[PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves
@ 2026-01-16 14:40 Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 01/21] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
                   ` (20 more replies)
  0 siblings, 21 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka, kernel test robot, stable

Percpu sheaves caching was introduced as opt-in but the goal was to
eventually move all caches to them. This is the next step, enabling
sheaves for all caches (except the two bootstrap ones) and then removing
the per cpu (partial) slabs and lots of associated code.

Besides (hopefully) improved performance, this removes the rather
complicated code related to the lockless fastpaths (using
this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
kmalloc_nolock().

The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
without repeating the "alien" array flushing of SLUB, and to allow
flushing objects from sheaves to slabs mostly without the node
list_lock.

This v3 is the first non-RFC (for real). I plan to expose the series to
linux-next at this point. Because of the ongoing troubles with
kmalloc_nolock() that are solved with sheaves, I think it's worth aiming
for 7.0 if it passes linux-next testing.

Git branch for the v3
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=sheaves-for-all-v3

Which is a snapshot of:
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/sheaves-for-all

Based on:
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves
  - includes a sheaves optimization that seemed minor but there was lkp
    test robot result with significant improvements:
    https://lore.kernel.org/all/202512291555.56ce2e53-lkp@intel.com/
    (could be an uncommon corner case workload though)
  - includes the kmalloc_nolock() fix commit a4ae75d1b6a2 that is undone
    as part of this series

Significant (but not critical) remaining TODOs:
- Integration of rcu sheaves handling with kfree_rcu batching.
  - Currently the kfree_rcu batching is almost completely bypassed. I'm
    thinking it could be adjusted to handle rcu sheaves in addition to
    individual objects, to get the best of both.
- Performance evaluation. Petr Tesarik has been doing that on the RFC
  with some promising results (thanks!) and also found a memory leak.

Note that as many things, this caching scheme change is a tradeoff, as
summarized by Christoph:

  https://lore.kernel.org/all/f7c33974-e520-387e-9e2f-1e523bfe1545@gentwo.org/

- Objects allocated from sheaves should have better temporal locality
  (likely recently freed, thus cache hot) but worse spatial locality
  (likely from many different slabs, increasing memory usage and
  possibly TLB pressure on kernel's direct map).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Changes in v3:
- Rebase to current slab/for-7.0/sheaves which itself is rebased to
  slab/for-next-fixes to include commit a4ae75d1b6a2 ("slab: fix
  kmalloc_nolock() context check for PREEMPT_RT")
- Revert a4ae75d1b6a2 as part of "slab: simplify kmalloc_nolock()" as
  it's no longer necessary.
- Add cache_has_sheaves() helper to test for s->sheaf_capacity, use it
  in more places instead of s->cpu_sheaves tests that were missed
  (Hao Li)
- Fix a bug where kmalloc_nolock() could end up trying to allocate empty
  sheaf (not compatible with !allow_spin) in __pcs_replace_full_main()
  (Hao Li)
- Fix missing inc_slabs_node() in ___slab_alloc() ->
  alloc_from_new_slab() path. (Hao Li)
  - Also a bug where refill_objects() -> alloc_from_new_slab ->
    free_new_slab_nolock() (previously defer_deactivate_slab()) would
    do inc_slabs_node() without matching dec_slabs_node()
- Make __free_slab call free_frozen_pages_nolock() when !allow_spin.
  This was correct in the first RFC. (Hao Li)
- Add patch to make SLAB_CONSISTENCY_CHECKS prevent merging.
- Add tags from sveral people (thanks!)
- Fix checkpatch warnings.
- Link to v2: https://patch.msgid.link/20260112-sheaves-for-all-v2-0-98225cfb50cf@suse.cz

Changes in v2:
- Rebased to v6.19-rc1+slab.git slab/for-7.0/sheaves
  - Some of the preliminary patches from the RFC went in there.
- Incorporate feedback/reports from many people (thanks!), including:
  - Make caches with sheaves mergeable.
  - Fix a major memory leak.
- Cleanup of stat items.
- Link to v1: https://patch.msgid.link/20251023-sheaves-for-all-v1-0-6ffa2c9941c0@suse.cz

---
Vlastimil Babka (21):
      mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
      slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
      mm/slab: move and refactor __kmem_cache_alias()
      mm/slab: make caches with sheaves mergeable
      slab: add sheaves to most caches
      slab: introduce percpu sheaves bootstrap
      slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
      slab: handle kmalloc sheaves bootstrap
      slab: add optimized sheaf refill from partial list
      slab: remove cpu (partial) slabs usage from allocation paths
      slab: remove SLUB_CPU_PARTIAL
      slab: remove the do_slab_free() fastpath
      slab: remove defer_deactivate_slab()
      slab: simplify kmalloc_nolock()
      slab: remove struct kmem_cache_cpu
      slab: remove unused PREEMPT_RT specific macros
      slab: refill sheaves from all nodes
      slab: update overview comments
      slab: remove frozen slab checks from __slab_free()
      mm/slub: remove DEACTIVATE_TO_* stat items
      mm/slub: cleanup and repurpose some stat items

 include/linux/slab.h |    6 -
 mm/Kconfig           |   11 -
 mm/internal.h        |    1 +
 mm/page_alloc.c      |    5 +
 mm/slab.h            |   53 +-
 mm/slab_common.c     |   61 +-
 mm/slub.c            | 2631 +++++++++++++++++---------------------------------
 7 files changed, 972 insertions(+), 1796 deletions(-)
---
base-commit: aa2ab7f1e8dc9d27b9130054e48b0c6accddfcba
change-id: 20251002-sheaves-for-all-86ac13dc47a5

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 01/21] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka, kernel test robot, stable

After we submit the rcu_free sheaves to call_rcu() we need to make sure
the rcu callbacks complete. kvfree_rcu_barrier() does that via
flush_all_rcu_sheaves() but kvfree_rcu_barrier_on_cache() doesn't. Fix
that.

This currently causes no issues because the caches with sheaves we have
are never destroyed. The problem flagged by kernel test robot was
reported for a patch that enables sheaves for (almost) all caches, and
occurred only with CONFIG_KASAN. Harry Yoo found the root cause [1]:

  It turns out the object freed by sheaf_flush_unused() was in KASAN
  percpu quarantine list (confirmed by dumping the list) by the time
  __kmem_cache_shutdown() returns an error.

  Quarantined objects are supposed to be flushed by kasan_cache_shutdown(),
  but things go wrong if the rcu callback (rcu_free_sheaf_nobarn()) is
  processed after kasan_cache_shutdown() finishes.

  That's why rcu_barrier() in __kmem_cache_shutdown() didn't help,
  because it's called after kasan_cache_shutdown().

  Calling rcu_barrier() in kvfree_rcu_barrier_on_cache() guarantees
  that it'll be added to the quarantine list before kasan_cache_shutdown()
  is called. So it's a valid fix!

[1] https://lore.kernel.org/all/aWd6f3jERlrB5yeF@hyeyoo/

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202601121442.c530bed3-lkp@intel.com
Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
Cc: stable@vger.kernel.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Tested-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index eed7ea556cb1..ee994ec7f251 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -2133,8 +2133,11 @@ EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
  */
 void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
 {
-	if (s->cpu_sheaves)
+	if (s->cpu_sheaves) {
 		flush_rcu_sheaves_on_cache(s);
+		rcu_barrier();
+	}
+
 	/*
 	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
 	 * on a specific slab cache.

-- 
2.52.0

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 01/21] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-16 17:22   ` Suren Baghdasaryan
  2026-01-19  3:41   ` Harry Yoo
  2026-01-16 14:40 ` [PATCH v3 03/21] mm/slab: move and refactor __kmem_cache_alias() Vlastimil Babka
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

All the debug flags prevent merging, except SLAB_CONSISTENCY_CHECKS. This
is suboptimal because this flag (like any debug flags) prevents the
usage of any fastpaths, and thus affect performance of any aliased
cache. Also the objects from an aliased cache than the one specified for
debugging could also interfere with the debugging efforts.

Fix this by adding the whole SLAB_DEBUG_FLAGS collection to
SLAB_NEVER_MERGE instead of individual debug flags, so it now also
includes SLAB_CONSISTENCY_CHECKS.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index ee994ec7f251..e691ede0e6a8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -45,9 +45,8 @@ struct kmem_cache *kmem_cache;
 /*
  * Set of flags that will prevent slab merging
  */
-#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
-		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
-		SLAB_FAILSLAB | SLAB_NO_MERGE)
+#define SLAB_NEVER_MERGE (SLAB_DEBUG_FLAGS | SLAB_TYPESAFE_BY_RCU | \
+		SLAB_NOLEAKTRACE | SLAB_FAILSLAB | SLAB_NO_MERGE)
 
 #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 03/21] mm/slab: move and refactor __kmem_cache_alias()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 01/21] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 04/21] mm/slab: make caches with sheaves mergeable Vlastimil Babka
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Move __kmem_cache_alias() to slab_common.c since it's called by
__kmem_cache_create_args() and calls find_mergeable() that both
are in this file. We can remove two slab.h declarations and make
them static. Instead declare sysfs_slab_alias() from slub.c so
that __kmem_cache_alias() can keep calling it.

Add args parameter to __kmem_cache_alias() and find_mergeable() instead
of align and ctor. With that we can also move the checks for usersize
and sheaf_capacity there from __kmem_cache_create_args() and make the
result more symmetric with slab_unmergeable().

No functional changes intended.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |  8 +++-----
 mm/slab_common.c | 44 +++++++++++++++++++++++++++++++++++++-------
 mm/slub.c        | 30 +-----------------------------
 3 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index e767aa7e91b0..cb48ce5014ba 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -281,9 +281,12 @@ struct kmem_cache {
 #define SLAB_SUPPORTS_SYSFS 1
 void sysfs_slab_unlink(struct kmem_cache *s);
 void sysfs_slab_release(struct kmem_cache *s);
+int sysfs_slab_alias(struct kmem_cache *, const char *);
 #else
 static inline void sysfs_slab_unlink(struct kmem_cache *s) { }
 static inline void sysfs_slab_release(struct kmem_cache *s) { }
+static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
+							{ return 0; }
 #endif
 
 void *fixup_red_left(struct kmem_cache *s, void *p);
@@ -400,11 +403,6 @@ extern void create_boot_cache(struct kmem_cache *, const char *name,
 			unsigned int useroffset, unsigned int usersize);
 
 int slab_unmergeable(struct kmem_cache *s);
-struct kmem_cache *find_mergeable(unsigned size, unsigned align,
-		slab_flags_t flags, const char *name, void (*ctor)(void *));
-struct kmem_cache *
-__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, void (*ctor)(void *));
 
 slab_flags_t kmem_cache_flags(slab_flags_t flags, const char *name);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e691ede0e6a8..ee245a880603 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -174,15 +174,22 @@ int slab_unmergeable(struct kmem_cache *s)
 	return 0;
 }
 
-struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
-		slab_flags_t flags, const char *name, void (*ctor)(void *))
+static struct kmem_cache *find_mergeable(unsigned int size, slab_flags_t flags,
+		const char *name, struct kmem_cache_args *args)
 {
 	struct kmem_cache *s;
+	unsigned int align;
 
 	if (slab_nomerge)
 		return NULL;
 
-	if (ctor)
+	if (args->ctor)
+		return NULL;
+
+	if (IS_ENABLED(CONFIG_HARDENED_USERCOPY) && args->usersize)
+		return NULL;
+
+	if (args->sheaf_capacity)
 		return NULL;
 
 	flags = kmem_cache_flags(flags, name);
@@ -191,7 +198,7 @@ struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
 		return NULL;
 
 	size = ALIGN(size, sizeof(void *));
-	align = calculate_alignment(flags, align, size);
+	align = calculate_alignment(flags, args->align, size);
 	size = ALIGN(size, align);
 
 	list_for_each_entry_reverse(s, &slab_caches, list) {
@@ -252,6 +259,31 @@ static struct kmem_cache *create_cache(const char *name,
 	return ERR_PTR(err);
 }
 
+static struct kmem_cache *
+__kmem_cache_alias(const char *name, unsigned int size, slab_flags_t flags,
+		   struct kmem_cache_args *args)
+{
+	struct kmem_cache *s;
+
+	s = find_mergeable(size, flags, name, args);
+	if (s) {
+		if (sysfs_slab_alias(s, name))
+			pr_err("SLUB: Unable to add cache alias %s to sysfs\n",
+			       name);
+
+		s->refcount++;
+
+		/*
+		 * Adjust the object sizes so that we clear
+		 * the complete object on kzalloc.
+		 */
+		s->object_size = max(s->object_size, size);
+		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
+	}
+
+	return s;
+}
+
 /**
  * __kmem_cache_create_args - Create a kmem cache.
  * @name: A string which is used in /proc/slabinfo to identify this cache.
@@ -323,9 +355,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize && !args->sheaf_capacity)
-		s = __kmem_cache_alias(name, object_size, args->align, flags,
-				       args->ctor);
+	s = __kmem_cache_alias(name, object_size, flags, args);
 	if (s)
 		goto out_unlock;
 
diff --git a/mm/slub.c b/mm/slub.c
index df71c156d13c..2dda2fc57ced 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -350,11 +350,8 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 
 #ifdef SLAB_SUPPORTS_SYSFS
 static int sysfs_slab_add(struct kmem_cache *);
-static int sysfs_slab_alias(struct kmem_cache *, const char *);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
-static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
-							{ return 0; }
 #endif
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SLUB_DEBUG)
@@ -8553,31 +8550,6 @@ void __init kmem_cache_init_late(void)
 	WARN_ON(!flushwq);
 }
 
-struct kmem_cache *
-__kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
-		   slab_flags_t flags, void (*ctor)(void *))
-{
-	struct kmem_cache *s;
-
-	s = find_mergeable(size, align, flags, name, ctor);
-	if (s) {
-		if (sysfs_slab_alias(s, name))
-			pr_err("SLUB: Unable to add cache alias %s to sysfs\n",
-			       name);
-
-		s->refcount++;
-
-		/*
-		 * Adjust the object sizes so that we clear
-		 * the complete object on kzalloc.
-		 */
-		s->object_size = max(s->object_size, size);
-		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
-	}
-
-	return s;
-}
-
 int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 			 unsigned int size, struct kmem_cache_args *args,
 			 slab_flags_t flags)
@@ -9810,7 +9782,7 @@ struct saved_alias {
 
 static struct saved_alias *alias_list;
 
-static int sysfs_slab_alias(struct kmem_cache *s, const char *name)
+int sysfs_slab_alias(struct kmem_cache *s, const char *name)
 {
 	struct saved_alias *al;
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 04/21] mm/slab: make caches with sheaves mergeable
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 03/21] mm/slab: move and refactor __kmem_cache_alias() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 05/21] slab: add sheaves to most caches Vlastimil Babka
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Before enabling sheaves for all caches (with automatically determined
capacity), their enablement should no longer prevent merging of caches.
Limit this merge prevention only to caches that were created with a
specific sheaf capacity, by adding the SLAB_NO_MERGE flag to them.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index ee245a880603..5c15a4ce5743 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -162,9 +162,6 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
-	if (s->cpu_sheaves)
-		return 1;
-
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -189,9 +186,6 @@ static struct kmem_cache *find_mergeable(unsigned int size, slab_flags_t flags,
 	if (IS_ENABLED(CONFIG_HARDENED_USERCOPY) && args->usersize)
 		return NULL;
 
-	if (args->sheaf_capacity)
-		return NULL;
-
 	flags = kmem_cache_flags(flags, name);
 
 	if (flags & SLAB_NEVER_MERGE)
@@ -336,6 +330,13 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 	flags &= ~SLAB_DEBUG_FLAGS;
 #endif
 
+	/*
+	 * Caches with specific capacity are special enough. It's simpler to
+	 * make them unmergeable.
+	 */
+	if (args->sheaf_capacity)
+		flags |= SLAB_NO_MERGE;
+
 	mutex_lock(&slab_mutex);
 
 	err = kmem_cache_sanity_check(name, object_size);

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 05/21] slab: add sheaves to most caches
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 04/21] mm/slab: make caches with sheaves mergeable Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20 18:47   ` Breno Leitao
  2026-01-16 14:40 ` [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap Vlastimil Babka
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

In the first step to replace cpu (partial) slabs with sheaves, enable
sheaves for almost all caches. Treat args->sheaf_capacity as a minimum,
and calculate sheaf capacity with a formula that roughly follows the
formula for number of objects in cpu partial slabs in set_cpu_partial().

This should achieve roughly similar contention on the barn spin lock as
there's currently for node list_lock without sheaves, to make
benchmarking results comparable. It can be further tuned later.

Don't enable sheaves for bootstrap caches as that wouldn't work. In
order to recognize them by SLAB_NO_OBJ_EXT, make sure the flag exists
even for !CONFIG_SLAB_OBJ_EXT.

This limitation will be lifted for kmalloc caches after the necessary
bootstrapping changes.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  6 ------
 mm/slub.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2482992248dc..2682ee57ec90 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -57,9 +57,7 @@ enum _slab_flag_bits {
 #endif
 	_SLAB_OBJECT_POISON,
 	_SLAB_CMPXCHG_DOUBLE,
-#ifdef CONFIG_SLAB_OBJ_EXT
 	_SLAB_NO_OBJ_EXT,
-#endif
 	_SLAB_FLAGS_LAST_BIT
 };
 
@@ -238,11 +236,7 @@ enum _slab_flag_bits {
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
 
 /* Slab created using create_boot_cache */
-#ifdef CONFIG_SLAB_OBJ_EXT
 #define SLAB_NO_OBJ_EXT		__SLAB_FLAG_BIT(_SLAB_NO_OBJ_EXT)
-#else
-#define SLAB_NO_OBJ_EXT		__SLAB_FLAG_UNUSED
-#endif
 
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
diff --git a/mm/slub.c b/mm/slub.c
index 2dda2fc57ced..edf341c87e20 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7863,6 +7863,48 @@ static void set_cpu_partial(struct kmem_cache *s)
 #endif
 }
 
+static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
+					     struct kmem_cache_args *args)
+
+{
+	unsigned int capacity;
+	size_t size;
+
+
+	if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
+		return 0;
+
+	/* bootstrap caches can't have sheaves for now */
+	if (s->flags & SLAB_NO_OBJ_EXT)
+		return 0;
+
+	/*
+	 * For now we use roughly similar formula (divided by two as there are
+	 * two percpu sheaves) as what was used for percpu partial slabs, which
+	 * should result in similar lock contention (barn or list_lock)
+	 */
+	if (s->size >= PAGE_SIZE)
+		capacity = 4;
+	else if (s->size >= 1024)
+		capacity = 12;
+	else if (s->size >= 256)
+		capacity = 26;
+	else
+		capacity = 60;
+
+	/* Increment capacity to make sheaf exactly a kmalloc size bucket */
+	size = struct_size_t(struct slab_sheaf, objects, capacity);
+	size = kmalloc_size_roundup(size);
+	capacity = (size - struct_size_t(struct slab_sheaf, objects, 0)) / sizeof(void *);
+
+	/*
+	 * Respect an explicit request for capacity that's typically motivated by
+	 * expected maximum size of kmem_cache_prefill_sheaf() to not end up
+	 * using low-performance oversize sheaves
+	 */
+	return max(capacity, args->sheaf_capacity);
+}
+
 /*
  * calculate_sizes() determines the order and the distribution of data within
  * a slab object.
@@ -7997,6 +8039,10 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
 	if (s->flags & SLAB_RECLAIM_ACCOUNT)
 		s->allocflags |= __GFP_RECLAIMABLE;
 
+	/* kmalloc caches need extra care to support sheaves */
+	if (!is_kmalloc_cache(s))
+		s->sheaf_capacity = calculate_sheaf_capacity(s, args);
+
 	/*
 	 * Determine the number of objects per slab
 	 */
@@ -8601,15 +8647,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
-	if (args->sheaf_capacity && !IS_ENABLED(CONFIG_SLUB_TINY)
-					&& !(s->flags & SLAB_DEBUG_FLAGS)) {
+	if (s->sheaf_capacity) {
 		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
 		if (!s->cpu_sheaves) {
 			err = -ENOMEM;
 			goto out;
 		}
-		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
-		s->sheaf_capacity = args->sheaf_capacity;
 	}
 
 #ifdef CONFIG_NUMA

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 05/21] slab: add sheaves to most caches Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-17  2:11   ` Suren Baghdasaryan
  2026-01-19 11:32   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
                   ` (14 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
sheaves enabled. Since we want to enable them for almost all caches,
it's suboptimal to test the pointer in the fast paths, so instead
allocate it for all caches in do_kmem_cache_create(). Instead of testing
the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
kmem_cache->sheaf_capacity for being 0, where needed, using a new
cache_has_sheaves() helper.

However, for the fast paths sake we also assume that the main sheaf
always exists (pcs->main is !NULL), and during bootstrap we cannot
allocate sheaves yet.

Solve this by introducing a single static bootstrap_sheaf that's
assigned as pcs->main during bootstrap. It has a size of 0, so during
allocations, the fast path will find it's empty. Since the size of 0
matches sheaf_capacity of 0, the freeing fast paths will find it's
"full". In the slow path handlers, we use cache_has_sheaves() to
recognize that the cache doesn't (yet) have real sheaves, and fall back.
Thus sharing the single bootstrap sheaf like this for multiple caches
and cpus is safe.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 81 insertions(+), 38 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index edf341c87e20..706cb6398f05 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -501,6 +501,18 @@ struct kmem_cache_node {
 	struct node_barn *barn;
 };
 
+/*
+ * Every cache has !NULL s->cpu_sheaves but they may point to the
+ * bootstrap_sheaf temporarily during init, or permanently for the boot caches
+ * and caches with debugging enabled, or all caches with CONFIG_SLUB_TINY. This
+ * helper distinguishes whether cache has real non-bootstrap sheaves.
+ */
+static inline bool cache_has_sheaves(struct kmem_cache *s)
+{
+	/* Test CONFIG_SLUB_TINY for code elimination purposes */
+	return !IS_ENABLED(CONFIG_SLUB_TINY) && s->sheaf_capacity;
+}
+
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 {
 	return s->node[node];
@@ -2855,6 +2867,10 @@ static void pcs_destroy(struct kmem_cache *s)
 		if (!pcs->main)
 			continue;
 
+		/* bootstrap or debug caches, it's the bootstrap_sheaf */
+		if (!pcs->main->cache)
+			continue;
+
 		/*
 		 * We have already passed __kmem_cache_shutdown() so everything
 		 * was flushed and there should be no objects allocated from
@@ -4030,7 +4046,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 
-	if (!s->cpu_sheaves)
+	if (!cache_has_sheaves(s))
 		return false;
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
@@ -4052,7 +4068,7 @@ static void flush_cpu_slab(struct work_struct *w)
 
 	s = sfw->s;
 
-	if (s->cpu_sheaves)
+	if (cache_has_sheaves(s))
 		pcs_flush_all(s);
 
 	flush_this_cpu_slab(s);
@@ -4157,7 +4173,7 @@ void flush_all_rcu_sheaves(void)
 	mutex_lock(&slab_mutex);
 
 	list_for_each_entry(s, &slab_caches, list) {
-		if (!s->cpu_sheaves)
+		if (!cache_has_sheaves(s))
 			continue;
 		flush_rcu_sheaves_on_cache(s);
 	}
@@ -4179,7 +4195,7 @@ static int slub_cpu_dead(unsigned int cpu)
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
-		if (s->cpu_sheaves)
+		if (cache_has_sheaves(s))
 			__pcs_flush_all_cpu(s, cpu);
 	}
 	mutex_unlock(&slab_mutex);
@@ -4979,6 +4995,12 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
+	/* Bootstrap or debug cache, back off */
+	if (unlikely(!cache_has_sheaves(s))) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
+
 	if (pcs->spare && pcs->spare->size > 0) {
 		swap(pcs->main, pcs->spare);
 		return pcs;
@@ -5165,6 +5187,11 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		struct slab_sheaf *full;
 		struct node_barn *barn;
 
+		if (unlikely(!cache_has_sheaves(s))) {
+			local_unlock(&s->cpu_sheaves->lock);
+			return allocated;
+		}
+
 		if (pcs->spare && pcs->spare->size > 0) {
 			swap(pcs->main, pcs->spare);
 			goto do_alloc;
@@ -5244,8 +5271,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	if (s->cpu_sheaves)
-		object = alloc_from_pcs(s, gfpflags, node);
+	object = alloc_from_pcs(s, gfpflags, node);
 
 	if (!object)
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
@@ -5355,17 +5381,6 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
-		/*
-		 * slab_debug disables cpu sheaves intentionally so all
-		 * prefilled sheaves become "oversize" and we give up on
-		 * performance for the debugging. Same with SLUB_TINY.
-		 * Creating a cache without sheaves and then requesting a
-		 * prefilled sheaf is however not expected, so warn.
-		 */
-		WARN_ON_ONCE(s->sheaf_capacity == 0 &&
-			     !IS_ENABLED(CONFIG_SLUB_TINY) &&
-			     !(s->flags & SLAB_DEBUG_FLAGS));
-
 		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
 		if (!sheaf)
 			return NULL;
@@ -6082,6 +6097,12 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 restart:
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
+	/* Bootstrap or debug cache, back off */
+	if (unlikely(!cache_has_sheaves(s))) {
+		local_unlock(&s->cpu_sheaves->lock);
+		return NULL;
+	}
+
 	barn = get_barn(s);
 	if (!barn) {
 		local_unlock(&s->cpu_sheaves->lock);
@@ -6280,6 +6301,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 		struct slab_sheaf *empty;
 		struct node_barn *barn;
 
+		/* Bootstrap or debug cache, fall back */
+		if (unlikely(!cache_has_sheaves(s))) {
+			local_unlock(&s->cpu_sheaves->lock);
+			goto fail;
+		}
+
 		if (pcs->spare && pcs->spare->size == 0) {
 			pcs->rcu_free = pcs->spare;
 			pcs->spare = NULL;
@@ -6674,9 +6701,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
-				     slab_nid(slab) == numa_mem_id())
-			   && likely(!slab_test_pfmemalloc(slab))) {
+	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
+	    && likely(!slab_test_pfmemalloc(slab))) {
 		if (likely(free_to_pcs(s, object)))
 			return;
 	}
@@ -7379,7 +7405,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	 * freeing to sheaves is so incompatible with the detached freelist so
 	 * once we go that way, we have to do everything differently
 	 */
-	if (s && s->cpu_sheaves) {
+	if (s && cache_has_sheaves(s)) {
 		free_to_pcs_bulk(s, size, p);
 		return;
 	}
@@ -7490,8 +7516,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		size--;
 	}
 
-	if (s->cpu_sheaves)
-		i = alloc_from_pcs_bulk(s, size, p);
+	i = alloc_from_pcs_bulk(s, size, p);
 
 	if (i < size) {
 		/*
@@ -7702,6 +7727,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 
 static int init_percpu_sheaves(struct kmem_cache *s)
 {
+	static struct slab_sheaf bootstrap_sheaf = {};
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
@@ -7711,7 +7737,28 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 
 		local_trylock_init(&pcs->lock);
 
-		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+		/*
+		 * Bootstrap sheaf has zero size so fast-path allocation fails.
+		 * It has also size == s->sheaf_capacity, so fast-path free
+		 * fails. In the slow paths we recognize the situation by
+		 * checking s->sheaf_capacity. This allows fast paths to assume
+		 * s->cpu_sheaves and pcs->main always exists and is valid.
+		 * It's also safe to share the single static bootstrap_sheaf
+		 * with zero-sized objects array as it's never modified.
+		 *
+		 * bootstrap_sheaf also has NULL pointer to kmem_cache so we
+		 * recognize it and not attempt to free it when destroying the
+		 * cache
+		 *
+		 * We keep bootstrap_sheaf for kmem_cache and kmem_cache_node,
+		 * caches with debug enabled, and all caches with SLUB_TINY.
+		 * For kmalloc caches it's used temporarily during the initial
+		 * bootstrap.
+		 */
+		if (!s->sheaf_capacity)
+			pcs->main = &bootstrap_sheaf;
+		else
+			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
 
 		if (!pcs->main)
 			return -ENOMEM;
@@ -7809,7 +7856,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 			continue;
 		}
 
-		if (s->cpu_sheaves) {
+		if (cache_has_sheaves(s)) {
 			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
 
 			if (!barn)
@@ -8127,7 +8174,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 
 	/* we might have rcu sheaves in flight */
-	if (s->cpu_sheaves)
+	if (cache_has_sheaves(s))
 		rcu_barrier();
 
 	/* Attempt to free all objects */
@@ -8439,7 +8486,7 @@ static int slab_mem_going_online_callback(int nid)
 		if (get_node(s, nid))
 			continue;
 
-		if (s->cpu_sheaves) {
+		if (cache_has_sheaves(s)) {
 			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
 
 			if (!barn) {
@@ -8647,12 +8694,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
-	if (s->sheaf_capacity) {
-		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
-		if (!s->cpu_sheaves) {
-			err = -ENOMEM;
-			goto out;
-		}
+	s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+	if (!s->cpu_sheaves) {
+		err = -ENOMEM;
+		goto out;
 	}
 
 #ifdef CONFIG_NUMA
@@ -8671,11 +8716,9 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
-	if (s->cpu_sheaves) {
-		err = init_percpu_sheaves(s);
-		if (err)
-			goto out;
-	}
+	err = init_percpu_sheaves(s);
+	if (err)
+		goto out;
 
 	err = 0;
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (5 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-18 20:45   ` Suren Baghdasaryan
  2026-01-19  4:31   ` Harry Yoo
  2026-01-16 14:40 ` [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
                   ` (13 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Before we enable percpu sheaves for kmalloc caches, we need to make sure
kmalloc_nolock() and kfree_nolock() will continue working properly and
not spin when not allowed to.

Percpu sheaves themselves use local_trylock() so they are already
compatible. We just need to be careful with the barn->lock spin_lock.
Pass a new allow_spin parameter where necessary to use
spin_trylock_irqsave().

In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
for now it will always fail until we enable sheaves for kmalloc caches
next. Similarly in kfree_nolock() we can attempt free_to_pcs().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 23 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 706cb6398f05..b385247c219f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2893,7 +2893,8 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
-static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
+					       bool allow_spin)
 {
 	struct slab_sheaf *empty = NULL;
 	unsigned long flags;
@@ -2901,7 +2902,10 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
 	if (!data_race(barn->nr_empty))
 		return NULL;
 
-	spin_lock_irqsave(&barn->lock, flags);
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
 
 	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty,
@@ -2978,7 +2982,8 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
  * change.
  */
 static struct slab_sheaf *
-barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty,
+			 bool allow_spin)
 {
 	struct slab_sheaf *full = NULL;
 	unsigned long flags;
@@ -2986,7 +2991,10 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
 	if (!data_race(barn->nr_full))
 		return NULL;
 
-	spin_lock_irqsave(&barn->lock, flags);
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
 
 	if (likely(barn->nr_full)) {
 		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
@@ -3007,7 +3015,8 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
  * barn. But if there are too many full sheaves, reject this with -E2BIG.
  */
 static struct slab_sheaf *
-barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full,
+			bool allow_spin)
 {
 	struct slab_sheaf *empty;
 	unsigned long flags;
@@ -3018,7 +3027,10 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
 	if (!data_race(barn->nr_empty))
 		return ERR_PTR(-ENOMEM);
 
-	spin_lock_irqsave(&barn->lock, flags);
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return ERR_PTR(-EBUSY);
 
 	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
@@ -5012,7 +5024,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main);
+	full = barn_replace_empty_sheaf(barn, pcs->main,
+					gfpflags_allow_spinning(gfp));
 
 	if (full) {
 		stat(s, BARN_GET);
@@ -5029,7 +5042,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 			empty = pcs->spare;
 			pcs->spare = NULL;
 		} else {
-			empty = barn_get_empty_sheaf(barn);
+			empty = barn_get_empty_sheaf(barn, true);
 		}
 	}
 
@@ -5169,7 +5182,8 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 }
 
 static __fastpath_inline
-unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
+				 void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
@@ -5203,7 +5217,8 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 			return allocated;
 		}
 
-		full = barn_replace_empty_sheaf(barn, pcs->main);
+		full = barn_replace_empty_sheaf(barn, pcs->main,
+						gfpflags_allow_spinning(gfp));
 
 		if (full) {
 			stat(s, BARN_GET);
@@ -5701,7 +5716,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
 	struct kmem_cache *s;
 	bool can_retry = true;
-	void *ret = ERR_PTR(-EBUSY);
+	void *ret;
 
 	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
 				      __GFP_NO_OBJ_EXT));
@@ -5732,6 +5747,12 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 		 */
 		return NULL;
 
+	ret = alloc_from_pcs(s, alloc_gfp, node);
+	if (ret)
+		goto success;
+
+	ret = ERR_PTR(-EBUSY);
+
 	/*
 	 * Do not call slab_alloc_node(), since trylock mode isn't
 	 * compatible with slab_pre_alloc_hook/should_failslab and
@@ -5768,6 +5789,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 		ret = NULL;
 	}
 
+success:
 	maybe_wipe_obj_freeptr(s, ret);
 	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
 			     slab_want_init_on_alloc(alloc_gfp, s), size);
@@ -6088,7 +6110,8 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
  * unlocked.
  */
 static struct slub_percpu_sheaves *
-__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
+__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
+			bool allow_spin)
 {
 	struct slab_sheaf *empty;
 	struct node_barn *barn;
@@ -6112,7 +6135,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 	put_fail = false;
 
 	if (!pcs->spare) {
-		empty = barn_get_empty_sheaf(barn);
+		empty = barn_get_empty_sheaf(barn, allow_spin);
 		if (empty) {
 			pcs->spare = pcs->main;
 			pcs->main = empty;
@@ -6126,7 +6149,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 		return pcs;
 	}
 
-	empty = barn_replace_full_sheaf(barn, pcs->main);
+	empty = barn_replace_full_sheaf(barn, pcs->main, allow_spin);
 
 	if (!IS_ERR(empty)) {
 		stat(s, BARN_PUT);
@@ -6134,7 +6157,8 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 		return pcs;
 	}
 
-	if (PTR_ERR(empty) == -E2BIG) {
+	/* sheaf_flush_unused() doesn't support !allow_spin */
+	if (PTR_ERR(empty) == -E2BIG && allow_spin) {
 		/* Since we got here, spare exists and is full */
 		struct slab_sheaf *to_flush = pcs->spare;
 
@@ -6159,6 +6183,14 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
 alloc_empty:
 	local_unlock(&s->cpu_sheaves->lock);
 
+	/*
+	 * alloc_empty_sheaf() doesn't support !allow_spin and it's
+	 * easier to fall back to freeing directly without sheaves
+	 * than add the support (and to sheaf_flush_unused() above)
+	 */
+	if (!allow_spin)
+		return NULL;
+
 	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 	if (empty)
 		goto got_empty;
@@ -6201,7 +6233,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
  * The object is expected to have passed slab_free_hook() already.
  */
 static __fastpath_inline
-bool free_to_pcs(struct kmem_cache *s, void *object)
+bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 {
 	struct slub_percpu_sheaves *pcs;
 
@@ -6212,7 +6244,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
 
-		pcs = __pcs_replace_full_main(s, pcs);
+		pcs = __pcs_replace_full_main(s, pcs, allow_spin);
 		if (unlikely(!pcs))
 			return false;
 	}
@@ -6319,7 +6351,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 			goto fail;
 		}
 
-		empty = barn_get_empty_sheaf(barn);
+		empty = barn_get_empty_sheaf(barn, true);
 
 		if (empty) {
 			pcs->rcu_free = empty;
@@ -6437,7 +6469,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		goto no_empty;
 
 	if (!pcs->spare) {
-		empty = barn_get_empty_sheaf(barn);
+		empty = barn_get_empty_sheaf(barn, true);
 		if (!empty)
 			goto no_empty;
 
@@ -6451,7 +6483,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		goto do_free;
 	}
 
-	empty = barn_replace_full_sheaf(barn, pcs->main);
+	empty = barn_replace_full_sheaf(barn, pcs->main, true);
 	if (IS_ERR(empty)) {
 		stat(s, BARN_PUT_FAIL);
 		goto no_empty;
@@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 
 	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
 	    && likely(!slab_test_pfmemalloc(slab))) {
-		if (likely(free_to_pcs(s, object)))
+		if (likely(free_to_pcs(s, object, true)))
 			return;
 	}
 
@@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
 	 * since kasan quarantine takes locks and not supported from NMI.
 	 */
 	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
-	do_slab_free(s, slab, x, x, 0, _RET_IP_);
+	if (!free_to_pcs(s, x, false))
+		do_slab_free(s, slab, x, x, 0, _RET_IP_);
 }
 EXPORT_SYMBOL_GPL(kfree_nolock);
 
@@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		size--;
 	}
 
-	i = alloc_from_pcs_bulk(s, size, p);
+	i = alloc_from_pcs_bulk(s, flags, size, p);
 
 	if (i < size) {
 		/*

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (6 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-19  5:23   ` Harry Yoo
  2026-01-20  1:04   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL,
we can simply allow them in calculate_sizes() as they are created later
than KMALLOC_NORMAL caches and can allocate sheaves and barns from
those.

For KMALLOC_NORMAL caches we perform additional step after first
creating them without sheaves. Then bootstrap_cache_sheaves() simply
allocates and initializes barns and sheaves and finally sets
s->sheaf_capacity to make them actually used.

Afterwards the only caches left without sheaves (unless SLUB_TINY or
debugging is enabled) are kmem_cache and kmem_cache_node. These are only
used when creating or destroying other kmem_caches. Thus they are not
performance critical and we can simply leave it that way.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 84 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b385247c219f..9bea8a65e510 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2605,7 +2605,8 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
-static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
+					      unsigned int capacity)
 {
 	struct slab_sheaf *sheaf;
 	size_t sheaf_size;
@@ -2623,7 +2624,7 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	if (s->flags & SLAB_KMALLOC)
 		gfp |= __GFP_NO_OBJ_EXT;
 
-	sheaf_size = struct_size(sheaf, objects, s->sheaf_capacity);
+	sheaf_size = struct_size(sheaf, objects, capacity);
 	sheaf = kzalloc(sheaf_size, gfp);
 
 	if (unlikely(!sheaf))
@@ -2636,6 +2637,12 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	return sheaf;
 }
 
+static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
+						   gfp_t gfp)
+{
+	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+}
+
 static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 {
 	kfree(sheaf);
@@ -8119,8 +8126,11 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
 	if (s->flags & SLAB_RECLAIM_ACCOUNT)
 		s->allocflags |= __GFP_RECLAIMABLE;
 
-	/* kmalloc caches need extra care to support sheaves */
-	if (!is_kmalloc_cache(s))
+	/*
+	 * For KMALLOC_NORMAL caches we enable sheaves later by
+	 * bootstrap_kmalloc_sheaves() to avoid recursion
+	 */
+	if (!is_kmalloc_normal(s))
 		s->sheaf_capacity = calculate_sheaf_capacity(s, args);
 
 	/*
@@ -8615,6 +8625,74 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	return s;
 }
 
+/*
+ * Finish the sheaves initialization done normally by init_percpu_sheaves() and
+ * init_kmem_cache_nodes(). For normal kmalloc caches we have to bootstrap it
+ * since sheaves and barns are allocated by kmalloc.
+ */
+static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
+{
+	struct kmem_cache_args empty_args = {};
+	unsigned int capacity;
+	bool failed = false;
+	int node, cpu;
+
+	capacity = calculate_sheaf_capacity(s, &empty_args);
+
+	/* capacity can be 0 due to debugging or SLUB_TINY */
+	if (!capacity)
+		return;
+
+	for_each_node_mask(node, slab_nodes) {
+		struct node_barn *barn;
+
+		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+		if (!barn) {
+			failed = true;
+			goto out;
+		}
+
+		barn_init(barn);
+		get_node(s, node)->barn = barn;
+	}
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+
+		if (!pcs->main) {
+			failed = true;
+			break;
+		}
+	}
+
+out:
+	/*
+	 * It's still early in boot so treat this like same as a failure to
+	 * create the kmalloc cache in the first place
+	 */
+	if (failed)
+		panic("Out of memory when creating kmem_cache %s\n", s->name);
+
+	s->sheaf_capacity = capacity;
+}
+
+static void __init bootstrap_kmalloc_sheaves(void)
+{
+	enum kmalloc_cache_type type;
+
+	for (type = KMALLOC_NORMAL; type <= KMALLOC_RANDOM_END; type++) {
+		for (int idx = 0; idx < KMALLOC_SHIFT_HIGH + 1; idx++) {
+			if (kmalloc_caches[type][idx])
+				bootstrap_cache_sheaves(kmalloc_caches[type][idx]);
+		}
+	}
+}
+
 void __init kmem_cache_init(void)
 {
 	static __initdata struct kmem_cache boot_kmem_cache,
@@ -8658,6 +8736,8 @@ void __init kmem_cache_init(void)
 	setup_kmalloc_cache_index_table();
 	create_kmalloc_caches();
 
+	bootstrap_kmalloc_sheaves();
+
 	/* Setup random freelists for each cache */
 	init_freelist_randomization();
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (7 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-19  6:41   ` Harry Yoo
                     ` (3 more replies)
  2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
                   ` (11 subsequent siblings)
  20 siblings, 4 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

At this point we have sheaves enabled for all caches, but their refill
is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
slabs - now a redundant caching layer that we are about to remove.

The refill will thus be done from slabs on the node partial list.
Introduce new functions that can do that in an optimized way as it's
easier than modifying the __kmem_cache_alloc_bulk() call chain.

Extend struct partial_context so it can return a list of slabs from the
partial list with the sum of free objects in them within the requested
min and max.

Introduce get_partial_node_bulk() that removes the slabs from freelist
and returns them in the list.

Introduce get_freelist_nofreeze() which grabs the freelist without
freezing the slab.

Introduce alloc_from_new_slab() which can allocate multiple objects from
a newly allocated slab where we don't need to synchronize with freeing.
In some aspects it's similar to alloc_single_from_new_slab() but assumes
the cache is a non-debug one so it can avoid some actions.

Introduce __refill_objects() that uses the functions above to fill an
array of objects. It has to handle the possibility that the slabs will
contain more objects that were requested, due to concurrent freeing of
objects to those slabs. When no more slabs on partial lists are
available, it will allocate new slabs. It is intended to be only used
in context where spinning is allowed, so add a WARN_ON_ONCE check there.

Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
only refilled from contexts that allow spinning, or even blocking.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 264 insertions(+), 20 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9bea8a65e510..dce80463f92c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -246,6 +246,9 @@ struct partial_context {
 	gfp_t flags;
 	unsigned int orig_size;
 	void *object;
+	unsigned int min_objects;
+	unsigned int max_objects;
+	struct list_head slabs;
 };
 
 static inline bool kmem_cache_debug(struct kmem_cache *s)
@@ -2650,9 +2653,9 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	stat(s, SHEAF_FREE);
 }
 
-static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
-				   size_t size, void **p);
-
+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		 unsigned int max);
 
 static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 			 gfp_t gfp)
@@ -2663,8 +2666,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 	if (!to_fill)
 		return 0;
 
-	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
-					 &sheaf->objects[sheaf->size]);
+	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
+			to_fill, to_fill);
 
 	sheaf->size += filled;
 
@@ -3522,6 +3525,63 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
 #endif
 static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 
+static bool get_partial_node_bulk(struct kmem_cache *s,
+				  struct kmem_cache_node *n,
+				  struct partial_context *pc)
+{
+	struct slab *slab, *slab2;
+	unsigned int total_free = 0;
+	unsigned long flags;
+
+	/* Racy check to avoid taking the lock unnecessarily. */
+	if (!n || data_race(!n->nr_partial))
+		return false;
+
+	INIT_LIST_HEAD(&pc->slabs);
+
+	spin_lock_irqsave(&n->list_lock, flags);
+
+	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+		struct freelist_counters flc;
+		unsigned int slab_free;
+
+		if (!pfmemalloc_match(slab, pc->flags))
+			continue;
+
+		/*
+		 * determine the number of free objects in the slab racily
+		 *
+		 * due to atomic updates done by a racing free we should not
+		 * read an inconsistent value here, but do a sanity check anyway
+		 *
+		 * slab_free is a lower bound due to subsequent concurrent
+		 * freeing, the caller might get more objects than requested and
+		 * must deal with it
+		 */
+		flc.counters = data_race(READ_ONCE(slab->counters));
+		slab_free = flc.objects - flc.inuse;
+
+		if (unlikely(slab_free > oo_objects(s->oo)))
+			continue;
+
+		/* we have already min and this would get us over the max */
+		if (total_free >= pc->min_objects
+		    && total_free + slab_free > pc->max_objects)
+			break;
+
+		remove_partial(n, slab);
+
+		list_add(&slab->slab_list, &pc->slabs);
+
+		total_free += slab_free;
+		if (total_free >= pc->max_objects)
+			break;
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return total_free > 0;
+}
+
 /*
  * Try to allocate a partial slab from a specific node.
  */
@@ -4448,6 +4508,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
 	return old.freelist;
 }
 
+/*
+ * Get the slab's freelist and do not freeze it.
+ *
+ * Assumes the slab is isolated from node partial list and not frozen.
+ *
+ * Assumes this is performed only for caches without debugging so we
+ * don't need to worry about adding the slab to the full list
+ */
+static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
+{
+	struct freelist_counters old, new;
+
+	do {
+		old.freelist = slab->freelist;
+		old.counters = slab->counters;
+
+		new.freelist = NULL;
+		new.counters = old.counters;
+		VM_WARN_ON_ONCE(new.frozen);
+
+		new.inuse = old.objects;
+
+	} while (!slab_update_freelist(s, slab, &old, &new, "get_freelist_nofreeze"));
+
+	return old.freelist;
+}
+
 /*
  * Freeze the partial slab and return the pointer to the freelist.
  */
@@ -4471,6 +4558,65 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 	return old.freelist;
 }
 
+/*
+ * If the object has been wiped upon free, make sure it's fully initialized by
+ * zeroing out freelist pointer.
+ *
+ * Note that we also wipe custom freelist pointers.
+ */
+static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
+						   void *obj)
+{
+	if (unlikely(slab_want_init_on_free(s)) && obj &&
+	    !freeptr_outside_object(s))
+		memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
+			0, sizeof(void *));
+}
+
+static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
+		void **p, unsigned int count, bool allow_spin)
+{
+	unsigned int allocated = 0;
+	struct kmem_cache_node *n;
+	unsigned long flags;
+	void *object;
+
+	if (!allow_spin && (slab->objects - slab->inuse) > count) {
+
+		n = get_node(s, slab_nid(slab));
+
+		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
+			/* Unlucky, discard newly allocated slab */
+			defer_deactivate_slab(slab, NULL);
+			return 0;
+		}
+	}
+
+	object = slab->freelist;
+	while (object && allocated < count) {
+		p[allocated] = object;
+		object = get_freepointer(s, object);
+		maybe_wipe_obj_freeptr(s, p[allocated]);
+
+		slab->inuse++;
+		allocated++;
+	}
+	slab->freelist = object;
+
+	if (slab->freelist) {
+
+		if (allow_spin) {
+			n = get_node(s, slab_nid(slab));
+			spin_lock_irqsave(&n->list_lock, flags);
+		}
+		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+
+	inc_slabs_node(s, slab_nid(slab), slab->objects);
+	return allocated;
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -4913,21 +5059,6 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
 	return object;
 }
 
-/*
- * If the object has been wiped upon free, make sure it's fully initialized by
- * zeroing out freelist pointer.
- *
- * Note that we also wipe custom freelist pointers.
- */
-static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
-						   void *obj)
-{
-	if (unlikely(slab_want_init_on_free(s)) && obj &&
-	    !freeptr_outside_object(s))
-		memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
-			0, sizeof(void *));
-}
-
 static __fastpath_inline
 struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 {
@@ -5388,6 +5519,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
 	return ret;
 }
 
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
 /*
  * returns a sheaf that has at least the requested size
  * when prefilling is needed, do so with given gfp flags
@@ -7463,6 +7597,116 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		 unsigned int max)
+{
+	struct slab *slab, *slab2;
+	struct partial_context pc;
+	unsigned int refilled = 0;
+	unsigned long flags;
+	void *object;
+	int node;
+
+	pc.flags = gfp;
+	pc.min_objects = min;
+	pc.max_objects = max;
+
+	node = numa_mem_id();
+
+	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
+		return 0;
+
+	/* TODO: consider also other nodes? */
+	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
+		goto new_slab;
+
+	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+		list_del(&slab->slab_list);
+
+		object = get_freelist_nofreeze(s, slab);
+
+		while (object && refilled < max) {
+			p[refilled] = object;
+			object = get_freepointer(s, object);
+			maybe_wipe_obj_freeptr(s, p[refilled]);
+
+			refilled++;
+		}
+
+		/*
+		 * Freelist had more objects than we can accommodate, we need to
+		 * free them back. We can treat it like a detached freelist, just
+		 * need to find the tail object.
+		 */
+		if (unlikely(object)) {
+			void *head = object;
+			void *tail;
+			int cnt = 0;
+
+			do {
+				tail = object;
+				cnt++;
+				object = get_freepointer(s, object);
+			} while (object);
+			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
+		}
+
+		if (refilled >= max)
+			break;
+	}
+
+	if (unlikely(!list_empty(&pc.slabs))) {
+		struct kmem_cache_node *n = get_node(s, node);
+
+		spin_lock_irqsave(&n->list_lock, flags);
+
+		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+			if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
+				continue;
+
+			list_del(&slab->slab_list);
+			add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		}
+
+		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		/* any slabs left are completely free and for discard */
+		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+			list_del(&slab->slab_list);
+			discard_slab(s, slab);
+		}
+	}
+
+
+	if (likely(refilled >= min))
+		goto out;
+
+new_slab:
+
+	slab = new_slab(s, pc.flags, node);
+	if (!slab)
+		goto out;
+
+	stat(s, ALLOC_SLAB);
+
+	/*
+	 * TODO: possible optimization - if we know we will consume the whole
+	 * slab we might skip creating the freelist?
+	 */
+	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
+					/* allow_spin = */ true);
+
+	if (refilled < min)
+		goto new_slab;
+out:
+
+	return refilled;
+}
+
 static inline
 int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			    void **p)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (8 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20  4:20   ` Harry Yoo
                     ` (2 more replies)
  2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
                   ` (10 subsequent siblings)
  20 siblings, 3 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

We now rely on sheaves as the percpu caching layer and can refill them
directly from partial or newly allocated slabs. Start removing the cpu
(partial) slabs code, first from allocation paths.

This means that any allocation not satisfied from percpu sheaves will
end up in ___slab_alloc(), where we remove the usage of cpu (partial)
slabs, so it will only perform get_partial() or new_slab(). In the
latter case we reuse alloc_from_new_slab() (when we don't use
the debug/tiny alloc_single_from_new_slab() variant).

In get_partial_node() we used to return a slab for freezing as the cpu
slab and to refill the partial slab. Now we only want to return a single
object and leave the slab on the list (unless it became full). We can't
simply reuse alloc_single_from_partial() as that assumes freeing uses
free_to_partial_list(). Instead we need to use __slab_update_freelist()
to work properly against a racing __slab_free().

The rest of the changes is removing functions that no longer have any
callers.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 612 ++++++++------------------------------------------------------
 1 file changed, 79 insertions(+), 533 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index dce80463f92c..698c0d940f06 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -245,7 +245,6 @@ static DEFINE_STATIC_KEY_FALSE(strict_numa);
 struct partial_context {
 	gfp_t flags;
 	unsigned int orig_size;
-	void *object;
 	unsigned int min_objects;
 	unsigned int max_objects;
 	struct list_head slabs;
@@ -611,36 +610,6 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
 	return freelist_ptr_decode(s, p, ptr_addr);
 }
 
-static void prefetch_freepointer(const struct kmem_cache *s, void *object)
-{
-	prefetchw(object + s->offset);
-}
-
-/*
- * When running under KMSAN, get_freepointer_safe() may return an uninitialized
- * pointer value in the case the current thread loses the race for the next
- * memory chunk in the freelist. In that case this_cpu_cmpxchg_double() in
- * slab_alloc_node() will fail, so the uninitialized value won't be used, but
- * KMSAN will still check all arguments of cmpxchg because of imperfect
- * handling of inline assembly.
- * To work around this problem, we apply __no_kmsan_checks to ensure that
- * get_freepointer_safe() returns initialized memory.
- */
-__no_kmsan_checks
-static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
-{
-	unsigned long freepointer_addr;
-	freeptr_t p;
-
-	if (!debug_pagealloc_enabled_static())
-		return get_freepointer(s, object);
-
-	object = kasan_reset_tag(object);
-	freepointer_addr = (unsigned long)object + s->offset;
-	copy_from_kernel_nofault(&p, (freeptr_t *)freepointer_addr, sizeof(p));
-	return freelist_ptr_decode(s, p, freepointer_addr);
-}
-
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	unsigned long freeptr_addr = (unsigned long)object + s->offset;
@@ -720,23 +689,11 @@ static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
 	nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
 	s->cpu_partial_slabs = nr_slabs;
 }
-
-static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
-{
-	return s->cpu_partial_slabs;
-}
-#else
-#ifdef SLAB_SUPPORTS_SYSFS
+#elif defined(SLAB_SUPPORTS_SYSFS)
 static inline void
 slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
 {
 }
-#endif
-
-static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
-{
-	return 0;
-}
 #endif /* CONFIG_SLUB_CPU_PARTIAL */
 
 /*
@@ -1077,7 +1034,7 @@ static void set_track_update(struct kmem_cache *s, void *object,
 	p->handle = handle;
 #endif
 	p->addr = addr;
-	p->cpu = smp_processor_id();
+	p->cpu = raw_smp_processor_id();
 	p->pid = current->pid;
 	p->when = jiffies;
 }
@@ -3583,15 +3540,15 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
 }
 
 /*
- * Try to allocate a partial slab from a specific node.
+ * Try to allocate object from a partial slab on a specific node.
  */
-static struct slab *get_partial_node(struct kmem_cache *s,
-				     struct kmem_cache_node *n,
-				     struct partial_context *pc)
+static void *get_partial_node(struct kmem_cache *s,
+			      struct kmem_cache_node *n,
+			      struct partial_context *pc)
 {
-	struct slab *slab, *slab2, *partial = NULL;
+	struct slab *slab, *slab2;
 	unsigned long flags;
-	unsigned int partial_slabs = 0;
+	void *object = NULL;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -3607,54 +3564,55 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 	else if (!spin_trylock_irqsave(&n->list_lock, flags))
 		return NULL;
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+
+		struct freelist_counters old, new;
+
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
 
 		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
-			void *object = alloc_single_from_partial(s, n, slab,
+			object = alloc_single_from_partial(s, n, slab,
 							pc->orig_size);
-			if (object) {
-				partial = slab;
-				pc->object = object;
+			if (object)
 				break;
-			}
 			continue;
 		}
 
-		remove_partial(n, slab);
+		/*
+		 * get a single object from the slab. This might race against
+		 * __slab_free(), which however has to take the list_lock if
+		 * it's about to make the slab fully free.
+		 */
+		do {
+			old.freelist = slab->freelist;
+			old.counters = slab->counters;
 
-		if (!partial) {
-			partial = slab;
-			stat(s, ALLOC_FROM_PARTIAL);
+			new.freelist = get_freepointer(s, old.freelist);
+			new.counters = old.counters;
+			new.inuse++;
 
-			if ((slub_get_cpu_partial(s) == 0)) {
-				break;
-			}
-		} else {
-			put_cpu_partial(s, slab, 0);
-			stat(s, CPU_PARTIAL_NODE);
+		} while (!__slab_update_freelist(s, slab, &old, &new, "get_partial_node"));
 
-			if (++partial_slabs > slub_get_cpu_partial(s) / 2) {
-				break;
-			}
-		}
+		object = old.freelist;
+		if (!new.freelist)
+			remove_partial(n, slab);
+
+		break;
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	return partial;
+	return object;
 }
 
 /*
- * Get a slab from somewhere. Search in increasing NUMA distances.
+ * Get an object from somewhere. Search in increasing NUMA distances.
  */
-static struct slab *get_any_partial(struct kmem_cache *s,
-				    struct partial_context *pc)
+static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
-	struct slab *slab;
 	unsigned int cpuset_mems_cookie;
 
 	/*
@@ -3689,8 +3647,10 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 
 			if (n && cpuset_zone_allowed(zone, pc->flags) &&
 					n->nr_partial > s->min_partial) {
-				slab = get_partial_node(s, n, pc);
-				if (slab) {
+
+				void *object = get_partial_node(s, n, pc);
+
+				if (object) {
 					/*
 					 * Don't check read_mems_allowed_retry()
 					 * here - if mems_allowed was updated in
@@ -3698,7 +3658,7 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 					 * between allocation and the cpuset
 					 * update
 					 */
-					return slab;
+					return object;
 				}
 			}
 		}
@@ -3708,20 +3668,20 @@ static struct slab *get_any_partial(struct kmem_cache *s,
 }
 
 /*
- * Get a partial slab, lock it and return it.
+ * Get an object from a partial slab
  */
-static struct slab *get_partial(struct kmem_cache *s, int node,
-				struct partial_context *pc)
+static void *get_partial(struct kmem_cache *s, int node,
+			 struct partial_context *pc)
 {
-	struct slab *slab;
 	int searchnode = node;
+	void *object;
 
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
 
-	slab = get_partial_node(s, get_node(s, searchnode), pc);
-	if (slab || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
-		return slab;
+	object = get_partial_node(s, get_node(s, searchnode), pc);
+	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
+		return object;
 
 	return get_any_partial(s, pc);
 }
@@ -4281,19 +4241,6 @@ static int slub_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
-/*
- * Check if the objects in a per cpu structure fit numa
- * locality expectations.
- */
-static inline int node_match(struct slab *slab, int node)
-{
-#ifdef CONFIG_NUMA
-	if (node != NUMA_NO_NODE && slab_nid(slab) != node)
-		return 0;
-#endif
-	return 1;
-}
-
 #ifdef CONFIG_SLUB_DEBUG
 static int count_free(struct slab *slab)
 {
@@ -4478,36 +4425,6 @@ __update_cpu_freelist_fast(struct kmem_cache *s,
 					     &old.freelist_tid, new.freelist_tid);
 }
 
-/*
- * Check the slab->freelist and either transfer the freelist to the
- * per cpu freelist or deactivate the slab.
- *
- * The slab is still frozen if the return value is not NULL.
- *
- * If this function returns NULL then the slab has been unfrozen.
- */
-static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
-{
-	struct freelist_counters old, new;
-
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
-
-	do {
-		old.freelist = slab->freelist;
-		old.counters = slab->counters;
-
-		new.freelist = NULL;
-		new.counters = old.counters;
-
-		new.inuse = old.objects;
-		new.frozen = old.freelist != NULL;
-
-
-	} while (!__slab_update_freelist(s, slab, &old, &new, "get_freelist"));
-
-	return old.freelist;
-}
-
 /*
  * Get the slab's freelist and do not freeze it.
  *
@@ -4535,29 +4452,6 @@ static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *sla
 	return old.freelist;
 }
 
-/*
- * Freeze the partial slab and return the pointer to the freelist.
- */
-static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
-{
-	struct freelist_counters old, new;
-
-	do {
-		old.freelist = slab->freelist;
-		old.counters = slab->counters;
-
-		new.freelist = NULL;
-		new.counters = old.counters;
-		VM_BUG_ON(new.frozen);
-
-		new.inuse = old.objects;
-		new.frozen = 1;
-
-	} while (!slab_update_freelist(s, slab, &old, &new, "freeze_slab"));
-
-	return old.freelist;
-}
-
 /*
  * If the object has been wiped upon free, make sure it's fully initialized by
  * zeroing out freelist pointer.
@@ -4618,170 +4512,23 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 }
 
 /*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
- *
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
+ * Slow path. We failed to allocate via percpu sheaves or they are not available
+ * due to bootstrap or debugging enabled or SLUB_TINY.
  *
- * Version of __slab_alloc to use when we know that preemption is
- * already disabled (which is the case for bulk allocation).
+ * We try to allocate from partial slab lists and fall back to allocating a new
+ * slab.
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
+			   unsigned long addr, unsigned int orig_size)
 {
 	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	void *freelist;
 	struct slab *slab;
-	unsigned long flags;
 	struct partial_context pc;
 	bool try_thisnode = true;
 
 	stat(s, ALLOC_SLOWPATH);
 
-reread_slab:
-
-	slab = READ_ONCE(c->slab);
-	if (!slab) {
-		/*
-		 * if the node is not online or has no normal memory, just
-		 * ignore the node constraint
-		 */
-		if (unlikely(node != NUMA_NO_NODE &&
-			     !node_isset(node, slab_nodes)))
-			node = NUMA_NO_NODE;
-		goto new_slab;
-	}
-
-	if (unlikely(!node_match(slab, node))) {
-		/*
-		 * same as above but node_match() being false already
-		 * implies node != NUMA_NO_NODE.
-		 *
-		 * We don't strictly honor pfmemalloc and NUMA preferences
-		 * when !allow_spin because:
-		 *
-		 * 1. Most kmalloc() users allocate objects on the local node,
-		 *    so kmalloc_nolock() tries not to interfere with them by
-		 *    deactivating the cpu slab.
-		 *
-		 * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause
-		 *    unnecessary slab allocations even when n->partial list
-		 *    is not empty.
-		 */
-		if (!node_isset(node, slab_nodes) ||
-		    !allow_spin) {
-			node = NUMA_NO_NODE;
-		} else {
-			stat(s, ALLOC_NODE_MISMATCH);
-			goto deactivate_slab;
-		}
-	}
-
-	/*
-	 * By rights, we should be searching for a slab page that was
-	 * PFMEMALLOC but right now, we are losing the pfmemalloc
-	 * information when the page leaves the per-cpu allocator
-	 */
-	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
-		goto deactivate_slab;
-
-	/* must check again c->slab in case we got preempted and it changed */
-	local_lock_cpu_slab(s, flags);
-
-	if (unlikely(slab != c->slab)) {
-		local_unlock_cpu_slab(s, flags);
-		goto reread_slab;
-	}
-	freelist = c->freelist;
-	if (freelist)
-		goto load_freelist;
-
-	freelist = get_freelist(s, slab);
-
-	if (!freelist) {
-		c->slab = NULL;
-		c->tid = next_tid(c->tid);
-		local_unlock_cpu_slab(s, flags);
-		stat(s, DEACTIVATE_BYPASS);
-		goto new_slab;
-	}
-
-	stat(s, ALLOC_REFILL);
-
-load_freelist:
-
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
-
-	/*
-	 * freelist is pointing to the list of objects to be used.
-	 * slab is pointing to the slab from which the objects are obtained.
-	 * That slab must be frozen for per cpu allocations to work.
-	 */
-	VM_BUG_ON(!c->slab->frozen);
-	c->freelist = get_freepointer(s, freelist);
-	c->tid = next_tid(c->tid);
-	local_unlock_cpu_slab(s, flags);
-	return freelist;
-
-deactivate_slab:
-
-	local_lock_cpu_slab(s, flags);
-	if (slab != c->slab) {
-		local_unlock_cpu_slab(s, flags);
-		goto reread_slab;
-	}
-	freelist = c->freelist;
-	c->slab = NULL;
-	c->freelist = NULL;
-	c->tid = next_tid(c->tid);
-	local_unlock_cpu_slab(s, flags);
-	deactivate_slab(s, slab, freelist);
-
-new_slab:
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	while (slub_percpu_partial(c)) {
-		local_lock_cpu_slab(s, flags);
-		if (unlikely(c->slab)) {
-			local_unlock_cpu_slab(s, flags);
-			goto reread_slab;
-		}
-		if (unlikely(!slub_percpu_partial(c))) {
-			local_unlock_cpu_slab(s, flags);
-			/* we were preempted and partial list got empty */
-			goto new_objects;
-		}
-
-		slab = slub_percpu_partial(c);
-		slub_set_percpu_partial(c, slab);
-
-		if (likely(node_match(slab, node) &&
-			   pfmemalloc_match(slab, gfpflags)) ||
-		    !allow_spin) {
-			c->slab = slab;
-			freelist = get_freelist(s, slab);
-			VM_BUG_ON(!freelist);
-			stat(s, CPU_PARTIAL_ALLOC);
-			goto load_freelist;
-		}
-
-		local_unlock_cpu_slab(s, flags);
-
-		slab->next = NULL;
-		__put_partials(s, slab);
-	}
-#endif
-
 new_objects:
 
 	pc.flags = gfpflags;
@@ -4806,33 +4553,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 
 	pc.orig_size = orig_size;
-	slab = get_partial(s, node, &pc);
-	if (slab) {
-		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
-			freelist = pc.object;
-			/*
-			 * For debug caches here we had to go through
-			 * alloc_single_from_partial() so just store the
-			 * tracking info and return the object.
-			 *
-			 * Due to disabled preemption we need to disallow
-			 * blocking. The flags are further adjusted by
-			 * gfp_nested_mask() in stack_depot itself.
-			 */
-			if (s->flags & SLAB_STORE_USER)
-				set_track(s, freelist, TRACK_ALLOC, addr,
-					  gfpflags & ~(__GFP_DIRECT_RECLAIM));
-
-			return freelist;
-		}
-
-		freelist = freeze_slab(s, slab);
-		goto retry_load_slab;
-	}
+	freelist = get_partial(s, node, &pc);
+	if (freelist)
+		goto success;
 
-	slub_put_cpu_ptr(s->cpu_slab);
 	slab = new_slab(s, pc.flags, node);
-	c = slub_get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!slab)) {
 		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
@@ -4849,68 +4574,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
 		freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
-		if (unlikely(!freelist)) {
-			/* This could cause an endless loop. Fail instead. */
-			if (!allow_spin)
-				return NULL;
-			goto new_objects;
-		}
-
-		if (s->flags & SLAB_STORE_USER)
-			set_track(s, freelist, TRACK_ALLOC, addr,
-				  gfpflags & ~(__GFP_DIRECT_RECLAIM));
-
-		return freelist;
-	}
-
-	/*
-	 * No other reference to the slab yet so we can
-	 * muck around with it freely without cmpxchg
-	 */
-	freelist = slab->freelist;
-	slab->freelist = NULL;
-	slab->inuse = slab->objects;
-	slab->frozen = 1;
-
-	inc_slabs_node(s, slab_nid(slab), slab->objects);
+		if (likely(freelist))
+			goto success;
+	} else {
+		alloc_from_new_slab(s, slab, &freelist, 1, allow_spin);
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
-		/*
-		 * For !pfmemalloc_match() case we don't load freelist so that
-		 * we don't make further mismatched allocations easier.
-		 */
-		deactivate_slab(s, slab, get_freepointer(s, freelist));
-		return freelist;
+		/* we don't need to check SLAB_STORE_USER here */
+		if (likely(freelist))
+			return freelist;
 	}
 
-retry_load_slab:
-
-	local_lock_cpu_slab(s, flags);
-	if (unlikely(c->slab)) {
-		void *flush_freelist = c->freelist;
-		struct slab *flush_slab = c->slab;
-
-		c->slab = NULL;
-		c->freelist = NULL;
-		c->tid = next_tid(c->tid);
-
-		local_unlock_cpu_slab(s, flags);
-
-		if (unlikely(!allow_spin)) {
-			/* Reentrant slub cannot take locks, defer */
-			defer_deactivate_slab(flush_slab, flush_freelist);
-		} else {
-			deactivate_slab(s, flush_slab, flush_freelist);
-		}
+	if (allow_spin)
+		goto new_objects;
 
-		stat(s, CPUSLAB_FLUSH);
+	/* This could cause an endless loop. Fail instead. */
+	return NULL;
 
-		goto retry_load_slab;
-	}
-	c->slab = slab;
+success:
+	if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
+		set_track(s, freelist, TRACK_ALLOC, addr, gfpflags);
 
-	goto load_freelist;
+	return freelist;
 }
+
 /*
  * We disallow kprobes in ___slab_alloc() to prevent reentrance
  *
@@ -4925,87 +4611,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  */
 NOKPROBE_SYMBOL(___slab_alloc);
 
-/*
- * A wrapper for ___slab_alloc() for contexts where preemption is not yet
- * disabled. Compensates for possible cpu changes by refetching the per cpu area
- * pointer.
- */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
-{
-	void *p;
-
-#ifdef CONFIG_PREEMPT_COUNT
-	/*
-	 * We may have been preempted and rescheduled on a different
-	 * cpu before disabling preemption. Need to reload cpu area
-	 * pointer.
-	 */
-	c = slub_get_cpu_ptr(s->cpu_slab);
-#endif
-	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
-		if (local_lock_is_locked(&s->cpu_slab->lock)) {
-			/*
-			 * EBUSY is an internal signal to kmalloc_nolock() to
-			 * retry a different bucket. It's not propagated
-			 * to the caller.
-			 */
-			p = ERR_PTR(-EBUSY);
-			goto out;
-		}
-	}
-	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
-out:
-#ifdef CONFIG_PREEMPT_COUNT
-	slub_put_cpu_ptr(s->cpu_slab);
-#endif
-	return p;
-}
-
 static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
-	struct kmem_cache_cpu *c;
-	struct slab *slab;
-	unsigned long tid;
 	void *object;
 
-redo:
-	/*
-	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
-	 * enabled. We may switch back and forth between cpus while
-	 * reading from one cpu area. That does not matter as long
-	 * as we end up on the original cpu again when doing the cmpxchg.
-	 *
-	 * We must guarantee that tid and kmem_cache_cpu are retrieved on the
-	 * same cpu. We read first the kmem_cache_cpu pointer and use it to read
-	 * the tid. If we are preempted and switched to another cpu between the
-	 * two reads, it's OK as the two are still associated with the same cpu
-	 * and cmpxchg later will validate the cpu.
-	 */
-	c = raw_cpu_ptr(s->cpu_slab);
-	tid = READ_ONCE(c->tid);
-
-	/*
-	 * Irqless object alloc/free algorithm used here depends on sequence
-	 * of fetching cpu_slab's data. tid should be fetched before anything
-	 * on c to guarantee that object and slab associated with previous tid
-	 * won't be used with current tid. If we fetch tid first, object and
-	 * slab could be one associated with next tid and our alloc/free
-	 * request will be failed. In this case, we will retry. So, no problem.
-	 */
-	barrier();
-
-	/*
-	 * The transaction ids are globally unique per cpu and per operation on
-	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
-	 * occurs on the right processor and that there was no operation on the
-	 * linked list in between.
-	 */
-
-	object = c->freelist;
-	slab = c->slab;
-
 #ifdef CONFIG_NUMA
 	if (static_branch_unlikely(&strict_numa) &&
 			node == NUMA_NO_NODE) {
@@ -5014,47 +4624,20 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
 
 		if (mpol) {
 			/*
-			 * Special BIND rule support. If existing slab
+			 * Special BIND rule support. If the local node
 			 * is in permitted set then do not redirect
 			 * to a particular node.
 			 * Otherwise we apply the memory policy to get
 			 * the node we need to allocate on.
 			 */
-			if (mpol->mode != MPOL_BIND || !slab ||
-					!node_isset(slab_nid(slab), mpol->nodes))
-
+			if (mpol->mode != MPOL_BIND ||
+					!node_isset(numa_mem_id(), mpol->nodes))
 				node = mempolicy_slab_node();
 		}
 	}
 #endif
 
-	if (!USE_LOCKLESS_FAST_PATH() ||
-	    unlikely(!object || !slab || !node_match(slab, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
-	} else {
-		void *next_object = get_freepointer_safe(s, object);
-
-		/*
-		 * The cmpxchg will only match if there was no additional
-		 * operation and if we are on the right processor.
-		 *
-		 * The cmpxchg does the following atomically (without lock
-		 * semantics!)
-		 * 1. Relocate first pointer to the current per cpu area.
-		 * 2. Verify that tid and freelist have not been changed
-		 * 3. If they were not changed replace tid and freelist
-		 *
-		 * Since this is without lock semantics the protection is only
-		 * against code executing on this cpu *not* from access by
-		 * other cpus.
-		 */
-		if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
-			note_cmpxchg_failure("slab_alloc", s, tid);
-			goto redo;
-		}
-		prefetch_freepointer(s, next_object);
-		stat(s, ALLOC_FASTPATH);
-	}
+	object = ___slab_alloc(s, gfpflags, node, addr, orig_size);
 
 	return object;
 }
@@ -7711,62 +7294,25 @@ static inline
 int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			    void **p)
 {
-	struct kmem_cache_cpu *c;
-	unsigned long irqflags;
 	int i;
 
 	/*
-	 * Drain objects in the per cpu slab, while disabling local
-	 * IRQs, which protects against PREEMPT and interrupts
-	 * handlers invoking normal fastpath.
+	 * TODO: this might be more efficient (if necessary) by reusing
+	 * __refill_objects()
 	 */
-	c = slub_get_cpu_ptr(s->cpu_slab);
-	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
-
 	for (i = 0; i < size; i++) {
-		void *object = c->freelist;
 
-		if (unlikely(!object)) {
-			/*
-			 * We may have removed an object from c->freelist using
-			 * the fastpath in the previous iteration; in that case,
-			 * c->tid has not been bumped yet.
-			 * Since ___slab_alloc() may reenable interrupts while
-			 * allocating memory, we should bump c->tid now.
-			 */
-			c->tid = next_tid(c->tid);
+		p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
+				     s->object_size);
+		if (unlikely(!p[i]))
+			goto error;
 
-			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
-
-			/*
-			 * Invoking slow path likely have side-effect
-			 * of re-populating per CPU c->freelist
-			 */
-			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
-					    _RET_IP_, c, s->object_size);
-			if (unlikely(!p[i]))
-				goto error;
-
-			c = this_cpu_ptr(s->cpu_slab);
-			maybe_wipe_obj_freeptr(s, p[i]);
-
-			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
-
-			continue; /* goto for-loop */
-		}
-		c->freelist = get_freepointer(s, object);
-		p[i] = object;
 		maybe_wipe_obj_freeptr(s, p[i]);
-		stat(s, ALLOC_FASTPATH);
 	}
-	c->tid = next_tid(c->tid);
-	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
-	slub_put_cpu_ptr(s->cpu_slab);
 
 	return i;
 
 error:
-	slub_put_cpu_ptr(s->cpu_slab);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (9 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20  5:24   ` Harry Yoo
                     ` (2 more replies)
  2026-01-16 14:40 ` [PATCH v3 12/21] slab: remove the do_slab_free() fastpath Vlastimil Babka
                   ` (9 subsequent siblings)
  20 siblings, 3 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

We have removed the partial slab usage from allocation paths. Now remove
the whole config option and associated code.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/Kconfig |  11 ---
 mm/slab.h  |  29 ------
 mm/slub.c  | 321 ++++---------------------------------------------------------
 3 files changed, 19 insertions(+), 342 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index bd0ea5454af8..08593674cd20 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -247,17 +247,6 @@ config SLUB_STATS
 	  out which slabs are relevant to a particular load.
 	  Try running: slabinfo -DA
 
-config SLUB_CPU_PARTIAL
-	default y
-	depends on SMP && !SLUB_TINY
-	bool "Enable per cpu partial caches"
-	help
-	  Per cpu partial caches accelerate objects allocation and freeing
-	  that is local to a processor at the price of more indeterminism
-	  in the latency of the free. On overflow these caches will be cleared
-	  which requires the taking of locks that may cause latency spikes.
-	  Typically one would choose no for a realtime system.
-
 config RANDOM_KMALLOC_CACHES
 	default n
 	depends on !SLUB_TINY
diff --git a/mm/slab.h b/mm/slab.h
index cb48ce5014ba..e77260720994 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -77,12 +77,6 @@ struct slab {
 					struct llist_node llnode;
 					void *flush_freelist;
 				};
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-				struct {
-					struct slab *next;
-					int slabs;	/* Nr of slabs left */
-				};
-#endif
 			};
 			/* Double-word boundary */
 			struct freelist_counters;
@@ -188,23 +182,6 @@ static inline size_t slab_size(const struct slab *slab)
 	return PAGE_SIZE << slab_order(slab);
 }
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-#define slub_percpu_partial(c)			((c)->partial)
-
-#define slub_set_percpu_partial(c, p)		\
-({						\
-	slub_percpu_partial(c) = (p)->next;	\
-})
-
-#define slub_percpu_partial_read_once(c)	READ_ONCE(slub_percpu_partial(c))
-#else
-#define slub_percpu_partial(c)			NULL
-
-#define slub_set_percpu_partial(c, p)
-
-#define slub_percpu_partial_read_once(c)	NULL
-#endif // CONFIG_SLUB_CPU_PARTIAL
-
 /*
  * Word size structure that can be atomically updated or read and that
  * contains both the order and the number of objects that a slab of the
@@ -228,12 +205,6 @@ struct kmem_cache {
 	unsigned int object_size;	/* Object size without metadata */
 	struct reciprocal_value reciprocal_size;
 	unsigned int offset;		/* Free pointer offset */
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	/* Number of per cpu partial objects to keep around */
-	unsigned int cpu_partial;
-	/* Number of per cpu partial slabs to keep around */
-	unsigned int cpu_partial_slabs;
-#endif
 	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
diff --git a/mm/slub.c b/mm/slub.c
index 698c0d940f06..6b1280f7900a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -263,15 +263,6 @@ void *fixup_red_left(struct kmem_cache *s, void *p)
 	return p;
 }
 
-static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
-{
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	return !kmem_cache_debug(s);
-#else
-	return false;
-#endif
-}
-
 /*
  * Issues still to be resolved:
  *
@@ -426,9 +417,6 @@ struct freelist_tid {
 struct kmem_cache_cpu {
 	struct freelist_tid;
 	struct slab *slab;	/* The slab from which we are allocating */
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	struct slab *partial;	/* Partially allocated slabs */
-#endif
 	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
@@ -673,29 +661,6 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
 	return x.x & OO_MASK;
 }
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
-{
-	unsigned int nr_slabs;
-
-	s->cpu_partial = nr_objects;
-
-	/*
-	 * We take the number of objects but actually limit the number of
-	 * slabs on the per cpu partial list, in order to limit excessive
-	 * growth of the list. For simplicity we assume that the slabs will
-	 * be half-full.
-	 */
-	nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
-	s->cpu_partial_slabs = nr_slabs;
-}
-#elif defined(SLAB_SUPPORTS_SYSFS)
-static inline void
-slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
-{
-}
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
-
 /*
  * If network-based swap is enabled, slub must keep track of whether memory
  * were allocated from pfmemalloc reserves.
@@ -3474,12 +3439,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
 	return object;
 }
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
-#else
-static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
-				   int drain) { }
-#endif
 static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 
 static bool get_partial_node_bulk(struct kmem_cache *s,
@@ -3898,131 +3857,6 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 #define local_unlock_cpu_slab(s, flags)	\
 	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
-{
-	struct kmem_cache_node *n = NULL, *n2 = NULL;
-	struct slab *slab, *slab_to_discard = NULL;
-	unsigned long flags = 0;
-
-	while (partial_slab) {
-		slab = partial_slab;
-		partial_slab = slab->next;
-
-		n2 = get_node(s, slab_nid(slab));
-		if (n != n2) {
-			if (n)
-				spin_unlock_irqrestore(&n->list_lock, flags);
-
-			n = n2;
-			spin_lock_irqsave(&n->list_lock, flags);
-		}
-
-		if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
-			slab->next = slab_to_discard;
-			slab_to_discard = slab;
-		} else {
-			add_partial(n, slab, DEACTIVATE_TO_TAIL);
-			stat(s, FREE_ADD_PARTIAL);
-		}
-	}
-
-	if (n)
-		spin_unlock_irqrestore(&n->list_lock, flags);
-
-	while (slab_to_discard) {
-		slab = slab_to_discard;
-		slab_to_discard = slab_to_discard->next;
-
-		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, slab);
-		stat(s, FREE_SLAB);
-	}
-}
-
-/*
- * Put all the cpu partial slabs to the node partial list.
- */
-static void put_partials(struct kmem_cache *s)
-{
-	struct slab *partial_slab;
-	unsigned long flags;
-
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
-	partial_slab = this_cpu_read(s->cpu_slab->partial);
-	this_cpu_write(s->cpu_slab->partial, NULL);
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-
-	if (partial_slab)
-		__put_partials(s, partial_slab);
-}
-
-static void put_partials_cpu(struct kmem_cache *s,
-			     struct kmem_cache_cpu *c)
-{
-	struct slab *partial_slab;
-
-	partial_slab = slub_percpu_partial(c);
-	c->partial = NULL;
-
-	if (partial_slab)
-		__put_partials(s, partial_slab);
-}
-
-/*
- * Put a slab into a partial slab slot if available.
- *
- * If we did not find a slot then simply move all the partials to the
- * per node partial list.
- */
-static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
-{
-	struct slab *oldslab;
-	struct slab *slab_to_put = NULL;
-	unsigned long flags;
-	int slabs = 0;
-
-	local_lock_cpu_slab(s, flags);
-
-	oldslab = this_cpu_read(s->cpu_slab->partial);
-
-	if (oldslab) {
-		if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
-			/*
-			 * Partial array is full. Move the existing set to the
-			 * per node partial list. Postpone the actual unfreezing
-			 * outside of the critical section.
-			 */
-			slab_to_put = oldslab;
-			oldslab = NULL;
-		} else {
-			slabs = oldslab->slabs;
-		}
-	}
-
-	slabs++;
-
-	slab->slabs = slabs;
-	slab->next = oldslab;
-
-	this_cpu_write(s->cpu_slab->partial, slab);
-
-	local_unlock_cpu_slab(s, flags);
-
-	if (slab_to_put) {
-		__put_partials(s, slab_to_put);
-		stat(s, CPU_PARTIAL_DRAIN);
-	}
-}
-
-#else	/* CONFIG_SLUB_CPU_PARTIAL */
-
-static inline void put_partials(struct kmem_cache *s) { }
-static inline void put_partials_cpu(struct kmem_cache *s,
-				    struct kmem_cache_cpu *c) { }
-
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
-
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	unsigned long flags;
@@ -4060,8 +3894,6 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 		deactivate_slab(s, slab, freelist);
 		stat(s, CPUSLAB_FLUSH);
 	}
-
-	put_partials_cpu(s, c);
 }
 
 static inline void flush_this_cpu_slab(struct kmem_cache *s)
@@ -4070,15 +3902,13 @@ static inline void flush_this_cpu_slab(struct kmem_cache *s)
 
 	if (c->slab)
 		flush_slab(s, c);
-
-	put_partials(s);
 }
 
 static bool has_cpu_slab(int cpu, struct kmem_cache *s)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	return c->slab || slub_percpu_partial(c);
+	return c->slab;
 }
 
 static bool has_pcs_used(int cpu, struct kmem_cache *s)
@@ -5646,13 +5476,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		return;
 	}
 
-	/*
-	 * It is enough to test IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) below
-	 * instead of kmem_cache_has_cpu_partial(s), because kmem_cache_debug(s)
-	 * is the only other reason it can be false, and it is already handled
-	 * above.
-	 */
-
 	do {
 		if (unlikely(n)) {
 			spin_unlock_irqrestore(&n->list_lock, flags);
@@ -5677,26 +5500,19 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		 * Unless it's frozen.
 		 */
 		if ((!new.inuse || was_full) && !was_frozen) {
+
+			n = get_node(s, slab_nid(slab));
 			/*
-			 * If slab becomes non-full and we have cpu partial
-			 * lists, we put it there unconditionally to avoid
-			 * taking the list_lock. Otherwise we need it.
+			 * Speculatively acquire the list_lock.
+			 * If the cmpxchg does not succeed then we may
+			 * drop the list_lock without any processing.
+			 *
+			 * Otherwise the list_lock will synchronize with
+			 * other processors updating the list of slabs.
 			 */
-			if (!(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full)) {
-
-				n = get_node(s, slab_nid(slab));
-				/*
-				 * Speculatively acquire the list_lock.
-				 * If the cmpxchg does not succeed then we may
-				 * drop the list_lock without any processing.
-				 *
-				 * Otherwise the list_lock will synchronize with
-				 * other processors updating the list of slabs.
-				 */
-				spin_lock_irqsave(&n->list_lock, flags);
-
-				on_node_partial = slab_test_node_partial(slab);
-			}
+			spin_lock_irqsave(&n->list_lock, flags);
+
+			on_node_partial = slab_test_node_partial(slab);
 		}
 
 	} while (!slab_update_freelist(s, slab, &old, &new, "__slab_free"));
@@ -5709,13 +5525,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 			 * activity can be necessary.
 			 */
 			stat(s, FREE_FROZEN);
-		} else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
-			/*
-			 * If we started with a full slab then put it onto the
-			 * per cpu partial list.
-			 */
-			put_cpu_partial(s, slab, 1);
-			stat(s, CPU_PARTIAL_FREE);
 		}
 
 		/*
@@ -5744,10 +5553,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 
 	/*
 	 * Objects left in the slab. If it was not on the partial list before
-	 * then add it. This can only happen when cache has no per cpu partial
-	 * list otherwise we would have put it there.
+	 * then add it.
 	 */
-	if (!IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && unlikely(was_full)) {
+	if (unlikely(was_full)) {
 		add_partial(n, slab, DEACTIVATE_TO_TAIL);
 		stat(s, FREE_ADD_PARTIAL);
 	}
@@ -6396,8 +6204,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 		if (unlikely(!allow_spin)) {
 			/*
 			 * __slab_free() can locklessly cmpxchg16 into a slab,
-			 * but then it might need to take spin_lock or local_lock
-			 * in put_cpu_partial() for further processing.
+			 * but then it might need to take spin_lock
+			 * for further processing.
 			 * Avoid the complexity and simply add to a deferred list.
 			 */
 			defer_free(s, head);
@@ -7707,39 +7515,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 	return 1;
 }
 
-static void set_cpu_partial(struct kmem_cache *s)
-{
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	unsigned int nr_objects;
-
-	/*
-	 * cpu_partial determined the maximum number of objects kept in the
-	 * per cpu partial lists of a processor.
-	 *
-	 * Per cpu partial lists mainly contain slabs that just have one
-	 * object freed. If they are used for allocation then they can be
-	 * filled up again with minimal effort. The slab will never hit the
-	 * per node partial lists and therefore no locking will be required.
-	 *
-	 * For backwards compatibility reasons, this is determined as number
-	 * of objects, even though we now limit maximum number of pages, see
-	 * slub_set_cpu_partial()
-	 */
-	if (!kmem_cache_has_cpu_partial(s))
-		nr_objects = 0;
-	else if (s->size >= PAGE_SIZE)
-		nr_objects = 6;
-	else if (s->size >= 1024)
-		nr_objects = 24;
-	else if (s->size >= 256)
-		nr_objects = 52;
-	else
-		nr_objects = 120;
-
-	slub_set_cpu_partial(s, nr_objects);
-#endif
-}
-
 static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
 					     struct kmem_cache_args *args)
 
@@ -8595,8 +8370,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
 	s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
 
-	set_cpu_partial(s);
-
 	s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
 	if (!s->cpu_sheaves) {
 		err = -ENOMEM;
@@ -8960,20 +8733,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 			total += x;
 			nodes[node] += x;
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-			slab = slub_percpu_partial_read_once(c);
-			if (slab) {
-				node = slab_nid(slab);
-				if (flags & SO_TOTAL)
-					WARN_ON_ONCE(1);
-				else if (flags & SO_OBJECTS)
-					WARN_ON_ONCE(1);
-				else
-					x = data_race(slab->slabs);
-				total += x;
-				nodes[node] += x;
-			}
-#endif
 		}
 	}
 
@@ -9108,12 +8867,7 @@ SLAB_ATTR(min_partial);
 
 static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
 {
-	unsigned int nr_partial = 0;
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	nr_partial = s->cpu_partial;
-#endif
-
-	return sysfs_emit(buf, "%u\n", nr_partial);
+	return sysfs_emit(buf, "0\n");
 }
 
 static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
@@ -9125,11 +8879,9 @@ static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
 	err = kstrtouint(buf, 10, &objects);
 	if (err)
 		return err;
-	if (objects && !kmem_cache_has_cpu_partial(s))
+	if (objects)
 		return -EINVAL;
 
-	slub_set_cpu_partial(s, objects);
-	flush_all(s);
 	return length;
 }
 SLAB_ATTR(cpu_partial);
@@ -9168,42 +8920,7 @@ SLAB_ATTR_RO(objects_partial);
 
 static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 {
-	int objects = 0;
-	int slabs = 0;
-	int cpu __maybe_unused;
-	int len = 0;
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	for_each_online_cpu(cpu) {
-		struct slab *slab;
-
-		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
-
-		if (slab)
-			slabs += data_race(slab->slabs);
-	}
-#endif
-
-	/* Approximate half-full slabs, see slub_set_cpu_partial() */
-	objects = (slabs * oo_objects(s->oo)) / 2;
-	len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
-
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-	for_each_online_cpu(cpu) {
-		struct slab *slab;
-
-		slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
-		if (slab) {
-			slabs = data_race(slab->slabs);
-			objects = (slabs * oo_objects(s->oo)) / 2;
-			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
-					     cpu, objects, slabs);
-		}
-	}
-#endif
-	len += sysfs_emit_at(buf, len, "\n");
-
-	return len;
+	return sysfs_emit(buf, "0(0)\n");
 }
 SLAB_ATTR_RO(slabs_cpu_partial);
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 12/21] slab: remove the do_slab_free() fastpath
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (10 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20  5:35   ` Harry Yoo
  2026-01-20 12:29   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 13/21] slab: remove defer_deactivate_slab() Vlastimil Babka
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

We have removed cpu slab usage from allocation paths. Now remove
do_slab_free() which was freeing objects to the cpu slab when
the object belonged to it. Instead call __slab_free() directly,
which was previously the fallback.

This simplifies kfree_nolock() - when freeing to percpu sheaf
fails, we can call defer_free() directly.

Also remove functions that became unused.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 149 ++++++--------------------------------------------------------
 1 file changed, 13 insertions(+), 136 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 6b1280f7900a..b08e775dc4cb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3682,29 +3682,6 @@ static inline unsigned int init_tid(int cpu)
 	return cpu;
 }
 
-static inline void note_cmpxchg_failure(const char *n,
-		const struct kmem_cache *s, unsigned long tid)
-{
-#ifdef SLUB_DEBUG_CMPXCHG
-	unsigned long actual_tid = __this_cpu_read(s->cpu_slab->tid);
-
-	pr_info("%s %s: cmpxchg redo ", n, s->name);
-
-	if (IS_ENABLED(CONFIG_PREEMPTION) &&
-	    tid_to_cpu(tid) != tid_to_cpu(actual_tid)) {
-		pr_warn("due to cpu change %d -> %d\n",
-			tid_to_cpu(tid), tid_to_cpu(actual_tid));
-	} else if (tid_to_event(tid) != tid_to_event(actual_tid)) {
-		pr_warn("due to cpu running other code. Event %ld->%ld\n",
-			tid_to_event(tid), tid_to_event(actual_tid));
-	} else {
-		pr_warn("for unknown reason: actual=%lx was=%lx target=%lx\n",
-			actual_tid, tid, next_tid(tid));
-	}
-#endif
-	stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
-}
-
 static void init_kmem_cache_cpus(struct kmem_cache *s)
 {
 #ifdef CONFIG_PREEMPT_RT
@@ -4243,18 +4220,6 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags)
 	return true;
 }
 
-static inline bool
-__update_cpu_freelist_fast(struct kmem_cache *s,
-			   void *freelist_old, void *freelist_new,
-			   unsigned long tid)
-{
-	struct freelist_tid old = { .freelist = freelist_old, .tid = tid };
-	struct freelist_tid new = { .freelist = freelist_new, .tid = next_tid(tid) };
-
-	return this_cpu_try_cmpxchg_freelist(s->cpu_slab->freelist_tid,
-					     &old.freelist_tid, new.freelist_tid);
-}
-
 /*
  * Get the slab's freelist and do not freeze it.
  *
@@ -6162,99 +6127,6 @@ void defer_free_barrier(void)
 		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
 }
 
-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- *
- * Bulk free of a freelist with several objects (all pointing to the
- * same slab) possible by specifying head and tail ptr, plus objects
- * count (cnt). Bulk free indicated by tail pointer being set.
- */
-static __always_inline void do_slab_free(struct kmem_cache *s,
-				struct slab *slab, void *head, void *tail,
-				int cnt, unsigned long addr)
-{
-	/* cnt == 0 signals that it's called from kfree_nolock() */
-	bool allow_spin = cnt;
-	struct kmem_cache_cpu *c;
-	unsigned long tid;
-	void **freelist;
-
-redo:
-	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succeed.
-	 */
-	c = raw_cpu_ptr(s->cpu_slab);
-	tid = READ_ONCE(c->tid);
-
-	/* Same with comment on barrier() in __slab_alloc_node() */
-	barrier();
-
-	if (unlikely(slab != c->slab)) {
-		if (unlikely(!allow_spin)) {
-			/*
-			 * __slab_free() can locklessly cmpxchg16 into a slab,
-			 * but then it might need to take spin_lock
-			 * for further processing.
-			 * Avoid the complexity and simply add to a deferred list.
-			 */
-			defer_free(s, head);
-		} else {
-			__slab_free(s, slab, head, tail, cnt, addr);
-		}
-		return;
-	}
-
-	if (unlikely(!allow_spin)) {
-		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
-		    local_lock_is_locked(&s->cpu_slab->lock)) {
-			defer_free(s, head);
-			return;
-		}
-		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
-	}
-
-	if (USE_LOCKLESS_FAST_PATH()) {
-		freelist = READ_ONCE(c->freelist);
-
-		set_freepointer(s, tail, freelist);
-
-		if (unlikely(!__update_cpu_freelist_fast(s, freelist, head, tid))) {
-			note_cmpxchg_failure("slab_free", s, tid);
-			goto redo;
-		}
-	} else {
-		__maybe_unused unsigned long flags = 0;
-
-		/* Update the free list under the local lock */
-		local_lock_cpu_slab(s, flags);
-		c = this_cpu_ptr(s->cpu_slab);
-		if (unlikely(slab != c->slab)) {
-			local_unlock_cpu_slab(s, flags);
-			goto redo;
-		}
-		tid = c->tid;
-		freelist = c->freelist;
-
-		set_freepointer(s, tail, freelist);
-		c->freelist = head;
-		c->tid = next_tid(tid);
-
-		local_unlock_cpu_slab(s, flags);
-	}
-	stat_add(s, FREE_FASTPATH, cnt);
-}
-
 static __fastpath_inline
 void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	       unsigned long addr)
@@ -6271,7 +6143,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 			return;
 	}
 
-	do_slab_free(s, slab, object, object, 1, addr);
+	__slab_free(s, slab, object, object, 1, addr);
 }
 
 #ifdef CONFIG_MEMCG
@@ -6280,7 +6152,7 @@ static noinline
 void memcg_alloc_abort_single(struct kmem_cache *s, void *object)
 {
 	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
-		do_slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_);
+		__slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_);
 }
 #endif
 
@@ -6295,7 +6167,7 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
 	 * to remove objects, whose reuse must be delayed.
 	 */
 	if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
-		do_slab_free(s, slab, head, tail, cnt, addr);
+		__slab_free(s, slab, head, tail, cnt, addr);
 }
 
 #ifdef CONFIG_SLUB_RCU_DEBUG
@@ -6321,14 +6193,14 @@ static void slab_free_after_rcu_debug(struct rcu_head *rcu_head)
 
 	/* resume freeing */
 	if (slab_free_hook(s, object, slab_want_init_on_free(s), true))
-		do_slab_free(s, slab, object, object, 1, _THIS_IP_);
+		__slab_free(s, slab, object, object, 1, _THIS_IP_);
 }
 #endif /* CONFIG_SLUB_RCU_DEBUG */
 
 #ifdef CONFIG_KASAN_GENERIC
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr)
 {
-	do_slab_free(cache, virt_to_slab(x), x, x, 1, addr);
+	__slab_free(cache, virt_to_slab(x), x, x, 1, addr);
 }
 #endif
 
@@ -6528,8 +6400,13 @@ void kfree_nolock(const void *object)
 	 * since kasan quarantine takes locks and not supported from NMI.
 	 */
 	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
+	/*
+	 * __slab_free() can locklessly cmpxchg16 into a slab, but then it might
+	 * need to take spin_lock for further processing.
+	 * Avoid the complexity and simply add to a deferred list.
+	 */
 	if (!free_to_pcs(s, x, false))
-		do_slab_free(s, slab, x, x, 0, _RET_IP_);
+		defer_free(s, x);
 }
 EXPORT_SYMBOL_GPL(kfree_nolock);
 
@@ -6955,7 +6832,7 @@ static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 		if (kfence_free(df.freelist))
 			continue;
 
-		do_slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
+		__slab_free(df.s, df.slab, df.freelist, df.tail, df.cnt,
 			     _RET_IP_);
 	} while (likely(size));
 }
@@ -7041,7 +6918,7 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 				cnt++;
 				object = get_freepointer(s, object);
 			} while (object);
-			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
+			__slab_free(s, slab, head, tail, cnt, _RET_IP_);
 		}
 
 		if (refilled >= max)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 13/21] slab: remove defer_deactivate_slab()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (11 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 12/21] slab: remove the do_slab_free() fastpath Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20  5:47   ` Harry Yoo
  2026-01-20  9:35   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 14/21] slab: simplify kmalloc_nolock() Vlastimil Babka
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

There are no more cpu slabs so we don't need their deferred
deactivation. The function is now only used from places where we
allocate a new slab but then can't spin on node list_lock to put it on
the partial list. Instead of the deferred action we can free it directly
via __free_slab(), we just need to tell it to use _nolock() freeing of
the underlying pages and take care of the accounting.

Since free_frozen_pages_nolock() variant does not yet exist for code
outside of the page allocator, create it as a trivial wrapper for
__free_frozen_pages(..., FPI_TRYLOCK).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c |  5 +++++
 mm/slab.h       |  8 +-------
 mm/slub.c       | 56 ++++++++++++++++++++------------------------------------
 4 files changed, 27 insertions(+), 43 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e430da900430..1f44ccb4badf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -846,6 +846,7 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord
 struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
 #define alloc_frozen_pages_nolock(...) \
 	alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__))
+void free_frozen_pages_nolock(struct page *page, unsigned int order);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c380f063e8b7..0127e9d661ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2981,6 +2981,11 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	__free_frozen_pages(page, order, FPI_NONE);
 }
 
+void free_frozen_pages_nolock(struct page *page, unsigned int order)
+{
+	__free_frozen_pages(page, order, FPI_TRYLOCK);
+}
+
 /*
  * Free a batch of folios
  */
diff --git a/mm/slab.h b/mm/slab.h
index e77260720994..4efec41b6445 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -71,13 +71,7 @@ struct slab {
 	struct kmem_cache *slab_cache;
 	union {
 		struct {
-			union {
-				struct list_head slab_list;
-				struct { /* For deferred deactivate_slab() */
-					struct llist_node llnode;
-					void *flush_freelist;
-				};
-			};
+			struct list_head slab_list;
 			/* Double-word boundary */
 			struct freelist_counters;
 		};
diff --git a/mm/slub.c b/mm/slub.c
index b08e775dc4cb..33f218c0e8d6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3260,7 +3260,7 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 }
 
-static void __free_slab(struct kmem_cache *s, struct slab *slab)
+static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
 {
 	struct page *page = slab_page(slab);
 	int order = compound_order(page);
@@ -3271,14 +3271,26 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	__ClearPageSlab(page);
 	mm_account_reclaimed_pages(pages);
 	unaccount_slab(slab, order, s);
-	free_frozen_pages(page, order);
+	if (allow_spin)
+		free_frozen_pages(page, order);
+	else
+		free_frozen_pages_nolock(page, order);
+}
+
+static void free_new_slab_nolock(struct kmem_cache *s, struct slab *slab)
+{
+	/*
+	 * Since it was just allocated, we can skip the actions in
+	 * discard_slab() and free_slab().
+	 */
+	__free_slab(s, slab, false);
 }
 
 static void rcu_free_slab(struct rcu_head *h)
 {
 	struct slab *slab = container_of(h, struct slab, rcu_head);
 
-	__free_slab(slab->slab_cache, slab);
+	__free_slab(slab->slab_cache, slab, true);
 }
 
 static void free_slab(struct kmem_cache *s, struct slab *slab)
@@ -3294,7 +3306,7 @@ static void free_slab(struct kmem_cache *s, struct slab *slab)
 	if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
 		call_rcu(&slab->rcu_head, rcu_free_slab);
 	else
-		__free_slab(s, slab);
+		__free_slab(s, slab, true);
 }
 
 static void discard_slab(struct kmem_cache *s, struct slab *slab)
@@ -3387,8 +3399,6 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
 	return object;
 }
 
-static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
-
 /*
  * Called only for kmem_cache_debug() caches to allocate from a freshly
  * allocated slab. Allocate a single object instead of whole freelist
@@ -3404,8 +3414,8 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
 	void *object;
 
 	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
-		/* Unlucky, discard newly allocated slab */
-		defer_deactivate_slab(slab, NULL);
+		/* Unlucky, discard newly allocated slab. */
+		free_new_slab_nolock(s, slab);
 		return NULL;
 	}
 
@@ -4276,7 +4286,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 
 		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
 			/* Unlucky, discard newly allocated slab */
-			defer_deactivate_slab(slab, NULL);
+			free_new_slab_nolock(s, slab);
 			return 0;
 		}
 	}
@@ -6033,7 +6043,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 
 struct defer_free {
 	struct llist_head objects;
-	struct llist_head slabs;
 	struct irq_work work;
 };
 
@@ -6041,7 +6050,6 @@ static void free_deferred_objects(struct irq_work *work);
 
 static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
 	.objects = LLIST_HEAD_INIT(objects),
-	.slabs = LLIST_HEAD_INIT(slabs),
 	.work = IRQ_WORK_INIT(free_deferred_objects),
 };
 
@@ -6054,10 +6062,9 @@ static void free_deferred_objects(struct irq_work *work)
 {
 	struct defer_free *df = container_of(work, struct defer_free, work);
 	struct llist_head *objs = &df->objects;
-	struct llist_head *slabs = &df->slabs;
 	struct llist_node *llnode, *pos, *t;
 
-	if (llist_empty(objs) && llist_empty(slabs))
+	if (llist_empty(objs))
 		return;
 
 	llnode = llist_del_all(objs);
@@ -6081,16 +6088,6 @@ static void free_deferred_objects(struct irq_work *work)
 
 		__slab_free(s, slab, x, x, 1, _THIS_IP_);
 	}
-
-	llnode = llist_del_all(slabs);
-	llist_for_each_safe(pos, t, llnode) {
-		struct slab *slab = container_of(pos, struct slab, llnode);
-
-		if (slab->frozen)
-			deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
-		else
-			free_slab(slab->slab_cache, slab);
-	}
 }
 
 static void defer_free(struct kmem_cache *s, void *head)
@@ -6106,19 +6103,6 @@ static void defer_free(struct kmem_cache *s, void *head)
 		irq_work_queue(&df->work);
 }
 
-static void defer_deactivate_slab(struct slab *slab, void *flush_freelist)
-{
-	struct defer_free *df;
-
-	slab->flush_freelist = flush_freelist;
-
-	guard(preempt)();
-
-	df = this_cpu_ptr(&defer_free_objects);
-	if (llist_add(&slab->llnode, &df->slabs))
-		irq_work_queue(&df->work);
-}
-
 void defer_free_barrier(void)
 {
 	int cpu;

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (12 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 13/21] slab: remove defer_deactivate_slab() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20 12:06   ` Hao Li
  2026-01-22  1:53   ` Harry Yoo
  2026-01-16 14:40 ` [PATCH v3 15/21] slab: remove struct kmem_cache_cpu Vlastimil Babka
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

The kmalloc_nolock() implementation has several complications and
restrictions due to SLUB's cpu slab locking, lockless fastpath and
PREEMPT_RT differences. With cpu slab usage removed, we can simplify
things:

- relax the PREEMPT_RT context checks as they were before commit
  a4ae75d1b6a2 ("slab: fix kmalloc_nolock() context check for
  PREEMPT_RT") and also reference the explanation comment in the page
  allocator

- the local_lock_cpu_slab() macros became unused, remove them

- we no longer need to set up lockdep classes on PREEMPT_RT

- we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
  since there's no lockless cpu freelist manipulation anymore

- __slab_alloc_node() can be called from kmalloc_nolock_noprof()
  unconditionally. It can also no longer return EBUSY. But trylock
  failures can still happen so retry with the larger bucket if the
  allocation fails for any reason.

Note that we still need __CMPXCHG_DOUBLE, because while it was removed
we don't use cmpxchg16b on cpu freelist anymore, we still use it on
slab freelist, and the alternative is slab_lock() which can be
interrupted by a nmi. Clarify the comment to mention it specifically.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h |   1 -
 mm/slub.c | 144 +++++++++++++-------------------------------------------------
 2 files changed, 29 insertions(+), 116 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 4efec41b6445..e9a0738133ed 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -190,7 +190,6 @@ struct kmem_cache_order_objects {
  */
 struct kmem_cache {
 	struct kmem_cache_cpu __percpu *cpu_slab;
-	struct lock_class_key lock_key;
 	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
diff --git a/mm/slub.c b/mm/slub.c
index 33f218c0e8d6..8746d9d3f3a3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3694,29 +3694,12 @@ static inline unsigned int init_tid(int cpu)
 
 static void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * Register lockdep key for non-boot kmem caches to avoid
-	 * WARN_ON_ONCE(static_obj(key))) in lockdep_register_key()
-	 */
-	bool finegrain_lockdep = !init_section_contains(s, 1);
-#else
-	/*
-	 * Don't bother with different lockdep classes for each
-	 * kmem_cache, since we only use local_trylock_irqsave().
-	 */
-	bool finegrain_lockdep = false;
-#endif
 	int cpu;
 	struct kmem_cache_cpu *c;
 
-	if (finegrain_lockdep)
-		lockdep_register_key(&s->lock_key);
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
 		local_trylock_init(&c->lock);
-		if (finegrain_lockdep)
-			lockdep_set_class(&c->lock, &s->lock_key);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -3803,47 +3786,6 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	}
 }
 
-/*
- * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
- * can be acquired without a deadlock before invoking the function.
- *
- * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
- * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
- * and kmalloc() is not used in an unsupported context.
- *
- * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
- * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
- * lockdep_assert() will catch a bug in case:
- * #1
- * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
- * or
- * #2
- * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
- *
- * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
- * disabled context. The lock will always be acquired and if needed it
- * block and sleep until the lock is available.
- * #1 is possible in !PREEMPT_RT only.
- * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
- * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
- *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
- *
- * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
- */
-#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
-#define local_lock_cpu_slab(s, flags)	\
-	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
-#else
-#define local_lock_cpu_slab(s, flags)					       \
-	do {								       \
-		bool __l = local_trylock_irqsave(&(s)->cpu_slab->lock, flags); \
-		lockdep_assert(__l);					       \
-	} while (0)
-#endif
-
-#define local_unlock_cpu_slab(s, flags)	\
-	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
-
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	unsigned long flags;
@@ -4402,20 +4344,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	return freelist;
 }
 
-/*
- * We disallow kprobes in ___slab_alloc() to prevent reentrance
- *
- * kmalloc() -> ___slab_alloc() -> local_lock_cpu_slab() protected part of
- * ___slab_alloc() manipulating c->freelist -> kprobe -> bpf ->
- * kmalloc_nolock() or kfree_nolock() -> __update_cpu_freelist_fast()
- * manipulating c->freelist without lock.
- *
- * This does not prevent kprobe in functions called from ___slab_alloc() such as
- * local_lock_irqsave() itself, and that is fine, we only need to protect the
- * c->freelist manipulation in ___slab_alloc() itself.
- */
-NOKPROBE_SYMBOL(___slab_alloc);
-
 static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 {
@@ -5253,13 +5181,13 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 	if (unlikely(!size))
 		return ZERO_SIZE_PTR;
 
-	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
-		/*
-		 * kmalloc_nolock() in PREEMPT_RT is not supported from
-		 * non-preemptible context because local_lock becomes a
-		 * sleeping lock on RT.
-		 */
+	/*
+	 * See the comment for the same check in
+	 * alloc_frozen_pages_nolock_noprof()
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
 		return NULL;
+
 retry:
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
 		return NULL;
@@ -5268,10 +5196,11 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
 		/*
 		 * kmalloc_nolock() is not supported on architectures that
-		 * don't implement cmpxchg16b, but debug caches don't use
-		 * per-cpu slab and per-cpu partial slabs. They rely on
-		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
-		 * attempt to allocate from debug caches by
+		 * don't implement cmpxchg16b and thus need slab_lock()
+		 * which could be preempted by a nmi.
+		 * But debug caches don't use that and only rely on
+		 * kmem_cache_node->list_lock, so kmalloc_nolock() can attempt
+		 * to allocate from debug caches by
 		 * spin_trylock_irqsave(&n->list_lock, ...)
 		 */
 		return NULL;
@@ -5280,42 +5209,31 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
 	if (ret)
 		goto success;
 
-	ret = ERR_PTR(-EBUSY);
-
 	/*
 	 * Do not call slab_alloc_node(), since trylock mode isn't
 	 * compatible with slab_pre_alloc_hook/should_failslab and
 	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
 	 * and slab_post_alloc_hook() directly.
-	 *
-	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
-	 * in irq saved region. It assumes that the same cpu will not
-	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
-	 * Therefore use in_nmi() to check whether particular bucket is in
-	 * irq protected section.
-	 *
-	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
-	 * this cpu was interrupted somewhere inside ___slab_alloc() after
-	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
-	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
 	 */
-	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
-		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+	ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
 
-	if (PTR_ERR(ret) == -EBUSY) {
-		if (can_retry) {
-			/* pick the next kmalloc bucket */
-			size = s->object_size + 1;
-			/*
-			 * Another alternative is to
-			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
-			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
-			 * to retry from bucket of the same size.
-			 */
-			can_retry = false;
-			goto retry;
-		}
-		ret = NULL;
+	/*
+	 * It's possible we failed due to trylock as we preempted someone with
+	 * the sheaves locked, and the list_lock is also held by another cpu.
+	 * But it should be rare that multiple kmalloc buckets would have
+	 * sheaves locked, so try a larger one.
+	 */
+	if (!ret && can_retry) {
+		/* pick the next kmalloc bucket */
+		size = s->object_size + 1;
+		/*
+		 * Another alternative is to
+		 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
+		 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+		 * to retry from bucket of the same size.
+		 */
+		can_retry = false;
+		goto retry;
 	}
 
 success:
@@ -7334,10 +7252,6 @@ void __kmem_cache_release(struct kmem_cache *s)
 	cache_random_seq_destroy(s);
 	if (s->cpu_sheaves)
 		pcs_destroy(s);
-#ifdef CONFIG_PREEMPT_RT
-	if (s->cpu_slab)
-		lockdep_unregister_key(&s->lock_key);
-#endif
 	free_percpu(s->cpu_slab);
 	free_kmem_cache_nodes(s);
 }

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (13 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 14/21] slab: simplify kmalloc_nolock() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-20 12:40   ` Hao Li
  2026-01-22  3:10   ` Harry Yoo
  2026-01-16 14:40 ` [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

The cpu slab is not used anymore for allocation or freeing, the
remaining code is for flushing, but it's effectively dead.  Remove the
whole struct kmem_cache_cpu, the flushing code and other orphaned
functions.

The remaining used field of kmem_cache_cpu is the stat array with
CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats.
In struct kmem_cache, the field is cpu_stats and placed near the
end of the struct.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h |   7 +-
 mm/slub.c | 298 +++++---------------------------------------------------------
 2 files changed, 24 insertions(+), 281 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index e9a0738133ed..87faeb6143f2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -21,14 +21,12 @@
 # define system_has_freelist_aba()	system_has_cmpxchg128()
 # define try_cmpxchg_freelist		try_cmpxchg128
 # endif
-#define this_cpu_try_cmpxchg_freelist	this_cpu_try_cmpxchg128
 typedef u128 freelist_full_t;
 #else /* CONFIG_64BIT */
 # ifdef system_has_cmpxchg64
 # define system_has_freelist_aba()	system_has_cmpxchg64()
 # define try_cmpxchg_freelist		try_cmpxchg64
 # endif
-#define this_cpu_try_cmpxchg_freelist	this_cpu_try_cmpxchg64
 typedef u64 freelist_full_t;
 #endif /* CONFIG_64BIT */
 
@@ -189,7 +187,6 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
-	struct kmem_cache_cpu __percpu *cpu_slab;
 	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
@@ -238,6 +235,10 @@ struct kmem_cache {
 	unsigned int usersize;		/* Usercopy region size */
 #endif
 
+#ifdef CONFIG_SLUB_STATS
+	struct kmem_cache_stats __percpu *cpu_stats;
+#endif
+
 	struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
diff --git a/mm/slub.c b/mm/slub.c
index 8746d9d3f3a3..bb72cfa2d7ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -400,28 +400,11 @@ enum stat_item {
 	NR_SLUB_STAT_ITEMS
 };
 
-struct freelist_tid {
-	union {
-		struct {
-			void *freelist;		/* Pointer to next available object */
-			unsigned long tid;	/* Globally unique transaction id */
-		};
-		freelist_full_t freelist_tid;
-	};
-};
-
-/*
- * When changing the layout, make sure freelist and tid are still compatible
- * with this_cpu_cmpxchg_double() alignment requirements.
- */
-struct kmem_cache_cpu {
-	struct freelist_tid;
-	struct slab *slab;	/* The slab from which we are allocating */
-	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
+struct kmem_cache_stats {
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
-#endif
 };
+#endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
 {
@@ -430,7 +413,7 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
 	 * The rmw is racy on a preemptible kernel but this is acceptable, so
 	 * avoid this_cpu_add()'s irq-disable overhead.
 	 */
-	raw_cpu_inc(s->cpu_slab->stat[si]);
+	raw_cpu_inc(s->cpu_stats->stat[si]);
 #endif
 }
 
@@ -438,7 +421,7 @@ static inline
 void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 {
 #ifdef CONFIG_SLUB_STATS
-	raw_cpu_add(s->cpu_slab->stat[si], v);
+	raw_cpu_add(s->cpu_stats->stat[si], v);
 #endif
 }
 
@@ -1160,20 +1143,6 @@ static void object_err(struct kmem_cache *s, struct slab *slab,
 	WARN_ON(1);
 }
 
-static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
-			       void **freelist, void *nextfree)
-{
-	if ((s->flags & SLAB_CONSISTENCY_CHECKS) &&
-	    !check_valid_pointer(s, slab, nextfree) && freelist) {
-		object_err(s, slab, *freelist, "Freechain corrupt");
-		*freelist = NULL;
-		slab_fix(s, "Isolate corrupted freechain");
-		return true;
-	}
-
-	return false;
-}
-
 static void __slab_err(struct slab *slab)
 {
 	if (slab_in_kunit_test())
@@ -1955,11 +1924,6 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node,
 							int objects) {}
 static inline void dec_slabs_node(struct kmem_cache *s, int node,
 							int objects) {}
-static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
-			       void **freelist, void *nextfree)
-{
-	return false;
-}
 #endif /* CONFIG_SLUB_DEBUG */
 
 /*
@@ -3655,191 +3619,6 @@ static void *get_partial(struct kmem_cache *s, int node,
 	return get_any_partial(s, pc);
 }
 
-#ifdef CONFIG_PREEMPTION
-/*
- * Calculate the next globally unique transaction for disambiguation
- * during cmpxchg. The transactions start with the cpu number and are then
- * incremented by CONFIG_NR_CPUS.
- */
-#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
-#else
-/*
- * No preemption supported therefore also no need to check for
- * different cpus.
- */
-#define TID_STEP 1
-#endif /* CONFIG_PREEMPTION */
-
-static inline unsigned long next_tid(unsigned long tid)
-{
-	return tid + TID_STEP;
-}
-
-#ifdef SLUB_DEBUG_CMPXCHG
-static inline unsigned int tid_to_cpu(unsigned long tid)
-{
-	return tid % TID_STEP;
-}
-
-static inline unsigned long tid_to_event(unsigned long tid)
-{
-	return tid / TID_STEP;
-}
-#endif
-
-static inline unsigned int init_tid(int cpu)
-{
-	return cpu;
-}
-
-static void init_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
-	struct kmem_cache_cpu *c;
-
-	for_each_possible_cpu(cpu) {
-		c = per_cpu_ptr(s->cpu_slab, cpu);
-		local_trylock_init(&c->lock);
-		c->tid = init_tid(cpu);
-	}
-}
-
-/*
- * Finishes removing the cpu slab. Merges cpu's freelist with slab's freelist,
- * unfreezes the slabs and puts it on the proper list.
- * Assumes the slab has been already safely taken away from kmem_cache_cpu
- * by the caller.
- */
-static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
-			    void *freelist)
-{
-	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
-	int free_delta = 0;
-	void *nextfree, *freelist_iter, *freelist_tail;
-	int tail = DEACTIVATE_TO_HEAD;
-	unsigned long flags = 0;
-	struct freelist_counters old, new;
-
-	if (READ_ONCE(slab->freelist)) {
-		stat(s, DEACTIVATE_REMOTE_FREES);
-		tail = DEACTIVATE_TO_TAIL;
-	}
-
-	/*
-	 * Stage one: Count the objects on cpu's freelist as free_delta and
-	 * remember the last object in freelist_tail for later splicing.
-	 */
-	freelist_tail = NULL;
-	freelist_iter = freelist;
-	while (freelist_iter) {
-		nextfree = get_freepointer(s, freelist_iter);
-
-		/*
-		 * If 'nextfree' is invalid, it is possible that the object at
-		 * 'freelist_iter' is already corrupted.  So isolate all objects
-		 * starting at 'freelist_iter' by skipping them.
-		 */
-		if (freelist_corrupted(s, slab, &freelist_iter, nextfree))
-			break;
-
-		freelist_tail = freelist_iter;
-		free_delta++;
-
-		freelist_iter = nextfree;
-	}
-
-	/*
-	 * Stage two: Unfreeze the slab while splicing the per-cpu
-	 * freelist to the head of slab's freelist.
-	 */
-	do {
-		old.freelist = READ_ONCE(slab->freelist);
-		old.counters = READ_ONCE(slab->counters);
-		VM_BUG_ON(!old.frozen);
-
-		/* Determine target state of the slab */
-		new.counters = old.counters;
-		new.frozen = 0;
-		if (freelist_tail) {
-			new.inuse -= free_delta;
-			set_freepointer(s, freelist_tail, old.freelist);
-			new.freelist = freelist;
-		} else {
-			new.freelist = old.freelist;
-		}
-	} while (!slab_update_freelist(s, slab, &old, &new, "unfreezing slab"));
-
-	/*
-	 * Stage three: Manipulate the slab list based on the updated state.
-	 */
-	if (!new.inuse && n->nr_partial >= s->min_partial) {
-		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, slab);
-		stat(s, FREE_SLAB);
-	} else if (new.freelist) {
-		spin_lock_irqsave(&n->list_lock, flags);
-		add_partial(n, slab, tail);
-		spin_unlock_irqrestore(&n->list_lock, flags);
-		stat(s, tail);
-	} else {
-		stat(s, DEACTIVATE_FULL);
-	}
-}
-
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
-{
-	unsigned long flags;
-	struct slab *slab;
-	void *freelist;
-
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
-
-	slab = c->slab;
-	freelist = c->freelist;
-
-	c->slab = NULL;
-	c->freelist = NULL;
-	c->tid = next_tid(c->tid);
-
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
-
-	if (slab) {
-		deactivate_slab(s, slab, freelist);
-		stat(s, CPUSLAB_FLUSH);
-	}
-}
-
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-	void *freelist = c->freelist;
-	struct slab *slab = c->slab;
-
-	c->slab = NULL;
-	c->freelist = NULL;
-	c->tid = next_tid(c->tid);
-
-	if (slab) {
-		deactivate_slab(s, slab, freelist);
-		stat(s, CPUSLAB_FLUSH);
-	}
-}
-
-static inline void flush_this_cpu_slab(struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
-
-	if (c->slab)
-		flush_slab(s, c);
-}
-
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
-	return c->slab;
-}
-
 static bool has_pcs_used(int cpu, struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
@@ -3853,7 +3632,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 }
 
 /*
- * Flush cpu slab.
+ * Flush percpu sheaves
  *
  * Called from CPU work handler with migration disabled.
  */
@@ -3868,8 +3647,6 @@ static void flush_cpu_slab(struct work_struct *w)
 
 	if (cache_has_sheaves(s))
 		pcs_flush_all(s);
-
-	flush_this_cpu_slab(s);
 }
 
 static void flush_all_cpus_locked(struct kmem_cache *s)
@@ -3882,7 +3659,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
+		if (!has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3992,7 +3769,6 @@ static int slub_cpu_dead(unsigned int cpu)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
-		__flush_cpu_slab(s, cpu);
 		if (cache_has_sheaves(s))
 			__pcs_flush_all_cpu(s, cpu);
 	}
@@ -7121,26 +6897,21 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 		barn_init(barn);
 }
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+#ifdef CONFIG_SLUB_STATS
+static inline int alloc_kmem_cache_stats(struct kmem_cache *s)
 {
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH *
-			sizeof(struct kmem_cache_cpu));
+			sizeof(struct kmem_cache_stats));
 
-	/*
-	 * Must align to double word boundary for the double cmpxchg
-	 * instructions to work; see __pcpu_double_call_return_bool().
-	 */
-	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu),
-				     2 * sizeof(void *));
+	s->cpu_stats = alloc_percpu(struct kmem_cache_stats);
 
-	if (!s->cpu_slab)
+	if (!s->cpu_stats)
 		return 0;
 
-	init_kmem_cache_cpus(s);
-
 	return 1;
 }
+#endif
 
 static int init_percpu_sheaves(struct kmem_cache *s)
 {
@@ -7252,7 +7023,9 @@ void __kmem_cache_release(struct kmem_cache *s)
 	cache_random_seq_destroy(s);
 	if (s->cpu_sheaves)
 		pcs_destroy(s);
-	free_percpu(s->cpu_slab);
+#ifdef CONFIG_SLUB_STATS
+	free_percpu(s->cpu_stats);
+#endif
 	free_kmem_cache_nodes(s);
 }
 
@@ -7944,12 +7717,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 
 	memcpy(s, static_cache, kmem_cache->object_size);
 
-	/*
-	 * This runs very early, and only the boot processor is supposed to be
-	 * up.  Even if it weren't true, IRQs are not up so we couldn't fire
-	 * IPIs around.
-	 */
-	__flush_cpu_slab(s, smp_processor_id());
 	for_each_kmem_cache_node(s, node, n) {
 		struct slab *p;
 
@@ -8164,8 +7931,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!init_kmem_cache_nodes(s))
 		goto out;
 
-	if (!alloc_kmem_cache_cpus(s))
+#ifdef CONFIG_SLUB_STATS
+	if (!alloc_kmem_cache_stats(s))
 		goto out;
+#endif
 
 	err = init_percpu_sheaves(s);
 	if (err)
@@ -8484,33 +8253,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 	if (!nodes)
 		return -ENOMEM;
 
-	if (flags & SO_CPU) {
-		int cpu;
-
-		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
-							       cpu);
-			int node;
-			struct slab *slab;
-
-			slab = READ_ONCE(c->slab);
-			if (!slab)
-				continue;
-
-			node = slab_nid(slab);
-			if (flags & SO_TOTAL)
-				x = slab->objects;
-			else if (flags & SO_OBJECTS)
-				x = slab->inuse;
-			else
-				x = 1;
-
-			total += x;
-			nodes[node] += x;
-
-		}
-	}
-
 	/*
 	 * It is impossible to take "mem_hotplug_lock" here with "kernfs_mutex"
 	 * already held which will conflict with an existing lock order:
@@ -8881,7 +8623,7 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
+		unsigned int x = per_cpu_ptr(s->cpu_stats, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;
@@ -8907,7 +8649,7 @@ static void clear_stat(struct kmem_cache *s, enum stat_item si)
 	int cpu;
 
 	for_each_online_cpu(cpu)
-		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
+		per_cpu_ptr(s->cpu_stats, cpu)->stat[si] = 0;
 }
 
 #define STAT_ATTR(si, text) 					\

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (14 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 15/21] slab: remove struct kmem_cache_cpu Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-21  6:42   ` Hao Li
  2026-01-22  3:50   ` Harry Yoo
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
                   ` (4 subsequent siblings)
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove
them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code
being removed. The only remaining usage is in fact testing whether we
can assert irqs disabled, because spin_lock_irqsave() only does that on
!RT. Test for CONFIG_PREEMPT_RT instead.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 24 +-----------------------
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index bb72cfa2d7ec..d52de6e3c2d5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -201,28 +201,6 @@ enum slab_flags {
 	SL_pfmemalloc = PG_active,	/* Historical reasons for this bit */
 };
 
-/*
- * We could simply use migrate_disable()/enable() but as long as it's a
- * function call even on !PREEMPT_RT, use inline preempt_disable() there.
- */
-#ifndef CONFIG_PREEMPT_RT
-#define slub_get_cpu_ptr(var)		get_cpu_ptr(var)
-#define slub_put_cpu_ptr(var)		put_cpu_ptr(var)
-#define USE_LOCKLESS_FAST_PATH()	(true)
-#else
-#define slub_get_cpu_ptr(var)		\
-({					\
-	migrate_disable();		\
-	this_cpu_ptr(var);		\
-})
-#define slub_put_cpu_ptr(var)		\
-do {					\
-	(void)(var);			\
-	migrate_enable();		\
-} while (0)
-#define USE_LOCKLESS_FAST_PATH()	(false)
-#endif
-
 #ifndef CONFIG_SLUB_TINY
 #define __fastpath_inline __always_inline
 #else
@@ -719,7 +697,7 @@ static inline bool __slab_update_freelist(struct kmem_cache *s, struct slab *sla
 {
 	bool ret;
 
-	if (USE_LOCKLESS_FAST_PATH())
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		lockdep_assert_irqs_disabled();
 
 	if (s->flags & __CMPXCHG_DOUBLE)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (15 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-21 18:30   ` Suren Baghdasaryan
                     ` (3 more replies)
  2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
                   ` (3 subsequent siblings)
  20 siblings, 4 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

__refill_objects() currently only attempts to get partial slabs from the
local node and then allocates new slab(s). Expand it to trying also
other nodes while observing the remote node defrag ratio, similarly to
get_any_partial().

This will prevent allocating new slabs on a node while other nodes have
many free slabs. It does mean sheaves will contain non-local objects in
that case. Allocations that care about specific node will still be
served appropriately, but might get a slowpath allocation.

Like get_any_partial() we do observe cpuset_zone_allowed(), although we
might be refilling a sheaf that will be then used from a different
allocation context.

We can also use the resulting refill_objects() in
__kmem_cache_alloc_bulk() for non-debug caches. This means
kmem_cache_alloc_bulk() will get better performance when sheaves are
exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
it's compatible with sheaves refill in preferring the local node.
Its users also have gfp flags that allow spinning, so document that
as a requirement.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 106 insertions(+), 31 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d52de6e3c2d5..2c522d2bf547 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2518,8 +2518,8 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 }
 
 static unsigned int
-__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
-		 unsigned int max);
+refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+	       unsigned int max);
 
 static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 			 gfp_t gfp)
@@ -2530,8 +2530,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 	if (!to_fill)
 		return 0;
 
-	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
-			to_fill, to_fill);
+	filled = refill_objects(s, &sheaf->objects[sheaf->size], gfp, to_fill,
+				to_fill);
 
 	sheaf->size += filled;
 
@@ -6522,29 +6522,22 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
 static unsigned int
-__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
-		 unsigned int max)
+__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		      unsigned int max, struct kmem_cache_node *n)
 {
 	struct slab *slab, *slab2;
 	struct partial_context pc;
 	unsigned int refilled = 0;
 	unsigned long flags;
 	void *object;
-	int node;
 
 	pc.flags = gfp;
 	pc.min_objects = min;
 	pc.max_objects = max;
 
-	node = numa_mem_id();
-
-	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
+	if (!get_partial_node_bulk(s, n, &pc))
 		return 0;
 
-	/* TODO: consider also other nodes? */
-	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
-		goto new_slab;
-
 	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
 
 		list_del(&slab->slab_list);
@@ -6582,8 +6575,6 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 	}
 
 	if (unlikely(!list_empty(&pc.slabs))) {
-		struct kmem_cache_node *n = get_node(s, node);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
@@ -6605,13 +6596,92 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 		}
 	}
 
+	return refilled;
+}
 
-	if (likely(refilled >= min))
-		goto out;
+#ifdef CONFIG_NUMA
+static unsigned int
+__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		     unsigned int max, int local_node)
+{
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	struct zone *zone;
+	enum zone_type highest_zoneidx = gfp_zone(gfp);
+	unsigned int cpuset_mems_cookie;
+	unsigned int refilled = 0;
+
+	/* see get_any_partial() for the defrag ratio description */
+	if (!s->remote_node_defrag_ratio ||
+			get_cycles() % 1024 > s->remote_node_defrag_ratio)
+		return 0;
+
+	do {
+		cpuset_mems_cookie = read_mems_allowed_begin();
+		zonelist = node_zonelist(mempolicy_slab_node(), gfp);
+		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
+			struct kmem_cache_node *n;
+			unsigned int r;
+
+			n = get_node(s, zone_to_nid(zone));
+
+			if (!n || !cpuset_zone_allowed(zone, gfp) ||
+					n->nr_partial <= s->min_partial)
+				continue;
+
+			r = __refill_objects_node(s, p, gfp, min, max, n);
+			refilled += r;
+
+			if (r >= min) {
+				/*
+				 * Don't check read_mems_allowed_retry() here -
+				 * if mems_allowed was updated in parallel, that
+				 * was a harmless race between allocation and
+				 * the cpuset update
+				 */
+				return refilled;
+			}
+			p += r;
+			min -= r;
+			max -= r;
+		}
+	} while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+	return refilled;
+}
+#else
+static inline unsigned int
+__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		     unsigned int max, int local_node)
+{
+	return 0;
+}
+#endif
+
+static unsigned int
+refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+	       unsigned int max)
+{
+	int local_node = numa_mem_id();
+	unsigned int refilled;
+	struct slab *slab;
+
+	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
+		return 0;
+
+	refilled = __refill_objects_node(s, p, gfp, min, max,
+					 get_node(s, local_node));
+	if (refilled >= min)
+		return refilled;
+
+	refilled += __refill_objects_any(s, p + refilled, gfp, min - refilled,
+					 max - refilled, local_node);
+	if (refilled >= min)
+		return refilled;
 
 new_slab:
 
-	slab = new_slab(s, pc.flags, node);
+	slab = new_slab(s, gfp, local_node);
 	if (!slab)
 		goto out;
 
@@ -6626,8 +6696,8 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 
 	if (refilled < min)
 		goto new_slab;
-out:
 
+out:
 	return refilled;
 }
 
@@ -6637,18 +6707,20 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 {
 	int i;
 
-	/*
-	 * TODO: this might be more efficient (if necessary) by reusing
-	 * __refill_objects()
-	 */
-	for (i = 0; i < size; i++) {
+	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
+		for (i = 0; i < size; i++) {
 
-		p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
-				     s->object_size);
-		if (unlikely(!p[i]))
-			goto error;
+			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
+					     s->object_size);
+			if (unlikely(!p[i]))
+				goto error;
 
-		maybe_wipe_obj_freeptr(s, p[i]);
+			maybe_wipe_obj_freeptr(s, p[i]);
+		}
+	} else {
+		i = refill_objects(s, p, flags, size, size);
+		if (i < size)
+			goto error;
 	}
 
 	return i;
@@ -6659,7 +6731,10 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 
 }
 
-/* Note that interrupts must be enabled when calling this function. */
+/*
+ * Note that interrupts must be enabled when calling this function and gfp
+ * flags must allow spinning.
+ */
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 18/21] slab: update overview comments
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (16 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-21 20:58   ` Suren Baghdasaryan
                     ` (2 more replies)
  2026-01-16 14:40 ` [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
                   ` (2 subsequent siblings)
  20 siblings, 3 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

The changes related to sheaves made the description of locking and other
details outdated. Update it to reflect current state.

Also add a new copyright line due to major changes.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 141 +++++++++++++++++++++++++++++---------------------------------
 1 file changed, 67 insertions(+), 74 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2c522d2bf547..476a279f1a94 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1,13 +1,15 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * SLUB: A slab allocator that limits cache line use instead of queuing
- * objects in per cpu and per node lists.
+ * SLUB: A slab allocator with low overhead percpu array caches and mostly
+ * lockless freeing of objects to slabs in the slowpath.
  *
- * The allocator synchronizes using per slab locks or atomic operations
- * and only uses a centralized lock to manage a pool of partial slabs.
+ * The allocator synchronizes using spin_trylock for percpu arrays in the
+ * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing.
+ * Uses a centralized lock to manage a pool of partial slabs.
  *
  * (C) 2007 SGI, Christoph Lameter
  * (C) 2011 Linux Foundation, Christoph Lameter
+ * (C) 2025 SUSE, Vlastimil Babka
  */
 
 #include <linux/mm.h>
@@ -53,11 +55,13 @@
 
 /*
  * Lock order:
- *   1. slab_mutex (Global Mutex)
- *   2. node->list_lock (Spinlock)
- *   3. kmem_cache->cpu_slab->lock (Local lock)
- *   4. slab_lock(slab) (Only on some arches)
- *   5. object_map_lock (Only for debugging)
+ *   0.  cpu_hotplug_lock
+ *   1.  slab_mutex (Global Mutex)
+ *   2a. kmem_cache->cpu_sheaves->lock (Local trylock)
+ *   2b. node->barn->lock (Spinlock)
+ *   2c. node->list_lock (Spinlock)
+ *   3.  slab_lock(slab) (Only on some arches)
+ *   4.  object_map_lock (Only for debugging)
  *
  *   slab_mutex
  *
@@ -78,31 +82,38 @@
  *	C. slab->objects	-> Number of objects in slab
  *	D. slab->frozen		-> frozen state
  *
- *   Frozen slabs
+ *   SL_partial slabs
+ *
+ *   Slabs on node partial list have at least one free object. A limited number
+ *   of slabs on the list can be fully free (slab->inuse == 0), until we start
+ *   discarding them. These slabs are marked with SL_partial, and the flag is
+ *   cleared while removing them, usually to grab their freelist afterwards.
+ *   This clearing also exempts them from list management. Please see
+ *   __slab_free() for more details.
  *
- *   If a slab is frozen then it is exempt from list management. It is
- *   the cpu slab which is actively allocated from by the processor that
- *   froze it and it is not on any list. The processor that froze the
- *   slab is the one who can perform list operations on the slab. Other
- *   processors may put objects onto the freelist but the processor that
- *   froze the slab is the only one that can retrieve the objects from the
- *   slab's freelist.
+ *   Full slabs
  *
- *   CPU partial slabs
+ *   For caches without debugging enabled, full slabs (slab->inuse ==
+ *   slab->objects and slab->freelist == NULL) are not placed on any list.
+ *   The __slab_free() freeing the first object from such a slab will place
+ *   it on the partial list. Caches with debugging enabled place such slab
+ *   on the full list and use different allocation and freeing paths.
+ *
+ *   Frozen slabs
  *
- *   The partially empty slabs cached on the CPU partial list are used
- *   for performance reasons, which speeds up the allocation process.
- *   These slabs are not frozen, but are also exempt from list management,
- *   by clearing the SL_partial flag when moving out of the node
- *   partial list. Please see __slab_free() for more details.
+ *   If a slab is frozen then it is exempt from list management. It is used to
+ *   indicate a slab that has failed consistency checks and thus cannot be
+ *   allocated from anymore - it is also marked as full. Any previously
+ *   allocated objects will be simply leaked upon freeing instead of attempting
+ *   to modify the potentially corrupted freelist and metadata.
  *
  *   To sum up, the current scheme is:
- *   - node partial slab: SL_partial && !frozen
- *   - cpu partial slab: !SL_partial && !frozen
- *   - cpu slab: !SL_partial && frozen
- *   - full slab: !SL_partial && !frozen
+ *   - node partial slab:            SL_partial && !full && !frozen
+ *   - taken off partial list:      !SL_partial && !full && !frozen
+ *   - full slab, not on any list:  !SL_partial &&  full && !frozen
+ *   - frozen due to inconsistency: !SL_partial &&  full &&  frozen
  *
- *   list_lock
+ *   node->list_lock (spinlock)
  *
  *   The list_lock protects the partial and full list on each node and
  *   the partial slab counter. If taken then no new slabs may be added or
@@ -112,47 +123,46 @@
  *
  *   The list_lock is a centralized lock and thus we avoid taking it as
  *   much as possible. As long as SLUB does not have to handle partial
- *   slabs, operations can continue without any centralized lock. F.e.
- *   allocating a long series of objects that fill up slabs does not require
- *   the list lock.
+ *   slabs, operations can continue without any centralized lock.
  *
  *   For debug caches, all allocations are forced to go through a list_lock
  *   protected region to serialize against concurrent validation.
  *
- *   cpu_slab->lock local lock
+ *   cpu_sheaves->lock (local_trylock)
  *
- *   This locks protect slowpath manipulation of all kmem_cache_cpu fields
- *   except the stat counters. This is a percpu structure manipulated only by
- *   the local cpu, so the lock protects against being preempted or interrupted
- *   by an irq. Fast path operations rely on lockless operations instead.
+ *   This lock protects fastpath operations on the percpu sheaves. On !RT it
+ *   only disables preemption and does no atomic operations. As long as the main
+ *   or spare sheaf can handle the allocation or free, there is no other
+ *   overhead.
  *
- *   On PREEMPT_RT, the local lock neither disables interrupts nor preemption
- *   which means the lockless fastpath cannot be used as it might interfere with
- *   an in-progress slow path operations. In this case the local lock is always
- *   taken but it still utilizes the freelist for the common operations.
+ *   node->barn->lock (spinlock)
  *
- *   lockless fastpaths
+ *   This lock protects the operations on per-NUMA-node barn. It can quickly
+ *   serve an empty or full sheaf if available, and avoid more expensive refill
+ *   or flush operation.
  *
- *   The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
- *   are fully lockless when satisfied from the percpu slab (and when
- *   cmpxchg_double is possible to use, otherwise slab_lock is taken).
- *   They also don't disable preemption or migration or irqs. They rely on
- *   the transaction id (tid) field to detect being preempted or moved to
- *   another cpu.
+ *   Lockless freeing
+ *
+ *   Objects may have to be freed to their slabs when they are from a remote
+ *   node (where we want to avoid filling local sheaves with remote objects)
+ *   or when there are too many full sheaves. On architectures supporting
+ *   cmpxchg_double this is done by a lockless update of slab's freelist and
+ *   counters, otherwise slab_lock is taken. This only needs to take the
+ *   list_lock if it's a first free to a full slab, or when there are too many
+ *   fully free slabs and some need to be discarded.
  *
  *   irq, preemption, migration considerations
  *
- *   Interrupts are disabled as part of list_lock or local_lock operations, or
+ *   Interrupts are disabled as part of list_lock or barn lock operations, or
  *   around the slab_lock operation, in order to make the slab allocator safe
  *   to use in the context of an irq.
+ *   Preemption is disabled as part of local_trylock operations.
+ *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
+ *   their limitations.
  *
- *   In addition, preemption (or migration on PREEMPT_RT) is disabled in the
- *   allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
- *   local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
- *   doesn't have to be revalidated in each section protected by the local lock.
- *
- * SLUB assigns one slab for allocation to each processor.
- * Allocations only occur from these slabs called cpu slabs.
+ * SLUB assigns two object arrays called sheaves for caching allocation and
+ * frees on each cpu, with a NUMA node shared barn for balancing between cpus.
+ * Allocations and frees are primarily served from these sheaves.
  *
  * Slabs with free elements are kept on a partial list and during regular
  * operations no list for full slabs is used. If an object in a full slab is
@@ -160,25 +170,8 @@
  * We track full slabs for debugging purposes though because otherwise we
  * cannot scan all objects.
  *
- * Slabs are freed when they become empty. Teardown and setup is
- * minimal so we rely on the page allocators per cpu caches for
- * fast frees and allocs.
- *
- * slab->frozen		The slab is frozen and exempt from list processing.
- * 			This means that the slab is dedicated to a purpose
- * 			such as satisfying allocations for a specific
- * 			processor. Objects may be freed in the slab while
- * 			it is frozen but slab_free will then skip the usual
- * 			list operations. It is up to the processor holding
- * 			the slab to integrate the slab into the slab lists
- * 			when the slab is no longer needed.
- *
- * 			One use of this flag is to mark slabs that are
- * 			used for allocations. Then such a slab becomes a cpu
- * 			slab. The cpu slab may be equipped with an additional
- * 			freelist that allows lockless access to
- * 			free objects in addition to the regular freelist
- * 			that requires the slab lock.
+ * Slabs are freed when they become empty. Teardown and setup is minimal so we
+ * rely on the page allocators per cpu caches for fast frees and allocs.
  *
  * SLAB_DEBUG_FLAGS	Slab requires special handling due to debug
  * 			options set. This moves	slab handling out of

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free()
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (17 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-22  0:54   ` Suren Baghdasaryan
  2026-01-22  5:01   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
  2026-01-16 14:40 ` [PATCH v3 21/21] mm/slub: cleanup and repurpose some " Vlastimil Babka
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

Currently slabs are only frozen after consistency checks failed. This
can happen only in caches with debugging enabled, and those use
free_to_partial_list() for freeing. The non-debug operation of
__slab_free() can thus stop considering the frozen field, and we can
remove the FREE_FROZEN stat.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 476a279f1a94..7ec7049c0ca5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -333,7 +333,6 @@ enum stat_item {
 	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
-	FREE_FROZEN,		/* Freeing to frozen slab */
 	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
 	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
 	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from node partial list */
@@ -5103,7 +5102,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 			unsigned long addr)
 
 {
-	bool was_frozen, was_full;
+	bool was_full;
 	struct freelist_counters old, new;
 	struct kmem_cache_node *n = NULL;
 	unsigned long flags;
@@ -5126,7 +5125,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		old.counters = slab->counters;
 
 		was_full = (old.freelist == NULL);
-		was_frozen = old.frozen;
 
 		set_freepointer(s, tail, old.freelist);
 
@@ -5139,7 +5137,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		 * to (due to not being full anymore) the partial list.
 		 * Unless it's frozen.
 		 */
-		if ((!new.inuse || was_full) && !was_frozen) {
+		if (!new.inuse || was_full) {
 
 			n = get_node(s, slab_nid(slab));
 			/*
@@ -5158,20 +5156,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	} while (!slab_update_freelist(s, slab, &old, &new, "__slab_free"));
 
 	if (likely(!n)) {
-
-		if (likely(was_frozen)) {
-			/*
-			 * The list lock was not taken therefore no list
-			 * activity can be necessary.
-			 */
-			stat(s, FREE_FROZEN);
-		}
-
 		/*
-		 * In other cases we didn't take the list_lock because the slab
-		 * was already on the partial list and will remain there.
+		 * We didn't take the list_lock because the slab was already on
+		 * the partial list and will remain there.
 		 */
-
 		return;
 	}
 
@@ -8721,7 +8709,6 @@ STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
 STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
-STAT_ATTR(FREE_FROZEN, free_frozen);
 STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
 STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
 STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
@@ -8826,7 +8813,6 @@ static struct attribute *slab_attrs[] = {
 	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
-	&free_frozen_attr.attr,
 	&free_add_partial_attr.attr,
 	&free_remove_partial_attr.attr,
 	&alloc_from_partial_attr.attr,

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (18 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-22  0:58   ` Suren Baghdasaryan
  2026-01-22  5:17   ` Hao Li
  2026-01-16 14:40 ` [PATCH v3 21/21] mm/slub: cleanup and repurpose some " Vlastimil Babka
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

The cpu slabs and their deactivations were removed, so remove the unused
stat items. Weirdly enough the values were also used to control
__add_partial() adding to head or tail of the list, so replace that with
a new enum add_mode, which is cleaner.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7ec7049c0ca5..c12e90cb2fca 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -324,6 +324,11 @@ static void debugfs_slab_add(struct kmem_cache *);
 static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
+enum add_mode {
+	ADD_TO_HEAD,
+	ADD_TO_TAIL,
+};
+
 enum stat_item {
 	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -343,8 +348,6 @@ enum stat_item {
 	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
 	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
 	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
-	DEACTIVATE_TO_HEAD,	/* Cpu slab was moved to the head of partials */
-	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
 	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
 	DEACTIVATE_BYPASS,	/* Implicit deactivation */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
@@ -3268,10 +3271,10 @@ static inline void slab_clear_node_partial(struct slab *slab)
  * Management of partially allocated slabs.
  */
 static inline void
-__add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
+__add_partial(struct kmem_cache_node *n, struct slab *slab, enum add_mode mode)
 {
 	n->nr_partial++;
-	if (tail == DEACTIVATE_TO_TAIL)
+	if (mode == ADD_TO_TAIL)
 		list_add_tail(&slab->slab_list, &n->partial);
 	else
 		list_add(&slab->slab_list, &n->partial);
@@ -3279,10 +3282,10 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
 }
 
 static inline void add_partial(struct kmem_cache_node *n,
-				struct slab *slab, int tail)
+				struct slab *slab, enum add_mode mode)
 {
 	lockdep_assert_held(&n->list_lock);
-	__add_partial(n, slab, tail);
+	__add_partial(n, slab, mode);
 }
 
 static inline void remove_partial(struct kmem_cache_node *n,
@@ -3375,7 +3378,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
 	if (slab->inuse == slab->objects)
 		add_full(s, n, slab);
 	else
-		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		add_partial(n, slab, ADD_TO_HEAD);
 
 	inc_slabs_node(s, nid, slab->objects);
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3996,7 +3999,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 			n = get_node(s, slab_nid(slab));
 			spin_lock_irqsave(&n->list_lock, flags);
 		}
-		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		add_partial(n, slab, ADD_TO_HEAD);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
@@ -5064,7 +5067,7 @@ static noinline void free_to_partial_list(
 			/* was on full list */
 			remove_full(s, n, slab);
 			if (!slab_free) {
-				add_partial(n, slab, DEACTIVATE_TO_TAIL);
+				add_partial(n, slab, ADD_TO_TAIL);
 				stat(s, FREE_ADD_PARTIAL);
 			}
 		} else if (slab_free) {
@@ -5184,7 +5187,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	 * then add it.
 	 */
 	if (unlikely(was_full)) {
-		add_partial(n, slab, DEACTIVATE_TO_TAIL);
+		add_partial(n, slab, ADD_TO_TAIL);
 		stat(s, FREE_ADD_PARTIAL);
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -6564,7 +6567,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
 				continue;
 
 			list_del(&slab->slab_list);
-			add_partial(n, slab, DEACTIVATE_TO_HEAD);
+			add_partial(n, slab, ADD_TO_HEAD);
 		}
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
@@ -7031,7 +7034,7 @@ static void early_kmem_cache_node_alloc(int node)
 	 * No locks need to be taken here as it has just been
 	 * initialized and there is no concurrent access.
 	 */
-	__add_partial(n, slab, DEACTIVATE_TO_HEAD);
+	__add_partial(n, slab, ADD_TO_HEAD);
 }
 
 static void free_kmem_cache_nodes(struct kmem_cache *s)
@@ -8719,8 +8722,6 @@ STAT_ATTR(FREE_SLAB, free_slab);
 STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
 STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
 STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
-STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
 STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
 STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
@@ -8823,8 +8824,6 @@ static struct attribute *slab_attrs[] = {
 	&cpuslab_flush_attr.attr,
 	&deactivate_full_attr.attr,
 	&deactivate_empty_attr.attr,
-	&deactivate_to_head_attr.attr,
-	&deactivate_to_tail_attr.attr,
 	&deactivate_remote_frees_attr.attr,
 	&deactivate_bypass_attr.attr,
 	&order_fallback_attr.attr,

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v3 21/21] mm/slub: cleanup and repurpose some stat items
  2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
                   ` (19 preceding siblings ...)
  2026-01-16 14:40 ` [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
@ 2026-01-16 14:40 ` Vlastimil Babka
  2026-01-22  2:35   ` Suren Baghdasaryan
  2026-01-22  5:52   ` Hao Li
  20 siblings, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-16 14:40 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev, Vlastimil Babka

A number of stat items related to cpu slabs became unused, remove them.

Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of
removing those, use them instead of ALLOC_PCS and FREE_PCS, since
sheaves are the new (and only) fastpaths, Remove the recently added
_PCS variants instead.

Change where FREE_SLOWPATH is counted so that it only counts freeing of
objects by slab users that (for whatever reason) do not go to a percpu
sheaf, and not all (including internal) callers of __slab_free(). Thus
flushing sheaves (counted by SHEAF_FLUSH) no longer also increments
FREE_SLOWPATH. This matches how ALLOC_SLOWPATH doesn't count sheaf
refills (counted by SHEAF_REFILL).

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 77 +++++++++++++++++----------------------------------------------
 1 file changed, 21 insertions(+), 56 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c12e90cb2fca..d73ad44fa046 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -330,33 +330,19 @@ enum add_mode {
 };
 
 enum stat_item {
-	ALLOC_PCS,		/* Allocation from percpu sheaf */
-	ALLOC_FASTPATH,		/* Allocation from cpu slab */
-	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
-	FREE_PCS,		/* Free to percpu sheaf */
+	ALLOC_FASTPATH,		/* Allocation from percpu sheaves */
+	ALLOC_SLOWPATH,		/* Allocation from partial or new slab */
 	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
 	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
-	FREE_FASTPATH,		/* Free to cpu slab */
-	FREE_SLOWPATH,		/* Freeing not to cpu slab */
+	FREE_FASTPATH,		/* Free to percpu sheaves */
+	FREE_SLOWPATH,		/* Free to a slab */
 	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
 	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
-	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from node partial list */
-	ALLOC_SLAB,		/* Cpu slab acquired from page allocator */
-	ALLOC_REFILL,		/* Refill cpu slab from slab freelist */
-	ALLOC_NODE_MISMATCH,	/* Switching cpu slab */
+	ALLOC_SLAB,		/* New slab acquired from page allocator */
+	ALLOC_NODE_MISMATCH,	/* Requested node different from cpu sheaf */
 	FREE_SLAB,		/* Slab freed to the page allocator */
-	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
-	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
-	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
-	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
-	DEACTIVATE_BYPASS,	/* Implicit deactivation */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
-	CMPXCHG_DOUBLE_CPU_FAIL,/* Failures of this_cpu_cmpxchg_double */
 	CMPXCHG_DOUBLE_FAIL,	/* Failures of slab freelist update */
-	CPU_PARTIAL_ALLOC,	/* Used cpu partial on alloc */
-	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
-	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
-	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
 	SHEAF_FLUSH,		/* Objects flushed from a sheaf */
 	SHEAF_REFILL,		/* Objects refilled to a sheaf */
 	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
@@ -4347,8 +4333,10 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 	 * We assume the percpu sheaves contain only local objects although it's
 	 * not completely guaranteed, so we verify later.
 	 */
-	if (unlikely(node_requested && node != numa_mem_id()))
+	if (unlikely(node_requested && node != numa_mem_id())) {
+		stat(s, ALLOC_NODE_MISMATCH);
 		return NULL;
+	}
 
 	if (!local_trylock(&s->cpu_sheaves->lock))
 		return NULL;
@@ -4371,6 +4359,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 		 */
 		if (page_to_nid(virt_to_page(object)) != node) {
 			local_unlock(&s->cpu_sheaves->lock);
+			stat(s, ALLOC_NODE_MISMATCH);
 			return NULL;
 		}
 	}
@@ -4379,7 +4368,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	stat(s, ALLOC_PCS);
+	stat(s, ALLOC_FASTPATH);
 
 	return object;
 }
@@ -4451,7 +4440,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	stat_add(s, ALLOC_PCS, batch);
+	stat_add(s, ALLOC_FASTPATH, batch);
 
 	allocated += batch;
 
@@ -5111,8 +5100,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	unsigned long flags;
 	bool on_node_partial;
 
-	stat(s, FREE_SLOWPATH);
-
 	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
 		free_to_partial_list(s, slab, head, tail, cnt, addr);
 		return;
@@ -5416,7 +5403,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	stat(s, FREE_PCS);
+	stat(s, FREE_FASTPATH);
 
 	return true;
 }
@@ -5664,7 +5651,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	stat_add(s, FREE_PCS, batch);
+	stat_add(s, FREE_FASTPATH, batch);
 
 	if (batch < size) {
 		p += batch;
@@ -5686,10 +5673,12 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	 */
 fallback:
 	__kmem_cache_free_bulk(s, size, p);
+	stat_add(s, FREE_SLOWPATH, size);
 
 flush_remote:
 	if (remote_nr) {
 		__kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
+		stat_add(s, FREE_SLOWPATH, remote_nr);
 		if (i < size) {
 			remote_nr = 0;
 			goto next_remote_batch;
@@ -5784,6 +5773,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	}
 
 	__slab_free(s, slab, object, object, 1, addr);
+	stat(s, FREE_SLOWPATH);
 }
 
 #ifdef CONFIG_MEMCG
@@ -5806,8 +5796,10 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
 	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
 	 * to remove objects, whose reuse must be delayed.
 	 */
-	if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
+	if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt))) {
 		__slab_free(s, slab, head, tail, cnt, addr);
+		stat_add(s, FREE_SLOWPATH, cnt);
+	}
 }
 
 #ifdef CONFIG_SLUB_RCU_DEBUG
@@ -6705,6 +6697,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		i = refill_objects(s, p, flags, size, size);
 		if (i < size)
 			goto error;
+		stat_add(s, ALLOC_SLOWPATH, i);
 	}
 
 	return i;
@@ -8704,33 +8697,19 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
-STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
-STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
 STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
 STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
-STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
 STAT_ATTR(ALLOC_SLAB, alloc_slab);
-STAT_ATTR(ALLOC_REFILL, alloc_refill);
 STAT_ATTR(ALLOC_NODE_MISMATCH, alloc_node_mismatch);
 STAT_ATTR(FREE_SLAB, free_slab);
-STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
-STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
-STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
-STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
-STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
 STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
-STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
-STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
-STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
-STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
 STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
 STAT_ATTR(SHEAF_REFILL, sheaf_refill);
 STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
@@ -8806,33 +8785,19 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
-	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
-	&free_cpu_sheaf_attr.attr,
 	&free_rcu_sheaf_attr.attr,
 	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_add_partial_attr.attr,
 	&free_remove_partial_attr.attr,
-	&alloc_from_partial_attr.attr,
 	&alloc_slab_attr.attr,
-	&alloc_refill_attr.attr,
 	&alloc_node_mismatch_attr.attr,
 	&free_slab_attr.attr,
-	&cpuslab_flush_attr.attr,
-	&deactivate_full_attr.attr,
-	&deactivate_empty_attr.attr,
-	&deactivate_remote_frees_attr.attr,
-	&deactivate_bypass_attr.attr,
 	&order_fallback_attr.attr,
 	&cmpxchg_double_fail_attr.attr,
-	&cmpxchg_double_cpu_fail_attr.attr,
-	&cpu_partial_alloc_attr.attr,
-	&cpu_partial_free_attr.attr,
-	&cpu_partial_node_attr.attr,
-	&cpu_partial_drain_attr.attr,
 	&sheaf_flush_attr.attr,
 	&sheaf_refill_attr.attr,
 	&sheaf_alloc_attr.attr,

-- 
2.52.0



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
  2026-01-16 14:40 ` [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
@ 2026-01-16 17:22   ` Suren Baghdasaryan
  2026-01-19  3:41   ` Harry Yoo
  1 sibling, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-16 17:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 6:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> All the debug flags prevent merging, except SLAB_CONSISTENCY_CHECKS. This
> is suboptimal because this flag (like any debug flags) prevents the
> usage of any fastpaths, and thus affect performance of any aliased
> cache. Also the objects from an aliased cache than the one specified for
> debugging could also interfere with the debugging efforts.
>
> Fix this by adding the whole SLAB_DEBUG_FLAGS collection to
> SLAB_NEVER_MERGE instead of individual debug flags, so it now also
> includes SLAB_CONSISTENCY_CHECKS.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slab_common.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index ee994ec7f251..e691ede0e6a8 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -45,9 +45,8 @@ struct kmem_cache *kmem_cache;
>  /*
>   * Set of flags that will prevent slab merging
>   */
> -#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> -               SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
> -               SLAB_FAILSLAB | SLAB_NO_MERGE)
> +#define SLAB_NEVER_MERGE (SLAB_DEBUG_FLAGS | SLAB_TYPESAFE_BY_RCU | \
> +               SLAB_NOLEAKTRACE | SLAB_FAILSLAB | SLAB_NO_MERGE)
>
>  #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
>                          SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-16 14:40 ` [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap Vlastimil Babka
@ 2026-01-17  2:11   ` Suren Baghdasaryan
  2026-01-19  3:40     ` Harry Yoo
                       ` (2 more replies)
  2026-01-19 11:32   ` Hao Li
  1 sibling, 3 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-17  2:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
> sheaves enabled. Since we want to enable them for almost all caches,
> it's suboptimal to test the pointer in the fast paths, so instead
> allocate it for all caches in do_kmem_cache_create(). Instead of testing
> the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
> kmem_cache->sheaf_capacity for being 0, where needed, using a new
> cache_has_sheaves() helper.
>
> However, for the fast paths sake we also assume that the main sheaf
> always exists (pcs->main is !NULL), and during bootstrap we cannot
> allocate sheaves yet.
>
> Solve this by introducing a single static bootstrap_sheaf that's
> assigned as pcs->main during bootstrap. It has a size of 0, so during
> allocations, the fast path will find it's empty. Since the size of 0
> matches sheaf_capacity of 0, the freeing fast paths will find it's
> "full". In the slow path handlers, we use cache_has_sheaves() to
> recognize that the cache doesn't (yet) have real sheaves, and fall back.

I don't think kmem_cache_prefill_sheaf() handles this case, does it?
Or do you rely on the caller to never try prefilling a bootstrapped
sheaf?
kmem_cache_refill_sheaf() and kmem_cache_return_sheaf() operate on a
sheaf obtained by calling kmem_cache_prefill_sheaf(), so if
kmem_cache_prefill_sheaf() never returns a bootstrapped sheaf we don't
need special handling there.

> Thus sharing the single bootstrap sheaf like this for multiple caches
> and cpus is safe.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 81 insertions(+), 38 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index edf341c87e20..706cb6398f05 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -501,6 +501,18 @@ struct kmem_cache_node {
>         struct node_barn *barn;
>  };
>
> +/*
> + * Every cache has !NULL s->cpu_sheaves but they may point to the
> + * bootstrap_sheaf temporarily during init, or permanently for the boot caches
> + * and caches with debugging enabled, or all caches with CONFIG_SLUB_TINY. This
> + * helper distinguishes whether cache has real non-bootstrap sheaves.
> + */
> +static inline bool cache_has_sheaves(struct kmem_cache *s)
> +{
> +       /* Test CONFIG_SLUB_TINY for code elimination purposes */
> +       return !IS_ENABLED(CONFIG_SLUB_TINY) && s->sheaf_capacity;
> +}
> +
>  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>  {
>         return s->node[node];
> @@ -2855,6 +2867,10 @@ static void pcs_destroy(struct kmem_cache *s)
>                 if (!pcs->main)
>                         continue;
>
> +               /* bootstrap or debug caches, it's the bootstrap_sheaf */
> +               if (!pcs->main->cache)
> +                       continue;

I wonder why we can't simply check cache_has_sheaves(s) at the
beginning and skip the loop altogether.
I realize that __kmem_cache_release()->pcs_destroy() is called in the
failure path of do_kmem_cache_create() and s->cpu_sheaves might be
partially initialized if alloc_empty_sheaf() fails somewhere in the
middle of the loop inside init_percpu_sheaves(). But for that,
s->sheaf_capacity should still be non-zero, so checking
cache_has_sheaves() at the beginning of pcs_destroy() should still
work, no?

BTW, I see one last check for s->cpu_sheaves that you didn't replace
with cache_has_sheaves() inside __kmem_cache_release(). I think that's
because it's also in the failure path of do_kmem_cache_create() and
it's possible that s->sheaf_capacity > 0 while s->cpu_sheaves == NULL
(if alloc_percpu(struct slub_percpu_sheaves) fails). It might be
helpful to add a comment inside __kmem_cache_release() to explain why
cache_has_sheaves() can't be used there.

> +
>                 /*
>                  * We have already passed __kmem_cache_shutdown() so everything
>                  * was flushed and there should be no objects allocated from
> @@ -4030,7 +4046,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>  {
>         struct slub_percpu_sheaves *pcs;
>
> -       if (!s->cpu_sheaves)
> +       if (!cache_has_sheaves(s))
>                 return false;
>
>         pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> @@ -4052,7 +4068,7 @@ static void flush_cpu_slab(struct work_struct *w)
>
>         s = sfw->s;
>
> -       if (s->cpu_sheaves)
> +       if (cache_has_sheaves(s))
>                 pcs_flush_all(s);
>
>         flush_this_cpu_slab(s);
> @@ -4157,7 +4173,7 @@ void flush_all_rcu_sheaves(void)
>         mutex_lock(&slab_mutex);
>
>         list_for_each_entry(s, &slab_caches, list) {
> -               if (!s->cpu_sheaves)
> +               if (!cache_has_sheaves(s))
>                         continue;
>                 flush_rcu_sheaves_on_cache(s);
>         }
> @@ -4179,7 +4195,7 @@ static int slub_cpu_dead(unsigned int cpu)
>         mutex_lock(&slab_mutex);
>         list_for_each_entry(s, &slab_caches, list) {
>                 __flush_cpu_slab(s, cpu);
> -               if (s->cpu_sheaves)
> +               if (cache_has_sheaves(s))
>                         __pcs_flush_all_cpu(s, cpu);
>         }
>         mutex_unlock(&slab_mutex);
> @@ -4979,6 +4995,12 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>
>         lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
>
> +       /* Bootstrap or debug cache, back off */
> +       if (unlikely(!cache_has_sheaves(s))) {
> +               local_unlock(&s->cpu_sheaves->lock);
> +               return NULL;
> +       }
> +
>         if (pcs->spare && pcs->spare->size > 0) {
>                 swap(pcs->main, pcs->spare);
>                 return pcs;
> @@ -5165,6 +5187,11 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                 struct slab_sheaf *full;
>                 struct node_barn *barn;
>
> +               if (unlikely(!cache_has_sheaves(s))) {
> +                       local_unlock(&s->cpu_sheaves->lock);
> +                       return allocated;
> +               }
> +
>                 if (pcs->spare && pcs->spare->size > 0) {
>                         swap(pcs->main, pcs->spare);
>                         goto do_alloc;
> @@ -5244,8 +5271,7 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       if (s->cpu_sheaves)
> -               object = alloc_from_pcs(s, gfpflags, node);
> +       object = alloc_from_pcs(s, gfpflags, node);
>
>         if (!object)
>                 object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> @@ -5355,17 +5381,6 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>
>         if (unlikely(size > s->sheaf_capacity)) {
>
> -               /*
> -                * slab_debug disables cpu sheaves intentionally so all
> -                * prefilled sheaves become "oversize" and we give up on
> -                * performance for the debugging. Same with SLUB_TINY.
> -                * Creating a cache without sheaves and then requesting a
> -                * prefilled sheaf is however not expected, so warn.
> -                */
> -               WARN_ON_ONCE(s->sheaf_capacity == 0 &&
> -                            !IS_ENABLED(CONFIG_SLUB_TINY) &&
> -                            !(s->flags & SLAB_DEBUG_FLAGS));
> -
>                 sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
>                 if (!sheaf)
>                         return NULL;
> @@ -6082,6 +6097,12 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>  restart:
>         lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
>
> +       /* Bootstrap or debug cache, back off */
> +       if (unlikely(!cache_has_sheaves(s))) {
> +               local_unlock(&s->cpu_sheaves->lock);
> +               return NULL;
> +       }
> +
>         barn = get_barn(s);
>         if (!barn) {
>                 local_unlock(&s->cpu_sheaves->lock);
> @@ -6280,6 +6301,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>                 struct slab_sheaf *empty;
>                 struct node_barn *barn;
>
> +               /* Bootstrap or debug cache, fall back */
> +               if (unlikely(!cache_has_sheaves(s))) {
> +                       local_unlock(&s->cpu_sheaves->lock);
> +                       goto fail;
> +               }
> +
>                 if (pcs->spare && pcs->spare->size == 0) {
>                         pcs->rcu_free = pcs->spare;
>                         pcs->spare = NULL;
> @@ -6674,9 +6701,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>                 return;
>
> -       if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
> -                                    slab_nid(slab) == numa_mem_id())
> -                          && likely(!slab_test_pfmemalloc(slab))) {
> +       if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> +           && likely(!slab_test_pfmemalloc(slab))) {
>                 if (likely(free_to_pcs(s, object)))
>                         return;
>         }
> @@ -7379,7 +7405,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>          * freeing to sheaves is so incompatible with the detached freelist so
>          * once we go that way, we have to do everything differently
>          */
> -       if (s && s->cpu_sheaves) {
> +       if (s && cache_has_sheaves(s)) {
>                 free_to_pcs_bulk(s, size, p);
>                 return;
>         }
> @@ -7490,8 +7516,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                 size--;
>         }
>
> -       if (s->cpu_sheaves)
> -               i = alloc_from_pcs_bulk(s, size, p);
> +       i = alloc_from_pcs_bulk(s, size, p);

Doesn't the above change make this fastpath a bit longer? IIUC,
instead of bailing out right here we call alloc_from_pcs_bulk() and
bail out from there because pcs->main->size is 0.

>
>         if (i < size) {
>                 /*
> @@ -7702,6 +7727,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>
>  static int init_percpu_sheaves(struct kmem_cache *s)
>  {
> +       static struct slab_sheaf bootstrap_sheaf = {};
>         int cpu;
>
>         for_each_possible_cpu(cpu) {
> @@ -7711,7 +7737,28 @@ static int init_percpu_sheaves(struct kmem_cache *s)
>
>                 local_trylock_init(&pcs->lock);
>
> -               pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
> +               /*
> +                * Bootstrap sheaf has zero size so fast-path allocation fails.
> +                * It has also size == s->sheaf_capacity, so fast-path free
> +                * fails. In the slow paths we recognize the situation by
> +                * checking s->sheaf_capacity. This allows fast paths to assume
> +                * s->cpu_sheaves and pcs->main always exists and is valid.

s/is/are

> +                * It's also safe to share the single static bootstrap_sheaf
> +                * with zero-sized objects array as it's never modified.
> +                *
> +                * bootstrap_sheaf also has NULL pointer to kmem_cache so we
> +                * recognize it and not attempt to free it when destroying the
> +                * cache

missing a period at the end of the above sentence.

> +                *
> +                * We keep bootstrap_sheaf for kmem_cache and kmem_cache_node,
> +                * caches with debug enabled, and all caches with SLUB_TINY.
> +                * For kmalloc caches it's used temporarily during the initial
> +                * bootstrap.
> +                */
> +               if (!s->sheaf_capacity)
> +                       pcs->main = &bootstrap_sheaf;
> +               else
> +                       pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
>
>                 if (!pcs->main)
>                         return -ENOMEM;
> @@ -7809,7 +7856,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>                         continue;
>                 }
>
> -               if (s->cpu_sheaves) {
> +               if (cache_has_sheaves(s)) {
>                         barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
>
>                         if (!barn)
> @@ -8127,7 +8174,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         flush_all_cpus_locked(s);
>
>         /* we might have rcu sheaves in flight */
> -       if (s->cpu_sheaves)
> +       if (cache_has_sheaves(s))
>                 rcu_barrier();
>
>         /* Attempt to free all objects */
> @@ -8439,7 +8486,7 @@ static int slab_mem_going_online_callback(int nid)
>                 if (get_node(s, nid))
>                         continue;
>
> -               if (s->cpu_sheaves) {
> +               if (cache_has_sheaves(s)) {
>                         barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
>
>                         if (!barn) {
> @@ -8647,12 +8694,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
>         set_cpu_partial(s);
>
> -       if (s->sheaf_capacity) {
> -               s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> -               if (!s->cpu_sheaves) {
> -                       err = -ENOMEM;
> -                       goto out;
> -               }
> +       s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> +       if (!s->cpu_sheaves) {
> +               err = -ENOMEM;
> +               goto out;
>         }
>
>  #ifdef CONFIG_NUMA
> @@ -8671,11 +8716,9 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>         if (!alloc_kmem_cache_cpus(s))
>                 goto out;
>
> -       if (s->cpu_sheaves) {
> -               err = init_percpu_sheaves(s);
> -               if (err)
> -                       goto out;
> -       }
> +       err = init_percpu_sheaves(s);
> +       if (err)
> +               goto out;
>
>         err = 0;
>
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-16 14:40 ` [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
@ 2026-01-18 20:45   ` Suren Baghdasaryan
  2026-01-19  4:31   ` Harry Yoo
  1 sibling, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-18 20:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Before we enable percpu sheaves for kmalloc caches, we need to make sure
> kmalloc_nolock() and kfree_nolock() will continue working properly and
> not spin when not allowed to.
>
> Percpu sheaves themselves use local_trylock() so they are already
> compatible. We just need to be careful with the barn->lock spin_lock.
> Pass a new allow_spin parameter where necessary to use
> spin_trylock_irqsave().
>
> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
> for now it will always fail until we enable sheaves for kmalloc caches
> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 56 insertions(+), 23 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 706cb6398f05..b385247c219f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2893,7 +2893,8 @@ static void pcs_destroy(struct kmem_cache *s)
>         s->cpu_sheaves = NULL;
>  }
>
> -static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
> +                                              bool allow_spin)
>  {
>         struct slab_sheaf *empty = NULL;
>         unsigned long flags;
> @@ -2901,7 +2902,10 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
>         if (!data_race(barn->nr_empty))
>                 return NULL;
>
> -       spin_lock_irqsave(&barn->lock, flags);
> +       if (likely(allow_spin))
> +               spin_lock_irqsave(&barn->lock, flags);
> +       else if (!spin_trylock_irqsave(&barn->lock, flags))
> +               return NULL;
>
>         if (likely(barn->nr_empty)) {
>                 empty = list_first_entry(&barn->sheaves_empty,
> @@ -2978,7 +2982,8 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
>   * change.
>   */
>  static struct slab_sheaf *
> -barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> +barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty,
> +                        bool allow_spin)
>  {
>         struct slab_sheaf *full = NULL;
>         unsigned long flags;
> @@ -2986,7 +2991,10 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
>         if (!data_race(barn->nr_full))
>                 return NULL;
>
> -       spin_lock_irqsave(&barn->lock, flags);
> +       if (likely(allow_spin))
> +               spin_lock_irqsave(&barn->lock, flags);
> +       else if (!spin_trylock_irqsave(&barn->lock, flags))
> +               return NULL;
>
>         if (likely(barn->nr_full)) {
>                 full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> @@ -3007,7 +3015,8 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
>   * barn. But if there are too many full sheaves, reject this with -E2BIG.
>   */
>  static struct slab_sheaf *
> -barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full,
> +                       bool allow_spin)
>  {
>         struct slab_sheaf *empty;
>         unsigned long flags;
> @@ -3018,7 +3027,10 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
>         if (!data_race(barn->nr_empty))
>                 return ERR_PTR(-ENOMEM);
>
> -       spin_lock_irqsave(&barn->lock, flags);
> +       if (likely(allow_spin))
> +               spin_lock_irqsave(&barn->lock, flags);
> +       else if (!spin_trylock_irqsave(&barn->lock, flags))
> +               return ERR_PTR(-EBUSY);
>
>         if (likely(barn->nr_empty)) {
>                 empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
> @@ -5012,7 +5024,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>                 return NULL;
>         }
>
> -       full = barn_replace_empty_sheaf(barn, pcs->main);
> +       full = barn_replace_empty_sheaf(barn, pcs->main,
> +                                       gfpflags_allow_spinning(gfp));
>
>         if (full) {
>                 stat(s, BARN_GET);
> @@ -5029,7 +5042,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>                         empty = pcs->spare;
>                         pcs->spare = NULL;
>                 } else {
> -                       empty = barn_get_empty_sheaf(barn);
> +                       empty = barn_get_empty_sheaf(barn, true);
>                 }
>         }
>
> @@ -5169,7 +5182,8 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  }
>
>  static __fastpath_inline
> -unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
> +                                void **p)
>  {
>         struct slub_percpu_sheaves *pcs;
>         struct slab_sheaf *main;
> @@ -5203,7 +5217,8 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                         return allocated;
>                 }
>
> -               full = barn_replace_empty_sheaf(barn, pcs->main);
> +               full = barn_replace_empty_sheaf(barn, pcs->main,
> +                                               gfpflags_allow_spinning(gfp));
>
>                 if (full) {
>                         stat(s, BARN_GET);
> @@ -5701,7 +5716,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>         gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
>         struct kmem_cache *s;
>         bool can_retry = true;
> -       void *ret = ERR_PTR(-EBUSY);
> +       void *ret;
>
>         VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO |
>                                       __GFP_NO_OBJ_EXT));
> @@ -5732,6 +5747,12 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>                  */
>                 return NULL;
>
> +       ret = alloc_from_pcs(s, alloc_gfp, node);
> +       if (ret)
> +               goto success;
> +
> +       ret = ERR_PTR(-EBUSY);
> +
>         /*
>          * Do not call slab_alloc_node(), since trylock mode isn't
>          * compatible with slab_pre_alloc_hook/should_failslab and
> @@ -5768,6 +5789,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>                 ret = NULL;
>         }
>
> +success:
>         maybe_wipe_obj_freeptr(s, ret);
>         slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
>                              slab_want_init_on_alloc(alloc_gfp, s), size);
> @@ -6088,7 +6110,8 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
>   * unlocked.
>   */
>  static struct slub_percpu_sheaves *
> -__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> +                       bool allow_spin)
>  {
>         struct slab_sheaf *empty;
>         struct node_barn *barn;
> @@ -6112,7 +6135,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>         put_fail = false;
>
>         if (!pcs->spare) {
> -               empty = barn_get_empty_sheaf(barn);
> +               empty = barn_get_empty_sheaf(barn, allow_spin);
>                 if (empty) {
>                         pcs->spare = pcs->main;
>                         pcs->main = empty;
> @@ -6126,7 +6149,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>                 return pcs;
>         }
>
> -       empty = barn_replace_full_sheaf(barn, pcs->main);
> +       empty = barn_replace_full_sheaf(barn, pcs->main, allow_spin);
>
>         if (!IS_ERR(empty)) {
>                 stat(s, BARN_PUT);
> @@ -6134,7 +6157,8 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>                 return pcs;
>         }
>
> -       if (PTR_ERR(empty) == -E2BIG) {
> +       /* sheaf_flush_unused() doesn't support !allow_spin */
> +       if (PTR_ERR(empty) == -E2BIG && allow_spin) {
>                 /* Since we got here, spare exists and is full */
>                 struct slab_sheaf *to_flush = pcs->spare;
>
> @@ -6159,6 +6183,14 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>  alloc_empty:
>         local_unlock(&s->cpu_sheaves->lock);
>
> +       /*
> +        * alloc_empty_sheaf() doesn't support !allow_spin and it's
> +        * easier to fall back to freeing directly without sheaves
> +        * than add the support (and to sheaf_flush_unused() above)
> +        */
> +       if (!allow_spin)
> +               return NULL;
> +
>         empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>         if (empty)
>                 goto got_empty;
> @@ -6201,7 +6233,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
>   * The object is expected to have passed slab_free_hook() already.
>   */
>  static __fastpath_inline
> -bool free_to_pcs(struct kmem_cache *s, void *object)
> +bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
>  {
>         struct slub_percpu_sheaves *pcs;
>
> @@ -6212,7 +6244,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>
>         if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>
> -               pcs = __pcs_replace_full_main(s, pcs);
> +               pcs = __pcs_replace_full_main(s, pcs, allow_spin);
>                 if (unlikely(!pcs))
>                         return false;
>         }
> @@ -6319,7 +6351,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>                         goto fail;
>                 }
>
> -               empty = barn_get_empty_sheaf(barn);
> +               empty = barn_get_empty_sheaf(barn, true);
>
>                 if (empty) {
>                         pcs->rcu_free = empty;
> @@ -6437,7 +6469,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                 goto no_empty;
>
>         if (!pcs->spare) {
> -               empty = barn_get_empty_sheaf(barn);
> +               empty = barn_get_empty_sheaf(barn, true);
>                 if (!empty)
>                         goto no_empty;
>
> @@ -6451,7 +6483,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                 goto do_free;
>         }
>
> -       empty = barn_replace_full_sheaf(barn, pcs->main);
> +       empty = barn_replace_full_sheaf(barn, pcs->main, true);
>         if (IS_ERR(empty)) {
>                 stat(s, BARN_PUT_FAIL);
>                 goto no_empty;
> @@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>
>         if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
>             && likely(!slab_test_pfmemalloc(slab))) {
> -               if (likely(free_to_pcs(s, object)))
> +               if (likely(free_to_pcs(s, object, true)))
>                         return;
>         }
>
> @@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
>          * since kasan quarantine takes locks and not supported from NMI.
>          */
>         kasan_slab_free(s, x, false, false, /* skip quarantine */true);
> -       do_slab_free(s, slab, x, x, 0, _RET_IP_);
> +       if (!free_to_pcs(s, x, false))
> +               do_slab_free(s, slab, x, x, 0, _RET_IP_);
>  }
>  EXPORT_SYMBOL_GPL(kfree_nolock);
>
> @@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                 size--;
>         }
>
> -       i = alloc_from_pcs_bulk(s, size, p);
> +       i = alloc_from_pcs_bulk(s, flags, size, p);
>
>         if (i < size) {
>                 /*
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-17  2:11   ` Suren Baghdasaryan
@ 2026-01-19  3:40     ` Harry Yoo
  2026-01-19  9:13       ` Vlastimil Babka
  2026-01-19  9:34     ` Vlastimil Babka
  2026-01-21 10:52     ` Vlastimil Babka
  2 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  3:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Sat, Jan 17, 2026 at 02:11:02AM +0000, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
> > sheaves enabled. Since we want to enable them for almost all caches,
> > it's suboptimal to test the pointer in the fast paths, so instead
> > allocate it for all caches in do_kmem_cache_create(). Instead of testing
> > the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
> > kmem_cache->sheaf_capacity for being 0, where needed, using a new
> > cache_has_sheaves() helper.
> >
> > However, for the fast paths sake we also assume that the main sheaf
> > always exists (pcs->main is !NULL), and during bootstrap we cannot
> > allocate sheaves yet.
> >
> > Solve this by introducing a single static bootstrap_sheaf that's
> > assigned as pcs->main during bootstrap. It has a size of 0, so during
> > allocations, the fast path will find it's empty. Since the size of 0
> > matches sheaf_capacity of 0, the freeing fast paths will find it's
> > "full". In the slow path handlers, we use cache_has_sheaves() to
> > recognize that the cache doesn't (yet) have real sheaves, and fall back.
> 
> I don't think kmem_cache_prefill_sheaf() handles this case, does it?
> Or do you rely on the caller to never try prefilling a bootstrapped
> sheaf?

If a cache doesn't have sheaves, s->sheaf_capacity should be 0,
so the sheaf returned by kmem_cache_prefill_sheaf() should be
"oversized" one... unless the user tries to prefill a sheaf with
size == 0?

> kmem_cache_refill_sheaf() and kmem_cache_return_sheaf() operate on a
> sheaf obtained by calling kmem_cache_prefill_sheaf(), so if
> kmem_cache_prefill_sheaf() never returns a bootstrapped sheaf we don't
> need special handling there.

Right.

> > Thus sharing the single bootstrap sheaf like this for multiple caches
> > and cpus is safe.
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 81 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index edf341c87e20..706cb6398f05 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -501,6 +501,18 @@ struct kmem_cache_node {
> >         struct node_barn *barn;
> >  };
> >
> > +/*
> > + * Every cache has !NULL s->cpu_sheaves but they may point to the
> > + * bootstrap_sheaf temporarily during init, or permanently for the boot caches
> > + * and caches with debugging enabled, or all caches with CONFIG_SLUB_TINY. This
> > + * helper distinguishes whether cache has real non-bootstrap sheaves.
> > + */
> > +static inline bool cache_has_sheaves(struct kmem_cache *s)
> > +{
> > +       /* Test CONFIG_SLUB_TINY for code elimination purposes */
> > +       return !IS_ENABLED(CONFIG_SLUB_TINY) && s->sheaf_capacity;
> > +}
> > +
> >  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> >  {
> >         return s->node[node];
> > @@ -2855,6 +2867,10 @@ static void pcs_destroy(struct kmem_cache *s)
> >                 if (!pcs->main)
> >                         continue;
> >
> > +               /* bootstrap or debug caches, it's the bootstrap_sheaf */
> > +               if (!pcs->main->cache)
> > +                       continue;
> 
> BTW, I see one last check for s->cpu_sheaves that you didn't replace
> with cache_has_sheaves() inside __kmem_cache_release(). I think that's
> because it's also in the failure path of do_kmem_cache_create() and
> it's possible that s->sheaf_capacity > 0 while s->cpu_sheaves == NULL
> (if alloc_percpu(struct slub_percpu_sheaves) fails). It might be
> helpful to add a comment inside __kmem_cache_release() to explain why
> cache_has_sheaves() can't be used there.

I was thinking it cannot be replaced because s->cpu_sheaves is not NULL
even when s->sheaf_capacity == 0.

Agree that a comment would be worth it!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
  2026-01-16 14:40 ` [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
  2026-01-16 17:22   ` Suren Baghdasaryan
@ 2026-01-19  3:41   ` Harry Yoo
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  3:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:22PM +0100, Vlastimil Babka wrote:
> All the debug flags prevent merging, except SLAB_CONSISTENCY_CHECKS. This
> is suboptimal because this flag (like any debug flags) prevents the
> usage of any fastpaths, and thus affect performance of any aliased
> cache. Also the objects from an aliased cache than the one specified for
> debugging could also interfere with the debugging efforts.
> 
> Fix this by adding the whole SLAB_DEBUG_FLAGS collection to
> SLAB_NEVER_MERGE instead of individual debug flags, so it now also
> includes SLAB_CONSISTENCY_CHECKS.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-16 14:40 ` [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
  2026-01-18 20:45   ` Suren Baghdasaryan
@ 2026-01-19  4:31   ` Harry Yoo
  2026-01-19 10:09     ` Vlastimil Babka
  1 sibling, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  4:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:27PM +0100, Vlastimil Babka wrote:
> Before we enable percpu sheaves for kmalloc caches, we need to make sure
> kmalloc_nolock() and kfree_nolock() will continue working properly and
> not spin when not allowed to.
> 
> Percpu sheaves themselves use local_trylock() so they are already
> compatible. We just need to be careful with the barn->lock spin_lock.
> Pass a new allow_spin parameter where necessary to use
> spin_trylock_irqsave().
> 
> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
> for now it will always fail until we enable sheaves for kmalloc caches
> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

with a nit below.

>  mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 56 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 706cb6398f05..b385247c219f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  
>  	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
>  	    && likely(!slab_test_pfmemalloc(slab))) {
> -		if (likely(free_to_pcs(s, object)))
> +		if (likely(free_to_pcs(s, object, true)))
>  			return;
>  	}
>  
> @@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
>  	 * since kasan quarantine takes locks and not supported from NMI.
>  	 */
>  	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
> -	do_slab_free(s, slab, x, x, 0, _RET_IP_);
> +	if (!free_to_pcs(s, x, false))
> +		do_slab_free(s, slab, x, x, 0, _RET_IP_);
>  }

nit: Maybe it's not that common but should we bypass sheaves if
it's from remote NUMA node just like slab_free()?

>  EXPORT_SYMBOL_GPL(kfree_nolock);
>  
> @@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>  		size--;
>  	}
>  
> -	i = alloc_from_pcs_bulk(s, size, p);
> +	i = alloc_from_pcs_bulk(s, flags, size, p);
>  
>  	if (i < size) { >  		/*
> 

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap
  2026-01-16 14:40 ` [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
@ 2026-01-19  5:23   ` Harry Yoo
  2026-01-20  1:04   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  5:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:28PM +0100, Vlastimil Babka wrote:
> Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL,
> we can simply allow them in calculate_sizes() as they are created later
> than KMALLOC_NORMAL caches and can allocate sheaves and barns from
> those.
> 
> For KMALLOC_NORMAL caches we perform additional step after first
> creating them without sheaves. Then bootstrap_cache_sheaves() simply
> allocates and initializes barns and sheaves and finally sets
> s->sheaf_capacity to make them actually used.
> 
> Afterwards the only caches left without sheaves (unless SLUB_TINY or
> debugging is enabled) are kmem_cache and kmem_cache_node. These are only
> used when creating or destroying other kmem_caches. Thus they are not
> performance critical and we can simply leave it that way.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
@ 2026-01-19  6:41   ` Harry Yoo
  2026-01-19  8:02     ` Harry Yoo
  2026-01-19 10:54     ` Vlastimil Babka
  2026-01-20  2:32   ` Harry Yoo
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  6:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
> 
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> 
> Extend struct partial_context so it can return a list of slabs from the
> partial list with the sum of free objects in them within the requested
> min and max.
> 
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list.
> 
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
> 
> Introduce alloc_from_new_slab() which can allocate multiple objects from
> a newly allocated slab where we don't need to synchronize with freeing.
> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> the cache is a non-debug one so it can avoid some actions.
> 
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs. It is intended to be only used
> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> 
> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> only refilled from contexts that allow spinning, or even blocking.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 264 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bea8a65e510..dce80463f92c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3522,6 +3525,63 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>  #endif
>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>  
> +static bool get_partial_node_bulk(struct kmem_cache *s,
> +				  struct kmem_cache_node *n,
> +				  struct partial_context *pc)
> +{
> +	struct slab *slab, *slab2;
> +	unsigned int total_free = 0;
> +	unsigned long flags;
> +
> +	/* Racy check to avoid taking the lock unnecessarily. */
> +	if (!n || data_race(!n->nr_partial))
> +		return false;
> +
> +	INIT_LIST_HEAD(&pc->slabs);
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +
> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +		struct freelist_counters flc;
> +		unsigned int slab_free;
> +
> +		if (!pfmemalloc_match(slab, pc->flags))
> +			continue;
> +		/*
> +		 * determine the number of free objects in the slab racily
> +		 *
> +		 * due to atomic updates done by a racing free we should not
> +		 * read an inconsistent value here, but do a sanity check anyway
> +		 *
> +		 * slab_free is a lower bound due to subsequent concurrent
> +		 * freeing, the caller might get more objects than requested and
> +		 * must deal with it
> +		 */
> +		flc.counters = data_race(READ_ONCE(slab->counters));
> +		slab_free = flc.objects - flc.inuse;
> +
> +		if (unlikely(slab_free > oo_objects(s->oo)))
> +			continue;

When is this condition supposed to be true?

I guess it's when __update_freelist_slow() doesn't update
slab->counters atomically?

> +
> +		/* we have already min and this would get us over the max */
> +		if (total_free >= pc->min_objects
> +		    && total_free + slab_free > pc->max_objects)
> +			break;
> +
> +		remove_partial(n, slab);
> +
> +		list_add(&slab->slab_list, &pc->slabs);
> +
> +		total_free += slab_free;
> +		if (total_free >= pc->max_objects)
> +			break;
> +	}
> +
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	return total_free > 0;
> +}
> +
>  /*
>   * Try to allocate a partial slab from a specific node.
>   */
> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +		void **p, unsigned int count, bool allow_spin)
> +{
> +	unsigned int allocated = 0;
> +	struct kmem_cache_node *n;
> +	unsigned long flags;
> +	void *object;
> +
> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {
> +
> +		n = get_node(s, slab_nid(slab));
> +
> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +			/* Unlucky, discard newly allocated slab */
> +			defer_deactivate_slab(slab, NULL);
> +			return 0;
> +		}
> +	}
> +
> +	object = slab->freelist;
> +	while (object && allocated < count) {
> +		p[allocated] = object;
> +		object = get_freepointer(s, object);
> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> +
> +		slab->inuse++;
> +		allocated++;
> +	}
> +	slab->freelist = object;
> +
> +	if (slab->freelist) {
> +
> +		if (allow_spin) {
> +			n = get_node(s, slab_nid(slab));
> +			spin_lock_irqsave(&n->list_lock, flags);
> +		}
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +	}
> +
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);

Maybe add a comment explaining why inc_slabs_node() doesn't need to be
called under n->list_lock?

> +	return allocated;
> +}
> +
>  /*
>   * Slow path. The lockless freelist is empty or we need to perform
>   * debugging duties.
> @@ -5388,6 +5519,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
>  	return ret;
>  }
>  
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +				   size_t size, void **p);
> +
>  /*
>   * returns a sheaf that has at least the requested size
>   * when prefilling is needed, do so with given gfp flags
> @@ -7463,6 +7597,116 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  }
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>  
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		 unsigned int max)
> +{
> +	struct slab *slab, *slab2;
> +	struct partial_context pc;
> +	unsigned int refilled = 0;
> +	unsigned long flags;
> +	void *object;
> +	int node;
> +
> +	pc.flags = gfp;
> +	pc.min_objects = min;
> +	pc.max_objects = max;
> +
> +	node = numa_mem_id();
> +
> +	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +		return 0;
> +
> +	/* TODO: consider also other nodes? */
> +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> +		goto new_slab;
> +
> +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +		list_del(&slab->slab_list);

When a slab is removed from the list,

> +		object = get_freelist_nofreeze(s, slab);
> +
> +		while (object && refilled < max) {
> +			p[refilled] = object;
> +			object = get_freepointer(s, object);
> +			maybe_wipe_obj_freeptr(s, p[refilled]);
> +
> +			refilled++;
> +		}
> +
> +		/*
> +		 * Freelist had more objects than we can accommodate, we need to
> +		 * free them back. We can treat it like a detached freelist, just
> +		 * need to find the tail object.
> +		 */
> +		if (unlikely(object)) {

And the freelist had more objects than requested,

> +			void *head = object;
> +			void *tail;
> +			int cnt = 0;
> +
> +			do {
> +				tail = object;
> +				cnt++;
> +				object = get_freepointer(s, object);
> +			} while (object);
> +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);

objects are freed to the slab but the slab may or may not be added back to
n->partial?

> +		}
> +
> +		if (refilled >= max)
> +			break;
> +	}
> +
> +	if (unlikely(!list_empty(&pc.slabs))) {
> +		struct kmem_cache_node *n = get_node(s, node);
> +
> +		spin_lock_irqsave(&n->list_lock, flags);
> +
> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
> +				continue;
> +
> +			list_del(&slab->slab_list);
> +			add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		}
> +
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +		/* any slabs left are completely free and for discard */
> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			list_del(&slab->slab_list);
> +			discard_slab(s, slab);
> +		}
> +	}
> +
> +
> +	if (likely(refilled >= min))
> +		goto out;
> +
> +new_slab:
> +
> +	slab = new_slab(s, pc.flags, node);
> +	if (!slab)
> +		goto out;
> +
> +	stat(s, ALLOC_SLAB);
> +
> +	/*
> +	 * TODO: possible optimization - if we know we will consume the whole
> +	 * slab we might skip creating the freelist?
> +	 */
> +	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> +					/* allow_spin = */ true);
> +
> +	if (refilled < min)
> +		goto new_slab;

It should jump to out: label when alloc_from_new_slab() returns zero
(trylock failed).

...Oh wait, no. I was confused.

Why does alloc_from_new_slab() handle !allow_spin case when it cannot be
called if allow_spin is false?

> +out:
> +
> +	return refilled;
> +}

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-19  6:41   ` Harry Yoo
@ 2026-01-19  8:02     ` Harry Yoo
  2026-01-19 10:54     ` Vlastimil Babka
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-19  8:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Mon, Jan 19, 2026 at 03:41:40PM +0900, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> > At this point we have sheaves enabled for all caches, but their refill
> > is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> > slabs - now a redundant caching layer that we are about to remove.
> > 
> > The refill will thus be done from slabs on the node partial list.
> > Introduce new functions that can do that in an optimized way as it's
> > easier than modifying the __kmem_cache_alloc_bulk() call chain.
> > 
> > Extend struct partial_context so it can return a list of slabs from the
> > partial list with the sum of free objects in them within the requested
> > min and max.
> > 
> > Introduce get_partial_node_bulk() that removes the slabs from freelist
> > and returns them in the list.
> > 
> > Introduce get_freelist_nofreeze() which grabs the freelist without
> > freezing the slab.
> > 
> > Introduce alloc_from_new_slab() which can allocate multiple objects from
> > a newly allocated slab where we don't need to synchronize with freeing.
> > In some aspects it's similar to alloc_single_from_new_slab() but assumes
> > the cache is a non-debug one so it can avoid some actions.
> > 
> > Introduce __refill_objects() that uses the functions above to fill an
> > array of objects. It has to handle the possibility that the slabs will
> > contain more objects that were requested, due to concurrent freeing of
> > objects to those slabs. When no more slabs on partial lists are
> > available, it will allocate new slabs. It is intended to be only used
> > in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> > 
> > Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> > only refilled from contexts that allow spinning, or even blocking.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 264 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 9bea8a65e510..dce80463f92c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -7463,6 +7597,116 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >  }
> >  EXPORT_SYMBOL(kmem_cache_free_bulk);
> >  
> > +static unsigned int
> > +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> > +		 unsigned int max)
> > +{
> > +	struct slab *slab, *slab2;
> > +	struct partial_context pc;
> > +	unsigned int refilled = 0;
> > +	unsigned long flags;
> > +	void *object;
> > +	int node;
> > +
> > +	pc.flags = gfp;
> > +	pc.min_objects = min;
> > +	pc.max_objects = max;
> > +
> > +	node = numa_mem_id();
> > +
> > +	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> > +		return 0;
> > +
> > +	/* TODO: consider also other nodes? */
> > +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> > +		goto new_slab;
> > +
> > +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> > +
> > +		list_del(&slab->slab_list);
> 
> When a slab is removed from the list,
> 
> > +		object = get_freelist_nofreeze(s, slab);
> > +
> > +		while (object && refilled < max) {
> > +			p[refilled] = object;
> > +			object = get_freepointer(s, object);
> > +			maybe_wipe_obj_freeptr(s, p[refilled]);
> > +
> > +			refilled++;
> > +		}
> > +
> > +		/*
> > +		 * Freelist had more objects than we can accommodate, we need to
> > +		 * free them back. We can treat it like a detached freelist, just
> > +		 * need to find the tail object.
> > +		 */
> > +		if (unlikely(object)) {
> 
> And the freelist had more objects than requested,
> 
> > +			void *head = object;
> > +			void *tail;
> > +			int cnt = 0;
> > +
> > +			do {
> > +				tail = object;
> > +				cnt++;
> > +				object = get_freepointer(s, object);
> > +			} while (object);
> > +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
> 
> objects are freed to the slab but the slab may or may not be added back to
> n->partial?

No, since the slab becomes a full slab after get_freelist_nofreeze(),
do_slab_free() should add it back to n->partial list!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-19  3:40     ` Harry Yoo
@ 2026-01-19  9:13       ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-19  9:13 UTC (permalink / raw)
  To: Harry Yoo, Suren Baghdasaryan
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Sebastian Andrzej Siewior, Alexei Starovoitov, linux-mm,
	linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/19/26 04:40, Harry Yoo wrote:
> On Sat, Jan 17, 2026 at 02:11:02AM +0000, Suren Baghdasaryan wrote:
>> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
>> > sheaves enabled. Since we want to enable them for almost all caches,
>> > it's suboptimal to test the pointer in the fast paths, so instead
>> > allocate it for all caches in do_kmem_cache_create(). Instead of testing
>> > the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
>> > kmem_cache->sheaf_capacity for being 0, where needed, using a new
>> > cache_has_sheaves() helper.
>> >
>> > However, for the fast paths sake we also assume that the main sheaf
>> > always exists (pcs->main is !NULL), and during bootstrap we cannot
>> > allocate sheaves yet.
>> >
>> > Solve this by introducing a single static bootstrap_sheaf that's
>> > assigned as pcs->main during bootstrap. It has a size of 0, so during
>> > allocations, the fast path will find it's empty. Since the size of 0
>> > matches sheaf_capacity of 0, the freeing fast paths will find it's
>> > "full". In the slow path handlers, we use cache_has_sheaves() to
>> > recognize that the cache doesn't (yet) have real sheaves, and fall back.
>> 
>> I don't think kmem_cache_prefill_sheaf() handles this case, does it?
>> Or do you rely on the caller to never try prefilling a bootstrapped
>> sheaf?
> 
> If a cache doesn't have sheaves, s->sheaf_capacity should be 0,
> so the sheaf returned by kmem_cache_prefill_sheaf() should be
> "oversized" one... unless the user tries to prefill a sheaf with
> size == 0?

I'll add a

        if (unlikely(!size))
                return NULL;

to kmem_cache_prefill_sheaf() so we don't have to deal with oversized
sheaves of size 0 just for this theoretical case...



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-17  2:11   ` Suren Baghdasaryan
  2026-01-19  3:40     ` Harry Yoo
@ 2026-01-19  9:34     ` Vlastimil Babka
  2026-01-21 10:52     ` Vlastimil Babka
  2 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-19  9:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/17/26 03:11, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> Thus sharing the single bootstrap sheaf like this for multiple caches
>> and cpus is safe.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 81 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index edf341c87e20..706cb6398f05 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -501,6 +501,18 @@ struct kmem_cache_node {
>>         struct node_barn *barn;
>>  };
>>
>> +/*
>> + * Every cache has !NULL s->cpu_sheaves but they may point to the
>> + * bootstrap_sheaf temporarily during init, or permanently for the boot caches
>> + * and caches with debugging enabled, or all caches with CONFIG_SLUB_TINY. This
>> + * helper distinguishes whether cache has real non-bootstrap sheaves.
>> + */
>> +static inline bool cache_has_sheaves(struct kmem_cache *s)
>> +{
>> +       /* Test CONFIG_SLUB_TINY for code elimination purposes */
>> +       return !IS_ENABLED(CONFIG_SLUB_TINY) && s->sheaf_capacity;
>> +}
>> +
>>  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>>  {
>>         return s->node[node];
>> @@ -2855,6 +2867,10 @@ static void pcs_destroy(struct kmem_cache *s)
>>                 if (!pcs->main)
>>                         continue;
>>
>> +               /* bootstrap or debug caches, it's the bootstrap_sheaf */
>> +               if (!pcs->main->cache)
>> +                       continue;
> 
> I wonder why we can't simply check cache_has_sheaves(s) at the
> beginning and skip the loop altogether.
> I realize that __kmem_cache_release()->pcs_destroy() is called in the
> failure path of do_kmem_cache_create() and s->cpu_sheaves might be
> partially initialized if alloc_empty_sheaf() fails somewhere in the
> middle of the loop inside init_percpu_sheaves(). But for that,
> s->sheaf_capacity should still be non-zero, so checking
> cache_has_sheaves() at the beginning of pcs_destroy() should still
> work, no?

I think it should, will do.

> BTW, I see one last check for s->cpu_sheaves that you didn't replace
> with cache_has_sheaves() inside __kmem_cache_release(). I think that's
> because it's also in the failure path of do_kmem_cache_create() and
> it's possible that s->sheaf_capacity > 0 while s->cpu_sheaves == NULL
> (if alloc_percpu(struct slub_percpu_sheaves) fails). It might be
> helpful to add a comment inside __kmem_cache_release() to explain why
> cache_has_sheaves() can't be used there.

The reason is rather what Harry said. I'll move the check to pcs_destroy()
and add comment there.

diff --git a/mm/slub.c b/mm/slub.c
index 706cb6398f05..6b19aa518a1a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2858,19 +2858,26 @@ static void pcs_destroy(struct kmem_cache *s)
 {
 	int cpu;
 
+	/*
+	 * We may be unwinding cache creation that failed before or during the
+	 * allocation of this.
+	 */
+	if (!s->cpu_sheaves)
+		return;
+
+	/* pcs->main can only point to the bootstrap sheaf, nothing to free */
+	if (!cache_has_sheaves(s))
+		goto free_pcs;
+
 	for_each_possible_cpu(cpu) {
 		struct slub_percpu_sheaves *pcs;
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		/* can happen when unwinding failed create */
+		/* This can happen when unwinding failed cache creation. */
 		if (!pcs->main)
 			continue;
 
-		/* bootstrap or debug caches, it's the bootstrap_sheaf */
-		if (!pcs->main->cache)
-			continue;
-
 		/*
 		 * We have already passed __kmem_cache_shutdown() so everything
 		 * was flushed and there should be no objects allocated from
@@ -2889,6 +2896,7 @@ static void pcs_destroy(struct kmem_cache *s)
 		}
 	}
 
+free_pcs:
 	free_percpu(s->cpu_sheaves);
 	s->cpu_sheaves = NULL;
 }
@@ -5379,6 +5387,9 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 	struct slab_sheaf *sheaf = NULL;
 	struct node_barn *barn;
 
+	if (unlikely(!size))
+		return NULL;
+
 	if (unlikely(size > s->sheaf_capacity)) {
 
 		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
@@ -7833,8 +7844,7 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
-	if (s->cpu_sheaves)
-		pcs_destroy(s);
+	pcs_destroy(s);
 #ifdef CONFIG_PREEMPT_RT
 	if (s->cpu_slab)
 		lockdep_unregister_key(&s->lock_key);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-19  4:31   ` Harry Yoo
@ 2026-01-19 10:09     ` Vlastimil Babka
  2026-01-19 10:23       ` Vlastimil Babka
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-19 10:09 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/19/26 05:31, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:27PM +0100, Vlastimil Babka wrote:
>> Before we enable percpu sheaves for kmalloc caches, we need to make sure
>> kmalloc_nolock() and kfree_nolock() will continue working properly and
>> not spin when not allowed to.
>> 
>> Percpu sheaves themselves use local_trylock() so they are already
>> compatible. We just need to be careful with the barn->lock spin_lock.
>> Pass a new allow_spin parameter where necessary to use
>> spin_trylock_irqsave().
>> 
>> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
>> for now it will always fail until we enable sheaves for kmalloc caches
>> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks.

> 
> with a nit below.
> 
>>  mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
>>  1 file changed, 56 insertions(+), 23 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 706cb6398f05..b385247c219f 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>  
>>  	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
>>  	    && likely(!slab_test_pfmemalloc(slab))) {
>> -		if (likely(free_to_pcs(s, object)))
>> +		if (likely(free_to_pcs(s, object, true)))
>>  			return;
>>  	}
>>  
>> @@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
>>  	 * since kasan quarantine takes locks and not supported from NMI.
>>  	 */
>>  	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
>> -	do_slab_free(s, slab, x, x, 0, _RET_IP_);
>> +	if (!free_to_pcs(s, x, false))
>> +		do_slab_free(s, slab, x, x, 0, _RET_IP_);
>>  }
> 
> nit: Maybe it's not that common but should we bypass sheaves if
> it's from remote NUMA node just like slab_free()?

Right, will do.

>>  EXPORT_SYMBOL_GPL(kfree_nolock);
>>  
>> @@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>>  		size--;
>>  	}
>>  
>> -	i = alloc_from_pcs_bulk(s, size, p);
>> +	i = alloc_from_pcs_bulk(s, flags, size, p);
>>  
>>  	if (i < size) { >  		/*
>> 
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-19 10:09     ` Vlastimil Babka
@ 2026-01-19 10:23       ` Vlastimil Babka
  2026-01-19 12:06         ` Hao Li
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-19 10:23 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/19/26 11:09, Vlastimil Babka wrote:
> On 1/19/26 05:31, Harry Yoo wrote:
>> On Fri, Jan 16, 2026 at 03:40:27PM +0100, Vlastimil Babka wrote:
>>> Before we enable percpu sheaves for kmalloc caches, we need to make sure
>>> kmalloc_nolock() and kfree_nolock() will continue working properly and
>>> not spin when not allowed to.
>>> 
>>> Percpu sheaves themselves use local_trylock() so they are already
>>> compatible. We just need to be careful with the barn->lock spin_lock.
>>> Pass a new allow_spin parameter where necessary to use
>>> spin_trylock_irqsave().
>>> 
>>> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
>>> for now it will always fail until we enable sheaves for kmalloc caches
>>> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
>>> 
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>> ---
>> 
>> Looks good to me,
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Thanks.
> 
>> 
>> with a nit below.
>> 
>>>  mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
>>>  1 file changed, 56 insertions(+), 23 deletions(-)
>>> 
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 706cb6398f05..b385247c219f 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>>  
>>>  	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
>>>  	    && likely(!slab_test_pfmemalloc(slab))) {
>>> -		if (likely(free_to_pcs(s, object)))
>>> +		if (likely(free_to_pcs(s, object, true)))
>>>  			return;
>>>  	}
>>>  
>>> @@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
>>>  	 * since kasan quarantine takes locks and not supported from NMI.
>>>  	 */
>>>  	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
>>> -	do_slab_free(s, slab, x, x, 0, _RET_IP_);
>>> +	if (!free_to_pcs(s, x, false))
>>> +		do_slab_free(s, slab, x, x, 0, _RET_IP_);
>>>  }
>> 
>> nit: Maybe it's not that common but should we bypass sheaves if
>> it's from remote NUMA node just like slab_free()?
> 
> Right, will do.

However that means sheaves will help less with the defer_free() avoidance
here. It becomes more obvious after "slab: remove the do_slab_free()
fastpath". All remote object frees will be deferred. Guess we can revisit
later if we see there are too many and have no better solution...

>>>  EXPORT_SYMBOL_GPL(kfree_nolock);
>>>  
>>> @@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>>>  		size--;
>>>  	}
>>>  
>>> -	i = alloc_from_pcs_bulk(s, size, p);
>>> +	i = alloc_from_pcs_bulk(s, flags, size, p);
>>>  
>>>  	if (i < size) { >  		/*
>>> 
>> 
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-19  6:41   ` Harry Yoo
  2026-01-19  8:02     ` Harry Yoo
@ 2026-01-19 10:54     ` Vlastimil Babka
  2026-01-20  1:41       ` Harry Yoo
  1 sibling, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-19 10:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/19/26 07:41, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
>> At this point we have sheaves enabled for all caches, but their refill
>> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
>> slabs - now a redundant caching layer that we are about to remove.
>> 
>> The refill will thus be done from slabs on the node partial list.
>> Introduce new functions that can do that in an optimized way as it's
>> easier than modifying the __kmem_cache_alloc_bulk() call chain.
>> 
>> Extend struct partial_context so it can return a list of slabs from the
>> partial list with the sum of free objects in them within the requested
>> min and max.
>> 
>> Introduce get_partial_node_bulk() that removes the slabs from freelist
>> and returns them in the list.
>> 
>> Introduce get_freelist_nofreeze() which grabs the freelist without
>> freezing the slab.
>> 
>> Introduce alloc_from_new_slab() which can allocate multiple objects from
>> a newly allocated slab where we don't need to synchronize with freeing.
>> In some aspects it's similar to alloc_single_from_new_slab() but assumes
>> the cache is a non-debug one so it can avoid some actions.
>> 
>> Introduce __refill_objects() that uses the functions above to fill an
>> array of objects. It has to handle the possibility that the slabs will
>> contain more objects that were requested, due to concurrent freeing of
>> objects to those slabs. When no more slabs on partial lists are
>> available, it will allocate new slabs. It is intended to be only used
>> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
>> 
>> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
>> only refilled from contexts that allow spinning, or even blocking.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 264 insertions(+), 20 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 9bea8a65e510..dce80463f92c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3522,6 +3525,63 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>>  #endif
>>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>>  
>> +static bool get_partial_node_bulk(struct kmem_cache *s,
>> +				  struct kmem_cache_node *n,
>> +				  struct partial_context *pc)
>> +{
>> +	struct slab *slab, *slab2;
>> +	unsigned int total_free = 0;
>> +	unsigned long flags;
>> +
>> +	/* Racy check to avoid taking the lock unnecessarily. */
>> +	if (!n || data_race(!n->nr_partial))
>> +		return false;
>> +
>> +	INIT_LIST_HEAD(&pc->slabs);
>> +
>> +	spin_lock_irqsave(&n->list_lock, flags);
>> +
>> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
>> +		struct freelist_counters flc;
>> +		unsigned int slab_free;
>> +
>> +		if (!pfmemalloc_match(slab, pc->flags))
>> +			continue;
>> +		/*
>> +		 * determine the number of free objects in the slab racily
>> +		 *
>> +		 * due to atomic updates done by a racing free we should not
>> +		 * read an inconsistent value here, but do a sanity check anyway
>> +		 *
>> +		 * slab_free is a lower bound due to subsequent concurrent
>> +		 * freeing, the caller might get more objects than requested and
>> +		 * must deal with it
>> +		 */
>> +		flc.counters = data_race(READ_ONCE(slab->counters));
>> +		slab_free = flc.objects - flc.inuse;
>> +
>> +		if (unlikely(slab_free > oo_objects(s->oo)))
>> +			continue;
> 
> When is this condition supposed to be true?
> 
> I guess it's when __update_freelist_slow() doesn't update
> slab->counters atomically?

Yeah. Probably could be solvable with WRITE_ONCE() there, as this is only
about hypothetical read/write tearing, not seeing stale values. Or not? Just
wanted to be careful.

>> +
>> +		/* we have already min and this would get us over the max */
>> +		if (total_free >= pc->min_objects
>> +		    && total_free + slab_free > pc->max_objects)
>> +			break;
>> +
>> +		remove_partial(n, slab);
>> +
>> +		list_add(&slab->slab_list, &pc->slabs);
>> +
>> +		total_free += slab_free;
>> +		if (total_free >= pc->max_objects)
>> +			break;
>> +	}
>> +
>> +	spin_unlock_irqrestore(&n->list_lock, flags);
>> +	return total_free > 0;
>> +}
>> +
>>  /*
>>   * Try to allocate a partial slab from a specific node.
>>   */
>> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>> +		void **p, unsigned int count, bool allow_spin)
>> +{
>> +	unsigned int allocated = 0;
>> +	struct kmem_cache_node *n;
>> +	unsigned long flags;
>> +	void *object;
>> +
>> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {
>> +
>> +		n = get_node(s, slab_nid(slab));
>> +
>> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
>> +			/* Unlucky, discard newly allocated slab */
>> +			defer_deactivate_slab(slab, NULL);
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	object = slab->freelist;
>> +	while (object && allocated < count) {
>> +		p[allocated] = object;
>> +		object = get_freepointer(s, object);
>> +		maybe_wipe_obj_freeptr(s, p[allocated]);
>> +
>> +		slab->inuse++;
>> +		allocated++;
>> +	}
>> +	slab->freelist = object;
>> +
>> +	if (slab->freelist) {
>> +
>> +		if (allow_spin) {
>> +			n = get_node(s, slab_nid(slab));
>> +			spin_lock_irqsave(&n->list_lock, flags);
>> +		}
>> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
>> +		spin_unlock_irqrestore(&n->list_lock, flags);
>> +	}
>> +
>> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> 
> Maybe add a comment explaining why inc_slabs_node() doesn't need to be
> called under n->list_lock?

Hm, we might not even be holding it. The old code also did the inc with no
comment. If anything could use one, it would be in
alloc_single_from_new_slab()? But that's outside the scope here.

>> +	return allocated;
>> +}
>> +
>>  /*
>>   * Slow path. The lockless freelist is empty or we need to perform
>>   * debugging duties.

>> +new_slab:
>> +
>> +	slab = new_slab(s, pc.flags, node);
>> +	if (!slab)
>> +		goto out;
>> +
>> +	stat(s, ALLOC_SLAB);
>> +
>> +	/*
>> +	 * TODO: possible optimization - if we know we will consume the whole
>> +	 * slab we might skip creating the freelist?
>> +	 */
>> +	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
>> +					/* allow_spin = */ true);
>> +
>> +	if (refilled < min)
>> +		goto new_slab;
> 
> It should jump to out: label when alloc_from_new_slab() returns zero
> (trylock failed).
> 
> ...Oh wait, no. I was confused.
> 
> Why does alloc_from_new_slab() handle !allow_spin case when it cannot be
> called if allow_spin is false?

The next patch will use it so it seemed easier to add it already. I'll note
in the commit log.

>> +out:
>> +
>> +	return refilled;
>> +}
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-16 14:40 ` [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap Vlastimil Babka
  2026-01-17  2:11   ` Suren Baghdasaryan
@ 2026-01-19 11:32   ` Hao Li
  2026-01-21 10:54     ` Vlastimil Babka
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-19 11:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:26PM +0100, Vlastimil Babka wrote:
> Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
> sheaves enabled. Since we want to enable them for almost all caches,
> it's suboptimal to test the pointer in the fast paths, so instead
> allocate it for all caches in do_kmem_cache_create(). Instead of testing
> the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
> kmem_cache->sheaf_capacity for being 0, where needed, using a new
> cache_has_sheaves() helper.
> 
> However, for the fast paths sake we also assume that the main sheaf
> always exists (pcs->main is !NULL), and during bootstrap we cannot
> allocate sheaves yet.
> 
> Solve this by introducing a single static bootstrap_sheaf that's
> assigned as pcs->main during bootstrap. It has a size of 0, so during
> allocations, the fast path will find it's empty. Since the size of 0
> matches sheaf_capacity of 0, the freeing fast paths will find it's
> "full". In the slow path handlers, we use cache_has_sheaves() to
> recognize that the cache doesn't (yet) have real sheaves, and fall back.
> Thus sharing the single bootstrap sheaf like this for multiple caches
> and cpus is safe.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 81 insertions(+), 38 deletions(-)
> 

Nit: would it make sense to also update "if (s->cpu_sheaves)" to
cache_has_sheaves() in kvfree_rcu_barrier_on_cache(), for consistency?

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
  2026-01-19 10:23       ` Vlastimil Babka
@ 2026-01-19 12:06         ` Hao Li
  0 siblings, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-19 12:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Mon, Jan 19, 2026 at 11:23:04AM +0100, Vlastimil Babka wrote:
> On 1/19/26 11:09, Vlastimil Babka wrote:
> > On 1/19/26 05:31, Harry Yoo wrote:
> >> On Fri, Jan 16, 2026 at 03:40:27PM +0100, Vlastimil Babka wrote:
> >>> Before we enable percpu sheaves for kmalloc caches, we need to make sure
> >>> kmalloc_nolock() and kfree_nolock() will continue working properly and
> >>> not spin when not allowed to.
> >>> 
> >>> Percpu sheaves themselves use local_trylock() so they are already
> >>> compatible. We just need to be careful with the barn->lock spin_lock.
> >>> Pass a new allow_spin parameter where necessary to use
> >>> spin_trylock_irqsave().
> >>> 
> >>> In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely,
> >>> for now it will always fail until we enable sheaves for kmalloc caches
> >>> next. Similarly in kfree_nolock() we can attempt free_to_pcs().
> >>> 
> >>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>> ---
> >> 
> >> Looks good to me,
> >> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > 
> > Thanks.
> > 
> >> 
> >> with a nit below.
> >> 
> >>>  mm/slub.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-------------------
> >>>  1 file changed, 56 insertions(+), 23 deletions(-)
> >>> 
> >>> diff --git a/mm/slub.c b/mm/slub.c
> >>> index 706cb6398f05..b385247c219f 100644
> >>> --- a/mm/slub.c
> >>> +++ b/mm/slub.c
> >>> @@ -6703,7 +6735,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >>>  
> >>>  	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> >>>  	    && likely(!slab_test_pfmemalloc(slab))) {
> >>> -		if (likely(free_to_pcs(s, object)))
> >>> +		if (likely(free_to_pcs(s, object, true)))
> >>>  			return;
> >>>  	}
> >>>  
> >>> @@ -6964,7 +6996,8 @@ void kfree_nolock(const void *object)
> >>>  	 * since kasan quarantine takes locks and not supported from NMI.
> >>>  	 */
> >>>  	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
> >>> -	do_slab_free(s, slab, x, x, 0, _RET_IP_);
> >>> +	if (!free_to_pcs(s, x, false))
> >>> +		do_slab_free(s, slab, x, x, 0, _RET_IP_);
> >>>  }
> >> 
> >> nit: Maybe it's not that common but should we bypass sheaves if
> >> it's from remote NUMA node just like slab_free()?
> > 
> > Right, will do.
> 
> However that means sheaves will help less with the defer_free() avoidance
> here. It becomes more obvious after "slab: remove the do_slab_free()
> fastpath". All remote object frees will be deferred. Guess we can revisit
> later if we see there are too many and have no better solution...

This makes sense to me, and the commit looks good as well. Thanks!

Reviewed-by: Hao Li <hao.li@linux.dev>

> 
> >>>  EXPORT_SYMBOL_GPL(kfree_nolock);
> >>>  
> >>> @@ -7516,7 +7549,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
> >>>  		size--;
> >>>  	}
> >>>  
> >>> -	i = alloc_from_pcs_bulk(s, size, p);
> >>> +	i = alloc_from_pcs_bulk(s, flags, size, p);
> >>>  
> >>>  	if (i < size) { >  		/*
> >>> 
> >> 
> > 
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap
  2026-01-16 14:40 ` [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
  2026-01-19  5:23   ` Harry Yoo
@ 2026-01-20  1:04   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-20  1:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:28PM +0100, Vlastimil Babka wrote:
> Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL,
> we can simply allow them in calculate_sizes() as they are created later
> than KMALLOC_NORMAL caches and can allocate sheaves and barns from
> those.
> 
> For KMALLOC_NORMAL caches we perform additional step after first
> creating them without sheaves. Then bootstrap_cache_sheaves() simply
> allocates and initializes barns and sheaves and finally sets
> s->sheaf_capacity to make them actually used.
> 
> Afterwards the only caches left without sheaves (unless SLUB_TINY or
> debugging is enabled) are kmem_cache and kmem_cache_node. These are only
> used when creating or destroying other kmem_caches. Thus they are not
> performance critical and we can simply leave it that way.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 84 insertions(+), 4 deletions(-)
> 

Looks good to me. Thanks.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-19 10:54     ` Vlastimil Babka
@ 2026-01-20  1:41       ` Harry Yoo
  2026-01-20  9:32         ` Hao Li
  0 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  1:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Mon, Jan 19, 2026 at 11:54:18AM +0100, Vlastimil Babka wrote:
> On 1/19/26 07:41, Harry Yoo wrote:
> > On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> >> At this point we have sheaves enabled for all caches, but their refill
> >> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> >> slabs - now a redundant caching layer that we are about to remove.
> >> 
> >> The refill will thus be done from slabs on the node partial list.
> >> Introduce new functions that can do that in an optimized way as it's
> >> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> >> 
> >> Extend struct partial_context so it can return a list of slabs from the
> >> partial list with the sum of free objects in them within the requested
> >> min and max.
> >> 
> >> Introduce get_partial_node_bulk() that removes the slabs from freelist
> >> and returns them in the list.
> >> 
> >> Introduce get_freelist_nofreeze() which grabs the freelist without
> >> freezing the slab.
> >> 
> >> Introduce alloc_from_new_slab() which can allocate multiple objects from
> >> a newly allocated slab where we don't need to synchronize with freeing.
> >> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> >> the cache is a non-debug one so it can avoid some actions.
> >> 
> >> Introduce __refill_objects() that uses the functions above to fill an
> >> array of objects. It has to handle the possibility that the slabs will
> >> contain more objects that were requested, due to concurrent freeing of
> >> objects to those slabs. When no more slabs on partial lists are
> >> available, it will allocate new slabs. It is intended to be only used
> >> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> >> 
> >> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> >> only refilled from contexts that allow spinning, or even blocking.
> >> 
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >> ---
> >>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 264 insertions(+), 20 deletions(-)
> >> 
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 9bea8a65e510..dce80463f92c 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -3522,6 +3525,63 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
> >>  #endif
> >>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
> >>  
> >> +static bool get_partial_node_bulk(struct kmem_cache *s,
> >> +				  struct kmem_cache_node *n,
> >> +				  struct partial_context *pc)
> >> +{
> >> +	struct slab *slab, *slab2;
> >> +	unsigned int total_free = 0;
> >> +	unsigned long flags;
> >> +
> >> +	/* Racy check to avoid taking the lock unnecessarily. */
> >> +	if (!n || data_race(!n->nr_partial))
> >> +		return false;
> >> +
> >> +	INIT_LIST_HEAD(&pc->slabs);
> >> +
> >> +	spin_lock_irqsave(&n->list_lock, flags);
> >> +
> >> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> >> +		struct freelist_counters flc;
> >> +		unsigned int slab_free;
> >> +
> >> +		if (!pfmemalloc_match(slab, pc->flags))
> >> +			continue;
> >> +		/*
> >> +		 * determine the number of free objects in the slab racily
> >> +		 *
> >> +		 * due to atomic updates done by a racing free we should not
> >> +		 * read an inconsistent value here, but do a sanity check anyway
> >> +		 *
> >> +		 * slab_free is a lower bound due to subsequent concurrent
> >> +		 * freeing, the caller might get more objects than requested and
> >> +		 * must deal with it
> >> +		 */
> >> +		flc.counters = data_race(READ_ONCE(slab->counters));
> >> +		slab_free = flc.objects - flc.inuse;
> >> +
> >> +		if (unlikely(slab_free > oo_objects(s->oo)))
> >> +			continue;
> > 
> > When is this condition supposed to be true?
> > 
> > I guess it's when __update_freelist_slow() doesn't update
> > slab->counters atomically?
> 
> Yeah. Probably could be solvable with WRITE_ONCE() there, as this is only
> about hypothetical read/write tearing, not seeing stale values.

Ok. That's less confusing than "we should not read an inconsistent value
here, but do a sanity check anyway".

> >> +
> >> +		/* we have already min and this would get us over the max */
> >> +		if (total_free >= pc->min_objects
> >> +		    && total_free + slab_free > pc->max_objects)
> >> +			break;
> >> +
> >> +		remove_partial(n, slab);
> >> +
> >> +		list_add(&slab->slab_list, &pc->slabs);
> >> +
> >> +		total_free += slab_free;
> >> +		if (total_free >= pc->max_objects)
> >> +			break;
> >> +	}
> >> +
> >> +	spin_unlock_irqrestore(&n->list_lock, flags);
> >> +	return total_free > 0;
> >> +}
> >> +
> >>  /*
> >>   * Try to allocate a partial slab from a specific node.
> >>   */
> >> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >> +		void **p, unsigned int count, bool allow_spin)
> >> +{
> >> +	unsigned int allocated = 0;
> >> +	struct kmem_cache_node *n;
> >> +	unsigned long flags;
> >> +	void *object;
> >> +
> >> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {
> >> +
> >> +		n = get_node(s, slab_nid(slab));
> >> +
> >> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> >> +			/* Unlucky, discard newly allocated slab */
> >> +			defer_deactivate_slab(slab, NULL);
> >> +			return 0;
> >> +		}
> >> +	}
> >> +
> >> +	object = slab->freelist;
> >> +	while (object && allocated < count) {
> >> +		p[allocated] = object;
> >> +		object = get_freepointer(s, object);
> >> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> >> +
> >> +		slab->inuse++;
> >> +		allocated++;
> >> +	}
> >> +	slab->freelist = object;
> >> +
> >> +	if (slab->freelist) {
> >> +
> >> +		if (allow_spin) {
> >> +			n = get_node(s, slab_nid(slab));
> >> +			spin_lock_irqsave(&n->list_lock, flags);
> >> +		}
> >> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> >> +		spin_unlock_irqrestore(&n->list_lock, flags);
> >> +	}
> >> +
> >> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> > 
> > Maybe add a comment explaining why inc_slabs_node() doesn't need to be
> > called under n->list_lock?
> 
> Hm, we might not even be holding it. The old code also did the inc with no
> comment. If anything could use one, it would be in
> alloc_single_from_new_slab()? But that's outside the scope here.

Ok. Perhaps worth adding something like this later, but yeah it's outside
the scope here.

diff --git a/mm/slub.c b/mm/slub.c
index 698c0d940f06..c5a1e47dfe16 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1633,6 +1633,9 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
 {
 	struct kmem_cache_node *n = get_node(s, node);
 
+	if (kmem_cache_debug(s))
+		/* slab validation may generate false errors without the lock */
+		lockdep_assert_held(&n->list_lock);
 	atomic_long_inc(&n->nr_slabs);
 	atomic_long_add(objects, &n->total_objects);
 }


-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
  2026-01-19  6:41   ` Harry Yoo
@ 2026-01-20  2:32   ` Harry Yoo
  2026-01-20  6:33     ` Vlastimil Babka
  2026-01-20  2:55   ` Hao Li
  2026-01-20 17:19   ` Suren Baghdasaryan
  3 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  2:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
> 
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> 
> Extend struct partial_context so it can return a list of slabs from the
> partial list with the sum of free objects in them within the requested
> min and max.
> 
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list.
> 
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
> 
> Introduce alloc_from_new_slab() which can allocate multiple objects from
> a newly allocated slab where we don't need to synchronize with freeing.
> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> the cache is a non-debug one so it can avoid some actions.
> 
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs. It is intended to be only used
> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> 
> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> only refilled from contexts that allow spinning, or even blocking.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 264 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bea8a65e510..dce80463f92c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -246,6 +246,9 @@ struct partial_context {
>  	gfp_t flags;
>  	unsigned int orig_size;
>  	void *object;
> +	unsigned int min_objects;
> +	unsigned int max_objects;
> +	struct list_head slabs;
>  };
>  
>  static inline bool kmem_cache_debug(struct kmem_cache *s)
> @@ -2663,8 +2666,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>  	if (!to_fill)
>  		return 0;
>  
> -	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> -					 &sheaf->objects[sheaf->size]);
> +	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> +			to_fill, to_fill);

nit: perhaps handling min and max separately is unnecessary
if it's always min == max? we could have simply one 'count' or 'size'?

Otherwise LGTM!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
  2026-01-19  6:41   ` Harry Yoo
  2026-01-20  2:32   ` Harry Yoo
@ 2026-01-20  2:55   ` Hao Li
  2026-01-20 17:19   ` Suren Baghdasaryan
  3 siblings, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-20  2:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
> 
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> 
> Extend struct partial_context so it can return a list of slabs from the
> partial list with the sum of free objects in them within the requested
> min and max.
> 
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list.
> 
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
> 
> Introduce alloc_from_new_slab() which can allocate multiple objects from
> a newly allocated slab where we don't need to synchronize with freeing.
> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> the cache is a non-debug one so it can avoid some actions.
> 
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs. It is intended to be only used
> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> 
> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> only refilled from contexts that allow spinning, or even blocking.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 264 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bea8a65e510..dce80463f92c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -246,6 +246,9 @@ struct partial_context {
>  	gfp_t flags;
>  	unsigned int orig_size;
>  	void *object;
> +	unsigned int min_objects;
> +	unsigned int max_objects;
> +	struct list_head slabs;
>  };
>  
...
> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +		void **p, unsigned int count, bool allow_spin)
> +{
> +	unsigned int allocated = 0;
> +	struct kmem_cache_node *n;
> +	unsigned long flags;
> +	void *object;
> +
> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {

I was wondering - given that slab->inuse is 0 for a newly allocated slab, is
there a reason to use "slab->objects - slab->inuse" instead of simply
slab->objects.

> +
> +		n = get_node(s, slab_nid(slab));
> +
> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +			/* Unlucky, discard newly allocated slab */
> +			defer_deactivate_slab(slab, NULL);
> +			return 0;
> +		}
> +	}
> +
> +	object = slab->freelist;
> +	while (object && allocated < count) {
> +		p[allocated] = object;
> +		object = get_freepointer(s, object);
> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> +
> +		slab->inuse++;
> +		allocated++;
> +	}
> +	slab->freelist = object;
> +
> +	if (slab->freelist) {
> +
> +		if (allow_spin) {
> +			n = get_node(s, slab_nid(slab));
> +			spin_lock_irqsave(&n->list_lock, flags);
> +		}
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +	}
> +
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> +	return allocated;
> +}
> +
...


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths
  2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
@ 2026-01-20  4:20   ` Harry Yoo
  2026-01-20  8:36   ` Hao Li
  2026-01-20 18:06   ` Suren Baghdasaryan
  2 siblings, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  4:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:30PM +0100, Vlastimil Babka wrote:
> We now rely on sheaves as the percpu caching layer and can refill them
> directly from partial or newly allocated slabs. Start removing the cpu
> (partial) slabs code, first from allocation paths.
> 
> This means that any allocation not satisfied from percpu sheaves will
> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
> slabs, so it will only perform get_partial() or new_slab(). In the
> latter case we reuse alloc_from_new_slab() (when we don't use
> the debug/tiny alloc_single_from_new_slab() variant).
> 
> In get_partial_node() we used to return a slab for freezing as the cpu
> slab and to refill the partial slab. Now we only want to return a single
> object and leave the slab on the list (unless it became full). We can't
> simply reuse alloc_single_from_partial() as that assumes freeing uses
> free_to_partial_list(). Instead we need to use __slab_update_freelist()
> to work properly against a racing __slab_free().
> 
> The rest of the changes is removing functions that no longer have any
> callers.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 612 ++++++++------------------------------------------------------
>  1 file changed, 79 insertions(+), 533 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index dce80463f92c..698c0d940f06 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3607,54 +3564,55 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>  	else if (!spin_trylock_irqsave(&n->list_lock, flags))
>  		return NULL;
>  	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +
> +		struct freelist_counters old, new;
> +
>  		if (!pfmemalloc_match(slab, pc->flags))
>  			continue;
>  
>  		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -			void *object = alloc_single_from_partial(s, n, slab,
> +			object = alloc_single_from_partial(s, n, slab,
>  							pc->orig_size);
> -			if (object) {
> -				partial = slab;
> -				pc->object = object;
> +			if (object)
>  				break;
> -			}
>  			continue;
>  		}
>  
> -		remove_partial(n, slab);
> +		/*
> +		 * get a single object from the slab. This might race against
> +		 * __slab_free(), which however has to take the list_lock if
> +		 * it's about to make the slab fully free.
> +		 */
> +		do {
> +			old.freelist = slab->freelist;
> +			old.counters = slab->counters;
>  
> -		if (!partial) {
> -			partial = slab;
> -			stat(s, ALLOC_FROM_PARTIAL);
> +			new.freelist = get_freepointer(s, old.freelist);
> +			new.counters = old.counters;
> +			new.inuse++;
>  
> -			if ((slub_get_cpu_partial(s) == 0)) {
> -				break;
> -			}
> -		} else {
> -			put_cpu_partial(s, slab, 0);
> -			stat(s, CPU_PARTIAL_NODE);
> +		} while (!__slab_update_freelist(s, slab, &old, &new, "get_partial_node"));

Hmm I was wondering if it would introduce an ABBA problem,
but it looks fine as allocations are serialized by n->list_lock.

> -			if (++partial_slabs > slub_get_cpu_partial(s) / 2) {
> -				break;
> -			}
> -		}
> +		object = old.freelist;
> +		if (!new.freelist)
> +			remove_partial(n, slab);
> +
> +		break;
>  	}
>  	spin_unlock_irqrestore(&n->list_lock, flags);
> -	return partial;
> +	return object;
>  }
> @@ -4849,68 +4574,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,

[...]

> +	if (allow_spin)
> +		goto new_objects;
>  
> -		stat(s, CPUSLAB_FLUSH);
> +	/* This could cause an endless loop. Fail instead. */
> +	return NULL;
>  
> -		goto retry_load_slab;
> -	}
> -	c->slab = slab;
> +success:
> +	if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
> +		set_track(s, freelist, TRACK_ALLOC, addr, gfpflags);

Oh, it was gfpflags & ~(__GFP_DIRECT_RECLAIM) but clearing
__GFP_DIRECT_RECLAIM was removed because preemption isn't disabled
anymore.

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

>  
> -	goto load_freelist;
> +	return freelist;
>  }
> +
>  /*
>   * We disallow kprobes in ___slab_alloc() to prevent reentrance
>   *

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
@ 2026-01-20  5:24   ` Harry Yoo
  2026-01-20 12:10   ` Hao Li
  2026-01-20 22:25   ` Suren Baghdasaryan
  2 siblings, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  5:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:31PM +0100, Vlastimil Babka wrote:
> We have removed the partial slab usage from allocation paths. Now remove
> the whole config option and associated code.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 12/21] slab: remove the do_slab_free() fastpath
  2026-01-16 14:40 ` [PATCH v3 12/21] slab: remove the do_slab_free() fastpath Vlastimil Babka
@ 2026-01-20  5:35   ` Harry Yoo
  2026-01-20 12:29   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  5:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:32PM +0100, Vlastimil Babka wrote:
> We have removed cpu slab usage from allocation paths. Now remove
> do_slab_free() which was freeing objects to the cpu slab when
> the object belonged to it. Instead call __slab_free() directly,
> which was previously the fallback.
> 
> This simplifies kfree_nolock() - when freeing to percpu sheaf
> fails, we can call defer_free() directly.
> 
> Also remove functions that became unused.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

The alloc/free path is now a lot simpler!

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 13/21] slab: remove defer_deactivate_slab()
  2026-01-16 14:40 ` [PATCH v3 13/21] slab: remove defer_deactivate_slab() Vlastimil Babka
@ 2026-01-20  5:47   ` Harry Yoo
  2026-01-20  9:35   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-20  5:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:33PM +0100, Vlastimil Babka wrote:
> There are no more cpu slabs so we don't need their deferred
> deactivation. The function is now only used from places where we
> allocate a new slab but then can't spin on node list_lock to put it on
> the partial list. Instead of the deferred action we can free it directly
> via __free_slab(), we just need to tell it to use _nolock() freeing of
> the underlying pages and take care of the accounting.
> 
> Since free_frozen_pages_nolock() variant does not yet exist for code
> outside of the page allocator, create it as a trivial wrapper for
> __free_frozen_pages(..., FPI_TRYLOCK).
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/internal.h   |  1 +
>  mm/page_alloc.c |  5 +++++
>  mm/slab.h       |  8 +-------
>  mm/slub.c       | 56 ++++++++++++++++++++------------------------------------
>  4 files changed, 27 insertions(+), 43 deletions(-)
> 
> index b08e775dc4cb..33f218c0e8d6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3260,7 +3260,7 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
>  }
>  
> -static void __free_slab(struct kmem_cache *s, struct slab *slab)
> +static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
>  {
>  	struct page *page = slab_page(slab);
>  	int order = compound_order(page);
> @@ -3271,14 +3271,26 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
>  	__ClearPageSlab(page);
>  	mm_account_reclaimed_pages(pages);
>  	unaccount_slab(slab, order, s);

As long as the slab is allocated with !allow_spin, it should be safe to
call unaccount_slab()->free_slab_obj_exts().

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20  2:32   ` Harry Yoo
@ 2026-01-20  6:33     ` Vlastimil Babka
  2026-01-20 10:27       ` Harry Yoo
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-20  6:33 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/20/26 03:32, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
>> At this point we have sheaves enabled for all caches, but their refill
>> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
>> slabs - now a redundant caching layer that we are about to remove.
>> 
>> The refill will thus be done from slabs on the node partial list.
>> Introduce new functions that can do that in an optimized way as it's
>> easier than modifying the __kmem_cache_alloc_bulk() call chain.
>> 
>> Extend struct partial_context so it can return a list of slabs from the
>> partial list with the sum of free objects in them within the requested
>> min and max.
>> 
>> Introduce get_partial_node_bulk() that removes the slabs from freelist
>> and returns them in the list.
>> 
>> Introduce get_freelist_nofreeze() which grabs the freelist without
>> freezing the slab.
>> 
>> Introduce alloc_from_new_slab() which can allocate multiple objects from
>> a newly allocated slab where we don't need to synchronize with freeing.
>> In some aspects it's similar to alloc_single_from_new_slab() but assumes
>> the cache is a non-debug one so it can avoid some actions.
>> 
>> Introduce __refill_objects() that uses the functions above to fill an
>> array of objects. It has to handle the possibility that the slabs will
>> contain more objects that were requested, due to concurrent freeing of
>> objects to those slabs. When no more slabs on partial lists are
>> available, it will allocate new slabs. It is intended to be only used
>> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
>> 
>> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
>> only refilled from contexts that allow spinning, or even blocking.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 264 insertions(+), 20 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 9bea8a65e510..dce80463f92c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -246,6 +246,9 @@ struct partial_context {
>>  	gfp_t flags;
>>  	unsigned int orig_size;
>>  	void *object;
>> +	unsigned int min_objects;
>> +	unsigned int max_objects;
>> +	struct list_head slabs;
>>  };
>>  
>>  static inline bool kmem_cache_debug(struct kmem_cache *s)
>> @@ -2663,8 +2666,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>>  	if (!to_fill)
>>  		return 0;
>>  
>> -	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
>> -					 &sheaf->objects[sheaf->size]);
>> +	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
>> +			to_fill, to_fill);
> 
> nit: perhaps handling min and max separately is unnecessary
> if it's always min == max? we could have simply one 'count' or 'size'?

Right, so the plan was to set min to some fraction of max when refilling
sheaves, with the goal of maximizing the chance that once we grab a slab
from the partial list, we almost certainly fully use it and don't have to
return it back. But I didn't get to there yet. It seems worthwile to try
though so we can leave the implementation prepared for it?

> Otherwise LGTM!
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths
  2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
  2026-01-20  4:20   ` Harry Yoo
@ 2026-01-20  8:36   ` Hao Li
  2026-01-20 18:06   ` Suren Baghdasaryan
  2 siblings, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-20  8:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:30PM +0100, Vlastimil Babka wrote:
> We now rely on sheaves as the percpu caching layer and can refill them
> directly from partial or newly allocated slabs. Start removing the cpu
> (partial) slabs code, first from allocation paths.
> 
> This means that any allocation not satisfied from percpu sheaves will
> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
> slabs, so it will only perform get_partial() or new_slab(). In the
> latter case we reuse alloc_from_new_slab() (when we don't use
> the debug/tiny alloc_single_from_new_slab() variant).
> 
> In get_partial_node() we used to return a slab for freezing as the cpu
> slab and to refill the partial slab. Now we only want to return a single
> object and leave the slab on the list (unless it became full). We can't
> simply reuse alloc_single_from_partial() as that assumes freeing uses
> free_to_partial_list(). Instead we need to use __slab_update_freelist()
> to work properly against a racing __slab_free().
> 
> The rest of the changes is removing functions that no longer have any
> callers.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 612 ++++++++------------------------------------------------------
>  1 file changed, 79 insertions(+), 533 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20  1:41       ` Harry Yoo
@ 2026-01-20  9:32         ` Hao Li
  2026-01-20 10:22           ` Harry Yoo
  0 siblings, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-20  9:32 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Tue, Jan 20, 2026 at 10:41:37AM +0900, Harry Yoo wrote:
> On Mon, Jan 19, 2026 at 11:54:18AM +0100, Vlastimil Babka wrote:
> > On 1/19/26 07:41, Harry Yoo wrote:
> > > On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> > >>  /*
> > >>   * Try to allocate a partial slab from a specific node.
> > >>   */
> > >> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > >> +		void **p, unsigned int count, bool allow_spin)
> > >> +{
> > >> +	unsigned int allocated = 0;
> > >> +	struct kmem_cache_node *n;
> > >> +	unsigned long flags;
> > >> +	void *object;
> > >> +
> > >> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {
> > >> +
> > >> +		n = get_node(s, slab_nid(slab));
> > >> +
> > >> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> > >> +			/* Unlucky, discard newly allocated slab */
> > >> +			defer_deactivate_slab(slab, NULL);
> > >> +			return 0;
> > >> +		}
> > >> +	}
> > >> +
> > >> +	object = slab->freelist;
> > >> +	while (object && allocated < count) {
> > >> +		p[allocated] = object;
> > >> +		object = get_freepointer(s, object);
> > >> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> > >> +
> > >> +		slab->inuse++;
> > >> +		allocated++;
> > >> +	}
> > >> +	slab->freelist = object;
> > >> +
> > >> +	if (slab->freelist) {
> > >> +
> > >> +		if (allow_spin) {
> > >> +			n = get_node(s, slab_nid(slab));
> > >> +			spin_lock_irqsave(&n->list_lock, flags);
> > >> +		}
> > >> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> > >> +		spin_unlock_irqrestore(&n->list_lock, flags);
> > >> +	}
> > >> +
> > >> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> > > 
> > > Maybe add a comment explaining why inc_slabs_node() doesn't need to be
> > > called under n->list_lock?

I think this is a great observation.

> > 
> > Hm, we might not even be holding it. The old code also did the inc with no
> > comment. If anything could use one, it would be in
> > alloc_single_from_new_slab()? But that's outside the scope here.
> 
> Ok. Perhaps worth adding something like this later, but yeah it's outside
> the scope here.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 698c0d940f06..c5a1e47dfe16 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1633,6 +1633,9 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
>  {
>  	struct kmem_cache_node *n = get_node(s, node);
>  
> +	if (kmem_cache_debug(s))
> +		/* slab validation may generate false errors without the lock */
> +		lockdep_assert_held(&n->list_lock);
>  	atomic_long_inc(&n->nr_slabs);
>  	atomic_long_add(objects, &n->total_objects);
>  }

Yes. This makes sense to me.

Just to double-check - I noticed that inc_slabs_node() is also called by
early_kmem_cache_node_alloc(). Could this potentially lead to false positive
warnings for boot-time caches when debug flags are enabled?

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 13/21] slab: remove defer_deactivate_slab()
  2026-01-16 14:40 ` [PATCH v3 13/21] slab: remove defer_deactivate_slab() Vlastimil Babka
  2026-01-20  5:47   ` Harry Yoo
@ 2026-01-20  9:35   ` Hao Li
  2026-01-21 17:11     ` Suren Baghdasaryan
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-20  9:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:33PM +0100, Vlastimil Babka wrote:
> There are no more cpu slabs so we don't need their deferred
> deactivation. The function is now only used from places where we
> allocate a new slab but then can't spin on node list_lock to put it on
> the partial list. Instead of the deferred action we can free it directly
> via __free_slab(), we just need to tell it to use _nolock() freeing of
> the underlying pages and take care of the accounting.
> 
> Since free_frozen_pages_nolock() variant does not yet exist for code
> outside of the page allocator, create it as a trivial wrapper for
> __free_frozen_pages(..., FPI_TRYLOCK).
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/internal.h   |  1 +
>  mm/page_alloc.c |  5 +++++
>  mm/slab.h       |  8 +-------
>  mm/slub.c       | 56 ++++++++++++++++++++------------------------------------
>  4 files changed, 27 insertions(+), 43 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20  9:32         ` Hao Li
@ 2026-01-20 10:22           ` Harry Yoo
  0 siblings, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-20 10:22 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Tue, Jan 20, 2026 at 05:32:37PM +0800, Hao Li wrote:
> On Tue, Jan 20, 2026 at 10:41:37AM +0900, Harry Yoo wrote:
> > On Mon, Jan 19, 2026 at 11:54:18AM +0100, Vlastimil Babka wrote:
> > > On 1/19/26 07:41, Harry Yoo wrote:
> > > > On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> > > >>  /*
> > > >>   * Try to allocate a partial slab from a specific node.
> > > >>   */
> > > >> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > > >> +		void **p, unsigned int count, bool allow_spin)
> > > >> +{
> > > >> +	unsigned int allocated = 0;
> > > >> +	struct kmem_cache_node *n;
> > > >> +	unsigned long flags;
> > > >> +	void *object;
> > > >> +
> > > >> +	if (!allow_spin && (slab->objects - slab->inuse) > count) {
> > > >> +
> > > >> +		n = get_node(s, slab_nid(slab));
> > > >> +
> > > >> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> > > >> +			/* Unlucky, discard newly allocated slab */
> > > >> +			defer_deactivate_slab(slab, NULL);
> > > >> +			return 0;
> > > >> +		}
> > > >> +	}
> > > >> +
> > > >> +	object = slab->freelist;
> > > >> +	while (object && allocated < count) {
> > > >> +		p[allocated] = object;
> > > >> +		object = get_freepointer(s, object);
> > > >> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> > > >> +
> > > >> +		slab->inuse++;
> > > >> +		allocated++;
> > > >> +	}
> > > >> +	slab->freelist = object;
> > > >> +
> > > >> +	if (slab->freelist) {
> > > >> +
> > > >> +		if (allow_spin) {
> > > >> +			n = get_node(s, slab_nid(slab));
> > > >> +			spin_lock_irqsave(&n->list_lock, flags);
> > > >> +		}
> > > >> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> > > >> +		spin_unlock_irqrestore(&n->list_lock, flags);
> > > >> +	}
> > > >> +
> > > >> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> > > > 
> > > > Maybe add a comment explaining why inc_slabs_node() doesn't need to be
> > > > called under n->list_lock?
> 
> I think this is a great observation.
> 
> > > 
> > > Hm, we might not even be holding it. The old code also did the inc with no
> > > comment. If anything could use one, it would be in
> > > alloc_single_from_new_slab()? But that's outside the scope here.
> > 
> > Ok. Perhaps worth adding something like this later, but yeah it's outside
> > the scope here.
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 698c0d940f06..c5a1e47dfe16 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1633,6 +1633,9 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
> >  {
> >  	struct kmem_cache_node *n = get_node(s, node);
> >  
> > +	if (kmem_cache_debug(s))
> > +		/* slab validation may generate false errors without the lock */
> > +		lockdep_assert_held(&n->list_lock);
> >  	atomic_long_inc(&n->nr_slabs);
> >  	atomic_long_add(objects, &n->total_objects);
> >  }
> 
> Yes. This makes sense to me.
> 
> Just to double-check - I noticed that inc_slabs_node() is also called by
> early_kmem_cache_node_alloc(). Could this potentially lead to false positive
> warnings for boot-time caches when debug flags are enabled?

Good point. Perhaps the condition should be

if ((slab_state != DOWN) && kmem_cache_debug(s))

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20  6:33     ` Vlastimil Babka
@ 2026-01-20 10:27       ` Harry Yoo
  2026-01-20 10:32         ` Vlastimil Babka
  0 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-20 10:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Tue, Jan 20, 2026 at 07:33:47AM +0100, Vlastimil Babka wrote:
> On 1/20/26 03:32, Harry Yoo wrote:
> > On Fri, Jan 16, 2026 at 03:40:29PM +0100, Vlastimil Babka wrote:
> >> At this point we have sheaves enabled for all caches, but their refill
> >> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> >> slabs - now a redundant caching layer that we are about to remove.
> >> 
> >> The refill will thus be done from slabs on the node partial list.
> >> Introduce new functions that can do that in an optimized way as it's
> >> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> >> 
> >> Extend struct partial_context so it can return a list of slabs from the
> >> partial list with the sum of free objects in them within the requested
> >> min and max.
> >> 
> >> Introduce get_partial_node_bulk() that removes the slabs from freelist
> >> and returns them in the list.
> >> 
> >> Introduce get_freelist_nofreeze() which grabs the freelist without
> >> freezing the slab.
> >> 
> >> Introduce alloc_from_new_slab() which can allocate multiple objects from
> >> a newly allocated slab where we don't need to synchronize with freeing.
> >> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> >> the cache is a non-debug one so it can avoid some actions.
> >> 
> >> Introduce __refill_objects() that uses the functions above to fill an
> >> array of objects. It has to handle the possibility that the slabs will
> >> contain more objects that were requested, due to concurrent freeing of
> >> objects to those slabs. When no more slabs on partial lists are
> >> available, it will allocate new slabs. It is intended to be only used
> >> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> >> 
> >> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> >> only refilled from contexts that allow spinning, or even blocking.
> >> 
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >> ---
> >>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 264 insertions(+), 20 deletions(-)
> >> 
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 9bea8a65e510..dce80463f92c 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -246,6 +246,9 @@ struct partial_context {
> >>  	gfp_t flags;
> >>  	unsigned int orig_size;
> >>  	void *object;
> >> +	unsigned int min_objects;
> >> +	unsigned int max_objects;
> >> +	struct list_head slabs;
> >>  };
> >>  
> >>  static inline bool kmem_cache_debug(struct kmem_cache *s)
> >> @@ -2663,8 +2666,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> >>  	if (!to_fill)
> >>  		return 0;
> >>  
> >> -	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> >> -					 &sheaf->objects[sheaf->size]);
> >> +	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> >> +			to_fill, to_fill);
> > 
> > nit: perhaps handling min and max separately is unnecessary
> > if it's always min == max? we could have simply one 'count' or 'size'?
> 
> Right, so the plan was to set min to some fraction of max when refilling
> sheaves, with the goal of maximizing the chance that once we grab a slab
> from the partial list, we almost certainly fully use it and don't have to
> return it back.

Oh, you had a plan!

I'm having trouble imagining what it would look like though.
If we fetch more objects than `to_fill`, where do they go?
Have a larger array and fill multiple sheaves with it?

> But I didn't get to there yet. It seems worthwile to try
> though so we can leave the implementation prepared for it?

Yeah that's fine.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20 10:27       ` Harry Yoo
@ 2026-01-20 10:32         ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-20 10:32 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/20/26 11:27, Harry Yoo wrote:
> On Tue, Jan 20, 2026 at 07:33:47AM +0100, Vlastimil Babka wrote:
>> 
>> Right, so the plan was to set min to some fraction of max when refilling
>> sheaves, with the goal of maximizing the chance that once we grab a slab
>> from the partial list, we almost certainly fully use it and don't have to
>> return it back.
> 
> Oh, you had a plan!
> 
> I'm having trouble imagining what it would look like though.
> If we fetch more objects than `to_fill`, where do they go?
> Have a larger array and fill multiple sheaves with it?

Ah that wouldn't happen. Rather we would consider sheaf to be full even if
it was filled a bit below its capacity, if trying to reach the full capacity
would mean taking a slab from partial list, not using all objects from it
and having to return it to the list.
Of course this would not apply for a prefilled sheaf request or
kmem_cache_alloc_bulk().

>> But I didn't get to there yet. It seems worthwile to try
>> though so we can leave the implementation prepared for it?
> 
> Yeah that's fine.
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-16 14:40 ` [PATCH v3 14/21] slab: simplify kmalloc_nolock() Vlastimil Babka
@ 2026-01-20 12:06   ` Hao Li
  2026-01-21 17:39     ` Suren Baghdasaryan
  2026-01-22  1:53   ` Harry Yoo
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-20 12:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:34PM +0100, Vlastimil Babka wrote:
> The kmalloc_nolock() implementation has several complications and
> restrictions due to SLUB's cpu slab locking, lockless fastpath and
> PREEMPT_RT differences. With cpu slab usage removed, we can simplify
> things:
> 
> - relax the PREEMPT_RT context checks as they were before commit
>   a4ae75d1b6a2 ("slab: fix kmalloc_nolock() context check for
>   PREEMPT_RT") and also reference the explanation comment in the page
>   allocator
> 
> - the local_lock_cpu_slab() macros became unused, remove them
> 
> - we no longer need to set up lockdep classes on PREEMPT_RT
> 
> - we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
>   since there's no lockless cpu freelist manipulation anymore
> 
> - __slab_alloc_node() can be called from kmalloc_nolock_noprof()
>   unconditionally. It can also no longer return EBUSY. But trylock
>   failures can still happen so retry with the larger bucket if the
>   allocation fails for any reason.
> 
> Note that we still need __CMPXCHG_DOUBLE, because while it was removed
> we don't use cmpxchg16b on cpu freelist anymore, we still use it on
> slab freelist, and the alternative is slab_lock() which can be
> interrupted by a nmi. Clarify the comment to mention it specifically.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h |   1 -
>  mm/slub.c | 144 +++++++++++++-------------------------------------------------
>  2 files changed, 29 insertions(+), 116 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
  2026-01-20  5:24   ` Harry Yoo
@ 2026-01-20 12:10   ` Hao Li
  2026-01-20 22:25   ` Suren Baghdasaryan
  2 siblings, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-20 12:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:31PM +0100, Vlastimil Babka wrote:
> We have removed the partial slab usage from allocation paths. Now remove
> the whole config option and associated code.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/Kconfig |  11 ---
>  mm/slab.h  |  29 ------
>  mm/slub.c  | 321 ++++---------------------------------------------------------
>  3 files changed, 19 insertions(+), 342 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 12/21] slab: remove the do_slab_free() fastpath
  2026-01-16 14:40 ` [PATCH v3 12/21] slab: remove the do_slab_free() fastpath Vlastimil Babka
  2026-01-20  5:35   ` Harry Yoo
@ 2026-01-20 12:29   ` Hao Li
  2026-01-21 16:57     ` Suren Baghdasaryan
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-20 12:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:32PM +0100, Vlastimil Babka wrote:
> We have removed cpu slab usage from allocation paths. Now remove
> do_slab_free() which was freeing objects to the cpu slab when
> the object belonged to it. Instead call __slab_free() directly,
> which was previously the fallback.
> 
> This simplifies kfree_nolock() - when freeing to percpu sheaf
> fails, we can call defer_free() directly.
> 
> Also remove functions that became unused.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 149 ++++++--------------------------------------------------------
>  1 file changed, 13 insertions(+), 136 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-16 14:40 ` [PATCH v3 15/21] slab: remove struct kmem_cache_cpu Vlastimil Babka
@ 2026-01-20 12:40   ` Hao Li
  2026-01-21 14:29     ` Vlastimil Babka
  2026-01-22  3:10   ` Harry Yoo
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-20 12:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:35PM +0100, Vlastimil Babka wrote:
> The cpu slab is not used anymore for allocation or freeing, the
> remaining code is for flushing, but it's effectively dead.  Remove the
> whole struct kmem_cache_cpu, the flushing code and other orphaned
> functions.
> 
> The remaining used field of kmem_cache_cpu is the stat array with
> CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats.
> In struct kmem_cache, the field is cpu_stats and placed near the
> end of the struct.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h |   7 +-
>  mm/slub.c | 298 +++++---------------------------------------------------------
>  2 files changed, 24 insertions(+), 281 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index e9a0738133ed..87faeb6143f2 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -21,14 +21,12 @@
>  # define system_has_freelist_aba()	system_has_cmpxchg128()
>  # define try_cmpxchg_freelist		try_cmpxchg128
>  # endif
> -#define this_cpu_try_cmpxchg_freelist	this_cpu_try_cmpxchg128
>  typedef u128 freelist_full_t;
>  #else /* CONFIG_64BIT */
>  # ifdef system_has_cmpxchg64
>  # define system_has_freelist_aba()	system_has_cmpxchg64()
>  # define try_cmpxchg_freelist		try_cmpxchg64
>  # endif
> -#define this_cpu_try_cmpxchg_freelist	this_cpu_try_cmpxchg64
>  typedef u64 freelist_full_t;
>  #endif /* CONFIG_64BIT */
>  
> @@ -189,7 +187,6 @@ struct kmem_cache_order_objects {
>   * Slab cache management.
>   */
>  struct kmem_cache {
> -	struct kmem_cache_cpu __percpu *cpu_slab;
>  	struct slub_percpu_sheaves __percpu *cpu_sheaves;
>  	/* Used for retrieving partial slabs, etc. */
>  	slab_flags_t flags;
> @@ -238,6 +235,10 @@ struct kmem_cache {
>  	unsigned int usersize;		/* Usercopy region size */
>  #endif
>  
> +#ifdef CONFIG_SLUB_STATS
> +	struct kmem_cache_stats __percpu *cpu_stats;
> +#endif
> +
>  	struct kmem_cache_node *node[MAX_NUMNODES];
>  };
>  
> diff --git a/mm/slub.c b/mm/slub.c
> index 8746d9d3f3a3..bb72cfa2d7ec 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -400,28 +400,11 @@ enum stat_item {
>  	NR_SLUB_STAT_ITEMS
>  };
>  
> -struct freelist_tid {
> -	union {
> -		struct {
> -			void *freelist;		/* Pointer to next available object */
> -			unsigned long tid;	/* Globally unique transaction id */
> -		};
> -		freelist_full_t freelist_tid;
> -	};
> -};
> -
> -/*
> - * When changing the layout, make sure freelist and tid are still compatible
> - * with this_cpu_cmpxchg_double() alignment requirements.
> - */
> -struct kmem_cache_cpu {
> -	struct freelist_tid;
> -	struct slab *slab;	/* The slab from which we are allocating */
> -	local_trylock_t lock;	/* Protects the fields above */
>  #ifdef CONFIG_SLUB_STATS
> +struct kmem_cache_stats {
>  	unsigned int stat[NR_SLUB_STAT_ITEMS];
> -#endif
>  };
> +#endif
>  
>  static inline void stat(const struct kmem_cache *s, enum stat_item si)
>  {
> @@ -430,7 +413,7 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si)
>  	 * The rmw is racy on a preemptible kernel but this is acceptable, so
>  	 * avoid this_cpu_add()'s irq-disable overhead.
>  	 */
> -	raw_cpu_inc(s->cpu_slab->stat[si]);
> +	raw_cpu_inc(s->cpu_stats->stat[si]);
>  #endif
>  }
>  
> @@ -438,7 +421,7 @@ static inline
>  void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
>  {
>  #ifdef CONFIG_SLUB_STATS
> -	raw_cpu_add(s->cpu_slab->stat[si], v);
> +	raw_cpu_add(s->cpu_stats->stat[si], v);
>  #endif
>  }
>  
> @@ -1160,20 +1143,6 @@ static void object_err(struct kmem_cache *s, struct slab *slab,
>  	WARN_ON(1);
>  }
>  
> -static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
> -			       void **freelist, void *nextfree)
> -{
> -	if ((s->flags & SLAB_CONSISTENCY_CHECKS) &&
> -	    !check_valid_pointer(s, slab, nextfree) && freelist) {
> -		object_err(s, slab, *freelist, "Freechain corrupt");
> -		*freelist = NULL;
> -		slab_fix(s, "Isolate corrupted freechain");
> -		return true;
> -	}
> -
> -	return false;
> -}
> -
>  static void __slab_err(struct slab *slab)
>  {
>  	if (slab_in_kunit_test())
> @@ -1955,11 +1924,6 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node,
>  							int objects) {}
>  static inline void dec_slabs_node(struct kmem_cache *s, int node,
>  							int objects) {}
> -static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab,
> -			       void **freelist, void *nextfree)
> -{
> -	return false;
> -}
>  #endif /* CONFIG_SLUB_DEBUG */
>  
>  /*
> @@ -3655,191 +3619,6 @@ static void *get_partial(struct kmem_cache *s, int node,
>  	return get_any_partial(s, pc);
>  }
>  
> -#ifdef CONFIG_PREEMPTION
> -/*
> - * Calculate the next globally unique transaction for disambiguation
> - * during cmpxchg. The transactions start with the cpu number and are then
> - * incremented by CONFIG_NR_CPUS.
> - */
> -#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> -#else
> -/*
> - * No preemption supported therefore also no need to check for
> - * different cpus.
> - */
> -#define TID_STEP 1
> -#endif /* CONFIG_PREEMPTION */
> -
> -static inline unsigned long next_tid(unsigned long tid)
> -{
> -	return tid + TID_STEP;
> -}
> -
> -#ifdef SLUB_DEBUG_CMPXCHG
> -static inline unsigned int tid_to_cpu(unsigned long tid)
> -{
> -	return tid % TID_STEP;
> -}
> -
> -static inline unsigned long tid_to_event(unsigned long tid)
> -{
> -	return tid / TID_STEP;
> -}
> -#endif
> -
> -static inline unsigned int init_tid(int cpu)
> -{
> -	return cpu;
> -}
> -
> -static void init_kmem_cache_cpus(struct kmem_cache *s)
> -{
> -	int cpu;
> -	struct kmem_cache_cpu *c;
> -
> -	for_each_possible_cpu(cpu) {
> -		c = per_cpu_ptr(s->cpu_slab, cpu);
> -		local_trylock_init(&c->lock);
> -		c->tid = init_tid(cpu);
> -	}
> -}
> -
> -/*
> - * Finishes removing the cpu slab. Merges cpu's freelist with slab's freelist,
> - * unfreezes the slabs and puts it on the proper list.
> - * Assumes the slab has been already safely taken away from kmem_cache_cpu
> - * by the caller.
> - */
> -static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> -			    void *freelist)
> -{
> -	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> -	int free_delta = 0;
> -	void *nextfree, *freelist_iter, *freelist_tail;
> -	int tail = DEACTIVATE_TO_HEAD;
> -	unsigned long flags = 0;
> -	struct freelist_counters old, new;
> -
> -	if (READ_ONCE(slab->freelist)) {
> -		stat(s, DEACTIVATE_REMOTE_FREES);
> -		tail = DEACTIVATE_TO_TAIL;
> -	}
> -
> -	/*
> -	 * Stage one: Count the objects on cpu's freelist as free_delta and
> -	 * remember the last object in freelist_tail for later splicing.
> -	 */
> -	freelist_tail = NULL;
> -	freelist_iter = freelist;
> -	while (freelist_iter) {
> -		nextfree = get_freepointer(s, freelist_iter);
> -
> -		/*
> -		 * If 'nextfree' is invalid, it is possible that the object at
> -		 * 'freelist_iter' is already corrupted.  So isolate all objects
> -		 * starting at 'freelist_iter' by skipping them.
> -		 */
> -		if (freelist_corrupted(s, slab, &freelist_iter, nextfree))
> -			break;
> -
> -		freelist_tail = freelist_iter;
> -		free_delta++;
> -
> -		freelist_iter = nextfree;
> -	}
> -
> -	/*
> -	 * Stage two: Unfreeze the slab while splicing the per-cpu
> -	 * freelist to the head of slab's freelist.
> -	 */
> -	do {
> -		old.freelist = READ_ONCE(slab->freelist);
> -		old.counters = READ_ONCE(slab->counters);
> -		VM_BUG_ON(!old.frozen);
> -
> -		/* Determine target state of the slab */
> -		new.counters = old.counters;
> -		new.frozen = 0;
> -		if (freelist_tail) {
> -			new.inuse -= free_delta;
> -			set_freepointer(s, freelist_tail, old.freelist);
> -			new.freelist = freelist;
> -		} else {
> -			new.freelist = old.freelist;
> -		}
> -	} while (!slab_update_freelist(s, slab, &old, &new, "unfreezing slab"));
> -
> -	/*
> -	 * Stage three: Manipulate the slab list based on the updated state.
> -	 */
> -	if (!new.inuse && n->nr_partial >= s->min_partial) {
> -		stat(s, DEACTIVATE_EMPTY);
> -		discard_slab(s, slab);
> -		stat(s, FREE_SLAB);
> -	} else if (new.freelist) {
> -		spin_lock_irqsave(&n->list_lock, flags);
> -		add_partial(n, slab, tail);
> -		spin_unlock_irqrestore(&n->list_lock, flags);
> -		stat(s, tail);
> -	} else {
> -		stat(s, DEACTIVATE_FULL);
> -	}
> -}
> -
> -static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
> -{
> -	unsigned long flags;
> -	struct slab *slab;
> -	void *freelist;
> -
> -	local_lock_irqsave(&s->cpu_slab->lock, flags);
> -
> -	slab = c->slab;
> -	freelist = c->freelist;
> -
> -	c->slab = NULL;
> -	c->freelist = NULL;
> -	c->tid = next_tid(c->tid);
> -
> -	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> -
> -	if (slab) {
> -		deactivate_slab(s, slab, freelist);
> -		stat(s, CPUSLAB_FLUSH);
> -	}
> -}
> -
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
> -{
> -	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> -	void *freelist = c->freelist;
> -	struct slab *slab = c->slab;
> -
> -	c->slab = NULL;
> -	c->freelist = NULL;
> -	c->tid = next_tid(c->tid);
> -
> -	if (slab) {
> -		deactivate_slab(s, slab, freelist);
> -		stat(s, CPUSLAB_FLUSH);
> -	}
> -}
> -
> -static inline void flush_this_cpu_slab(struct kmem_cache *s)
> -{
> -	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> -
> -	if (c->slab)
> -		flush_slab(s, c);
> -}
> -
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> -
> -	return c->slab;
> -}
> -
>  static bool has_pcs_used(int cpu, struct kmem_cache *s)
>  {
>  	struct slub_percpu_sheaves *pcs;
> @@ -3853,7 +3632,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>  }
>  
>  /*
> - * Flush cpu slab.
> + * Flush percpu sheaves
>   *
>   * Called from CPU work handler with migration disabled.
>   */
> @@ -3868,8 +3647,6 @@ static void flush_cpu_slab(struct work_struct *w)

Nit: Would it make sense to rename flush_cpu_slab to flush_cpu_sheaf for better
clarity?

Other than that, looks good to me. Thanks.

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

>  
>  	if (cache_has_sheaves(s))
>  		pcs_flush_all(s);
> -
> -	flush_this_cpu_slab(s);
>  }
>  
>  static void flush_all_cpus_locked(struct kmem_cache *s)
> @@ -3882,7 +3659,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
>  
>  	for_each_online_cpu(cpu) {
>  		sfw = &per_cpu(slub_flush, cpu);
> -		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
> +		if (!has_pcs_used(cpu, s)) {
>  			sfw->skip = true;
>  			continue;
>  		}
> @@ -3992,7 +3769,6 @@ static int slub_cpu_dead(unsigned int cpu)
>  
>  	mutex_lock(&slab_mutex);
>  	list_for_each_entry(s, &slab_caches, list) {
> -		__flush_cpu_slab(s, cpu);
>  		if (cache_has_sheaves(s))
>  			__pcs_flush_all_cpu(s, cpu);
>  	}
> @@ -7121,26 +6897,21 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  		barn_init(barn);
>  }
>  
> -static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
> +#ifdef CONFIG_SLUB_STATS
> +static inline int alloc_kmem_cache_stats(struct kmem_cache *s)
>  {
>  	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
>  			NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH *
> -			sizeof(struct kmem_cache_cpu));
> +			sizeof(struct kmem_cache_stats));
>  
> -	/*
> -	 * Must align to double word boundary for the double cmpxchg
> -	 * instructions to work; see __pcpu_double_call_return_bool().
> -	 */
> -	s->cpu_slab = __alloc_percpu(sizeof(struct kmem_cache_cpu),
> -				     2 * sizeof(void *));
> +	s->cpu_stats = alloc_percpu(struct kmem_cache_stats);
>  
> -	if (!s->cpu_slab)
> +	if (!s->cpu_stats)
>  		return 0;
>  
> -	init_kmem_cache_cpus(s);
> -
>  	return 1;
>  }
> +#endif
>  
>  static int init_percpu_sheaves(struct kmem_cache *s)
>  {
> @@ -7252,7 +7023,9 @@ void __kmem_cache_release(struct kmem_cache *s)
>  	cache_random_seq_destroy(s);
>  	if (s->cpu_sheaves)
>  		pcs_destroy(s);
> -	free_percpu(s->cpu_slab);
> +#ifdef CONFIG_SLUB_STATS
> +	free_percpu(s->cpu_stats);
> +#endif
>  	free_kmem_cache_nodes(s);
>  }
>  
> @@ -7944,12 +7717,6 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
>  
>  	memcpy(s, static_cache, kmem_cache->object_size);
>  
> -	/*
> -	 * This runs very early, and only the boot processor is supposed to be
> -	 * up.  Even if it weren't true, IRQs are not up so we couldn't fire
> -	 * IPIs around.
> -	 */
> -	__flush_cpu_slab(s, smp_processor_id());
>  	for_each_kmem_cache_node(s, node, n) {
>  		struct slab *p;
>  
> @@ -8164,8 +7931,10 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>  	if (!init_kmem_cache_nodes(s))
>  		goto out;
>  
> -	if (!alloc_kmem_cache_cpus(s))
> +#ifdef CONFIG_SLUB_STATS
> +	if (!alloc_kmem_cache_stats(s))
>  		goto out;
> +#endif
>  
>  	err = init_percpu_sheaves(s);
>  	if (err)
> @@ -8484,33 +8253,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
>  	if (!nodes)
>  		return -ENOMEM;
>  
> -	if (flags & SO_CPU) {
> -		int cpu;
> -
> -		for_each_possible_cpu(cpu) {
> -			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
> -							       cpu);
> -			int node;
> -			struct slab *slab;
> -
> -			slab = READ_ONCE(c->slab);
> -			if (!slab)
> -				continue;
> -
> -			node = slab_nid(slab);
> -			if (flags & SO_TOTAL)
> -				x = slab->objects;
> -			else if (flags & SO_OBJECTS)
> -				x = slab->inuse;
> -			else
> -				x = 1;
> -
> -			total += x;
> -			nodes[node] += x;
> -
> -		}
> -	}
> -
>  	/*
>  	 * It is impossible to take "mem_hotplug_lock" here with "kernfs_mutex"
>  	 * already held which will conflict with an existing lock order:
> @@ -8881,7 +8623,7 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
>  		return -ENOMEM;
>  
>  	for_each_online_cpu(cpu) {
> -		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
> +		unsigned int x = per_cpu_ptr(s->cpu_stats, cpu)->stat[si];
>  
>  		data[cpu] = x;
>  		sum += x;
> @@ -8907,7 +8649,7 @@ static void clear_stat(struct kmem_cache *s, enum stat_item si)
>  	int cpu;
>  
>  	for_each_online_cpu(cpu)
> -		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
> +		per_cpu_ptr(s->cpu_stats, cpu)->stat[si] = 0;
>  }
>  
>  #define STAT_ATTR(si, text) 					\
> 
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
                     ` (2 preceding siblings ...)
  2026-01-20  2:55   ` Hao Li
@ 2026-01-20 17:19   ` Suren Baghdasaryan
  2026-01-21 13:22     ` Vlastimil Babka
  3 siblings, 1 reply; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-20 17:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
>
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
>
> Extend struct partial_context so it can return a list of slabs from the
> partial list with the sum of free objects in them within the requested
> min and max.
>
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list.
>
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
>
> Introduce alloc_from_new_slab() which can allocate multiple objects from
> a newly allocated slab where we don't need to synchronize with freeing.
> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> the cache is a non-debug one so it can avoid some actions.
>
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs. It is intended to be only used
> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
>
> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> only refilled from contexts that allow spinning, or even blocking.
>

Some nits, but otherwise LGTM.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 284 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 264 insertions(+), 20 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 9bea8a65e510..dce80463f92c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -246,6 +246,9 @@ struct partial_context {
>         gfp_t flags;
>         unsigned int orig_size;
>         void *object;
> +       unsigned int min_objects;
> +       unsigned int max_objects;
> +       struct list_head slabs;
>  };
>
>  static inline bool kmem_cache_debug(struct kmem_cache *s)
> @@ -2650,9 +2653,9 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
>         stat(s, SHEAF_FREE);
>  }
>
> -static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> -                                  size_t size, void **p);
> -
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +                unsigned int max);
>
>  static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>                          gfp_t gfp)
> @@ -2663,8 +2666,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>         if (!to_fill)
>                 return 0;
>
> -       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> -                                        &sheaf->objects[sheaf->size]);
> +       filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> +                       to_fill, to_fill);
>
>         sheaf->size += filled;
>
> @@ -3522,6 +3525,63 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>  #endif
>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>
> +static bool get_partial_node_bulk(struct kmem_cache *s,
> +                                 struct kmem_cache_node *n,
> +                                 struct partial_context *pc)
> +{
> +       struct slab *slab, *slab2;
> +       unsigned int total_free = 0;
> +       unsigned long flags;
> +
> +       /* Racy check to avoid taking the lock unnecessarily. */
> +       if (!n || data_race(!n->nr_partial))
> +               return false;
> +
> +       INIT_LIST_HEAD(&pc->slabs);
> +
> +       spin_lock_irqsave(&n->list_lock, flags);
> +
> +       list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +               struct freelist_counters flc;
> +               unsigned int slab_free;
> +
> +               if (!pfmemalloc_match(slab, pc->flags))
> +                       continue;
> +
> +               /*
> +                * determine the number of free objects in the slab racily
> +                *
> +                * due to atomic updates done by a racing free we should not
> +                * read an inconsistent value here, but do a sanity check anyway
> +                *
> +                * slab_free is a lower bound due to subsequent concurrent
> +                * freeing, the caller might get more objects than requested and
> +                * must deal with it
> +                */
> +               flc.counters = data_race(READ_ONCE(slab->counters));
> +               slab_free = flc.objects - flc.inuse;
> +
> +               if (unlikely(slab_free > oo_objects(s->oo)))
> +                       continue;
> +
> +               /* we have already min and this would get us over the max */
> +               if (total_free >= pc->min_objects
> +                   && total_free + slab_free > pc->max_objects)
> +                       break;
> +
> +               remove_partial(n, slab);
> +
> +               list_add(&slab->slab_list, &pc->slabs);
> +
> +               total_free += slab_free;
> +               if (total_free >= pc->max_objects)
> +                       break;

From the above code it seems like you are trying to get at least
pc->min_objects and as close as possible to the pc->max_objects
without exceeding it (with a possibility that we will exceed both
min_objects and max_objects in one step). Is that indeed the intent?
Because otherwise could could simplify these conditions to stop once
you crossed pc->min_objects.

> +       }
> +
> +       spin_unlock_irqrestore(&n->list_lock, flags);
> +       return total_free > 0;
> +}
> +
>  /*
>   * Try to allocate a partial slab from a specific node.
>   */
> @@ -4448,6 +4508,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>         return old.freelist;
>  }
>
> +/*
> + * Get the slab's freelist and do not freeze it.
> + *
> + * Assumes the slab is isolated from node partial list and not frozen.
> + *
> + * Assumes this is performed only for caches without debugging so we
> + * don't need to worry about adding the slab to the full list

nit: Missing a period sign at the end of the above sentence.

> + */
> +static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)

I was going to comment on similarities between
get_freelist_nofreeze(), get_freelist() and freeze_slab() and
possibility of consolidating them but then I saw you removing the
other functions in the next patch. So, I'm mentioning it here merely
for other reviewers not to trip on this.

> +{
> +       struct freelist_counters old, new;
> +
> +       do {
> +               old.freelist = slab->freelist;
> +               old.counters = slab->counters;
> +
> +               new.freelist = NULL;
> +               new.counters = old.counters;
> +               VM_WARN_ON_ONCE(new.frozen);
> +
> +               new.inuse = old.objects;
> +
> +       } while (!slab_update_freelist(s, slab, &old, &new, "get_freelist_nofreeze"));
> +
> +       return old.freelist;
> +}
> +
>  /*
>   * Freeze the partial slab and return the pointer to the freelist.
>   */
> @@ -4471,6 +4558,65 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>         return old.freelist;
>  }
>
> +/*
> + * If the object has been wiped upon free, make sure it's fully initialized by
> + * zeroing out freelist pointer.
> + *
> + * Note that we also wipe custom freelist pointers.
> + */
> +static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> +                                                  void *obj)
> +{
> +       if (unlikely(slab_want_init_on_free(s)) && obj &&
> +           !freeptr_outside_object(s))
> +               memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
> +                       0, sizeof(void *));
> +}
> +
> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +               void **p, unsigned int count, bool allow_spin)
> +{
> +       unsigned int allocated = 0;
> +       struct kmem_cache_node *n;
> +       unsigned long flags;
> +       void *object;
> +
> +       if (!allow_spin && (slab->objects - slab->inuse) > count) {
> +
> +               n = get_node(s, slab_nid(slab));
> +
> +               if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +                       /* Unlucky, discard newly allocated slab */
> +                       defer_deactivate_slab(slab, NULL);
> +                       return 0;
> +               }
> +       }
> +
> +       object = slab->freelist;
> +       while (object && allocated < count) {
> +               p[allocated] = object;
> +               object = get_freepointer(s, object);
> +               maybe_wipe_obj_freeptr(s, p[allocated]);
> +
> +               slab->inuse++;
> +               allocated++;
> +       }
> +       slab->freelist = object;
> +
> +       if (slab->freelist) {

nit: It's a bit subtle that the checks for slab->freelist here and the
earlier one for ((slab->objects - slab->inuse) > count) are
effectively equivalent. That's because this is a new slab and objects
can't be freed into it concurrently. I would feel better if both
checks were explicitly the same, like having "bool extra_objs =
(slab->objects - slab->inuse) > count;" and use it for both checks.
But this is minor, so feel free to ignore.

> +
> +               if (allow_spin) {
> +                       n = get_node(s, slab_nid(slab));
> +                       spin_lock_irqsave(&n->list_lock, flags);
> +               }
> +               add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +               spin_unlock_irqrestore(&n->list_lock, flags);
> +       }
> +
> +       inc_slabs_node(s, slab_nid(slab), slab->objects);
> +       return allocated;
> +}
> +
>  /*
>   * Slow path. The lockless freelist is empty or we need to perform
>   * debugging duties.
> @@ -4913,21 +5059,6 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
>         return object;
>  }
>
> -/*
> - * If the object has been wiped upon free, make sure it's fully initialized by
> - * zeroing out freelist pointer.
> - *
> - * Note that we also wipe custom freelist pointers.
> - */
> -static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> -                                                  void *obj)
> -{
> -       if (unlikely(slab_want_init_on_free(s)) && obj &&
> -           !freeptr_outside_object(s))
> -               memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
> -                       0, sizeof(void *));
> -}
> -
>  static __fastpath_inline
>  struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>  {
> @@ -5388,6 +5519,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
>         return ret;
>  }
>
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +                                  size_t size, void **p);
> +
>  /*
>   * returns a sheaf that has at least the requested size
>   * when prefilling is needed, do so with given gfp flags
> @@ -7463,6 +7597,116 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  }
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +                unsigned int max)
> +{
> +       struct slab *slab, *slab2;
> +       struct partial_context pc;
> +       unsigned int refilled = 0;
> +       unsigned long flags;
> +       void *object;
> +       int node;
> +
> +       pc.flags = gfp;
> +       pc.min_objects = min;
> +       pc.max_objects = max;
> +
> +       node = numa_mem_id();
> +
> +       if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +               return 0;
> +
> +       /* TODO: consider also other nodes? */
> +       if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> +               goto new_slab;
> +
> +       list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +               list_del(&slab->slab_list);
> +
> +               object = get_freelist_nofreeze(s, slab);
> +
> +               while (object && refilled < max) {
> +                       p[refilled] = object;
> +                       object = get_freepointer(s, object);
> +                       maybe_wipe_obj_freeptr(s, p[refilled]);
> +
> +                       refilled++;
> +               }
> +
> +               /*
> +                * Freelist had more objects than we can accommodate, we need to
> +                * free them back. We can treat it like a detached freelist, just
> +                * need to find the tail object.
> +                */
> +               if (unlikely(object)) {
> +                       void *head = object;
> +                       void *tail;
> +                       int cnt = 0;
> +
> +                       do {
> +                               tail = object;
> +                               cnt++;
> +                               object = get_freepointer(s, object);
> +                       } while (object);
> +                       do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
> +               }
> +
> +               if (refilled >= max)
> +                       break;
> +       }
> +
> +       if (unlikely(!list_empty(&pc.slabs))) {
> +               struct kmem_cache_node *n = get_node(s, node);
> +
> +               spin_lock_irqsave(&n->list_lock, flags);
> +
> +               list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +                       if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
> +                               continue;
> +
> +                       list_del(&slab->slab_list);
> +                       add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +               }
> +
> +               spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +               /* any slabs left are completely free and for discard */
> +               list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +                       list_del(&slab->slab_list);
> +                       discard_slab(s, slab);
> +               }
> +       }
> +
> +
> +       if (likely(refilled >= min))
> +               goto out;
> +
> +new_slab:
> +
> +       slab = new_slab(s, pc.flags, node);
> +       if (!slab)
> +               goto out;
> +
> +       stat(s, ALLOC_SLAB);
> +
> +       /*
> +        * TODO: possible optimization - if we know we will consume the whole
> +        * slab we might skip creating the freelist?
> +        */
> +       refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> +                                       /* allow_spin = */ true);
> +
> +       if (refilled < min)
> +               goto new_slab;

Ok, allow_spin=true saves us from a potential infinite loop here. LGTM.

> +out:
> +
> +       return refilled;
> +}
> +
>  static inline
>  int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>                             void **p)
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths
  2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
  2026-01-20  4:20   ` Harry Yoo
  2026-01-20  8:36   ` Hao Li
@ 2026-01-20 18:06   ` Suren Baghdasaryan
  2026-01-21 13:56     ` Vlastimil Babka
  2 siblings, 1 reply; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-20 18:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> We now rely on sheaves as the percpu caching layer and can refill them
> directly from partial or newly allocated slabs. Start removing the cpu
> (partial) slabs code, first from allocation paths.
>
> This means that any allocation not satisfied from percpu sheaves will
> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
> slabs, so it will only perform get_partial() or new_slab(). In the
> latter case we reuse alloc_from_new_slab() (when we don't use
> the debug/tiny alloc_single_from_new_slab() variant).
>
> In get_partial_node() we used to return a slab for freezing as the cpu
> slab and to refill the partial slab. Now we only want to return a single
> object and leave the slab on the list (unless it became full). We can't
> simply reuse alloc_single_from_partial() as that assumes freeing uses
> free_to_partial_list(). Instead we need to use __slab_update_freelist()
> to work properly against a racing __slab_free().
>
> The rest of the changes is removing functions that no longer have any
> callers.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

A couple of nits, but otherwise seems fine to me.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 612 ++++++++------------------------------------------------------
>  1 file changed, 79 insertions(+), 533 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index dce80463f92c..698c0d940f06 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -245,7 +245,6 @@ static DEFINE_STATIC_KEY_FALSE(strict_numa);
>  struct partial_context {
>         gfp_t flags;
>         unsigned int orig_size;
> -       void *object;
>         unsigned int min_objects;
>         unsigned int max_objects;
>         struct list_head slabs;
> @@ -611,36 +610,6 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
>         return freelist_ptr_decode(s, p, ptr_addr);
>  }
>
> -static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> -{
> -       prefetchw(object + s->offset);
> -}
> -
> -/*
> - * When running under KMSAN, get_freepointer_safe() may return an uninitialized
> - * pointer value in the case the current thread loses the race for the next
> - * memory chunk in the freelist. In that case this_cpu_cmpxchg_double() in
> - * slab_alloc_node() will fail, so the uninitialized value won't be used, but
> - * KMSAN will still check all arguments of cmpxchg because of imperfect
> - * handling of inline assembly.
> - * To work around this problem, we apply __no_kmsan_checks to ensure that
> - * get_freepointer_safe() returns initialized memory.
> - */
> -__no_kmsan_checks
> -static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> -{
> -       unsigned long freepointer_addr;
> -       freeptr_t p;
> -
> -       if (!debug_pagealloc_enabled_static())
> -               return get_freepointer(s, object);
> -
> -       object = kasan_reset_tag(object);
> -       freepointer_addr = (unsigned long)object + s->offset;
> -       copy_from_kernel_nofault(&p, (freeptr_t *)freepointer_addr, sizeof(p));
> -       return freelist_ptr_decode(s, p, freepointer_addr);
> -}
> -
>  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>  {
>         unsigned long freeptr_addr = (unsigned long)object + s->offset;
> @@ -720,23 +689,11 @@ static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
>         nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
>         s->cpu_partial_slabs = nr_slabs;
>  }
> -
> -static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
> -{
> -       return s->cpu_partial_slabs;
> -}
> -#else
> -#ifdef SLAB_SUPPORTS_SYSFS
> +#elif defined(SLAB_SUPPORTS_SYSFS)
>  static inline void
>  slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
>  {
>  }
> -#endif
> -
> -static inline unsigned int slub_get_cpu_partial(struct kmem_cache *s)
> -{
> -       return 0;
> -}
>  #endif /* CONFIG_SLUB_CPU_PARTIAL */
>
>  /*
> @@ -1077,7 +1034,7 @@ static void set_track_update(struct kmem_cache *s, void *object,
>         p->handle = handle;
>  #endif
>         p->addr = addr;
> -       p->cpu = smp_processor_id();
> +       p->cpu = raw_smp_processor_id();
>         p->pid = current->pid;
>         p->when = jiffies;
>  }
> @@ -3583,15 +3540,15 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
>  }
>
>  /*
> - * Try to allocate a partial slab from a specific node.
> + * Try to allocate object from a partial slab on a specific node.
>   */
> -static struct slab *get_partial_node(struct kmem_cache *s,
> -                                    struct kmem_cache_node *n,
> -                                    struct partial_context *pc)
> +static void *get_partial_node(struct kmem_cache *s,
> +                             struct kmem_cache_node *n,
> +                             struct partial_context *pc)

Naming for get_partial()/get_partial_node()/get_any_partial() made
sense when they returned a slab. Now that they return object(s) the
naming is a bit confusing. I think renaming to
get_from_partial()/get_from_partial_node()/get_from_any_partial()
would be more appropriate.

>  {
> -       struct slab *slab, *slab2, *partial = NULL;
> +       struct slab *slab, *slab2;
>         unsigned long flags;
> -       unsigned int partial_slabs = 0;
> +       void *object = NULL;
>
>         /*
>          * Racy check. If we mistakenly see no partial slabs then we
> @@ -3607,54 +3564,55 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>         else if (!spin_trylock_irqsave(&n->list_lock, flags))
>                 return NULL;
>         list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +
> +               struct freelist_counters old, new;
> +
>                 if (!pfmemalloc_match(slab, pc->flags))
>                         continue;
>
>                 if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -                       void *object = alloc_single_from_partial(s, n, slab,
> +                       object = alloc_single_from_partial(s, n, slab,
>                                                         pc->orig_size);
> -                       if (object) {
> -                               partial = slab;
> -                               pc->object = object;
> +                       if (object)
>                                 break;
> -                       }
>                         continue;
>                 }
>
> -               remove_partial(n, slab);
> +               /*
> +                * get a single object from the slab. This might race against
> +                * __slab_free(), which however has to take the list_lock if
> +                * it's about to make the slab fully free.
> +                */
> +               do {
> +                       old.freelist = slab->freelist;
> +                       old.counters = slab->counters;
>
> -               if (!partial) {
> -                       partial = slab;
> -                       stat(s, ALLOC_FROM_PARTIAL);
> +                       new.freelist = get_freepointer(s, old.freelist);
> +                       new.counters = old.counters;
> +                       new.inuse++;
>
> -                       if ((slub_get_cpu_partial(s) == 0)) {
> -                               break;
> -                       }
> -               } else {
> -                       put_cpu_partial(s, slab, 0);
> -                       stat(s, CPU_PARTIAL_NODE);
> +               } while (!__slab_update_freelist(s, slab, &old, &new, "get_partial_node"));
>
> -                       if (++partial_slabs > slub_get_cpu_partial(s) / 2) {
> -                               break;
> -                       }
> -               }
> +               object = old.freelist;
> +               if (!new.freelist)
> +                       remove_partial(n, slab);
> +
> +               break;
>         }
>         spin_unlock_irqrestore(&n->list_lock, flags);
> -       return partial;
> +       return object;
>  }
>
>  /*
> - * Get a slab from somewhere. Search in increasing NUMA distances.
> + * Get an object from somewhere. Search in increasing NUMA distances.
>   */
> -static struct slab *get_any_partial(struct kmem_cache *s,
> -                                   struct partial_context *pc)
> +static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
>  {
>  #ifdef CONFIG_NUMA
>         struct zonelist *zonelist;
>         struct zoneref *z;
>         struct zone *zone;
>         enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> -       struct slab *slab;
>         unsigned int cpuset_mems_cookie;
>
>         /*
> @@ -3689,8 +3647,10 @@ static struct slab *get_any_partial(struct kmem_cache *s,
>
>                         if (n && cpuset_zone_allowed(zone, pc->flags) &&
>                                         n->nr_partial > s->min_partial) {
> -                               slab = get_partial_node(s, n, pc);
> -                               if (slab) {
> +
> +                               void *object = get_partial_node(s, n, pc);
> +
> +                               if (object) {
>                                         /*
>                                          * Don't check read_mems_allowed_retry()
>                                          * here - if mems_allowed was updated in
> @@ -3698,7 +3658,7 @@ static struct slab *get_any_partial(struct kmem_cache *s,
>                                          * between allocation and the cpuset
>                                          * update
>                                          */
> -                                       return slab;
> +                                       return object;
>                                 }
>                         }
>                 }
> @@ -3708,20 +3668,20 @@ static struct slab *get_any_partial(struct kmem_cache *s,
>  }
>
>  /*
> - * Get a partial slab, lock it and return it.
> + * Get an object from a partial slab
>   */
> -static struct slab *get_partial(struct kmem_cache *s, int node,
> -                               struct partial_context *pc)
> +static void *get_partial(struct kmem_cache *s, int node,
> +                        struct partial_context *pc)
>  {
> -       struct slab *slab;
>         int searchnode = node;
> +       void *object;
>
>         if (node == NUMA_NO_NODE)
>                 searchnode = numa_mem_id();
>
> -       slab = get_partial_node(s, get_node(s, searchnode), pc);
> -       if (slab || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> -               return slab;
> +       object = get_partial_node(s, get_node(s, searchnode), pc);
> +       if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> +               return object;
>
>         return get_any_partial(s, pc);
>  }
> @@ -4281,19 +4241,6 @@ static int slub_cpu_dead(unsigned int cpu)
>         return 0;
>  }
>
> -/*
> - * Check if the objects in a per cpu structure fit numa
> - * locality expectations.
> - */
> -static inline int node_match(struct slab *slab, int node)
> -{
> -#ifdef CONFIG_NUMA
> -       if (node != NUMA_NO_NODE && slab_nid(slab) != node)
> -               return 0;
> -#endif
> -       return 1;
> -}
> -
>  #ifdef CONFIG_SLUB_DEBUG
>  static int count_free(struct slab *slab)
>  {
> @@ -4478,36 +4425,6 @@ __update_cpu_freelist_fast(struct kmem_cache *s,
>                                              &old.freelist_tid, new.freelist_tid);
>  }
>
> -/*
> - * Check the slab->freelist and either transfer the freelist to the
> - * per cpu freelist or deactivate the slab.
> - *
> - * The slab is still frozen if the return value is not NULL.
> - *
> - * If this function returns NULL then the slab has been unfrozen.
> - */
> -static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> -{
> -       struct freelist_counters old, new;
> -
> -       lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
> -
> -       do {
> -               old.freelist = slab->freelist;
> -               old.counters = slab->counters;
> -
> -               new.freelist = NULL;
> -               new.counters = old.counters;
> -
> -               new.inuse = old.objects;
> -               new.frozen = old.freelist != NULL;
> -
> -
> -       } while (!__slab_update_freelist(s, slab, &old, &new, "get_freelist"));
> -
> -       return old.freelist;
> -}
> -
>  /*
>   * Get the slab's freelist and do not freeze it.
>   *
> @@ -4535,29 +4452,6 @@ static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *sla
>         return old.freelist;
>  }
>
> -/*
> - * Freeze the partial slab and return the pointer to the freelist.
> - */
> -static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
> -{
> -       struct freelist_counters old, new;
> -
> -       do {
> -               old.freelist = slab->freelist;
> -               old.counters = slab->counters;
> -
> -               new.freelist = NULL;
> -               new.counters = old.counters;
> -               VM_BUG_ON(new.frozen);
> -
> -               new.inuse = old.objects;
> -               new.frozen = 1;
> -
> -       } while (!slab_update_freelist(s, slab, &old, &new, "freeze_slab"));
> -
> -       return old.freelist;
> -}
> -
>  /*
>   * If the object has been wiped upon free, make sure it's fully initialized by
>   * zeroing out freelist pointer.
> @@ -4618,170 +4512,23 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  }
>
>  /*
> - * Slow path. The lockless freelist is empty or we need to perform
> - * debugging duties.
> - *
> - * Processing is still very fast if new objects have been freed to the
> - * regular freelist. In that case we simply take over the regular freelist
> - * as the lockless freelist and zap the regular freelist.
> - *
> - * If that is not working then we fall back to the partial lists. We take the
> - * first element of the freelist as the object to allocate now and move the
> - * rest of the freelist to the lockless freelist.
> - *
> - * And if we were unable to get a new slab from the partial slab lists then
> - * we need to allocate a new slab. This is the slowest path since it involves
> - * a call to the page allocator and the setup of a new slab.
> + * Slow path. We failed to allocate via percpu sheaves or they are not available
> + * due to bootstrap or debugging enabled or SLUB_TINY.
>   *
> - * Version of __slab_alloc to use when we know that preemption is
> - * already disabled (which is the case for bulk allocation).
> + * We try to allocate from partial slab lists and fall back to allocating a new
> + * slab.
>   */
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -                         unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> +                          unsigned long addr, unsigned int orig_size)
>  {
>         bool allow_spin = gfpflags_allow_spinning(gfpflags);
>         void *freelist;
>         struct slab *slab;
> -       unsigned long flags;
>         struct partial_context pc;
>         bool try_thisnode = true;
>
>         stat(s, ALLOC_SLOWPATH);
>
> -reread_slab:
> -
> -       slab = READ_ONCE(c->slab);
> -       if (!slab) {
> -               /*
> -                * if the node is not online or has no normal memory, just
> -                * ignore the node constraint
> -                */
> -               if (unlikely(node != NUMA_NO_NODE &&
> -                            !node_isset(node, slab_nodes)))
> -                       node = NUMA_NO_NODE;
> -               goto new_slab;
> -       }
> -
> -       if (unlikely(!node_match(slab, node))) {
> -               /*
> -                * same as above but node_match() being false already
> -                * implies node != NUMA_NO_NODE.
> -                *
> -                * We don't strictly honor pfmemalloc and NUMA preferences
> -                * when !allow_spin because:
> -                *
> -                * 1. Most kmalloc() users allocate objects on the local node,
> -                *    so kmalloc_nolock() tries not to interfere with them by
> -                *    deactivating the cpu slab.
> -                *
> -                * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause
> -                *    unnecessary slab allocations even when n->partial list
> -                *    is not empty.
> -                */
> -               if (!node_isset(node, slab_nodes) ||
> -                   !allow_spin) {
> -                       node = NUMA_NO_NODE;
> -               } else {
> -                       stat(s, ALLOC_NODE_MISMATCH);
> -                       goto deactivate_slab;
> -               }
> -       }
> -
> -       /*
> -        * By rights, we should be searching for a slab page that was
> -        * PFMEMALLOC but right now, we are losing the pfmemalloc
> -        * information when the page leaves the per-cpu allocator
> -        */
> -       if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
> -               goto deactivate_slab;
> -
> -       /* must check again c->slab in case we got preempted and it changed */
> -       local_lock_cpu_slab(s, flags);
> -
> -       if (unlikely(slab != c->slab)) {
> -               local_unlock_cpu_slab(s, flags);
> -               goto reread_slab;
> -       }
> -       freelist = c->freelist;
> -       if (freelist)
> -               goto load_freelist;
> -
> -       freelist = get_freelist(s, slab);
> -
> -       if (!freelist) {
> -               c->slab = NULL;
> -               c->tid = next_tid(c->tid);
> -               local_unlock_cpu_slab(s, flags);
> -               stat(s, DEACTIVATE_BYPASS);
> -               goto new_slab;
> -       }
> -
> -       stat(s, ALLOC_REFILL);
> -
> -load_freelist:
> -
> -       lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
> -
> -       /*
> -        * freelist is pointing to the list of objects to be used.
> -        * slab is pointing to the slab from which the objects are obtained.
> -        * That slab must be frozen for per cpu allocations to work.
> -        */
> -       VM_BUG_ON(!c->slab->frozen);
> -       c->freelist = get_freepointer(s, freelist);
> -       c->tid = next_tid(c->tid);
> -       local_unlock_cpu_slab(s, flags);
> -       return freelist;
> -
> -deactivate_slab:
> -
> -       local_lock_cpu_slab(s, flags);
> -       if (slab != c->slab) {
> -               local_unlock_cpu_slab(s, flags);
> -               goto reread_slab;
> -       }
> -       freelist = c->freelist;
> -       c->slab = NULL;
> -       c->freelist = NULL;
> -       c->tid = next_tid(c->tid);
> -       local_unlock_cpu_slab(s, flags);
> -       deactivate_slab(s, slab, freelist);
> -
> -new_slab:
> -
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       while (slub_percpu_partial(c)) {
> -               local_lock_cpu_slab(s, flags);
> -               if (unlikely(c->slab)) {
> -                       local_unlock_cpu_slab(s, flags);
> -                       goto reread_slab;
> -               }
> -               if (unlikely(!slub_percpu_partial(c))) {
> -                       local_unlock_cpu_slab(s, flags);
> -                       /* we were preempted and partial list got empty */
> -                       goto new_objects;
> -               }
> -
> -               slab = slub_percpu_partial(c);
> -               slub_set_percpu_partial(c, slab);
> -
> -               if (likely(node_match(slab, node) &&
> -                          pfmemalloc_match(slab, gfpflags)) ||
> -                   !allow_spin) {
> -                       c->slab = slab;
> -                       freelist = get_freelist(s, slab);
> -                       VM_BUG_ON(!freelist);
> -                       stat(s, CPU_PARTIAL_ALLOC);
> -                       goto load_freelist;
> -               }
> -
> -               local_unlock_cpu_slab(s, flags);
> -
> -               slab->next = NULL;
> -               __put_partials(s, slab);
> -       }
> -#endif
> -
>  new_objects:
>
>         pc.flags = gfpflags;
> @@ -4806,33 +4553,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>         }
>
>         pc.orig_size = orig_size;
> -       slab = get_partial(s, node, &pc);
> -       if (slab) {
> -               if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -                       freelist = pc.object;
> -                       /*
> -                        * For debug caches here we had to go through
> -                        * alloc_single_from_partial() so just store the
> -                        * tracking info and return the object.
> -                        *
> -                        * Due to disabled preemption we need to disallow
> -                        * blocking. The flags are further adjusted by
> -                        * gfp_nested_mask() in stack_depot itself.
> -                        */
> -                       if (s->flags & SLAB_STORE_USER)
> -                               set_track(s, freelist, TRACK_ALLOC, addr,
> -                                         gfpflags & ~(__GFP_DIRECT_RECLAIM));
> -
> -                       return freelist;
> -               }
> -
> -               freelist = freeze_slab(s, slab);
> -               goto retry_load_slab;
> -       }
> +       freelist = get_partial(s, node, &pc);

I think all this cleanup results in this `freelist` variable being
used to always store a single object. Maybe rename it into `object`?

> +       if (freelist)
> +               goto success;
>
> -       slub_put_cpu_ptr(s->cpu_slab);
>         slab = new_slab(s, pc.flags, node);
> -       c = slub_get_cpu_ptr(s->cpu_slab);
>
>         if (unlikely(!slab)) {
>                 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
> @@ -4849,68 +4574,29 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>         if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>                 freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
>
> -               if (unlikely(!freelist)) {
> -                       /* This could cause an endless loop. Fail instead. */
> -                       if (!allow_spin)
> -                               return NULL;
> -                       goto new_objects;
> -               }
> -
> -               if (s->flags & SLAB_STORE_USER)
> -                       set_track(s, freelist, TRACK_ALLOC, addr,
> -                                 gfpflags & ~(__GFP_DIRECT_RECLAIM));
> -
> -               return freelist;
> -       }
> -
> -       /*
> -        * No other reference to the slab yet so we can
> -        * muck around with it freely without cmpxchg
> -        */
> -       freelist = slab->freelist;
> -       slab->freelist = NULL;
> -       slab->inuse = slab->objects;
> -       slab->frozen = 1;
> -
> -       inc_slabs_node(s, slab_nid(slab), slab->objects);
> +               if (likely(freelist))
> +                       goto success;
> +       } else {
> +               alloc_from_new_slab(s, slab, &freelist, 1, allow_spin);
>
> -       if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
> -               /*
> -                * For !pfmemalloc_match() case we don't load freelist so that
> -                * we don't make further mismatched allocations easier.
> -                */
> -               deactivate_slab(s, slab, get_freepointer(s, freelist));
> -               return freelist;
> +               /* we don't need to check SLAB_STORE_USER here */
> +               if (likely(freelist))
> +                       return freelist;
>         }
>
> -retry_load_slab:
> -
> -       local_lock_cpu_slab(s, flags);
> -       if (unlikely(c->slab)) {
> -               void *flush_freelist = c->freelist;
> -               struct slab *flush_slab = c->slab;
> -
> -               c->slab = NULL;
> -               c->freelist = NULL;
> -               c->tid = next_tid(c->tid);
> -
> -               local_unlock_cpu_slab(s, flags);
> -
> -               if (unlikely(!allow_spin)) {
> -                       /* Reentrant slub cannot take locks, defer */
> -                       defer_deactivate_slab(flush_slab, flush_freelist);
> -               } else {
> -                       deactivate_slab(s, flush_slab, flush_freelist);
> -               }
> +       if (allow_spin)
> +               goto new_objects;
>
> -               stat(s, CPUSLAB_FLUSH);
> +       /* This could cause an endless loop. Fail instead. */
> +       return NULL;
>
> -               goto retry_load_slab;
> -       }
> -       c->slab = slab;
> +success:
> +       if (kmem_cache_debug_flags(s, SLAB_STORE_USER))
> +               set_track(s, freelist, TRACK_ALLOC, addr, gfpflags);
>
> -       goto load_freelist;
> +       return freelist;
>  }
> +
>  /*
>   * We disallow kprobes in ___slab_alloc() to prevent reentrance
>   *
> @@ -4925,87 +4611,11 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>   */
>  NOKPROBE_SYMBOL(___slab_alloc);
>
> -/*
> - * A wrapper for ___slab_alloc() for contexts where preemption is not yet
> - * disabled. Compensates for possible cpu changes by refetching the per cpu area
> - * pointer.
> - */
> -static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> -                         unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> -{
> -       void *p;
> -
> -#ifdef CONFIG_PREEMPT_COUNT
> -       /*
> -        * We may have been preempted and rescheduled on a different
> -        * cpu before disabling preemption. Need to reload cpu area
> -        * pointer.
> -        */
> -       c = slub_get_cpu_ptr(s->cpu_slab);
> -#endif
> -       if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> -               if (local_lock_is_locked(&s->cpu_slab->lock)) {
> -                       /*
> -                        * EBUSY is an internal signal to kmalloc_nolock() to
> -                        * retry a different bucket. It's not propagated
> -                        * to the caller.
> -                        */
> -                       p = ERR_PTR(-EBUSY);
> -                       goto out;
> -               }
> -       }
> -       p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> -out:
> -#ifdef CONFIG_PREEMPT_COUNT
> -       slub_put_cpu_ptr(s->cpu_slab);
> -#endif
> -       return p;
> -}
> -
>  static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
>                 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
>  {
> -       struct kmem_cache_cpu *c;
> -       struct slab *slab;
> -       unsigned long tid;
>         void *object;
>
> -redo:
> -       /*
> -        * Must read kmem_cache cpu data via this cpu ptr. Preemption is
> -        * enabled. We may switch back and forth between cpus while
> -        * reading from one cpu area. That does not matter as long
> -        * as we end up on the original cpu again when doing the cmpxchg.
> -        *
> -        * We must guarantee that tid and kmem_cache_cpu are retrieved on the
> -        * same cpu. We read first the kmem_cache_cpu pointer and use it to read
> -        * the tid. If we are preempted and switched to another cpu between the
> -        * two reads, it's OK as the two are still associated with the same cpu
> -        * and cmpxchg later will validate the cpu.
> -        */
> -       c = raw_cpu_ptr(s->cpu_slab);
> -       tid = READ_ONCE(c->tid);
> -
> -       /*
> -        * Irqless object alloc/free algorithm used here depends on sequence
> -        * of fetching cpu_slab's data. tid should be fetched before anything
> -        * on c to guarantee that object and slab associated with previous tid
> -        * won't be used with current tid. If we fetch tid first, object and
> -        * slab could be one associated with next tid and our alloc/free
> -        * request will be failed. In this case, we will retry. So, no problem.
> -        */
> -       barrier();
> -
> -       /*
> -        * The transaction ids are globally unique per cpu and per operation on
> -        * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
> -        * occurs on the right processor and that there was no operation on the
> -        * linked list in between.
> -        */
> -
> -       object = c->freelist;
> -       slab = c->slab;
> -
>  #ifdef CONFIG_NUMA
>         if (static_branch_unlikely(&strict_numa) &&
>                         node == NUMA_NO_NODE) {
> @@ -5014,47 +4624,20 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
>
>                 if (mpol) {
>                         /*
> -                        * Special BIND rule support. If existing slab
> +                        * Special BIND rule support. If the local node
>                          * is in permitted set then do not redirect
>                          * to a particular node.
>                          * Otherwise we apply the memory policy to get
>                          * the node we need to allocate on.
>                          */
> -                       if (mpol->mode != MPOL_BIND || !slab ||
> -                                       !node_isset(slab_nid(slab), mpol->nodes))
> -
> +                       if (mpol->mode != MPOL_BIND ||
> +                                       !node_isset(numa_mem_id(), mpol->nodes))
>                                 node = mempolicy_slab_node();
>                 }
>         }
>  #endif
>
> -       if (!USE_LOCKLESS_FAST_PATH() ||
> -           unlikely(!object || !slab || !node_match(slab, node))) {
> -               object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
> -       } else {
> -               void *next_object = get_freepointer_safe(s, object);
> -
> -               /*
> -                * The cmpxchg will only match if there was no additional
> -                * operation and if we are on the right processor.
> -                *
> -                * The cmpxchg does the following atomically (without lock
> -                * semantics!)
> -                * 1. Relocate first pointer to the current per cpu area.
> -                * 2. Verify that tid and freelist have not been changed
> -                * 3. If they were not changed replace tid and freelist
> -                *
> -                * Since this is without lock semantics the protection is only
> -                * against code executing on this cpu *not* from access by
> -                * other cpus.
> -                */
> -               if (unlikely(!__update_cpu_freelist_fast(s, object, next_object, tid))) {
> -                       note_cmpxchg_failure("slab_alloc", s, tid);
> -                       goto redo;
> -               }
> -               prefetch_freepointer(s, next_object);
> -               stat(s, ALLOC_FASTPATH);
> -       }
> +       object = ___slab_alloc(s, gfpflags, node, addr, orig_size);
>
>         return object;
>  }
> @@ -7711,62 +7294,25 @@ static inline
>  int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>                             void **p)
>  {
> -       struct kmem_cache_cpu *c;
> -       unsigned long irqflags;
>         int i;
>
>         /*
> -        * Drain objects in the per cpu slab, while disabling local
> -        * IRQs, which protects against PREEMPT and interrupts
> -        * handlers invoking normal fastpath.
> +        * TODO: this might be more efficient (if necessary) by reusing
> +        * __refill_objects()
>          */
> -       c = slub_get_cpu_ptr(s->cpu_slab);
> -       local_lock_irqsave(&s->cpu_slab->lock, irqflags);
> -
>         for (i = 0; i < size; i++) {
> -               void *object = c->freelist;
>
> -               if (unlikely(!object)) {
> -                       /*
> -                        * We may have removed an object from c->freelist using
> -                        * the fastpath in the previous iteration; in that case,
> -                        * c->tid has not been bumped yet.
> -                        * Since ___slab_alloc() may reenable interrupts while
> -                        * allocating memory, we should bump c->tid now.
> -                        */
> -                       c->tid = next_tid(c->tid);
> +               p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> +                                    s->object_size);
> +               if (unlikely(!p[i]))
> +                       goto error;
>
> -                       local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
> -
> -                       /*
> -                        * Invoking slow path likely have side-effect
> -                        * of re-populating per CPU c->freelist
> -                        */
> -                       p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
> -                                           _RET_IP_, c, s->object_size);
> -                       if (unlikely(!p[i]))
> -                               goto error;
> -
> -                       c = this_cpu_ptr(s->cpu_slab);
> -                       maybe_wipe_obj_freeptr(s, p[i]);
> -
> -                       local_lock_irqsave(&s->cpu_slab->lock, irqflags);
> -
> -                       continue; /* goto for-loop */
> -               }
> -               c->freelist = get_freepointer(s, object);
> -               p[i] = object;
>                 maybe_wipe_obj_freeptr(s, p[i]);
> -               stat(s, ALLOC_FASTPATH);
>         }
> -       c->tid = next_tid(c->tid);
> -       local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
> -       slub_put_cpu_ptr(s->cpu_slab);
>
>         return i;
>
>  error:
> -       slub_put_cpu_ptr(s->cpu_slab);
>         __kmem_cache_free_bulk(s, i, p);
>         return 0;
>
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 05/21] slab: add sheaves to most caches
  2026-01-16 14:40 ` [PATCH v3 05/21] slab: add sheaves to most caches Vlastimil Babka
@ 2026-01-20 18:47   ` Breno Leitao
  2026-01-21  8:12     ` Vlastimil Babka
  0 siblings, 1 reply; 106+ messages in thread
From: Breno Leitao @ 2026-01-20 18:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

Hello Vlastimil,

On Fri, Jan 16, 2026 at 03:40:25PM +0100, Vlastimil Babka wrote:
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -7863,6 +7863,48 @@ static void set_cpu_partial(struct kmem_cache *s)
>  #endif
>  }
>  
> +static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> +					     struct kmem_cache_args *args)
> +
> +{
> +	unsigned int capacity;
> +	size_t size;
> +
> +
> +	if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
> +		return 0;
> +
> +	/* bootstrap caches can't have sheaves for now */
> +	if (s->flags & SLAB_NO_OBJ_EXT)
> +		return 0;

I've been testing this on my arm64 environment with some debug patches,
and the machine became unbootable.

I am wondering if you should avoid SLAB_NOLEAKTRACE as well. I got the
impression it is hitting this infinite loop:

        -> slab allocation
          -> kmemleak_alloc()
            -> kmem_cache_alloc(object_cache)
              -> alloc_from_pcs() / __pcs_replace_empty_main()
                -> alloc_full_sheaf() -> kzalloc()
                  -> kmemleak_alloc()
                    -> ... (infinite recursion)


What about something as:

diff --git a/mm/slub.c b/mm/slub.c
index 26804859821a..0a6481aaa744 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7445,8 +7445,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
        if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
                return 0;

-       /* bootstrap caches can't have sheaves for now */
-       if (s->flags & SLAB_NO_OBJ_EXT)
+       /*
+        * bootstrap caches can't have sheaves for now (SLAB_NO_OBJ_EXT).
+        * SLAB_NOLEAKTRACE caches (e.g., kmemleak's object_cache) must not
+        * have sheaves to avoid recursion when sheaf allocation triggers
+        * kmemleak tracking.
+        */
+       if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
                return 0;



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
  2026-01-20  5:24   ` Harry Yoo
  2026-01-20 12:10   ` Hao Li
@ 2026-01-20 22:25   ` Suren Baghdasaryan
  2026-01-21  0:58     ` Harry Yoo
  2026-01-21 14:22     ` Vlastimil Babka
  2 siblings, 2 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-20 22:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> We have removed the partial slab usage from allocation paths. Now remove
> the whole config option and associated code.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

I did? Well, if so, I missed some remaining mentions about cpu partial caches:
- slub.c has several hits on "cpu partial" in the comments.
- there is one hit on "put_cpu_partial" in slub.c in the comments.

Should we also update Documentation/ABI/testing/sysfs-kernel-slab to
say that from now on cpu_partial control always reads 0?

Once addressed, please feel free to keep my Reviewed-by.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/Kconfig |  11 ---
>  mm/slab.h  |  29 ------
>  mm/slub.c  | 321 ++++---------------------------------------------------------
>  3 files changed, 19 insertions(+), 342 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd0ea5454af8..08593674cd20 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -247,17 +247,6 @@ config SLUB_STATS
>           out which slabs are relevant to a particular load.
>           Try running: slabinfo -DA
>
> -config SLUB_CPU_PARTIAL
> -       default y
> -       depends on SMP && !SLUB_TINY
> -       bool "Enable per cpu partial caches"
> -       help
> -         Per cpu partial caches accelerate objects allocation and freeing
> -         that is local to a processor at the price of more indeterminism
> -         in the latency of the free. On overflow these caches will be cleared
> -         which requires the taking of locks that may cause latency spikes.
> -         Typically one would choose no for a realtime system.
> -
>  config RANDOM_KMALLOC_CACHES
>         default n
>         depends on !SLUB_TINY
> diff --git a/mm/slab.h b/mm/slab.h
> index cb48ce5014ba..e77260720994 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -77,12 +77,6 @@ struct slab {
>                                         struct llist_node llnode;
>                                         void *flush_freelist;
>                                 };
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -                               struct {
> -                                       struct slab *next;
> -                                       int slabs;      /* Nr of slabs left */
> -                               };
> -#endif
>                         };
>                         /* Double-word boundary */
>                         struct freelist_counters;
> @@ -188,23 +182,6 @@ static inline size_t slab_size(const struct slab *slab)
>         return PAGE_SIZE << slab_order(slab);
>  }
>
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -#define slub_percpu_partial(c)                 ((c)->partial)
> -
> -#define slub_set_percpu_partial(c, p)          \
> -({                                             \
> -       slub_percpu_partial(c) = (p)->next;     \
> -})
> -
> -#define slub_percpu_partial_read_once(c)       READ_ONCE(slub_percpu_partial(c))
> -#else
> -#define slub_percpu_partial(c)                 NULL
> -
> -#define slub_set_percpu_partial(c, p)
> -
> -#define slub_percpu_partial_read_once(c)       NULL
> -#endif // CONFIG_SLUB_CPU_PARTIAL
> -
>  /*
>   * Word size structure that can be atomically updated or read and that
>   * contains both the order and the number of objects that a slab of the
> @@ -228,12 +205,6 @@ struct kmem_cache {
>         unsigned int object_size;       /* Object size without metadata */
>         struct reciprocal_value reciprocal_size;
>         unsigned int offset;            /* Free pointer offset */
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       /* Number of per cpu partial objects to keep around */
> -       unsigned int cpu_partial;
> -       /* Number of per cpu partial slabs to keep around */
> -       unsigned int cpu_partial_slabs;
> -#endif
>         unsigned int sheaf_capacity;
>         struct kmem_cache_order_objects oo;
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 698c0d940f06..6b1280f7900a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -263,15 +263,6 @@ void *fixup_red_left(struct kmem_cache *s, void *p)
>         return p;
>  }
>
> -static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
> -{
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       return !kmem_cache_debug(s);
> -#else
> -       return false;
> -#endif
> -}
> -
>  /*
>   * Issues still to be resolved:
>   *
> @@ -426,9 +417,6 @@ struct freelist_tid {
>  struct kmem_cache_cpu {
>         struct freelist_tid;
>         struct slab *slab;      /* The slab from which we are allocating */
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       struct slab *partial;   /* Partially allocated slabs */
> -#endif
>         local_trylock_t lock;   /* Protects the fields above */
>  #ifdef CONFIG_SLUB_STATS
>         unsigned int stat[NR_SLUB_STAT_ITEMS];
> @@ -673,29 +661,6 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
>         return x.x & OO_MASK;
>  }
>
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
> -{
> -       unsigned int nr_slabs;
> -
> -       s->cpu_partial = nr_objects;
> -
> -       /*
> -        * We take the number of objects but actually limit the number of
> -        * slabs on the per cpu partial list, in order to limit excessive
> -        * growth of the list. For simplicity we assume that the slabs will
> -        * be half-full.
> -        */
> -       nr_slabs = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
> -       s->cpu_partial_slabs = nr_slabs;
> -}
> -#elif defined(SLAB_SUPPORTS_SYSFS)
> -static inline void
> -slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
> -{
> -}
> -#endif /* CONFIG_SLUB_CPU_PARTIAL */
> -
>  /*
>   * If network-based swap is enabled, slub must keep track of whether memory
>   * were allocated from pfmemalloc reserves.
> @@ -3474,12 +3439,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
>         return object;
>  }
>
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
> -#else
> -static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
> -                                  int drain) { }
> -#endif
>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>
>  static bool get_partial_node_bulk(struct kmem_cache *s,
> @@ -3898,131 +3857,6 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  #define local_unlock_cpu_slab(s, flags)        \
>         local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
>
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> -{
> -       struct kmem_cache_node *n = NULL, *n2 = NULL;
> -       struct slab *slab, *slab_to_discard = NULL;
> -       unsigned long flags = 0;
> -
> -       while (partial_slab) {
> -               slab = partial_slab;
> -               partial_slab = slab->next;
> -
> -               n2 = get_node(s, slab_nid(slab));
> -               if (n != n2) {
> -                       if (n)
> -                               spin_unlock_irqrestore(&n->list_lock, flags);
> -
> -                       n = n2;
> -                       spin_lock_irqsave(&n->list_lock, flags);
> -               }
> -
> -               if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
> -                       slab->next = slab_to_discard;
> -                       slab_to_discard = slab;
> -               } else {
> -                       add_partial(n, slab, DEACTIVATE_TO_TAIL);
> -                       stat(s, FREE_ADD_PARTIAL);
> -               }
> -       }
> -
> -       if (n)
> -               spin_unlock_irqrestore(&n->list_lock, flags);
> -
> -       while (slab_to_discard) {
> -               slab = slab_to_discard;
> -               slab_to_discard = slab_to_discard->next;
> -
> -               stat(s, DEACTIVATE_EMPTY);
> -               discard_slab(s, slab);
> -               stat(s, FREE_SLAB);
> -       }
> -}
> -
> -/*
> - * Put all the cpu partial slabs to the node partial list.
> - */
> -static void put_partials(struct kmem_cache *s)
> -{
> -       struct slab *partial_slab;
> -       unsigned long flags;
> -
> -       local_lock_irqsave(&s->cpu_slab->lock, flags);
> -       partial_slab = this_cpu_read(s->cpu_slab->partial);
> -       this_cpu_write(s->cpu_slab->partial, NULL);
> -       local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> -
> -       if (partial_slab)
> -               __put_partials(s, partial_slab);
> -}
> -
> -static void put_partials_cpu(struct kmem_cache *s,
> -                            struct kmem_cache_cpu *c)
> -{
> -       struct slab *partial_slab;
> -
> -       partial_slab = slub_percpu_partial(c);
> -       c->partial = NULL;
> -
> -       if (partial_slab)
> -               __put_partials(s, partial_slab);
> -}
> -
> -/*
> - * Put a slab into a partial slab slot if available.
> - *
> - * If we did not find a slot then simply move all the partials to the
> - * per node partial list.
> - */
> -static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
> -{
> -       struct slab *oldslab;
> -       struct slab *slab_to_put = NULL;
> -       unsigned long flags;
> -       int slabs = 0;
> -
> -       local_lock_cpu_slab(s, flags);
> -
> -       oldslab = this_cpu_read(s->cpu_slab->partial);
> -
> -       if (oldslab) {
> -               if (drain && oldslab->slabs >= s->cpu_partial_slabs) {
> -                       /*
> -                        * Partial array is full. Move the existing set to the
> -                        * per node partial list. Postpone the actual unfreezing
> -                        * outside of the critical section.
> -                        */
> -                       slab_to_put = oldslab;
> -                       oldslab = NULL;
> -               } else {
> -                       slabs = oldslab->slabs;
> -               }
> -       }
> -
> -       slabs++;
> -
> -       slab->slabs = slabs;
> -       slab->next = oldslab;
> -
> -       this_cpu_write(s->cpu_slab->partial, slab);
> -
> -       local_unlock_cpu_slab(s, flags);
> -
> -       if (slab_to_put) {
> -               __put_partials(s, slab_to_put);
> -               stat(s, CPU_PARTIAL_DRAIN);
> -       }
> -}
> -
> -#else  /* CONFIG_SLUB_CPU_PARTIAL */
> -
> -static inline void put_partials(struct kmem_cache *s) { }
> -static inline void put_partials_cpu(struct kmem_cache *s,
> -                                   struct kmem_cache_cpu *c) { }
> -
> -#endif /* CONFIG_SLUB_CPU_PARTIAL */
> -
>  static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
>  {
>         unsigned long flags;
> @@ -4060,8 +3894,6 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>                 deactivate_slab(s, slab, freelist);
>                 stat(s, CPUSLAB_FLUSH);
>         }
> -
> -       put_partials_cpu(s, c);
>  }
>
>  static inline void flush_this_cpu_slab(struct kmem_cache *s)
> @@ -4070,15 +3902,13 @@ static inline void flush_this_cpu_slab(struct kmem_cache *s)
>
>         if (c->slab)
>                 flush_slab(s, c);
> -
> -       put_partials(s);
>  }
>
>  static bool has_cpu_slab(int cpu, struct kmem_cache *s)
>  {
>         struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
>
> -       return c->slab || slub_percpu_partial(c);
> +       return c->slab;
>  }
>
>  static bool has_pcs_used(int cpu, struct kmem_cache *s)
> @@ -5646,13 +5476,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                 return;
>         }
>
> -       /*
> -        * It is enough to test IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) below
> -        * instead of kmem_cache_has_cpu_partial(s), because kmem_cache_debug(s)
> -        * is the only other reason it can be false, and it is already handled
> -        * above.
> -        */
> -
>         do {
>                 if (unlikely(n)) {
>                         spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -5677,26 +5500,19 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                  * Unless it's frozen.
>                  */
>                 if ((!new.inuse || was_full) && !was_frozen) {
> +
> +                       n = get_node(s, slab_nid(slab));
>                         /*
> -                        * If slab becomes non-full and we have cpu partial
> -                        * lists, we put it there unconditionally to avoid
> -                        * taking the list_lock. Otherwise we need it.
> +                        * Speculatively acquire the list_lock.
> +                        * If the cmpxchg does not succeed then we may
> +                        * drop the list_lock without any processing.
> +                        *
> +                        * Otherwise the list_lock will synchronize with
> +                        * other processors updating the list of slabs.
>                          */
> -                       if (!(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full)) {
> -
> -                               n = get_node(s, slab_nid(slab));
> -                               /*
> -                                * Speculatively acquire the list_lock.
> -                                * If the cmpxchg does not succeed then we may
> -                                * drop the list_lock without any processing.
> -                                *
> -                                * Otherwise the list_lock will synchronize with
> -                                * other processors updating the list of slabs.
> -                                */
> -                               spin_lock_irqsave(&n->list_lock, flags);
> -
> -                               on_node_partial = slab_test_node_partial(slab);
> -                       }
> +                       spin_lock_irqsave(&n->list_lock, flags);
> +
> +                       on_node_partial = slab_test_node_partial(slab);
>                 }
>
>         } while (!slab_update_freelist(s, slab, &old, &new, "__slab_free"));
> @@ -5709,13 +5525,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                          * activity can be necessary.
>                          */
>                         stat(s, FREE_FROZEN);
> -               } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> -                       /*
> -                        * If we started with a full slab then put it onto the
> -                        * per cpu partial list.
> -                        */
> -                       put_cpu_partial(s, slab, 1);
> -                       stat(s, CPU_PARTIAL_FREE);
>                 }
>
>                 /*
> @@ -5744,10 +5553,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>
>         /*
>          * Objects left in the slab. If it was not on the partial list before
> -        * then add it. This can only happen when cache has no per cpu partial
> -        * list otherwise we would have put it there.
> +        * then add it.
>          */
> -       if (!IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && unlikely(was_full)) {
> +       if (unlikely(was_full)) {

This is not really related to your change but I wonder why we check
for was_full to detect that the slab was not on partial list instead
of checking !on_node_partial... They might be equivalent at this point
but it's still a bit confusing.

>                 add_partial(n, slab, DEACTIVATE_TO_TAIL);
>                 stat(s, FREE_ADD_PARTIAL);
>         }
> @@ -6396,8 +6204,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>                 if (unlikely(!allow_spin)) {
>                         /*
>                          * __slab_free() can locklessly cmpxchg16 into a slab,
> -                        * but then it might need to take spin_lock or local_lock
> -                        * in put_cpu_partial() for further processing.
> +                        * but then it might need to take spin_lock
> +                        * for further processing.
>                          * Avoid the complexity and simply add to a deferred list.
>                          */
>                         defer_free(s, head);
> @@ -7707,39 +7515,6 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>         return 1;
>  }
>
> -static void set_cpu_partial(struct kmem_cache *s)
> -{
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       unsigned int nr_objects;
> -
> -       /*
> -        * cpu_partial determined the maximum number of objects kept in the
> -        * per cpu partial lists of a processor.
> -        *
> -        * Per cpu partial lists mainly contain slabs that just have one
> -        * object freed. If they are used for allocation then they can be
> -        * filled up again with minimal effort. The slab will never hit the
> -        * per node partial lists and therefore no locking will be required.
> -        *
> -        * For backwards compatibility reasons, this is determined as number
> -        * of objects, even though we now limit maximum number of pages, see
> -        * slub_set_cpu_partial()
> -        */
> -       if (!kmem_cache_has_cpu_partial(s))
> -               nr_objects = 0;
> -       else if (s->size >= PAGE_SIZE)
> -               nr_objects = 6;
> -       else if (s->size >= 1024)
> -               nr_objects = 24;
> -       else if (s->size >= 256)
> -               nr_objects = 52;
> -       else
> -               nr_objects = 120;
> -
> -       slub_set_cpu_partial(s, nr_objects);
> -#endif
> -}
> -
>  static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
>                                              struct kmem_cache_args *args)
>
> @@ -8595,8 +8370,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>         s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
>         s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
>
> -       set_cpu_partial(s);
> -
>         s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
>         if (!s->cpu_sheaves) {
>                 err = -ENOMEM;
> @@ -8960,20 +8733,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
>                         total += x;
>                         nodes[node] += x;
>
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -                       slab = slub_percpu_partial_read_once(c);
> -                       if (slab) {
> -                               node = slab_nid(slab);
> -                               if (flags & SO_TOTAL)
> -                                       WARN_ON_ONCE(1);
> -                               else if (flags & SO_OBJECTS)
> -                                       WARN_ON_ONCE(1);
> -                               else
> -                                       x = data_race(slab->slabs);
> -                               total += x;
> -                               nodes[node] += x;
> -                       }
> -#endif
>                 }
>         }
>
> @@ -9108,12 +8867,7 @@ SLAB_ATTR(min_partial);
>
>  static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
>  {
> -       unsigned int nr_partial = 0;
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       nr_partial = s->cpu_partial;
> -#endif
> -
> -       return sysfs_emit(buf, "%u\n", nr_partial);
> +       return sysfs_emit(buf, "0\n");
>  }
>
>  static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
> @@ -9125,11 +8879,9 @@ static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
>         err = kstrtouint(buf, 10, &objects);
>         if (err)
>                 return err;
> -       if (objects && !kmem_cache_has_cpu_partial(s))
> +       if (objects)
>                 return -EINVAL;
>
> -       slub_set_cpu_partial(s, objects);
> -       flush_all(s);
>         return length;
>  }
>  SLAB_ATTR(cpu_partial);
> @@ -9168,42 +8920,7 @@ SLAB_ATTR_RO(objects_partial);
>
>  static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
>  {
> -       int objects = 0;
> -       int slabs = 0;
> -       int cpu __maybe_unused;
> -       int len = 0;
> -
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       for_each_online_cpu(cpu) {
> -               struct slab *slab;
> -
> -               slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
> -
> -               if (slab)
> -                       slabs += data_race(slab->slabs);
> -       }
> -#endif
> -
> -       /* Approximate half-full slabs, see slub_set_cpu_partial() */
> -       objects = (slabs * oo_objects(s->oo)) / 2;
> -       len += sysfs_emit_at(buf, len, "%d(%d)", objects, slabs);
> -
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -       for_each_online_cpu(cpu) {
> -               struct slab *slab;
> -
> -               slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
> -               if (slab) {
> -                       slabs = data_race(slab->slabs);
> -                       objects = (slabs * oo_objects(s->oo)) / 2;
> -                       len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
> -                                            cpu, objects, slabs);
> -               }
> -       }
> -#endif
> -       len += sysfs_emit_at(buf, len, "\n");
> -
> -       return len;
> +       return sysfs_emit(buf, "0(0)\n");
>  }
>  SLAB_ATTR_RO(slabs_cpu_partial);
>
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-20 22:25   ` Suren Baghdasaryan
@ 2026-01-21  0:58     ` Harry Yoo
  2026-01-21  1:06       ` Harry Yoo
  2026-01-21 16:21       ` Suren Baghdasaryan
  2026-01-21 14:22     ` Vlastimil Babka
  1 sibling, 2 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-21  0:58 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Tue, Jan 20, 2026 at 10:25:27PM +0000, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > @@ -5744,10 +5553,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >
> >         /*
> >          * Objects left in the slab. If it was not on the partial list before
> > -        * then add it. This can only happen when cache has no per cpu partial
> > -        * list otherwise we would have put it there.
> > +        * then add it.
> >          */
> > -       if (!IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && unlikely(was_full)) {
> > +       if (unlikely(was_full)) {
> 
> This is not really related to your change but I wonder why we check
> for was_full to detect that the slab was not on partial list instead
> of checking !on_node_partial... They might be equivalent at this point
> but it's still a bit confusing.

If we only know that a slab is not on the partial list, we cannot
manipulate its list because it may be on a linked list that cannot
handle list manipulation outside function
(e.g., pc.slabs in __refill_objects()).

If it's not on the partial list, we can safely manipulate the list
only when we know it was full. It's safe because full slabs are not
supposed to be on any list (except for debug caches, where frees are
done via free_to_partial_list()).

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-21  0:58     ` Harry Yoo
@ 2026-01-21  1:06       ` Harry Yoo
  2026-01-21 16:21       ` Suren Baghdasaryan
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-21  1:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 09:58:40AM +0900, Harry Yoo wrote:
> On Tue, Jan 20, 2026 at 10:25:27PM +0000, Suren Baghdasaryan wrote:
> > On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > @@ -5744,10 +5553,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> > >
> > >         /*
> > >          * Objects left in the slab. If it was not on the partial list before
> > > -        * then add it. This can only happen when cache has no per cpu partial
> > > -        * list otherwise we would have put it there.
> > > +        * then add it.
> > >          */
> > > -       if (!IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && unlikely(was_full)) {
> > > +       if (unlikely(was_full)) {
> > 
> > This is not really related to your change but I wonder why we check
> > for was_full to detect that the slab was not on partial list instead
> > of checking !on_node_partial... They might be equivalent at this point
> > but it's still a bit confusing.
> 
> If we only know that a slab is not on the partial list, we cannot
> manipulate its list because it may be on a linked list that cannot
> handle list manipulation outside function
> (e.g., pc.slabs in __refill_objects()).
> 
> If it's not on the partial list, we can safely manipulate the list
> only when we know it was full. It's safe because full slabs are not
> supposed to be on any list (except for debug caches, where frees are
> done via free_to_partial_list()).

Of course, when a slab was frozen, this doesn't apply and __slab_free()
explicitly handles that case.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros
  2026-01-16 14:40 ` [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
@ 2026-01-21  6:42   ` Hao Li
  2026-01-21 17:57     ` Suren Baghdasaryan
  2026-01-22  3:50   ` Harry Yoo
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-21  6:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:36PM +0100, Vlastimil Babka wrote:
> The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove
> them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code
> being removed. The only remaining usage is in fact testing whether we
> can assert irqs disabled, because spin_lock_irqsave() only does that on
> !RT. Test for CONFIG_PREEMPT_RT instead.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 24 +-----------------------
>  1 file changed, 1 insertion(+), 23 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 05/21] slab: add sheaves to most caches
  2026-01-20 18:47   ` Breno Leitao
@ 2026-01-21  8:12     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21  8:12 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/20/26 19:47, Breno Leitao wrote:
> Hello Vlastimil,
> 
> On Fri, Jan 16, 2026 at 03:40:25PM +0100, Vlastimil Babka wrote:
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -7863,6 +7863,48 @@ static void set_cpu_partial(struct kmem_cache *s)
>>  #endif
>>  }
>>  
>> +static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
>> +					     struct kmem_cache_args *args)
>> +
>> +{
>> +	unsigned int capacity;
>> +	size_t size;
>> +
>> +
>> +	if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
>> +		return 0;
>> +
>> +	/* bootstrap caches can't have sheaves for now */
>> +	if (s->flags & SLAB_NO_OBJ_EXT)
>> +		return 0;
> 
> I've been testing this on my arm64 environment with some debug patches,
> and the machine became unbootable.
> 
> I am wondering if you should avoid SLAB_NOLEAKTRACE as well. I got the
> impression it is hitting this infinite loop:
> 
>         -> slab allocation
>           -> kmemleak_alloc()
>             -> kmem_cache_alloc(object_cache)
>               -> alloc_from_pcs() / __pcs_replace_empty_main()
>                 -> alloc_full_sheaf() -> kzalloc()
>                   -> kmemleak_alloc()
>                     -> ... (infinite recursion)
> 

Oops.

> What about something as:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 26804859821a..0a6481aaa744 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -7445,8 +7445,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
>         if (IS_ENABLED(CONFIG_SLUB_TINY) || s->flags & SLAB_DEBUG_FLAGS)
>                 return 0;
> 
> -       /* bootstrap caches can't have sheaves for now */
> -       if (s->flags & SLAB_NO_OBJ_EXT)
> +       /*
> +        * bootstrap caches can't have sheaves for now (SLAB_NO_OBJ_EXT).
> +        * SLAB_NOLEAKTRACE caches (e.g., kmemleak's object_cache) must not
> +        * have sheaves to avoid recursion when sheaf allocation triggers
> +        * kmemleak tracking.
> +        */
> +       if (s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE))
>                 return 0;

Yeah that should work, will do. Thanks a lot!


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-17  2:11   ` Suren Baghdasaryan
  2026-01-19  3:40     ` Harry Yoo
  2026-01-19  9:34     ` Vlastimil Babka
@ 2026-01-21 10:52     ` Vlastimil Babka
  2 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 10:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/17/26 03:11, Suren Baghdasaryan wrote:
>> @@ -7379,7 +7405,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>>          * freeing to sheaves is so incompatible with the detached freelist so
>>          * once we go that way, we have to do everything differently
>>          */
>> -       if (s && s->cpu_sheaves) {
>> +       if (s && cache_has_sheaves(s)) {
>>                 free_to_pcs_bulk(s, size, p);
>>                 return;
>>         }
>> @@ -7490,8 +7516,7 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>>                 size--;
>>         }
>>
>> -       if (s->cpu_sheaves)
>> -               i = alloc_from_pcs_bulk(s, size, p);
>> +       i = alloc_from_pcs_bulk(s, size, p);
> 
> Doesn't the above change make this fastpath a bit longer? IIUC,
> instead of bailing out right here we call alloc_from_pcs_bulk() and
> bail out from there because pcs->main->size is 0.

But only for caches with no sheaves, and that should be the exception. So
the fast path avoids checks. We're making the slowpath longer. The strategy
is the same with single-object alloc, and described in the changelog. Or
what am I missing?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap
  2026-01-19 11:32   ` Hao Li
@ 2026-01-21 10:54     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 10:54 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/19/26 12:32, Hao Li wrote:
> On Fri, Jan 16, 2026 at 03:40:26PM +0100, Vlastimil Babka wrote:
>> Until now, kmem_cache->cpu_sheaves was !NULL only for caches with
>> sheaves enabled. Since we want to enable them for almost all caches,
>> it's suboptimal to test the pointer in the fast paths, so instead
>> allocate it for all caches in do_kmem_cache_create(). Instead of testing
>> the cpu_sheaves pointer to recognize caches (yet) without sheaves, test
>> kmem_cache->sheaf_capacity for being 0, where needed, using a new
>> cache_has_sheaves() helper.
>> 
>> However, for the fast paths sake we also assume that the main sheaf
>> always exists (pcs->main is !NULL), and during bootstrap we cannot
>> allocate sheaves yet.
>> 
>> Solve this by introducing a single static bootstrap_sheaf that's
>> assigned as pcs->main during bootstrap. It has a size of 0, so during
>> allocations, the fast path will find it's empty. Since the size of 0
>> matches sheaf_capacity of 0, the freeing fast paths will find it's
>> "full". In the slow path handlers, we use cache_has_sheaves() to
>> recognize that the cache doesn't (yet) have real sheaves, and fall back.
>> Thus sharing the single bootstrap sheaf like this for multiple caches
>> and cpus is safe.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 119 ++++++++++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 81 insertions(+), 38 deletions(-)
>> 
> 
> Nit: would it make sense to also update "if (s->cpu_sheaves)" to
> cache_has_sheaves() in kvfree_rcu_barrier_on_cache(), for consistency?

Ack, will do. Thanks.




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-20 17:19   ` Suren Baghdasaryan
@ 2026-01-21 13:22     ` Vlastimil Babka
  2026-01-21 16:12       ` Suren Baghdasaryan
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 13:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/20/26 18:19, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> At this point we have sheaves enabled for all caches, but their refill
>> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
>> slabs - now a redundant caching layer that we are about to remove.
>>
>> The refill will thus be done from slabs on the node partial list.
>> Introduce new functions that can do that in an optimized way as it's
>> easier than modifying the __kmem_cache_alloc_bulk() call chain.
>>
>> Extend struct partial_context so it can return a list of slabs from the
>> partial list with the sum of free objects in them within the requested
>> min and max.
>>
>> Introduce get_partial_node_bulk() that removes the slabs from freelist
>> and returns them in the list.
>>
>> Introduce get_freelist_nofreeze() which grabs the freelist without
>> freezing the slab.
>>
>> Introduce alloc_from_new_slab() which can allocate multiple objects from
>> a newly allocated slab where we don't need to synchronize with freeing.
>> In some aspects it's similar to alloc_single_from_new_slab() but assumes
>> the cache is a non-debug one so it can avoid some actions.
>>
>> Introduce __refill_objects() that uses the functions above to fill an
>> array of objects. It has to handle the possibility that the slabs will
>> contain more objects that were requested, due to concurrent freeing of
>> objects to those slabs. When no more slabs on partial lists are
>> available, it will allocate new slabs. It is intended to be only used
>> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
>>
>> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
>> only refilled from contexts that allow spinning, or even blocking.
>>
> 
> Some nits, but otherwise LGTM.
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks.

> 
> From the above code it seems like you are trying to get at least
> pc->min_objects and as close as possible to the pc->max_objects
> without exceeding it (with a possibility that we will exceed both
> min_objects and max_objects in one step). Is that indeed the intent?
> Because otherwise could could simplify these conditions to stop once
> you crossed pc->min_objects.

Yeah see my reply to Harry, it's for future tuning.
 
>> +       if (slab->freelist) {
> 
> nit: It's a bit subtle that the checks for slab->freelist here and the
> earlier one for ((slab->objects - slab->inuse) > count) are
> effectively equivalent. That's because this is a new slab and objects
> can't be freed into it concurrently. I would feel better if both
> checks were explicitly the same, like having "bool extra_objs =
> (slab->objects - slab->inuse) > count;" and use it for both checks.
> But this is minor, so feel free to ignore.

OK, doing this for your and Hao Li's comment:

diff --git a/mm/slub.c b/mm/slub.c
index d6fde1d60ae9..015bdef11eb6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4505,7 +4505,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
  * Assumes the slab is isolated from node partial list and not frozen.
  *
  * Assumes this is performed only for caches without debugging so we
- * don't need to worry about adding the slab to the full list
+ * don't need to worry about adding the slab to the full list.
  */
 static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
 {
@@ -4569,10 +4569,17 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 {
        unsigned int allocated = 0;
        struct kmem_cache_node *n;
+       bool needs_add_partial;
        unsigned long flags;
        void *object;
 
-       if (!allow_spin && (slab->objects - slab->inuse) > count) {
+       /*
+        * Are we going to put the slab on the partial list?
+        * Note slab->inuse is 0 on a new slab.
+        */
+       needs_add_partial = (slab->objects > count);
+
+       if (!allow_spin && needs_add_partial) {
 
                n = get_node(s, slab_nid(slab));
 
@@ -4594,7 +4601,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
        }
        slab->freelist = object;
 
-       if (slab->freelist) {
+       if (needs_add_partial) {
 
                if (allow_spin) {
                        n = get_node(s, slab_nid(slab));



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths
  2026-01-20 18:06   ` Suren Baghdasaryan
@ 2026-01-21 13:56     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 13:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/20/26 19:06, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> We now rely on sheaves as the percpu caching layer and can refill them
>> directly from partial or newly allocated slabs. Start removing the cpu
>> (partial) slabs code, first from allocation paths.
>>
>> This means that any allocation not satisfied from percpu sheaves will
>> end up in ___slab_alloc(), where we remove the usage of cpu (partial)
>> slabs, so it will only perform get_partial() or new_slab(). In the
>> latter case we reuse alloc_from_new_slab() (when we don't use
>> the debug/tiny alloc_single_from_new_slab() variant).
>>
>> In get_partial_node() we used to return a slab for freezing as the cpu
>> slab and to refill the partial slab. Now we only want to return a single
>> object and leave the slab on the list (unless it became full). We can't
>> simply reuse alloc_single_from_partial() as that assumes freeing uses
>> free_to_partial_list(). Instead we need to use __slab_update_freelist()
>> to work properly against a racing __slab_free().
>>
>> The rest of the changes is removing functions that no longer have any
>> callers.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> A couple of nits, but otherwise seems fine to me.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!


 > -static struct slab *get_partial_node(struct kmem_cache *s,
>> -                                    struct kmem_cache_node *n,
>> -                                    struct partial_context *pc)
>> +static void *get_partial_node(struct kmem_cache *s,
>> +                             struct kmem_cache_node *n,
>> +                             struct partial_context *pc)
> 
> Naming for get_partial()/get_partial_node()/get_any_partial() made
> sense when they returned a slab. Now that they return object(s) the
> naming is a bit confusing. I think renaming to
> get_from_partial()/get_from_partial_node()/get_from_any_partial()
> would be more appropriate.

OK, will do.

>> -       }
>> +       freelist = get_partial(s, node, &pc);
> 
> I think all this cleanup results in this `freelist` variable being
> used to always store a single object. Maybe rename it into `object`?

Ack.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-20 22:25   ` Suren Baghdasaryan
  2026-01-21  0:58     ` Harry Yoo
@ 2026-01-21 14:22     ` Vlastimil Babka
  2026-01-21 14:43       ` Vlastimil Babka
  2026-01-21 16:22       ` Suren Baghdasaryan
  1 sibling, 2 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 14:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/20/26 23:25, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> We have removed the partial slab usage from allocation paths. Now remove
>> the whole config option and associated code.
>>
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> 
> I did?

Hmm looks like you didn't. Wonder if I screwed up, or b4 did. Sorry about that.

> Well, if so, I missed some remaining mentions about cpu partial caches:
> - slub.c has several hits on "cpu partial" in the comments.
> - there is one hit on "put_cpu_partial" in slub.c in the comments.

Should be addressed later by [PATCH v3 18/21] slab: update overview
comments. I'll grep the result if anything is missing.

> Should we also update Documentation/ABI/testing/sysfs-kernel-slab to
> say that from now on cpu_partial control always reads 0?

Uh those weird files. Does anyone care? I'd do that separately as well...

> Once addressed, please feel free to keep my Reviewed-by.

Thanks!



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-20 12:40   ` Hao Li
@ 2026-01-21 14:29     ` Vlastimil Babka
  2026-01-21 17:54       ` Suren Baghdasaryan
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 14:29 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/20/26 13:40, Hao Li wrote:
> On Fri, Jan 16, 2026 at 03:40:35PM +0100, Vlastimil Babka wrote:
>> @@ -3853,7 +3632,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>>  }
>>  
>>  /*
>> - * Flush cpu slab.
>> + * Flush percpu sheaves
>>   *
>>   * Called from CPU work handler with migration disabled.
>>   */
>> @@ -3868,8 +3647,6 @@ static void flush_cpu_slab(struct work_struct *w)
> 
> Nit: Would it make sense to rename flush_cpu_slab to flush_cpu_sheaf for better
> clarity?

OK

> Other than that, looks good to me. Thanks.
> 
> Reviewed-by: Hao Li <hao.li@linux.dev>

Thanks!



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-21 14:22     ` Vlastimil Babka
@ 2026-01-21 14:43       ` Vlastimil Babka
  2026-01-21 16:22       ` Suren Baghdasaryan
  1 sibling, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 14:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/21/26 15:22, Vlastimil Babka wrote:
> On 1/20/26 23:25, Suren Baghdasaryan wrote:
>> On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> We have removed the partial slab usage from allocation paths. Now remove
>>> the whole config option and associated code.
>>>
>>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> 
>> I did?
> 
> Hmm looks like you didn't. Wonder if I screwed up, or b4 did. Sorry about that.

Seems like it was b4 and did for all patches, damn. Sorry, will fix up to
match reality.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 09/21] slab: add optimized sheaf refill from partial list
  2026-01-21 13:22     ` Vlastimil Babka
@ 2026-01-21 16:12       ` Suren Baghdasaryan
  0 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 16:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 1:22 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/20/26 18:19, Suren Baghdasaryan wrote:
> > On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> At this point we have sheaves enabled for all caches, but their refill
> >> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> >> slabs - now a redundant caching layer that we are about to remove.
> >>
> >> The refill will thus be done from slabs on the node partial list.
> >> Introduce new functions that can do that in an optimized way as it's
> >> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> >>
> >> Extend struct partial_context so it can return a list of slabs from the
> >> partial list with the sum of free objects in them within the requested
> >> min and max.
> >>
> >> Introduce get_partial_node_bulk() that removes the slabs from freelist
> >> and returns them in the list.
> >>
> >> Introduce get_freelist_nofreeze() which grabs the freelist without
> >> freezing the slab.
> >>
> >> Introduce alloc_from_new_slab() which can allocate multiple objects from
> >> a newly allocated slab where we don't need to synchronize with freeing.
> >> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> >> the cache is a non-debug one so it can avoid some actions.
> >>
> >> Introduce __refill_objects() that uses the functions above to fill an
> >> array of objects. It has to handle the possibility that the slabs will
> >> contain more objects that were requested, due to concurrent freeing of
> >> objects to those slabs. When no more slabs on partial lists are
> >> available, it will allocate new slabs. It is intended to be only used
> >> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> >>
> >> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> >> only refilled from contexts that allow spinning, or even blocking.
> >>
> >
> > Some nits, but otherwise LGTM.
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> Thanks.
>
> >
> > From the above code it seems like you are trying to get at least
> > pc->min_objects and as close as possible to the pc->max_objects
> > without exceeding it (with a possibility that we will exceed both
> > min_objects and max_objects in one step). Is that indeed the intent?
> > Because otherwise could could simplify these conditions to stop once
> > you crossed pc->min_objects.
>
> Yeah see my reply to Harry, it's for future tuning.

Ok.

>
> >> +       if (slab->freelist) {
> >
> > nit: It's a bit subtle that the checks for slab->freelist here and the
> > earlier one for ((slab->objects - slab->inuse) > count) are
> > effectively equivalent. That's because this is a new slab and objects
> > can't be freed into it concurrently. I would feel better if both
> > checks were explicitly the same, like having "bool extra_objs =
> > (slab->objects - slab->inuse) > count;" and use it for both checks.
> > But this is minor, so feel free to ignore.
>
> OK, doing this for your and Hao Li's comment:

Sounds good. Thanks!

>
> diff --git a/mm/slub.c b/mm/slub.c
> index d6fde1d60ae9..015bdef11eb6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4505,7 +4505,7 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>   * Assumes the slab is isolated from node partial list and not frozen.
>   *
>   * Assumes this is performed only for caches without debugging so we
> - * don't need to worry about adding the slab to the full list
> + * don't need to worry about adding the slab to the full list.
>   */
>  static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
>  {
> @@ -4569,10 +4569,17 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  {
>         unsigned int allocated = 0;
>         struct kmem_cache_node *n;
> +       bool needs_add_partial;
>         unsigned long flags;
>         void *object;
>
> -       if (!allow_spin && (slab->objects - slab->inuse) > count) {
> +       /*
> +        * Are we going to put the slab on the partial list?
> +        * Note slab->inuse is 0 on a new slab.
> +        */
> +       needs_add_partial = (slab->objects > count);
> +
> +       if (!allow_spin && needs_add_partial) {
>
>                 n = get_node(s, slab_nid(slab));
>
> @@ -4594,7 +4601,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>         }
>         slab->freelist = object;
>
> -       if (slab->freelist) {
> +       if (needs_add_partial) {
>
>                 if (allow_spin) {
>                         n = get_node(s, slab_nid(slab));
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-21  0:58     ` Harry Yoo
  2026-01-21  1:06       ` Harry Yoo
@ 2026-01-21 16:21       ` Suren Baghdasaryan
  1 sibling, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 16:21 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 12:59 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Jan 20, 2026 at 10:25:27PM +0000, Suren Baghdasaryan wrote:
> > On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > @@ -5744,10 +5553,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> > >
> > >         /*
> > >          * Objects left in the slab. If it was not on the partial list before
> > > -        * then add it. This can only happen when cache has no per cpu partial
> > > -        * list otherwise we would have put it there.
> > > +        * then add it.
> > >          */
> > > -       if (!IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && unlikely(was_full)) {
> > > +       if (unlikely(was_full)) {
> >
> > This is not really related to your change but I wonder why we check
> > for was_full to detect that the slab was not on partial list instead
> > of checking !on_node_partial... They might be equivalent at this point
> > but it's still a bit confusing.
>
> If we only know that a slab is not on the partial list, we cannot
> manipulate its list because it may be on a linked list that cannot
> handle list manipulation outside function
> (e.g., pc.slabs in __refill_objects()).
>
> If it's not on the partial list, we can safely manipulate the list
> only when we know it was full. It's safe because full slabs are not
> supposed to be on any list (except for debug caches, where frees are
> done via free_to_partial_list()).

Ack. I guess I'm reading the above comment too literally. Thanks!

>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL
  2026-01-21 14:22     ` Vlastimil Babka
  2026-01-21 14:43       ` Vlastimil Babka
@ 2026-01-21 16:22       ` Suren Baghdasaryan
  1 sibling, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 16:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 2:22 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/20/26 23:25, Suren Baghdasaryan wrote:
> > On Fri, Jan 16, 2026 at 2:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> We have removed the partial slab usage from allocation paths. Now remove
> >> the whole config option and associated code.
> >>
> >> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> >
> > I did?
>
> Hmm looks like you didn't. Wonder if I screwed up, or b4 did. Sorry about that.

No worries.

>
> > Well, if so, I missed some remaining mentions about cpu partial caches:
> > - slub.c has several hits on "cpu partial" in the comments.
> > - there is one hit on "put_cpu_partial" in slub.c in the comments.
>
> Should be addressed later by [PATCH v3 18/21] slab: update overview
> comments. I'll grep the result if anything is missing.
>
> > Should we also update Documentation/ABI/testing/sysfs-kernel-slab to
> > say that from now on cpu_partial control always reads 0?
>
> Uh those weird files. Does anyone care? I'd do that separately as well...

I'm fine either way. Thanks!

>
> > Once addressed, please feel free to keep my Reviewed-by.
>
> Thanks!
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 12/21] slab: remove the do_slab_free() fastpath
  2026-01-20 12:29   ` Hao Li
@ 2026-01-21 16:57     ` Suren Baghdasaryan
  0 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 16:57 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Tue, Jan 20, 2026 at 12:30 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Fri, Jan 16, 2026 at 03:40:32PM +0100, Vlastimil Babka wrote:
> > We have removed cpu slab usage from allocation paths. Now remove
> > do_slab_free() which was freeing objects to the cpu slab when
> > the object belonged to it. Instead call __slab_free() directly,
> > which was previously the fallback.
> >
> > This simplifies kfree_nolock() - when freeing to percpu sheaf
> > fails, we can call defer_free() directly.
> >
> > Also remove functions that became unused.
> >
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slub.c | 149 ++++++--------------------------------------------------------
> >  1 file changed, 13 insertions(+), 136 deletions(-)
> >
>
> Looks good to me.
> Reviewed-by: Hao Li <hao.li@linux.dev>

There are some hits in the comments on __update_cpu_freelist_fast and
do_slab_free but you remove them later. Nice cleanup!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 13/21] slab: remove defer_deactivate_slab()
  2026-01-20  9:35   ` Hao Li
@ 2026-01-21 17:11     ` Suren Baghdasaryan
  0 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 17:11 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Tue, Jan 20, 2026 at 9:35 AM Hao Li <hao.li@linux.dev> wrote:
>
> On Fri, Jan 16, 2026 at 03:40:33PM +0100, Vlastimil Babka wrote:
> > There are no more cpu slabs so we don't need their deferred
> > deactivation. The function is now only used from places where we
> > allocate a new slab but then can't spin on node list_lock to put it on
> > the partial list. Instead of the deferred action we can free it directly
> > via __free_slab(), we just need to tell it to use _nolock() freeing of
> > the underlying pages and take care of the accounting.
> >
> > Since free_frozen_pages_nolock() variant does not yet exist for code
> > outside of the page allocator, create it as a trivial wrapper for
> > __free_frozen_pages(..., FPI_TRYLOCK).
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/internal.h   |  1 +
> >  mm/page_alloc.c |  5 +++++
> >  mm/slab.h       |  8 +-------
> >  mm/slub.c       | 56 ++++++++++++++++++++------------------------------------
> >  4 files changed, 27 insertions(+), 43 deletions(-)
> >
>
> Looks good to me.
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-20 12:06   ` Hao Li
@ 2026-01-21 17:39     ` Suren Baghdasaryan
  0 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 17:39 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Tue, Jan 20, 2026 at 12:07 PM Hao Li <hao.li@linux.dev> wrote:
>
> On Fri, Jan 16, 2026 at 03:40:34PM +0100, Vlastimil Babka wrote:
> > The kmalloc_nolock() implementation has several complications and
> > restrictions due to SLUB's cpu slab locking, lockless fastpath and
> > PREEMPT_RT differences. With cpu slab usage removed, we can simplify
> > things:
> >
> > - relax the PREEMPT_RT context checks as they were before commit
> >   a4ae75d1b6a2 ("slab: fix kmalloc_nolock() context check for
> >   PREEMPT_RT") and also reference the explanation comment in the page
> >   allocator
> >
> > - the local_lock_cpu_slab() macros became unused, remove them
> >
> > - we no longer need to set up lockdep classes on PREEMPT_RT
> >
> > - we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
> >   since there's no lockless cpu freelist manipulation anymore
> >
> > - __slab_alloc_node() can be called from kmalloc_nolock_noprof()
> >   unconditionally. It can also no longer return EBUSY. But trylock
> >   failures can still happen so retry with the larger bucket if the
> >   allocation fails for any reason.
> >
> > Note that we still need __CMPXCHG_DOUBLE, because while it was removed
> > we don't use cmpxchg16b on cpu freelist anymore, we still use it on
> > slab freelist, and the alternative is slab_lock() which can be
> > interrupted by a nmi. Clarify the comment to mention it specifically.
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slab.h |   1 -
> >  mm/slub.c | 144 +++++++++++++-------------------------------------------------
> >  2 files changed, 29 insertions(+), 116 deletions(-)
> >
>
> Looks good to me.
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-21 14:29     ` Vlastimil Babka
@ 2026-01-21 17:54       ` Suren Baghdasaryan
  2026-01-21 19:03         ` Vlastimil Babka
  0 siblings, 1 reply; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 17:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hao Li, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 2:29 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/20/26 13:40, Hao Li wrote:
> > On Fri, Jan 16, 2026 at 03:40:35PM +0100, Vlastimil Babka wrote:
> >> @@ -3853,7 +3632,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
> >>  }
> >>
> >>  /*
> >> - * Flush cpu slab.
> >> + * Flush percpu sheaves
> >>   *
> >>   * Called from CPU work handler with migration disabled.
> >>   */
> >> @@ -3868,8 +3647,6 @@ static void flush_cpu_slab(struct work_struct *w)
> >
> > Nit: Would it make sense to rename flush_cpu_slab to flush_cpu_sheaf for better
> > clarity?
>
> OK
>
> > Other than that, looks good to me. Thanks.
> >
> > Reviewed-by: Hao Li <hao.li@linux.dev>

I noticed one hit on deactivate_slab in the comments after applying
the entire patchset. Other than that LGTM.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> Thanks!
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros
  2026-01-21  6:42   ` Hao Li
@ 2026-01-21 17:57     ` Suren Baghdasaryan
  0 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 17:57 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Wed, Jan 21, 2026 at 6:43 AM Hao Li <hao.li@linux.dev> wrote:
>
> On Fri, Jan 16, 2026 at 03:40:36PM +0100, Vlastimil Babka wrote:
> > The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove
> > them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code
> > being removed. The only remaining usage is in fact testing whether we
> > can assert irqs disabled, because spin_lock_irqsave() only does that on
> > !RT. Test for CONFIG_PREEMPT_RT instead.
> >
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slub.c | 24 +-----------------------
> >  1 file changed, 1 insertion(+), 23 deletions(-)
> >
>
> Looks good to me.
> Reviewed-by: Hao Li <hao.li@linux.dev>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Thanks,
> Hao
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
@ 2026-01-21 18:30   ` Suren Baghdasaryan
  2026-01-22  4:44   ` Harry Yoo
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 18:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:41 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> __refill_objects() currently only attempts to get partial slabs from the
> local node and then allocates new slab(s). Expand it to trying also
> other nodes while observing the remote node defrag ratio, similarly to
> get_any_partial().
>
> This will prevent allocating new slabs on a node while other nodes have
> many free slabs. It does mean sheaves will contain non-local objects in
> that case. Allocations that care about specific node will still be
> served appropriately, but might get a slowpath allocation.
>
> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
> might be refilling a sheaf that will be then used from a different
> allocation context.
>
> We can also use the resulting refill_objects() in
> __kmem_cache_alloc_bulk() for non-debug caches. This means
> kmem_cache_alloc_bulk() will get better performance when sheaves are
> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
> it's compatible with sheaves refill in preferring the local node.
> Its users also have gfp flags that allow spinning, so document that
> as a requirement.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 106 insertions(+), 31 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d52de6e3c2d5..2c522d2bf547 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2518,8 +2518,8 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  }
>
>  static unsigned int
> -__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> -                unsigned int max);
> +refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +              unsigned int max);
>
>  static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>                          gfp_t gfp)
> @@ -2530,8 +2530,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>         if (!to_fill)
>                 return 0;
>
> -       filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> -                       to_fill, to_fill);
> +       filled = refill_objects(s, &sheaf->objects[sheaf->size], gfp, to_fill,
> +                               to_fill);
>
>         sheaf->size += filled;
>
> @@ -6522,29 +6522,22 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>
>  static unsigned int
> -__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> -                unsigned int max)
> +__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +                     unsigned int max, struct kmem_cache_node *n)
>  {
>         struct slab *slab, *slab2;
>         struct partial_context pc;
>         unsigned int refilled = 0;
>         unsigned long flags;
>         void *object;
> -       int node;
>
>         pc.flags = gfp;
>         pc.min_objects = min;
>         pc.max_objects = max;
>
> -       node = numa_mem_id();
> -
> -       if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +       if (!get_partial_node_bulk(s, n, &pc))
>                 return 0;
>
> -       /* TODO: consider also other nodes? */
> -       if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> -               goto new_slab;
> -
>         list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
>
>                 list_del(&slab->slab_list);
> @@ -6582,8 +6575,6 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>         }
>
>         if (unlikely(!list_empty(&pc.slabs))) {
> -               struct kmem_cache_node *n = get_node(s, node);
> -
>                 spin_lock_irqsave(&n->list_lock, flags);
>
>                 list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> @@ -6605,13 +6596,92 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>                 }
>         }
>
> +       return refilled;
> +}
>
> -       if (likely(refilled >= min))
> -               goto out;
> +#ifdef CONFIG_NUMA
> +static unsigned int
> +__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +                    unsigned int max, int local_node)
> +{
> +       struct zonelist *zonelist;
> +       struct zoneref *z;
> +       struct zone *zone;
> +       enum zone_type highest_zoneidx = gfp_zone(gfp);
> +       unsigned int cpuset_mems_cookie;
> +       unsigned int refilled = 0;
> +
> +       /* see get_any_partial() for the defrag ratio description */
> +       if (!s->remote_node_defrag_ratio ||
> +                       get_cycles() % 1024 > s->remote_node_defrag_ratio)
> +               return 0;
> +
> +       do {
> +               cpuset_mems_cookie = read_mems_allowed_begin();
> +               zonelist = node_zonelist(mempolicy_slab_node(), gfp);
> +               for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
> +                       struct kmem_cache_node *n;
> +                       unsigned int r;
> +
> +                       n = get_node(s, zone_to_nid(zone));
> +
> +                       if (!n || !cpuset_zone_allowed(zone, gfp) ||
> +                                       n->nr_partial <= s->min_partial)
> +                               continue;
> +
> +                       r = __refill_objects_node(s, p, gfp, min, max, n);
> +                       refilled += r;
> +
> +                       if (r >= min) {
> +                               /*
> +                                * Don't check read_mems_allowed_retry() here -
> +                                * if mems_allowed was updated in parallel, that
> +                                * was a harmless race between allocation and
> +                                * the cpuset update
> +                                */
> +                               return refilled;
> +                       }
> +                       p += r;
> +                       min -= r;
> +                       max -= r;
> +               }
> +       } while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> +       return refilled;
> +}
> +#else
> +static inline unsigned int
> +__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +                    unsigned int max, int local_node)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static unsigned int
> +refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +              unsigned int max)
> +{
> +       int local_node = numa_mem_id();
> +       unsigned int refilled;
> +       struct slab *slab;
> +
> +       if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +               return 0;
> +
> +       refilled = __refill_objects_node(s, p, gfp, min, max,
> +                                        get_node(s, local_node));
> +       if (refilled >= min)
> +               return refilled;
> +
> +       refilled += __refill_objects_any(s, p + refilled, gfp, min - refilled,
> +                                        max - refilled, local_node);
> +       if (refilled >= min)
> +               return refilled;
>
>  new_slab:
>
> -       slab = new_slab(s, pc.flags, node);
> +       slab = new_slab(s, gfp, local_node);
>         if (!slab)
>                 goto out;
>
> @@ -6626,8 +6696,8 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>
>         if (refilled < min)
>                 goto new_slab;
> -out:
>
> +out:
>         return refilled;
>  }
>
> @@ -6637,18 +6707,20 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  {
>         int i;
>
> -       /*
> -        * TODO: this might be more efficient (if necessary) by reusing
> -        * __refill_objects()
> -        */
> -       for (i = 0; i < size; i++) {
> +       if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> +               for (i = 0; i < size; i++) {
>
> -               p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> -                                    s->object_size);
> -               if (unlikely(!p[i]))
> -                       goto error;
> +                       p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> +                                            s->object_size);
> +                       if (unlikely(!p[i]))
> +                               goto error;
>
> -               maybe_wipe_obj_freeptr(s, p[i]);
> +                       maybe_wipe_obj_freeptr(s, p[i]);
> +               }
> +       } else {
> +               i = refill_objects(s, p, flags, size, size);
> +               if (i < size)
> +                       goto error;
>         }
>
>         return i;
> @@ -6659,7 +6731,10 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>
>  }
>
> -/* Note that interrupts must be enabled when calling this function. */
> +/*
> + * Note that interrupts must be enabled when calling this function and gfp
> + * flags must allow spinning.
> + */
>  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                                  void **p)
>  {
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-21 17:54       ` Suren Baghdasaryan
@ 2026-01-21 19:03         ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-21 19:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Hao Li, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/21/26 18:54, Suren Baghdasaryan wrote:
> On Wed, Jan 21, 2026 at 2:29 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/20/26 13:40, Hao Li wrote:
>> > On Fri, Jan 16, 2026 at 03:40:35PM +0100, Vlastimil Babka wrote:
>> >> @@ -3853,7 +3632,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>> >>  }
>> >>
>> >>  /*
>> >> - * Flush cpu slab.
>> >> + * Flush percpu sheaves
>> >>   *
>> >>   * Called from CPU work handler with migration disabled.
>> >>   */
>> >> @@ -3868,8 +3647,6 @@ static void flush_cpu_slab(struct work_struct *w)
>> >
>> > Nit: Would it make sense to rename flush_cpu_slab to flush_cpu_sheaf for better
>> > clarity?
>>
>> OK
>>
>> > Other than that, looks good to me. Thanks.
>> >
>> > Reviewed-by: Hao Li <hao.li@linux.dev>
> 
> I noticed one hit on deactivate_slab in the comments after applying
> the entire patchset. Other than that LGTM.

Thanks, I'll remove it as part of "slab: remove defer_deactivate_slab()"
where it belongs.

> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> 
>>
>> Thanks!
>>



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 18/21] slab: update overview comments
  2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
@ 2026-01-21 20:58   ` Suren Baghdasaryan
  2026-01-22  3:54   ` Hao Li
  2026-01-22  6:41   ` Harry Yoo
  2 siblings, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-21 20:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:41 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> The changes related to sheaves made the description of locking and other
> details outdated. Update it to reflect current state.
>
> Also add a new copyright line due to major changes.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 141 +++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 67 insertions(+), 74 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 2c522d2bf547..476a279f1a94 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1,13 +1,15 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*
> - * SLUB: A slab allocator that limits cache line use instead of queuing
> - * objects in per cpu and per node lists.
> + * SLUB: A slab allocator with low overhead percpu array caches and mostly
> + * lockless freeing of objects to slabs in the slowpath.
>   *
> - * The allocator synchronizes using per slab locks or atomic operations
> - * and only uses a centralized lock to manage a pool of partial slabs.
> + * The allocator synchronizes using spin_trylock for percpu arrays in the
> + * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing.
> + * Uses a centralized lock to manage a pool of partial slabs.
>   *
>   * (C) 2007 SGI, Christoph Lameter
>   * (C) 2011 Linux Foundation, Christoph Lameter
> + * (C) 2025 SUSE, Vlastimil Babka
>   */
>
>  #include <linux/mm.h>
> @@ -53,11 +55,13 @@
>
>  /*
>   * Lock order:
> - *   1. slab_mutex (Global Mutex)
> - *   2. node->list_lock (Spinlock)
> - *   3. kmem_cache->cpu_slab->lock (Local lock)
> - *   4. slab_lock(slab) (Only on some arches)
> - *   5. object_map_lock (Only for debugging)
> + *   0.  cpu_hotplug_lock
> + *   1.  slab_mutex (Global Mutex)
> + *   2a. kmem_cache->cpu_sheaves->lock (Local trylock)
> + *   2b. node->barn->lock (Spinlock)
> + *   2c. node->list_lock (Spinlock)
> + *   3.  slab_lock(slab) (Only on some arches)
> + *   4.  object_map_lock (Only for debugging)
>   *
>   *   slab_mutex
>   *
> @@ -78,31 +82,38 @@
>   *     C. slab->objects        -> Number of objects in slab
>   *     D. slab->frozen         -> frozen state
>   *
> - *   Frozen slabs
> + *   SL_partial slabs
> + *
> + *   Slabs on node partial list have at least one free object. A limited number
> + *   of slabs on the list can be fully free (slab->inuse == 0), until we start
> + *   discarding them. These slabs are marked with SL_partial, and the flag is
> + *   cleared while removing them, usually to grab their freelist afterwards.
> + *   This clearing also exempts them from list management. Please see
> + *   __slab_free() for more details.
>   *
> - *   If a slab is frozen then it is exempt from list management. It is
> - *   the cpu slab which is actively allocated from by the processor that
> - *   froze it and it is not on any list. The processor that froze the
> - *   slab is the one who can perform list operations on the slab. Other
> - *   processors may put objects onto the freelist but the processor that
> - *   froze the slab is the only one that can retrieve the objects from the
> - *   slab's freelist.
> + *   Full slabs
>   *
> - *   CPU partial slabs
> + *   For caches without debugging enabled, full slabs (slab->inuse ==
> + *   slab->objects and slab->freelist == NULL) are not placed on any list.
> + *   The __slab_free() freeing the first object from such a slab will place
> + *   it on the partial list. Caches with debugging enabled place such slab
> + *   on the full list and use different allocation and freeing paths.
> + *
> + *   Frozen slabs
>   *
> - *   The partially empty slabs cached on the CPU partial list are used
> - *   for performance reasons, which speeds up the allocation process.
> - *   These slabs are not frozen, but are also exempt from list management,
> - *   by clearing the SL_partial flag when moving out of the node
> - *   partial list. Please see __slab_free() for more details.
> + *   If a slab is frozen then it is exempt from list management. It is used to
> + *   indicate a slab that has failed consistency checks and thus cannot be
> + *   allocated from anymore - it is also marked as full. Any previously
> + *   allocated objects will be simply leaked upon freeing instead of attempting
> + *   to modify the potentially corrupted freelist and metadata.
>   *
>   *   To sum up, the current scheme is:
> - *   - node partial slab: SL_partial && !frozen
> - *   - cpu partial slab: !SL_partial && !frozen
> - *   - cpu slab: !SL_partial && frozen
> - *   - full slab: !SL_partial && !frozen
> + *   - node partial slab:            SL_partial && !full && !frozen
> + *   - taken off partial list:      !SL_partial && !full && !frozen
> + *   - full slab, not on any list:  !SL_partial &&  full && !frozen
> + *   - frozen due to inconsistency: !SL_partial &&  full &&  frozen
>   *
> - *   list_lock
> + *   node->list_lock (spinlock)
>   *
>   *   The list_lock protects the partial and full list on each node and
>   *   the partial slab counter. If taken then no new slabs may be added or
> @@ -112,47 +123,46 @@
>   *
>   *   The list_lock is a centralized lock and thus we avoid taking it as
>   *   much as possible. As long as SLUB does not have to handle partial
> - *   slabs, operations can continue without any centralized lock. F.e.
> - *   allocating a long series of objects that fill up slabs does not require
> - *   the list lock.
> + *   slabs, operations can continue without any centralized lock.
>   *
>   *   For debug caches, all allocations are forced to go through a list_lock
>   *   protected region to serialize against concurrent validation.
>   *
> - *   cpu_slab->lock local lock
> + *   cpu_sheaves->lock (local_trylock)
>   *
> - *   This locks protect slowpath manipulation of all kmem_cache_cpu fields
> - *   except the stat counters. This is a percpu structure manipulated only by
> - *   the local cpu, so the lock protects against being preempted or interrupted
> - *   by an irq. Fast path operations rely on lockless operations instead.
> + *   This lock protects fastpath operations on the percpu sheaves. On !RT it
> + *   only disables preemption and does no atomic operations. As long as the main
> + *   or spare sheaf can handle the allocation or free, there is no other
> + *   overhead.
>   *
> - *   On PREEMPT_RT, the local lock neither disables interrupts nor preemption
> - *   which means the lockless fastpath cannot be used as it might interfere with
> - *   an in-progress slow path operations. In this case the local lock is always
> - *   taken but it still utilizes the freelist for the common operations.
> + *   node->barn->lock (spinlock)
>   *
> - *   lockless fastpaths
> + *   This lock protects the operations on per-NUMA-node barn. It can quickly
> + *   serve an empty or full sheaf if available, and avoid more expensive refill
> + *   or flush operation.
>   *
> - *   The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
> - *   are fully lockless when satisfied from the percpu slab (and when
> - *   cmpxchg_double is possible to use, otherwise slab_lock is taken).
> - *   They also don't disable preemption or migration or irqs. They rely on
> - *   the transaction id (tid) field to detect being preempted or moved to
> - *   another cpu.
> + *   Lockless freeing
> + *
> + *   Objects may have to be freed to their slabs when they are from a remote
> + *   node (where we want to avoid filling local sheaves with remote objects)
> + *   or when there are too many full sheaves. On architectures supporting
> + *   cmpxchg_double this is done by a lockless update of slab's freelist and
> + *   counters, otherwise slab_lock is taken. This only needs to take the
> + *   list_lock if it's a first free to a full slab, or when there are too many
> + *   fully free slabs and some need to be discarded.
>   *
>   *   irq, preemption, migration considerations
>   *
> - *   Interrupts are disabled as part of list_lock or local_lock operations, or
> + *   Interrupts are disabled as part of list_lock or barn lock operations, or
>   *   around the slab_lock operation, in order to make the slab allocator safe
>   *   to use in the context of an irq.
> + *   Preemption is disabled as part of local_trylock operations.
> + *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
> + *   their limitations.
>   *
> - *   In addition, preemption (or migration on PREEMPT_RT) is disabled in the
> - *   allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
> - *   local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
> - *   doesn't have to be revalidated in each section protected by the local lock.
> - *
> - * SLUB assigns one slab for allocation to each processor.
> - * Allocations only occur from these slabs called cpu slabs.
> + * SLUB assigns two object arrays called sheaves for caching allocation and

s/allocation/allocations

> + * frees on each cpu, with a NUMA node shared barn for balancing between cpus.
> + * Allocations and frees are primarily served from these sheaves.
>   *
>   * Slabs with free elements are kept on a partial list and during regular
>   * operations no list for full slabs is used. If an object in a full slab is
> @@ -160,25 +170,8 @@
>   * We track full slabs for debugging purposes though because otherwise we
>   * cannot scan all objects.
>   *
> - * Slabs are freed when they become empty. Teardown and setup is
> - * minimal so we rely on the page allocators per cpu caches for
> - * fast frees and allocs.
> - *
> - * slab->frozen                The slab is frozen and exempt from list processing.
> - *                     This means that the slab is dedicated to a purpose
> - *                     such as satisfying allocations for a specific
> - *                     processor. Objects may be freed in the slab while
> - *                     it is frozen but slab_free will then skip the usual
> - *                     list operations. It is up to the processor holding
> - *                     the slab to integrate the slab into the slab lists
> - *                     when the slab is no longer needed.
> - *
> - *                     One use of this flag is to mark slabs that are
> - *                     used for allocations. Then such a slab becomes a cpu
> - *                     slab. The cpu slab may be equipped with an additional
> - *                     freelist that allows lockless access to
> - *                     free objects in addition to the regular freelist
> - *                     that requires the slab lock.
> + * Slabs are freed when they become empty. Teardown and setup is minimal so we
> + * rely on the page allocators per cpu caches for fast frees and allocs.
>   *
>   * SLAB_DEBUG_FLAGS    Slab requires special handling due to debug
>   *                     options set. This moves slab handling out of
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free()
  2026-01-16 14:40 ` [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
@ 2026-01-22  0:54   ` Suren Baghdasaryan
  2026-01-22  6:31     ` Vlastimil Babka
  2026-01-22  5:01   ` Hao Li
  1 sibling, 1 reply; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-22  0:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:41 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Currently slabs are only frozen after consistency checks failed. This
> can happen only in caches with debugging enabled, and those use
> free_to_partial_list() for freeing. The non-debug operation of
> __slab_free() can thus stop considering the frozen field, and we can
> remove the FREE_FROZEN stat.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Functionally looks fine to me. Do we need to do something about the
UAPI breakage that removal of a sysfs node might cause?

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 476a279f1a94..7ec7049c0ca5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -333,7 +333,6 @@ enum stat_item {
>         FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
> -       FREE_FROZEN,            /* Freeing to frozen slab */
>         FREE_ADD_PARTIAL,       /* Freeing moves slab to partial list */
>         FREE_REMOVE_PARTIAL,    /* Freeing removes last object */
>         ALLOC_FROM_PARTIAL,     /* Cpu slab acquired from node partial list */
> @@ -5103,7 +5102,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                         unsigned long addr)
>
>  {
> -       bool was_frozen, was_full;
> +       bool was_full;
>         struct freelist_counters old, new;
>         struct kmem_cache_node *n = NULL;
>         unsigned long flags;
> @@ -5126,7 +5125,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                 old.counters = slab->counters;
>
>                 was_full = (old.freelist == NULL);
> -               was_frozen = old.frozen;
>
>                 set_freepointer(s, tail, old.freelist);
>
> @@ -5139,7 +5137,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                  * to (due to not being full anymore) the partial list.
>                  * Unless it's frozen.
>                  */
> -               if ((!new.inuse || was_full) && !was_frozen) {
> +               if (!new.inuse || was_full) {
>
>                         n = get_node(s, slab_nid(slab));
>                         /*
> @@ -5158,20 +5156,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         } while (!slab_update_freelist(s, slab, &old, &new, "__slab_free"));
>
>         if (likely(!n)) {
> -
> -               if (likely(was_frozen)) {
> -                       /*
> -                        * The list lock was not taken therefore no list
> -                        * activity can be necessary.
> -                        */
> -                       stat(s, FREE_FROZEN);
> -               }
> -
>                 /*
> -                * In other cases we didn't take the list_lock because the slab
> -                * was already on the partial list and will remain there.
> +                * We didn't take the list_lock because the slab was already on
> +                * the partial list and will remain there.
>                  */
> -
>                 return;
>         }
>
> @@ -8721,7 +8709,6 @@ STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
>  STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
> -STAT_ATTR(FREE_FROZEN, free_frozen);
>  STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
>  STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
>  STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
> @@ -8826,7 +8813,6 @@ static struct attribute *slab_attrs[] = {
>         &free_rcu_sheaf_fail_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
> -       &free_frozen_attr.attr,
>         &free_add_partial_attr.attr,
>         &free_remove_partial_attr.attr,
>         &alloc_from_partial_attr.attr,
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items
  2026-01-16 14:40 ` [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
@ 2026-01-22  0:58   ` Suren Baghdasaryan
  2026-01-22  5:17   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-22  0:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 2:41 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> The cpu slabs and their deactivations were removed, so remove the unused
> stat items. Weirdly enough the values were also used to control
> __add_partial() adding to head or tail of the list, so replace that with
> a new enum add_mode, which is cleaner.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Same question about UAPI breakage, but otherwise LGTM.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 7ec7049c0ca5..c12e90cb2fca 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -324,6 +324,11 @@ static void debugfs_slab_add(struct kmem_cache *);
>  static inline void debugfs_slab_add(struct kmem_cache *s) { }
>  #endif
>
> +enum add_mode {
> +       ADD_TO_HEAD,
> +       ADD_TO_TAIL,
> +};
> +
>  enum stat_item {
>         ALLOC_PCS,              /* Allocation from percpu sheaf */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
> @@ -343,8 +348,6 @@ enum stat_item {
>         CPUSLAB_FLUSH,          /* Abandoning of the cpu slab */
>         DEACTIVATE_FULL,        /* Cpu slab was full when deactivated */
>         DEACTIVATE_EMPTY,       /* Cpu slab was empty when deactivated */
> -       DEACTIVATE_TO_HEAD,     /* Cpu slab was moved to the head of partials */
> -       DEACTIVATE_TO_TAIL,     /* Cpu slab was moved to the tail of partials */
>         DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
>         DEACTIVATE_BYPASS,      /* Implicit deactivation */
>         ORDER_FALLBACK,         /* Number of times fallback was necessary */
> @@ -3268,10 +3271,10 @@ static inline void slab_clear_node_partial(struct slab *slab)
>   * Management of partially allocated slabs.
>   */
>  static inline void
> -__add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
> +__add_partial(struct kmem_cache_node *n, struct slab *slab, enum add_mode mode)
>  {
>         n->nr_partial++;
> -       if (tail == DEACTIVATE_TO_TAIL)
> +       if (mode == ADD_TO_TAIL)
>                 list_add_tail(&slab->slab_list, &n->partial);
>         else
>                 list_add(&slab->slab_list, &n->partial);
> @@ -3279,10 +3282,10 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
>  }
>
>  static inline void add_partial(struct kmem_cache_node *n,
> -                               struct slab *slab, int tail)
> +                               struct slab *slab, enum add_mode mode)
>  {
>         lockdep_assert_held(&n->list_lock);
> -       __add_partial(n, slab, tail);
> +       __add_partial(n, slab, mode);
>  }
>
>  static inline void remove_partial(struct kmem_cache_node *n,
> @@ -3375,7 +3378,7 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
>         if (slab->inuse == slab->objects)
>                 add_full(s, n, slab);
>         else
> -               add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +               add_partial(n, slab, ADD_TO_HEAD);
>
>         inc_slabs_node(s, nid, slab->objects);
>         spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -3996,7 +3999,7 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>                         n = get_node(s, slab_nid(slab));
>                         spin_lock_irqsave(&n->list_lock, flags);
>                 }
> -               add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +               add_partial(n, slab, ADD_TO_HEAD);
>                 spin_unlock_irqrestore(&n->list_lock, flags);
>         }
>
> @@ -5064,7 +5067,7 @@ static noinline void free_to_partial_list(
>                         /* was on full list */
>                         remove_full(s, n, slab);
>                         if (!slab_free) {
> -                               add_partial(n, slab, DEACTIVATE_TO_TAIL);
> +                               add_partial(n, slab, ADD_TO_TAIL);
>                                 stat(s, FREE_ADD_PARTIAL);
>                         }
>                 } else if (slab_free) {
> @@ -5184,7 +5187,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>          * then add it.
>          */
>         if (unlikely(was_full)) {
> -               add_partial(n, slab, DEACTIVATE_TO_TAIL);
> +               add_partial(n, slab, ADD_TO_TAIL);
>                 stat(s, FREE_ADD_PARTIAL);
>         }
>         spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -6564,7 +6567,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
>                                 continue;
>
>                         list_del(&slab->slab_list);
> -                       add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +                       add_partial(n, slab, ADD_TO_HEAD);
>                 }
>
>                 spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -7031,7 +7034,7 @@ static void early_kmem_cache_node_alloc(int node)
>          * No locks need to be taken here as it has just been
>          * initialized and there is no concurrent access.
>          */
> -       __add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +       __add_partial(n, slab, ADD_TO_HEAD);
>  }
>
>  static void free_kmem_cache_nodes(struct kmem_cache *s)
> @@ -8719,8 +8722,6 @@ STAT_ATTR(FREE_SLAB, free_slab);
>  STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
>  STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
>  STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
> -STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
> -STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
>  STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
>  STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
>  STAT_ATTR(ORDER_FALLBACK, order_fallback);
> @@ -8823,8 +8824,6 @@ static struct attribute *slab_attrs[] = {
>         &cpuslab_flush_attr.attr,
>         &deactivate_full_attr.attr,
>         &deactivate_empty_attr.attr,
> -       &deactivate_to_head_attr.attr,
> -       &deactivate_to_tail_attr.attr,
>         &deactivate_remote_frees_attr.attr,
>         &deactivate_bypass_attr.attr,
>         &order_fallback_attr.attr,
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-16 14:40 ` [PATCH v3 14/21] slab: simplify kmalloc_nolock() Vlastimil Babka
  2026-01-20 12:06   ` Hao Li
@ 2026-01-22  1:53   ` Harry Yoo
  2026-01-22  8:16     ` Vlastimil Babka
  1 sibling, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  1:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:34PM +0100, Vlastimil Babka wrote:
> The kmalloc_nolock() implementation has several complications and
> restrictions due to SLUB's cpu slab locking, lockless fastpath and
> PREEMPT_RT differences. With cpu slab usage removed, we can simplify
> things:
> 
> - relax the PREEMPT_RT context checks as they were before commit
>   a4ae75d1b6a2 ("slab: fix kmalloc_nolock() context check for
>   PREEMPT_RT") and also reference the explanation comment in the page
>   allocator
> 
> - the local_lock_cpu_slab() macros became unused, remove them
> 
> - we no longer need to set up lockdep classes on PREEMPT_RT
> 
> - we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
>   since there's no lockless cpu freelist manipulation anymore
> 
> - __slab_alloc_node() can be called from kmalloc_nolock_noprof()
>   unconditionally. It can also no longer return EBUSY. But trylock
>   failures can still happen so retry with the larger bucket if the
>   allocation fails for any reason.
> 
> Note that we still need __CMPXCHG_DOUBLE, because while it was removed
> we don't use cmpxchg16b on cpu freelist anymore, we still use it on
> slab freelist, and the alternative is slab_lock() which can be
> interrupted by a nmi. Clarify the comment to mention it specifically.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

What a nice cleanup!

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

with a nit below.

>  mm/slab.h |   1 -
>  mm/slub.c | 144 +++++++++++++-------------------------------------------------
>  2 files changed, 29 insertions(+), 116 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 4efec41b6445..e9a0738133ed 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -5268,10 +5196,11 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>  	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>  		/*
>  		 * kmalloc_nolock() is not supported on architectures that
> -		 * don't implement cmpxchg16b, but debug caches don't use
> -		 * per-cpu slab and per-cpu partial slabs. They rely on
> -		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
> -		 * attempt to allocate from debug caches by
> +		 * don't implement cmpxchg16b and thus need slab_lock()
> +		 * which could be preempted by a nmi.

nit: I think now this limitation can be removed because the only slab
lock used in the allocation path is get_partial_node() ->
__slab_update_freelist(), but it is always used under n->list_lock.

Being preempted by a NMI while holding the slab lock is fine because
NMI context should fail to acquire n->list_lock and bail out.

But no hurry on this, it's probably not important enough to delay
this series :)

> +		 * But debug caches don't use that and only rely on
> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can attempt
> +		 * to allocate from debug caches by
>  		 * spin_trylock_irqsave(&n->list_lock, ...)
>  		 */
>  		return NULL;
>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 21/21] mm/slub: cleanup and repurpose some stat items
  2026-01-16 14:40 ` [PATCH v3 21/21] mm/slub: cleanup and repurpose some " Vlastimil Babka
@ 2026-01-22  2:35   ` Suren Baghdasaryan
  2026-01-22  9:30     ` Vlastimil Babka
  2026-01-22  5:52   ` Hao Li
  1 sibling, 1 reply; 106+ messages in thread
From: Suren Baghdasaryan @ 2026-01-22  2:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On Fri, Jan 16, 2026 at 6:41 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> A number of stat items related to cpu slabs became unused, remove them.
>
> Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of
> removing those, use them instead of ALLOC_PCS and FREE_PCS, since
> sheaves are the new (and only) fastpaths, Remove the recently added
> _PCS variants instead.
>
> Change where FREE_SLOWPATH is counted so that it only counts freeing of
> objects by slab users that (for whatever reason) do not go to a percpu
> sheaf, and not all (including internal) callers of __slab_free(). Thus
> flushing sheaves (counted by SHEAF_FLUSH) no longer also increments
> FREE_SLOWPATH.

nit: I think I understand what you mean but "no longer also
increments" sounds wrong. Maybe repharase as "Thus sheaf flushing
(already counted by SHEAF_FLUSH) does not affect FREE_SLOWPATH
anymore."?

> This matches how ALLOC_SLOWPATH doesn't count sheaf
> refills (counted by SHEAF_REFILL).
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 77 +++++++++++++++++----------------------------------------------
>  1 file changed, 21 insertions(+), 56 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index c12e90cb2fca..d73ad44fa046 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -330,33 +330,19 @@ enum add_mode {
>  };
>
>  enum stat_item {
> -       ALLOC_PCS,              /* Allocation from percpu sheaf */
> -       ALLOC_FASTPATH,         /* Allocation from cpu slab */
> -       ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> -       FREE_PCS,               /* Free to percpu sheaf */
> +       ALLOC_FASTPATH,         /* Allocation from percpu sheaves */
> +       ALLOC_SLOWPATH,         /* Allocation from partial or new slab */
>         FREE_RCU_SHEAF,         /* Free to rcu_free sheaf */
>         FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
> -       FREE_FASTPATH,          /* Free to cpu slab */
> -       FREE_SLOWPATH,          /* Freeing not to cpu slab */
> +       FREE_FASTPATH,          /* Free to percpu sheaves */
> +       FREE_SLOWPATH,          /* Free to a slab */
>         FREE_ADD_PARTIAL,       /* Freeing moves slab to partial list */
>         FREE_REMOVE_PARTIAL,    /* Freeing removes last object */
> -       ALLOC_FROM_PARTIAL,     /* Cpu slab acquired from node partial list */
> -       ALLOC_SLAB,             /* Cpu slab acquired from page allocator */
> -       ALLOC_REFILL,           /* Refill cpu slab from slab freelist */
> -       ALLOC_NODE_MISMATCH,    /* Switching cpu slab */
> +       ALLOC_SLAB,             /* New slab acquired from page allocator */
> +       ALLOC_NODE_MISMATCH,    /* Requested node different from cpu sheaf */
>         FREE_SLAB,              /* Slab freed to the page allocator */
> -       CPUSLAB_FLUSH,          /* Abandoning of the cpu slab */
> -       DEACTIVATE_FULL,        /* Cpu slab was full when deactivated */
> -       DEACTIVATE_EMPTY,       /* Cpu slab was empty when deactivated */
> -       DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
> -       DEACTIVATE_BYPASS,      /* Implicit deactivation */
>         ORDER_FALLBACK,         /* Number of times fallback was necessary */
> -       CMPXCHG_DOUBLE_CPU_FAIL,/* Failures of this_cpu_cmpxchg_double */
>         CMPXCHG_DOUBLE_FAIL,    /* Failures of slab freelist update */
> -       CPU_PARTIAL_ALLOC,      /* Used cpu partial on alloc */
> -       CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
> -       CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
> -       CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
>         SHEAF_FLUSH,            /* Objects flushed from a sheaf */
>         SHEAF_REFILL,           /* Objects refilled to a sheaf */
>         SHEAF_ALLOC,            /* Allocation of an empty sheaf */
> @@ -4347,8 +4333,10 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>          * We assume the percpu sheaves contain only local objects although it's
>          * not completely guaranteed, so we verify later.
>          */
> -       if (unlikely(node_requested && node != numa_mem_id()))
> +       if (unlikely(node_requested && node != numa_mem_id())) {
> +               stat(s, ALLOC_NODE_MISMATCH);
>                 return NULL;
> +       }
>
>         if (!local_trylock(&s->cpu_sheaves->lock))
>                 return NULL;
> @@ -4371,6 +4359,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>                  */
>                 if (page_to_nid(virt_to_page(object)) != node) {
>                         local_unlock(&s->cpu_sheaves->lock);
> +                       stat(s, ALLOC_NODE_MISMATCH);
>                         return NULL;
>                 }
>         }
> @@ -4379,7 +4368,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> -       stat(s, ALLOC_PCS);
> +       stat(s, ALLOC_FASTPATH);
>
>         return object;
>  }
> @@ -4451,7 +4440,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> -       stat_add(s, ALLOC_PCS, batch);
> +       stat_add(s, ALLOC_FASTPATH, batch);
>
>         allocated += batch;
>
> @@ -5111,8 +5100,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         unsigned long flags;
>         bool on_node_partial;
>
> -       stat(s, FREE_SLOWPATH);

After moving the above accounting to the callers I think there are
several callers which won't account it anymore:
- free_deferred_objects
- memcg_alloc_abort_single
- slab_free_after_rcu_debug
- ___cache_free

Am I missing something or is that intentional?

> -
>         if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>                 free_to_partial_list(s, slab, head, tail, cnt, addr);
>                 return;
> @@ -5416,7 +5403,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> -       stat(s, FREE_PCS);
> +       stat(s, FREE_FASTPATH);
>
>         return true;
>  }
> @@ -5664,7 +5651,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> -       stat_add(s, FREE_PCS, batch);
> +       stat_add(s, FREE_FASTPATH, batch);
>
>         if (batch < size) {
>                 p += batch;
> @@ -5686,10 +5673,12 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>          */
>  fallback:
>         __kmem_cache_free_bulk(s, size, p);
> +       stat_add(s, FREE_SLOWPATH, size);
>
>  flush_remote:
>         if (remote_nr) {
>                 __kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
> +               stat_add(s, FREE_SLOWPATH, remote_nr);
>                 if (i < size) {
>                         remote_nr = 0;
>                         goto next_remote_batch;
> @@ -5784,6 +5773,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         }
>
>         __slab_free(s, slab, object, object, 1, addr);
> +       stat(s, FREE_SLOWPATH);
>  }
>
>  #ifdef CONFIG_MEMCG
> @@ -5806,8 +5796,10 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
>          * With KASAN enabled slab_free_freelist_hook modifies the freelist
>          * to remove objects, whose reuse must be delayed.
>          */
> -       if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
> +       if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt))) {
>                 __slab_free(s, slab, head, tail, cnt, addr);
> +               stat_add(s, FREE_SLOWPATH, cnt);
> +       }
>  }
>
>  #ifdef CONFIG_SLUB_RCU_DEBUG
> @@ -6705,6 +6697,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>                 i = refill_objects(s, p, flags, size, size);
>                 if (i < size)
>                         goto error;
> +               stat_add(s, ALLOC_SLOWPATH, i);
>         }
>
>         return i;
> @@ -8704,33 +8697,19 @@ static ssize_t text##_store(struct kmem_cache *s,               \
>  }                                                              \
>  SLAB_ATTR(text);                                               \
>
> -STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> -STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>  STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
>  STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
>  STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
> -STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
>  STAT_ATTR(ALLOC_SLAB, alloc_slab);
> -STAT_ATTR(ALLOC_REFILL, alloc_refill);
>  STAT_ATTR(ALLOC_NODE_MISMATCH, alloc_node_mismatch);
>  STAT_ATTR(FREE_SLAB, free_slab);
> -STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
> -STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
> -STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
> -STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
> -STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
>  STAT_ATTR(ORDER_FALLBACK, order_fallback);
> -STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
>  STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
> -STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
> -STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
> -STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
> -STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
>  STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
>  STAT_ATTR(SHEAF_REFILL, sheaf_refill);
>  STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> @@ -8806,33 +8785,19 @@ static struct attribute *slab_attrs[] = {
>         &remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> -       &alloc_cpu_sheaf_attr.attr,
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
> -       &free_cpu_sheaf_attr.attr,
>         &free_rcu_sheaf_attr.attr,
>         &free_rcu_sheaf_fail_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_add_partial_attr.attr,
>         &free_remove_partial_attr.attr,
> -       &alloc_from_partial_attr.attr,
>         &alloc_slab_attr.attr,
> -       &alloc_refill_attr.attr,
>         &alloc_node_mismatch_attr.attr,
>         &free_slab_attr.attr,
> -       &cpuslab_flush_attr.attr,
> -       &deactivate_full_attr.attr,
> -       &deactivate_empty_attr.attr,
> -       &deactivate_remote_frees_attr.attr,
> -       &deactivate_bypass_attr.attr,
>         &order_fallback_attr.attr,
>         &cmpxchg_double_fail_attr.attr,
> -       &cmpxchg_double_cpu_fail_attr.attr,
> -       &cpu_partial_alloc_attr.attr,
> -       &cpu_partial_free_attr.attr,
> -       &cpu_partial_node_attr.attr,
> -       &cpu_partial_drain_attr.attr,
>         &sheaf_flush_attr.attr,
>         &sheaf_refill_attr.attr,
>         &sheaf_alloc_attr.attr,
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 15/21] slab: remove struct kmem_cache_cpu
  2026-01-16 14:40 ` [PATCH v3 15/21] slab: remove struct kmem_cache_cpu Vlastimil Babka
  2026-01-20 12:40   ` Hao Li
@ 2026-01-22  3:10   ` Harry Yoo
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  3:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:35PM +0100, Vlastimil Babka wrote:
> The cpu slab is not used anymore for allocation or freeing, the
> remaining code is for flushing, but it's effectively dead.  Remove the
> whole struct kmem_cache_cpu, the flushing code and other orphaned
> functions.
> 
> The remaining used field of kmem_cache_cpu is the stat array with
> CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats.
> In struct kmem_cache, the field is cpu_stats and placed near the
> end of the struct.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros
  2026-01-16 14:40 ` [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
  2026-01-21  6:42   ` Hao Li
@ 2026-01-22  3:50   ` Harry Yoo
  1 sibling, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  3:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:36PM +0100, Vlastimil Babka wrote:
> The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove
> them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code
> being removed. The only remaining usage is in fact testing whether we
> can assert irqs disabled, because spin_lock_irqsave() only does that on
> !RT. Test for CONFIG_PREEMPT_RT instead.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 18/21] slab: update overview comments
  2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
  2026-01-21 20:58   ` Suren Baghdasaryan
@ 2026-01-22  3:54   ` Hao Li
  2026-01-22  6:41   ` Harry Yoo
  2 siblings, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-22  3:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:38PM +0100, Vlastimil Babka wrote:
> The changes related to sheaves made the description of locking and other
> details outdated. Update it to reflect current state.
> 
> Also add a new copyright line due to major changes.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 141 +++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 67 insertions(+), 74 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
  2026-01-21 18:30   ` Suren Baghdasaryan
@ 2026-01-22  4:44   ` Harry Yoo
  2026-01-22  8:37     ` Vlastimil Babka
  2026-01-22  4:58   ` Hao Li
  2026-01-22  7:02   ` Harry Yoo
  3 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  4:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:37PM +0100, Vlastimil Babka wrote:
> __refill_objects() currently only attempts to get partial slabs from the
> local node and then allocates new slab(s). Expand it to trying also
> other nodes while observing the remote node defrag ratio, similarly to
> get_any_partial().
> 
> This will prevent allocating new slabs on a node while other nodes have
> many free slabs. It does mean sheaves will contain non-local objects in
> that case. Allocations that care about specific node will still be
> served appropriately, but might get a slowpath allocation.
> 
> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
> might be refilling a sheaf that will be then used from a different
> allocation context.
> 
> We can also use the resulting refill_objects() in
> __kmem_cache_alloc_bulk() for non-debug caches. This means
> kmem_cache_alloc_bulk() will get better performance when sheaves are
> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
> it's compatible with sheaves refill in preferring the local node.
> Its users also have gfp flags that allow spinning, so document that
> as a requirement.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Could this cause strict_numa to not work as intended when
the policy is MPOL_BIND?

alloc_from_pcs() has:
> #ifdef CONFIG_NUMA
>         if (static_branch_unlikely(&strict_numa) &&
>                          node == NUMA_NO_NODE) {
>
>                 struct mempolicy *mpol = current->mempolicy;
>
>                 if (mpol) {
>                         /*
>                          * Special BIND rule support. If the local node
>                          * is in permitted set then do not redirect
>                          * to a particular node.
>                          * Otherwise we apply the memory policy to get
>                          * the node we need to allocate on.
>                          */
>                         if (mpol->mode != MPOL_BIND ||
>                                         !node_isset(numa_mem_id(), mpol->nodes))

This assumes the sheaves contain (mostly, although it wasn't strictly
guaranteed) objects from local node, and this change breaks that
assumption.

So... perhaps remove "Special BIND rule support"?

>
>                                 node = mempolicy_slab_node(); 
>                 }
>         }
> #endif

Otherwise LGTM.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
  2026-01-21 18:30   ` Suren Baghdasaryan
  2026-01-22  4:44   ` Harry Yoo
@ 2026-01-22  4:58   ` Hao Li
  2026-01-22  8:32     ` Vlastimil Babka
  2026-01-22  7:02   ` Harry Yoo
  3 siblings, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-22  4:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:37PM +0100, Vlastimil Babka wrote:
> __refill_objects() currently only attempts to get partial slabs from the
> local node and then allocates new slab(s). Expand it to trying also
> other nodes while observing the remote node defrag ratio, similarly to
> get_any_partial().
> 
> This will prevent allocating new slabs on a node while other nodes have
> many free slabs. It does mean sheaves will contain non-local objects in
> that case. Allocations that care about specific node will still be
> served appropriately, but might get a slowpath allocation.
> 
> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
> might be refilling a sheaf that will be then used from a different
> allocation context.
> 
> We can also use the resulting refill_objects() in
> __kmem_cache_alloc_bulk() for non-debug caches. This means
> kmem_cache_alloc_bulk() will get better performance when sheaves are
> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
> it's compatible with sheaves refill in preferring the local node.
> Its users also have gfp flags that allow spinning, so document that
> as a requirement.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 106 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index d52de6e3c2d5..2c522d2bf547 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2518,8 +2518,8 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  }
>  
>  static unsigned int
> -__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> -		 unsigned int max);
> +refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +	       unsigned int max);
>  
>  static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>  			 gfp_t gfp)
> @@ -2530,8 +2530,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>  	if (!to_fill)
>  		return 0;
>  
> -	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> -			to_fill, to_fill);
> +	filled = refill_objects(s, &sheaf->objects[sheaf->size], gfp, to_fill,
> +				to_fill);
>  
>  	sheaf->size += filled;
>  
> @@ -6522,29 +6522,22 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>  
>  static unsigned int
> -__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> -		 unsigned int max)
> +__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		      unsigned int max, struct kmem_cache_node *n)
>  {
>  	struct slab *slab, *slab2;
>  	struct partial_context pc;
>  	unsigned int refilled = 0;
>  	unsigned long flags;
>  	void *object;
> -	int node;
>  
>  	pc.flags = gfp;
>  	pc.min_objects = min;
>  	pc.max_objects = max;
>  
> -	node = numa_mem_id();
> -
> -	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +	if (!get_partial_node_bulk(s, n, &pc))
>  		return 0;
>  
> -	/* TODO: consider also other nodes? */
> -	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> -		goto new_slab;
> -
>  	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
>  
>  		list_del(&slab->slab_list);
> @@ -6582,8 +6575,6 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>  	}
>  
>  	if (unlikely(!list_empty(&pc.slabs))) {
> -		struct kmem_cache_node *n = get_node(s, node);
> -
>  		spin_lock_irqsave(&n->list_lock, flags);
>  
>  		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> @@ -6605,13 +6596,92 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>  		}
>  	}
>  
> +	return refilled;
> +}
>  
> -	if (likely(refilled >= min))
> -		goto out;
> +#ifdef CONFIG_NUMA
> +static unsigned int
> +__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		     unsigned int max, int local_node)


Just a small note: I noticed that the local_node variable is unused. It seems
the intention was to skip local_node in __refill_objects_any(), since it had
already been attempted in __refill_objects_node().

Everything else looks good.

Reviewed-by: Hao Li <hao.li@linux.dev>

> +{
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	enum zone_type highest_zoneidx = gfp_zone(gfp);
> +	unsigned int cpuset_mems_cookie;
> +	unsigned int refilled = 0;
> +
> +	/* see get_any_partial() for the defrag ratio description */
> +	if (!s->remote_node_defrag_ratio ||
> +			get_cycles() % 1024 > s->remote_node_defrag_ratio)
> +		return 0;
> +
> +	do {
> +		cpuset_mems_cookie = read_mems_allowed_begin();
> +		zonelist = node_zonelist(mempolicy_slab_node(), gfp);
> +		for_each_zone_zonelist(zone, z, zonelist, highest_zoneidx) {
> +			struct kmem_cache_node *n;
> +			unsigned int r;
> +
> +			n = get_node(s, zone_to_nid(zone));
> +
> +			if (!n || !cpuset_zone_allowed(zone, gfp) ||
> +					n->nr_partial <= s->min_partial)
> +				continue;
> +
> +			r = __refill_objects_node(s, p, gfp, min, max, n);
> +			refilled += r;
> +
> +			if (r >= min) {
> +				/*
> +				 * Don't check read_mems_allowed_retry() here -
> +				 * if mems_allowed was updated in parallel, that
> +				 * was a harmless race between allocation and
> +				 * the cpuset update
> +				 */
> +				return refilled;
> +			}
> +			p += r;
> +			min -= r;
> +			max -= r;
> +		}
> +	} while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> +	return refilled;
> +}
> +#else
> +static inline unsigned int
> +__refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		     unsigned int max, int local_node)
> +{
> +	return 0;
> +}
> +#endif
> +
> +static unsigned int
> +refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +	       unsigned int max)
> +{
> +	int local_node = numa_mem_id();
> +	unsigned int refilled;
> +	struct slab *slab;
> +
> +	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +		return 0;
> +
> +	refilled = __refill_objects_node(s, p, gfp, min, max,
> +					 get_node(s, local_node));
> +	if (refilled >= min)
> +		return refilled;
> +
> +	refilled += __refill_objects_any(s, p + refilled, gfp, min - refilled,
> +					 max - refilled, local_node);
> +	if (refilled >= min)
> +		return refilled;
>  
>  new_slab:
>  
> -	slab = new_slab(s, pc.flags, node);
> +	slab = new_slab(s, gfp, local_node);
>  	if (!slab)
>  		goto out;
>  
> @@ -6626,8 +6696,8 @@ __refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>  
>  	if (refilled < min)
>  		goto new_slab;
> -out:
>  
> +out:
>  	return refilled;
>  }
>  
> @@ -6637,18 +6707,20 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  {
>  	int i;
>  
> -	/*
> -	 * TODO: this might be more efficient (if necessary) by reusing
> -	 * __refill_objects()
> -	 */
> -	for (i = 0; i < size; i++) {
> +	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> +		for (i = 0; i < size; i++) {
>  
> -		p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> -				     s->object_size);
> -		if (unlikely(!p[i]))
> -			goto error;
> +			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_,
> +					     s->object_size);
> +			if (unlikely(!p[i]))
> +				goto error;
>  
> -		maybe_wipe_obj_freeptr(s, p[i]);
> +			maybe_wipe_obj_freeptr(s, p[i]);
> +		}
> +	} else {
> +		i = refill_objects(s, p, flags, size, size);
> +		if (i < size)
> +			goto error;
>  	}
>  
>  	return i;
> @@ -6659,7 +6731,10 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  
>  }
>  
> -/* Note that interrupts must be enabled when calling this function. */
> +/*
> + * Note that interrupts must be enabled when calling this function and gfp
> + * flags must allow spinning.
> + */
>  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>  				 void **p)
>  {
> 
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free()
  2026-01-16 14:40 ` [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
  2026-01-22  0:54   ` Suren Baghdasaryan
@ 2026-01-22  5:01   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-22  5:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:39PM +0100, Vlastimil Babka wrote:
> Currently slabs are only frozen after consistency checks failed. This
> can happen only in caches with debugging enabled, and those use
> free_to_partial_list() for freeing. The non-debug operation of
> __slab_free() can thus stop considering the frozen field, and we can
> remove the FREE_FROZEN stat.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items
  2026-01-16 14:40 ` [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
  2026-01-22  0:58   ` Suren Baghdasaryan
@ 2026-01-22  5:17   ` Hao Li
  1 sibling, 0 replies; 106+ messages in thread
From: Hao Li @ 2026-01-22  5:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:40PM +0100, Vlastimil Babka wrote:
> The cpu slabs and their deactivations were removed, so remove the unused
> stat items. Weirdly enough the values were also used to control
> __add_partial() adding to head or tail of the list, so replace that with
> a new enum add_mode, which is cleaner.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
> 

Looks good to me.
Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 21/21] mm/slub: cleanup and repurpose some stat items
  2026-01-16 14:40 ` [PATCH v3 21/21] mm/slub: cleanup and repurpose some " Vlastimil Babka
  2026-01-22  2:35   ` Suren Baghdasaryan
@ 2026-01-22  5:52   ` Hao Li
  2026-01-22  9:30     ` Vlastimil Babka
  1 sibling, 1 reply; 106+ messages in thread
From: Hao Li @ 2026-01-22  5:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:41PM +0100, Vlastimil Babka wrote:
> A number of stat items related to cpu slabs became unused, remove them.
> 
> Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of
> removing those, use them instead of ALLOC_PCS and FREE_PCS, since
> sheaves are the new (and only) fastpaths, Remove the recently added
> _PCS variants instead.
> 
> Change where FREE_SLOWPATH is counted so that it only counts freeing of
> objects by slab users that (for whatever reason) do not go to a percpu
> sheaf, and not all (including internal) callers of __slab_free(). Thus
> flushing sheaves (counted by SHEAF_FLUSH) no longer also increments
> FREE_SLOWPATH. This matches how ALLOC_SLOWPATH doesn't count sheaf
> refills (counted by SHEAF_REFILL).
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 77 +++++++++++++++++----------------------------------------------
>  1 file changed, 21 insertions(+), 56 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index c12e90cb2fca..d73ad44fa046 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -330,33 +330,19 @@ enum add_mode {
>  };
>  
>  enum stat_item {
> -	ALLOC_PCS,		/* Allocation from percpu sheaf */
> -	ALLOC_FASTPATH,		/* Allocation from cpu slab */
> -	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
> -	FREE_PCS,		/* Free to percpu sheaf */
> +	ALLOC_FASTPATH,		/* Allocation from percpu sheaves */
> +	ALLOC_SLOWPATH,		/* Allocation from partial or new slab */
>  	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
>  	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
> -	FREE_FASTPATH,		/* Free to cpu slab */
> -	FREE_SLOWPATH,		/* Freeing not to cpu slab */
> +	FREE_FASTPATH,		/* Free to percpu sheaves */
> +	FREE_SLOWPATH,		/* Free to a slab */

Nits: Would it make sense to add stat(s, FREE_SLOWPATH) in
free_deferred_objects() as well, since it also calls __slab_free()?

Everything else looks good.

This patchset replaces cpu slab with cpu sheaves and really simplifies the code
overall - I really like the direction and the end result. It's really been a
pleasure reviewing this series. Thanks!

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

>  	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
>  	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
> -	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from node partial list */
> -	ALLOC_SLAB,		/* Cpu slab acquired from page allocator */
> -	ALLOC_REFILL,		/* Refill cpu slab from slab freelist */
> -	ALLOC_NODE_MISMATCH,	/* Switching cpu slab */
> +	ALLOC_SLAB,		/* New slab acquired from page allocator */
> +	ALLOC_NODE_MISMATCH,	/* Requested node different from cpu sheaf */
>  	FREE_SLAB,		/* Slab freed to the page allocator */
> -	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
> -	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
> -	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
> -	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
> -	DEACTIVATE_BYPASS,	/* Implicit deactivation */
>  	ORDER_FALLBACK,		/* Number of times fallback was necessary */
> -	CMPXCHG_DOUBLE_CPU_FAIL,/* Failures of this_cpu_cmpxchg_double */
>  	CMPXCHG_DOUBLE_FAIL,	/* Failures of slab freelist update */
> -	CPU_PARTIAL_ALLOC,	/* Used cpu partial on alloc */
> -	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
> -	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
> -	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
>  	SHEAF_FLUSH,		/* Objects flushed from a sheaf */
>  	SHEAF_REFILL,		/* Objects refilled to a sheaf */
>  	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
> @@ -4347,8 +4333,10 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  	 * We assume the percpu sheaves contain only local objects although it's
>  	 * not completely guaranteed, so we verify later.
>  	 */
> -	if (unlikely(node_requested && node != numa_mem_id()))
> +	if (unlikely(node_requested && node != numa_mem_id())) {
> +		stat(s, ALLOC_NODE_MISMATCH);
>  		return NULL;
> +	}
>  
>  	if (!local_trylock(&s->cpu_sheaves->lock))
>  		return NULL;
> @@ -4371,6 +4359,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  		 */
>  		if (page_to_nid(virt_to_page(object)) != node) {
>  			local_unlock(&s->cpu_sheaves->lock);
> +			stat(s, ALLOC_NODE_MISMATCH);
>  			return NULL;
>  		}
>  	}
> @@ -4379,7 +4368,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  
>  	local_unlock(&s->cpu_sheaves->lock);
>  
> -	stat(s, ALLOC_PCS);
> +	stat(s, ALLOC_FASTPATH);
>  
>  	return object;
>  }
> @@ -4451,7 +4440,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
>  
>  	local_unlock(&s->cpu_sheaves->lock);
>  
> -	stat_add(s, ALLOC_PCS, batch);
> +	stat_add(s, ALLOC_FASTPATH, batch);
>  
>  	allocated += batch;
>  
> @@ -5111,8 +5100,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	unsigned long flags;
>  	bool on_node_partial;
>  
> -	stat(s, FREE_SLOWPATH);
> -
>  	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>  		free_to_partial_list(s, slab, head, tail, cnt, addr);
>  		return;
> @@ -5416,7 +5403,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
>  
>  	local_unlock(&s->cpu_sheaves->lock);
>  
> -	stat(s, FREE_PCS);
> +	stat(s, FREE_FASTPATH);
>  
>  	return true;
>  }
> @@ -5664,7 +5651,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  
>  	local_unlock(&s->cpu_sheaves->lock);
>  
> -	stat_add(s, FREE_PCS, batch);
> +	stat_add(s, FREE_FASTPATH, batch);
>  
>  	if (batch < size) {
>  		p += batch;
> @@ -5686,10 +5673,12 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  	 */
>  fallback:
>  	__kmem_cache_free_bulk(s, size, p);
> +	stat_add(s, FREE_SLOWPATH, size);
>  
>  flush_remote:
>  	if (remote_nr) {
>  		__kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
> +		stat_add(s, FREE_SLOWPATH, remote_nr);
>  		if (i < size) {
>  			remote_nr = 0;
>  			goto next_remote_batch;
> @@ -5784,6 +5773,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	}
>  
>  	__slab_free(s, slab, object, object, 1, addr);
> +	stat(s, FREE_SLOWPATH);
>  }
>  
>  #ifdef CONFIG_MEMCG
> @@ -5806,8 +5796,10 @@ void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
>  	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
>  	 * to remove objects, whose reuse must be delayed.
>  	 */
> -	if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt)))
> +	if (likely(slab_free_freelist_hook(s, &head, &tail, &cnt))) {
>  		__slab_free(s, slab, head, tail, cnt, addr);
> +		stat_add(s, FREE_SLOWPATH, cnt);
> +	}
>  }
>  
>  #ifdef CONFIG_SLUB_RCU_DEBUG
> @@ -6705,6 +6697,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  		i = refill_objects(s, p, flags, size, size);
>  		if (i < size)
>  			goto error;
> +		stat_add(s, ALLOC_SLOWPATH, i);
>  	}
>  
>  	return i;
> @@ -8704,33 +8697,19 @@ static ssize_t text##_store(struct kmem_cache *s,		\
>  }								\
>  SLAB_ATTR(text);						\
>  
> -STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> -STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>  STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
>  STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
>  STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
> -STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
>  STAT_ATTR(ALLOC_SLAB, alloc_slab);
> -STAT_ATTR(ALLOC_REFILL, alloc_refill);
>  STAT_ATTR(ALLOC_NODE_MISMATCH, alloc_node_mismatch);
>  STAT_ATTR(FREE_SLAB, free_slab);
> -STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
> -STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
> -STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
> -STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
> -STAT_ATTR(DEACTIVATE_BYPASS, deactivate_bypass);
>  STAT_ATTR(ORDER_FALLBACK, order_fallback);
> -STAT_ATTR(CMPXCHG_DOUBLE_CPU_FAIL, cmpxchg_double_cpu_fail);
>  STAT_ATTR(CMPXCHG_DOUBLE_FAIL, cmpxchg_double_fail);
> -STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
> -STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
> -STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
> -STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
>  STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
>  STAT_ATTR(SHEAF_REFILL, sheaf_refill);
>  STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> @@ -8806,33 +8785,19 @@ static struct attribute *slab_attrs[] = {
>  	&remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> -	&alloc_cpu_sheaf_attr.attr,
>  	&alloc_fastpath_attr.attr,
>  	&alloc_slowpath_attr.attr,
> -	&free_cpu_sheaf_attr.attr,
>  	&free_rcu_sheaf_attr.attr,
>  	&free_rcu_sheaf_fail_attr.attr,
>  	&free_fastpath_attr.attr,
>  	&free_slowpath_attr.attr,
>  	&free_add_partial_attr.attr,
>  	&free_remove_partial_attr.attr,
> -	&alloc_from_partial_attr.attr,
>  	&alloc_slab_attr.attr,
> -	&alloc_refill_attr.attr,
>  	&alloc_node_mismatch_attr.attr,
>  	&free_slab_attr.attr,
> -	&cpuslab_flush_attr.attr,
> -	&deactivate_full_attr.attr,
> -	&deactivate_empty_attr.attr,
> -	&deactivate_remote_frees_attr.attr,
> -	&deactivate_bypass_attr.attr,
>  	&order_fallback_attr.attr,
>  	&cmpxchg_double_fail_attr.attr,
> -	&cmpxchg_double_cpu_fail_attr.attr,
> -	&cpu_partial_alloc_attr.attr,
> -	&cpu_partial_free_attr.attr,
> -	&cpu_partial_node_attr.attr,
> -	&cpu_partial_drain_attr.attr,
>  	&sheaf_flush_attr.attr,
>  	&sheaf_refill_attr.attr,
>  	&sheaf_alloc_attr.attr,
> 
> -- 
> 2.52.0
> 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free()
  2026-01-22  0:54   ` Suren Baghdasaryan
@ 2026-01-22  6:31     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  6:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/22/26 01:54, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 2:41 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Currently slabs are only frozen after consistency checks failed. This
>> can happen only in caches with debugging enabled, and those use
>> free_to_partial_list() for freeing. The non-debug operation of
>> __slab_free() can thus stop considering the frozen field, and we can
>> remove the FREE_FROZEN stat.
>>
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Functionally looks fine to me. Do we need to do something about the
> UAPI breakage that removal of a sysfs node might cause?

Only if someone complains. Just this week it has been reiterated by Linus:
https://lore.kernel.org/all/CAHk-%3Dwga8Qu0-OSE9VZbviq9GuqwhPhLUXeAt-S7_9%2BfMCLkKg@mail.gmail.com/

Given this is behing a config no distro enables, I think chances are good
noone will complain:

https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=SLUB_STATS

> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> 
>> ---
>>  mm/slub.c | 22 ++++------------------
>>  1 file changed, 4 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 476a279f1a94..7ec7049c0ca5 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -333,7 +333,6 @@ enum stat_item {
>>         FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
>>         FREE_FASTPATH,          /* Free to cpu slab */
>>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>> -       FREE_FROZEN,            /* Freeing to frozen slab */
>>         FREE_ADD_PARTIAL,       /* Freeing moves slab to partial list */
>>         FREE_REMOVE_PARTIAL,    /* Freeing removes last object */
>>         ALLOC_FROM_PARTIAL,     /* Cpu slab acquired from node partial list */
>> @@ -5103,7 +5102,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>                         unsigned long addr)
>>
>>  {
>> -       bool was_frozen, was_full;
>> +       bool was_full;
>>         struct freelist_counters old, new;
>>         struct kmem_cache_node *n = NULL;
>>         unsigned long flags;
>> @@ -5126,7 +5125,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>                 old.counters = slab->counters;
>>
>>                 was_full = (old.freelist == NULL);
>> -               was_frozen = old.frozen;
>>
>>                 set_freepointer(s, tail, old.freelist);
>>
>> @@ -5139,7 +5137,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>                  * to (due to not being full anymore) the partial list.
>>                  * Unless it's frozen.
>>                  */
>> -               if ((!new.inuse || was_full) && !was_frozen) {
>> +               if (!new.inuse || was_full) {
>>
>>                         n = get_node(s, slab_nid(slab));
>>                         /*
>> @@ -5158,20 +5156,10 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>         } while (!slab_update_freelist(s, slab, &old, &new, "__slab_free"));
>>
>>         if (likely(!n)) {
>> -
>> -               if (likely(was_frozen)) {
>> -                       /*
>> -                        * The list lock was not taken therefore no list
>> -                        * activity can be necessary.
>> -                        */
>> -                       stat(s, FREE_FROZEN);
>> -               }
>> -
>>                 /*
>> -                * In other cases we didn't take the list_lock because the slab
>> -                * was already on the partial list and will remain there.
>> +                * We didn't take the list_lock because the slab was already on
>> +                * the partial list and will remain there.
>>                  */
>> -
>>                 return;
>>         }
>>
>> @@ -8721,7 +8709,6 @@ STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
>>  STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>> -STAT_ATTR(FREE_FROZEN, free_frozen);
>>  STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
>>  STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
>>  STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
>> @@ -8826,7 +8813,6 @@ static struct attribute *slab_attrs[] = {
>>         &free_rcu_sheaf_fail_attr.attr,
>>         &free_fastpath_attr.attr,
>>         &free_slowpath_attr.attr,
>> -       &free_frozen_attr.attr,
>>         &free_add_partial_attr.attr,
>>         &free_remove_partial_attr.attr,
>>         &alloc_from_partial_attr.attr,
>>
>> --
>> 2.52.0
>>



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 18/21] slab: update overview comments
  2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
  2026-01-21 20:58   ` Suren Baghdasaryan
  2026-01-22  3:54   ` Hao Li
@ 2026-01-22  6:41   ` Harry Yoo
  2026-01-22  8:49     ` Vlastimil Babka
  2 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  6:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:38PM +0100, Vlastimil Babka wrote:
> The changes related to sheaves made the description of locking and other
> details outdated. Update it to reflect current state.
> 
> Also add a new copyright line due to major changes.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> @@ -112,47 +123,46 @@
> + *   node->barn->lock (spinlock)
>   *
> - *   lockless fastpaths
> + *   Lockless freeing
> + *
> + *   Objects may have to be freed to their slabs when they are from a remote
> + *   node (where we want to avoid filling local sheaves with remote objects)
> + *   or when there are too many full sheaves. On architectures supporting
> + *   cmpxchg_double this is done by a lockless update of slab's freelist and
> + *   counters, otherwise slab_lock is taken. This only needs to take the
> + *   list_lock if it's a first free to a full slab, or when there are too many
> + *   fully free slabs and some need to be discarded.

nit: "or when a slab becomes empty after the free"?
because we don't check nr_partial before acquiring list_lock.

With that addressed,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
                     ` (2 preceding siblings ...)
  2026-01-22  4:58   ` Hao Li
@ 2026-01-22  7:02   ` Harry Yoo
  2026-01-22  8:42     ` Vlastimil Babka
  3 siblings, 1 reply; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  7:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Fri, Jan 16, 2026 at 03:40:37PM +0100, Vlastimil Babka wrote:
> __refill_objects() currently only attempts to get partial slabs from the
> local node and then allocates new slab(s). Expand it to trying also
> other nodes while observing the remote node defrag ratio, similarly to
> get_any_partial().
> 
> This will prevent allocating new slabs on a node while other nodes have
> many free slabs. It does mean sheaves will contain non-local objects in
> that case. Allocations that care about specific node will still be
> served appropriately, but might get a slowpath allocation.

Hmm one more question.

Given frees to remote nodes bypass sheaves layer anyway, isn't it
more reasonable to let refill_objects() fail sometimes instead of
allocating new local slabs and fall back to slowpath (based on defrag_ratio)?

> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
> might be refilling a sheaf that will be then used from a different
> allocation context.
> 
> We can also use the resulting refill_objects() in
> __kmem_cache_alloc_bulk() for non-debug caches. This means
> kmem_cache_alloc_bulk() will get better performance when sheaves are
> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
> it's compatible with sheaves refill in preferring the local node.
> Its users also have gfp flags that allow spinning, so document that
> as a requirement.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---


-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-22  1:53   ` Harry Yoo
@ 2026-01-22  8:16     ` Vlastimil Babka
  2026-01-22  8:34       ` Harry Yoo
  0 siblings, 1 reply; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:16 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 02:53, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:34PM +0100, Vlastimil Babka wrote:
>> The kmalloc_nolock() implementation has several complications and
>> restrictions due to SLUB's cpu slab locking, lockless fastpath and
>> PREEMPT_RT differences. With cpu slab usage removed, we can simplify
>> things:
>> 
>> - relax the PREEMPT_RT context checks as they were before commit
>>   a4ae75d1b6a2 ("slab: fix kmalloc_nolock() context check for
>>   PREEMPT_RT") and also reference the explanation comment in the page
>>   allocator
>> 
>> - the local_lock_cpu_slab() macros became unused, remove them
>> 
>> - we no longer need to set up lockdep classes on PREEMPT_RT
>> 
>> - we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL
>>   since there's no lockless cpu freelist manipulation anymore
>> 
>> - __slab_alloc_node() can be called from kmalloc_nolock_noprof()
>>   unconditionally. It can also no longer return EBUSY. But trylock
>>   failures can still happen so retry with the larger bucket if the
>>   allocation fails for any reason.
>> 
>> Note that we still need __CMPXCHG_DOUBLE, because while it was removed
>> we don't use cmpxchg16b on cpu freelist anymore, we still use it on
>> slab freelist, and the alternative is slab_lock() which can be
>> interrupted by a nmi. Clarify the comment to mention it specifically.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> What a nice cleanup!
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!

> with a nit below.
> 
>>  mm/slab.h |   1 -
>>  mm/slub.c | 144 +++++++++++++-------------------------------------------------
>>  2 files changed, 29 insertions(+), 116 deletions(-)
>> 
>> diff --git a/mm/slab.h b/mm/slab.h
>> index 4efec41b6445..e9a0738133ed 100644
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -5268,10 +5196,11 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>>  	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>>  		/*
>>  		 * kmalloc_nolock() is not supported on architectures that
>> -		 * don't implement cmpxchg16b, but debug caches don't use
>> -		 * per-cpu slab and per-cpu partial slabs. They rely on
>> -		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
>> -		 * attempt to allocate from debug caches by
>> +		 * don't implement cmpxchg16b and thus need slab_lock()
>> +		 * which could be preempted by a nmi.
> 
> nit: I think now this limitation can be removed because the only slab
> lock used in the allocation path is get_partial_node() ->
> __slab_update_freelist(), but it is always used under n->list_lock.
> 
> Being preempted by a NMI while holding the slab lock is fine because
> NMI context should fail to acquire n->list_lock and bail out.

Hmm but somebody might be freeing with __slab_free() without taking the
n->list_lock (slab is on partial list and expected to remain there after the
free), then there's a NMI and the allocation can take n->list_lock fine?

> But no hurry on this, it's probably not important enough to delay
> this series :)
> 
>> +		 * But debug caches don't use that and only rely on
>> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can attempt
>> +		 * to allocate from debug caches by
>>  		 * spin_trylock_irqsave(&n->list_lock, ...)
>>  		 */
>>  		return NULL;
>>
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-22  4:58   ` Hao Li
@ 2026-01-22  8:32     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:32 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 05:58, Hao Li wrote:
> Just a small note: I noticed that the local_node variable is unused. It seems
> the intention was to skip local_node in __refill_objects_any(), since it had
> already been attempted in __refill_objects_node().

Ah, I'll remove it. Such skip wouldn't likely do much.

> Everything else looks good.
> 
> Reviewed-by: Hao Li <hao.li@linux.dev>

Thanks!


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 14/21] slab: simplify kmalloc_nolock()
  2026-01-22  8:16     ` Vlastimil Babka
@ 2026-01-22  8:34       ` Harry Yoo
  0 siblings, 0 replies; 106+ messages in thread
From: Harry Yoo @ 2026-01-22  8:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On Thu, Jan 22, 2026 at 09:16:04AM +0100, Vlastimil Babka wrote:
> On 1/22/26 02:53, Harry Yoo wrote:
> > On Fri, Jan 16, 2026 at 03:40:34PM +0100, Vlastimil Babka wrote:
> >>  	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> >>  		/*
> >>  		 * kmalloc_nolock() is not supported on architectures that
> >> -		 * don't implement cmpxchg16b, but debug caches don't use
> >> -		 * per-cpu slab and per-cpu partial slabs. They rely on
> >> -		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
> >> -		 * attempt to allocate from debug caches by
> >> +		 * don't implement cmpxchg16b and thus need slab_lock()
> >> +		 * which could be preempted by a nmi.
> > 
> > nit: I think now this limitation can be removed because the only slab
> > lock used in the allocation path is get_partial_node() ->
> > __slab_update_freelist(), but it is always used under n->list_lock.
> > 
> > Being preempted by a NMI while holding the slab lock is fine because
> > NMI context should fail to acquire n->list_lock and bail out.
> 
> Hmm but somebody might be freeing with __slab_free() without taking the
> n->list_lock (slab is on partial list and expected to remain there after the
> free), then there's a NMI and the allocation can take n->list_lock fine?

Oops, you're right. Never mind.
Concurrency is tricky :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-22  4:44   ` Harry Yoo
@ 2026-01-22  8:37     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:37 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 05:44, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:37PM +0100, Vlastimil Babka wrote:
>> __refill_objects() currently only attempts to get partial slabs from the
>> local node and then allocates new slab(s). Expand it to trying also
>> other nodes while observing the remote node defrag ratio, similarly to
>> get_any_partial().
>> 
>> This will prevent allocating new slabs on a node while other nodes have
>> many free slabs. It does mean sheaves will contain non-local objects in
>> that case. Allocations that care about specific node will still be
>> served appropriately, but might get a slowpath allocation.
>> 
>> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
>> might be refilling a sheaf that will be then used from a different
>> allocation context.
>> 
>> We can also use the resulting refill_objects() in
>> __kmem_cache_alloc_bulk() for non-debug caches. This means
>> kmem_cache_alloc_bulk() will get better performance when sheaves are
>> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
>> it's compatible with sheaves refill in preferring the local node.
>> Its users also have gfp flags that allow spinning, so document that
>> as a requirement.
>> 
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> Could this cause strict_numa to not work as intended when
> the policy is MPOL_BIND?

Hm I guess it could be optimized differently later. I assume people running
strict_numa would also tune remote_node_defrag_ratio accordingly and don't
run into this often.

> alloc_from_pcs() has:
>> #ifdef CONFIG_NUMA
>>         if (static_branch_unlikely(&strict_numa) &&
>>                          node == NUMA_NO_NODE) {
>>
>>                 struct mempolicy *mpol = current->mempolicy;
>>
>>                 if (mpol) {
>>                         /*
>>                          * Special BIND rule support. If the local node
>>                          * is in permitted set then do not redirect
>>                          * to a particular node.
>>                          * Otherwise we apply the memory policy to get
>>                          * the node we need to allocate on.
>>                          */
>>                         if (mpol->mode != MPOL_BIND ||
>>                                         !node_isset(numa_mem_id(), mpol->nodes))
> 
> This assumes the sheaves contain (mostly, although it wasn't strictly
> guaranteed) objects from local node, and this change breaks that
> assumption.
> 
> So... perhaps remove "Special BIND rule support"?

Ideally we would check if the object in sheaf is from the permitted nodes
instead of picking the local one. In a way that doesn't make systems with
strict_numa disabled slower :)

>>
>>                                 node = mempolicy_slab_node(); 
>>                 }
>>         }
>> #endif
> 
> Otherwise LGTM.
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 17/21] slab: refill sheaves from all nodes
  2026-01-22  7:02   ` Harry Yoo
@ 2026-01-22  8:42     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:42 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 08:02, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:37PM +0100, Vlastimil Babka wrote:
>> __refill_objects() currently only attempts to get partial slabs from the
>> local node and then allocates new slab(s). Expand it to trying also
>> other nodes while observing the remote node defrag ratio, similarly to
>> get_any_partial().
>> 
>> This will prevent allocating new slabs on a node while other nodes have
>> many free slabs. It does mean sheaves will contain non-local objects in
>> that case. Allocations that care about specific node will still be
>> served appropriately, but might get a slowpath allocation.
> 
> Hmm one more question.
> 
> Given frees to remote nodes bypass sheaves layer anyway, isn't it
> more reasonable to let refill_objects() fail sometimes instead of
> allocating new local slabs and fall back to slowpath (based on defrag_ratio)?

You mean if we can't refill from local partial list, we give up and perhaps
fail alloc_from_pcs()? Then the __slab_alloc_node() fallback would do
allocate local slab or try remote nodes?

Wouldn't that mean __slab_alloc_node() does all that work for a single
object, and slow everything down? Only in case of a new slab it would
somehow amortize because the next attempt would refill from it.

>> Like get_any_partial() we do observe cpuset_zone_allowed(), although we
>> might be refilling a sheaf that will be then used from a different
>> allocation context.
>> 
>> We can also use the resulting refill_objects() in
>> __kmem_cache_alloc_bulk() for non-debug caches. This means
>> kmem_cache_alloc_bulk() will get better performance when sheaves are
>> exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so
>> it's compatible with sheaves refill in preferring the local node.
>> Its users also have gfp flags that allow spinning, so document that
>> as a requirement.
>> 
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> 



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 18/21] slab: update overview comments
  2026-01-22  6:41   ` Harry Yoo
@ 2026-01-22  8:49     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  8:49 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Petr Tesarik, Christoph Lameter, David Rientjes, Roman Gushchin,
	Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 07:41, Harry Yoo wrote:
> On Fri, Jan 16, 2026 at 03:40:38PM +0100, Vlastimil Babka wrote:
>> The changes related to sheaves made the description of locking and other
>> details outdated. Update it to reflect current state.
>> 
>> Also add a new copyright line due to major changes.
>> 
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>> @@ -112,47 +123,46 @@
>> + *   node->barn->lock (spinlock)
>>   *
>> - *   lockless fastpaths
>> + *   Lockless freeing
>> + *
>> + *   Objects may have to be freed to their slabs when they are from a remote
>> + *   node (where we want to avoid filling local sheaves with remote objects)
>> + *   or when there are too many full sheaves. On architectures supporting
>> + *   cmpxchg_double this is done by a lockless update of slab's freelist and
>> + *   counters, otherwise slab_lock is taken. This only needs to take the
>> + *   list_lock if it's a first free to a full slab, or when there are too many
>> + *   fully free slabs and some need to be discarded.
> 
> nit: "or when a slab becomes empty after the free"?
> because we don't check nr_partial before acquiring list_lock.
> 
> With that addressed,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Good point, thanks!



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 21/21] mm/slub: cleanup and repurpose some stat items
  2026-01-22  2:35   ` Suren Baghdasaryan
@ 2026-01-22  9:30     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  9:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Liam R. Howlett, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev

On 1/22/26 03:35, Suren Baghdasaryan wrote:
> On Fri, Jan 16, 2026 at 6:41 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> A number of stat items related to cpu slabs became unused, remove them.
>>
>> Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of
>> removing those, use them instead of ALLOC_PCS and FREE_PCS, since
>> sheaves are the new (and only) fastpaths, Remove the recently added
>> _PCS variants instead.
>>
>> Change where FREE_SLOWPATH is counted so that it only counts freeing of
>> objects by slab users that (for whatever reason) do not go to a percpu
>> sheaf, and not all (including internal) callers of __slab_free(). Thus
>> flushing sheaves (counted by SHEAF_FLUSH) no longer also increments
>> FREE_SLOWPATH.
> 
> nit: I think I understand what you mean but "no longer also
> increments" sounds wrong. Maybe repharase as "Thus sheaf flushing
> (already counted by SHEAF_FLUSH) does not affect FREE_SLOWPATH
> anymore."?

OK will do.

>> @@ -5111,8 +5100,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>         unsigned long flags;
>>         bool on_node_partial;
>>
>> -       stat(s, FREE_SLOWPATH);
> 
> After moving the above accounting to the callers I think there are
> several callers which won't account it anymore:
> - free_deferred_objects
> - memcg_alloc_abort_single
> - slab_free_after_rcu_debug
> - ___cache_free
> 
> Am I missing something or is that intentional?

I'm adding them for completeness, but not to memcg_alloc_abort_single() as
that's not result of a user-initiated-free.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v3 21/21] mm/slub: cleanup and repurpose some stat items
  2026-01-22  5:52   ` Hao Li
@ 2026-01-22  9:30     ` Vlastimil Babka
  0 siblings, 0 replies; 106+ messages in thread
From: Vlastimil Babka @ 2026-01-22  9:30 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior,
	Alexei Starovoitov, linux-mm, linux-kernel, linux-rt-devel, bpf,
	kasan-dev

On 1/22/26 06:52, Hao Li wrote:
> On Fri, Jan 16, 2026 at 03:40:41PM +0100, Vlastimil Babka wrote:
>> A number of stat items related to cpu slabs became unused, remove them.
>> 
>> Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of
>> removing those, use them instead of ALLOC_PCS and FREE_PCS, since
>> sheaves are the new (and only) fastpaths, Remove the recently added
>> _PCS variants instead.
>> 
>> Change where FREE_SLOWPATH is counted so that it only counts freeing of
>> objects by slab users that (for whatever reason) do not go to a percpu
>> sheaf, and not all (including internal) callers of __slab_free(). Thus
>> flushing sheaves (counted by SHEAF_FLUSH) no longer also increments
>> FREE_SLOWPATH. This matches how ALLOC_SLOWPATH doesn't count sheaf
>> refills (counted by SHEAF_REFILL).
>> 
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 77 +++++++++++++++++----------------------------------------------
>>  1 file changed, 21 insertions(+), 56 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index c12e90cb2fca..d73ad44fa046 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -330,33 +330,19 @@ enum add_mode {
>>  };
>>  
>>  enum stat_item {
>> -	ALLOC_PCS,		/* Allocation from percpu sheaf */
>> -	ALLOC_FASTPATH,		/* Allocation from cpu slab */
>> -	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
>> -	FREE_PCS,		/* Free to percpu sheaf */
>> +	ALLOC_FASTPATH,		/* Allocation from percpu sheaves */
>> +	ALLOC_SLOWPATH,		/* Allocation from partial or new slab */
>>  	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
>>  	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
>> -	FREE_FASTPATH,		/* Free to cpu slab */
>> -	FREE_SLOWPATH,		/* Freeing not to cpu slab */
>> +	FREE_FASTPATH,		/* Free to percpu sheaves */
>> +	FREE_SLOWPATH,		/* Free to a slab */
> 
> Nits: Would it make sense to add stat(s, FREE_SLOWPATH) in
> free_deferred_objects() as well, since it also calls __slab_free()?

Yeah.

> Everything else looks good.
> 
> This patchset replaces cpu slab with cpu sheaves and really simplifies the code
> overall - I really like the direction and the end result. It's really been a
> pleasure reviewing this series. Thanks!
> 
> Reviewed-by: Hao Li <hao.li@linux.dev>

Thanks a lot for the thorough review!




^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2026-01-22  9:30 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-16 14:40 [PATCH v3 00/21] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 01/21] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 02/21] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
2026-01-16 17:22   ` Suren Baghdasaryan
2026-01-19  3:41   ` Harry Yoo
2026-01-16 14:40 ` [PATCH v3 03/21] mm/slab: move and refactor __kmem_cache_alias() Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 04/21] mm/slab: make caches with sheaves mergeable Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 05/21] slab: add sheaves to most caches Vlastimil Babka
2026-01-20 18:47   ` Breno Leitao
2026-01-21  8:12     ` Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 06/21] slab: introduce percpu sheaves bootstrap Vlastimil Babka
2026-01-17  2:11   ` Suren Baghdasaryan
2026-01-19  3:40     ` Harry Yoo
2026-01-19  9:13       ` Vlastimil Babka
2026-01-19  9:34     ` Vlastimil Babka
2026-01-21 10:52     ` Vlastimil Babka
2026-01-19 11:32   ` Hao Li
2026-01-21 10:54     ` Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 07/21] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
2026-01-18 20:45   ` Suren Baghdasaryan
2026-01-19  4:31   ` Harry Yoo
2026-01-19 10:09     ` Vlastimil Babka
2026-01-19 10:23       ` Vlastimil Babka
2026-01-19 12:06         ` Hao Li
2026-01-16 14:40 ` [PATCH v3 08/21] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
2026-01-19  5:23   ` Harry Yoo
2026-01-20  1:04   ` Hao Li
2026-01-16 14:40 ` [PATCH v3 09/21] slab: add optimized sheaf refill from partial list Vlastimil Babka
2026-01-19  6:41   ` Harry Yoo
2026-01-19  8:02     ` Harry Yoo
2026-01-19 10:54     ` Vlastimil Babka
2026-01-20  1:41       ` Harry Yoo
2026-01-20  9:32         ` Hao Li
2026-01-20 10:22           ` Harry Yoo
2026-01-20  2:32   ` Harry Yoo
2026-01-20  6:33     ` Vlastimil Babka
2026-01-20 10:27       ` Harry Yoo
2026-01-20 10:32         ` Vlastimil Babka
2026-01-20  2:55   ` Hao Li
2026-01-20 17:19   ` Suren Baghdasaryan
2026-01-21 13:22     ` Vlastimil Babka
2026-01-21 16:12       ` Suren Baghdasaryan
2026-01-16 14:40 ` [PATCH v3 10/21] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
2026-01-20  4:20   ` Harry Yoo
2026-01-20  8:36   ` Hao Li
2026-01-20 18:06   ` Suren Baghdasaryan
2026-01-21 13:56     ` Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 11/21] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
2026-01-20  5:24   ` Harry Yoo
2026-01-20 12:10   ` Hao Li
2026-01-20 22:25   ` Suren Baghdasaryan
2026-01-21  0:58     ` Harry Yoo
2026-01-21  1:06       ` Harry Yoo
2026-01-21 16:21       ` Suren Baghdasaryan
2026-01-21 14:22     ` Vlastimil Babka
2026-01-21 14:43       ` Vlastimil Babka
2026-01-21 16:22       ` Suren Baghdasaryan
2026-01-16 14:40 ` [PATCH v3 12/21] slab: remove the do_slab_free() fastpath Vlastimil Babka
2026-01-20  5:35   ` Harry Yoo
2026-01-20 12:29   ` Hao Li
2026-01-21 16:57     ` Suren Baghdasaryan
2026-01-16 14:40 ` [PATCH v3 13/21] slab: remove defer_deactivate_slab() Vlastimil Babka
2026-01-20  5:47   ` Harry Yoo
2026-01-20  9:35   ` Hao Li
2026-01-21 17:11     ` Suren Baghdasaryan
2026-01-16 14:40 ` [PATCH v3 14/21] slab: simplify kmalloc_nolock() Vlastimil Babka
2026-01-20 12:06   ` Hao Li
2026-01-21 17:39     ` Suren Baghdasaryan
2026-01-22  1:53   ` Harry Yoo
2026-01-22  8:16     ` Vlastimil Babka
2026-01-22  8:34       ` Harry Yoo
2026-01-16 14:40 ` [PATCH v3 15/21] slab: remove struct kmem_cache_cpu Vlastimil Babka
2026-01-20 12:40   ` Hao Li
2026-01-21 14:29     ` Vlastimil Babka
2026-01-21 17:54       ` Suren Baghdasaryan
2026-01-21 19:03         ` Vlastimil Babka
2026-01-22  3:10   ` Harry Yoo
2026-01-16 14:40 ` [PATCH v3 16/21] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
2026-01-21  6:42   ` Hao Li
2026-01-21 17:57     ` Suren Baghdasaryan
2026-01-22  3:50   ` Harry Yoo
2026-01-16 14:40 ` [PATCH v3 17/21] slab: refill sheaves from all nodes Vlastimil Babka
2026-01-21 18:30   ` Suren Baghdasaryan
2026-01-22  4:44   ` Harry Yoo
2026-01-22  8:37     ` Vlastimil Babka
2026-01-22  4:58   ` Hao Li
2026-01-22  8:32     ` Vlastimil Babka
2026-01-22  7:02   ` Harry Yoo
2026-01-22  8:42     ` Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 18/21] slab: update overview comments Vlastimil Babka
2026-01-21 20:58   ` Suren Baghdasaryan
2026-01-22  3:54   ` Hao Li
2026-01-22  6:41   ` Harry Yoo
2026-01-22  8:49     ` Vlastimil Babka
2026-01-16 14:40 ` [PATCH v3 19/21] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
2026-01-22  0:54   ` Suren Baghdasaryan
2026-01-22  6:31     ` Vlastimil Babka
2026-01-22  5:01   ` Hao Li
2026-01-16 14:40 ` [PATCH v3 20/21] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
2026-01-22  0:58   ` Suren Baghdasaryan
2026-01-22  5:17   ` Hao Li
2026-01-16 14:40 ` [PATCH v3 21/21] mm/slub: cleanup and repurpose some " Vlastimil Babka
2026-01-22  2:35   ` Suren Baghdasaryan
2026-01-22  9:30     ` Vlastimil Babka
2026-01-22  5:52   ` Hao Li
2026-01-22  9:30     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox