[RFC PATCH 0/7] k[v]free_rcu() improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] k[v]free_rcu() improvements
@ 2026-02-06  9:34 Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

These are a few improvements for k[v]free_rcu() API, which were suggested
by Alexei Starovoitov.

[ To kmemleak folks: I'm going to teach delete_object_full() and
  paint_ptr() to ignore cases when the object does not exist.
  Could you please let me know if the way it's done in patch 3
  looks good? Only part 2 is relevant to you. ]

Although I've put some effort into providing a decent quality
implementation, I'd like you to consider this as a proof-of-concept
and let's discuss how best we could tackle those problems:

  1) Allow an 8-byte field to be used as an alternative to
     struct rcu_head (16-byte) for 2-argument kvfree_rcu()
  2) kmalloc_nolock() -> kfree[_rcu]() support
  3) Add kfree_rcu_nolock() for NMI context

# Part 1. Allow an 8-byte field to be used as an alternative to
  struct rcu_head for 2-argument kvfree_rcu()

  Technically, objects that are freed with k[v]free_rcu() need
  only one pointer to link objects, because we already know that
  the callback function is always kvfree(). For this purpose,
  struct rcu_head is unnecessarily large (16 bytes on 64-bit).

  Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
  with k[v]free_rcu(). Let's save one pointer per slab object.

  I have to admit that my naming skill isn't great; hopefully
  we'll come up with a better name than `struct rcu_ptr`.

  With this feature, either a struct rcu_ptr or rcu_head field
  can be used as the second argument of the k[v]free_rcu() API.

  Users that only use k[v]free_rcu() are highly encouraged to use
  struct rcu_ptr; otherwise you're wasting memory. However, some users,
  such as maple tree, may use call_rcu() or k[v]free_rcu() depending on
  the situation for objects of the same type. For such users,
  struct rcu_head remains the only option.

  Patch 1 implements this feature, and patch 2 adds a few users in mm/.

# Part 2. kmalloc_nolock() -> kfree() or kfree_rcu() path support

  Allow objects allocated with kmalloc_nolock() to be freed with
  kfree[_rcu](). Without this support, users are forced to call
  call_rcu() with kfree_nolock() to free objects after a grace period.
  This is not efficient and can create unnecessarily many grace periods
  by bypassing the kfree_rcu batching layer.

  The reason why it was not supported before was because some alloc
  hooks are not called in kmalloc_nolock(), while all free hooks are
  called in kfree().

  Patch 3 adds support for this by teaching kmemleak to ignore cases
  when free hooks are called without prior alloc hooks. Patch 4 frees
  a bit in enum objexts_flags, since we no longer have to remember
  whether the array was allocated using kmalloc_nolock() or kmalloc().

  Note that the free hooks fall into these categories:

  - Its alloc hook is called in kmalloc_nolock(), no problem!
    (kmsan_slab_alloc(), kasan_slab_alloc(),
     memcg_slab_post_alloc_hook(), alloc_tagging_slab_alloc_hook())

  - Its alloc hook isn't called in kmalloc_nolock(); free hooks
    must handle asymmetric hook calls. (kfence_free(),
    kmemleak_free_recursive())

  - There is no matching alloc hook for the free hook; it's safe to
    call. (debug_check_no_{locks,obj}_freed, __kcsan_check_access())

  Note that kmalloc() -> kfree_nolock() or kfree_rcu_nolock() isn't
  still supported! That's much trickier :)

# Part 3. Add kfree_rcu_nolock() for NMI context

  Add a new 2-argument kfree_rcu_nolock() variant that is safe to be
  called in NMI context. In NMI context, calling kfree_rcu() or
  call_rcu() is not legal, and thus users are forced to implement some
  sort of deferred freeing. Let's make users' lives easier with the new
  variant.

  Note that 1-argument kfree_rcu_nolock() is not supported, since there
  is not much we can do when trylock & memory allocation fails.
  (You can't call synchronize_rcu() in NMI context!)

  When spinning on a lock is not allowed, try to acquire the spinlock.
  When it succeeds in acquiring the lock, do either:

  1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
     be called in NMI context! When the rcu sheaf becomes full by
     freeing the object, it cannot free to the sheaf and has to fall back.

  2) Use struct rcu_ptr field to link objects. Consuming a bnode
     (of struct kvfree_rcu_bulk_data) and queueing work to maintain
     a number of cached bnodes is avoided in NMI context.

  Note that scheduling delayed monitor work to drain objects after
  KFREE_DRAIN_JIFFIES is done using a lazy irq_work to avoid raising
  self-IPIs. That means scheduling delayed monitor work can be delayed
  up to the length of a time slice.

  In rare cases where trylock fails, a non-lazy irq_work is used to
  defer calling kvfree_rcu_call().

  When certain debug features (kmemleak, debugobjects) are enabled,
  freeing in NMI context is always deferred because they use spinlocks.

  Patch 6 implements kfree_rcu_nolock() support, patch 7 adds sheaves
  support for the new API.

Harry Yoo (7):
  mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  mm: use rcu_ptr instead of rcu_head
  mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()
  mm/slab: free a bit in enum objexts_flags
  mm/slab: move kfree_rcu_cpu[_work] definitions
  mm/slab: introduce kfree_rcu_nolock()
  mm/slab: make kfree_rcu_nolock() work with sheaves

 include/linux/list_lru.h   |   2 +-
 include/linux/memcontrol.h |   3 +-
 include/linux/rcupdate.h   |  68 +++++---
 include/linux/shrinker.h   |   2 +-
 include/linux/types.h      |   9 ++
 mm/kmemleak.c              |  11 +-
 mm/slab.h                  |   2 +-
 mm/slab_common.c           | 309 +++++++++++++++++++++++++------------
 mm/slub.c                  |  47 ++++--
 mm/vmalloc.c               |   4 +-
 10 files changed, 310 insertions(+), 147 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-11 10:16   ` Uladzislau Rezki
  2026-02-06  9:34 ` [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head Harry Yoo
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

k[v]free_rcu() repurposes two fields of struct rcu_head: 'func' to store
the start address of the object, and 'next' to link objects.

However, using 'func' to store the start address is unnecessary:

  1. slab can get the start address from the address of struct rcu_head
     field via nearest_obj(), and

  2. vmalloc and large kmalloc can get the start address by aligning
     down the address of the struct rcu_head field to the page boundary.

Therefore, allow an 8-byte (on 64-bit) field (of a new type called
struct rcu_ptr) to be used with k[v]free_rcu() with two arguments.

Some users use both call_rcu() and k[v]free_rcu() to process callbacks
(e.g., maple tree), so it makes sense to have struct rcu_head field
to handle both cases. However, many users that simply free objects via
kvfree_rcu() can save one pointer by using struct rcu_ptr instead of
struct rcu_head.

Note that struct rcu_ptr is a single pointer only when
CONFIG_KVFREE_RCU_BATCHED=y. To keep kvfree_rcu() implementation minimal
when CONFIG_KVFREE_RCU_BATCHED is disabled, struct rcu_ptr is the size
as struct rcu_head, and the implementation of kvfree_rcu() remains
unchanged in that configuration.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/rcupdate.h | 61 +++++++++++++++++++++++++++-------------
 include/linux/types.h    |  9 ++++++
 mm/slab_common.c         | 40 +++++++++++++++-----------
 3 files changed, 75 insertions(+), 35 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index c5b30054cd01..8924edf7e8c1 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1059,22 +1059,30 @@ static inline void rcu_read_unlock_migrate(void)
 /**
  * kfree_rcu() - kfree an object after a grace period.
  * @ptr: pointer to kfree for double-argument invocations.
- * @rhf: the name of the struct rcu_head within the type of @ptr.
+ * @rf: the name of the struct rcu_head or struct rcu_ptr within the type of @ptr.
  *
  * Many rcu callbacks functions just call kfree() on the base structure.
  * These functions are trivial, but their size adds up, and furthermore
  * when they are used in a kernel module, that module must invoke the
  * high-latency rcu_barrier() function at module-unload time.
+ * The kfree_rcu() function handles this issue by batching.
  *
- * The kfree_rcu() function handles this issue. In order to have a universal
- * callback function handling different offsets of rcu_head, the callback needs
- * to determine the starting address of the freed object, which can be a large
- * kmalloc or vmalloc allocation. To allow simply aligning the pointer down to
- * page boundary for those, only offsets up to 4095 bytes can be accommodated.
- * If the offset is larger than 4095 bytes, a compile-time error will
- * be generated in kvfree_rcu_arg_2(). If this error is triggered, you can
- * either fall back to use of call_rcu() or rearrange the structure to
- * position the rcu_head structure into the first 4096 bytes.
+ * Typically, struct rcu_head is used to process RCU callbacks, but it requires
+ * two pointers. However, since kfree_rcu() uses kfree() as the callback
+ * function, it can process callbacks with struct rcu_ptr, which is only
+ * one pointer in size (unless !CONFIG_KVFREE_RCU_BATCHED).
+ *
+ * The type of @rf can be either struct rcu_head or struct rcu_ptr, and when
+ * possible, it is recommended to use struct rcu_ptr due to its smaller size.
+ *
+ * In order to have a universal callback function handling different offsets
+ * of @rf, the callback needs to determine the starting address of the freed
+ * object, which can be a large kmalloc or vmalloc allocation. To allow simply
+ * aligning the pointer down to page boundary for those, only offsets up to
+ * 4095 bytes can be accommodated. If the offset is larger than 4095 bytes,
+ * a compile-time error will be generated in kvfree_rcu_arg_2().
+ * If this error is triggered, you can either fall back to use of call_rcu()
+ * or rearrange the structure to position @rf into the first 4096 bytes.
  *
  * The object to be freed can be allocated either by kmalloc() or
  * kmem_cache_alloc().
@@ -1084,8 +1092,8 @@ static inline void rcu_read_unlock_migrate(void)
  * The BUILD_BUG_ON check must not involve any function calls, hence the
  * checks are done in macros here.
  */
-#define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
-#define kvfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
+#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
+#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
 
 /**
  * kfree_rcu_mightsleep() - kfree an object after a grace period.
@@ -1107,22 +1115,37 @@ static inline void rcu_read_unlock_migrate(void)
 #define kfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
 #define kvfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
 
-/*
- * In mm/slab_common.c, no suitable header to include here.
- */
-void kvfree_call_rcu(struct rcu_head *head, void *ptr);
+
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
+#define kvfree_call_rcu(head, ptr) \
+	_Generic((head), \
+		struct rcu_head *: kvfree_call_rcu_ptr,		\
+		struct rcu_ptr *: kvfree_call_rcu_ptr,		\
+		void *: kvfree_call_rcu_ptr			\
+	)((struct rcu_ptr *)(head), (ptr))
+#else
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
+static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
+#define kvfree_call_rcu(head, ptr) \
+	_Generic((head), \
+		struct rcu_head *: kvfree_call_rcu_head,	\
+		struct rcu_ptr *: kvfree_call_rcu_head,		\
+		void *: kvfree_call_rcu_head			\
+	)((struct rcu_head *)(head), (ptr))
+#endif
 
 /*
  * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
  * comment of kfree_rcu() for details.
  */
-#define kvfree_rcu_arg_2(ptr, rhf)					\
+#define kvfree_rcu_arg_2(ptr, rf)					\
 do {									\
 	typeof (ptr) ___p = (ptr);					\
 									\
 	if (___p) {							\
-		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096);	\
-		kvfree_call_rcu(&((___p)->rhf), (void *) (___p));	\
+		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096);	\
+		kvfree_call_rcu(&((___p)->rf), (void *) (___p));	\
 	}								\
 } while (0)
 
diff --git a/include/linux/types.h b/include/linux/types.h
index d4437e9c452c..e5596ebab29c 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -245,6 +245,15 @@ struct callback_head {
 } __attribute__((aligned(sizeof(void *))));
 #define rcu_head callback_head
 
+
+struct rcu_ptr {
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+	struct rcu_ptr *next;
+#else
+	struct callback_head;
+#endif
+} __attribute__((aligned(sizeof(void *))));
+
 typedef void (*rcu_callback_t)(struct rcu_head *head);
 typedef void (*call_rcu_func_t)(struct rcu_head *head, rcu_callback_t func);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d5a70a831a2a..3ec99a5463d3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1265,7 +1265,7 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
 
 #ifndef CONFIG_KVFREE_RCU_BATCHED
 
-void kvfree_call_rcu(struct rcu_head *head, void *ptr)
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
 {
 	if (head) {
 		kasan_record_aux_stack(ptr);
@@ -1278,7 +1278,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	synchronize_rcu();
 	kvfree(ptr);
 }
-EXPORT_SYMBOL_GPL(kvfree_call_rcu);
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
 
 void __init kvfree_rcu_init(void)
 {
@@ -1346,7 +1346,7 @@ struct kvfree_rcu_bulk_data {
 
 struct kfree_rcu_cpu_work {
 	struct rcu_work rcu_work;
-	struct rcu_head *head_free;
+	struct rcu_ptr *head_free;
 	struct rcu_gp_oldstate head_free_gp_snap;
 	struct list_head bulk_head_free[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu *krcp;
@@ -1381,8 +1381,7 @@ struct kfree_rcu_cpu_work {
  */
 struct kfree_rcu_cpu {
 	// Objects queued on a linked list
-	// through their rcu_head structures.
-	struct rcu_head *head;
+	struct rcu_ptr *head;
 	unsigned long head_gp_snap;
 	atomic_t head_count;
 
@@ -1523,18 +1522,28 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
 }
 
 static void
-kvfree_rcu_list(struct rcu_head *head)
+kvfree_rcu_list(struct rcu_ptr *head)
 {
-	struct rcu_head *next;
+	struct rcu_ptr *next;
 
 	for (; head; head = next) {
-		void *ptr = (void *) head->func;
-		unsigned long offset = (void *) head - ptr;
+		void *ptr;
+		unsigned long offset;
+		struct slab *slab;
+
+		slab = virt_to_slab(head);
+		if (is_vmalloc_addr(head) || !slab)
+			ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
+		else
+			ptr = nearest_obj(slab->slab_cache, slab, head);
+		offset = (void *)head - ptr;
 
 		next = head->next;
 		debug_rcu_head_unqueue((struct rcu_head *)ptr);
 		rcu_lock_acquire(&rcu_callback_map);
-		trace_rcu_invoke_kvfree_callback("slab", head, offset);
+		trace_rcu_invoke_kvfree_callback("slab",
+						(struct rcu_head *)head,
+						offset);
 
 		kvfree(ptr);
 
@@ -1552,7 +1561,7 @@ static void kfree_rcu_work(struct work_struct *work)
 	unsigned long flags;
 	struct kvfree_rcu_bulk_data *bnode, *n;
 	struct list_head bulk_head[FREE_N_CHANNELS];
-	struct rcu_head *head;
+	struct rcu_ptr *head;
 	struct kfree_rcu_cpu *krcp;
 	struct kfree_rcu_cpu_work *krwp;
 	struct rcu_gp_oldstate head_gp_snap;
@@ -1675,7 +1684,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
 {
 	struct list_head bulk_ready[FREE_N_CHANNELS];
 	struct kvfree_rcu_bulk_data *bnode, *n;
-	struct rcu_head *head_ready = NULL;
+	struct rcu_ptr *head_ready = NULL;
 	unsigned long flags;
 	int i;
 
@@ -1938,7 +1947,7 @@ void __init kfree_rcu_scheduler_running(void)
  * be free'd in workqueue context. This allows us to: batch requests together to
  * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
  */
-void kvfree_call_rcu(struct rcu_head *head, void *ptr)
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
 {
 	unsigned long flags;
 	struct kfree_rcu_cpu *krcp;
@@ -1960,7 +1969,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
-		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
+		WARN_ONCE(1, "%s(): Double-freed call. rcu_ptr %p\n",
 			  __func__, head);
 
 		// Mark as success and leave.
@@ -1976,7 +1985,6 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 			// Inline if kvfree_rcu(one_arg) call.
 			goto unlock_return;
 
-		head->func = ptr;
 		head->next = krcp->head;
 		WRITE_ONCE(krcp->head, head);
 		atomic_inc(&krcp->head_count);
@@ -2012,7 +2020,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 		kvfree(ptr);
 	}
 }
-EXPORT_SYMBOL_GPL(kvfree_call_rcu);
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
 
 static inline void __kvfree_rcu_barrier(void)
 {
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-09 10:41   ` Uladzislau Rezki
  2026-02-06  9:34 ` [RFC PATCH 3/7] mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() Harry Yoo
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

When slab objects are freed with kfree_rcu() and not call_rcu(),
using struct rcu_head (16 bytes on 64-bit) is unnecessary and
struct rcu_ptr (8 bytes on 64-bit) is enough. Save one pointer
per slab object by using struct rcu_ptr.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/list_lru.h | 2 +-
 include/linux/shrinker.h | 2 +-
 mm/vmalloc.c             | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index fe739d35a864..c79bccb7dafa 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -37,7 +37,7 @@ struct list_lru_one {
 };
 
 struct list_lru_memcg {
-	struct rcu_head		rcu;
+	struct rcu_ptr		rcu;
 	/* array of per cgroup per node lists, indexed by node id */
 	struct list_lru_one	node[];
 };
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1a00be90d93a..bad20de2803a 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -19,7 +19,7 @@ struct shrinker_info_unit {
 };
 
 struct shrinker_info {
-	struct rcu_head rcu;
+	struct rcu_ptr rcu;
 	int map_nr_max;
 	struct shrinker_info_unit *unit[];
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..89c781dcab58 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2596,7 +2596,7 @@ struct vmap_block {
 	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
 	unsigned long dirty_min, dirty_max; /*< dirty range */
 	struct list_head free_list;
-	struct rcu_head rcu_head;
+	struct rcu_ptr rcu;
 	struct list_head purge;
 	unsigned int cpu;
 };
@@ -2765,7 +2765,7 @@ static void free_vmap_block(struct vmap_block *vb)
 	spin_unlock(&vn->busy.lock);
 
 	free_vmap_area_noflush(vb->va);
-	kfree_rcu(vb, rcu_head);
+	kfree_rcu(vb, rcu);
 }
 
 static bool purge_fragmented_block(struct vmap_block *vb,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 3/7] mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags Harry Yoo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

Slab objects that are allocated with kmalloc_nolock() must be freed
using kfree_nolock() because only a subset of alloc hooks are called,
since kmalloc_nolock() can't spin on a lock during allocation.

This imposes a limitation: such objects cannot be freed with kfree_rcu(),
forcing users to work around this limitation by calling call_rcu()
with a callback that frees the object using kfree_nolock().

Remove this limitation by teaching kmemleak to gracefully ignore cases
when kmemleak_free() is called without a prior kmemleak_alloc().
Unlike kmemleak, kfence already handles this case, because,
due to its design, only a subset of allocations are served from kfence.

With this change, kfree() and kfree_rcu() can be used to free objects
that are allocated using kmalloc_nolock().

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/rcupdate.h |  4 ++--
 mm/kmemleak.c            | 11 +++++------
 mm/slub.c                | 21 ++++++++++++++++++++-
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 8924edf7e8c1..db5053a7b0cb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1084,8 +1084,8 @@ static inline void rcu_read_unlock_migrate(void)
  * If this error is triggered, you can either fall back to use of call_rcu()
  * or rearrange the structure to position @rf into the first 4096 bytes.
  *
- * The object to be freed can be allocated either by kmalloc() or
- * kmem_cache_alloc().
+ * The object to be freed can be allocated either by kmalloc(),
+ * kmalloc_nolock(), or kmem_cache_alloc().
  *
  * Note that the allowable offset might decrease in the future.
  *
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 1ac56ceb29b6..de32db5c4f23 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -837,13 +837,12 @@ static void delete_object_full(unsigned long ptr, unsigned int objflags)
 	struct kmemleak_object *object;
 
 	object = find_and_remove_object(ptr, 0, objflags);
-	if (!object) {
-#ifdef DEBUG
-		kmemleak_warn("Freeing unknown object at 0x%08lx\n",
-			      ptr);
-#endif
+	if (!object)
+		/*
+		 * kmalloc_nolock() -> kfree() calls kmemleak_free()
+		 * without kmemleak_alloc()
+		 */
 		return;
-	}
 	__delete_object(object);
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 102fb47ae013..a118ac009b61 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2581,6 +2581,24 @@ struct rcu_delayed_free {
  * Returns true if freeing of the object can proceed, false if its reuse
  * was delayed by CONFIG_SLUB_RCU_DEBUG or KASAN quarantine, or it was returned
  * to KFENCE.
+ *
+ * For objects allocated via kmalloc_nolock(), only a subset of alloc hooks
+ * are invoked, so some free hooks must handle asymmetric hook calls.
+ *
+ * Alloc hooks called for kmalloc_nolock():
+ * - kmsan_slab_alloc()
+ * - kasan_slab_alloc()
+ * - memcg_slab_post_alloc_hook()
+ * - alloc_tagging_slab_alloc_hook()
+ *
+ * Free hooks that must handle missing corresponding alloc hooks:
+ * - kmemleak_free_recursive()
+ * - kfence_free()
+ *
+ * Free hooks that have no alloc hook counterpart, and thus safe to call:
+ * - debug_check_no_locks_freed()
+ * - debug_check_no_obj_freed()
+ * - __kcsan_check_access()
  */
 static __always_inline
 bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
@@ -6365,7 +6383,7 @@ void kvfree_rcu_cb(struct rcu_head *head)
 
 /**
  * kfree - free previously allocated memory
- * @object: pointer returned by kmalloc() or kmem_cache_alloc()
+ * @object: pointer returned by kmalloc(), kmalloc_nolock(), or kmem_cache_alloc()
  *
  * If @object is NULL, no operation is performed.
  */
@@ -6384,6 +6402,7 @@ void kfree(const void *object)
 	page = virt_to_page(object);
 	slab = page_slab(page);
 	if (!slab) {
+		/* kmalloc_nolock() doesn't support large kmalloc */
 		free_large_kmalloc(page, (void *)object);
 		return;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (2 preceding siblings ...)
  2026-02-06  9:34 ` [RFC PATCH 3/7] mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-06 20:09   ` Alexei Starovoitov
  2026-02-06  9:34 ` [RFC PATCH 5/7] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

Since kfree() now supports freeing objects allocated with
kmalloc_nolock(), free one bit in enum object_flags.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/memcontrol.h |  3 +--
 mm/slub.c                  | 12 ++----------
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0651865a4564..bb789ec4a2a2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -359,8 +359,7 @@ enum objext_flags {
 	 * MEMCG_DATA_OBJEXTS.
 	 */
 	OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
-	/* slabobj_ext vector allocated with kmalloc_nolock() */
-	OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
+	__OBJEXTS_FLAG_UNUSED = __FIRST_OBJEXT_FLAG,
 	/* the next bit after the last actual flag */
 	__NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
diff --git a/mm/slub.c b/mm/slub.c
index a118ac009b61..ac7bc7e1163f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2186,8 +2186,6 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 			virt_to_slab(vec)->slab_cache == s);
 
 	new_exts = (unsigned long)vec;
-	if (unlikely(!allow_spin))
-		new_exts |= OBJEXTS_NOSPIN_ALLOC;
 #ifdef CONFIG_MEMCG
 	new_exts |= MEMCG_DATA_OBJEXTS;
 #endif
@@ -2210,10 +2208,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		 * objcg vector should be reused.
 		 */
 		mark_objexts_empty(vec);
-		if (unlikely(!allow_spin))
-			kfree_nolock(vec);
-		else
-			kfree(vec);
+		kfree(vec);
 		return 0;
 	} else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
 		/* Retry if a racing thread changed slab->obj_exts from under us. */
@@ -2253,10 +2248,7 @@ static inline void free_slab_obj_exts(struct slab *slab)
 	 * the extension for obj_exts is expected to be NULL.
 	 */
 	mark_objexts_empty(obj_exts);
-	if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
-		kfree_nolock(obj_exts);
-	else
-		kfree(obj_exts);
+	kfree(obj_exts);
 	slab->obj_exts = 0;
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 5/7] mm/slab: move kfree_rcu_cpu[_work] definitions
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (3 preceding siblings ...)
  2026-02-06  9:34 ` [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-06  9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

In preparation for defining kfree_rcu_cpu under
CONFIG_KVFREE_RCU_BATCHED=n and adding a new function common to both
configurations, move the existing kfree_rcu_cpu[_work] definitions to
just before the beginning of the kfree_rcu batching infrastructure.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slab_common.c | 142 ++++++++++++++++++++++++-----------------------
 1 file changed, 72 insertions(+), 70 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3ec99a5463d3..d232b99a4b52 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1263,78 +1263,9 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
 EXPORT_TRACEPOINT_SYMBOL(kfree);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
 
-#ifndef CONFIG_KVFREE_RCU_BATCHED
-
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
-{
-	if (head) {
-		kasan_record_aux_stack(ptr);
-		call_rcu(head, kvfree_rcu_cb);
-		return;
-	}
-
-	// kvfree_rcu(one_arg) call.
-	might_sleep();
-	synchronize_rcu();
-	kvfree(ptr);
-}
-EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
-
-void __init kvfree_rcu_init(void)
-{
-}
-
-#else /* CONFIG_KVFREE_RCU_BATCHED */
-
-/*
- * This rcu parameter is runtime-read-only. It reflects
- * a minimum allowed number of objects which can be cached
- * per-CPU. Object size is equal to one page. This value
- * can be changed at boot time.
- */
-static int rcu_min_cached_objs = 5;
-module_param(rcu_min_cached_objs, int, 0444);
-
-// A page shrinker can ask for pages to be freed to make them
-// available for other parts of the system. This usually happens
-// under low memory conditions, and in that case we should also
-// defer page-cache filling for a short time period.
-//
-// The default value is 5 seconds, which is long enough to reduce
-// interference with the shrinker while it asks other systems to
-// drain their caches.
-static int rcu_delay_page_cache_fill_msec = 5000;
-module_param(rcu_delay_page_cache_fill_msec, int, 0444);
-
-static struct workqueue_struct *rcu_reclaim_wq;
-
-/* Maximum number of jiffies to wait before draining a batch. */
-#define KFREE_DRAIN_JIFFIES (5 * HZ)
+#ifdef CONFIG_KVFREE_RCU_BATCHED
 #define KFREE_N_BATCHES 2
 #define FREE_N_CHANNELS 2
-
-/**
- * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
- * @list: List node. All blocks are linked between each other
- * @gp_snap: Snapshot of RCU state for objects placed to this bulk
- * @nr_records: Number of active pointers in the array
- * @records: Array of the kvfree_rcu() pointers
- */
-struct kvfree_rcu_bulk_data {
-	struct list_head list;
-	struct rcu_gp_oldstate gp_snap;
-	unsigned long nr_records;
-	void *records[] __counted_by(nr_records);
-};
-
-/*
- * This macro defines how many entries the "records" array
- * will contain. It is based on the fact that the size of
- * kvfree_rcu_bulk_data structure becomes exactly one page.
- */
-#define KVFREE_BULK_MAX_ENTR \
-	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
-
 /**
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
  * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
@@ -1402,6 +1333,77 @@ struct kfree_rcu_cpu {
 	struct llist_head bkvcache;
 	int nr_bkv_objs;
 };
+#endif
+
+#ifndef CONFIG_KVFREE_RCU_BATCHED
+
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
+{
+	if (head) {
+		kasan_record_aux_stack(ptr);
+		call_rcu(head, kvfree_rcu_cb);
+		return;
+	}
+
+	// kvfree_rcu(one_arg) call.
+	might_sleep();
+	synchronize_rcu();
+	kvfree(ptr);
+}
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
+
+void __init kvfree_rcu_init(void)
+{
+}
+
+#else /* CONFIG_KVFREE_RCU_BATCHED */
+
+/*
+ * This rcu parameter is runtime-read-only. It reflects
+ * a minimum allowed number of objects which can be cached
+ * per-CPU. Object size is equal to one page. This value
+ * can be changed at boot time.
+ */
+static int rcu_min_cached_objs = 5;
+module_param(rcu_min_cached_objs, int, 0444);
+
+// A page shrinker can ask for pages to be freed to make them
+// available for other parts of the system. This usually happens
+// under low memory conditions, and in that case we should also
+// defer page-cache filling for a short time period.
+//
+// The default value is 5 seconds, which is long enough to reduce
+// interference with the shrinker while it asks other systems to
+// drain their caches.
+static int rcu_delay_page_cache_fill_msec = 5000;
+module_param(rcu_delay_page_cache_fill_msec, int, 0444);
+
+static struct workqueue_struct *rcu_reclaim_wq;
+
+/* Maximum number of jiffies to wait before draining a batch. */
+#define KFREE_DRAIN_JIFFIES (5 * HZ)
+
+/**
+ * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
+ * @list: List node. All blocks are linked between each other
+ * @gp_snap: Snapshot of RCU state for objects placed to this bulk
+ * @nr_records: Number of active pointers in the array
+ * @records: Array of the kvfree_rcu() pointers
+ */
+struct kvfree_rcu_bulk_data {
+	struct list_head list;
+	struct rcu_gp_oldstate gp_snap;
+	unsigned long nr_records;
+	void *records[] __counted_by(nr_records);
+};
+
+/*
+ * This macro defines how many entries the "records" array
+ * will contain. It is based on the fact that the size of
+ * kvfree_rcu_bulk_data structure becomes exactly one page.
+ */
+#define KVFREE_BULK_MAX_ENTR \
+	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
 static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock()
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (4 preceding siblings ...)
  2026-02-06  9:34 ` [RFC PATCH 5/7] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-12  2:58   ` Harry Yoo
  2026-02-16 21:07   ` Joel Fernandes
  2026-02-06  9:34 ` [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

Currently, kfree_rcu() cannot be called in an NMI context.
In such a context, even calling call_rcu() is not legal,
forcing users to implement deferred freeing.

Make users' lives easier by introducing kfree_rcu_nolock() variant.
Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
variant, because, in the worst case where memory allocation fails,
the caller cannot synchronously wait for the grace period to finish.

Similar to kfree_nolock() implementation, try to acquire kfree_rcu_cpu
spinlock, and if that fails, insert the object to per-cpu lockless list
and delay freeing using irq_work that calls kvfree_call_rcu() later.
In case kmemleak or debugobjects is enabled, always defer freeing as
those debug features don't support NMI contexts.

When trylock succeeds, avoid consuming bnode and run_page_cache_worker()
altogether. Instead, insert objects into struct kfree_rcu_cpu.head
without consuming additional memory.

For now, the sheaves layer is bypassed if spinning is not allowed.

Scheduling delayed monitor work in an NMI context is tricky; use
irq_work to schedule, but use lazy irq_work to avoid raising self-IPIs.
That means scheduling delayed monitor work can be delayed up to the
length of a time slice.

Without CONFIG_KVFREE_RCU_BATCHED, all frees in the !allow_spin case are
delayed using irq_work.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/rcupdate.h |  23 ++++---
 mm/slab_common.c         | 140 +++++++++++++++++++++++++++++++++------
 2 files changed, 133 insertions(+), 30 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index db5053a7b0cb..18bb7378b23d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1092,8 +1092,9 @@ static inline void rcu_read_unlock_migrate(void)
  * The BUILD_BUG_ON check must not involve any function calls, hence the
  * checks are done in macros here.
  */
-#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
-#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
+#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
+#define kfree_rcu_nolock(ptr, rf) kvfree_rcu_arg_2(ptr, rf, false)
+#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
 
 /**
  * kfree_rcu_mightsleep() - kfree an object after a grace period.
@@ -1117,35 +1118,35 @@ static inline void rcu_read_unlock_migrate(void)
 
 
 #ifdef CONFIG_KVFREE_RCU_BATCHED
-void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
-#define kvfree_call_rcu(head, ptr) \
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin);
+#define kvfree_call_rcu(head, ptr, spin) \
 	_Generic((head), \
 		struct rcu_head *: kvfree_call_rcu_ptr,		\
 		struct rcu_ptr *: kvfree_call_rcu_ptr,		\
 		void *: kvfree_call_rcu_ptr			\
-	)((struct rcu_ptr *)(head), (ptr))
+	)((struct rcu_ptr *)(head), (ptr), spin)
 #else
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin);
 static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
-#define kvfree_call_rcu(head, ptr) \
+#define kvfree_call_rcu(head, ptr, spin) \
 	_Generic((head), \
 		struct rcu_head *: kvfree_call_rcu_head,	\
 		struct rcu_ptr *: kvfree_call_rcu_head,		\
 		void *: kvfree_call_rcu_head			\
-	)((struct rcu_head *)(head), (ptr))
+	)((struct rcu_head *)(head), (ptr), spin)
 #endif
 
 /*
  * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
  * comment of kfree_rcu() for details.
  */
-#define kvfree_rcu_arg_2(ptr, rf)					\
+#define kvfree_rcu_arg_2(ptr, rf, spin)					\
 do {									\
 	typeof (ptr) ___p = (ptr);					\
 									\
 	if (___p) {							\
 		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096);	\
-		kvfree_call_rcu(&((___p)->rf), (void *) (___p));	\
+		kvfree_call_rcu(&((___p)->rf), (void *) (___p), spin);	\
 	}								\
 } while (0)
 
@@ -1154,7 +1155,7 @@ do {								\
 	typeof(ptr) ___p = (ptr);				\
 								\
 	if (___p)						\
-		kvfree_call_rcu(NULL, (void *) (___p));		\
+		kvfree_call_rcu(NULL, (void *) (___p), true);	\
 } while (0)
 
 /*
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d232b99a4b52..9d7801e5cb73 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1311,6 +1311,12 @@ struct kfree_rcu_cpu_work {
  * the interactions with the slab allocators.
  */
 struct kfree_rcu_cpu {
+	// Objects queued on a lockless linked list, not protected by the lock.
+	// This allows freeing objects in NMI context, where trylock may fail.
+	struct llist_head llist_head;
+	struct irq_work irq_work;
+	struct irq_work sched_monitor_irq_work;
+
 	// Objects queued on a linked list
 	struct rcu_ptr *head;
 	unsigned long head_gp_snap;
@@ -1333,12 +1339,61 @@ struct kfree_rcu_cpu {
 	struct llist_head bkvcache;
 	int nr_bkv_objs;
 };
+#else
+struct kfree_rcu_cpu {
+	struct llist_head llist_head;
+	struct irq_work irq_work;
+};
 #endif
 
+/* Universial implementation regardless of CONFIG_KVFREE_RCU_BATCHED */
+static void defer_kfree_rcu(struct irq_work *work)
+{
+	struct kfree_rcu_cpu *krcp;
+	struct llist_head *head;
+	struct llist_node *llnode, *pos, *t;
+
+	krcp = container_of(work, struct kfree_rcu_cpu, irq_work);
+	head = &krcp->llist_head;
+
+	if (llist_empty(head))
+		return;
+
+	llnode = llist_del_all(head);
+	llist_for_each_safe(pos, t, llnode) {
+		struct slab *slab;
+		void *objp;
+		struct rcu_ptr *rcup = (struct rcu_ptr *)pos;
+
+		slab = virt_to_slab(pos);
+		if (is_vmalloc_addr(pos) || !slab)
+			objp = (void *)PAGE_ALIGN_DOWN((unsigned long)pos);
+		else
+			objp = nearest_obj(slab->slab_cache, slab, pos);
+
+		kvfree_call_rcu(rcup, objp, true);
+	}
+}
+
 #ifndef CONFIG_KVFREE_RCU_BATCHED
+static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
+	.llist_head = LLIST_HEAD_INIT(llist_head),
+	.irq_work = IRQ_WORK_INIT(defer_kfree_rcu),
+};
 
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin)
 {
+	if (!allow_spin) {
+		struct kfree_rcu_cpu *krcp;
+
+		guard(preempt)();
+
+		krcp = this_cpu_ptr(&krc);
+		if (llist_add((struct llist_node *)head, &krcp->llist_head))
+			irq_work_queue(&krcp->irq_work);
+		return;
+	}
+
 	if (head) {
 		kasan_record_aux_stack(ptr);
 		call_rcu(head, kvfree_rcu_cb);
@@ -1405,8 +1460,21 @@ struct kvfree_rcu_bulk_data {
 #define KVFREE_BULK_MAX_ENTR \
 	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
+static void schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp);
+
+static void sched_monitor_irq_work(struct irq_work *work)
+{
+	struct kfree_rcu_cpu *krcp;
+
+	krcp = container_of(work, struct kfree_rcu_cpu, sched_monitor_irq_work);
+	schedule_delayed_monitor_work(krcp);
+}
+
 static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
+	.irq_work = IRQ_WORK_INIT(defer_kfree_rcu),
+	.sched_monitor_irq_work =
+		IRQ_WORK_INIT_LAZY(sched_monitor_irq_work),
 };
 
 static __always_inline void
@@ -1421,13 +1489,18 @@ debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
 }
 
 static inline struct kfree_rcu_cpu *
-krc_this_cpu_lock(unsigned long *flags)
+krc_this_cpu_lock(unsigned long *flags, bool allow_spin)
 {
 	struct kfree_rcu_cpu *krcp;
 
 	local_irq_save(*flags);	// For safely calling this_cpu_ptr().
 	krcp = this_cpu_ptr(&krc);
-	raw_spin_lock(&krcp->lock);
+	if (allow_spin) {
+		raw_spin_lock(&krcp->lock);
+	} else if (!raw_spin_trylock(&krcp->lock)) {
+		local_irq_restore(*flags);
+		return NULL;
+	}
 
 	return krcp;
 }
@@ -1841,25 +1914,27 @@ static void fill_page_cache_func(struct work_struct *work)
 // Returns true if ptr was successfully recorded, else the caller must
 // use a fallback.
 static inline bool
-add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
-	unsigned long *flags, void *ptr, bool can_alloc)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu *krcp,
+	unsigned long *flags, void *ptr, bool can_alloc, bool allow_spin)
 {
 	struct kvfree_rcu_bulk_data *bnode;
 	int idx;
 
-	*krcp = krc_this_cpu_lock(flags);
-	if (unlikely(!(*krcp)->initialized))
+	if (unlikely(!krcp->initialized))
+		return false;
+
+	if (!allow_spin)
 		return false;
 
 	idx = !!is_vmalloc_addr(ptr);
-	bnode = list_first_entry_or_null(&(*krcp)->bulk_head[idx],
+	bnode = list_first_entry_or_null(&krcp->bulk_head[idx],
 		struct kvfree_rcu_bulk_data, list);
 
 	/* Check if a new block is required. */
 	if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
-		bnode = get_cached_bnode(*krcp);
+		bnode = get_cached_bnode(krcp);
 		if (!bnode && can_alloc) {
-			krc_this_cpu_unlock(*krcp, *flags);
+			krc_this_cpu_unlock(krcp, *flags);
 
 			// __GFP_NORETRY - allows a light-weight direct reclaim
 			// what is OK from minimizing of fallback hitting point of
@@ -1874,7 +1949,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 			// scenarios.
 			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
-			raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
+			raw_spin_lock_irqsave(&krcp->lock, *flags);
 		}
 
 		if (!bnode)
@@ -1882,14 +1957,14 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 
 		// Initialize the new block and attach it.
 		bnode->nr_records = 0;
-		list_add(&bnode->list, &(*krcp)->bulk_head[idx]);
+		list_add(&bnode->list, &krcp->bulk_head[idx]);
 	}
 
 	// Finally insert and update the GP for this page.
 	bnode->nr_records++;
 	bnode->records[bnode->nr_records - 1] = ptr;
 	get_state_synchronize_rcu_full(&bnode->gp_snap);
-	atomic_inc(&(*krcp)->bulk_count[idx]);
+	atomic_inc(&krcp->bulk_count[idx]);
 
 	return true;
 }
@@ -1949,7 +2024,7 @@ void __init kfree_rcu_scheduler_running(void)
  * be free'd in workqueue context. This allows us to: batch requests together to
  * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
  */
-void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin)
 {
 	unsigned long flags;
 	struct kfree_rcu_cpu *krcp;
@@ -1965,7 +2040,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
 	if (!head)
 		might_sleep();
 
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
+	if (!allow_spin && (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD) ||
+				IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)))
+		goto defer_free;
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
+			(allow_spin && kfree_rcu_sheaf(ptr)))
 		return;
 
 	// Queue the object but don't yet schedule the batch.
@@ -1979,9 +2059,15 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
 	}
 
 	kasan_record_aux_stack(ptr);
-	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
+
+	krcp = krc_this_cpu_lock(&flags, allow_spin);
+	if (!krcp)
+		goto defer_free;
+
+	success = add_ptr_to_bulk_krc_lock(krcp, &flags, ptr, !head, allow_spin);
 	if (!success) {
-		run_page_cache_worker(krcp);
+		if (allow_spin)
+			run_page_cache_worker(krcp);
 
 		if (head == NULL)
 			// Inline if kvfree_rcu(one_arg) call.
@@ -2005,8 +2091,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
 	kmemleak_ignore(ptr);
 
 	// Set timer to drain after KFREE_DRAIN_JIFFIES.
-	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
-		__schedule_delayed_monitor_work(krcp);
+	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING) {
+		if (allow_spin)
+			__schedule_delayed_monitor_work(krcp);
+		else
+			irq_work_queue(&krcp->sched_monitor_irq_work);
+	}
 
 unlock_return:
 	krc_this_cpu_unlock(krcp, flags);
@@ -2017,10 +2107,22 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
 	 * CPU can pass the QS state.
 	 */
 	if (!success) {
+		VM_WARN_ON_ONCE(!allow_spin);
 		debug_rcu_head_unqueue((struct rcu_head *) ptr);
 		synchronize_rcu();
 		kvfree(ptr);
 	}
+	return;
+
+defer_free:
+	VM_WARN_ON_ONCE(allow_spin);
+	guard(preempt)();
+
+	krcp = this_cpu_ptr(&krc);
+	if (llist_add((struct llist_node *)head, &krcp->llist_head))
+		irq_work_queue(&krcp->irq_work);
+	return;
+
 }
 EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (5 preceding siblings ...)
  2026-02-06  9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
@ 2026-02-06  9:34 ` Harry Yoo
  2026-02-12 19:15   ` Alexei Starovoitov
  2026-02-07  0:16 ` [RFC PATCH 0/7] k[v]free_rcu() improvements Paul E. McKenney
  2026-02-12 14:28 ` Vlastimil Babka
  8 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-06  9:34 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Harry Yoo, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

Teach kfree_rcu_sheaf() how to handle the !allow_spin case. Similar to
__pcs_replace_full_main(), try to get an empty sheaf from pcs->spare or
the barn, but don't add !allow_spin support for alloc_empty_sheaf() and
fail early instead.

Since call_rcu() does not support NMI contexts, kfree_rcu_sheaf() fails
when the rcu sheaf becomes full.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slab.h        |  2 +-
 mm/slab_common.c |  7 +++----
 mm/slub.c        | 14 ++++++++++++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 71c7261bf822..5e05a684258f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -404,7 +404,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin);
 void flush_all_rcu_sheaves(void);
 void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9d7801e5cb73..3ee3cf8da304 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1675,7 +1675,7 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
-static bool kfree_rcu_sheaf(void *obj)
+static bool kfree_rcu_sheaf(void *obj, bool allow_spin)
 {
 	struct kmem_cache *s;
 	struct slab *slab;
@@ -1689,7 +1689,7 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	s = slab->slab_cache;
 	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
-		return __kfree_rcu_sheaf(s, obj);
+		return __kfree_rcu_sheaf(s, obj, allow_spin);
 
 	return false;
 }
@@ -2044,8 +2044,7 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin)
 				IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)))
 		goto defer_free;
 
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
-			(allow_spin && kfree_rcu_sheaf(ptr)))
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr, allow_spin))
 		return;
 
 	// Queue the object but don't yet schedule the batch.
diff --git a/mm/slub.c b/mm/slub.c
index ac7bc7e1163f..48f5d6dd3767 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5783,7 +5783,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
  */
 static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
 
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
@@ -5821,7 +5821,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 			goto fail;
 		}
 
-		empty = barn_get_empty_sheaf(barn, true);
+		empty = barn_get_empty_sheaf(barn, allow_spin);
 
 		if (empty) {
 			pcs->rcu_free = empty;
@@ -5830,6 +5830,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 		local_unlock(&s->cpu_sheaves->lock);
 
+		/* It's easier to fall back than trying harder with !allow_spin */
+		if (!allow_spin)
+			goto fail;
+
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty)
@@ -5861,6 +5865,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
 	} else {
+		if (unlikely(!allow_spin)) {
+			/* call_rcu() does not support NMI context */
+			rcu_sheaf->size--;
+			local_unlock(&s->cpu_sheaves->lock);
+			goto fail;
+		}
 		pcs->rcu_free = NULL;
 		rcu_sheaf->node = numa_mem_id();
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags
  2026-02-06  9:34 ` [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags Harry Yoo
@ 2026-02-06 20:09   ` Alexei Starovoitov
  2026-02-09  9:38     ` Vlastimil Babka
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2026-02-06 20:09 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 6, 2026 at 1:35 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> Since kfree() now supports freeing objects allocated with
> kmalloc_nolock(), free one bit in enum object_flags.
>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>

For patches 3 and 4:

Acked-by: Alexei Starovoitov <ast@kernel.org>

I think patches 3 and 4 are ready.
Would be great to land them for this merge window
(if Vlastimil agrees).

Patch 3 is tiny, but the impact is huge and
patch 4 is a very nice cleanup.
If we land it now we can start using kfree_rcu() on bpf side
in the next release cycle. That will help us a lot.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (6 preceding siblings ...)
  2026-02-06  9:34 ` [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo
@ 2026-02-07  0:16 ` Paul E. McKenney
  2026-02-07  1:21   ` Harry Yoo
  2026-02-12 14:28 ` Vlastimil Babka
  8 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2026-02-07  0:16 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 06, 2026 at 06:34:03PM +0900, Harry Yoo wrote:
> These are a few improvements for k[v]free_rcu() API, which were suggested
> by Alexei Starovoitov.
> 
> [ To kmemleak folks: I'm going to teach delete_object_full() and
>   paint_ptr() to ignore cases when the object does not exist.
>   Could you please let me know if the way it's done in patch 3
>   looks good? Only part 2 is relevant to you. ]

On what commit should I apply this series?  I get conflicts on top of -rcu
(no surprise there) and build errors on top of next-20260205.

							Thanx, Paul

> Although I've put some effort into providing a decent quality
> implementation, I'd like you to consider this as a proof-of-concept
> and let's discuss how best we could tackle those problems:
> 
>   1) Allow an 8-byte field to be used as an alternative to
>      struct rcu_head (16-byte) for 2-argument kvfree_rcu()
>   2) kmalloc_nolock() -> kfree[_rcu]() support
>   3) Add kfree_rcu_nolock() for NMI context
> 
> # Part 1. Allow an 8-byte field to be used as an alternative to
>   struct rcu_head for 2-argument kvfree_rcu()
>   
>   Technically, objects that are freed with k[v]free_rcu() need
>   only one pointer to link objects, because we already know that
>   the callback function is always kvfree(). For this purpose,
>   struct rcu_head is unnecessarily large (16 bytes on 64-bit).
> 
>   Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
>   with k[v]free_rcu(). Let's save one pointer per slab object.
>   
>   I have to admit that my naming skill isn't great; hopefully
>   we'll come up with a better name than `struct rcu_ptr`.
> 
>   With this feature, either a struct rcu_ptr or rcu_head field
>   can be used as the second argument of the k[v]free_rcu() API.
> 
>   Users that only use k[v]free_rcu() are highly encouraged to use
>   struct rcu_ptr; otherwise you're wasting memory. However, some users,
>   such as maple tree, may use call_rcu() or k[v]free_rcu() depending on
>   the situation for objects of the same type. For such users,
>   struct rcu_head remains the only option.
> 
>   Patch 1 implements this feature, and patch 2 adds a few users in mm/.
> 
> # Part 2. kmalloc_nolock() -> kfree() or kfree_rcu() path support
>   
>   Allow objects allocated with kmalloc_nolock() to be freed with
>   kfree[_rcu](). Without this support, users are forced to call
>   call_rcu() with kfree_nolock() to free objects after a grace period.
>   This is not efficient and can create unnecessarily many grace periods
>   by bypassing the kfree_rcu batching layer.
> 
>   The reason why it was not supported before was because some alloc
>   hooks are not called in kmalloc_nolock(), while all free hooks are
>   called in kfree().
> 
>   Patch 3 adds support for this by teaching kmemleak to ignore cases
>   when free hooks are called without prior alloc hooks. Patch 4 frees
>   a bit in enum objexts_flags, since we no longer have to remember
>   whether the array was allocated using kmalloc_nolock() or kmalloc().
> 
>   Note that the free hooks fall into these categories:
> 
>   - Its alloc hook is called in kmalloc_nolock(), no problem!
>     (kmsan_slab_alloc(), kasan_slab_alloc(),
>      memcg_slab_post_alloc_hook(), alloc_tagging_slab_alloc_hook())
> 
>   - Its alloc hook isn't called in kmalloc_nolock(); free hooks
>     must handle asymmetric hook calls. (kfence_free(),
>     kmemleak_free_recursive())
> 
>   - There is no matching alloc hook for the free hook; it's safe to
>     call. (debug_check_no_{locks,obj}_freed, __kcsan_check_access())
> 
>   Note that kmalloc() -> kfree_nolock() or kfree_rcu_nolock() isn't
>   still supported! That's much trickier :)
> 
> # Part 3. Add kfree_rcu_nolock() for NMI context
> 
>   Add a new 2-argument kfree_rcu_nolock() variant that is safe to be
>   called in NMI context. In NMI context, calling kfree_rcu() or
>   call_rcu() is not legal, and thus users are forced to implement some
>   sort of deferred freeing. Let's make users' lives easier with the new
>   variant.
> 
>   Note that 1-argument kfree_rcu_nolock() is not supported, since there
>   is not much we can do when trylock & memory allocation fails.
>   (You can't call synchronize_rcu() in NMI context!)
> 
>   When spinning on a lock is not allowed, try to acquire the spinlock.
>   When it succeeds in acquiring the lock, do either:
> 
>   1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
>      be called in NMI context! When the rcu sheaf becomes full by
>      freeing the object, it cannot free to the sheaf and has to fall back.
>   
>   2) Use struct rcu_ptr field to link objects. Consuming a bnode
>      (of struct kvfree_rcu_bulk_data) and queueing work to maintain
>      a number of cached bnodes is avoided in NMI context.
> 
>   Note that scheduling delayed monitor work to drain objects after
>   KFREE_DRAIN_JIFFIES is done using a lazy irq_work to avoid raising
>   self-IPIs. That means scheduling delayed monitor work can be delayed
>   up to the length of a time slice.
> 
>   In rare cases where trylock fails, a non-lazy irq_work is used to
>   defer calling kvfree_rcu_call().
> 
>   When certain debug features (kmemleak, debugobjects) are enabled,
>   freeing in NMI context is always deferred because they use spinlocks.
> 
>   Patch 6 implements kfree_rcu_nolock() support, patch 7 adds sheaves
>   support for the new API.
> 
> Harry Yoo (7):
>   mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
>   mm: use rcu_ptr instead of rcu_head
>   mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()
>   mm/slab: free a bit in enum objexts_flags
>   mm/slab: move kfree_rcu_cpu[_work] definitions
>   mm/slab: introduce kfree_rcu_nolock()
>   mm/slab: make kfree_rcu_nolock() work with sheaves
> 
>  include/linux/list_lru.h   |   2 +-
>  include/linux/memcontrol.h |   3 +-
>  include/linux/rcupdate.h   |  68 +++++---
>  include/linux/shrinker.h   |   2 +-
>  include/linux/types.h      |   9 ++
>  mm/kmemleak.c              |  11 +-
>  mm/slab.h                  |   2 +-
>  mm/slab_common.c           | 309 +++++++++++++++++++++++++------------
>  mm/slub.c                  |  47 ++++--
>  mm/vmalloc.c               |   4 +-
>  10 files changed, 310 insertions(+), 147 deletions(-)
> 
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-07  0:16 ` [RFC PATCH 0/7] k[v]free_rcu() improvements Paul E. McKenney
@ 2026-02-07  1:21   ` Harry Yoo
  2026-02-07  1:33     ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-07  1:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 06, 2026 at 04:16:46PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 06, 2026 at 06:34:03PM +0900, Harry Yoo wrote:
> > These are a few improvements for k[v]free_rcu() API, which were suggested
> > by Alexei Starovoitov.
> > 
> > [ To kmemleak folks: I'm going to teach delete_object_full() and
> >   paint_ptr() to ignore cases when the object does not exist.
> >   Could you please let me know if the way it's done in patch 3
> >   looks good? Only part 2 is relevant to you. ]
> 
> On what commit should I apply this series?

It's based on Vlastimil's slab/for-next:

bc33906024eb Merge branch 'slab/for-7.0/sheaves' into slab/for-next
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next

> I get conflicts on top of -rcu
> (no surprise there) and build errors on top of next-20260205.

Interesting, I don't get build errors when applied it on top of next-20260205.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-07  1:21   ` Harry Yoo
@ 2026-02-07  1:33     ` Paul E. McKenney
  2026-02-09  9:02       ` Harry Yoo
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2026-02-07  1:33 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Sat, Feb 07, 2026 at 10:21:26AM +0900, Harry Yoo wrote:
> On Fri, Feb 06, 2026 at 04:16:46PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 06, 2026 at 06:34:03PM +0900, Harry Yoo wrote:
> > > These are a few improvements for k[v]free_rcu() API, which were suggested
> > > by Alexei Starovoitov.
> > > 
> > > [ To kmemleak folks: I'm going to teach delete_object_full() and
> > >   paint_ptr() to ignore cases when the object does not exist.
> > >   Could you please let me know if the way it's done in patch 3
> > >   looks good? Only part 2 is relevant to you. ]
> > 
> > On what commit should I apply this series?
> 
> It's based on Vlastimil's slab/for-next:
> 
> bc33906024eb Merge branch 'slab/for-7.0/sheaves' into slab/for-next
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
> 
> > I get conflicts on top of -rcu
> > (no surprise there) and build errors on top of next-20260205.
> 
> Interesting, I don't get build errors when applied it on top of next-20260205.

Here you go!

Here is my repeat-by for these build errors, perhaps a .config issue
or difference:

tools/testing/selftests/rcutorture/bin/torture.sh --do-none --do-kvfree --do-kasan

							Thanx, Paul

------------------------------------------------------------------------

mm/slab_common.c:1475:21: error: implicit declaration of function ‘IRQ_WORK_INIT’; did you mean ‘IRQ_WORK_VECTOR’? [-Werror=implicit-function-declaration]
 1475 |         .irq_work = IRQ_WORK_INIT(defer_kfree_rcu),
      |                     ^~~~~~~~~~~~~
      |                     IRQ_WORK_VECTOR
mm/slab_common.c:1475:21: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
mm/slab_common.c:1475:21: error: initializer element is not constant
mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
  CC      drivers/tty/pty.o
mm/slab_common.c:1477:17: error: implicit declaration of function ‘IRQ_WORK_INIT_LAZY’ [-Werror=implicit-function-declaration]
 1477 |                 IRQ_WORK_INIT_LAZY(sched_monitor_irq_work),
      |                 ^~~~~~~~~~~~~~~~~~
mm/slab_common.c:1477:17: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
mm/slab_common.c:1477:17: error: initializer element is not constant
mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
  CC      drivers/tty/tty_audit.o
  CC      net/ethtool/eee.o
mm/slab_common.c: In function ‘kvfree_call_rcu_ptr’:
mm/slab_common.c:2097:25: error: implicit declaration of function ‘irq_work_queue’; did you mean ‘drain_workqueue’? [-Werror=implicit-function-declaration]
 2097 |                         irq_work_queue(&krcp->sched_monitor_irq_work);
      |                         ^~~~~~~~~~~~~~
      |                         drain_workqueue


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-07  1:33     ` Paul E. McKenney
@ 2026-02-09  9:02       ` Harry Yoo
  2026-02-09 16:40         ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-09  9:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 06, 2026 at 05:33:52PM -0800, Paul E. McKenney wrote:
> On Sat, Feb 07, 2026 at 10:21:26AM +0900, Harry Yoo wrote:
> > On Fri, Feb 06, 2026 at 04:16:46PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 06, 2026 at 06:34:03PM +0900, Harry Yoo wrote:
> > > > These are a few improvements for k[v]free_rcu() API, which were suggested
> > > > by Alexei Starovoitov.
> > > > 
> > > > [ To kmemleak folks: I'm going to teach delete_object_full() and
> > > >   paint_ptr() to ignore cases when the object does not exist.
> > > >   Could you please let me know if the way it's done in patch 3
> > > >   looks good? Only part 2 is relevant to you. ]
> > > 
> > > On what commit should I apply this series?
> > 
> > It's based on Vlastimil's slab/for-next:
> > 
> > bc33906024eb Merge branch 'slab/for-7.0/sheaves' into slab/for-next
> > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
> > 
> > > I get conflicts on top of -rcu
> > > (no surprise there) and build errors on top of next-20260205.
> > 
> > Interesting, I don't get build errors when applied it on top of next-20260205.
> 
> Here you go!
> 
> Here is my repeat-by for these build errors, perhaps a .config issue
> or difference:
> 
> tools/testing/selftests/rcutorture/bin/torture.sh --do-none --do-kvfree --do-kasan

Haha, thanks! The kernel test robot reported the same issue on the
weekend. It seems I forgot to include <linux/irq_work.h> and it's
accidentally included on my environment.

Adding #include <linux/irq_work.h> in mm/slab_common.c fixes this.
Will adjust next time I post it, thanks!

-- 
Cheers,
Harry / Hyeonggon

> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> mm/slab_common.c:1475:21: error: implicit declaration of function ‘IRQ_WORK_INIT’; did you mean ‘IRQ_WORK_VECTOR’? [-Werror=implicit-function-declaration]
>  1475 |         .irq_work = IRQ_WORK_INIT(defer_kfree_rcu),
>       |                     ^~~~~~~~~~~~~
>       |                     IRQ_WORK_VECTOR
> mm/slab_common.c:1475:21: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
> mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
> mm/slab_common.c:1475:21: error: initializer element is not constant
> mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
>   CC      drivers/tty/pty.o
> mm/slab_common.c:1477:17: error: implicit declaration of function ‘IRQ_WORK_INIT_LAZY’ [-Werror=implicit-function-declaration]
>  1477 |                 IRQ_WORK_INIT_LAZY(sched_monitor_irq_work),
>       |                 ^~~~~~~~~~~~~~~~~~
> mm/slab_common.c:1477:17: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
> mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
> mm/slab_common.c:1477:17: error: initializer element is not constant
> mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
>   CC      drivers/tty/tty_audit.o
>   CC      net/ethtool/eee.o
> mm/slab_common.c: In function ‘kvfree_call_rcu_ptr’:
> mm/slab_common.c:2097:25: error: implicit declaration of function ‘irq_work_queue’; did you mean ‘drain_workqueue’? [-Werror=implicit-function-declaration]
>  2097 |                         irq_work_queue(&krcp->sched_monitor_irq_work);
>       |                         ^~~~~~~~~~~~~~
>       |                         drain_workqueue
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags
  2026-02-06 20:09   ` Alexei Starovoitov
@ 2026-02-09  9:38     ` Vlastimil Babka
  2026-02-09 18:44       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2026-02-09  9:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Harry Yoo
  Cc: Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On 2/6/26 21:09, Alexei Starovoitov wrote:
> On Fri, Feb 6, 2026 at 1:35 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> Since kfree() now supports freeing objects allocated with
>> kmalloc_nolock(), free one bit in enum object_flags.
>>
>> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> 
> For patches 3 and 4:
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> 
> I think patches 3 and 4 are ready.
> Would be great to land them for this merge window
> (if Vlastimil agrees).

We should have an ack from Catalin for kmemleak. Also better take them out
of the RFC and send as 2 non-rfc patches first, with cc list reduced
accordingly etc.

Then I can put them to -next and try sending second merge window PR next
week. Can you also point to any bug reports that would be fixed? (that you
had to work around or delay merging or something) that would help the
argument to not wait a cycle.

> Patch 3 is tiny, but the impact is huge and
> patch 4 is a very nice cleanup.
> If we land it now we can start using kfree_rcu() on bpf side
> in the next release cycle. That will help us a lot.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head
  2026-02-06  9:34 ` [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head Harry Yoo
@ 2026-02-09 10:41   ` Uladzislau Rezki
  2026-02-09 11:22     ` Harry Yoo
  0 siblings, 1 reply; 32+ messages in thread
From: Uladzislau Rezki @ 2026-02-09 10:41 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 06, 2026 at 06:34:05PM +0900, Harry Yoo wrote:
> When slab objects are freed with kfree_rcu() and not call_rcu(),
> using struct rcu_head (16 bytes on 64-bit) is unnecessary and
> struct rcu_ptr (8 bytes on 64-bit) is enough. Save one pointer
> per slab object by using struct rcu_ptr.
> 
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  include/linux/list_lru.h | 2 +-
>  include/linux/shrinker.h | 2 +-
>  mm/vmalloc.c             | 4 ++--
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index fe739d35a864..c79bccb7dafa 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -37,7 +37,7 @@ struct list_lru_one {
>  };
>  
>  struct list_lru_memcg {
> -	struct rcu_head		rcu;
> +	struct rcu_ptr		rcu;
>  	/* array of per cgroup per node lists, indexed by node id */
>  	struct list_lru_one	node[];
>  };
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 1a00be90d93a..bad20de2803a 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -19,7 +19,7 @@ struct shrinker_info_unit {
>  };
>  
>  struct shrinker_info {
> -	struct rcu_head rcu;
> +	struct rcu_ptr rcu;
>  	int map_nr_max;
>  	struct shrinker_info_unit *unit[];
>  };
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 41dd01e8430c..89c781dcab58 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2596,7 +2596,7 @@ struct vmap_block {
>  	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
>  	unsigned long dirty_min, dirty_max; /*< dirty range */
>  	struct list_head free_list;
> -	struct rcu_head rcu_head;
> +	struct rcu_ptr rcu;
>  	struct list_head purge;
>  	unsigned int cpu;
>  };
>
Why this change is needed?

If you want to save 8 bytes of vmap_block structure, then i
do not see a big gain here. We do not have so many vmap_block
objects.

Am i missing something here? :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head
  2026-02-09 10:41   ` Uladzislau Rezki
@ 2026-02-09 11:22     ` Harry Yoo
  0 siblings, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-09 11:22 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On Mon, Feb 09, 2026 at 11:41:17AM +0100, Uladzislau Rezki wrote:
> On Fri, Feb 06, 2026 at 06:34:05PM +0900, Harry Yoo wrote:
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 41dd01e8430c..89c781dcab58 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2596,7 +2596,7 @@ struct vmap_block {
> >  	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> >  	unsigned long dirty_min, dirty_max; /*< dirty range */
> >  	struct list_head free_list;
> > -	struct rcu_head rcu_head;
> > +	struct rcu_ptr rcu;
> >  	struct list_head purge;
> >  	unsigned int cpu;
> >  };
> >
> Why this change is needed?
> 
> If you want to save 8 bytes of vmap_block structure,

To be honest, because I didn't want to post a series with a feature
that has no users :)

The feature itself was requested by Alexei, because he doesn't
want to bump additional 8 bytes for each object in bpf side to use
kfree_rcu().

But not being familiar with kernel/bpf/,
I just added a few users in mm/ ;)

> then i do not see a big gain here.
> We do not have so many vmap_block objects.

But I agree that replacing existing users just because we can is not
an effective use of our time. I'll drop patch 2 in the next version
as it doesn't (or can't) demonstrate its benefit.

Are there any potential users that might benefit from this (other than bpf)?
don't know, but it will be interesting to explore.

> --
> Uladzislau Rezki

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-09  9:02       ` Harry Yoo
@ 2026-02-09 16:40         ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2026-02-09 16:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Mon, Feb 09, 2026 at 06:02:54PM +0900, Harry Yoo wrote:
> On Fri, Feb 06, 2026 at 05:33:52PM -0800, Paul E. McKenney wrote:
> > On Sat, Feb 07, 2026 at 10:21:26AM +0900, Harry Yoo wrote:
> > > On Fri, Feb 06, 2026 at 04:16:46PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 06, 2026 at 06:34:03PM +0900, Harry Yoo wrote:
> > > > > These are a few improvements for k[v]free_rcu() API, which were suggested
> > > > > by Alexei Starovoitov.
> > > > > 
> > > > > [ To kmemleak folks: I'm going to teach delete_object_full() and
> > > > >   paint_ptr() to ignore cases when the object does not exist.
> > > > >   Could you please let me know if the way it's done in patch 3
> > > > >   looks good? Only part 2 is relevant to you. ]
> > > > 
> > > > On what commit should I apply this series?
> > > 
> > > It's based on Vlastimil's slab/for-next:
> > > 
> > > bc33906024eb Merge branch 'slab/for-7.0/sheaves' into slab/for-next
> > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
> > > 
> > > > I get conflicts on top of -rcu
> > > > (no surprise there) and build errors on top of next-20260205.
> > > 
> > > Interesting, I don't get build errors when applied it on top of next-20260205.
> > 
> > Here you go!
> > 
> > Here is my repeat-by for these build errors, perhaps a .config issue
> > or difference:
> > 
> > tools/testing/selftests/rcutorture/bin/torture.sh --do-none --do-kvfree --do-kasan
> 
> Haha, thanks! The kernel test robot reported the same issue on the
> weekend. It seems I forgot to include <linux/irq_work.h> and it's
> accidentally included on my environment.
> 
> Adding #include <linux/irq_work.h> in mm/slab_common.c fixes this.
> Will adjust next time I post it, thanks!

Very good, and I will give the update another spin.

							Thanx, Paul

> -- 
> Cheers,
> Harry / Hyeonggon
> 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > mm/slab_common.c:1475:21: error: implicit declaration of function ‘IRQ_WORK_INIT’; did you mean ‘IRQ_WORK_VECTOR’? [-Werror=implicit-function-declaration]
> >  1475 |         .irq_work = IRQ_WORK_INIT(defer_kfree_rcu),
> >       |                     ^~~~~~~~~~~~~
> >       |                     IRQ_WORK_VECTOR
> > mm/slab_common.c:1475:21: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
> > mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
> > mm/slab_common.c:1475:21: error: initializer element is not constant
> > mm/slab_common.c:1475:21: note: (near initialization for ‘krc.irq_work.node.llist.next’)
> >   CC      drivers/tty/pty.o
> > mm/slab_common.c:1477:17: error: implicit declaration of function ‘IRQ_WORK_INIT_LAZY’ [-Werror=implicit-function-declaration]
> >  1477 |                 IRQ_WORK_INIT_LAZY(sched_monitor_irq_work),
> >       |                 ^~~~~~~~~~~~~~~~~~
> > mm/slab_common.c:1477:17: error: initialization of ‘struct llist_node *’ from ‘int’ makes pointer from integer without a cast [-Werror=int-conversion]
> > mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
> > mm/slab_common.c:1477:17: error: initializer element is not constant
> > mm/slab_common.c:1477:17: note: (near initialization for ‘krc.sched_monitor_irq_work.node.llist.next’)
> >   CC      drivers/tty/tty_audit.o
> >   CC      net/ethtool/eee.o
> > mm/slab_common.c: In function ‘kvfree_call_rcu_ptr’:
> > mm/slab_common.c:2097:25: error: implicit declaration of function ‘irq_work_queue’; did you mean ‘drain_workqueue’? [-Werror=implicit-function-declaration]
> >  2097 |                         irq_work_queue(&krcp->sched_monitor_irq_work);
> >       |                         ^~~~~~~~~~~~~~
> >       |                         drain_workqueue
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags
  2026-02-09  9:38     ` Vlastimil Babka
@ 2026-02-09 18:44       ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2026-02-09 18:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Hao Li, Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko,
	Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Mon, Feb 9, 2026 at 1:38 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/6/26 21:09, Alexei Starovoitov wrote:
> > On Fri, Feb 6, 2026 at 1:35 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> Since kfree() now supports freeing objects allocated with
> >> kmalloc_nolock(), free one bit in enum object_flags.
> >>
> >> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> >
> > For patches 3 and 4:
> >
> > Acked-by: Alexei Starovoitov <ast@kernel.org>
> >
> > I think patches 3 and 4 are ready.
> > Would be great to land them for this merge window
> > (if Vlastimil agrees).
>
> We should have an ack from Catalin for kmemleak. Also better take them out
> of the RFC and send as 2 non-rfc patches first, with cc list reduced
> accordingly etc.
>
> Then I can put them to -next and try sending second merge window PR next
> week. Can you also point to any bug reports that would be fixed? (that you
> had to work around or delay merging or something) that would help the
> argument to not wait a cycle.

Here is one example:
https://lore.kernel.org/all/20251114201329.3275875-1-ameryhung@gmail.com/

"
RFC v1 tried to switch to kmalloc_nolock() unconditionally. However,
as there is substantial performance loss in socket local storage due to
1) defer_free() in kfree_nolock() and 2) no kfree_rcu() batching,
replacing kzalloc() is postponed until necessary improvements in mm
land.
"

This patch addresses both 1 and 2. The freeing is done in a good context.
The only reason we use kfree_nolock() and suffer from defer_free and
lack of batching is due to kmalloc_nolock()->kfree_nolock() matching
requirement. So this small patch is a big deal.
We will be able to use kmalloc_nolock() -> kfree_rcu().


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-06  9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
@ 2026-02-11 10:16   ` Uladzislau Rezki
  2026-02-11 10:44     ` Harry Yoo
  2026-02-12 11:52     ` Vlastimil Babka
  0 siblings, 2 replies; 32+ messages in thread
From: Uladzislau Rezki @ 2026-02-11 10:16 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 06, 2026 at 06:34:04PM +0900, Harry Yoo wrote:
> k[v]free_rcu() repurposes two fields of struct rcu_head: 'func' to store
> the start address of the object, and 'next' to link objects.
> 
> However, using 'func' to store the start address is unnecessary:
> 
>   1. slab can get the start address from the address of struct rcu_head
>      field via nearest_obj(), and
> 
>   2. vmalloc and large kmalloc can get the start address by aligning
>      down the address of the struct rcu_head field to the page boundary.
> 
> Therefore, allow an 8-byte (on 64-bit) field (of a new type called
> struct rcu_ptr) to be used with k[v]free_rcu() with two arguments.
> 
> Some users use both call_rcu() and k[v]free_rcu() to process callbacks
> (e.g., maple tree), so it makes sense to have struct rcu_head field
> to handle both cases. However, many users that simply free objects via
> kvfree_rcu() can save one pointer by using struct rcu_ptr instead of
> struct rcu_head.
> 
> Note that struct rcu_ptr is a single pointer only when
> CONFIG_KVFREE_RCU_BATCHED=y. To keep kvfree_rcu() implementation minimal
> when CONFIG_KVFREE_RCU_BATCHED is disabled, struct rcu_ptr is the size
> as struct rcu_head, and the implementation of kvfree_rcu() remains
> unchanged in that configuration.
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  include/linux/rcupdate.h | 61 +++++++++++++++++++++++++++-------------
>  include/linux/types.h    |  9 ++++++
>  mm/slab_common.c         | 40 +++++++++++++++-----------
>  3 files changed, 75 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index c5b30054cd01..8924edf7e8c1 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1059,22 +1059,30 @@ static inline void rcu_read_unlock_migrate(void)
>  /**
>   * kfree_rcu() - kfree an object after a grace period.
>   * @ptr: pointer to kfree for double-argument invocations.
> - * @rhf: the name of the struct rcu_head within the type of @ptr.
> + * @rf: the name of the struct rcu_head or struct rcu_ptr within the type of @ptr.
>   *
>   * Many rcu callbacks functions just call kfree() on the base structure.
>   * These functions are trivial, but their size adds up, and furthermore
>   * when they are used in a kernel module, that module must invoke the
>   * high-latency rcu_barrier() function at module-unload time.
> + * The kfree_rcu() function handles this issue by batching.
>   *
> - * The kfree_rcu() function handles this issue. In order to have a universal
> - * callback function handling different offsets of rcu_head, the callback needs
> - * to determine the starting address of the freed object, which can be a large
> - * kmalloc or vmalloc allocation. To allow simply aligning the pointer down to
> - * page boundary for those, only offsets up to 4095 bytes can be accommodated.
> - * If the offset is larger than 4095 bytes, a compile-time error will
> - * be generated in kvfree_rcu_arg_2(). If this error is triggered, you can
> - * either fall back to use of call_rcu() or rearrange the structure to
> - * position the rcu_head structure into the first 4096 bytes.
> + * Typically, struct rcu_head is used to process RCU callbacks, but it requires
> + * two pointers. However, since kfree_rcu() uses kfree() as the callback
> + * function, it can process callbacks with struct rcu_ptr, which is only
> + * one pointer in size (unless !CONFIG_KVFREE_RCU_BATCHED).
> + *
> + * The type of @rf can be either struct rcu_head or struct rcu_ptr, and when
> + * possible, it is recommended to use struct rcu_ptr due to its smaller size.
> + *
> + * In order to have a universal callback function handling different offsets
> + * of @rf, the callback needs to determine the starting address of the freed
> + * object, which can be a large kmalloc or vmalloc allocation. To allow simply
> + * aligning the pointer down to page boundary for those, only offsets up to
> + * 4095 bytes can be accommodated. If the offset is larger than 4095 bytes,
> + * a compile-time error will be generated in kvfree_rcu_arg_2().
> + * If this error is triggered, you can either fall back to use of call_rcu()
> + * or rearrange the structure to position @rf into the first 4096 bytes.
>   *
>   * The object to be freed can be allocated either by kmalloc() or
>   * kmem_cache_alloc().
> @@ -1084,8 +1092,8 @@ static inline void rcu_read_unlock_migrate(void)
>   * The BUILD_BUG_ON check must not involve any function calls, hence the
>   * checks are done in macros here.
>   */
> -#define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
> -#define kvfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
> +#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> +#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
>  
>  /**
>   * kfree_rcu_mightsleep() - kfree an object after a grace period.
> @@ -1107,22 +1115,37 @@ static inline void rcu_read_unlock_migrate(void)
>  #define kfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
>  #define kvfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
>  
> -/*
> - * In mm/slab_common.c, no suitable header to include here.
> - */
> -void kvfree_call_rcu(struct rcu_head *head, void *ptr);
> +
> +#ifdef CONFIG_KVFREE_RCU_BATCHED
> +void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
> +#define kvfree_call_rcu(head, ptr) \
> +	_Generic((head), \
> +		struct rcu_head *: kvfree_call_rcu_ptr,		\
> +		struct rcu_ptr *: kvfree_call_rcu_ptr,		\
> +		void *: kvfree_call_rcu_ptr			\
> +	)((struct rcu_ptr *)(head), (ptr))
> +#else
> +void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
> +static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
> +#define kvfree_call_rcu(head, ptr) \
> +	_Generic((head), \
> +		struct rcu_head *: kvfree_call_rcu_head,	\
> +		struct rcu_ptr *: kvfree_call_rcu_head,		\
> +		void *: kvfree_call_rcu_head			\
> +	)((struct rcu_head *)(head), (ptr))
> +#endif
>  
>  /*
>   * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
>   * comment of kfree_rcu() for details.
>   */
> -#define kvfree_rcu_arg_2(ptr, rhf)					\
> +#define kvfree_rcu_arg_2(ptr, rf)					\
>  do {									\
>  	typeof (ptr) ___p = (ptr);					\
>  									\
>  	if (___p) {							\
> -		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096);	\
> -		kvfree_call_rcu(&((___p)->rhf), (void *) (___p));	\
> +		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096);	\
> +		kvfree_call_rcu(&((___p)->rf), (void *) (___p));	\
>  	}								\
>  } while (0)
>  
> diff --git a/include/linux/types.h b/include/linux/types.h
> index d4437e9c452c..e5596ebab29c 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -245,6 +245,15 @@ struct callback_head {
>  } __attribute__((aligned(sizeof(void *))));
>  #define rcu_head callback_head
>  
> +
> +struct rcu_ptr {
> +#ifdef CONFIG_KVFREE_RCU_BATCHED
> +	struct rcu_ptr *next;
> +#else
> +	struct callback_head;
> +#endif
> +} __attribute__((aligned(sizeof(void *))));
> +
>  typedef void (*rcu_callback_t)(struct rcu_head *head);
>  typedef void (*call_rcu_func_t)(struct rcu_head *head, rcu_callback_t func);
>  
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d5a70a831a2a..3ec99a5463d3 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1265,7 +1265,7 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
>  
>  #ifndef CONFIG_KVFREE_RCU_BATCHED
>  
> -void kvfree_call_rcu(struct rcu_head *head, void *ptr)
> +void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
>  {
>  	if (head) {
>  		kasan_record_aux_stack(ptr);
> @@ -1278,7 +1278,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	synchronize_rcu();
>  	kvfree(ptr);
>  }
> -EXPORT_SYMBOL_GPL(kvfree_call_rcu);
> +EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
>  
>  void __init kvfree_rcu_init(void)
>  {
> @@ -1346,7 +1346,7 @@ struct kvfree_rcu_bulk_data {
>  
>  struct kfree_rcu_cpu_work {
>  	struct rcu_work rcu_work;
> -	struct rcu_head *head_free;
> +	struct rcu_ptr *head_free;
>  	struct rcu_gp_oldstate head_free_gp_snap;
>  	struct list_head bulk_head_free[FREE_N_CHANNELS];
>  	struct kfree_rcu_cpu *krcp;
> @@ -1381,8 +1381,7 @@ struct kfree_rcu_cpu_work {
>   */
>  struct kfree_rcu_cpu {
>  	// Objects queued on a linked list
> -	// through their rcu_head structures.
> -	struct rcu_head *head;
> +	struct rcu_ptr *head;
>  	unsigned long head_gp_snap;
>  	atomic_t head_count;
>  
> @@ -1523,18 +1522,28 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
>  }
>  
>  static void
> -kvfree_rcu_list(struct rcu_head *head)
> +kvfree_rcu_list(struct rcu_ptr *head)
>  {
> -	struct rcu_head *next;
> +	struct rcu_ptr *next;
>  
>  	for (; head; head = next) {
> -		void *ptr = (void *) head->func;
> -		unsigned long offset = (void *) head - ptr;
> +		void *ptr;
> +		unsigned long offset;
> +		struct slab *slab;
> +
> +		slab = virt_to_slab(head);
> +		if (is_vmalloc_addr(head) || !slab)
> +			ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
> +		else
> +			ptr = nearest_obj(slab->slab_cache, slab, head);
> +		offset = (void *)head - ptr;
>  
>  		next = head->next;
>  		debug_rcu_head_unqueue((struct rcu_head *)ptr);
>  		rcu_lock_acquire(&rcu_callback_map);
> -		trace_rcu_invoke_kvfree_callback("slab", head, offset);
> +		trace_rcu_invoke_kvfree_callback("slab",
> +						(struct rcu_head *)head,
> +						offset);
>  
>  		kvfree(ptr);
>  
> @@ -1552,7 +1561,7 @@ static void kfree_rcu_work(struct work_struct *work)
>  	unsigned long flags;
>  	struct kvfree_rcu_bulk_data *bnode, *n;
>  	struct list_head bulk_head[FREE_N_CHANNELS];
> -	struct rcu_head *head;
> +	struct rcu_ptr *head;
>  	struct kfree_rcu_cpu *krcp;
>  	struct kfree_rcu_cpu_work *krwp;
>  	struct rcu_gp_oldstate head_gp_snap;
> @@ -1675,7 +1684,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
>  {
>  	struct list_head bulk_ready[FREE_N_CHANNELS];
>  	struct kvfree_rcu_bulk_data *bnode, *n;
> -	struct rcu_head *head_ready = NULL;
> +	struct rcu_ptr *head_ready = NULL;
>  	unsigned long flags;
>  	int i;
>  
> @@ -1938,7 +1947,7 @@ void __init kfree_rcu_scheduler_running(void)
>   * be free'd in workqueue context. This allows us to: batch requests together to
>   * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
>   */
> -void kvfree_call_rcu(struct rcu_head *head, void *ptr)
> +void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
>  {
>  	unsigned long flags;
>  	struct kfree_rcu_cpu *krcp;
> @@ -1960,7 +1969,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	// Queue the object but don't yet schedule the batch.
>  	if (debug_rcu_head_queue(ptr)) {
>  		// Probable double kfree_rcu(), just leak.
> -		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_ptr %p\n",
>  			  __func__, head);
>  
>  		// Mark as success and leave.
> @@ -1976,7 +1985,6 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  			// Inline if kvfree_rcu(one_arg) call.
>  			goto unlock_return;
>  
> -		head->func = ptr;
>  		head->next = krcp->head;
>  		WRITE_ONCE(krcp->head, head);
>  		atomic_inc(&krcp->head_count);
> @@ -2012,7 +2020,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  		kvfree(ptr);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(kvfree_call_rcu);
> +EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
>  
>  static inline void __kvfree_rcu_barrier(void)
>  {
> -- 
> 2.43.0
> 
If this is supposed to be invoked from NMI, should we better just detect
such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
which make it easy to get lost.

As i see you maintain llist and the idea is simply to re-enter to the
kvfree_rcu() again with allow-spin=true, since then it will be "normal"
context.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 10:16   ` Uladzislau Rezki
@ 2026-02-11 10:44     ` Harry Yoo
  2026-02-11 10:53       ` Uladzislau Rezki
  2026-02-12 11:52     ` Vlastimil Babka
  1 sibling, 1 reply; 32+ messages in thread
From: Harry Yoo @ 2026-02-11 10:44 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On Wed, Feb 11, 2026 at 11:16:51AM +0100, Uladzislau Rezki wrote:
> If this is supposed to be invoked from NMI, should we better just detect
> such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
> which make it easy to get lost.

Detecting if it's NMI might be okay, but IIUC re-entrancy requirement
not only comes from NMI but also from attaching bpf programs to
kernel functions, something like:

"Run a BPF program whenever queue_delayed_work() is called,
 ... and the BPF program somehow frees memory via kfree_rcu_nolock()".

Then, by the time the kernel calls queue_delayed_work() while holding
krcp->lock, it run the BPF program and calls kfree_rcu_nolock(),
it is not allowed to spin on krcp->lock.

It is hard to detect if it can spin in this case.

> As i see you maintain llist and the idea is simply to re-enter to the
> kvfree_rcu() again with allow-spin=true, since then it will be "normal"
> context.

It tries to acquire the lock and add it to krcp->head, but if somebody
is already holding the lock, it re-runs kvfree_rcu() with irq work.

> --
> Uladzislau Rezki

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 10:44     ` Harry Yoo
@ 2026-02-11 10:53       ` Uladzislau Rezki
  2026-02-11 11:26         ` Harry Yoo
  0 siblings, 1 reply; 32+ messages in thread
From: Uladzislau Rezki @ 2026-02-11 10:53 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Uladzislau Rezki, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Dave Chinner, Qi Zheng, Muchun Song, rcu, linux-mm, bpf

On Wed, Feb 11, 2026 at 07:44:37PM +0900, Harry Yoo wrote:
> On Wed, Feb 11, 2026 at 11:16:51AM +0100, Uladzislau Rezki wrote:
> > If this is supposed to be invoked from NMI, should we better just detect
> > such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
> > which make it easy to get lost.
> 
> Detecting if it's NMI might be okay, but IIUC re-entrancy requirement
> not only comes from NMI but also from attaching bpf programs to
> kernel functions, something like:
> 
> "Run a BPF program whenever queue_delayed_work() is called,
>  ... and the BPF program somehow frees memory via kfree_rcu_nolock()".
> 
> Then, by the time the kernel calls queue_delayed_work() while holding
> krcp->lock, it run the BPF program and calls kfree_rcu_nolock(),
> it is not allowed to spin on krcp->lock.
> 
> 
> > As i see you maintain llist and the idea is simply to re-enter to the
> > kvfree_rcu() again with allow-spin=true, since then it will be "normal"
> > context.
> 
> It tries to acquire the lock and add it to krcp->head, but if somebody
> is already holding the lock, it re-runs kvfree_rcu() with irq work.
> 
Check no_spin on entry, if true, llist_add, queue-irq-work. Re-enter.
You might need to set-up interval to prevent frequent bouncing.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 10:53       ` Uladzislau Rezki
@ 2026-02-11 11:26         ` Harry Yoo
  2026-02-11 13:02           ` Uladzislau Rezki
  2026-02-11 17:05           ` Alexei Starovoitov
  0 siblings, 2 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-11 11:26 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On Wed, Feb 11, 2026 at 11:53:46AM +0100, Uladzislau Rezki wrote:
> On Wed, Feb 11, 2026 at 07:44:37PM +0900, Harry Yoo wrote:
> > On Wed, Feb 11, 2026 at 11:16:51AM +0100, Uladzislau Rezki wrote:
> > > If this is supposed to be invoked from NMI, should we better just detect
> > > such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
> > > which make it easy to get lost.
> > 
> > Detecting if it's NMI might be okay, but IIUC re-entrancy requirement
> > not only comes from NMI but also from attaching bpf programs to
> > kernel functions, something like:
> > 
> > "Run a BPF program whenever queue_delayed_work() is called,
> >  ... and the BPF program somehow frees memory via kfree_rcu_nolock()".
> > 
> > Then, by the time the kernel calls queue_delayed_work() while holding
> > krcp->lock, it run the BPF program and calls kfree_rcu_nolock(),
> > it is not allowed to spin on krcp->lock.
> > 
> > 
> > > As i see you maintain llist and the idea is simply to re-enter to the
> > > kvfree_rcu() again with allow-spin=true, since then it will be "normal"
> > > context.
> > 
> > It tries to acquire the lock and add it to krcp->head, but if somebody
> > is already holding the lock, it re-runs kvfree_rcu() with irq work.
> > 
>
> Check no_spin on entry, if true, llist_add, queue-irq-work. Re-enter.

That is much simpler! Actually, I tried this way during the initial
implementation. I like its simplicity.

But I wasn't sure about performance implications of the approach
and switched to current implementation.

It'd be nice to hear Alexei's thoughts on this; I think he'd have some
insights on performance aspect of this, as we have something similar
in slab (defer_free).

> You might need to set-up interval to prevent frequent bouncing.

You mean an interval to wait after queueing the work, before it gets
processed, right?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 11:26         ` Harry Yoo
@ 2026-02-11 13:02           ` Uladzislau Rezki
  2026-02-11 17:05           ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Uladzislau Rezki @ 2026-02-11 13:02 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Uladzislau Rezki, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Dave Chinner, Qi Zheng, Muchun Song, rcu, linux-mm, bpf

On Wed, Feb 11, 2026 at 08:26:54PM +0900, Harry Yoo wrote:
> On Wed, Feb 11, 2026 at 11:53:46AM +0100, Uladzislau Rezki wrote:
> > On Wed, Feb 11, 2026 at 07:44:37PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 11, 2026 at 11:16:51AM +0100, Uladzislau Rezki wrote:
> > > > If this is supposed to be invoked from NMI, should we better just detect
> > > > such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
> > > > which make it easy to get lost.
> > > 
> > > Detecting if it's NMI might be okay, but IIUC re-entrancy requirement
> > > not only comes from NMI but also from attaching bpf programs to
> > > kernel functions, something like:
> > > 
> > > "Run a BPF program whenever queue_delayed_work() is called,
> > >  ... and the BPF program somehow frees memory via kfree_rcu_nolock()".
> > > 
> > > Then, by the time the kernel calls queue_delayed_work() while holding
> > > krcp->lock, it run the BPF program and calls kfree_rcu_nolock(),
> > > it is not allowed to spin on krcp->lock.
> > > 
> > > 
> > > > As i see you maintain llist and the idea is simply to re-enter to the
> > > > kvfree_rcu() again with allow-spin=true, since then it will be "normal"
> > > > context.
> > > 
> > > It tries to acquire the lock and add it to krcp->head, but if somebody
> > > is already holding the lock, it re-runs kvfree_rcu() with irq work.
> > > 
> >
> > Check no_spin on entry, if true, llist_add, queue-irq-work. Re-enter.
> 
> That is much simpler! Actually, I tried this way during the initial
> implementation. I like its simplicity.
> 
> But I wasn't sure about performance implications of the approach
> and switched to current implementation.
> 
> It'd be nice to hear Alexei's thoughts on this; I think he'd have some
> insights on performance aspect of this, as we have something similar
> in slab (defer_free).
> 
> > You might need to set-up interval to prevent frequent bouncing.
> 
> You mean an interval to wait after queueing the work, before it gets
> processed, right?
> 
Something like that. Not ping a scheduler as soon as we added an object
to be freed after GP.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 11:26         ` Harry Yoo
  2026-02-11 13:02           ` Uladzislau Rezki
@ 2026-02-11 17:05           ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2026-02-11 17:05 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Uladzislau Rezki, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Dave Chinner, Qi Zheng, Muchun Song, rcu, linux-mm, bpf

On Wed, Feb 11, 2026 at 3:27 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Feb 11, 2026 at 11:53:46AM +0100, Uladzislau Rezki wrote:
> > On Wed, Feb 11, 2026 at 07:44:37PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 11, 2026 at 11:16:51AM +0100, Uladzislau Rezki wrote:
> > > > If this is supposed to be invoked from NMI, should we better just detect
> > > > such context in the kvfree_call_rcu()? There are lot of "allow_spin" checks
> > > > which make it easy to get lost.
> > >
> > > Detecting if it's NMI might be okay, but IIUC re-entrancy requirement
> > > not only comes from NMI but also from attaching bpf programs to
> > > kernel functions, something like:
> > >
> > > "Run a BPF program whenever queue_delayed_work() is called,
> > >  ... and the BPF program somehow frees memory via kfree_rcu_nolock()".
> > >
> > > Then, by the time the kernel calls queue_delayed_work() while holding
> > > krcp->lock, it run the BPF program and calls kfree_rcu_nolock(),
> > > it is not allowed to spin on krcp->lock.
> > >
> > >
> > > > As i see you maintain llist and the idea is simply to re-enter to the
> > > > kvfree_rcu() again with allow-spin=true, since then it will be "normal"
> > > > context.
> > >
> > > It tries to acquire the lock and add it to krcp->head, but if somebody
> > > is already holding the lock, it re-runs kvfree_rcu() with irq work.
> > >
> >
> > Check no_spin on entry, if true, llist_add, queue-irq-work. Re-enter.
>
> That is much simpler! Actually, I tried this way during the initial
> implementation. I like its simplicity.
>
> But I wasn't sure about performance implications of the approach
> and switched to current implementation.
>
> It'd be nice to hear Alexei's thoughts on this; I think he'd have some
> insights on performance aspect of this, as we have something similar
> in slab (defer_free).

It's not a good idea. !allow_spin doesn't mean that we're in in_nmi
or reentering. It means that the running context is unknown,
but 99% of the time it's fine to go normal route and try to lock
and everything will proceed as usual.
if (!allow_spin) irq_work() will 100% hurt performance.
In kfree_nolock() (before sheaves) we have a fallback to irq_work
that we thought would be rare in practice. Turned out that even
relatively rare spikes of irq_work hurt overall throughput by 5%
for that workload. So, no, irq_work must be absolutely the last resort.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock()
  2026-02-06  9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
@ 2026-02-12  2:58   ` Harry Yoo
  2026-02-16 21:07   ` Joel Fernandes
  1 sibling, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-12  2:58 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On Fri, Feb 06, 2026 at 06:34:09PM +0900, Harry Yoo wrote:
> Currently, kfree_rcu() cannot be called in an NMI context.
> In such a context, even calling call_rcu() is not legal,
> forcing users to implement deferred freeing.
> 
> Make users' lives easier by introducing kfree_rcu_nolock() variant.
> Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
> variant, because, in the worst case where memory allocation fails,
> the caller cannot synchronously wait for the grace period to finish.
> 
> Similar to kfree_nolock() implementation, try to acquire kfree_rcu_cpu
> spinlock, and if that fails, insert the object to per-cpu lockless list
> and delay freeing using irq_work that calls kvfree_call_rcu() later.
> In case kmemleak or debugobjects is enabled, always defer freeing as
> those debug features don't support NMI contexts.
> 
> When trylock succeeds, avoid consuming bnode and run_page_cache_worker()
> altogether. Instead, insert objects into struct kfree_rcu_cpu.head
> without consuming additional memory.
> 
> For now, the sheaves layer is bypassed if spinning is not allowed.
> 
> Scheduling delayed monitor work in an NMI context is tricky; use
> irq_work to schedule, but use lazy irq_work to avoid raising self-IPIs.
> That means scheduling delayed monitor work can be delayed up to the
> length of a time slice.

By the way, this part is still not optimal. Unfortunately we can't use
workqueues in NMI. Need a trick to avoid irq_work (when possible) while
avoiding forgetting to drain batches later.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-11 10:16   ` Uladzislau Rezki
  2026-02-11 10:44     ` Harry Yoo
@ 2026-02-12 11:52     ` Vlastimil Babka
  2026-02-13  5:17       ` Harry Yoo
  1 sibling, 1 reply; 32+ messages in thread
From: Vlastimil Babka @ 2026-02-12 11:52 UTC (permalink / raw)
  To: Uladzislau Rezki, Harry Yoo
  Cc: Andrew Morton, Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Dave Chinner, Qi Zheng, Muchun Song, rcu, linux-mm, bpf

On 2/11/26 11:16, Uladzislau Rezki wrote:
> On Fri, Feb 06, 2026 at 06:34:04PM +0900, Harry Yoo wrote:
>> k[v]free_rcu() repurposes two fields of struct rcu_head: 'func' to store
>> the start address of the object, and 'next' to link objects.
>> 
>> However, using 'func' to store the start address is unnecessary:
>> 
>>   1. slab can get the start address from the address of struct rcu_head
>>      field via nearest_obj(), and
>> 
>>   2. vmalloc and large kmalloc can get the start address by aligning
>>      down the address of the struct rcu_head field to the page boundary.
>> 
>> Therefore, allow an 8-byte (on 64-bit) field (of a new type called
>> struct rcu_ptr) to be used with k[v]free_rcu() with two arguments.
>> 
>> Some users use both call_rcu() and k[v]free_rcu() to process callbacks
>> (e.g., maple tree), so it makes sense to have struct rcu_head field
>> to handle both cases. However, many users that simply free objects via
>> kvfree_rcu() can save one pointer by using struct rcu_ptr instead of
>> struct rcu_head.
>> 
>> Note that struct rcu_ptr is a single pointer only when
>> CONFIG_KVFREE_RCU_BATCHED=y. To keep kvfree_rcu() implementation minimal
>> when CONFIG_KVFREE_RCU_BATCHED is disabled, struct rcu_ptr is the size
>> as struct rcu_head, and the implementation of kvfree_rcu() remains
>> unchanged in that configuration.

Won't that be too limiting, if we can't shrink structures (e.g. BPF)
unconditionally? Or acceptable because CONFIG_KVFREE_RCU_BATCHED=n is uncommon?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/7] k[v]free_rcu() improvements
  2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
                   ` (7 preceding siblings ...)
  2026-02-07  0:16 ` [RFC PATCH 0/7] k[v]free_rcu() improvements Paul E. McKenney
@ 2026-02-12 14:28 ` Vlastimil Babka
  8 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2026-02-12 14:28 UTC (permalink / raw)
  To: Harry Yoo, Andrew Morton
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Hao Li,
	Alexei Starovoitov, Puranjay Mohan, Andrii Nakryiko, Amery Hung,
	Catalin Marinas, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On 2/6/26 10:34, Harry Yoo wrote:
> These are a few improvements for k[v]free_rcu() API, which were suggested
> by Alexei Starovoitov.
> 
> [ To kmemleak folks: I'm going to teach delete_object_full() and
>   paint_ptr() to ignore cases when the object does not exist.
>   Could you please let me know if the way it's done in patch 3
>   looks good? Only part 2 is relevant to you. ]
> 
> Although I've put some effort into providing a decent quality
> implementation, I'd like you to consider this as a proof-of-concept
> and let's discuss how best we could tackle those problems:
> 
>   1) Allow an 8-byte field to be used as an alternative to
>      struct rcu_head (16-byte) for 2-argument kvfree_rcu()
>   2) kmalloc_nolock() -> kfree[_rcu]() support
>   3) Add kfree_rcu_nolock() for NMI context

Since you went bravely into this area, I'd like to suggest some more :)

- teach CONFIG_KVFREE_RCU_BATCHED to handle kfree_rcu sheaves after they
fill, instead of the standard call_rcu() handling

- make kvfree_call_rcu() -> kfree_rcu_sheaf() compatible with PREEMPT_RT

Thanks,
Vlastimil


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves
  2026-02-06  9:34 ` [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo
@ 2026-02-12 19:15   ` Alexei Starovoitov
  2026-02-13 11:55     ` Harry Yoo
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2026-02-12 19:15 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Fri, Feb 6, 2026 at 1:35 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
>         } else {
> +               if (unlikely(!allow_spin)) {
> +                       /* call_rcu() does not support NMI context */
> +                       rcu_sheaf->size--;
> +                       local_unlock(&s->cpu_sheaves->lock);
> +                       goto fail;

As a first step it's ok, but we need to make call_rcu() work too.
Shouldn't be too hard. It protects itself with local_irq_save,
so if (irqs_disabled()) defer to irq_work and call_rcu there
or guard reentrance into __call_rcu_common() with per-cpu busy counter.
rcu_head can be reused to form list of objects to be processed in irq work.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  2026-02-12 11:52     ` Vlastimil Babka
@ 2026-02-13  5:17       ` Harry Yoo
  0 siblings, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-13  5:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Uladzislau Rezki, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

On Thu, Feb 12, 2026 at 12:52:46PM +0100, Vlastimil Babka wrote:
> On 2/11/26 11:16, Uladzislau Rezki wrote:
> > On Fri, Feb 06, 2026 at 06:34:04PM +0900, Harry Yoo wrote:
> >> k[v]free_rcu() repurposes two fields of struct rcu_head: 'func' to store
> >> the start address of the object, and 'next' to link objects.
> >> 
> >> However, using 'func' to store the start address is unnecessary:
> >> 
> >>   1. slab can get the start address from the address of struct rcu_head
> >>      field via nearest_obj(), and
> >> 
> >>   2. vmalloc and large kmalloc can get the start address by aligning
> >>      down the address of the struct rcu_head field to the page boundary.
> >> 
> >> Therefore, allow an 8-byte (on 64-bit) field (of a new type called
> >> struct rcu_ptr) to be used with k[v]free_rcu() with two arguments.
> >> 
> >> Some users use both call_rcu() and k[v]free_rcu() to process callbacks
> >> (e.g., maple tree), so it makes sense to have struct rcu_head field
> >> to handle both cases. However, many users that simply free objects via
> >> kvfree_rcu() can save one pointer by using struct rcu_ptr instead of
> >> struct rcu_head.
> >> 
> >> Note that struct rcu_ptr is a single pointer only when
> >> CONFIG_KVFREE_RCU_BATCHED=y. To keep kvfree_rcu() implementation minimal
> >> when CONFIG_KVFREE_RCU_BATCHED is disabled, struct rcu_ptr is the size
> >> as struct rcu_head, and the implementation of kvfree_rcu() remains
> >> unchanged in that configuration.
> 
> Won't that be too limiting, if we can't shrink structures (e.g. BPF)
> unconditionally? Or acceptable because CONFIG_KVFREE_RCU_BATCHED=n is uncommon?

I thought BPF would be the primary user of this feature, and I believe
anyone that cares about BPF performance / memory usage will use
CONFIG_KVFREE_RCU_BATCHED=y.

But yeah, if we have more users of this feature beyond BPF, it makes sense
to reduce memory usage for CONFIG_KVFREE_RCU_BATCHED=n users with some
complexity.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves
  2026-02-12 19:15   ` Alexei Starovoitov
@ 2026-02-13 11:55     ` Harry Yoo
  0 siblings, 0 replies; 32+ messages in thread
From: Harry Yoo @ 2026-02-13 11:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng,
	Muchun Song, rcu, linux-mm, bpf

On Thu, Feb 12, 2026 at 11:15:52AM -0800, Alexei Starovoitov wrote:
> On Fri, Feb 6, 2026 at 1:35 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> >         } else {
> > +               if (unlikely(!allow_spin)) {
> > +                       /* call_rcu() does not support NMI context */
> > +                       rcu_sheaf->size--;
> > +                       local_unlock(&s->cpu_sheaves->lock);
> > +                       goto fail;
> 
> As a first step it's ok, but we need to make call_rcu() work too.

Yeah I was thinking it would be nice to have call_rcu_nolock()...

> Shouldn't be too hard. It protects itself with local_irq_save,
> so if (irqs_disabled()) defer to irq_work and call_rcu there
> or guard reentrance into __call_rcu_common() with per-cpu busy counter.
> rcu_head can be reused to form list of objects to be processed in irq work.

I'll take a look at that, thanks!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock()
  2026-02-06  9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
  2026-02-12  2:58   ` Harry Yoo
@ 2026-02-16 21:07   ` Joel Fernandes
  2026-02-16 21:32     ` Joel Fernandes
  1 sibling, 1 reply; 32+ messages in thread
From: Joel Fernandes @ 2026-02-16 21:07 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf

Hi Harry,

On Fri, Feb 06, 2026 at 06:34:09PM +0900, Harry Yoo wrote:
> Currently, kfree_rcu() cannot be called in an NMI context.
> In such a context, even calling call_rcu() is not legal,
> forcing users to implement deferred freeing.
> 
> Make users' lives easier by introducing kfree_rcu_nolock() variant.
> Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
> variant, because, in the worst case where memory allocation fails,
> the caller cannot synchronously wait for the grace period to finish.
> 
> Similar to kfree_nolock() implementation, try to acquire kfree_rcu_cpu
> spinlock, and if that fails, insert the object to per-cpu lockless list
> and delay freeing using irq_work that calls kvfree_call_rcu() later.
> In case kmemleak or debugobjects is enabled, always defer freeing as
> those debug features don't support NMI contexts.
> 
> When trylock succeeds, avoid consuming bnode and run_page_cache_worker()
> altogether. Instead, insert objects into struct kfree_rcu_cpu.head
> without consuming additional memory.
> 
> For now, the sheaves layer is bypassed if spinning is not allowed.
> 
> Scheduling delayed monitor work in an NMI context is tricky; use
> irq_work to schedule, but use lazy irq_work to avoid raising self-IPIs.
> That means scheduling delayed monitor work can be delayed up to the
> length of a time slice.
> 
> Without CONFIG_KVFREE_RCU_BATCHED, all frees in the !allow_spin case are
> delayed using irq_work.
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  include/linux/rcupdate.h |  23 ++++---
>  mm/slab_common.c         | 140 +++++++++++++++++++++++++++++++++------
>  2 files changed, 133 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index db5053a7b0cb..18bb7378b23d 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1092,8 +1092,9 @@ static inline void rcu_read_unlock_migrate(void)
>   * The BUILD_BUG_ON check must not involve any function calls, hence the
>   * checks are done in macros here.
>   */
> -#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> -#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> +#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
> +#define kfree_rcu_nolock(ptr, rf) kvfree_rcu_arg_2(ptr, rf, false)
> +#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
>  
>  /**
>   * kfree_rcu_mightsleep() - kfree an object after a grace period.
> @@ -1117,35 +1118,35 @@ static inline void rcu_read_unlock_migrate(void)
>  
>  
>  #ifdef CONFIG_KVFREE_RCU_BATCHED
> -void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
> -#define kvfree_call_rcu(head, ptr) \
> +void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin);
> +#define kvfree_call_rcu(head, ptr, spin) \
>  	_Generic((head), \
>  		struct rcu_head *: kvfree_call_rcu_ptr,		\
>  		struct rcu_ptr *: kvfree_call_rcu_ptr,		\
>  		void *: kvfree_call_rcu_ptr			\
> -	)((struct rcu_ptr *)(head), (ptr))
> +	)((struct rcu_ptr *)(head), (ptr), spin)
>  #else
> -void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
> +void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin);
>  static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
> -#define kvfree_call_rcu(head, ptr) \
> +#define kvfree_call_rcu(head, ptr, spin) \
>  	_Generic((head), \
>  		struct rcu_head *: kvfree_call_rcu_head,	\
>  		struct rcu_ptr *: kvfree_call_rcu_head,		\
>  		void *: kvfree_call_rcu_head			\
> -	)((struct rcu_head *)(head), (ptr))
> +	)((struct rcu_head *)(head), (ptr), spin)
>  #endif
>  
>  /*
>   * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
>   * comment of kfree_rcu() for details.
>   */
> -#define kvfree_rcu_arg_2(ptr, rf)					\
> +#define kvfree_rcu_arg_2(ptr, rf, spin)					\
>  do {									\
>  	typeof (ptr) ___p = (ptr);					\
>  									\
>  	if (___p) {							\
>  		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096);	\
> -		kvfree_call_rcu(&((___p)->rf), (void *) (___p));	\
> +		kvfree_call_rcu(&((___p)->rf), (void *) (___p), spin);	\
>  	}								\
>  } while (0)
>  
> @@ -1154,7 +1155,7 @@ do {								\
>  	typeof(ptr) ___p = (ptr);				\
>  								\
>  	if (___p)						\
> -		kvfree_call_rcu(NULL, (void *) (___p));		\
> +		kvfree_call_rcu(NULL, (void *) (___p), true);	\
>  } while (0)
>  
>  /*
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d232b99a4b52..9d7801e5cb73 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1311,6 +1311,12 @@ struct kfree_rcu_cpu_work {
>   * the interactions with the slab allocators.
>   */
>  struct kfree_rcu_cpu {
> +	// Objects queued on a lockless linked list, not protected by the lock.
> +	// This allows freeing objects in NMI context, where trylock may fail.
> +	struct llist_head llist_head;
> +	struct irq_work irq_work;
> +	struct irq_work sched_monitor_irq_work;

It would be great if irq_work_queue() could support a lazy flag, or a new
irq_work_queue_lazy() which then just skips the irq_work_raise() for the lazy
case. Then we don't need multiple struct irq_work doing the same thing. +PeterZ

[...]
> @@ -1979,9 +2059,15 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
>  	}
>  
>  	kasan_record_aux_stack(ptr);
> -	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
> +
> +	krcp = krc_this_cpu_lock(&flags, allow_spin);
> +	if (!krcp)
> +		goto defer_free;
> +
> +	success = add_ptr_to_bulk_krc_lock(krcp, &flags, ptr, !head, allow_spin);
>  	if (!success) {
> -		run_page_cache_worker(krcp);
> +		if (allow_spin)
> +			run_page_cache_worker(krcp);
>  
>  		if (head == NULL)
>  			// Inline if kvfree_rcu(one_arg) call.
> @@ -2005,8 +2091,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
>  	kmemleak_ignore(ptr);
>  
>  	// Set timer to drain after KFREE_DRAIN_JIFFIES.
> -	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
> -		__schedule_delayed_monitor_work(krcp);
> +	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING) {
> +		if (allow_spin)
> +			__schedule_delayed_monitor_work(krcp);
> +		else
> +			irq_work_queue(&krcp->sched_monitor_irq_work);

Here this irq_work will be queued even if delayed_work_pending? That might be
additional irq_work overhead (which was not needed) when the delayed monitor
was already queued?

If delayed_work_pending() is safe to call from NMI, you could also call
that to avoid unnecessary irq_work queueing. But do double check if it is.

Also per [1], I gather allow_spin does not always imply NMI. If that is true,
is better to call in_nmi() instead of relying on allow_spin?

[1] https://lore.kernel.org/all/CAADnVQKk_Bgi0bc-td_3pVpHYXR3CpC3R8rg-NHwdLEDiQSeNg@mail.gmail.com/

Thanks,

--
Joel Fernandes



> +	}
>  
>  unlock_return:
>  	krc_this_cpu_unlock(krcp, flags);
> @@ -2017,10 +2107,22 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
>  	 * CPU can pass the QS state.
>  	 */
>  	if (!success) {
> +		VM_WARN_ON_ONCE(!allow_spin);
>  		debug_rcu_head_unqueue((struct rcu_head *) ptr);
>  		synchronize_rcu();
>  		kvfree(ptr);
>  	}
> +	return;
> +
> +defer_free:
> +	VM_WARN_ON_ONCE(allow_spin);
> +	guard(preempt)();
> +
> +	krcp = this_cpu_ptr(&krc);
> +	if (llist_add((struct llist_node *)head, &krcp->llist_head))
> +		irq_work_queue(&krcp->irq_work);
> +	return;
> +
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
>  
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock()
  2026-02-16 21:07   ` Joel Fernandes
@ 2026-02-16 21:32     ` Joel Fernandes
  0 siblings, 0 replies; 32+ messages in thread
From: Joel Fernandes @ 2026-02-16 21:32 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Vlastimil Babka, Christoph Lameter,
	David Rientjes, Roman Gushchin, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Hao Li, Alexei Starovoitov, Puranjay Mohan,
	Andrii Nakryiko, Amery Hung, Catalin Marinas, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Dave Chinner, Qi Zheng, Muchun Song, rcu,
	linux-mm, bpf, peterz

CC Peter for real this time. ;-)

On Mon, Feb 16, 2026 at 04:07:55PM -0500, Joel Fernandes wrote:
> Hi Harry,
> 
> On Fri, Feb 06, 2026 at 06:34:09PM +0900, Harry Yoo wrote:
> > Currently, kfree_rcu() cannot be called in an NMI context.
> > In such a context, even calling call_rcu() is not legal,
> > forcing users to implement deferred freeing.
> > 
> > Make users' lives easier by introducing kfree_rcu_nolock() variant.
> > Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
> > variant, because, in the worst case where memory allocation fails,
> > the caller cannot synchronously wait for the grace period to finish.
> > 
> > Similar to kfree_nolock() implementation, try to acquire kfree_rcu_cpu
> > spinlock, and if that fails, insert the object to per-cpu lockless list
> > and delay freeing using irq_work that calls kvfree_call_rcu() later.
> > In case kmemleak or debugobjects is enabled, always defer freeing as
> > those debug features don't support NMI contexts.
> > 
> > When trylock succeeds, avoid consuming bnode and run_page_cache_worker()
> > altogether. Instead, insert objects into struct kfree_rcu_cpu.head
> > without consuming additional memory.
> > 
> > For now, the sheaves layer is bypassed if spinning is not allowed.
> > 
> > Scheduling delayed monitor work in an NMI context is tricky; use
> > irq_work to schedule, but use lazy irq_work to avoid raising self-IPIs.
> > That means scheduling delayed monitor work can be delayed up to the
> > length of a time slice.
> > 
> > Without CONFIG_KVFREE_RCU_BATCHED, all frees in the !allow_spin case are
> > delayed using irq_work.
> > 
> > Suggested-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >  include/linux/rcupdate.h |  23 ++++---
> >  mm/slab_common.c         | 140 +++++++++++++++++++++++++++++++++------
> >  2 files changed, 133 insertions(+), 30 deletions(-)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index db5053a7b0cb..18bb7378b23d 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -1092,8 +1092,9 @@ static inline void rcu_read_unlock_migrate(void)
> >   * The BUILD_BUG_ON check must not involve any function calls, hence the
> >   * checks are done in macros here.
> >   */
> > -#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> > -#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> > +#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
> > +#define kfree_rcu_nolock(ptr, rf) kvfree_rcu_arg_2(ptr, rf, false)
> > +#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
> >  
> >  /**
> >   * kfree_rcu_mightsleep() - kfree an object after a grace period.
> > @@ -1117,35 +1118,35 @@ static inline void rcu_read_unlock_migrate(void)
> >  
> >  
> >  #ifdef CONFIG_KVFREE_RCU_BATCHED
> > -void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
> > -#define kvfree_call_rcu(head, ptr) \
> > +void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin);
> > +#define kvfree_call_rcu(head, ptr, spin) \
> >  	_Generic((head), \
> >  		struct rcu_head *: kvfree_call_rcu_ptr,		\
> >  		struct rcu_ptr *: kvfree_call_rcu_ptr,		\
> >  		void *: kvfree_call_rcu_ptr			\
> > -	)((struct rcu_ptr *)(head), (ptr))
> > +	)((struct rcu_ptr *)(head), (ptr), spin)
> >  #else
> > -void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
> > +void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin);
> >  static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
> > -#define kvfree_call_rcu(head, ptr) \
> > +#define kvfree_call_rcu(head, ptr, spin) \
> >  	_Generic((head), \
> >  		struct rcu_head *: kvfree_call_rcu_head,	\
> >  		struct rcu_ptr *: kvfree_call_rcu_head,		\
> >  		void *: kvfree_call_rcu_head			\
> > -	)((struct rcu_head *)(head), (ptr))
> > +	)((struct rcu_head *)(head), (ptr), spin)
> >  #endif
> >  
> >  /*
> >   * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
> >   * comment of kfree_rcu() for details.
> >   */
> > -#define kvfree_rcu_arg_2(ptr, rf)					\
> > +#define kvfree_rcu_arg_2(ptr, rf, spin)					\
> >  do {									\
> >  	typeof (ptr) ___p = (ptr);					\
> >  									\
> >  	if (___p) {							\
> >  		BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096);	\
> > -		kvfree_call_rcu(&((___p)->rf), (void *) (___p));	\
> > +		kvfree_call_rcu(&((___p)->rf), (void *) (___p), spin);	\
> >  	}								\
> >  } while (0)
> >  
> > @@ -1154,7 +1155,7 @@ do {								\
> >  	typeof(ptr) ___p = (ptr);				\
> >  								\
> >  	if (___p)						\
> > -		kvfree_call_rcu(NULL, (void *) (___p));		\
> > +		kvfree_call_rcu(NULL, (void *) (___p), true);	\
> >  } while (0)
> >  
> >  /*
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index d232b99a4b52..9d7801e5cb73 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -1311,6 +1311,12 @@ struct kfree_rcu_cpu_work {
> >   * the interactions with the slab allocators.
> >   */
> >  struct kfree_rcu_cpu {
> > +	// Objects queued on a lockless linked list, not protected by the lock.
> > +	// This allows freeing objects in NMI context, where trylock may fail.
> > +	struct llist_head llist_head;
> > +	struct irq_work irq_work;
> > +	struct irq_work sched_monitor_irq_work;
> 
> It would be great if irq_work_queue() could support a lazy flag, or a new
> irq_work_queue_lazy() which then just skips the irq_work_raise() for the lazy
> case. Then we don't need multiple struct irq_work doing the same thing. +PeterZ
> 
> [...]
> > @@ -1979,9 +2059,15 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> >  	}
> >  
> >  	kasan_record_aux_stack(ptr);
> > -	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
> > +
> > +	krcp = krc_this_cpu_lock(&flags, allow_spin);
> > +	if (!krcp)
> > +		goto defer_free;
> > +
> > +	success = add_ptr_to_bulk_krc_lock(krcp, &flags, ptr, !head, allow_spin);
> >  	if (!success) {
> > -		run_page_cache_worker(krcp);
> > +		if (allow_spin)
> > +			run_page_cache_worker(krcp);
> >  
> >  		if (head == NULL)
> >  			// Inline if kvfree_rcu(one_arg) call.
> > @@ -2005,8 +2091,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> >  	kmemleak_ignore(ptr);
> >  
> >  	// Set timer to drain after KFREE_DRAIN_JIFFIES.
> > -	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
> > -		__schedule_delayed_monitor_work(krcp);
> > +	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING) {
> > +		if (allow_spin)
> > +			__schedule_delayed_monitor_work(krcp);
> > +		else
> > +			irq_work_queue(&krcp->sched_monitor_irq_work);
> 
> Here this irq_work will be queued even if delayed_work_pending? That might be
> additional irq_work overhead (which was not needed) when the delayed monitor
> was already queued?
> 
> If delayed_work_pending() is safe to call from NMI, you could also call
> that to avoid unnecessary irq_work queueing. But do double check if it is.
> 
> Also per [1], I gather allow_spin does not always imply NMI. If that is true,
> is better to call in_nmi() instead of relying on allow_spin?
> 
> [1] https://lore.kernel.org/all/CAADnVQKk_Bgi0bc-td_3pVpHYXR3CpC3R8rg-NHwdLEDiQSeNg@mail.gmail.com/
> 
> Thanks,
> 
> --
> Joel Fernandes
> 
> 
> 
> > +	}
> >  
> >  unlock_return:
> >  	krc_this_cpu_unlock(krcp, flags);
> > @@ -2017,10 +2107,22 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> >  	 * CPU can pass the QS state.
> >  	 */
> >  	if (!success) {
> > +		VM_WARN_ON_ONCE(!allow_spin);
> >  		debug_rcu_head_unqueue((struct rcu_head *) ptr);
> >  		synchronize_rcu();
> >  		kvfree(ptr);
> >  	}
> > +	return;
> > +
> > +defer_free:
> > +	VM_WARN_ON_ONCE(allow_spin);
> > +	guard(preempt)();
> > +
> > +	krcp = this_cpu_ptr(&krc);
> > +	if (llist_add((struct llist_node *)head, &krcp->llist_head))
> > +		irq_work_queue(&krcp->irq_work);
> > +	return;
> > +
> >  }
> >  EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
> >  
> > -- 
> > 2.43.0
> > 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-02-16 21:33 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-06  9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
2026-02-06  9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
2026-02-11 10:16   ` Uladzislau Rezki
2026-02-11 10:44     ` Harry Yoo
2026-02-11 10:53       ` Uladzislau Rezki
2026-02-11 11:26         ` Harry Yoo
2026-02-11 13:02           ` Uladzislau Rezki
2026-02-11 17:05           ` Alexei Starovoitov
2026-02-12 11:52     ` Vlastimil Babka
2026-02-13  5:17       ` Harry Yoo
2026-02-06  9:34 ` [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head Harry Yoo
2026-02-09 10:41   ` Uladzislau Rezki
2026-02-09 11:22     ` Harry Yoo
2026-02-06  9:34 ` [RFC PATCH 3/7] mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() Harry Yoo
2026-02-06  9:34 ` [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags Harry Yoo
2026-02-06 20:09   ` Alexei Starovoitov
2026-02-09  9:38     ` Vlastimil Babka
2026-02-09 18:44       ` Alexei Starovoitov
2026-02-06  9:34 ` [RFC PATCH 5/7] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo
2026-02-06  9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
2026-02-12  2:58   ` Harry Yoo
2026-02-16 21:07   ` Joel Fernandes
2026-02-16 21:32     ` Joel Fernandes
2026-02-06  9:34 ` [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo
2026-02-12 19:15   ` Alexei Starovoitov
2026-02-13 11:55     ` Harry Yoo
2026-02-07  0:16 ` [RFC PATCH 0/7] k[v]free_rcu() improvements Paul E. McKenney
2026-02-07  1:21   ` Harry Yoo
2026-02-07  1:33     ` Paul E. McKenney
2026-02-09  9:02       ` Harry Yoo
2026-02-09 16:40         ` Paul E. McKenney
2026-02-12 14:28 ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox