From: "Harry Yoo (Oracle)" <harry@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
Uladzislau Rezki <urezki@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
rcu@vger.kernel.org, linux-mm@kvack.org,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>
Subject: [RFC PATCH v2 0/8] kvfree_rcu() improvements
Date: Thu, 16 Apr 2026 18:10:14 +0900 [thread overview]
Message-ID: <20260416091022.36823-1-harry@kernel.org> (raw)
These are a few improvements for k[v]free_rcu() API,
suggested by Alexei Starovoitov. This aims to tackle two problems:
1) Allow an 8-byte field to be used as an alternative to
struct rcu_head (16-byte) for 2-argument kvfree_rcu()
to save memory.
2) Add kfree_rcu_nolock() API for an unknown context.
"Unknown context" means the caller does not know whether spinning
on a lock is safe. For example, a BPF program attached to an
arbitrary kernel function may run while the CPU already holds
krcp->lock. However, in practice, it's not held most of the time.
# Discussion
Now that we have sheaves for kmalloc caches, most of frees go through
the sheaves layer. However, when sheaves becomes full w/ !allow_spin,
call_rcu() cannot be called because the context is unknown. (e.g., it
might have preempted call_rcu()). There are two possible approaches:
a) Implement a general call_rcu_nolock() in the RCU subsystem that
defers call_rcu() when it's not safe.
b) Handle this as a special case only for rcu sheaf submission
in mm/slab_common.c, without touching the RCU core.
This series takes approach (b). This is because a general
call_rcu_nolock() would need to flush deferred callbacks before
rcu_barrier() to preserve its guarantee, increasing the cost of
rcu_barrier() for all RCU users, not just kfree_rcu. By keeping the
deferred call_rcu logic in the slab subsystem, only
kvfree_rcu_barrier() pays the extra cost.
One downside of the current approach is that slab uses the condition
`!allow_spin && irqs_disabled()` to determine whether it's safe to
call call_rcu(), which creates a dependency on RCU's implementation
details. I'd like to hear thoughts on this.
# Part 1. Allow an 8-byte field to be used as an alternative to
struct rcu_head for 2-argument kvfree_rcu()
(patches 1-2)
Technically, objects that are freed with k[v]free_rcu() need
only one pointer to link objects, because we already know that
the callback function is always kvfree(). For this purpose,
struct rcu_head is unnecessarily large (16 bytes on 64-bit).
Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
with k[v]free_rcu(). Let's save one pointer per slab object.
I have to admit that my naming skill isn't great; hopefully
we'll come up with a better name than `struct rcu_ptr`.
With this feature, either a struct rcu_ptr or rcu_head field
can be used as the second argument of the k[v]free_rcu() API.
Users that only use k[v]free_rcu() are may use struct rcu_ptr to save
memory (if there can be a lot of objects). However, some users,
such as maple tree, may use call_rcu() or k[v]free_rcu() for objects
of the same type. For such users, struct rcu_head remains the only
option.
Patch 1 implements the struct rcu_ptr feature (for
CONFIG_KVFREE_RCU_BATCHED), and patch 2 converts fs/dcache external_name
to use struct rcu_ptr as an example user, saving a pointer per
dynamically allocated external file name.
# Part 2. Add kfree_rcu_nolock() for unknown contexts
(patches 3-8)
Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such a context, even
calling call_rcu() is not legal, forcing users to implement some
sort of deferred freeing. Let's make users' lives easier with
a new kfree_rcu_nolock() variant.
Note that only the 2-argument variant is supported, since there is
not much we can do when trylock & memory allocation fails.
When spinning on a lock is not allowed, try to acquire the spinlock
using spin_trylock(). When trylock succeeds, do either:
1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
be called in an unknown context, because it might have preempted
call_rcu(). When the rcu sheaf becomes full by freeing the object,
defer the submission of the full sheaf using irq_work
(defer_call_rcu).
2) Use bnode (of struct kvfree_rcu_bulk_data) to store the pointer.
If trylock succeeded but no cached bnode is available, fall back
and queue page cache worker just like normal 2-args kvfree_rcu()
path.
In rare cases where trylock fails, a non-lazy irq_work is used to
defer calling kvfree_call_rcu().
When certain debug features (kmemleak, debugobjects) are enabled,
freeing is always deferred because they use spinlocks.
Patch 3 moves code for preparation.
Patch 4 introduces kfree_rcu_nolock().
Patch 5 teaches the rcu sheaf to handle the !allow_spin case.
Patch 6 wraps rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef.
Patch 7 introduces deferred submission of rcu sheaves for the
!allow_spin case when IRQs are disabled.
Patch 8 adds a kunit test case for kfree_rcu_nolock().
Changes since RFC V1 [1]:
- Dropped the kmalloc_nolock() -> kfree[_rcu]() path support
and the objexts_flags cleanup as they already have landed mainline.
- Dropped rcu_ptr conversions in mm/ (previous patch 2) and instead
added struct external_name in fs/dcache.c as a user(new patch 2).
- (Fix) Handle kfence addresses correctly using is_kfence_address()
and kfence_object_start().
- Reworked kfree_rcu_nolock() (patch 4):
- When trylock succeeds, now attempts to use cached bnodes
(like normal kvfree_rcu 2-arg path) instead of only inserting
into krcp->head.
- Added allow_spin parameter to __schedule_delayed_monitor_work()
and run_page_cache_worker() to defer work submission via
irq_work when spinning is not allowed (Joel).
- (Fix) Introduced defer_kvfree_rcu_barrier() to flush deferred
objects before flushing rcu sheaves, preserving correctness of
kvfree_rcu_barrier().
- (Fix) Moved kvfree_rcu_barrier()/kvfree_rcu_barrier_on_cache()
to slab_common.c on CONFIG_KVFREE_RCU_BATCHED=n, and made them
wait for deferred irq_works even without kvfree_rcu batching.
- Introduced object_start_addr() helper to deduplicate the
start address calculation logic.
- Instead of falling back when the rcu sheaf becomes full,
implemented deferred submission of rcu sheaves using irq_work
(new patch 7) (Vlastimil, Alexei).
- Wrapped rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef
(new patch 6).
- Added a kunit test for kfree_rcu_nolock() (new patch 8).
[1] RFC V1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com
RFC V2 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v2r1
RFC V1 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v1r1
What haven't changed since RFC v1:
- PREEMPT_RT support for kfree_rcu_sheaf() (Vlastimil): that is worth
addressing and I think it's doable, but it'll be a too big change to
be part of this series.
- Reducing struct rcu_ptr on !KVFREE_RCU_BATCHED (Vlastimil): I tried,
but I'm not still sure it's worth the complexity for
CONFIG_KVFREE_RCU_BATCHED=n users. Also, this inevitably introduces
some delay in freeing objects which is against the purpose of
RCU_STRICT_GRACE_PERIOD.
- While writing this cover letter, just realized that I should probably
try to reduce the number of irq work structures (pointed out by Joel)
(at least to 2 for lazy and non-lazy instead of 4). Will explore this
in the next version.
Harry Yoo (Oracle) (8):
mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
fs/dcache: use rcu_ptr instead of rcu_head for external names
mm/slab: move kfree_rcu_cpu[_work] definitions
mm/slab: introduce kfree_rcu_nolock()
mm/slab: make kfree_rcu_nolock() work with sheaves
mm/slab: wrap rcu sheaf handling with ifdef
mm/slab: introduce deferred submission of rcu sheaves
lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()
fs/dcache.c | 8 +-
include/linux/rcupdate.h | 64 ++++--
include/linux/slab.h | 16 +-
include/linux/types.h | 9 +
lib/tests/slub_kunit.c | 73 +++++++
mm/slab.h | 8 +-
mm/slab_common.c | 452 +++++++++++++++++++++++++++++----------
mm/slub.c | 47 +++-
8 files changed, 514 insertions(+), 163 deletions(-)
base-commit: 7e0445f673205fd045f3358cacb52b3557627317
--
2.43.0
next reply other threads:[~2026-04-16 9:10 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-16 9:10 Harry Yoo (Oracle) [this message]
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260416091022.36823-1-harry@kernel.org \
--to=harry@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=boqun@kernel.org \
--cc=brauner@kernel.org \
--cc=cl@gentwo.org \
--cc=frederic@kernel.org \
--cc=hao.li@linux.dev \
--cc=jiangshanlai@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox