linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/8] kvfree_rcu() improvements
@ 2026-04-16  9:10 Harry Yoo (Oracle)
  2026-04-16  9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16  9:10 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
	Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm, Alexander Viro,
	Christian Brauner

These are a few improvements for k[v]free_rcu() API,
suggested by Alexei Starovoitov. This aims to tackle two problems:

  1) Allow an 8-byte field to be used as an alternative to
     struct rcu_head (16-byte) for 2-argument kvfree_rcu()
     to save memory.

  2) Add kfree_rcu_nolock() API for an unknown context.

  "Unknown context" means the caller does not know whether spinning
  on a lock is safe. For example, a BPF program attached to an
  arbitrary kernel function may run while the CPU already holds
  krcp->lock. However, in practice, it's not held most of the time.

# Discussion

Now that we have sheaves for kmalloc caches, most of frees go through
the sheaves layer. However, when sheaves becomes full w/ !allow_spin,
call_rcu() cannot be called because the context is unknown. (e.g., it
might have preempted call_rcu()). There are two possible approaches:

  a) Implement a general call_rcu_nolock() in the RCU subsystem that
     defers call_rcu() when it's not safe.

  b) Handle this as a special case only for rcu sheaf submission
     in mm/slab_common.c, without touching the RCU core.

This series takes approach (b). This is because a general
call_rcu_nolock() would need to flush deferred callbacks before
rcu_barrier() to preserve its guarantee, increasing the cost of
rcu_barrier() for all RCU users, not just kfree_rcu. By keeping the
deferred call_rcu logic in the slab subsystem, only
kvfree_rcu_barrier() pays the extra cost.

One downside of the current approach is that slab uses the condition
`!allow_spin && irqs_disabled()` to determine whether it's safe to
call call_rcu(), which creates a dependency on RCU's implementation
details. I'd like to hear thoughts on this.

# Part 1. Allow an 8-byte field to be used as an alternative to
  struct rcu_head for 2-argument kvfree_rcu()
  (patches 1-2)

Technically, objects that are freed with k[v]free_rcu() need
only one pointer to link objects, because we already know that
the callback function is always kvfree(). For this purpose,
struct rcu_head is unnecessarily large (16 bytes on 64-bit).

Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
with k[v]free_rcu(). Let's save one pointer per slab object.

I have to admit that my naming skill isn't great; hopefully
we'll come up with a better name than `struct rcu_ptr`.

With this feature, either a struct rcu_ptr or rcu_head field
can be used as the second argument of the k[v]free_rcu() API.

Users that only use k[v]free_rcu() are may use struct rcu_ptr to save
memory (if there can be a lot of objects). However, some users,
such as maple tree, may use call_rcu() or k[v]free_rcu() for objects
of the same type. For such users, struct rcu_head remains the only
option.

Patch 1 implements the struct rcu_ptr feature (for
CONFIG_KVFREE_RCU_BATCHED), and patch 2 converts fs/dcache external_name
to use struct rcu_ptr as an example user, saving a pointer per
dynamically allocated external file name.

# Part 2. Add kfree_rcu_nolock() for unknown contexts
  (patches 3-8)

Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such a context, even
calling call_rcu() is not legal, forcing users to implement some
sort of deferred freeing. Let's make users' lives easier with
a new kfree_rcu_nolock() variant.

Note that only the 2-argument variant is supported, since there is
not much we can do when trylock & memory allocation fails.

When spinning on a lock is not allowed, try to acquire the spinlock
using spin_trylock(). When trylock succeeds, do either:

  1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
     be called in an unknown context, because it might have preempted
     call_rcu(). When the rcu sheaf becomes full by freeing the object,
     defer the submission of the full sheaf using irq_work
     (defer_call_rcu).

  2) Use bnode (of struct kvfree_rcu_bulk_data) to store the pointer.
     If trylock succeeded but no cached bnode is available, fall back
     and queue page cache worker just like normal 2-args kvfree_rcu()
     path.

In rare cases where trylock fails, a non-lazy irq_work is used to
defer calling kvfree_call_rcu().

When certain debug features (kmemleak, debugobjects) are enabled,
freeing is always deferred because they use spinlocks.

Patch 3 moves code for preparation.
Patch 4 introduces kfree_rcu_nolock().
Patch 5 teaches the rcu sheaf to handle the !allow_spin case.
Patch 6 wraps rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef.
Patch 7 introduces deferred submission of rcu sheaves for the
!allow_spin case when IRQs are disabled.
Patch 8 adds a kunit test case for kfree_rcu_nolock().

Changes since RFC V1 [1]:
  - Dropped the kmalloc_nolock() -> kfree[_rcu]() path support
    and the objexts_flags cleanup as they already have landed mainline.
  - Dropped rcu_ptr conversions in mm/ (previous patch 2) and instead
    added struct external_name in fs/dcache.c as a user(new patch 2).
  - (Fix) Handle kfence addresses correctly using is_kfence_address()
    and kfence_object_start().
  - Reworked kfree_rcu_nolock() (patch 4):
    - When trylock succeeds, now attempts to use cached bnodes
      (like normal kvfree_rcu 2-arg path) instead of only inserting
      into krcp->head.
    - Added allow_spin parameter to __schedule_delayed_monitor_work()
      and run_page_cache_worker() to defer work submission via
      irq_work when spinning is not allowed (Joel).
    - (Fix) Introduced defer_kvfree_rcu_barrier() to flush deferred
      objects before flushing rcu sheaves, preserving correctness of
      kvfree_rcu_barrier().
    - (Fix) Moved kvfree_rcu_barrier()/kvfree_rcu_barrier_on_cache()
      to slab_common.c on CONFIG_KVFREE_RCU_BATCHED=n, and made them
      wait for deferred irq_works even without kvfree_rcu batching.
    - Introduced object_start_addr() helper to deduplicate the
      start address calculation logic.
  - Instead of falling back when the rcu sheaf becomes full,
    implemented deferred submission of rcu sheaves using irq_work
    (new patch 7) (Vlastimil, Alexei).
  - Wrapped rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef
    (new patch 6).
  - Added a kunit test for kfree_rcu_nolock() (new patch 8).

[1] RFC V1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com

RFC V2 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v2r1

RFC V1 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v1r1

What haven't changed since RFC v1:

- PREEMPT_RT support for kfree_rcu_sheaf() (Vlastimil): that is worth
  addressing and I think it's doable, but it'll be a too big change to
  be part of this series.

- Reducing struct rcu_ptr on !KVFREE_RCU_BATCHED (Vlastimil): I tried,
  but I'm not still sure it's worth the complexity for
  CONFIG_KVFREE_RCU_BATCHED=n users. Also, this inevitably introduces
  some delay in freeing objects which is against the purpose of
  RCU_STRICT_GRACE_PERIOD.

- While writing this cover letter, just realized that I should probably
  try to reduce the number of irq work structures (pointed out by Joel)
  (at least to 2 for lazy and non-lazy instead of 4). Will explore this
  in the next version.

Harry Yoo (Oracle) (8):
  mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  fs/dcache: use rcu_ptr instead of rcu_head for external names
  mm/slab: move kfree_rcu_cpu[_work] definitions
  mm/slab: introduce kfree_rcu_nolock()
  mm/slab: make kfree_rcu_nolock() work with sheaves
  mm/slab: wrap rcu sheaf handling with ifdef
  mm/slab: introduce deferred submission of rcu sheaves
  lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()

 fs/dcache.c              |   8 +-
 include/linux/rcupdate.h |  64 ++++--
 include/linux/slab.h     |  16 +-
 include/linux/types.h    |   9 +
 lib/tests/slub_kunit.c   |  73 +++++++
 mm/slab.h                |   8 +-
 mm/slab_common.c         | 452 +++++++++++++++++++++++++++++----------
 mm/slub.c                |  47 +++-
 8 files changed, 514 insertions(+), 163 deletions(-)


base-commit: 7e0445f673205fd045f3358cacb52b3557627317
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-16  9:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-16  9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox