From: Harry Yoo <harry.yoo@oracle.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz,
shakeel.butt@linux.dev, mhocko@suse.com, bigeasy@linutronix.de,
andrii@kernel.org, memxor@gmail.com, akpm@linux-foundation.org,
peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 15 Sep 2025 21:52:49 +0900 [thread overview]
Message-ID: <aMgMIQVYnAq2weuE@hyeyoo> (raw)
In-Reply-To: <20250909010007.1660-7-alexei.starovoitov@gmail.com>
On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> kmalloc_nolock() relies on ability of local_trylock_t to detect
> the situation when per-cpu kmem_cache is locked.
>
> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> disables IRQs and marks s->cpu_slab->lock as acquired.
> local_lock_is_locked(&s->cpu_slab->lock) returns true when
> slab is in the middle of manipulating per-cpu cache
> of that specific kmem_cache.
>
> kmalloc_nolock() can be called from any context and can re-enter
> into ___slab_alloc():
> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> kmalloc_nolock() -> ___slab_alloc(cache_B)
> or
> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> kmalloc_nolock() -> ___slab_alloc(cache_B)
>
> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> can be acquired without a deadlock before invoking the function.
> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> retries in a different kmalloc bucket. The second attempt will
> likely succeed, since this cpu locked different kmem_cache.
>
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current _task_. In this case
> re-entrance into the same kmalloc bucket is unsafe, and
> kmalloc_nolock() tries a different bucket that is most likely is
> not locked by the current task. Though it may be locked by a
> different task it's safe to rt_spin_lock() and sleep on it.
>
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
>
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and (in_nmi() or in PREEMPT_RT).
>
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate,
> while kfree_nolock() always defers to irq_work.
>
> Note, kfree_nolock() must be called _only_ for objects allocated
> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> will miss kmemleak/kfence book keeping and will cause false positives.
> large_kmalloc is not supported by either kmalloc_nolock()
> or kfree_nolock().
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/kasan.h | 13 +-
> include/linux/memcontrol.h | 2 +
> include/linux/slab.h | 4 +
> mm/Kconfig | 1 +
> mm/kasan/common.c | 5 +-
> mm/slab.h | 6 +
> mm/slab_common.c | 3 +
> mm/slub.c | 473 +++++++++++++++++++++++++++++++++----
> 8 files changed, 453 insertions(+), 54 deletions(-)
> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> }
> }
>
> +/*
> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> + * can be acquired without a deadlock before invoking the function.
> + *
> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> + * and kmalloc() is not used in an unsupported context.
> + *
> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> + * lockdep_assert() will catch a bug in case:
> + * #1
> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> + * or
> + * #2
> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> + *
> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> + * disabled context. The lock will always be acquired and if needed it
> + * block and sleep until the lock is available.
> + * #1 is possible in !PREEMPT_RT only.
> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> + *
> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> + */
> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> +#define local_lock_cpu_slab(s, flags) \
> + local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> +#else
> +#define local_lock_cpu_slab(s, flags) \
> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> +#endif
> +
> +#define local_unlock_cpu_slab(s, flags) \
> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
more precise for LOCKDEP"?
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> {
> @@ -4262,6 +4342,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
> static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
> {
> + bool allow_spin = gfpflags_allow_spinning(gfpflags);
> void *freelist;
> struct slab *slab;
> unsigned long flags;
> @@ -4287,9 +4368,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> if (unlikely(!node_match(slab, node))) {
> /*
> * same as above but node_match() being false already
> - * implies node != NUMA_NO_NODE
> + * implies node != NUMA_NO_NODE.
> + * Reentrant slub cannot take locks necessary to
> + * deactivate_slab, hence ignore node preference.
nit: the comment is obsolte?
Per previous discussion there were two points.
Maybe something like this?
/*
* We don't strictly honor pfmemalloc and NUMA preferences when
* !allow_spin because:
*
* 1. Most kmalloc() users allocate objects on the local node,
* so kmalloc_nolock() tries not to interfere with them by
* deactivating the cpu slab.
*
* 2. Deactivating due to NUMA or pfmemalloc mismatch may cause
* unnecessary slab allocations even when n->partial list is not empty.
*/
...or if you don't feel like it's not worth documenting,
just removing the misleading comment is fine.
> + * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> */
> - if (!node_isset(node, slab_nodes)) {
> + if (!node_isset(node, slab_nodes) ||
> + !allow_spin) {
> node = NUMA_NO_NODE;
> } else {
> stat(s, ALLOC_NODE_MISMATCH);
> @@ -4374,8 +4460,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> slab = slub_percpu_partial(c);
> slub_set_percpu_partial(c, slab);
>
> + /*
> + * Reentrant slub cannot take locks necessary for
> + * __put_partials(), hence ignore node preference.
nit: same here.
> + * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> + */
> if (likely(node_match(slab, node) &&
> - pfmemalloc_match(slab, gfpflags))) {
> + pfmemalloc_match(slab, gfpflags)) ||
> + !allow_spin) {
> c->slab = slab;
> freelist = get_freelist(s, slab);
> VM_BUG_ON(!freelist);
> @@ -5390,6 +5504,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
> }
> EXPORT_SYMBOL(__kmalloc_noprof);
>
> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> + struct kmem_cache *s;
> + bool can_retry = true;
> + void *ret = ERR_PTR(-EBUSY);
> +
> + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> + if (unlikely(!size))
> + return ZERO_SIZE_PTR;
> +
> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> + return NULL;
> +retry:
> + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> + return NULL;
> + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +
> + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> + /*
> + * kmalloc_nolock() is not supported on architectures that
> + * don't implement cmpxchg16b, but debug caches don't use
> + * per-cpu slab and per-cpu partial slabs. They rely on
> + * kmem_cache_node->list_lock, so kmalloc_nolock() can
> + * attempt to allocate from debug caches by
> + * spin_trylock_irqsave(&n->list_lock, ...)
> + */
> + return NULL;
> +
> + /*
> + * Do not call slab_alloc_node(), since trylock mode isn't
> + * compatible with slab_pre_alloc_hook/should_failslab and
> + * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> + * and slab_post_alloc_hook() directly.
> + *
> + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> + * in irq saved region. It assumes that the same cpu will not
> + * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> + * Therefore use in_nmi() to check whether particular bucket is in
> + * irq protected section.
> + *
> + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> + * this cpu was interrupted somewhere inside ___slab_alloc() after
> + * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> + * In this case fast path with __update_cpu_freelist_fast() is not safe.
> + */
> +#ifndef CONFIG_SLUB_TINY
> + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +#endif
On !PREEMPT_RT, how does the kernel know that it should not use
the lockless fastpath in kmalloc_nolock() in the following path:
kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
For the same reason as in NMIs (as slowpath doesn't expect that).
Maybe check if interrupts are disabled instead of in_nmi()?
Otherwise looks good to me.
--
Cheers,
Harry / Hyeonggon
> + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
> +
> + if (PTR_ERR(ret) == -EBUSY) {
> + if (can_retry) {
> + /* pick the next kmalloc bucket */
> + size = s->object_size + 1;
> + /*
> + * Another alternative is to
> + * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
> + * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
> + * to retry from bucket of the same size.
> + */
> + can_retry = false;
> + goto retry;
> + }
> + ret = NULL;
> + }
> +
> + maybe_wipe_obj_freeptr(s, ret);
> + slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
> + slab_want_init_on_alloc(alloc_gfp, s), size);
> +
> + ret = kasan_kmalloc(s, ret, size, alloc_gfp);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
next prev parent reply other threads:[~2025-09-15 12:53 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11 ` Shakeel Butt
2025-09-12 17:15 ` Matthew Wilcox
2025-09-12 17:34 ` Alexei Starovoitov
2025-09-12 17:46 ` Shakeel Butt
2025-09-12 17:47 ` Shakeel Butt
2025-09-15 5:25 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15 ` Shakeel Butt
2025-09-15 5:17 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27 ` Shakeel Butt
2025-09-12 21:03 ` Suren Baghdasaryan
2025-09-12 21:11 ` Shakeel Butt
2025-09-12 21:26 ` Suren Baghdasaryan
2025-09-12 21:24 ` Alexei Starovoitov
2025-09-12 21:29 ` Shakeel Butt
2025-09-12 21:31 ` Alexei Starovoitov
2025-09-12 21:44 ` Shakeel Butt
2025-09-12 21:59 ` Alexei Starovoitov
2025-09-13 0:01 ` Shakeel Butt
2025-09-13 0:07 ` Alexei Starovoitov
2025-09-13 0:33 ` Shakeel Butt
2025-09-13 0:36 ` Suren Baghdasaryan
2025-09-13 1:12 ` Alexei Starovoitov
2025-09-15 7:51 ` Vlastimil Babka
2025-09-15 15:06 ` Suren Baghdasaryan
2025-09-15 15:11 ` Vlastimil Babka
2025-09-15 15:25 ` Suren Baghdasaryan
2025-09-15 20:10 ` Suren Baghdasaryan
2025-09-13 1:16 ` Shakeel Butt
2025-09-15 6:14 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52 ` Harry Yoo [this message]
2025-09-15 14:39 ` Vlastimil Babka
2025-09-16 0:56 ` Alexei Starovoitov
2025-09-16 9:55 ` Vlastimil Babka
2025-09-16 1:00 ` Alexei Starovoitov
2025-09-24 0:40 ` Harry Yoo
2025-09-24 7:43 ` Alexei Starovoitov
2025-09-24 11:07 ` Harry Yoo
2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aMgMIQVYnAq2weuE@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox