From: Vlastimil Babka <vbabka@suse.cz>
To: Harry Yoo <harry.yoo@oracle.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev,
mhocko@suse.com, bigeasy@linutronix.de, andrii@kernel.org,
memxor@gmail.com, akpm@linux-foundation.org,
peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 15 Sep 2025 16:39:12 +0200 [thread overview]
Message-ID: <451c6823-40fa-4ef1-91d7-effb1ca43c90@suse.cz> (raw)
In-Reply-To: <aMgMIQVYnAq2weuE@hyeyoo>
On 9/15/25 14:52, Harry Yoo wrote:
> On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>>
>> kmalloc_nolock() relies on ability of local_trylock_t to detect
>> the situation when per-cpu kmem_cache is locked.
>>
>> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
>> disables IRQs and marks s->cpu_slab->lock as acquired.
>> local_lock_is_locked(&s->cpu_slab->lock) returns true when
>> slab is in the middle of manipulating per-cpu cache
>> of that specific kmem_cache.
>>
>> kmalloc_nolock() can be called from any context and can re-enter
>> into ___slab_alloc():
>> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>> kmalloc_nolock() -> ___slab_alloc(cache_B)
>> or
>> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>> kmalloc_nolock() -> ___slab_alloc(cache_B)
>>
>> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
>> can be acquired without a deadlock before invoking the function.
>> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
>> retries in a different kmalloc bucket. The second attempt will
>> likely succeed, since this cpu locked different kmem_cache.
>>
>> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>> per-cpu rt_spin_lock is locked by current _task_. In this case
>> re-entrance into the same kmalloc bucket is unsafe, and
>> kmalloc_nolock() tries a different bucket that is most likely is
>> not locked by the current task. Though it may be locked by a
>> different task it's safe to rt_spin_lock() and sleep on it.
>>
>> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>> immediately if called from hard irq or NMI in PREEMPT_RT.
>>
>> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
>> and (in_nmi() or in PREEMPT_RT).
>>
>> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
>> spin_trylock_irqsave(&n->list_lock) to allocate,
>> while kfree_nolock() always defers to irq_work.
>>
>> Note, kfree_nolock() must be called _only_ for objects allocated
>> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
>> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
>> will miss kmemleak/kfence book keeping and will cause false positives.
>> large_kmalloc is not supported by either kmalloc_nolock()
>> or kfree_nolock().
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>> include/linux/kasan.h | 13 +-
>> include/linux/memcontrol.h | 2 +
>> include/linux/slab.h | 4 +
>> mm/Kconfig | 1 +
>> mm/kasan/common.c | 5 +-
>> mm/slab.h | 6 +
>> mm/slab_common.c | 3 +
>> mm/slub.c | 473 +++++++++++++++++++++++++++++++++----
>> 8 files changed, 453 insertions(+), 54 deletions(-)
>> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>> }
>> }
>>
>> +/*
>> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
>> + * can be acquired without a deadlock before invoking the function.
>> + *
>> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
>> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
>> + * and kmalloc() is not used in an unsupported context.
>> + *
>> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
>> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
>> + * lockdep_assert() will catch a bug in case:
>> + * #1
>> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
>> + * or
>> + * #2
>> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
>> + *
>> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
>> + * disabled context. The lock will always be acquired and if needed it
>> + * block and sleep until the lock is available.
>> + * #1 is possible in !PREEMPT_RT only.
>> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
>> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
>> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
>> + *
>> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
>> + */
>> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
>> +#define local_lock_cpu_slab(s, flags) \
>> + local_lock_irqsave(&(s)->cpu_slab->lock, flags)
>> +#else
>> +#define local_lock_cpu_slab(s, flags) \
>> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
>> +#endif
>> +
>> +#define local_unlock_cpu_slab(s, flags) \
>> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
>
> nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
> more precise for LOCKDEP"?
I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we
can avoid it using this trick. It's probably better for lockdep's overhead
to avoid the class-per-cache when we can.
Perhaps we can even improve by having a special class only for kmalloc
caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc
cache to another non-kmalloc cache?
>>
>> +/**
>> + * kmalloc_nolock - Allocate an object of given size from any context.
>> + * @size: size to allocate
>> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
>> + * @node: node number of the target node.
>> + *
>> + * Return: pointer to the new object or NULL in case of error.
>> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
>> + * There is no reason to call it again and expect !NULL.
>> + */
>> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>> +{
>> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
>> + struct kmem_cache *s;
>> + bool can_retry = true;
>> + void *ret = ERR_PTR(-EBUSY);
>> +
>> + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
>> +
>> + if (unlikely(!size))
>> + return ZERO_SIZE_PTR;
>> +
>> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
>> + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
>> + return NULL;
>> +retry:
>> + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>> + return NULL;
>> + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>> +
>> + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>> + /*
>> + * kmalloc_nolock() is not supported on architectures that
>> + * don't implement cmpxchg16b, but debug caches don't use
>> + * per-cpu slab and per-cpu partial slabs. They rely on
>> + * kmem_cache_node->list_lock, so kmalloc_nolock() can
>> + * attempt to allocate from debug caches by
>> + * spin_trylock_irqsave(&n->list_lock, ...)
>> + */
>> + return NULL;
>> +
>> + /*
>> + * Do not call slab_alloc_node(), since trylock mode isn't
>> + * compatible with slab_pre_alloc_hook/should_failslab and
>> + * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
>> + * and slab_post_alloc_hook() directly.
>> + *
>> + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
>> + * in irq saved region. It assumes that the same cpu will not
>> + * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
>> + * Therefore use in_nmi() to check whether particular bucket is in
>> + * irq protected section.
>> + *
>> + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
>> + * this cpu was interrupted somewhere inside ___slab_alloc() after
>> + * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
>> + * In this case fast path with __update_cpu_freelist_fast() is not safe.
>> + */
>> +#ifndef CONFIG_SLUB_TINY
>> + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
>> +#endif
>
> On !PREEMPT_RT, how does the kernel know that it should not use
> the lockless fastpath in kmalloc_nolock() in the following path:
>
> kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
>
> For the same reason as in NMIs (as slowpath doesn't expect that).
Hmm... seems a good point, unless I'm missing something.
> Maybe check if interrupts are disabled instead of in_nmi()?
Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and
just remove the "!in_nmi() ||" part? There shouldn't be false positives?
> Otherwise looks good to me.
>
next prev parent reply other threads:[~2025-09-15 14:39 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11 ` Shakeel Butt
2025-09-12 17:15 ` Matthew Wilcox
2025-09-12 17:34 ` Alexei Starovoitov
2025-09-12 17:46 ` Shakeel Butt
2025-09-12 17:47 ` Shakeel Butt
2025-09-15 5:25 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15 ` Shakeel Butt
2025-09-15 5:17 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27 ` Shakeel Butt
2025-09-12 21:03 ` Suren Baghdasaryan
2025-09-12 21:11 ` Shakeel Butt
2025-09-12 21:26 ` Suren Baghdasaryan
2025-09-12 21:24 ` Alexei Starovoitov
2025-09-12 21:29 ` Shakeel Butt
2025-09-12 21:31 ` Alexei Starovoitov
2025-09-12 21:44 ` Shakeel Butt
2025-09-12 21:59 ` Alexei Starovoitov
2025-09-13 0:01 ` Shakeel Butt
2025-09-13 0:07 ` Alexei Starovoitov
2025-09-13 0:33 ` Shakeel Butt
2025-09-13 0:36 ` Suren Baghdasaryan
2025-09-13 1:12 ` Alexei Starovoitov
2025-09-15 7:51 ` Vlastimil Babka
2025-09-15 15:06 ` Suren Baghdasaryan
2025-09-15 15:11 ` Vlastimil Babka
2025-09-15 15:25 ` Suren Baghdasaryan
2025-09-15 20:10 ` Suren Baghdasaryan
2025-09-13 1:16 ` Shakeel Butt
2025-09-15 6:14 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52 ` Harry Yoo
2025-09-15 14:39 ` Vlastimil Babka [this message]
2025-09-16 0:56 ` Alexei Starovoitov
2025-09-16 9:55 ` Vlastimil Babka
2025-09-16 1:00 ` Alexei Starovoitov
2025-09-24 0:40 ` Harry Yoo
2025-09-24 7:43 ` Alexei Starovoitov
2025-09-24 11:07 ` Harry Yoo
2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=451c6823-40fa-4ef1-91d7-effb1ca43c90@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox