From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>, bpf <bpf@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Michal Hocko <mhocko@suse.com>,
Sebastian Sewior <bigeasy@linutronix.de>,
Andrii Nakryiko <andrii@kernel.org>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 15 Sep 2025 17:56:18 -0700 [thread overview]
Message-ID: <CAADnVQKC=BA73ifNnyiE2c8ExtzS4LjR00--m5Xe8pqPLSGJfA@mail.gmail.com> (raw)
In-Reply-To: <451c6823-40fa-4ef1-91d7-effb1ca43c90@suse.cz>
On Mon, Sep 15, 2025 at 7:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/15/25 14:52, Harry Yoo wrote:
> > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
> >> From: Alexei Starovoitov <ast@kernel.org>
> >>
> >> kmalloc_nolock() relies on ability of local_trylock_t to detect
> >> the situation when per-cpu kmem_cache is locked.
> >>
> >> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> >> disables IRQs and marks s->cpu_slab->lock as acquired.
> >> local_lock_is_locked(&s->cpu_slab->lock) returns true when
> >> slab is in the middle of manipulating per-cpu cache
> >> of that specific kmem_cache.
> >>
> >> kmalloc_nolock() can be called from any context and can re-enter
> >> into ___slab_alloc():
> >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> >> kmalloc_nolock() -> ___slab_alloc(cache_B)
> >> or
> >> kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> >> kmalloc_nolock() -> ___slab_alloc(cache_B)
> >>
> >> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> >> can be acquired without a deadlock before invoking the function.
> >> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> >> retries in a different kmalloc bucket. The second attempt will
> >> likely succeed, since this cpu locked different kmem_cache.
> >>
> >> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> >> per-cpu rt_spin_lock is locked by current _task_. In this case
> >> re-entrance into the same kmalloc bucket is unsafe, and
> >> kmalloc_nolock() tries a different bucket that is most likely is
> >> not locked by the current task. Though it may be locked by a
> >> different task it's safe to rt_spin_lock() and sleep on it.
> >>
> >> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> >> immediately if called from hard irq or NMI in PREEMPT_RT.
> >>
> >> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> >> and (in_nmi() or in PREEMPT_RT).
> >>
> >> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> >> spin_trylock_irqsave(&n->list_lock) to allocate,
> >> while kfree_nolock() always defers to irq_work.
> >>
> >> Note, kfree_nolock() must be called _only_ for objects allocated
> >> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> >> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> >> will miss kmemleak/kfence book keeping and will cause false positives.
> >> large_kmalloc is not supported by either kmalloc_nolock()
> >> or kfree_nolock().
> >>
> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >> ---
> >> include/linux/kasan.h | 13 +-
> >> include/linux/memcontrol.h | 2 +
> >> include/linux/slab.h | 4 +
> >> mm/Kconfig | 1 +
> >> mm/kasan/common.c | 5 +-
> >> mm/slab.h | 6 +
> >> mm/slab_common.c | 3 +
> >> mm/slub.c | 473 +++++++++++++++++++++++++++++++++----
> >> 8 files changed, 453 insertions(+), 54 deletions(-)
> >> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >> }
> >> }
> >>
> >> +/*
> >> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> >> + * can be acquired without a deadlock before invoking the function.
> >> + *
> >> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> >> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> >> + * and kmalloc() is not used in an unsupported context.
> >> + *
> >> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> >> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> >> + * lockdep_assert() will catch a bug in case:
> >> + * #1
> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> >> + * or
> >> + * #2
> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> >> + *
> >> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> >> + * disabled context. The lock will always be acquired and if needed it
> >> + * block and sleep until the lock is available.
> >> + * #1 is possible in !PREEMPT_RT only.
> >> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> >> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> >> + * tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> >> + *
> >> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> >> + */
> >> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> >> +#define local_lock_cpu_slab(s, flags) \
> >> + local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> >> +#else
> >> +#define local_lock_cpu_slab(s, flags) \
> >> + lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> >> +#endif
> >> +
> >> +#define local_unlock_cpu_slab(s, flags) \
> >> + local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> >
> > nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
> > more precise for LOCKDEP"?
>
> I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we
> can avoid it using this trick. It's probably better for lockdep's overhead
> to avoid the class-per-cache when we can.
yes.
> Perhaps we can even improve by having a special class only for kmalloc
> caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc
> cache to another non-kmalloc cache?
Probably correct.
The current algorithm of kmalloc_nolock() (pick a different bucket)
works only for kmalloc caches, so other caches won't see _nolock()
version any time soon...
but caches are mergeable, so other kmem_cache_create()-d cache
might get merged with kmalloc ? Still shouldn't be an issue.
I guess we can fine tune "bool finegrain_lockdep" in that patch
to make it false for non-kmalloc caches, but I don't know how to do it.
Some flag struct kmem_cache ? I can do a follow up.
> >>
> >> +/**
> >> + * kmalloc_nolock - Allocate an object of given size from any context.
> >> + * @size: size to allocate
> >> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> >> + * @node: node number of the target node.
> >> + *
> >> + * Return: pointer to the new object or NULL in case of error.
> >> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> >> + * There is no reason to call it again and expect !NULL.
> >> + */
> >> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> >> +{
> >> + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> >> + struct kmem_cache *s;
> >> + bool can_retry = true;
> >> + void *ret = ERR_PTR(-EBUSY);
> >> +
> >> + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> >> +
> >> + if (unlikely(!size))
> >> + return ZERO_SIZE_PTR;
> >> +
> >> + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> >> + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> >> + return NULL;
> >> +retry:
> >> + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> >> + return NULL;
> >> + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> >> +
> >> + if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> >> + /*
> >> + * kmalloc_nolock() is not supported on architectures that
> >> + * don't implement cmpxchg16b, but debug caches don't use
> >> + * per-cpu slab and per-cpu partial slabs. They rely on
> >> + * kmem_cache_node->list_lock, so kmalloc_nolock() can
> >> + * attempt to allocate from debug caches by
> >> + * spin_trylock_irqsave(&n->list_lock, ...)
> >> + */
> >> + return NULL;
> >> +
> >> + /*
> >> + * Do not call slab_alloc_node(), since trylock mode isn't
> >> + * compatible with slab_pre_alloc_hook/should_failslab and
> >> + * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> >> + * and slab_post_alloc_hook() directly.
> >> + *
> >> + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> >> + * in irq saved region. It assumes that the same cpu will not
> >> + * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> >> + * Therefore use in_nmi() to check whether particular bucket is in
> >> + * irq protected section.
> >> + *
> >> + * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> >> + * this cpu was interrupted somewhere inside ___slab_alloc() after
> >> + * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> >> + * In this case fast path with __update_cpu_freelist_fast() is not safe.
> >> + */
> >> +#ifndef CONFIG_SLUB_TINY
> >> + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> >> +#endif
> >
> > On !PREEMPT_RT, how does the kernel know that it should not use
> > the lockless fastpath in kmalloc_nolock() in the following path:
> >
> > kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> >
> > For the same reason as in NMIs (as slowpath doesn't expect that).
>
> Hmm... seems a good point, unless I'm missing something.
Good point indeed.
tracepoints are not an issue, since there are no tracepoints
in the middle of freelist operations,
but kprobe in the middle of ___slab_alloc() is indeed problematic.
>
> > Maybe check if interrupts are disabled instead of in_nmi()?
but calling if (irqs_disabled()) isn't fast (list time I benchmarked it)
and unnecessarily restrictive.
I think it's better to add 'notrace' to ___slab_alloc
or I can denylist that function on bpf side to disallow attaching.
>
> Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and
> just remove the "!in_nmi() ||" part? There shouldn't be false positives?
That wouldn't be correct. Remember you asked why
access &s->cpu_slab->lock is stable? in_nmi() guarantees that
the task won't migrate.
Adding slub_put_cpu_ptr() wrap around local_lock_is_locked() _and_
subsequent call to __slab_alloc_node() will fix it,
but it's ugly.
Potentially can do
if (!allow_sping && local_lock_is_locked())
right before calling __update_cpu_freelist_fast()
but it's even uglier, since it will affect the fast path for everyone.
So I prefer to leave this bit as-is.
I'll add filtering of ___slab_alloc() on bpf side.
We already have a precedent: btf_id_deny set.
That would be one line patch that I can do in bpf tree.
Good to disallow poking into ___slab_alloc() anyway.
next prev parent reply other threads:[~2025-09-16 0:56 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-09 1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11 ` Shakeel Butt
2025-09-12 17:15 ` Matthew Wilcox
2025-09-12 17:34 ` Alexei Starovoitov
2025-09-12 17:46 ` Shakeel Butt
2025-09-12 17:47 ` Shakeel Butt
2025-09-15 5:25 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15 ` Shakeel Butt
2025-09-15 5:17 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09 1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27 ` Shakeel Butt
2025-09-12 21:03 ` Suren Baghdasaryan
2025-09-12 21:11 ` Shakeel Butt
2025-09-12 21:26 ` Suren Baghdasaryan
2025-09-12 21:24 ` Alexei Starovoitov
2025-09-12 21:29 ` Shakeel Butt
2025-09-12 21:31 ` Alexei Starovoitov
2025-09-12 21:44 ` Shakeel Butt
2025-09-12 21:59 ` Alexei Starovoitov
2025-09-13 0:01 ` Shakeel Butt
2025-09-13 0:07 ` Alexei Starovoitov
2025-09-13 0:33 ` Shakeel Butt
2025-09-13 0:36 ` Suren Baghdasaryan
2025-09-13 1:12 ` Alexei Starovoitov
2025-09-15 7:51 ` Vlastimil Babka
2025-09-15 15:06 ` Suren Baghdasaryan
2025-09-15 15:11 ` Vlastimil Babka
2025-09-15 15:25 ` Suren Baghdasaryan
2025-09-15 20:10 ` Suren Baghdasaryan
2025-09-13 1:16 ` Shakeel Butt
2025-09-15 6:14 ` Harry Yoo
2025-09-09 1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52 ` Harry Yoo
2025-09-15 14:39 ` Vlastimil Babka
2025-09-16 0:56 ` Alexei Starovoitov [this message]
2025-09-16 9:55 ` Vlastimil Babka
2025-09-16 1:00 ` Alexei Starovoitov
2025-09-24 0:40 ` Harry Yoo
2025-09-24 7:43 ` Alexei Starovoitov
2025-09-24 11:07 ` Harry Yoo
2025-09-12 9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAADnVQKC=BA73ifNnyiE2c8ExtzS4LjR00--m5Xe8pqPLSGJfA@mail.gmail.com' \
--to=alexei.starovoitov@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox