linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Harry Yoo <harry.yoo@oracle.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev,
	mhocko@suse.com, bigeasy@linutronix.de, andrii@kernel.org,
	memxor@gmail.com, akpm@linux-foundation.org,
	peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 15 Sep 2025 16:39:12 +0200	[thread overview]
Message-ID: <451c6823-40fa-4ef1-91d7-effb1ca43c90@suse.cz> (raw)
In-Reply-To: <aMgMIQVYnAq2weuE@hyeyoo>

On 9/15/25 14:52, Harry Yoo wrote:
> On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>> 
>> kmalloc_nolock() relies on ability of local_trylock_t to detect
>> the situation when per-cpu kmem_cache is locked.
>> 
>> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
>> disables IRQs and marks s->cpu_slab->lock as acquired.
>> local_lock_is_locked(&s->cpu_slab->lock) returns true when
>> slab is in the middle of manipulating per-cpu cache
>> of that specific kmem_cache.
>> 
>> kmalloc_nolock() can be called from any context and can re-enter
>> into ___slab_alloc():
>>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>>     kmalloc_nolock() -> ___slab_alloc(cache_B)
>> or
>>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>>     kmalloc_nolock() -> ___slab_alloc(cache_B)
>> 
>> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
>> can be acquired without a deadlock before invoking the function.
>> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
>> retries in a different kmalloc bucket. The second attempt will
>> likely succeed, since this cpu locked different kmem_cache.
>> 
>> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>> per-cpu rt_spin_lock is locked by current _task_. In this case
>> re-entrance into the same kmalloc bucket is unsafe, and
>> kmalloc_nolock() tries a different bucket that is most likely is
>> not locked by the current task. Though it may be locked by a
>> different task it's safe to rt_spin_lock() and sleep on it.
>> 
>> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>> immediately if called from hard irq or NMI in PREEMPT_RT.
>> 
>> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
>> and (in_nmi() or in PREEMPT_RT).
>> 
>> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
>> spin_trylock_irqsave(&n->list_lock) to allocate,
>> while kfree_nolock() always defers to irq_work.
>> 
>> Note, kfree_nolock() must be called _only_ for objects allocated
>> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
>> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
>> will miss kmemleak/kfence book keeping and will cause false positives.
>> large_kmalloc is not supported by either kmalloc_nolock()
>> or kfree_nolock().
>> 
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>>  include/linux/kasan.h      |  13 +-
>>  include/linux/memcontrol.h |   2 +
>>  include/linux/slab.h       |   4 +
>>  mm/Kconfig                 |   1 +
>>  mm/kasan/common.c          |   5 +-
>>  mm/slab.h                  |   6 +
>>  mm/slab_common.c           |   3 +
>>  mm/slub.c                  | 473 +++++++++++++++++++++++++++++++++----
>>  8 files changed, 453 insertions(+), 54 deletions(-)
>> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>  	}
>>  }
>>  
>> +/*
>> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
>> + * can be acquired without a deadlock before invoking the function.
>> + *
>> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
>> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
>> + * and kmalloc() is not used in an unsupported context.
>> + *
>> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
>> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
>> + * lockdep_assert() will catch a bug in case:
>> + * #1
>> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
>> + * or
>> + * #2
>> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
>> + *
>> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
>> + * disabled context. The lock will always be acquired and if needed it
>> + * block and sleep until the lock is available.
>> + * #1 is possible in !PREEMPT_RT only.
>> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
>> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
>> + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
>> + *
>> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
>> + */
>> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
>> +#define local_lock_cpu_slab(s, flags)	\
>> +	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
>> +#else
>> +#define local_lock_cpu_slab(s, flags)	\
>> +	lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
>> +#endif
>> +
>> +#define local_unlock_cpu_slab(s, flags)	\
>> +	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> 
> nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
> more precise for LOCKDEP"?

I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we
can avoid it using this trick. It's probably better for lockdep's overhead
to avoid the class-per-cache when we can.
Perhaps we can even improve by having a special class only for kmalloc
caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc
cache to another non-kmalloc cache?

>>  
>> +/**
>> + * kmalloc_nolock - Allocate an object of given size from any context.
>> + * @size: size to allocate
>> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
>> + * @node: node number of the target node.
>> + *
>> + * Return: pointer to the new object or NULL in case of error.
>> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
>> + * There is no reason to call it again and expect !NULL.
>> + */
>> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>> +{
>> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
>> +	struct kmem_cache *s;
>> +	bool can_retry = true;
>> +	void *ret = ERR_PTR(-EBUSY);
>> +
>> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
>> +
>> +	if (unlikely(!size))
>> +		return ZERO_SIZE_PTR;
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
>> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
>> +		return NULL;
>> +retry:
>> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>> +		return NULL;
>> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>> +
>> +	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>> +		/*
>> +		 * kmalloc_nolock() is not supported on architectures that
>> +		 * don't implement cmpxchg16b, but debug caches don't use
>> +		 * per-cpu slab and per-cpu partial slabs. They rely on
>> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
>> +		 * attempt to allocate from debug caches by
>> +		 * spin_trylock_irqsave(&n->list_lock, ...)
>> +		 */
>> +		return NULL;
>> +
>> +	/*
>> +	 * Do not call slab_alloc_node(), since trylock mode isn't
>> +	 * compatible with slab_pre_alloc_hook/should_failslab and
>> +	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
>> +	 * and slab_post_alloc_hook() directly.
>> +	 *
>> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
>> +	 * in irq saved region. It assumes that the same cpu will not
>> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
>> +	 * Therefore use in_nmi() to check whether particular bucket is in
>> +	 * irq protected section.
>> +	 *
>> +	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
>> +	 * this cpu was interrupted somewhere inside ___slab_alloc() after
>> +	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
>> +	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
>> +	 */
>> +#ifndef CONFIG_SLUB_TINY
>> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
>> +#endif
> 
> On !PREEMPT_RT, how does the kernel know that it should not use
> the lockless fastpath in kmalloc_nolock() in the following path:
> 
> kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> 
> For the same reason as in NMIs (as slowpath doesn't expect that).

Hmm... seems a good point, unless I'm missing something.

> Maybe check if interrupts are disabled instead of in_nmi()?

Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and
just remove the "!in_nmi() ||" part? There shouldn't be false positives?
> Otherwise looks good to me.
> 



  reply	other threads:[~2025-09-15 14:39 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09  1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11   ` Shakeel Butt
2025-09-12 17:15     ` Matthew Wilcox
2025-09-12 17:34       ` Alexei Starovoitov
2025-09-12 17:46         ` Shakeel Butt
2025-09-12 17:47   ` Shakeel Butt
2025-09-15  5:25   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15   ` Shakeel Butt
2025-09-15  5:17   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27   ` Shakeel Butt
2025-09-12 21:03     ` Suren Baghdasaryan
2025-09-12 21:11       ` Shakeel Butt
2025-09-12 21:26         ` Suren Baghdasaryan
2025-09-12 21:24       ` Alexei Starovoitov
2025-09-12 21:29         ` Shakeel Butt
2025-09-12 21:31           ` Alexei Starovoitov
2025-09-12 21:44             ` Shakeel Butt
2025-09-12 21:59               ` Alexei Starovoitov
2025-09-13  0:01                 ` Shakeel Butt
2025-09-13  0:07                   ` Alexei Starovoitov
2025-09-13  0:33                     ` Shakeel Butt
2025-09-13  0:36                       ` Suren Baghdasaryan
2025-09-13  1:12                         ` Alexei Starovoitov
2025-09-15  7:51                           ` Vlastimil Babka
2025-09-15 15:06                             ` Suren Baghdasaryan
2025-09-15 15:11                               ` Vlastimil Babka
2025-09-15 15:25                                 ` Suren Baghdasaryan
2025-09-15 20:10                                   ` Suren Baghdasaryan
2025-09-13  1:16   ` Shakeel Butt
2025-09-15  6:14   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52   ` Harry Yoo
2025-09-15 14:39     ` Vlastimil Babka [this message]
2025-09-16  0:56       ` Alexei Starovoitov
2025-09-16  9:55         ` Vlastimil Babka
2025-09-16  1:00     ` Alexei Starovoitov
2025-09-24  0:40   ` Harry Yoo
2025-09-24  7:43     ` Alexei Starovoitov
2025-09-24 11:07       ` Harry Yoo
2025-09-12  9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=451c6823-40fa-4ef1-91d7-effb1ca43c90@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox