linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz,
	shakeel.butt@linux.dev, mhocko@suse.com, bigeasy@linutronix.de,
	andrii@kernel.org, memxor@gmail.com, akpm@linux-foundation.org,
	peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Mon, 15 Sep 2025 21:52:49 +0900	[thread overview]
Message-ID: <aMgMIQVYnAq2weuE@hyeyoo> (raw)
In-Reply-To: <20250909010007.1660-7-alexei.starovoitov@gmail.com>

On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_trylock_t to detect
> the situation when per-cpu kmem_cache is locked.
> 
> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> disables IRQs and marks s->cpu_slab->lock as acquired.
> local_lock_is_locked(&s->cpu_slab->lock) returns true when
> slab is in the middle of manipulating per-cpu cache
> of that specific kmem_cache.
> 
> kmalloc_nolock() can be called from any context and can re-enter
> into ___slab_alloc():
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)
> or
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)
> 
> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> can be acquired without a deadlock before invoking the function.
> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> retries in a different kmalloc bucket. The second attempt will
> likely succeed, since this cpu locked different kmem_cache.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current _task_. In this case
> re-entrance into the same kmalloc bucket is unsafe, and
> kmalloc_nolock() tries a different bucket that is most likely is
> not locked by the current task. Though it may be locked by a
> different task it's safe to rt_spin_lock() and sleep on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and (in_nmi() or in PREEMPT_RT).
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate,
> while kfree_nolock() always defers to irq_work.
> 
> Note, kfree_nolock() must be called _only_ for objects allocated
> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> will miss kmemleak/kfence book keeping and will cause false positives.
> large_kmalloc is not supported by either kmalloc_nolock()
> or kfree_nolock().
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/kasan.h      |  13 +-
>  include/linux/memcontrol.h |   2 +
>  include/linux/slab.h       |   4 +
>  mm/Kconfig                 |   1 +
>  mm/kasan/common.c          |   5 +-
>  mm/slab.h                  |   6 +
>  mm/slab_common.c           |   3 +
>  mm/slub.c                  | 473 +++++++++++++++++++++++++++++++++----
>  8 files changed, 453 insertions(+), 54 deletions(-)
> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  	}
>  }
>  
> +/*
> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> + * can be acquired without a deadlock before invoking the function.
> + *
> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> + * and kmalloc() is not used in an unsupported context.
> + *
> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> + * lockdep_assert() will catch a bug in case:
> + * #1
> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> + * or
> + * #2
> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> + *
> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> + * disabled context. The lock will always be acquired and if needed it
> + * block and sleep until the lock is available.
> + * #1 is possible in !PREEMPT_RT only.
> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> + *
> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> + */
> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> +#define local_lock_cpu_slab(s, flags)	\
> +	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> +#else
> +#define local_lock_cpu_slab(s, flags)	\
> +	lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> +#endif
> +
> +#define local_unlock_cpu_slab(s, flags)	\
> +	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)

nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
more precise for LOCKDEP"?

>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
>  {
> @@ -4262,6 +4342,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfpflags);
>  	void *freelist;
>  	struct slab *slab;
>  	unsigned long flags;
> @@ -4287,9 +4368,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (unlikely(!node_match(slab, node))) {
>  		/*
>  		 * same as above but node_match() being false already
> -		 * implies node != NUMA_NO_NODE
> +		 * implies node != NUMA_NO_NODE.
> +		 * Reentrant slub cannot take locks necessary to
> +		 * deactivate_slab, hence ignore node preference.

nit: the comment is obsolte?

Per previous discussion there were two points.
Maybe something like this?

/*
 * We don't strictly honor pfmemalloc and NUMA preferences when
 * !allow_spin because:
 *
 * 1. Most kmalloc() users allocate objects on the local node,
 *    so kmalloc_nolock() tries not to interfere with them by
 *    deactivating the cpu slab.
 *
 * 2. Deactivating due to NUMA or pfmemalloc mismatch may cause
 *    unnecessary slab allocations even when n->partial list is not empty.
 */

...or if you don't feel like it's not worth documenting,
just removing the misleading comment is fine.

> +		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
>  		 */
> -		if (!node_isset(node, slab_nodes)) {
> +		if (!node_isset(node, slab_nodes) ||
> +		    !allow_spin) {
>  			node = NUMA_NO_NODE;
>  		} else {
>  			stat(s, ALLOC_NODE_MISMATCH);
> @@ -4374,8 +4460,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		slab = slub_percpu_partial(c);
>  		slub_set_percpu_partial(c, slab);
>  
> +		/*
> +		 * Reentrant slub cannot take locks necessary for
> +		 * __put_partials(), hence ignore node preference.

nit: same here.

> +		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> +		 */
>  		if (likely(node_match(slab, node) &&
> -			   pfmemalloc_match(slab, gfpflags))) {
> +			   pfmemalloc_match(slab, gfpflags)) ||
> +		    !allow_spin) {
>  			c->slab = slab;
>  			freelist = get_freelist(s, slab);
>  			VM_BUG_ON(!freelist);
> @@ -5390,6 +5504,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
>  }
>  EXPORT_SYMBOL(__kmalloc_noprof);
>  
> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> +	struct kmem_cache *s;
> +	bool can_retry = true;
> +	void *ret = ERR_PTR(-EBUSY);
> +
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> +	if (unlikely(!size))
> +		return ZERO_SIZE_PTR;
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> +		return NULL;
> +retry:
> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> +		return NULL;
> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +
> +	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> +		/*
> +		 * kmalloc_nolock() is not supported on architectures that
> +		 * don't implement cmpxchg16b, but debug caches don't use
> +		 * per-cpu slab and per-cpu partial slabs. They rely on
> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
> +		 * attempt to allocate from debug caches by
> +		 * spin_trylock_irqsave(&n->list_lock, ...)
> +		 */
> +		return NULL;
> +
> +	/*
> +	 * Do not call slab_alloc_node(), since trylock mode isn't
> +	 * compatible with slab_pre_alloc_hook/should_failslab and
> +	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> +	 * and slab_post_alloc_hook() directly.
> +	 *
> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> +	 * in irq saved region. It assumes that the same cpu will not
> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> +	 * Therefore use in_nmi() to check whether particular bucket is in
> +	 * irq protected section.
> +	 *
> +	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> +	 * this cpu was interrupted somewhere inside ___slab_alloc() after
> +	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> +	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
> +	 */
> +#ifndef CONFIG_SLUB_TINY
> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +#endif

On !PREEMPT_RT, how does the kernel know that it should not use
the lockless fastpath in kmalloc_nolock() in the following path:

kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()

For the same reason as in NMIs (as slowpath doesn't expect that).

Maybe check if interrupts are disabled instead of in_nmi()?

Otherwise looks good to me.

-- 
Cheers,
Harry / Hyeonggon

> +		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
> +
> +	if (PTR_ERR(ret) == -EBUSY) {
> +		if (can_retry) {
> +			/* pick the next kmalloc bucket */
> +			size = s->object_size + 1;
> +			/*
> +			 * Another alternative is to
> +			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
> +			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
> +			 * to retry from bucket of the same size.
> +			 */
> +			can_retry = false;
> +			goto retry;
> +		}
> +		ret = NULL;
> +	}
> +
> +	maybe_wipe_obj_freeptr(s, ret);
> +	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
> +			     slab_want_init_on_alloc(alloc_gfp, s), size);
> +
> +	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);


  reply	other threads:[~2025-09-15 12:53 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09  1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11   ` Shakeel Butt
2025-09-12 17:15     ` Matthew Wilcox
2025-09-12 17:34       ` Alexei Starovoitov
2025-09-12 17:46         ` Shakeel Butt
2025-09-12 17:47   ` Shakeel Butt
2025-09-15  5:25   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15   ` Shakeel Butt
2025-09-15  5:17   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27   ` Shakeel Butt
2025-09-12 21:03     ` Suren Baghdasaryan
2025-09-12 21:11       ` Shakeel Butt
2025-09-12 21:26         ` Suren Baghdasaryan
2025-09-12 21:24       ` Alexei Starovoitov
2025-09-12 21:29         ` Shakeel Butt
2025-09-12 21:31           ` Alexei Starovoitov
2025-09-12 21:44             ` Shakeel Butt
2025-09-12 21:59               ` Alexei Starovoitov
2025-09-13  0:01                 ` Shakeel Butt
2025-09-13  0:07                   ` Alexei Starovoitov
2025-09-13  0:33                     ` Shakeel Butt
2025-09-13  0:36                       ` Suren Baghdasaryan
2025-09-13  1:12                         ` Alexei Starovoitov
2025-09-15  7:51                           ` Vlastimil Babka
2025-09-15 15:06                             ` Suren Baghdasaryan
2025-09-15 15:11                               ` Vlastimil Babka
2025-09-15 15:25                                 ` Suren Baghdasaryan
2025-09-15 20:10                                   ` Suren Baghdasaryan
2025-09-13  1:16   ` Shakeel Butt
2025-09-15  6:14   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52   ` Harry Yoo [this message]
2025-09-15 14:39     ` Vlastimil Babka
2025-09-16  0:56       ` Alexei Starovoitov
2025-09-16  9:55         ` Vlastimil Babka
2025-09-16  1:00     ` Alexei Starovoitov
2025-09-24  0:40   ` Harry Yoo
2025-09-24  7:43     ` Alexei Starovoitov
2025-09-24 11:07       ` Harry Yoo
2025-09-12  9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMgMIQVYnAq2weuE@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox