linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Harry Yoo <harry.yoo@oracle.com>, bpf <bpf@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Michal Hocko <mhocko@suse.com>,
	Sebastian Sewior <bigeasy@linutronix.de>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Tue, 16 Sep 2025 11:55:24 +0200	[thread overview]
Message-ID: <fbadc3e8-7086-4423-9e45-07a46d6510b8@suse.cz> (raw)
In-Reply-To: <CAADnVQKC=BA73ifNnyiE2c8ExtzS4LjR00--m5Xe8pqPLSGJfA@mail.gmail.com>

On 9/16/25 02:56, Alexei Starovoitov wrote:
> On Mon, Sep 15, 2025 at 7:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 9/15/25 14:52, Harry Yoo wrote:
>> > On Mon, Sep 08, 2025 at 06:00:07PM -0700, Alexei Starovoitov wrote:
>> >> From: Alexei Starovoitov <ast@kernel.org>
>> >>
>> >> kmalloc_nolock() relies on ability of local_trylock_t to detect
>> >> the situation when per-cpu kmem_cache is locked.
>> >>
>> >> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
>> >> disables IRQs and marks s->cpu_slab->lock as acquired.
>> >> local_lock_is_locked(&s->cpu_slab->lock) returns true when
>> >> slab is in the middle of manipulating per-cpu cache
>> >> of that specific kmem_cache.
>> >>
>> >> kmalloc_nolock() can be called from any context and can re-enter
>> >> into ___slab_alloc():
>> >>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>> >>     kmalloc_nolock() -> ___slab_alloc(cache_B)
>> >> or
>> >>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>> >>     kmalloc_nolock() -> ___slab_alloc(cache_B)
>> >>
>> >> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
>> >> can be acquired without a deadlock before invoking the function.
>> >> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
>> >> retries in a different kmalloc bucket. The second attempt will
>> >> likely succeed, since this cpu locked different kmem_cache.
>> >>
>> >> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>> >> per-cpu rt_spin_lock is locked by current _task_. In this case
>> >> re-entrance into the same kmalloc bucket is unsafe, and
>> >> kmalloc_nolock() tries a different bucket that is most likely is
>> >> not locked by the current task. Though it may be locked by a
>> >> different task it's safe to rt_spin_lock() and sleep on it.
>> >>
>> >> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>> >> immediately if called from hard irq or NMI in PREEMPT_RT.
>> >>
>> >> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
>> >> and (in_nmi() or in PREEMPT_RT).
>> >>
>> >> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
>> >> spin_trylock_irqsave(&n->list_lock) to allocate,
>> >> while kfree_nolock() always defers to irq_work.
>> >>
>> >> Note, kfree_nolock() must be called _only_ for objects allocated
>> >> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
>> >> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
>> >> will miss kmemleak/kfence book keeping and will cause false positives.
>> >> large_kmalloc is not supported by either kmalloc_nolock()
>> >> or kfree_nolock().
>> >>
>> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> >> ---
>> >>  include/linux/kasan.h      |  13 +-
>> >>  include/linux/memcontrol.h |   2 +
>> >>  include/linux/slab.h       |   4 +
>> >>  mm/Kconfig                 |   1 +
>> >>  mm/kasan/common.c          |   5 +-
>> >>  mm/slab.h                  |   6 +
>> >>  mm/slab_common.c           |   3 +
>> >>  mm/slub.c                  | 473 +++++++++++++++++++++++++++++++++----
>> >>  8 files changed, 453 insertions(+), 54 deletions(-)
>> >> @@ -3704,6 +3746,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>> >>      }
>> >>  }
>> >>
>> >> +/*
>> >> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
>> >> + * can be acquired without a deadlock before invoking the function.
>> >> + *
>> >> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
>> >> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
>> >> + * and kmalloc() is not used in an unsupported context.
>> >> + *
>> >> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
>> >> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
>> >> + * lockdep_assert() will catch a bug in case:
>> >> + * #1
>> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
>> >> + * or
>> >> + * #2
>> >> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
>> >> + *
>> >> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
>> >> + * disabled context. The lock will always be acquired and if needed it
>> >> + * block and sleep until the lock is available.
>> >> + * #1 is possible in !PREEMPT_RT only.
>> >> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
>> >> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
>> >> + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
>> >> + *
>> >> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
>> >> + */
>> >> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
>> >> +#define local_lock_cpu_slab(s, flags)       \
>> >> +    local_lock_irqsave(&(s)->cpu_slab->lock, flags)
>> >> +#else
>> >> +#define local_lock_cpu_slab(s, flags)       \
>> >> +    lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
>> >> +#endif
>> >> +
>> >> +#define local_unlock_cpu_slab(s, flags)     \
>> >> +    local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
>> >
>> > nit: Do we still need this trick with patch "slab: Make slub local_(try)lock
>> > more precise for LOCKDEP"?
>>
>> I think we only make it more precise on PREEMPT_RT because on !PREEMPT_RT we
>> can avoid it using this trick. It's probably better for lockdep's overhead
>> to avoid the class-per-cache when we can.
> 
> yes.
> 
>> Perhaps we can even improve by having a special class only for kmalloc
>> caches? With kmalloc_nolock we shouldn't ever recurse from one non-kmalloc
>> cache to another non-kmalloc cache?
> 
> Probably correct.
> The current algorithm of kmalloc_nolock() (pick a different bucket)
> works only for kmalloc caches, so other caches won't see _nolock()
> version any time soon...
> but caches are mergeable, so other kmem_cache_create()-d cache
> might get merged with kmalloc ? Still shouldn't be an issue.

I think we mostly don't merge them, maybe in some configs it's still
possible. We should probably just do it unconditionally as they are special
enough. But it still shouldn't be an issue for this case, as we wouldn't
ever merge such caches that would need a different lockdep class, AFAICS.

> I guess we can fine tune "bool finegrain_lockdep" in that patch
> to make it false for non-kmalloc caches, but I don't know how to do it.
> Some flag struct kmem_cache ? I can do a follow up.

Yeah there's now SLAB_KMALLOC and is_kmalloc_cache(). We can do followup, it
shouldn't be urgent.

>>
>> Hmm... seems a good point, unless I'm missing something.
> 
> Good point indeed.
> tracepoints are not an issue, since there are no tracepoints
> in the middle of freelist operations,
> but kprobe in the middle of ___slab_alloc() is indeed problematic.
> 
>>
>> > Maybe check if interrupts are disabled instead of in_nmi()?
> 
> but calling if (irqs_disabled()) isn't fast (list time I benchmarked it)
> and unnecessarily restrictive.
> 
> I think it's better to add 'notrace' to ___slab_alloc
> or I can denylist that function on bpf side to disallow attaching.
> 
>>
>> Why not just check for local_lock_is_locked(&s->cpu_slab->lock) then and
>> just remove the "!in_nmi() ||" part? There shouldn't be false positives?
> 
> That wouldn't be correct. Remember you asked why
> access &s->cpu_slab->lock is stable? in_nmi() guarantees that
> the task won't migrate.

Ah right.

> Adding slub_put_cpu_ptr() wrap around local_lock_is_locked() _and_
> subsequent call to __slab_alloc_node() will fix it,
> but it's ugly.
> Potentially can do
> if (!allow_sping && local_lock_is_locked())
> right before calling __update_cpu_freelist_fast()
> but it's even uglier, since it will affect the fast path for everyone.

Agreed.

> So I prefer to leave this bit as-is.
> I'll add filtering of ___slab_alloc() on bpf side.
> We already have a precedent: btf_id_deny set.
> That would be one line patch that I can do in bpf tree.
> Good to disallow poking into ___slab_alloc() anyway.

Yeah it sounds like the easiest fix and won't limit anything.


  reply	other threads:[~2025-09-16  9:55 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09  1:00 [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 1/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 2/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-09-12 17:11   ` Shakeel Butt
2025-09-12 17:15     ` Matthew Wilcox
2025-09-12 17:34       ` Alexei Starovoitov
2025-09-12 17:46         ` Shakeel Butt
2025-09-12 17:47   ` Shakeel Butt
2025-09-15  5:25   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 3/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-09-12 17:15   ` Shakeel Butt
2025-09-15  5:17   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 4/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-09-09  1:00 ` [PATCH slab v5 5/6] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL Alexei Starovoitov
2025-09-12 19:27   ` Shakeel Butt
2025-09-12 21:03     ` Suren Baghdasaryan
2025-09-12 21:11       ` Shakeel Butt
2025-09-12 21:26         ` Suren Baghdasaryan
2025-09-12 21:24       ` Alexei Starovoitov
2025-09-12 21:29         ` Shakeel Butt
2025-09-12 21:31           ` Alexei Starovoitov
2025-09-12 21:44             ` Shakeel Butt
2025-09-12 21:59               ` Alexei Starovoitov
2025-09-13  0:01                 ` Shakeel Butt
2025-09-13  0:07                   ` Alexei Starovoitov
2025-09-13  0:33                     ` Shakeel Butt
2025-09-13  0:36                       ` Suren Baghdasaryan
2025-09-13  1:12                         ` Alexei Starovoitov
2025-09-15  7:51                           ` Vlastimil Babka
2025-09-15 15:06                             ` Suren Baghdasaryan
2025-09-15 15:11                               ` Vlastimil Babka
2025-09-15 15:25                                 ` Suren Baghdasaryan
2025-09-15 20:10                                   ` Suren Baghdasaryan
2025-09-13  1:16   ` Shakeel Butt
2025-09-15  6:14   ` Harry Yoo
2025-09-09  1:00 ` [PATCH slab v5 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-09-15 12:52   ` Harry Yoo
2025-09-15 14:39     ` Vlastimil Babka
2025-09-16  0:56       ` Alexei Starovoitov
2025-09-16  9:55         ` Vlastimil Babka [this message]
2025-09-16  1:00     ` Alexei Starovoitov
2025-09-24  0:40   ` Harry Yoo
2025-09-24  7:43     ` Alexei Starovoitov
2025-09-24 11:07       ` Harry Yoo
2025-09-12  9:33 ` [PATCH slab v5 0/6] slab: Re-entrant kmalloc_nolock() Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fbadc3e8-7086-4423-9e45-07a46d6510b8@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox