linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
	Harry Yoo <harry.yoo@oracle.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Michal Hocko <mhocko@suse.com>,
	Sebastian Sewior <bigeasy@linutronix.de>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Wed, 7 May 2025 12:44:24 +0200	[thread overview]
Message-ID: <b6f69981-f851-4c6f-a19d-5f8d13994087@suse.cz> (raw)
In-Reply-To: <CAADnVQLO9YX2_0wEZshHbwXoJY2-wv3OgVGvN-hgf6mK0_ipxw@mail.gmail.com>

On 5/7/25 4:20 AM, Alexei Starovoitov wrote:
> On Tue, May 6, 2025 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 5/1/25 05:27, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> kmalloc_nolock() relies on ability of local_lock to detect the situation
>>> when it's locked.
>>> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
>>> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
>>> In that case retry the operation in a different kmalloc bucket.
>>> The second attempt will likely succeed, since this cpu locked
>>> different kmem_cache_cpu.
>>> When lock_local_is_locked() sees locked memcg_stock.stock_lock
>>> fallback to atomic operations.
>>>
>>> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>>> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
>>> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
>>> a different bucket that is most likely is not locked by current
>>> task. Though it may be locked by a different task it's safe to
>>> rt_spin_lock() on it.
>>>
>>> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>>> immediately if called from hard irq or NMI in PREEMPT_RT.
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> In general I'd prefer if we could avoid local_lock_is_locked() usage outside
>> of debugging code. It just feels hacky given we have local_trylock()
>> operations. But I can see how this makes things simpler so it's probably
>> acceptable.
> 
> local_lock_is_locked() is not for debugging.
> It's gating further calls into slub internals.
> If a particular bucket is locked the logic will use a different one.
> There is no local_trylock() at all here.

It could be, but I can see how it would complicate things. Not worth it,
especially in case we manage to replace the current percpu scheme with
sheaves later.

> In that sense it's very different from alloc_pages_nolock().
> There we trylock first and if not successful go for plan B.
> For kmalloc_nolock() we first check whether local_lock_is_locked(),
> if not then proceed and do
> local_lock_irqsave_check() instead of local_lock_irqsave().
> Both are unconditional and exactly the same without
> CONFIG_DEBUG_LOCK_ALLOC.

Right.

> Extra checks are there in _check() version for debugging,
> since local_lock_is_locked() is called much earlier in the call chain
> and far from local_lock_irqsave. So not trivial to see by just
> code reading.

Right, we rely on the implication that once we find
local_lock_is_locked() is false, it cannot become suddenly locked later
even if we re-enable preemption in the meanwhile.

> If local_lock_is_locked() says that it's locked
> we go for a different bucket which is pretty much guaranteed to
> be unlocked.

OK.

>>> +             folio = (struct folio *)p;
>>> +     } else if (node == NUMA_NO_NODE)
>>>               folio = (struct folio *)alloc_frozen_pages(flags, order);
>>>       else
>>>               folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
>>
>> <snip>
>>
>>> @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>        */
>>>       c = slub_get_cpu_ptr(s->cpu_slab);
>>>  #endif
>>> +     if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
>>> +             struct slab *slab;
>>> +
>>> +             slab = c->slab;
>>> +             if (slab && !node_match(slab, node))
>>> +                     /* In trylock mode numa node is a hint */
>>> +                     node = NUMA_NO_NODE;
>>> +
>>> +             if (!local_lock_is_locked(&s->cpu_slab->lock)) {
>>> +                     lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
>>> +             } else {
>>> +                     /*
>>> +                      * EBUSY is an internal signal to kmalloc_nolock() to
>>> +                      * retry a different bucket. It's not propagated further.
>>> +                      */
>>> +                     p = ERR_PTR(-EBUSY);
>>> +                     goto out;
>>
>> Am I right in my reasoning as follows?
>>
>> - If we're on RT and "in_nmi() || in_hardirq()" is true then
>> kmalloc_nolock_noprof() would return NULL immediately and we never reach
>> this code
> 
> correct.
> 
>> - local_lock_is_locked() on RT tests if the current process is the lock
>> owner. This means (in absence of double locking bugs) that we locked it as
>> task (or hardirq) and now we're either in_hardirq() (doesn't change current
>> AFAIK?) preempting task, or in_nmi() preempting task or hardirq.
> 
> not quite.
> There could be re-entrance due to kprobe/fentry/tracepoint.
> Like trace_contention_begin().
> The code is still preemptable.

Hm right. Glad that I asked then and thanks for making me realize.

>> - so local_lock_is_locked() will never be true here on RT
> 
> hehe :)
> 
> To have good coverage I fuzz test this patch set with:
> 
> +extern void (*debug_callback)(void);
> +#define local_unlock_irqrestore(lock, flags) \
> + do { \
> + if (debug_callback) debug_callback(); \
> + __local_unlock_irqrestore(lock, flags); \
> + } while (0)
> 
> and randomly re-enter everywhere from debug_callback().

Oh cool :)

>>
>> <snip>
>>
>>>  /*
>>>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>>>   * can perform fastpath freeing without additional function calls.
>>> @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>>>       barrier();
>>>
>>>       if (unlikely(slab != c->slab)) {
>>
>> Note this unlikely() is actually a lie. It's actually unlikely that the free
>> will happen on the same cpu and with the same slab still being c->slab,
>> unless it's a free following shortly a temporary object allocation.
> 
> I didn't change it, since you would have called it
> an unrelated change in the patch :)

Right I'm not suggesting to change it here and now, just that it might
be misleading in that this is a slowpath and we're fine doing expensive
things here - I'm arguing we should be careful.

> I can prepare a separate single line patch to remove unlikely() here,
> but it's a micro optimization unrelated to this set.
> 
>>> -             __slab_free(s, slab, head, tail, cnt, addr);
>>> +             /* cnt == 0 signals that it's called from kfree_nolock() */
>>> +             if (unlikely(!cnt)) {
>>> +                     /*
>>> +                      * Use llist in cache_node ?
>>> +                      * struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>>> +                      */
>>> +                     /*
>>> +                      * __slab_free() can locklessly cmpxchg16 into a slab,
>>> +                      * but then it might need to take spin_lock or local_lock
>>> +                      * in put_cpu_partial() for further processing.
>>> +                      * Avoid the complexity and simply add to a deferred list.
>>> +                      */
>>> +                     llist_add(head, &s->defer_free_objects);
>>> +             } else {
>>> +                     free_deferred_objects(&s->defer_free_objects, addr);
>>
>> So I'm a bit vary that this is actually rather a fast path that might
>> contend on the defer_free_objects from all cpus.
> 
> Well, in my current stress test I could only get this list
> to contain a single digit number of objects.

My worry isn't the list would get long, but that we'd be checking it on
almost any kfree() (that's not lucky enough to be slab == c->slab) from
all cpus. And every kfree_nolock() will have to make the cache line of
s->defer_free_objects exclusive to that cpu in the llist_add(), and then
all other cpus doing kfree() will have to refetch it while making it
shared again...

>> I'm wondering if we could make the list part of kmem_cache_cpu to distribute
>> it,
> 
> doable, but kmem_cache_cpu *c = raw_cpu_ptr(s->cpu_slab);
> is preemptable, so there is a risk that
> llist_add(.. , &c->defer_free_objects);
> will be accessing per-cpu memory of another cpu.
> llist_add() will work correctly, but cache line bounce is possible.

The cache line bounce due to occasional preemption should be much more
rare than on the single s->defer_free_objects cache line as described above.

> In kmem_cache I placed defer_free_objects after cpu_partial and oo,
> so it should be cache hot.

OTOH it would make the bouncing worse?

>> and hook the flushing e.g. to places where we do deactivate_slab() which
>> should be much slower path,
> 
> I don't follow the idea.
> If we don't process kmem_cache_cpu *c right here in do_slab_free()
> this llist will get large.

I'd hope deativate_slab() should be frequent enough, it means mainly the
local cpu's slab was exhausted. If there are many frees there are
probably many allocations too.

But ok, maybe having the llist local to the cpu would be sufficient to
make it feasible to check it every kfree() cheaply enough.



  reply	other threads:[~2025-05-07 10:43 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:26   ` Vlastimil Babka
2025-05-07  1:24     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-05-06 12:56   ` Vlastimil Babka
2025-05-06 14:55     ` Vlastimil Babka
2025-05-07  1:25       ` Alexei Starovoitov
2025-05-12 13:26   ` Sebastian Andrzej Siewior
2025-05-12 16:46     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-05-06 12:59   ` Vlastimil Babka
2025-05-07  1:28     ` Alexei Starovoitov
2025-05-12 14:56   ` Sebastian Andrzej Siewior
2025-05-12 15:01     ` Vlastimil Babka
2025-05-12 15:23       ` Sebastian Andrzej Siewior
2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
2025-05-07 13:02   ` Vlastimil Babka
2025-05-12 14:03   ` Sebastian Andrzej Siewior
2025-05-12 17:16     ` Alexei Starovoitov
2025-05-13  6:58       ` Vlastimil Babka
2025-05-13 21:55         ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:55   ` Vlastimil Babka
2025-05-07  1:33     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-05-05 18:46   ` Shakeel Butt
2025-05-06  0:49     ` Alexei Starovoitov
2025-05-06  1:24       ` Shakeel Butt
2025-05-06  1:51         ` Alexei Starovoitov
2025-05-06 18:05           ` Shakeel Butt
2025-05-06 12:01   ` Vlastimil Babka
2025-05-07  0:31     ` Harry Yoo
2025-05-07  2:23       ` Alexei Starovoitov
2025-05-07  8:38       ` Vlastimil Babka
2025-05-07  2:20     ` Alexei Starovoitov
2025-05-07 10:44       ` Vlastimil Babka [this message]
2025-05-09  1:03   ` Harry Yoo
2025-06-24 17:13     ` SLAB_NO_CMPXCHG was:: " Alexei Starovoitov
2025-06-25 11:38       ` Harry Yoo
2025-06-26 20:03         ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6f69981-f851-4c6f-a19d-5f8d13994087@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox