From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B1C26C3ABBC
	for <linux-mm@archiver.kernel.org>; Wed,  7 May 2025 02:20:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id BA9A86B000A; Tue,  6 May 2025 22:20:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B57D36B0083; Tue,  6 May 2025 22:20:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9F7746B0085; Tue,  6 May 2025 22:20:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 7E40C6B000A
	for <linux-mm@kvack.org>; Tue,  6 May 2025 22:20:47 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 7D42EBBC20
	for <linux-mm@kvack.org>; Wed,  7 May 2025 02:20:47 +0000 (UTC)
X-FDA: 83414508534.13.48A22A8
Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42])
	by imf15.hostedemail.com (Postfix) with ESMTP id B290CA000D
	for <linux-mm@kvack.org>; Wed,  7 May 2025 02:20:45 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="gs/YpSMw";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf15.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1746584445;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1E+c1WyGMrj9RHkYHYay+xLHQEL8y5IHjkKQFpF45v8=;
	b=PtSpa5VujKWadQOPPuE6hdEvX5lKTzS5oExxVwhyNhgty+qKaJuQByur8PDCaUmcvK9bh+
	K+FIjWmQ9X1q8U1MFKWlhtlB8X7ZisnWAvP+39odenl4PwZiw/YxMUtXswHXtw0z3j85vu
	6s8uyFlmsvfX2NfQz4d/rRBpAmfjnIM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746584445; a=rsa-sha256;
	cv=none;
	b=JxiBKuvMP6/UCqoUcX82H4pxK6iB7TzRJwnqQin/0K2Q4vHyCq63Ys+TzgrJA1J0pWN36e
	O6m9wanpkvY4oGoSp7Ahy7ch5ZT37TFYyTr2106dkeWqWM+iZ9LUm/ywGB8+0ssvR2cAS3
	HWnyc5fXj9bCi25OwO0LUOpEO+E38Fk=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="gs/YpSMw";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf15.hostedemail.com: domain of alexei.starovoitov@gmail.com designates 209.85.128.42 as permitted sender) smtp.mailfrom=alexei.starovoitov@gmail.com
Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-43ea40a6e98so54905055e9.1
        for <linux-mm@kvack.org>; Tue, 06 May 2025 19:20:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1746584444; x=1747189244; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=1E+c1WyGMrj9RHkYHYay+xLHQEL8y5IHjkKQFpF45v8=;
        b=gs/YpSMwN8LCAhZIeWNMin29YHhtz5vFuHBZvZ70bvyRiTEhkKuxmqCklVt3j1CwGa
         QuRp27D0RaHNCN9NsKliXdDyu1MQSJnZmDNl9ZNxvPE/HPVgQQdl6C1xBTUFoZ5UScoD
         BDze2f+zjXsnyMmDTja8lxknLXTccG8dDB5ND0I9vVEhThhIULC3MGWXib5FWDr+6wFK
         NvmIF5NfkYkn702CMA1RwbfZro5ZnWygHog2tIjrjSfuoQO8B5BIjh2UC0l9Gm+xMzA/
         JKgMrMF0OoOz4PCNLHOwhw1YXzpd5CUz+zeiBeHsxZ6tSEebWqdjvKz6vldl3jeVwtIl
         hIhg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1746584444; x=1747189244;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1E+c1WyGMrj9RHkYHYay+xLHQEL8y5IHjkKQFpF45v8=;
        b=pnF0eZUqZUwcJMk2C0qJ+E1HmfCdGnOZl0qhtMvlQvhFOdJpt6sXu8fBkrZwyZwi3A
         LIY3kNufIyJWLA4snBQptCZ9e33xZ37nWCf5p9+xSeSYwgm+9Rr85kIy3rdMhrDLZZax
         0lAj0dcFPF0OjRmI63fMkQ09mc3f+CmxRF4SJ7YLVSQeNF80nfEzQ7tJ4HxAlwyUboiS
         b73yw5GzcPFLyfZW2Wa//wQIYvMPSL6GWg9RRDCFoO5kdkQYlnq9S48ctH4c77OFGawZ
         xTFAKqUTbCI42Vvp7DORDOntPnebfINeiOOKgNkJfs2RfLWmCaTgrYjF0MvRjRiTjWJ6
         /TOg==
X-Forwarded-Encrypted: i=1; AJvYcCVMG35PopGaW1gcXLh6lWMRhjUghTYjuuLop4J52O0iSkTgf3El88thfN2tOuiswX4NWxlCe8uIaw==@kvack.org
X-Gm-Message-State: AOJu0YyBXTm007XvndyrTJuqG1oBlbhi4qijB6z7xeY2FV9KEwqQBU2C
	3wtpAVA4rszeQbg2DjVtinFYE6FmxqD+rTyjBIb99rPWJdr0jKYwOB9BjfUBMx38Z4eY5ab/yZp
	BF0Q0u70/BS8akq3v+9UrAX5IMeA=
X-Gm-Gg: ASbGncsNvuCQ6YtlBIs77pnmIbf/KMgHGsI3TPtet/4n9YocHXm6HALJHcZ1xSRNGSI
	g71s+eJ/2tLh5mYaiDWdBM0+e7Fp5ow+8oIbXVpQumAUvXbvMpnaynbhzvY/FZtXUrYKD8ekJ6C
	uWdPX6NtJ+qfF3Yt2vN29TK76E179PZPutuXwyOXgLyZ9rn7+F8KvENb13EdKq
X-Google-Smtp-Source: AGHT+IGzyCfjUQQ0GVZYfCKRrFGd76wXqmgBCDTjo9Zxn+pl1IG8tcnnVuBVgATzZ+lr4FPuu259+dTar6M3152NK1g=
X-Received: by 2002:a05:600c:608f:b0:43c:fa24:873e with SMTP id
 5b1f17b1804b1-441d44c3348mr9201755e9.13.1746584443574; Tue, 06 May 2025
 19:20:43 -0700 (PDT)
MIME-Version: 1.0
References: <20250501032718.65476-1-alexei.starovoitov@gmail.com>
 <20250501032718.65476-7-alexei.starovoitov@gmail.com> <4d3e5d4b-502b-459b-9779-c0bf55ef2a03@suse.cz>
In-Reply-To: <4d3e5d4b-502b-459b-9779-c0bf55ef2a03@suse.cz>
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 6 May 2025 19:20:32 -0700
X-Gm-Features: ATxdqUFAQ4Z2MIutfsfIgdZ6neFvCQUnfowkIpubpnkGzCgejeWyzXzqOK1brQQ
Message-ID: <CAADnVQLO9YX2_0wEZshHbwXoJY2-wv3OgVGvN-hgf6mK0_ipxw@mail.gmail.com>
Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
To: Vlastimil Babka <vbabka@suse.cz>
Cc: bpf <bpf@vger.kernel.org>, linux-mm <linux-mm@kvack.org>, 
	Harry Yoo <harry.yoo@oracle.com>, Shakeel Butt <shakeel.butt@linux.dev>, 
	Michal Hocko <mhocko@suse.com>, Sebastian Sewior <bigeasy@linutronix.de>, 
	Andrii Nakryiko <andrii@kernel.org>, Kumar Kartikeya Dwivedi <memxor@gmail.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Peter Zijlstra <peterz@infradead.org>, 
	Steven Rostedt <rostedt@goodmis.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Matthew Wilcox <willy@infradead.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: B290CA000D
X-Rspam-User: 
X-Stat-Signature: x68b8341zrjthhwt8p46qfhszrdu8zqf
X-HE-Tag: 1746584445-434291
X-HE-Meta: U2FsdGVkX195v6qGg122KimAsLNrGEhdOklQQIPELdEmUInlazdeOiTLMoGdMPWMBlG25olAf2y5CnedazqFnqTOtmfsUxR9B9xIqUCsxrvIBBylShfcfyY06Zrr7fKa8CUUIrTSpBIn3ml5m5qh3uZYAZyIqb71OzaCYjeKcG7gJt7KUaxr1oPqBPaBDlmx0CShERxL78DHBy3rZ+q1IOd4Zt8NoG8y/qV9vgDnxKWt461eHPSEsMdnUrUI+D4Hc8UdIKwO+6EhSRSMiA/8h4ojhcDuGthrFo0eAIcpdA86SCGIsmJ+dS9AWRwy28cz+NfXs8QGxS02vsXcDe548FchzyEqPrwgq7fW0GCd0/9x/9qHCqoKlOUkWuWQlL/670egVz3uHoFSkYuZMkTvr8E9EhHzdRNhwgvfL4h3HflNbGQQ34Hql7fn57pegg4CtzsQr9Pj3GCb5Iulsd0ZEjM6OHk2Ts3NRtzVES9YkAKYscHKjsZa7n0IS067vDAbRcIMTx0wEdtZtSSdFIjbe/dCN4zXy0PLJZFCfRz7qFSLzFibMkliaA6PcO6B0s6/u8QMoLAH4wLHtAOXBM041NKfdjVhRepj0lEJaaHIYSaKfagrXpiaC4ev1pOUtx39bfvMBRAjUMgH8FzesHOF5nWSSaYaZiDhhJgkGm9e8FAVHeoWgfP80LEWjv6W343hhRhTrt978+dcrzRsu6vZfZZXkNmOjNM6qUKt3Z5UIhpEqTtxp8Oy/TJHlKAEIP7SRVQiqI0QJJTcS4GHTC+a9o99MxI9OYIfOK7+gDhoex7KxqtO4wXYoZEymoeFE1MushPt9KW0w0mZfpcGTaKi4odQlAsiC4cI9iPDniOybSy+zsAa05L0icIum3YMxXJ8eDX+tg80k1FKJSSpz4TejFzhJsB6EXMl4GHd7TOxS77VnyDmNzKBO4nJW5hMP4xAVsgNw28BDqUDeA0s9fT
 R/AS7++N
 jhZqKbA4SuVroHie+C7lhOTN9bnYOsAA95IZ6RTmAnLMzIrdan0IsVwkI/g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, May 6, 2025 at 5:01=E2=80=AFAM Vlastimil Babka <vbabka@suse.cz> wro=
te:
>
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > kmalloc_nolock() relies on ability of local_lock to detect the situatio=
n
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> > When lock_local_is_locked() sees locked memcg_stock.stock_lock
> > fallback to atomic operations.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entranc=
e
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> In general I'd prefer if we could avoid local_lock_is_locked() usage outs=
ide
> of debugging code. It just feels hacky given we have local_trylock()
> operations. But I can see how this makes things simpler so it's probably
> acceptable.

local_lock_is_locked() is not for debugging.
It's gating further calls into slub internals.
If a particular bucket is locked the logic will use a different one.
There is no local_trylock() at all here.
In that sense it's very different from alloc_pages_nolock().
There we trylock first and if not successful go for plan B.
For kmalloc_nolock() we first check whether local_lock_is_locked(),
if not then proceed and do
local_lock_irqsave_check() instead of local_lock_irqsave().
Both are unconditional and exactly the same without
CONFIG_DEBUG_LOCK_ALLOC.
Extra checks are there in _check() version for debugging,
since local_lock_is_locked() is called much earlier in the call chain
and far from local_lock_irqsave. So not trivial to see by just
code reading.
If local_lock_is_locked() says that it's locked
we go for a different bucket which is pretty much guaranteed to
be unlocked.

>
> > @@ -2458,13 +2468,21 @@ static void *setup_object(struct kmem_cache *s,=
 void *object)
> >   * Slab allocation and freeing
> >   */
> >  static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> > -             struct kmem_cache_order_objects oo)
> > +                                        struct kmem_cache_order_object=
s oo,
> > +                                        bool allow_spin)
> >  {
> >       struct folio *folio;
> >       struct slab *slab;
> >       unsigned int order =3D oo_order(oo);
> >
> > -     if (node =3D=3D NUMA_NO_NODE)
> > +     if (unlikely(!allow_spin)) {
> > +             struct page *p =3D alloc_pages_nolock(__GFP_COMP, node, o=
rder);
> > +
> > +             if (p)
> > +                     /* Make the page frozen. Drop refcnt to zero. */
> > +                     put_page_testzero(p);
>
> This is dangerous. Once we create a refcounted (non-frozen) page, someone
> else (a pfn scanner like compaction) can do a get_page_unless_zero(), so =
the
> refcount becomes 2, then we decrement the refcount here to 1, the pfn
> scanner realizes it's not a page it can work with, do put_page() and free=
s
> it under us.

Something like isolate_migratepages_block() does that?
ok. good to know.

> The solution is to split out alloc_frozen_pages_nolock() to use from here=
,
> and make alloc_pages_nolock() use it too and then set refcounted.

understood.

> > +             folio =3D (struct folio *)p;
> > +     } else if (node =3D=3D NUMA_NO_NODE)
> >               folio =3D (struct folio *)alloc_frozen_pages(flags, order=
);
> >       else
> >               folio =3D (struct folio *)__alloc_frozen_pages(flags, ord=
er, node, NULL);
>
> <snip>
>
> > @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, =
gfp_t gfpflags, int node,
> >        */
> >       c =3D slub_get_cpu_ptr(s->cpu_slab);
> >  #endif
> > +     if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> > +             struct slab *slab;
> > +
> > +             slab =3D c->slab;
> > +             if (slab && !node_match(slab, node))
> > +                     /* In trylock mode numa node is a hint */
> > +                     node =3D NUMA_NO_NODE;
> > +
> > +             if (!local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab=
->lock));
> > +             } else {
> > +                     /*
> > +                      * EBUSY is an internal signal to kmalloc_nolock(=
) to
> > +                      * retry a different bucket. It's not propagated =
further.
> > +                      */
> > +                     p =3D ERR_PTR(-EBUSY);
> > +                     goto out;
>
> Am I right in my reasoning as follows?
>
> - If we're on RT and "in_nmi() || in_hardirq()" is true then
> kmalloc_nolock_noprof() would return NULL immediately and we never reach
> this code

correct.

> - local_lock_is_locked() on RT tests if the current process is the lock
> owner. This means (in absence of double locking bugs) that we locked it a=
s
> task (or hardirq) and now we're either in_hardirq() (doesn't change curre=
nt
> AFAIK?) preempting task, or in_nmi() preempting task or hardirq.

not quite.
There could be re-entrance due to kprobe/fentry/tracepoint.
Like trace_contention_begin().
The code is still preemptable.

> - so local_lock_is_locked() will never be true here on RT

hehe :)

To have good coverage I fuzz test this patch set with:

+extern void (*debug_callback)(void);
+#define local_unlock_irqrestore(lock, flags) \
+ do { \
+ if (debug_callback) debug_callback(); \
+ __local_unlock_irqrestore(lock, flags); \
+ } while (0)

and randomly re-enter everywhere from debug_callback().

> > +             }
> > +     }
> >
> >       p =3D ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> > +out:
> >  #ifdef CONFIG_PREEMPT_COUNT
> >       slub_put_cpu_ptr(s->cpu_slab);
> >  #endif
> > @@ -4162,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, s=
truct list_lru *lru,
> >               if (p[i] && init && (!kasan_init ||
> >                                    !kasan_has_integrated_init()))
> >                       memset(p[i], 0, zero_size);
> > -             kmemleak_alloc_recursive(p[i], s->object_size, 1,
> > -                                      s->flags, init_flags);
> > +             if (gfpflags_allow_spinning(flags))
> > +                     kmemleak_alloc_recursive(p[i], s->object_size, 1,
> > +                                              s->flags, init_flags);
> >               kmsan_slab_alloc(s, p[i], init_flags);
> >               alloc_tagging_slab_alloc_hook(s, p[i], flags);
> >       }
> > @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
> >  }
> >  EXPORT_SYMBOL(__kmalloc_noprof);
> >
> > +/**
> > + * kmalloc_nolock - Allocate an object of given size from any context.
> > + * @size: size to allocate
> > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > + * @node: node number of the target node.
> > + *
> > + * Return: pointer to the new object or NULL in case of error.
> > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > + * There is no reason to call it again and expect !NULL.
> > + */
> > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > +{
> > +     gfp_t alloc_gfp =3D __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > +     struct kmem_cache *s;
> > +     bool can_retry =3D true;
> > +     void *ret =3D ERR_PTR(-EBUSY);
> > +
> > +     VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > +
> > +     if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > +             return NULL;
> > +     if (unlikely(!size))
> > +             return ZERO_SIZE_PTR;
> > +
> > +     if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> > +             /* kmalloc_nolock() in PREEMPT_RT is not supported from i=
rq */
> > +             return NULL;
> > +retry:
> > +     s =3D kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>
> The idea of retrying on different bucket is based on wrong assumptions an=
d
> thus won't work as you expect. kmalloc_slab() doesn't select buckets trul=
y
> randomly, but deterministically via hashing from a random per-boot seed a=
nd
> the _RET_IP_, as the security hardening goal is to make different kmalloc=
()
> callsites get different caches with high probability.

There is no relying on randomness.
As Harry pointed out in the other reply,
there is one retry from a different bucket.
Everything is deterministic.

> And I wouldn't also recommend changing this for kmalloc_nolock_noprof() c=
ase
> as that could make the hardening weaker, and also not help for kernels th=
at
> don't have it enabled, anyway.

This patch doesn't affect hardening.
If RANDOM_KMALLOC_CACHES is enabled it will affect
all callers of kmalloc_slab(), normal kmalloc and this kmalloc_nolock.
Protection is not weakened.

>
> > +
> > +     if (!(s->flags & __CMPXCHG_DOUBLE))
> > +             /*
> > +              * kmalloc_nolock() is not supported on architectures tha=
t
> > +              * don't implement cmpxchg16b.
> > +              */
> > +             return NULL;
> > +
> > +     /*
> > +      * Do not call slab_alloc_node(), since trylock mode isn't
> > +      * compatible with slab_pre_alloc_hook/should_failslab and
> > +      * kfence_alloc.
> > +      *
> > +      * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> > +      * in irq saved region. It assumes that the same cpu will not
> > +      * __update_cpu_freelist_fast() into the same (freelist,tid) pair=
.
> > +      * Therefore use in_nmi() to check whether particular bucket is i=
n
> > +      * irq protected section.
> > +      */
> > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> > +             ret =3D __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, s=
ize);
>
> Hm this is somewhat subtle. We're testing the local lock without having t=
he
> cpu explicitly pinned. But the test only happens in_nmi() which implicitl=
y
> is a context that won't migrate, so should work I think, but maybe should=
 be
> more explicit in the comment?

Ok. I'll expand the comment right above this 'if'.

>
> <snip>
>
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_fre=
e that
> >   * can perform fastpath freeing without additional function calls.
> > @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct=
 kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab !=3D c->slab)) {
>
> Note this unlikely() is actually a lie. It's actually unlikely that the f=
ree
> will happen on the same cpu and with the same slab still being c->slab,
> unless it's a free following shortly a temporary object allocation.

I didn't change it, since you would have called it
an unrelated change in the patch :)
I can prepare a separate single line patch to remove unlikely() here,
but it's a micro optimization unrelated to this set.

> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             /* cnt =3D=3D 0 signals that it's called from kfree_noloc=
k() */
> > +             if (unlikely(!cnt)) {
> > +                     /*
> > +                      * Use llist in cache_node ?
> > +                      * struct kmem_cache_node *n =3D get_node(s, slab=
_nid(slab));
> > +                      */
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a =
slab,
> > +                      * but then it might need to take spin_lock or lo=
cal_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a defer=
red list.
> > +                      */
> > +                     llist_add(head, &s->defer_free_objects);
> > +             } else {
> > +                     free_deferred_objects(&s->defer_free_objects, add=
r);
>
> So I'm a bit vary that this is actually rather a fast path that might
> contend on the defer_free_objects from all cpus.

Well, in my current stress test I could only get this list
to contain a single digit number of objects.

> I'm wondering if we could make the list part of kmem_cache_cpu to distrib=
ute
> it,

doable, but kmem_cache_cpu *c =3D raw_cpu_ptr(s->cpu_slab);
is preemptable, so there is a risk that
llist_add(.. , &c->defer_free_objects);
will be accessing per-cpu memory of another cpu.
llist_add() will work correctly, but cache line bounce is possible.
In kmem_cache I placed defer_free_objects after cpu_partial and oo,
so it should be cache hot.

> and hook the flushing e.g. to places where we do deactivate_slab() which
> should be much slower path,

I don't follow the idea.
If we don't process kmem_cache_cpu *c right here in do_slab_free()
this llist will get large.
So we have to process it here, but if we do, what's the point
of extra flush in deactivate_slab() ?
Especially with extra for_each_cpu() loop to reach all kmem_cache_cpu ?

> and also free_to_partial_list() to handle
> SLUB_TINY/caches with debugging enabled.

SLUB_TINY... ohh. I didn't try it. Will fix.