From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B4A6C3ABBF for ; Wed, 7 May 2025 10:43:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A3C26B0083; Wed, 7 May 2025 06:43:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 679F16B0085; Wed, 7 May 2025 06:43:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 519376B0089; Wed, 7 May 2025 06:43:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 334766B0083 for ; Wed, 7 May 2025 06:43:56 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1AC6F121C4E for ; Wed, 7 May 2025 10:43:56 +0000 (UTC) X-FDA: 83415776472.13.7CA7281 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf18.hostedemail.com (Postfix) with ESMTP id F3B191C0004 for ; Wed, 7 May 2025 10:43:53 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=jriVKSFM; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=wD+lLREZ; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=jriVKSFM; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=wD+lLREZ; dmarc=none; spf=pass (imf18.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746614634; a=rsa-sha256; cv=none; b=gnkazT2DwWbtlbliWBgWuU9C0umPMqBRSQHKniQcQroa/t21DcYRIjEfvA4iM9CLtUn2nP B8KeJJackyi1VEhs5hk9ogw3RGzLvlIDE0O7ZpZOFxUFKjzBrCpYU3Dj1F6lFdQrfxwwCg MHY8rMV/pgESjpqsouK94QkGCW50KCE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=jriVKSFM; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=wD+lLREZ; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=jriVKSFM; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=wD+lLREZ; dmarc=none; spf=pass (imf18.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746614634; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cX1wthXScZPfty4tSrb4aP4ON71P4G5AoJJQZT5lRSk=; b=SHbuK4UHJW2f9LsouMWxHn5nhwNYz3B6jar9ve4o9KrzIi1jQGn7GBGNXJ6LcEiCsANo2R 3gbr2IAu0/E44PD8ZDxuvNlex6NUdzKyE/ybaW5ggw+NU0GWs/FP6rWPvzMCZuXiwejpFq QCIC8g83OWIG/9pVKn46v2ljVzenQtU= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 1491A1F394; Wed, 7 May 2025 10:43:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746614632; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cX1wthXScZPfty4tSrb4aP4ON71P4G5AoJJQZT5lRSk=; b=jriVKSFM0KRBsG9g5R0chIx6T7G4FS0XRZN994vHaSXtNFNv+HIF5B1VB2artqUFZyMyMa pbAsohoKtFlfx4lxuDkDQCffUe0sM2nb8tU55ww9b2tOh5naVkVcMKt0b0S6OH3PxWgMSK cpamfSn18/5ym1WoGQv8oeGVfYVX9KQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746614632; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cX1wthXScZPfty4tSrb4aP4ON71P4G5AoJJQZT5lRSk=; b=wD+lLREZBvT+VU9GDuro74F5psFI25KYvPxTUPc0TFV5Z2D/T6oVjc6jpEUAAbM+LBRbm5 enCO3vrvV40IKdBA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746614632; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cX1wthXScZPfty4tSrb4aP4ON71P4G5AoJJQZT5lRSk=; b=jriVKSFM0KRBsG9g5R0chIx6T7G4FS0XRZN994vHaSXtNFNv+HIF5B1VB2artqUFZyMyMa pbAsohoKtFlfx4lxuDkDQCffUe0sM2nb8tU55ww9b2tOh5naVkVcMKt0b0S6OH3PxWgMSK cpamfSn18/5ym1WoGQv8oeGVfYVX9KQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746614632; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cX1wthXScZPfty4tSrb4aP4ON71P4G5AoJJQZT5lRSk=; b=wD+lLREZBvT+VU9GDuro74F5psFI25KYvPxTUPc0TFV5Z2D/T6oVjc6jpEUAAbM+LBRbm5 enCO3vrvV40IKdBA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id F0D0A139D9; Wed, 7 May 2025 10:43:51 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id /6mEOmc5G2iREgAAD6G6ig (envelope-from ); Wed, 07 May 2025 10:43:51 +0000 Message-ID: Date: Wed, 7 May 2025 12:44:24 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). To: Alexei Starovoitov Cc: bpf , linux-mm , Harry Yoo , Shakeel Butt , Michal Hocko , Sebastian Sewior , Andrii Nakryiko , Kumar Kartikeya Dwivedi , Andrew Morton , Peter Zijlstra , Steven Rostedt , Johannes Weiner , Matthew Wilcox References: <20250501032718.65476-1-alexei.starovoitov@gmail.com> <20250501032718.65476-7-alexei.starovoitov@gmail.com> <4d3e5d4b-502b-459b-9779-c0bf55ef2a03@suse.cz> From: Vlastimil Babka Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Action: no action X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: F3B191C0004 X-Stat-Signature: 4ykdo9iq1yo4kzzua1a6uh1hhm9ui11n X-Rspam-User: X-HE-Tag: 1746614633-972545 X-HE-Meta: U2FsdGVkX19x+Fb98lc0L4gK4u1hISCC630cgqXciAAqNd1r65aKYQ6ffU3KiP/AeNA9Lm701i3lbnd9c0q8pOR6yroaRFozKZQAlLUg/IYtQr4Ud8SHhgaAqE1BV/z2tc0++Zt7WpeYy1NS4ASb5Tv/ygATRfOm7QV2H7eqvu4IBhUJJ9wWB8lctLnaUCZQ/JRLqWDIevRbgF6TDuXzFIW1njFslmYa45B8tghFxkO1SyLkXn6FNn1TsQ874Me+lbYKj8SBYwnN5DG+hOIFN+OMJEgPV4EMbLxxExJC5qBrVMyx6DWiQqxz9CyJdXtrud9YiKEn3hH2Z9JsbIcTP7n7ATQ9SmM+vopaydCCVrB1U2Evftvb6+R/MbyuvcWS15FBRkQlCm9iI/2t4KL/JVTPbNWiKbVsutvHoUsE9qIK4WA7BYJEx39qvWo/06it9dUNU2IHwWj1RwAFoIYbqsHne4p4YXHLQOzdiTh3iN1FTcfJODrFLb/gHKE6uVPNXLKXlu3AWDyNGUiP+8ehDqmh6JTpPCZqe/r4od0vPe9vriz+dn5sh1YLlDtbUez+26AE5oh+7mxSQPbl3qY8vobK7MEhrlvY3Rvwknmz5JA8YFm/3RvORgXtxsbwNCyRo5EIPsKt6rk2CmXEtc499F71UaK6yzOHlny3ZEzREIajB0jNhk1Utmdva6IRfSj5ILX21ShnxOHLJqjezXlEL6GPLsItihiF7fsk2MzmFgqB13RZEuROkHBHB9Fy6wrADmBx7teyYV92VEnYt44V9k7eS7PeQ2v/o2WZfB3+mOtuaRS228tqISKlalykZfX+aU9Kky6+Upfc6FJmwO31WAZKrlkNd2T5mkxZ4P3crXNz6HjZgYoCJaE0RfXIEzlIeEk63YDTN4uAUZNj2MEU4hkPIMHLoXrJF4CNH2Ep/linbJihNNSaLEj9VuNUb2Q8BbDqnnAMWeWaplYlcmf W+c+zkap ppKiYZQArRYbpsSoNUR5AzaT4gNRFr8bME7Jd4Dq8T3sdvcLUD1o1PZay1m1C3RI5sxm9f0n/Ql+6FwKNq2j4/gq0iw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/7/25 4:20 AM, Alexei Starovoitov wrote: > On Tue, May 6, 2025 at 5:01 AM Vlastimil Babka wrote: >> >> On 5/1/25 05:27, Alexei Starovoitov wrote: >>> From: Alexei Starovoitov >>> >>> kmalloc_nolock() relies on ability of local_lock to detect the situation >>> when it's locked. >>> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in >>> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu. >>> In that case retry the operation in a different kmalloc bucket. >>> The second attempt will likely succeed, since this cpu locked >>> different kmem_cache_cpu. >>> When lock_local_is_locked() sees locked memcg_stock.stock_lock >>> fallback to atomic operations. >>> >>> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when >>> per-cpu rt_spin_lock is locked by current task. In this case re-entrance >>> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries >>> a different bucket that is most likely is not locked by current >>> task. Though it may be locked by a different task it's safe to >>> rt_spin_lock() on it. >>> >>> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL >>> immediately if called from hard irq or NMI in PREEMPT_RT. >>> >>> Signed-off-by: Alexei Starovoitov >> >> In general I'd prefer if we could avoid local_lock_is_locked() usage outside >> of debugging code. It just feels hacky given we have local_trylock() >> operations. But I can see how this makes things simpler so it's probably >> acceptable. > > local_lock_is_locked() is not for debugging. > It's gating further calls into slub internals. > If a particular bucket is locked the logic will use a different one. > There is no local_trylock() at all here. It could be, but I can see how it would complicate things. Not worth it, especially in case we manage to replace the current percpu scheme with sheaves later. > In that sense it's very different from alloc_pages_nolock(). > There we trylock first and if not successful go for plan B. > For kmalloc_nolock() we first check whether local_lock_is_locked(), > if not then proceed and do > local_lock_irqsave_check() instead of local_lock_irqsave(). > Both are unconditional and exactly the same without > CONFIG_DEBUG_LOCK_ALLOC. Right. > Extra checks are there in _check() version for debugging, > since local_lock_is_locked() is called much earlier in the call chain > and far from local_lock_irqsave. So not trivial to see by just > code reading. Right, we rely on the implication that once we find local_lock_is_locked() is false, it cannot become suddenly locked later even if we re-enable preemption in the meanwhile. > If local_lock_is_locked() says that it's locked > we go for a different bucket which is pretty much guaranteed to > be unlocked. OK. >>> + folio = (struct folio *)p; >>> + } else if (node == NUMA_NO_NODE) >>> folio = (struct folio *)alloc_frozen_pages(flags, order); >>> else >>> folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL); >> >> >> >>> @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >>> */ >>> c = slub_get_cpu_ptr(s->cpu_slab); >>> #endif >>> + if (unlikely(!gfpflags_allow_spinning(gfpflags))) { >>> + struct slab *slab; >>> + >>> + slab = c->slab; >>> + if (slab && !node_match(slab, node)) >>> + /* In trylock mode numa node is a hint */ >>> + node = NUMA_NO_NODE; >>> + >>> + if (!local_lock_is_locked(&s->cpu_slab->lock)) { >>> + lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock)); >>> + } else { >>> + /* >>> + * EBUSY is an internal signal to kmalloc_nolock() to >>> + * retry a different bucket. It's not propagated further. >>> + */ >>> + p = ERR_PTR(-EBUSY); >>> + goto out; >> >> Am I right in my reasoning as follows? >> >> - If we're on RT and "in_nmi() || in_hardirq()" is true then >> kmalloc_nolock_noprof() would return NULL immediately and we never reach >> this code > > correct. > >> - local_lock_is_locked() on RT tests if the current process is the lock >> owner. This means (in absence of double locking bugs) that we locked it as >> task (or hardirq) and now we're either in_hardirq() (doesn't change current >> AFAIK?) preempting task, or in_nmi() preempting task or hardirq. > > not quite. > There could be re-entrance due to kprobe/fentry/tracepoint. > Like trace_contention_begin(). > The code is still preemptable. Hm right. Glad that I asked then and thanks for making me realize. >> - so local_lock_is_locked() will never be true here on RT > > hehe :) > > To have good coverage I fuzz test this patch set with: > > +extern void (*debug_callback)(void); > +#define local_unlock_irqrestore(lock, flags) \ > + do { \ > + if (debug_callback) debug_callback(); \ > + __local_unlock_irqrestore(lock, flags); \ > + } while (0) > > and randomly re-enter everywhere from debug_callback(). Oh cool :) >> >> >> >>> /* >>> * Fastpath with forced inlining to produce a kfree and kmem_cache_free that >>> * can perform fastpath freeing without additional function calls. >>> @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s, >>> barrier(); >>> >>> if (unlikely(slab != c->slab)) { >> >> Note this unlikely() is actually a lie. It's actually unlikely that the free >> will happen on the same cpu and with the same slab still being c->slab, >> unless it's a free following shortly a temporary object allocation. > > I didn't change it, since you would have called it > an unrelated change in the patch :) Right I'm not suggesting to change it here and now, just that it might be misleading in that this is a slowpath and we're fine doing expensive things here - I'm arguing we should be careful. > I can prepare a separate single line patch to remove unlikely() here, > but it's a micro optimization unrelated to this set. > >>> - __slab_free(s, slab, head, tail, cnt, addr); >>> + /* cnt == 0 signals that it's called from kfree_nolock() */ >>> + if (unlikely(!cnt)) { >>> + /* >>> + * Use llist in cache_node ? >>> + * struct kmem_cache_node *n = get_node(s, slab_nid(slab)); >>> + */ >>> + /* >>> + * __slab_free() can locklessly cmpxchg16 into a slab, >>> + * but then it might need to take spin_lock or local_lock >>> + * in put_cpu_partial() for further processing. >>> + * Avoid the complexity and simply add to a deferred list. >>> + */ >>> + llist_add(head, &s->defer_free_objects); >>> + } else { >>> + free_deferred_objects(&s->defer_free_objects, addr); >> >> So I'm a bit vary that this is actually rather a fast path that might >> contend on the defer_free_objects from all cpus. > > Well, in my current stress test I could only get this list > to contain a single digit number of objects. My worry isn't the list would get long, but that we'd be checking it on almost any kfree() (that's not lucky enough to be slab == c->slab) from all cpus. And every kfree_nolock() will have to make the cache line of s->defer_free_objects exclusive to that cpu in the llist_add(), and then all other cpus doing kfree() will have to refetch it while making it shared again... >> I'm wondering if we could make the list part of kmem_cache_cpu to distribute >> it, > > doable, but kmem_cache_cpu *c = raw_cpu_ptr(s->cpu_slab); > is preemptable, so there is a risk that > llist_add(.. , &c->defer_free_objects); > will be accessing per-cpu memory of another cpu. > llist_add() will work correctly, but cache line bounce is possible. The cache line bounce due to occasional preemption should be much more rare than on the single s->defer_free_objects cache line as described above. > In kmem_cache I placed defer_free_objects after cpu_partial and oo, > so it should be cache hot. OTOH it would make the bouncing worse? >> and hook the flushing e.g. to places where we do deactivate_slab() which >> should be much slower path, > > I don't follow the idea. > If we don't process kmem_cache_cpu *c right here in do_slab_free() > this llist will get large. I'd hope deativate_slab() should be frequent enough, it means mainly the local cpu's slab was exhausted. If there are many frees there are probably many allocations too. But ok, maybe having the llist local to the cpu would be sufficient to make it feasible to check it every kfree() cheaply enough.