From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1679C3ABAC for ; Tue, 6 May 2025 12:01:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93D326B000A; Tue, 6 May 2025 08:01:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8ECC76B0082; Tue, 6 May 2025 08:01:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78CE86B0085; Tue, 6 May 2025 08:01:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5B7A26B000A for ; Tue, 6 May 2025 08:01:53 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A5C79140311 for ; Tue, 6 May 2025 12:01:53 +0000 (UTC) X-FDA: 83412344106.13.C0763AF Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf08.hostedemail.com (Postfix) with ESMTP id 3ED1116000B for ; Tue, 6 May 2025 12:01:50 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="G/HzpAO9"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="Yp0PF/sN"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="G/HzpAO9"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="Yp0PF/sN"; dmarc=none; spf=pass (imf08.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746532911; a=rsa-sha256; cv=none; b=JsJmAN5iOCv4dTWZwX3qOk6goA0zJ6IJWyEYivDDj1BTFaq88TrG62qa2EEP9Je35NlKoN 8e5JLBcTJgTjPzk2FyzpPVJ/SmPiRtteFjC3EllndEERcAb62ELX4DEzWbK/XDO29Vnm1h rzcFxL6ioby3FS/B3b01YTxaSlkPkvQ= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="G/HzpAO9"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="Yp0PF/sN"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="G/HzpAO9"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="Yp0PF/sN"; dmarc=none; spf=pass (imf08.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746532911; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vy10q/chmakb6t9Knn2dhtwHUF+2eAGNL0MO5J+fUKQ=; b=N3xnpunlTO+ufMVlNG0waWbOic5mEMHhTy6q+3gqSUUFF1qNC96BpcDkrUSlw/L8A6gJ+L cZ2Ae0mFDlPgXt9wHzk8eEfILqSQccF0BvZNjdQqYMTTB1nPM1jBjnMNr7hAYQgqp7WYP3 kGf/irTfPD/zeYk4TNt9rwCLd1FeNqI= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 661B11F395; Tue, 6 May 2025 12:01:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746532909; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vy10q/chmakb6t9Knn2dhtwHUF+2eAGNL0MO5J+fUKQ=; b=G/HzpAO9a77uWas87rhcWkQdrVmO8BmJEC6hkVlquIbnbCIdc/phAXo9guo6ZV1vpJWG4n 9dqOj38cSwJ/dblFJNTuWoQZS71yhoIZplm/o1oYfQy5gnsm52576jUtDE7JzPMKVwYIXL re25/ZBqpkj1ShbvMfhQRmJ+Gg885qQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746532909; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vy10q/chmakb6t9Knn2dhtwHUF+2eAGNL0MO5J+fUKQ=; b=Yp0PF/sNUDw78/z+B5hmsAB50i4QYaE+SYAD6o5V6XtqD6OLgHo8mW69G/JNh8hPTgpZ/U 8dos3bo4lhE0zRBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1746532909; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vy10q/chmakb6t9Knn2dhtwHUF+2eAGNL0MO5J+fUKQ=; b=G/HzpAO9a77uWas87rhcWkQdrVmO8BmJEC6hkVlquIbnbCIdc/phAXo9guo6ZV1vpJWG4n 9dqOj38cSwJ/dblFJNTuWoQZS71yhoIZplm/o1oYfQy5gnsm52576jUtDE7JzPMKVwYIXL re25/ZBqpkj1ShbvMfhQRmJ+Gg885qQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1746532909; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vy10q/chmakb6t9Knn2dhtwHUF+2eAGNL0MO5J+fUKQ=; b=Yp0PF/sNUDw78/z+B5hmsAB50i4QYaE+SYAD6o5V6XtqD6OLgHo8mW69G/JNh8hPTgpZ/U 8dos3bo4lhE0zRBg== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 401AF137CF; Tue, 6 May 2025 12:01:49 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id 15XXDS36GWidIAAAD6G6ig (envelope-from ); Tue, 06 May 2025 12:01:49 +0000 Message-ID: <4d3e5d4b-502b-459b-9779-c0bf55ef2a03@suse.cz> Date: Tue, 6 May 2025 14:01:48 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock(). Content-Language: en-US To: Alexei Starovoitov , bpf@vger.kernel.org, linux-mm@kvack.org Cc: harry.yoo@oracle.com, shakeel.butt@linux.dev, mhocko@suse.com, bigeasy@linutronix.de, andrii@kernel.org, memxor@gmail.com, akpm@linux-foundation.org, peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org, willy@infradead.org References: <20250501032718.65476-1-alexei.starovoitov@gmail.com> <20250501032718.65476-7-alexei.starovoitov@gmail.com> From: Vlastimil Babka In-Reply-To: <20250501032718.65476-7-alexei.starovoitov@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Action: no action X-Rspamd-Queue-Id: 3ED1116000B X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: i8arqkmenfnxhypryz5uq6eedwbcz8ab X-HE-Tag: 1746532910-730864 X-HE-Meta: U2FsdGVkX1/UCncsizepz6aEnzd89UQ6AarSzNMRVmolmdQVi57dl2AgrH/3YbBmQUhiV7cvEqi0odixbsDur1tA5coLvhjUGuMW1RMZa0rNnUoXhufVulYnsHdGkJEdHKftz2lh9xnjwBJR9peKizjba0MGQMUjKgvgqfjQAkp4Axr0OuJvaVtMpZO2rylV9tmmTm59+CbM58KspLSyB+NAq34icJKiH76u3z33465YBIxWc3w7GhaMZLmy+Cy0BfIWUcabjRxDs+ZOEdfRfHZ2D59GhCnLxRJJSlyz9RxP0obDXR3LE4r2XSlvD5SKf7lK1kFlYQ1PCv99CQnCpGLDelhP+Oq6JtrIvpoVHEgu8E4GXGX+BpELKyoQ+BwJ7t8JPYrBmIKPH9VmVQqG1g/Hrxilc5d34kt6z4KjWhWalmhZ5P0clyhLB9tiUicpp+NZUjdPPayc08tY6aO8fhfvTvNVENq6CTMuygFPY7mvaXtmJST+Tz5Kg4/+1B/sHRMnKPcpcAmADJIn5kLxd42g3x1tYyD5n6peeordydwLnxnMvBCK8tahjjpWdz9o4Q6GbKHlkgNP1YGQLLjVXVBoCXK9S2CZclYaCOD66fzPE+E5+CRmZlGvO5PVb7UwdlSxhHk7ZwWtzfroGpd9ajvgUJpy0o+YipZKPyap/9my5Zx1mbapwDk3XsJa5qLAoo7+A7chdM8diMfrZIkwbJEVLuHNMubqzD7e8dIHVBsXina+KXA81+203MdwgbLW3sUnC+fxXkXZJ7LiGlCBSzF3Yd2rWTz+GRf/VO5lGGr+Cl/kaMTyxgc4rj6/RhqHq4ME1fuWZLczACqUxLpDnRZDWgoAWVKQ/A6oCFtPvn/B1oRtDWeSHEUYemKlrWxLaw2M1Oxx/NlsT8bV0SQ47RQwIZiNDbwrUCQKmsJS/miHTjf2DGsHsH/Rp/AQ7t3Q5wVe+H8Jy0SJZcApC4i 3PYb4I1k 2IvO8rPDxmbCr6Sf42Fo3tBf+mfIDJv/7+duE9Kd9sewnJV31BRwM18ogeinNRRNH4bs2MtI6CUzruW0x5ColZIUuy1o0Iy6E+txFMTdTs7kFrQryt7Zh7amZXMGGuLv6uJr2mZf1lRAPEFDPohkkp54n/Vt9qQUeYyWWZwrnFLNDvg62DtPvhMZymEq2TyjQ52Z8zoCPlAAJh0orMfQSyxvthdVYIGJWYkRDf9JrDyOhoc3BTufatT3+u2NxzpLs8OmhqqiB5u/CIYkIgH99f55Z6f1Aa4Hw8Vzm7lp/kNkrus+K2UNwtbV9Zh2/4+q6lozm9DEvUZXNbt9qhMSCmvPlkxiQOTeIvLsSe2KobQIs53L1j/l/lTnBiNPv39fHIgR+gLM8Wtl5krdtwilyjEEWTnBZeGgkIrTnVZUaRI8aPuI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/1/25 05:27, Alexei Starovoitov wrote: > From: Alexei Starovoitov > > kmalloc_nolock() relies on ability of local_lock to detect the situation > when it's locked. > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu. > In that case retry the operation in a different kmalloc bucket. > The second attempt will likely succeed, since this cpu locked > different kmem_cache_cpu. > When lock_local_is_locked() sees locked memcg_stock.stock_lock > fallback to atomic operations. > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when > per-cpu rt_spin_lock is locked by current task. In this case re-entrance > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries > a different bucket that is most likely is not locked by current > task. Though it may be locked by a different task it's safe to > rt_spin_lock() on it. > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL > immediately if called from hard irq or NMI in PREEMPT_RT. > > Signed-off-by: Alexei Starovoitov In general I'd prefer if we could avoid local_lock_is_locked() usage outside of debugging code. It just feels hacky given we have local_trylock() operations. But I can see how this makes things simpler so it's probably acceptable. > @@ -2458,13 +2468,21 @@ static void *setup_object(struct kmem_cache *s, void *object) > * Slab allocation and freeing > */ > static inline struct slab *alloc_slab_page(gfp_t flags, int node, > - struct kmem_cache_order_objects oo) > + struct kmem_cache_order_objects oo, > + bool allow_spin) > { > struct folio *folio; > struct slab *slab; > unsigned int order = oo_order(oo); > > - if (node == NUMA_NO_NODE) > + if (unlikely(!allow_spin)) { > + struct page *p = alloc_pages_nolock(__GFP_COMP, node, order); > + > + if (p) > + /* Make the page frozen. Drop refcnt to zero. */ > + put_page_testzero(p); This is dangerous. Once we create a refcounted (non-frozen) page, someone else (a pfn scanner like compaction) can do a get_page_unless_zero(), so the refcount becomes 2, then we decrement the refcount here to 1, the pfn scanner realizes it's not a page it can work with, do put_page() and frees it under us. The solution is to split out alloc_frozen_pages_nolock() to use from here, and make alloc_pages_nolock() use it too and then set refcounted. > + folio = (struct folio *)p; > + } else if (node == NUMA_NO_NODE) > folio = (struct folio *)alloc_frozen_pages(flags, order); > else > folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL); > @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > */ > c = slub_get_cpu_ptr(s->cpu_slab); > #endif > + if (unlikely(!gfpflags_allow_spinning(gfpflags))) { > + struct slab *slab; > + > + slab = c->slab; > + if (slab && !node_match(slab, node)) > + /* In trylock mode numa node is a hint */ > + node = NUMA_NO_NODE; > + > + if (!local_lock_is_locked(&s->cpu_slab->lock)) { > + lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock)); > + } else { > + /* > + * EBUSY is an internal signal to kmalloc_nolock() to > + * retry a different bucket. It's not propagated further. > + */ > + p = ERR_PTR(-EBUSY); > + goto out; Am I right in my reasoning as follows? - If we're on RT and "in_nmi() || in_hardirq()" is true then kmalloc_nolock_noprof() would return NULL immediately and we never reach this code - local_lock_is_locked() on RT tests if the current process is the lock owner. This means (in absence of double locking bugs) that we locked it as task (or hardirq) and now we're either in_hardirq() (doesn't change current AFAIK?) preempting task, or in_nmi() preempting task or hardirq. - so local_lock_is_locked() will never be true here on RT > + } > + } > > p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size); > +out: > #ifdef CONFIG_PREEMPT_COUNT > slub_put_cpu_ptr(s->cpu_slab); > #endif > @@ -4162,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > if (p[i] && init && (!kasan_init || > !kasan_has_integrated_init())) > memset(p[i], 0, zero_size); > - kmemleak_alloc_recursive(p[i], s->object_size, 1, > - s->flags, init_flags); > + if (gfpflags_allow_spinning(flags)) > + kmemleak_alloc_recursive(p[i], s->object_size, 1, > + s->flags, init_flags); > kmsan_slab_alloc(s, p[i], init_flags); > alloc_tagging_slab_alloc_hook(s, p[i], flags); > } > @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags) > } > EXPORT_SYMBOL(__kmalloc_noprof); > > +/** > + * kmalloc_nolock - Allocate an object of given size from any context. > + * @size: size to allocate > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed. > + * @node: node number of the target node. > + * > + * Return: pointer to the new object or NULL in case of error. > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM. > + * There is no reason to call it again and expect !NULL. > + */ > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node) > +{ > + gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags; > + struct kmem_cache *s; > + bool can_retry = true; > + void *ret = ERR_PTR(-EBUSY); > + > + VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO)); > + > + if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) > + return NULL; > + if (unlikely(!size)) > + return ZERO_SIZE_PTR; > + > + if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq())) > + /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */ > + return NULL; > +retry: > + s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_); The idea of retrying on different bucket is based on wrong assumptions and thus won't work as you expect. kmalloc_slab() doesn't select buckets truly randomly, but deterministically via hashing from a random per-boot seed and the _RET_IP_, as the security hardening goal is to make different kmalloc() callsites get different caches with high probability. And I wouldn't also recommend changing this for kmalloc_nolock_noprof() case as that could make the hardening weaker, and also not help for kernels that don't have it enabled, anyway. > + > + if (!(s->flags & __CMPXCHG_DOUBLE)) > + /* > + * kmalloc_nolock() is not supported on architectures that > + * don't implement cmpxchg16b. > + */ > + return NULL; > + > + /* > + * Do not call slab_alloc_node(), since trylock mode isn't > + * compatible with slab_pre_alloc_hook/should_failslab and > + * kfence_alloc. > + * > + * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair > + * in irq saved region. It assumes that the same cpu will not > + * __update_cpu_freelist_fast() into the same (freelist,tid) pair. > + * Therefore use in_nmi() to check whether particular bucket is in > + * irq protected section. > + */ > + if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock)) > + ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size); Hm this is somewhat subtle. We're testing the local lock without having the cpu explicitly pinned. But the test only happens in_nmi() which implicitly is a context that won't migrate, so should work I think, but maybe should be more explicit in the comment? > /* > * Fastpath with forced inlining to produce a kfree and kmem_cache_free that > * can perform fastpath freeing without additional function calls. > @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s, > barrier(); > > if (unlikely(slab != c->slab)) { Note this unlikely() is actually a lie. It's actually unlikely that the free will happen on the same cpu and with the same slab still being c->slab, unless it's a free following shortly a temporary object allocation. > - __slab_free(s, slab, head, tail, cnt, addr); > + /* cnt == 0 signals that it's called from kfree_nolock() */ > + if (unlikely(!cnt)) { > + /* > + * Use llist in cache_node ? > + * struct kmem_cache_node *n = get_node(s, slab_nid(slab)); > + */ > + /* > + * __slab_free() can locklessly cmpxchg16 into a slab, > + * but then it might need to take spin_lock or local_lock > + * in put_cpu_partial() for further processing. > + * Avoid the complexity and simply add to a deferred list. > + */ > + llist_add(head, &s->defer_free_objects); > + } else { > + free_deferred_objects(&s->defer_free_objects, addr); So I'm a bit vary that this is actually rather a fast path that might contend on the defer_free_objects from all cpus. I'm wondering if we could make the list part of kmem_cache_cpu to distribute it, and hook the flushing e.g. to places where we do deactivate_slab() which should be much slower path, and also free_to_partial_list() to handle SLUB_TINY/caches with debugging enabled. > + __slab_free(s, slab, head, tail, cnt, addr); > + } > return; > } > > + if (unlikely(!cnt)) { > + if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) && > + local_lock_is_locked(&s->cpu_slab->lock)) { > + llist_add(head, &s->defer_free_objects); > + return; > + } > + cnt = 1; > + kasan_slab_free(s, head, false, false, true); > + } > + > if (USE_LOCKLESS_FAST_PATH()) { > freelist = READ_ONCE(c->freelist); >