From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 66761CCD1BE for ; Thu, 23 Oct 2025 13:53:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA6DB8E003C; Thu, 23 Oct 2025 09:53:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B568E8E0002; Thu, 23 Oct 2025 09:53:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A45428E003C; Thu, 23 Oct 2025 09:53:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 90A978E0002 for ; Thu, 23 Oct 2025 09:53:52 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 85DD348D13 for ; Thu, 23 Oct 2025 13:53:50 +0000 (UTC) X-FDA: 84029522220.15.EAE363A Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf02.hostedemail.com (Postfix) with ESMTP id 661FE80007 for ; Thu, 23 Oct 2025 13:53:48 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; spf=pass (imf02.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1761227628; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jcT7k4ASuUNua7Fb7j44LhUIoyM1dwJwFvhsotHWEtY=; b=Sfwhigu9sHxnq/vRe3Fzo1LZRFSD2nll8GVDc6d3ndVjYKNkWuoiIVxkjDN/4fmVtaj2VF aA1PuzEGyRFJsUIX6pBn4pvHhjZJ9PwrOVDR6FwBCqx874a2kE+68rMNWAkxvGiQbKja+8 Sz0EuM6ffWjujm2mv3556AIwt1KYih4= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of vbabka@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1761227628; a=rsa-sha256; cv=none; b=G4FiVU+KF7v497Si6/vqPk7kPInE+eREVlGjLkOKfg97z57x2nD3b9nu6N4wqtCUEyz2Vj fmyEyJrG+chLswRG+SYUqHDpKuj9Og3kmhoRCkR/WdK78DI/3g0T0iVj1gS5RbaH/SJs7h +MQol4P08W1rbDHUc59QjxIFVPeZKFw= Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id B398F1F7CF; Thu, 23 Oct 2025 13:53:01 +0000 (UTC) Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id D6F6F13B10; Thu, 23 Oct 2025 13:52:54 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id oDUPNDYz+mjvQQAAD6G6ig (envelope-from ); Thu, 23 Oct 2025 13:52:54 +0000 From: Vlastimil Babka Date: Thu, 23 Oct 2025 15:52:40 +0200 Subject: [PATCH RFC 18/19] slab: update overview comments MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20251023-sheaves-for-all-v1-18-6ffa2c9941c0@suse.cz> References: <20251023-sheaves-for-all-v1-0-6ffa2c9941c0@suse.cz> In-Reply-To: <20251023-sheaves-for-all-v1-0-6ffa2c9941c0@suse.cz> To: Andrew Morton , Christoph Lameter , David Rientjes , Roman Gushchin , Harry Yoo Cc: Uladzislau Rezki , "Liam R. Howlett" , Suren Baghdasaryan , Sebastian Andrzej Siewior , Alexei Starovoitov , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev, bpf@vger.kernel.org, kasan-dev@googlegroups.com, Vlastimil Babka X-Mailer: b4 0.14.3 X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 661FE80007 X-Stat-Signature: i5jjj7ngqsycng9yyu8fa1g8e7zcpeu1 X-Rspam-User: X-HE-Tag: 1761227628-512439 X-HE-Meta: U2FsdGVkX18lmZSMPHMGt80HN8GvEM6NUv8OWua6UDFhnEEmtwDKzuihRA7CgZ30BHmXEP8W3LINOjM0j9emLW4n4HwMHT87Rswm7LgrS6PTe77ZXnZRgQaIsIBJzmQQrhCsJxm1rC4zUlCfXhbD6k6+jyfTMvm6ICvL0eXFqafXIyBICtPDMKtnTueHj1WCdOmKMJzhZIMmColgqvJ1aZ17owWHuHqYq8Ho+Hjb4xA0JxeBR+uPyQtQPh1PH6Z5ADLWptTvDDYUqWU4lzckng/GBRshKEHePEQT0kj79KgieLJ3KoVfYIWPKrGp805EdSeaCJLSDZSlbZinMb6BX0kRfk2/XRkfZOYD8dncm2a1bm5cS5fMP93e3KR8af9se3ovgOIUO1/cx2/T2c3tDo7fLcsCpC22MBdeH0Anr1jQVADobxZTin/T/QEa9l9o+ODj15aHTOVccRn2UVSXw1FlMKA5Cl70+/FQscXHAs8SPYEhSlYyR2sb4l5fQ3jCoTbCWkfPpt+C6/axWx4qM+6JhE+m1lioAPhJvprzEHuhYOvJgYyga0KfpyupPBfHH50QiRE/IgInrLX9j/LJLEbc6OveLTnT+0/4ouue42JGbNzXn+0FHcD23ozmYh82tJG80mDfhaZLb6mGx5t130IOI7JQZG8pyctfyMqDjl+lzHW567nqQ6mlsWFNQn6Bv7+xVtIuuOS+6AZMo98mlFoflsmcBmawXwOWeKZYXoDNKdaWCC+JdYsXWN608ad/Esh4Medh93HT3t7Y/RZvsRX56XNUeXjvBCzlvNv8Wv9SgvZrEAxg1VTBlSzRxMMVRUE3tWYVjjU9VDyukHx2lrXjXu+phumENSK95XIOcg1N/xN3s1cy6N9Legy5ZCShJseDVvnOqmsuHMB1ZhnfPvdb4Zh0tcwIgdnldCMIHzVjNr/HkWOI+UDlF2QDlxw6xJIcnaOuJH3CB9966em hd+22XbJ 56vG1niPhpa2NpwSefvaMTJayl20HNQOcEgPLW3h/XgG5jKzdiYTLcws08nBxn77oQDgtEPatwoEMwpKoHp6xAt55AQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The changes related to sheaves made the description of locking and other details outdated. Update it to reflect current state. Also add a new copyright line due to major changes. Signed-off-by: Vlastimil Babka --- mm/slub.c | 141 +++++++++++++++++++++++++++++--------------------------------- 1 file changed, 67 insertions(+), 74 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 4e003493ba60..515a2b59cb52 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1,13 +1,15 @@ // SPDX-License-Identifier: GPL-2.0 /* - * SLUB: A slab allocator that limits cache line use instead of queuing - * objects in per cpu and per node lists. + * SLUB: A slab allocator with low overhead percpu array caches and mostly + * lockless freeing of objects to slabs in the slowpath. * - * The allocator synchronizes using per slab locks or atomic operations - * and only uses a centralized lock to manage a pool of partial slabs. + * The allocator synchronizes using spin_trylock for percpu arrays in the + * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing. + * Uses a centralized lock to manage a pool of partial slabs. * * (C) 2007 SGI, Christoph Lameter * (C) 2011 Linux Foundation, Christoph Lameter + * (C) 2025 SUSE, Vlastimil Babka */ #include @@ -53,11 +55,13 @@ /* * Lock order: - * 1. slab_mutex (Global Mutex) - * 2. node->list_lock (Spinlock) - * 3. kmem_cache->cpu_slab->lock (Local lock) - * 4. slab_lock(slab) (Only on some arches) - * 5. object_map_lock (Only for debugging) + * 0. cpu_hotplug_lock + * 1. slab_mutex (Global Mutex) + * 2a. kmem_cache->cpu_sheaves->lock (Local trylock) + * 2b. node->barn->lock (Spinlock) + * 2c. node->list_lock (Spinlock) + * 3. slab_lock(slab) (Only on some arches) + * 4. object_map_lock (Only for debugging) * * slab_mutex * @@ -78,31 +82,38 @@ * C. slab->objects -> Number of objects in slab * D. slab->frozen -> frozen state * - * Frozen slabs + * SL_partial slabs + * + * Slabs on node partial list have at least one free object. A limited number + * of slabs on the list can be fully free (slab->inuse == 0), until we start + * discarding them. These slabs are marked with SL_partial, and the flag is + * cleared while removing them, usually to grab their freelist afterwards. + * This clearing also exempts them from list management. Please see + * __slab_free() for more details. * - * If a slab is frozen then it is exempt from list management. It is - * the cpu slab which is actively allocated from by the processor that - * froze it and it is not on any list. The processor that froze the - * slab is the one who can perform list operations on the slab. Other - * processors may put objects onto the freelist but the processor that - * froze the slab is the only one that can retrieve the objects from the - * slab's freelist. + * Full slabs * - * CPU partial slabs + * For caches without debugging enabled, full slabs (slab->inuse == + * slab->objects and slab->freelist == NULL) are not placed on any list. + * The __slab_free() freeing the first object from such a slab will place + * it on the partial list. Caches with debugging enabled place such slab + * on the full list and use different allocation and freeing paths. + * + * Frozen slabs * - * The partially empty slabs cached on the CPU partial list are used - * for performance reasons, which speeds up the allocation process. - * These slabs are not frozen, but are also exempt from list management, - * by clearing the SL_partial flag when moving out of the node - * partial list. Please see __slab_free() for more details. + * If a slab is frozen then it is exempt from list management. It is used to + * indicate a slab that has failed consistency checks and thus cannot be + * allocated from anymore - it is also marked as full. Any previously + * allocated objects will be simply leaked upon freeing instead of attempting + * to modify the potentially corrupted freelist and metadata. * * To sum up, the current scheme is: - * - node partial slab: SL_partial && !frozen - * - cpu partial slab: !SL_partial && !frozen - * - cpu slab: !SL_partial && frozen - * - full slab: !SL_partial && !frozen + * - node partial slab: SL_partial && !full && !frozen + * - taken off partial list: !SL_partial && !full && !frozen + * - full slab, not on any list: !SL_partial && full && !frozen + * - frozen due to inconsistency: !SL_partial && full && frozen * - * list_lock + * node->list_lock (spinlock) * * The list_lock protects the partial and full list on each node and * the partial slab counter. If taken then no new slabs may be added or @@ -112,47 +123,46 @@ * * The list_lock is a centralized lock and thus we avoid taking it as * much as possible. As long as SLUB does not have to handle partial - * slabs, operations can continue without any centralized lock. F.e. - * allocating a long series of objects that fill up slabs does not require - * the list lock. + * slabs, operations can continue without any centralized lock. * * For debug caches, all allocations are forced to go through a list_lock * protected region to serialize against concurrent validation. * - * cpu_slab->lock local lock + * cpu_sheaves->lock (local_trylock) * - * This locks protect slowpath manipulation of all kmem_cache_cpu fields - * except the stat counters. This is a percpu structure manipulated only by - * the local cpu, so the lock protects against being preempted or interrupted - * by an irq. Fast path operations rely on lockless operations instead. + * This lock protects fastpath operations on the percpu sheaves. On !RT it + * only disables preemption and does no atomic operations. As long as the main + * or spare sheaf can handle the allocation or free, there is no other + * overhead. * - * On PREEMPT_RT, the local lock neither disables interrupts nor preemption - * which means the lockless fastpath cannot be used as it might interfere with - * an in-progress slow path operations. In this case the local lock is always - * taken but it still utilizes the freelist for the common operations. + * node->barn->lock (spinlock) * - * lockless fastpaths + * This lock protects the operations on per-NUMA-node barn. It can quickly + * serve an empty or full sheaf if available, and avoid more expensive refill + * or flush operation. * - * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free()) - * are fully lockless when satisfied from the percpu slab (and when - * cmpxchg_double is possible to use, otherwise slab_lock is taken). - * They also don't disable preemption or migration or irqs. They rely on - * the transaction id (tid) field to detect being preempted or moved to - * another cpu. + * Lockless freeing + * + * Objects may have to be freed to their slabs when they are from a remote + * node (where we want to avoid filling local sheaves with remote objects) + * or when there are too many full sheaves. On architectures supporting + * cmpxchg_double this is done by a lockless update of slab's freelist and + * counters, otherwise slab_lock is taken. This only needs to take the + * list_lock if it's a first free to a full slab, or when there are too many + * fully free slabs and some need to be discarded. * * irq, preemption, migration considerations * - * Interrupts are disabled as part of list_lock or local_lock operations, or + * Interrupts are disabled as part of list_lock or barn lock operations, or * around the slab_lock operation, in order to make the slab allocator safe * to use in the context of an irq. + * Preemption is disabled as part of local_trylock operations. + * kmalloc_nolock() and kfree_nolock() are safe in NMI context but see + * their limitations. * - * In addition, preemption (or migration on PREEMPT_RT) is disabled in the - * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the - * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer - * doesn't have to be revalidated in each section protected by the local lock. - * - * SLUB assigns one slab for allocation to each processor. - * Allocations only occur from these slabs called cpu slabs. + * SLUB assigns two object arrays called sheaves for caching allocation and + * frees on each cpu, with a NUMA node shared barn for balancing between cpus. + * Allocations and frees are primarily served from these sheaves. * * Slabs with free elements are kept on a partial list and during regular * operations no list for full slabs is used. If an object in a full slab is @@ -160,25 +170,8 @@ * We track full slabs for debugging purposes though because otherwise we * cannot scan all objects. * - * Slabs are freed when they become empty. Teardown and setup is - * minimal so we rely on the page allocators per cpu caches for - * fast frees and allocs. - * - * slab->frozen The slab is frozen and exempt from list processing. - * This means that the slab is dedicated to a purpose - * such as satisfying allocations for a specific - * processor. Objects may be freed in the slab while - * it is frozen but slab_free will then skip the usual - * list operations. It is up to the processor holding - * the slab to integrate the slab into the slab lists - * when the slab is no longer needed. - * - * One use of this flag is to mark slabs that are - * used for allocations. Then such a slab becomes a cpu - * slab. The cpu slab may be equipped with an additional - * freelist that allows lockless access to - * free objects in addition to the regular freelist - * that requires the slab lock. + * Slabs are freed when they become empty. Teardown and setup is minimal so we + * rely on the page allocators per cpu caches for fast frees and allocs. * * SLAB_DEBUG_FLAGS Slab requires special handling due to debug * options set. This moves slab handling out of -- 2.51.1