From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45077C4338F for ; Thu, 29 Jul 2021 13:23:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D7F5E60C41 for ; Thu, 29 Jul 2021 13:23:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D7F5E60C41 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 032198D0011; Thu, 29 Jul 2021 09:21:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 95255900012; Thu, 29 Jul 2021 09:21:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69702900015; Thu, 29 Jul 2021 09:21:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id E82CD900012 for ; Thu, 29 Jul 2021 09:21:48 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A4112269DD for ; Thu, 29 Jul 2021 13:21:48 +0000 (UTC) X-FDA: 78415687896.07.AF4E6F1 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf26.hostedemail.com (Postfix) with ESMTP id 2E65520019C6 for ; Thu, 29 Jul 2021 13:21:48 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 37E072004A; Thu, 29 Jul 2021 13:21:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1627564907; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iYVSGOrkM21pFT/e3cbIa7LDokPHVe2XjRbFkyu6gbs=; b=ka/CQZFiV4uQ+BgQtewmzJHG8OTrVSG8635ZnaWDBjrEZ8Eo5YJiUA+9O7FvDZVDWDUsya M9Q0zWsxBZUspTm2JgWqRNk0ML8NsezSWkg4Vytbx3PKeJTA310Wa5Qi76ju4QmSbRr5da SahyZkD75Dv2wPrNEp1Tg5oIuk1ATRQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1627564907; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iYVSGOrkM21pFT/e3cbIa7LDokPHVe2XjRbFkyu6gbs=; b=f8GyvePlzEV49b58OpBGdHuX5qv1P0eAmg2hjv5hb60jDffWRsgIRL6HUw9YF6Sk12SXea axFGmwq94b7DG+Dg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id F11BB13AE9; Thu, 29 Jul 2021 13:21:46 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uLw8OmqrAmF9AwAAMHmgww (envelope-from ); Thu, 29 Jul 2021 13:21:46 +0000 From: Vlastimil Babka To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Christoph Lameter , David Rientjes , Pekka Enberg , Joonsoo Kim Cc: Mike Galbraith , Sebastian Andrzej Siewior , Thomas Gleixner , Mel Gorman , Jesper Dangaard Brouer , Jann Horn , Vlastimil Babka Subject: [PATCH v3 35/35] mm, slub: convert kmem_cpu_slab protection to local_lock Date: Thu, 29 Jul 2021 15:21:32 +0200 Message-Id: <20210729132132.19691-36-vbabka@suse.cz> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210729132132.19691-1-vbabka@suse.cz> References: <20210729132132.19691-1-vbabka@suse.cz> MIME-Version: 1.0 Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b="ka/CQZFi"; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=f8GyvePl; spf=pass (imf26.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.29 as permitted sender) smtp.mailfrom=vbabka@suse.cz; dmarc=none X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 2E65520019C6 X-Stat-Signature: w3jo7ng811rytb5fjaz9cwuhn17q6y76 X-HE-Tag: 1627564908-556896 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions = of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's equivalent, with better lockdep visibility. On PREEMPT_RT that means bett= er preemption. However, the cost on PREEMPT_RT is the loss of lockless fast paths which = only work with cpu freelist. Those are designed to detect and recover from bei= ng preempted by other conflicting operations (both fast or slow path), but t= he slow path operations assume they cannot be preempted by a fast path opera= tion, which is guaranteed naturally with disabled irqs. With local locks on PREEMPT_RT, the fast paths now also need to take the local lock to avoid = races. In the allocation fastpath slab_alloc_node() we can just defer to the slo= wpath __slab_alloc() which also works with cpu freelist, but under the local lo= ck. In the free fastpath do_slab_free() we have to add a new local lock prote= cted version of freeing to the cpu freelist, as the existing slowpath only wor= ks with the page freelist. Also update the comment about locking scheme in SLUB to reflect changes d= one by this series. [ Mike Galbraith : use local_lock() without irq in PREEMPT= _RT scope; debugging of RT crashes resulting in put_cpu_partial() locking c= hanges ] Signed-off-by: Vlastimil Babka --- include/linux/slub_def.h | 2 + mm/slub.c | 146 ++++++++++++++++++++++++++++++--------- 2 files changed, 115 insertions(+), 33 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..b5bcac29b979 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -10,6 +10,7 @@ #include #include #include +#include =20 enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -41,6 +42,7 @@ enum stat_item { NR_SLUB_STAT_ITEMS }; =20 struct kmem_cache_cpu { + local_lock_t lock; /* Protects the fields below except stat */ void **freelist; /* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ struct page *page; /* The slab from which we are allocating */ diff --git a/mm/slub.c b/mm/slub.c index 91e04e20cf60..695ffaf28c25 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -46,13 +46,21 @@ /* * Lock order: * 1. slab_mutex (Global Mutex) - * 2. node->list_lock - * 3. slab_lock(page) (Only on some arches and for debugging) + * 2. node->list_lock (Spinlock) + * 3. kmem_cache->cpu_slab->lock (Local lock) + * 4. slab_lock(page) (Only on some arches or for debugging) + * 5. object_map_lock (Only for debugging) * * slab_mutex * * The role of the slab_mutex is to protect the list of all the slabs * and to synchronize major metadata changes to slab cache structures. + * Also synchronizes memory hotplug callbacks. + * + * slab_lock + * + * The slab_lock is a wrapper around the page lock, thus it is a bit + * spinlock. * * The slab_lock is only used for debugging and on arches that do not * have the ability to do a cmpxchg_double. It only protects: @@ -61,6 +69,8 @@ * C. page->objects -> Number of objects in page * D. page->frozen -> frozen state * + * Frozen slabs + * * If a slab is frozen then it is exempt from list management. It is n= ot * on any list except per cpu partial list. The processor that froze t= he * slab is the one who can perform list operations on the page. Other @@ -68,6 +78,8 @@ * froze the slab is the only one that can retrieve the objects from t= he * page's freelist. * + * list_lock + * * The list_lock protects the partial and full list on each node and * the partial slab counter. If taken then no new slabs may be added o= r * removed from the lists nor make the number of partial slabs be modi= fied. @@ -79,10 +91,36 @@ * slabs, operations can continue without any centralized lock. F.e. * allocating a long series of objects that fill up slabs does not req= uire * the list lock. - * Interrupts are disabled during allocation and deallocation in order= to - * make the slab allocator safe to use in the context of an irq. In ad= dition - * interrupts are disabled to ensure that the processor does not chang= e - * while handling per_cpu slabs, due to kernel preemption. + * + * cpu_slab->lock local lock + * + * This locks protect slowpath manipulation of all kmem_cache_cpu fiel= ds + * except the stat counters. This is a percpu structure manipulated on= ly by + * the local cpu, so the lock protects against being preempted or inte= rrupted + * by an irq. Fast path operations rely on lockless operations instead= . + * On PREEMPT_RT, the local lock does not actually disable irqs (and t= hus + * prevent the lockless operations), so fastpath operations also need = to take + * the lock and are no longer lockless. + * + * lockless fastpaths + * + * The fast path allocation (slab_alloc_node()) and freeing (do_slab_f= ree()) + * are fully lockless when satisfied from the percpu slab (and when + * cmpxchg_double is possible to use, otherwise slab_lock is taken). + * They also don't disable preemption or migration or irqs. They rely = on + * the transaction id (tid) field to detect being preempted or moved t= o + * another cpu. + * + * irq, preemption, migration considerations + * + * Interrupts are disabled as part of list_lock or local_lock operatio= ns, or + * around the slab_lock operation, in order to make the slab allocator= safe + * to use in the context of an irq. + * + * In addition, preemption (or migration on PREEMPT_RT) is disabled in= the + * allocation slowpath, bulk allocation, and put_cpu_partial(), so tha= t the + * local cpu doesn't change in the process and e.g. the kmem_cache_cpu= pointer + * doesn't have to be revalidated in each section protected by the loc= al lock. * * SLUB assigns one slab for allocation to each processor. * Allocations only occur from these slabs called cpu slabs. @@ -2227,9 +2265,13 @@ static inline void note_cmpxchg_failure(const char= *n, static void init_kmem_cache_cpus(struct kmem_cache *s) { int cpu; + struct kmem_cache_cpu *c; =20 - for_each_possible_cpu(cpu) - per_cpu_ptr(s->cpu_slab, cpu)->tid =3D init_tid(cpu); + for_each_possible_cpu(cpu) { + c =3D per_cpu_ptr(s->cpu_slab, cpu); + local_lock_init(&c->lock); + c->tid =3D init_tid(cpu); + } } =20 /* @@ -2440,10 +2482,10 @@ static void unfreeze_partials(struct kmem_cache *= s) struct page *partial_page; unsigned long flags; =20 - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); partial_page =3D this_cpu_read(s->cpu_slab->partial); this_cpu_write(s->cpu_slab->partial, NULL); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); =20 if (partial_page) __unfreeze_partials(s, partial_page); @@ -2476,7 +2518,7 @@ static void put_cpu_partial(struct kmem_cache *s, s= truct page *page, int drain) int pages =3D 0; int pobjects =3D 0; =20 - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); =20 oldpage =3D this_cpu_read(s->cpu_slab->partial); =20 @@ -2504,7 +2546,7 @@ static void put_cpu_partial(struct kmem_cache *s, s= truct page *page, int drain) =20 this_cpu_write(s->cpu_slab->partial, page); =20 - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); =20 if (page_to_unfreeze) { __unfreeze_partials(s, page_to_unfreeze); @@ -2528,7 +2570,7 @@ static inline void flush_slab(struct kmem_cache *s,= struct kmem_cache_cpu *c, struct page *page; =20 if (lock) - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); =20 freelist =3D c->freelist; page =3D c->page; @@ -2538,7 +2580,7 @@ static inline void flush_slab(struct kmem_cache *s,= struct kmem_cache_cpu *c, c->tid =3D next_tid(c->tid); =20 if (lock) - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); =20 if (page) deactivate_slab(s, page, freelist); @@ -2826,9 +2868,9 @@ static void *___slab_alloc(struct kmem_cache *s, gf= p_t gfpflags, int node, goto deactivate_slab; =20 /* must check again c->page in case we got preempted and it changed */ - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(page !=3D c->page)) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } freelist =3D c->freelist; @@ -2839,7 +2881,7 @@ static void *___slab_alloc(struct kmem_cache *s, gf= p_t gfpflags, int node, =20 if (!freelist) { c->page =3D NULL; - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); stat(s, DEACTIVATE_BYPASS); goto new_slab; } @@ -2848,7 +2890,11 @@ static void *___slab_alloc(struct kmem_cache *s, g= fp_t gfpflags, int node, =20 load_freelist: =20 - lockdep_assert_irqs_disabled(); +#ifdef CONFIG_PREEMPT_RT + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock.lock)); +#else + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); +#endif =20 /* * freelist is pointing to the list of objects to be used. @@ -2858,39 +2904,39 @@ static void *___slab_alloc(struct kmem_cache *s, = gfp_t gfpflags, int node, VM_BUG_ON(!c->page->frozen); c->freelist =3D get_freepointer(s, freelist); c->tid =3D next_tid(c->tid); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); return freelist; =20 deactivate_slab: =20 - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (page !=3D c->page) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } freelist =3D c->freelist; c->page =3D NULL; c->freelist =3D NULL; - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); deactivate_slab(s, page, freelist); =20 new_slab: =20 if (slub_percpu_partial(c)) { - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(c->page)) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } if (unlikely(!slub_percpu_partial(c))) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); /* we were preempted and partial list got empty */ goto new_objects; } =20 page =3D c->page =3D slub_percpu_partial(c); slub_set_percpu_partial(c, page); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); stat(s, CPU_PARTIAL_ALLOC); goto redo; } @@ -2943,7 +2989,7 @@ static void *___slab_alloc(struct kmem_cache *s, gf= p_t gfpflags, int node, =20 retry_load_page: =20 - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(c->page)) { void *flush_freelist =3D c->freelist; struct page *flush_page =3D c->page; @@ -2952,7 +2998,7 @@ static void *___slab_alloc(struct kmem_cache *s, gf= p_t gfpflags, int node, c->freelist =3D NULL; c->tid =3D next_tid(c->tid); =20 - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); =20 deactivate_slab(s, flush_page, flush_freelist); =20 @@ -3071,7 +3117,15 @@ static __always_inline void *slab_alloc_node(struc= t kmem_cache *s, =20 object =3D c->freelist; page =3D c->page; - if (unlikely(!object || !page || !node_match(page, node))) { + /* + * We cannot use the lockless fastpath on PREEMPT_RT because if a + * slowpath has taken the local_lock_irqsave(), it is not protected + * against a fast path operation in an irq handler. So we need to take + * the slow path which uses local_lock. It is still relatively fast if + * there is a suitable cpu freelist. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT) || + unlikely(!object || !page || !node_match(page, node))) { object =3D __slab_alloc(s, gfpflags, node, addr, c); } else { void *next_object =3D get_freepointer_safe(s, object); @@ -3331,6 +3385,7 @@ static __always_inline void do_slab_free(struct kme= m_cache *s, barrier(); =20 if (likely(page =3D=3D c->page)) { +#ifndef CONFIG_PREEMPT_RT void **freelist =3D READ_ONCE(c->freelist); =20 set_freepointer(s, tail_obj, freelist); @@ -3343,6 +3398,31 @@ static __always_inline void do_slab_free(struct km= em_cache *s, note_cmpxchg_failure("slab_free", s, tid); goto redo; } +#else /* CONFIG_PREEMPT_RT */ + /* + * We cannot use the lockless fastpath on PREEMPT_RT because if + * a slowpath has taken the local_lock_irqsave(), it is not + * protected against a fast path operation in an irq handler. So + * we need to take the local_lock. We shouldn't simply defer to + * __slab_free() as that wouldn't use the cpu freelist at all. + */ + void **freelist; + + local_lock(&s->cpu_slab->lock); + c =3D this_cpu_ptr(s->cpu_slab); + if (unlikely(page !=3D c->page)) { + local_unlock(&s->cpu_slab->lock); + goto redo; + } + tid =3D c->tid; + freelist =3D c->freelist; + + set_freepointer(s, tail_obj, freelist); + c->freelist =3D head; + c->tid =3D next_tid(tid); + + local_unlock(&s->cpu_slab->lock); +#endif stat(s, FREE_FASTPATH); } else __slab_free(s, page, head, tail_obj, cnt, addr); @@ -3513,7 +3593,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp= _t flags, size_t size, * handlers invoking normal fastpath. */ c =3D slub_get_cpu_ptr(s->cpu_slab); - local_irq_disable(); + local_lock_irq(&s->cpu_slab->lock); =20 for (i =3D 0; i < size; i++) { void *object =3D kfence_alloc(s, s->object_size, flags); @@ -3534,7 +3614,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp= _t flags, size_t size, */ c->tid =3D next_tid(c->tid); =20 - local_irq_enable(); + local_unlock_irq(&s->cpu_slab->lock); =20 /* * Invoking slow path likely have side-effect @@ -3548,7 +3628,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp= _t flags, size_t size, c =3D this_cpu_ptr(s->cpu_slab); maybe_wipe_obj_freeptr(s, p[i]); =20 - local_irq_disable(); + local_lock_irq(&s->cpu_slab->lock); =20 continue; /* goto for-loop */ } @@ -3557,7 +3637,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp= _t flags, size_t size, maybe_wipe_obj_freeptr(s, p[i]); } c->tid =3D next_tid(c->tid); - local_irq_enable(); + local_unlock_irq(&s->cpu_slab->lock); slub_put_cpu_ptr(s->cpu_slab); =20 /* --=20 2.32.0