From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76ED5E77188 for ; Sat, 4 Jan 2025 02:56:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B71D26B0082; Fri, 3 Jan 2025 21:56:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B210A6B0088; Fri, 3 Jan 2025 21:56:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E8466B0089; Fri, 3 Jan 2025 21:56:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7F6AA6B0082 for ; Fri, 3 Jan 2025 21:56:51 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2F111AFD56 for ; Sat, 4 Jan 2025 02:56:51 +0000 (UTC) X-FDA: 82968257022.03.DC9B89B Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf27.hostedemail.com (Postfix) with ESMTP id 71E3240007 for ; Sat, 4 Jan 2025 02:56:49 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf27.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735959409; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CXMU3Srs3CLMeWqmHN16L63gPFld4w+jsAcgJmK8A5A=; b=aESnIA6E8ww797TRbL57fb7rIeWLg48l8FxsetdBmTWb1Abs9rsZKffrQoBHL9PmwK/ROe UC6h5FSsrSf+cNUlar4Ue4TJEla39MDDk3Sz4JB/n2ZdpHyT+2uWccGN7NZ12guR0aBx4o 60Xv0eY38iU01bSR8yf++10YoBWLQSU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735959409; a=rsa-sha256; cv=none; b=WKlHPEdySPFnPTZbYtKAUYQfy8bPkbfAVDb3P9bCMHEHlOVLvIVPt+VqzQySfvAMfcHWj9 C/QkBII7CLt8h2ptkGGUfjwm14aHsCktOnWRvx67Yr8zS8MZlNOHTuWkv6X4RjzCCFXuFh F+rDjcK3G9XG2FXWjoSOrcOoZALHwi4= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf27.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tTuJu-000000000rh-11XI; Fri, 03 Jan 2025 21:55:22 -0500 Message-ID: <96d1b007b3b88c43feac58d2646d675b94daf1fb.camel@surriel.com> Subject: Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes From: Rik van Riel To: Jann Horn Cc: x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, akpm@linux-foundation.org, nadav.amit@gmail.com, zhengqi.arch@bytedance.com, linux-mm@kvack.org Date: Fri, 03 Jan 2025 21:55:21 -0500 In-Reply-To: References: <20241230175550.4046587-1-riel@surriel.com> <20241230175550.4046587-10-riel@surriel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) MIME-Version: 1.0 X-Rspamd-Queue-Id: 71E3240007 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: tethta93kw4nuhtnsshetoeacsenwqnf X-HE-Tag: 1735959409-137069 X-HE-Meta: U2FsdGVkX1+29iXvKe4Xccq5Sp9rLlgf7Yd0gEDtNzIfMZPiWSZJjchtM7C9dv93nJrp0Iw1vcrzubref+qZVqlKPbIOGGyqp4GlH5Dx5dnMrWEQFMITr2+3atIn72a0Ebd8pMwzBxSYiBFPD0WGu1M8NClRkfaePGWFbDsishLv89WDqk6tBzooLWBkymupJV5ACL4lUpkMPZmdxixC64iMRpgbRE5NcomuKKfVrxCsBLhZ4TooT4/E1VcA/SfgY9rK4Rze88hIHWDn+ic5prypb8ub3uGH6fCBf/QHl/cLoXwyIgSESKg3JI2eIpC+o4N8kJWxp77tst9eHBJP/1K5F8G4A5QQFV3eYKbjRmRVsruhiMSrA9Q+7JLaLhazsfPilbS6ndrz+HMn3VVrsN9MkUww6X3Dqk3mpR/8NL6n/+BDdc1YqiZtK+qIiV2bfJQI/m9lf7zbfdSuI61rt9XCMWFnsGiZZUMkqnYBag/53cFnL5bdeXcg4XQgIuQlHt42myfGHYOANf1dNaNPiEjLESmegUI4LaOoatluXb0ivj+K+rmuuXSYMuaBzIa+HDEUrsMVRNz9HHGzeEuW9DnKR05AgxwNBgxwYo0dG4E21fLL+QtTeRRTtC3Wne49zs1NlEa2zhRFnyftJBycKBjgB3KVz7+/0HXpcgASeAFaEb3ajTHGWip3ySJ6VO82VismxBcdj5JiksPCwoaz4vd6+yZ5Jg/UL6VxTmNUhd5tfkA4HeluZVVb2YjcNlkYzmykbplXF6TYwAvOc0E6PnYP6tXBybckT3A+k5CpWVuHuPnTJqBNVl1raF/QpuOmGDVwZ7QoV3VNcHElDN1yTgy7/yQTMdjlQ7zHNM4C5nee+NmHhMgeMv1TRT+Engle7iYey92TGE+no24NDSnSekJ91Y8R/u2kWnjcLoakZ6F5tYVHOt1jz4IrqV20Qh0LJHlh4e6X2MyEGsuz3Ka VHJDDnFA FA6pCUacQ6fkAQEIbmVkLUuSCMSUYR8E+8p5atmr9RnlsckOAulKEsYAtDon1UYtwbHmnXUppjbb428LW2bsj6wTlkMAKdemKPy3j6msiNwT4GlmK9zEsdFy7OdDlaEXkD0UW6CSN2tg/3mNBycld8adsa3XZsTj5PE9c9gokW9jHF/RhtXIF9Y1LCHdXZaTrLdzA3cV3eO0Pdy84u6XqdL3nADBVJ7+MJrEE7e+wSNnRNB5rpECR5AH6scjmU6FeLkwIKIobfS3MENwVzeQg0u0Q4MOyiJGHmFn4TZfAlp8X6jzvlY34HR3YiIx8UpKGDabpSEajmmnCFZnyZy6gtTKQew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 2025-01-03 at 18:36 +0100, Jann Horn wrote: >=20 > > +++ b/arch/x86/include/asm/mmu.h > > @@ -48,6 +48,12 @@ typedef struct { > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long flags; > > =C2=A0#endif > >=20 > > +#ifdef CONFIG_CPU_SUP_AMD > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct list_head broadcast_asid_l= ist; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 u16 broadcast_asid; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bool asid_transition; >=20 > Please add a comment on the semantics of the "asid_transition" field > here after addressing the comments below. Will do.=20 >=20 > > +#endif > > + > > =C2=A0#ifdef CONFIG_ADDRESS_MASKING > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Active LAM mode:=C2=A0 X8= 6_CR3_LAM_U48 or X86_CR3_LAM_U57 or > > 0 (disabled) */ > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long lam_cr3_mask; > [...] > > +#ifdef CONFIG_CPU_SUP_AMD > > +/* > > + * Logic for AMD INVLPGB support. > > + */ > > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock); > > +static u16 last_broadcast_asid =3D TLB_NR_DYN_ASIDS; >=20 > I wonder if this should be set to MAX_ASID_AVAILABLE or such to > ensure > that we do a flush before we start using the broadcast ASID space the > first time... Or is there something else that already guarantees that > all ASIDs of the TLB are flushed during kernel boot? That is a good idea. I don't know if the TLBs always get flushed on every kexec, for example, and having the wraparound code exercised early on every boot will be good as a self test for future proofing the code. I'll do that in the next version. >=20 > > +static u16 get_broadcast_asid(void) > > +{ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lockdep_assert_held(&broadcast_as= id_lock); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 do { > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 u16 start =3D last_broadcast_asid; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 u16 asid =3D find_next_zero_bit(broadcast_asid_used, > > MAX_ASID_AVAILABLE, start); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 if (asid >=3D MAX_ASID_AVAILABLE) { > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 reset_broad= cast_asid_space(); > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >=20 > Can this loop endlessly without making forward progress if we have a > few thousand processes on the system that are multi-threaded (or used > to be multi-threaded) and race the wrong way? > meets_broadcast_asid_threshold() checks if we have free IDs > remaining, > but that check happens before broadcast_asid_lock is held, so we > could > theoretically race such that no free IDs are available, right? You are right, I need to duplicate that check under the spinlock! I'll get that done in the next version. >=20 > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mm->context.broadcast_asid =3D ge= t_broadcast_asid(); > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mm->context.asid_transition =3D t= rue; >=20 > This looks buggy to me:=20 Nadav found the same issue. I've fixed it locally already for the next version. > Maybe change how mm->context.asid_transition works such that it is > immediately set on mm creation and cleared when the transition is > done, so that you don't have to touch it here? >=20 If we want to document the ordering, won't it be better to keep both assignments close to each other (with WRITE_ONCE), so the code stays easier to understand for future maintenance? > Also, please use at least WRITE_ONCE() for writes here, and add > comments documenting ordering requirements. I'll add a comment. >=20 > > +static void finish_asid_transition(struct flush_tlb_info *info) > > +{ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct mm_struct *mm =3D info->mm= ; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int bc_asid =3D mm_broadcast_asid= (mm); > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int cpu; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!mm->context.asid_transition) >=20 > AFAIU this can be accessed concurrently - please use at least > READ_ONCE(). (I think in the current version of the patch, this needs > to be ordered against the preceding mm_broadcast_asid() read, but > that's implicit on x86, so I guess writing a barrier here would be > superfluous.) I'll add a READ_ONCE here. Good point. >=20 > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 return; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_cpu(cpu, mm_cpumask(mm))= { > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) > > !=3D mm) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >=20 > switch_mm_irqs_off() picks an ASID and writes CR3 before writing > loaded_mm: > "/* Make sure we write CR3 before loaded_mm. */" >=20 > Can we race with a concurrent switch_mm_irqs_off() on the other CPU > such that the other CPU has already switched CR3 to our MM using the > old ASID, but has not yet written loaded_mm, such that we skip it > here? And then we'll think we finished the ASID transition, and the > next time we do a flush, we'll wrongly omit the flush for that other > CPU even though it's still using the old ASID? That is a very good question. I suppose we need to check against LOADED_MM_SWITCHING too, and possibly wait to see what mm shows up on that CPU before proceeding? Maybe as simple as this? for_each_cpu(cpu, mm_cpumask(mm)) { while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu) =3D=3D LOADED_MM_SWITCHING) cpu_relax(); if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=3D mm) continue; /* * If at least one CPU is not using the broadcast ASID yet, * send a TLB flush IPI. The IPI should cause stragglers * to transition soon. */ if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=3D bc_asid) { flush_tlb_multi(mm_cpumask(info->mm), info); return; } } Then the only change needed to switch_mm_irqs_off would be to move the LOADED_MM_SWITCHING line to before choose_new_asid, to fully close the window. Am I overlooking anything here? >=20 > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 /* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * If at least one CPU is not using the broadcast > > ASID yet, > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * send a TLB flush IPI. The IPI should cause > > stragglers > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 * to transition soon. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 */ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=3D > > bc_asid) { >=20 > READ_ONCE()? Also, I think this needs a comment explaining that this > can race with concurrent MM switches such that we wrongly think that > there's a straggler (because we're not reading the loaded_mm and the > loaded_mm_asid as one atomic combination). I'll add the READ_ONCE. Will the race still exist if we wait on LOADED_MM_SWITCHING as proposed above? >=20 > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 flush_tlb_m= ulti(mm_cpumask(info->mm), > > info); > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 } > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* All the CPUs running this proc= ess are using the > > broadcast ASID. */ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mm->context.asid_transition =3D 0= ; >=20 > WRITE_ONCE()? > Also: This is a bool, please use "false". Will do. >=20 > > +} > > + > > +static void broadcast_tlb_flush(struct flush_tlb_info *info) > > +{ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bool pmd =3D info->stride_shift = =3D=3D PMD_SHIFT; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long maxnr =3D invlpgb_c= ount_max; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long asid =3D info->mm->= context.broadcast_asid; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long addr =3D info->star= t; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long nr; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Flushing multiple pages at onc= e is not supported with > > 1GB pages. */ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (info->stride_shift > PMD_SHIF= T) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 maxnr =3D 1; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (info->end =3D=3D TLB_FLUSH_AL= L) { > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 invlpgb_flush_single_pcid(kern_pcid(asid)); >=20 > What orders this flush with the preceding page table update? Does the > instruction implicitly get ordered after preceding memory writes, or > do we get that ordering from inc_mm_tlb_gen() or something like that? I believe inc_mm_tlb_gen() should provide the ordering. You are right that it should be documented. >=20 > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Broadcast ASIDs are always kep= t up to date with INVLPGB. > > */ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_broadcast_asid(loaded_mm_a= sid)) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 return; >=20 > This relies on the mm_broadcast_asid() read in flush_tlb_mm_range() > being ordered after the page table update, correct? And we get that > required ordering from the inc_mm_tlb_gen(), which implies a full > barrier? It might be nice if there were some more comments on this. I will add some comments, and I hope you can review those in the next series, because I'll no doubt forget to explain something important. Thank you for the thorough review! --=20 All Rights Reversed.