From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEC76C02180 for ; Tue, 14 Jan 2025 03:14:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3AB3F280002; Mon, 13 Jan 2025 22:14:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 333AD280001; Mon, 13 Jan 2025 22:14:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D52B280002; Mon, 13 Jan 2025 22:14:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EF93D280001 for ; Mon, 13 Jan 2025 22:14:26 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D2E25AEE55 for ; Tue, 14 Jan 2025 03:14:25 +0000 (UTC) X-FDA: 83004589290.07.F65FF9F Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf05.hostedemail.com (Postfix) with ESMTP id 1BF4C100007 for ; Tue, 14 Jan 2025 03:14:23 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736824464; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5uxrs0kHeEW7NOXSe4Mj8FxASd8tQXWsFlb8kXSs0FM=; b=cCF0Z8yGEtY+A3fN5mUAxqNSWDO3rdkQxZLtg8gCiZrLtYSNhVvcQXL5ZKWdAIw/pMsnUN 73AfOk7URkdmb9phnORMTGfLx08DEFQxs9wWkfZ4TZoHzSr1cUomhCIOY2AYn731iWmqBE FXkC9BT1Fa7kG4vqCeBa4N+0hhOtMEQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736824464; a=rsa-sha256; cv=none; b=XtJT5DWRhytmOGcy2oyuj4iw21vcsBiv7rloc/CYt6VlSOLqYN+1wf571Ahj4ELHtlIlYI CTZIGrdhiq0u4jsVjcy++LEL0Uscctraqiv0rswXggBdArnCE0Ni6UsEPZ+wsdQZ56zhQv GuwF/57OXcsDZHBnFbnISTxJ40U3s1Q= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tXXMa-000000005F1-3iUN; Mon, 13 Jan 2025 22:13:08 -0500 Message-ID: Subject: Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes From: Rik van Riel To: Nadav Amit Cc: the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , peterz@infradead.org, Dave Hansen , zhengqi.arch@bytedance.com, thomas.lendacky@amd.com, kernel-team@meta.com, "open list:MEMORY MANAGEMENT" , Andrew Morton , jannh@google.com Date: Mon, 13 Jan 2025 22:13:08 -0500 In-Reply-To: References: <20250112155453.1104139-1-riel@surriel.com> <20250112155453.1104139-10-riel@surriel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 1BF4C100007 X-Stat-Signature: ne7wb9kyh98duefykcazzisj1nu69m9t X-Rspam-User: X-HE-Tag: 1736824463-14186 X-HE-Meta: U2FsdGVkX19Yn4XjSwRcz8JFQHJ5ljuUbYs6p8dL/1f0Hft2osqhNvZ/2i8AB9AI1BbAp2rElAmzB8Yy7UljH2loxfUlz9gzb29eXkSrFVKhq+RpgcF18n2p6/gz4Jki8x3m9yjWVYG2FAnulFu0ympsSOFQ6PhsLwW3htbxwwCer/HqhIVL6f/IAy5qetaZYc1/IcKkwpg+NVEfUkcws48MYiuVVqkPOMS8fHtU7enUuC13xfoVN483ulmJNzzlZ0ZefPQT4hbQQpZijarSjAXEhPgvS7m7+C4HPnoTMUSpbl/++wNQdJa9y7a58ba9k922U9K242hM8A737CeL8tGLaqCgjbsDVQjGBiQcOUVXPJdViRTZ3Ea7zZ75uoUWenqeL3vQr1olfa2PtjDgcKsJq3VLEUdbTOznxJ+MFjqqFJjhMwo6nodQl3EsLpsWjp18EPIw94ibx24e1k8wrRVxF3bAm5UDw50dyDZJIOWkQg6hPv61bV1xbRYbk5nIsxQe6ay66rIRtS/Jq8Drp2ifVcPV26kMVVZyFgTm2pciiqz6e8iumaUhHWA85gZykLlxFGCU9VSX+rMu1JXl93bOm60dH/Wy056GMOUstRbsY9p4/ulvMeEka41dPgiD3jMDyqK1L3ZWeVXUKc9GNSxvulkF5XWlrbe819h2urRTrUKnKdY1UJBuDGdxIvWG0CjdNI+J4FpBgvfmWBHr9MzoALmlaDuIlE4eWr6mSr58C/7gjpCCsOgblnzsWppcxyitWQQAaqGCFyHN/5QHEYsIbHteOvkPlYZ7aDITeIIU425OJV0e9H+Zc2cG8vjMIroGKe9nDzeti79rlsumpfGMb97fxymQkBRsGFv2NevkwgNeAT7HQradKcrXQ8QadBJ+ku9xr7DB3U3ST0FlcV5vGhRwfUsxnK0GdlTFjR7Sq38JhWcg07aV8Xkl1H2mKQdqvGNgGRYIMSGwHm5 UK514tJg YUvcQdfuoIAXxJKaDdM870v+OpKpzLbLV0IcDPLuGWWsFhmVLHY5j1Z9TkPqXlCApiITnpU1cqYZTpxSG9vDllmJ9DaPoz1YjosxhYNpGQmLEKO69bdhksoDP7txYzGzA7OqPOhODQI4r8GZUM60P5tKGPHUEX9o+wUaGxRA/Deb1m8WXreJhQFZQHNBIOXsRUt1ra0IT3k5cjg4VNq77xHWhP7iHoZPoayD99mBqAARYIwZaf3pO8f4pEWKQRQEzow1KFniyGsoI20ubYCyNxVlIAbdlinhp4PNbuFDj0+wRaZg/v4+Rh9rxXFRJOAD+1zM3LHgumDheaxg/mET4nneCv88HzEV0aVvfgvE13az5jcE0XdNWWIW9OpTiyGCNMReauRMeLgYg+Oc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 2025-01-13 at 15:09 +0200, Nadav Amit wrote: >=20 > > On 12 Jan 2025, at 17:53, Rik van Riel wrote: > >=20 > > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > > + u16 global_asid; > > + bool asid_transition; >=20 > As I later note, there are various ordering issues between the two. > Would it be > just easier to combine them into one field? I know everybody hates > bitfields so > I don=E2=80=99t suggest it, but there are other ways... It's certainly an option, but I think we figured out the ordering issues, so at this point documentation and readability of the code might be more important for future proofing? >=20 > > @@ -170,6 +180,10 @@ static inline int init_new_context(struct > > task_struct *tsk, > > static inline void destroy_context(struct mm_struct *mm) > > { > > destroy_context_ldt(mm); > > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH >=20 > I=E2=80=99d prefer to use IS_ENABLED() and to have a stub for=20 > destroy_context_free_global_asid(). >=20 > > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > > + destroy_context_free_global_asid(mm); > > +#endif > > } I'll think about how to do this cleaner. I would like to keep the cpu feature test in the inline function, so we don't do an unnecessary function call on systems without INVLPGB. > >=20 > > +static inline bool in_asid_transition(const struct flush_tlb_info > > *info) > > +{ > > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > > + return false; > > + > > + return info->mm && info->mm->context.asid_transition; >=20 > READ_ONCE(context.asid_transition) ? Changed for the next version. >=20 > > +static inline void broadcast_tlb_flush(struct flush_tlb_info > > *info) > > +{ >=20 > Having a VM_WARN_ON() here might be nice. Added. Thank you. >=20 > > +static u16 get_global_asid(void) > > +{ > > + lockdep_assert_held(&global_asid_lock); > > + > > + do { > > + u16 start =3D last_global_asid; > > + u16 asid =3D find_next_zero_bit(global_asid_used, > > MAX_ASID_AVAILABLE, start); > > + > > + if (asid >=3D MAX_ASID_AVAILABLE) { > > + reset_global_asid_space(); > > + continue; > > + } > > + > > + /* Claim this global ASID. */ > > + __set_bit(asid, global_asid_used); > > + last_global_asid =3D asid; > > + return asid; > > + } while (1); >=20 > This does not make me feel easy at all. I do not understand > why it might happen. The caller should=E2=80=99ve already checked the glo= bal > ASID > is available under the lock. If it is not obvious from the code, > perhaps > refactoring is needed. >=20 The caller checks whether we have a global ASID available, anywhere in the namespace. This code will claim a specific one. I guess the global_asid_available-- line could be moved into get_global_asid() to improve readability? > > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int > > threshold) > > +{ > > + int count =3D 0; > > + int cpu; > > + > > + /* This quick check should eliminate most single threaded > > programs. */ > > + if (cpumask_weight(mm_cpumask(mm)) <=3D threshold) > > + return false; > > + > > + /* Slower check to make sure. */ > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + /* Skip the CPUs that aren't really running this > > process. */ > > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) !=3D mm) > > + continue; >=20 > Do you really want to make loaded_mm accessed from other cores? Does > this > really provide worthy benefit? >=20 > Why not just use cpumask_weight() and be done with it? Anyhow it=E2=80=99= s a > heuristic. We recently added some code to make mm_cpumask maintenance a lot lazier, which can result in more CPUs remaining in the bitmap while not running the mm. As for accessing loaded_mm from other CPUs, we already have to do that in finish_asid_transition. I don't see any good way around that, but I'm open to suggestions :) >=20 > > + /* > > + * The transition from IPI TLB flushing, with a dynamic > > ASID, > > + * and broadcast TLB flushing, using a global ASID, uses > > memory > > + * ordering for synchronization. > > + * > > + * While the process has threads still using a dynamic > > ASID, > > + * TLB invalidation IPIs continue to get sent. > > + * > > + * This code sets asid_transition first, before assigning > > the > > + * global ASID. > > + * > > + * The TLB flush code will only verify the ASID transition > > + * after it has seen the new global ASID for the process. > > + */ > > + WRITE_ONCE(mm->context.asid_transition, true); >=20 > I would prefer smp_wmb() and document where the matching smp_rmb() > (or smp_mb) is. >=20 > > + WRITE_ONCE(mm->context.global_asid, get_global_asid()); > > + > > + global_asid_available--; > > +} > >=20 I would be happy with either style of ordering. It's all a bit of a no-op anyway, because x86 does not do stores out of order. > > +static void finish_asid_transition(struct flush_tlb_info *info) > > +{ > > + struct mm_struct *mm =3D info->mm; > > + int bc_asid =3D mm_global_asid(mm); > > + int cpu; > > + > > + if (!READ_ONCE(mm->context.asid_transition)) > > + return; > > + > > + for_each_cpu(cpu, mm_cpumask(mm)) { > > + /* > > + * The remote CPU is context switching. Wait for > > that to > > + * finish, to catch the unlikely case of it > > switching to > > + * the target mm with an out of date ASID. > > + */ > > + while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, > > cpu)) =3D=3D LOADED_MM_SWITCHING) > > + cpu_relax(); >=20 > Although this code should rarely run, it seems bad for a couple of > reasons: >=20 > 1. It is a new busy-wait in a very delicate place. Lockdep is blind > to this > =C2=A0=C2=A0 change. This is true. However, if a CPU gets stuck in the middle of switch_mm_irqs_off, won't we have a bigger problem? >=20 > 2. cpu_tlbstate is supposed to be private for each core - that=E2=80=99s = why > there > =C2=A0=C2=A0 is cpu_tlbstate_shared. But I really think loaded_mm should = be > kept > =C2=A0=C2=A0 private. >=20 > Can't we just do one TLB shootdown if=20 > cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids That conflicts with a future optimization of simply not maintaining the mm_cpumask at all for processes that use broadcast TLB invalidation. After all, once we no longer use the mm_cpumask for anything any more, why incur the overhead of maintaining it? I would like to understand more about the harm of reading loaded_mm. How is reading loaded_mm worse than reading other per-CPU variables that are written from the same code paths? >=20 > > + /* All the CPUs running this process are using the global > > ASID. */ >=20 > I guess it=E2=80=99s ordered with the flushes (the flushes must complete > first). >=20 If there are any flushes. If all the CPUs we scanned are already using the global ASID, we do not need any additional ordering in here, since any CPUs switching to this mm afterward will be seeing the new global ASID. > > + WRITE_ONCE(mm->context.asid_transition, false); > > +} > >=20 > > + } else do { > > + /* > > + * Calculate how many pages can be flushed at > > once; if the > > + * remainder of the range is less than one page, > > flush one. > > + */ > > + nr =3D min(maxnr, (info->end - addr) >> info- > > >stride_shift); > > + nr =3D max(nr, 1); > > + > > + invlpgb_flush_user_nr_nosync(kern_pcid(asid), > > addr, nr, pmd); > > + /* Do any CPUs supporting INVLPGB need PTI? */ > > + if (static_cpu_has(X86_FEATURE_PTI)) > > + invlpgb_flush_user_nr_nosync(user_pcid(asi > > d), addr, nr, pmd); > > + addr +=3D nr << info->stride_shift; > > + } while (addr < info->end); >=20 > I would have preferred for instead of while... Changed that for the next version. Thank you. > > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct > > *mm, unsigned long start, > > * a local TLB flush is needed. Optimize this use-case by > > calling > > * flush_tlb_func_local() directly in this case. > > */ > > - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { >=20 > I think an smp_rmb() here would communicate the fact > in_asid_transition() and > mm_global_asid() must be ordered. >=20 > > + if (mm_global_asid(mm)) { > > + broadcast_tlb_flush(info); We have the barrier already, in the form of inc_mm_tlb_gen a few lines up. Or are you talking about a barrier (or READ_ONCE?) inside of mm_global_asid() to make sure the compiler cannot reorder things around mm_global_asid()? > > + } else if (cpumask_any_but(mm_cpumask(mm), cpu) < > > nr_cpu_ids) { > > info->trim_cpumask =3D should_trim_cpumask(mm); > > flush_tlb_multi(mm_cpumask(mm), info); > > + consider_global_asid(mm); > > } else if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { > > lockdep_assert_irqs_enabled(); > > local_irq_disable(); > > --=20 > > 2.47.1 > >=20 >=20 --=20 All Rights Reversed.