From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70532E7719F for ; Mon, 13 Jan 2025 13:17:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09E496B00A7; Mon, 13 Jan 2025 08:17:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 04E556B00A8; Mon, 13 Jan 2025 08:17:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE4ED6B00A9; Mon, 13 Jan 2025 08:17:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B94206B00A7 for ; Mon, 13 Jan 2025 08:17:36 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 75363C021E for ; Mon, 13 Jan 2025 13:17:36 +0000 (UTC) X-FDA: 83002480512.18.B12AC2B Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41]) by imf06.hostedemail.com (Postfix) with ESMTP id 46E88180002 for ; Mon, 13 Jan 2025 13:17:34 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YzdcMXv9; spf=pass (imf06.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736774254; a=rsa-sha256; cv=none; b=RadqLxmpBEUhpfXqHDU/oXySnFf52CKmd8I88gwNeCO3+zIGPB/TqPL/5LCDiyCpgyvjgy HnjvgsiLYdphUJPF6U/50a3j6uLosasMv4MQLHZxCYTpAEdoBYgIuzz+gaTx/nkc2bB+Lu eX/4HmwBT6/4Qg5F18+JwhTp3z+AQ5U= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=YzdcMXv9; spf=pass (imf06.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736774254; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=; b=W1xLO8+3yq1KuixbzoMepuMaHN9iDuJeXvZmTGWuKKOVOQZvg6r4cHZ7FzOKK8YqG0BT2U XduUGdiFH1ZAbAmeqiH+4Qrabp3Ak8r819kKqTB3A2bWQIytOYbn9PF+9E6Aec+6NoySHf mCig5reguepflkexK8M0spjoGVgxgZU= Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-5401bd6cdb7so4306544e87.2 for ; Mon, 13 Jan 2025 05:17:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736774253; x=1737379053; darn=kvack.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=; b=YzdcMXv9KD8ElDgU+iG4CIgYh3kIqMMyJN8UnHIbVxPLWXflxSirZdAVWWo0h0qTZ5 EKf1DN00QoXrJUT92Kk37vlad/Rr5Ekpu6ufL46pGMg55OvGvzqWHpESe9humsnGDoXB VTHJRyisXBbtyOqHh961k9h5VnP9xM/bcOxQfa5Sl82mC5RY2udau8t8rEoGADrPw+jF bT7XCPzcR2GSNi4JxML/FfQAoUTLtliFwfDYpoY0cQ8N6eiy8j2Wkn5gBsU8v9iVrR31 aJuO9kCliOjQo07oGAxH0XEdwx455EQdsHQ0k8I9UJjAA6U1JbkcZTZb2VelJPufJbJT n8bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736774253; x=1737379053; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=; b=TNUc/zgzTx55LT5UESFThL2+dWxSvGqv1XDy5dikWSezWmEpr6vRu3jMuCSE/rauO+ 8Tu/bcmS43d2vNrM5nFg2+TN80kDxqyouMG8CZM95EIhLG758S+eByY4Imayaq3v8T0l /uZ6M/x5YI186y36ZuDEfpbAXqdl3OrNX+dzILRpclVv63cwOK6cx+xE4wWoSCdlNdhr mKzYj6nXxY3QsL8tHMkAH65hMTEJhvVWGGxWLywKEVOHVeshalr6+JnwG+a8zIJNJbIS 3b1kIKlI6uKhSSc9lD5hKVcwkfjE8pO1gqdnXCK/g0fdAzJ02IWG9zTaS98si3q0iWLE eVdw== X-Forwarded-Encrypted: i=1; AJvYcCUUNR6oyWGjhQ/CyA9een8wElnOKcerxAu/2zRAPsqYKIpbanqmDUQDJMFIipWwR73yaWnez+UQTw==@kvack.org X-Gm-Message-State: AOJu0Yzx+jbnj4idZsikasOVbACSasDH1d3LnGnUBA3dEjFCxm5x/Pts Vf2Opqxwz4TGqov5JwB2KxSzvrtNViHKCbHyDRDzEQSyIGIbuc5DzikYEw5pcC8= X-Gm-Gg: ASbGncum/3WA+3XsUjNHBaNRM5TsgTlL4JYNj81Yf6BrZnI9YdGknw1qm6He/I1zYCp 191TKKHvTZA34dbdfWlJFBHM5i8YCamh9vhgMFaxxPs+NsekAK0C150LERaCFuJJJRv6S8vfqKo I9BgI2WWr6ioXJ231CWOalCUdgWcFNo3GpCDqg2VE9eRAHjohPfp2FTDHA2/WtK2jc0BNXoaS2I 5Z9SR2Nk2jouV8QSyu0MtqKwpSrBHBdkAB8KC8fpxWS7lvVV0UqMd8YMjgvvNif9dwTRmFN X-Google-Smtp-Source: AGHT+IHR2kMg6heY3uM9fTlk4l0hgFFpxYUxDGFGZ/ecosdk0R1Z/79kRv6tmd8v42nHScWMwJ1bqw== X-Received: by 2002:a05:6402:360c:b0:5d0:e826:f0da with SMTP id 4fb4d7f45d1cf-5d972e1683fmr20357396a12.16.1736773764527; Mon, 13 Jan 2025 05:09:24 -0800 (PST) Received: from smtpclient.apple ([132.68.46.72]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5d99046dd55sm4759995a12.54.2025.01.13.05.09.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 13 Jan 2025 05:09:23 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.300.87.4.3\)) Subject: Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes From: Nadav Amit In-Reply-To: <20250112155453.1104139-10-riel@surriel.com> Date: Mon, 13 Jan 2025 15:09:08 +0200 Cc: the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , peterz@infradead.org, Dave Hansen , zhengqi.arch@bytedance.com, thomas.lendacky@amd.com, kernel-team@meta.com, "open list:MEMORY MANAGEMENT" , Andrew Morton , jannh@google.com Content-Transfer-Encoding: quoted-printable Message-Id: References: <20250112155453.1104139-1-riel@surriel.com> <20250112155453.1104139-10-riel@surriel.com> To: Rik van Riel X-Mailer: Apple Mail (2.3826.300.87.4.3) X-Rspamd-Queue-Id: 46E88180002 X-Stat-Signature: m3tfkk73ga7codmci78h5xzefqx3ijo4 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736774254-40240 X-HE-Meta: U2FsdGVkX18Az1xRlY6QA/x4+GTumwubxF9pC1HoZA2fGofpbI9sZqBXVZSqK5Tc8kSKQ9KHclHTjlaZiBojmE9ay/kV5jQZ1tvGEkPVPLsROmzaunQ72wWkC/6iMI6IsOoR+SsPJtOkpyOoDVP5GUXesbI+Lvp7oj16keOIAm5PoYeuCw7jjrh0UPtjJECL9dL32UYU23BThFsAf+3QggRWGGUH5itOsKTmC0k3Loo2ztAMeNhg+RtI4d+rGyRxraY4H9LiUtnLOOwNNVfYTFLfDbeQtEDolCRV+DyW2s8RF+uttm+KN0WqhpaLUsSqmdNfk31QYoJ6Oq5vI9rpNpcqiQcX6WZaR+FswEHrE9vvmjtkRFOZWsqs1TjMhwBeuEn6eC0qaXmUZOBOwRhPtMrs/5MNSfO5K//Ohp69LfE02khGAZ0ft8LbmpzAJbvie4fO4T2S9vsaLJe760raLuQq/CdxJA9rQImWkxf8YBvRRyrCYLuykNY5O0p4DP+1clKZlURgleOtxNmyD8NlyEpynkV/NhfWRX8csNPz5lKfxlE2C2K0f0FxE9VocWtyMlgux6SQzueBN+fv1VfheD8GV9qJboNRLVA7snpSDPb4HXYnnwvBO/Nji7pb5gglnmyqv/7D9Q83PWwmKFkY0B0pVFRO+oLvGQsramcPhpR/vYrb1RDNsCpuYdLDfG/DM7TAmlfziIJKRIoPAaBF9ImT+XbxkVsYOZ2OcC1vcQKgE0UvgQc7E4kENcf58dZO0f1dpgB5KG4eIGgXVj0hg4LDK6kCMY0K7SucNVTZA62UsrfBAs+C6krdne9UyffPv97EPjSRYf1Nyduu5NRBF4JzihWQ5VTPcJI0tFAQzQgH0fEw2YctZFGp2CX83QNH45nVGzVsi5t+LSOYAVD07kfPCpyhSEsjpporH82YaeLfYVazHzFxs7A/kTsjoxzYMH+rAokkWOgJBFrqw// LO8fGZ4/ yi717ZqU2vHBuk4bGVOmZyQgcQOEab/qX+rHrPfTo/r4GC0Dv6dGqDFvMix7A8MD+oPjeX0hZ0ALoH1yRtShleAq3pP2xxqFE1f2qGck8dXq5pqwXJHK/2BK1/rBN2YqwyBJDS2n3OXzidsRIQ+of+cJ01hdYdhOsFfqnswWRYMFK79IH3M5wkdvel+AmQ8PL+89iUyIOKyFl1z97oReTD66Eb9y2ujIBp4Kf2IG6ZWLYIdCyydz3vKzFR7t+Ei8yaRILQj1MMa5k7Wxp8zKtXQST7PXwYM4oEzZgxzLdAh4pUZPZBvtIXFFq6tTjaxcY+28oOxzNhufeyKVrb6bUfRp33G9MmRC44zBpZnkFws5LXzSN18Nnu4ywzT+N/TqSmE1MB9q1QuQqf+nGpPAF79v6frrTb6w3x5gBkoqwaEyByofXm6ri7y471ISjML0Is40+A7JRjan3lEzjqCwZtfvmqKhNMipq+kgo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Not sure my review is thorough, but that=E2=80=99s all the time I have = right now... > On 12 Jan 2025, at 17:53, Rik van Riel wrote: >=20 > Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD = EPYC 3 > and newer CPUs. >=20 > In order to not exhaust PCID space, and keep TLB flushes local for = single > threaded processes, we only hand out broadcast ASIDs to processes = active on > 3 or more CPUs, and gradually increase the threshold as broadcast ASID = space > is depleted. >=20 > Signed-off-by: Rik van Riel > --- > arch/x86/include/asm/mmu.h | 6 + > arch/x86/include/asm/mmu_context.h | 14 ++ > arch/x86/include/asm/tlbflush.h | 64 +++++ > arch/x86/mm/tlb.c | 363 ++++++++++++++++++++++++++++- > 4 files changed, 435 insertions(+), 12 deletions(-) >=20 > diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h > index 3b496cdcb74b..d71cd599fec4 100644 > --- a/arch/x86/include/asm/mmu.h > +++ b/arch/x86/include/asm/mmu.h > @@ -69,6 +69,12 @@ typedef struct { > u16 pkey_allocation_map; > s16 execute_only_pkey; > #endif > + > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > + u16 global_asid; > + bool asid_transition; As I later note, there are various ordering issues between the two. = Would it be just easier to combine them into one field? I know everybody hates = bitfields so I don=E2=80=99t suggest it, but there are other ways... > +#endif > + > } mm_context_t; >=20 > #define INIT_MM_CONTEXT(mm) = \ > diff --git a/arch/x86/include/asm/mmu_context.h = b/arch/x86/include/asm/mmu_context.h > index 795fdd53bd0a..d670699d32c2 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct = mm_struct *mm) > #define enter_lazy_tlb enter_lazy_tlb > extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct = *tsk); >=20 > +extern void destroy_context_free_global_asid(struct mm_struct *mm); > + > /* > * Init a new mm. Used on mm copies, like at fork() > * and on mm's that are brand-new, like at execve(). > @@ -161,6 +163,14 @@ static inline int init_new_context(struct = task_struct *tsk, > mm->context.execute_only_pkey =3D -1; > } > #endif > + > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > + mm->context.global_asid =3D 0; > + mm->context.asid_transition =3D false; > + } > +#endif > + > mm_reset_untag_mask(mm); > init_new_context_ldt(mm); > return 0; > @@ -170,6 +180,10 @@ static inline int init_new_context(struct = task_struct *tsk, > static inline void destroy_context(struct mm_struct *mm) > { > destroy_context_ldt(mm); > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH I=E2=80=99d prefer to use IS_ENABLED() and to have a stub for=20 destroy_context_free_global_asid(). > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + destroy_context_free_global_asid(mm); > +#endif > } >=20 > extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, > diff --git a/arch/x86/include/asm/tlbflush.h = b/arch/x86/include/asm/tlbflush.h > index dba5caa4a9f4..cd244cdd49dd 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -239,6 +239,70 @@ void flush_tlb_one_kernel(unsigned long addr); > void flush_tlb_multi(const struct cpumask *cpumask, > const struct flush_tlb_info *info); >=20 > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > +static inline bool is_dyn_asid(u16 asid) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return true; > + > + return asid < TLB_NR_DYN_ASIDS; > +} > + > +static inline bool is_global_asid(u16 asid) > +{ > + return !is_dyn_asid(asid); > +} > + > +static inline bool in_asid_transition(const struct flush_tlb_info = *info) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return false; > + > + return info->mm && info->mm->context.asid_transition; READ_ONCE(context.asid_transition) ? > +} > + > +static inline u16 mm_global_asid(struct mm_struct *mm) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) > + return 0; > + > + return mm->context.global_asid; > +} > +#else > +static inline bool is_dyn_asid(u16 asid) > +{ > + return true; > +} > + > +static inline bool is_global_asid(u16 asid) > +{ > + return false; > +} > + > +static inline bool in_asid_transition(const struct flush_tlb_info = *info) > +{ > + return false; > +} > + > +static inline u16 mm_global_asid(struct mm_struct *mm) > +{ > + return 0; > +} > + > +static inline bool needs_global_asid_reload(struct mm_struct *next, = u16 prev_asid) > +{ > + return false; > +} > + > +static inline void broadcast_tlb_flush(struct flush_tlb_info *info) > +{ Having a VM_WARN_ON() here might be nice. > +} > + > +static inline void consider_global_asid(struct mm_struct *mm) > +{ > +} > +#endif > + > #ifdef CONFIG_PARAVIRT > #include > #endif > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index b47d6c3fe0af..80375ef186d5 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -74,13 +74,15 @@ > * use different names for each of them: > * > * ASID - [0, TLB_NR_DYN_ASIDS-1] > - * the canonical identifier for an mm > + * the canonical identifier for an mm, dynamically allocated = on each CPU > + * [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1] > + * the canonical, global identifier for an mm, identical = across all CPUs > * > - * kPCID - [1, TLB_NR_DYN_ASIDS] > + * kPCID - [1, MAX_ASID_AVAILABLE] > * the value we write into the PCID part of CR3; corresponds = to the > * ASID+1, because PCID 0 is special. > * > - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS] > + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE] > * for KPTI each mm has two address spaces and thus needs two > * PCID values, but we can still do with a single ASID = denomination > * for each mm. Corresponds to kPCID + 2048. > @@ -225,6 +227,19 @@ static void choose_new_asid(struct mm_struct = *next, u64 next_tlb_gen, > return; > } >=20 > + /* > + * TLB consistency for global ASIDs is maintained with broadcast = TLB > + * flushing. The TLB is never outdated, and does not need = flushing. > + */ > + if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && = static_cpu_has(X86_FEATURE_INVLPGB)) { > + u16 global_asid =3D mm_global_asid(next); > + if (global_asid) { > + *new_asid =3D global_asid; > + *need_flush =3D false; > + return; > + } > + } > + > if (this_cpu_read(cpu_tlbstate.invalidate_other)) > clear_asid_other(); >=20 > @@ -251,6 +266,292 @@ static void choose_new_asid(struct mm_struct = *next, u64 next_tlb_gen, > *need_flush =3D true; > } >=20 > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH > +/* > + * Logic for broadcast TLB invalidation. > + */ > +static DEFINE_RAW_SPINLOCK(global_asid_lock); > +static u16 last_global_asid =3D MAX_ASID_AVAILABLE; > +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) =3D { 0 = }; > +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) =3D { 0 = }; > +static int global_asid_available =3D MAX_ASID_AVAILABLE - = TLB_NR_DYN_ASIDS - 1; > + > +static void reset_global_asid_space(void) > +{ > + lockdep_assert_held(&global_asid_lock); > + > + /* > + * A global TLB flush guarantees that any stale entries from > + * previously freed global ASIDs get flushed from the TLB > + * everywhere, making these global ASIDs safe to reuse. > + */ > + invlpgb_flush_all_nonglobals(); > + > + /* > + * Clear all the previously freed global ASIDs from the > + * broadcast_asid_used bitmap, now that the global TLB flush > + * has made them actually available for re-use. > + */ > + bitmap_andnot(global_asid_used, global_asid_used, > + global_asid_freed, MAX_ASID_AVAILABLE); > + bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE); > + > + /* > + * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID > + * assignments, for tasks doing IPI based TLB shootdowns. > + * Restart the search from the start of the global ASID space. > + */ > + last_global_asid =3D TLB_NR_DYN_ASIDS; > +} > + > +static u16 get_global_asid(void) > +{ > + lockdep_assert_held(&global_asid_lock); > + > + do { > + u16 start =3D last_global_asid; > + u16 asid =3D find_next_zero_bit(global_asid_used, = MAX_ASID_AVAILABLE, start); > + > + if (asid >=3D MAX_ASID_AVAILABLE) { > + reset_global_asid_space(); > + continue; > + } > + > + /* Claim this global ASID. */ > + __set_bit(asid, global_asid_used); > + last_global_asid =3D asid; > + return asid; > + } while (1); This does not make me feel easy at all. I do not understand why it might happen. The caller should=E2=80=99ve already checked the = global ASID is available under the lock. If it is not obvious from the code, perhaps refactoring is needed. > +} > + > +/* > + * Returns true if the mm is transitioning from a CPU-local ASID to a = global=20 > + * (INVLPGB) ASID, or the other way around. > + */ > +static bool needs_global_asid_reload(struct mm_struct *next, u16 = prev_asid) > +{ > + u16 global_asid =3D mm_global_asid(next); > + > + if (global_asid && prev_asid !=3D global_asid) > + return true; > + > + if (!global_asid && is_global_asid(prev_asid)) > + return true; > + > + return false; > +} > + > +void destroy_context_free_global_asid(struct mm_struct *mm) > +{ > + if (!mm->context.global_asid) > + return; > + > + guard(raw_spinlock_irqsave)(&global_asid_lock); > + > + /* The global ASID can be re-used only after flush at = wrap-around. */ > + __set_bit(mm->context.global_asid, global_asid_freed); > + > + mm->context.global_asid =3D 0; > + global_asid_available++; > +} > + > +/* > + * Check whether a process is currently active on more than = "threshold" CPUs. > + * This is a cheap estimation on whether or not it may make sense to = assign > + * a global ASID to this process, and use broadcast TLB invalidation. > + */ > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int = threshold) > +{ > + int count =3D 0; > + int cpu; > + > + /* This quick check should eliminate most single threaded = programs. */ > + if (cpumask_weight(mm_cpumask(mm)) <=3D threshold) > + return false; > + > + /* Slower check to make sure. */ > + for_each_cpu(cpu, mm_cpumask(mm)) { > + /* Skip the CPUs that aren't really running this = process. */ > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) !=3D mm) > + continue; Do you really want to make loaded_mm accessed from other cores? Does = this really provide worthy benefit? Why not just use cpumask_weight() and be done with it? Anyhow it=E2=80=99s= a heuristic. > + > + if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu)) > + continue; > + > + if (++count > threshold) > + return true; > + } > + return false; > +} > + > +/* > + * Assign a global ASID to the current process, protecting against > + * races between multiple threads in the process. > + */ > +static void use_global_asid(struct mm_struct *mm) > +{ > + guard(raw_spinlock_irqsave)(&global_asid_lock); > + > + /* This process is already using broadcast TLB invalidation. */ > + if (mm->context.global_asid) > + return; > + > + /* The last global ASID was consumed while waiting for the lock. = */ > + if (!global_asid_available) I think "global_asid_available > 0=E2=80=9D would make more sense. > + return; > + > + /* > + * The transition from IPI TLB flushing, with a dynamic ASID, > + * and broadcast TLB flushing, using a global ASID, uses memory > + * ordering for synchronization. > + * > + * While the process has threads still using a dynamic ASID, > + * TLB invalidation IPIs continue to get sent. > + * > + * This code sets asid_transition first, before assigning the > + * global ASID. > + * > + * The TLB flush code will only verify the ASID transition > + * after it has seen the new global ASID for the process. > + */ > + WRITE_ONCE(mm->context.asid_transition, true); I would prefer smp_wmb() and document where the matching smp_rmb() (or smp_mb) is. > + WRITE_ONCE(mm->context.global_asid, get_global_asid()); > + > + global_asid_available--; > +} > + > +/* > + * Figure out whether to assign a global ASID to a process. > + * We vary the threshold by how empty or full global ASID space is. > + * 1/4 full: >=3D 4 active threads > + * 1/2 full: >=3D 8 active threads > + * 3/4 full: >=3D 16 active threads > + * 7/8 full: >=3D 32 active threads > + * etc > + * > + * This way we should never exhaust the global ASID space, even on = very > + * large systems, and the processes with the largest number of active > + * threads should be able to use broadcast TLB invalidation. > + */ > +#define HALFFULL_THRESHOLD 8 > +static bool meets_global_asid_threshold(struct mm_struct *mm) > +{ > + int avail =3D global_asid_available; > + int threshold =3D HALFFULL_THRESHOLD; > + > + if (!avail) > + return false; > + > + if (avail > MAX_ASID_AVAILABLE * 3 / 4) { > + threshold =3D HALFFULL_THRESHOLD / 4; > + } else if (avail > MAX_ASID_AVAILABLE / 2) { > + threshold =3D HALFFULL_THRESHOLD / 2; > + } else if (avail < MAX_ASID_AVAILABLE / 3) { > + do { > + avail *=3D 2; > + threshold *=3D 2; > + } while ((avail + threshold) < MAX_ASID_AVAILABLE / 2); > + } > + > + return mm_active_cpus_exceeds(mm, threshold); > +} > + > +static void consider_global_asid(struct mm_struct *mm) > +{ > + if (!static_cpu_has(X86_FEATURE_INVLPGB)) > + return; > + > + /* Check every once in a while. */ > + if ((current->pid & 0x1f) !=3D (jiffies & 0x1f)) > + return; > + > + if (meets_global_asid_threshold(mm)) > + use_global_asid(mm); > +} > + > +static void finish_asid_transition(struct flush_tlb_info *info) > +{ > + struct mm_struct *mm =3D info->mm; > + int bc_asid =3D mm_global_asid(mm); > + int cpu; > + > + if (!READ_ONCE(mm->context.asid_transition)) > + return; > + > + for_each_cpu(cpu, mm_cpumask(mm)) { > + /* > + * The remote CPU is context switching. Wait for that to > + * finish, to catch the unlikely case of it switching to > + * the target mm with an out of date ASID. > + */ > + while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) = =3D=3D LOADED_MM_SWITCHING) > + cpu_relax(); Although this code should rarely run, it seems bad for a couple of = reasons: 1. It is a new busy-wait in a very delicate place. Lockdep is blind to = this change. 2. cpu_tlbstate is supposed to be private for each core - that=E2=80=99s = why there is cpu_tlbstate_shared. But I really think loaded_mm should be kept private. Can't we just do one TLB shootdown if=20 cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids > + > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=3D = mm) > + continue; > + > + /* > + * If at least one CPU is not using the global ASID yet, > + * send a TLB flush IPI. The IPI should cause stragglers > + * to transition soon. > + * > + * This can race with the CPU switching to another task; > + * that results in a (harmless) extra IPI. > + */ > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) = !=3D bc_asid) { > + flush_tlb_multi(mm_cpumask(info->mm), info); > + return; > + } > + } > + > + /* All the CPUs running this process are using the global ASID. = */ I guess it=E2=80=99s ordered with the flushes (the flushes must complete = first). > + WRITE_ONCE(mm->context.asid_transition, false); > +} > + > +static void broadcast_tlb_flush(struct flush_tlb_info *info) > +{ > + bool pmd =3D info->stride_shift =3D=3D PMD_SHIFT; > + unsigned long maxnr =3D invlpgb_count_max; > + unsigned long asid =3D info->mm->context.global_asid; > + unsigned long addr =3D info->start; > + unsigned long nr; > + > + /* Flushing multiple pages at once is not supported with 1GB = pages. */ > + if (info->stride_shift > PMD_SHIFT) > + maxnr =3D 1; > + > + /* > + * TLB flushes with INVLPGB are kicked off asynchronously. > + * The inc_mm_tlb_gen() guarantees page table updates are done > + * before these TLB flushes happen. > + */ > + if (info->end =3D=3D TLB_FLUSH_ALL) { > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid)); > + /* Do any CPUs supporting INVLPGB need PTI? */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + = invlpgb_flush_single_pcid_nosync(user_pcid(asid)); > + } else do { > + /* > + * Calculate how many pages can be flushed at once; if = the > + * remainder of the range is less than one page, flush = one. > + */ > + nr =3D min(maxnr, (info->end - addr) >> = info->stride_shift); > + nr =3D max(nr, 1); > + > + invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, = pmd); > + /* Do any CPUs supporting INVLPGB need PTI? */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + invlpgb_flush_user_nr_nosync(user_pcid(asid), = addr, nr, pmd); > + addr +=3D nr << info->stride_shift; > + } while (addr < info->end); I would have preferred for instead of while... > + > + finish_asid_transition(info); > + > + /* Wait for the INVLPGBs kicked off above to finish. */ > + tlbsync(); > +} > +#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */ > + > /* > * Given an ASID, flush the corresponding user ASID. We can delay = this > * until the next time we switch to it. > @@ -556,8 +857,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, = struct mm_struct *next, > */ > if (prev =3D=3D next) { > /* Not actually switching mm's */ > - = VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=3D > - next->context.ctx_id); > + VM_WARN_ON(is_dyn_asid(prev_asid) && > + = this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=3D > + next->context.ctx_id); >=20 > /* > * If this races with another thread that enables lam, = 'new_lam' > @@ -573,6 +875,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, = struct mm_struct *next, > !cpumask_test_cpu(cpu, = mm_cpumask(next)))) > cpumask_set_cpu(cpu, mm_cpumask(next)); >=20 > + /* > + * Check if the current mm is transitioning to a new = ASID. > + */ > + if (needs_global_asid_reload(next, prev_asid)) { > + next_tlb_gen =3D = atomic64_read(&next->context.tlb_gen); > + > + choose_new_asid(next, next_tlb_gen, &new_asid, = &need_flush); > + goto reload_tlb; > + } > + > + /* > + * Broadcast TLB invalidation keeps this PCID up to date > + * all the time. > + */ > + if (is_global_asid(prev_asid)) > + return; > + > /* > * If the CPU is not in lazy TLB mode, we are just = switching > * from one thread in a process to another thread in the = same > @@ -606,6 +925,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, = struct mm_struct *next, > */ > cond_mitigation(tsk); >=20 > + /* > + * Let nmi_uaccess_okay() and finish_asid_transition() > + * know that we're changing CR3. > + */ > + this_cpu_write(cpu_tlbstate.loaded_mm, = LOADED_MM_SWITCHING); > + barrier(); > + > /* > * Leave this CPU in prev's mm_cpumask. Atomic writes to > * mm_cpumask can be expensive under contention. The CPU > @@ -620,14 +946,12 @@ void switch_mm_irqs_off(struct mm_struct = *unused, struct mm_struct *next, > next_tlb_gen =3D atomic64_read(&next->context.tlb_gen); >=20 > choose_new_asid(next, next_tlb_gen, &new_asid, = &need_flush); > - > - /* Let nmi_uaccess_okay() know that we're changing CR3. = */ > - this_cpu_write(cpu_tlbstate.loaded_mm, = LOADED_MM_SWITCHING); > - barrier(); > } >=20 > +reload_tlb: > new_lam =3D mm_lam_cr3_mask(next); > if (need_flush) { > + VM_BUG_ON(is_global_asid(new_asid)); > this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, = next->context.ctx_id); > this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, = next_tlb_gen); > load_new_mm_cr3(next->pgd, new_asid, new_lam, true); > @@ -746,7 +1070,7 @@ static void flush_tlb_func(void *info) > const struct flush_tlb_info *f =3D info; > struct mm_struct *loaded_mm =3D = this_cpu_read(cpu_tlbstate.loaded_mm); > u32 loaded_mm_asid =3D = this_cpu_read(cpu_tlbstate.loaded_mm_asid); > - u64 local_tlb_gen =3D = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); > + u64 local_tlb_gen; > bool local =3D smp_processor_id() =3D=3D f->initiating_cpu; > unsigned long nr_invalidate =3D 0; > u64 mm_tlb_gen; > @@ -769,6 +1093,16 @@ static void flush_tlb_func(void *info) > if (unlikely(loaded_mm =3D=3D &init_mm)) > return; >=20 > + /* Reload the ASID if transitioning into or out of a global ASID = */ > + if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) { > + switch_mm_irqs_off(NULL, loaded_mm, NULL); > + loaded_mm_asid =3D = this_cpu_read(cpu_tlbstate.loaded_mm_asid); > + } > + > + /* Broadcast ASIDs are always kept up to date with INVLPGB. */ > + if (is_global_asid(loaded_mm_asid)) > + return; > + > = VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=3D > loaded_mm->context.ctx_id); >=20 > @@ -786,6 +1120,8 @@ static void flush_tlb_func(void *info) > return; > } >=20 > + local_tlb_gen =3D = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); > + > if (unlikely(f->new_tlb_gen !=3D TLB_GENERATION_INVALID && > f->new_tlb_gen <=3D local_tlb_gen)) { > /* > @@ -953,7 +1289,7 @@ STATIC_NOPV void native_flush_tlb_multi(const = struct cpumask *cpumask, > * up on the new contents of what used to be page tables, while > * doing a speculative memory access. > */ > - if (info->freed_tables) > + if (info->freed_tables || in_asid_transition(info)) > on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, = true); > else > on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func, > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, = unsigned long start, > * a local TLB flush is needed. Optimize this use-case by = calling > * flush_tlb_func_local() directly in this case. > */ > - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { I think an smp_rmb() here would communicate the fact = in_asid_transition() and mm_global_asid() must be ordered. > + if (mm_global_asid(mm)) { > + broadcast_tlb_flush(info); > + } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { > info->trim_cpumask =3D should_trim_cpumask(mm); > flush_tlb_multi(mm_cpumask(mm), info); > + consider_global_asid(mm); > } else if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { > lockdep_assert_irqs_enabled(); > local_irq_disable(); > --=20 > 2.47.1 >=20