From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70532E7719F
	for <linux-mm@archiver.kernel.org>; Mon, 13 Jan 2025 13:17:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 09E496B00A7; Mon, 13 Jan 2025 08:17:37 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 04E556B00A8; Mon, 13 Jan 2025 08:17:36 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DE4ED6B00A9; Mon, 13 Jan 2025 08:17:36 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id B94206B00A7
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 08:17:36 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 75363C021E
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 13:17:36 +0000 (UTC)
X-FDA: 83002480512.18.B12AC2B
Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com [209.85.167.41])
	by imf06.hostedemail.com (Postfix) with ESMTP id 46E88180002
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 13:17:34 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YzdcMXv9;
	spf=pass (imf06.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736774254; a=rsa-sha256;
	cv=none;
	b=RadqLxmpBEUhpfXqHDU/oXySnFf52CKmd8I88gwNeCO3+zIGPB/TqPL/5LCDiyCpgyvjgy
	HnjvgsiLYdphUJPF6U/50a3j6uLosasMv4MQLHZxCYTpAEdoBYgIuzz+gaTx/nkc2bB+Lu
	eX/4HmwBT6/4Qg5F18+JwhTp3z+AQ5U=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=YzdcMXv9;
	spf=pass (imf06.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.167.41 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736774254;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=;
	b=W1xLO8+3yq1KuixbzoMepuMaHN9iDuJeXvZmTGWuKKOVOQZvg6r4cHZ7FzOKK8YqG0BT2U
	XduUGdiFH1ZAbAmeqiH+4Qrabp3Ak8r819kKqTB3A2bWQIytOYbn9PF+9E6Aec+6NoySHf
	mCig5reguepflkexK8M0spjoGVgxgZU=
Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-5401bd6cdb7so4306544e87.2
        for <linux-mm@kvack.org>; Mon, 13 Jan 2025 05:17:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1736774253; x=1737379053; darn=kvack.org;
        h=to:references:message-id:content-transfer-encoding:cc:date
         :in-reply-to:from:subject:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=;
        b=YzdcMXv9KD8ElDgU+iG4CIgYh3kIqMMyJN8UnHIbVxPLWXflxSirZdAVWWo0h0qTZ5
         EKf1DN00QoXrJUT92Kk37vlad/Rr5Ekpu6ufL46pGMg55OvGvzqWHpESe9humsnGDoXB
         VTHJRyisXBbtyOqHh961k9h5VnP9xM/bcOxQfa5Sl82mC5RY2udau8t8rEoGADrPw+jF
         bT7XCPzcR2GSNi4JxML/FfQAoUTLtliFwfDYpoY0cQ8N6eiy8j2Wkn5gBsU8v9iVrR31
         aJuO9kCliOjQo07oGAxH0XEdwx455EQdsHQ0k8I9UJjAA6U1JbkcZTZb2VelJPufJbJT
         n8bQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736774253; x=1737379053;
        h=to:references:message-id:content-transfer-encoding:cc:date
         :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HTZdQVrS6Z6kBh/lnR7mQdDWp1F4hBh83wyVEKlH9is=;
        b=TNUc/zgzTx55LT5UESFThL2+dWxSvGqv1XDy5dikWSezWmEpr6vRu3jMuCSE/rauO+
         8Tu/bcmS43d2vNrM5nFg2+TN80kDxqyouMG8CZM95EIhLG758S+eByY4Imayaq3v8T0l
         /uZ6M/x5YI186y36ZuDEfpbAXqdl3OrNX+dzILRpclVv63cwOK6cx+xE4wWoSCdlNdhr
         mKzYj6nXxY3QsL8tHMkAH65hMTEJhvVWGGxWLywKEVOHVeshalr6+JnwG+a8zIJNJbIS
         3b1kIKlI6uKhSSc9lD5hKVcwkfjE8pO1gqdnXCK/g0fdAzJ02IWG9zTaS98si3q0iWLE
         eVdw==
X-Forwarded-Encrypted: i=1; AJvYcCUUNR6oyWGjhQ/CyA9een8wElnOKcerxAu/2zRAPsqYKIpbanqmDUQDJMFIipWwR73yaWnez+UQTw==@kvack.org
X-Gm-Message-State: AOJu0Yzx+jbnj4idZsikasOVbACSasDH1d3LnGnUBA3dEjFCxm5x/Pts
	Vf2Opqxwz4TGqov5JwB2KxSzvrtNViHKCbHyDRDzEQSyIGIbuc5DzikYEw5pcC8=
X-Gm-Gg: ASbGncum/3WA+3XsUjNHBaNRM5TsgTlL4JYNj81Yf6BrZnI9YdGknw1qm6He/I1zYCp
	191TKKHvTZA34dbdfWlJFBHM5i8YCamh9vhgMFaxxPs+NsekAK0C150LERaCFuJJJRv6S8vfqKo
	I9BgI2WWr6ioXJ231CWOalCUdgWcFNo3GpCDqg2VE9eRAHjohPfp2FTDHA2/WtK2jc0BNXoaS2I
	5Z9SR2Nk2jouV8QSyu0MtqKwpSrBHBdkAB8KC8fpxWS7lvVV0UqMd8YMjgvvNif9dwTRmFN
X-Google-Smtp-Source: AGHT+IHR2kMg6heY3uM9fTlk4l0hgFFpxYUxDGFGZ/ecosdk0R1Z/79kRv6tmd8v42nHScWMwJ1bqw==
X-Received: by 2002:a05:6402:360c:b0:5d0:e826:f0da with SMTP id 4fb4d7f45d1cf-5d972e1683fmr20357396a12.16.1736773764527;
        Mon, 13 Jan 2025 05:09:24 -0800 (PST)
Received: from smtpclient.apple ([132.68.46.72])
        by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5d99046dd55sm4759995a12.54.2025.01.13.05.09.22
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 13 Jan 2025 05:09:23 -0800 (PST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.300.87.4.3\))
Subject: Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for
 multi-threaded processes
From: Nadav Amit <nadav.amit@gmail.com>
In-Reply-To: <20250112155453.1104139-10-riel@surriel.com>
Date: Mon, 13 Jan 2025 15:09:08 +0200
Cc: the arch/x86 maintainers <x86@kernel.org>,
 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
 Borislav Petkov <bp@alien8.de>,
 peterz@infradead.org,
 Dave Hansen <dave.hansen@linux.intel.com>,
 zhengqi.arch@bytedance.com,
 thomas.lendacky@amd.com,
 kernel-team@meta.com,
 "open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 jannh@google.com
Content-Transfer-Encoding: quoted-printable
Message-Id: <A56FE1EE-2600-4C58-8AA8-FF858F403A29@gmail.com>
References: <20250112155453.1104139-1-riel@surriel.com>
 <20250112155453.1104139-10-riel@surriel.com>
To: Rik van Riel <riel@surriel.com>
X-Mailer: Apple Mail (2.3826.300.87.4.3)
X-Rspamd-Queue-Id: 46E88180002
X-Stat-Signature: m3tfkk73ga7codmci78h5xzefqx3ijo4
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-HE-Tag: 1736774254-40240
X-HE-Meta: U2FsdGVkX18Az1xRlY6QA/x4+GTumwubxF9pC1HoZA2fGofpbI9sZqBXVZSqK5Tc8kSKQ9KHclHTjlaZiBojmE9ay/kV5jQZ1tvGEkPVPLsROmzaunQ72wWkC/6iMI6IsOoR+SsPJtOkpyOoDVP5GUXesbI+Lvp7oj16keOIAm5PoYeuCw7jjrh0UPtjJECL9dL32UYU23BThFsAf+3QggRWGGUH5itOsKTmC0k3Loo2ztAMeNhg+RtI4d+rGyRxraY4H9LiUtnLOOwNNVfYTFLfDbeQtEDolCRV+DyW2s8RF+uttm+KN0WqhpaLUsSqmdNfk31QYoJ6Oq5vI9rpNpcqiQcX6WZaR+FswEHrE9vvmjtkRFOZWsqs1TjMhwBeuEn6eC0qaXmUZOBOwRhPtMrs/5MNSfO5K//Ohp69LfE02khGAZ0ft8LbmpzAJbvie4fO4T2S9vsaLJe760raLuQq/CdxJA9rQImWkxf8YBvRRyrCYLuykNY5O0p4DP+1clKZlURgleOtxNmyD8NlyEpynkV/NhfWRX8csNPz5lKfxlE2C2K0f0FxE9VocWtyMlgux6SQzueBN+fv1VfheD8GV9qJboNRLVA7snpSDPb4HXYnnwvBO/Nji7pb5gglnmyqv/7D9Q83PWwmKFkY0B0pVFRO+oLvGQsramcPhpR/vYrb1RDNsCpuYdLDfG/DM7TAmlfziIJKRIoPAaBF9ImT+XbxkVsYOZ2OcC1vcQKgE0UvgQc7E4kENcf58dZO0f1dpgB5KG4eIGgXVj0hg4LDK6kCMY0K7SucNVTZA62UsrfBAs+C6krdne9UyffPv97EPjSRYf1Nyduu5NRBF4JzihWQ5VTPcJI0tFAQzQgH0fEw2YctZFGp2CX83QNH45nVGzVsi5t+LSOYAVD07kfPCpyhSEsjpporH82YaeLfYVazHzFxs7A/kTsjoxzYMH+rAokkWOgJBFrqw//
 LO8fGZ4/
 yi717ZqU2vHBuk4bGVOmZyQgcQOEab/qX+rHrPfTo/r4GC0Dv6dGqDFvMix7A8MD+oPjeX0hZ0ALoH1yRtShleAq3pP2xxqFE1f2qGck8dXq5pqwXJHK/2BK1/rBN2YqwyBJDS2n3OXzidsRIQ+of+cJ01hdYdhOsFfqnswWRYMFK79IH3M5wkdvel+AmQ8PL+89iUyIOKyFl1z97oReTD66Eb9y2ujIBp4Kf2IG6ZWLYIdCyydz3vKzFR7t+Ei8yaRILQj1MMa5k7Wxp8zKtXQST7PXwYM4oEzZgxzLdAh4pUZPZBvtIXFFq6tTjaxcY+28oOxzNhufeyKVrb6bUfRp33G9MmRC44zBpZnkFws5LXzSN18Nnu4ywzT+N/TqSmE1MB9q1QuQqf+nGpPAF79v6frrTb6w3x5gBkoqwaEyByofXm6ri7y471ISjML0Is40+A7JRjan3lEzjqCwZtfvmqKhNMipq+kgo
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


Not sure my review is thorough, but that=E2=80=99s all the time I have =
right now...

> On 12 Jan 2025, at 17:53, Rik van Riel <riel@surriel.com> wrote:
>=20
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD =
EPYC 3
> and newer CPUs.
>=20
> In order to not exhaust PCID space, and keep TLB flushes local for =
single
> threaded processes, we only hand out broadcast ASIDs to processes =
active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID =
space
> is depleted.
>=20
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> arch/x86/include/asm/mmu.h         |   6 +
> arch/x86/include/asm/mmu_context.h |  14 ++
> arch/x86/include/asm/tlbflush.h    |  64 +++++
> arch/x86/mm/tlb.c                  | 363 ++++++++++++++++++++++++++++-
> 4 files changed, 435 insertions(+), 12 deletions(-)
>=20
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..d71cd599fec4 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -69,6 +69,12 @@ typedef struct {
> 	u16 pkey_allocation_map;
> 	s16 execute_only_pkey;
> #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	u16 global_asid;
> +	bool asid_transition;

As I later note, there are various ordering issues between the two. =
Would it be
just easier to combine them into one field? I know everybody hates =
bitfields so
I don=E2=80=99t suggest it, but there are other ways...

> +#endif
> +
> } mm_context_t;
>=20
> #define INIT_MM_CONTEXT(mm)						=
\
> diff --git a/arch/x86/include/asm/mmu_context.h =
b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..d670699d32c2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct =
mm_struct *mm)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct =
*tsk);
>=20
> +extern void destroy_context_free_global_asid(struct mm_struct *mm);
> +
> /*
>  * Init a new mm.  Used on mm copies, like at fork()
>  * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,14 @@ static inline int init_new_context(struct =
task_struct *tsk,
> 		mm->context.execute_only_pkey =3D -1;
> 	}
> #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		mm->context.global_asid =3D 0;
> +		mm->context.asid_transition =3D false;
> +	}
> +#endif
> +
> 	mm_reset_untag_mask(mm);
> 	init_new_context_ldt(mm);
> 	return 0;
> @@ -170,6 +180,10 @@ static inline int init_new_context(struct =
task_struct *tsk,
> static inline void destroy_context(struct mm_struct *mm)
> {
> 	destroy_context_ldt(mm);
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH

I=E2=80=99d prefer to use IS_ENABLED() and to have a stub for=20
destroy_context_free_global_asid().

> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		destroy_context_free_global_asid(mm);
> +#endif
> }
>=20
> extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h =
b/arch/x86/include/asm/tlbflush.h
> index dba5caa4a9f4..cd244cdd49dd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -239,6 +239,70 @@ void flush_tlb_one_kernel(unsigned long addr);
> void flush_tlb_multi(const struct cpumask *cpumask,
> 		      const struct flush_tlb_info *info);
>=20
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return true;
> +
> +	return asid < TLB_NR_DYN_ASIDS;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return !is_dyn_asid(asid);
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info =
*info)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return false;
> +
> +	return info->mm && info->mm->context.asid_transition;

READ_ONCE(context.asid_transition) ?

> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return 0;
> +
> +	return mm->context.global_asid;
> +}
> +#else
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	return true;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return false;
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info =
*info)
> +{
> +	return false;
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +
> +static inline bool needs_global_asid_reload(struct mm_struct *next, =
u16 prev_asid)
> +{
> +	return false;
> +}
> +
> +static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{

Having a VM_WARN_ON() here might be nice.

> +}
> +
> +static inline void consider_global_asid(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
> #ifdef CONFIG_PARAVIRT
> #include <asm/paravirt.h>
> #endif
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index b47d6c3fe0af..80375ef186d5 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,13 +74,15 @@
>  * use different names for each of them:
>  *
>  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
> - *         the canonical identifier for an mm
> + *         the canonical identifier for an mm, dynamically allocated =
on each CPU
> + *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
> + *         the canonical, global identifier for an mm, identical =
across all CPUs
>  *
> - * kPCID - [1, TLB_NR_DYN_ASIDS]
> + * kPCID - [1, MAX_ASID_AVAILABLE]
>  *         the value we write into the PCID part of CR3; corresponds =
to the
>  *         ASID+1, because PCID 0 is special.
>  *
> - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
> + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
>  *         for KPTI each mm has two address spaces and thus needs two
>  *         PCID values, but we can still do with a single ASID =
denomination
>  *         for each mm. Corresponds to kPCID + 2048.
> @@ -225,6 +227,19 @@ static void choose_new_asid(struct mm_struct =
*next, u64 next_tlb_gen,
> 		return;
> 	}
>=20
> +	/*
> +	 * TLB consistency for global ASIDs is maintained with broadcast =
TLB
> +	 * flushing. The TLB is never outdated, and does not need =
flushing.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && =
static_cpu_has(X86_FEATURE_INVLPGB)) {
> +		u16 global_asid =3D mm_global_asid(next);
> +		if (global_asid) {
> +			*new_asid =3D global_asid;
> +			*need_flush =3D false;
> +			return;
> +		}
> +	}
> +
> 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
> 		clear_asid_other();
>=20
> @@ -251,6 +266,292 @@ static void choose_new_asid(struct mm_struct =
*next, u64 next_tlb_gen,
> 	*need_flush =3D true;
> }
>=20
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
> +static u16 last_global_asid =3D MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) =3D { 0 =
};
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) =3D { 0 =
};
> +static int global_asid_available =3D MAX_ASID_AVAILABLE - =
TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/*
> +	 * A global TLB flush guarantees that any stale entries from
> +	 * previously freed global ASIDs get flushed from the TLB
> +	 * everywhere, making these global ASIDs safe to reuse.
> +	 */
> +	invlpgb_flush_all_nonglobals();
> +
> +	/*
> +	 * Clear all the previously freed global ASIDs from the
> +	 * broadcast_asid_used bitmap, now that the global TLB flush
> +	 * has made them actually available for re-use.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/*
> +	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> +	 * assignments, for tasks doing IPI based TLB shootdowns.
> +	 * Restart the search from the start of the global ASID space.
> +	 */
> +	last_global_asid =3D TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	do {
> +		u16 start =3D last_global_asid;
> +		u16 asid =3D find_next_zero_bit(global_asid_used, =
MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >=3D MAX_ASID_AVAILABLE) {
> +			reset_global_asid_space();
> +			continue;
> +		}
> +
> +		/* Claim this global ASID. */
> +		__set_bit(asid, global_asid_used);
> +		last_global_asid =3D asid;
> +		return asid;
> +	} while (1);

This does not make me feel easy at all. I do not understand
why it might happen. The caller should=E2=80=99ve already checked the =
global ASID
is available under the lock. If it is not obvious from the code, perhaps
refactoring is needed.

> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a =
global=20
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 =
prev_asid)
> +{
> +	u16 global_asid =3D mm_global_asid(next);
> +
> +	if (global_asid && prev_asid !=3D global_asid)
> +		return true;
> +
> +	if (!global_asid && is_global_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
> +
> +void destroy_context_free_global_asid(struct mm_struct *mm)
> +{
> +	if (!mm->context.global_asid)
> +		return;
> +
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* The global ASID can be re-used only after flush at =
wrap-around. */
> +	__set_bit(mm->context.global_asid, global_asid_freed);
> +
> +	mm->context.global_asid =3D 0;
> +	global_asid_available++;
> +}
> +
> +/*
> + * Check whether a process is currently active on more than =
"threshold" CPUs.
> + * This is a cheap estimation on whether or not it may make sense to =
assign
> + * a global ASID to this process, and use broadcast TLB invalidation.
> + */
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int =
threshold)
> +{
> +	int count =3D 0;
> +	int cpu;
> +
> +	/* This quick check should eliminate most single threaded =
programs. */
> +	if (cpumask_weight(mm_cpumask(mm)) <=3D threshold)
> +		return false;
> +
> +	/* Slower check to make sure. */
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/* Skip the CPUs that aren't really running this =
process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) !=3D mm)
> +			continue;

Do you really want to make loaded_mm accessed from other cores? Does =
this
really provide worthy benefit?

Why not just use cpumask_weight() and be done with it? Anyhow it=E2=80=99s=
 a heuristic.

> +
> +		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> +			continue;
> +
> +		if (++count > threshold)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Assign a global ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_global_asid(struct mm_struct *mm)
> +{
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* This process is already using broadcast TLB invalidation. */
> +	if (mm->context.global_asid)
> +		return;
> +
> +	/* The last global ASID was consumed while waiting for the lock. =
*/
> +	if (!global_asid_available)

I think "global_asid_available > 0=E2=80=9D would make more sense.

> +		return;
> +
> +	/*
> +	 * The transition from IPI TLB flushing, with a dynamic ASID,
> +	 * and broadcast TLB flushing, using a global ASID, uses memory
> +	 * ordering for synchronization.
> +	 *
> +	 * While the process has threads still using a dynamic ASID,
> +	 * TLB invalidation IPIs continue to get sent.
> +	 *
> +	 * This code sets asid_transition first, before assigning the
> +	 * global ASID.
> +	 *
> +	 * The TLB flush code will only verify the ASID transition
> +	 * after it has seen the new global ASID for the process.
> +	 */
> +	WRITE_ONCE(mm->context.asid_transition, true);

I would prefer smp_wmb() and document where the matching smp_rmb()
(or smp_mb) is.

> +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> +
> +	global_asid_available--;
> +}
> +
> +/*
> + * Figure out whether to assign a global ASID to a process.
> + * We vary the threshold by how empty or full global ASID space is.
> + * 1/4 full: >=3D 4 active threads
> + * 1/2 full: >=3D 8 active threads
> + * 3/4 full: >=3D 16 active threads
> + * 7/8 full: >=3D 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the global ASID space, even on =
very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail =3D global_asid_available;
> +	int threshold =3D HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold =3D HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold =3D HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *=3D 2;
> +			threshold *=3D 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}
> +
> +static void consider_global_asid(struct mm_struct *mm)
> +{
> +	if (!static_cpu_has(X86_FEATURE_INVLPGB))
> +		return;
> +
> +	/* Check every once in a while. */
> +	if ((current->pid & 0x1f) !=3D (jiffies & 0x1f))
> +		return;
> +
> +	if (meets_global_asid_threshold(mm))
> +		use_global_asid(mm);
> +}
> +
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +	struct mm_struct *mm =3D info->mm;
> +	int bc_asid =3D mm_global_asid(mm);
> +	int cpu;
> +
> +	if (!READ_ONCE(mm->context.asid_transition))
> +		return;
> +
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/*
> +		 * The remote CPU is context switching. Wait for that to
> +		 * finish, to catch the unlikely case of it switching to
> +		 * the target mm with an out of date ASID.
> +		 */
> +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) =
=3D=3D LOADED_MM_SWITCHING)
> +			cpu_relax();

Although this code should rarely run, it seems bad for a couple of =
reasons:

1. It is a new busy-wait in a very delicate place. Lockdep is blind to =
this
   change.

2. cpu_tlbstate is supposed to be private for each core - that=E2=80=99s =
why there
   is cpu_tlbstate_shared. But I really think loaded_mm should be kept
   private.

Can't we just do one TLB shootdown if=20
	cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids


> +
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=3D =
mm)
> +			continue;
> +
> +		/*
> +		 * If at least one CPU is not using the global ASID yet,
> +		 * send a TLB flush IPI. The IPI should cause stragglers
> +		 * to transition soon.
> +		 *
> +		 * This can race with the CPU switching to another task;
> +		 * that results in a (harmless) extra IPI.
> +		 */
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) =
!=3D bc_asid) {
> +			flush_tlb_multi(mm_cpumask(info->mm), info);
> +			return;
> +		}
> +	}
> +
> +	/* All the CPUs running this process are using the global ASID. =
*/

I guess it=E2=80=99s ordered with the flushes (the flushes must complete =
first).

> +	WRITE_ONCE(mm->context.asid_transition, false);
> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	bool pmd =3D info->stride_shift =3D=3D PMD_SHIFT;
> +	unsigned long maxnr =3D invlpgb_count_max;
> +	unsigned long asid =3D info->mm->context.global_asid;
> +	unsigned long addr =3D info->start;
> +	unsigned long nr;
> +
> +	/* Flushing multiple pages at once is not supported with 1GB =
pages. */
> +	if (info->stride_shift > PMD_SHIFT)
> +		maxnr =3D 1;
> +
> +	/*
> +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> +	 * The inc_mm_tlb_gen() guarantees page table updates are done
> +	 * before these TLB flushes happen.
> +	 */
> +	if (info->end =3D=3D TLB_FLUSH_ALL) {
> +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			=
invlpgb_flush_single_pcid_nosync(user_pcid(asid));
> +	} else do {
> +		/*
> +		 * Calculate how many pages can be flushed at once; if =
the
> +		 * remainder of the range is less than one page, flush =
one.
> +		 */
> +		nr =3D min(maxnr, (info->end - addr) >> =
info->stride_shift);
> +		nr =3D max(nr, 1);
> +
> +		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, =
pmd);
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_user_nr_nosync(user_pcid(asid), =
addr, nr, pmd);
> +		addr +=3D nr << info->stride_shift;
> +	} while (addr < info->end);

I would have preferred for instead of while...

> +
> +	finish_asid_transition(info);
> +
> +	/* Wait for the INVLPGBs kicked off above to finish. */
> +	tlbsync();
> +}
> +#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
> +
> /*
>  * Given an ASID, flush the corresponding user ASID.  We can delay =
this
>  * until the next time we switch to it.
> @@ -556,8 +857,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, =
struct mm_struct *next,
> 	 */
> 	if (prev =3D=3D next) {
> 		/* Not actually switching mm's */
> -		=
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=3D
> -			   next->context.ctx_id);
> +		VM_WARN_ON(is_dyn_asid(prev_asid) &&
> +				=
this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=3D
> +				next->context.ctx_id);
>=20
> 		/*
> 		 * If this races with another thread that enables lam, =
'new_lam'
> @@ -573,6 +875,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, =
struct mm_struct *next,
> 				 !cpumask_test_cpu(cpu, =
mm_cpumask(next))))
> 			cpumask_set_cpu(cpu, mm_cpumask(next));
>=20
> +		/*
> +		 * Check if the current mm is transitioning to a new =
ASID.
> +		 */
> +		if (needs_global_asid_reload(next, prev_asid)) {
> +			next_tlb_gen =3D =
atomic64_read(&next->context.tlb_gen);
> +
> +			choose_new_asid(next, next_tlb_gen, &new_asid, =
&need_flush);
> +			goto reload_tlb;
> +		}
> +
> +		/*
> +		 * Broadcast TLB invalidation keeps this PCID up to date
> +		 * all the time.
> +		 */
> +		if (is_global_asid(prev_asid))
> +			return;
> +
> 		/*
> 		 * If the CPU is not in lazy TLB mode, we are just =
switching
> 		 * from one thread in a process to another thread in the =
same
> @@ -606,6 +925,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, =
struct mm_struct *next,
> 		 */
> 		cond_mitigation(tsk);
>=20
> +		/*
> +		 * Let nmi_uaccess_okay() and finish_asid_transition()
> +		 * know that we're changing CR3.
> +		 */
> +		this_cpu_write(cpu_tlbstate.loaded_mm, =
LOADED_MM_SWITCHING);
> +		barrier();
> +
> 		/*
> 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
> 		 * mm_cpumask can be expensive under contention. The CPU
> @@ -620,14 +946,12 @@ void switch_mm_irqs_off(struct mm_struct =
*unused, struct mm_struct *next,
> 		next_tlb_gen =3D atomic64_read(&next->context.tlb_gen);
>=20
> 		choose_new_asid(next, next_tlb_gen, &new_asid, =
&need_flush);
> -
> -		/* Let nmi_uaccess_okay() know that we're changing CR3. =
*/
> -		this_cpu_write(cpu_tlbstate.loaded_mm, =
LOADED_MM_SWITCHING);
> -		barrier();
> 	}
>=20
> +reload_tlb:
> 	new_lam =3D mm_lam_cr3_mask(next);
> 	if (need_flush) {
> +		VM_BUG_ON(is_global_asid(new_asid));
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, =
next->context.ctx_id);
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, =
next_tlb_gen);
> 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
> @@ -746,7 +1070,7 @@ static void flush_tlb_func(void *info)
> 	const struct flush_tlb_info *f =3D info;
> 	struct mm_struct *loaded_mm =3D =
this_cpu_read(cpu_tlbstate.loaded_mm);
> 	u32 loaded_mm_asid =3D =
this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> -	u64 local_tlb_gen =3D =
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +	u64 local_tlb_gen;
> 	bool local =3D smp_processor_id() =3D=3D f->initiating_cpu;
> 	unsigned long nr_invalidate =3D 0;
> 	u64 mm_tlb_gen;
> @@ -769,6 +1093,16 @@ static void flush_tlb_func(void *info)
> 	if (unlikely(loaded_mm =3D=3D &init_mm))
> 		return;
>=20
> +	/* Reload the ASID if transitioning into or out of a global ASID =
*/
> +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> +		loaded_mm_asid =3D =
this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +	}
> +
> +	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +	if (is_global_asid(loaded_mm_asid))
> +		return;
> +
> 	=
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=3D
> 		   loaded_mm->context.ctx_id);
>=20
> @@ -786,6 +1120,8 @@ static void flush_tlb_func(void *info)
> 		return;
> 	}
>=20
> +	local_tlb_gen =3D =
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
> 	if (unlikely(f->new_tlb_gen !=3D TLB_GENERATION_INVALID &&
> 		     f->new_tlb_gen <=3D local_tlb_gen)) {
> 		/*
> @@ -953,7 +1289,7 @@ STATIC_NOPV void native_flush_tlb_multi(const =
struct cpumask *cpumask,
> 	 * up on the new contents of what used to be page tables, while
> 	 * doing a speculative memory access.
> 	 */
> -	if (info->freed_tables)
> +	if (info->freed_tables || in_asid_transition(info))
> 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, =
true);
> 	else
> 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, =
unsigned long start,
> 	 * a local TLB flush is needed. Optimize this use-case by =
calling
> 	 * flush_tlb_func_local() directly in this case.
> 	 */
> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {

I think an smp_rmb() here would communicate the fact =
in_asid_transition() and
mm_global_asid() must be ordered.

> +	if (mm_global_asid(mm)) {
> +		broadcast_tlb_flush(info);
> +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> 		info->trim_cpumask =3D should_trim_cpumask(mm);
> 		flush_tlb_multi(mm_cpumask(mm), info);
> +		consider_global_asid(mm);
> 	} else if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) {
> 		lockdep_assert_irqs_enabled();
> 		local_irq_disable();
> --=20
> 2.47.1
>=20