From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CEC76C02180
	for <linux-mm@archiver.kernel.org>; Tue, 14 Jan 2025 03:14:27 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3AB3F280002; Mon, 13 Jan 2025 22:14:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 333AD280001; Mon, 13 Jan 2025 22:14:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1D52B280002; Mon, 13 Jan 2025 22:14:27 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id EF93D280001
	for <linux-mm@kvack.org>; Mon, 13 Jan 2025 22:14:26 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id D2E25AEE55
	for <linux-mm@kvack.org>; Tue, 14 Jan 2025 03:14:25 +0000 (UTC)
X-FDA: 83004589290.07.F65FF9F
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	by imf05.hostedemail.com (Postfix) with ESMTP id 1BF4C100007
	for <linux-mm@kvack.org>; Tue, 14 Jan 2025 03:14:23 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=none;
	spf=pass (imf05.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736824464;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=5uxrs0kHeEW7NOXSe4Mj8FxASd8tQXWsFlb8kXSs0FM=;
	b=cCF0Z8yGEtY+A3fN5mUAxqNSWDO3rdkQxZLtg8gCiZrLtYSNhVvcQXL5ZKWdAIw/pMsnUN
	73AfOk7URkdmb9phnORMTGfLx08DEFQxs9wWkfZ4TZoHzSr1cUomhCIOY2AYn731iWmqBE
	FXkC9BT1Fa7kG4vqCeBa4N+0hhOtMEQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736824464; a=rsa-sha256;
	cv=none;
	b=XtJT5DWRhytmOGcy2oyuj4iw21vcsBiv7rloc/CYt6VlSOLqYN+1wf571Ahj4ELHtlIlYI
	CTZIGrdhiq0u4jsVjcy++LEL0Uscctraqiv0rswXggBdArnCE0Ni6UsEPZ+wsdQZ56zhQv
	GuwF/57OXcsDZHBnFbnISTxJ40U3s1Q=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=none;
	spf=pass (imf05.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com;
	dmarc=none
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@shelob.surriel.com>)
	id 1tXXMa-000000005F1-3iUN;
	Mon, 13 Jan 2025 22:13:08 -0500
Message-ID: <bf0a64788c5d42f06f3a61c996dc104af9e69206.camel@surriel.com>
Subject: Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for
 multi-threaded processes
From: Rik van Riel <riel@surriel.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: the arch/x86 maintainers <x86@kernel.org>, Linux Kernel Mailing List
	 <linux-kernel@vger.kernel.org>, Borislav Petkov <bp@alien8.de>, 
	peterz@infradead.org, Dave Hansen <dave.hansen@linux.intel.com>, 
	zhengqi.arch@bytedance.com, thomas.lendacky@amd.com, kernel-team@meta.com, 
 "open list:MEMORY MANAGEMENT"
	 <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>, 
	jannh@google.com
Date: Mon, 13 Jan 2025 22:13:08 -0500
In-Reply-To: <A56FE1EE-2600-4C58-8AA8-FF858F403A29@gmail.com>
References: <20250112155453.1104139-1-riel@surriel.com>
	 <20250112155453.1104139-10-riel@surriel.com>
	 <A56FE1EE-2600-4C58-8AA8-FF858F403A29@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) 
MIME-Version: 1.0
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 1BF4C100007
X-Stat-Signature: ne7wb9kyh98duefykcazzisj1nu69m9t
X-Rspam-User: 
X-HE-Tag: 1736824463-14186
X-HE-Meta: U2FsdGVkX19Yn4XjSwRcz8JFQHJ5ljuUbYs6p8dL/1f0Hft2osqhNvZ/2i8AB9AI1BbAp2rElAmzB8Yy7UljH2loxfUlz9gzb29eXkSrFVKhq+RpgcF18n2p6/gz4Jki8x3m9yjWVYG2FAnulFu0ympsSOFQ6PhsLwW3htbxwwCer/HqhIVL6f/IAy5qetaZYc1/IcKkwpg+NVEfUkcws48MYiuVVqkPOMS8fHtU7enUuC13xfoVN483ulmJNzzlZ0ZefPQT4hbQQpZijarSjAXEhPgvS7m7+C4HPnoTMUSpbl/++wNQdJa9y7a58ba9k922U9K242hM8A737CeL8tGLaqCgjbsDVQjGBiQcOUVXPJdViRTZ3Ea7zZ75uoUWenqeL3vQr1olfa2PtjDgcKsJq3VLEUdbTOznxJ+MFjqqFJjhMwo6nodQl3EsLpsWjp18EPIw94ibx24e1k8wrRVxF3bAm5UDw50dyDZJIOWkQg6hPv61bV1xbRYbk5nIsxQe6ay66rIRtS/Jq8Drp2ifVcPV26kMVVZyFgTm2pciiqz6e8iumaUhHWA85gZykLlxFGCU9VSX+rMu1JXl93bOm60dH/Wy056GMOUstRbsY9p4/ulvMeEka41dPgiD3jMDyqK1L3ZWeVXUKc9GNSxvulkF5XWlrbe819h2urRTrUKnKdY1UJBuDGdxIvWG0CjdNI+J4FpBgvfmWBHr9MzoALmlaDuIlE4eWr6mSr58C/7gjpCCsOgblnzsWppcxyitWQQAaqGCFyHN/5QHEYsIbHteOvkPlYZ7aDITeIIU425OJV0e9H+Zc2cG8vjMIroGKe9nDzeti79rlsumpfGMb97fxymQkBRsGFv2NevkwgNeAT7HQradKcrXQ8QadBJ+ku9xr7DB3U3ST0FlcV5vGhRwfUsxnK0GdlTFjR7Sq38JhWcg07aV8Xkl1H2mKQdqvGNgGRYIMSGwHm5
 UK514tJg
 YUvcQdfuoIAXxJKaDdM870v+OpKpzLbLV0IcDPLuGWWsFhmVLHY5j1Z9TkPqXlCApiITnpU1cqYZTpxSG9vDllmJ9DaPoz1YjosxhYNpGQmLEKO69bdhksoDP7txYzGzA7OqPOhODQI4r8GZUM60P5tKGPHUEX9o+wUaGxRA/Deb1m8WXreJhQFZQHNBIOXsRUt1ra0IT3k5cjg4VNq77xHWhP7iHoZPoayD99mBqAARYIwZaf3pO8f4pEWKQRQEzow1KFniyGsoI20ubYCyNxVlIAbdlinhp4PNbuFDj0+wRaZg/v4+Rh9rxXFRJOAD+1zM3LHgumDheaxg/mET4nneCv88HzEV0aVvfgvE13az5jcE0XdNWWIW9OpTiyGCNMReauRMeLgYg+Oc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, 2025-01-13 at 15:09 +0200, Nadav Amit wrote:
>=20
> > On 12 Jan 2025, at 17:53, Rik van Riel <riel@surriel.com> wrote:
> >=20
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> > +	u16 global_asid;
> > +	bool asid_transition;
>=20
> As I later note, there are various ordering issues between the two.
> Would it be
> just easier to combine them into one field? I know everybody hates
> bitfields so
> I don=E2=80=99t suggest it, but there are other ways...

It's certainly an option, but I think we figured out
the ordering issues, so at this point documentation
and readability of the code might be more important
for future proofing?

>=20
> > @@ -170,6 +180,10 @@ static inline int init_new_context(struct
> > task_struct *tsk,
> > static inline void destroy_context(struct mm_struct *mm)
> > {
> > 	destroy_context_ldt(mm);
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
>=20
> I=E2=80=99d prefer to use IS_ENABLED() and to have a stub for=20
> destroy_context_free_global_asid().
>=20
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		destroy_context_free_global_asid(mm);
> > +#endif
> > }

I'll think about how to do this cleaner.

I would like to keep the cpu feature test in the
inline function, so we don't do an unnecessary
function call on systems without INVLPGB.

> >=20
> > +static inline bool in_asid_transition(const struct flush_tlb_info
> > *info)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		return false;
> > +
> > +	return info->mm && info->mm->context.asid_transition;
>=20
> READ_ONCE(context.asid_transition) ?

Changed for the next version.

>=20
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
>=20
> Having a VM_WARN_ON() here might be nice.

Added. Thank you.

>=20
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start =3D last_global_asid;
> > +		u16 asid =3D find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >=3D MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> > +
> > +		/* Claim this global ASID. */
> > +		__set_bit(asid, global_asid_used);
> > +		last_global_asid =3D asid;
> > +		return asid;
> > +	} while (1);
>=20
> This does not make me feel easy at all. I do not understand
> why it might happen. The caller should=E2=80=99ve already checked the glo=
bal
> ASID
> is available under the lock. If it is not obvious from the code,
> perhaps
> refactoring is needed.
>=20
The caller checks whether we have a global ASID available,
anywhere in the namespace.

This code will claim a specific one.

I guess the global_asid_available-- line could be moved
into get_global_asid() to improve readability?

> > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int
> > threshold)
> > +{
> > +	int count =3D 0;
> > +	int cpu;
> > +
> > +	/* This quick check should eliminate most single threaded
> > programs. */
> > +	if (cpumask_weight(mm_cpumask(mm)) <=3D threshold)
> > +		return false;
> > +
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) !=3D mm)
> > +			continue;
>=20
> Do you really want to make loaded_mm accessed from other cores? Does
> this
> really provide worthy benefit?
>=20
> Why not just use cpumask_weight() and be done with it? Anyhow it=E2=80=99=
s a
> heuristic.

We recently added some code to make mm_cpumask
maintenance a lot lazier, which can result in
more CPUs remaining in the bitmap while not
running the mm.

As for accessing loaded_mm from other CPUs, we
already have to do that in finish_asid_transition.

I don't see any good way around that, but I'm open
to suggestions :)

>=20
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
>=20
> I would prefer smp_wmb() and document where the matching smp_rmb()
> (or smp_mb) is.
>=20
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> > +
> > +	global_asid_available--;
> > +}
> >=20

I would be happy with either style of ordering.

It's all a bit of a no-op anyway, because x86 does not
do stores out of order.

> > +static void finish_asid_transition(struct flush_tlb_info *info)
> > +{
> > +	struct mm_struct *mm =3D info->mm;
> > +	int bc_asid =3D mm_global_asid(mm);
> > +	int cpu;
> > +
> > +	if (!READ_ONCE(mm->context.asid_transition))
> > +		return;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/*
> > +		 * The remote CPU is context switching. Wait for
> > that to
> > +		 * finish, to catch the unlikely case of it
> > switching to
> > +		 * the target mm with an out of date ASID.
> > +		 */
> > +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm,
> > cpu)) =3D=3D LOADED_MM_SWITCHING)
> > +			cpu_relax();
>=20
> Although this code should rarely run, it seems bad for a couple of
> reasons:
>=20
> 1. It is a new busy-wait in a very delicate place. Lockdep is blind
> to this
> =C2=A0=C2=A0 change.

This is true. However, if a CPU gets stuck in the middle
of switch_mm_irqs_off, won't we have a bigger problem?

>=20
> 2. cpu_tlbstate is supposed to be private for each core - that=E2=80=99s =
why
> there
> =C2=A0=C2=A0 is cpu_tlbstate_shared. But I really think loaded_mm should =
be
> kept
> =C2=A0=C2=A0 private.
>=20
> Can't we just do one TLB shootdown if=20
> 	cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids

That conflicts with a future optimization of simply
not maintaining the mm_cpumask at all for processes
that use broadcast TLB invalidation.

After all, once we no longer use the mm_cpumask for
anything any more, why incur the overhead of maintaining
it?

I would like to understand more about the harm of
reading loaded_mm. How is reading loaded_mm worse
than reading other per-CPU variables that are written
from the same code paths?

>=20
> > +	/* All the CPUs running this process are using the global
> > ASID. */
>=20
> I guess it=E2=80=99s ordered with the flushes (the flushes must complete
> first).
>=20
If there are any flushes.

If all the CPUs we scanned are already using the
global ASID, we do not need any additional ordering
in here, since any CPUs switching to this mm afterward
will be seeing the new global ASID.

> > +	WRITE_ONCE(mm->context.asid_transition, false);
> > +}
> >=20


> > +	} else do {
> > +		/*
> > +		 * Calculate how many pages can be flushed at
> > once; if the
> > +		 * remainder of the range is less than one page,
> > flush one.
> > +		 */
> > +		nr =3D min(maxnr, (info->end - addr) >> info-
> > >stride_shift);
> > +		nr =3D max(nr, 1);
> > +
> > +		invlpgb_flush_user_nr_nosync(kern_pcid(asid),
> > addr, nr, pmd);
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_user_nr_nosync(user_pcid(asi
> > d), addr, nr, pmd);
> > +		addr +=3D nr << info->stride_shift;
> > +	} while (addr < info->end);
>=20
> I would have preferred for instead of while...

Changed that for the next version. Thank you.

> > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct
> > *mm, unsigned long start,
> > 	 * a local TLB flush is needed. Optimize this use-case by
> > calling
> > 	 * flush_tlb_func_local() directly in this case.
> > 	 */
> > -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
>=20
> I think an smp_rmb() here would communicate the fact
> in_asid_transition() and
> mm_global_asid() must be ordered.
>=20
> > +	if (mm_global_asid(mm)) {
> > +		broadcast_tlb_flush(info);

We have the barrier already, in the form of
inc_mm_tlb_gen a few lines up.

Or are you talking about a barrier (or READ_ONCE?)
inside of mm_global_asid() to make sure the compiler
cannot reorder things around mm_global_asid()?

> > +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) <
> > nr_cpu_ids) {
> > 		info->trim_cpumask =3D should_trim_cpumask(mm);
> > 		flush_tlb_multi(mm_cpumask(mm), info);
> > +		consider_global_asid(mm);
> > 	} else if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) {
> > 		lockdep_assert_irqs_enabled();
> > 		local_irq_disable();
> > --=20
> > 2.47.1
> >=20
>=20

--=20
All Rights Reversed.