From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 558A1E77197
	for <linux-mm@archiver.kernel.org>; Thu,  9 Jan 2025 20:22:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D08EF6B00AA; Thu,  9 Jan 2025 15:22:31 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CB4526B00AE; Thu,  9 Jan 2025 15:22:31 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B7C8D6B00B0; Thu,  9 Jan 2025 15:22:31 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 95F526B00AA
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 15:22:31 -0500 (EST)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 4CE50C04B8
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 20:22:31 +0000 (UTC)
X-FDA: 82989036102.14.C73BAB5
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	by imf23.hostedemail.com (Postfix) with ESMTP id AA78E14001B
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 20:22:28 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf23.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736454148; a=rsa-sha256;
	cv=none;
	b=wXSm5dUye6AK0qVn2riJP5521/RDPRJHIl4m5Z3nYF8dEf3Jk/YwD88Dt4mLoWcgiSht1T
	kfNP0Kxn35BdkfOsYxDYGfito4UNkhHJFJ8qIvN0eJFKihYhcIgkfPC+Ehp64m/9Jbze/1
	6/0Wx5yOyiQardKvPnSB9jkQgj3gPX8=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf23.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736454148;
	h=from:from:sender:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dpq4vTpqbKlRDiifE27zVPmZFm6sTZTg00HNe1r9d5s=;
	b=mP4DxPJVQ5iWH8Yr5uKOtcTpjIXOs6d4f1PxQiLxq+pUSSuIkLy0bPF4kY/pKZai0KxQ28
	ROy1x1Lwv1q+GeD4TrfPWnQpNS4/dy7Cs2OM6yWEeR4PDd0qSL45VaEDxhyx3E4hJcftUY
	lxjRUWmKVKCueL14AL6nfaXTM5hi/dw=
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@shelob.surriel.com>)
	id 1tVyxd-000000007eP-49xB;
	Thu, 09 Jan 2025 15:16:58 -0500
Message-ID: <855298e6e981378c3afeab93b8c3cb821a7a5b88.camel@surriel.com>
Subject: Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
From: Rik van Riel <riel@surriel.com>
To: Dave Hansen <dave.hansen@intel.com>, x86@kernel.org
Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com, 
	dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, 
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, 
	akpm@linux-foundation.org, nadav.amit@gmail.com,
 zhengqi.arch@bytedance.com, 	linux-mm@kvack.org
Date: Thu, 09 Jan 2025 15:16:57 -0500
In-Reply-To: <bbe96ff4-3633-4008-b524-6b183a64caf6@intel.com>
References: <20241230175550.4046587-1-riel@surriel.com>
	 <20241230175550.4046587-7-riel@surriel.com>
	 <bbe96ff4-3633-4008-b524-6b183a64caf6@intel.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) 
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: AA78E14001B
X-Stat-Signature: hd8gwkacikm6neyysi5h4fhjahhpsn7s
X-HE-Tag: 1736454148-812867
X-HE-Meta: U2FsdGVkX18OuVRnHtOZxJaGtzQYDjD9clwzsvJ1fIxKmoEKMEKUHWYktCD7Y1+OrYSLGvPg9V7PFQ0ZvkdH71U9THGiDhPBchEzOoLCmCzwufuksKH0xghsWKwWAwyPyfw0I2NvrZZHHIWOWxIdK+noIXXZQ6NDyRobgw0jCh9POPVW184wsLF5e0HdCIxNmgqPpMQDKglvhkfBKu51YytLMyW8dfLcy62i1MOdRiV1Epvjj/1KzI9/sYRInxjm7CIHeW61fpocOVeKBRf0asc2tu+SdyX2kKuWrT22pvaGcK+EqY9QXZctNsWV/IcXTOGDwT4KPCgVuBDYOK9V2nipEYgWtkwp9U8LCrBpu5j/Y7SDbk84eB+bnlu3mf+/wdNl4KsHmjMUF2QR7H27dtN02i8PyOqt6PeHDF2JnITx72Lb8SzLtLk6/DUHYgYhF+ldPxKChplfDh4i6HzkR/vIo0DzisYHVrNLuyEC+4gPRlk2mIRXES6YGbyYepyDitysK6wj/DF9ubwvt0vDPhbkT2AQmo9cwlX3nn2I9fuI1cj8alPdN1162++J9zqqjcLWfyXa7qectJqwidMPly3zOjHXjlcwgWi3m8ZwvuysO6vizb5FvuXqe/qGPSr9qGqJ1qcqF+8biQLw2FpDsxS3y6S7Y7E7PsiXE+2U8qFAG9aVeD2diGA2hzownlWtDCSPoe0N9TWXP71TEIKMNg1DUeas+BfecwR/0saJbGjl+GDuE+rTajPbojO0E/2lEcl8YmTZB/dnK24iefimfPQP9txKK3oCvxitBVUZreuH19DqWHpJ8fMsuNXhVuoOd06h/e7IpOV3rYYBeQ+9sCQZftfIGAH/iI/5lLwQH+pf2jK8ZEhIU+OJOl1SmEl5JRLYXBpTxlpoVv9inSOxNydX28GFdZItu1z8kEDSUYV0xHwBXXkSRcmMQ0fCblBJg4S3u4a3uSH3b3sNOvG
 C194EGo4
 0nhlyWH6YPeqXYBdz8cDKMyn3y7xMYtGiBwdR3iRqxsQoldKG+Yn06yrFF8akm6onQZwCKP7EGn7bVqZ0XI7yukzl9pT5zXIoF1QEASqYMmt7/9sGbbAcGE1Y2AtfAKPEpZy08g/dEPCvNcULahHyEERVJB8OUTbuy7UOyJjDMaDcTbSfUBBaw9MWiYwoACZCnd+PnwQERrgv8LCR3JKRrH8bDBwrnVbkjkwu2RJGMyfStM0SMxa3cp9+uFkkdoChUN+i5dWTWWHmsihawDNAOwoxvQRTGfOW71g3630DNnRupJHABNpOnH11Npdvge2aGhEIBGeOg60IMNFSUPNZ3JawEQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
>=20
>=20
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > +	unsigned long addr;
> > +	unsigned long maxnr =3D invlpgb_count_max;
> > +	unsigned long threshold =3D tlb_single_page_flush_ceiling *
> > maxnr;
>=20
> The 'tlb_single_page_flush_ceiling' value was determined by looking
> at
> _local_ invalidation cost. Could you talk a bit about why it's also a
> good value to use for remote invalidations? Does it hold up for
> INVLPGB
> the same way it did for good ol' INVLPG? Has there been any explicit
> testing here to find a good value?
>=20
> I'm also confused by the multiplication here. Let's say
> invlpgb_count_max=3D=3D20 and tlb_single_page_flush_ceiling=3D=3D30.
>=20
> You would need to switch away from single-address invalidation when
> the
> number of addresses is >20 for INVLPGB functional reasons. But you'd
> also need to switch away when >30 for performance reasons
> (tlb_single_page_flush_ceiling).
>=20
> But I don't understand how that would make the threshold 20*30=3D600
> invalidations.

I have not done any measurement to see how
flushing with INVLPGB stacks up versus
local TLB flushes.

What makes INVLPGB potentially slower:
- These flushes are done globally

What makes INVLPGB potentially faster:
- Multiple flushes can be pending simultaneously,
  and executed in any convenient order by the CPUs.
- Wait once on completion of all the queued flushes.

Another thing that makes things interesting is the=C2=A0
TLB entry coalescing done by AMD CPUs.

When multiple pages are both virtually and physically
contiguous in memory (which is fairly common), the
CPU can use a single TLB entry to map up to 8 of them.

That means if we issue eg. 20 INVLPGB flushes for
8 4kB pages each, instead of the CPUs needing to
remove 160 TLB entries, there might only be 50.

I just guessed at the numbers used in my code,
while trying to sort out the details elsewhere
in the code.

How should we go about measuring the tradeoffs
between invalidation time, and the time spent
in TLB misses from flushing unnecessary stuff?

>=20
> > +	/*
> > +	 * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > +	 * Disabling migration allows us to wait on all flushes.
> > +	 */
>=20
> Imperative voice here too, please:
>=20
> 	Disable migration to wait on all flushes.
>=20
> > +	guard(preempt)();
> > +
> > +	if (end =3D=3D TLB_FLUSH_ALL ||
> > +	=C2=A0=C2=A0=C2=A0 (end - start) > threshold << PAGE_SHIFT) {
>=20
> This is basically a copy-and-paste of the "range vs. global" flush
> logic, but taking 'invlpgb_count_max' into account.
>=20
> It would be ideal if those limit checks could be consolidated. I
> suspect
> that when the 'threshold' calculation above gets clarified that they
> may
> be easier to consolidate.

Maybe?

I implemented another suggestion in the code,
and the start of flush_tlb_kernel_range() now
looks like this:

void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
        if (broadcast_kernel_range_flush(start, end))
                return;

If we are to consolidate the limit check, it
should probably be in a helper function somewhere,
and not by spreading out the broadcast flush calls.

>=20
> BTW, what is a typical value for 'invlpgb_count_max'? Is it more or
> less
> than the typical value for 'tlb_single_page_flush_ceiling'?
>=20
> Maybe we should just lower 'tlb_single_page_flush_ceiling' if
> 'invlpgb_count_max' falls below it so we only have _one_ runtime
> value
> to consider.

The value for invlpgb_count_max on both Milan
and Bergamo CPUs appears to be 8. That is, the
CPU reports we can flush 7 additional pages
beyond a single page.

This matches the number of PTEs that can be=20
cached in one TLB entry if they are contiguous
and aligned, and matches one cache line worth
of PTE entries.

>=20
>=20
> > +		invlpgb_flush_all();
> > +	} else {
> > +		unsigned long nr;
> > +		for (addr =3D start; addr < end; addr +=3D nr <<
> > PAGE_SHIFT) {
> > +			nr =3D min((end - addr) >> PAGE_SHIFT,
> > maxnr);
> > +			invlpgb_flush_addr(addr, nr);
> > +		}
> > +	}
> > +
> > +	tlbsync();
> > +}
> > +
> > =C2=A0static void do_kernel_range_flush(void *info)
> > =C2=A0{
> > =C2=A0	struct flush_tlb_info *f =3D info;
> > @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void
> > *info)
> > =C2=A0
> > =C2=A0void flush_tlb_kernel_range(unsigned long start, unsigned long
> > end)
> > =C2=A0{
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > +		broadcast_kernel_range_flush(start, end);
> > +		return;
> > +	}
> > +
> > =C2=A0	/* Balance as user space task's flush, a bit conservative
> > */
> > =C2=A0	if (end =3D=3D TLB_FLUSH_ALL ||
> > =C2=A0	=C2=A0=C2=A0=C2=A0 (end - start) > tlb_single_page_flush_ceiling=
 <<
> > PAGE_SHIFT) {
>=20
> I also wonder if this would all get simpler if we give in and
> *always*
> call get_flush_tlb_info(). That would provide a nice single place to
> consolidate the "all vs. ranged" flush logic.

Possibly. That might be a good way to unify that
threshold check?

That should probably be a separate patch, though.


--=20
All Rights Reversed.