From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3F8E5C77B7A
	for <linux-mm@archiver.kernel.org>; Wed, 17 May 2023 10:31:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7B437900005; Wed, 17 May 2023 06:31:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 76482900003; Wed, 17 May 2023 06:31:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 60517900005; Wed, 17 May 2023 06:31:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 4FB66900003
	for <linux-mm@kvack.org>; Wed, 17 May 2023 06:31:09 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 1927516050C
	for <linux-mm@kvack.org>; Wed, 17 May 2023 10:31:09 +0000 (UTC)
X-FDA: 80799379458.02.EF79605
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	by imf08.hostedemail.com (Postfix) with ESMTP id 31B0F16001E
	for <linux-mm@kvack.org>; Wed, 17 May 2023 10:31:06 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=linutronix.de header.s=2020 header.b=ShRfkm1Q;
	dkim=pass header.d=linutronix.de header.s=2020e header.b=XtR3iUnI;
	spf=pass (imf08.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de;
	dmarc=pass (policy=none) header.from=linutronix.de
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684319467;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=;
	b=odn563h0xyVueOnTP3c/wzhWz/IacRHb1RHvco5rszEWJDZLwb8SlR8ImmdDmkhpJvHDN2
	p4ra0OnEhEALuDYh4zestKGVmQDsuIPjSPoufdjs6WWlS41Hd3Fp128hVRUyj3BbadTEG1
	B2syBrqFVnjwaoLe7TNK3bc5I4wvLpU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684319467; a=rsa-sha256;
	cv=none;
	b=TdyXLQQ2C2EmtC6ETu3+cV9Z8wqDB1hn8YUE8vGyjcoyAXTB5tS+lxTfby4Aj7ryvraw3j
	Z4f4ffAwRiHdhUg98e6sQX9xIanzEbnxX9PSad6ssXTjowG5oJOQBAi1xCtsXH/bDFLifO
	/E5A8FMmIMr1NkDNoUS1ZIH9M+WrNvc=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=linutronix.de header.s=2020 header.b=ShRfkm1Q;
	dkim=pass header.d=linutronix.de header.s=2020e header.b=XtR3iUnI;
	spf=pass (imf08.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de;
	dmarc=pass (policy=none) header.from=linutronix.de
From: Thomas Gleixner <tglx@linutronix.de>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1684319465;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=;
	b=ShRfkm1Q8gdgDr+qIB/H1rhE7Usw2QZRneUN6/Lor7x1OVkiqKeKYxM30OAczAgdk20Uto
	XYAQuWabMIb/XfzY9yyx0lGBfa6xSuuT36QA4SVWEWQ/M4Mjd1cj1+f7TDX4ABYoVRsTwi
	77/Xm8SzEf5diT1PnG+B3RxpcBTcHpmzBV0wyQIChj9mIPM/Go+czh/+H7mHwDjzZ+uplb
	KLiLprPiMmg+uXy91L5qY4OQft0lEKwxuogyWsJxUlAxdBdidzUE47etyGpm/bFs9/64bI
	w3Uy4emFmszV42odEplDmi+F/z0rbJ1raV0lt9J7GX+iIqge+q9SKXTTwAC7qA==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1684319465;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=;
	b=XtR3iUnIKMfQ0zwaRXHc/QsCOg0R8nhlWALazD5/Pr5BOg0kOcSy57K8I9Xjv7Ab6CS/KS
	lpii+Ih9BhQYcwAw==
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>, "Russell King (Oracle)"
 <linux@armlinux.org.uk>, Andrew Morton <akpm@linux-foundation.org>,
 linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>, Lorenzo
 Stoakes <lstoakes@gmail.com>, Peter Zijlstra <peterz@infradead.org>,
 Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
 linux-arm-kernel@lists.infradead.org, Mark Rutland <mark.rutland@arm.com>,
 Marc Zyngier <maz@kernel.org>, x86@kernel.org
Subject: Re: Excessive TLB flush ranges
In-Reply-To: <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>
References: <87a5y5a6kj.ffs@tglx> <ZGJk3w4gUbpLl2tp@shell.armlinux.org.uk>
 <87353x9y3l.ffs@tglx> <87zg658fla.ffs@tglx>
 <ZGKkmUWzCG1SjFhL@shell.armlinux.org.uk> <87r0rg93z5.ffs@tglx>
 <ZGM8lKpX2zxbNRl/@shell.armlinux.org.uk> <87cz308y3s.ffs@tglx>
 <ZGNDXm54ieAOp77T@shell.armlinux.org.uk> <87y1lo7a0z.ffs@tglx>
 <ZGOIPDsTdL2IazdZ@pc636> <87o7mk733x.ffs@tglx>
 <7ED917BC-420F-47D4-8956-8984205A75F0@gmail.com> <87bkik6pin.ffs@tglx>
 <87353v7qms.ffs@tglx> <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>
Date: Wed, 17 May 2023 12:31:04 +0200
Message-ID: <87ttwb5jx3.ffs@tglx>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 31B0F16001E
X-Stat-Signature: x4hbit5uze6z7zioxt6u47655ydc4dqy
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-HE-Tag: 1684319466-199700
X-HE-Meta: U2FsdGVkX1/erG2XaWCLyjkt29zGsPIwDN/SEBlBalZRwPQR8J1gFXvOmcnlbKKw34pzdYBB7I66TbZ8a47+jOXOfw471DxX/0g97d0vJ0bUqS65J1taUtwB/oJEvA6sxLs2B7PZmA3QGksnUPIthtRXMJkVi2r8AsoHRze+FNy8v0OkiuoFCBBKoKhOU6nDr4yE3nKh4eseckJFf3446tfixnmFUXyBjqsqsBVxN9ldgyi506FZtXJJZ93kCOyVWy5ZS99RtroR5atgP+Acllru/lJCKEK6bzLCCJmJQYC+Qq6rKt7JlLde7WUcNt+Y1wbBZMzRiBmeis6Bd86nZFz+uBePcTY5FGukO0Oj6PBnEpLjq6yrgGKvtQZIZYbCpArguBDIxhRk/0Udp45jIteJjMcCGqhHgrK3YfJEcXP5WB7yUI7Q5QlOAH5Uw7+dXxC7Ap+qssn6VJNlVTL05bYNGCNif0ErBbiNJ4wm+95FyhCzaDAZrKPx4CLMij9omObUTuZNFJ9Ns03F/eV6bXRl+pqI1gAusLVKBVtJg90l8Px2YK+ic19NpQN8ASd4wWHhRHjXysEhG6TJZpX2d3y2WG3rNmJ7JPRHmuBmKCfJF9/u/J3Kdp0X9QRlvRdVAzT0QiNt9qWA50JUrj+Ywj5p10xRNdWlwgqPw40hwXE2PA3DQnmzIl8p2eW5JLY+qk1rc37N0ig2iACsEt4ojwDCJtOlJfcbulRt7/gjNCv1fKwjkFcnYP72//Z15OZP2fF+oHisELGh71Fi3KLpV/xBEz5HsdVjZuOW75lHElrPEs4W/PccAzdRa5FzJx0ePtyiqh2onVB4D9x8epf1ccRJ11Sjgj+Sf9159AIpEqLk4OmlkmMV9SMpSqIEeVepp2a52ia1RkR0TByooRvYdUM+fEtVgMs3BQ+A3hYW7JG5KoZbqIqmy0Yu0CMbgfdmVuYce2IuhfZ+2tYmA07
 AbyXRqhC
 DQNnsWn6nSInZytwpABt2SUnAymCJv8Ob+V2G0Fs4563DykJbPit97MZHySctTT67CgU6g/ageS/0J9qPZm24xZCNSUEHLw5yz9x7EZbMX5Ojcwa6IeQJBUosiT0KAsnFzC7qvU3nuAYtQypeEbEnzsufZopehgRosD6D7fH9o/xqNwc=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Nadav!

On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.

All good.

>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.=20=20

Correct and there is obviously a tradeoff too, which has yet to be
determined.

>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> =E2=80=9Chowever, this doesn't mean INVLPG has no impact on the pipeline.=
 INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.=E2=80=9D
>
> So I am not sure that your claim is exactly correct.

Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.

>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.

That's correct and there might be smarter ways to handle that list muck.

>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don=E2=80=99t know what =E2=80=9Cdismissing=E2=80=9D you are tal=
king about.

Sorry, I was overreacting due to increased grumpiness.

> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That=E2=80=99s all to say that caching effect is not something to
> be considered obsolete.

I never claimed that it does not matter. That's surely part of a
decision making to investigate that.

>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>>=20
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).

For the vmalloc() area mappings? Not really.

> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.

The way how arm/arm64 implement that in software is:

    magic_barrier1();
    flush_range_with_magic_opcodes();
    magic_barrier2();

And for that use case having the list with individual ranges is not
really wrong.

Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.

>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because=E2=80=A6
>
> I hope you would excuse my ignorance (I am sure you won=E2=80=99t), but i=
sn=E2=80=99t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)

Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.

Thanks,

        tglx