From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F8E5C77B7A for ; Wed, 17 May 2023 10:31:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B437900005; Wed, 17 May 2023 06:31:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 76482900003; Wed, 17 May 2023 06:31:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60517900005; Wed, 17 May 2023 06:31:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4FB66900003 for ; Wed, 17 May 2023 06:31:09 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1927516050C for ; Wed, 17 May 2023 10:31:09 +0000 (UTC) X-FDA: 80799379458.02.EF79605 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf08.hostedemail.com (Postfix) with ESMTP id 31B0F16001E for ; Wed, 17 May 2023 10:31:06 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=ShRfkm1Q; dkim=pass header.d=linutronix.de header.s=2020e header.b=XtR3iUnI; spf=pass (imf08.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de; dmarc=pass (policy=none) header.from=linutronix.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684319467; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=; b=odn563h0xyVueOnTP3c/wzhWz/IacRHb1RHvco5rszEWJDZLwb8SlR8ImmdDmkhpJvHDN2 p4ra0OnEhEALuDYh4zestKGVmQDsuIPjSPoufdjs6WWlS41Hd3Fp128hVRUyj3BbadTEG1 B2syBrqFVnjwaoLe7TNK3bc5I4wvLpU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684319467; a=rsa-sha256; cv=none; b=TdyXLQQ2C2EmtC6ETu3+cV9Z8wqDB1hn8YUE8vGyjcoyAXTB5tS+lxTfby4Aj7ryvraw3j Z4f4ffAwRiHdhUg98e6sQX9xIanzEbnxX9PSad6ssXTjowG5oJOQBAi1xCtsXH/bDFLifO /E5A8FMmIMr1NkDNoUS1ZIH9M+WrNvc= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=ShRfkm1Q; dkim=pass header.d=linutronix.de header.s=2020e header.b=XtR3iUnI; spf=pass (imf08.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de; dmarc=pass (policy=none) header.from=linutronix.de From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1684319465; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=; b=ShRfkm1Q8gdgDr+qIB/H1rhE7Usw2QZRneUN6/Lor7x1OVkiqKeKYxM30OAczAgdk20Uto XYAQuWabMIb/XfzY9yyx0lGBfa6xSuuT36QA4SVWEWQ/M4Mjd1cj1+f7TDX4ABYoVRsTwi 77/Xm8SzEf5diT1PnG+B3RxpcBTcHpmzBV0wyQIChj9mIPM/Go+czh/+H7mHwDjzZ+uplb KLiLprPiMmg+uXy91L5qY4OQft0lEKwxuogyWsJxUlAxdBdidzUE47etyGpm/bFs9/64bI w3Uy4emFmszV42odEplDmi+F/z0rbJ1raV0lt9J7GX+iIqge+q9SKXTTwAC7qA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1684319465; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vAK7l/aJjFeS46FU3iKGAXLUwa70sSWxoPYwNM5d04o=; b=XtR3iUnIKMfQ0zwaRXHc/QsCOg0R8nhlWALazD5/Pr5BOg0kOcSy57K8I9Xjv7Ab6CS/KS lpii+Ih9BhQYcwAw== To: Nadav Amit Cc: Uladzislau Rezki , "Russell King (Oracle)" , Andrew Morton , linux-mm , Christoph Hellwig , Lorenzo Stoakes , Peter Zijlstra , Baoquan He , John Ogness , linux-arm-kernel@lists.infradead.org, Mark Rutland , Marc Zyngier , x86@kernel.org Subject: Re: Excessive TLB flush ranges In-Reply-To: References: <87a5y5a6kj.ffs@tglx> <87353x9y3l.ffs@tglx> <87zg658fla.ffs@tglx> <87r0rg93z5.ffs@tglx> <87cz308y3s.ffs@tglx> <87y1lo7a0z.ffs@tglx> <87o7mk733x.ffs@tglx> <7ED917BC-420F-47D4-8956-8984205A75F0@gmail.com> <87bkik6pin.ffs@tglx> <87353v7qms.ffs@tglx> Date: Wed, 17 May 2023 12:31:04 +0200 Message-ID: <87ttwb5jx3.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 31B0F16001E X-Stat-Signature: x4hbit5uze6z7zioxt6u47655ydc4dqy X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1684319466-199700 X-HE-Meta: U2FsdGVkX1/erG2XaWCLyjkt29zGsPIwDN/SEBlBalZRwPQR8J1gFXvOmcnlbKKw34pzdYBB7I66TbZ8a47+jOXOfw471DxX/0g97d0vJ0bUqS65J1taUtwB/oJEvA6sxLs2B7PZmA3QGksnUPIthtRXMJkVi2r8AsoHRze+FNy8v0OkiuoFCBBKoKhOU6nDr4yE3nKh4eseckJFf3446tfixnmFUXyBjqsqsBVxN9ldgyi506FZtXJJZ93kCOyVWy5ZS99RtroR5atgP+Acllru/lJCKEK6bzLCCJmJQYC+Qq6rKt7JlLde7WUcNt+Y1wbBZMzRiBmeis6Bd86nZFz+uBePcTY5FGukO0Oj6PBnEpLjq6yrgGKvtQZIZYbCpArguBDIxhRk/0Udp45jIteJjMcCGqhHgrK3YfJEcXP5WB7yUI7Q5QlOAH5Uw7+dXxC7Ap+qssn6VJNlVTL05bYNGCNif0ErBbiNJ4wm+95FyhCzaDAZrKPx4CLMij9omObUTuZNFJ9Ns03F/eV6bXRl+pqI1gAusLVKBVtJg90l8Px2YK+ic19NpQN8ASd4wWHhRHjXysEhG6TJZpX2d3y2WG3rNmJ7JPRHmuBmKCfJF9/u/J3Kdp0X9QRlvRdVAzT0QiNt9qWA50JUrj+Ywj5p10xRNdWlwgqPw40hwXE2PA3DQnmzIl8p2eW5JLY+qk1rc37N0ig2iACsEt4ojwDCJtOlJfcbulRt7/gjNCv1fKwjkFcnYP72//Z15OZP2fF+oHisELGh71Fi3KLpV/xBEz5HsdVjZuOW75lHElrPEs4W/PccAzdRa5FzJx0ePtyiqh2onVB4D9x8epf1ccRJ11Sjgj+Sf9159AIpEqLk4OmlkmMV9SMpSqIEeVepp2a52ia1RkR0TByooRvYdUM+fEtVgMs3BQ+A3hYW7JG5KoZbqIqmy0Yu0CMbgfdmVuYce2IuhfZ+2tYmA07 AbyXRqhC DQNnsWn6nSInZytwpABt2SUnAymCJv8Ob+V2G0Fs4563DykJbPit97MZHySctTT67CgU6g/ageS/0J9qPZm24xZCNSUEHLw5yz9x7EZbMX5Ojcwa6IeQJBUosiT0KAsnFzC7qvU3nuAYtQypeEbEnzsufZopehgRosD6D7fH9o/xqNwc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Nadav! On Tue, May 16 2023 at 18:23, Nadav Amit wrote: >> On May 16, 2023, at 5:23 PM, Thomas Gleixner wrote: >>> I'm not ignoring them and I'm well aware of these issues. No need to >>> repeat them over and over. I'm old but not senile yet. > > Thomas, no disrespect was intended. I initially just sent the link and I > had a sense (based on my past experience) that nobody clicked on it. All good. >> It makes a whole lot of a difference whether you do 5 IPIs in a row >> which all need to get a cache line updated or if you have _one_ which >> needs a couple of cache lines updated. > > Obviously, if the question is 5 IPIs or 1 IPI with more flushing data, > the 1 IPI wins. The question I was focusing on is whether 1 IPI with > potentially global flush or detailed list of ranges to flush.=20=20 Correct and there is obviously a tradeoff too, which has yet to be determined. >> INVLPG is not serializing so the CPU can pull in the next required cache >> line(s) on the VA list during that. > > Indeed, but ChatGPT says (yes, I see you making fun of me already): > =E2=80=9Chowever, this doesn't mean INVLPG has no impact on the pipeline.= INVLPG > can cause a pipeline stall because the TLB entry invalidation must be > completed before subsequent instructions that might rely on the TLB can > be executed correctly.=E2=80=9D > > So I am not sure that your claim is exactly correct. Key is a subsequent instruction which might depend on the to be flushed TLB entry. That's obvious, but I'm having a hard time to construct that dependent intruction in this case. >> These cache lines are _not_ >> contended at that point because _all_ of these data structures are not >> longer globally accessible (mis-speculation aside) and therefore not >> exclusive (misalignment aside, but you have to prove that this is an >> issue). > > This is not entirely true. Indeed whether you have 1 remote core or N > remote core is not a whole issue (putting aside NUMA). But you will get > first a snoop to the initiator cache by the responding core, and then, > after the TLB invalidation is completed, an RFO by the initiator once > it writes to the cache again. If the invalidation data is on the stack > (as you did), this is even more likely to happen shortly after. That's correct and there might be smarter ways to handle that list muck. >> So just dismissing this on 10 years old experience is not really >> helpful, though I'm happy to confirm your points once I had the time and >> opportunity to actually run real testing over it, unless you beat me to >> it. > > I really don=E2=80=99t know what =E2=80=9Cdismissing=E2=80=9D you are tal= king about. Sorry, I was overreacting due to increased grumpiness. > I do have relatively recent experience with the overhead of caching > effects on TLB shootdown time. It can become very apparent. You can > find some numbers in, for instance, the patch of mine I quoted in my > previous email. > > There are additional opportunities to reduce the caching effect for > x86, such as combining the SMP-code metadata with the TLB-invalidation > metadata (which is out of the scope) that I saw having performance > benefit. That=E2=80=99s all to say that caching effect is not something to > be considered obsolete. I never claimed that it does not matter. That's surely part of a decision making to investigate that. >> The point is that the generic vmalloc code is making assumptions which >> are x86 centric on not even necessarily true on x86. >>=20 >> Whether or not this is benefitial on x86 that's a completey separate >> debate. > > I fully understand that if you reduce multiple TLB shootdowns (IPI-wise) > into 1, it is (pretty much) all benefit and there is no tradeoff. I was > focusing on the question of whether it is beneficial also to do precise > TLB flushing, and the tradeoff there is less clear (especially that the > kernel uses 2MB pages). For the vmalloc() area mappings? Not really. > My experience with non-IPI based TLB invalidations is more limited. IIUC > the usage model is that the TLB shootdowns should be invoked ASAP > (perhaps each range can be batched, but there is no sense of batching > multiple ranges), and then later you would issue some barrier to ensure > prior TLB shootdown invocations have been completed. > > If that is the (use) case, I am not sure the abstraction you used in > your prototype is the best one. The way how arm/arm64 implement that in software is: magic_barrier1(); flush_range_with_magic_opcodes(); magic_barrier2(); And for that use case having the list with individual ranges is not really wrong. Maybe ARM[64] could do this smarter, but that would require to rewrite a lot of code I assume. >> There is also a debate required whether a wholesale "flush on _ALL_ >> CPUs' is justified when some of those CPUs are completely isolated and >> have absolutely no chance to be affected by that. This process bound >> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of >> their computation in user space just because=E2=80=A6 > > I hope you would excuse my ignorance (I am sure you won=E2=80=99t), but i= sn=E2=80=99t > the seccomp/BPF VMAP ranges are mapped on all processes (considering > PTI of course)? Are you suggesting you want a per-process kernel > address space? (which can make senes, I guess) Right. The BPF muck is mapped in the global kernel space, but e.g. the seccomp filters are individual per process. At least that's how I understand it, but I might be completely wrong. Thanks, tglx