From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2D47EC77B7A
	for <linux-mm@archiver.kernel.org>; Wed, 17 May 2023 16:41:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9E6DC900006; Wed, 17 May 2023 12:41:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 99885900003; Wed, 17 May 2023 12:41:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 885C9900006; Wed, 17 May 2023 12:41:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 79AD3900003
	for <linux-mm@kvack.org>; Wed, 17 May 2023 12:41:50 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 416D712064E
	for <linux-mm@kvack.org>; Wed, 17 May 2023 16:41:50 +0000 (UTC)
X-FDA: 80800313580.19.92B44A3
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	by imf19.hostedemail.com (Postfix) with ESMTP id 4D31C1A0013
	for <linux-mm@kvack.org>; Wed, 17 May 2023 16:41:47 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=linutronix.de header.s=2020 header.b=SqhFtCn1;
	dkim=pass header.d=linutronix.de header.s=2020e header.b=xBF97RtS;
	dmarc=pass (policy=none) header.from=linutronix.de;
	spf=pass (imf19.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684341707;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=oaxra5LHo3xgJ7S7vUrzvDP6NZUhTIgpDNXRpQYpnMQ=;
	b=5Pjjicw1zt0yBTenBj88Kt2pDEhnkmzl1A+otBVM68r3/+7i31LIky+B6EZlLiJgqH9TWS
	lsNC5mrMo8x4womt0lDWTN/A9FEFj09M7T8CKp2YDBd/bvGfhLeO+f4pB6Q3LbC2nkBi3L
	zuwcvOFXzJadeRmDevKUHB+Xqhc9KJc=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=linutronix.de header.s=2020 header.b=SqhFtCn1;
	dkim=pass header.d=linutronix.de header.s=2020e header.b=xBF97RtS;
	dmarc=pass (policy=none) header.from=linutronix.de;
	spf=pass (imf19.hostedemail.com: domain of tglx@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=tglx@linutronix.de
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684341707; a=rsa-sha256;
	cv=none;
	b=Cp0P+lbcC/kG/0VafgoMkma6j1hGn6dkrOVTIRKxGhBgzojqGOqE3XShyBEfmsVwBMNSBl
	lqFoK+rZe94c1Q3PEI0471UIMCbks0y1K1K6//emP1nGx4sr5xYk7LFfd3ivdQmela3+Zg
	ktYrCZ6b8v/hBM0hZqabEo21kjTVKtk=
From: Thomas Gleixner <tglx@linutronix.de>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1684341705;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=oaxra5LHo3xgJ7S7vUrzvDP6NZUhTIgpDNXRpQYpnMQ=;
	b=SqhFtCn1feXTg0JlJ0doVT2H2tx8esA8PU4FfccLzaFx2egMJo4sEGmrMi6/TPi+jAmQ0l
	yIAfFixmYJKrELPdZlfz0B6KYI4DZh+ohIV8u7iMRqpC9lnK+rPAHvEsCfgFjd4YJkJtTb
	xYooZz/luyFRT9flyyhOY7YIAhJfHILFL9ajLOju1da29KLvxQE4zeqWwYYFx68353uLVV
	G61RIJYkFqbIm2MI2U2nU7IQaKY/RIUJvM2xDX/NlFvWryHObTTfBujy1NyZHs2lkJele6
	A3JO29WOr1jgFG5OhSo8VbNRQ9ReXBo/H866fdtjED4dF1MXNLxx/AH9HrULqQ==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1684341705;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=oaxra5LHo3xgJ7S7vUrzvDP6NZUhTIgpDNXRpQYpnMQ=;
	b=xBF97RtSz9tINsAxIHlc7nErm3fhbKdEHjFn+SmqJI94mETIXBMj3x87ePt2x87B40XAtY
	tf4bxnozuONCwbDw==
To: Mark Rutland <mark.rutland@arm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>, Uladzislau Rezki <urezki@gmail.com>,
 "Russell King (Oracle)" <linux@armlinux.org.uk>, Andrew Morton
 <akpm@linux-foundation.org>, linux-mm <linux-mm@kvack.org>, Christoph
 Hellwig <hch@lst.de>, Lorenzo Stoakes <lstoakes@gmail.com>, Peter Zijlstra
 <peterz@infradead.org>, Baoquan He <bhe@redhat.com>, John Ogness
 <jogness@linutronix.de>, linux-arm-kernel@lists.infradead.org, Marc
 Zyngier <maz@kernel.org>, x86@kernel.org
Subject: Re: Excessive TLB flush ranges
In-Reply-To: <ZGTngQhcJ19/dMbm@FVFF77S0Q05N.cambridge.arm.com>
References: <87cz308y3s.ffs@tglx> <ZGNDXm54ieAOp77T@shell.armlinux.org.uk>
 <87y1lo7a0z.ffs@tglx> <ZGOIPDsTdL2IazdZ@pc636> <87o7mk733x.ffs@tglx>
 <7ED917BC-420F-47D4-8956-8984205A75F0@gmail.com> <87bkik6pin.ffs@tglx>
 <87353v7qms.ffs@tglx> <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>
 <87ttwb5jx3.ffs@tglx> <ZGTngQhcJ19/dMbm@FVFF77S0Q05N.cambridge.arm.com>
Date: Wed, 17 May 2023 18:41:44 +0200
Message-ID: <87bkii6hbr.ffs@tglx>
MIME-Version: 1.0
Content-Type: text/plain
X-Stat-Signature: foo3aouz7n4yjfr4q9cuw37mtk7mj6ip
X-Rspam-User: 
X-Rspamd-Queue-Id: 4D31C1A0013
X-Rspamd-Server: rspam07
X-HE-Tag: 1684341707-956307
X-HE-Meta: U2FsdGVkX19BQYpnhlo3e820wqxBXAeY7ObREGLq4+FJfY8938WGGHSSeTzD5aX5+GlVlXZrpzQyWzNT/tvAAE8JgN0bxUSHJlGE8VxJj+pMjoTmxDeG6uZzl84MYrnkRVue7vQnWvlt04oyAuBcqahPmhK28xRp/wRBFlc0ljcQ8zbj5AM4mPLeZiX52QLWtRX078ujtrjwcf6xCnuHntTv/WeY8p3uXEyRufxwyXpwh4Z2STs8cwRa/5nRrQnba7xfK6qUQzB5+8sOTYv2NdFzmGGrtgzuN/jq0pR9BLio6T0VqkbUI7dyw/Hpnm0Wf6+ce6c0GmRnFhGAImsvlGWGNNZS+lH6NentvX+nr+FQguRmhp93SstmYr6VpOE2Eb4BhOPiyW4HVfRmUH/FdgjIIok8M882z1mZj909+KAqS8pEkJFuGXNOUEYAusYHKq55Z1/OYhdxlrtX7SbF+kvkJXIA4iVqqgB9BLD9FpJGP5KJpGfYZTXcUtM0nLvJ1cDrJXgmtTvyXMsD9Fnblh4+VQzXmaI/FY5IiWC52KNnBG4dOWWANv3kZU93mB1YpplxqrfZoHmV4QtkBLGoHsRfMXr2Z7VjyN/vq7nCS0OkYvik/o/v3L12YNws2iCG7jUlYekYLZMdhT6TM5XuvtJwHvxcCPO/jpnyqicZ/RlHgQRNyZThjMy2MKYT1IBmNAuufhhRaOqT8MgzeSbTA+Zxe8WMn5WhK/wcPOt2t/seZEPQwG3lNPi8+nc8n1OC2ER9NryHumL03vFvan2RKtplFgB5wklNYQ62yKYgYV/uzH0rROntDKjtkEmijviQHta0V8ol5c06y07K7gTJy/LKVtWoYDxP1fg2N+KDEkHm6k3WoRI9u6LZWJBQraaIeEg3/RMSSj9IUX9L9qV+t+hIqLM0kUHIhhE0hXBOMHQkk5f5GzXwCHY91yUsn2r5Zf6RZ6yQ0vLobFaWc05
 GCM1+dR8
 KagNMY1AMfvnAvCozXBNWqBB0rlE4hpXF/mBBYNcvyVlxtV/a7yIc/m0oYOWZV3/qaOpnvygIZyG1vtMHxIhgryhIQlrnDL03wLhv/23IyLirBQps4Pj254wWEQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, May 17 2023 at 15:43, Mark Rutland wrote:
> On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote:
>> The way how arm/arm64 implement that in software is:
>> 
>>     magic_barrier1();
>>     flush_range_with_magic_opcodes();
>>     magic_barrier2();
>
> FWIW, on arm64 that sequence (for leaf entries only) is:
>
> 	/*
> 	 * Make sure prior writes to the page table entries are visible to all
> 	 * CPUs, so that *subsequent* page table walks will see the latest
> 	 * values.
> 	 *
> 	 * This is roughly __smp_wmb().
> 	 */
> 	dsb(ishst)		// AKA magic_barrier1()
>
> 	/*
> 	 * The "TLBI *IS, <addr>" instructions send a message to all other
> 	 * CPUs, essentially saying "please start invalidating entries for
> 	 * <addr>"
> 	 *
> 	 * The "TLBI *ALL*IS" instructions send a message to all other CPUs,
> 	 * essentially saying "please start invalidating all entries".
> 	 *
> 	 * In theory, this could be for discontiguous ranges.
> 	 */
> 	flush_range_with_magic_opcodes()
>
> 	/*
> 	 * Wait for acknowledgement that all prior TLBIs have completed. This
> 	 * also ensures that all accesses using those translations have also
> 	 * completed.
> 	 *
> 	 * This waits for all relevant CPUs to acknowledge completion of any
> 	 * prior TLBIs sent by this CPU.
> 	 */
> 	dsb(ish) 		// AKA magic_barrier2()
> 	isb()
>
> So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for
> completion, or you can use a single "TLBI *ALL*IS" to invalidate everything.
>
> It can still be worth using the latter, as arm64 has done since commit:
>
>   05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range")
>
> ... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a
> while, and can require the recipient CPUs to do more work than they might have
> to do for a single "TLBI *ALL*IS".

And looking at the changelog and backtrace:

       PC is at __cpu_flush_kern_tlb_range+0xc/0x40
       LR is at __purge_vmap_area_lazy+0x28c/0x3ac

I'm willing to bet that this is exactly the same scenario of a direct
map + module area flush. That's the only one we found so far which
creates insanely large ranges.

The other effects of coalescing can still result in seriously oversized
flushs for just a couple of pages. The worst I've seen aside of that BPF
muck was a 'flush 2 pages' with an resulting range of ~3.8MB.

> The point at which invalidating everything is better depends on a number of
> factors (e.g. the impact of all CPUs needing to make new page table walks), and
> currently we have an arbitrary boundary where we choose to invalidate
> everything (which has been tweaked a bit over time); there isn't really a
> one-size-fits-all best answer.

I'm well aware of that :)

Thanks,

        tglx