From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 41987E7719C
	for <linux-mm@archiver.kernel.org>; Thu,  9 Jan 2025 23:27:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A7FC66B00A9; Thu,  9 Jan 2025 18:27:15 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A31716B00AA; Thu,  9 Jan 2025 18:27:15 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8F6B96B00AC; Thu,  9 Jan 2025 18:27:15 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 6CB186B00A9
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 18:27:15 -0500 (EST)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id E3B921A0651
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 23:27:14 +0000 (UTC)
X-FDA: 82989501588.21.CB15732
Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182])
	by imf11.hostedemail.com (Postfix) with ESMTP id 100BB40005
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 23:27:12 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=nsygJrq4;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736465233;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jtVdikFS+ajJy72HDorDhGlrpLpK/sj81PI5siVFo2E=;
	b=GdAvUawqRwCD/q3ZgVN3eoOq/caU3sAib9nhaDW8trmQlRBZCLeyeRJuQyOQTI+5aiO1TL
	9fKhsOv7KaxQJ97A9Vv4BWpGfbV85jn5cfzlfk/GAAzJyoRojcYlCYSsK8AUkYWl90LhhC
	U/NysaT/z2uf0W/x/9DDLuvOyJDJozs=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=nsygJrq4;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736465233; a=rsa-sha256;
	cv=none;
	b=nVrMIJrMRyc1l4cGKqfR6/jMEr7+6NtlWN7gtsoiVzGfUkbfe4d5tvlItNUyW9rboQvnTB
	hUW8A+JFFtu6DMlPnkLXo43813Jrv3FZKr8Cpi4RAw1swd0OU+MYnjaVNwd1HvdILq17eI
	Xw02y41SPSWayI88+u+R3QqJnEsy5fg=
Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7b6eeff1fdfso122539785a.2
        for <linux-mm@kvack.org>; Thu, 09 Jan 2025 15:27:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1736465232; x=1737070032; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jtVdikFS+ajJy72HDorDhGlrpLpK/sj81PI5siVFo2E=;
        b=nsygJrq4oFq+H0kX/4QekYNcRa+Ayfkqe1hh3DHq0ZadUlhQuM9mHkAownXT2zRISQ
         HyrfEv5uioVoJpSOpuufMHf6F4h6a6BXUOjmaK78fKa766nDyRkpKGwYTF7vj3ZqzPjE
         RuXMUwtIj6lhH0dPcSHi3K9UON09FlioLhcA/arsegpwaZfnkJPZFljxO54inuSdF9Es
         CZBejEBZhpmM02BWytGXY/V9tRL7GlprjZcggHpXHG2buJP1PTNtXVkyNctC01zA/pYr
         9UafBVvCRMB3udkpo0o4xeJP+kKhRZBB3rjVa5/lnGDocWWbbWoSrYqMYOUQElIhIGDT
         mwkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736465232; x=1737070032;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=jtVdikFS+ajJy72HDorDhGlrpLpK/sj81PI5siVFo2E=;
        b=vXz0QDp5f65AIAAKigwl9z0WWL+wv1t7cj88emIde+O+/7JfslNtHzA94/NDefXvpw
         xN/Ayi2y5gEBg89D3KMYShfcjn791mC41D9DFUngCROaeOxF4Sx0qIkRBSynEehg/URP
         0ce/VkdWxtsSROR3Svxu9ULX4szt3L9OO6/iiB9T85kpC/CmfCUlgT6SKbS929TnAbU+
         Qtubmd+d4a+tp46xWFc+b7HnSuqPyE9+SUVnFdu/g4wFU/TI10hepRh/GvQpgADKJmHd
         uybC52YUqSVR7ZpJxL1apwo9SCvk4pBQXetJs4gtAhEUZs45t95HGQAOTnaXhV6TyxYv
         +OqQ==
X-Forwarded-Encrypted: i=1; AJvYcCX82vh7KsZ10LfCl2mdEKNFPBkW9aDZBeAAwpKTfm7NHe6vJK3cuwknsVT+tGAJYdpVnDOEhCK4eg==@kvack.org
X-Gm-Message-State: AOJu0Yx94zdgRLrjBLTx4dNkWdVi8V0fGjqA8Dm9IrUcsk6mGQ6CsVTD
	akPCH7DsyFLv6AaG6MkziXrzerO0heNGo37Jq0EGpHOMHWyBbLV1BoNNApdNL4XFdI34Ecb6Ccp
	R8+z9yWPMiH2GRHci49P5QEm1lSHG05gVMvHN
X-Gm-Gg: ASbGncvqgskdOPkY8pD+zbIWqR9AXzzfwNtjzD/fIq+F8Rlnuckn8FwWiQLsiSH+B4p
	SniBSIj5fIdjMIBBh2pkAO07Hj6pgmyeR4upt6SQxYQUUW6B99k9Qbyi4iPUNBSlBIweP
X-Google-Smtp-Source: AGHT+IFyiEztkvRjglzDf9z+nqtrxgN9c621L5wb0NB0D2G7mGtuPYUvJC/VNx8N0ROnMp188WfC1Qv7vl8w7l2vTH8=
X-Received: by 2002:a05:620a:28ca:b0:7b6:e8c3:4b60 with SMTP id
 af79cd13be357-7bcd975a284mr1192947485a.28.1736465231896; Thu, 09 Jan 2025
 15:27:11 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbk6tLMSKKc1XChJvpOi=J_T0WXXgwfscN0n8CK+CDoYQ@mail.gmail.com>
 <ed486ebe-d3ba-42fb-afdc-485b3f2504f0@citrix.com> <CAJD7tkbnJGdJhyYkMJB2EFUDALoCh93pwsdQVBmm=a10anyTkg@mail.gmail.com>
 <3ecfa4ff-9916-4ac1-8464-c1b6615b832a@citrix.com>
In-Reply-To: <3ecfa4ff-9916-4ac1-8464-c1b6615b832a@citrix.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 9 Jan 2025 15:26:35 -0800
X-Gm-Features: AbW1kvakSC1-ULQoiEft9NGCdsD8bRViVBZLoCCM4SnnaZAW6yZXfZF7OI60xYI
Message-ID: <CAJD7tkZiujbLOg_HYSd4iUYuOhjK5mHkVrhcgk6wePDd8dfCvA@mail.gmail.com>
Subject: Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: akpm@linux-foundation.org, bp@alien8.de, dave.hansen@linux.intel.com, 
	hpa@zytor.com, jackmanb@google.com, kernel-team@meta.com, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, luto@kernel.org, 
	mingo@redhat.com, nadav.amit@gmail.com, peterz@infradead.org, 
	reijiw@google.com, riel@surriel.com, tglx@linutronix.de, x86@kernel.org, 
	zhengqi.arch@bytedance.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam05
X-Stat-Signature: jkb8cs3ikcdrnzjrat9mm4ohmafzjspw
X-Rspamd-Queue-Id: 100BB40005
X-Rspam-User: 
X-HE-Tag: 1736465232-659572
X-HE-Meta: U2FsdGVkX1/fCiKddU7osx2tWcmwXd8ZDtp3G1d15NRfKyqmuPDGqHyB6DEhawbDoPKPN4sxJIlu7K8NUau9KSIW95H1UgmMkEgx06HXuAtl91jWb0NAFt1RQjqvealhynj6Yz8F1ZrCN9E50FoHQ7qnRb9yFDyFSNnF2ar3xlo68e44rqB7dCYO9Plfzika0l5wlYqXS5NhCkT5SSkFrh0O2LLvbC64zYzqOTXyvJElwKCzsF7nJDzqtNsWQt8xtUe35UMMyCljIj5J/clkaiVYj16mFWOwIJcqGdBYTyH4McGMtPKaVks8awJ7+T125l2aN7VpLX1T7qnWBjW/qP1GJodJ+5pvPnErsXaTgXuKXs0JqiCQooNUl4wZoN+Lsat1YGOL8lqQhQx4odJuBz3VQ6bqQOXBL0IIghvAaEw6K1e79/TrEKHTHN1dFjLGvfM2kKnazFzbDUnLpm67upthywXvg37b2EGgwH3kmNbFqXSSJlWt/G5c4VQb0ZKR+rhTOTxv4Z0iGhCmQNQ67fGmRftGpVx1PIIyta3855dbxwGrGPZ603G1tSpApc3FgGSFnIcrHx+sLMWMVePp3pqd4U9GXYhpqa8Qn446Svn1KY1xDt1khWmbYUAls77TGCF/iF8U77o7fWzIL+zA88RyBQMOpK0UGjAjDhq3tfMu0WjiApzScBlrzVDWNy7e6lTZgmflSZVhOqNyG/XQsTrDMI7iCGnhVSkRFjK0lgIEyDt9CUGCYkRTz/Ti+s+qJ42mehmNlDI6STgMZHDd8xcobiiC8axwZXl0h92hjLoL1TNx1Ps0jPcKnxIwFfYW3uwprNKRDJEc8KSBl5W5VGwSyz66ATw9LiBCkizedIV4zg6IeAonVx24L8t0atcuN7LC+i91Uub33NPiusKfg/EoMd5kWRZzRtjrRLKdy/WZ+Rb+zVSXyHnucqDIJENxq9puiNyFV7p/sCQsfa5
 1h0sS1uE
 YhXmHKk9OrHdSfD89Ctg52PUgDjocD+drSWRRWBSeDRGXcju9JvW1q6qLFvPaZLWIaG9aL+7ow/A/7rLX1AHWbADnlq2FxnpUnnsdWg0ZAOn1CARnQLoITP8wRXDnn4WoNuN8ejXUBEWuQBAmpxjHnp2O2WMKDHMFzj31Fb5yFMbxMzxHqtIFycgiqCgn5+eRI+o1V4gnftVdG5hsarlBM7axLhWRs3hqRikSayoaEmH6wJmf0yqlE5BVXqnTMc/LQKoqIgX4stAg+q2apCoRx9Ltd9o4AZ00h97y
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000016, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jan 9, 2025 at 3:00=E2=80=AFPM Andrew Cooper <andrew.cooper3@citrix=
.com> wrote:
>
> On 09/01/2025 9:32 pm, Yosry Ahmed wrote:
> > On Wed, Jan 8, 2025 at 6:47=E2=80=AFPM Andrew Cooper <andrew.cooper3@ci=
trix.com> wrote:
> >>>> I suspect AMD wouldn't tell us exactly ;)
> >>> Well, ideally they would just tell us the conditions under which CPUs
> >>> respond to the broadcast TLB flush or the expectations around latency=
.
> >> [Resend, complete this time]
> >>
> >> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
> >> just a random person on the internet.  But, here are a few things that
> >> might be relevant to know.
> >>
> >> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached i=
n
> >> the CPU TLB and related structures" and also "When required, hardware
> >> automatically performs TLB invalidations to ensure that all processors
> >> in the system see the updated RMP entry information."
> >>
> >> That sentence doesn't use "broadcast" or "remote", but "all processors=
"
> >> is a pretty clear clue.  Broadcast TLB invalidations are a building
> >> block of all the RMP-manipulation instructions.
> >>
> >> Furthermore, to be useful in this context, they need to be ordered wit=
h
> >> memory.  Specifically, a new pagewalk mustn't start after an
> >> invalidation, yet observe the stale RMP entry.
> >>
> >>
> >> x86 CPUs do have reasonable forward-progress guarantees, but in order =
to
> >> achieve forward progress, they need to e.g. guarantee that one memory
> >> access doesn't displace the TLB entry backing a different memory acces=
s
> >> from the same instruction, or you could livelock while trying to
> >> complete a single instruction.
> >>
> >> A consequence is that you can't safely invalidate a TLB entry of an
> >> in-progress instruction (although this means only the oldest instructi=
on
> >> in the pipeline, because everything else is speculative and potentiall=
y
> >> transient).
> >>
> >>
> >> INVLPGB invalidations are interrupt-like from the point of view of the
> >> remote core, but are microarchitectural and can be taken irrespective =
of
> >> the architectural Interrupt and Global Interrupt Flags.  As a
> >> consequence, they'll need wait until an instruction boundary to be
> >> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
> >> handling of RARs on the remote processor, and they share a number of
> >> constraints in common with INVLPGB.
> >>
> >>
> >> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
> >> of themselves; interestingly, they're not identified as architecturall=
y
> >> serialising.  The broadcast is probably posted, and will be dealt with
> >> by remote processors on the subsequent instruction boundary.  TLBSYNC =
is
> >> the barrier to wait until the invalidations have been processed, and
> >> this will block for an unspecified length of time, probably bounded by
> >> the "longest" instruction in progress on a remote CPU.  e.g. I expect =
it
> >> probably will suck if you have to wait for a WBINVD instruction to
> >> complete on a remote CPU.
> >>
> >> That said, architectural IPIs have the same conditions too, except on
> >> top of that you've got to run a whole interrupt handler.  So, with
> >> reasonable confidence, however slow TLBSYNC might be in the worst case=
,
> >> it's got absolutely nothing on the overhead of doing invalidations the
> >> old fashioned way.
> > Generally speaking, I am not arguing that TLB flush IPIs are worse
> > than INLPGB/TLBSYNC, I think we should expect the latter to perform
> > better in most cases.
> >
> > But there is a difference here because the processor executing TLBSYNC
> > cannot serve interrupts or NMIs while waiting for remote CPUs, because
> > they have to be served at an instruction boundary, right?
>
> That's as per the architecture, yes.  NMIs do have to be served on
> instruction boundaries.  An NMI that becomes pending while a TLBSYNC is
> in progress will have to wait until the TLBSYNC completes.
>
> (Probably.  REP string instructions and AVX scatter/gather have explicit
> behaviours that them them be interrupted, and to continue from where
> they left off when the interrupt handler returns.  Depending on how
> TLBSYNC is implemented, it's just possible it has this property too.)

That would be great actually, if that's the case all my concerns go away.

>
> > Unless
> > TLBSYNC is an exception to that rule, or its execution is considered
> > completed before remote CPUs respond (i.e. the CPU executes it quickly
> > then enters into a wait doing "nothing").
> >
> > There are also intriguing corner cases that are not documented. For
> > example, you mention that it's reasonable to expect that a remote CPU
> > does not serve TLBSYNC except at the instruction boundary.
>
> INVLPGB needs to wait for an instruction boundary in order to be processe=
d.
>
> All TLBSYNC needs to do is wait until it's certain that all the prior
> INVLPGBs issued by this CPU have been serviced.
>
> >  What if
> > that CPU is executing TLBSYNC? Do we have to wait for its execution to
> > complete? Is it possible to end up in a deadlock? This goes back to my
> > previous point about whether TLBSYNC is a special case or when it's
> > considered to have finished executing.
>
> Remember that the SEV-SNP instruction (PSMASH, PVALIDATE,
> RMP{ADJUST,UPDATE,QUERY,READ}) have an INVLPGB/TLBSYNC pair under the
> hood.  You can execute these instructions on different CPUs in parallel.
>
> It's certainly possible AMD missed something and there's and there's a
> deadlock case in there.  But Google do offer SEV-SNP VMs and have the
> data and scale to know whether such a deadlock is happening in practice.

I am not familiar with SEV-SNP so excuse my ignorance. I am also
pretty sure that the percentage of SEV-SNP workloads is very low
compared to the workloads that would start using INVLPGB/TLBSYNC after
this series. So if there's a dormant bug or a rare scenario where the
TLBSYNC latency is massive, it may very well be newly uncovered now.

>
> >
> > I am sure people thought about that and I am probably worried over
> > nothing, but there's little details here so one has to speculate.
> >
> > Again, sorry if I am making a fuss over nothing and it's all in my head=
.
>
> It's absolutely a valid question to ask.
>
> But x86 is full of longer delays than this.  The GIF for example can
> block NMIs until the hypervisor is complete with the world switch, and
> it's left as an exercise to software not to abuse this.  Taking an SMI
> will be orders of magnitude more expensive than anything discussed here.

Right. What is happening here just seems like something that happens
more frequently and therefore is more likely to run into cases with
absurd delays.

It would be great if someone from AMD could shed some light on what is
to be reasonably expected from TLBSYNC here.

Anyway, thanks a lot for all your (very informative) responses :)