From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B92CD767F2 for ; Thu, 31 Oct 2024 18:11:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32CDA6B007B; Thu, 31 Oct 2024 14:11:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2DAC66B0082; Thu, 31 Oct 2024 14:11:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A3386B0083; Thu, 31 Oct 2024 14:11:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id F134A6B007B for ; Thu, 31 Oct 2024 14:11:25 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8DD8D1C3DFA for ; Thu, 31 Oct 2024 18:11:25 +0000 (UTC) X-FDA: 82734688386.01.0D2A26F Received: from mail-ot1-f43.google.com (mail-ot1-f43.google.com [209.85.210.43]) by imf12.hostedemail.com (Postfix) with ESMTP id 4F8CD4001F for ; Thu, 31 Oct 2024 18:11:11 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mmqGIba2; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.210.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730398202; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zifDQr8Gy9ICyc3p2ENuZG+aDhE1SHEQ39uTi6PmnYo=; b=JZIta0Dmo9W4FECfPWvFFobqNfdIKw6KcFABi1uqxF11DXHW/zkjSUW0bcWNU9zsifb/Go EzpvJFz/LgS9K93aZRkSaA+aPA/ahkC0oAEt81sYkrP0iddHtFWqRchQ8ZaIcMjlrOEezu hdq+6aTy30p+ZWp8r0ugsv+Uyg1pjSg= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mmqGIba2; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.210.43 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730398202; a=rsa-sha256; cv=none; b=WMChHI3nHSAU1rdUzxnFS1+3v+uAQLWDZtNsCOW1TLlC2UHIvT669jDjPqcqzsBsUHyjPK ZEJHdMELbR26j6RBU+Ye9l7cX4XnUGkP8eXRAuf5P5zLRDiwt5cNc/1Ni+I19LRBLc8peo +OeNfuffqovv2dHU8NEe3QRVmDc4FCk= Received: by mail-ot1-f43.google.com with SMTP id 46e09a7af769-7180ab89c58so589683a34.1 for ; Thu, 31 Oct 2024 11:11:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730398283; x=1731003083; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zifDQr8Gy9ICyc3p2ENuZG+aDhE1SHEQ39uTi6PmnYo=; b=mmqGIba2K9QdeZnzTJJ7BnY7q5YLJNVjJAxlpXE0us/r/Yff2Z3+AEU1Nj4tY6PV9t J6IaAG86GRpZcz8faB1uOON4ptv7fRnZJrWBMz/DWbBEmSukHaa4ep/e6zr9h5t63cfT 4hL3DVTEbG5Dkhk60wUSqelWwZH6BAq12gpopcmJ4OFSv8/b01gad6ppKkdr7nfyWdAi /WN7E0Y/OVZzfZkIQ/F0Zi19JIW5pT3e3oj6pFDD8LrnHBsYlJpisXrUJbhieXz7YvX3 a/hj1wC1FN4jjZFMLLFq0fYQQOZJ092+DQKEeclb9DxCT0jphYwl/SXxXZaiLB3cHfp1 sFGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730398283; x=1731003083; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zifDQr8Gy9ICyc3p2ENuZG+aDhE1SHEQ39uTi6PmnYo=; b=YLCj9Lr3fKOSz63+3jHHF2HAei6HGlSnzMzatXXc1/5GFLbXrBot3lx2DYWeXSGfWW Ct6V72lzvwlwD1qW0W6sWSAf9rOanu4GQLLZPVWkMivSbaJ9TTkDj8SC+l61Mm4zDmQu PjTASloapM46HccZ3rkJp7wcBuxfcss1T+IbNAJe3nva3SgXUAL445bwq2LWGF/EMtDG lomobk8+O/ffGYJ8ulLBuHuSWxsAXgN+sC+Jsh7UrMep3Akj4M5HxufgpNoLEVj7cjxs kjq0nXojD8ilQ35QXV4//yPGmKmY+ZLjbv1S6+yuyLZJ95KmAAyeET5xm/UJCZ2vE2+a +auA== X-Forwarded-Encrypted: i=1; AJvYcCXNHP2UHvUmV1vFvCTOM6Rp8nObfMgDbj1BkH1uY3X8WeIWWE8VfNrA6LSnd097kSmlsqVqJWyu8g==@kvack.org X-Gm-Message-State: AOJu0YzWTjptRYT8gyonQikktod2TELjpW/i8Gmh29L8+od3FbigSHNh ecSG9SSrSQoJRsnb7XcgyjIR6kVXIXIbIB++Ri63CwCjDrEmLmfUrvao9yAQhnWjk4aSe/fCrLG 7+xpnYwdEzq5qXrtQKA8JCFfgGATRSQXffW+t X-Google-Smtp-Source: AGHT+IGCZUfA0nt9tbEFHktX/dCx4DkdYhEES5nQZ7wCYXRaHV6mv2O2BXeCcEfn6XydEqak20pVWKM5POqkaQsbpvU= X-Received: by 2002:a05:6359:4115:b0:1c3:8d57:ea8b with SMTP id e5c5f4694b2df-1c5ee96bd85mr291907555d.16.1730398282240; Thu, 31 Oct 2024 11:11:22 -0700 (PDT) MIME-Version: 1.0 References: <20241021042218.746659-1-yuzhao@google.com> <20241021042218.746659-5-yuzhao@google.com> <868qug3yig.wl-maz@kernel.org> <8634ke3dn4.wl-maz@kernel.org> In-Reply-To: <8634ke3dn4.wl-maz@kernel.org> From: Yu Zhao Date: Thu, 31 Oct 2024 12:10:45 -0600 Message-ID: Subject: Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs To: Marc Zyngier Cc: Andrew Morton , Catalin Marinas , Muchun Song , Thomas Gleixner , Will Deacon , Douglas Anderson , Mark Rutland , Nanyong Sun , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 4F8CD4001F X-Stat-Signature: t9wwrwg3jfho6knptptq9sfxh9c6pbks X-HE-Tag: 1730398271-64834 X-HE-Meta: U2FsdGVkX19/xJ2t3tAIiiVambFXlZdDeR8EvH6Sa671SlvOUS6fCxEii0B5YMce2lkAKLhSlIfead059RQMirLZ4e5IRa2yZ2AuYhAOmNsRvh/YqWbY/ZP00/nGxjDF0Y3uIhMnUfLRF3CQh2j0lmD+1L8ODizl+6thMDKcyVaYVexrn1hozIV41+YAlM9N+JNqKBwzqUuOIzrnCVWApyRGP861WvRc6PNsTML8olwmPOHeuoZ8ZpU8XvW/PWw+FUf1/YZ6gb47Jjg5ahpb6uF3TuxlbG/+pLg7IOJ319IfMjG3/xyWSrJkKqoNonYMDWZrCxm82H488fmwF87d+/kxCe9kSgsW4g8zrRI4Bcsh9BI/jB5EdKoL/feeFMi0owD4j+woEnE1xJAxIt6XDGSVquwbBpzie/TExjVGkYSHSewx0QNpyXA+wxqBcIVKtzN5eO+v0QxU7hSI9XXzYlt49KyBYbBfy9BxeaZxKNkoxd0QUx5+ZW+gSu7wwknoV4ZoXbaBTgzN7gMBfG8hdwrM0Nm0+lGxmZk8CjyDcp0EInMh+BLu7ps+8wZn1+Q3/fQYebOSJs3q3Hwc0z4t5exlNdvjaPV4v2SFYzskIp2ri9eW/yB5AJCzgIg4HN2wPraZjFeqjPF1t8OJG0uKL1tyWkPgeFSqA7Aj9w+VEUgOe8zyc4qBLpdSztnrq1dwJedwsVedAYv6wMgDil7RuJnHsmzi9wfFMBmG9PSEhTopqpQZEJgHHRSGJ/D2WaRd8944KdfhgunRxLsHxo+ehheXrzSQQlwc0zwWfwG3aXAnzHtkxBqu36koI3I6Rpg/2LGAMpzz4SuD87U/kP0S1uvO5pXJywYC7j5AkAH4qMSD83jCx2eIz3V+ISZK6k0asB2PmouTF0VZG5TQEdINLrJHb7xBfc1Xc/QSJ2NvWlOUucrSrr/u1+ci4Po/ZkCa7yB7Xd676mrvCgDgRrC +uELbMyI My9TA4mCu520Y/5YCp3f5cyRjzKJNeW5MCnl7Qllt4shhXiWT9XYUEey8N8zOAR5WyTG/e5Josr1K0R/vYrxyA4EBm1hrvih7IoQr8P8kl9ZgbCK6RIXs6TQQoB8iOE83OPwEHwZOrRqmBkkBaKgICOGwX5OJf3Fh5fQPAnchXsPKECJyuBmr1BubfgQh7o7RsazrME0HUqpOi/zqWytt+zZnEyoNTWtGqbeAFvZYsyWxAhKeqUCLF3JX7g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 29, 2024 at 1:36=E2=80=AFPM Marc Zyngier wrote= : > > On Mon, 28 Oct 2024 22:11:52 +0000, > Yu Zhao wrote: > > > > On Tue, Oct 22, 2024 at 10:15=E2=80=AFAM Marc Zyngier = wrote: > > > > > > On Mon, 21 Oct 2024 05:22:16 +0100, > > > Yu Zhao wrote: > > > > > > > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period o= f > > > > time, and then reliably resume them when the local CPU exits critic= al > > > > sections that preclude the execution of remote CPUs. > > > > > > > > A typical example of such critical sections is BBM on kernel PTEs. > > > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by > > > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason: > > > > > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > > > break-before-make sequence (make the PTE invalid, TLBI, write the > > > > new valid PTE). However, such sequence is not possible since the > > > > vmemmap may be concurrently accessed by the kernel. > > > > > > > > Supporting BBM on kernel PTEs is one of the approaches that can mak= e > > > > HVO theoretically safe on arm64. > > > > > > Is the safety only theoretical? I would have expected that we'd use a= n > > > approach that is absolutely rock-solid. > > > > We've been trying to construct a repro against the original HVO > > (missing BBM), but so far no success. Hopefully a repro does exist, > > and then we'd be able to demonstrate the effectiveness of this series, > > which is only theoretical at the moment. > > That wasn't my question. > > Just because your HW doesn't show you a failure mode doesn't mean the > issue doesn't exist. Or that someone will eventually make use of the > relaxed aspects of the architecture and turn the behaviour you are > relying on into the perfect memory corruption You absolutely cannot > rely on an implementation-defined behaviour for this stuff. I agree. > Hence my question: is your current approach actually safe? AFAIK, yes. > In a > non-theoretical manner? We are confident enough to try it in our production, but it doesn't mean it's bug free. It would take more eyeballs and soak time before it can prove itself. > > > > Note that it is still possible for the paused CPUs to perform > > > > speculative translations. Such translations would cause spurious > > > > kernel PFs, which should be properly handled by > > > > is_spurious_el1_translation_fault(). > > > > > > Speculative translation faults are never reported, that'd be a CPU > > > bug. > > > > Right, I meant to say "speculative accesses that cause translations". > > > > > *Spurious* translation faults can be reported if the CPU doesn't > > > implement FEAT_ETS2, for example, and that has to do with the orderin= g > > > of memory access wrt page-table walking for the purpose of translatio= ns. > > > > Just want to make sure I fully understand: after the local CPU sends > > TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on > > remote CPUs when they perform the invalidation, and therefore > > speculative accesses are ordered as well on remote CPUs. > > ETS2 has nothing to do with TLBI. It has to do with non-cacheable > (translation, access and address size) faults, and whether an older > update to a translation is visible to a younger memory access without > extra synchronisation. Is it correct to say that *without* ETS2, we also wouldn't see spurious faults on the local CPU from the following BBM sequence? clear_pte(); dsb(); isb(); tlbi(); dsb(); isb(); set_valid_pte(); dsb(); isb(); ... > It definitely has nothing to do with remote > CPUs (and the idea of a remote ISB, while cute, doesn't exist). > > > Also I'm assuming IPIs on remote CPUs don't act like a full barrier, > > and whatever they interrupt still can be speculatively executed even > > though the IPI hander itself doesn't access the vmemmap area > > undergoing BBM. Is this correct? > > Full barrier *for what*? An interrupt is a CSE, which has the effects > described in the ARM ARM, but that's about it. Sorry for the confusion. I hope the following could explain what's in my mi= nd: local cpu remote cpu send ipi cse (=3D dsb(); isb() ?) enters ipi handler sets cpu_paused sees cpu_paused clear_pte() dsb(); tlbi(); dsb(); isb() set_valid_pte() dsb(); sets cpu_resumed sees cpu_resumed No isb() exits ipi handler My current understanding is: Because exception exit doesn't have the CSE and there are no isb()s within the IPI handler running on the remote CPU, it's possible for the remote CPU to speculative execute the next instruction before it sees the valid PTE and trigger a spurious fault. > > > > Signed-off-by: Yu Zhao > > > > --- > > > > arch/arm64/include/asm/smp.h | 3 ++ > > > > arch/arm64/kernel/smp.c | 92 ++++++++++++++++++++++++++++++++= +--- > > > > 2 files changed, 88 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/= smp.h > > > > index 2510eec026f7..cffb0cfed961 100644 > > > > --- a/arch/arm64/include/asm/smp.h > > > > +++ b/arch/arm64/include/asm/smp.h > > > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void); > > > > extern void crash_smp_send_stop(void); > > > > extern bool smp_crash_stop_failed(void); > > > > > > > > +void pause_remote_cpus(void); > > > > +void resume_remote_cpus(void); > > > > + > > > > #endif /* ifndef __ASSEMBLY__ */ > > > > > > > > #endif /* ifndef __ASM_SMP_H */ > > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > > > index 3b3f6b56e733..68829c6de1b1 100644 > > > > --- a/arch/arm64/kernel/smp.c > > > > +++ b/arch/arm64/kernel/smp.c > > > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init; > > > > static int nr_ipi __ro_after_init =3D NR_IPI; > > > > static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init; > > > > > > > > -static bool crash_stop; > > > > +enum { > > > > + SEND_STOP =3D BIT(0), > > > > + CRASH_STOP =3D BIT(1), > > > > +}; > > > > + > > > > +static unsigned long stop_in_progress; > > > > > > > > static void ipi_setup(int cpu); > > > > > > > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsi= gned int cpu, struct pt_regs *regs > > > > #endif > > > > } > > > > > > > > +static DEFINE_SPINLOCK(cpu_pause_lock); > > > > > > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep a= s > > > you are dealing with kernel mappings? > > > > Right, it should be a raw spinlock -- the caller disabled preemption, > > which as you said is required when dealing with the kernel mappings. > > > > > > +static cpumask_t paused_cpus; > > > > +static cpumask_t resumed_cpus; > > > > + > > > > +static void pause_local_cpu(void) > > > > +{ > > > > + int cpu =3D smp_processor_id(); > > > > + > > > > + cpumask_clear_cpu(cpu, &resumed_cpus); > > > > + /* > > > > + * Paired with pause_remote_cpus() to confirm that this CPU n= ot only > > > > + * will be paused but also can be reliably resumed. > > > > + */ > > > > + smp_wmb(); > > > > + cpumask_set_cpu(cpu, &paused_cpus); > > > > + /* paused_cpus must be set before waiting on resumed_cpus. */ > > > > + barrier(); > > > > > > I'm not sure what this is trying to enforce. Yes, the compiler won't > > > reorder the set and the test. > > > > Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already > > contain a compiler barrier? > > > > My understanding is that the compiler is free to reorder the set and > > test on those two independent variables, and make it like this: > > > > while (!cpumask_test_cpu(cpu, &resumed_cpus)) > > cpu_relax(); > > cpumask_set_cpu(cpu, &paused_cpus); > > > > So the CPU sent the IPI would keep waiting on paused_cpus being set, > > and this CPU would keep waiting on resumed_cpus being set, which would > > end up with a deadlock. > > > > > But your comment seems to indicate that > > > also need to make sure the CPU preserves that ordering > > > and short of a > > > DMB, the test below could be reordered. > > > > If this CPU reorders the set and test like above, it wouldn't be a > > problem because the set would eventually appear on the other CPU that > > sent the IPI. > > I don't get it. You are requiring that the compiler doesn't reorder > things, but you're happy that the CPU reorders the same things. Surely > this leads to the same outcome... I was thinking that the compiler might be able to reorder even if it can't prove the loop actually terminates. But I now think that shouldn't happen. So yes, I agree with what you said earlier that "the compiler won't reorder the set and the test". > > > > + while (!cpumask_test_cpu(cpu, &resumed_cpus)) > > > > + cpu_relax(); > > > > + /* A typical example for sleep and wake-up functions. */ > > > > > > I'm not sure this is "typical",... > > > > Sorry, this full barrier isn't needed. Apparently I didn't properly > > fix this from the previous attempt to use wfe()/sev() to make this > > function the sleeper for resume_remote_cpus() to wake up. > > > > > > + smp_mb(); > > > > + cpumask_clear_cpu(cpu, &paused_cpus); > > > > +} > > > > + > > > > +void pause_remote_cpus(void) > > > > +{ > > > > + cpumask_t cpus_to_pause; > > > > + > > > > + lockdep_assert_cpus_held(); > > > > + lockdep_assert_preemption_disabled(); > > > > + > > > > + cpumask_copy(&cpus_to_pause, cpu_online_mask); > > > > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); > > > > > > This bitmap is manipulated outside of your cpu_pause_lock. What > > > guarantees you can't have two CPUs stepping on each other here? > > > > Do you mean cpus_to_pause? If so, that's a local bitmap. > > Ah yes, sorry. > > > > > > > + > > > > + spin_lock(&cpu_pause_lock); > > > > + > > > > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); > > > > + > > > > + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI); > > > > + > > > > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) > > > > + cpu_relax(); > > > > > > This can be a lot of things to compare, specially that you are > > > explicitly mentioning large systems. Why can't this be implemented as > > > a counter instead? > > > > Agreed - that'd be sufficient and simpler. > > > > > Overall, this looks like stop_machine() in disguise. Why can't this > > > use the existing infrastructure? > > > > This came up during the previous discussion [1]. There are > > similarities. The main concern with reusing stop_machine() (or part of > > its current implementation) is that it may not meet the performance > > requirements. Refactoring might be possible, however, it seems to me > > (after checking the code again) that such a refactoring unlikely ends > > up cleaner or simpler codebase, especially after I got rid of CPU > > masks, as you suggested. > > I would have expected to see an implementation handling faults during > the break window. IPIs are slow, virtualised extremely badly, and > pNMIs are rarely enabled, due to the long history of issues with it. > At this stage, I'm not sure what you have here is any better than > stop_machine(). On the other hand, faults should be the rarest thing, Let me measure stop_machine() and see how that works for our production. Meanwhile, if you don't mind that we leave the IPI approach along with the kernel PF approach on the table? Thanks.