From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7186D5B158 for ; Mon, 28 Oct 2024 22:12:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 672866B00B8; Mon, 28 Oct 2024 18:12:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FB596B00B9; Mon, 28 Oct 2024 18:12:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 474AA6B00BA; Mon, 28 Oct 2024 18:12:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 266F26B00B8 for ; Mon, 28 Oct 2024 18:12:33 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 435B312054C for ; Mon, 28 Oct 2024 22:12:32 +0000 (UTC) X-FDA: 82724410188.10.836B05F Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50]) by imf30.hostedemail.com (Postfix) with ESMTP id 763D780011 for ; Mon, 28 Oct 2024 22:11:45 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rkkXYslH; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730153421; a=rsa-sha256; cv=none; b=x7L7t1zh12uNKYQcoj3v9GctJUt6R6N39uAohJyd8cq/8OpQTIt2wHu6uYbTQxFGVPIcU6 Y5jZCECdZMR+lu7ExI3ZgNjKMLXDehJs2izoTlCmVvlT0Cw6kNf77GwOee9OjC0t5pnMF5 GGfeOSrzlkFkriJKwxogaOkxGGQh0aI= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=rkkXYslH; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730153421; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mLJb+MSFeIuMjaWbZFbinF8pHkgM5h+mk8sFEoJ/Tys=; b=TX2LoyPkteTDZ5MVcEMtj0xgZozxZLDpqVakaxczKWLo6IMUbGgono8O85/i9lsQcCXbXk 02uVsVFNX2xhqB5v95zOMcZz0ewfMUDo39MxKet7eGwNVq2Z7Ot1jS5zS9Z6hF5hr4Pj+w 2m6QYRDgHO7FzmEaRIwuE9E0+aQsKdI= Received: by mail-vs1-f50.google.com with SMTP id ada2fe7eead31-4a46a25545aso1388129137.0 for ; Mon, 28 Oct 2024 15:12:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730153549; x=1730758349; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mLJb+MSFeIuMjaWbZFbinF8pHkgM5h+mk8sFEoJ/Tys=; b=rkkXYslHkSA3IT5sP2vUxbiCN/qR4ozSnzEMmt4AS08B/R/BQ70S9wE2/jXHR4wl7h K2FeD+wbG05RI06XYnGGDXOGiVA1zAn2wmbiphOOq1tJ+seZQmlg169fFtRpi10yJzbB VBdgTX+Jgys0Bn65QLcen+PSxdWme7GSxJLL1Jo+mDrhtj7m/PImVRP60kMJAf0/dRnv AsKt1wMGuvQN0mQTnfL4UXquc+SWmv5Z1HWeldbTckI5s7nYe7Ee6u4UcaYeYpENIsFF u/jqQC6Ut1mD4RPIkMdzrh1RVcGEXqmI2+9TvYkh52LCY4lKk1YdsDTtEtTnzz/zTVx1 njGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730153549; x=1730758349; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mLJb+MSFeIuMjaWbZFbinF8pHkgM5h+mk8sFEoJ/Tys=; b=MwPtGTHduEU0AvyDKmjbbhlJKqPsVTTm8N4KUxy8mElDQqT6uJh/I+stZV5BRFK2Xr PLbdlmfNlXoSM9bN/75gLs5Gkz7ZomuNPNqXhCaYiKfq+pIHLLzf927Z13ihmBg9SLKc idBcmnpm2oU5KHNL0O8mdEah+tGbvKGPPmN5duQ81WFSiXqIy/R3nFOgjURdG+hpwTzn J90TogiqqKb9ZaWtBXvdFk6VdeMZ7C6n+X1fhYFJJtgpj3lpzb0EGPeMfbaLSRC/zww/ o9l11YiUDqF2kfegiJhEr9g3t0aYOUJ9D2vUos0HPdasn6T2ZkA4y8b1lCgbGPdV2af5 JzrQ== X-Forwarded-Encrypted: i=1; AJvYcCUgGnhM5tB+mkAYJHSorDIZkWRfdRk3NQB988ixNWBVSLG00K+XDR+w6aWyWXlNw3wdPIUgvZs+1g==@kvack.org X-Gm-Message-State: AOJu0Yyn5MBJIdBTlZWiaP9SyClrDdjO1ZAN5SeSirgCcblgaXPtU8Av GBpb6xq9G86iEIEmDuQzAdhteCPV5ghpGaGqyY9vuCF7Ex4comDvptFX4gvEam5kJbpTgnzouE/ hzJkW4xLgdRoKZXE6PFVjU41pUIUHrlAHgiin X-Google-Smtp-Source: AGHT+IGhZjhk8TWm4Ox13TlDPCl1f+no887h0FoY5bya9X/aDkGoU1CVHkU9ScAAG7rcSvUbBqANUY0rSjFirr2oCPk= X-Received: by 2002:a05:6102:290e:b0:4a5:e63f:3655 with SMTP id ada2fe7eead31-4a8cfb27a53mr8607614137.1.1730153549305; Mon, 28 Oct 2024 15:12:29 -0700 (PDT) MIME-Version: 1.0 References: <20241021042218.746659-1-yuzhao@google.com> <20241021042218.746659-5-yuzhao@google.com> <868qug3yig.wl-maz@kernel.org> In-Reply-To: <868qug3yig.wl-maz@kernel.org> From: Yu Zhao Date: Mon, 28 Oct 2024 16:11:52 -0600 Message-ID: Subject: Re: [PATCH v1 4/6] arm64: broadcast IPIs to pause remote CPUs To: Marc Zyngier Cc: Andrew Morton , Catalin Marinas , Muchun Song , Thomas Gleixner , Will Deacon , Douglas Anderson , Mark Rutland , Nanyong Sun , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 763D780011 X-Stat-Signature: 5td5qrbwtb7bj6dnhd7ip8zekr1hadz9 X-Rspam-User: X-HE-Tag: 1730153505-345444 X-HE-Meta: U2FsdGVkX19OZ53lySAxtLXFGGGRyYh92LLmJbnU3RxpqnZI5uEa7m+ihRmM3O1aNEPgqqm+zXfpjLNlROWj18OtglnWTbQ0ysRwyrna1c/8mtzTV8e+fQSTyE9dh7eXaMAs9tx7CLEDh8dAG+tYcxECEs5fcboCPceXcyyB5YUbXXlgnxH2ZnX5zGzvNpAVDaKEc6vf1C3yHgEyv8DHkajABAHy1AxOQL1o0kcStmH9hQ4lxdlnxUT7Ox7LhSp4ryHtDeQNzFGxDTqqvZFsD3qjddUAwEZiuSjeGw7RZpw9shzi7M5UgIccZ970TmxCyF+9+vebfXqqhp433LPn2tbQGYXKp7CILAgrlvJ2nu/Pa5OMsQ+zLUu3QQsudmX4FhspSE+ujiekelszbtw9DS6VubYT5Xh7b/Ltm4grM9g13zjQX08Rd3NfcONtlrgNi9wWiBvVddIchkmB/bGDnMAUeJeJCPBfIkBdE36lrHmf4JE6YaouAubYCkNVN0P+tOCgMe5Rqp4udbhLz2p9BxAxPlPCGdlRrY5aPvXJT0tN7u98QFLw8C7kOcvmDctynVxHHrAzMKoEdsc27lh6KZmBl5lGNTvFVfSspzmZuwqAjp1Gp/liD717PeZglSzlpXklt34ssifBsnEuUVILbqtOJmBKTfxpsdTXpe1So9GYCNnR5OgHrl0AFUXDWKgOeEGGHG/Hik3Aadfmm1syGq0d/dk/+m9RZ+qVCp9QAd1rq4wwPPdu9zlOV3VG+iikLB3M1sHjDbTYo40fR0ppyXOc8w/rGtC+jOGH+TY/UYc+zJrEslkl5b9+JsPzyRq7DpQVcE1YmcfP3BstpmhC0M7eXuRBvvX4EsocdTi9lYWGUOSawgFkLBjMMYidCi58XJn0RVdseTDvM/uSnOqYdZp8dwTd5ai2rDK1pFJoNj3qtNRYw5qs8djRNiJXIsExsOhlQ2DnbKSMA2RaaWK XeLEXObL z7fQv2jU3zaIxjaj1ofHAUeQiDWzm2ysqVUbCN+6Px+5Cic7T0f3R2aU8Kt7RSEzieqMcyDieB7EXEzNeONZLGvTGb+jG5QwrAGmIpYsrrHGReYCTH2tiYaKQnavvLWn7JOJNfEHJFnSP0qOyfRnXh6oIwImgRQmsVQqhvqjdg+u8bOqrEHscjRjl7v7QDMjkPAktu+EXsHbACxOb7UEFOqjnY0XtGJWnf4uvkAZU/yEycIybx7E5SQCcNCR149zhAr7f3+wjwk7wGs1DFewyEduXh9hCU+f9L1fkJQmCvxovSGQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 22, 2024 at 10:15=E2=80=AFAM Marc Zyngier wrot= e: > > On Mon, 21 Oct 2024 05:22:16 +0100, > Yu Zhao wrote: > > > > Broadcast pseudo-NMI IPIs to pause remote CPUs for a short period of > > time, and then reliably resume them when the local CPU exits critical > > sections that preclude the execution of remote CPUs. > > > > A typical example of such critical sections is BBM on kernel PTEs. > > HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by > > commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable > > HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the folllowing reason: > > > > This is deemed UNPREDICTABLE by the Arm architecture without a > > break-before-make sequence (make the PTE invalid, TLBI, write the > > new valid PTE). However, such sequence is not possible since the > > vmemmap may be concurrently accessed by the kernel. > > > > Supporting BBM on kernel PTEs is one of the approaches that can make > > HVO theoretically safe on arm64. > > Is the safety only theoretical? I would have expected that we'd use an > approach that is absolutely rock-solid. We've been trying to construct a repro against the original HVO (missing BBM), but so far no success. Hopefully a repro does exist, and then we'd be able to demonstrate the effectiveness of this series, which is only theoretical at the moment. > > Note that it is still possible for the paused CPUs to perform > > speculative translations. Such translations would cause spurious > > kernel PFs, which should be properly handled by > > is_spurious_el1_translation_fault(). > > Speculative translation faults are never reported, that'd be a CPU > bug. Right, I meant to say "speculative accesses that cause translations". > *Spurious* translation faults can be reported if the CPU doesn't > implement FEAT_ETS2, for example, and that has to do with the ordering > of memory access wrt page-table walking for the purpose of translations. Just want to make sure I fully understand: after the local CPU sends TLBI (followed by DSB & ISB), FEAT_ETS2 would act like DSB & ISB on remote CPUs when they perform the invalidation, and therefore speculative accesses are ordered as well on remote CPUs. Also I'm assuming IPIs on remote CPUs don't act like a full barrier, and whatever they interrupt still can be speculatively executed even though the IPI hander itself doesn't access the vmemmap area undergoing BBM. Is this correct? > > Signed-off-by: Yu Zhao > > --- > > arch/arm64/include/asm/smp.h | 3 ++ > > arch/arm64/kernel/smp.c | 92 +++++++++++++++++++++++++++++++++--- > > 2 files changed, 88 insertions(+), 7 deletions(-) > > > > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.= h > > index 2510eec026f7..cffb0cfed961 100644 > > --- a/arch/arm64/include/asm/smp.h > > +++ b/arch/arm64/include/asm/smp.h > > @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void); > > extern void crash_smp_send_stop(void); > > extern bool smp_crash_stop_failed(void); > > > > +void pause_remote_cpus(void); > > +void resume_remote_cpus(void); > > + > > #endif /* ifndef __ASSEMBLY__ */ > > > > #endif /* ifndef __ASM_SMP_H */ > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > > index 3b3f6b56e733..68829c6de1b1 100644 > > --- a/arch/arm64/kernel/smp.c > > +++ b/arch/arm64/kernel/smp.c > > @@ -85,7 +85,12 @@ static int ipi_irq_base __ro_after_init; > > static int nr_ipi __ro_after_init =3D NR_IPI; > > static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init; > > > > -static bool crash_stop; > > +enum { > > + SEND_STOP =3D BIT(0), > > + CRASH_STOP =3D BIT(1), > > +}; > > + > > +static unsigned long stop_in_progress; > > > > static void ipi_setup(int cpu); > > > > @@ -917,6 +922,79 @@ static void __noreturn ipi_cpu_crash_stop(unsigned= int cpu, struct pt_regs *regs > > #endif > > } > > > > +static DEFINE_SPINLOCK(cpu_pause_lock); > > PREEMPT_RT will turn this into a sleeping lock. Is it safe to sleep as > you are dealing with kernel mappings? Right, it should be a raw spinlock -- the caller disabled preemption, which as you said is required when dealing with the kernel mappings. > > +static cpumask_t paused_cpus; > > +static cpumask_t resumed_cpus; > > + > > +static void pause_local_cpu(void) > > +{ > > + int cpu =3D smp_processor_id(); > > + > > + cpumask_clear_cpu(cpu, &resumed_cpus); > > + /* > > + * Paired with pause_remote_cpus() to confirm that this CPU not o= nly > > + * will be paused but also can be reliably resumed. > > + */ > > + smp_wmb(); > > + cpumask_set_cpu(cpu, &paused_cpus); > > + /* paused_cpus must be set before waiting on resumed_cpus. */ > > + barrier(); > > I'm not sure what this is trying to enforce. Yes, the compiler won't > reorder the set and the test. Sorry I don't follow: does cpumask_set_cpu(), i.e., set_bit(), already contain a compiler barrier? My understanding is that the compiler is free to reorder the set and test on those two independent variables, and make it like this: while (!cpumask_test_cpu(cpu, &resumed_cpus)) cpu_relax(); cpumask_set_cpu(cpu, &paused_cpus); So the CPU sent the IPI would keep waiting on paused_cpus being set, and this CPU would keep waiting on resumed_cpus being set, which would end up with a deadlock. > But your comment seems to indicate that > also need to make sure the CPU preserves that ordering > and short of a > DMB, the test below could be reordered. If this CPU reorders the set and test like above, it wouldn't be a problem because the set would eventually appear on the other CPU that sent the IPI. > > + while (!cpumask_test_cpu(cpu, &resumed_cpus)) > > + cpu_relax(); > > + /* A typical example for sleep and wake-up functions. */ > > I'm not sure this is "typical",... Sorry, this full barrier isn't needed. Apparently I didn't properly fix this from the previous attempt to use wfe()/sev() to make this function the sleeper for resume_remote_cpus() to wake up. > > + smp_mb(); > > + cpumask_clear_cpu(cpu, &paused_cpus); > > +} > > + > > +void pause_remote_cpus(void) > > +{ > > + cpumask_t cpus_to_pause; > > + > > + lockdep_assert_cpus_held(); > > + lockdep_assert_preemption_disabled(); > > + > > + cpumask_copy(&cpus_to_pause, cpu_online_mask); > > + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); > > This bitmap is manipulated outside of your cpu_pause_lock. What > guarantees you can't have two CPUs stepping on each other here? Do you mean cpus_to_pause? If so, that's a local bitmap. > > + > > + spin_lock(&cpu_pause_lock); > > + > > + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); > > + > > + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI); > > + > > + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) > > + cpu_relax(); > > This can be a lot of things to compare, specially that you are > explicitly mentioning large systems. Why can't this be implemented as > a counter instead? Agreed - that'd be sufficient and simpler. > Overall, this looks like stop_machine() in disguise. Why can't this > use the existing infrastructure? This came up during the previous discussion [1]. There are similarities. The main concern with reusing stop_machine() (or part of its current implementation) is that it may not meet the performance requirements. Refactoring might be possible, however, it seems to me (after checking the code again) that such a refactoring unlikely ends up cleaner or simpler codebase, especially after I got rid of CPU masks, as you suggested. [1] https://lore.kernel.org/all/ZbKjHHeEdFYY1xR5@arm.com/ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 3b3f6b56e733..b672af2441a3 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -917,6 +922,64 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs #endif } +static DEFINE_RAW_SPINLOCK(cpu_pause_lock); +static bool __cacheline_aligned_in_smp cpu_paused; +static atomic_t __cacheline_aligned_in_smp nr_cpus_paused; + +static void pause_local_cpu(void) +{ + atomic_inc(&nr_cpus_paused); + + while (READ_ONCE(cpu_paused)) + cpu_relax(); + + atomic_dec(&nr_cpus_paused); +} + +void pause_remote_cpus(void) +{ + cpumask_t cpus_to_pause; + int nr_cpus_to_pause =3D num_online_cpus() - 1; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + if (!nr_cpus_to_pause) + return; + + cpumask_copy(&cpus_to_pause, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); + + raw_spin_lock(&cpu_pause_lock); + + WARN_ON_ONCE(cpu_paused); + WARN_ON_ONCE(atomic_read(&nr_cpus_paused)); + + cpu_paused =3D true; + + smp_cross_call(&cpus_to_pause, IPI_CPU_STOP_NMI); + + while (atomic_read(&nr_cpus_paused) !=3D nr_cpus_to_pause) + cpu_relax(); + + raw_spin_unlock(&cpu_pause_lock); +} + +void resume_remote_cpus(void) +{ + if (!cpu_paused) + return; + + raw_spin_lock(&cpu_pause_lock); + + WRITE_ONCE(cpu_paused, false); + + while (atomic_read(&nr_cpus_paused)) + cpu_relax(); + + raw_spin_unlock(&cpu_pause_lock); +} + static void arm64_backtrace_ipi(cpumask_t *mask) { __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);