From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87734C4332F for ; Wed, 12 Jan 2022 15:40:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF93A6B017A; Wed, 12 Jan 2022 10:40:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EA9886B017B; Wed, 12 Jan 2022 10:40:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D223B6B017C; Wed, 12 Jan 2022 10:40:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0164.hostedemail.com [216.40.44.164]) by kanga.kvack.org (Postfix) with ESMTP id ABB206B017A for ; Wed, 12 Jan 2022 10:40:41 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 485FB944CD for ; Wed, 12 Jan 2022 15:40:41 +0000 (UTC) X-FDA: 79022047482.08.5EE286D Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by imf22.hostedemail.com (Postfix) with ESMTP id CE239C000C for ; Wed, 12 Jan 2022 15:40:40 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 54D1A257123; Wed, 12 Jan 2022 10:40:40 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id kXwR4HKYG3yH; Wed, 12 Jan 2022 10:40:39 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 82E89256DD2; Wed, 12 Jan 2022 10:40:39 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 82E89256DD2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1642002039; bh=FQdsZ7HE+pKL5cQg3cg3GQ/Q3tAXADsGertBSqdls2g=; h=Date:From:To:Message-ID:MIME-Version; b=lWKmW8z1OKng25lAZKeYFPYYG9rb5y4Vtm5zVPcg4+eSLvQcrCLnLQ6PJVHXbF+7t sSM88a76vMJYZ6EUszp/sHTx//cpjZJ2xkOX+28/OrAzvxFIKdvaLhwQ28wtgXeCWO UczUsxY612vX3ByFhHrvAU1UlBbBlJGHAjf6JIFyf10+102kGJNvcmk1/bfFYRbO/e RyeGIQGcxJFb1aiWohgvCNT7sFaAwUp3NHgCnrKO+vjrOSulzZUwNIYhRzVpxFzpsf Ucrul0PC18hnG3hd97qxzGWZ+yaDgtelFCcSgqX15Jv4TOIBPc8QiK1Y9ATFGBpuEV scAfPr33N2xSw== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Gh25b7Iv8Ahi; Wed, 12 Jan 2022 10:40:39 -0500 (EST) Received: from mail03.efficios.com (mail03.efficios.com [167.114.26.124]) by mail.efficios.com (Postfix) with ESMTP id 63560256DD0; Wed, 12 Jan 2022 10:40:39 -0500 (EST) Date: Wed, 12 Jan 2022 10:40:39 -0500 (EST) From: Mathieu Desnoyers To: Andy Lutomirski Cc: Andrew Morton , linux-mm , Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86 , riel , Dave Hansen , Peter Zijlstra , Nadav Amit Message-ID: <996622244.24690.1642002039268.JavaMail.zimbra@efficios.com> In-Reply-To: References: Subject: Re: [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.26.124] X-Mailer: Zimbra 8.8.15_GA_4180 (ZimbraWebClient - FF96 (Linux)/8.8.15_GA_4177) Thread-Topic: x86/mm: Handle unlazying membarrier core sync in the arch code Thread-Index: 4L1mEyLCO7I1HF0UD1fS5LuOQ67Zmg== X-Rspamd-Queue-Id: CE239C000C X-Stat-Signature: htozigo849hozswrezf5prijffd6is3q Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=efficios.com header.s=default header.b=lWKmW8z1; dmarc=pass (policy=none) header.from=efficios.com; spf=pass (imf22.hostedemail.com: domain of compudj@efficios.com designates 167.114.26.124 as permitted sender) smtp.mailfrom=compudj@efficios.com X-Rspamd-Server: rspam06 X-HE-Tag: 1642002040-562936 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: ----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@kernel.org wrote: > The core scheduler isn't a great place for > membarrier_mm_sync_core_before_usermode() -- the core scheduler > doesn't actually know whether we are lazy. With the old code, if a > CPU is running a membarrier-registered task, goes idle, gets unlazied > via a TLB shootdown IPI, and switches back to the > membarrier-registered task, it will do an unnecessary core sync. > > Conveniently, x86 is the only architecture that does anything in this > sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode() > is a no-op on all other architectures and we can just move the code. > > (I am not claiming that the SYNC_CORE code was correct before or after this > change on any non-x86 architecture. I merely claim that this change > improves readability, is correct on x86, and makes no change on any other > architecture.) > Looks good to me! Thanks! Reviewed-by: Mathieu Desnoyers > Cc: Mathieu Desnoyers > Cc: Nicholas Piggin > Cc: Peter Zijlstra > Signed-off-by: Andy Lutomirski > --- > arch/x86/mm/tlb.c | 58 +++++++++++++++++++++++++++++++--------- > include/linux/sched/mm.h | 13 --------- > kernel/sched/core.c | 14 +++++----- > 3 files changed, 53 insertions(+), 32 deletions(-) > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 59ba2968af1b..1ae15172885e 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -9,6 +9,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -485,6 +486,15 @@ void cr4_update_pce(void *ignored) > static inline void cr4_update_pce_mm(struct mm_struct *mm) { } > #endif > > +static void sync_core_if_membarrier_enabled(struct mm_struct *next) > +{ > +#ifdef CONFIG_MEMBARRIER > + if (unlikely(atomic_read(&next->membarrier_state) & > + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)) > + sync_core_before_usermode(); > +#endif > +} > + > void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > struct task_struct *tsk) > { > @@ -539,16 +549,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > this_cpu_write(cpu_tlbstate_shared.is_lazy, false); > > /* > - * The membarrier system call requires a full memory barrier and > - * core serialization before returning to user-space, after > - * storing to rq->curr, when changing mm. This is because > - * membarrier() sends IPIs to all CPUs that are in the target mm > - * to make them issue memory barriers. However, if another CPU > - * switches to/from the target mm concurrently with > - * membarrier(), it can cause that CPU not to receive an IPI > - * when it really should issue a memory barrier. Writing to CR3 > - * provides that full memory barrier and core serializing > - * instruction. > + * membarrier() support requires that, when we change rq->curr->mm: > + * > + * - If next->mm has membarrier registered, a full memory barrier > + * after writing rq->curr (or rq->curr->mm if we switched the mm > + * without switching tasks) and before returning to user mode. > + * > + * - If next->mm has SYNC_CORE registered, then we sync core before > + * returning to user mode. > + * > + * In the case where prev->mm == next->mm, membarrier() uses an IPI > + * instead, and no particular barriers are needed while context > + * switching. > + * > + * x86 gets all of this as a side-effect of writing to CR3 except > + * in the case where we unlazy without flushing. > + * > + * All other architectures are civilized and do all of this implicitly > + * when transitioning from kernel to user mode. > */ > if (real_prev == next) { > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != > @@ -566,7 +584,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > /* > * If the CPU is not in lazy TLB mode, we are just switching > * from one thread in a process to another thread in the same > - * process. No TLB flush required. > + * process. No TLB flush or membarrier() synchronization > + * is required. > */ > if (!was_lazy) > return; > @@ -576,16 +595,31 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > * If the TLB is up to date, just use it. > * The barrier synchronizes with the tlb_gen increment in > * the TLB shootdown code. > + * > + * As a future optimization opportunity, it's plausible > + * that the x86 memory model is strong enough that this > + * smp_mb() isn't needed. > */ > smp_mb(); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > - next_tlb_gen) > + next_tlb_gen) { > + /* > + * We switched logical mm but we're not going to > + * write to CR3. We already did smp_mb() above, > + * but membarrier() might require a sync_core() > + * as well. > + */ > + sync_core_if_membarrier_enabled(next); > + > return; > + } > > /* > * TLB contents went out of date while we were in lazy > * mode. Fall through to the TLB switching code below. > + * No need for an explicit membarrier invocation -- the CR3 > + * write will serialize. > */ > new_asid = prev_asid; > need_flush = true; > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 5561486fddef..c256a7fc0423 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -345,16 +345,6 @@ enum { > #include > #endif > > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > - if (current->mm != mm) > - return; > - if (likely(!(atomic_read(&mm->membarrier_state) & > - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > - return; > - sync_core_before_usermode(); > -} > - > extern void membarrier_exec_mmap(struct mm_struct *mm); > > extern void membarrier_update_current_mm(struct mm_struct *next_mm); > @@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct > mm_struct *prev, > static inline void membarrier_exec_mmap(struct mm_struct *mm) > { > } > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > -} > static inline void membarrier_update_current_mm(struct mm_struct *next_mm) > { > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f21714ea3db8..6a1db8264c7b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4822,22 +4822,22 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > kmap_local_sched_in(); > > fire_sched_in_preempt_notifiers(current); > + > /* > * When switching through a kernel thread, the loop in > * membarrier_{private,global}_expedited() may have observed that > * kernel thread and not issued an IPI. It is therefore possible to > * schedule between user->kernel->user threads without passing though > * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > + * rq->curr, before returning to userspace, and mmdrop() provides > + * this barrier. > * > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * If an architecture needs to take a specific action for > + * SYNC_CORE, it can do so in switch_mm_irqs_off(). > */ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > -- > 2.33.1 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com