From: Andy Lutomirski <luto@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>
Cc: Nicholas Piggin <npiggin@gmail.com>,
Anton Blanchard <anton@ozlabs.org>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Paul Mackerras <paulus@ozlabs.org>,
Randy Dunlap <rdunlap@infradead.org>,
linux-arch <linux-arch@vger.kernel.org>,
x86@kernel.org, Rik van Riel <riel@surriel.com>,
Dave Hansen <dave.hansen@intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Nadav Amit <nadav.amit@gmail.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Andy Lutomirski <luto@kernel.org>
Subject: [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code
Date: Sat, 8 Jan 2022 08:43:47 -0800 [thread overview]
Message-ID: <b622be287d8148e017742ecf29a966aa4c6de664.1641659630.git.luto@kernel.org> (raw)
In-Reply-To: <cover.1641659630.git.luto@kernel.org>
The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler
doesn't actually know whether we are lazy. With the old code, if a
CPU is running a membarrier-registered task, goes idle, gets unlazied
via a TLB shootdown IPI, and switches back to the
membarrier-registered task, it will do an unnecessary core sync.
Conveniently, x86 is the only architecture that does anything in this
sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode()
is a no-op on all other architectures and we can just move the code.
(I am not claiming that the SYNC_CORE code was correct before or after this
change on any non-x86 architecture. I merely claim that this change
improves readability, is correct on x86, and makes no change on any other
architecture.)
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
arch/x86/mm/tlb.c | 58 +++++++++++++++++++++++++++++++---------
include/linux/sched/mm.h | 13 ---------
kernel/sched/core.c | 14 +++++-----
3 files changed, 53 insertions(+), 32 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59ba2968af1b..1ae15172885e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -9,6 +9,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
+#include <linux/sched/mm.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -485,6 +486,15 @@ void cr4_update_pce(void *ignored)
static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
#endif
+static void sync_core_if_membarrier_enabled(struct mm_struct *next)
+{
+#ifdef CONFIG_MEMBARRIER
+ if (unlikely(atomic_read(&next->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))
+ sync_core_before_usermode();
+#endif
+}
+
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
@@ -539,16 +549,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
/*
- * The membarrier system call requires a full memory barrier and
- * core serialization before returning to user-space, after
- * storing to rq->curr, when changing mm. This is because
- * membarrier() sends IPIs to all CPUs that are in the target mm
- * to make them issue memory barriers. However, if another CPU
- * switches to/from the target mm concurrently with
- * membarrier(), it can cause that CPU not to receive an IPI
- * when it really should issue a memory barrier. Writing to CR3
- * provides that full memory barrier and core serializing
- * instruction.
+ * membarrier() support requires that, when we change rq->curr->mm:
+ *
+ * - If next->mm has membarrier registered, a full memory barrier
+ * after writing rq->curr (or rq->curr->mm if we switched the mm
+ * without switching tasks) and before returning to user mode.
+ *
+ * - If next->mm has SYNC_CORE registered, then we sync core before
+ * returning to user mode.
+ *
+ * In the case where prev->mm == next->mm, membarrier() uses an IPI
+ * instead, and no particular barriers are needed while context
+ * switching.
+ *
+ * x86 gets all of this as a side-effect of writing to CR3 except
+ * in the case where we unlazy without flushing.
+ *
+ * All other architectures are civilized and do all of this implicitly
+ * when transitioning from kernel to user mode.
*/
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
@@ -566,7 +584,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
/*
* If the CPU is not in lazy TLB mode, we are just switching
* from one thread in a process to another thread in the same
- * process. No TLB flush required.
+ * process. No TLB flush or membarrier() synchronization
+ * is required.
*/
if (!was_lazy)
return;
@@ -576,16 +595,31 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* If the TLB is up to date, just use it.
* The barrier synchronizes with the tlb_gen increment in
* the TLB shootdown code.
+ *
+ * As a future optimization opportunity, it's plausible
+ * that the x86 memory model is strong enough that this
+ * smp_mb() isn't needed.
*/
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
- next_tlb_gen)
+ next_tlb_gen) {
+ /*
+ * We switched logical mm but we're not going to
+ * write to CR3. We already did smp_mb() above,
+ * but membarrier() might require a sync_core()
+ * as well.
+ */
+ sync_core_if_membarrier_enabled(next);
+
return;
+ }
/*
* TLB contents went out of date while we were in lazy
* mode. Fall through to the TLB switching code below.
+ * No need for an explicit membarrier invocation -- the CR3
+ * write will serialize.
*/
new_asid = prev_asid;
need_flush = true;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5561486fddef..c256a7fc0423 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -345,16 +345,6 @@ enum {
#include <asm/membarrier.h>
#endif
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
- if (current->mm != mm)
- return;
- if (likely(!(atomic_read(&mm->membarrier_state) &
- MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
- return;
- sync_core_before_usermode();
-}
-
extern void membarrier_exec_mmap(struct mm_struct *mm);
extern void membarrier_update_current_mm(struct mm_struct *next_mm);
@@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
static inline void membarrier_exec_mmap(struct mm_struct *mm)
{
}
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
-{
-}
static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
{
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f21714ea3db8..6a1db8264c7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4822,22 +4822,22 @@ static struct rq *finish_task_switch(struct task_struct *prev)
kmap_local_sched_in();
fire_sched_in_preempt_notifiers(current);
+
/*
* When switching through a kernel thread, the loop in
* membarrier_{private,global}_expedited() may have observed that
* kernel thread and not issued an IPI. It is therefore possible to
* schedule between user->kernel->user threads without passing though
* switch_mm(). Membarrier requires a barrier after storing to
- * rq->curr, before returning to userspace, so provide them here:
+ * rq->curr, before returning to userspace, and mmdrop() provides
+ * this barrier.
*
- * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
- * provided by mmdrop(),
- * - a sync_core for SYNC_CORE.
+ * If an architecture needs to take a specific action for
+ * SYNC_CORE, it can do so in switch_mm_irqs_off().
*/
- if (mm) {
- membarrier_mm_sync_core_before_usermode(mm);
+ if (mm)
mmdrop(mm);
- }
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
--
2.33.1
next prev parent reply other threads:[~2022-01-08 16:44 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30 ` Mathieu Desnoyers
2022-01-08 16:43 ` Andy Lutomirski [this message]
2022-01-12 15:40 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52 ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-10 8:42 ` Christophe Leroy
2022-01-12 15:57 ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-12 16:11 ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13 ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30 ` Mathieu Desnoyers
2022-01-12 17:08 ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06 ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22 ` Linus Torvalds
2022-01-08 22:04 ` Andy Lutomirski
2022-01-09 0:27 ` Linus Torvalds
2022-01-09 0:53 ` Linus Torvalds
2022-01-09 3:58 ` Andy Lutomirski
2022-01-09 4:38 ` Linus Torvalds
2022-01-09 20:19 ` Andy Lutomirski
2022-01-09 20:48 ` Linus Torvalds
2022-01-09 21:51 ` Linus Torvalds
2022-01-10 0:52 ` Andy Lutomirski
2022-01-10 2:36 ` Rik van Riel
2022-01-10 3:51 ` Linus Torvalds
[not found] ` <1641790309.2vqc26hwm3.astroid@bobo.none>
[not found] ` <1641791321.kvkq0n8kbq.astroid@bobo.none>
2022-01-10 17:19 ` Linus Torvalds
2022-01-10 20:52 ` Andy Lutomirski
2022-01-11 3:10 ` Nicholas Piggin
2022-01-11 15:39 ` Andy Lutomirski
2022-01-11 22:48 ` Nicholas Piggin
2022-01-11 10:39 ` Will Deacon
2022-01-11 15:22 ` Andy Lutomirski
2022-01-09 5:56 ` Nadav Amit
2022-01-09 6:48 ` Linus Torvalds
2022-01-09 8:49 ` Nadav Amit
2022-01-09 19:10 ` Linus Torvalds
2022-01-09 19:52 ` Andy Lutomirski
2022-01-09 20:00 ` Linus Torvalds
2022-01-09 20:34 ` Nadav Amit
2022-01-09 20:48 ` Andy Lutomirski
2022-01-09 19:22 ` Rik van Riel
2022-01-09 19:34 ` Nadav Amit
2022-01-09 19:37 ` Rik van Riel
2022-01-09 19:51 ` Nadav Amit
2022-01-09 19:54 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13 ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski
[not found] ` <e6e7c11c38a3880e56fb7dfff4fa67090d831a3b.1641659630.git.luto@kernel.org>
2022-01-12 15:55 ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b622be287d8148e017742ecf29a966aa4c6de664.1641659630.git.luto@kernel.org \
--to=luto@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=anton@ozlabs.org \
--cc=benh@kernel.crashing.org \
--cc=dave.hansen@intel.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=nadav.amit@gmail.com \
--cc=npiggin@gmail.com \
--cc=paulus@ozlabs.org \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=riel@surriel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox