From: Frederic Weisbecker <frederic@kernel.org>
To: Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
Nicolas Saenz Julienne <nsaenzju@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Andy Lutomirski <luto@kernel.org>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Josh Poimboeuf <jpoimboe@kernel.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Arnd Bergmann <arnd@arndb.de>,
"Paul E. McKenney" <paulmck@kernel.org>,
Jason Baron <jbaron@akamai.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ard Biesheuvel <ardb@kernel.org>,
Sami Tolvanen <samitolvanen@google.com>,
"David S. Miller" <davem@davemloft.net>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun.feng@gmail.com>,
Uladzislau Rezki <urezki@gmail.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Masahiro Yamada <masahiroy@kernel.org>,
Han Shen <shenhan@google.com>, Rik van Riel <riel@surriel.com>,
Jann Horn <jannh@google.com>,
Dan Carpenter <dan.carpenter@linaro.org>,
Oleg Nesterov <oleg@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Clark Williams <williams@redhat.com>,
Tomas Glozar <tglozar@redhat.com>,
Yair Podemsky <ypodemsk@redhat.com>,
Marcelo Tosatti <mtosatti@redhat.com>,
Daniel Wagner <dwagner@suse.de>, Petr Tesarik <ptesarik@suse.com>,
Shrikanth Hegde <sshegde@linux.ibm.com>
Subject: Re: [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches
Date: Wed, 15 Apr 2026 14:11:35 +0200 [thread overview]
Message-ID: <ad-Ad44-IyU-yKJf@localhost.localdomain> (raw)
In-Reply-To: <20260324094801.3092968-10-vschneid@redhat.com>
Le Tue, Mar 24, 2026 at 10:48:00AM +0100, Valentin Schneider a écrit :
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.
>
> As the synchronization IPIs are sent using a blocking call, returning from
> text_poke_bp_batch() implies all CPUs will observe the patched
> instruction(s), and this should be preserved even if the IPI is deferred.
> In other words, to safely defer this synchronization, any kernel
> instruction leading to the execution of the deferred instruction
> sync must *not* be mutable (patchable) at runtime.
>
> This means we must pay attention to mutable instructions in the early entry
> code:
> - alternatives
> - static keys
> - static calls
> - all sorts of probes (kprobes/ftrace/bpf/???)
>
> The early entry code is noinstr, which gets rid of the probes.
>
> Alternatives are safe, because it's boot-time patching (before SMP is
> even brought up) which is before any IPI deferral can happen.
>
> This leaves us with static keys and static calls. Any static key used in
> early entry code should be only forever-enabled at boot time, IOW
> __ro_after_init (pretty much like alternatives). Exceptions to that will
> now be caught by objtool.
>
> The deferred instruction sync is the CR3 RMW done as part of
> kPTI when switching to the kernel page table:
>
> SDM vol2 chapter 4.3 - Move to/from control registers:
> ```
> MOV CR* instructions, except for MOV CR8, are serializing instructions.
> ```
>
> Leverage the new kernel_cr3_loaded signal and the kPTI CR3 RMW to defer
> sync_core() IPIs targeting NOHZ_FULL CPUs.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> arch/x86/include/asm/text-patching.h | 5 ++++
> arch/x86/kernel/alternative.c | 34 +++++++++++++++++++++++-----
> arch/x86/kernel/kprobes/core.c | 4 ++--
> arch/x86/kernel/kprobes/opt.c | 4 ++--
> arch/x86/kernel/module.c | 2 +-
> 5 files changed, 38 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index f2d142a0a862e..628e80f8318cd 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -33,6 +33,11 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i
> */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void smp_text_poke_sync_each_cpu(void);
> +#ifdef CONFIG_TRACK_CR3
> +extern void smp_text_poke_sync_each_cpu_deferrable(void);
> +#else
> +#define smp_text_poke_sync_each_cpu_deferrable smp_text_poke_sync_each_cpu
> +#endif
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
> #define text_poke_copy text_poke_copy
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 28518371d8bf3..f3af77d7c533c 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -6,6 +6,7 @@
> #include <linux/vmalloc.h>
> #include <linux/memory.h>
> #include <linux/execmem.h>
> +#include <linux/sched/isolation.h>
>
> #include <asm/text-patching.h>
> #include <asm/insn.h>
> @@ -13,6 +14,7 @@
> #include <asm/ibt.h>
> #include <asm/set_memory.h>
> #include <asm/nmi.h>
> +#include <asm/tlbflush.h>
>
> int __read_mostly alternatives_patched;
>
> @@ -2706,11 +2708,29 @@ static void do_sync_core(void *info)
> sync_core();
> }
>
> +static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func)
> +{
> + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
> +}
> +
> void smp_text_poke_sync_each_cpu(void)
> {
> - on_each_cpu(do_sync_core, NULL, 1);
> + __smp_text_poke_sync_each_cpu(NULL);
> +}
> +
> +#ifdef CONFIG_TRACK_CR3
> +static bool do_sync_core_defer_cond(int cpu, void *info)
> +{
> + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
> + per_cpu(kernel_cr3_loaded, cpu);
|| should be && ?
Also I would again expect full ordering here with an smp_mb() before the
check. So that:
CPU 0 CPU 1
----- -----
//enter_kernel //do_sync_core_defer_cond
kernel_cr3_loaded = 1 WRITE page table
smp_mb() smp_mb()
WRITE cr3 READ kernel_cr3_loaded
But I'm not sure if that ordering is enough to imply that if CPU 1 observes
kernel_cr3_loaded == 0, then subsequent CPU 0 entering the kernel is guaranteed
to flush the TLB with the latest page table write.
Thoughts?
Thanks.
--
Frederic Weisbecker
SUSE Labs
next prev parent reply other threads:[~2026-04-15 12:11 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-24 9:47 [RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 01/10] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 02/10] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 03/10] objtool: Always pass a section to validate_unwind_hints() Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 04/10] x86/retpoline: Make warn_thunk_thunk .noinstr Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 05/10] sched/isolation: Mark housekeeping_overridden key as __ro_after_init Valentin Schneider
2026-03-24 15:17 ` Shrikanth Hegde
2026-03-24 19:46 ` Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 06/10] objtool: Add .entry.text validation for static branches Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 07/10] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2026-03-24 9:47 ` [RFC PATCH v8 08/10] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2026-04-15 12:02 ` Frederic Weisbecker
2026-03-24 9:48 ` [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches Valentin Schneider
2026-04-15 12:11 ` Frederic Weisbecker [this message]
2026-03-24 9:48 ` [RFC PATCH v8 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush " Valentin Schneider
2026-03-24 15:01 ` [syzbot ci] Re: context_tracking,x86: Defer some IPIs until a user->kernel transition syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad-Ad44-IyU-yKJf@localhost.localdomain \
--to=frederic@kernel.org \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ardb@kernel.org \
--cc=arnd@arndb.de \
--cc=boqun.feng@gmail.com \
--cc=bp@alien8.de \
--cc=dan.carpenter@linaro.org \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dwagner@suse.de \
--cc=hpa@zytor.com \
--cc=jannh@google.com \
--cc=jbaron@akamai.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=jpoimboe@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=masahiroy@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mtosatti@redhat.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=nsaenzju@redhat.com \
--cc=oleg@redhat.com \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=ptesarik@suse.com \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=samitolvanen@google.com \
--cc=shenhan@google.com \
--cc=sshegde@linux.ibm.com \
--cc=tglozar@redhat.com \
--cc=tglx@linutronix.de \
--cc=urezki@gmail.com \
--cc=vschneid@redhat.com \
--cc=williams@redhat.com \
--cc=x86@kernel.org \
--cc=ypodemsk@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox