From: "Andy Lutomirski" <luto@kernel.org>
To: "Valentin Schneider" <vschneid@redhat.com>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, rcu@vger.kernel.org,
"the arch/x86 maintainers" <x86@kernel.org>,
linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
linux-riscv@lists.infradead.org, linux-arch@vger.kernel.org,
linux-trace-kernel@vger.kernel.org
Cc: "Thomas Gleixner" <tglx@linutronix.de>,
"Ingo Molnar" <mingo@redhat.com>,
"Borislav Petkov" <bp@alien8.de>,
"Dave Hansen" <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
"Arnaldo Carvalho de Melo" <acme@kernel.org>,
"Josh Poimboeuf" <jpoimboe@kernel.org>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Arnd Bergmann" <arnd@arndb.de>,
"Frederic Weisbecker" <frederic@kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
"Jason Baron" <jbaron@akamai.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Ard Biesheuvel" <ardb@kernel.org>,
"Sami Tolvanen" <samitolvanen@google.com>,
"David S. Miller" <davem@davemloft.net>,
"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
"Joel Fernandes" <joelagnelf@nvidia.com>,
"Josh Triplett" <josh@joshtriplett.org>,
"Boqun Feng" <boqun.feng@gmail.com>,
"Uladzislau Rezki" <urezki@gmail.com>,
"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"Mel Gorman" <mgorman@suse.de>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Masahiro Yamada" <masahiroy@kernel.org>,
"Han Shen" <shenhan@google.com>,
"Rik van Riel" <riel@surriel.com>, "Jann Horn" <jannh@google.com>,
"Dan Carpenter" <dan.carpenter@linaro.org>,
"Oleg Nesterov" <oleg@redhat.com>,
"Juri Lelli" <juri.lelli@redhat.com>,
"Clark Williams" <williams@redhat.com>,
"Yair Podemsky" <ypodemsk@redhat.com>,
"Marcelo Tosatti" <mtosatti@redhat.com>,
"Daniel Wagner" <dwagner@suse.de>,
"Petr Tesarik" <ptesarik@suse.com>,
"Shrikanth Hegde" <sshegde@linux.ibm.com>
Subject: Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
Date: Fri, 14 Nov 2025 08:20:42 -0800 [thread overview]
Message-ID: <c4768cf3-f131-44e6-9b25-ebeb633f32ee@app.fastmail.com> (raw)
In-Reply-To: <20251114150133.1056710-1-vschneid@redhat.com>
On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:
> Context
> =======
>
> We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
> pure-userspace application get regularly interrupted by IPIs sent from
> housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
> leading to various on_each_cpu() calls, e.g.:
>
> The heart of this series is the thought that while we cannot remove NOHZ_FULL
> CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
> the callbacks immediately. Anything that only affects kernelspace can wait
> until the next user->kernel transition, providing it can be executed "early
> enough" in the entry code.
>
I want to point out that there's another option here, although anyone trying to implement it would be fighting against quite a lot of history.
Logically, each CPU is in one of a handful of states: user mode, idle, normal kernel mode (possibly subdivided into IRQ, etc), and a handful of very narrow windows, hopefully uninstrumented and not accessing any PTEs that might be invalid, in the entry and exit paths where any state in memory could be out of sync with actual CPU state. (The latter includes right after the CPU switches to kernel mode, for example.) And NMI and MCE and whatever weird "security" entry types that Intel and AMD love to add.
The way the kernel *currently* deals with this has two big historical oddities:
1. The entry and exit code cares about ti_flags, which is per-*task*, which means that atomically poking it from other CPUs involves the runqueue lock or other shenanigans (see the idle nr_polling code for example), and also that it's not accessible from the user page tables if PTI is on.
2. The actual heavyweight atomic part (context tracking) was built for RCU, and it's sort or bolted on, and, as you've observed in this series, it's really quite awkward to do things that aren't RCU using context tracking.
If this were a greenfield project, I think there's a straightforward approach that's much nicer: stick everything into a single percpu flags structure. Imagine we have cpu_flags, which tracks both the current state of the CPU and what work needs to be done on state changes. On exit to user mode, we would atomically set the mode to USER and make sure we don't touch anything like vmalloc space after that. On entry back to kernel mode, we would avoid vmalloc space, etc, then atomically switch to kernel mode and read out whatever deferred work is needed. As an optimization, if nothing in the current configuration needs atomic state tracking, the state could be left at USER_OR_KERNEL and the overhead of an extra atomic op at entry and exit could be avoided.
And RCU would hook into *that* instead of having its own separate set of hooks.
I think that actually doing this would be a big improvement and would also be a serious project. There's a lot of code that would get touched, and the existing context tracking code is subtle and confusing. And, as mentioned, ti_flags has the wrong scope.
It's *possible* that one could avoid making ti_flags percpu either by extensive use of the runqueue locks or by borrowing a kludge from the idle code. For the latter, right now, the reason that the wake-from-idle code works is that the optimized path only happens if the idle thread/cpu is "polling", and it's impossible for the idle ti_flags to be polling while the CPU isn't actually idle. We could similarly observe that, if a ti_flags says it's in USER mode *and* is on, say, cpu 3, then cpu 3 is most definitely in USER mode. So someone could try shoving the CPU number into ti_flags :-p (USER means actually user or in the late exit / early entry path.)
Anyway, benefits of this whole approach would include considerably (IMO) increased comprehensibility compared to the current tangled ct code and much more straightforward addition of new things that happen to a target CPU conditionally depending on its mode. And, if the flags word was actually per cpu, it could be mapped such that SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write (and maybe CR4/invpcid depending on whether a zapped mapping is global) and the flush bit could depend on whether a flush is needed. And there would be basically no chance that a bug that accessed invalidated-but-not-flushed kernel data could be undetected -- in PTI mode, any such access would page fault! Similarly, if kernel text pokes deferred the flush and serialization, the only code that could execute before noticing the deferred flush would be the user-CR3 code.
Oh, any another primitive would be possible: one CPU could plausibly execute another CPU's interrupts or soft-irqs or whatever by taking a special lock that would effectively pin the remote CPU in user mode -- you'd set a flag in the target cpu_flags saying "pin in USER mode" and the transition on that CPU to kernel mode would then spin on entry to kernel mode and wait for the lock to be released. This could plausibly get a lot of the on_each_cpu callers to switch over in one fell swoop: anything that needs to synchronize to the remote CPU but does not need to poke its actual architectural state could be executed locally while the remote CPU is pinned.
--Andy
next prev parent reply other threads:[~2025-11-14 16:21 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-14 15:01 Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
2025-11-19 14:31 ` Andy Lutomirski
2025-11-19 15:44 ` Valentin Schneider
2025-11-19 17:31 ` Andy Lutomirski
2025-11-21 10:12 ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
2025-11-19 18:31 ` Dave Hansen
2025-11-19 18:33 ` Andy Lutomirski
2025-11-21 17:37 ` Valentin Schneider
2025-11-21 17:50 ` Dave Hansen
2025-11-25 14:13 ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
2025-11-14 16:20 ` Andy Lutomirski [this message]
2025-11-14 17:22 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
2025-11-14 18:14 ` Paul E. McKenney
2025-11-14 18:45 ` Andy Lutomirski
2025-11-14 20:03 ` Paul E. McKenney
2025-11-15 0:29 ` Andy Lutomirski
2025-11-15 2:30 ` Paul E. McKenney
2025-11-14 20:06 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c4768cf3-f131-44e6-9b25-ebeb633f32ee@app.fastmail.com \
--to=luto@kernel.org \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ardb@kernel.org \
--cc=arnd@arndb.de \
--cc=boqun.feng@gmail.com \
--cc=bp@alien8.de \
--cc=dan.carpenter@linaro.org \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dwagner@suse.de \
--cc=frederic@kernel.org \
--cc=hpa@zytor.com \
--cc=jannh@google.com \
--cc=jbaron@akamai.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=jpoimboe@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=loongarch@lists.linux.dev \
--cc=masahiroy@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mtosatti@redhat.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=oleg@redhat.com \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=ptesarik@suse.com \
--cc=rcu@vger.kernel.org \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=samitolvanen@google.com \
--cc=shenhan@google.com \
--cc=sshegde@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=urezki@gmail.com \
--cc=vschneid@redhat.com \
--cc=williams@redhat.com \
--cc=x86@kernel.org \
--cc=ypodemsk@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox