Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Andy Lutomirski" <luto@kernel.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: "Valentin Schneider" <vschneid@redhat.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, rcu@vger.kernel.org,
	"the arch/x86 maintainers" <x86@kernel.org>,
	linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
	linux-riscv@lists.infradead.org, linux-arch@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Arnaldo Carvalho de Melo" <acme@kernel.org>,
	"Josh Poimboeuf" <jpoimboe@kernel.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Jason Baron" <jbaron@akamai.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ard Biesheuvel" <ardb@kernel.org>,
	"Sami Tolvanen" <samitolvanen@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Mel Gorman" <mgorman@suse.de>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Masahiro Yamada" <masahiroy@kernel.org>,
	"Han Shen" <shenhan@google.com>,
	"Rik van Riel" <riel@surriel.com>, "Jann Horn" <jannh@google.com>,
	"Dan Carpenter" <dan.carpenter@linaro.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Clark Williams" <williams@redhat.com>,
	"Yair Podemsky" <ypodemsk@redhat.com>,
	"Marcelo Tosatti" <mtosatti@redhat.com>,
	"Daniel Wagner" <dwagner@suse.de>,
	"Petr Tesarik" <ptesarik@suse.com>,
	"Shrikanth Hegde" <sshegde@linux.ibm.com>
Subject: Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition
Date: Fri, 14 Nov 2025 16:29:31 -0800	[thread overview]
Message-ID: <4f1a7b15-cfdc-478e-8e43-98750608866c@app.fastmail.com> (raw)
In-Reply-To: <b3472cb3-86fa-4687-bfce-3b9a1bf4ff36@paulmck-laptop>



On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
>> 
>> 
>> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
>> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> >> 
>> 
>> >> > Oh, any another primitive would be possible: one CPU could plausibly 
>> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> >> > special lock that would effectively pin the remote CPU in user mode -- 
>> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> >> > the transition on that CPU to kernel mode would then spin on entry to 
>> >> > kernel mode and wait for the lock to be released.  This could plausibly 
>> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> >> > anything that needs to synchronize to the remote CPU but does not need 
>> >> > to poke its actual architectural state could be executed locally while 
>> >> > the remote CPU is pinned.
>> >
>> > It would be necessary to arrange for the remote CPU to remain pinned
>> > while the local CPU executed on its behalf.  Does the above approach
>> > make that happen without re-introducing our current context-tracking
>> > overhead and complexity?
>> 
>> Using the pseudo-implementation farther down, I think this would be like:
>> 
>> if (my_state->status == INDETERMINATE) {
>>    // we were lazy and we never promised to do work atomically.
>>    atomic_set(&my_state->status, KERNEL);
>>    this_entry_work = 0;
>>    /* we are definitely not pinned in this path */
>> } else {
>>    // we were not lazy and we promised we would do work atomically
>>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>    this_entry_work = (whatever we just read);
>>    if (this_entry_work & PINNED) {
>>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>>      while (atomic_read(&this_cpu_pin_count)) {
>>        cpu_relax();
>>      }
>>    }
>>  }
>> 
>> and we'd have something like:
>> 
>> bool try_pin_remote_cpu(int cpu)
>> {
>>     u32 *remote_pin_count = ...;
>>     struct fancy_cpu_state *remote_state = ...;
>>     atomic_inc(remote_pin_count);  // optimistic
>> 
>>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>>     smp_mb__after_atomic();
>> 
>>     if (atomic_read(&remote_state->status) == USER) {
>>       // Okay, it's genuinely pinned.
>>       return true;
>> 
>>       // egads, if this is some arch with very weak ordering,
>>       // do we need to be concerned that we just took a lock but we
>>       // just did a relaxed read and therefore a subsequent access
>>       // that thinks it's locked might appear to precede the load and therefore
>>       // somehow get surprisingly seen out of order by the target cpu?
>>       // maybe we wanted atomic_read_acquire above instead?
>>     } else {
>>       // We might not have successfully pinned it
>>       atomic_dec(remote_pin_count);
>>     }
>> }
>> 
>> void unpin_remote_cpu(int cpu)
>> {
>>     atomic_dec(remote_pin_count();
>> }
>> 
>> and we'd use it like:
>> 
>> if (try_pin_remote_cpu(cpu)) {
>>   // do something useful
>> } else {
>>   send IPI;
>> }
>> 
>> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
>> 
>> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
>
> Let's start with requirements, non-traditional though that might be. ;-)

That's ridiculous! :-)

>
> An "RCU idle" CPU is either in deep idle or executing in nohz_full
> userspace.
>
> 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
>
> 	a.	Everything that this CPU did before entering RCU-idle
> 		state must be visible to sufficiently later code executed
> 		by the grace-period kthread, and:
>
> 	b.	Everything that the CPU will do after exiting RCU-idle
> 		state must *not* have been visible to the grace-period
> 		kthread sufficiently prior to having sampled this
> 		CPU's state.
>
> 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> 	it depends on whether this is the same nonidle sojourn as was
> 	initially seen.  (If the kthread initially saw the CPU in an
> 	RCU-idle state, it would not have bothered resampling.)
>
> 	a.	If this is the same nonidle sojourn, then there are no
> 		ordering requirements.	RCU must continue to wait on
> 		this CPU.
>
> 	b.	Otherwise, everything that this CPU did before entering
> 		its last RCU-idle state must be visible to sufficiently
> 		later code executed by the grace-period kthread.
> 		Similar to (1a) above.
>
> 3.	If a given CPU quickly switches into and out of RCU-idle
> 	state, and it is always in RCU-nonidle state whenever the RCU
> 	grace-period kthread looks, RCU must still realize that this
> 	CPU has passed through at least one quiescent state.
>
> 	This is why we have a counter for RCU rather than just a
> 	simple state.

...

> I am not seeing how the third requirement above is met, though.  I have
> not verified the sampling code that the RCU grace-period kthread is
> supposed to use because I am not seeing it right off-hand.

This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.


>> It wouldn't be outrageous to have real-time imply the full USER transition.
>
> We could have special code for RT, but this of course increases the
> complexity.  Which might be justified by sufficient speedup.

I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
>
> And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> suggesting additional static branches.  ;-)

I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

I will ponder.

next prev parent reply	other threads:[~2025-11-15  0:29 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-14 15:01 Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
2025-11-19 14:31   ` Andy Lutomirski
2025-11-19 15:44     ` Valentin Schneider
2025-11-19 17:31       ` Andy Lutomirski
2025-11-21 10:12         ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
2025-11-19 18:31   ` Dave Hansen
2025-11-19 18:33     ` Andy Lutomirski
2025-11-21 17:37     ` Valentin Schneider
2025-11-21 17:50       ` Dave Hansen
2025-11-25 14:13         ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
2025-11-14 17:22   ` Andy Lutomirski
2025-11-14 18:14     ` Paul E. McKenney
2025-11-14 18:45       ` Andy Lutomirski
2025-11-14 20:03         ` Paul E. McKenney
2025-11-15  0:29           ` Andy Lutomirski [this message]
2025-11-15  2:30             ` Paul E. McKenney
2025-11-14 20:06         ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f1a7b15-cfdc-478e-8e43-98750608866c@app.fastmail.com \
    --to=luto@kernel.org \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=boqun.feng@gmail.com \
    --cc=bp@alien8.de \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dwagner@suse.de \
    --cc=frederic@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jannh@google.com \
    --cc=jbaron@akamai.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=jpoimboe@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=loongarch@lists.linux.dev \
    --cc=masahiroy@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ptesarik@suse.com \
    --cc=rcu@vger.kernel.org \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=samitolvanen@google.com \
    --cc=shenhan@google.com \
    --cc=sshegde@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    --cc=vschneid@redhat.com \
    --cc=williams@redhat.com \
    --cc=x86@kernel.org \
    --cc=ypodemsk@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox