linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Andy Lutomirski" <luto@kernel.org>
To: "Valentin Schneider" <vschneid@redhat.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, rcu@vger.kernel.org,
	"the arch/x86 maintainers" <x86@kernel.org>,
	linux-arm-kernel@lists.infradead.org, loongarch@lists.linux.dev,
	linux-riscv@lists.infradead.org, linux-arch@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org
Cc: "Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Arnaldo Carvalho de Melo" <acme@kernel.org>,
	"Josh Poimboeuf" <jpoimboe@kernel.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Jason Baron" <jbaron@akamai.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ard Biesheuvel" <ardb@kernel.org>,
	"Sami Tolvanen" <samitolvanen@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Mel Gorman" <mgorman@suse.de>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Masahiro Yamada" <masahiroy@kernel.org>,
	"Han Shen" <shenhan@google.com>,
	"Rik van Riel" <riel@surriel.com>, "Jann Horn" <jannh@google.com>,
	"Dan Carpenter" <dan.carpenter@linaro.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Clark Williams" <williams@redhat.com>,
	"Yair Podemsky" <ypodemsk@redhat.com>,
	"Marcelo Tosatti" <mtosatti@redhat.com>,
	"Daniel Wagner" <dwagner@suse.de>,
	"Petr Tesarik" <ptesarik@suse.com>,
	"Shrikanth Hegde" <sshegde@linux.ibm.com>
Subject: Re: [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3
Date: Wed, 19 Nov 2025 09:31:37 -0800	[thread overview]
Message-ID: <91702ceb-afba-450e-819b-52d482d7bd11@app.fastmail.com> (raw)
In-Reply-To: <xhsmhecpukowa.mognet@vschneid-thinkpadt14sgen2i.remote.csb>


On Wed, Nov 19, 2025, at 7:44 AM, Valentin Schneider wrote:
> On 19/11/25 06:31, Andy Lutomirski wrote:
>> On Fri, Nov 14, 2025, at 7:14 AM, Valentin Schneider wrote:
>>> Deferring kernel range TLB flushes requires the guarantee that upon
>>> entering the kernel, no stale entry may be accessed. The simplest way to
>>> provide such a guarantee is to issue an unconditional flush upon switching
>>> to the kernel CR3, as this is the pivoting point where such stale entries
>>> may be accessed.
>>>
>>
>> Doing this together with the PTI CR3 switch has no actual benefit: MOV CR3 doesn’t flush global pages. And doing this in asm is pretty gross.  We don’t even get a free sync_core() out of it because INVPCID is not documented as being serializing.
>>
>> Why can’t we do it in C?  What’s the actual risk?  In order to trip over a stale TLB entry, we would need to deference a pointer to newly allocated kernel virtual memory that was not valid prior to our entry into user mode. I can imagine BPF doing this, but plain noinstr C in the entry path?  Especially noinstr C *that has RCU disabled*?  We already can’t follow an RCU pointer, and ISTM the only style of kernel code that might do this would use RCU to protect the pointer, and we are already doomed if we follow an RCU pointer to any sort of memory.
>>
>
> So v4 and earlier had the TLB flush faff done in C in the context_tracking entry
> just like sync_core().
>
> My biggest issue with it was that I couldn't figure out a way to instrument
> memory accesses such that I would get an idea of where vmalloc'd accesses
> happen - even with a hackish thing just to survey the landscape. So while I
> agree with your reasoning wrt entry noinstr code, I don't have any way to
> prove it.
> That's unlike the text_poke sync_core() deferral for which I have all of
> that nice objtool instrumentation.
>
> Dave also pointed out that the whole stale entry flush deferral is a risky
> move, and that the sanest thing would be to execute the deferred flush just
> after switching to the kernel CR3.
>
> See the thread surrounding:
>   https://lore.kernel.org/lkml/20250114175143.81438-30-vschneid@redhat.com/
>
> mainly Dave's reply and subthread:
>   https://lore.kernel.org/lkml/352317e3-c7dc-43b4-b4cb-9644489318d0@intel.com/
>
>> We do need to watch out for NMI/MCE hitting before we flush.

I read a decent fraction of that thread.

Let's consider what we're worried about:

1. Architectural access to a kernel virtual address that has been unmapped, in asm or early C.  If it hasn't been remapped, then we oops anyway.  If it has, then that means we're accessing a pointer where either the pointer has changed or the pointee has been remapped while we're in user mode, and that's a very strange thing to do for anything that the asm points to or that early C points to, unless RCU is involved.  But RCU is already disallowed in the entry paths that might be in extended quiescent states, so I think this is mostly a nonissue.

2. Non-speculative access via GDT access, etc.  We can't control this at all, but we're not avoid to move the GDT, IDT, LDT etc of a running task while that task is in user mode.  We do move the LDT, but that's quite thoroughly synchronized via IPI.  (Should probably be double checked.  I wrote that code, but that doesn't mean I remember it exactly.)

3. Speculative TLB fills.  We can't control this at all.  We have had actual machine checks, on AMD IIRC, due to messing this up.  This is why we can't defer a flush after freeing a page table.

4. Speculative or other nonarchitectural loads.  One would hope that these are not dangerous.  For example, an early version of TDX would machine check if we did a speculative load from TDX memory, but that was fixed.  I don't see why this would be materially different between actual userspace execution (without LASS, anyway), kernel asm, and kernel C.

5. Writes to page table dirty bits.  I don't think we use these.

In any case, the current implementation in your series is really, really, utterly horrifically slow.  It's probably fine for a task that genuinely sits in usermode forever, but I don't think it's likely to be something that we'd be willing to enable for normal kernels and normal tasks.  And it would be really nice for the don't-interrupt-user-code still to move toward being always available rather than further from it.


I admit that I'm kind of with dhansen: Zen 3+ can use INVLPGB and doesn't need any of this.  Some Intel CPUs support RAR and will eventually be able to use RAR, possibly even for sync_core().


  reply	other threads:[~2025-11-19 17:32 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-14 15:01 [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 01/31] objtool: Make validate_call() recognize indirect calls to pv_ops[] Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 02/31] objtool: Flesh out warning related to pv_ops[] calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 03/31] rcu: Add a small-width RCU watching counter debug option Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 04/31] rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 05/31] jump_label: Add annotations for validating noinstr usage Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 06/31] static_call: Add read-only-after-init static calls Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 07/31] x86/paravirt: Mark pv_sched_clock static call as __ro_after_init Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 08/31] x86/idle: Mark x86_idle " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 09/31] x86/paravirt: Mark pv_steal_clock " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 10/31] riscv/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 11/31] loongarch/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 12/31] arm64/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 13/31] arm/paravirt: " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 14/31] perf/x86/amd: Mark perf_lopwr_cb " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 15/31] sched/clock: Mark sched_clock_running key " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 16/31] KVM: VMX: Mark __kvm_is_using_evmcs static " Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 17/31] x86/bugs: Mark cpu_buf_vm_clear key as allowed in .noinstr Valentin Schneider
2025-11-14 15:01 ` [PATCH v7 18/31] x86/speculation/mds: Mark cpu_buf_idle_clear " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 19/31] sched/clock, x86: Mark __sched_clock_stable " Valentin Schneider
2025-11-14 15:10 ` [PATCH v7 20/31] KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 21/31] stackleack: Mark stack_erasing_bypass key " Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 22/31] objtool: Add noinstr validation for static branches/calls Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 23/31] module: Add MOD_NOINSTR_TEXT mem_type Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 24/31] context-tracking: Introduce work deferral infrastructure Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 25/31] context_tracking,x86: Defer kernel text patching IPIs Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 26/31] x86/jump_label: Add ASM support for static_branch_likely() Valentin Schneider
2025-11-14 15:14 ` [PATCH v7 27/31] x86/mm: Make INVPCID type macros available to assembly Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 28/31] x86/mm/pti: Introduce a kernel/user CR3 software signal Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 29/31] x86/mm/pti: Implement a TLB flush immediately after a switch to kernel CR3 Valentin Schneider
2025-11-19 14:31   ` Andy Lutomirski
2025-11-19 15:44     ` Valentin Schneider
2025-11-19 17:31       ` Andy Lutomirski [this message]
2025-11-21 10:12         ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y Valentin Schneider
2025-11-19 18:31   ` Dave Hansen
2025-11-19 18:33     ` Andy Lutomirski
2025-11-21 17:37     ` Valentin Schneider
2025-11-21 17:50       ` Dave Hansen
2025-11-25 14:13         ` Valentin Schneider
2025-11-14 15:14 ` [RFC PATCH v7 31/31] x86/entry: Add an option to coalesce TLB flushes Valentin Schneider
2025-11-14 16:20 ` [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition Andy Lutomirski
2025-11-14 17:22   ` Andy Lutomirski
2025-11-14 18:14     ` Paul E. McKenney
2025-11-14 18:45       ` Andy Lutomirski
2025-11-14 20:03         ` Paul E. McKenney
2025-11-15  0:29           ` Andy Lutomirski
2025-11-15  2:30             ` Paul E. McKenney
2025-11-14 20:06         ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=91702ceb-afba-450e-819b-52d482d7bd11@app.fastmail.com \
    --to=luto@kernel.org \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=boqun.feng@gmail.com \
    --cc=bp@alien8.de \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dwagner@suse.de \
    --cc=frederic@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jannh@google.com \
    --cc=jbaron@akamai.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=jpoimboe@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=loongarch@lists.linux.dev \
    --cc=masahiroy@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ptesarik@suse.com \
    --cc=rcu@vger.kernel.org \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=samitolvanen@google.com \
    --cc=shenhan@google.com \
    --cc=sshegde@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    --cc=vschneid@redhat.com \
    --cc=williams@redhat.com \
    --cc=x86@kernel.org \
    --cc=ypodemsk@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox