[RFC PATCH 0/9] Intel RAR TLB invalidation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/9] Intel RAR TLB invalidation
@ 2025-05-06  0:37 Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 1/9] x86/mm: Introduce MSR_IA32_CORE_CAPABILITIES Rik van Riel
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa

This patch series adds support for IPI-less TLB invalidation
using Intel RAR technology.

Intel RAR differs from AMD INVLPGB in a few ways:
- RAR goes through (emulated?) APIC writes, not instructions
- RAR flushes go through a memory table with 64 entries
- RAR flushes can be targeted to a cpumask
- The RAR functionality must be set up at boot time before it can be used

The cpumask targeting has resulted in Intel RAR and AMD INVLPGB having
slightly different rules:
- Processes with dynamic ASIDs use IPI based shootdowns
- INVLPGB: processes with a global ASID 
   - always have the TLB up to date, on every CPU
   - never need to flush the TLB at context switch time
- RAR: processes with global ASIDs
   - have the TLB up to date on CPUs in the mm_cpumask
   - can skip a TLB flush at context switch time if the CPU is in the mm_cpumask
   - need to flush the TLB when scheduled on a cpu not in the mm_cpumask,
     in case it used to run there before and the TLB has stale entries

RAR functionality is present on Sapphire Rapids and newer CPUs.

Information about Intel RAR can be found in this whitepaper.

https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf

This patch series is based off a 2019 patch series created by
Intel, with patches later in the series modified to fit into
the TLB flush code structure we have after AMD INVLPGB functionality
was integrated.

This first version of the code still has issues with segfaults
and kernel oopses. Clearly something is still wrong with how or
when flushes are done, but the code could use more eyeballs.

I have some idas for additional code cleanups as well, but would
like to get the last bugs sorted out first...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/9] x86/mm: Introduce MSR_IA32_CORE_CAPABILITIES
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 2/9] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

MSR_IA32_CORE_CAPABILITIES indicates the existence of other MSRs.
Bit[1] indicates Remote Action Request (RAR) TLB registers.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/msr-index.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ac21dc19dde2..0828b891fe2e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -212,6 +212,12 @@
 						 * File.
 						 */
 
+#define MSR_IA32_CORE_CAPABILITIES	0x000000cf
+#define CORE_CAP_RAR			BIT(1)	/*
+						 * Remote Action Request. Used to directly
+						 * flush the TLB on remote CPUs.
+						 */
+
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			BIT(0)	/*
 						 * Writeback and invalidate the
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/9] x86/mm: Introduce Remote Action Request MSRs
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 1/9] x86/mm: Introduce MSR_IA32_CORE_CAPABILITIES Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 3/9] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Remote Action Request (RAR) is a TLB flushing broadcast facility.
This patch introduces RAR MSRs.  RAR is introduced in later patches.

There are five RAR MSRs:

  MSR_CORE_CAPABILITIES
  MSR_IA32_RAR_CTRL
  MSR_IA32_RAR_ACT_VEC
  MSR_IA32_RAR_PAYLOAD_BASE
  MSR_IA32_RAR_INFO

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/msr-index.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0828b891fe2e..923e17462712 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -122,6 +122,17 @@
 #define SNB_C3_AUTO_UNDEMOTE		(1UL << 27)
 #define SNB_C1_AUTO_UNDEMOTE		(1UL << 28)
 
+/*
+ * Remote Action Requests (RAR) MSRs
+ */
+#define MSR_IA32_RAR_CTRL		0x000000ed
+#define MSR_IA32_RAR_ACT_VEC		0x000000ee
+#define MSR_IA32_RAR_PAYLOAD_BASE	0x000000ef
+#define MSR_IA32_RAR_INFO		0x000000f0
+
+#define RAR_CTRL_ENABLE			BIT(31)
+#define RAR_CTRL_IGNORE_IF		BIT(30)
+
 #define MSR_MTRRcap			0x000000fe
 
 #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/9] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 1/9] x86/mm: Introduce MSR_IA32_CORE_CAPABILITIES Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 2/9] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 4/9] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Much of the code for Intel RAR and AMD INVLPGB is shared.

Place both under the same config option.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpu | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index f928cf6e3252..f9cdd145abba 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -360,7 +360,7 @@ menuconfig PROCESSOR_SELECT
 
 config BROADCAST_TLB_FLUSH
 	def_bool y
-	depends on CPU_SUP_AMD && 64BIT
+	depends on (CPU_SUP_AMD || CPU_SUP_INTEL) && 64BIT
 
 config CPU_SUP_INTEL
 	default y
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/9] x86/mm: Introduce X86_FEATURE_RAR
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 3/9] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 5/9] x86/mm: Change cpa_flush() to call flush_kernel_range() directly Rik van Riel
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Yu-cheng Yu, Rik van Riel

From: Rik van Riel <riel@fb.com>

Introduce X86_FEATURE_RAR and enumeration of the feature.

[riel: move disabling to Kconfig.cpufeatures]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpufeatures       |  4 ++++
 arch/x86/include/asm/cpufeatures.h |  2 +-
 arch/x86/kernel/cpu/common.c       | 13 +++++++++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpufeatures b/arch/x86/Kconfig.cpufeatures
index e12d5b7e39a2..60042f8c2837 100644
--- a/arch/x86/Kconfig.cpufeatures
+++ b/arch/x86/Kconfig.cpufeatures
@@ -199,3 +199,7 @@ config X86_DISABLED_FEATURE_SEV_SNP
 config X86_DISABLED_FEATURE_INVLPGB
 	def_bool y
 	depends on !BROADCAST_TLB_FLUSH
+
+config X86_DISABLED_FEATURE_RAR
+	def_bool y
+	depends on !BROADCAST_TLB_FLUSH
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7642310276a8..06732c872998 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -75,7 +75,7 @@
 #define X86_FEATURE_CENTAUR_MCR		( 3*32+ 3) /* "centaur_mcr" Centaur MCRs (= MTRRs) */
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
-/* Free                                 ( 3*32+ 6) */
+#define X86_FEATURE_RAR			( 3*32+ 6) /* Intel Remote Action Request */
 /* Free                                 ( 3*32+ 7) */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b73e09315413..5666620e7153 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1502,6 +1502,18 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	setup_force_cpu_bug(X86_BUG_L1TF);
 }
 
+static void __init detect_rar(struct cpuinfo_x86 *c)
+{
+	u64 msr;
+
+	if (cpu_has(c, X86_FEATURE_CORE_CAPABILITIES)) {
+		rdmsrl(MSR_IA32_CORE_CAPABILITIES, msr);
+
+		if (msr & CORE_CAP_RAR)
+			setup_force_cpu_cap(X86_FEATURE_RAR);
+	}
+}
+
 /*
  * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
  * unfortunately, that's not true in practice because of early VIA
@@ -1728,6 +1740,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
 
 	detect_nopl();
+	detect_rar(c);
 }
 
 void __init init_cpu_devs(void)
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 5/9] x86/mm: Change cpa_flush() to call flush_kernel_range() directly
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 4/9] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 6/9] x86/apic: Introduce Remote Action Request Operations Rik van Riel
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Yu-cheng Yu, Rik van Riel

From: Rik van Riel <riel@fb.com>

The function cpa_flush() calls __flush_tlb_one_kernel() and
flush_tlb_all().

Replacing that with a call to flush_tlb_kernel_range() allows
cpa_flush() to make use of INVLPGB or RAR without any additional
changes.

Initialize invlpgb_count_max to 1, since flush_tlb_kernel_range()
can now be called before invlpgb_count_max has been initialized
to the value read from CPUID.

[riel: remove now unused __cpa_flush_tlb]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/kernel/cpu/amd.c    |  2 +-
 arch/x86/mm/pat/set_memory.c | 20 +++++++-------------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 13a48ec28f32..c85ecde786f3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -31,7 +31,7 @@
 
 #include "cpu.h"
 
-u16 invlpgb_count_max __ro_after_init;
+u16 invlpgb_count_max __ro_after_init = 1;
 
 static inline int rdmsrq_amd_safe(unsigned msr, u64 *p)
 {
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 30ab4aced761..2454f5249329 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -399,15 +399,6 @@ static void cpa_flush_all(unsigned long cache)
 	on_each_cpu(__cpa_flush_all, (void *) cache, 1);
 }
 
-static void __cpa_flush_tlb(void *data)
-{
-	struct cpa_data *cpa = data;
-	unsigned int i;
-
-	for (i = 0; i < cpa->numpages; i++)
-		flush_tlb_one_kernel(fix_addr(__cpa_addr(cpa, i)));
-}
-
 static int collapse_large_pages(unsigned long addr, struct list_head *pgtables);
 
 static void cpa_collapse_large_pages(struct cpa_data *cpa)
@@ -444,6 +435,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 
 static void cpa_flush(struct cpa_data *cpa, int cache)
 {
+	unsigned long start, end;
 	unsigned int i;
 
 	BUG_ON(irqs_disabled() && !early_boot_irqs_disabled);
@@ -453,10 +445,12 @@ static void cpa_flush(struct cpa_data *cpa, int cache)
 		goto collapse_large_pages;
 	}
 
-	if (cpa->force_flush_all || cpa->numpages > tlb_single_page_flush_ceiling)
-		flush_tlb_all();
-	else
-		on_each_cpu(__cpa_flush_tlb, cpa, 1);
+	start = fix_addr(__cpa_addr(cpa, 0));
+	end = fix_addr(__cpa_addr(cpa, cpa->numpages));
+	if (cpa->force_flush_all)
+		end = TLB_FLUSH_ALL;
+
+	flush_tlb_kernel_range(start, end);
 
 	if (!cache)
 		goto collapse_large_pages;
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 6/9] x86/apic: Introduce Remote Action Request Operations
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 5/9] x86/mm: Change cpa_flush() to call flush_kernel_range() directly Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request Rik van Riel
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Yu-cheng Yu, Rik van Riel

From: Rik van Riel <riel@fb.com>

RAR TLB flushing is started by sending a command to the APIC.
This patch adds Remote Action Request commands.

[riel: move some things around to acount for 6 years of changes]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/apicdef.h     |  1 +
 arch/x86/include/asm/irq_vectors.h |  5 +++++
 arch/x86/include/asm/smp.h         | 15 +++++++++++++++
 arch/x86/kernel/apic/ipi.c         | 23 +++++++++++++++++++----
 arch/x86/kernel/apic/local.h       |  3 +++
 arch/x86/kernel/smp.c              |  3 +++
 6 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
index 094106b6a538..b152d45af91a 100644
--- a/arch/x86/include/asm/apicdef.h
+++ b/arch/x86/include/asm/apicdef.h
@@ -92,6 +92,7 @@
 #define		APIC_DM_LOWEST		0x00100
 #define		APIC_DM_SMI		0x00200
 #define		APIC_DM_REMRD		0x00300
+#define		APIC_DM_RAR		0x00300
 #define		APIC_DM_NMI		0x00400
 #define		APIC_DM_INIT		0x00500
 #define		APIC_DM_STARTUP		0x00600
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 47051871b436..c417b0015304 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -103,6 +103,11 @@
  */
 #define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
 
+/*
+ * RAR (remote action request) TLB flush
+ */
+#define RAR_VECTOR			0xe0
+
 #define NR_VECTORS			 256
 
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 0c1c68039d6f..1ab9f5fcac8a 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -40,6 +40,9 @@ struct smp_ops {
 
 	void (*send_call_func_ipi)(const struct cpumask *mask);
 	void (*send_call_func_single_ipi)(int cpu);
+
+	void (*send_rar_ipi)(const struct cpumask *mask);
+	void (*send_rar_single_ipi)(int cpu);
 };
 
 /* Globals due to paravirt */
@@ -100,6 +103,16 @@ static inline void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 	smp_ops.send_call_func_ipi(mask);
 }
 
+static inline void arch_send_rar_single_ipi(int cpu)
+{
+	smp_ops.send_rar_single_ipi(cpu);
+}
+
+static inline void arch_send_rar_ipi_mask(const struct cpumask *mask)
+{
+	smp_ops.send_rar_ipi(mask);
+}
+
 void cpu_disable_common(void);
 void native_smp_prepare_boot_cpu(void);
 void smp_prepare_cpus_common(void);
@@ -120,6 +133,8 @@ void __noreturn mwait_play_dead(unsigned int eax_hint);
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
+void native_send_rar_ipi(const struct cpumask *mask);
+void native_send_rar_single_ipi(int cpu);
 
 asmlinkage __visible void smp_reboot_interrupt(void);
 __visible void smp_reschedule_interrupt(struct pt_regs *regs);
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 98a57cb4aa86..e5e9fc08f86c 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -79,7 +79,7 @@ void native_send_call_func_single_ipi(int cpu)
 	__apic_send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
 }
 
-void native_send_call_func_ipi(const struct cpumask *mask)
+static void do_native_send_ipi(const struct cpumask *mask, int vector)
 {
 	if (static_branch_likely(&apic_use_ipi_shorthand)) {
 		unsigned int cpu = smp_processor_id();
@@ -88,14 +88,19 @@ void native_send_call_func_ipi(const struct cpumask *mask)
 			goto sendmask;
 
 		if (cpumask_test_cpu(cpu, mask))
-			__apic_send_IPI_all(CALL_FUNCTION_VECTOR);
+			__apic_send_IPI_all(vector);
 		else if (num_online_cpus() > 1)
-			__apic_send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+			__apic_send_IPI_allbutself(vector);
 		return;
 	}
 
 sendmask:
-	__apic_send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
+	__apic_send_IPI_mask(mask, vector);
+}
+
+void native_send_call_func_ipi(const struct cpumask *mask)
+{
+	do_native_send_ipi(mask, CALL_FUNCTION_VECTOR);
 }
 
 void apic_send_nmi_to_offline_cpu(unsigned int cpu)
@@ -106,6 +111,16 @@ void apic_send_nmi_to_offline_cpu(unsigned int cpu)
 		return;
 	apic->send_IPI(cpu, NMI_VECTOR);
 }
+
+void native_send_rar_single_ipi(int cpu)
+{
+	apic->send_IPI_mask(cpumask_of(cpu), RAR_VECTOR);
+}
+
+void native_send_rar_ipi(const struct cpumask *mask)
+{
+	do_native_send_ipi(mask, RAR_VECTOR);
+}
 #endif /* CONFIG_SMP */
 
 static inline int __prepare_ICR2(unsigned int mask)
diff --git a/arch/x86/kernel/apic/local.h b/arch/x86/kernel/apic/local.h
index bdcf609eb283..833669174267 100644
--- a/arch/x86/kernel/apic/local.h
+++ b/arch/x86/kernel/apic/local.h
@@ -38,6 +38,9 @@ static inline unsigned int __prepare_ICR(unsigned int shortcut, int vector,
 	case NMI_VECTOR:
 		icr |= APIC_DM_NMI;
 		break;
+	case RAR_VECTOR:
+		icr |= APIC_DM_RAR;
+		break;
 	}
 	return icr;
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 18266cc3d98c..2c51ed6aaf03 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -297,5 +297,8 @@ struct smp_ops smp_ops = {
 
 	.send_call_func_ipi	= native_send_call_func_ipi,
 	.send_call_func_single_ipi = native_send_call_func_single_ipi,
+
+	.send_rar_ipi		= native_send_rar_ipi,
+	.send_rar_single_ipi	= native_send_rar_single_ipi,
 };
 EXPORT_SYMBOL_GPL(smp_ops);
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 6/9] x86/apic: Introduce Remote Action Request Operations Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  6:59   ` Nadav Amit
  2025-05-06 16:31   ` Ingo Molnar
  2025-05-06  0:37 ` [RFC PATCH 8/9] x86/mm: use RAR for kernel TLB flushes Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 9/9] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel
  8 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Remote Action Request (RAR) is a TLB flushing broadcast facility.
To start a TLB flush, the initiator CPU creates a RAR payload and
sends a command to the APIC.  The receiving CPUs automatically flush
TLBs as specified in the payload without the kernel's involement.

[ riel:
  - add pcid parameter to smp_call_rar_many so other mms can be flushed
  - ensure get_payload only allocates valid indices
  - make sure rar_cpu_init does not write reserved bits
  - fix overflow in range vs full flush decision ]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/rar.h   |  69 +++++++++++
 arch/x86/kernel/cpu/common.c |   4 +
 arch/x86/mm/Makefile         |   1 +
 arch/x86/mm/rar.c            | 226 +++++++++++++++++++++++++++++++++++
 4 files changed, 300 insertions(+)
 create mode 100644 arch/x86/include/asm/rar.h
 create mode 100644 arch/x86/mm/rar.c

diff --git a/arch/x86/include/asm/rar.h b/arch/x86/include/asm/rar.h
new file mode 100644
index 000000000000..b5ba856fcaa8
--- /dev/null
+++ b/arch/x86/include/asm/rar.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_RAR_H
+#define _ASM_X86_RAR_H
+
+/*
+ * RAR payload types
+ */
+#define RAR_TYPE_INVPG		0
+#define RAR_TYPE_INVPG_NO_CR3	1
+#define RAR_TYPE_INVPCID	2
+#define RAR_TYPE_INVEPT		3
+#define RAR_TYPE_INVVPID	4
+#define RAR_TYPE_WRMSR		5
+
+/*
+ * Subtypes for RAR_TYPE_INVLPG
+ */
+#define RAR_INVPG_ADDR			0 /* address specific */
+#define RAR_INVPG_ALL			2 /* all, include global */
+#define RAR_INVPG_ALL_NO_GLOBAL		3 /* all, exclude global */
+
+/*
+ * Subtypes for RAR_TYPE_INVPCID
+ */
+#define RAR_INVPCID_ADDR		0 /* address specific */
+#define RAR_INVPCID_PCID		1 /* all of PCID */
+#define RAR_INVPCID_ALL			2 /* all, include global */
+#define RAR_INVPCID_ALL_NO_GLOBAL	3 /* all, exclude global */
+
+/*
+ * Page size for RAR_TYPE_INVLPG
+ */
+#define RAR_INVLPG_PAGE_SIZE_4K		0
+#define RAR_INVLPG_PAGE_SIZE_2M		1
+#define RAR_INVLPG_PAGE_SIZE_1G		2
+
+/*
+ * Max number of pages per payload
+ */
+#define RAR_INVLPG_MAX_PAGES 63
+
+typedef struct {
+	uint64_t for_sw : 8;
+	uint64_t type : 8;
+	uint64_t must_be_zero_1 : 16;
+	uint64_t subtype : 3;
+	uint64_t page_size: 2;
+	uint64_t num_pages : 6;
+	uint64_t must_be_zero_2 : 21;
+
+	uint64_t must_be_zero_3;
+
+	/*
+	 * Starting address
+	 */
+	uint64_t initiator_cr3;
+	uint64_t linear_address;
+
+	/*
+	 * Padding
+	 */
+	uint64_t padding[4];
+} rar_payload_t;
+
+void rar_cpu_init(void);
+void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
+		       unsigned long start, unsigned long end);
+
+#endif /* _ASM_X86_RAR_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5666620e7153..75b43db0b129 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -71,6 +71,7 @@
 #include <asm/tdx.h>
 #include <asm/posted_intr.h>
 #include <asm/runtime-const.h>
+#include <asm/rar.h>
 
 #include "cpu.h"
 
@@ -2395,6 +2396,9 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
+	if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_cpu_init();
+
 	load_fixmap_gdt(cpu);
 }
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index cebe5812d78d..d49d16412569 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)	+= pti.o
+obj-$(CONFIG_BROADCAST_TLB_FLUSH)		+= rar.o
 
 obj-$(CONFIG_X86_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c
new file mode 100644
index 000000000000..77a334f1e212
--- /dev/null
+++ b/arch/x86/mm/rar.c
@@ -0,0 +1,226 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RAR Tlb shootdown
+ */
+
+#include <linux/kgdb.h>
+#include <linux/sched.h>
+#include <linux/bug.h>
+#include <asm/current.h>
+#include <asm/io.h>
+#include <asm/sync_bitops.h>
+#include <asm/rar.h>
+#include <asm/tlbflush.h>
+
+static DEFINE_PER_CPU(int, rar_lock);
+static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask);
+
+#define RAR_ACTION_OK		0x00
+#define RAR_ACTION_START	0x01
+#define RAR_ACTION_ACKED	0x02
+#define RAR_ACTION_FAIL		0x80
+
+#define RAR_MAX_PAYLOADS 32UL
+
+static unsigned long rar_in_use = ~(RAR_MAX_PAYLOADS - 1);
+static rar_payload_t rar_payload[RAR_MAX_PAYLOADS] __page_aligned_bss;
+static DEFINE_PER_CPU_ALIGNED(u64[(RAR_MAX_PAYLOADS + 8) / 8], rar_action);
+
+static __always_inline void lock(int *lock)
+{
+	smp_cond_load_acquire(lock, !(VAL & 1));
+	*lock |= 1;
+
+	/*
+	 * prevent CPU from reordering the above assignment
+	 * to ->flags with any subsequent assignments to other
+	 * fields of the specified call_single_data structure:
+	 */
+	smp_wmb();
+}
+
+static __always_inline void unlock(int *lock)
+{
+	WARN_ON(!(*lock & 1));
+
+	/*
+	 * ensure we're all done before releasing data:
+	 */
+	smp_store_release(lock, 0);
+}
+
+static unsigned long get_payload(void)
+{
+	while (1) {
+		unsigned long bit;
+
+		/*
+		 * Find a free bit and confirm it with
+		 * test_and_set_bit() below.
+		 */
+		bit = ffz(READ_ONCE(rar_in_use));
+
+		if (bit >= RAR_MAX_PAYLOADS)
+			continue;
+
+		if (!test_and_set_bit((long)bit, &rar_in_use))
+			return bit;
+	}
+}
+
+static void free_payload(unsigned long idx)
+{
+	clear_bit(idx, &rar_in_use);
+}
+
+static void set_payload(unsigned long idx, u16 pcid, unsigned long start,
+			uint32_t pages)
+{
+	rar_payload_t *p = &rar_payload[idx];
+
+	p->must_be_zero_1	= 0;
+	p->must_be_zero_2	= 0;
+	p->must_be_zero_3	= 0;
+	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
+	p->type			= RAR_TYPE_INVPCID;
+	p->num_pages		= pages;
+	p->initiator_cr3	= pcid;
+	p->linear_address	= start;
+
+	if (pcid) {
+		/* RAR invalidation of the mapping of a specific process. */
+		if (pages >= RAR_INVLPG_MAX_PAGES)
+			p->subtype = RAR_INVPCID_PCID;
+		else
+			p->subtype = RAR_INVPCID_ADDR;
+	} else {
+		/*
+		 * Unfortunately RAR_INVPCID_ADDR excludes global translations.
+		 * Always do a full flush for kernel invalidations.
+		 */
+		p->subtype = RAR_INVPCID_ALL;
+	}
+
+	smp_wmb();
+}
+
+static void set_action_entry(unsigned long idx, int target_cpu)
+{
+	u8 *bitmap = (u8 *)per_cpu(rar_action, target_cpu);
+
+	WRITE_ONCE(bitmap[idx], RAR_ACTION_START);
+}
+
+static void wait_for_done(unsigned long idx, int target_cpu)
+{
+	u8 status;
+	u8 *bitmap = (u8 *)per_cpu(rar_action, target_cpu);
+
+	status = READ_ONCE(bitmap[idx]);
+
+	while ((status != RAR_ACTION_OK) && (status != RAR_ACTION_FAIL)) {
+		cpu_relax();
+		status = READ_ONCE(bitmap[idx]);
+	}
+
+	WARN_ON_ONCE(bitmap[idx] == RAR_ACTION_FAIL);
+}
+
+void rar_cpu_init(void)
+{
+	u64 r;
+	u8 *bitmap;
+	int this_cpu = smp_processor_id();
+
+	per_cpu(rar_lock, this_cpu) = 0;
+	cpumask_clear(&per_cpu(rar_cpu_mask, this_cpu));
+
+	rdmsrl(MSR_IA32_RAR_INFO, r);
+	pr_info_once("RAR: support %lld payloads\n", r >> 32);
+
+	bitmap = (u8 *)per_cpu(rar_action, this_cpu);
+	memset(bitmap, 0, RAR_MAX_PAYLOADS);
+	wrmsrl(MSR_IA32_RAR_ACT_VEC, (u64)virt_to_phys(bitmap));
+	wrmsrl(MSR_IA32_RAR_PAYLOAD_BASE, (u64)virt_to_phys(rar_payload));
+
+	r = RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF;
+	// reserved bits!!! r |= (RAR_VECTOR & 0xff);
+	wrmsrl(MSR_IA32_RAR_CTRL, r);
+}
+
+/*
+ * This is a modified version of smp_call_function_many() of kernel/smp.c,
+ * without a function pointer, because the RAR handler is the ucode.
+ */
+void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
+		       unsigned long start, unsigned long end)
+{
+	unsigned long pages = (end - start + PAGE_SIZE) / PAGE_SIZE;
+	int cpu, next_cpu, this_cpu = smp_processor_id();
+	cpumask_t *dest_mask;
+	unsigned long idx;
+
+	if (pages > RAR_INVLPG_MAX_PAGES || end == TLB_FLUSH_ALL)
+		pages = RAR_INVLPG_MAX_PAGES;
+
+	/*
+	 * Can deadlock when called with interrupts disabled.
+	 * We allow cpu's that are not yet online though, as no one else can
+	 * send smp call function interrupt to this cpu and as such deadlocks
+	 * can't happen.
+	 */
+	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
+		     && !oops_in_progress && !early_boot_irqs_disabled);
+
+	/* Try to fastpath.  So, what's a CPU they want?  Ignoring this one. */
+	cpu = cpumask_first_and(mask, cpu_online_mask);
+	if (cpu == this_cpu)
+		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+
+	/* No online cpus?  We're done. */
+	if (cpu >= nr_cpu_ids)
+		return;
+
+	/* Do we have another CPU which isn't us? */
+	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
+	if (next_cpu == this_cpu)
+		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
+
+	/* Fastpath: do that cpu by itself. */
+	if (next_cpu >= nr_cpu_ids) {
+		lock(this_cpu_ptr(&rar_lock));
+		idx = get_payload();
+		set_payload(idx, pcid, start, pages);
+		set_action_entry(idx, cpu);
+		arch_send_rar_single_ipi(cpu);
+		wait_for_done(idx, cpu);
+		free_payload(idx);
+		unlock(this_cpu_ptr(&rar_lock));
+		return;
+	}
+
+	dest_mask = this_cpu_ptr(&rar_cpu_mask);
+	cpumask_and(dest_mask, mask, cpu_online_mask);
+	cpumask_clear_cpu(this_cpu, dest_mask);
+
+	/* Some callers race with other cpus changing the passed mask */
+	if (unlikely(!cpumask_weight(dest_mask)))
+		return;
+
+	lock(this_cpu_ptr(&rar_lock));
+	idx = get_payload();
+	set_payload(idx, pcid, start, pages);
+
+	for_each_cpu(cpu, dest_mask)
+		set_action_entry(idx, cpu);
+
+	/* Send a message to all CPUs in the map */
+	arch_send_rar_ipi_mask(dest_mask);
+
+	for_each_cpu(cpu, dest_mask)
+		wait_for_done(idx, cpu);
+
+	free_payload(idx);
+	unlock(this_cpu_ptr(&rar_lock));
+}
+EXPORT_SYMBOL(smp_call_rar_many);
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 8/9] x86/mm: use RAR for kernel TLB flushes
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  2025-05-06  0:37 ` [RFC PATCH 9/9] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Use Intel RAR for kernel TLB flushes, when enabled.

Pass in PCID 0 to smp_call_rar_many() to flush kernel memory.

Unfortunately RAR_INVPCID_ADDR excludes global PTE mappings, so only
full flushes with RAR_INVPCID_ALL will flush kernel mappings.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7c61bf11d472..a4f3941281b6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -21,6 +21,7 @@
 #include <asm/apic.h>
 #include <asm/msr.h>
 #include <asm/perf_event.h>
+#include <asm/rar.h>
 #include <asm/tlb.h>
 
 #include "mm_internal.h"
@@ -1451,6 +1452,18 @@ static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static void rar_full_flush(const cpumask_t *cpumask)
+{
+	guard(preempt)();
+	smp_call_rar_many(cpumask, 0, 0, TLB_FLUSH_ALL);
+	invpcid_flush_all();
+}
+
+static void rar_flush_all(void)
+{
+	rar_full_flush(cpu_online_mask);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1458,6 +1471,8 @@ void flush_tlb_all(void)
 	/* First try (faster) hardware-assisted TLB invalidation. */
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_flush_all();
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all();
 	else
 		/* Fall back to the IPI-based invalidation. */
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
@@ -1487,15 +1502,36 @@ static void do_kernel_range_flush(void *info)
 	struct flush_tlb_info *f = info;
 	unsigned long addr;
 
+	/*
+	 * With PTI kernel TLB entries in all PCIDs need to be flushed.
+	 * With RAR the PCID space becomes so large, we might as well flush it all.
+	 *
+	 * Either of the two by itself works with targeted flushes.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_RAR) &&
+	    cpu_feature_enabled(X86_FEATURE_PTI)) {
+		invpcid_flush_all();
+		return;
+	}
+
 	/* flush range by one by one 'invlpg' */
 	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
 		flush_tlb_one_kernel(addr);
 }
 
+static void rar_kernel_range_flush(struct flush_tlb_info *info)
+{
+	guard(preempt)();
+	smp_call_rar_many(cpu_online_mask, 0, info->start, info->end);
+	do_kernel_range_flush(info);
+}
+
 static void kernel_tlb_flush_all(struct flush_tlb_info *info)
 {
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_flush_all();
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all();
 	else
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
@@ -1504,6 +1540,8 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
 {
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_kernel_range_flush(info);
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_kernel_range_flush(info);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 }
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 9/9] x86/mm: userspace & pageout flushing using Intel RAR
  2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-05-06  0:37 ` [RFC PATCH 8/9] x86/mm: use RAR for kernel TLB flushes Rik van Riel
@ 2025-05-06  0:37 ` Rik van Riel
  8 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06  0:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Use Intel RAR to flush userspace mappings.

Because RAR flushes are targeted using a cpu bitmap, the rules are
a little bit different than for true broadcast TLB invalidation.

For true broadcast TLB invalidation, like done with AMD INVLPGB,
a global ASID always has up to date TLB entries on every CPU.
The context switch code never has to flush the TLB when switching
to a global ASID on any CPU with INVLPGB.

For RAR, the TLB mappings for a global ASID are kept up to date
only on CPUs within the mm_cpumask, which lazily follows the
threads around the system. The context switch code does not
need to flush the TLB if the CPU is in the mm_cpumask, and
the PCID used stays the same.

However, a CPU that falls outside of the mm_cpumask can have
out of date TLB mappings for this task. When switching to
that task on a CPU not in the mm_cpumask, the TLB does need
to be flushed.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h |   9 ++-
 arch/x86/mm/tlb.c               | 119 +++++++++++++++++++++++++-------
 2 files changed, 99 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e9b81876ebe4..1940d51f95a9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -250,7 +250,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 {
 	u16 asid;
 
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return 0;
 
 	asid = smp_load_acquire(&mm->context.global_asid);
@@ -263,7 +264,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 
 static inline void mm_init_global_asid(struct mm_struct *mm)
 {
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR)) {
 		mm->context.global_asid = 0;
 		mm->context.asid_transition = false;
 	}
@@ -287,7 +289,8 @@ static inline void mm_clear_asid_transition(struct mm_struct *mm)
 
 static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return false;
 
 	return mm && READ_ONCE(mm->context.asid_transition);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a4f3941281b6..724359be3f97 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -235,9 +235,11 @@ static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tlb_gen)
 
 	/*
 	 * TLB consistency for global ASIDs is maintained with hardware assisted
-	 * remote TLB flushing. Global ASIDs are always up to date.
+	 * remote TLB flushing. Global ASIDs are always up to date with INVLPGB,
+	 * and up to date for CPUs in the mm_cpumask with RAR..
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) ||
+	    cpu_feature_enabled(X86_FEATURE_RAR)) {
 		u16 global_asid = mm_global_asid(next);
 
 		if (global_asid) {
@@ -300,7 +302,14 @@ static void reset_global_asid_space(void)
 {
 	lockdep_assert_held(&global_asid_lock);
 
-	invlpgb_flush_all_nonglobals();
+	/*
+	 * The global flush ensures that a freshly allocated global ASID
+	 * has no entries in any TLB, and can be used immediately.
+	 * With Intel RAR, the TLB may still need to be flushed at context
+	 * switch time when dealing with a CPU that was not in the mm_cpumask
+	 * for the process, and may have missed flushes along the way.
+	 */
+	flush_tlb_all();
 
 	/*
 	 * The TLB flush above makes it safe to re-use the previously
@@ -377,7 +386,7 @@ static void use_global_asid(struct mm_struct *mm)
 {
 	u16 asid;
 
-	guard(raw_spinlock_irqsave)(&global_asid_lock);
+	guard(raw_spinlock)(&global_asid_lock);
 
 	/* This process is already using broadcast TLB invalidation. */
 	if (mm_global_asid(mm))
@@ -403,13 +412,14 @@ static void use_global_asid(struct mm_struct *mm)
 
 void mm_free_global_asid(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return;
 
 	if (!mm_global_asid(mm))
 		return;
 
-	guard(raw_spinlock_irqsave)(&global_asid_lock);
+	guard(raw_spinlock)(&global_asid_lock);
 
 	/* The global ASID can be re-used only after flush at wrap-around. */
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
@@ -427,7 +437,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 {
 	u16 global_asid = mm_global_asid(mm);
 
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return false;
 
 	/* Process is transitioning to a global ASID */
@@ -445,7 +456,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
  */
 static void consider_global_asid(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return;
 
 	/* Check every once in a while. */
@@ -499,7 +511,7 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	mm_clear_asid_transition(mm);
 }
 
-static void broadcast_tlb_flush(struct flush_tlb_info *info)
+static void invlpgb_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
 	unsigned long asid = mm_global_asid(info->mm);
@@ -865,13 +877,6 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 			goto reload_tlb;
 		}
 
-		/*
-		 * Broadcast TLB invalidation keeps this ASID up to date
-		 * all the time.
-		 */
-		if (is_global_asid(prev_asid))
-			return;
-
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -880,6 +885,15 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		if (!was_lazy)
 			return;
 
+		/*
+		 * Broadcast TLB invalidation keeps this ASID up to date
+		 * all the time with AMD INVLPGB. Intel RAR may need a TLB
+		 * flush if the CPU was in lazy TLB mode..
+		 */
+		if (cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+		    is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * Read the tlb_gen to check whether a flush is needed.
 		 * If the TLB is up to date, just use it.
@@ -912,20 +926,27 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
 
-		/* Start receiving IPIs and then read tlb_gen (and LAM below) */
-		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
-			cpumask_set_cpu(cpu, mm_cpumask(next));
+		/* A TLB flush started during a context switch is harmless. */
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		ns = choose_new_asid(next, next_tlb_gen);
+
+		/* Start receiving IPIs and RAR invalidations */
+		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) {
+			cpumask_set_cpu(cpu, mm_cpumask(next));
+			/* CPUs outside mm_cpumask may be out of date. */
+			if (cpu_feature_enabled(X86_FEATURE_RAR))
+				ns.need_flush = true;
+		}
 	}
 
 reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (ns.need_flush) {
-		VM_WARN_ON_ONCE(is_global_asid(ns.asid));
-		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);
-		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);
+		if (is_dyn_asid(ns.asid)) {
+			this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);
+			this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);
+		}
 		load_new_mm_cr3(next->pgd, ns.asid, new_lam, true);
 
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
@@ -1142,8 +1163,12 @@ static void flush_tlb_func(void *info)
 		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 	}
 
-	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
-	if (is_global_asid(loaded_mm_asid))
+	/*
+	 * Broadcast ASIDs are always kept up to date with INVLPGB; with
+	 * Intel RAR IPI based flushes are used periodically to trim the
+	 * mm_cpumask. Make sure those flushes are processed here.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && is_global_asid(loaded_mm_asid))
 		return;
 
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
@@ -1363,6 +1388,33 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
+static void rar_tlb_flush(struct flush_tlb_info *info)
+{
+	unsigned long asid = mm_global_asid(info->mm);
+	u16 pcid = kern_pcid(asid);
+
+	/* Flush the remote CPUs. */
+	smp_call_rar_many(mm_cpumask(info->mm), pcid, info->start, info->end);
+	if (cpu_feature_enabled(X86_FEATURE_PTI))
+		smp_call_rar_many(mm_cpumask(info->mm), user_pcid(asid), info->start, info->end);
+
+	/* Flush the local TLB, if needed. */
+	if (cpumask_test_cpu(smp_processor_id(), mm_cpumask(info->mm))) {
+		lockdep_assert_irqs_enabled();
+		local_irq_disable();
+		flush_tlb_func(info);
+		local_irq_enable();
+	}
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_tlb_flush(info);
+	else /* Intel RAR */
+		rar_tlb_flush(info);
+}
+
 static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned long start, unsigned long end,
 			unsigned int stride_shift, bool freed_tables,
@@ -1423,15 +1475,22 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
+	/*
+	 * IPIs and RAR can be targeted to a cpumask. Periodically trim that
+	 * mm_cpumask by sending TLB flush IPIs, even when most TLB flushes
+	 * are done with RAR.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) || !mm_global_asid(mm))
+		info->trim_cpumask = should_trim_cpumask(mm);
+
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (mm_global_asid(mm)) {
+	if (mm_global_asid(mm) && !info->trim_cpumask) {
 		broadcast_tlb_flush(info);
 	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
-		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
 		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
@@ -1742,6 +1801,14 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->unmapped_pages) {
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages = false;
+	} else if (cpu_feature_enabled(X86_FEATURE_RAR) && cpumask_any(&batch->cpumask) < nr_cpu_ids) {
+		rar_full_flush(&batch->cpumask);
+		if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+			lockdep_assert_irqs_enabled();
+			local_irq_disable();
+			invpcid_flush_all_nonglobals();
+			local_irq_enable();
+		}
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-- 
2.49.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06  0:37 ` [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request Rik van Riel
@ 2025-05-06  6:59   ` Nadav Amit
  2025-05-06 15:16     ` Rik van Riel
  2025-05-06 16:31   ` Ingo Molnar
  1 sibling, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2025-05-06  6:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	the arch/x86 maintainers, kernel-team, Dave Hansen, luto, peterz,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Yu-cheng Yu



> On 6 May 2025, at 3:37, Rik van Riel <riel@surriel.com> wrote:
> 
> +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> +		       unsigned long start, unsigned long end)
> +{
> +	unsigned long pages = (end - start + PAGE_SIZE) / PAGE_SIZE;
> +	int cpu, next_cpu, this_cpu = smp_processor_id();
> +	cpumask_t *dest_mask;
> +	unsigned long idx;
> +
> +	if (pages > RAR_INVLPG_MAX_PAGES || end == TLB_FLUSH_ALL)
> +		pages = RAR_INVLPG_MAX_PAGES;
> +
> +	/*
> +	 * Can deadlock when called with interrupts disabled.
> +	 * We allow cpu's that are not yet online though, as no one else can
> +	 * send smp call function interrupt to this cpu and as such deadlocks
> +	 * can't happen.
> +	 */
> +	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> +		     && !oops_in_progress && !early_boot_irqs_disabled);

To ease it for the reader, consider using the updated version from smp.c
(or - even better - refactor into common inline function):

	if (cpu_online(this_cpu) && !oops_in_progress &&
	    !early_boot_irqs_disabled)
		lockdep_assert_irqs_enabled();


> +
> +	/* Try to fastpath.  So, what's a CPU they want?  Ignoring this one. */
> +	cpu = cpumask_first_and(mask, cpu_online_mask);
> +	if (cpu == this_cpu)
> +		cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
> +
> +	/* No online cpus?  We're done. */
> +	if (cpu >= nr_cpu_ids)
> +		return;
> +
> +	/* Do we have another CPU which isn't us? */
> +	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
> +	if (next_cpu == this_cpu)
> +		next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
> +
> +	/* Fastpath: do that cpu by itself. */

If you follow my comment (suggestion) about the concurrent flushes, then 
this part should be moved to be in the same was as done in the updated
smp_call_function_many_cond().

IOW, the main difference between this path and the “slow path” is 
arch_send_rar_ipi_mask() vs arch_send_rar_single_ipi() (and maybe
“and” with cpu_online_mask).


> +	if (next_cpu >= nr_cpu_ids) {
> +		lock(this_cpu_ptr(&rar_lock));
> +		idx = get_payload();
> +		set_payload(idx, pcid, start, pages);
> +		set_action_entry(idx, cpu);
> +		arch_send_rar_single_ipi(cpu);
> +		wait_for_done(idx, cpu);
> +		free_payload(idx);
> +		unlock(this_cpu_ptr(&rar_lock));
> +		return;
> +	}
> +
> +	dest_mask = this_cpu_ptr(&rar_cpu_mask);
> +	cpumask_and(dest_mask, mask, cpu_online_mask);
> +	cpumask_clear_cpu(this_cpu, dest_mask);
> +
> +	/* Some callers race with other cpus changing the passed mask */
> +	if (unlikely(!cpumask_weight(dest_mask)))
> +		return;
> +
> +	lock(this_cpu_ptr(&rar_lock));
> +	idx = get_payload();
> +	set_payload(idx, pcid, start, pages);
> +
> +	for_each_cpu(cpu, dest_mask)
> +		set_action_entry(idx, cpu);
> +
> +	/* Send a message to all CPUs in the map */
> +	arch_send_rar_ipi_mask(dest_mask);


Since 2019 we have move into “multi” TLB flush instead of “many”.

This means that try to take advantage of the time between sending the IPI
and the indication it was completed to do the local TLB flush. For both
consistency and performance, I recommend you’d follow this approach and
do the local TLB flush (if needed) here, instead of doing it in the
caller. 

> +
> +	for_each_cpu(cpu, dest_mask)
> +		wait_for_done(idx, cpu);
> +
> +	free_payload(idx);
> +	unlock(this_cpu_ptr(&rar_lock));

We don’t do lock/unlock on kernel/smp.c . So I would expect at least a
comment as for why it is required.

> +}
> +EXPORT_SYMBOL(smp_call_rar_many);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06  6:59   ` Nadav Amit
@ 2025-05-06 15:16     ` Rik van Riel
  2025-05-06 15:27       ` Dave Hansen
  2025-05-06 15:50       ` Nadav Amit
  0 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06 15:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	the arch/x86 maintainers, kernel-team, Dave Hansen, luto, peterz,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Yu-cheng Yu

On Tue, 2025-05-06 at 09:59 +0300, Nadav Amit wrote:
> 
> 
> > On 6 May 2025, at 3:37, Rik van Riel <riel@surriel.com> wrote:
> > 
> > +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> > +		       unsigned long start, unsigned long end)
> > +{
> > +	unsigned long pages = (end - start + PAGE_SIZE) /
> > PAGE_SIZE;
> > +	int cpu, next_cpu, this_cpu = smp_processor_id();
> > +	cpumask_t *dest_mask;
> > +	unsigned long idx;
> > +
> > +	if (pages > RAR_INVLPG_MAX_PAGES || end == TLB_FLUSH_ALL)
> > +		pages = RAR_INVLPG_MAX_PAGES;
> > +
> > +	/*
> > +	 * Can deadlock when called with interrupts disabled.
> > +	 * We allow cpu's that are not yet online though, as no
> > one else can
> > +	 * send smp call function interrupt to this cpu and as
> > such deadlocks
> > +	 * can't happen.
> > +	 */
> > +	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> > +		     && !oops_in_progress &&
> > !early_boot_irqs_disabled);
> 
> To ease it for the reader, consider using the updated version from
> smp.c
> (or - even better - refactor into common inline function):
> 
> 	if (cpu_online(this_cpu) && !oops_in_progress &&
> 	    !early_boot_irqs_disabled)
> 		lockdep_assert_irqs_enabled();

Nice cleanup. I will change this. Thank you.

> 
> 
> > +
> > +	/* Try to fastpath.  So, what's a CPU they want?  Ignoring
> > this one. */
> > +	cpu = cpumask_first_and(mask, cpu_online_mask);
> > +	if (cpu == this_cpu)
> > +		cpu = cpumask_next_and(cpu, mask,
> > cpu_online_mask);
> > +
> > +	/* No online cpus?  We're done. */
> > +	if (cpu >= nr_cpu_ids)
> > +		return;
> > +
> > +	/* Do we have another CPU which isn't us? */
> > +	next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
> > +	if (next_cpu == this_cpu)
> > +		next_cpu = cpumask_next_and(next_cpu, mask,
> > cpu_online_mask);
> > +
> > +	/* Fastpath: do that cpu by itself. */
> 
> If you follow my comment (suggestion) about the concurrent flushes,
> then 
> this part should be moved to be in the same was as done in the
> updated
> smp_call_function_many_cond().
> 
> IOW, the main difference between this path and the “slow path” is 
> arch_send_rar_ipi_mask() vs arch_send_rar_single_ipi() (and maybe
> “and” with cpu_online_mask).

It gets better. Page 8 of the RAR whitepaper tells
us that we can simply use RAR to have a CPU send
itself TLB flush instructions, and the microcode
will do the flush at the same time the other CPUs
handle theirs.

"At this point, the ILP may invalidate its own TLB by 
 signaling RAR to itself in order to invoke the RAR handler
 locally as well"

I tried this, but things blew up very early in
boot, presumably due to the CPU trying to send
itself a RAR before it was fully configured to
handle them.

The code may need a better decision point than
cpu_feature_enabled(X86_FEATURE_RAR) to decide
whether or not to use RAR.

Probably something that indicates RAR is actually
ready to use on all CPUs.

> 
> Since 2019 we have move into “multi” TLB flush instead of “many”.
> 
> This means that try to take advantage of the time between sending the
> IPI
> and the indication it was completed to do the local TLB flush. 

I think we have 3 cases here:

1) Only the local TLB needs to be flushed.
   In this case we can INVPCID locally, and skip any
   potential contention on the RAR payload table.

2) Only one remote CPU needs to be flushed (no local).
   This can use the arch_rar_send_single_ipi() thing.

3) Multiple CPUs need to be flushed. This could include
   the local CPU, or be only multiple remote CPUs.
   For this case we could just use arch_send_rar_ipi_mask(),
   including sending a RAR request to the local CPU, which
   should handle it concurrently with the other CPUs.

Does that seem like a reasonable way to handle things?

> > +
> > +	for_each_cpu(cpu, dest_mask)
> > +		wait_for_done(idx, cpu);
> > +
> > +	free_payload(idx);
> > +	unlock(this_cpu_ptr(&rar_lock));
> 
> We don’t do lock/unlock on kernel/smp.c . So I would expect at least
> a
> comment as for why it is required.
> 
That is a very good question!

It is locking a per-cpu lock, which no other code
path takes.

It looks like it could protect against preemption,
on a kernel with full preemption enabled, but that
should not be needed since the code in arch/x86/mm/tlb.c
disables preemption around every call to the RAR code.

I suspect that lock is no longer needed, but maybe
somebody at Intel has a reason why we still do?

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06 15:16     ` Rik van Riel
@ 2025-05-06 15:27       ` Dave Hansen
  2025-05-06 15:50       ` Nadav Amit
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-05-06 15:27 UTC (permalink / raw)
  To: Rik van Riel, Nadav Amit
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	the arch/x86 maintainers, kernel-team, Dave Hansen, luto, peterz,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Yu-cheng Yu

On 5/6/25 08:16, Rik van Riel wrote:
> I suspect that lock is no longer needed, but maybe
> somebody at Intel has a reason why we still do?

I just took a quick look at the locking. It doesn't make any sense to me
either.

I suspect it's just plain not needed.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06 15:16     ` Rik van Riel
  2025-05-06 15:27       ` Dave Hansen
@ 2025-05-06 15:50       ` Nadav Amit
  2025-05-06 16:00         ` Rik van Riel
  1 sibling, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2025-05-06 15:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	the arch/x86 maintainers, kernel-team, Dave Hansen, luto, peterz,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Yu-cheng Yu

> On 6 May 2025, at 18:16, Rik van Riel <riel@surriel.com> wrote:
> 
> It gets better. Page 8 of the RAR whitepaper tells
> us that we can simply use RAR to have a CPU send
> itself TLB flush instructions, and the microcode
> will do the flush at the same time the other CPUs
> handle theirs.
> 
> "At this point, the ILP may invalidate its own TLB by 
> signaling RAR to itself in order to invoke the RAR handler
> locally as well"
> 
> I tried this, but things blew up very early in
> boot, presumably due to the CPU trying to send
> itself a RAR before it was fully configured to
> handle them.
> 
> The code may need a better decision point than
> cpu_feature_enabled(X86_FEATURE_RAR) to decide
> whether or not to use RAR.
> 
> Probably something that indicates RAR is actually
> ready to use on all CPUs.
> 

Once you get something working (perhaps with a branch for
now) you can take the static-key/static-call path, presumably.
I would first try to get something working properly.

BTW: I suspect that the RAR approach might not handle TLB
storms worse than the IPI approach, in which once the handler
sees such a storm, it does full TLB flush and skips flushes of
“older” generations. You may want to benchmark this scenario
(IIRC one of the will-it-scale does something similar).

> I think we have 3 cases here:
> 
> 1) Only the local TLB needs to be flushed.
>   In this case we can INVPCID locally, and skip any
>   potential contention on the RAR payload table.

More like INVLPG (and INVPCID to the user PTI). AFAIK, Andy said
INVLPG performs better than INVPCID for a single entry. But yes,
this is a simple and hot scenario that should have a separate
code-path.

> 
> 2) Only one remote CPU needs to be flushed (no local).
>   This can use the arch_rar_send_single_ipi() thing.
> 
> 3) Multiple CPUs need to be flushed. This could include
>   the local CPU, or be only multiple remote CPUs.
>   For this case we could just use arch_send_rar_ipi_mask(),
>   including sending a RAR request to the local CPU, which
>   should handle it concurrently with the other CPUs.
> 
> Does that seem like a reasonable way to handle things?

It it. It is just that code-wise, I think the 2nd and 3rd cases
are similar, and it can be better to distinguish the differences
between them without creating two completely separate code-paths.
This makes maintenance and reasoning more simple, I think.

Consider having a look at smp_call_function_many_cond(). I think
it handles the 2nd and 3rd cases nicely in the manner I just
described. Admittedly, I am a bit biased…

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06 15:50       ` Nadav Amit
@ 2025-05-06 16:00         ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-05-06 16:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	the arch/x86 maintainers, kernel-team, Dave Hansen, luto, peterz,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Yu-cheng Yu

On Tue, 2025-05-06 at 18:50 +0300, Nadav Amit wrote:
> 
> 
> > On 6 May 2025, at 18:16, Rik van Riel <riel@surriel.com> wrote:
> > 
> > It gets better. Page 8 of the RAR whitepaper tells
> > us that we can simply use RAR to have a CPU send
> > itself TLB flush instructions, and the microcode
> > will do the flush at the same time the other CPUs
> > handle theirs.
> > 
> > "At this point, the ILP may invalidate its own TLB by 
> > signaling RAR to itself in order to invoke the RAR handler
> > locally as well"
> > 
> > I tried this, but things blew up very early in
> > boot, presumably due to the CPU trying to send
> > itself a RAR before it was fully configured to
> > handle them.
> > 
> > The code may need a better decision point than
> > cpu_feature_enabled(X86_FEATURE_RAR) to decide
> > whether or not to use RAR.
> > 
> > Probably something that indicates RAR is actually
> > ready to use on all CPUs.
> > 
> 
> Once you get something working (perhaps with a branch for
> now) you can take the static-key/static-call path, presumably.
> I would first try to get something working properly.
> 
The static-key code is implemented with alternatives,
which call flush_tlb_mm_range.

I've not spent the time digging into whether that
creates any chicken-egg scenarios yet :)

> > I think we have 3 cases here:
> > 
> > 1) Only the local TLB needs to be flushed.
> >   In this case we can INVPCID locally, and skip any
> >   potential contention on the RAR payload table.
> 
> More like INVLPG (and INVPCID to the user PTI). AFAIK, Andy said
> INVLPG performs better than INVPCID for a single entry. But yes,
> this is a simple and hot scenario that should have a separate
> code-path.

I think this can probably be handled in flush_tlb_mm_range(),
so the RAR code is only called for cases (2) and (3) to
begin with.

> 
> > 
> > 2) Only one remote CPU needs to be flushed (no local).
> >   This can use the arch_rar_send_single_ipi() thing.
> > 
> > 3) Multiple CPUs need to be flushed. This could include
> >   the local CPU, or be only multiple remote CPUs.
> >   For this case we could just use arch_send_rar_ipi_mask(),
> >   including sending a RAR request to the local CPU, which
> >   should handle it concurrently with the other CPUs.
> > 
> > Does that seem like a reasonable way to handle things?
> 
> It it. It is just that code-wise, I think the 2nd and 3rd cases
> are similar, and it can be better to distinguish the differences
> between them without creating two completely separate code-paths.
> This makes maintenance and reasoning more simple, I think.
> 
> Consider having a look at smp_call_function_many_cond(). I think
> it handles the 2nd and 3rd cases nicely in the manner I just
> described. Admittedly, I am a bit biased…

I need to use smp_call_function_many_cond() anyway,
to prevent sending RARs to CPUs that are in lazy
TLB mode (and possibly in a power saving idle state).

IPI TLB flushing and RAR can probably both use the
same should_flush_tlb() helper function.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request
  2025-05-06  0:37 ` [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request Rik van Riel
  2025-05-06  6:59   ` Nadav Amit
@ 2025-05-06 16:31   ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-05-06 16:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, x86, kernel-team, dave.hansen, luto,
	peterz, tglx, mingo, bp, hpa, Yu-cheng Yu


* Rik van Riel <riel@surriel.com> wrote:

> +++ b/arch/x86/include/asm/rar.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_RAR_H
> +#define _ASM_X86_RAR_H
> +
> +/*
> + * RAR payload types
> + */
> +#define RAR_TYPE_INVPG		0
> +#define RAR_TYPE_INVPG_NO_CR3	1
> +#define RAR_TYPE_INVPCID	2
> +#define RAR_TYPE_INVEPT		3
> +#define RAR_TYPE_INVVPID	4
> +#define RAR_TYPE_WRMSR		5
> +
> +/*
> + * Subtypes for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVPG_ADDR			0 /* address specific */
> +#define RAR_INVPG_ALL			2 /* all, include global */
> +#define RAR_INVPG_ALL_NO_GLOBAL		3 /* all, exclude global */
> +
> +/*
> + * Subtypes for RAR_TYPE_INVPCID
> + */
> +#define RAR_INVPCID_ADDR		0 /* address specific */
> +#define RAR_INVPCID_PCID		1 /* all of PCID */
> +#define RAR_INVPCID_ALL			2 /* all, include global */
> +#define RAR_INVPCID_ALL_NO_GLOBAL	3 /* all, exclude global */
> +
> +/*
> + * Page size for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVLPG_PAGE_SIZE_4K		0
> +#define RAR_INVLPG_PAGE_SIZE_2M		1
> +#define RAR_INVLPG_PAGE_SIZE_1G		2
> +
> +/*
> + * Max number of pages per payload
> + */
> +#define RAR_INVLPG_MAX_PAGES 63
> +
> +typedef struct {
> +	uint64_t for_sw : 8;
> +	uint64_t type : 8;
> +	uint64_t must_be_zero_1 : 16;
> +	uint64_t subtype : 3;
> +	uint64_t page_size: 2;
> +	uint64_t num_pages : 6;
> +	uint64_t must_be_zero_2 : 21;
> +
> +	uint64_t must_be_zero_3;
> +
> +	/*
> +	 * Starting address
> +	 */
> +	uint64_t initiator_cr3;
> +	uint64_t linear_address;
> +
> +	/*
> +	 * Padding
> +	 */
> +	uint64_t padding[4];
> +} rar_payload_t;

- Please don't use _t typedefs for complex types. 'struct rar_payload' 
  should be good enough.

- Please use u64/u32 for HW ABI definitions.

- Please align bitfield definitions vertically, for better readability:

	u64 for_sw		:  8;
	u64 type		:  8;
	u64 must_be_zero_1	: 16;
	u64 subtype		:  3;
	u64 page_size		:  2;
	u64 num_pages		:  6;
	u64 must_be_zero_2	: 21;

> +++ b/arch/x86/mm/rar.c
> @@ -0,0 +1,226 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * RAR Tlb shootdown

s/Tlb
 /TLB

> +#include <linux/kgdb.h>

Is this really needed? There's nothing KGDB specific in here AFAICS.

> +static DEFINE_PER_CPU_ALIGNED(u64[(RAR_MAX_PAYLOADS + 8) / 8], rar_action);

> +static void set_action_entry(unsigned long idx, int target_cpu)
> +{
> +	u8 *bitmap = (u8 *)per_cpu(rar_action, target_cpu);

> +	u8 *bitmap = (u8 *)per_cpu(rar_action, target_cpu);

> +	bitmap = (u8 *)per_cpu(rar_action, this_cpu);


So AFAICS all these ugly, forced type casts tp (u8 *) are needed only 
because rar_action has the wrong type: if it were an u8[], then these 
lines could be:

	static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);

	...

	u8 *bitmap = per_cpu(rar_action, target_cpu);

right?

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-06 16:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-06  0:37 [RFC PATCH 0/9] Intel RAR TLB invalidation Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 1/9] x86/mm: Introduce MSR_IA32_CORE_CAPABILITIES Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 2/9] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 3/9] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 4/9] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 5/9] x86/mm: Change cpa_flush() to call flush_kernel_range() directly Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 6/9] x86/apic: Introduce Remote Action Request Operations Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 7/9] x86/mm: Introduce Remote Action Request Rik van Riel
2025-05-06  6:59   ` Nadav Amit
2025-05-06 15:16     ` Rik van Riel
2025-05-06 15:27       ` Dave Hansen
2025-05-06 15:50       ` Nadav Amit
2025-05-06 16:00         ` Rik van Riel
2025-05-06 16:31   ` Ingo Molnar
2025-05-06  0:37 ` [RFC PATCH 8/9] x86/mm: use RAR for kernel TLB flushes Rik van Riel
2025-05-06  0:37 ` [RFC PATCH 9/9] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox