[PATCH v3 00/12] AMD broadcast TLB invalidation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/12] AMD broadcast TLB invalidation
@ 2024-12-30 17:53 Rik van Riel
  2024-12-30 17:53 ` [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
                   ` (13 more replies)
  0 siblings, 14 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

Subject: [RFC PATCH 00/10] AMD broadcast TLB invalidation

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-30 18:41   ` Borislav Petkov
  2024-12-30 17:53 ` [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

After this change, the synchronization between get_user_pages_fast
and page table freeing is handled by RCU, which prevents page tables
from being reused for other data while get_user_pages_fast is walking
them.

This allows us to invalidate page tables while other CPUs have
interrupts disabled.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig           | 2 +-
 arch/x86/kernel/paravirt.c | 7 +------
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
@@ -191,7 +186,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= native_tlb_remove_table,
+	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
  2024-12-30 17:53 ` [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-31  3:18   ` Qi Zheng
  2024-12-30 17:53 ` [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition Rik van Riel
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.

Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/hyperv/mmu.c                 |  1 -
 arch/x86/include/asm/paravirt.h       |  5 -----
 arch/x86/include/asm/paravirt_types.h |  2 --
 arch/x86/kernel/kvm.c                 |  1 -
 arch/x86/kernel/paravirt.c            |  1 -
 arch/x86/mm/pgtable.c                 | 16 ++++------------
 arch/x86/xen/mmu_pv.c                 |  1 -
 7 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 1cc113200ff5..cbe6c71e17c1 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -240,5 +240,4 @@ void hyperv_setup_mmu_ops(void)
 
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
-	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8..794ba3647c6c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -91,11 +91,6 @@ static inline void __flush_tlb_multi(const struct cpumask *cpumask,
 	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
-static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);
-}
-
 static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 {
 	PVOP_VCALL1(mmu.exit_mmap, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8d4fbe1be489..13405959e4db 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -136,8 +136,6 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);
 
-	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
-
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..3be9b3342c67 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -838,7 +838,6 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 2b78a6b466ed..c019771e0123 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -186,7 +186,6 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..3dc4af1f7868 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -18,14 +18,6 @@ EXPORT_SYMBOL(physical_mask);
 #define PGTABLE_HIGHMEM 0
 #endif
 
-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -54,7 +46,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pagetable_pte_dtor(page_ptdesc(pte));
 	paravirt_release_pte(page_to_pfn(pte));
-	paravirt_tlb_remove_table(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -70,7 +62,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 	tlb->need_flush_all = 1;
 #endif
 	pagetable_pmd_dtor(ptdesc);
-	paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
+	tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -80,14 +72,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 
 	pagetable_pud_dtor(ptdesc);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
+	tlb_remove_table(tlb, virt_to_page(p4d));
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04..041e17282af0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2137,7 +2137,6 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
-		.tlb_remove_table = tlb_remove_table,
 
 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition.
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
  2024-12-30 17:53 ` [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
  2024-12-30 17:53 ` [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-02 12:04   ` Borislav Petkov
  2024-12-30 17:53 ` [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Add the INVPLGB CPUID definition, allowing the kernel to recognize
whether the CPU supports the INVLPGB instruction.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..b7209d6c3a5f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* "invlpgb" INVLPGB instruction */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-02 12:15   ` Borislav Petkov
  2025-01-10 18:44   ` Tom Lendacky
  2024-12-30 17:53 ` [PATCH 05/12] x86/mm: add INVLPGB support code Rik van Riel
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 1 +
 arch/x86/kernel/cpu/amd.c       | 8 ++++++++
 arch/x86/kernel/setup.c         | 4 ++++
 3 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e..7d1468a3967b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -182,6 +182,7 @@ static inline void cr4_init_shadow(void)
 
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
+extern u16 invlpgb_count_max;
 
 extern void initialize_tlbstate_and_flush(void);
 
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79d2e17f6582..226b8fc64bfc 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1135,6 +1135,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	if (c->extended_cpuid_level < 0x80000008)
+		return;
+
+	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	invlpgb_count_max = (edx & 0xffff) + 1;
 }
 
 static const struct cpu_dev amd_cpu_dev = {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f1fea506e20f..6c4d08f8f7b1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -138,6 +138,10 @@ __visible unsigned long mmu_cr4_features __ro_after_init;
 __visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE;
 #endif
 
+#ifdef CONFIG_CPU_SUP_AMD
+u16 invlpgb_count_max __ro_after_init;
+#endif
+
 #ifdef CONFIG_IMA
 static phys_addr_t ima_kexec_buffer_phys;
 static size_t ima_kexec_buffer_size;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 05/12] x86/mm: add INVLPGB support code
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-02 12:42   ` Borislav Petkov
  2025-01-03 12:44   ` Borislav Petkov
  2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h  | 93 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |  1 +
 2 files changed, 94 insertions(+)
 create mode 100644 arch/x86/include/asm/invlpgb.h

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
new file mode 100644
index 000000000000..862775897a54
--- /dev/null
+++ b/arch/x86/include/asm/invlpgb.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVLPGB
+#define _ASM_X86_INVLPGB
+
+#include <vdso/bits.h>
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
+			    int extra_count, bool pmd_stride, unsigned long flags)
+{
+	u64 rax = addr | flags;
+	u32 ecx = (pmd_stride << 31) | extra_count;
+	u32 edx = (pcid << 16) | asid;
+
+	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:              	  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invlpgb_flush_user(unsigned long pcid,
+				      unsigned long addr)
+{
+	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
+}
+
+static inline void invlpgb_flush_user_nr(unsigned long pcid, unsigned long addr,
+					 int nr, bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given ASID, not including globals. */
+static inline void invlpgb_flush_single_asid(unsigned long asid)
+{
+	__invlpgb(asid, 0, 0, 0, 0, INVLPGB_ASID);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr(unsigned long addr, int nr)
+{
+	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, 0);
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void tlbsync(void)
+{
+	asm volatile("tlbsync");
+}
+
+#endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 7d1468a3967b..20074f17fbcd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
+#include <asm/invlpgb.h>
 #include <asm/invpcid.h>
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-03 12:39   ` Borislav Petkov
                     ` (2 more replies)
  2024-12-30 17:53 ` [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Use broadcast TLB invalidation for kernel addresses when available.

This stops us from having to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bb..29207dc5b807 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,32 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
+{
+	unsigned long addr;
+	unsigned long maxnr = invlpgb_count_max;
+	unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;
+
+	/*
+	 * TLBSYNC only waits for flushes originating on the same CPU.
+	 * Disabling migration allows us to wait on all flushes.
+	 */
+	guard(preempt)();
+
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > threshold << PAGE_SHIFT) {
+		invlpgb_flush_all();
+	} else {
+		unsigned long nr;
+		for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
+			nr = min((end - addr) >> PAGE_SHIFT, maxnr);
+			invlpgb_flush_addr(addr, nr);
+		}
+	}
+
+	tlbsync();
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		broadcast_kernel_range_flush(start, end);
+		return;
+	}
+
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-06 17:29   ` Dave Hansen
  2024-12-30 17:53 ` [PATCH 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 29207dc5b807..266d5174fc7b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1074,6 +1074,12 @@ static void do_flush_tlb_all(void *info)
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		guard(preempt)();
+		invlpgb_flush_all();
+		tlbsync();
+		return;
+	}
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 266d5174fc7b..64f1679c37e1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1310,8 +1310,16 @@ EXPORT_SYMBOL_GPL(__flush_tlb_all);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
+	int cpu;
+
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		guard(preempt)();
+		invlpgb_flush_all_nonglobals();
+		tlbsync();
+		return;
+	}
 
-	int cpu = get_cpu();
+	cpu = get_cpu();
 
 	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
 				  TLB_GENERATION_INVALID);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-30 19:24   ` Nadav Amit
                     ` (3 more replies)
  2024-12-30 17:53 ` [PATCH 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (4 subsequent siblings)
  13 siblings, 4 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
and newer CPUs.

In order to not exhaust PCID space, and keep TLB flushes local for single
threaded processes, we only hand out broadcast ASIDs to processes active on
3 or more CPUs, and gradually increase the threshold as broadcast ASID space
is depleted.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu.h         |   6 +
 arch/x86/include/asm/mmu_context.h |  12 ++
 arch/x86/include/asm/tlbflush.h    |  17 ++
 arch/x86/mm/tlb.c                  | 310 ++++++++++++++++++++++++++++-
 4 files changed, 336 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..a8e8dfa5a520 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -48,6 +48,12 @@ typedef struct {
 	unsigned long flags;
 #endif
 
+#ifdef CONFIG_CPU_SUP_AMD
+	struct list_head broadcast_asid_list;
+	u16 broadcast_asid;
+	bool asid_transition;
+#endif
+
 #ifdef CONFIG_ADDRESS_MASKING
 	/* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
 	unsigned long lam_cr3_mask;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..0dc446c427d2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void destroy_context_free_broadcast_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
@@ -161,6 +163,13 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_CPU_SUP_AMD
+	INIT_LIST_HEAD(&mm->context.broadcast_asid_list);
+	mm->context.broadcast_asid = 0;
+	mm->context.asid_transition = false;
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -170,6 +179,9 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_CPU_SUP_AMD
+	destroy_context_free_broadcast_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 20074f17fbcd..5e9956af98d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -65,6 +65,23 @@ static inline void cr4_clear_bits(unsigned long mask)
  */
 #define TLB_NR_DYN_ASIDS	6
 
+#ifdef CONFIG_CPU_SUP_AMD
+#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS
+#define is_broadcast_asid(asid) (asid) >= TLB_NR_DYN_ASIDS
+#define in_asid_transition(info) (info->mm && info->mm->context.asid_transition)
+#define mm_broadcast_asid(mm) (mm->context.broadcast_asid)
+#else
+#define is_dyn_asid(asid) true
+#define is_broadcast_asid(asid) false
+#define in_asid_transition(info) false
+#define mm_broadcast_asid(mm) 0
+
+inline bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	return false;
+}
+#endif
+
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 64f1679c37e1..eb83391385ce 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -225,6 +227,18 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for this ASID is maintained with INVLPGB;
+	 * TLB flushes happen even while the process isn't running.
+	 */
+#ifdef CONFIG_CPU_SUP_AMD
+	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_broadcast_asid(next)) {
+		*new_asid = mm_broadcast_asid(next);
+		*need_flush = false;
+		return;
+	}
+#endif
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -251,6 +265,245 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+#ifdef CONFIG_CPU_SUP_AMD
+/*
+ * Logic for AMD INVLPGB support.
+ */
+static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
+static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
+static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
+static LIST_HEAD(broadcast_asid_list);
+static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+static void reset_broadcast_asid_space(void)
+{
+	mm_context_t *context;
+
+	lockdep_assert_held(&broadcast_asid_lock);
+
+	/*
+	 * Flush once when we wrap around the ASID space, so we won't need
+	 * to flush every time we allocate an ASID for boradcast flushing.
+	 */
+	invlpgb_flush_all_nonglobals();
+	tlbsync();
+
+	/*
+	 * Leave the currently used broadcast ASIDs set in the bitmap, since
+	 * those cannot be reused before the next wraparound and flush..
+	 */
+	bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
+	list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
+		__set_bit(context->broadcast_asid, broadcast_asid_used);
+
+	last_broadcast_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 get_broadcast_asid(void)
+{
+	lockdep_assert_held(&broadcast_asid_lock);
+
+	do {
+		u16 start = last_broadcast_asid;
+		u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
+
+		if (asid >= MAX_ASID_AVAILABLE) {
+			reset_broadcast_asid_space();
+			continue;
+		}
+
+		/* Try claiming this broadcast ASID. */
+		if (!test_and_set_bit(asid, broadcast_asid_used)) {
+			last_broadcast_asid = asid;
+			return asid;
+		}
+	} while (1);
+}
+
+/*
+ * Returns true if the mm is transitioning from a CPU-local ASID to a broadcast
+ * (INVLPGB) ASID, or the other way around.
+ */
+static bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 broadcast_asid = mm_broadcast_asid(next);
+
+	if (broadcast_asid && prev_asid != broadcast_asid)
+		return true;
+
+	if (!broadcast_asid && is_broadcast_asid(prev_asid))
+		return true;
+
+	return false;
+}
+
+void destroy_context_free_broadcast_asid(struct mm_struct *mm)
+{
+	if (!mm->context.broadcast_asid)
+		return;
+
+	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
+	mm->context.broadcast_asid = 0;
+	list_del(&mm->context.broadcast_asid_list);
+	broadcast_asid_available++;
+}
+
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a broadcast ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_broadcast_asid(struct mm_struct *mm)
+{
+	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm->context.broadcast_asid)
+		return;
+
+	mm->context.broadcast_asid = get_broadcast_asid();
+	mm->context.asid_transition = true;
+	list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
+	broadcast_asid_available--;
+}
+
+/*
+ * Figure out whether to assign a broadcast (global) ASID to a process.
+ * We vary the threshold by how empty or full broadcast ASID space is.
+ * 1/4 full: >= 4 active threads
+ * 1/2 full: >= 8 active threads
+ * 3/4 full: >= 16 active threads
+ * 7/8 full: >= 32 active threads
+ * etc
+ *
+ * This way we should never exhaust the broadcast ASID space, even on very
+ * large systems, and the processes with the largest number of active
+ * threads should be able to use broadcast TLB invalidation.
+ */
+#define HALFFULL_THRESHOLD 8
+static bool meets_broadcast_asid_threshold(struct mm_struct *mm)
+{
+	int avail = broadcast_asid_available;
+	int threshold = HALFFULL_THRESHOLD;
+
+	if (!avail)
+		return false;
+
+	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
+		threshold = HALFFULL_THRESHOLD / 4;
+	} else if (avail > MAX_ASID_AVAILABLE / 2) {
+		threshold = HALFFULL_THRESHOLD / 2;
+	} else if (avail < MAX_ASID_AVAILABLE / 3) {
+		do {
+			avail *= 2;
+			threshold *= 2;
+		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
+	}
+
+	return mm_active_cpus_exceeds(mm, threshold);
+}
+
+static void count_tlb_flush(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (meets_broadcast_asid_threshold(mm))
+		use_broadcast_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_broadcast_asid(mm);
+	int cpu;
+
+	if (!mm->context.asid_transition)
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the broadcast ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 */
+		if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the broadcast ASID. */
+	mm->context.asid_transition = 0;
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long maxnr = invlpgb_count_max;
+	unsigned long asid = info->mm->context.broadcast_asid;
+	unsigned long addr = info->start;
+	unsigned long nr;
+
+	/* Flushing multiple pages at once is not supported with 1GB pages. */
+	if (info->stride_shift > PMD_SHIFT)
+		maxnr = 1;
+
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid(user_pcid(asid));
+	} else do {
+		/*
+		 * Calculate how many pages can be flushed at once; if the
+		 * remainder of the range is less than one page, flush one.
+		 */
+		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
+		nr = max(nr, 1);
+
+		invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr(user_pcid(asid), addr, nr, pmd);
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	tlbsync();
+}
+#endif /* CONFIG_CPU_SUP_AMD */
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -556,8 +809,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-			   next->context.ctx_id);
+		if (is_dyn_asid(prev_asid))
+			VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+				   next->context.ctx_id);
 
 		/*
 		 * If this races with another thread that enables lam, 'new_lam'
@@ -573,6 +827,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * Check if the current mm is transitioning to a new ASID.
+		 */
+		if (needs_broadcast_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_broadcast_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -626,8 +897,10 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_BUG_ON(is_broadcast_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -746,7 +1019,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -769,6 +1042,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a broadcast ASID */
+	if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_broadcast_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -786,6 +1069,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -953,7 +1238,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1026,14 +1311,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	unsigned long threshold = tlb_single_page_flush_ceiling;
 	u64 new_tlb_gen;
 	int cpu;
 
+	if (static_cpu_has(X86_FEATURE_INVLPGB))
+		threshold *= invlpgb_count_max;
+
 	cpu = get_cpu();
 
 	/* Should we flush just the requested range? */
 	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
+	    ((end - start) >> stride_shift) > threshold) {
 		start = 0;
 		end = TLB_FLUSH_ALL;
 	}
@@ -1049,9 +1338,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (IS_ENABLED(CONFIG_CPU_SUP_AMD) && mm_broadcast_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		count_tlb_flush(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbbatch.h |  1 +
 arch/x86/include/asm/tlbflush.h | 12 +++------
 arch/x86/mm/tlb.c               | 48 ++++++++++++++++++++++++++-------
 3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 1ad56eb3e8a8..f9a17edf63ad 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -10,6 +10,7 @@ struct arch_tlbflush_unmap_batch {
 	 * the PFNs being flushed..
 	 */
 	struct cpumask cpumask;
+	bool used_invlpgb;
 };
 
 #endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5e9956af98d1..17ec1b169ebd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -297,21 +297,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index eb83391385ce..454a370494d3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1602,16 +1602,7 @@ EXPORT_SYMBOL_GPL(__flush_tlb_all);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
-	int cpu;
-
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		guard(preempt)();
-		invlpgb_flush_all_nonglobals();
-		tlbsync();
-		return;
-	}
-
-	cpu = get_cpu();
+	int cpu = get_cpu();
 
 	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
 				  TLB_GENERATION_INVALID);
@@ -1629,12 +1620,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
+		tlbsync();
+		migrate_enable();
+		batch->used_invlpgb = false;
+	}
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_broadcast_asid(mm)) {
+		u16 asid = mm_broadcast_asid(mm);
+		/*
+		 * Queue up an asynchronous invalidation. The corresponding
+		 * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
+		 * on the same CPU.
+		 */
+		if (!batch->used_invlpgb) {
+			batch->used_invlpgb = true;
+			migrate_disable();
+		}
+		invlpgb_flush_user_nr(kern_pcid(asid), uaddr, 1, 0);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr(user_pcid(asid), uaddr, 1, 0);
+	} else {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2024-12-30 18:25   ` Nadav Amit
                     ` (2 more replies)
  2024-12-30 17:53 ` [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/kernel/cpu/amd.c |  8 ++++++++
 arch/x86/mm/tlb.c         | 10 +++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 226b8fc64bfc..4dc42705aaca 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 
 	/* Max number of pages INVLPGB can invalidate in one shot */
 	invlpgb_count_max = (edx & 0xffff) + 1;
+
+	/* If supported, enable translation cache extensions (TCE) */
+	cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
+	if (ecx & BIT(17)) {
+		u64 msr = native_read_msr(MSR_EFER);;
+		msr |= BIT(15);
+		wrmsrl(MSR_EFER, msr);
+	}
 }
 
 static const struct cpu_dev amd_cpu_dev = {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 454a370494d3..585d0731ca9f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -477,7 +477,7 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 	if (info->stride_shift > PMD_SHIFT)
 		maxnr = 1;
 
-	if (info->end == TLB_FLUSH_ALL) {
+	if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
 		invlpgb_flush_single_pcid(kern_pcid(asid));
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
@@ -1110,7 +1110,7 @@ static void flush_tlb_func(void *info)
 	 *
 	 * The only question is whether to do a full or partial flush.
 	 *
-	 * We do a partial flush if requested and two extra conditions
+	 * We do a partial flush if requested and three extra conditions
 	 * are met:
 	 *
 	 * 1. f->new_tlb_gen == local_tlb_gen + 1.  We have an invariant that
@@ -1137,10 +1137,14 @@ static void flush_tlb_func(void *info)
 	 *    date.  By doing a full flush instead, we can increase
 	 *    local_tlb_gen all the way to mm_tlb_gen and we can probably
 	 *    avoid another flush in the very near future.
+	 *
+	 * 3. No page tables were freed. If page tables were freed, a full
+	 *    flush ensures intermediate translations in the TLB get flushed.
 	 */
 	if (f->end != TLB_FLUSH_ALL &&
 	    f->new_tlb_gen == local_tlb_gen + 1 &&
-	    f->new_tlb_gen == mm_tlb_gen) {
+	    f->new_tlb_gen == mm_tlb_gen &&
+	    !f->freed_tables) {
 		/* Partial flush */
 		unsigned long addr = f->start;
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2024-12-30 17:53 ` Rik van Riel
  2025-01-03 18:40   ` Jann Horn
  2025-01-06 19:03 ` [PATCH v3 00/12] AMD broadcast TLB invalidation Dave Hansen
  2025-01-06 22:49 ` Yosry Ahmed
  13 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 17:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Rik van Riel

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index 862775897a54..2669ebfffe81 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -51,7 +51,7 @@ static inline void invlpgb_flush_user(unsigned long pcid,
 static inline void invlpgb_flush_user_nr(unsigned long pcid, unsigned long addr,
 					 int nr, bool pmd_stride)
 {
-	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA | INVLPGB_FINAL_ONLY);
 }
 
 /* Flush all mappings for a given ASID, not including globals. */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2024-12-30 18:25   ` Nadav Amit
  2024-12-30 18:27     ` Rik van Riel
  2025-01-03 17:49   ` Jann Horn
  2025-01-10 19:34   ` Tom Lendacky
  2 siblings, 1 reply; 89+ messages in thread
From: Nadav Amit @ 2024-12-30 18:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT


> On 30 Dec 2024, at 19:53, Rik van Riel <riel@surriel.com> wrote:
> 
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
> 
> 	/* Max number of pages INVLPGB can invalidate in one shot */
> 	invlpgb_count_max = (edx & 0xffff) + 1;
> +
> +	/* If supported, enable translation cache extensions (TCE) */
> +	cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
> +	if (ecx & BIT(17)) {
> +		u64 msr = native_read_msr(MSR_EFER);;
> +		msr |= BIT(15);
> +		wrmsrl(MSR_EFER, msr);
> +	}
> }

Sorry for the gradual/delayed feedback.

Is it possible to avoid the BIT(x) and just add the bits to 
arch/x86/include/asm/msr-index.h like EFER_FFXSR ?



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2024-12-30 18:25   ` Nadav Amit
@ 2024-12-30 18:27     ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-30 18:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Mon, 2024-12-30 at 20:25 +0200, Nadav Amit wrote:
> 
> > On 30 Dec 2024, at 19:53, Rik van Riel <riel@surriel.com> wrote:
> > 
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct
> > cpuinfo_x86 *c)
> > 
> > 	/* Max number of pages INVLPGB can invalidate in one shot
> > */
> > 	invlpgb_count_max = (edx & 0xffff) + 1;
> > +
> > +	/* If supported, enable translation cache extensions (TCE)
> > */
> > +	cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
> > +	if (ecx & BIT(17)) {
> > +		u64 msr = native_read_msr(MSR_EFER);;
> > +		msr |= BIT(15);
> > +		wrmsrl(MSR_EFER, msr);
> > +	}
> > }
> 
> Sorry for the gradual/delayed feedback.
> 
> Is it possible to avoid the BIT(x) and just add the bits to 
> arch/x86/include/asm/msr-index.h like EFER_FFXSR ?
> 
Of course!

I'd be happy to send that either as part of the
next version of the patch series, or as a separate
cleanup patch later. Whatever is more convenient
for the x86 maintainers.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-30 17:53 ` [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2024-12-30 18:41   ` Borislav Petkov
  2024-12-31 16:11     ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Borislav Petkov @ 2024-12-30 18:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:02PM -0500, Rik van Riel wrote:
> Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
> paravirt, and not when running on bare metal.
> 
> There is no real good reason to do things differently for
> each setup. Make them all the same.
> 
> After this change, the synchronization between get_user_pages_fast
> and page table freeing is handled by RCU, which prevents page tables
> from being reused for other data while get_user_pages_fast is walking
> them.

I'd rather like to read here why this is not a problem anymore and why

  48a8b97cfd80 ("x86/mm: Only use tlb_remove_table() for paravirt")

is not relevant anymore.

> This allows us to invalidate page tables while other CPUs have
	      ^^

Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

Personal pronouns are ambiguous in text, especially with so many
parties/companies/etc developing the kernel so let's avoid them please.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2024-12-30 19:24   ` Nadav Amit
  2025-01-01  4:42     ` Rik van Riel
  2025-01-03 17:36   ` Jann Horn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 89+ messages in thread
From: Nadav Amit @ 2024-12-30 19:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT



> On 30 Dec 2024, at 19:53, Rik van Riel <riel@surriel.com> wrote:
> 
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
> 
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID space
> is depleted.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> 

[snip]

> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
> 
> +extern void destroy_context_free_broadcast_asid(struct mm_struct *mm);
> +
> /*
>  * Init a new mm.  Used on mm copies, like at fork()
>  * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,13 @@ static inline int init_new_context(struct task_struct *tsk,
> 		mm->context.execute_only_pkey = -1;
> 	}
> #endif
> +
> +#ifdef CONFIG_CPU_SUP_AMD
> +	INIT_LIST_HEAD(&mm->context.broadcast_asid_list);
> +	mm->context.broadcast_asid = 0;
> +	mm->context.asid_transition = false;
> +#endif
> +
> 	mm_reset_untag_mask(mm);
> 	init_new_context_ldt(mm);
> 	return 0;
> @@ -170,6 +179,9 @@ static inline int init_new_context(struct task_struct *tsk,
> static inline void destroy_context(struct mm_struct *mm)
> {
> 	destroy_context_ldt(mm);
> +#ifdef CONFIG_CPU_SUP_AMD
> +	destroy_context_free_broadcast_asid(mm);
> +#endif

This ifdef’ry is not great. I think it’s better to have entire functions
in ifdef than put ifdef’s within the code. 

> }
> 
> extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 20074f17fbcd..5e9956af98d1 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -65,6 +65,23 @@ static inline void cr4_clear_bits(unsigned long mask)
>  */
> #define TLB_NR_DYN_ASIDS	6
> 
> +#ifdef CONFIG_CPU_SUP_AMD
> +#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS
> +#define is_broadcast_asid(asid) (asid) >= TLB_NR_DYN_ASIDS
> +#define in_asid_transition(info) (info->mm && info->mm->context.asid_transition)
> +#define mm_broadcast_asid(mm) (mm->context.broadcast_asid)
> +#else
> +#define is_dyn_asid(asid) true
> +#define is_broadcast_asid(asid) false
> +#define in_asid_transition(info) false
> +#define mm_broadcast_asid(mm) 0

I don’t see a reason why those should be #define instead of inline functions.
Arguably, those are better due to type-checking, etc. For instance is_dyn_asid()
is missing brackets to be safe.

> +
> +inline bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	return false;
> +}
> +#endif
> +
> struct tlb_context {
> 	u64 ctx_id;
> 	u64 tlb_gen;
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 64f1679c37e1..eb83391385ce 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,13 +74,15 @@
>  * use different names for each of them:
>  *
>  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
> - *         the canonical identifier for an mm
> + *         the canonical identifier for an mm, dynamically allocated on each CPU
> + *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
> + *         the canonical, global identifier for an mm, identical across all CPUs
>  *
> - * kPCID - [1, TLB_NR_DYN_ASIDS]
> + * kPCID - [1, MAX_ASID_AVAILABLE]
>  *         the value we write into the PCID part of CR3; corresponds to the
>  *         ASID+1, because PCID 0 is special.
>  *
> - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
> + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
>  *         for KPTI each mm has two address spaces and thus needs two
>  *         PCID values, but we can still do with a single ASID denomination
>  *         for each mm. Corresponds to kPCID + 2048.
> @@ -225,6 +227,18 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> 		return;
> 	}
> 
> +	/*
> +	 * TLB consistency for this ASID is maintained with INVLPGB;
> +	 * TLB flushes happen even while the process isn't running.
> +	 */
> +#ifdef CONFIG_CPU_SUP_AMD
I’m pretty sure IS_ENABLED() can be used here.

> +	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_broadcast_asid(next)) {
> +		*new_asid = mm_broadcast_asid(next);

Isn’t there a risk of a race changing broadcast_asid between the two reads?

Maybe use READ_ONCE() also since the value is modified asynchronously? 

> +		*need_flush = false;
> +		return;
> +	}
> +#endif
> +
> 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
> 		clear_asid_other();
> 
> @@ -251,6 +265,245 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> 	*need_flush = true;
> }
> 
> +#ifdef CONFIG_CPU_SUP_AMD
> +/*
> + * Logic for AMD INVLPGB support.
> + */
> +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static LIST_HEAD(broadcast_asid_list);
> +static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;

Presumably some of these data structures are shared, and some are accessed
frequently together. Wouldn’t it make more sense to put them inside a struct(s)
and make it cacheline aligned?

> +
> +static void reset_broadcast_asid_space(void)
> +{
> +	mm_context_t *context;
> +
> +	lockdep_assert_held(&broadcast_asid_lock);
> +
> +	/*
> +	 * Flush once when we wrap around the ASID space, so we won't need
> +	 * to flush every time we allocate an ASID for boradcast flushing.
> +	 */
> +	invlpgb_flush_all_nonglobals();
> +	tlbsync();
> +
> +	/*
> +	 * Leave the currently used broadcast ASIDs set in the bitmap, since
> +	 * those cannot be reused before the next wraparound and flush..
> +	 */
> +	bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
> +	list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
> +		__set_bit(context->broadcast_asid, broadcast_asid_used);
> +
> +	last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_broadcast_asid(void)
> +{
> +	lockdep_assert_held(&broadcast_asid_lock);
> +
> +	do {
> +		u16 start = last_broadcast_asid;
> +		u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >= MAX_ASID_AVAILABLE) {
> +			reset_broadcast_asid_space();
> +			continue;
> +		}
> +
> +		/* Try claiming this broadcast ASID. */
> +		if (!test_and_set_bit(asid, broadcast_asid_used)) {

IIUC, broadcast_asid_used is always protected with broadcast_asid_lock.
So why test_and_set_bit  ?

> +			last_broadcast_asid = asid;
> +			return asid;
> +		}
> +	} while (1);
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a broadcast
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	u16 broadcast_asid = mm_broadcast_asid(next);
> +
> +	if (broadcast_asid && prev_asid != broadcast_asid)
> +		return true;
> +
> +	if (!broadcast_asid && is_broadcast_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
> +
> +void destroy_context_free_broadcast_asid(struct mm_struct *mm)
> +{
> +	if (!mm->context.broadcast_asid)

mm_broadcast_asid()?

> +		return;
> +
> +	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> +	mm->context.broadcast_asid = 0;
> +	list_del(&mm->context.broadcast_asid_list);
> +	broadcast_asid_available++;
> +}
> +
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
> +{
> +	int count = 0;
> +	int cpu;
> +
> +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> +		return false;
> +
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/* Skip the CPUs that aren't really running this process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> +			continue;
> +
> +		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> +			continue;
> +
> +		if (++count > threshold)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Assign a broadcast ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_broadcast_asid(struct mm_struct *mm)
> +{
> +	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> +
> +	/* This process is already using broadcast TLB invalidation. */
> +	if (mm->context.broadcast_asid)
> +		return;
> +
> +	mm->context.broadcast_asid = get_broadcast_asid();

This is read without the lock, so do you want WRITE_ONCE() here? 

> +	mm->context.asid_transition = true;

And what about asid_transition? Presumably also need WRITE_ONCE(). But more
importantly than this theoretical compiler optimization, is there some assumed
ordering with setting broadcast_asid?

> +	list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
> +	broadcast_asid_available--;
> +}
> +
> +/*
> + * Figure out whether to assign a broadcast (global) ASID to a process.
> + * We vary the threshold by how empty or full broadcast ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the broadcast ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_broadcast_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail = broadcast_asid_available;
> +	int threshold = HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold = HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold = HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *= 2;
> +			threshold *= 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}
> +
> +static void count_tlb_flush(struct mm_struct *mm)
> +{
> +	if (!static_cpu_has(X86_FEATURE_INVLPGB))
> +		return;
> +
> +	/* Check every once in a while. */
> +	if ((current->pid & 0x1f) != (jiffies & 0x1f))
> +		return;
> +
> +	if (meets_broadcast_asid_threshold(mm))
> +		use_broadcast_asid(mm);
> +}

I don’t think count_tlb_flush() is a name that reflects what this function
does.

> +
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +	struct mm_struct *mm = info->mm;
> +	int bc_asid = mm_broadcast_asid(mm);
> +	int cpu;
> +
> +	if (!mm->context.asid_transition)

is_asid_transition()?


> +		return;
> +
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
> +			continue;
> +
> +		/*
> +		 * If at least one CPU is not using the broadcast ASID yet,
> +		 * send a TLB flush IPI. The IPI should cause stragglers
> +		 * to transition soon.
> +		 */
> +		if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) {
> +			flush_tlb_multi(mm_cpumask(info->mm), info);
> +			return;
> +		}
> +	}
> +
> +	/* All the CPUs running this process are using the broadcast ASID. */
> +	mm->context.asid_transition = 0;
> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	bool pmd = info->stride_shift == PMD_SHIFT;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long asid = info->mm->context.broadcast_asid;
> +	unsigned long addr = info->start;
> +	unsigned long nr;
> +
> +	/* Flushing multiple pages at once is not supported with 1GB pages. */
> +	if (info->stride_shift > PMD_SHIFT)
> +		maxnr = 1;
> +
> +	if (info->end == TLB_FLUSH_ALL) {
> +		invlpgb_flush_single_pcid(kern_pcid(asid));
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_single_pcid(user_pcid(asid));
> +	} else do {

I couldn’t find any use of “else do” in the kernel. Might it be confusing?

> +		/*
> +		 * Calculate how many pages can be flushed at once; if the
> +		 * remainder of the range is less than one page, flush one.
> +		 */
> +		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
> +		nr = max(nr, 1);
> +
> +		invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd);
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_user_nr(user_pcid(asid), addr, nr, pmd);
> +		addr += nr << info->stride_shift;
> +	} while (addr < info->end);
> +
> +	finish_asid_transition(info);
> +
> +	/* Wait for the INVLPGBs kicked off above to finish. */
> +	tlbsync();
> +}
> +#endif /* CONFIG_CPU_SUP_AMD */
> +
> /*
>  * Given an ASID, flush the corresponding user ASID.  We can delay this
>  * until the next time we switch to it.
> @@ -556,8 +809,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 	 */
> 	if (prev == next) {
> 		/* Not actually switching mm's */
> -		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> -			   next->context.ctx_id);
> +		if (is_dyn_asid(prev_asid))
> +			VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> +				   next->context.ctx_id);

Why not to add the condition into the VM_WARN_ON and avoid the nesting?

> 
> 		/*
> 		 * If this races with another thread that enables lam, 'new_lam'
> @@ -573,6 +827,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
> 			cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> +		/*
> +		 * Check if the current mm is transitioning to a new ASID.
> +		 */
> +		if (needs_broadcast_asid_reload(next, prev_asid)) {
> +			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> +
> +			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> +			goto reload_tlb;
> +		}
> +
> +		/*
> +		 * Broadcast TLB invalidation keeps this PCID up to date
> +		 * all the time.
> +		 */
> +		if (is_broadcast_asid(prev_asid))
> +			return;
> +
> 		/*
> 		 * If the CPU is not in lazy TLB mode, we are just switching
> 		 * from one thread in a process to another thread in the same
> @@ -626,8 +897,10 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 		barrier();
> 	}
> 
> +reload_tlb:
> 	new_lam = mm_lam_cr3_mask(next);
> 	if (need_flush) {
> +		VM_BUG_ON(is_broadcast_asid(new_asid));
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
> 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
> @@ -746,7 +1019,7 @@ static void flush_tlb_func(void *info)
> 	const struct flush_tlb_info *f = info;
> 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> -	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +	u64 local_tlb_gen;
> 	bool local = smp_processor_id() == f->initiating_cpu;
> 	unsigned long nr_invalidate = 0;
> 	u64 mm_tlb_gen;
> @@ -769,6 +1042,16 @@ static void flush_tlb_func(void *info)
> 	if (unlikely(loaded_mm == &init_mm))
> 		return;
> 
> +	/* Reload the ASID if transitioning into or out of a broadcast ASID */
> +	if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) {
> +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> +		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +	}
> +
> +	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +	if (is_broadcast_asid(loaded_mm_asid))
> +		return;
> +
> 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
> 		   loaded_mm->context.ctx_id);
> 
> @@ -786,6 +1069,8 @@ static void flush_tlb_func(void *info)
> 		return;
> 	}
> 
> +	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
> 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
> 		     f->new_tlb_gen <= local_tlb_gen)) {
> 		/*
> @@ -953,7 +1238,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
> 	 * up on the new contents of what used to be page tables, while
> 	 * doing a speculative memory access.
> 	 */
> -	if (info->freed_tables)
> +	if (info->freed_tables || in_asid_transition(info))
> 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
> 	else
> 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1026,14 +1311,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> 				bool freed_tables)
> {
> 	struct flush_tlb_info *info;
> +	unsigned long threshold = tlb_single_page_flush_ceiling;
> 	u64 new_tlb_gen;
> 	int cpu;
> 
> +	if (static_cpu_has(X86_FEATURE_INVLPGB))
> +		threshold *= invlpgb_count_max;

I know it’s not really impacting performance, but it is hard for me to see
such calculations happening unnecessarily every time...



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2024-12-30 17:53 ` [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2024-12-31  3:18   ` Qi Zheng
  0 siblings, 0 replies; 89+ messages in thread
From: Qi Zheng @ 2024-12-31  3:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm



On 2024/12/31 01:53, Rik van Riel wrote:
> Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.
> 
> Get rid of the indirection by simply calling tlb_remove_table directly,
> and not going through the paravirt function pointers.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   arch/x86/hyperv/mmu.c                 |  1 -
>   arch/x86/include/asm/paravirt.h       |  5 -----
>   arch/x86/include/asm/paravirt_types.h |  2 --
>   arch/x86/kernel/kvm.c                 |  1 -
>   arch/x86/kernel/paravirt.c            |  1 -
>   arch/x86/mm/pgtable.c                 | 16 ++++------------
>   arch/x86/xen/mmu_pv.c                 |  1 -
>   7 files changed, 4 insertions(+), 23 deletions(-)

This change looks good to me. Thanks!

In addition, since I also made relevant changes [1], there will be
conflict when our code is merged into the linux-next branch. Of course,
this is very easy to fix. ;)

[1]. 
https://lore.kernel.org/all/0287d442a973150b0e1019cc406e6322d148277a.1733305182.git.zhengqi.arch@bytedance.com/
(This patch has been merged into the mm-unstable branch, and we can
also see it in the linux-next: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=62e76fb4ff704945b5cf9411dda448fefb6618f9)

Thanks,
Qi



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-30 18:41   ` Borislav Petkov
@ 2024-12-31 16:11     ` Rik van Riel
  2024-12-31 16:19       ` Borislav Petkov
  2025-01-02 19:56       ` Peter Zijlstra
  0 siblings, 2 replies; 89+ messages in thread
From: Rik van Riel @ 2024-12-31 16:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2024-12-30 at 19:41 +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 12:53:02PM -0500, Rik van Riel wrote:
> > Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
> > paravirt, and not when running on bare metal.
> > 
> > There is no real good reason to do things differently for
> > each setup. Make them all the same.
> > 
> > After this change, the synchronization between get_user_pages_fast
> > and page table freeing is handled by RCU, which prevents page
> > tables
> > from being reused for other data while get_user_pages_fast is
> > walking
> > them.
> 
> I'd rather like to read here why this is not a problem anymore and
> why
> 
>   48a8b97cfd80 ("x86/mm: Only use tlb_remove_table() for paravirt")
> 
> is not relevant anymore.

That would be a question for Peter :)

> 
> > This allows us to invalidate page tables while other CPUs have
> 	      ^^
> 
> Please use passive voice in your commit message: no "we" or "I", etc,
> and describe your changes in imperative mood.

Will do. Between your feedback, the suggestions
from Qi and Nadav, and the kernel test robot
flagging some build issues without CONFIG_CPU_SUP_AMD,
there's enough to warrant a v4 :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-31 16:11     ` Rik van Riel
@ 2024-12-31 16:19       ` Borislav Petkov
  2024-12-31 16:30         ` Rik van Riel
  2025-01-02 19:56       ` Peter Zijlstra
  1 sibling, 1 reply; 89+ messages in thread
From: Borislav Petkov @ 2024-12-31 16:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Tue, Dec 31, 2024 at 11:11:51AM -0500, Rik van Riel wrote:
> Will do. Between your feedback, the suggestions
> from Qi and Nadav, and the kernel test robot
> flagging some build issues without CONFIG_CPU_SUP_AMD,
> there's enough to warrant a v4 :)

Yes, just take your time pls. I need to get through the rest, first, in the
coming days.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-31 16:19       ` Borislav Petkov
@ 2024-12-31 16:30         ` Rik van Riel
  2025-01-02 11:52           ` Borislav Petkov
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2024-12-31 16:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Tue, 2024-12-31 at 17:19 +0100, Borislav Petkov wrote:
> On Tue, Dec 31, 2024 at 11:11:51AM -0500, Rik van Riel wrote:
> > Will do. Between your feedback, the suggestions
> > from Qi and Nadav, and the kernel test robot
> > flagging some build issues without CONFIG_CPU_SUP_AMD,
> > there's enough to warrant a v4 :)
> 
> Yes, just take your time pls. I need to get through the rest, first,
> in the
> coming days.
> 
I'm incorporating the feedback I have so far,
and will test with those improvements.

I'll wait for you to finish making your way
through before coming up with a v4 :)

I do have a question about the second to last patch
"x86/mm: enable AMD translation cache extensions"

It looks like with this patch the translation cache
extensions get enabled (when I read back the MSR),
but the AMD manual suggests I may need to enable
EFER.TCE separately on every CPU?

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 19:24   ` Nadav Amit
@ 2025-01-01  4:42     ` Rik van Riel
  2025-01-01 15:20       ` Nadav Amit
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-01  4:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Mon, 2024-12-30 at 21:24 +0200, Nadav Amit wrote:
> 
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -65,6 +65,23 @@ static inline void cr4_clear_bits(unsigned long
> > mask)
> >  */
> > #define TLB_NR_DYN_ASIDS	6
> > 
> > +#ifdef CONFIG_CPU_SUP_AMD
> > +#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS
> > 
> 
> I don’t see a reason why those should be #define instead of inline
> functions.
> Arguably, those are better due to type-checking, etc. For instance
> is_dyn_asid()
> is missing brackets to be safe.
> 
I've turned these all into inline functions for
the next version.

> > 
> > @@ -225,6 +227,18 @@ static void choose_new_asid(struct mm_struct
> > *next, u64 next_tlb_gen,
> > 		return;
> > 	}
> > 
> > +	/*
> > +	 * TLB consistency for this ASID is maintained with
> > INVLPGB;
> > +	 * TLB flushes happen even while the process isn't
> > running.
> > +	 */
> > +#ifdef CONFIG_CPU_SUP_AMD
> I’m pretty sure IS_ENABLED() can be used here.
> 
> > +	if (static_cpu_has(X86_FEATURE_INVLPGB) &&
> > mm_broadcast_asid(next)) {
> > +		*new_asid = mm_broadcast_asid(next);
> 
> Isn’t there a risk of a race changing broadcast_asid between the two
> reads?
> 
> Maybe use READ_ONCE() also since the value is modified
> asynchronously? 
> 
In the current code, the broadcast_asid only ever
changes from zero to a non-zero value, which
continues to be used for the lifetime of the
process.

It will not change from one non-zero to another
non-zero value.

I have cleaned up the code a bit by having just
one single read, though.

> > 
> > @@ -251,6 +265,245 @@ static void choose_new_asid(struct mm_struct
> > *next, u64 next_tlb_gen,
> > 	*need_flush = true;
> > }
> > 
> > +#ifdef CONFIG_CPU_SUP_AMD
> > +/*
> > + * Logic for AMD INVLPGB support.
> > + */
> > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> > +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
> > +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = {
> > 0 };
> > +static LIST_HEAD(broadcast_asid_list);
> > +static int broadcast_asid_available = MAX_ASID_AVAILABLE -
> > TLB_NR_DYN_ASIDS - 1;
> 
> Presumably some of these data structures are shared, and some are
> accessed
> frequently together. Wouldn’t it make more sense to put them inside a
> struct(s)
> and make it cacheline aligned?
> 
Maybe?

I'm not sure any of these are particularly
frequently used, at least not compared to things
like TLB invalidations.

> > 
> > +		/* Try claiming this broadcast ASID. */
> > +		if (!test_and_set_bit(asid, broadcast_asid_used))
> > {
> 
> IIUC, broadcast_asid_used is always protected with
> broadcast_asid_lock.
> So why test_and_set_bit  ?

Thanks for spotting that one. I cleaned that up for
the next version.

> 
> > +void destroy_context_free_broadcast_asid(struct mm_struct *mm)
> > +{
> > +	if (!mm->context.broadcast_asid)
> 
> mm_broadcast_asid()?

I've intentionally kept this one in the same "shape"
as the assignment a few lines lower. That might be
cleaner than reading it one way, and then writing it
another way.

> 
> > +static void use_broadcast_asid(struct mm_struct *mm)
> > +{
> > +	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> > +
> > +	/* This process is already using broadcast TLB
> > invalidation. */
> > +	if (mm->context.broadcast_asid)
> > +		return;
> > +
> > +	mm->context.broadcast_asid = get_broadcast_asid();
> 
> This is read without the lock, so do you want WRITE_ONCE() here? 
> 
> > +	mm->context.asid_transition = true;
> 
> And what about asid_transition? Presumably also need WRITE_ONCE().
> But more
> importantly than this theoretical compiler optimization, is there
> some assumed
> ordering with setting broadcast_asid?

I changed both to WRITE_ONCE. 

Thinking about ordering, maybe we need to set
asid_transition before setting broadcast_asid?

That way when we see a non-zero broadcast ASID,
we are guaranteed to try finish_asid_transition()
from broadcast_tlb_flush().

Fixed for the next version.

> 
> > +	if (meets_broadcast_asid_threshold(mm))
> > +		use_broadcast_asid(mm);
> > +}
> 
> I don’t think count_tlb_flush() is a name that reflects what this
> function
> does.
> 

Agreed. I've renamed it to the equally lame
consider_broadcast_asid :)

> > +
> > +static void finish_asid_transition(struct flush_tlb_info *info)
> > +{
> > +	struct mm_struct *mm = info->mm;
> > +	int bc_asid = mm_broadcast_asid(mm);
> > +	int cpu;
> > +
> > +	if (!mm->context.asid_transition)
> 
> is_asid_transition()?

I'm not convinced that for a thing we access in so few
places, having a different "shape" for the read than we
have for the write would make the code easier to follow.

I'm open to arguments, though ;)

> 
> > +	if (info->end == TLB_FLUSH_ALL) {
> > +		invlpgb_flush_single_pcid(kern_pcid(asid));
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_single_pcid(user_pcid(asid))
> > ;
> > +	} else do {
> 
> I couldn’t find any use of “else do” in the kernel. Might it be
> confusing?

I could replace it with a goto finish_asid_transition. 
That's what I had there before. I'm not sure it was
better, though :/

> 
> > +		/*
> > +		 * Calculate how many pages can be flushed at
> > once; if the
> > +		 * remainder of the range is less than one page,
> > flush one.
> > +		 */
> > +		nr = min(maxnr, (info->end - addr) >> info-
> > >stride_shift);
> > +		nr = max(nr, 1);
> > +
> > +		invlpgb_flush_user_nr(kern_pcid(asid), addr, nr,
> > pmd);
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_user_nr(user_pcid(asid),
> > addr, nr, pmd);
> > +		addr += nr << info->stride_shift;
> > +	} while (addr < info->end);
> > +
> > +	finish_asid_transition(info);
> > +
> > +	/* Wait for the INVLPGBs kicked off above to finish. */
> > +	tlbsync();
> > +}
> > 

> > @@ -556,8 +809,9 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> > 	 */
> > 	if (prev == next) {
> > 		/* Not actually switching mm's */
> > -
> > 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> > -			   next->context.ctx_id);
> > +		if (is_dyn_asid(prev_asid))
> > +			VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs
> > [prev_asid].ctx_id) !=
> > +				   next->context.ctx_id);
> 
> Why not to add the condition into the VM_WARN_ON and avoid the
> nesting?

Done. Thank you.

> 
> > @@ -1026,14 +1311,18 @@ void flush_tlb_mm_range(struct mm_struct
> > *mm, unsigned long start,
> > 				bool freed_tables)
> > {
> > 	struct flush_tlb_info *info;
> > +	unsigned long threshold = tlb_single_page_flush_ceiling;
> > 	u64 new_tlb_gen;
> > 	int cpu;
> > 
> > +	if (static_cpu_has(X86_FEATURE_INVLPGB))
> > +		threshold *= invlpgb_count_max;
> 
> I know it’s not really impacting performance, but it is hard for me
> to see
> such calculations happening unnecessarily every time...
> 

Thanks for pointing out this code.

We should only do the multiplication if this
mm is using broadcast TLB invalidation.

Fixed for the next version.

I left the multiplication for now, since that
is smaller than adding code to the debugfs
tlb_single_page_flush_ceiling handler to store
a multiplied value in another variable.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-01  4:42     ` Rik van Riel
@ 2025-01-01 15:20       ` Nadav Amit
  2025-01-01 16:15         ` Karim Manaouil
  0 siblings, 1 reply; 89+ messages in thread
From: Nadav Amit @ 2025-01-01 15:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On 01/01/2025 6:42, Rik van Riel wrote:
> Fixed for the next version.

Thanks Rik,

Admittedly, I don't feel great about my overall last review - it mostly 
focused on style and common BKMs.

I still don't quite get the entire logic. To name one thing that I don't 
understand: why do we need broadcast_asid_list and the complicated games 
of syncing it with broadcast_asid_used. Why wouldn't broadcast_asid_used 
suffice?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-01 15:20       ` Nadav Amit
@ 2025-01-01 16:15         ` Karim Manaouil
  2025-01-01 16:23           ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Karim Manaouil @ 2025-01-01 16:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, kernel-team, Dave Hansen, luto,
	peterz, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Wed, Jan 01, 2025 at 05:20:01PM +0200, Nadav Amit wrote:
> 
> 
> On 01/01/2025 6:42, Rik van Riel wrote:
> > Fixed for the next version.
> 
> Thanks Rik,
> 
> Admittedly, I don't feel great about my overall last review - it mostly
> focused on style and common BKMs.
> 
> I still don't quite get the entire logic. To name one thing that I don't
> understand: why do we need broadcast_asid_list and the complicated games of
> syncing it with broadcast_asid_used. Why wouldn't broadcast_asid_used
> suffice?

If I uderstand correctly from Rik's patch, I think the list is needed to
save the flush for only when we run out of the ASID space (wrap around).
Without the list, whenever the ASID bit is cleared, you also have to flush
the TLBs.

> 
> 
> 

-- 
Best,
Karim
Edinburgh University


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-01 16:15         ` Karim Manaouil
@ 2025-01-01 16:23           ` Rik van Riel
  2025-01-02  0:06             ` Nadav Amit
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-01 16:23 UTC (permalink / raw)
  To: Karim Manaouil, Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Wed, 2025-01-01 at 16:15 +0000, Karim Manaouil wrote:
> On Wed, Jan 01, 2025 at 05:20:01PM +0200, Nadav Amit wrote:
> > 
> > 
> > On 01/01/2025 6:42, Rik van Riel wrote:
> > > Fixed for the next version.
> > 
> > Thanks Rik,
> > 
> > Admittedly, I don't feel great about my overall last review - it
> > mostly
> > focused on style and common BKMs.
> > 
> > I still don't quite get the entire logic. To name one thing that I
> > don't
> > understand: why do we need broadcast_asid_list and the complicated
> > games of
> > syncing it with broadcast_asid_used. Why wouldn't
> > broadcast_asid_used
> > suffice?
> 
> If I uderstand correctly from Rik's patch, I think the list is needed
> to
> save the flush for only when we run out of the ASID space (wrap
> around).
> Without the list, whenever the ASID bit is cleared, you also have to
> flush
> the TLBs.

That's exactly it.

The list will only contain processes that are active on
multiple CPUs, and hit a TLB flush "at the right moment"
to be assigned a broadcast ASID, which will be true for
essentially every process that does a lot of TLB flushes
and is long lived.

However, something like a kernel build has lots of
short lived, single threaded processes, for which we
should not be using broadcast TLB flushing, and which
will not need to remove themselves from the list at
exit time.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-01 16:23           ` Rik van Riel
@ 2025-01-02  0:06             ` Nadav Amit
  0 siblings, 0 replies; 89+ messages in thread
From: Nadav Amit @ 2025-01-02  0:06 UTC (permalink / raw)
  To: Rik van Riel, Karim Manaouil
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT



On 01/01/2025 18:23, Rik van Riel wrote:
> On Wed, 2025-01-01 at 16:15 +0000, Karim Manaouil wrote:
>> On Wed, Jan 01, 2025 at 05:20:01PM +0200, Nadav Amit wrote:
>>>
>>>
>>> On 01/01/2025 6:42, Rik van Riel wrote:
>>>> Fixed for the next version.
>>>
>>> Thanks Rik,
>>>
>>> Admittedly, I don't feel great about my overall last review - it
>>> mostly
>>> focused on style and common BKMs.
>>>
>>> I still don't quite get the entire logic. To name one thing that I
>>> don't
>>> understand: why do we need broadcast_asid_list and the complicated
>>> games of
>>> syncing it with broadcast_asid_used. Why wouldn't
>>> broadcast_asid_used
>>> suffice?
>>
>> If I uderstand correctly from Rik's patch, I think the list is needed
>> to
>> save the flush for only when we run out of the ASID space (wrap
>> around).
>> Without the list, whenever the ASID bit is cleared, you also have to
>> flush
>> the TLBs.
> 
> That's exactly it.
> 
> The list will only contain processes that are active on
> multiple CPUs, and hit a TLB flush "at the right moment"
> to be assigned a broadcast ASID, which will be true for
> essentially every process that does a lot of TLB flushes
> and is long lived.
> 
> However, something like a kernel build has lots of
> short lived, single threaded processes, for which we
> should not be using broadcast TLB flushing, and which
> will not need to remove themselves from the list at
> exit time.

Thank you Karim and Rik for the patient explanations.

But IIUC, it does seem a bit overly complicated (and when I say 
complicated, I mostly refer to traversing the broadcast_asid_list and 
its overhead).

It seems to me that basically you want to have two pieces of data that 
can easily be kept in bitmaps.

1. (existing) broadcast_asid_used - those that are currently "busy" and 
cannot be allocated.

2. (new) broadcast_asid_pending_flush - those that are logically free 
but were still not flushed. This would allow removing broadcast_asid_list.

Then in reset_broadcast_asid_space(), you just do something like:

   bitmap_andnot(broadcast_asid_used, broadcast_asid_used,
		broadcast_asid_pending_flush, MAX_ASID_AVAILABLE);
   bitmap_clear(broadcast_asid_pending_flush, MAX_ASID_AVAILABLE);

Seems to me as simpler to understand, faster, and may even allow in the 
future to avoid locking in fast-paths (would require different ordering 
and some thought). Of course, I might be missing something...



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-31 16:30         ` Rik van Riel
@ 2025-01-02 11:52           ` Borislav Petkov
  0 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-02 11:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Tue, Dec 31, 2024 at 11:30:48AM -0500, Rik van Riel wrote:
> I'll wait for you to finish making your way
> through before coming up with a v4 :)

Thanks!

> It looks like with this patch the translation cache
> extensions get enabled (when I read back the MSR),
> but the AMD manual suggests I may need to enable
> EFER.TCE separately on every CPU?

Yes, you do. EFER is a per-thread MSR.

cpu/amd.c is where we usually do enable features.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition.
  2024-12-30 17:53 ` [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition Rik van Riel
@ 2025-01-02 12:04   ` Borislav Petkov
  2025-01-03 18:27     ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Borislav Petkov @ 2025-01-02 12:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:04PM -0500, Rik van Riel wrote:
> Add the INVPLGB CPUID definition, allowing the kernel to recognize
> whether the CPU supports the INVLPGB instruction.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 17b6590748c0..b7209d6c3a5f 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -338,6 +338,7 @@
>  #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
>  #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
>  #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
> +#define X86_FEATURE_INVLPGB		(13*32+ 3) /* "invlpgb" INVLPGB instruction */
						      ^^^^^^^^^

We don't show random CPUID bits in /proc/cpuinfo anymore so you can remove
that.

>  #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
>  #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
>  #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
> -- 

Also, merge this patch with the patch which uses the flag pls.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2024-12-30 17:53 ` [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-01-02 12:15   ` Borislav Petkov
  2025-01-10 18:44   ` Tom Lendacky
  1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-02 12:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:05PM -0500, Rik van Riel wrote:
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index f1fea506e20f..6c4d08f8f7b1 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -138,6 +138,10 @@ __visible unsigned long mmu_cr4_features __ro_after_init;
>  __visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE;
>  #endif
>  
> +#ifdef CONFIG_CPU_SUP_AMD
> +u16 invlpgb_count_max __ro_after_init;
> +#endif

You can define this in amd.c and put the ifdeffery in the header. Something
like:

#ifdef CONFIG_CPU_SUP_AMD
extern u16 invlpgb_count_max __ro_after_init;
#else
#define invlpgb_count_max 0
#endif

or so and use it freely in the remaining places.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2024-12-30 17:53 ` [PATCH 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-01-02 12:42   ` Borislav Petkov
  2025-01-06 16:50     ` Dave Hansen
  2025-01-14 19:50     ` Rik van Riel
  2025-01-03 12:44   ` Borislav Petkov
  1 sibling, 2 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-02 12:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:06PM -0500, Rik van Riel wrote:
> Add invlpgb.h with the helper functions and definitions needed to use
> broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/invlpgb.h  | 93 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/tlbflush.h |  1 +
>  2 files changed, 94 insertions(+)
>  create mode 100644 arch/x86/include/asm/invlpgb.h
> 
> diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
> new file mode 100644
> index 000000000000..862775897a54
> --- /dev/null
> +++ b/arch/x86/include/asm/invlpgb.h

I don't see the point for a separate header just for that. We have
arch/x86/include/asm/tlb.h.

> @@ -0,0 +1,93 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_INVLPGB
> +#define _ASM_X86_INVLPGB
> +
> +#include <vdso/bits.h>
> +
> +/*
> + * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
> + *
> + * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
> + * be done in a parallel fashion.
> + *
> + * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
> + * this CPU have completed.
> + */
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
> +			    int extra_count, bool pmd_stride, unsigned long flags)

See below. Once you prune the functions you're not using in your patchset,
this argument list will drop too. We can always extend it later, if really
needed, so let's keep it simple here.

I had slimmed it down to this internally:

static inline void invlpgb(unsigned long va, unsigned long count, unsigned long id)
{
       /* INVLPGB; supported in binutils >= 2.36. */
	asm volatile(".byte 0x0f, 0x01, 0xfe"
		     : : "a" (va), "c" (count), "d" (id)
		     : "memory");
}

I had the memory clobber too but now that I think of it, it probably isn't
needed because even if the compiler reorders INVLPGB, it is weakly-ordered
anyway.

TLBSYNC should probably have a memory clobber tho, to prevent the compiler
from doing funky stuff...

> +{
> +	u64 rax = addr | flags;
> +	u32 ecx = (pmd_stride << 31) | extra_count;
> +	u32 edx = (pcid << 16) | asid;
> +
> +	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));

No, you do:

	/* INVLPGB; supported in binutils >= 2.36. */
 	asm volatile(".byte 0x0f, 0x01, 0xfe"
	...


> +/*
> + * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
> + * of the three. For example:
> + * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
> + * - INVLPGB_PCID:              	  invalidate all TLB entries matching the PCID
		     ^^^^^^^^^^^^^^^^^^^^^^

Whitespace damage here. Needs tabs.

> + *
> + * The first can be used to invalidate (kernel) mappings at a particular
> + * address across all processes.
> + *
> + * The latter invalidates all TLB entries matching a PCID.
> + */
> +#define INVLPGB_VA			BIT(0)
> +#define INVLPGB_PCID			BIT(1)
> +#define INVLPGB_ASID			BIT(2)
> +#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
> +#define INVLPGB_FINAL_ONLY		BIT(4)
> +#define INVLPGB_INCLUDE_NESTED		BIT(5)

Please add only the defines which are actually being used. Ditto for the
functions.

> +/* Wait for INVLPGB originated by this CPU to complete. */
> +static inline void tlbsync(void)
> +{
> +	asm volatile("tlbsync");
> +}

	/* TLBSYNC; supported in binutils >= 2.36. */
 	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");

> +
> +#endif /* _ASM_X86_INVLPGB */
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 7d1468a3967b..20074f17fbcd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -10,6 +10,7 @@
>  #include <asm/cpufeature.h>
>  #include <asm/special_insns.h>
>  #include <asm/smp.h>
> +#include <asm/invlpgb.h>
>  #include <asm/invpcid.h>
>  #include <asm/pti.h>
>  #include <asm/processor-flags.h>
> -- 
> 2.47.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2024-12-31 16:11     ` Rik van Riel
  2024-12-31 16:19       ` Borislav Petkov
@ 2025-01-02 19:56       ` Peter Zijlstra
  2025-01-03 12:18         ` Borislav Petkov
  1 sibling, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2025-01-02 19:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, x86, linux-kernel, kernel-team, dave.hansen,
	luto, tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Tue, Dec 31, 2024 at 11:11:51AM -0500, Rik van Riel wrote:
> On Mon, 2024-12-30 at 19:41 +0100, Borislav Petkov wrote:
> > On Mon, Dec 30, 2024 at 12:53:02PM -0500, Rik van Riel wrote:
> > > Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
> > > paravirt, and not when running on bare metal.
> > > 
> > > There is no real good reason to do things differently for
> > > each setup. Make them all the same.
> > > 
> > > After this change, the synchronization between get_user_pages_fast
> > > and page table freeing is handled by RCU, which prevents page
> > > tables
> > > from being reused for other data while get_user_pages_fast is
> > > walking
> > > them.
> > 
> > I'd rather like to read here why this is not a problem anymore and
> > why
> > 
> >   48a8b97cfd80 ("x86/mm: Only use tlb_remove_table() for paravirt")
> > 
> > is not relevant anymore.
> 
> That would be a question for Peter :)

Well, I've already answered why we need this in the previous thread but
it wasn't preserved :-(

Currently GUP-fast serializes against table-free by disabling
interrupts, which in turn holds of the TLBI-IPIs.

Since you're going to be doing broadcast TLBI -- without IPIs, this no
longer works and we need other means of serializing GUP-fast vs
table-free.

MMU_GATHER_RCU_TABLE_FREE is that means.

So where previously paravirt implementations of tlb_flush_multi might
require this (because of virt optimizations that avoided the TLBI-IPI),
this broadcast invalidate now very much requires this for native.




^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-02 19:56       ` Peter Zijlstra
@ 2025-01-03 12:18         ` Borislav Petkov
  2025-01-04 16:27           ` Peter Zijlstra
  2025-01-06 15:47           ` Rik van Riel
  0 siblings, 2 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-03 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, x86, linux-kernel, kernel-team, dave.hansen, luto,
	tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Thu, Jan 02, 2025 at 08:56:09PM +0100, Peter Zijlstra wrote:
> Well, I've already answered why we need this in the previous thread but
> it wasn't preserved :-(

... and this needs to be part of the commit message. And there's a similar
comment over tlb_remove_table_smp_sync() in mm/mmu_gather.c which pretty much
explains the same thing.

> Currently GUP-fast serializes against table-free by disabling
> interrupts, which in turn holds of the TLBI-IPIs.
> 
> Since you're going to be doing broadcast TLBI -- without IPIs, this no
> longer works and we need other means of serializing GUP-fast vs
> table-free.
> 
> MMU_GATHER_RCU_TABLE_FREE is that means.
> 
> So where previously paravirt implementations of tlb_flush_multi might
> require this (because of virt optimizations that avoided the TLBI-IPI),
> this broadcast invalidate now very much requires this for native.

Right, so this begs the question: we probably should do this dynamically only
on TLBI systems - not on everything native - due to the overhead of this
batching - I'm looking at tlb_remove_table().

Or should we make this unconditional on all native because we don't care about
the overhead and would like to have simpler code. I mean, disabling IRQs vs
batching and allocating memory...?

Meh.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-01-03 12:39   ` Borislav Petkov
  2025-01-06 17:21   ` Dave Hansen
  2025-01-10 18:53   ` Tom Lendacky
  2 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-03 12:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:07PM -0500, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> This stops us from having to send IPIs for kernel TLB flushes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/mm/tlb.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6cf881a942bb..29207dc5b807 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1077,6 +1077,32 @@ void flush_tlb_all(void)
>  	on_each_cpu(do_flush_tlb_all, NULL, 1);
>  }
>  
> +static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
> +{
> +	unsigned long addr;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;

The tip-tree preferred ordering of variable declarations at the
beginning of a function is reverse fir tree order::

	struct long_struct_name *descriptive_name;
	unsigned long foo, bar;
	unsigned int tmp;
	int ret;

The above is faster to parse than the reverse ordering::

	int ret;
	unsigned int tmp;
	unsigned long foo, bar;
	struct long_struct_name *descriptive_name;

And even more so than random ordering::

	unsigned long foo, bar;
	int ret;
	struct long_struct_name *descriptive_name;
	unsigned int tmp;

And you can get rid of maxnr and get the reversed xmas tree order:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 29207dc5b807..8a85acd9483d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1079,9 +1079,8 @@ void flush_tlb_all(void)
 
 static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
 {
+	unsigned long threshold = tlb_single_page_flush_ceiling * invlpgb_count_max;
 	unsigned long addr;
-	unsigned long maxnr = invlpgb_count_max;
-	unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;
 
 	/*
 	 * TLBSYNC only waits for flushes originating on the same CPU.
@@ -1095,7 +1094,7 @@ static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
 	} else {
 		unsigned long nr;
 		for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
-			nr = min((end - addr) >> PAGE_SHIFT, maxnr);
+			nr = min((end - addr) >> PAGE_SHIFT, invlpgb_count_max);
 			invlpgb_flush_addr(addr, nr);
 		}
 	}


> +	/*
> +	 * TLBSYNC only waits for flushes originating on the same CPU.
> +	 * Disabling migration allows us to wait on all flushes.
> +	 */
> +	guard(preempt)();

Migration?

Why not migrate_disable() then?

Although there's a big, thorny comment in include/linux/preempt.h about its
influence on sched.

> +
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > threshold << PAGE_SHIFT) {
> +		invlpgb_flush_all();
> +	} else {
> +		unsigned long nr;
> +		for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
> +			nr = min((end - addr) >> PAGE_SHIFT, maxnr);
> +			invlpgb_flush_addr(addr, nr);
> +		}
> +	}
> +
> +	tlbsync();
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2024-12-30 17:53 ` [PATCH 05/12] x86/mm: add INVLPGB support code Rik van Riel
  2025-01-02 12:42   ` Borislav Petkov
@ 2025-01-03 12:44   ` Borislav Petkov
  1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-03 12:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 12:53:06PM -0500, Rik van Riel wrote:
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
> +			    int extra_count, bool pmd_stride, unsigned long flags)

That pmd_stride thing - the callers should supply a bool: true/false instead
of 0/1 for more clarity at the call sites.

> +{
> +	u64 rax = addr | flags;
> +	u32 ecx = (pmd_stride << 31) | extra_count;

You need to handle the case where extra_count becomes negative because callers
supply nr=0 and you do "nr - 1" below and then you'll end up flushing
0b1111_1111_1111_1111 TLB entries.

"ECX[15:0] contains a count of the number of sequential pages to invalidate in
addition to the original virtual address, starting from the virtual address
specified in rAX."

> +	u32 edx = (pcid << 16) | asid;
> +
> +	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2024-12-30 19:24   ` Nadav Amit
@ 2025-01-03 17:36   ` Jann Horn
  2025-01-04  2:55     ` Rik van Riel
  2025-01-06 14:52   ` Nadav Amit
  2025-01-06 18:40   ` Dave Hansen
  3 siblings, 1 reply; 89+ messages in thread
From: Jann Horn @ 2025-01-03 17:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com> wrote:
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
>
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID space
> is depleted.
[...]
> ---
>  arch/x86/include/asm/mmu.h         |   6 +
>  arch/x86/include/asm/mmu_context.h |  12 ++
>  arch/x86/include/asm/tlbflush.h    |  17 ++
>  arch/x86/mm/tlb.c                  | 310 ++++++++++++++++++++++++++++-
>  4 files changed, 336 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..a8e8dfa5a520 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -48,6 +48,12 @@ typedef struct {
>         unsigned long flags;
>  #endif
>
> +#ifdef CONFIG_CPU_SUP_AMD
> +       struct list_head broadcast_asid_list;
> +       u16 broadcast_asid;
> +       bool asid_transition;

Please add a comment on the semantics of the "asid_transition" field
here after addressing the comments below.

> +#endif
> +
>  #ifdef CONFIG_ADDRESS_MASKING
>         /* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
>         unsigned long lam_cr3_mask;
[...]
> +#ifdef CONFIG_CPU_SUP_AMD
> +/*
> + * Logic for AMD INVLPGB support.
> + */
> +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;

I wonder if this should be set to MAX_ASID_AVAILABLE or such to ensure
that we do a flush before we start using the broadcast ASID space the
first time... Or is there something else that already guarantees that
all ASIDs of the TLB are flushed during kernel boot?

> +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static LIST_HEAD(broadcast_asid_list);
> +static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_broadcast_asid_space(void)
> +{
> +       mm_context_t *context;
> +
> +       lockdep_assert_held(&broadcast_asid_lock);
> +
> +       /*
> +        * Flush once when we wrap around the ASID space, so we won't need
> +        * to flush every time we allocate an ASID for boradcast flushing.

nit: typoed "broadcast"

> +        */
> +       invlpgb_flush_all_nonglobals();
> +       tlbsync();
> +
> +       /*
> +        * Leave the currently used broadcast ASIDs set in the bitmap, since
> +        * those cannot be reused before the next wraparound and flush..
> +        */
> +       bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
> +       list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
> +               __set_bit(context->broadcast_asid, broadcast_asid_used);
> +
> +       last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_broadcast_asid(void)
> +{
> +       lockdep_assert_held(&broadcast_asid_lock);
> +
> +       do {
> +               u16 start = last_broadcast_asid;
> +               u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +               if (asid >= MAX_ASID_AVAILABLE) {
> +                       reset_broadcast_asid_space();
> +                       continue;

Can this loop endlessly without making forward progress if we have a
few thousand processes on the system that are multi-threaded (or used
to be multi-threaded) and race the wrong way?
meets_broadcast_asid_threshold() checks if we have free IDs remaining,
but that check happens before broadcast_asid_lock is held, so we could
theoretically race such that no free IDs are available, right?

> +               }
> +
> +               /* Try claiming this broadcast ASID. */
> +               if (!test_and_set_bit(asid, broadcast_asid_used)) {
> +                       last_broadcast_asid = asid;
> +                       return asid;
> +               }
> +       } while (1);
> +}
[...]
> +/*
> + * Assign a broadcast ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_broadcast_asid(struct mm_struct *mm)
> +{
> +       guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> +
> +       /* This process is already using broadcast TLB invalidation. */
> +       if (mm->context.broadcast_asid)
> +               return;
> +
> +       mm->context.broadcast_asid = get_broadcast_asid();
> +       mm->context.asid_transition = true;

This looks buggy to me: If we first set mm->context.broadcast_asid and
then later set mm->context.asid_transition, then a
flush_tlb_mm_range() that happens in between will observe
mm_broadcast_asid() being true (meaning broadcast invalidation should
be used) while mm->context.asid_transition is false (meaning broadcast
invalidation alone is sufficient); but actually we haven't even
started transitioning CPUs over to the new ASID yet, so I think the
flush does nothing?

Maybe change how mm->context.asid_transition works such that it is
immediately set on mm creation and cleared when the transition is
done, so that you don't have to touch it here?

Also, please use at least WRITE_ONCE() for writes here, and add
comments documenting ordering requirements.

> +       list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
> +       broadcast_asid_available--;
> +}
[...]
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +       struct mm_struct *mm = info->mm;
> +       int bc_asid = mm_broadcast_asid(mm);
> +       int cpu;
> +
> +       if (!mm->context.asid_transition)

AFAIU this can be accessed concurrently - please use at least
READ_ONCE(). (I think in the current version of the patch, this needs
to be ordered against the preceding mm_broadcast_asid() read, but
that's implicit on x86, so I guess writing a barrier here would be
superfluous.)

> +               return;
> +
> +       for_each_cpu(cpu, mm_cpumask(mm)) {
> +               if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
> +                       continue;

switch_mm_irqs_off() picks an ASID and writes CR3 before writing loaded_mm:
"/* Make sure we write CR3 before loaded_mm. */"

Can we race with a concurrent switch_mm_irqs_off() on the other CPU
such that the other CPU has already switched CR3 to our MM using the
old ASID, but has not yet written loaded_mm, such that we skip it
here? And then we'll think we finished the ASID transition, and the
next time we do a flush, we'll wrongly omit the flush for that other
CPU even though it's still using the old ASID?

> +
> +               /*
> +                * If at least one CPU is not using the broadcast ASID yet,
> +                * send a TLB flush IPI. The IPI should cause stragglers
> +                * to transition soon.
> +                */
> +               if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) {

READ_ONCE()? Also, I think this needs a comment explaining that this
can race with concurrent MM switches such that we wrongly think that
there's a straggler (because we're not reading the loaded_mm and the
loaded_mm_asid as one atomic combination).

> +                       flush_tlb_multi(mm_cpumask(info->mm), info);
> +                       return;
> +               }
> +       }
> +
> +       /* All the CPUs running this process are using the broadcast ASID. */
> +       mm->context.asid_transition = 0;

WRITE_ONCE()?
Also: This is a bool, please use "false".

> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +       bool pmd = info->stride_shift == PMD_SHIFT;
> +       unsigned long maxnr = invlpgb_count_max;
> +       unsigned long asid = info->mm->context.broadcast_asid;
> +       unsigned long addr = info->start;
> +       unsigned long nr;
> +
> +       /* Flushing multiple pages at once is not supported with 1GB pages. */
> +       if (info->stride_shift > PMD_SHIFT)
> +               maxnr = 1;
> +
> +       if (info->end == TLB_FLUSH_ALL) {
> +               invlpgb_flush_single_pcid(kern_pcid(asid));

What orders this flush with the preceding page table update? Does the
instruction implicitly get ordered after preceding memory writes, or
do we get that ordering from inc_mm_tlb_gen() or something like that?

> +               /* Do any CPUs supporting INVLPGB need PTI? */
> +               if (static_cpu_has(X86_FEATURE_PTI))
> +                       invlpgb_flush_single_pcid(user_pcid(asid));
> +       } else do {
> +               /*
> +                * Calculate how many pages can be flushed at once; if the
> +                * remainder of the range is less than one page, flush one.
> +                */
> +               nr = min(maxnr, (info->end - addr) >> info->stride_shift);
> +               nr = max(nr, 1);
> +
> +               invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd);
> +               /* Do any CPUs supporting INVLPGB need PTI? */
> +               if (static_cpu_has(X86_FEATURE_PTI))
> +                       invlpgb_flush_user_nr(user_pcid(asid), addr, nr, pmd);
> +               addr += nr << info->stride_shift;
> +       } while (addr < info->end);
> +
> +       finish_asid_transition(info);
> +
> +       /* Wait for the INVLPGBs kicked off above to finish. */
> +       tlbsync();
> +}
> +#endif /* CONFIG_CPU_SUP_AMD */
[...]
> @@ -769,6 +1042,16 @@ static void flush_tlb_func(void *info)
>         if (unlikely(loaded_mm == &init_mm))
>                 return;
>
> +       /* Reload the ASID if transitioning into or out of a broadcast ASID */
> +       if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) {
> +               switch_mm_irqs_off(NULL, loaded_mm, NULL);
> +               loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +       }
> +
> +       /* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +       if (is_broadcast_asid(loaded_mm_asid))
> +               return;

This relies on the mm_broadcast_asid() read in flush_tlb_mm_range()
being ordered after the page table update, correct? And we get that
required ordering from the inc_mm_tlb_gen(), which implies a full
barrier? It might be nice if there were some more comments on this.

>         VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
>                    loaded_mm->context.ctx_id);
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
  2024-12-30 18:25   ` Nadav Amit
@ 2025-01-03 17:49   ` Jann Horn
  2025-01-04  3:08     ` Rik van Riel
  2025-01-10 19:34   ` Tom Lendacky
  2 siblings, 1 reply; 89+ messages in thread
From: Jann Horn @ 2025-01-03 17:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com> wrote:
> With AMD TCE (translation cache extensions) only the intermediate mappings
> that cover the address range zapped by INVLPG / INVLPGB get invalidated,
> rather than all intermediate mappings getting zapped at every TLB invalidation.
>
> This can help reduce the TLB miss rate, by keeping more intermediate
> mappings in the cache.
>
> >From the AMD manual:
>
> Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
> to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
> TLB entries. When this bit is 0, these instructions remove the target PTE
> from the TLB as well as all upper-level table entries that are cached
> in the TLB, whether or not they are associated with the target PTE.
> When this bit is set, these instructions will remove the target PTE and
> only those upper-level entries that lead to the target PTE in
> the page table hierarchy, leaving unrelated upper-level entries intact.

How does this patch interact with KVM SVM guests?
In particular, will this patch cause TLB flushes performed by guest
kernels to behave differently?

[...]
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 454a370494d3..585d0731ca9f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -477,7 +477,7 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
>         if (info->stride_shift > PMD_SHIFT)
>                 maxnr = 1;
>
> -       if (info->end == TLB_FLUSH_ALL) {
> +       if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
>                 invlpgb_flush_single_pcid(kern_pcid(asid));
>                 /* Do any CPUs supporting INVLPGB need PTI? */
>                 if (static_cpu_has(X86_FEATURE_PTI))
> @@ -1110,7 +1110,7 @@ static void flush_tlb_func(void *info)
>          *
>          * The only question is whether to do a full or partial flush.
>          *
> -        * We do a partial flush if requested and two extra conditions
> +        * We do a partial flush if requested and three extra conditions
>          * are met:
>          *
>          * 1. f->new_tlb_gen == local_tlb_gen + 1.  We have an invariant that
> @@ -1137,10 +1137,14 @@ static void flush_tlb_func(void *info)
>          *    date.  By doing a full flush instead, we can increase
>          *    local_tlb_gen all the way to mm_tlb_gen and we can probably
>          *    avoid another flush in the very near future.
> +        *
> +        * 3. No page tables were freed. If page tables were freed, a full
> +        *    flush ensures intermediate translations in the TLB get flushed.
>          */

Why is this necessary - do we ever issue TLB flushes that are intended
to zap upper-level entries which are not covered by the specified
address range?

When, for example, free_pmd_range() gets rid of a page table, it calls
pmd_free_tlb(), which sets tlb->freed_tables and does
tlb_flush_pud_range(tlb, address, PAGE_SIZE).


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition.
  2025-01-02 12:04   ` Borislav Petkov
@ 2025-01-03 18:27     ` Rik van Riel
  2025-01-03 21:07       ` Borislav Petkov
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-03 18:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Thu, 2025-01-02 at 13:04 +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 12:53:04PM -0500, Rik van Riel wrote:
> > 
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -338,6 +338,7 @@
> >  #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero"
> > CLZERO instruction */
> >  #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf"
> > Instructions Retired Count */
> >  #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr"
> > Always save/restore FP error pointers */
> > +#define X86_FEATURE_INVLPGB		(13*32+ 3) /* "invlpgb"
> > INVLPGB instruction */
> 						      ^^^^^^^^^
> 
> We don't show random CPUID bits in /proc/cpuinfo anymore so you can
> remove
> that.

I still see dozens of flags in /proc/cpuinfo here on
6.11. When did that change?

> 
> >  #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read
> > processor register at user level */
> >  #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd"
> > WBNOINVD instruction */
> >  #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect
> > Branch Prediction Barrier */
> > -- 
> 
> Also, merge this patch with the patch which uses the flag pls.

The first real use is 3 patches further down into the
series. 

I'm not convinced things will be more readable if 4
patches get squashed down into one.

Are you sure you want that?

-- 
All Rights Reversed.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB
  2024-12-30 17:53 ` [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-01-03 18:40   ` Jann Horn
  2025-01-12  2:39     ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Jann Horn @ 2025-01-03 18:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com> wrote:
> Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
> This way only leaf mappings get removed from the TLB, leaving intermediate
> translations cached.
>
> On the (rare) occasions where we free page tables we do a full flush,
> ensuring intermediate translations get flushed from the TLB.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/invlpgb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
> index 862775897a54..2669ebfffe81 100644
> --- a/arch/x86/include/asm/invlpgb.h
> +++ b/arch/x86/include/asm/invlpgb.h
> @@ -51,7 +51,7 @@ static inline void invlpgb_flush_user(unsigned long pcid,
>  static inline void invlpgb_flush_user_nr(unsigned long pcid, unsigned long addr,
>                                          int nr, bool pmd_stride)
>  {
> -       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
> +       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA | INVLPGB_FINAL_ONLY);
>  }

Please note this final-only behavior in a comment above the function
and/or rename the function to make this clear.

I think this currently interacts badly with pmdp_collapse_flush(),
which is used by retract_page_tables(). pmdp_collapse_flush() removes
a PMD entry pointing to a page table with pmdp_huge_get_and_clear(),
then calls flush_tlb_range(), which on x86 calls flush_tlb_mm_range()
with the "freed_tables" parameter set to false. But that's really a
preexisting bug, not something introduced by your series. I've sent a
patch for that, see
<https://lore.kernel.org/r/20250103-x86-collapse-flush-fix-v1-1-3c521856cfa6@google.com>.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition.
  2025-01-03 18:27     ` Rik van Riel
@ 2025-01-03 21:07       ` Borislav Petkov
  0 siblings, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-03 21:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, Jan 03, 2025 at 01:27:41PM -0500, Rik van Riel wrote:
> I still see dozens of flags in /proc/cpuinfo here on
> 6.11.

We can't remove those flags because they're an ABI.

> When did that change?

This should explain: Documentation/arch/x86/cpuinfo.rst

> I'm not convinced things will be more readable if 4
> patches get squashed down into one.

Not 4 patches - you merge this one with patch 6 - the first patch that uses
the flag.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-03 17:36   ` Jann Horn
@ 2025-01-04  2:55     ` Rik van Riel
  2025-01-06 13:04       ` Jann Horn
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-04  2:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-03 at 18:36 +0100, Jann Horn wrote:
> 
> > +++ b/arch/x86/include/asm/mmu.h
> > @@ -48,6 +48,12 @@ typedef struct {
> >         unsigned long flags;
> >  #endif
> > 
> > +#ifdef CONFIG_CPU_SUP_AMD
> > +       struct list_head broadcast_asid_list;
> > +       u16 broadcast_asid;
> > +       bool asid_transition;
> 
> Please add a comment on the semantics of the "asid_transition" field
> here after addressing the comments below.

Will do. 

> 
> > +#endif
> > +
> >  #ifdef CONFIG_ADDRESS_MASKING
> >         /* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or
> > 0 (disabled) */
> >         unsigned long lam_cr3_mask;
> [...]
> > +#ifdef CONFIG_CPU_SUP_AMD
> > +/*
> > + * Logic for AMD INVLPGB support.
> > + */
> > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> > +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
> 
> I wonder if this should be set to MAX_ASID_AVAILABLE or such to
> ensure
> that we do a flush before we start using the broadcast ASID space the
> first time... Or is there something else that already guarantees that
> all ASIDs of the TLB are flushed during kernel boot?

That is a good idea. I don't know if the TLBs always get
flushed on every kexec, for example, and having the
wraparound code exercised early on every boot will be
good as a self test for future proofing the code.

I'll do that in the next version.

> 
> > +static u16 get_broadcast_asid(void)
> > +{
> > +       lockdep_assert_held(&broadcast_asid_lock);
> > +
> > +       do {
> > +               u16 start = last_broadcast_asid;
> > +               u16 asid = find_next_zero_bit(broadcast_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +               if (asid >= MAX_ASID_AVAILABLE) {
> > +                       reset_broadcast_asid_space();
> > +                       continue;
> 
> Can this loop endlessly without making forward progress if we have a
> few thousand processes on the system that are multi-threaded (or used
> to be multi-threaded) and race the wrong way?
> meets_broadcast_asid_threshold() checks if we have free IDs
> remaining,
> but that check happens before broadcast_asid_lock is held, so we
> could
> theoretically race such that no free IDs are available, right?

You are right, I need to duplicate that check under
the spinlock! I'll get that done in the next version.

> 
> > +       mm->context.broadcast_asid = get_broadcast_asid();
> > +       mm->context.asid_transition = true;
> 
> This looks buggy to me: 

Nadav found the same issue. I've fixed it locally
already for the next version.

> Maybe change how mm->context.asid_transition works such that it is
> immediately set on mm creation and cleared when the transition is
> done, so that you don't have to touch it here?
> 
If we want to document the ordering, won't it be better
to keep both assignments close to each other (with WRITE_ONCE),
so the code stays easier to understand for future maintenance?

> Also, please use at least WRITE_ONCE() for writes here, and add
> comments documenting ordering requirements.

I'll add a comment.

> 
> > +static void finish_asid_transition(struct flush_tlb_info *info)
> > +{
> > +       struct mm_struct *mm = info->mm;
> > +       int bc_asid = mm_broadcast_asid(mm);
> > +       int cpu;
> > +
> > +       if (!mm->context.asid_transition)
> 
> AFAIU this can be accessed concurrently - please use at least
> READ_ONCE(). (I think in the current version of the patch, this needs
> to be ordered against the preceding mm_broadcast_asid() read, but
> that's implicit on x86, so I guess writing a barrier here would be
> superfluous.)

I'll add a READ_ONCE here. Good point.

> 
> > +               return;
> > +
> > +       for_each_cpu(cpu, mm_cpumask(mm)) {
> > +               if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu))
> > != mm)
> > +                       continue;
> 
> switch_mm_irqs_off() picks an ASID and writes CR3 before writing
> loaded_mm:
> "/* Make sure we write CR3 before loaded_mm. */"
> 
> Can we race with a concurrent switch_mm_irqs_off() on the other CPU
> such that the other CPU has already switched CR3 to our MM using the
> old ASID, but has not yet written loaded_mm, such that we skip it
> here? And then we'll think we finished the ASID transition, and the
> next time we do a flush, we'll wrongly omit the flush for that other
> CPU even though it's still using the old ASID?

That is a very good question.

I suppose we need to check against LOADED_MM_SWITCHING
too, and possibly wait to see what mm shows up on that
CPU before proceeding?

Maybe as simple as this?

        for_each_cpu(cpu, mm_cpumask(mm)) {
		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)
== LOADED_MM_SWITCHING)
			cpu_relax();

                if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=
mm)
                        continue;

                /*
                 * If at least one CPU is not using the broadcast ASID
yet,
                 * send a TLB flush IPI. The IPI should cause
stragglers
                 * to transition soon.
                 */
                if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=
bc_asid) {
                        flush_tlb_multi(mm_cpumask(info->mm), info);
                        return;
                }
        }

Then the only change needed to switch_mm_irqs_off
would be to move the LOADED_MM_SWITCHING line to
before choose_new_asid, to fully close the window.

Am I overlooking anything here?


> 
> > +
> > +               /*
> > +                * If at least one CPU is not using the broadcast
> > ASID yet,
> > +                * send a TLB flush IPI. The IPI should cause
> > stragglers
> > +                * to transition soon.
> > +                */
> > +               if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=
> > bc_asid) {
> 
> READ_ONCE()? Also, I think this needs a comment explaining that this
> can race with concurrent MM switches such that we wrongly think that
> there's a straggler (because we're not reading the loaded_mm and the
> loaded_mm_asid as one atomic combination).

I'll add the READ_ONCE.

Will the race still exist if we wait on
LOADED_MM_SWITCHING as proposed above?

> 
> > +                       flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > +                       return;
> > +               }
> > +       }
> > +
> > +       /* All the CPUs running this process are using the
> > broadcast ASID. */
> > +       mm->context.asid_transition = 0;
> 
> WRITE_ONCE()?
> Also: This is a bool, please use "false".

Will do.

> 
> > +}
> > +
> > +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> > +{
> > +       bool pmd = info->stride_shift == PMD_SHIFT;
> > +       unsigned long maxnr = invlpgb_count_max;
> > +       unsigned long asid = info->mm->context.broadcast_asid;
> > +       unsigned long addr = info->start;
> > +       unsigned long nr;
> > +
> > +       /* Flushing multiple pages at once is not supported with
> > 1GB pages. */
> > +       if (info->stride_shift > PMD_SHIFT)
> > +               maxnr = 1;
> > +
> > +       if (info->end == TLB_FLUSH_ALL) {
> > +               invlpgb_flush_single_pcid(kern_pcid(asid));
> 
> What orders this flush with the preceding page table update? Does the
> instruction implicitly get ordered after preceding memory writes, or
> do we get that ordering from inc_mm_tlb_gen() or something like that?

I believe inc_mm_tlb_gen() should provide the ordering.

You are right that it should be documented.

> 
> > +       /* Broadcast ASIDs are always kept up to date with INVLPGB.
> > */
> > +       if (is_broadcast_asid(loaded_mm_asid))
> > +               return;
> 
> This relies on the mm_broadcast_asid() read in flush_tlb_mm_range()
> being ordered after the page table update, correct? And we get that
> required ordering from the inc_mm_tlb_gen(), which implies a full
> barrier? It might be nice if there were some more comments on this.

I will add some comments, and I hope you can review
those in the next series, because I'll no doubt
forget to explain something important.

Thank you for the thorough review!


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-03 17:49   ` Jann Horn
@ 2025-01-04  3:08     ` Rik van Riel
  2025-01-06 13:10       ` Jann Horn
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-04  3:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-03 at 18:49 +0100, Jann Horn wrote:
> On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com> 
> > only those upper-level entries that lead to the target PTE in
> > the page table hierarchy, leaving unrelated upper-level entries
> > intact.
> 
> How does this patch interact with KVM SVM guests?
> In particular, will this patch cause TLB flushes performed by guest
> kernels to behave differently?
> 
That is a good question.

A Linux guest should be fine, since Linux already
flushes the parts of the TLB where page tables are
being freed.

I don't know whether this could potentially break
some non-Linux guests, though.


> [...]
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 454a370494d3..585d0731ca9f 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -477,7 +477,7 @@ static void broadcast_tlb_flush(struct
> > flush_tlb_info *info)
> >         if (info->stride_shift > PMD_SHIFT)
> >                 maxnr = 1;
> > 
> > -       if (info->end == TLB_FLUSH_ALL) {
> > +       if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
> >                 invlpgb_flush_single_pcid(kern_pcid(asid));
> >                 /* Do any CPUs supporting INVLPGB need PTI? */
> >                 if (static_cpu_has(X86_FEATURE_PTI))
> > @@ -1110,7 +1110,7 @@ static void flush_tlb_func(void *info)
> >          *
> >          * The only question is whether to do a full or partial
> > flush.
> >          *
> > -        * We do a partial flush if requested and two extra
> > conditions
> > +        * We do a partial flush if requested and three extra
> > conditions
> >          * are met:
> >          *
> >          * 1. f->new_tlb_gen == local_tlb_gen + 1.  We have an
> > invariant that
> > @@ -1137,10 +1137,14 @@ static void flush_tlb_func(void *info)
> >          *    date.  By doing a full flush instead, we can increase
> >          *    local_tlb_gen all the way to mm_tlb_gen and we can
> > probably
> >          *    avoid another flush in the very near future.
> > +        *
> > +        * 3. No page tables were freed. If page tables were freed,
> > a full
> > +        *    flush ensures intermediate translations in the TLB
> > get flushed.
> >          */
> 
> Why is this necessary - do we ever issue TLB flushes that are
> intended
> to zap upper-level entries which are not covered by the specified
> address range?
> 
> When, for example, free_pmd_range() gets rid of a page table, it
> calls
> pmd_free_tlb(), which sets tlb->freed_tables and does
> tlb_flush_pud_range(tlb, address, PAGE_SIZE).
> 

I missed those calls.

It looks like this change is not needed.

Of course, the way pmd_free_tlb() operates, the
partial zaps done in that code will exceed the
(default 33 pages) value of tlb_single_page_flush_ceiling,
and the code in flush_tlb_mm_range() will already do
a full flush by default today.

I'll leave out these unnecessary changes in the next
version.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-03 12:18         ` Borislav Petkov
@ 2025-01-04 16:27           ` Peter Zijlstra
  2025-01-06 15:54             ` Dave Hansen
  2025-01-06 15:47           ` Rik van Riel
  1 sibling, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2025-01-04 16:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, kernel-team, dave.hansen, luto,
	tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, Jan 03, 2025 at 01:18:43PM +0100, Borislav Petkov wrote:
> On Thu, Jan 02, 2025 at 08:56:09PM +0100, Peter Zijlstra wrote:
> > Well, I've already answered why we need this in the previous thread but
> > it wasn't preserved :-(
> 
> ... and this needs to be part of the commit message. And there's a similar
> comment over tlb_remove_table_smp_sync() in mm/mmu_gather.c which pretty much
> explains the same thing.
> 
> > Currently GUP-fast serializes against table-free by disabling
> > interrupts, which in turn holds of the TLBI-IPIs.
> > 
> > Since you're going to be doing broadcast TLBI -- without IPIs, this no
> > longer works and we need other means of serializing GUP-fast vs
> > table-free.
> > 
> > MMU_GATHER_RCU_TABLE_FREE is that means.
> > 
> > So where previously paravirt implementations of tlb_flush_multi might
> > require this (because of virt optimizations that avoided the TLBI-IPI),
> > this broadcast invalidate now very much requires this for native.
> 
> Right, so this begs the question: we probably should do this dynamically only
> on TLBI systems - not on everything native - due to the overhead of this
> batching - I'm looking at tlb_remove_table().
> 
> Or should we make this unconditional on all native because we don't care about
> the overhead and would like to have simpler code. I mean, disabling IRQs vs
> batching and allocating memory...?

The disabling IRQs on the GUP-fast side stays, it acts as a
RCU-read-side section -- also mmu_gather reverts to sending IPIs if it
runs out of memory (extremely rare).

I don't think there is measurable overhead from doing the separate table
batching, but I'm sure the robots will tell us.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-04  2:55     ` Rik van Riel
@ 2025-01-06 13:04       ` Jann Horn
  2025-01-06 14:26         ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Jann Horn @ 2025-01-06 13:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Sat, Jan 4, 2025 at 3:55 AM Rik van Riel <riel@surriel.com> wrote:
> On Fri, 2025-01-03 at 18:36 +0100, Jann Horn wrote:
> > Maybe change how mm->context.asid_transition works such that it is
> > immediately set on mm creation and cleared when the transition is
> > done, so that you don't have to touch it here?
> >
> If we want to document the ordering, won't it be better
> to keep both assignments close to each other (with WRITE_ONCE),
> so the code stays easier to understand for future maintenance?

You have a point there. I was thinking that if asid_transition is set
on mm creation, we don't have to think about the ordering properties
as hard; but I guess you're right that it would be more
clean/future-proof to do the writes together here.

> > > +               return;
> > > +
> > > +       for_each_cpu(cpu, mm_cpumask(mm)) {
> > > +               if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu))
> > > != mm)
> > > +                       continue;
> >
> > switch_mm_irqs_off() picks an ASID and writes CR3 before writing
> > loaded_mm:
> > "/* Make sure we write CR3 before loaded_mm. */"
> >
> > Can we race with a concurrent switch_mm_irqs_off() on the other CPU
> > such that the other CPU has already switched CR3 to our MM using the
> > old ASID, but has not yet written loaded_mm, such that we skip it
> > here? And then we'll think we finished the ASID transition, and the
> > next time we do a flush, we'll wrongly omit the flush for that other
> > CPU even though it's still using the old ASID?
>
> That is a very good question.
>
> I suppose we need to check against LOADED_MM_SWITCHING
> too, and possibly wait to see what mm shows up on that
> CPU before proceeding?
>
> Maybe as simple as this?
>
>         for_each_cpu(cpu, mm_cpumask(mm)) {
>                 while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)
> == LOADED_MM_SWITCHING)
>                         cpu_relax();
>
>                 if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=
> mm)
>                         continue;
>
>                 /*
>                  * If at least one CPU is not using the broadcast ASID
> yet,
>                  * send a TLB flush IPI. The IPI should cause
> stragglers
>                  * to transition soon.
>                  */
>                 if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=
> bc_asid) {
>                         flush_tlb_multi(mm_cpumask(info->mm), info);
>                         return;
>                 }
>         }
>
> Then the only change needed to switch_mm_irqs_off
> would be to move the LOADED_MM_SWITCHING line to
> before choose_new_asid, to fully close the window.
>
> Am I overlooking anything here?

I think that might require having a full memory barrier in
switch_mm_irqs_off to ensure that the write of LOADED_MM_SWITCHING
can't be reordered after reads in choose_new_asid(). Which wouldn't be
very nice; we probably should avoid adding heavy barriers to the task
switch path...

Hmm, but I think luckily the cpumask_set_cpu() already implies a
relaxed RMW atomic, which I think on X86 is actually the same as a
sequentially consistent atomic, so as long as you put the
LOADED_MM_SWITCHING line before that, it might do the job? Maybe with
an smp_mb__after_atomic() and/or an explainer comment.
(smp_mb__after_atomic() is a no-op on x86, so maybe just a comment is
the right way. Documentation/memory-barriers.txt says
smp_mb__after_atomic() can be used together with atomic RMW bitop
functions.)

> > > +
> > > +               /*
> > > +                * If at least one CPU is not using the broadcast
> > > ASID yet,
> > > +                * send a TLB flush IPI. The IPI should cause
> > > stragglers
> > > +                * to transition soon.
> > > +                */
> > > +               if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=
> > > bc_asid) {
> >
> > READ_ONCE()? Also, I think this needs a comment explaining that this
> > can race with concurrent MM switches such that we wrongly think that
> > there's a straggler (because we're not reading the loaded_mm and the
> > loaded_mm_asid as one atomic combination).
>
> I'll add the READ_ONCE.
>
> Will the race still exist if we wait on
> LOADED_MM_SWITCHING as proposed above?

I think so, since between reading the loaded_mm and reading the
loaded_mm_asid, the remote CPU might go through an entire task switch.
Like:

1. We read the loaded_mm, and see that the remote CPU is currently
running in our mm_struct.
2. The remote CPU does a task switch to another process with a
different mm_struct.
3. We read the loaded_mm_asid, and see an ASID that does not match our
broadcast ASID (because the loaded ASID is not for our mm_struct).


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-04  3:08     ` Rik van Riel
@ 2025-01-06 13:10       ` Jann Horn
  2025-01-06 18:29         ` Sean Christopherson
  0 siblings, 1 reply; 89+ messages in thread
From: Jann Horn @ 2025-01-06 13:10 UTC (permalink / raw)
  To: Rik van Riel, Sean Christopherson, Paolo Bonzini, KVM list, Tom Lendacky
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

+KVM/SVM folks in case they know more about how enabling CPU features
interacts with virtualization; original patch is at
https://lore.kernel.org/all/20241230175550.4046587-12-riel@surriel.com/

On Sat, Jan 4, 2025 at 4:08 AM Rik van Riel <riel@surriel.com> wrote:
> On Fri, 2025-01-03 at 18:49 +0100, Jann Horn wrote:
> > On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com>
> > > only those upper-level entries that lead to the target PTE in
> > > the page table hierarchy, leaving unrelated upper-level entries
> > > intact.
> >
> > How does this patch interact with KVM SVM guests?
> > In particular, will this patch cause TLB flushes performed by guest
> > kernels to behave differently?
> >
> That is a good question.
>
> A Linux guest should be fine, since Linux already
> flushes the parts of the TLB where page tables are
> being freed.
>
> I don't know whether this could potentially break
> some non-Linux guests, though.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-06 13:04       ` Jann Horn
@ 2025-01-06 14:26         ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-06 14:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 14:04 +0100, Jann Horn wrote:
> On Sat, Jan 4, 2025 at 3:55 AM Rik van Riel <riel@surriel.com> wrote:
> 
> > 
> > Then the only change needed to switch_mm_irqs_off
> > would be to move the LOADED_MM_SWITCHING line to
> > before choose_new_asid, to fully close the window.
> > 
> > Am I overlooking anything here?
> 
> I think that might require having a full memory barrier in
> switch_mm_irqs_off to ensure that the write of LOADED_MM_SWITCHING
> can't be reordered after reads in choose_new_asid(). Which wouldn't
> be
> very nice; we probably should avoid adding heavy barriers to the task
> switch path...
> 
> Hmm, but I think luckily the cpumask_set_cpu() already implies a
> relaxed RMW atomic, which I think on X86 is actually the same as a
> sequentially consistent atomic, so as long as you put the
> LOADED_MM_SWITCHING line before that, it might do the job? Maybe with
> an smp_mb__after_atomic() and/or an explainer comment.
> (smp_mb__after_atomic() is a no-op on x86, so maybe just a comment is
> the right way. Documentation/memory-barriers.txt says
> smp_mb__after_atomic() can be used together with atomic RMW bitop
> functions.)
> 

That noop smp_mb__after_atomic() might be the way to go,
since we do not actually use the mm_cpumask with INVLPGB,
and we could conceivably skip updates to the bitmask for
tasks using broadcast TLB flushing.

> > > 
> > 
> > I'll add the READ_ONCE.
> > 
> > Will the race still exist if we wait on
> > LOADED_MM_SWITCHING as proposed above?
> 
> I think so, since between reading the loaded_mm and reading the
> loaded_mm_asid, the remote CPU might go through an entire task
> switch.
> Like:
> 
> 1. We read the loaded_mm, and see that the remote CPU is currently
> running in our mm_struct.
> 2. The remote CPU does a task switch to another process with a
> different mm_struct.
> 3. We read the loaded_mm_asid, and see an ASID that does not match
> our
> broadcast ASID (because the loaded ASID is not for our mm_struct).
> 

A false positive, where we do not clear the
asid_transition field, and will check again
in the future should be harmless, though.

The worry is false negatives, where we fail
to detect an out-of-sync CPU, yet still clear
the asid_transition field.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2024-12-30 19:24   ` Nadav Amit
  2025-01-03 17:36   ` Jann Horn
@ 2025-01-06 14:52   ` Nadav Amit
  2025-01-06 16:03     ` Rik van Riel
  2025-01-06 18:40   ` Dave Hansen
  3 siblings, 1 reply; 89+ messages in thread
From: Nadav Amit @ 2025-01-06 14:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

> On 30 Dec 2024, at 19:53, Rik van Riel <riel@surriel.com> wrote:
> 
> +/*
> + * Figure out whether to assign a broadcast (global) ASID to a process.
> + * We vary the threshold by how empty or full broadcast ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the broadcast ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_broadcast_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail = broadcast_asid_available;
> +	int threshold = HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold = HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold = HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *= 2;
> +			threshold *= 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}

Rik,

I thought about it further and I am not sure this approach is so great.
It reminds me the technique of eating chocolate forever: each day eat
half of the previous day. It works in theory, but less in practice.

IOW, I mean it seems likely that early processes would get and hog all
broadcast ASIDs. It seems necessary to be able to revoke broadcast ASIDs,
although I understand it can be complicated.

Do you have any other resource in mind that Linux manages in a similar
way (avoids revoking)?


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-03 12:18         ` Borislav Petkov
  2025-01-04 16:27           ` Peter Zijlstra
@ 2025-01-06 15:47           ` Rik van Riel
  1 sibling, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-06 15:47 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
	hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-03 at 13:18 +0100, Borislav Petkov wrote:
> On Thu, Jan 02, 2025 at 08:56:09PM +0100, Peter Zijlstra wrote:
> > Well, I've already answered why we need this in the previous thread
> > but
> > it wasn't preserved :-(
> 
> ... and this needs to be part of the commit message. And there's a
> similar
> comment over tlb_remove_table_smp_sync() in mm/mmu_gather.c which
> pretty much
> explains the same thing.
> 
> > Currently GUP-fast serializes against table-free by disabling
> > interrupts, which in turn holds of the TLBI-IPIs.
> > 
> > Since you're going to be doing broadcast TLBI -- without IPIs, this
> > no
> > longer works and we need other means of serializing GUP-fast vs
> > table-free.
> > 
> > MMU_GATHER_RCU_TABLE_FREE is that means.
> > 
> > So where previously paravirt implementations of tlb_flush_multi
> > might
> > require this (because of virt optimizations that avoided the TLBI-
> > IPI),
> > this broadcast invalidate now very much requires this for native.
> 
> Right, so this begs the question: we probably should do this
> dynamically only
> on TLBI systems - not on everything native - due to the overhead of
> this
> batching - I'm looking at tlb_remove_table().
> 
> Or should we make this unconditional on all native because we don't
> care about
> the overhead and would like to have simpler code. I mean, disabling
> IRQs vs
> batching and allocating memory...?

Given the cost of IPIs on systems where we don't
technically need MMU_GATHER_RCU_TABLE_FREE, the
batching might still be cheaper.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-04 16:27           ` Peter Zijlstra
@ 2025-01-06 15:54             ` Dave Hansen
  0 siblings, 0 replies; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 15:54 UTC (permalink / raw)
  To: Peter Zijlstra, Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, kernel-team, dave.hansen, luto,
	tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 1/4/25 08:27, Peter Zijlstra wrote:
>> Or should we make this unconditional on all native because we don't care about
>> the overhead and would like to have simpler code. I mean, disabling IRQs vs
>> batching and allocating memory...?
> The disabling IRQs on the GUP-fast side stays, it acts as a
> RCU-read-side section -- also mmu_gather reverts to sending IPIs if it
> runs out of memory (extremely rare).
> 
> I don't think there is measurable overhead from doing the separate table
> batching, but I'm sure the robots will tell us.

We should _try_ to make it unconditional for simplicity if nothing else.

BTW, a few years back, some folks at Intel turned on
MMU_GATHER_RCU_TABLE_FREE and ran the usual 0day/LKP tests. I _think_ it
was when we were exploring the benefits of Intel's IPI-free TLB flushing
mechanism. We didn't find anything remarkable either way (IIRC).


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-06 14:52   ` Nadav Amit
@ 2025-01-06 16:03     ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-06 16:03 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Mon, 2025-01-06 at 16:52 +0200, Nadav Amit wrote:
> 
> 
> I thought about it further and I am not sure this approach is so
> great.
> It reminds me the technique of eating chocolate forever: each day eat
> half of the previous day. It works in theory, but less in practice.
> 
The PCID space might be just large enough that we can
get away with it, even on extremely large systems.

Say that we have a system with 8192 CPUs, where we are
not using the PTI mitigation, giving us 4086 or so available
PCIDs.

If the system runs 4k 2-thread processes, most of them
get to use INVLPGB.

If the system runs 4 processes with 2k threads each, all
of those large processes get to use INVLPGB.

If a few smaller processes do not get to use INVLPGB,
it may not matter much, since the large (and presumably
more important) processes in the system do get to use it.

> IOW, I mean it seems likely that early processes would get and hog
> all
> broadcast ASIDs. It seems necessary to be able to revoke broadcast
> ASIDs,
> although I understand it can be complicated.
> 

Revoking broadcast ASIDs works. An earlier prototype of
these patches assigned broadcast ASIDs only to the top
8 TLB flushing processes on the system, and would kick
tasks out of the top 8 when a more active flusher showed
up.

However, given that INVLPGB seems to give only a few
percent performance boost in the Phoronix tests, having
some processes use INVPLGB, and some use TLB flushing
might be a perfectly reasonable fallback.

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2025-01-02 12:42   ` Borislav Petkov
@ 2025-01-06 16:50     ` Dave Hansen
  2025-01-06 17:32       ` Rik van Riel
  2025-01-06 18:14       ` Borislav Petkov
  2025-01-14 19:50     ` Rik van Riel
  1 sibling, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 16:50 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 1/2/25 04:42, Borislav Petkov wrote:
>> +#define INVLPGB_VA			BIT(0)
>> +#define INVLPGB_PCID			BIT(1)
>> +#define INVLPGB_ASID			BIT(2)
>> +#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
>> +#define INVLPGB_FINAL_ONLY		BIT(4)
>> +#define INVLPGB_INCLUDE_NESTED		BIT(5)
> Please add only the defines which are actually being used. Ditto for the
> functions.

There's some precedent for defining them all up front, like we did for
invpcid_flush_*().

For INVPCID, there are four variants and two of them got used up front.
But I get that it's a balancing act between having untested code that
might bitrot and introducing helpers at a time when someone (Rik) is
very likely to get all the variants coded up correctly.

Rik, how many of these end up being used by the end of the series?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
  2025-01-03 12:39   ` Borislav Petkov
@ 2025-01-06 17:21   ` Dave Hansen
  2025-01-09 20:16     ` Rik van Riel
  2025-01-10 18:53   ` Tom Lendacky
  2 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 17:21 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 09:53, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> This stops us from having to send IPIs for kernel TLB flushes.

Could this be changed to imperative voice, please?

	Remove the need to send IPIs for kernel TLB flushes.

> +static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
> +{
> +	unsigned long addr;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;

The 'tlb_single_page_flush_ceiling' value was determined by looking at
_local_ invalidation cost. Could you talk a bit about why it's also a
good value to use for remote invalidations? Does it hold up for INVLPGB
the same way it did for good ol' INVLPG? Has there been any explicit
testing here to find a good value?

I'm also confused by the multiplication here. Let's say
invlpgb_count_max==20 and tlb_single_page_flush_ceiling==30.

You would need to switch away from single-address invalidation when the
number of addresses is >20 for INVLPGB functional reasons. But you'd
also need to switch away when >30 for performance reasons
(tlb_single_page_flush_ceiling).

But I don't understand how that would make the threshold 20*30=600
invalidations.

> +	/*
> +	 * TLBSYNC only waits for flushes originating on the same CPU.
> +	 * Disabling migration allows us to wait on all flushes.
> +	 */

Imperative voice here too, please:

	Disable migration to wait on all flushes.

> +	guard(preempt)();
> +
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > threshold << PAGE_SHIFT) {

This is basically a copy-and-paste of the "range vs. global" flush
logic, but taking 'invlpgb_count_max' into account.

It would be ideal if those limit checks could be consolidated. I suspect
that when the 'threshold' calculation above gets clarified that they may
be easier to consolidate.

BTW, what is a typical value for 'invlpgb_count_max'? Is it more or less
than the typical value for 'tlb_single_page_flush_ceiling'?

Maybe we should just lower 'tlb_single_page_flush_ceiling' if
'invlpgb_count_max' falls below it so we only have _one_ runtime value
to consider.

> +		invlpgb_flush_all();
> +	} else {
> +		unsigned long nr;
> +		for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
> +			nr = min((end - addr) >> PAGE_SHIFT, maxnr);
> +			invlpgb_flush_addr(addr, nr);
> +		}
> +	}
> +
> +	tlbsync();
> +}
> +
>  static void do_kernel_range_flush(void *info)
>  {
>  	struct flush_tlb_info *f = info;
> @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void *info)
>  
>  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  {
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		broadcast_kernel_range_flush(start, end);
> +		return;
> +	}
> +
>  	/* Balance as user space task's flush, a bit conservative */
>  	if (end == TLB_FLUSH_ALL ||
>  	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {

I also wonder if this would all get simpler if we give in and *always*
call get_flush_tlb_info(). That would provide a nice single place to
consolidate the "all vs. ranged" flush logic.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2024-12-30 17:53 ` [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-01-06 17:29   ` Dave Hansen
  2025-01-06 17:35     ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 17:29 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 09:53, Rik van Riel wrote:
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1074,6 +1074,12 @@ static void do_flush_tlb_all(void *info)
>  void flush_tlb_all(void)
>  {
>  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		guard(preempt)();
> +		invlpgb_flush_all();
> +		tlbsync();
> +		return;
> +	}

After seeing a few of these, I'd really prefer that the preempt and
tlbsync() logic be hidden in the invlpgb_*() helper, or *a* helper at least.

This would be a lot easier on the eyes if it were something like:

	flushed = invlpgb_flush_all();
	if (flushed)
		return;

or even:

	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
		invlpgb_flush_all();
		return;
	}


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2025-01-06 16:50     ` Dave Hansen
@ 2025-01-06 17:32       ` Rik van Riel
  2025-01-06 18:14       ` Borislav Petkov
  1 sibling, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-06 17:32 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 08:50 -0800, Dave Hansen wrote:
> On 1/2/25 04:42, Borislav Petkov wrote:
> > > +#define INVLPGB_VA			BIT(0)
> > > +#define INVLPGB_PCID			BIT(1)
> > > +#define INVLPGB_ASID			BIT(2)
> > > +#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
> > > +#define INVLPGB_FINAL_ONLY		BIT(4)
> > > +#define INVLPGB_INCLUDE_NESTED		BIT(5)
> > Please add only the defines which are actually being used. Ditto
> > for the
> > functions.
> 
> There's some precedent for defining them all up front, like we did
> for
> invpcid_flush_*().
> 
> For INVPCID, there are four variants and two of them got used up
> front.
> But I get that it's a balancing act between having untested code that
> might bitrot and introducing helpers at a time when someone (Rik) is
> very likely to get all the variants coded up correctly.
> 
> Rik, how many of these end up being used by the end of the series?
> 

Only invlpgb_flush_single_asid is unused at the
end of the series.

I'll remove that one.

As for the bit flags, those are a hardware
interface. I can remove the unused ones, but
would like to know why :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2025-01-06 17:29   ` Dave Hansen
@ 2025-01-06 17:35     ` Rik van Riel
  2025-01-06 17:54       ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-06 17:35 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 09:29 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1074,6 +1074,12 @@ static void do_flush_tlb_all(void *info)
> >  void flush_tlb_all(void)
> >  {
> >  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > +		guard(preempt)();
> > +		invlpgb_flush_all();
> > +		tlbsync();
> > +		return;
> > +	}
> 
> After seeing a few of these, I'd really prefer that the preempt and
> tlbsync() logic be hidden in the invlpgb_*() helper, or *a* helper at
> least.
> 
> This would be a lot easier on the eyes if it were something like:
> 
> 	flushed = invlpgb_flush_all();
> 	if (flushed)
> 		return;

One issue here is that some of the invlpgb helpers
are supposed to be asynchronous, because we can
have multiple of those flushes pending simultaneously,
and then wait for them to complete with a tlbsync.

How would we avoid the confusion between the two
types (async vs sync) invlpgb helpers?

I'm all for cleaning this up, but I have not
thought of a good idea yet...

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2025-01-06 17:35     ` Rik van Riel
@ 2025-01-06 17:54       ` Dave Hansen
  0 siblings, 0 replies; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 17:54 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 1/6/25 09:35, Rik van Riel wrote:
> On Mon, 2025-01-06 at 09:29 -0800, Dave Hansen wrote:
>> On 12/30/24 09:53, Rik van Riel wrote:
>>> --- a/arch/x86/mm/tlb.c
>>> +++ b/arch/x86/mm/tlb.c
>>> @@ -1074,6 +1074,12 @@ static void do_flush_tlb_all(void *info)
>>>  void flush_tlb_all(void)
>>>  {
>>>  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
>>> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
>>> +		guard(preempt)();
>>> +		invlpgb_flush_all();
>>> +		tlbsync();
>>> +		return;
>>> +	}
>>
>> After seeing a few of these, I'd really prefer that the preempt and
>> tlbsync() logic be hidden in the invlpgb_*() helper, or *a* helper at
>> least.
>>
>> This would be a lot easier on the eyes if it were something like:
>>
>> 	flushed = invlpgb_flush_all();
>> 	if (flushed)
>> 		return;
> 
> One issue here is that some of the invlpgb helpers
> are supposed to be asynchronous, because we can
> have multiple of those flushes pending simultaneously,
> and then wait for them to complete with a tlbsync.
> 
> How would we avoid the confusion between the two
> types (async vs sync) invlpgb helpers?

It could be done with naming. Either preface things with __ or give them
"sync" suffixes.

We could also do it with a calling convention:

	struct invlpgb_seq;

	start_invlpgb(&invlpgb_seq);
	invlpgb_flush_addr(&invlpgb_seq, start, end);
	end_invlpgb(&invlpgb_seq);

The things that can logically get done in sequence need to have the
start/end, and need to have the struct passed in. The ones that have the
internal sync don't have the argument.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2025-01-06 16:50     ` Dave Hansen
  2025-01-06 17:32       ` Rik van Riel
@ 2025-01-06 18:14       ` Borislav Petkov
  1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-06 18:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, kernel-team, dave.hansen, luto,
	peterz, tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch,
	linux-mm

On Mon, Jan 06, 2025 at 08:50:27AM -0800, Dave Hansen wrote:
> There's some precedent for defining them all up front, like we did for
> invpcid_flush_*().
> 
> For INVPCID, there are four variants and two of them got used up front.
> But I get that it's a balancing act between having untested code that
> might bitrot and introducing helpers at a time when someone (Rik) is
> very likely to get all the variants coded up correctly.

That's just silly. We don't add all possible hw interface bits, defines etc,
when they're not going to be used. If we did, the kernel would be an unwieldy
mess of MSRs and their bits, unused insn opcodes and the like.

If someone wants to use them, someone can add them *when* they're needed - not
preemptively, in anticipation that *someone* *might* use them in the future.

Guys, I can't believe I'm actually arguing for something so obvious. 

:-\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-06 13:10       ` Jann Horn
@ 2025-01-06 18:29         ` Sean Christopherson
  0 siblings, 0 replies; 89+ messages in thread
From: Sean Christopherson @ 2025-01-06 18:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: Rik van Riel, Paolo Bonzini, KVM list, Tom Lendacky, x86,
	linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, Jan 06, 2025, Jann Horn wrote:
> +KVM/SVM folks in case they know more about how enabling CPU features
> interacts with virtualization; original patch is at
> https://lore.kernel.org/all/20241230175550.4046587-12-riel@surriel.com/
> 
> On Sat, Jan 4, 2025 at 4:08 AM Rik van Riel <riel@surriel.com> wrote:
> > On Fri, 2025-01-03 at 18:49 +0100, Jann Horn wrote:
> > > On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com>
> > > > only those upper-level entries that lead to the target PTE in
> > > > the page table hierarchy, leaving unrelated upper-level entries
> > > > intact.
> > >
> > > How does this patch interact with KVM SVM guests?
> > > In particular, will this patch cause TLB flushes performed by guest
> > > kernels to behave differently?

No.  EFER is context switched by hardware on VMRUN and #VMEXIT, i.e. the guest
runs with its own EFER, and thus will get the targeted flushes if and only if
the hypervisor virtualizes EFER.TCE *and* the guest explicitly enables EFER.TCE.

> > That is a good question.
> >
> > A Linux guest should be fine, since Linux already flushes the parts of the
> > TLB where page tables are being freed.
> >
> > I don't know whether this could potentially break some non-Linux guests,
> > though.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                     ` (2 preceding siblings ...)
  2025-01-06 14:52   ` Nadav Amit
@ 2025-01-06 18:40   ` Dave Hansen
  2025-01-12  2:36     ` Rik van Riel
  3 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 18:40 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 09:53, Rik van Riel wrote:
...
> +#ifdef CONFIG_CPU_SUP_AMD
> +	struct list_head broadcast_asid_list;
> +	u16 broadcast_asid;
> +	bool asid_transition;
> +#endif

Could we either do:

	config X86_TLB_FLUSH_BROADCAST_HW
		bool
		depends on CONFIG_CPU_SUP_AMD

or even

#define X86_TLB_FLUSH_BROADCAST_HW	CONFIG_CPU_SUP_AMD

for the whole series please?

There are a non-trivial number of #ifdefs here and it would be nice to
know what there're for, logically.

This is a completely selfish request because Intel has a similar feature
and we're surely going to give this approach a try on Intel CPUs too.

Second, is there something that prevents you from defining a new
MM_CONTEXT_* flag instead of a new bool? It might save bloating the
context by a few words.

>  #ifdef CONFIG_ADDRESS_MASKING
>  	/* Active LAM mode:  X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
>  	unsigned long lam_cr3_mask;
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..0dc446c427d2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
>  #define enter_lazy_tlb enter_lazy_tlb
>  extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>  
> +extern void destroy_context_free_broadcast_asid(struct mm_struct *mm);
> +
>  /*
>   * Init a new mm.  Used on mm copies, like at fork()
>   * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,13 @@ static inline int init_new_context(struct task_struct *tsk,
>  		mm->context.execute_only_pkey = -1;
>  	}
>  #endif
> +
> +#ifdef CONFIG_CPU_SUP_AMD
> +	INIT_LIST_HEAD(&mm->context.broadcast_asid_list);
> +	mm->context.broadcast_asid = 0;
> +	mm->context.asid_transition = false;
> +#endif

We've been inconsistent about it, but I think I'd prefer that this had a:

	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
		...
	}

wrapper as opposed to CONFIG_CPU_SUP_AMD. It might save dirtying a
cacheline on all the CPUs that don't care. cpu_feature_enabled() would
also function the same as the #ifdef.

>  	mm_reset_untag_mask(mm);
>  	init_new_context_ldt(mm);
>  	return 0;
> @@ -170,6 +179,9 @@ static inline int init_new_context(struct task_struct *tsk,
>  static inline void destroy_context(struct mm_struct *mm)
>  {
>  	destroy_context_ldt(mm);
> +#ifdef CONFIG_CPU_SUP_AMD
> +	destroy_context_free_broadcast_asid(mm);
> +#endif
>  }
>  
>  extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 20074f17fbcd..5e9956af98d1 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -65,6 +65,23 @@ static inline void cr4_clear_bits(unsigned long mask)
>   */
>  #define TLB_NR_DYN_ASIDS	6
>  
> +#ifdef CONFIG_CPU_SUP_AMD
> +#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS
> +#define is_broadcast_asid(asid) (asid) >= TLB_NR_DYN_ASIDS
> +#define in_asid_transition(info) (info->mm && info->mm->context.asid_transition)
> +#define mm_broadcast_asid(mm) (mm->context.broadcast_asid)
> +#else
> +#define is_dyn_asid(asid) true
> +#define is_broadcast_asid(asid) false
> +#define in_asid_transition(info) false
> +#define mm_broadcast_asid(mm) 0

I think it was said elsewhere, but I also prefer static inlines for
these instead of macros. The type checking that you get from the
compiler in _both_ compile configurations is much more valuable than
brevity.

...
> +	/*
> +	 * TLB consistency for this ASID is maintained with INVLPGB;
> +	 * TLB flushes happen even while the process isn't running.
> +	 */

I'm not sure this comment helps much. The thing that matters here is
that a broadcast ASID is asigned from a global namespace and not from a
per-cpu namespace.

> +#ifdef CONFIG_CPU_SUP_AMD
> +	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_broadcast_asid(next)) {
> +		*new_asid = mm_broadcast_asid(next);
> +		*need_flush = false;
> +		return;
> +	}
> +#endif
> +
>  	if (this_cpu_read(cpu_tlbstate.invalidate_other))
>  		clear_asid_other();
>  
> @@ -251,6 +265,245 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>  	*need_flush = true;
>  }
>  
> +#ifdef CONFIG_CPU_SUP_AMD
> +/*
> + * Logic for AMD INVLPGB support.
> + */

This comment is another indication that this shouldn't all be crammed
under CONFIG_CPU_SUP_AMD.

> +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock);
> +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };

I'm debating whether this should be a bitmap for "broadcast" ASIDs alone
or for all ASIDs.

> +static LIST_HEAD(broadcast_asid_list);
> +static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_broadcast_asid_space(void)
> +{
> +	mm_context_t *context;
> +
> +	lockdep_assert_held(&broadcast_asid_lock);
> +
> +	/*
> +	 * Flush once when we wrap around the ASID space, so we won't need
> +	 * to flush every time we allocate an ASID for boradcast flushing.

							^ broadcast

> +	 */
> +	invlpgb_flush_all_nonglobals();
> +	tlbsync();
> +
> +	/*
> +	 * Leave the currently used broadcast ASIDs set in the bitmap, since
> +	 * those cannot be reused before the next wraparound and flush..
> +	 */
> +	bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
> +	list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
> +		__set_bit(context->broadcast_asid, broadcast_asid_used);
> +
> +	last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +}

'TLB_NR_DYN_ASIDS' is special here. Could it please be made more clear
what it means *logically*?

> +static u16 get_broadcast_asid(void)
> +{
> +	lockdep_assert_held(&broadcast_asid_lock);
> +
> +	do {
> +		u16 start = last_broadcast_asid;
> +		u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >= MAX_ASID_AVAILABLE) {
> +			reset_broadcast_asid_space();
> +			continue;
> +		}
> +
> +		/* Try claiming this broadcast ASID. */
> +		if (!test_and_set_bit(asid, broadcast_asid_used)) {
> +			last_broadcast_asid = asid;
> +			return asid;
> +		}
> +	} while (1);
> +}

I think it was said elsewhere, but the "try" logic doesn't make a lot of
sense to me when it's all protected by a global lock.

> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a broadcast
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	u16 broadcast_asid = mm_broadcast_asid(next);
> +
> +	if (broadcast_asid && prev_asid != broadcast_asid)
> +		return true;
> +
> +	if (!broadcast_asid && is_broadcast_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
> +
> +void destroy_context_free_broadcast_asid(struct mm_struct *mm)
> +{
> +	if (!mm->context.broadcast_asid)
> +		return;
> +
> +	guard(raw_spinlock_irqsave)(&broadcast_asid_lock);
> +	mm->context.broadcast_asid = 0;
> +	list_del(&mm->context.broadcast_asid_list);
> +	broadcast_asid_available++;
> +}
> +
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
> +{

This function is pretty important. It's kinda missing a comment about
its theory of operation.

> +	int count = 0;
> +	int cpu;
> +
> +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> +		return false;

There's a lot of potential redundancy between this check and the one
below. I assume this sequence was desinged for performance: first, do a
cheap, one-stop-shopping check on mm_cpumask(). If it looks, ok, then go
marauding around in a bunch of per_cpu() cachelines in a much more
expensive but precise search.

Could we spell some of that out explicitly, please?

> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/* Skip the CPUs that aren't really running this process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> +			continue;

This is the only place I know of where 'cpu_tlbstate' is read from a
non-local CPU. This is fundamentally racy as hell and needs some heavy
commenting about why this raciness is OK.




^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (11 preceding siblings ...)
  2024-12-30 17:53 ` [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-01-06 19:03 ` Dave Hansen
  2025-01-12  2:46   ` Rik van Riel
  2025-01-06 22:49 ` Yosry Ahmed
  13 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-06 19:03 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

A couple of high level things we need to address:

First, I'm OK calling this approach "broadcast TLB invalidation". But I
don't think the ASIDs should be called "broadcast ASIDs". I'd much
rather that they are called something which makes it clear that they are
from a different namespace than the existing ASIDs.

After this series there will be three classes:

 0: Special ASID used for the kernel, basically
 1->TLB_NR_DYN_ASIDS: Allocated from private, per-cpu space. Meaningless
		      when compared between CPUs.
 >TLB_NR_DYN_ASIDS:   Allocated from shared, kernel-wide space. All CPUs
		      share this space and must all agree on what the
		      values mean.

The fact that the "shared" ones are system-wide obviously allows INVLPGB
to be used. The hardware feature also obviously "broadcasts" things more
than plain old INVLPG did. But I don't think that makes the ASIDs
"broadcast" ASIDs.

It's much more important to know that they are shared across the system
instead of per-cpu than the fact that the deep implementation manages
them with an instruction that is "broadcast" by hardware.

So can we call them "global", "shared" or "system" ASIDs, please?

Second, the TLB_NR_DYN_ASIDS was picked because it's roughly the number
of distinct PCIDs that the CPU can keep in the TLB at once (at least on
Intel). Let's say a CPU has 6 mm's in the per-cpu ASID space and another
6 in the shared/broadcast space. At that point, PCIDs might not be doing
much good because the TLB can't store entries for 12 PCIDs.

Is there any comprehension in this series? Should we be indexing
cpu_tlbstate.ctxs[] by a *context* number rather than by the ASID that
it's running as?

Last, I'm not 100% convinced we want to do this whole thing. The
will-it-scale numbers are nice. But given the complexity of this, I
think we need some actual, real end users to stand up and say exactly
how this is important in *PRODUCTION* to them.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (12 preceding siblings ...)
  2025-01-06 19:03 ` [PATCH v3 00/12] AMD broadcast TLB invalidation Dave Hansen
@ 2025-01-06 22:49 ` Yosry Ahmed
  2025-01-07  3:25   ` Rik van Riel
  13 siblings, 1 reply; 89+ messages in thread
From: Yosry Ahmed @ 2025-01-06 22:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Reiji Watanabe, Brendan Jackman

On Mon, Dec 30, 2024 at 9:57 AM Rik van Riel <riel@surriel.com> wrote:
>
> Subject: [RFC PATCH 00/10] AMD broadcast TLB invalidation
>
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
>
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
>
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
>
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
>
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
>
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.

We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
optimize away the async freeing logic which sends TLB flush IPIs.

I have a high-level question about INVLPGB/TLBSYNC that I could not
immediately find the answer to in the AMD manual. Sorry if I missed
the answer or if I missed something obvious.

Do we know what the underlying mechanism for delivering the TLB
flushes is? If a CPU has interrupts disabled, does it still receive
the broadcast TLB flush request and handle it?

My main concern is that TLBSYNC is a single instruction that seems
like it will wait for an arbitrary amount of time, and IIUC interrupts
(and NMIs) will not be delivered to the running CPU until after the
instruction completes execution (only at an instruction boundary).

Are there any guarantees about other CPUs handling the broadcast TLB
flush in a timely manner, or an explanation of how CPUs handle the
incoming requests in general?

>
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
>
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
>
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-06 22:49 ` Yosry Ahmed
@ 2025-01-07  3:25   ` Rik van Riel
  2025-01-08  1:36     ` Yosry Ahmed
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-07  3:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Reiji Watanabe, Brendan Jackman

On Mon, 2025-01-06 at 14:49 -0800, Yosry Ahmed wrote:
> 
> We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
> optimize away the async freeing logic which sends TLB flush IPIs.
> 
> I have a high-level question about INVLPGB/TLBSYNC that I could not
> immediately find the answer to in the AMD manual. Sorry if I missed
> the answer or if I missed something obvious.
> 
> Do we know what the underlying mechanism for delivering the TLB
> flushes is? If a CPU has interrupts disabled, does it still receive
> the broadcast TLB flush request and handle it?

I assume TLB invalidation is probably handled similarly
to how cache coherency is handled between CPUs.

However, it probably does not need to be quite as fast,
since cache coherency traffic is probably 2-6 orders of
magnitude more common than TLB invalidation traffic.

> 
> My main concern is that TLBSYNC is a single instruction that seems
> like it will wait for an arbitrary amount of time, and IIUC
> interrupts
> (and NMIs) will not be delivered to the running CPU until after the
> instruction completes execution (only at an instruction boundary).
> 
> Are there any guarantees about other CPUs handling the broadcast TLB
> flush in a timely manner, or an explanation of how CPUs handle the
> incoming requests in general?

The performance numbers I got with the tlb_flush2_threads
microbenchmark strongly suggest that INVLPGB flushes are
handled by the receiving CPUs even while interrupts are
disabled.

CPU time spent in flush_tlb_mm_range goes down with
INVLPGB, compared with IPI based TLB flushing, even when
the IPIs only go to a subset of CPUs.

I have no idea whether the invalidation is handled by
something like microcode in the CPU, by the (more
external?) logic that handles cache coherency, or
something else entirely.

I suspect AMD wouldn't tell us exactly ;)

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-07  3:25   ` Rik van Riel
@ 2025-01-08  1:36     ` Yosry Ahmed
  2025-01-09  2:25       ` Andrew Cooper
  2025-01-09  2:47       ` Andrew Cooper
  0 siblings, 2 replies; 89+ messages in thread
From: Yosry Ahmed @ 2025-01-08  1:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm,
	Reiji Watanabe, Brendan Jackman

On Mon, Jan 6, 2025 at 7:25 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2025-01-06 at 14:49 -0800, Yosry Ahmed wrote:
> >
> > We briefly looked at using INVLPGB/TLBSYNC as part of the ASI work to
> > optimize away the async freeing logic which sends TLB flush IPIs.
> >
> > I have a high-level question about INVLPGB/TLBSYNC that I could not
> > immediately find the answer to in the AMD manual. Sorry if I missed
> > the answer or if I missed something obvious.
> >
> > Do we know what the underlying mechanism for delivering the TLB
> > flushes is? If a CPU has interrupts disabled, does it still receive
> > the broadcast TLB flush request and handle it?
>
> I assume TLB invalidation is probably handled similarly
> to how cache coherency is handled between CPUs.
>
> However, it probably does not need to be quite as fast,
> since cache coherency traffic is probably 2-6 orders of
> magnitude more common than TLB invalidation traffic.
>
> >
> > My main concern is that TLBSYNC is a single instruction that seems
> > like it will wait for an arbitrary amount of time, and IIUC
> > interrupts
> > (and NMIs) will not be delivered to the running CPU until after the
> > instruction completes execution (only at an instruction boundary).
> >
> > Are there any guarantees about other CPUs handling the broadcast TLB
> > flush in a timely manner, or an explanation of how CPUs handle the
> > incoming requests in general?
>
> The performance numbers I got with the tlb_flush2_threads
> microbenchmark strongly suggest that INVLPGB flushes are
> handled by the receiving CPUs even while interrupts are
> disabled.
>
> CPU time spent in flush_tlb_mm_range goes down with
> INVLPGB, compared with IPI based TLB flushing, even when
> the IPIs only go to a subset of CPUs.
>
> I have no idea whether the invalidation is handled by
> something like microcode in the CPU, by the (more
> external?) logic that handles cache coherency, or
> something else entirely.
>
> I suspect AMD wouldn't tell us exactly ;)

Well, ideally they would just tell us the conditions under which CPUs
respond to the broadcast TLB flush or the expectations around latency.
I am also wondering if a CPU can respond to an INVLPGB while running
TLBSYNC, specifically if it's possible for two CPUs to send broadcasts
to one another and then execute TLBSYNC to wait for each other. Could
this lead to a deadlock? I think the answer is no but we have little
understanding about what's going on under the hood to know for sure
(or at least I do).

>
> --
> All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-08  1:36     ` Yosry Ahmed
@ 2025-01-09  2:25       ` Andrew Cooper
  2025-01-09  2:47       ` Andrew Cooper
  1 sibling, 0 replies; 89+ messages in thread
From: Andrew Cooper @ 2025-01-09  2:25 UTC (permalink / raw)
  To: yosryahmed
  Cc: akpm, bp, dave.hansen, hpa, jackmanb, kernel-team, linux-kernel,
	linux-mm, luto, mingo, nadav.amit, peterz, reijiw, riel, tglx,
	x86, zhengqi.arch

>> I suspect AMD wouldn't tell us exactly ;)
>
> Well, ideally they would just tell us the conditions under which CPUs
> respond to the broadcast TLB flush or the expectations around latency.

Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
just a random person on the internet.  But, here are a few things that
might be relevant to know.

AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
the CPU TLB and related structures" and also "When required, hardware
automatically performs TLB invalidations to ensure that all processors
in the system see the updated RMP entry information."

That sentence doesn't use "broadcast" or "remote", but "all processors"
is a pretty clear clue.  Broadcast TLB invalidations are a building
block of all the RMP-manipulation instructions.

Furthermore, to be useful in this context, they need to be ordered with
memory.  Specifically, a new pagewalk mustn't start after an
invalidation, yet observe the stale RMP entry.

x86 CPUs do have reasonable forward-progress guarantees, but in order to
achieve forward progress, they need to e.g. guarantee that one memory
access doesn't displace the TLB entry backing a different memory access
from the same instruction, or you could livelock while trying to
complete a single instruction.

A consequence is that you can't safely invalidate a TLB entry of an
in-progress instruction (although this means only the oldest instruction
in the pipeline, because everything else is speculative and potentially
transient).

INVLPGB invalidations are interrupt-like from the point of view of the
remote core, but can be processed

~Andrew

[1]
https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-08  1:36     ` Yosry Ahmed
  2025-01-09  2:25       ` Andrew Cooper
@ 2025-01-09  2:47       ` Andrew Cooper
  2025-01-09 21:32         ` Yosry Ahmed
  1 sibling, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2025-01-09  2:47 UTC (permalink / raw)
  To: yosryahmed
  Cc: akpm, bp, dave.hansen, hpa, jackmanb, kernel-team, linux-kernel,
	linux-mm, luto, mingo, nadav.amit, peterz, reijiw, riel, tglx,
	x86, zhengqi.arch

>> I suspect AMD wouldn't tell us exactly ;)
>
> Well, ideally they would just tell us the conditions under which CPUs
> respond to the broadcast TLB flush or the expectations around latency.

[Resend, complete this time]

Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
just a random person on the internet.  But, here are a few things that
might be relevant to know.

AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
the CPU TLB and related structures" and also "When required, hardware
automatically performs TLB invalidations to ensure that all processors
in the system see the updated RMP entry information."

That sentence doesn't use "broadcast" or "remote", but "all processors"
is a pretty clear clue.  Broadcast TLB invalidations are a building
block of all the RMP-manipulation instructions.

Furthermore, to be useful in this context, they need to be ordered with
memory.  Specifically, a new pagewalk mustn't start after an
invalidation, yet observe the stale RMP entry.

x86 CPUs do have reasonable forward-progress guarantees, but in order to
achieve forward progress, they need to e.g. guarantee that one memory
access doesn't displace the TLB entry backing a different memory access
from the same instruction, or you could livelock while trying to
complete a single instruction.

A consequence is that you can't safely invalidate a TLB entry of an
in-progress instruction (although this means only the oldest instruction
in the pipeline, because everything else is speculative and potentially
transient).

INVLPGB invalidations are interrupt-like from the point of view of the
remote core, but are microarchitectural and can be taken irrespective of
the architectural Interrupt and Global Interrupt Flags.  As a
consequence, they'll need wait until an instruction boundary to be
processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
handling of RARs on the remote processor, and they share a number of
constraints in common with INVLPGB.

Overall, I'd expect the INVLPGB instructions to be pretty quick in and
of themselves; interestingly, they're not identified as architecturally
serialising.  The broadcast is probably posted, and will be dealt with
by remote processors on the subsequent instruction boundary.  TLBSYNC is
the barrier to wait until the invalidations have been processed, and
this will block for an unspecified length of time, probably bounded by
the "longest" instruction in progress on a remote CPU.  e.g. I expect it
probably will suck if you have to wait for a WBINVD instruction to
complete on a remote CPU.

That said, architectural IPIs have the same conditions too, except on
top of that you've got to run a whole interrupt handler.  So, with
reasonable confidence, however slow TLBSYNC might be in the worst case,
it's got absolutely nothing on the overhead of doing invalidations the
old fashioned way.

~Andrew

[1]
https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
[2]
https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-06 17:21   ` Dave Hansen
@ 2025-01-09 20:16     ` Rik van Riel
  2025-01-09 21:18       ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-09 20:16 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
> 
> 
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > +	unsigned long addr;
> > +	unsigned long maxnr = invlpgb_count_max;
> > +	unsigned long threshold = tlb_single_page_flush_ceiling *
> > maxnr;
> 
> The 'tlb_single_page_flush_ceiling' value was determined by looking
> at
> _local_ invalidation cost. Could you talk a bit about why it's also a
> good value to use for remote invalidations? Does it hold up for
> INVLPGB
> the same way it did for good ol' INVLPG? Has there been any explicit
> testing here to find a good value?
> 
> I'm also confused by the multiplication here. Let's say
> invlpgb_count_max==20 and tlb_single_page_flush_ceiling==30.
> 
> You would need to switch away from single-address invalidation when
> the
> number of addresses is >20 for INVLPGB functional reasons. But you'd
> also need to switch away when >30 for performance reasons
> (tlb_single_page_flush_ceiling).
> 
> But I don't understand how that would make the threshold 20*30=600
> invalidations.

I have not done any measurement to see how
flushing with INVLPGB stacks up versus
local TLB flushes.

What makes INVLPGB potentially slower:
- These flushes are done globally

What makes INVLPGB potentially faster:
- Multiple flushes can be pending simultaneously,
  and executed in any convenient order by the CPUs.
- Wait once on completion of all the queued flushes.

Another thing that makes things interesting is the 
TLB entry coalescing done by AMD CPUs.

When multiple pages are both virtually and physically
contiguous in memory (which is fairly common), the
CPU can use a single TLB entry to map up to 8 of them.

That means if we issue eg. 20 INVLPGB flushes for
8 4kB pages each, instead of the CPUs needing to
remove 160 TLB entries, there might only be 50.

I just guessed at the numbers used in my code,
while trying to sort out the details elsewhere
in the code.

How should we go about measuring the tradeoffs
between invalidation time, and the time spent
in TLB misses from flushing unnecessary stuff?

> 
> > +	/*
> > +	 * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > +	 * Disabling migration allows us to wait on all flushes.
> > +	 */
> 
> Imperative voice here too, please:
> 
> 	Disable migration to wait on all flushes.
> 
> > +	guard(preempt)();
> > +
> > +	if (end == TLB_FLUSH_ALL ||
> > +	    (end - start) > threshold << PAGE_SHIFT) {
> 
> This is basically a copy-and-paste of the "range vs. global" flush
> logic, but taking 'invlpgb_count_max' into account.
> 
> It would be ideal if those limit checks could be consolidated. I
> suspect
> that when the 'threshold' calculation above gets clarified that they
> may
> be easier to consolidate.

Maybe?

I implemented another suggestion in the code,
and the start of flush_tlb_kernel_range() now
looks like this:

void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
        if (broadcast_kernel_range_flush(start, end))
                return;

If we are to consolidate the limit check, it
should probably be in a helper function somewhere,
and not by spreading out the broadcast flush calls.

> 
> BTW, what is a typical value for 'invlpgb_count_max'? Is it more or
> less
> than the typical value for 'tlb_single_page_flush_ceiling'?
> 
> Maybe we should just lower 'tlb_single_page_flush_ceiling' if
> 'invlpgb_count_max' falls below it so we only have _one_ runtime
> value
> to consider.

The value for invlpgb_count_max on both Milan
and Bergamo CPUs appears to be 8. That is, the
CPU reports we can flush 7 additional pages
beyond a single page.

This matches the number of PTEs that can be 
cached in one TLB entry if they are contiguous
and aligned, and matches one cache line worth
of PTE entries.

> 
> 
> > +		invlpgb_flush_all();
> > +	} else {
> > +		unsigned long nr;
> > +		for (addr = start; addr < end; addr += nr <<
> > PAGE_SHIFT) {
> > +			nr = min((end - addr) >> PAGE_SHIFT,
> > maxnr);
> > +			invlpgb_flush_addr(addr, nr);
> > +		}
> > +	}
> > +
> > +	tlbsync();
> > +}
> > +
> >  static void do_kernel_range_flush(void *info)
> >  {
> >  	struct flush_tlb_info *f = info;
> > @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void
> > *info)
> >  
> >  void flush_tlb_kernel_range(unsigned long start, unsigned long
> > end)
> >  {
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > +		broadcast_kernel_range_flush(start, end);
> > +		return;
> > +	}
> > +
> >  	/* Balance as user space task's flush, a bit conservative
> > */
> >  	if (end == TLB_FLUSH_ALL ||
> >  	    (end - start) > tlb_single_page_flush_ceiling <<
> > PAGE_SHIFT) {
> 
> I also wonder if this would all get simpler if we give in and
> *always*
> call get_flush_tlb_info(). That would provide a nice single place to
> consolidate the "all vs. ranged" flush logic.

Possibly. That might be a good way to unify that
threshold check?

That should probably be a separate patch, though.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-09 20:16     ` Rik van Riel
@ 2025-01-09 21:18       ` Dave Hansen
  2025-01-10  5:31         ` Rik van Riel
  2025-01-10  6:07         ` Nadav Amit
  0 siblings, 2 replies; 89+ messages in thread
From: Dave Hansen @ 2025-01-09 21:18 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 1/9/25 12:16, Rik van Riel wrote:
> On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote:
>> On 12/30/24 09:53, Rik van Riel wrote:
>>
>>
>>> +static void broadcast_kernel_range_flush(unsigned long start,
>>> unsigned long end)
>>> +{
>>> +	unsigned long addr;
>>> +	unsigned long maxnr = invlpgb_count_max;
>>> +	unsigned long threshold = tlb_single_page_flush_ceiling *
>>> maxnr;
>>
>> The 'tlb_single_page_flush_ceiling' value was determined by
>> looking at _local_ invalidation cost. Could you talk a bit about
>> why it's also a good value to use for remote invalidations? Does it
>> hold up for INVLPGB the same way it did for good ol' INVLPG? Has
>> there been any explicit testing here to find a good value?
>>
>> I'm also confused by the multiplication here. Let's say
>> invlpgb_count_max==20 and tlb_single_page_flush_ceiling==30.
>>
>> You would need to switch away from single-address invalidation
>> when the number of addresses is >20 for INVLPGB functional reasons.
>> But you'd also need to switch away when >30 for performance
>> reasons (tlb_single_page_flush_ceiling).
>>
>> But I don't understand how that would make the threshold 20*30=600
>> invalidations.
> 
> I have not done any measurement to see how
> flushing with INVLPGB stacks up versus
> local TLB flushes.
> 
> What makes INVLPGB potentially slower:
> - These flushes are done globally
> 
> What makes INVLPGB potentially faster:
> - Multiple flushes can be pending simultaneously,
>   and executed in any convenient order by the CPUs.
> - Wait once on completion of all the queued flushes.
> 
> Another thing that makes things interesting is the 
> TLB entry coalescing done by AMD CPUs.
> 
> When multiple pages are both virtually and physically
> contiguous in memory (which is fairly common), the
> CPU can use a single TLB entry to map up to 8 of them.
> 
> That means if we issue eg. 20 INVLPGB flushes for
> 8 4kB pages each, instead of the CPUs needing to
> remove 160 TLB entries, there might only be 50.

I honestly don't expect there to be any real difference in INVLPGB
execution on the sender side based on what the receivers have in their TLB.

> I just guessed at the numbers used in my code,
> while trying to sort out the details elsewhere
> in the code.
> 
> How should we go about measuring the tradeoffs
> between invalidation time, and the time spent
> in TLB misses from flushing unnecessary stuff?

Well, we did a bunch of benchmarks for INVLPG. We could dig that back up
and repeat some of it.

But actually I think INVLPGB is *WAY* better than INVLPG here.  INVLPG
doesn't have ranged invalidation. It will only architecturally
invalidate multiple 4K entries when the hardware fractured them in the
first place. I think we should probably take advantage of what INVLPGB
can do instead of following the INVLPG approach.

INVLPGB will invalidate a range no matter where the underlying entries
came from. Its "increment the virtual address at the 2M boundary" mode
will invalidate entries of any size. That's my reading of the docs at
least. Is that everyone else's reading too?

So, let's pick a number "Z" which is >= invlpgb_count_max. Z could
arguably be set to tlb_single_page_flush_ceiling. Then do this:

	   4k -> Z*4k => use 4k step
	>Z*4k -> Z*2M => use 2M step
	>Z*2M	      => invalidate everything

Invalidations <=Z*4k are exact. They never zap extra TLB entries.

Invalidations that use the 2M step *might* unnecessarily zap some extra
4k mappings in the last 2M, but this is *WAY* better than invalidating
everything.

"Invalidate everything" obviously stinks, but it should only be for
pretty darn big invalidations. This approach can also do a true ranged
INVLPGB for many more cases than the existing proposal. The only issue
would be if the 2M step is substantially more expensive than the 4k step.

...
>> I also wonder if this would all get simpler if we give in and 
>> *always* call get_flush_tlb_info(). That would provide a nice
>> single place to consolidate the "all vs. ranged" flush logic.
> 
> Possibly. That might be a good way to unify that threshold check?
> 
> That should probably be a separate patch, though.

Yes, it should be part of refactoring that comes before the INVLPGB
enabling.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-09  2:47       ` Andrew Cooper
@ 2025-01-09 21:32         ` Yosry Ahmed
  2025-01-09 23:00           ` Andrew Cooper
  0 siblings, 1 reply; 89+ messages in thread
From: Yosry Ahmed @ 2025-01-09 21:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: akpm, bp, dave.hansen, hpa, jackmanb, kernel-team, linux-kernel,
	linux-mm, luto, mingo, nadav.amit, peterz, reijiw, riel, tglx,
	x86, zhengqi.arch

On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> >> I suspect AMD wouldn't tell us exactly ;)
> >
> > Well, ideally they would just tell us the conditions under which CPUs
> > respond to the broadcast TLB flush or the expectations around latency.
>
> [Resend, complete this time]
>
> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
> just a random person on the internet.  But, here are a few things that
> might be relevant to know.
>
> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
> the CPU TLB and related structures" and also "When required, hardware
> automatically performs TLB invalidations to ensure that all processors
> in the system see the updated RMP entry information."
>
> That sentence doesn't use "broadcast" or "remote", but "all processors"
> is a pretty clear clue.  Broadcast TLB invalidations are a building
> block of all the RMP-manipulation instructions.
>
> Furthermore, to be useful in this context, they need to be ordered with
> memory.  Specifically, a new pagewalk mustn't start after an
> invalidation, yet observe the stale RMP entry.
>
>
> x86 CPUs do have reasonable forward-progress guarantees, but in order to
> achieve forward progress, they need to e.g. guarantee that one memory
> access doesn't displace the TLB entry backing a different memory access
> from the same instruction, or you could livelock while trying to
> complete a single instruction.
>
> A consequence is that you can't safely invalidate a TLB entry of an
> in-progress instruction (although this means only the oldest instruction
> in the pipeline, because everything else is speculative and potentially
> transient).
>
>
> INVLPGB invalidations are interrupt-like from the point of view of the
> remote core, but are microarchitectural and can be taken irrespective of
> the architectural Interrupt and Global Interrupt Flags.  As a
> consequence, they'll need wait until an instruction boundary to be
> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
> handling of RARs on the remote processor, and they share a number of
> constraints in common with INVLPGB.
>
>
> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
> of themselves; interestingly, they're not identified as architecturally
> serialising.  The broadcast is probably posted, and will be dealt with
> by remote processors on the subsequent instruction boundary.  TLBSYNC is
> the barrier to wait until the invalidations have been processed, and
> this will block for an unspecified length of time, probably bounded by
> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
> probably will suck if you have to wait for a WBINVD instruction to
> complete on a remote CPU.
>
> That said, architectural IPIs have the same conditions too, except on
> top of that you've got to run a whole interrupt handler.  So, with
> reasonable confidence, however slow TLBSYNC might be in the worst case,
> it's got absolutely nothing on the overhead of doing invalidations the
> old fashioned way.

Generally speaking, I am not arguing that TLB flush IPIs are worse
than INLPGB/TLBSYNC, I think we should expect the latter to perform
better in most cases.

But there is a difference here because the processor executing TLBSYNC
cannot serve interrupts or NMIs while waiting for remote CPUs, because
they have to be served at an instruction boundary, right? Unless
TLBSYNC is an exception to that rule, or its execution is considered
completed before remote CPUs respond (i.e. the CPU executes it quickly
then enters into a wait doing "nothing").

There are also intriguing corner cases that are not documented. For
example, you mention that it's reasonable to expect that a remote CPU
does not serve TLBSYNC except at the instruction boundary. What if
that CPU is executing TLBSYNC? Do we have to wait for its execution to
complete? Is it possible to end up in a deadlock? This goes back to my
previous point about whether TLBSYNC is a special case or when it's
considered to have finished executing.

I am sure people thought about that and I am probably worried over
nothing, but there's little details here so one has to speculate.

Again, sorry if I am making a fuss over nothing and it's all in my head.

>
>
> ~Andrew
>
> [1]
> https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
> [2]
> https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-09 21:32         ` Yosry Ahmed
@ 2025-01-09 23:00           ` Andrew Cooper
  2025-01-09 23:26             ` Yosry Ahmed
  0 siblings, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2025-01-09 23:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, bp, dave.hansen, hpa, jackmanb, kernel-team, linux-kernel,
	linux-mm, luto, mingo, nadav.amit, peterz, reijiw, riel, tglx,
	x86, zhengqi.arch

On 09/01/2025 9:32 pm, Yosry Ahmed wrote:
> On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> I suspect AMD wouldn't tell us exactly ;)
>>> Well, ideally they would just tell us the conditions under which CPUs
>>> respond to the broadcast TLB flush or the expectations around latency.
>> [Resend, complete this time]
>>
>> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
>> just a random person on the internet.  But, here are a few things that
>> might be relevant to know.
>>
>> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
>> the CPU TLB and related structures" and also "When required, hardware
>> automatically performs TLB invalidations to ensure that all processors
>> in the system see the updated RMP entry information."
>>
>> That sentence doesn't use "broadcast" or "remote", but "all processors"
>> is a pretty clear clue.  Broadcast TLB invalidations are a building
>> block of all the RMP-manipulation instructions.
>>
>> Furthermore, to be useful in this context, they need to be ordered with
>> memory.  Specifically, a new pagewalk mustn't start after an
>> invalidation, yet observe the stale RMP entry.
>>
>>
>> x86 CPUs do have reasonable forward-progress guarantees, but in order to
>> achieve forward progress, they need to e.g. guarantee that one memory
>> access doesn't displace the TLB entry backing a different memory access
>> from the same instruction, or you could livelock while trying to
>> complete a single instruction.
>>
>> A consequence is that you can't safely invalidate a TLB entry of an
>> in-progress instruction (although this means only the oldest instruction
>> in the pipeline, because everything else is speculative and potentially
>> transient).
>>
>>
>> INVLPGB invalidations are interrupt-like from the point of view of the
>> remote core, but are microarchitectural and can be taken irrespective of
>> the architectural Interrupt and Global Interrupt Flags.  As a
>> consequence, they'll need wait until an instruction boundary to be
>> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
>> handling of RARs on the remote processor, and they share a number of
>> constraints in common with INVLPGB.
>>
>>
>> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
>> of themselves; interestingly, they're not identified as architecturally
>> serialising.  The broadcast is probably posted, and will be dealt with
>> by remote processors on the subsequent instruction boundary.  TLBSYNC is
>> the barrier to wait until the invalidations have been processed, and
>> this will block for an unspecified length of time, probably bounded by
>> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
>> probably will suck if you have to wait for a WBINVD instruction to
>> complete on a remote CPU.
>>
>> That said, architectural IPIs have the same conditions too, except on
>> top of that you've got to run a whole interrupt handler.  So, with
>> reasonable confidence, however slow TLBSYNC might be in the worst case,
>> it's got absolutely nothing on the overhead of doing invalidations the
>> old fashioned way.
> Generally speaking, I am not arguing that TLB flush IPIs are worse
> than INLPGB/TLBSYNC, I think we should expect the latter to perform
> better in most cases.
>
> But there is a difference here because the processor executing TLBSYNC
> cannot serve interrupts or NMIs while waiting for remote CPUs, because
> they have to be served at an instruction boundary, right?

That's as per the architecture, yes.  NMIs do have to be served on
instruction boundaries.  An NMI that becomes pending while a TLBSYNC is
in progress will have to wait until the TLBSYNC completes.

(Probably.  REP string instructions and AVX scatter/gather have explicit
behaviours that them them be interrupted, and to continue from where
they left off when the interrupt handler returns.  Depending on how
TLBSYNC is implemented, it's just possible it has this property too.)

> Unless
> TLBSYNC is an exception to that rule, or its execution is considered
> completed before remote CPUs respond (i.e. the CPU executes it quickly
> then enters into a wait doing "nothing").
>
> There are also intriguing corner cases that are not documented. For
> example, you mention that it's reasonable to expect that a remote CPU
> does not serve TLBSYNC except at the instruction boundary.

INVLPGB needs to wait for an instruction boundary in order to be processed.

All TLBSYNC needs to do is wait until it's certain that all the prior
INVLPGBs issued by this CPU have been serviced.

>  What if
> that CPU is executing TLBSYNC? Do we have to wait for its execution to
> complete? Is it possible to end up in a deadlock? This goes back to my
> previous point about whether TLBSYNC is a special case or when it's
> considered to have finished executing.

Remember that the SEV-SNP instruction (PSMASH, PVALIDATE,
RMP{ADJUST,UPDATE,QUERY,READ}) have an INVLPGB/TLBSYNC pair under the
hood.  You can execute these instructions on different CPUs in parallel.

It's certainly possible AMD missed something and there's and there's a
deadlock case in there.  But Google do offer SEV-SNP VMs and have the
data and scale to know whether such a deadlock is happening in practice.

>
> I am sure people thought about that and I am probably worried over
> nothing, but there's little details here so one has to speculate.
>
> Again, sorry if I am making a fuss over nothing and it's all in my head.

It's absolutely a valid question to ask.

But x86 is full of longer delays than this.  The GIF for example can
block NMIs until the hypervisor is complete with the world switch, and
it's left as an exercise to software not to abuse this.  Taking an SMI
will be orders of magnitude more expensive than anything discussed here.

~Andrew


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-09 23:00           ` Andrew Cooper
@ 2025-01-09 23:26             ` Yosry Ahmed
  0 siblings, 0 replies; 89+ messages in thread
From: Yosry Ahmed @ 2025-01-09 23:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: akpm, bp, dave.hansen, hpa, jackmanb, kernel-team, linux-kernel,
	linux-mm, luto, mingo, nadav.amit, peterz, reijiw, riel, tglx,
	x86, zhengqi.arch

On Thu, Jan 9, 2025 at 3:00 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/01/2025 9:32 pm, Yosry Ahmed wrote:
> > On Wed, Jan 8, 2025 at 6:47 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>>> I suspect AMD wouldn't tell us exactly ;)
> >>> Well, ideally they would just tell us the conditions under which CPUs
> >>> respond to the broadcast TLB flush or the expectations around latency.
> >> [Resend, complete this time]
> >>
> >> Disclaimer.  I'm not at AMD; I don't know how they implement it; I'm
> >> just a random person on the internet.  But, here are a few things that
> >> might be relevant to know.
> >>
> >> AMD's SEV-SNP whitepaper [1] states that RMP permissions "are cached in
> >> the CPU TLB and related structures" and also "When required, hardware
> >> automatically performs TLB invalidations to ensure that all processors
> >> in the system see the updated RMP entry information."
> >>
> >> That sentence doesn't use "broadcast" or "remote", but "all processors"
> >> is a pretty clear clue.  Broadcast TLB invalidations are a building
> >> block of all the RMP-manipulation instructions.
> >>
> >> Furthermore, to be useful in this context, they need to be ordered with
> >> memory.  Specifically, a new pagewalk mustn't start after an
> >> invalidation, yet observe the stale RMP entry.
> >>
> >>
> >> x86 CPUs do have reasonable forward-progress guarantees, but in order to
> >> achieve forward progress, they need to e.g. guarantee that one memory
> >> access doesn't displace the TLB entry backing a different memory access
> >> from the same instruction, or you could livelock while trying to
> >> complete a single instruction.
> >>
> >> A consequence is that you can't safely invalidate a TLB entry of an
> >> in-progress instruction (although this means only the oldest instruction
> >> in the pipeline, because everything else is speculative and potentially
> >> transient).
> >>
> >>
> >> INVLPGB invalidations are interrupt-like from the point of view of the
> >> remote core, but are microarchitectural and can be taken irrespective of
> >> the architectural Interrupt and Global Interrupt Flags.  As a
> >> consequence, they'll need wait until an instruction boundary to be
> >> processed.  While not AMD, the Intel RAR whitepaper [2] discusses the
> >> handling of RARs on the remote processor, and they share a number of
> >> constraints in common with INVLPGB.
> >>
> >>
> >> Overall, I'd expect the INVLPGB instructions to be pretty quick in and
> >> of themselves; interestingly, they're not identified as architecturally
> >> serialising.  The broadcast is probably posted, and will be dealt with
> >> by remote processors on the subsequent instruction boundary.  TLBSYNC is
> >> the barrier to wait until the invalidations have been processed, and
> >> this will block for an unspecified length of time, probably bounded by
> >> the "longest" instruction in progress on a remote CPU.  e.g. I expect it
> >> probably will suck if you have to wait for a WBINVD instruction to
> >> complete on a remote CPU.
> >>
> >> That said, architectural IPIs have the same conditions too, except on
> >> top of that you've got to run a whole interrupt handler.  So, with
> >> reasonable confidence, however slow TLBSYNC might be in the worst case,
> >> it's got absolutely nothing on the overhead of doing invalidations the
> >> old fashioned way.
> > Generally speaking, I am not arguing that TLB flush IPIs are worse
> > than INLPGB/TLBSYNC, I think we should expect the latter to perform
> > better in most cases.
> >
> > But there is a difference here because the processor executing TLBSYNC
> > cannot serve interrupts or NMIs while waiting for remote CPUs, because
> > they have to be served at an instruction boundary, right?
>
> That's as per the architecture, yes.  NMIs do have to be served on
> instruction boundaries.  An NMI that becomes pending while a TLBSYNC is
> in progress will have to wait until the TLBSYNC completes.
>
> (Probably.  REP string instructions and AVX scatter/gather have explicit
> behaviours that them them be interrupted, and to continue from where
> they left off when the interrupt handler returns.  Depending on how
> TLBSYNC is implemented, it's just possible it has this property too.)

That would be great actually, if that's the case all my concerns go away.

>
> > Unless
> > TLBSYNC is an exception to that rule, or its execution is considered
> > completed before remote CPUs respond (i.e. the CPU executes it quickly
> > then enters into a wait doing "nothing").
> >
> > There are also intriguing corner cases that are not documented. For
> > example, you mention that it's reasonable to expect that a remote CPU
> > does not serve TLBSYNC except at the instruction boundary.
>
> INVLPGB needs to wait for an instruction boundary in order to be processed.
>
> All TLBSYNC needs to do is wait until it's certain that all the prior
> INVLPGBs issued by this CPU have been serviced.
>
> >  What if
> > that CPU is executing TLBSYNC? Do we have to wait for its execution to
> > complete? Is it possible to end up in a deadlock? This goes back to my
> > previous point about whether TLBSYNC is a special case or when it's
> > considered to have finished executing.
>
> Remember that the SEV-SNP instruction (PSMASH, PVALIDATE,
> RMP{ADJUST,UPDATE,QUERY,READ}) have an INVLPGB/TLBSYNC pair under the
> hood.  You can execute these instructions on different CPUs in parallel.
>
> It's certainly possible AMD missed something and there's and there's a
> deadlock case in there.  But Google do offer SEV-SNP VMs and have the
> data and scale to know whether such a deadlock is happening in practice.

I am not familiar with SEV-SNP so excuse my ignorance. I am also
pretty sure that the percentage of SEV-SNP workloads is very low
compared to the workloads that would start using INVLPGB/TLBSYNC after
this series. So if there's a dormant bug or a rare scenario where the
TLBSYNC latency is massive, it may very well be newly uncovered now.

>
> >
> > I am sure people thought about that and I am probably worried over
> > nothing, but there's little details here so one has to speculate.
> >
> > Again, sorry if I am making a fuss over nothing and it's all in my head.
>
> It's absolutely a valid question to ask.
>
> But x86 is full of longer delays than this.  The GIF for example can
> block NMIs until the hypervisor is complete with the world switch, and
> it's left as an exercise to software not to abuse this.  Taking an SMI
> will be orders of magnitude more expensive than anything discussed here.

Right. What is happening here just seems like something that happens
more frequently and therefore is more likely to run into cases with
absurd delays.

It would be great if someone from AMD could shed some light on what is
to be reasonably expected from TLBSYNC here.

Anyway, thanks a lot for all your (very informative) responses :)


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-09 21:18       ` Dave Hansen
@ 2025-01-10  5:31         ` Rik van Riel
  2025-01-10  6:07         ` Nadav Amit
  1 sibling, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-10  5:31 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Thu, 2025-01-09 at 13:18 -0800, Dave Hansen wrote:
> 
> But actually I think INVLPGB is *WAY* better than INVLPG here. 
> INVLPG
> doesn't have ranged invalidation. It will only architecturally
> invalidate multiple 4K entries when the hardware fractured them in
> the
> first place. I think we should probably take advantage of what
> INVLPGB
> can do instead of following the INVLPG approach.
> 
> INVLPGB will invalidate a range no matter where the underlying
> entries
> came from. Its "increment the virtual address at the 2M boundary"
> mode
> will invalidate entries of any size. That's my reading of the docs at
> least. Is that everyone else's reading too?

Ohhhh, good point! I glossed over that the first
half dozen times I was reading the document, because
I was trying to use the ASID, and working to figure
out why things kept crashing (turns out I can only 
use the PCID on bare metal)

> 
> So, let's pick a number "Z" which is >= invlpgb_count_max. Z could
> arguably be set to tlb_single_page_flush_ceiling. Then do this:
> 
> 	   4k -> Z*4k => use 4k step
> 	>Z*4k -> Z*2M => use 2M step
> 	>Z*2M	      => invalidate everything
> 
> Invalidations <=Z*4k are exact. They never zap extra TLB entries.
> 
> Invalidations that use the 2M step *might* unnecessarily zap some
> extra
> 4k mappings in the last 2M, but this is *WAY* better than
> invalidating
> everything.
> 
This is a great idea.

Then the code in get_flush_tlb_info can adjust
start, end, and stride_shift as needed.

INVLPGB also supports invalidation of an entire
1GB region, so we can take your idea one step
further :)

With up to 8 pages zapped by a single INVLPGB
instruction, and multiple in flight simultaneously,
maybe we could set the threshold to 64, for 8
INVLPGBs in flight at once?

That way we can invalidate up to 1/8th of a
512 entry range with individual zaps, before
just zapping the higher level entry.

> "Invalidate everything" obviously stinks, but it should only be for
> pretty darn big invalidations. 

That would only come into play when we get
past several GB worth of invalidation.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-09 21:18       ` Dave Hansen
  2025-01-10  5:31         ` Rik van Riel
@ 2025-01-10  6:07         ` Nadav Amit
  2025-01-10 15:14           ` Dave Hansen
  1 sibling, 1 reply; 89+ messages in thread
From: Nadav Amit @ 2025-01-10  6:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, kernel-team, Dave Hansen, luto,
	peterz, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

> On 9 Jan 2025, at 23:18, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> But actually I think INVLPGB is *WAY* better than INVLPG here.  INVLPG
> doesn't have ranged invalidation. It will only architecturally
> invalidate multiple 4K entries when the hardware fractured them in the
> first place. I think we should probably take advantage of what INVLPGB
> can do instead of following the INVLPG approach.
> 
> INVLPGB will invalidate a range no matter where the underlying entries
> came from. Its "increment the virtual address at the 2M boundary" mode
> will invalidate entries of any size. That's my reading of the docs at
> least. Is that everyone else's reading too?

This is not my reading. I think that this reading assumes that besides
the broadcast, some new “range flush” was added to the TLB. My guess
is that this not the case, since presumably it would require a different
TLB structure (and who does 2 changes at once ;-) ).

My understanding is therefore that it’s all in microcode. There is a
“stride” and “number” which are used by the microcode for iterating
and on every iteration the microcode issues a TLB invalidation. This
invalidation is similar to INVLPG, just as it was always done (putting
aside the variants that do not invalidate the PWC). IOW, the page-size
is not given as part of the INVLPG and not as part of INVLPGB (regardless
of the stride) for whatever entries used to invalidate a given address.

I think my understanding is backed by the wording of "regardless of the
page size” appearing for INVLPG as well in AMD’s manual.

My guess is that invalidating more entries will take longer time, maybe
not on the sender, but at least on the receiver. I also guess that in
certain setups - big NUMA machines - INVLPGB might perform worse. I
remember vaguely ARM guys writing something about such behavior.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-10  6:07         ` Nadav Amit
@ 2025-01-10 15:14           ` Dave Hansen
  2025-01-10 16:08             ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-10 15:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, kernel-team, Dave Hansen, luto,
	peterz, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On 1/9/25 22:07, Nadav Amit wrote:
> This is not my reading. I think that this reading assumes that besides
> the broadcast, some new “range flush” was added to the TLB. My guess
> is that this not the case, since presumably it would require a different
> TLB structure (and who does 2 changes at once 😉 ).

Reading it again, I think you're right.

The INVLPG and INVLPGB language is too close. It would also _talk_ about
invalidating a range rather than just incrementing an address to invalidate.

I think the key thing we need to decide is whether to treat a single
INVLPGB(stride=8) more like a single INVLPGB or eight INVLPGBs.
Measuring a bunch of invalidation looks should tell us that.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-10 15:14           ` Dave Hansen
@ 2025-01-10 16:08             ` Rik van Riel
  2025-01-10 16:29               ` Dave Hansen
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 16:08 UTC (permalink / raw)
  To: Dave Hansen, Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Fri, 2025-01-10 at 07:14 -0800, Dave Hansen wrote:
> On 1/9/25 22:07, Nadav Amit wrote:
> > This is not my reading. I think that this reading assumes that
> > besides
> > the broadcast, some new “range flush” was added to the TLB. My
> > guess
> > is that this not the case, since presumably it would require a
> > different
> > TLB structure (and who does 2 changes at once 😉 ).
> 
> Reading it again, I think you're right.
> 
> The INVLPG and INVLPGB language is too close. It would also _talk_
> about
> invalidating a range rather than just incrementing an address to
> invalidate.
> 
> I think the key thing we need to decide is whether to treat a single
> INVLPGB(stride=8) more like a single INVLPGB or eight INVLPGBs.
> Measuring a bunch of invalidation looks should tell us that.

Would I be wrong to assume that the CPUs have
some optimizations built in to efficiently
execute an invalidation for "everything in a
PCID"?

The "global invalidate" we send does not
zap everything in the TLB, but only the
translations for a single PCID.

I suppose we should measure these things
at some point (after I do the other
cleanups?), because the CPUs may well have
made a bunch of optimizations that we
don't know about.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-10 16:08             ` Rik van Riel
@ 2025-01-10 16:29               ` Dave Hansen
  2025-01-10 16:36                 ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2025-01-10 16:29 UTC (permalink / raw)
  To: Rik van Riel, Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On 1/10/25 08:08, Rik van Riel wrote:
> On Fri, 2025-01-10 at 07:14 -0800, Dave Hansen wrote:
...
>> I think the key thing we need to decide is whether to treat a single
>> INVLPGB(stride=8) more like a single INVLPGB or eight INVLPGBs.
>> Measuring a bunch of invalidation looks should tell us that.
> 
> Would I be wrong to assume that the CPUs have
> some optimizations built in to efficiently
> execute an invalidation for "everything in a
> PCID"?

There's only a few bits in the actual TLBs to store the PCID (or VPID),
roughly 3 on Intel. Then there's another structure to map between the
architectural PCID and the 3 bits of actual hardware alias.

That's what I know. The rest is pure speculation:

All you have to do in theory is zap the one entry in the PCID=>HW
mapping structure to invalidate a whole PCID. You don't need to run
through the TLB itself to invalidate it.

You need to do something else to make sure that the now-unused 3-bit
hardware identifier gets reused at _some_ point, but there may be other
tricks for that.

> The "global invalidate" we send does not
> zap everything in the TLB, but only the
> translations for a single PCID.
> 
> I suppose we should measure these things
> at some point (after I do the other
> cleanups?), because the CPUs may well have
> made a bunch of optimizations that we
> don't know about.

IIRC, the "big" invalidation modes are pretty cheap to execute. Most of
the cost comes from the TLB refill, not the flush itself.

But there's no substitute for actually measuring it. There's some wonky
stuff out there. The last time Andy L. went and looked at it, there were
oddities like INVPCID's "Individual-address invalidation" and INVLPG
having surprisingly different performance.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-10 16:29               ` Dave Hansen
@ 2025-01-10 16:36                 ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 16:36 UTC (permalink / raw)
  To: Dave Hansen, Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, kernel-team,
	Dave Hansen, luto, peterz, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton, zhengqi.arch,
	open list:MEMORY MANAGEMENT

On Fri, 2025-01-10 at 08:29 -0800, Dave Hansen wrote:
> >  about.
> 
> IIRC, the "big" invalidation modes are pretty cheap to execute. Most
> of
> the cost comes from the TLB refill, not the flush itself.
> 
> But there's no substitute for actually measuring it. There's some
> wonky
> stuff out there. The last time Andy L. went and looked at it, there
> were
> oddities like INVPCID's "Individual-address invalidation" and INVLPG
> having surprisingly different performance.
> 
Agreed on all points.

If you guys don't mind, I'm going to keep this on the
back burner, while I work my way through all the
suggested code cleanups and improvements first.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2024-12-30 17:53 ` [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
  2025-01-02 12:15   ` Borislav Petkov
@ 2025-01-10 18:44   ` Tom Lendacky
  2025-01-10 20:27     ` Rik van Riel
  1 sibling, 1 reply; 89+ messages in thread
From: Tom Lendacky @ 2025-01-10 18:44 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 11:53, Rik van Riel wrote:
> The CPU advertises the maximum number of pages that can be shot down
> with one INVLPGB instruction in the CPUID data.
> 
> Save that information for later use.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/tlbflush.h | 1 +
>  arch/x86/kernel/cpu/amd.c       | 8 ++++++++
>  arch/x86/kernel/setup.c         | 4 ++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 02fc2aa06e9e..7d1468a3967b 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -182,6 +182,7 @@ static inline void cr4_init_shadow(void)
>  
>  extern unsigned long mmu_cr4_features;
>  extern u32 *trampoline_cr4_features;
> +extern u16 invlpgb_count_max;
>  
>  extern void initialize_tlbstate_and_flush(void);
>  
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 79d2e17f6582..226b8fc64bfc 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1135,6 +1135,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
>  		tlb_lli_2m[ENTRIES] = eax & mask;
>  
>  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
> +
> +	if (c->extended_cpuid_level < 0x80000008)
> +		return;

Can this just be based on cpu_feature_enabled(X86_FEATURE_TLBI), e.g:

	if (cpu_feature_enabled(X86_FEATURE_TLBI))
		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1

Then you can squash this and the previous patch.

Thanks,
Tom

> +
> +	cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> +
> +	/* Max number of pages INVLPGB can invalidate in one shot */
> +	invlpgb_count_max = (edx & 0xffff) + 1;
>  }
>  
>  static const struct cpu_dev amd_cpu_dev = {
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index f1fea506e20f..6c4d08f8f7b1 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -138,6 +138,10 @@ __visible unsigned long mmu_cr4_features __ro_after_init;
>  __visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE;
>  #endif
>  
> +#ifdef CONFIG_CPU_SUP_AMD
> +u16 invlpgb_count_max __ro_after_init;
> +#endif
> +
>  #ifdef CONFIG_IMA
>  static phys_addr_t ima_kexec_buffer_phys;
>  static size_t ima_kexec_buffer_size;


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
  2025-01-03 12:39   ` Borislav Petkov
  2025-01-06 17:21   ` Dave Hansen
@ 2025-01-10 18:53   ` Tom Lendacky
  2025-01-10 20:29     ` Rik van Riel
  2 siblings, 1 reply; 89+ messages in thread
From: Tom Lendacky @ 2025-01-10 18:53 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 11:53, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> This stops us from having to send IPIs for kernel TLB flushes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/mm/tlb.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6cf881a942bb..29207dc5b807 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1077,6 +1077,32 @@ void flush_tlb_all(void)
>  	on_each_cpu(do_flush_tlb_all, NULL, 1);
>  }
>  
> +static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
> +{
> +	unsigned long addr;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;
> +
> +	/*
> +	 * TLBSYNC only waits for flushes originating on the same CPU.
> +	 * Disabling migration allows us to wait on all flushes.
> +	 */
> +	guard(preempt)();
> +
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > threshold << PAGE_SHIFT) {
> +		invlpgb_flush_all();
> +	} else {
> +		unsigned long nr;
> +		for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
> +			nr = min((end - addr) >> PAGE_SHIFT, maxnr);
> +			invlpgb_flush_addr(addr, nr);
> +		}

Would it be better to put this loop in the actual invlpgb_flush*
function(s)? Then callers don't have to worry about it, similar to what is
done in clflush_cache_range() / clflush_cache_range_opt().

Thanks,
Tom

> +	}
> +
> +	tlbsync();
> +}
> +
>  static void do_kernel_range_flush(void *info)
>  {
>  	struct flush_tlb_info *f = info;
> @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void *info)
>  
>  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  {
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		broadcast_kernel_range_flush(start, end);
> +		return;
> +	}
> +
>  	/* Balance as user space task's flush, a bit conservative */
>  	if (end == TLB_FLUSH_ALL ||
>  	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
  2024-12-30 18:25   ` Nadav Amit
  2025-01-03 17:49   ` Jann Horn
@ 2025-01-10 19:34   ` Tom Lendacky
  2025-01-10 19:45     ` Rik van Riel
  2 siblings, 1 reply; 89+ messages in thread
From: Tom Lendacky @ 2025-01-10 19:34 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 12/30/24 11:53, Rik van Riel wrote:
> With AMD TCE (translation cache extensions) only the intermediate mappings
> that cover the address range zapped by INVLPG / INVLPGB get invalidated,
> rather than all intermediate mappings getting zapped at every TLB invalidation.
> 
> This can help reduce the TLB miss rate, by keeping more intermediate
> mappings in the cache.
> 
> From the AMD manual:
> 
> Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
> to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
> TLB entries. When this bit is 0, these instructions remove the target PTE
> from the TLB as well as all upper-level table entries that are cached
> in the TLB, whether or not they are associated with the target PTE.
> When this bit is set, these instructions will remove the target PTE and
> only those upper-level entries that lead to the target PTE in
> the page table hierarchy, leaving unrelated upper-level entries intact.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/kernel/cpu/amd.c |  8 ++++++++
>  arch/x86/mm/tlb.c         | 10 +++++++---
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 226b8fc64bfc..4dc42705aaca 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
>  
>  	/* Max number of pages INVLPGB can invalidate in one shot */
>  	invlpgb_count_max = (edx & 0xffff) + 1;
> +
> +	/* If supported, enable translation cache extensions (TCE) */
> +	cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
> +	if (ecx & BIT(17)) {

Back to my comment from patch #4, you can put this under the
cpu_feature_enabled() check and just set it.

> +		u64 msr = native_read_msr(MSR_EFER);;
> +		msr |= BIT(15);
> +		wrmsrl(MSR_EFER, msr);

msr_set_bit() ?

Thanks,
Tom

> +	}
>  }
>  
>  static const struct cpu_dev amd_cpu_dev = {
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 454a370494d3..585d0731ca9f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -477,7 +477,7 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
>  	if (info->stride_shift > PMD_SHIFT)
>  		maxnr = 1;
>  
> -	if (info->end == TLB_FLUSH_ALL) {
> +	if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
>  		invlpgb_flush_single_pcid(kern_pcid(asid));
>  		/* Do any CPUs supporting INVLPGB need PTI? */
>  		if (static_cpu_has(X86_FEATURE_PTI))
> @@ -1110,7 +1110,7 @@ static void flush_tlb_func(void *info)
>  	 *
>  	 * The only question is whether to do a full or partial flush.
>  	 *
> -	 * We do a partial flush if requested and two extra conditions
> +	 * We do a partial flush if requested and three extra conditions
>  	 * are met:
>  	 *
>  	 * 1. f->new_tlb_gen == local_tlb_gen + 1.  We have an invariant that
> @@ -1137,10 +1137,14 @@ static void flush_tlb_func(void *info)
>  	 *    date.  By doing a full flush instead, we can increase
>  	 *    local_tlb_gen all the way to mm_tlb_gen and we can probably
>  	 *    avoid another flush in the very near future.
> +	 *
> +	 * 3. No page tables were freed. If page tables were freed, a full
> +	 *    flush ensures intermediate translations in the TLB get flushed.
>  	 */
>  	if (f->end != TLB_FLUSH_ALL &&
>  	    f->new_tlb_gen == local_tlb_gen + 1 &&
> -	    f->new_tlb_gen == mm_tlb_gen) {
> +	    f->new_tlb_gen == mm_tlb_gen &&
> +	    !f->freed_tables) {
>  		/* Partial flush */
>  		unsigned long addr = f->start;
>  


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-10 19:34   ` Tom Lendacky
@ 2025-01-10 19:45     ` Rik van Riel
  2025-01-10 19:58       ` Borislav Petkov
  0 siblings, 1 reply; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 19:45 UTC (permalink / raw)
  To: Tom Lendacky, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-10 at 13:34 -0600, Tom Lendacky wrote:
> 
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct
> > cpuinfo_x86 *c)
> >  
> >  	/* Max number of pages INVLPGB can invalidate in one shot
> > */
> >  	invlpgb_count_max = (edx & 0xffff) + 1;
> > +
> > +	/* If supported, enable translation cache extensions (TCE)
> > */
> > +	cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
> > +	if (ecx & BIT(17)) {
> 
> Back to my comment from patch #4, you can put this under the
> cpu_feature_enabled() check and just set it.
> 

Ohhh nice, so I can just add a CPUID feature
bit for TCE, and then have this?

if(cpu_feature_enabled(X86_FEATURE_TCE))
    msr_set_bit(MSR_EFER, EFER_TCE);

That is much nicer.

Is this the right location for that code, or
do I need to move it somewhere else to
guarantee TCE gets enabled on every CPU?

> > +		u64 msr = native_read_msr(MSR_EFER);;
> > +		msr |= BIT(15);
> > +		wrmsrl(MSR_EFER, msr);
> 
> msr_set_bit() ?
> 
> Thanks,
> Tom
> 
> > +	}
> >  }
> >  
> >  static const struct cpu_dev amd_cpu_dev = {
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 454a370494d3..585d0731ca9f 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -477,7 +477,7 @@ static void broadcast_tlb_flush(struct
> > flush_tlb_info *info)
> >  	if (info->stride_shift > PMD_SHIFT)
> >  		maxnr = 1;
> >  
> > -	if (info->end == TLB_FLUSH_ALL) {
> > +	if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
> >  		invlpgb_flush_single_pcid(kern_pcid(asid));
> >  		/* Do any CPUs supporting INVLPGB need PTI? */
> >  		if (static_cpu_has(X86_FEATURE_PTI))
> > @@ -1110,7 +1110,7 @@ static void flush_tlb_func(void *info)
> >  	 *
> >  	 * The only question is whether to do a full or partial
> > flush.
> >  	 *
> > -	 * We do a partial flush if requested and two extra
> > conditions
> > +	 * We do a partial flush if requested and three extra
> > conditions
> >  	 * are met:
> >  	 *
> >  	 * 1. f->new_tlb_gen == local_tlb_gen + 1.  We have an
> > invariant that
> > @@ -1137,10 +1137,14 @@ static void flush_tlb_func(void *info)
> >  	 *    date.  By doing a full flush instead, we can
> > increase
> >  	 *    local_tlb_gen all the way to mm_tlb_gen and we can
> > probably
> >  	 *    avoid another flush in the very near future.
> > +	 *
> > +	 * 3. No page tables were freed. If page tables were
> > freed, a full
> > +	 *    flush ensures intermediate translations in the TLB
> > get flushed.
> >  	 */
> >  	if (f->end != TLB_FLUSH_ALL &&
> >  	    f->new_tlb_gen == local_tlb_gen + 1 &&
> > -	    f->new_tlb_gen == mm_tlb_gen) {
> > +	    f->new_tlb_gen == mm_tlb_gen &&
> > +	    !f->freed_tables) {
> >  		/* Partial flush */
> >  		unsigned long addr = f->start;
> >  
> 

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-10 19:45     ` Rik van Riel
@ 2025-01-10 19:58       ` Borislav Petkov
  2025-01-10 20:43         ` Rik van Riel
  0 siblings, 1 reply; 89+ messages in thread
From: Borislav Petkov @ 2025-01-10 19:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Tom Lendacky, x86, linux-kernel, kernel-team, dave.hansen, luto,
	peterz, tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch,
	linux-mm

On Fri, Jan 10, 2025 at 02:45:27PM -0500, Rik van Riel wrote:
> Is this the right location for that code, or
> do I need to move it somewhere else to
> guarantee TCE gets enabled on every CPU?

Didn't I answer this already?

https://lore.kernel.org/r/20250102115237.GGZ3Z-BUqBauHEfB10@fat_crate.local

And you should put it in init_amd() as it runs on every CPU.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-10 18:44   ` Tom Lendacky
@ 2025-01-10 20:27     ` Rik van Riel
  2025-01-10 20:31       ` Tom Lendacky
  2025-01-10 20:34       ` Borislav Petkov
  0 siblings, 2 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 20:27 UTC (permalink / raw)
  To: Tom Lendacky, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-10 at 12:44 -0600, Tom Lendacky wrote:
> On 12/30/24 11:53, Rik van Riel wrote:
> > 
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -1135,6 +1135,14 @@ static void cpu_detect_tlb_amd(struct
> > cpuinfo_x86 *c)
> >  		tlb_lli_2m[ENTRIES] = eax & mask;
> >  
> >  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
> > +
> > +	if (c->extended_cpuid_level < 0x80000008)
> > +		return;
> 
> Can this just be based on cpu_feature_enabled(X86_FEATURE_TLBI), e.g:
> 
> 	if (cpu_feature_enabled(X86_FEATURE_TLBI))
> 		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff)
> + 1
> 
I don't see X86_FEATURE_TLBI defined in the tip
tree. Which CPUID bit does that need to be?

> Then you can squash this and the previous patch.

Heh, I already squashed the previous commit into
what is patch 6 in this series, as requested by
Borislav :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-10 18:53   ` Tom Lendacky
@ 2025-01-10 20:29     ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 20:29 UTC (permalink / raw)
  To: Tom Lendacky, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-10 at 12:53 -0600, Tom Lendacky wrote:
> On 12/30/24 11:53, Rik van Riel wrote:
> > Use broadcast TLB invalidation for kernel addresses when available.
> > 
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > +	unsigned long addr;
> > +	unsigned long maxnr = invlpgb_count_max;
> > +	unsigned long threshold = tlb_single_page_flush_ceiling *
> > maxnr;
> > +
> > +	/*
> > +	 * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > +	 * Disabling migration allows us to wait on all flushes.
> > +	 */
> > +	guard(preempt)();
> > +
> > +	if (end == TLB_FLUSH_ALL ||
> > +	    (end - start) > threshold << PAGE_SHIFT) {
> > +		invlpgb_flush_all();
> > +	} else {
> > +		unsigned long nr;
> > +		for (addr = start; addr < end; addr += nr <<
> > PAGE_SHIFT) {
> > +			nr = min((end - addr) >> PAGE_SHIFT,
> > maxnr);
> > +			invlpgb_flush_addr(addr, nr);
> > +		}
> 
> Would it be better to put this loop in the actual invlpgb_flush*
> function(s)? Then callers don't have to worry about it, similar to
> what is
> done in clflush_cache_range() / clflush_cache_range_opt().

Maybe?

We only have one caller, though.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-10 20:27     ` Rik van Riel
@ 2025-01-10 20:31       ` Tom Lendacky
  2025-01-10 20:34       ` Borislav Petkov
  1 sibling, 0 replies; 89+ messages in thread
From: Tom Lendacky @ 2025-01-10 20:31 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On 1/10/25 14:27, Rik van Riel wrote:
> On Fri, 2025-01-10 at 12:44 -0600, Tom Lendacky wrote:
>> On 12/30/24 11:53, Rik van Riel wrote:
>>>
>>> +++ b/arch/x86/kernel/cpu/amd.c
>>> @@ -1135,6 +1135,14 @@ static void cpu_detect_tlb_amd(struct
>>> cpuinfo_x86 *c)
>>>  		tlb_lli_2m[ENTRIES] = eax & mask;
>>>  
>>>  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
>>> +
>>> +	if (c->extended_cpuid_level < 0x80000008)
>>> +		return;
>>
>> Can this just be based on cpu_feature_enabled(X86_FEATURE_TLBI), e.g:
>>
>> 	if (cpu_feature_enabled(X86_FEATURE_TLBI))
>> 		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff)
>> + 1
>>
> I don't see X86_FEATURE_TLBI defined in the tip
> tree. Which CPUID bit does that need to be?

Sorry, I meant X86_FEATURE_INVLPGB.

Thanks,
Tom

> 
>> Then you can squash this and the previous patch.
> 
> Heh, I already squashed the previous commit into
> what is patch 6 in this series, as requested by
> Borislav :)
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-10 20:27     ` Rik van Riel
  2025-01-10 20:31       ` Tom Lendacky
@ 2025-01-10 20:34       ` Borislav Petkov
  1 sibling, 0 replies; 89+ messages in thread
From: Borislav Petkov @ 2025-01-10 20:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Tom Lendacky, x86, linux-kernel, kernel-team, dave.hansen, luto,
	peterz, tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch,
	linux-mm

On Fri, Jan 10, 2025 at 03:27:37PM -0500, Rik van Riel wrote:
> I don't see X86_FEATURE_TLBI defined in the tip
> tree. Which CPUID bit does that need to be?

It should be added as:

#define X86_FEATURE_INVLPGB          (13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */

in arch/x86/include/asm/cpufeatures.h

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-10 19:58       ` Borislav Petkov
@ 2025-01-10 20:43         ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-10 20:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tom Lendacky, x86, linux-kernel, kernel-team, dave.hansen, luto,
	peterz, tglx, mingo, hpa, akpm, nadav.amit, zhengqi.arch,
	linux-mm

On Fri, 2025-01-10 at 20:58 +0100, Borislav Petkov wrote:
> On Fri, Jan 10, 2025 at 02:45:27PM -0500, Rik van Riel wrote:
> > Is this the right location for that code, or
> > do I need to move it somewhere else to
> > guarantee TCE gets enabled on every CPU?
> 
> Didn't I answer this already?
> 
> https://lore.kernel.org/r/20250102115237.GGZ3Z-BUqBauHEfB10@fat_crate.local
> 
> And you should put it in init_amd() as it runs on every CPU.
> 
Thank you for telling me about init_amd().

I wasn't sure where in amd.c to put it.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-06 18:40   ` Dave Hansen
@ 2025-01-12  2:36     ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-12  2:36 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 10:40 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
> ...
> > +#ifdef CONFIG_CPU_SUP_AMD
> > +	struct list_head broadcast_asid_list;
> > +	u16 broadcast_asid;
> > +	bool asid_transition;
> > +#endif
> 
> Could we either do:
> 
> 	config X86_TLB_FLUSH_BROADCAST_HW
> 		bool
> 		depends on CONFIG_CPU_SUP_AMD

I went with this option.

I also got rid of the list_head (thanks Nadav), and
moved the remaining two items into the same word as
the protection key bits.

I think I have everybody's comments addressed in my
code base now. I'll post a new version soon-ish :)


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB
  2025-01-03 18:40   ` Jann Horn
@ 2025-01-12  2:39     ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-12  2:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Fri, 2025-01-03 at 19:40 +0100, Jann Horn wrote:
> On Mon, Dec 30, 2024 at 6:53 PM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > +++ b/arch/x86/include/asm/invlpgb.h
> > @@ -51,7 +51,7 @@ static inline void invlpgb_flush_user(unsigned
> > long pcid,
> >  static inline void invlpgb_flush_user_nr(unsigned long pcid,
> > unsigned long addr,
> >                                          int nr, bool pmd_stride)
> >  {
> > -       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID |
> > INVLPGB_VA);
> > +       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID |
> > INVLPGB_VA | INVLPGB_FINAL_ONLY);
> >  }
> 
> Please note this final-only behavior in a comment above the function
> and/or rename the function to make this clear.
> 
> I think this currently interacts badly with pmdp_collapse_flush(),
> which is used by retract_page_tables(). pmdp_collapse_flush() removes

I've added a freed_tables argument to invlpgb_flush_user_nr_nosync

> a PMD entry pointing to a page table with pmdp_huge_get_and_clear(),
> then calls flush_tlb_range(), which on x86 calls flush_tlb_mm_range()
> with the "freed_tables" parameter set to false. But that's really a
> preexisting bug, not something introduced by your series. I've sent a
> patch for that, see
> <
> https://lore.kernel.org/r/20250103-x86-collapse-flush-fix-v1-1-3c521856cfa6@google.com
> >.
> 
With your change, I believe the next version of my patch
series should handle this case correctly, too.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 00/12] AMD broadcast TLB invalidation
  2025-01-06 19:03 ` [PATCH v3 00/12] AMD broadcast TLB invalidation Dave Hansen
@ 2025-01-12  2:46   ` Rik van Riel
  0 siblings, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-12  2:46 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Mon, 2025-01-06 at 11:03 -0800, Dave Hansen wrote:
> 
> So can we call them "global", "shared" or "system" ASIDs, please?
> 
I have renamed them to global ASIDs.

> Second, the TLB_NR_DYN_ASIDS was picked because it's roughly the
> number
> of distinct PCIDs that the CPU can keep in the TLB at once (at least
> on
> Intel). Let's say a CPU has 6 mm's in the per-cpu ASID space and
> another
> 6 in the shared/broadcast space. At that point, PCIDs might not be
> doing
> much good because the TLB can't store entries for 12 PCIDs.
> 
If the CPU has 12 runnable processes, we may have
various other performance issues, too, like the
system simply not having enough CPU power to run
all the runnable tasks.

Most of the systems I have looked at seem to average
between .2 and 2 runnable tasks per CPU, depending on
whether the workload is CPU bound, or memory/IO bound.

> Is there any comprehension in this series? Should we be indexing
> cpu_tlbstate.ctxs[] by a *context* number rather than by the ASID
> that
> it's running as?
> 
We only need the cpu_tlbstate.ctxs[] for the per-CPU
ASID space, in order to look up what process is
assigned which slot.

We do not need it for global ASID numbers, which are
always the same everywhere.

> Last, I'm not 100% convinced we want to do this whole thing. The
> will-it-scale numbers are nice. But given the complexity of this, I
> think we need some actual, real end users to stand up and say exactly
> how this is important in *PRODUCTION* to them.
> 
Do any of these count? :)

https://www.phoronix.com/review/amd-invlpgb-linux
I am hoping to gather some real world numbers as well,
and will work with some workload owners to get some numbers.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 05/12] x86/mm: add INVLPGB support code
  2025-01-02 12:42   ` Borislav Petkov
  2025-01-06 16:50     ` Dave Hansen
@ 2025-01-14 19:50     ` Rik van Riel
  1 sibling, 0 replies; 89+ messages in thread
From: Rik van Riel @ 2025-01-14 19:50 UTC (permalink / raw)
  To: Borislav Petkov, Michael Kelley
  Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx,
	mingo, hpa, akpm, nadav.amit, zhengqi.arch, linux-mm

On Thu, 2025-01-02 at 13:42 +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 12:53:06PM -0500, Rik van Riel wrote:
> 
> 
> > +{
> > +	u64 rax = addr | flags;
> > +	u32 ecx = (pmd_stride << 31) | extra_count;
> > +	u32 edx = (pcid << 16) | asid;
> > +
> > +	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d"
> > (edx));
> 
> No, you do:
> 
> 	/* INVLPGB; supported in binutils >= 2.36. */
>  	asm volatile(".byte 0x0f, 0x01, 0xfe"
> 	...

Sorry, this one fell through the cracks.

I've replaced both of the asm mnemonics with
byte code for the next version of the series.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2025-01-14 19:51 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-30 17:53 [PATCH v3 00/12] AMD broadcast TLB invalidation Rik van Riel
2024-12-30 17:53 ` [PATCH 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2024-12-30 18:41   ` Borislav Petkov
2024-12-31 16:11     ` Rik van Riel
2024-12-31 16:19       ` Borislav Petkov
2024-12-31 16:30         ` Rik van Riel
2025-01-02 11:52           ` Borislav Petkov
2025-01-02 19:56       ` Peter Zijlstra
2025-01-03 12:18         ` Borislav Petkov
2025-01-04 16:27           ` Peter Zijlstra
2025-01-06 15:54             ` Dave Hansen
2025-01-06 15:47           ` Rik van Riel
2024-12-30 17:53 ` [PATCH 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2024-12-31  3:18   ` Qi Zheng
2024-12-30 17:53 ` [PATCH 03/12] x86/mm: add X86_FEATURE_INVLPGB definition Rik van Riel
2025-01-02 12:04   ` Borislav Petkov
2025-01-03 18:27     ` Rik van Riel
2025-01-03 21:07       ` Borislav Petkov
2024-12-30 17:53 ` [PATCH 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-01-02 12:15   ` Borislav Petkov
2025-01-10 18:44   ` Tom Lendacky
2025-01-10 20:27     ` Rik van Riel
2025-01-10 20:31       ` Tom Lendacky
2025-01-10 20:34       ` Borislav Petkov
2024-12-30 17:53 ` [PATCH 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-01-02 12:42   ` Borislav Petkov
2025-01-06 16:50     ` Dave Hansen
2025-01-06 17:32       ` Rik van Riel
2025-01-06 18:14       ` Borislav Petkov
2025-01-14 19:50     ` Rik van Riel
2025-01-03 12:44   ` Borislav Petkov
2024-12-30 17:53 ` [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-01-03 12:39   ` Borislav Petkov
2025-01-06 17:21   ` Dave Hansen
2025-01-09 20:16     ` Rik van Riel
2025-01-09 21:18       ` Dave Hansen
2025-01-10  5:31         ` Rik van Riel
2025-01-10  6:07         ` Nadav Amit
2025-01-10 15:14           ` Dave Hansen
2025-01-10 16:08             ` Rik van Riel
2025-01-10 16:29               ` Dave Hansen
2025-01-10 16:36                 ` Rik van Riel
2025-01-10 18:53   ` Tom Lendacky
2025-01-10 20:29     ` Rik van Riel
2024-12-30 17:53 ` [PATCH 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
2025-01-06 17:29   ` Dave Hansen
2025-01-06 17:35     ` Rik van Riel
2025-01-06 17:54       ` Dave Hansen
2024-12-30 17:53 ` [PATCH 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2024-12-30 17:53 ` [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2024-12-30 19:24   ` Nadav Amit
2025-01-01  4:42     ` Rik van Riel
2025-01-01 15:20       ` Nadav Amit
2025-01-01 16:15         ` Karim Manaouil
2025-01-01 16:23           ` Rik van Riel
2025-01-02  0:06             ` Nadav Amit
2025-01-03 17:36   ` Jann Horn
2025-01-04  2:55     ` Rik van Riel
2025-01-06 13:04       ` Jann Horn
2025-01-06 14:26         ` Rik van Riel
2025-01-06 14:52   ` Nadav Amit
2025-01-06 16:03     ` Rik van Riel
2025-01-06 18:40   ` Dave Hansen
2025-01-12  2:36     ` Rik van Riel
2024-12-30 17:53 ` [PATCH 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
2024-12-30 17:53 ` [PATCH 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2024-12-30 18:25   ` Nadav Amit
2024-12-30 18:27     ` Rik van Riel
2025-01-03 17:49   ` Jann Horn
2025-01-04  3:08     ` Rik van Riel
2025-01-06 13:10       ` Jann Horn
2025-01-06 18:29         ` Sean Christopherson
2025-01-10 19:34   ` Tom Lendacky
2025-01-10 19:45     ` Rik van Riel
2025-01-10 19:58       ` Borislav Petkov
2025-01-10 20:43         ` Rik van Riel
2024-12-30 17:53 ` [PATCH 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-01-03 18:40   ` Jann Horn
2025-01-12  2:39     ` Rik van Riel
2025-01-06 19:03 ` [PATCH v3 00/12] AMD broadcast TLB invalidation Dave Hansen
2025-01-12  2:46   ` Rik van Riel
2025-01-06 22:49 ` Yosry Ahmed
2025-01-07  3:25   ` Rik van Riel
2025-01-08  1:36     ` Yosry Ahmed
2025-01-09  2:25       ` Andrew Cooper
2025-01-09  2:47       ` Andrew Cooper
2025-01-09 21:32         ` Yosry Ahmed
2025-01-09 23:00           ` Andrew Cooper
2025-01-09 23:26             ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox