[PATCH v6 00/12] AMD broadcast TLB invalidation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 00/12] AMD broadcast TLB invalidation
@ 2025-01-20  2:40 Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20 19:32   ` David Hildenbrand
  2025-01-20  2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE

This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.

Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.

That makes it safe to use INVLPGB on AMD CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig           | 2 +-
 arch/x86/kernel/paravirt.c | 7 +------
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
@@ -191,7 +186,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= native_tlb_remove_table,
+	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20 19:47   ` David Hildenbrand
  2025-01-20  2:40 ` [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.

Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/hyperv/mmu.c                 |  1 -
 arch/x86/include/asm/paravirt.h       |  5 -----
 arch/x86/include/asm/paravirt_types.h |  2 --
 arch/x86/kernel/kvm.c                 |  1 -
 arch/x86/kernel/paravirt.c            |  1 -
 arch/x86/mm/pgtable.c                 | 16 ++++------------
 arch/x86/xen/mmu_pv.c                 |  1 -
 7 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 1cc113200ff5..cbe6c71e17c1 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -240,5 +240,4 @@ void hyperv_setup_mmu_ops(void)
 
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
-	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8..794ba3647c6c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -91,11 +91,6 @@ static inline void __flush_tlb_multi(const struct cpumask *cpumask,
 	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
-static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);
-}
-
 static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 {
 	PVOP_VCALL1(mmu.exit_mmap, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8d4fbe1be489..13405959e4db 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -136,8 +136,6 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);
 
-	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
-
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..3be9b3342c67 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -838,7 +838,6 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 2b78a6b466ed..c019771e0123 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -186,7 +186,6 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..3dc4af1f7868 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -18,14 +18,6 @@ EXPORT_SYMBOL(physical_mask);
 #define PGTABLE_HIGHMEM 0
 #endif
 
-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -54,7 +46,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pagetable_pte_dtor(page_ptdesc(pte));
 	paravirt_release_pte(page_to_pfn(pte));
-	paravirt_tlb_remove_table(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -70,7 +62,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 	tlb->need_flush_all = 1;
 #endif
 	pagetable_pmd_dtor(ptdesc);
-	paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
+	tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -80,14 +72,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 
 	pagetable_pud_dtor(ptdesc);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
+	tlb_remove_table(tlb, virt_to_page(p4d));
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04..041e17282af0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2137,7 +2137,6 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
-		.tlb_remove_table = tlb_remove_table,
 
 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Dave Hansen

Reduce code duplication by consolidating the decision point
for whether to do individual invalidations or a full flush
inside get_flush_tlb_info.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bb..4c2feb7259b1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1009,6 +1009,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		info->start = 0;
+		info->end = TLB_FLUSH_ALL;
+	}
+
 	return info;
 }
 
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	guard(preempt)();
 
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
+
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*
@@ -1276,7 +1273,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpu               | 5 +++++
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/tlbflush.h    | 7 +++++++
 arch/x86/kernel/cpu/amd.c          | 8 ++++++++
 4 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..abe013a1b076 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -395,6 +395,10 @@ config X86_VMX_FEATURE_NAMES
 	def_bool y
 	depends on IA32_FEAT_CTL
 
+config X86_BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 menuconfig PROCESSOR_SELECT
 	bool "Supported processor vendors" if EXPERT
 	help
@@ -431,6 +435,7 @@ config CPU_SUP_CYRIX_32
 config CPU_SUP_AMD
 	default y
 	bool "Support AMD processors" if PROCESSOR_SELECT
+	select X86_BROADCAST_TLB_FLUSH
 	help
 	  This enables detection, tunings and quirks for AMD processors
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..f9b832e971c5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e..8fe3b2dda507 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,13 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can we invalidate with one INVLPGB. */
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+extern u16 invlpgb_count_max;
+#else
+#define invlpgb_count_max 1
+#endif
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79d2e17f6582..bcf73775b4f8 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1135,6 +1137,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
+		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+		invlpgb_count_max = (edx & 0xffff) + 1;
+	}
 }
 
 static const struct cpu_dev amd_cpu_dev = {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 05/12] x86/mm: add INVLPGB support code
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-21  9:45   ` Peter Zijlstra
  2025-01-20  2:40 ` [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h  | 97 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |  1 +
 2 files changed, 98 insertions(+)
 create mode 100644 arch/x86/include/asm/invlpgb.h

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
new file mode 100644
index 000000000000..4dfd09e65fa6
--- /dev/null
+++ b/arch/x86/include/asm/invlpgb.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVLPGB
+#define _ASM_X86_INVLPGB
+
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 extra_count,
+			     bool pmd_stride, unsigned long flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (pmd_stride << 31) | extra_count;
+	u64 rax = addr | flags;
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void tlbsync(void)
+{
+	cant_migrate();
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invlpgb_flush_user(unsigned long pcid,
+				      unsigned long addr)
+{
+	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
+	tlbsync();
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr,
+						bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
+	tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, 0);
+	tlbsync();
+}
+
+#endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8fe3b2dda507..dba5caa4a9f4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
+#include <asm/invlpgb.h>
 #include <asm/invpcid.h>
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4c2feb7259b1..2c9e9b7482dd 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,30 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr;
+	unsigned long nr;
+
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_all();
+		return true;
+	}
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = min((info->end - addr) >> PAGE_SHIFT, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	tlbsync();
+	return true;
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1096,7 +1120,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 
-	if (info->end == TLB_FLUSH_ALL)
+	if (broadcast_kernel_range_flush(info))
+		; /* Fall through. */
+	else if (info->end == TLB_FLUSH_ALL)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2c9e9b7482dd..e2a0b7fc5fed 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1065,6 +1065,19 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 }
 
 
+static bool broadcast_flush_tlb_all(void)
+{
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	guard(preempt)();
+	invlpgb_flush_all();
+	return true;
+}
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1073,6 +1086,8 @@ static void do_flush_tlb_all(void *info)
 
 void flush_tlb_all(void)
 {
+	if (broadcast_flush_tlb_all())
+		return;
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e2a0b7fc5fed..9d4864db5720 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1321,7 +1321,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20 14:02   ` Nadav Amit
                     ` (2 more replies)
  2025-01-20  2:40 ` [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (4 subsequent siblings)
  13 siblings, 3 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
and newer CPUs.

In order to not exhaust PCID space, and keep TLB flushes local for single
threaded processes, we only hand out broadcast ASIDs to processes active on
3 or more CPUs, and gradually increase the threshold as broadcast ASID space
is depleted.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu.h         |   6 +
 arch/x86/include/asm/mmu_context.h |  14 ++
 arch/x86/include/asm/tlbflush.h    |  72 ++++++
 arch/x86/mm/tlb.c                  | 362 ++++++++++++++++++++++++++++-
 4 files changed, 442 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..d71cd599fec4 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,12 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	u16 global_asid;
+	bool asid_transition;
+#endif
+
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..d670699d32c2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void destroy_context_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
@@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		destroy_context_free_global_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dba5caa4a9f4..5eae5c1aafa5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -239,6 +239,78 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static inline bool is_dyn_asid(u16 asid)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return true;
+
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
+static inline bool in_asid_transition(const struct flush_tlb_info *info)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return info->mm && READ_ONCE(info->mm->context.asid_transition);
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = READ_ONCE(mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(is_dyn_asid(asid));
+
+	return asid;
+}
+#else
+static inline bool is_dyn_asid(u16 asid)
+{
+	return true;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return false;
+}
+
+static inline bool in_asid_transition(const struct flush_tlb_info *info)
+{
+	return false;
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	return false;
+}
+
+static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	VM_WARN_ON_ONCE(1);
+}
+
+static inline void consider_global_asid(struct mm_struct *mm)
+{
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9d4864db5720..08eee1f8573a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -225,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with broadcast TLB
+	 * flushing. The TLB is never outdated, and does not need flushing.
+	 */
+	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -251,6 +267,290 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+/*
+ * Logic for broadcast TLB invalidation.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	/*
+	 * A global TLB flush guarantees that any stale entries from
+	 * previously freed global ASIDs get flushed from the TLB
+	 * everywhere, making these global ASIDs safe to reuse.
+	 */
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * Clear all the previously freed global ASIDs from the
+	 * broadcast_asid_used bitmap, now that the global TLB flush
+	 * has made them actually available for re-use.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/*
+	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
+	 * assignments, for tasks doing IPI based TLB shootdowns.
+	 * Restart the search from the start of the global ASID space.
+	 */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 get_global_asid(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	do {
+		u16 start = last_global_asid;
+		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
+
+		if (asid >= MAX_ASID_AVAILABLE) {
+			reset_global_asid_space();
+			continue;
+		}
+
+		/* Claim this global ASID. */
+		__set_bit(asid, global_asid_used);
+		last_global_asid = asid;
+		global_asid_available--;
+		return asid;
+	} while (1);
+}
+
+/*
+ * Returns true if the mm is transitioning from a CPU-local ASID to a global
+ * (INVLPGB) ASID, or the other way around.
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	if (!global_asid && is_global_asid(prev_asid))
+		return true;
+
+	return false;
+}
+
+void destroy_context_free_global_asid(struct mm_struct *mm)
+{
+	if (!mm->context.global_asid)
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+}
+
+/*
+ * Check whether a process is currently active on more than "threshold" CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm->context.global_asid)
+		return;
+
+	/* The last global ASID was consumed while waiting for the lock. */
+	if (!global_asid_available)
+		return;
+
+	/*
+	 * The transition from IPI TLB flushing, with a dynamic ASID,
+	 * and broadcast TLB flushing, using a global ASID, uses memory
+	 * ordering for synchronization.
+	 *
+	 * While the process has threads still using a dynamic ASID,
+	 * TLB invalidation IPIs continue to get sent.
+	 *
+	 * This code sets asid_transition first, before assigning the
+	 * global ASID.
+	 *
+	 * The TLB flush code will only verify the ASID transition
+	 * after it has seen the new global ASID for the process.
+	 */
+	WRITE_ONCE(mm->context.asid_transition, true);
+	WRITE_ONCE(mm->context.global_asid, get_global_asid());
+}
+
+/*
+ * Figure out whether to assign a global ASID to a process.
+ * We vary the threshold by how empty or full global ASID space is.
+ * 1/4 full: >= 4 active threads
+ * 1/2 full: >= 8 active threads
+ * 3/4 full: >= 16 active threads
+ * 7/8 full: >= 32 active threads
+ * etc
+ *
+ * This way we should never exhaust the global ASID space, even on very
+ * large systems, and the processes with the largest number of active
+ * threads should be able to use broadcast TLB invalidation.
+ */
+#define HALFFULL_THRESHOLD 8
+static bool meets_global_asid_threshold(struct mm_struct *mm)
+{
+	int avail = global_asid_available;
+	int threshold = HALFFULL_THRESHOLD;
+
+	if (!avail)
+		return false;
+
+	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
+		threshold = HALFFULL_THRESHOLD / 4;
+	} else if (avail > MAX_ASID_AVAILABLE / 2) {
+		threshold = HALFFULL_THRESHOLD / 2;
+	} else if (avail < MAX_ASID_AVAILABLE / 3) {
+		do {
+			avail *= 2;
+			threshold *= 2;
+		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
+	}
+
+	return mm_active_cpus_exceeds(mm, threshold);
+}
+
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (meets_global_asid_threshold(mm))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!READ_ONCE(mm->context.asid_transition))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long maxnr = invlpgb_count_max;
+	unsigned long asid = info->mm->context.global_asid;
+	unsigned long addr = info->start;
+	unsigned long nr;
+
+	/* Flushing multiple pages at once is not supported with 1GB pages. */
+	if (info->stride_shift > PMD_SHIFT)
+		maxnr = 1;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else for (; addr < info->end; addr += nr << info->stride_shift) {
+		/*
+		 * Calculate how many pages can be flushed at once; if the
+		 * remainder of the range is less than one page, flush one.
+		 */
+		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
+		nr = max(nr, 1);
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+	}
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	tlbsync();
+}
+#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -556,8 +856,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-			   next->context.ctx_id);
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+				this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+				next->context.ctx_id);
 
 		/*
 		 * If this races with another thread that enables lam, 'new_lam'
@@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * Check if the current mm is transitioning to a new ASID.
+		 */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -606,6 +924,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -620,14 +945,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -746,7 +1069,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -786,6 +1119,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -953,7 +1288,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1049,9 +1384,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbbatch.h |  1 +
 arch/x86/include/asm/tlbflush.h | 12 ++------
 arch/x86/mm/tlb.c               | 54 +++++++++++++++++++++++++++++++--
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 1ad56eb3e8a8..f9a17edf63ad 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -10,6 +10,7 @@ struct arch_tlbflush_unmap_batch {
 	 * the PFNs being flushed..
 	 */
 	struct cpumask cpumask;
+	bool used_invlpgb;
 };
 
 #endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5eae5c1aafa5..e5516afdef7d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -358,21 +358,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 08eee1f8573a..f731e6cfaa29 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1659,9 +1659,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1670,12 +1668,62 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
+		tlbsync();
+		migrate_enable();
+		batch->used_invlpgb = false;
+	}
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) {
+		u16 asid = mm_global_asid(mm);
+		/*
+		 * Queue up an asynchronous invalidation. The corresponding
+		 * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
+		 * on the same CPU.
+		 */
+		if (!batch->used_invlpgb) {
+			batch->used_invlpgb = true;
+			migrate_disable();
+		}
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+
+		/*
+		 * Some CPUs might still be using a local ASID for this
+		 * process, and require IPIs, while others are using the
+		 * global ASID.
+		 *
+		 * In this corner case we need to do both the broadcast
+		 * TLB invalidation, and send IPIs. The IPIs will help
+		 * stragglers transition to the broadcast ASID.
+		 */
+		if (READ_ONCE(mm->context.asid_transition))
+			goto also_send_ipi;
+	} else {
+also_send_ipi:
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  2:40 ` [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bcf73775b4f8..21076252a491 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1071,6 +1071,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_feature_enabled(X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-01-20  2:40 ` Rik van Riel
  2025-01-20  5:58 ` [PATCH v6 00/12] AMD broadcast TLB invalidation Michael Kelley
  2025-01-24 11:41 ` Manali Shukla
  13 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-20  2:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h | 10 ++++++++--
 arch/x86/mm/tlb.c              |  8 ++++----
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index 4dfd09e65fa6..418402535319 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -63,9 +63,15 @@ static inline void invlpgb_flush_user(unsigned long pcid,
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
 						u16 nr,
-						bool pmd_stride)
+						bool pmd_stride,
+						bool freed_tables)
 {
-	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	unsigned long flags = INVLPGB_PCID | INVLPGB_VA;
+
+	if (!freed_tables)
+		flags |= INVLPGB_FINAL_ONLY;
+
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f731e6cfaa29..4057afb6edc0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -538,10 +538,10 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
 		nr = max(nr, 1);
 
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd, info->freed_tables);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd, info->freed_tables);
 	}
 
 	finish_asid_transition(info);
@@ -1700,10 +1700,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			batch->used_invlpgb = true;
 			migrate_disable();
 		}
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false, false);
 
 		/*
 		 * Some CPUs might still be using a local ASID for this
-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v6 00/12] AMD broadcast TLB invalidation
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (11 preceding siblings ...)
  2025-01-20  2:40 ` [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-01-20  5:58 ` Michael Kelley
  2025-01-24 11:41 ` Manali Shukla
  13 siblings, 0 replies; 36+ messages in thread
From: Michael Kelley @ 2025-01-20  5:58 UTC (permalink / raw)
  To: riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	andrew.cooper3

From: riel@surriel.com <riel@surriel.com>
> 
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
> 

I've done functional testing of this v6 in a local Hyper-V VM on an
Intel processor, and in a Hyper-V-based Azure Confidential VM on
an AMD Milan, where INVLPGB is enabled in the VM. Testing is
basic booting, and then examining at some custom telemetry
added to ensure that INVLPGB is being used in the VM on the AMD
Milan for some processes, and falling back to the existing paravirt
TLB flushing hypercalls for other processes. Testing is based on
6.13-rc6.

All looks good. For this limited testing of the entire series,

Tested-by: Michael Kelley <mhklinux@outlook.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-01-20 14:02   ` Nadav Amit
  2025-01-20 16:09     ` Rik van Riel
  2025-01-21  9:55   ` Peter Zijlstra
  2025-01-22  8:38   ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2025-01-20 14:02 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3



On 20/01/2025 4:40, Rik van Riel wrote:
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
> 
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID space
> is depleted.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>   arch/x86/include/asm/mmu.h         |   6 +
>   arch/x86/include/asm/mmu_context.h |  14 ++
>   arch/x86/include/asm/tlbflush.h    |  72 ++++++
>   arch/x86/mm/tlb.c                  | 362 ++++++++++++++++++++++++++++-
>   4 files changed, 442 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..d71cd599fec4 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -69,6 +69,12 @@ typedef struct {
>   	u16 pkey_allocation_map;
>   	s16 execute_only_pkey;
>   #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	u16 global_asid;
> +	bool asid_transition;
> +#endif
> +
>   } mm_context_t;
>   
>   #define INIT_MM_CONTEXT(mm)						\
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..d670699d32c2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
>   #define enter_lazy_tlb enter_lazy_tlb
>   extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>   
> +extern void destroy_context_free_global_asid(struct mm_struct *mm);
> +
>   /*
>    * Init a new mm.  Used on mm copies, like at fork()
>    * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
>   		mm->context.execute_only_pkey = -1;
>   	}
>   #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		mm->context.global_asid = 0;
> +		mm->context.asid_transition = false;
> +	}
> +#endif
> +
>   	mm_reset_untag_mask(mm);
>   	init_new_context_ldt(mm);
>   	return 0;
> @@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
>   static inline void destroy_context(struct mm_struct *mm)
>   {
>   	destroy_context_ldt(mm);
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		destroy_context_free_global_asid(mm);
> +#endif
>   }
>   
>   extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index dba5caa4a9f4..5eae5c1aafa5 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -239,6 +239,78 @@ void flush_tlb_one_kernel(unsigned long addr);
>   void flush_tlb_multi(const struct cpumask *cpumask,
>   		      const struct flush_tlb_info *info);
>   
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return true;
> +
> +	return asid < TLB_NR_DYN_ASIDS;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return !is_dyn_asid(asid);
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info *info)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return false;
> +
> +	return info->mm && READ_ONCE(info->mm->context.asid_transition);
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	u16 asid;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return 0;
> +
> +	asid = READ_ONCE(mm->context.global_asid);
> +
> +	/* mm->context.global_asid is either 0, or a global ASID */
> +	VM_WARN_ON_ONCE(is_dyn_asid(asid));
> +
> +	return asid;
> +}
> +#else
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	return true;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return false;
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info *info)
> +{
> +	return false;
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +
> +static inline bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	return false;
> +}
> +
> +static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	VM_WARN_ON_ONCE(1);

Not sure why not the use VM_WARN_ONCE() instead with some more 
informative message (anyhow, a string is allocated for it).

> +}
> +
> +static inline void consider_global_asid(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>   #ifdef CONFIG_PARAVIRT
>   #include <asm/paravirt.h>
>   #endif
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9d4864db5720..08eee1f8573a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,13 +74,15 @@
>    * use different names for each of them:
>    *
>    * ASID  - [0, TLB_NR_DYN_ASIDS-1]
> - *         the canonical identifier for an mm
> + *         the canonical identifier for an mm, dynamically allocated on each CPU
> + *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
> + *         the canonical, global identifier for an mm, identical across all CPUs
>    *
> - * kPCID - [1, TLB_NR_DYN_ASIDS]
> + * kPCID - [1, MAX_ASID_AVAILABLE]
>    *         the value we write into the PCID part of CR3; corresponds to the
>    *         ASID+1, because PCID 0 is special.
>    *
> - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
> + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
>    *         for KPTI each mm has two address spaces and thus needs two
>    *         PCID values, but we can still do with a single ASID denomination
>    *         for each mm. Corresponds to kPCID + 2048.
> @@ -225,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>   		return;
>   	}
>   
> +	/*
> +	 * TLB consistency for global ASIDs is maintained with broadcast TLB
> +	 * flushing. The TLB is never outdated, and does not need flushing.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && static_cpu_has(X86_FEATURE_INVLPGB)) {
> +		u16 global_asid = mm_global_asid(next);
> +
> +		if (global_asid) {
> +			*new_asid = global_asid;
> +			*need_flush = false;
> +			return;
> +		}
> +	}
> +
>   	if (this_cpu_read(cpu_tlbstate.invalidate_other))
>   		clear_asid_other();
>   
> @@ -251,6 +267,290 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>   	*need_flush = true;
>   }
>   
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/*
> +	 * A global TLB flush guarantees that any stale entries from
> +	 * previously freed global ASIDs get flushed from the TLB
> +	 * everywhere, making these global ASIDs safe to reuse.
> +	 */
> +	invlpgb_flush_all_nonglobals();
> +
> +	/*
> +	 * Clear all the previously freed global ASIDs from the
> +	 * broadcast_asid_used bitmap, now that the global TLB flush
> +	 * has made them actually available for re-use.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/*
> +	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> +	 * assignments, for tasks doing IPI based TLB shootdowns.
> +	 * Restart the search from the start of the global ASID space.
> +	 */
> +	last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	do {
> +		u16 start = last_global_asid;
> +		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >= MAX_ASID_AVAILABLE) {
> +			reset_global_asid_space();
> +			continue;
> +		}

I think that unless something is awfully wrong, you are supposed to at 
most call reset_global_asid_space() once. So if that's the case, why not 
do it this way?

Instead, you can get rid of the loop and just do:

	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);

If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some 
fallback. But the loop, is just confusing in my opinion for no reason.

> +
> +		/* Claim this global ASID. */
> +		__set_bit(asid, global_asid_used);
> +		last_global_asid = asid;
> +		global_asid_available--;
> +		return asid;
> +	} while (1);
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	u16 global_asid = mm_global_asid(next);
> +
> +	if (global_asid && prev_asid != global_asid)
> +		return true;
> +
> +	if (!global_asid && is_global_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
> +
> +void destroy_context_free_global_asid(struct mm_struct *mm)
> +{
> +	if (!mm->context.global_asid)
> +		return;
> +
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* The global ASID can be re-used only after flush at wrap-around. */
> +	__set_bit(mm->context.global_asid, global_asid_freed);
> +
> +	mm->context.global_asid = 0;
> +	global_asid_available++;
> +}
> +
> +/*
> + * Check whether a process is currently active on more than "threshold" CPUs.
> + * This is a cheap estimation on whether or not it may make sense to assign
> + * a global ASID to this process, and use broadcast TLB invalidation.
> + */
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
> +{
> +	int count = 0;
> +	int cpu;
> +
> +	/* This quick check should eliminate most single threaded programs. */
> +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> +		return false;
> +
> +	/* Slower check to make sure. */
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/* Skip the CPUs that aren't really running this process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> +			continue;

Then perhaps at least add a comment next to loaded_mm, that it's not 
private per-se, but rarely accessed by other cores?

> +
> +		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> +			continue;
> +
> +		if (++count > threshold)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Assign a global ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_global_asid(struct mm_struct *mm)
> +{
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* This process is already using broadcast TLB invalidation. */
> +	if (mm->context.global_asid)
> +		return;
> +
> +	/* The last global ASID was consumed while waiting for the lock. */
> +	if (!global_asid_available)
> +		return;
> +
> +	/*
> +	 * The transition from IPI TLB flushing, with a dynamic ASID,
> +	 * and broadcast TLB flushing, using a global ASID, uses memory
> +	 * ordering for synchronization.
> +	 *
> +	 * While the process has threads still using a dynamic ASID,
> +	 * TLB invalidation IPIs continue to get sent.
> +	 *
> +	 * This code sets asid_transition first, before assigning the
> +	 * global ASID.
> +	 *
> +	 * The TLB flush code will only verify the ASID transition
> +	 * after it has seen the new global ASID for the process.
> +	 */
> +	WRITE_ONCE(mm->context.asid_transition, true);
> +	WRITE_ONCE(mm->context.global_asid, get_global_asid());

I know it is likely correct in practice (due to TSO memory model), but 
it is not clear, at least for me, how those write order affects the rest 
of the code. I managed to figure out how it relates to the reads in 
flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say it 
is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).

> +}
> +
> +/*
> + * Figure out whether to assign a global ASID to a process.
> + * We vary the threshold by how empty or full global ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the global ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail = global_asid_available;
> +	int threshold = HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold = HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold = HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *= 2;
> +			threshold *= 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}
> +
> +static void consider_global_asid(struct mm_struct *mm)
> +{
> +	if (!static_cpu_has(X86_FEATURE_INVLPGB))
> +		return;
> +
> +	/* Check every once in a while. */
> +	if ((current->pid & 0x1f) != (jiffies & 0x1f))
> +		return;
> +
> +	if (meets_global_asid_threshold(mm))
> +		use_global_asid(mm);
> +}
> +
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +	struct mm_struct *mm = info->mm;
> +	int bc_asid = mm_global_asid(mm);
> +	int cpu;
> +
> +	if (!READ_ONCE(mm->context.asid_transition))
> +		return;
> +
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/*
> +		 * The remote CPU is context switching. Wait for that to
> +		 * finish, to catch the unlikely case of it switching to
> +		 * the target mm with an out of date ASID.
> +		 */
> +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
> +			cpu_relax();
> +
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
> +			continue;
> +
> +		/*
> +		 * If at least one CPU is not using the global ASID yet,
> +		 * send a TLB flush IPI. The IPI should cause stragglers
> +		 * to transition soon.
> +		 *
> +		 * This can race with the CPU switching to another task;
> +		 * that results in a (harmless) extra IPI.
> +		 */
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
> +			flush_tlb_multi(mm_cpumask(info->mm), info);
> +			return;

I am trying to figure out why we return here. The transition might not 
be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
unconditionally?

> +		}
> +	}
> +
> +	/* All the CPUs running this process are using the global ASID. */
> +	WRITE_ONCE(mm->context.asid_transition, false);
> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	bool pmd = info->stride_shift == PMD_SHIFT;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long asid = info->mm->context.global_asid;
> +	unsigned long addr = info->start;
> +	unsigned long nr;
> +
> +	/* Flushing multiple pages at once is not supported with 1GB pages. */
> +	if (info->stride_shift > PMD_SHIFT)
> +		maxnr = 1;
> +
> +	/*
> +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> +	 * The inc_mm_tlb_gen() guarantees page table updates are done
> +	 * before these TLB flushes happen.
> +	 */
> +	if (info->end == TLB_FLUSH_ALL) {
> +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
> +	} else for (; addr < info->end; addr += nr << info->stride_shift) {

I guess I was wrong, and do-while was cleaner here.

And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...

[ I guess the cleanest way was to change get_flush_tlb_info to mask the 
low bits of start and end based on ((1ull << stride_shift) - 1). But 
whatever... ]

> +		/*
> +		 * Calculate how many pages can be flushed at once; if the
> +		 * remainder of the range is less than one page, flush one.
> +		 */
> +		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
> +		nr = max(nr, 1);
> +
> +		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
> +	}
> +
> +	finish_asid_transition(info);
> +
> +	/* Wait for the INVLPGBs kicked off above to finish. */
> +	tlbsync();
> +}
> +#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
> +
>   /*
>    * Given an ASID, flush the corresponding user ASID.  We can delay this
>    * until the next time we switch to it.
> @@ -556,8 +856,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   	 */
>   	if (prev == next) {
>   		/* Not actually switching mm's */
> -		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> -			   next->context.ctx_id);
> +		VM_WARN_ON(is_dyn_asid(prev_asid) &&
> +				this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> +				next->context.ctx_id);
>   
>   		/*
>   		 * If this races with another thread that enables lam, 'new_lam'
> @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
>   			cpumask_set_cpu(cpu, mm_cpumask(next));
>   
> +		/*
> +		 * Check if the current mm is transitioning to a new ASID.
> +		 */
> +		if (needs_global_asid_reload(next, prev_asid)) {
> +			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> +
> +			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> +			goto reload_tlb;

Not a fan of the goto's when they are not really needed, and I don't 
think it is really needed here. Especially that the name of the tag 
"reload_tlb" does not really convey that the page-tables are reloaded at 
that point.

> +		}
> +
> +		/*
> +		 * Broadcast TLB invalidation keeps this PCID up to date
> +		 * all the time.
> +		 */
> +		if (is_global_asid(prev_asid))
> +			return;

Hard for me to convince myself

> +
>   		/*
>   		 * If the CPU is not in lazy TLB mode, we are just switching
>   		 * from one thread in a process to another thread in the same
> @@ -606,6 +924,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   		 */
>   		cond_mitigation(tsk);
>   
> +		/*
> +		 * Let nmi_uaccess_okay() and finish_asid_transition()
> +		 * know that we're changing CR3.
> +		 */
> +		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
> +		barrier();
> +
>   		/*
>   		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
>   		 * mm_cpumask can be expensive under contention. The CPU
> @@ -620,14 +945,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>   
>   		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> -
> -		/* Let nmi_uaccess_okay() know that we're changing CR3. */
> -		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
> -		barrier();
>   	}
>   
> +reload_tlb:
>   	new_lam = mm_lam_cr3_mask(next);
>   	if (need_flush) {
> +		VM_WARN_ON_ONCE(is_global_asid(new_asid));
>   		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
>   		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
>   		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
> @@ -746,7 +1069,7 @@ static void flush_tlb_func(void *info)
>   	const struct flush_tlb_info *f = info;
>   	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
>   	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> -	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +	u64 local_tlb_gen;
>   	bool local = smp_processor_id() == f->initiating_cpu;
>   	unsigned long nr_invalidate = 0;
>   	u64 mm_tlb_gen;
> @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
>   	if (unlikely(loaded_mm == &init_mm))
>   		return;
>   
> +	/* Reload the ASID if transitioning into or out of a global ASID */
> +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> +		switch_mm_irqs_off(NULL, loaded_mm, NULL);

I understand you want to reuse that logic, but it doesn't seem 
reasonable to me. It both doesn't convey what you want to do, and can 
lead to undesired operations - cpu_tlbstate_update_lam() for instance. 
Probably the impact on performance is minor, but it is an opening for 
future mistakes.

> +		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +	}
> +
> +	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +	if (is_global_asid(loaded_mm_asid))
> +		return;

The comment does not clarify to me, and I don't manage to clearly 
explain to myself, why it is guaranteed that all the IPI TLB flushes, 
which were potentially issued before the transition, are not needed.

> +
>   	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
>   		   loaded_mm->context.ctx_id);
>   
> @@ -786,6 +1119,8 @@ static void flush_tlb_func(void *info)
>   		return;
>   	}
>   
> +	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
>   	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
>   		     f->new_tlb_gen <= local_tlb_gen)) {
>   		/*
> @@ -953,7 +1288,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
>   	 * up on the new contents of what used to be page tables, while
>   	 * doing a speculative memory access.
>   	 */
> -	if (info->freed_tables)
> +	if (info->freed_tables || in_asid_transition(info))
>   		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>   	else
>   		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1049,9 +1384,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>   	 * a local TLB flush is needed. Optimize this use-case by calling
>   	 * flush_tlb_func_local() directly in this case.
>   	 */
> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> +	if (mm_global_asid(mm)) {
> +		broadcast_tlb_flush(info);
> +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
>   		info->trim_cpumask = should_trim_cpumask(mm);
>   		flush_tlb_multi(mm_cpumask(mm), info);
> +		consider_global_asid(mm);
>   	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>   		lockdep_assert_irqs_enabled();
>   		local_irq_disable();



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20 14:02   ` Nadav Amit
@ 2025-01-20 16:09     ` Rik van Riel
  2025-01-20 20:04       ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-20 16:09 UTC (permalink / raw)
  To: Nadav Amit, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
> 
> 
> On 20/01/2025 4:40, Rik van Riel wrote:
> > 
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > +	VM_WARN_ON_ONCE(1);
> 
> Not sure why not the use VM_WARN_ONCE() instead with some more 
> informative message (anyhow, a string is allocated for it).
> 
VM_WARN_ON_ONCE only has a condition, not a message.

> > 
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start = last_global_asid;
> > +		u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >= MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> 
> I think that unless something is awfully wrong, you are supposed to
> at 
> most call reset_global_asid_space() once. So if that's the case, why
> not 
> do it this way?
> 
> Instead, you can get rid of the loop and just do:
> 
> 	asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
> 
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.

I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.

> 
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > +			continue;
> 
> Then perhaps at least add a comment next to loaded_mm, that it's not 
> private per-se, but rarely accessed by other cores?
> 
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.

Which comment are you referring to that should
be edited?

> > 
> > +
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> 
> I know it is likely correct in practice (due to TSO memory model),
> but 
> it is not clear, at least for me, how those write order affects the
> rest 
> of the code. I managed to figure out how it relates to the reads in 
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it 
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
> 

What kind of wording should we add here to make it
easier to understand?

"The TLB invalidation code reads these variables in
 the opposite order in which they are written" ?


> > +		/*
> > +		 * If at least one CPU is not using the global
> > ASID yet,
> > +		 * send a TLB flush IPI. The IPI should cause
> > stragglers
> > +		 * to transition soon.
> > +		 *
> > +		 * This can race with the CPU switching to another
> > task;
> > +		 * that results in a (harmless) extra IPI.
> > +		 */
> > +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > +			flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > +			return;
> 
> I am trying to figure out why we return here. The transition might
> not 
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
> unconditionally?

The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?

On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.

I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)

> 
> > +	/*
> > +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> > +	 * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > +	 * before these TLB flushes happen.
> > +	 */
> > +	if (info->end == TLB_FLUSH_ALL) {
> > +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > +	} else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
> 
> I guess I was wrong, and do-while was cleaner here.
> 
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
> 
We set maxnr to 1 for larger stride shifts at the top of the function:

        /* Flushing multiple pages at once is not supported with 1GB
pages. */
        if (info->stride_shift > PMD_SHIFT)
                maxnr = 1;

> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the 
> low bits of start and end based on ((1ull << stride_shift) - 1). But 
> whatever... ]

I'll change it back :)

I'm just happy this code is getting lots of attention,
and we're improving it with time.


> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >   				 !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> >   			cpumask_set_cpu(cpu, mm_cpumask(next));
> >   
> > +		/*
> > +		 * Check if the current mm is transitioning to a
> > new ASID.
> > +		 */
> > +		if (needs_global_asid_reload(next, prev_asid)) {
> > +			next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > +			choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > +			goto reload_tlb;
> 
> Not a fan of the goto's when they are not really needed, and I don't 
> think it is really needed here. Especially that the name of the tag 
> "reload_tlb" does not really convey that the page-tables are reloaded
> at 
> that point.

In this particular case, the CPU continues running with
the same page tables, but with a different PCID.

> 
> > +		}
> > +
> > +		/*
> > +		 * Broadcast TLB invalidation keeps this PCID up
> > to date
> > +		 * all the time.
> > +		 */
> > +		if (is_global_asid(prev_asid))
> > +			return;
> 
> Hard for me to convince myself

When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.

The global ASID should always be up to date.

> 
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> >   	if (unlikely(loaded_mm == &init_mm))
> >   		return;
> >   
> > +	/* Reload the ASID if transitioning into or out of a
> > global ASID */
> > +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> 
> I understand you want to reuse that logic, but it doesn't seem 
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance. 
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.

My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.

I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.


> 
> > +		loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > +	}
> > +
> > +	/* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > +	if (is_global_asid(loaded_mm_asid))
> > +		return;
> 
> The comment does not clarify to me, and I don't manage to clearly 
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
> 
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).

Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-01-20 19:32   ` David Hildenbrand
  0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-01-20 19:32 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On 20.01.25 03:40, Rik van Riel wrote:
> Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
> paravirt, and not when running on bare metal.
> 
> There is no real good reason to do things differently for
> each setup. Make them all the same.
> 
> Currently get_user_pages_fast synchronizes against page table
> freeing in two different ways:
> - on bare metal, by blocking IRQs, which block TLB flush IPIs
> - on paravirt, with MMU_GATHER_RCU_TABLE_FREE

Right, worth noting the latter also right now relies on the blocking of 
IRQs that is in place, because they implicitly block the RCU sync.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-20  2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2025-01-20 19:47   ` David Hildenbrand
  2025-01-21  1:03     ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-01-20 19:47 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On 20.01.25 03:40, Rik van Riel wrote:
> Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.
> 

Indeed, but the !CONFIG_PARAVIRT variant paravirt_tlb_remove_table() 
however calls tlb_remove_page().

tlb_remove_page() ends up in 
__tlb_remove_page_size()->__tlb_remove_folio_pages_size(), not in 
tlb_remove_table()

... but maybe I am looking at the wrong tree, so I wonder if this is 
okay an simply not spelled out here explicitly?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20 16:09     ` Rik van Riel
@ 2025-01-20 20:04       ` Nadav Amit
  2025-01-20 22:44         ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2025-01-20 20:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, peterz, Dave Hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, open list:MEMORY MANAGEMENT,
	Andrew Morton, jannh, mhklinux, andrew.cooper3



> On 20 Jan 2025, at 18:09, Rik van Riel <riel@surriel.com> wrote:
> 
> On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>> 
>> 
>> On 20/01/2025 4:40, Rik van Riel wrote:
>>> 
>>> +static inline void broadcast_tlb_flush(struct flush_tlb_info
>>> *info)
>>> +{
>>> + VM_WARN_ON_ONCE(1);
>> 
>> Not sure why not the use VM_WARN_ONCE() instead with some more 
>> informative message (anyhow, a string is allocated for it).
>> 
> VM_WARN_ON_ONCE only has a condition, not a message.

Right, my bad.

> 
>>> + /* Slower check to make sure. */
>>> + for_each_cpu(cpu, mm_cpumask(mm)) {
>>> + /* Skip the CPUs that aren't really running this
>>> process. */
>>> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
>>> + continue;
>> 
>> Then perhaps at least add a comment next to loaded_mm, that it's not 
>> private per-se, but rarely accessed by other cores?
>> 
> I don't see any comment in struct tlb_state that
> suggests it was ever private to begin with.
> 
> Which comment are you referring to that should
> be edited?

You can see there is a tlb_state_shared, so one assumes tlb_state is
private... (at least that was my intention separating them).

> 
>>> 
>>> + WRITE_ONCE(mm->context.asid_transition, true);
>>> + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>> 
>> I know it is likely correct in practice (due to TSO memory model),
>> but 
>> it is not clear, at least for me, how those write order affects the
>> rest 
>> of the code. I managed to figure out how it relates to the reads in 
>> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
>> it 
>> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>> 
> 
> What kind of wording should we add here to make it
> easier to understand?
> 
> "The TLB invalidation code reads these variables in
> the opposite order in which they are written" ?

Usually in such cases, you make a reference to wherever there are readers
that rely on the ordering. This is how documenting smp_wmb()/smp_rmb()
ordering is usually done.

> 
> 
>>> 
>>> +		/*
>>> +		 * If at least one CPU is not using the global ASID yet,
>>> +		 * send a TLB flush IPI. The IPI should cause stragglers
>>> +		 * to transition soon.
>>> +		 *
>>> +		 * This can race with the CPU switching to another task;
>>> +		 * that results in a (harmless) extra IPI.
>>> +		 */
>>> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
>>> +			flush_tlb_multi(mm_cpumask(info->mm), info);
>>> +			return;
>>> +		}
>> 
>> I am trying to figure out why we return here. The transition might
>> not 
>> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
>> unconditionally?
> 
> The transition _should_ be over, but what if another
> CPU got an NMI while in the middle of switch_mm_irqs_off,
> and set its own bit in the mm_cpumask after we send this
> IPI?
> 
> On the other hand, if it sets its mm_cpumask bit after
> this point, it will also load the mm->context.global_asid
> after this point, and should definitely get the new ASID.
> 
> I think we are probably fine to set asid_transition to
> false here, but I've had to tweak this code so much over
> the past months that I don't feel super confident any more :)

I fully relate, but I am not sure it is that great. The problem
is that nobody would have the guts to change that code later...

>> 
>> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>> 
> We set maxnr to 1 for larger stride shifts at the top of the function:
> 

You’re right, all safe.

> 
>>> + goto reload_tlb;
>> 
>> Not a fan of the goto's when they are not really needed, and I don't 
>> think it is really needed here. Especially that the name of the tag 
>> "reload_tlb" does not really convey that the page-tables are reloaded
>> at 
>> that point.
> 
> In this particular case, the CPU continues running with
> the same page tables, but with a different PCID.

I understand it is “reload_tlb” from your point of view, or from the
point of view of the code that does the “goto”, but if I showed you
the code that follows the “reload_tlb”, I’m not sure you’d know it
is so.

[ snip, taking your valid points ]

>> 
>>> + loaded_mm_asid =
>>> this_cpu_read(cpu_tlbstate.loaded_mm_asid);
>>> + }
>>> +
>>> + /* Broadcast ASIDs are always kept up to date with
>>> INVLPGB. */
>>> + if (is_global_asid(loaded_mm_asid))
>>> + return;
>> 
>> The comment does not clarify to me, and I don't manage to clearly 
>> explain to myself, why it is guaranteed that all the IPI TLB flushes,
>> which were potentially issued before the transition, are not needed.
>> 
> IPI TLB flushes that were issued before the transition went
> to the CPUs when they were using dynamic ASIDs (numbers 1-5).
> 
> Reloading the TLB with a different PCID, even pointed at the
> same page tables, means that the TLB should load the
> translations fresh from the page tables, and not re-use any
> that it had previously loaded under a different PCID.
> 

What about this scenario for instance?

CPU0                  CPU1                      CPU2
----                  ----                      ----
(1) use_global_asid(mm):        
    mm->context.asid_trans = T;
    mm->context.global_asid = G;

                      (2) switch_mm(..., next=mm):
                          *Observes global_asid = G
                          => loads CR3 with PCID=G
                          => fills TLB under G.
                          TLB caches PTE[G, V] = P
			  (for some reason)

                                             (3) flush_tlb_mm_range(mm):
                                                 *Sees global_asid == 0
                                                   (stale/old value)
                                                 => flush_tlb_multi()
                                                 => IPI flush for dyn.

                      (4) IPI arrives on CPU1:
                          flush_tlb_func(...): 
                          is_global_asid(G)? yes,
                          skip invalidate; broadcast
                          flush assumed to cover it.

                                             (5) IPI completes on CPU2:
                                                 Dyn. ASIDs are flushed, 
                                                 but CPU1’s global ASID
                                                 was never invalidated!

                      (6) CPU1 uses stale TLB entries under ASID G.
                          TLB continues to use PTE[G, V] = P, as it
                          was not invalidated.






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20 20:04       ` Nadav Amit
@ 2025-01-20 22:44         ` Rik van Riel
  2025-01-21  7:31           ` Nadav Amit
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-20 22:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, peterz, Dave Hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, open list:MEMORY MANAGEMENT,
	Andrew Morton, jannh, mhklinux, andrew.cooper3

On Mon, 2025-01-20 at 22:04 +0200, Nadav Amit wrote:
> 
> What about this scenario for instance?
> 
> CPU0                  CPU1                      CPU2
> ----                  ----                      ----
> (1) use_global_asid(mm):        
>     mm->context.asid_trans = T;
>     mm->context.global_asid = G;
> 
>                       (2) switch_mm(..., next=mm):
>                           *Observes global_asid = G
>                           => loads CR3 with PCID=G
>                           => fills TLB under G.
>                           TLB caches PTE[G, V] = P
> 			  (for some reason)
> 
>                                              (3)
> flush_tlb_mm_range(mm):
>                                                  *Sees global_asid ==
> 0
>                                                    (stale/old value)
>                                                  => flush_tlb_multi()
>                                                  => IPI flush for
> dyn.
> 

If the TLB flush is about a page table change that
happened before CPUs 0 and 1 switched to the global
ASID, then CPUs 0 and 1 will not see the old page
table contents after the switch.

If the TLB flush is about a page table change that
happened after the transition to a global ASID,
flush_tlb_mm_range() should see that global ASID,
and flush accordingly.

What am I missing?

>                       (4) IPI arrives on CPU1:
>                           flush_tlb_func(...): 
>                           is_global_asid(G)? yes,
>                           skip invalidate; broadcast
>                           flush assumed to cover it.
> 
>                                              (5) IPI completes on
> CPU2:
>                                                  Dyn. ASIDs are
> flushed, 
>                                                  but CPU1’s global
> ASID
>                                                  was never
> invalidated!
> 
>                       (6) CPU1 uses stale TLB entries under ASID G.
>                           TLB continues to use PTE[G, V] = P, as it
>                           was not invalidated.
> 
> 
> 
> 
> 

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-20 19:47   ` David Hildenbrand
@ 2025-01-21  1:03     ` Rik van Riel
  2025-01-21  7:46       ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-21  1:03 UTC (permalink / raw)
  To: David Hildenbrand, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Mon, 2025-01-20 at 20:47 +0100, David Hildenbrand wrote:
> On 20.01.25 03:40, Rik van Riel wrote:
> > Every pv_ops.mmu.tlb_remove_table call ends up calling
> > tlb_remove_table.
> > 
> 
> Indeed, but the !CONFIG_PARAVIRT variant paravirt_tlb_remove_table() 
> however calls tlb_remove_page().

Patch 1/12 from this series removes that.

After patch 1/12, we always call tlb_remove_table everywhere.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20 22:44         ` Rik van Riel
@ 2025-01-21  7:31           ` Nadav Amit
  0 siblings, 0 replies; 36+ messages in thread
From: Nadav Amit @ 2025-01-21  7:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, peterz, Dave Hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, open list:MEMORY MANAGEMENT,
	Andrew Morton, jannh, mhklinux, andrew.cooper3

> On 21 Jan 2025, at 0:44, Rik van Riel <riel@surriel.com> wrote:
> 
> 
> If the TLB flush is about a page table change that
> happened after the transition to a global ASID,
> flush_tlb_mm_range() should see that global ASID,
> and flush accordingly.
> 
> What am I missing?

I think reasoning needs to be done using memory ordering
arguments using the kernel memory model (which builds on top
of x86 memory model in our case) and when necessary
“happens-before” relations. The fact one CPU sees a write
does not imply another CPU will see the write by itself.

So if there is some memory barriers that would prevent this
scenario, it would be good to mark how they synchronize.
Otherwise, I think the very least “late” TLB-shootdowns should
be respected even if the ASID is already “global”.

I do recommend that you would also check the opposite case
where a CPU that transitioned to global ASID does broadcast
and there is a strangler CPU that has not yet switched to
the global one. While in that case the TLB flush would
eventually take place, there might be a window of time that
it is not (and the page is already freed).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-21  1:03     ` Rik van Riel
@ 2025-01-21  7:46       ` David Hildenbrand
  2025-01-21  8:54         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-01-21  7:46 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On 21.01.25 02:03, Rik van Riel wrote:
> On Mon, 2025-01-20 at 20:47 +0100, David Hildenbrand wrote:
>> On 20.01.25 03:40, Rik van Riel wrote:
>>> Every pv_ops.mmu.tlb_remove_table call ends up calling
>>> tlb_remove_table.
>>>
>>
>> Indeed, but the !CONFIG_PARAVIRT variant paravirt_tlb_remove_table()
>> however calls tlb_remove_page().
> 
> Patch 1/12 from this series removes that.
> 
> After patch 1/12, we always call tlb_remove_table everywhere.

This patch contains the hunk:

-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-

That is the source of my confusion.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-21  7:46       ` David Hildenbrand
@ 2025-01-21  8:54         ` Peter Zijlstra
  2025-01-22 15:48           ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-21  8:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rik van Riel, x86, linux-kernel, bp, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3

On Tue, Jan 21, 2025 at 08:46:04AM +0100, David Hildenbrand wrote:
> On 21.01.25 02:03, Rik van Riel wrote:
> > On Mon, 2025-01-20 at 20:47 +0100, David Hildenbrand wrote:
> > > On 20.01.25 03:40, Rik van Riel wrote:
> > > > Every pv_ops.mmu.tlb_remove_table call ends up calling
> > > > tlb_remove_table.
> > > > 
> > > 
> > > Indeed, but the !CONFIG_PARAVIRT variant paravirt_tlb_remove_table()
> > > however calls tlb_remove_page().
> > 
> > Patch 1/12 from this series removes that.
> > 
> > After patch 1/12, we always call tlb_remove_table everywhere.
> 
> This patch contains the hunk:
> 
> -#ifndef CONFIG_PARAVIRT
> -static inline
> -void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
> -{
> -	tlb_remove_page(tlb, table);
> -}
> -#endif
> -
> 
> That is the source of my confusion.

Ah, that hunk should probably go to patch 1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 05/12] x86/mm: add INVLPGB support code
  2025-01-20  2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-01-21  9:45   ` Peter Zijlstra
  2025-01-22 16:58     ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-21  9:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Sun, Jan 19, 2025 at 09:40:13PM -0500, Rik van Riel wrote:

> +/*
> + * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
> + *
> + * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
> + * be done in a parallel fashion.
> + *
> + * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
> + * this CPU have completed.
> + */
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid,
> +			     unsigned long addr, u16 extra_count,
> +			     bool pmd_stride, unsigned long flags)
> +{
> +	u32 edx = (pcid << 16) | asid;
> +	u32 ecx = (pmd_stride << 31) | extra_count;
> +	u64 rax = addr | flags;
> +
> +	/* INVLPGB; supported in binutils >= 2.36. */
> +	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));
> +}

So asid is always 0 (for now), but I'd feel better if that was a u16
argument, less chance funnies when someone starts using it.

We should probably mask or WARN on addr having low bits set, and flags
should then be a u8 or something.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2025-01-20 14:02   ` Nadav Amit
@ 2025-01-21  9:55   ` Peter Zijlstra
  2025-01-21 10:33     ` Peter Zijlstra
  2025-01-21 18:48     ` Dave Hansen
  2025-01-22  8:38   ` Peter Zijlstra
  2 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-21  9:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Sun, Jan 19, 2025 at 09:40:17PM -0500, Rik van Riel wrote:
> +/*
> + * Figure out whether to assign a global ASID to a process.
> + * We vary the threshold by how empty or full global ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the global ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail = global_asid_available;
> +	int threshold = HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold = HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold = HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *= 2;
> +			threshold *= 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}

I'm still very much disliking this. Why do we need this? Yes, running
out of ASID space is a pain, but this increasing threshold also makes
things behave weird.

Suppose our most used processes starts slow, and ends up not getting an
ASID because too much irrelevant crap gets started before it spawns
enough threads and then no longer qualifies.

Can't we just start with a very simple constant test and poke at things
if/when its found to not work?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-21  9:55   ` Peter Zijlstra
@ 2025-01-21 10:33     ` Peter Zijlstra
  2025-01-23  1:40       ` Rik van Riel
  2025-01-21 18:48     ` Dave Hansen
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-21 10:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Tue, Jan 21, 2025 at 10:55:07AM +0100, Peter Zijlstra wrote:
> On Sun, Jan 19, 2025 at 09:40:17PM -0500, Rik van Riel wrote:
> > +/*
> > + * Figure out whether to assign a global ASID to a process.
> > + * We vary the threshold by how empty or full global ASID space is.
> > + * 1/4 full: >= 4 active threads
> > + * 1/2 full: >= 8 active threads
> > + * 3/4 full: >= 16 active threads
> > + * 7/8 full: >= 32 active threads
> > + * etc
> > + *
> > + * This way we should never exhaust the global ASID space, even on very
> > + * large systems, and the processes with the largest number of active
> > + * threads should be able to use broadcast TLB invalidation.
> > + */
> > +#define HALFFULL_THRESHOLD 8
> > +static bool meets_global_asid_threshold(struct mm_struct *mm)
> > +{
> > +	int avail = global_asid_available;
> > +	int threshold = HALFFULL_THRESHOLD;
> > +
> > +	if (!avail)
> > +		return false;
> > +
> > +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> > +		threshold = HALFFULL_THRESHOLD / 4;
> > +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> > +		threshold = HALFFULL_THRESHOLD / 2;
> > +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> > +		do {
> > +			avail *= 2;
> > +			threshold *= 2;
> > +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> > +	}
> > +
> > +	return mm_active_cpus_exceeds(mm, threshold);
> > +}
> 
> I'm still very much disliking this. Why do we need this? Yes, running
> out of ASID space is a pain, but this increasing threshold also makes
> things behave weird.
> 
> Suppose our most used processes starts slow, and ends up not getting an
> ASID because too much irrelevant crap gets started before it spawns
> enough threads and then no longer qualifies.
> 
> Can't we just start with a very simple constant test and poke at things
> if/when its found to not work?

Something like so perhaps?

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -268,7 +268,7 @@ static inline u16 mm_global_asid(struct
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		return 0;
 
-	asid = READ_ONCE(mm->context.global_asid);
+	asid = smp_load_acquire(&mm->context.global_asid);
 
 	/* mm->context.global_asid is either 0, or a global ASID */
 	VM_WARN_ON_ONCE(is_dyn_asid(asid));
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -308,13 +308,18 @@ static void reset_global_asid_space(void
 static u16 get_global_asid(void)
 {
 	lockdep_assert_held(&global_asid_lock);
+	bool done_reset = false;
 
 	do {
 		u16 start = last_global_asid;
 		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
 
-		if (asid >= MAX_ASID_AVAILABLE) {
+		if (asid > MAX_ASID_AVAILABLE) {
+			if (done_reset)
+				return asid;
+
 			reset_global_asid_space();
+			done_reset = true;
 			continue;
 		}
 
@@ -392,6 +398,12 @@ static bool mm_active_cpus_exceeds(struc
  */
 static void use_global_asid(struct mm_struct *mm)
 {
+	u16 asid;
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm->context.global_asid)
+		return;
+
 	guard(raw_spinlock_irqsave)(&global_asid_lock);
 
 	/* This process is already using broadcast TLB invalidation. */
@@ -402,58 +414,25 @@ static void use_global_asid(struct mm_st
 	if (!global_asid_available)
 		return;
 
+	asid = get_global_asid();
+	if (asid > MAX_ASID_AVAILABLE)
+		return;
+
 	/*
-	 * The transition from IPI TLB flushing, with a dynamic ASID,
-	 * and broadcast TLB flushing, using a global ASID, uses memory
-	 * ordering for synchronization.
-	 *
-	 * While the process has threads still using a dynamic ASID,
-	 * TLB invalidation IPIs continue to get sent.
-	 *
-	 * This code sets asid_transition first, before assigning the
-	 * global ASID.
-	 *
-	 * The TLB flush code will only verify the ASID transition
-	 * after it has seen the new global ASID for the process.
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition == true
+	 * once it observes global_asid.
 	 */
-	WRITE_ONCE(mm->context.asid_transition, true);
-	WRITE_ONCE(mm->context.global_asid, get_global_asid());
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
 }
 
-/*
- * Figure out whether to assign a global ASID to a process.
- * We vary the threshold by how empty or full global ASID space is.
- * 1/4 full: >= 4 active threads
- * 1/2 full: >= 8 active threads
- * 3/4 full: >= 16 active threads
- * 7/8 full: >= 32 active threads
- * etc
- *
- * This way we should never exhaust the global ASID space, even on very
- * large systems, and the processes with the largest number of active
- * threads should be able to use broadcast TLB invalidation.
- */
-#define HALFFULL_THRESHOLD 8
 static bool meets_global_asid_threshold(struct mm_struct *mm)
 {
-	int avail = global_asid_available;
-	int threshold = HALFFULL_THRESHOLD;
-
-	if (!avail)
+	if (!global_asid_available)
 		return false;
 
-	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
-		threshold = HALFFULL_THRESHOLD / 4;
-	} else if (avail > MAX_ASID_AVAILABLE / 2) {
-		threshold = HALFFULL_THRESHOLD / 2;
-	} else if (avail < MAX_ASID_AVAILABLE / 3) {
-		do {
-			avail *= 2;
-			threshold *= 2;
-		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
-	}
-
-	return mm_active_cpus_exceeds(mm, threshold);
+	return mm_active_cpus_exceeds(mm, 4);
 }
 
 static void consider_global_asid(struct mm_struct *mm)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-21  9:55   ` Peter Zijlstra
  2025-01-21 10:33     ` Peter Zijlstra
@ 2025-01-21 18:48     ` Dave Hansen
  1 sibling, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2025-01-21 18:48 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On 1/21/25 01:55, Peter Zijlstra wrote:
> Can't we just start with a very simple constant test and poke at things
> if/when its found to not work?

I'd prefer something simpler for now, too.

Let's just pick a sane number, maybe 16 or 32 for now, make it pokeable
in debugfs and make sure we have a way to tell when the PCID space is
exhausted.

Then we try and design a solution for the _actual_ cases where folks are
exhausting it.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2025-01-20 14:02   ` Nadav Amit
  2025-01-21  9:55   ` Peter Zijlstra
@ 2025-01-22  8:38   ` Peter Zijlstra
  2025-01-23  1:13     ` Rik van Riel
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-22  8:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Sun, Jan 19, 2025 at 09:40:17PM -0500, Rik van Riel wrote:
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/*
> +	 * A global TLB flush guarantees that any stale entries from
> +	 * previously freed global ASIDs get flushed from the TLB
> +	 * everywhere, making these global ASIDs safe to reuse.
> +	 */
> +	invlpgb_flush_all_nonglobals();
> +
> +	/*
> +	 * Clear all the previously freed global ASIDs from the
> +	 * broadcast_asid_used bitmap, now that the global TLB flush
> +	 * has made them actually available for re-use.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/*
> +	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> +	 * assignments, for tasks doing IPI based TLB shootdowns.
> +	 * Restart the search from the start of the global ASID space.
> +	 */
> +	last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	do {
> +		u16 start = last_global_asid;
> +		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >= MAX_ASID_AVAILABLE) {
> +			reset_global_asid_space();
> +			continue;
> +		}
> +
> +		/* Claim this global ASID. */
> +		__set_bit(asid, global_asid_used);
> +		last_global_asid = asid;
> +		global_asid_available--;
> +		return asid;
> +	} while (1);
> +}

Looking at this more... I'm left wondering, did 'we' look at any other
architecture code at all? 

For example, look at arch/arm64/mm/context.c and see how their reset
works. Notably, they are not at all limited to reclaiming free'd ASIDs,
but will very aggressively take back all ASIDs except for the current
running ones.

And IIRC more architectures are like that (at some point in the distant
past I read through the tlb and mmu context crap from every architecture
we had at that point -- but those memories are vague).

If we want to move towards relying on broadcast TBLI, we'll need to
go in that direction. Also, as argued in the old thread yesterday, we
likely want more PCID bits -- in the interest of competition we can't be
having less than ARM64, surely :-)

Anyway, please drop the crazy threshold thing, and if you run into
falling back to IPIs because you don't have enough ASIDs to go around,
we should 'borrow' some of the ARM64 code -- RISC-V seems to have
borrowed very heavily from that as well.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-21  8:54         ` Peter Zijlstra
@ 2025-01-22 15:48           ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-22 15:48 UTC (permalink / raw)
  To: Peter Zijlstra, David Hildenbrand
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Tue, 2025-01-21 at 09:54 +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 08:46:04AM +0100, David Hildenbrand wrote:
> > On 21.01.25 02:03, Rik van Riel wrote:
> > > On Mon, 2025-01-20 at 20:47 +0100, David Hildenbrand wrote:
> > > > On 20.01.25 03:40, Rik van Riel wrote:
> > > > > Every pv_ops.mmu.tlb_remove_table call ends up calling
> > > > > tlb_remove_table.
> > > > > 
> > > > 
> > > > Indeed, but the !CONFIG_PARAVIRT variant
> > > > paravirt_tlb_remove_table()
> > > > however calls tlb_remove_page().
> > > 
> > > Patch 1/12 from this series removes that.
> > > 
> > > After patch 1/12, we always call tlb_remove_table everywhere.
> > 
> > This patch contains the hunk:
> > 
> > -#ifndef CONFIG_PARAVIRT
> > -static inline
> > -void paravirt_tlb_remove_table(struct mmu_gather *tlb, void
> > *table)
> > -{
> > -	tlb_remove_page(tlb, table);
> > -}
> > -#endif
> > -
> > 
> > That is the source of my confusion.
> 
> Ah, that hunk should probably go to patch 1
> 
Moved that over for the next version.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 05/12] x86/mm: add INVLPGB support code
  2025-01-21  9:45   ` Peter Zijlstra
@ 2025-01-22 16:58     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-22 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Tue, 2025-01-21 at 10:45 +0100, Peter Zijlstra wrote:
> On Sun, Jan 19, 2025 at 09:40:13PM -0500, Rik van Riel wrote:
> 
> > 
> > +static inline void __invlpgb(unsigned long asid, unsigned long
> > pcid,
> > +			     unsigned long addr, u16 extra_count,
> > +			     bool pmd_stride, unsigned long flags)
> > +{
> > +	u32 edx = (pcid << 16) | asid;
> > +	u32 ecx = (pmd_stride << 31) | extra_count;
> > +	u64 rax = addr | flags;
> > +
> > +	/* INVLPGB; supported in binutils >= 2.36. */
> > +	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c"
> > (ecx), "d" (edx));
> > +}
> 
> So asid is always 0 (for now), but I'd feel better if that was a u16
> argument, less chance funnies when someone starts using it.
> 
> We should probably mask or WARN on addr having low bits set, and
> flags
> should then be a u8 or something.

Done and done. Thank you for the suggestions.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-22  8:38   ` Peter Zijlstra
@ 2025-01-23  1:13     ` Rik van Riel
  2025-01-23  9:07       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-23  1:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Wed, 2025-01-22 at 09:38 +0100, Peter Zijlstra wrote:
> 
> Looking at this more... I'm left wondering, did 'we' look at any
> other
> architecture code at all? 
> 
> For example, look at arch/arm64/mm/context.c and see how their reset
> works. Notably, they are not at all limited to reclaiming free'd
> ASIDs,
> but will very aggressively take back all ASIDs except for the current
> running ones.
> 
I did look at the ARM64 code, and while their reset
is much nicer, it looks like that comes at a cost on
each process at context switch time.

In new_context(), there is a call to check_update_reserved_asid(),
which will iterate over all CPUs to check whether this
process's ASID is part of the reserved list that got
carried over during the rollover.

I don't know if that would scale well enough to work
on systems with thousands of CPUs.

> If we want to move towards relying on broadcast TBLI, we'll need to
> go in that direction.

For single threaded processes, which are still very
common, a local flush would likely be faster than
broadcast flushes, even if multiple broadcast flushes
can be pending simultaneously.

For very large systems with a large number of processes,
I agree we want to move in that direction, but we may
need to figure out whether or not everybody taking the 
cpu_asid_lock at rollover time, and then scanning all
other CPUs from check_update_reserved_asid(), with the
lock held, would scale to systems with thousands of CPUs.

Everybody taking the cpu_asid_lock would probably be
fine, if they didn't all have to scan over all the
CPUs.

If we can figure out a more scalable way to do the
new_context() stuff, this would definitely be the
way to go.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-21 10:33     ` Peter Zijlstra
@ 2025-01-23  1:40       ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-23  1:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Tue, 2025-01-21 at 11:33 +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 10:55:07AM +0100, Peter Zijlstra wrote:
> > 
> > Can't we just start with a very simple constant test and poke at
> > things
> > if/when its found to not work?
> 
> Something like so perhaps?

I've applied your suggestions, with the exception of some
code that was already simplified further based on other
people's suggestions (get_global_asid is no longer a loop).

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-23  1:13     ` Rik van Riel
@ 2025-01-23  9:07       ` Peter Zijlstra
  2025-01-23 12:42         ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-01-23  9:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Mark Rutland, Will Deacon

On Wed, Jan 22, 2025 at 08:13:03PM -0500, Rik van Riel wrote:
> On Wed, 2025-01-22 at 09:38 +0100, Peter Zijlstra wrote:
> > 
> > Looking at this more... I'm left wondering, did 'we' look at any
> > other
> > architecture code at all? 
> > 
> > For example, look at arch/arm64/mm/context.c and see how their reset
> > works. Notably, they are not at all limited to reclaiming free'd
> > ASIDs,
> > but will very aggressively take back all ASIDs except for the current
> > running ones.
> > 
> I did look at the ARM64 code, and while their reset
> is much nicer, it looks like that comes at a cost on
> each process at context switch time.
> 
> In new_context(), there is a call to check_update_reserved_asid(),
> which will iterate over all CPUs to check whether this
> process's ASID is part of the reserved list that got
> carried over during the rollover.
> 
> I don't know if that would scale well enough to work
> on systems with thousands of CPUs.

So assuming something like 1k CPUs and !PTI, we only have like 4 PCIDs
per CPU on average, and rollover could be frequent.

While an ARM64 with 1k CPUs and !PTI would have an average of 64 ASIDs
per CPU, and rollover would be far less frequent.

That is to say, their larger ASID space (16 bits, vs our 12) definitely
helps. But at some point yeah, this will become a problem.

Notably, I think think a 2 socket Epyc Turin with 192C is one of the
larger off-the-shelf systems atm, that gets you 768 CPUs and that is
already uncomfortably tight with our PCID space.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-23  9:07       ` Peter Zijlstra
@ 2025-01-23 12:42         ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-23 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Mark Rutland, Will Deacon

On Thu, 2025-01-23 at 10:07 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 08:13:03PM -0500, Rik van Riel wrote:
> > On Wed, 2025-01-22 at 09:38 +0100, Peter Zijlstra wrote:
> > > 
> > > Looking at this more... I'm left wondering, did 'we' look at any
> > > other
> > > architecture code at all? 
> > > 
> > > For example, look at arch/arm64/mm/context.c and see how their
> > > reset
> > > works. Notably, they are not at all limited to reclaiming free'd
> > > ASIDs,
> > > but will very aggressively take back all ASIDs except for the
> > > current
> > > running ones.
> > > 
> > I did look at the ARM64 code, and while their reset
> > is much nicer, it looks like that comes at a cost on
> > each process at context switch time.
> > 
> > In new_context(), there is a call to check_update_reserved_asid(),
> > which will iterate over all CPUs to check whether this
> > process's ASID is part of the reserved list that got
> > carried over during the rollover.
> > 
> > I don't know if that would scale well enough to work
> > on systems with thousands of CPUs.
> 
> So assuming something like 1k CPUs and !PTI, we only have like 4
> PCIDs
> per CPU on average, and rollover could be frequent.
> 
> While an ARM64 with 1k CPUs and !PTI would have an average of 64
> ASIDs
> per CPU, and rollover would be far less frequent.

Not necessarily. On ARM64, every short lived task will
get a global ASID, while on x86_64 only longer lived
processes that are simultaneously active on multiple
CPUs get a global ASID.

The situation could be fairly bad for both, which is
why I would like to solve the O(n^2) issues with the
rollover code before adding that in to our x86_64
side :)

I fully agree we should probably move in that direction,
but I would like to make the worst case in the rollover-reuse
cheaper.

> 
> That is to say, their larger ASID space (16 bits, vs our 12)
> definitely
> helps. But at some point yeah, this will become a problem.
> 
> Notably, I think think a 2 socket Epyc Turin with 192C is one of the
> larger off-the-shelf systems atm, that gets you 768 CPUs and that is
> already uncomfortably tight with our PCID space.
> 
> 
> 

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 00/12] AMD broadcast TLB invalidation
  2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (12 preceding siblings ...)
  2025-01-20  5:58 ` [PATCH v6 00/12] AMD broadcast TLB invalidation Michael Kelley
@ 2025-01-24 11:41 ` Manali Shukla
  13 siblings, 0 replies; 36+ messages in thread
From: Manali Shukla @ 2025-01-24 11:41 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Manali Shukla

On 1/20/2025 8:10 AM, Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.
> 
> 
> 

I have collected the performance data using will-it-scale tlb_flush2_threads test
and AMD broadcast TLB invalidations v6 + LRU drain removal patches on my AMD Milan,
Genoa and Turin machines. I have not observed any discrepancy in the data. As seen
in the below table, LRU drain removal and INVLPGB patches combined is giving nice
performance boost as expected.
 
Since v7 has already been posted, I plan to collect the similar data for V7 too.

------------------------------------------------------------------------------------------------------------------------------------------------
| ./tlb_flush2_threads -s 5 -t 128 | Milan 1P (NPS1) | Milan 1P (NPS2) | Genoa 1P (NPS1) | Genoa 1P (NPS2) | Turin 2P (NPS1) | Turin 2P (NPS2) |
------------------------------------------------------------------------------------------------------------------------------------------------
| Vanila                           |      369880     |      399732     |     311936      |      326639     |      371146     |      377428     |
------------------------------------------------------------------------------------------------------------------------------------------------
| LRU drain removal                |      781684     |      794648     |     541434      |      531191     |      549773     |      487340     |
------------------------------------------------------------------------------------------------------------------------------------------------
| INVLPGB                          |      554438     |      937696     |     501677      |      565055     |      531544     |      487342     |
------------------------------------------------------------------------------------------------------------------------------------------------
| LRU drain removal + INVLPGB      |     1091792     |     1096667     |     971744      |      956330     |      1387741    |      1388581    |
------------------------------------------------------------------------------------------------------------------------------------------------
| LRU drain vs. Vanila             |      52.68%     |     49.70%      |     42.39%      |      38.51%     |      32.49%     |      22.55%     |
------------------------------------------------------------------------------------------------------------------------------------------------
| INVLPGB vs. Vanila               |      33.29%     |     57.37%      |     37.82%      |      42.19%     |      30.18%     |      22.55%     |
------------------------------------------------------------------------------------------------------------------------------------------------
| (LRU drain + INVLPGB) vs. Vanila |      66.12%     |     63.55%      |     67.90%      |      65.84%     |      73.26%     |      72.82%     |
------------------------------------------------------------------------------------------------------------------------------------------------

-Manali



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-01-24 11:41 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2025-01-20 19:32   ` David Hildenbrand
2025-01-20  2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2025-01-20 19:47   ` David Hildenbrand
2025-01-21  1:03     ` Rik van Riel
2025-01-21  7:46       ` David Hildenbrand
2025-01-21  8:54         ` Peter Zijlstra
2025-01-22 15:48           ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-01-20  2:40 ` [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-01-20  2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-01-21  9:45   ` Peter Zijlstra
2025-01-22 16:58     ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-01-20  2:40 ` [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
2025-01-20  2:40 ` [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-01-20 14:02   ` Nadav Amit
2025-01-20 16:09     ` Rik van Riel
2025-01-20 20:04       ` Nadav Amit
2025-01-20 22:44         ` Rik van Riel
2025-01-21  7:31           ` Nadav Amit
2025-01-21  9:55   ` Peter Zijlstra
2025-01-21 10:33     ` Peter Zijlstra
2025-01-23  1:40       ` Rik van Riel
2025-01-21 18:48     ` Dave Hansen
2025-01-22  8:38   ` Peter Zijlstra
2025-01-23  1:13     ` Rik van Riel
2025-01-23  9:07       ` Peter Zijlstra
2025-01-23 12:42         ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-01-20  2:40 ` [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-01-20  2:40 ` [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-01-20  5:58 ` [PATCH v6 00/12] AMD broadcast TLB invalidation Michael Kelley
2025-01-24 11:41 ` Manali Shukla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox