[PATCH v9 00/12] AMD broadcast TLB invalidation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v9 00/12] AMD broadcast TLB invalidation
@ 2025-02-06  4:43 Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
                   ` (13 more replies)
  0 siblings, 14 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-07 14:28   ` Brendan Jackman
  2025-02-06  4:43 ` [PATCH v9 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE

This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.

Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.

Various places in the mm do still block IRQs or disable preemption
as an implicit way to block RCU frees.

That makes it safe to use INVLPGB on AMD CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/Kconfig           |  2 +-
 arch/x86/kernel/paravirt.c |  7 +------
 arch/x86/mm/pgtable.c      | 16 ++++------------
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
@@ -191,7 +186,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= native_tlb_remove_table,
+	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..3dc4af1f7868 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -18,14 +18,6 @@ EXPORT_SYMBOL(physical_mask);
 #define PGTABLE_HIGHMEM 0
 #endif
 
-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -54,7 +46,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pagetable_pte_dtor(page_ptdesc(pte));
 	paravirt_release_pte(page_to_pfn(pte));
-	paravirt_tlb_remove_table(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -70,7 +62,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 	tlb->need_flush_all = 1;
 #endif
 	pagetable_pmd_dtor(ptdesc);
-	paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
+	tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -80,14 +72,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 
 	pagetable_pud_dtor(ptdesc);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
+	tlb_remove_table(tlb, virt_to_page(p4d));
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.

Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/hyperv/mmu.c                 | 1 -
 arch/x86/include/asm/paravirt.h       | 5 -----
 arch/x86/include/asm/paravirt_types.h | 2 --
 arch/x86/kernel/kvm.c                 | 1 -
 arch/x86/kernel/paravirt.c            | 1 -
 arch/x86/xen/mmu_pv.c                 | 1 -
 6 files changed, 11 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 1cc113200ff5..cbe6c71e17c1 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -240,5 +240,4 @@ void hyperv_setup_mmu_ops(void)
 
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
-	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8..794ba3647c6c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -91,11 +91,6 @@ static inline void __flush_tlb_multi(const struct cpumask *cpumask,
 	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
-static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);
-}
-
 static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 {
 	PVOP_VCALL1(mmu.exit_mmap, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8d4fbe1be489..13405959e4db 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -136,8 +136,6 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);
 
-	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
-
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..3be9b3342c67 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -838,7 +838,6 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 2b78a6b466ed..c019771e0123 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -186,7 +186,6 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04..041e17282af0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2137,7 +2137,6 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
-		.tlb_remove_table = tlb_remove_table,
 
 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-07 14:50   ` Brendan Jackman
  2025-02-06  4:43 ` [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Dave Hansen

Reduce code duplication by consolidating the decision point
for whether to do individual invalidations or a full flush
inside get_flush_tlb_info.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c | 56 ++++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bb..36939b104561 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1000,8 +1000,13 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
-	info->start		= start;
-	info->end		= end;
+	/*
+	 * Round the start and end addresses to the page size specified
+	 * by the stride shift. This ensures partial pages at the end of
+	 * a range get fully invalidated.
+	 */
+	info->start		= round_down(start, 1 << stride_shift);
+	info->end		= round_up(end, 1 << stride_shift);
 	info->mm		= mm;
 	info->stride_shift	= stride_shift;
 	info->freed_tables	= freed_tables;
@@ -1009,6 +1014,19 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
 
+	WARN_ONCE(start != info->start || end != info->end,
+		  "TLB flush not stride %x aligned. Start %lx, end %lx\n",
+		  1 << stride_shift, start, end);
+
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		info->start = 0;
+		info->end = TLB_FLUSH_ALL;
+	}
+
 	return info;
 }
 
@@ -1026,17 +1044,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1098,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	guard(preempt)();
+
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*
@@ -1276,7 +1282,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-07 15:10   ` Brendan Jackman
  2025-02-10  7:30   ` Vern Hao
  2025-02-06  4:43 ` [PATCH v9 05/12] x86/mm: add INVLPGB support code Rik van Riel
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/Kconfig.cpu               | 5 +++++
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/tlbflush.h    | 7 +++++++
 arch/x86/kernel/cpu/amd.c          | 8 ++++++++
 4 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..abe013a1b076 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -395,6 +395,10 @@ config X86_VMX_FEATURE_NAMES
 	def_bool y
 	depends on IA32_FEAT_CTL
 
+config X86_BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 menuconfig PROCESSOR_SELECT
 	bool "Supported processor vendors" if EXPERT
 	help
@@ -431,6 +435,7 @@ config CPU_SUP_CYRIX_32
 config CPU_SUP_AMD
 	default y
 	bool "Support AMD processors" if PROCESSOR_SELECT
+	select X86_BROADCAST_TLB_FLUSH
 	help
 	  This enables detection, tunings and quirks for AMD processors
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..f9b832e971c5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e..8fe3b2dda507 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,13 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can we invalidate with one INVLPGB. */
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+extern u16 invlpgb_count_max;
+#else
+#define invlpgb_count_max 1
+#endif
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79d2e17f6582..bcf73775b4f8 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1135,6 +1137,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
+		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+		invlpgb_count_max = (edx & 0xffff) + 1;
+	}
 }
 
 static const struct cpu_dev amd_cpu_dev = {
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 05/12] x86/mm: add INVLPGB support code
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/invlpgb.h  | 101 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |   1 +
 2 files changed, 102 insertions(+)
 create mode 100644 arch/x86/include/asm/invlpgb.h

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
new file mode 100644
index 000000000000..a1d5dedd5217
--- /dev/null
+++ b/arch/x86/include/asm/invlpgb.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVLPGB
+#define _ASM_X86_INVLPGB
+
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+#include <vdso/page.h>
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 extra_count,
+			     bool pmd_stride, u8 flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (pmd_stride << 31) | extra_count;
+	u64 rax = addr | flags;
+
+	/* The low bits in rax are for flags. Verify addr is clean. */
+	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void tlbsync(void)
+{
+	cant_migrate();
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invlpgb_flush_user(unsigned long pcid,
+				      unsigned long addr)
+{
+	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
+	tlbsync();
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr,
+						bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
+	tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, 0);
+	tlbsync();
+}
+
+#endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8fe3b2dda507..dba5caa4a9f4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
+#include <asm/invlpgb.h>
 #include <asm/invpcid.h>
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-07 16:03   ` Brendan Jackman
  2025-02-06  4:43 ` [PATCH v9 07/12] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/mm/tlb.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 36939b104561..227e972b6fbc 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1086,6 +1086,30 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr;
+	unsigned long nr;
+
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_all();
+		return true;
+	}
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = min((info->end - addr) >> PAGE_SHIFT, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	tlbsync();
+	return true;
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1105,7 +1129,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 
-	if (info->end == TLB_FLUSH_ALL)
+	if (broadcast_kernel_range_flush(info))
+		; /* Fall through. */
+	else if (info->end == TLB_FLUSH_ALL)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 07/12] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/mm/tlb.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 227e972b6fbc..d225a3df9aa7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1074,6 +1074,19 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 }
 
 
+static bool broadcast_flush_tlb_all(void)
+{
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	guard(preempt)();
+	invlpgb_flush_all();
+	return true;
+}
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1082,6 +1095,8 @@ static void do_flush_tlb_all(void *info)
 
 void flush_tlb_all(void)
 {
+	if (broadcast_flush_tlb_all())
+		return;
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 07/12] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d225a3df9aa7..f2d18f16d76f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1330,7 +1330,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-10 14:15   ` Brendan Jackman
  2025-02-06  4:43 ` [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
and newer CPUs.

In order to not exhaust PCID space, and keep TLB flushes local for single
threaded processes, we only hand out broadcast ASIDs to processes active on
4 or more CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/mmu.h         |   6 +
 arch/x86/include/asm/mmu_context.h |  14 ++
 arch/x86/include/asm/tlbflush.h    |  73 ++++++
 arch/x86/mm/tlb.c                  | 344 ++++++++++++++++++++++++++++-
 4 files changed, 425 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..d71cd599fec4 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,12 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	u16 global_asid;
+	bool asid_transition;
+#endif
+
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..d670699d32c2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void destroy_context_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
@@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		destroy_context_free_global_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dba5caa4a9f4..234277a5ef89 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -6,6 +6,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 
+#include <asm/barrier.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
@@ -239,6 +240,78 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static inline bool is_dyn_asid(u16 asid)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return true;
+
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = smp_load_acquire(&mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
+
+	return asid;
+}
+#else
+static inline bool is_dyn_asid(u16 asid)
+{
+	return true;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return false;
+}
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	return false;
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	return false;
+}
+
+static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	VM_WARN_ON_ONCE(1);
+}
+
+static inline void consider_global_asid(struct mm_struct *mm)
+{
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f2d18f16d76f..05390f0e6cb0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -225,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with broadcast TLB
+	 * flushing. The TLB is never outdated, and does not need flushing.
+	 */
+	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -251,6 +267,272 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+/*
+ * Logic for broadcast TLB invalidation.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	/*
+	 * A global TLB flush guarantees that any stale entries from
+	 * previously freed global ASIDs get flushed from the TLB
+	 * everywhere, making these global ASIDs safe to reuse.
+	 */
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * Clear all the previously freed global ASIDs from the
+	 * broadcast_asid_used bitmap, now that the global TLB flush
+	 * has made them actually available for re-use.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/*
+	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
+	 * assignments, for tasks doing IPI based TLB shootdowns.
+	 * Restart the search from the start of the global ASID space.
+	 */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 get_global_asid(void)
+{
+
+	u16 asid;
+
+	lockdep_assert_held(&global_asid_lock);
+
+	/* The previous allocated ASID is at the top of the address space. */
+	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
+		reset_global_asid_space();
+
+	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
+
+	if (asid >= MAX_ASID_AVAILABLE) {
+		/* This should never happen. */
+		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n", global_asid_available);
+		return 0;
+	}
+
+	/* Claim this global ASID. */
+	__set_bit(asid, global_asid_used);
+	last_global_asid = asid;
+	global_asid_available--;
+	return asid;
+}
+
+/*
+ * Returns true if the mm is transitioning from a CPU-local ASID to a global
+ * (INVLPGB) ASID, or the other way around.
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	if (!global_asid && is_global_asid(prev_asid))
+		return true;
+
+	return false;
+}
+
+void destroy_context_free_global_asid(struct mm_struct *mm)
+{
+	if (!mm->context.global_asid)
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+}
+
+/*
+ * Check whether a process is currently active on more than "threshold" CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm->context.global_asid)
+		return;
+
+	/* The last global ASID was consumed while waiting for the lock. */
+	if (!global_asid_available) {
+		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
+		return;
+	}
+
+	asid = get_global_asid();
+	if (!asid)
+		return;
+
+	/*
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition = true
+	 * once it observes global_asid.
+	 */
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
+}
+
+static bool meets_global_asid_threshold(struct mm_struct *mm)
+{
+	if (!global_asid_available)
+		return false;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	return mm_active_cpus_exceeds(mm, 3);
+}
+
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (meets_global_asid_threshold(mm))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!READ_ONCE(mm->context.asid_transition))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long maxnr = invlpgb_count_max;
+	unsigned long asid = info->mm->context.global_asid;
+	unsigned long addr = info->start;
+	unsigned long nr;
+
+	/* Flushing multiple pages at once is not supported with 1GB pages. */
+	if (info->stride_shift > PMD_SHIFT)
+		maxnr = 1;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		/*
+		 * Calculate how many pages can be flushed at once; if the
+		 * remainder of the range is less than one page, flush one.
+		 */
+		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
+		nr = max(nr, 1);
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	tlbsync();
+}
+#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -556,8 +838,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-			   next->context.ctx_id);
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+				this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+				next->context.ctx_id);
 
 		/*
 		 * If this races with another thread that enables lam, 'new_lam'
@@ -573,6 +856,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * Check if the current mm is transitioning to a new ASID.
+		 */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -606,6 +906,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -620,14 +927,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -746,7 +1051,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -769,6 +1074,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -786,6 +1101,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -953,7 +1270,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1058,9 +1375,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-10 15:27   ` Brendan Jackman
  2025-02-06  4:43 ` [PATCH v9 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/invlpgb.h  | 21 +++++----
 arch/x86/include/asm/tlbflush.h | 17 ++++---
 arch/x86/mm/tlb.c               | 80 +++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index a1d5dedd5217..357e3cc417e4 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -31,9 +31,8 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 }
 
 /* Wait for INVLPGB originated by this CPU to complete. */
-static inline void tlbsync(void)
+static inline void __tlbsync(void)
 {
-	cant_migrate();
 	/* TLBSYNC: supported in binutils >= 0.36. */
 	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
 }
@@ -61,19 +60,19 @@ static inline void invlpgb_flush_user(unsigned long pcid,
 				      unsigned long addr)
 {
 	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
-	tlbsync();
+	__tlbsync();
 }
 
-static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
-						unsigned long addr,
-						u16 nr,
-						bool pmd_stride)
+static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						  unsigned long addr,
+						  u16 nr,
+						  bool pmd_stride)
 {
 	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
 	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
 }
@@ -82,11 +81,11 @@ static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 static inline void invlpgb_flush_all(void)
 {
 	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
-	tlbsync();
+	__tlbsync();
 }
 
 /* Flush addr, including globals, for all PCIDs. */
-static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
 	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
 }
@@ -95,7 +94,7 @@ static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 static inline void invlpgb_flush_all_nonglobals(void)
 {
 	__invlpgb(0, 0, 0, 0, 0, 0);
-	tlbsync();
+	__tlbsync();
 }
 
 #endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 234277a5ef89..bf167e215e8e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -106,6 +106,7 @@ struct tlb_state {
 	 * need to be invalidated.
 	 */
 	bool invalidate_other;
+	bool need_tlbsync;
 
 #ifdef CONFIG_ADDRESS_MASKING
 	/*
@@ -310,6 +311,10 @@ static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
 static inline void consider_global_asid(struct mm_struct *mm)
 {
 }
+
+static inline void tlbsync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_PARAVIRT
@@ -359,21 +364,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 05390f0e6cb0..4253c3efd7e4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -488,6 +488,37 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	WRITE_ONCE(mm->context.asid_transition, false);
 }
 
+static inline void tlbsync(void)
+{
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		return;
+	__tlbsync();
+	this_cpu_write(cpu_tlbstate.need_tlbsync, false);
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool pmd_stride)
+{
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_flush_single_pcid_nosync(pcid);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb_flush_addr_nosync(addr, nr);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
 static void broadcast_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
@@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
+	tlbsync();
+
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
@@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  */
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
+	tlbsync();
+
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
@@ -1650,9 +1685,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1661,12 +1694,53 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		tlbsync();
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	u16 asid = mm_global_asid(mm);
+
+	if (asid) {
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+
+		/*
+		 * Some CPUs might still be using a local ASID for this
+		 * process, and require IPIs, while others are using the
+		 * global ASID.
+		 *
+		 * In this corner case we need to do both the broadcast
+		 * TLB invalidation, and send IPIs. The IPIs will help
+		 * stragglers transition to the broadcast ASID.
+		 */
+		if (in_asid_transition(mm))
+			asid = 0;
+	}
+
+	if (!asid) {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 11/12] x86/mm: enable AMD translation cache extensions
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06  4:43 ` [PATCH v9 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bcf73775b4f8..21076252a491 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1071,6 +1071,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_feature_enabled(X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v9 12/12] x86/mm: only invalidate final translations with INVLPGB
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-02-06  4:43 ` Rik van Riel
  2025-02-06 10:16 ` [PATCH v9 00/12] AMD broadcast TLB invalidation Oleksandr Natalenko
  2025-02-07 18:23 ` Brendan Jackman
  13 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-06  4:43 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Rik van Riel, Manali Shukla

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/invlpgb.h | 10 ++++++++--
 arch/x86/mm/tlb.c              | 13 +++++++------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index 357e3cc417e4..9df559974f78 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -66,9 +66,15 @@ static inline void invlpgb_flush_user(unsigned long pcid,
 static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						  unsigned long addr,
 						  u16 nr,
-						  bool pmd_stride)
+						  bool pmd_stride,
+						  bool freed_tables)
 {
-	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	u8 flags = INVLPGB_PCID | INVLPGB_VA;
+
+	if (!freed_tables)
+		flags |= INVLPGB_FINAL_ONLY;
+
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4253c3efd7e4..b9aa5ab1b1af 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -498,9 +498,10 @@ static inline void tlbsync(void)
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
-						u16 nr, bool pmd_stride)
+						u16 nr, bool pmd_stride,
+						bool freed_tables)
 {
-	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride, freed_tables);
 	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
 		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
 }
@@ -549,10 +550,10 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
 		nr = max(nr, 1);
 
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd, info->freed_tables);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd, info->freed_tables);
 
 		addr += nr << info->stride_shift;
 	} while (addr < info->end);
@@ -1715,10 +1716,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 	u16 asid = mm_global_asid(mm);
 
 	if (asid) {
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false, false);
 
 		/*
 		 * Some CPUs might still be using a local ASID for this
-- 
2.47.1



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (11 preceding siblings ...)
  2025-02-06  4:43 ` [PATCH v9 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-02-06 10:16 ` Oleksandr Natalenko
  2025-02-06 14:16   ` Rik van Riel
  2025-02-07 18:23 ` Brendan Jackman
  13 siblings, 1 reply; 47+ messages in thread
From: Oleksandr Natalenko @ 2025-02-06 10:16 UTC (permalink / raw)
  To: x86, Rik van Riel
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 5121 bytes --]

Hello.

On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
> 
> Because x86 PCID space is limited, and there are some very large
> systems out there, broadcast TLB invalidation is only used for
> processes that are active on 3 or more CPUs, with the threshold
> being gradually increased the more the PCID space gets exhausted.
> 
> Combined with the removal of unnecessary lru_add_drain calls
> (see https://lkml.org/lkml/2024/12/19/1388) this results in a
> nice performance boost for the will-it-scale tlb_flush2_threads
> test on an AMD Milan system with 36 cores:
> 
> - vanilla kernel:           527k loops/second
> - lru_add_drain removal:    731k loops/second
> - only INVLPGB:             527k loops/second
> - lru_add_drain + INVLPGB: 1157k loops/second
> 
> Profiling with only the INVLPGB changes showed while
> TLB invalidation went down from 40% of the total CPU
> time to only around 4% of CPU time, the contention
> simply moved to the LRU lock.
> 
> Fixing both at the same time about doubles the
> number of iterations per second from this case.
> 
> Some numbers closer to real world performance
> can be found at Phoronix, thanks to Michael:
> 
> https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits
> 
> My current plan is to implement support for Intel's RAR
> (Remote Action Request) TLB flushing in a follow-up series,
> after this thing has been merged into -tip. Making things
> any larger would just be unwieldy for reviewers.
> 
> v9:
>  - print warning when start or end address was rounded (Peter)

OK, I've just hit one:

TLB flush not stride 200000 aligned. Start 7fffc0000000, end 7fffffe01000
WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342 flush_tlb_mm_range+0x57b/0x600
Modules linked in:
CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1 1366679ca06f46d05d1e9d9c537b0c6b4c922b82
Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902 08/29/2024
RIP: 0010:flush_tlb_mm_range+0x57b/0x600
Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f> 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb
RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff
RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8
R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000
R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000
FS:  0000000000000000(0000) GS:ffff9e6eaf180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 tlb_flush_mmu+0x125/0x1a0
 tlb_finish_mmu+0x41/0x80
 relocate_vma_down+0x183/0x200
 setup_arg_pages+0x201/0x390
 load_elf_binary+0x3a7/0x17d0
 bprm_execve+0x244/0x630
 kernel_execve+0x180/0x1f0
 call_usermodehelper_exec_async+0xd0/0x190
 ret_from_fork+0x34/0x50
 ret_from_fork_asm+0x1a/0x30
 </TASK>

What do I do with it?

Thank you.

>  - in the reclaim code, tlbsync at context switch time (Peter)
>  - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
> v8:
>  - round start & end to handle non-page-aligned callers (Steven & Jan)
>  - fix up changelog & add tested-by tags (Manali)
> v7:
>  - a few small code cleanups (Nadav)
>  - fix spurious VM_WARN_ON_ONCE in mm_global_asid
>  - code simplifications & better barriers (Peter & Dave)
> v6:
>  - fix info->end check in flush_tlb_kernel_range (Michael)
>  - disable broadcast TLB flushing on 32 bit x86
> v5:
>  - use byte assembly for compatibility with older toolchains (Borislav, Michael)
>  - ensure a panic on an invalid number of extra pages (Dave, Tom)
>  - add cant_migrate() assertion to tlbsync (Jann)
>  - a bunch more cleanups (Nadav)
>  - key TCE enabling off X86_FEATURE_TCE (Andrew)
>  - fix a race between reclaim and ASID transition (Jann)
> v4:
>  - Use only bitmaps to track free global ASIDs (Nadav)
>  - Improved AMD initialization (Borislav & Tom)
>  - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
>  - Fixes for subtle race conditions (Jann)
> v3:
>  - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
>  - More suggested cleanups and changelog fixes by Peter and Nadav
> v2:
>  - Apply suggestions by Peter and Borislav (thank you!)
>  - Fix bug in arch_tlbbatch_flush, where we need to do both
>    the TLBSYNC, and flush the CPUs that are in the cpumask.
>  - Some updates to comments and changelogs based on questions.

-- 
Oleksandr Natalenko, MSE

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06 10:16 ` [PATCH v9 00/12] AMD broadcast TLB invalidation Oleksandr Natalenko
@ 2025-02-06 14:16   ` Rik van Riel
  2025-02-06 14:23     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06 14:16 UTC (permalink / raw)
  To: Oleksandr Natalenko, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3

On Thu, 2025-02-06 at 11:16 +0100, Oleksandr Natalenko wrote:
> Hello.
> 
> On čtvrtek 6. února 2025 5:43:19, středoevropský standardní čas Rik
> van Riel wrote:
> > 
> > v9:
> >  - print warning when start or end address was rounded (Peter)
> 
> OK, I've just hit one:
> 
> TLB flush not stride 200000 aligned. Start 7fffc0000000, end
> 7fffffe01000

Beautiful, the caller wants to zap 2MB pages, but
the end address is 4kB aligned.

> WARNING: CPU: 31 PID: 411 at arch/x86/mm/tlb.c:1342
> flush_tlb_mm_range+0x57b/0x600
> Modules linked in:
> CPU: 31 UID: 0 PID: 411 Comm: modprobe Not tainted 6.13.0-pf3 #1
> 1366679ca06f46d05d1e9d9c537b0c6b4c922b82
> Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 4902
> 08/29/2024
> RIP: 0010:flush_tlb_mm_range+0x57b/0x600
> Code: 5f e9 39 b3 3f 00 e8 24 57 f5 ff e9 e9 fc ff ff 48 8b 0c 24 4c
> 89 e2 48 c7 c7 78 59 27 b0 c6 05 3d 1a 31 02 01 e8 85 e4 01 00 <0f>
> 0b e9 35 fb ff ff fa 0f 1f 44 00 00 48 89 df e8 a0 f4 ff ff fb
> RSP: 0018:ffffc137c11e7a38 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff9e6eaf1b5d80 RCX: 00000000ffffdfff
> RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
> RBP: ffff9e500244d800 R08: 00000000ffffdfff R09: ffff9e6eae1fffa8
> R10: 00000000ffffdfff R11: 0000000000000003 R12: 00007fffc0000000
> R13: 000000000000001f R14: 0000000000000015 R15: ffff9e6eaf180000
> FS:  0000000000000000(0000) GS:ffff9e6eaf180000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000109966000 CR4: 0000000000f50ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  tlb_flush_mmu+0x125/0x1a0
>  tlb_finish_mmu+0x41/0x80
>  relocate_vma_down+0x183/0x200
>  setup_arg_pages+0x201/0x390
>  load_elf_binary+0x3a7/0x17d0
>  bprm_execve+0x244/0x630
>  kernel_execve+0x180/0x1f0
>  call_usermodehelper_exec_async+0xd0/0x190
>  ret_from_fork+0x34/0x50
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>
> 
> What do I do with it?

Reporting it is the right thing. Let me dig into what
setup_arg_pages and relocate_vma_down are doing to
come up with a 2MB page size area where the end is
not 2MB aligned.

Reading through the relocate_vma_down code, and the
free_pgd/p4d/pud/pmd_range code, it looks like that
code always adds PAGE_SIZE to the address being zapped,
even when zapping things at a larger granularity.

On the flip side, the code in relocate_vma_down and
free_pgd_range correctly set the TLB page size to
the 4kB PAGE_SIZE.

It looks like setting the stride_shift to something
larger is done transparently by the x86 tlb_flush()
implementation, specifically by tlb_get_unmap_shift(),
which looks at which page table level got freed to
determine what stride shift to use.

This can result in flush_tlb_mm_range being called
with a stride_shift for 2MB pages, but a range ending
on a 4kB aligned (not 2MB aligned) boundary.

Peter, how should we solve this one?

Should tlb_flush() round the start & end addresses
to match the found stride_shift?

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06 14:16   ` Rik van Riel
@ 2025-02-06 14:23     ` Peter Zijlstra
  2025-02-06 14:48       ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-06 14:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleksandr Natalenko, x86, linux-kernel, bp, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3

On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:

> This can result in flush_tlb_mm_range being called
> with a stride_shift for 2MB pages, but a range ending
> on a 4kB aligned (not 2MB aligned) boundary.
> 
> Peter, how should we solve this one?

I don't think that's wrong per-se, since all we really need is for end
to be past the end, one byte, one page, or one stride don't matter much.

Anyway, I'm in desperate need of a break, so I'm not quite sure what the
best way forward is.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06 14:23     ` Peter Zijlstra
@ 2025-02-06 14:48       ` Rik van Riel
  2025-02-07  8:16         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-06 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleksandr Natalenko, x86, linux-kernel, bp, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3

On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:
> 
> > This can result in flush_tlb_mm_range being called
> > with a stride_shift for 2MB pages, but a range ending
> > on a 4kB aligned (not 2MB aligned) boundary.
> > 
> > Peter, how should we solve this one?
> 
> I don't think that's wrong per-se, since all we really need is for
> end
> to be past the end, one byte, one page, or one stride don't matter
> much.
> 
> Anyway, I'm in desperate need of a break, so I'm not quite sure what
> the
> best way forward is.
> 
Given that the tlb_flush() code is used only for
page table freeing, we can just round up the
end address to the nearest stride boundary
there, with a comment explaining why?

Alternatively, we could just change the page
sizes used in pmd_free_tlb, pud_free_tlb,
and p4d_free_tlb, given that the functions
called now have parameters they didn't seem
to have back in 2014, when the linked email
in the comment was written?

Either way, no big hurry. The rounding up
is already being done in get_flush_tlb_info,
so all we have is a noisy WARN_ONCE :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06 14:48       ` Rik van Riel
@ 2025-02-07  8:16         ` Peter Zijlstra
  2025-02-07 17:46           ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-07  8:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleksandr Natalenko, x86, linux-kernel, bp, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3

On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote:
> On Thu, 2025-02-06 at 15:23 +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2025 at 09:16:35AM -0500, Rik van Riel wrote:
> > 
> > > This can result in flush_tlb_mm_range being called
> > > with a stride_shift for 2MB pages, but a range ending
> > > on a 4kB aligned (not 2MB aligned) boundary.
> > > 
> > > Peter, how should we solve this one?
> > 
> > I don't think that's wrong per-se, since all we really need is for
> > end
> > to be past the end, one byte, one page, or one stride don't matter
> > much.
> > 
> > Anyway, I'm in desperate need of a break, so I'm not quite sure what
> > the
> > best way forward is.
> > 
> Given that the tlb_flush() code is used only for
> page table freeing, 

mmu_gather is used for both page and page-table freeing.

> we can just round up the
> end address to the nearest stride boundary
> there, with a comment explaining why?

Well, why are we rounding at all? I don't think I've seen an explanation
for that anywhere yet.

What made you do this?

> Alternatively, we could just change the page
> sizes used in pmd_free_tlb, pud_free_tlb,
> and p4d_free_tlb, given that the functions
> called now have parameters they didn't seem
> to have back in 2014, when the linked email
> in the comment was written?

Well, it really shouldn't matter.

Notably, stride is simply the smallest size encountered during the
gather; not something that can be determined a priory.

Like I said, end is exclusive, so it doesn't really matter how much
further it is, and PAGE_SIZE is a natural enough step, seeing it is the
smallest granularity.

So start and end not being page aligned is a definite fail.
And start should be stride aligned.

But other than that, I don't see it matters much.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-02-06  4:43 ` [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-02-07 14:28   ` Brendan Jackman
  2025-02-11 11:07     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 14:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

Hi Rik,

On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> This is done because some paravirt TLB flush implementations
> handle the TLB flush in the hypervisor, and will do the flush
> even when the target CPU has interrupts disabled.
>
> Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
> Using RCU synchronization between page table freeing and get_user_pages_fast()
> allows bare metal to also do TLB flushing while interrupts are disabled.
>
> Various places in the mm do still block IRQs or disable preemption
> as an implicit way to block RCU frees.
>
> That makes it safe to use INVLPGB on AMD CPUs.

It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
that do not have this (PPC)" - and while that's being updated it would
also be useful to note down the paravirt thing you explained above,
IMO it's pretty helpful to have more examples of the concrete usecases
for this logic.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision
  2025-02-06  4:43 ` [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-02-07 14:50   ` Brendan Jackman
  2025-02-07 20:22     ` Rik van Riel
  2025-02-10 19:12     ` Rik van Riel
  0 siblings, 2 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 14:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Dave Hansen

On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> @@ -1276,7 +1282,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>
>         int cpu = get_cpu();
>
> -       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> +       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, PAGE_SHIFT, false,
>                                   TLB_GENERATION_INVALID);

[Why] do we need this change? If it's necessary here, why isn't it
needed everywhere else that does TLB_FLUSH_ALL too, like
flush_tlb_mm()?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-06  4:43 ` [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-07 15:10   ` Brendan Jackman
  2025-02-07 17:34     ` Brendan Jackman
  2025-02-10  7:30   ` Vern Hao
  1 sibling, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 15:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 17b6590748c0..f9b832e971c5 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -338,6 +338,7 @@
>  #define X86_FEATURE_CLZERO             (13*32+ 0) /* "clzero" CLZERO instruction */
>  #define X86_FEATURE_IRPERF             (13*32+ 1) /* "irperf" Instructions Retired Count */
>  #define X86_FEATURE_XSAVEERPTR         (13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
> +#define X86_FEATURE_INVLPGB            (13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */

Why no "invlpgb" here? Seems like having this flag visible in cpuinfo
would be worthwhile.

If there's a reason to hide it maybe add a comment to explain the
reason? Sorry if this is a stupid question - I also can't see an
obvious rationale for why existing flags do or don't get a name at
runtime.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-06  4:43 ` [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-07 16:03   ` Brendan Jackman
  2025-02-07 20:50     ` Rik van Riel
  2025-02-11  2:01     ` Rik van Riel
  0 siblings, 2 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 16:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 36939b104561..227e972b6fbc 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1086,6 +1086,30 @@ void flush_tlb_all(void)
>         on_each_cpu(do_flush_tlb_all, NULL, 1);
>  }
>
> +static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
> +{
> +       unsigned long addr;
> +       unsigned long nr;
> +
> +       if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
> +               return false;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +               return false;

I think that first conditional can be shunted off into the header

diff --git a/arch/x86/include/asm/disabled-features.h
b/arch/x86/include/asm/disabled-features.h
index c492bdc97b05..61376b4e4fa7 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -129,6 +129,12 @@
 #define DISABLE_SEV_SNP                (1 << (X86_FEATURE_SEV_SNP & 31))
 #endif

+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+#define DISABLE_INVLPGB        0
+#else
+#define DISABLE_INVLPGB        (1 << (X86_FEATURE_INVLPGB & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -146,7 +152,7 @@
 #define DISABLED_MASK11
(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
                         DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12        (DISABLE_FRED|DISABLE_LAM)
-#define DISABLED_MASK13        0
+#define DISABLED_MASK13        (DISABLE_INVLPGB)
 #define DISABLED_MASK14        0
 #define DISABLED_MASK15        0
 #define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 227e972b6fbc..b8a8665359a2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1091,9 +1091,6 @@ static bool broadcast_kernel_range_flush(struct
flush_tlb_info *info)
        unsigned long addr;
        unsigned long nr;

-       if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
-               return false;
-
        if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
                return false;

With !CPU_SUP_AMD and the above, broadcast_kernel_range_flush
disappears from tlb.o. (Caveat - I didn't actually read the disasm I
just made it noinline and checked the call disappeared).

It's actually more lines of code but now they're off in a boilerplate
header and it's consistent with the other flags that do this.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-07 15:10   ` Brendan Jackman
@ 2025-02-07 17:34     ` Brendan Jackman
  0 siblings, 0 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 17:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

Oh, sorry

On Fri, 7 Feb 2025 at 16:10, Brendan Jackman <jackmanb@google.com> wrote:
>
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 17b6590748c0..f9b832e971c5 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -338,6 +338,7 @@
> >  #define X86_FEATURE_CLZERO             (13*32+ 0) /* "clzero" CLZERO instruction */
> >  #define X86_FEATURE_IRPERF             (13*32+ 1) /* "irperf" Instructions Retired Count */
> >  #define X86_FEATURE_XSAVEERPTR         (13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
> > +#define X86_FEATURE_INVLPGB            (13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
>
> Why no "invlpgb" here? Seems like having this flag visible in cpuinfo
> would be worthwhile.
>
> If there's a reason to hide it maybe add a comment to explain the
> reason? Sorry if this is a stupid question - I also can't see an
> obvious rationale for why existing flags do or don't get a name at
> runtime.

Oh, found it:

https://lore.kernel.org/lkml/20250102120450.GNZ3aA4oVPnoJYRVUL@fat_crate.local/

Sorry for the noise, please ignore.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-07  8:16         ` Peter Zijlstra
@ 2025-02-07 17:46           ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-07 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleksandr Natalenko, x86, linux-kernel, bp, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3

On Fri, 2025-02-07 at 09:16 +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 09:48:25AM -0500, Rik van Riel wrote:
> 
> 
> > we can just round up the
> > end address to the nearest stride boundary
> > there, with a comment explaining why?
> 
> Well, why are we rounding at all? I don't think I've seen an
> explanation
> for that anywhere yet.
> 
> What made you do this?

The only real reason is to not end up with a 0
value for nr in these loops:

        for (addr = info->start; addr < info->end; addr += nr <<
PAGE_SHIFT) {
                nr = min((info->end - addr) >> PAGE_SHIFT,
invlpgb_count_max);
                invlpgb_flush_addr_nosync(addr, nr);
        }       

Which makes me think we could just insert a
"nr = max(nr, 1)" in there, and round up
partial pages that way?

Looking further, I already did that on the
userspace flushing side. Let me do the same
thing on the kernel side, and remove the
other code related to partial pages being
passed in.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 00/12] AMD broadcast TLB invalidation
  2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
                   ` (12 preceding siblings ...)
  2025-02-06 10:16 ` [PATCH v9 00/12] AMD broadcast TLB invalidation Oleksandr Natalenko
@ 2025-02-07 18:23 ` Brendan Jackman
  13 siblings, 0 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-07 18:23 UTC (permalink / raw)
  To: riel
  Cc: akpm, andrew.cooper3, bp, dave.hansen, jannh, kernel-team,
	linux-kernel, linux-mm, mhklinux, nadav.amit, peterz,
	thomas.lendacky, x86, zhengqi.arch, Brendan Jackman

> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

I applied this to 60675d4ca1ef0 ("Merge branch 'linus' into x86/mm, to pick up
fixes"), booted it and ran some selftests on a Milan and a Genoa machine, no
issues, although I didn't have CONFIG_DEBUG_VM or anything like that. I did see
the warning reported by Oleksandr [0] on QEMU.

I tried to test Turin but there's some unrelated brokenness. Due to other
unrelated issues I also couldn't test with lockdep.

Anyway I guess that counts as:

Tested-by: Brendan Jackman <jackmanb@google.com>

I couldn't get any performance data either, more unrelated brokenness. Maybe I
can try again next week.

I posted some bikeshed review comments but I haven't grokked the interesting
bits yet, will try to do that next week too.

[0] https://lore.kernel.org/linux-kernel/12602226.O9o76ZdvQC@natalenko.name/

Cheers,
Brendan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision
  2025-02-07 14:50   ` Brendan Jackman
@ 2025-02-07 20:22     ` Rik van Riel
  2025-02-10 11:15       ` Brendan Jackman
  2025-02-10 19:12     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-07 20:22 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Dave Hansen

On Fri, 2025-02-07 at 15:50 +0100, Brendan Jackman wrote:
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > @@ -1276,7 +1282,7 @@ void arch_tlbbatch_flush(struct
> > arch_tlbflush_unmap_batch *batch)
> > 
> >         int cpu = get_cpu();
> > 
> > -       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> > +       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL,
> > PAGE_SHIFT, false,
> >                                   TLB_GENERATION_INVALID);
> 
> [Why] do we need this change? If it's necessary here, why isn't it
> needed everywhere else that does TLB_FLUSH_ALL too, like
> flush_tlb_mm()?
> 
flush_tlb_mm() calls flush_tlb_mm_range(), which
does also use get_flush_tlb_info().

We pass in PAGE_SHIFT here to ensure that the
stride shift is specified correctly to the
INVLPGB instruction later on.


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-07 16:03   ` Brendan Jackman
@ 2025-02-07 20:50     ` Rik van Riel
  2025-02-10 11:22       ` Brendan Jackman
  2025-02-11  2:01     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-07 20:50 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Fri, 2025-02-07 at 17:03 +0100, Brendan Jackman wrote:
> O
> With !CPU_SUP_AMD and the above, broadcast_kernel_range_flush
> disappears from tlb.o. (Caveat - I didn't actually read the disasm I
> just made it noinline and checked the call disappeared).
> 
> It's actually more lines of code but now they're off in a boilerplate
> header and it's consistent with the other flags that do this.
> 
What compiler did you use?

While I like the cleanup in principle, I
don't want to saddle people with older
compilers with extra code they don't need.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-06  4:43 ` [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
  2025-02-07 15:10   ` Brendan Jackman
@ 2025-02-10  7:30   ` Vern Hao
  2025-02-10 16:48     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Vern Hao @ 2025-02-10  7:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]

*I do some test on my  Machine with AMD EPYC 7K83,  these patches work on
my host,  but failed on my guest with qemu.*

*in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest no.*

*So are you plan to support it in guest?*


Best Regards!
Thanks


Rik van Riel <riel@surriel.com> 于2025年2月6日周四 12:45写道：

> The CPU advertises the maximum number of pages that can be shot down
> with one INVLPGB instruction in the CPUID data.
>
> Save that information for later use.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> ---
>  arch/x86/Kconfig.cpu               | 5 +++++
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/include/asm/tlbflush.h    | 7 +++++++
>  arch/x86/kernel/cpu/amd.c          | 8 ++++++++
>  4 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
> index 2a7279d80460..abe013a1b076 100644
> --- a/arch/x86/Kconfig.cpu
> +++ b/arch/x86/Kconfig.cpu
> @@ -395,6 +395,10 @@ config X86_VMX_FEATURE_NAMES
>         def_bool y
>         depends on IA32_FEAT_CTL
>
> +config X86_BROADCAST_TLB_FLUSH
> +       def_bool y
> +       depends on CPU_SUP_AMD && 64BIT
> +
>  menuconfig PROCESSOR_SELECT
>         bool "Supported processor vendors" if EXPERT
>         help
> @@ -431,6 +435,7 @@ config CPU_SUP_CYRIX_32
>  config CPU_SUP_AMD
>         default y
>         bool "Support AMD processors" if PROCESSOR_SELECT
> +       select X86_BROADCAST_TLB_FLUSH
>         help
>           This enables detection, tunings and quirks for AMD processors
>
> diff --git a/arch/x86/include/asm/cpufeatures.h
> b/arch/x86/include/asm/cpufeatures.h
> index 17b6590748c0..f9b832e971c5 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -338,6 +338,7 @@
>  #define X86_FEATURE_CLZERO             (13*32+ 0) /* "clzero" CLZERO
> instruction */
>  #define X86_FEATURE_IRPERF             (13*32+ 1) /* "irperf"
> Instructions Retired Count */
>  #define X86_FEATURE_XSAVEERPTR         (13*32+ 2) /* "xsaveerptr" Always
> save/restore FP error pointers */
> +#define X86_FEATURE_INVLPGB            (13*32+ 3) /* INVLPGB and TLBSYNC
> instruction supported. */
>  #define X86_FEATURE_RDPRU              (13*32+ 4) /* "rdpru" Read
> processor register at user level */
>  #define X86_FEATURE_WBNOINVD           (13*32+ 9) /* "wbnoinvd" WBNOINVD
> instruction */
>  #define X86_FEATURE_AMD_IBPB           (13*32+12) /* Indirect Branch
> Prediction Barrier */
> diff --git a/arch/x86/include/asm/tlbflush.h
> b/arch/x86/include/asm/tlbflush.h
> index 02fc2aa06e9e..8fe3b2dda507 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -183,6 +183,13 @@ static inline void cr4_init_shadow(void)
>  extern unsigned long mmu_cr4_features;
>  extern u32 *trampoline_cr4_features;
>
> +/* How many pages can we invalidate with one INVLPGB. */
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +extern u16 invlpgb_count_max;
> +#else
> +#define invlpgb_count_max 1
> +#endif
> +
>  extern void initialize_tlbstate_and_flush(void);
>
>  /*
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 79d2e17f6582..bcf73775b4f8 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -29,6 +29,8 @@
>
>  #include "cpu.h"
>
> +u16 invlpgb_count_max __ro_after_init;
> +
>  static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
>  {
>         u32 gprs[8] = { 0 };
> @@ -1135,6 +1137,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86
> *c)
>                 tlb_lli_2m[ENTRIES] = eax & mask;
>
>         tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
> +
> +       /* Max number of pages INVLPGB can invalidate in one shot */
> +       if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
> +               cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> +               invlpgb_count_max = (edx & 0xffff) + 1;
> +       }
>  }
>
>  static const struct cpu_dev amd_cpu_dev = {
> --
> 2.47.1
>
>
>

[-- Attachment #2: Type: text/html, Size: 6258 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision
  2025-02-07 20:22     ` Rik van Riel
@ 2025-02-10 11:15       ` Brendan Jackman
  0 siblings, 0 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-10 11:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Dave Hansen

On Fri, 7 Feb 2025 at 21:25, Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2025-02-07 at 15:50 +0100, Brendan Jackman wrote:
> > On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > > @@ -1276,7 +1282,7 @@ void arch_tlbbatch_flush(struct
> > > arch_tlbflush_unmap_batch *batch)
> > >
> > >         int cpu = get_cpu();
> > >
> > > -       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> > > +       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL,
> > > PAGE_SHIFT, false,
> > >                                   TLB_GENERATION_INVALID);
> >
> > [Why] do we need this change? If it's necessary here, why isn't it
> > needed everywhere else that does TLB_FLUSH_ALL too, like
> > flush_tlb_mm()?
> >
> flush_tlb_mm() calls flush_tlb_mm_range(), which
> does also use get_flush_tlb_info().
>
> We pass in PAGE_SHIFT here to ensure that the
> stride shift is specified correctly to the
> INVLPGB instruction later on.

Hm I don't follow your answer here so lemme just state my current
understanding more verbosely...

flush_tlb_mm() indirectly calls get_flush_tlb_info() with
end=TLB_FLUSH_ALL and stride_shift=0. That's an invalid stride shift,
but this doesn't break anything because with TLB_FLUSH_ALL the stride
shift is never used (except to check for values above PMD_SHIFT).

So here I think passing stride_shift=0 would be fine too. Concretely,
here we specifically call invlpgb_flush_all_nonglobals() which doesn't
care about stride_shift, and flush_tlb_func() is the same as before
this patchset in this respect.

I would be quite happy with removing the "it's invalid but it doesn't
matter" thing everywhere but this seems to be localised and I think
the issue is orthogonal to the presence of invlpgb.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-07 20:50     ` Rik van Riel
@ 2025-02-10 11:22       ` Brendan Jackman
  0 siblings, 0 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-10 11:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Fri, 7 Feb 2025 at 21:51, Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2025-02-07 at 17:03 +0100, Brendan Jackman wrote:
> > O
> > With !CPU_SUP_AMD and the above, broadcast_kernel_range_flush
> > disappears from tlb.o. (Caveat - I didn't actually read the disasm I
> > just made it noinline and checked the call disappeared).
> >
> > It's actually more lines of code but now they're off in a boilerplate
> > header and it's consistent with the other flags that do this.
> >
> What compiler did you use?
>
> While I like the cleanup in principle, I
> don't want to saddle people with older
> compilers with extra code they don't need.

I used a pretty fresh Clang but I'd be very surprised if it needs a
fancy compiler. Compared to

if (IS_ENABLED(CONFIG_FOO))

I think all we have with the disabled-features.h magic is

- An extra __builtin_constant_p - I did a quick search and I can find
GCC release notes referring to this at least back to 4.7 (2012) [0].
Note also this doesn't create any code.

- An extra bit of constant folding to turn the (x & y) into
true/false. This seems like something compilers have been good at for
a long time. And if someone's happy with a compiler so old that it
can't do this, I dunno but they probably don't mind a few extra
instructions.

[1] https://gcc.gnu.org/gcc-4.7/changes.html


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-06  4:43 ` [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-02-10 14:15   ` Brendan Jackman
  2025-02-11  3:07     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-10 14:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Thu, 6 Feb 2025 at 05:47, Rik van Riel <riel@surriel.com> wrote:
> +static u16 get_global_asid(void)
> +{
> +
> +       u16 asid;
> +
> +       lockdep_assert_held(&global_asid_lock);
> +
> +       /* The previous allocated ASID is at the top of the address space. */
> +       if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
> +               reset_global_asid_space();
> +
> +       asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
> +
> +       if (asid >= MAX_ASID_AVAILABLE) {
> +               /* This should never happen. */
> +               VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n", global_asid_available);

If you'll forgive the nitpicking, please put the last arg on a new
line or otherwise break this up, the rest of this file keeps below 100
chars (this is 113).

> +               return 0;
> +       }
> +
> +       /* Claim this global ASID. */
> +       __set_bit(asid, global_asid_used);
> +       last_global_asid = asid;
> +       global_asid_available--;
> +       return asid;
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +       u16 global_asid = mm_global_asid(next);
> +
> +       if (global_asid && prev_asid != global_asid)
> +               return true;
> +
> +       if (!global_asid && is_global_asid(prev_asid))
> +               return true;

I think this needs clarification around when switches from
global->nonglobal happen. Maybe commentary or maybe there's a way to
just express the code that makes it obvious. Here's what I currently
understand, please correct me if I'm wrong:

- Once a process gets a global ASID it keeps it forever. So within a
process we never switch global->nonglobal.

- In flush_tlb_func() we are just calling this to check if the process
has just been given a global ASID - there's no way loaded_mm_asid can
be global yet !mm_global_asid(loaded_mm).

- When we call this from switch_mm_irqs_off() we are in the prev==next
case. Is there something about lazy TLB that can cause the case above
to happen here?

> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +       if (!global_asid_available)

I think we need READ_ONCE here.

Also - this doesn't really make sense in this function as it's currently named.

I think we could just inline this whole function into
consider_global_asid(), it would still be nice and readable IMO.

> @@ -786,6 +1101,8 @@ static void flush_tlb_func(void *info)
>                 return;
>         }
>
> +       local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
>         if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
>                      f->new_tlb_gen <= local_tlb_gen)) {
>                 /*
> @@ -953,7 +1270,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
>          * up on the new contents of what used to be page tables, while
>          * doing a speculative memory access.
>          */
> -       if (info->freed_tables)
> +       if (info->freed_tables || in_asid_transition(info->mm))
>                 on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>         else
>                 on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1058,9 +1375,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>          * a local TLB flush is needed. Optimize this use-case by calling
>          * flush_tlb_func_local() directly in this case.
>          */
> -       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> +       if (mm_global_asid(mm)) {
> +               broadcast_tlb_flush(info);
> +       } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
>                 info->trim_cpumask = should_trim_cpumask(mm);
>                 flush_tlb_multi(mm_cpumask(mm), info);
> +               consider_global_asid(mm);

Why do we do this here instead of when the CPU enters the mm? Is the
idea that in combination with the jiffies thing in
consider_global_asid() we get a probability of getting a global ASID
(within some time period) that scales with the amount of TLB flushing
the process does? So then we avoid using up ASID space on processes
that are multithreaded but just sit around with stable VMAs etc?

I guess another reason would be in the bizarre case that we ran out of
global ASIDs and then entered one big workload that effectively owns
all the CPUs, that big workload can still get a global ASID later once
the old processes free them up, even if we enter it before
reset_global_asid_space().

Just to be clear - this isn't an objection, I just wanna see if I
actually understood the design.

I guess it would be worth having a comment about it - especially if
I'm missing something or totally wrong.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-06  4:43 ` [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-02-10 15:27   ` Brendan Jackman
  2025-02-11  3:45     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-10 15:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Thu, 6 Feb 2025 at 05:46, Rik van Riel <riel@surriel.com> wrote:
>  /* Wait for INVLPGB originated by this CPU to complete. */
> -static inline void tlbsync(void)
> +static inline void __tlbsync(void)
>  {
> -       cant_migrate();

Why does this have to go away?

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 234277a5ef89..bf167e215e8e 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -106,6 +106,7 @@ struct tlb_state {
>          * need to be invalidated.
>          */
>         bool invalidate_other;
> +       bool need_tlbsync;

The ifdeffery is missing here.

> +static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
> +                                               unsigned long addr,
> +                                               u16 nr, bool pmd_stride)
> +{
> +       __invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
> +       if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
> +               this_cpu_write(cpu_tlbstate.need_tlbsync, true);

 Why do we need the conditional here?

I always thought code like this was about avoiding cacheline
contention, but given this is a percpu and it's really only of
interest to the present CPU, is this worthwhile?

I guess it might be sharing a cacheline with some other percpu that is
more shared?

Anyway, not a big deal, I'm mostly asking for my own education.

> @@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>                 WARN_ON_ONCE(!irqs_disabled());
>
> +       tlbsync();
> +
>         /*
>          * Verify that CR3 is what we think it is.  This will catch
>          * hypothetical buggy code that directly switches to swapper_pg_dir
> @@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   */
>  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  {
> +       tlbsync();
> +

I have a feeling I'll look stupid for asking this, but why do we need
this and the one in switch_mm_irqs_off()?

> @@ -1661,12 +1694,53 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>                 local_irq_enable();
>         }
>
> +       /*
> +        * If we issued (asynchronous) INVLPGB flushes, wait for them here.
> +        * The cpumask above contains only CPUs that were running tasks
> +        * not using broadcast TLB flushing.
> +        */
> +       if (cpu_feature_enabled(X86_FEATURE_INVLPGB))

It feels wrong that we check the cpufeature here but not in e.g.
enter_lazy_tlb().

> +               tlbsync();
> +
>         cpumask_clear(&batch->cpumask);
>
>         put_flush_tlb_info();
>         put_cpu();
>  }
>
> +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +                                            struct mm_struct *mm,
> +                                            unsigned long uaddr)
> +{
> +       u16 asid = mm_global_asid(mm);
> +
> +       if (asid) {
> +               invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
> +               /* Do any CPUs supporting INVLPGB need PTI? */

Not today, but

1. I don't think avoiding static_cpu_has() is worth the cost of making
the reader take that logical step.

2.  AFAIK a new CPU bug never led to enabling KPTI on a CPU that
didn't have it before, and I think that would be a pretty dystopian
future (and hopefully by then we'll have ASI instead...). But we can't
really rule it out.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-10  7:30   ` Vern Hao
@ 2025-02-10 16:48     ` Rik van Riel
  2025-02-12  1:18       ` Vern Hao
  2025-02-12  1:57       ` Vern Hao
  0 siblings, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-10 16:48 UTC (permalink / raw)
  To: Vern Hao
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Mon, 2025-02-10 at 15:30 +0800, Vern Hao wrote:
> I do some test on my  Machine with AMD EPYC 7K83,  these patches work
> on my host,  but failed on my guest with qemu.
> 
> in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest
> no.
> 
> So are you plan to support it in guest?

How exactly did it fail in the guest?

Did the guest simply not use INVLPGB because that
CPUID flag was not presented in the CPUID that
qemu shows to the guest, or did things crash somehow?

My understanding is that while INVLPGB can work
in guests, actually implementing that is a whole
other can of worms, and definitely not something
we should try to tackle at the same time as bare
metal support.

A TLB flush hypercall, with IRQ-less flushing on
the hypervisor side will probably get us 90% of
the way there, potentially with less overall
complexity than actually supporting INVLPGB in
the guest.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision
  2025-02-07 14:50   ` Brendan Jackman
  2025-02-07 20:22     ` Rik van Riel
@ 2025-02-10 19:12     ` Rik van Riel
  1 sibling, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-10 19:12 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Dave Hansen

On Fri, 2025-02-07 at 15:50 +0100, Brendan Jackman wrote:
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > @@ -1276,7 +1282,7 @@ void arch_tlbbatch_flush(struct
> > arch_tlbflush_unmap_batch *batch)
> > 
> >         int cpu = get_cpu();
> > 
> > -       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> > +       info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL,
> > PAGE_SHIFT, false,
> >                                   TLB_GENERATION_INVALID);
> 
> [Why] do we need this change? If it's necessary here, why isn't it
> needed everywhere else that does TLB_FLUSH_ALL too, like
> flush_tlb_mm()?
> 
You are right, we do not actually need this.

I'll drop this for v10


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-07 16:03   ` Brendan Jackman
  2025-02-07 20:50     ` Rik van Riel
@ 2025-02-11  2:01     ` Rik van Riel
  1 sibling, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-11  2:01 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Fri, 2025-02-07 at 17:03 +0100, Brendan Jackman wrote:
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 36939b104561..227e972b6fbc 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1086,6 +1086,30 @@ void flush_tlb_all(void)
> >         on_each_cpu(do_flush_tlb_all, NULL, 1);
> >  }
> > 
> > +static bool broadcast_kernel_range_flush(struct flush_tlb_info
> > *info)
> > +{
> > +       unsigned long addr;
> > +       unsigned long nr;
> > +
> > +       if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
> > +               return false;
> > +
> > +       if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +               return false;
> 
> I think that first conditional can be shunted off into the header
> 
> diff --git a/arch/x86/include/asm/disabled-features.h
> b/arch/x86/include/asm/disabled-features.h
> index c492bdc97b05..61376b4e4fa7 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -129,6 +129,12 @@
>  #define DISABLE_SEV_SNP                (1 << (X86_FEATURE_SEV_SNP &
> 31))
>  #endif
> 
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +#define DISABLE_INVLPGB        0
> +#else
> +#define DISABLE_INVLPGB        (1 << (X86_FEATURE_INVLPGB & 31))
> +#endif
> +

I'm adding this for v10, with the disabled-features.h
stuff in patch 5, and removing the no longer needed
tests from each subsequent patch.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-10 14:15   ` Brendan Jackman
@ 2025-02-11  3:07     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-11  3:07 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Mon, 2025-02-10 at 15:15 +0100, Brendan Jackman wrote:
> On Thu, 6 Feb 2025 at 05:47, Rik van Riel <riel@surriel.com> wrote:
> > 
> > +       if (asid >= MAX_ASID_AVAILABLE) {
> > +               /* This should never happen. */
> > +               VM_WARN_ONCE(1, "Unable to allocate global ASID
> > despite %d available\n", global_asid_available);
> 
> If you'll forgive the nitpicking, please put the last arg on a new
> line or otherwise break this up, the rest of this file keeps below
> 100
> chars (this is 113).
> 

Nitpicks are great! Chances are I'll have to look at
this code again several times over the coming years,
so getting it in the best possible shape is in my
interest as much as anybody else's ;)

> > 
> > +static bool needs_global_asid_reload(struct mm_struct *next, u16
> > prev_asid)
> > +{
> > +       u16 global_asid = mm_global_asid(next);
> > +
> > +       if (global_asid && prev_asid != global_asid)
> > +               return true;
> > +
> > +       if (!global_asid && is_global_asid(prev_asid))
> > +               return true;
> 
> I think this needs clarification around when switches from
> global->nonglobal happen. Maybe commentary or maybe there's a way to
> just express the code that makes it obvious. Here's what I currently
> understand, please correct me if I'm wrong:
> 
> - Once a process gets a global ASID it keeps it forever. So within a
> process we never switch global->nonglobal.
> 
> - In flush_tlb_func() we are just calling this to check if the
> process
> has just been given a global ASID - there's no way loaded_mm_asid can
> be global yet !mm_global_asid(loaded_mm).
> 
> - When we call this from switch_mm_irqs_off() we are in the
> prev==next
> case. Is there something about lazy TLB that can cause the case above
> to happen here?
> 
In the current implementation, we never transition
from global->local ASID.

In a previous implementation, the code did do those
transitions, and they appeared to survive the testing
thrown at it.

If we implement more aggressive ASID reuse (which we
may need to), we may need to support that transition
again.

In short, while we do not need to support that
transition right now, I don't really want to remove
the two lines of code that make it work :)

I'll add comments.

> > +static bool meets_global_asid_threshold(struct mm_struct *mm)
> > +{
> > +       if (!global_asid_available)
> 
> I think we need READ_ONCE here.
> 
> Also - this doesn't really make sense in this function as it's
> currently named.
> 
> I think we could just inline this whole function into
> consider_global_asid(), it would still be nice and readable IMO.
> 
Done and done.

> > 
> > @@ -1058,9 +1375,12 @@ void flush_tlb_mm_range(struct mm_struct
> > *mm, unsigned long start,
> >          * a local TLB flush is needed. Optimize this use-case by
> > calling
> >          * flush_tlb_func_local() directly in this case.
> >          */
> > -       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> > +       if (mm_global_asid(mm)) {
> > +               broadcast_tlb_flush(info);
> > +       } else if (cpumask_any_but(mm_cpumask(mm), cpu) <
> > nr_cpu_ids) {
> >                 info->trim_cpumask = should_trim_cpumask(mm);
> >                 flush_tlb_multi(mm_cpumask(mm), info);
> > +               consider_global_asid(mm);
> 
> Why do we do this here instead of when the CPU enters the mm? Is the
> idea that in combination with the jiffies thing in
> consider_global_asid() we get a probability of getting a global ASID
> (within some time period) that scales with the amount of TLB flushing
> the process does? So then we avoid using up ASID space on processes
> that are multithreaded but just sit around with stable VMAs etc?
> 
You guessed right.

In the current x86 hardware, a global ASID is a scarce
resource, with about 4k available ASIDs (2k in a kernel
compiled with support for the KPTI mitigation), while
the largest available x86 systems have at least 8k CPUs.

We can either implement the much more aggressive ASID
reuse that ARM64 and RISC-V implement, though it is not
clear how to scale that to thousands of CPUs, or reserve
global ASIDs for the processes that are most likely to
benefit from them, continuing to use IPI-based flushing
for the processes that need it less.

I've added a comment to document that.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-10 15:27   ` Brendan Jackman
@ 2025-02-11  3:45     ` Rik van Riel
  2025-02-11 10:02       ` Brendan Jackman
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-11  3:45 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Mon, 2025-02-10 at 16:27 +0100, Brendan Jackman wrote:
> On Thu, 6 Feb 2025 at 05:46, Rik van Riel <riel@surriel.com> wrote:
> >  /* Wait for INVLPGB originated by this CPU to complete. */
> > -static inline void tlbsync(void)
> > +static inline void __tlbsync(void)
> >  {
> > -       cant_migrate();
> 
> Why does this have to go away?

I'm not sure the current task in sched_init() has
all the correct bits set to prevent the warning
from firing, but on the flip side it won't have
called INVLPGB yet at that point, so the call to
enter_lazy_tlb() won't actually end up here.

I'll put it back.

> 
> > diff --git a/arch/x86/include/asm/tlbflush.h
> > b/arch/x86/include/asm/tlbflush.h
> > index 234277a5ef89..bf167e215e8e 100644
> > --- a/arch/x86/include/asm/tlbflush.h
> > +++ b/arch/x86/include/asm/tlbflush.h
> > @@ -106,6 +106,7 @@ struct tlb_state {
> >          * need to be invalidated.
> >          */
> >         bool invalidate_other;
> > +       bool need_tlbsync;
> 
> The ifdeffery is missing here.

Added.

> 
> > @@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >                 WARN_ON_ONCE(!irqs_disabled());
> > 
> > +       tlbsync();
> > +
> >         /*
> >          * Verify that CR3 is what we think it is.  This will catch
> >          * hypothetical buggy code that directly switches to
> > swapper_pg_dir
> > @@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >   */
> >  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> >  {
> > +       tlbsync();
> > +
> 
> I have a feeling I'll look stupid for asking this, but why do we need
> this and the one in switch_mm_irqs_off()?

This is an architectural thing: TLBSYNC waits for
the INVLPGB flushes to finish that were issued
from the same CPU.

That means if we have pending flushes (from the
pageout code), we need to wait for them at context
switch time, before the task could potentially be
migrated to another CPU.

> 
> > @@ -1661,12 +1694,53 @@ void arch_tlbbatch_flush(struct
> > arch_tlbflush_unmap_batch *batch)
> >                 local_irq_enable();
> >         }
> > 
> > +       /*
> > +        * If we issued (asynchronous) INVLPGB flushes, wait for
> > them here.
> > +        * The cpumask above contains only CPUs that were running
> > tasks
> > +        * not using broadcast TLB flushing.
> > +        */
> > +       if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> 
> It feels wrong that we check the cpufeature here but not in e.g.
> enter_lazy_tlb().
> 
> > +               tlbsync();
> > +
We no longer need to check it here, with the change
to tlbsync. Good catch.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-11  3:45     ` Rik van Riel
@ 2025-02-11 10:02       ` Brendan Jackman
  2025-02-11 20:21         ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-11 10:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Tue, 11 Feb 2025 at 04:50, Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2025-02-10 at 16:27 +0100, Brendan Jackman wrote:
> > On Thu, 6 Feb 2025 at 05:46, Rik van Riel <riel@surriel.com> wrote:
> > >  /* Wait for INVLPGB originated by this CPU to complete. */
> > > -static inline void tlbsync(void)
> > > +static inline void __tlbsync(void)
> > >  {
> > > -       cant_migrate();
> >
> > Why does this have to go away?
>
> I'm not sure the current task in sched_init() has
> all the correct bits set to prevent the warning
> from firing, but on the flip side it won't have
> called INVLPGB yet at that point, so the call to
> enter_lazy_tlb() won't actually end up here.
>
> I'll put it back.

Sounds good.

FWIW I think if we do run into early-boot code hitting false
DEBUG_ATOMIC_SLEEP warnings, the best response might be to update the
DEBUG_ATOMIC_SLEEP code. Like maybe there's a more targeted solution
but something roughly equivalent to checking if (system_state ==
SYSTEM_STATE_SCHEDULING) before the warning.

> > > @@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> > >                 WARN_ON_ONCE(!irqs_disabled());
> > >
> > > +       tlbsync();
> > > +
> > >         /*
> > >          * Verify that CR3 is what we think it is.  This will catch
> > >          * hypothetical buggy code that directly switches to
> > > swapper_pg_dir
> > > @@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >   */
> > >  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> > >  {
> > > +       tlbsync();
> > > +
> >
> > I have a feeling I'll look stupid for asking this, but why do we need
> > this and the one in switch_mm_irqs_off()?
>
> This is an architectural thing: TLBSYNC waits for
> the INVLPGB flushes to finish that were issued
> from the same CPU.
>
> That means if we have pending flushes (from the
> pageout code), we need to wait for them at context
> switch time, before the task could potentially be
> migrated to another CPU.

Oh right thanks, that makes sense.

So I think here we're encoding the assumption that context_switch()
always calls either enter_lazy_tlb() or switch_mm_irqs_off(), which is
a little awkward, plus the job of these functions is already kinda
hazy and this makes it even hazier. What about doing it in
arch_start_context_switch() instead?

That would mean a bit of plumbing since we'd still wanna have the
tlbsync() in tlb.c, but that seems worth it to me. Plus, having it in
one place would give us a spot to add a comment. Now that you point it
out it does indeed seem obvious but it didn't seem so yesterday.

Now I think about it... if we always tlbsync() before a context
switch, is the cant_migrate() above actually required? I think with
that, even if we migrated in the middle of e.g.
broadcast_kernel_range_flush(), we'd be fine? (At least, from the
specific perspective of the invplgb code, presumably having preemption
on there would break things horribly in other ways).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-02-07 14:28   ` Brendan Jackman
@ 2025-02-11 11:07     ` Peter Zijlstra
  2025-02-11 12:10       ` Brendan Jackman
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-02-11 11:07 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Rik van Riel, x86, linux-kernel, bp, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Fri, Feb 07, 2025 at 03:28:31PM +0100, Brendan Jackman wrote:
> Hi Rik,
> 
> On Thu, 6 Feb 2025 at 05:45, Rik van Riel <riel@surriel.com> wrote:
> > This is done because some paravirt TLB flush implementations
> > handle the TLB flush in the hypervisor, and will do the flush
> > even when the target CPU has interrupts disabled.
> >
> > Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
> > Using RCU synchronization between page table freeing and get_user_pages_fast()
> > allows bare metal to also do TLB flushing while interrupts are disabled.
> >
> > Various places in the mm do still block IRQs or disable preemption
> > as an implicit way to block RCU frees.
> >
> > That makes it safe to use INVLPGB on AMD CPUs.
> 
> It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
> comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
> that do not have this (PPC)" 

Why? This is just one more architecture that does broadcast.

- and while that's being updated it would
> also be useful to note down the paravirt thing you explained above,
> IMO it's pretty helpful to have more examples of the concrete usecases
> for this logic.

Look at kvm_flush_tlb_multi() if you're interested. The notable detail
is that is avoids flushing TLB for vCPUs that are preempted, and instead
lets the vCPU resume code do the invalidate.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-02-11 11:07     ` Peter Zijlstra
@ 2025-02-11 12:10       ` Brendan Jackman
  2025-02-11 20:23         ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Brendan Jackman @ 2025-02-11 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, x86, linux-kernel, bp, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Tue, 11 Feb 2025 at 12:07, Peter Zijlstra <peterz@infradead.org> wrote:
> > It would be nice to update the CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > comment in mm/mmu_gather.c to mention INVLPG alongside "Architectures
> > that do not have this (PPC)"
>
> Why? This is just one more architecture that does broadcast.

Hmm yeah, I didn't really make the leap from "doesn't do an IPI" to
"that just means it uses broadcast TLB flush". In that light I can see
how it's "just another architecture".

I do think it would make sense to be more explicit about that, even
though it seems obvious now you point it out. But it's not really
relevant to this patchset.

> - and while that's being updated it would
> > also be useful to note down the paravirt thing you explained above,
> > IMO it's pretty helpful to have more examples of the concrete usecases
> > for this logic.
>
> Look at kvm_flush_tlb_multi() if you're interested. The notable detail
> is that is avoids flushing TLB for vCPUs that are preempted, and instead
> lets the vCPU resume code do the invalidate.

Oh that actually looks like a slightly different case from what Rik mentioned?

> some paravirt TLB flush implementations
> handle the TLB flush in the hypervisor, and will do the flush
> even when the target CPU has interrupts disabled.

Do we have

a) Systems where the flush gets completely pushed into the hypervisor
- xen_flush_tlb_multi()?

b) Systems where the guest coordinates with the hypervisor to avoid
IPI-ing preempted vCPUs?

Maybe I can send a separate patch to note this in the commentary, it's
interesting and useful to know.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-11 10:02       ` Brendan Jackman
@ 2025-02-11 20:21         ` Rik van Riel
  2025-02-12 10:38           ` Brendan Jackman
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2025-02-11 20:21 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Tue, 2025-02-11 at 11:02 +0100, Brendan Jackman wrote:
> 
> So I think here we're encoding the assumption that context_switch()
> always calls either enter_lazy_tlb() or switch_mm_irqs_off(), which
> is
> a little awkward, plus the job of these functions is already kinda
> hazy and this makes it even hazier. What about doing it in
> arch_start_context_switch() instead?
> 
> That would mean a bit of plumbing since we'd still wanna have the
> tlbsync() in tlb.c, but that seems worth it to me. Plus, having it in
> one place would give us a spot to add a comment. Now that you point
> it
> out it does indeed seem obvious but it didn't seem so yesterday.
> 
While that would be a little bit cleaner to maintain,
in theory, I'm not convinced that adding an extra
function call to the context switch path is worthwhile
for that small maintenance benefit.

I'd rather add more comments than an extra function call :)

> Now I think about it... if we always tlbsync() before a context
> switch, is the cant_migrate() above actually required?

Probably not, but I guess it won't hurt?

I'm running tests right now on a kernel with a bunch
of debug options enabled, and I'm getting all sorts
of warnings, but not this one :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-02-11 12:10       ` Brendan Jackman
@ 2025-02-11 20:23         ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2025-02-11 20:23 UTC (permalink / raw)
  To: Brendan Jackman, Peter Zijlstra
  Cc: x86, linux-kernel, bp, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Manali Shukla

On Tue, 2025-02-11 at 13:10 +0100, Brendan Jackman wrote:
> 
> Maybe I can send a separate patch to note this in the commentary,
> it's
> interesting and useful to know.
> 

I added a similar comment to this patch series
for v10 yesterday. I'll send it out after a few
more tests have finished running.

I'm not particularly invested in which version
of the comment gets merged :)

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-10 16:48     ` Rik van Riel
@ 2025-02-12  1:18       ` Vern Hao
  2025-02-12  1:57       ` Vern Hao
  1 sibling, 0 replies; 47+ messages in thread
From: Vern Hao @ 2025-02-12  1:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

i support these patches in host and guest, and add this patch to support
cpuid flags in kvm.

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index db3838667466..fd21d9438137 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -488,7 +488,7 @@ static inline int __do_cpuid_func(struct
kvm_cpuid_entry2 *entry, u32 function,

        /* cpuid 0x80000008.ebx */
        const u32 kvm_cpuid_8000_0008_ebx_x86_features =
-               F(CLZERO) | F(XSAVEERPTR) |
+               F(CLZERO) | F(XSAVEERPTR) | F(INVLPGB) |
                F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) |
F(VIRT_SSBD) |
                F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON);

But  in guest,  use lscpu cmd, i still can not see  invlpgb,  so  i just
wonder where is wrong ?

Best Regards!
Thanks


Rik van Riel <riel@surriel.com> 于2025年2月11日周二 00:50写道：

> On Mon, 2025-02-10 at 15:30 +0800, Vern Hao wrote:
> > I do some test on my  Machine with AMD EPYC 7K83,  these patches work
> > on my host,  but failed on my guest with qemu.
> >
> > in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest
> > no.
> >
> > So are you plan to support it in guest?
>
> How exactly did it fail in the guest?
>
> Did the guest simply not use INVLPGB because that
> CPUID flag was not presented in the CPUID that
> qemu shows to the guest, or did things crash somehow?
>
> My understanding is that while INVLPGB can work
> in guests, actually implementing that is a whole
> other can of worms, and definitely not something
> we should try to tackle at the same time as bare
> metal support.
>
> A TLB flush hypercall, with IRQ-less flushing on
> the hypervisor side will probably get us 90% of
> the way there, potentially with less overall
> complexity than actually supporting INVLPGB in
> the guest.
>
> --
> All Rights Reversed.
>

[-- Attachment #2: Type: text/html, Size: 2572 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-10 16:48     ` Rik van Riel
  2025-02-12  1:18       ` Vern Hao
@ 2025-02-12  1:57       ` Vern Hao
  2025-02-12 15:56         ` Tom Lendacky
  1 sibling, 1 reply; 47+ messages in thread
From: Vern Hao @ 2025-02-12  1:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]


On 2025/2/11 00:48, Rik van Riel wrote:
> On Mon, 2025-02-10 at 15:30 +0800, Vern Hao wrote:
>> I do some test on my  Machine with AMD EPYC 7K83,  these patches work
>> on my host,  but failed on my guest with qemu.
>>
>> in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest
>> no.
>>
>> So are you plan to support it in guest?
> How exactly did it fail in the guest?
>
> Did the guest simply not use INVLPGB because that
> CPUID flag was not presented in the CPUID that
> qemu shows to the guest, or did things crash somehow?
i support these patches in host and guest, and add this patch to support 
cpuid flags in kvm.

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index db3838667466..fd21d9438137 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -488,7 +488,7 @@ static inline int __do_cpuid_func(struct 
kvm_cpuid_entry2 *entry, u32 function,

         /* cpuid 0x80000008.ebx */
         const u32 kvm_cpuid_8000_0008_ebx_x86_features =
-               F(CLZERO) | F(XSAVEERPTR) |
+              F(CLZERO) | F(XSAVEERPTR) | F(INVLPGB) |
                 F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | 
F(VIRT_SSBD) |
                 F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON);

But in guest,  use lscpu cmd, i still can not see  invlpgb,  so  i just 
wonder where is wrong ?

>
> My understanding is that while INVLPGB can work
> in guests, actually implementing that is a whole
> other can of worms, and definitely not something
> we should try to tackle at the same time as bare
> metal support.
>
> A TLB flush hypercall, with IRQ-less flushing on
> the hypervisor side will probably get us 90% of
> the way there, potentially with less overall
> complexity than actually supporting INVLPGB in
> the guest.
>

[-- Attachment #2: Type: text/html, Size: 5274 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-11 20:21         ` Rik van Riel
@ 2025-02-12 10:38           ` Brendan Jackman
  0 siblings, 0 replies; 47+ messages in thread
From: Brendan Jackman @ 2025-02-12 10:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm, jannh,
	mhklinux, andrew.cooper3, Manali Shukla

On Tue, 11 Feb 2025 at 21:21, Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2025-02-11 at 11:02 +0100, Brendan Jackman wrote:
> >
> > So I think here we're encoding the assumption that context_switch()
> > always calls either enter_lazy_tlb() or switch_mm_irqs_off(), which
> > is
> > a little awkward, plus the job of these functions is already kinda
> > hazy and this makes it even hazier. What about doing it in
> > arch_start_context_switch() instead?
> >
> > That would mean a bit of plumbing since we'd still wanna have the
> > tlbsync() in tlb.c, but that seems worth it to me. Plus, having it in
> > one place would give us a spot to add a comment. Now that you point
> > it
> > out it does indeed seem obvious but it didn't seem so yesterday.
> >
> While that would be a little bit cleaner to maintain,
> in theory, I'm not convinced that adding an extra
> function call to the context switch path is worthwhile
> for that small maintenance benefit.

I don't see why it has to introduce a function call, can't we just
have something like:

static inline void arch_start_context_switch(struct task_struct *prev)
{
    arch_paravirt_start_context_switch(prev);
    tlb_start_context_switch(prev);
}

The paravirt one would disappear when !CONFIG_PARAVIRT (as the current
arch_start_context_switch() does) and the tlb one would disappear when
!CONFIG_X86_BROADCAST_TLB_FLUSH. The whole thing should be inlinable.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-12  1:57       ` Vern Hao
@ 2025-02-12 15:56         ` Tom Lendacky
  2025-02-13  8:16           ` Vern Hao
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Lendacky @ 2025-02-12 15:56 UTC (permalink / raw)
  To: Vern Hao, Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Manali Shukla

On 2/11/25 19:57, Vern Hao wrote:
> 
> On 2025/2/11 00:48, Rik van Riel wrote:
>> On Mon, 2025-02-10 at 15:30 +0800, Vern Hao wrote:
>>> I do some test on my  Machine with AMD EPYC 7K83,  these patches work
>>> on my host,  but failed on my guest with qemu.
>>>
>>> in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest
>>> no.
>>>
>>> So are you plan to support it in guest?
>> How exactly did it fail in the guest?
>>
>> Did the guest simply not use INVLPGB because that
>> CPUID flag was not presented in the CPUID that
>> qemu shows to the guest, or did things crash somehow?
> i support these patches in host and guest, and add this patch to support
> cpuid flags in kvm.
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index db3838667466..fd21d9438137 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -488,7 +488,7 @@ static inline int __do_cpuid_func(struct
> kvm_cpuid_entry2 *entry, u32 function,
> 
>         /* cpuid 0x80000008.ebx */
>         const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> -               F(CLZERO) | F(XSAVEERPTR) |
> +              F(CLZERO) | F(XSAVEERPTR) | F(INVLPGB) |
>                 F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) |
> F(VIRT_SSBD) |
>                 F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON);
> 
> But  in guest,  use lscpu cmd, i still can not see  invlpgb,  so  i just
> wonder where is wrong ?

Well, for one, the INVLPGB instruction has to be enabled in the VMCB in
order for it to be used (unless it is an SEV-ES or SEV-SNP guest). Also,
lscpu won't show "invlpgb" since the patches don't define the feature in
the way that it would be visible via lscpu. You need to issue CPUID to
see the bit is set or not. Also, you might need VMM support in order for
that CPUID bit to be set in the guest.

But, it will take hypervisor support to use INVLPGB in a non-SEV guest,
since non-SEV guests do not use global ASIDs. In this case, the
instruction will need to be intercepted and the hypervisor will need to
determine how to process it.

If you have an SEV guest, which use global ASIDs and use the same ASID
for all vCPUs within a guest, you can use INVLPGB in the guest without
issue and without needing to intercept the instruction.

See "Guest Usage of INVLPGB" in AMD APM Vol 3 under the INVLPGB
instruction documentation.

Thanks,
Tom

> 
>> My understanding is that while INVLPGB can work
>> in guests, actually implementing that is a whole
>> other can of worms, and definitely not something
>> we should try to tackle at the same time as bare
>> metal support.
>>
>> A TLB flush hypercall, with IRQ-less flushing on
>> the hypervisor side will probably get us 90% of
>> the way there, potentially with less overall
>> complexity than actually supporting INVLPGB in
>> the guest.
>>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-02-12 15:56         ` Tom Lendacky
@ 2025-02-13  8:16           ` Vern Hao
  0 siblings, 0 replies; 47+ messages in thread
From: Vern Hao @ 2025-02-13  8:16 UTC (permalink / raw)
  To: Tom Lendacky, Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, kernel-team, linux-mm, akpm, jannh, mhklinux,
	andrew.cooper3, Manali Shukla

[-- Attachment #1: Type: text/plain, Size: 3154 bytes --]


On 2025/2/12 23:56, Tom Lendacky wrote:
> On 2/11/25 19:57, Vern Hao wrote:
>> On 2025/2/11 00:48, Rik van Riel wrote:
>>> On Mon, 2025-02-10 at 15:30 +0800, Vern Hao wrote:
>>>> I do some test on my  Machine with AMD EPYC 7K83,  these patches work
>>>> on my host,  but failed on my guest with qemu.
>>>>
>>>> in host, use lscpu cmd, you can see  invlpgb in flags,  but in guest
>>>> no.
>>>>
>>>> So are you plan to support it in guest?
>>> How exactly did it fail in the guest?
>>>
>>> Did the guest simply not use INVLPGB because that
>>> CPUID flag was not presented in the CPUID that
>>> qemu shows to the guest, or did things crash somehow?
>> i support these patches in host and guest, and add this patch to support
>> cpuid flags in kvm.
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index db3838667466..fd21d9438137 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -488,7 +488,7 @@ static inline int __do_cpuid_func(struct
>> kvm_cpuid_entry2 *entry, u32 function,
>>
>>          /* cpuid 0x80000008.ebx */
>>          const u32 kvm_cpuid_8000_0008_ebx_x86_features =
>> -               F(CLZERO) | F(XSAVEERPTR) |
>> +              F(CLZERO) | F(XSAVEERPTR) | F(INVLPGB) |
>>                  F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) |
>> F(VIRT_SSBD) |
>>                  F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON);
>>
>> But  in guest,  use lscpu cmd, i still can not see  invlpgb,  so  i just
>> wonder where is wrong ?
> Well, for one, the INVLPGB instruction has to be enabled in the VMCB in
> order for it to be used (unless it is an SEV-ES or SEV-SNP guest). Also,
> lscpu won't show "invlpgb" since the patches don't define the feature in
> the way that it would be visible via lscpu. You need to issue CPUID to
> see the bit is set or not. Also, you might need VMM support in order for
> that CPUID bit to be set in the guest.
>
> But, it will take hypervisor support to use INVLPGB in a non-SEV guest,
> since non-SEV guests do not use global ASIDs. In this case, the
> instruction will need to be intercepted and the hypervisor will need to
> determine how to process it.
>
> If you have an SEV guest, which use global ASIDs and use the same ASID
> for all vCPUs within a guest, you can use INVLPGB in the guest without
> issue and without needing to intercept the instruction.
>
> See "Guest Usage of INVLPGB" in AMD APM Vol 3 under the INVLPGB
> instruction documentation.

OK,  Thanks a lot*.
*

*Obviously, my 5.10 version of the kernel KVM does not support this 
CPUID well, i test it on upstream kernel, it works well.*

>
> Thanks,
> Tom
>
>>> My understanding is that while INVLPGB can work
>>> in guests, actually implementing that is a whole
>>> other can of worms, and definitely not something
>>> we should try to tackle at the same time as bare
>>> metal support.
>>>
>>> A TLB flush hypercall, with IRQ-less flushing on
>>> the hypervisor side will probably get us 90% of
>>> the way there, potentially with less overall
>>> complexity than actually supporting INVLPGB in
>>> the guest.
>>>

[-- Attachment #2: Type: text/html, Size: 4707 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-02-13  8:16 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-06  4:43 [PATCH v9 00/12] AMD broadcast TLB invalidation Rik van Riel
2025-02-06  4:43 ` [PATCH v9 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2025-02-07 14:28   ` Brendan Jackman
2025-02-11 11:07     ` Peter Zijlstra
2025-02-11 12:10       ` Brendan Jackman
2025-02-11 20:23         ` Rik van Riel
2025-02-06  4:43 ` [PATCH v9 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2025-02-06  4:43 ` [PATCH v9 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-02-07 14:50   ` Brendan Jackman
2025-02-07 20:22     ` Rik van Riel
2025-02-10 11:15       ` Brendan Jackman
2025-02-10 19:12     ` Rik van Riel
2025-02-06  4:43 ` [PATCH v9 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-02-07 15:10   ` Brendan Jackman
2025-02-07 17:34     ` Brendan Jackman
2025-02-10  7:30   ` Vern Hao
2025-02-10 16:48     ` Rik van Riel
2025-02-12  1:18       ` Vern Hao
2025-02-12  1:57       ` Vern Hao
2025-02-12 15:56         ` Tom Lendacky
2025-02-13  8:16           ` Vern Hao
2025-02-06  4:43 ` [PATCH v9 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-02-06  4:43 ` [PATCH v9 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-02-07 16:03   ` Brendan Jackman
2025-02-07 20:50     ` Rik van Riel
2025-02-10 11:22       ` Brendan Jackman
2025-02-11  2:01     ` Rik van Riel
2025-02-06  4:43 ` [PATCH v9 07/12] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
2025-02-06  4:43 ` [PATCH v9 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-02-06  4:43 ` [PATCH v9 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-02-10 14:15   ` Brendan Jackman
2025-02-11  3:07     ` Rik van Riel
2025-02-06  4:43 ` [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-02-10 15:27   ` Brendan Jackman
2025-02-11  3:45     ` Rik van Riel
2025-02-11 10:02       ` Brendan Jackman
2025-02-11 20:21         ` Rik van Riel
2025-02-12 10:38           ` Brendan Jackman
2025-02-06  4:43 ` [PATCH v9 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-02-06  4:43 ` [PATCH v9 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-02-06 10:16 ` [PATCH v9 00/12] AMD broadcast TLB invalidation Oleksandr Natalenko
2025-02-06 14:16   ` Rik van Riel
2025-02-06 14:23     ` Peter Zijlstra
2025-02-06 14:48       ` Rik van Riel
2025-02-07  8:16         ` Peter Zijlstra
2025-02-07 17:46           ` Rik van Riel
2025-02-07 18:23 ` Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox