* [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs @ 2026-01-06 12:03 Lance Yang 2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang 0 siblings, 2 replies; 16+ messages in thread From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw) To: akpm Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 Hi all, When unsharing hugetlb PMD page tables or collapsing pages in khugepaged, we send two IPIs: one for TLB invalidation, and another to synchronize with concurrent GUP-fast walkers. However, if the TLB flush already reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI completes, any concurrent GUP-fast must have finished. We now track whether IPIs were actually sent during TLB flush. We pass the mmu_gather context through the flush path, and native_flush_tlb_multi() sets a flag when sending IPIs. Works with PV and INVLPGB since only native_flush_tlb_multi() sets the flag - no matter what replaces pv_ops.mmu.flush_tlb_multi or whether INVLPGB is available. David Hildenbrand did the initial implementation. I built on his work and relied on off-list discussions to push it further - thanks a lot David! v2 -> v3: - Complete rewrite: use dynamic IPI tracking instead of static checks (per Dave Hansen, thanks!) - Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when actually sending IPIs - Motivation for skipping redundant IPIs explained by David: https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/ - https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/ v1 -> v2: - Fix cover letter encoding to resolve send-email issues. Apologies for any email flood caused by the failed send attempts :( RFC -> v1: - Use a callback function in pv_mmu_ops instead of comparing function pointers (per David) - Embed the check directly in tlb_remove_table_sync_one() instead of requiring every caller to check explicitly (per David) - Move tlb_table_flush_implies_ipi_broadcast() outside of CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures that don't enable this config. https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/ - https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/ Lance Yang (2): mm/tlb: skip redundant IPI when TLB flush already synchronized mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI arch/x86/include/asm/tlb.h | 3 ++- arch/x86/include/asm/tlbflush.h | 9 +++++---- arch/x86/kernel/alternative.c | 2 +- arch/x86/kernel/ldt.c | 2 +- arch/x86/mm/tlb.c | 22 +++++++++++++++------ include/asm-generic/tlb.h | 14 +++++++++----- include/linux/pgtable.h | 13 +++++++++---- mm/khugepaged.c | 9 +++------ mm/mmu_gather.c | 26 ++++++++++++++++++------- mm/pgtable-generic.c | 34 +++++++++++++++++++++++++++++++++ 10 files changed, 99 insertions(+), 35 deletions(-) -- 2.49.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang @ 2026-01-06 12:03 ` Lance Yang 2026-01-06 15:19 ` David Hildenbrand (Red Hat) 2026-01-06 16:24 ` Dave Hansen 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang 1 sibling, 2 replies; 16+ messages in thread From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw) To: akpm Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang From: Lance Yang <lance.yang@linux.dev> When unsharing hugetlb PMD page tables, we currently send two IPIs: one for TLB invalidation, and another to synchronize with concurrent GUP-fast walkers via tlb_remove_table_sync_one(). However, if the TLB flush already sent IPIs to all CPUs (when freed_tables or unshared_tables is true), the second IPI is redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI completes, any concurrent GUP-fast must have finished. To avoid the redundant IPI, we add a flag to mmu_gather to track whether the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB flush path via flush_tlb_info, so native_flush_tlb_multi() can set the flag when it sends IPIs for freed_tables. We also set the flag for local-only flushes, since disabling IRQs provides the same guarantee. Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Lance Yang <lance.yang@linux.dev> --- arch/x86/include/asm/tlb.h | 3 ++- arch/x86/include/asm/tlbflush.h | 9 +++++---- arch/x86/kernel/alternative.c | 2 +- arch/x86/kernel/ldt.c | 2 +- arch/x86/mm/tlb.c | 22 ++++++++++++++++------ include/asm-generic/tlb.h | 14 +++++++++----- mm/mmu_gather.c | 26 +++++++++++++++++++------- 7 files changed, 53 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..c5950a92058c 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) end = tlb->end; } - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, + tlb->freed_tables || tlb->unshared_tables, tlb); } static inline void invlpg(unsigned long addr) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 00daedfefc1b..83c260c88b80 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -220,6 +220,7 @@ struct flush_tlb_info { * will be zero. */ struct mm_struct *mm; + struct mmu_gather *tlb; unsigned long start; unsigned long end; u64 new_tlb_gen; @@ -305,23 +306,23 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; } #endif #define flush_tlb_mm(mm) \ - flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true) + flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true, NULL) #define flush_tlb_range(vma, start, end) \ flush_tlb_mm_range((vma)->vm_mm, start, end, \ ((vma)->vm_flags & VM_HUGETLB) \ ? huge_page_shift(hstate_vma(vma)) \ - : PAGE_SHIFT, true) + : PAGE_SHIFT, true, NULL) extern void flush_tlb_all(void); extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, - bool freed_tables); + bool freed_tables, struct mmu_gather *tlb); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { - flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); + flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false, NULL); } static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 28518371d8bf..006f3705b616 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2572,7 +2572,7 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l */ flush_tlb_mm_range(text_poke_mm, text_poke_mm_addr, text_poke_mm_addr + (cross_page_boundary ? 2 : 1) * PAGE_SIZE, - PAGE_SHIFT, false); + PAGE_SHIFT, false, NULL); if (func == text_poke_memcpy) { /* diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 0f19ef355f5f..d8494706fec5 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -374,7 +374,7 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt) } va = (unsigned long)ldt_slot_va(ldt->slot); - flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false); + flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false, NULL); } #else /* !CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index f5b93e01e347..be45976c0d16 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1374,6 +1374,9 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask, else on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func, (void *)info, 1, cpumask); + + if (info->freed_tables && info->tlb) + info->tlb->tlb_flush_sent_ipi = true; } void flush_tlb_multi(const struct cpumask *cpumask, @@ -1403,7 +1406,7 @@ static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables, - u64 new_tlb_gen) + u64 new_tlb_gen, struct mmu_gather *tlb) { struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info); @@ -1433,6 +1436,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, info->new_tlb_gen = new_tlb_gen; info->initiating_cpu = smp_processor_id(); info->trim_cpumask = 0; + info->tlb = tlb; return info; } @@ -1447,8 +1451,8 @@ static void put_flush_tlb_info(void) } void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, - unsigned long end, unsigned int stride_shift, - bool freed_tables) + unsigned long end, unsigned int stride_shift, + bool freed_tables, struct mmu_gather *tlb) { struct flush_tlb_info *info; int cpu = get_cpu(); @@ -1458,7 +1462,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, new_tlb_gen = inc_mm_tlb_gen(mm); info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, - new_tlb_gen); + new_tlb_gen, tlb); /* * flush_tlb_multi() is not optimized for the common case in which only @@ -1476,6 +1480,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, local_irq_disable(); flush_tlb_func(info); local_irq_enable(); + /* + * Only current CPU uses this mm, so we can treat this as + * having synchronized with GUP-fast. No sync IPI needed. + */ + if (tlb && freed_tables) + tlb->tlb_flush_sent_ipi = true; } put_flush_tlb_info(); @@ -1553,7 +1563,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) guard(preempt)(); info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false, - TLB_GENERATION_INVALID); + TLB_GENERATION_INVALID, NULL); if (info->end == TLB_FLUSH_ALL) kernel_tlb_flush_all(info); @@ -1733,7 +1743,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) int cpu = get_cpu(); info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, - TLB_GENERATION_INVALID); + TLB_GENERATION_INVALID, NULL); /* * flush_tlb_multi() is not optimized for the common case in which only * a local TLB flush is needed. Optimize this use-case by calling diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 3975f7d11553..cbbe008590ee 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -249,6 +249,7 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) #define tlb_needs_table_invalidate() (true) #endif +void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb); void tlb_remove_table_sync_one(void); #else @@ -257,6 +258,7 @@ void tlb_remove_table_sync_one(void); #error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE #endif +static inline void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb) { } static inline void tlb_remove_table_sync_one(void) { } #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */ @@ -378,6 +380,12 @@ struct mmu_gather { */ unsigned int fully_unshared_tables : 1; + /* + * Did the TLB flush for freed/unshared tables send IPIs to all CPUs? + * If true, we can skip the redundant IPI in tlb_remove_table_sync_one(). + */ + unsigned int tlb_flush_sent_ipi : 1; + unsigned int batch_count; #ifndef CONFIG_MMU_GATHER_NO_GATHER @@ -833,13 +841,9 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) * * We only perform this when we are the last sharer of a page table, * as the IPI will reach all CPUs: any GUP-fast. - * - * Note that on configs where tlb_remove_table_sync_one() is a NOP, - * the expectation is that the tlb_flush_mmu_tlbonly() would have issued - * required IPIs already for us. */ if (tlb->fully_unshared_tables) { - tlb_remove_table_sync_one(); + tlb_gather_remove_table_sync_one(tlb); tlb->fully_unshared_tables = false; } } diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 2faa23d7f8d4..da36de52b281 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -273,8 +273,14 @@ static void tlb_remove_table_smp_sync(void *arg) /* Simply deliver the interrupt */ } -void tlb_remove_table_sync_one(void) +void tlb_gather_remove_table_sync_one(struct mmu_gather *tlb) { + /* Skip the IPI if the TLB flush already synchronized with other CPUs */ + if (tlb && tlb->tlb_flush_sent_ipi) { + tlb->tlb_flush_sent_ipi = false; + return; + } + /* * This isn't an RCU grace period and hence the page-tables cannot be * assumed to be actually RCU-freed. @@ -285,6 +291,11 @@ void tlb_remove_table_sync_one(void) smp_call_function(tlb_remove_table_smp_sync, NULL, 1); } +void tlb_remove_table_sync_one(void) +{ + tlb_gather_remove_table_sync_one(NULL); +} + static void tlb_remove_table_rcu(struct rcu_head *head) { __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu)); @@ -328,7 +339,7 @@ static inline void __tlb_remove_table_one_rcu(struct rcu_head *head) __tlb_remove_table(ptdesc); } -static inline void __tlb_remove_table_one(void *table) +static inline void __tlb_remove_table_one(void *table, struct mmu_gather *tlb) { struct ptdesc *ptdesc; @@ -336,16 +347,16 @@ static inline void __tlb_remove_table_one(void *table) call_rcu(&ptdesc->pt_rcu_head, __tlb_remove_table_one_rcu); } #else -static inline void __tlb_remove_table_one(void *table) +static inline void __tlb_remove_table_one(void *table, struct mmu_gather *tlb) { - tlb_remove_table_sync_one(); + tlb_gather_remove_table_sync_one(tlb); __tlb_remove_table(table); } #endif /* CONFIG_PT_RECLAIM */ -static void tlb_remove_table_one(void *table) +static void tlb_remove_table_one(void *table, struct mmu_gather *tlb) { - __tlb_remove_table_one(table); + __tlb_remove_table_one(table, tlb); } static void tlb_table_flush(struct mmu_gather *tlb) @@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT); if (*batch == NULL) { tlb_table_invalidate(tlb); - tlb_remove_table_one(table); + tlb_remove_table_one(table, tlb); return; } (*batch)->nr = 0; @@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, tlb->vma_pfn = 0; tlb->fully_unshared_tables = 0; + tlb->tlb_flush_sent_ipi = 0; __tlb_reset_range(tlb); inc_tlb_flush_pending(tlb->mm); } -- 2.49.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang @ 2026-01-06 15:19 ` David Hildenbrand (Red Hat) 2026-01-06 16:10 ` Lance Yang 2026-01-06 16:24 ` Dave Hansen 1 sibling, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2026-01-06 15:19 UTC (permalink / raw) To: Lance Yang, akpm Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 > > static void tlb_table_flush(struct mmu_gather *tlb) > @@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) > *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT); > if (*batch == NULL) { > tlb_table_invalidate(tlb); > - tlb_remove_table_one(table); > + tlb_remove_table_one(table, tlb); > return; > } > (*batch)->nr = 0; > @@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, > tlb->vma_pfn = 0; > > tlb->fully_unshared_tables = 0; > + tlb->tlb_flush_sent_ipi = 0; > __tlb_reset_range(tlb); > inc_tlb_flush_pending(tlb->mm); > } But when would we have to reset tlb->tlb_flush_sent_ipi = 0 later? That's where it gets tricky. Just imagine the MMU gather gets reused later. Also, + if (info->freed_tables && info->tlb) + info->tlb->tlb_flush_sent_ipi = true; in native_flush_tlb_multi() misses the fact that we have different flushing types for removed/unshared tables vs. other flush. So this approach more here certainly gets more complicated and error prone. tlb_table_flush_implies_ipi_broadcast() was clearer in that regard: if you flushed the TLB after removing /unsharing tables, the IPI for handling page tables can be skipped. It's on the code flow to assure that. What could work is tracking "tlb_table_flush_sent_ipi" really when we are flushing the TLB for removed/unshared tables, and maybe resetting it ... I don't know when from the top of my head. v2 was simpler IMHO. -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 15:19 ` David Hildenbrand (Red Hat) @ 2026-01-06 16:10 ` Lance Yang 2026-01-07 6:37 ` Lance Yang 2026-01-09 14:13 ` David Hildenbrand (Red Hat) 0 siblings, 2 replies; 16+ messages in thread From: Lance Yang @ 2026-01-06 16:10 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 2026/1/6 23:19, David Hildenbrand (Red Hat) wrote: >> static void tlb_table_flush(struct mmu_gather *tlb) >> @@ -367,7 +378,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void >> *table) >> *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT); >> if (*batch == NULL) { >> tlb_table_invalidate(tlb); >> - tlb_remove_table_one(table); >> + tlb_remove_table_one(table, tlb); >> return; >> } >> (*batch)->nr = 0; >> @@ -427,6 +438,7 @@ static void __tlb_gather_mmu(struct mmu_gather >> *tlb, struct mm_struct *mm, >> tlb->vma_pfn = 0; >> tlb->fully_unshared_tables = 0; >> + tlb->tlb_flush_sent_ipi = 0; >> __tlb_reset_range(tlb); >> inc_tlb_flush_pending(tlb->mm); >> } > > But when would we have to reset tlb->tlb_flush_sent_ipi = 0 later? > That's where it gets tricky. Just imagine the MMU gather gets reused later. > > Also, > > + if (info->freed_tables && info->tlb) > + info->tlb->tlb_flush_sent_ipi = true; > > in native_flush_tlb_multi() misses the fact that we have different > flushing types for removed/unshared tables vs. other flush. > > So this approach more here certainly gets more complicated and error prone. Agreed. Tracking the flag through mmu_gather lifecycle does get more complicated and error-prone ... > > tlb_table_flush_implies_ipi_broadcast() was clearer in that regard: if > you flushed the TLB after removing /unsharing tables, the IPI for > handling page tables can be skipped. It's on the code flow to assure that. v2 was definitely simpler. > > What could work is tracking "tlb_table_flush_sent_ipi" really when we > are flushing the TLB for removed/unshared tables, and maybe resetting > it ... I don't know when from the top of my head. Not sure what's the best way forward here :( > > v2 was simpler IMHO. The main concern Dave raised was that with PV hypercalls or when INVLPGB is available, we can't tell from a static check whether IPIs were actually sent. Maybe that's acceptable, or we could find a simpler way to track that ... Open to suggestions! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 16:10 ` Lance Yang @ 2026-01-07 6:37 ` Lance Yang 2026-01-09 14:11 ` David Hildenbrand (Red Hat) 2026-01-09 14:13 ` David Hildenbrand (Red Hat) 1 sibling, 1 reply; 16+ messages in thread From: Lance Yang @ 2026-01-07 6:37 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 Hi David, On 2026/1/7 00:10, Lance Yang wrote: [..] >> What could work is tracking "tlb_table_flush_sent_ipi" really when we >> are flushing the TLB for removed/unshared tables, and maybe resetting >> it ... I don't know when from the top of my head. > Seems like we could fix the issue that the flag lifetime was broken if the MMU gather gets reused by splitting the flush and reset. This ensures the flag stays valid between flush and sync. Now tlb_flush_unshared_tables() does: 1) __tlb_flush_mmu_tlbonly() - flush only, keeps flags alive 2) tlb_gather_remove_table_sync_one() - can check the flag 3) __tlb_reset_range() - reset everything after sync Something like this: ---8<--- diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 3975f7d11553..a95b054dfcca 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -415,6 +415,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->cleared_puds = 0; tlb->cleared_p4ds = 0; tlb->unshared_tables = 0; + tlb->tlb_flush_sent_ipi = 0; /* * Do not reset mmu_gather::vma_* fields here, we do not * call into tlb_start_vma() again to set them if there is an @@ -492,7 +493,7 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); } -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static inline void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { /* * Anything calling __tlb_adjust_range() also sets at least one of @@ -503,6 +504,11 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) return; tlb_flush(tlb); +} + +static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +{ + __tlb_flush_mmu_tlbonly(tlb); __tlb_reset_range(tlb); } @@ -824,7 +830,7 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) * flush the TLB for the unsharer now. */ if (tlb->unshared_tables) - tlb_flush_mmu_tlbonly(tlb); + __tlb_flush_mmu_tlbonly(tlb); /* * Similarly, we must make sure that concurrent GUP-fast will not @@ -834,14 +840,16 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) * We only perform this when we are the last sharer of a page table, * as the IPI will reach all CPUs: any GUP-fast. * - * Note that on configs where tlb_remove_table_sync_one() is a NOP, - * the expectation is that the tlb_flush_mmu_tlbonly() would have issued - * required IPIs already for us. + * Use tlb_gather_remove_table_sync_one() instead of + * tlb_remove_table_sync_one() to skip the redundant IPI if the + * TLB flush above already sent one. */ if (tlb->fully_unshared_tables) { - tlb_remove_table_sync_one(); + tlb_gather_remove_table_sync_one(tlb); tlb->fully_unshared_tables = false; } + + __tlb_reset_range(tlb); } #endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ --- For khugepaged, it should be fine - it uses a local mmu_gather that doesn't get reused. The lifetime is simply: tlb_gather_mmu() → flush → sync → tlb_finish_mmu() Let me know if this addresses your concern :) [...] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-07 6:37 ` Lance Yang @ 2026-01-09 14:11 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2026-01-09 14:11 UTC (permalink / raw) To: Lance Yang Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 1/7/26 07:37, Lance Yang wrote: > Hi David, > > On 2026/1/7 00:10, Lance Yang wrote: > [..] >>> What could work is tracking "tlb_table_flush_sent_ipi" really when we >>> are flushing the TLB for removed/unshared tables, and maybe resetting >>> it ... I don't know when from the top of my head. >> > > Seems like we could fix the issue that the flag lifetime was broken > if the MMU gather gets reused by splitting the flush and reset. This > ensures the flag stays valid between flush and sync. > > Now tlb_flush_unshared_tables() does: > 1) __tlb_flush_mmu_tlbonly() - flush only, keeps flags alive > 2) tlb_gather_remove_table_sync_one() - can check the flag > 3) __tlb_reset_range() - reset everything after sync > > Something like this: > > ---8<--- > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 3975f7d11553..a95b054dfcca 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -415,6 +415,7 @@ static inline void __tlb_reset_range(struct > mmu_gather *tlb) > tlb->cleared_puds = 0; > tlb->cleared_p4ds = 0; > tlb->unshared_tables = 0; > + tlb->tlb_flush_sent_ipi = 0; As raised, the "tlb_flush_sent_ipi" is confusing when we sent to different CPUs based on whether we are removing page tables or not. I think you would really want to track that explicitly "tlb_table_flush_sent_ipi" ? > /* > * Do not reset mmu_gather::vma_* fields here, we do not > * call into tlb_start_vma() again to set them if there is an > @@ -492,7 +493,7 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct > vm_area_struct *vma) > tlb->vma_pfn |= !!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)); > } > > -static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > +static inline void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > { > /* > * Anything calling __tlb_adjust_range() also sets at least one of > @@ -503,6 +504,11 @@ static inline void tlb_flush_mmu_tlbonly(struct > mmu_gather *tlb) > return; > > tlb_flush(tlb); > +} > + > +static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > +{ > + __tlb_flush_mmu_tlbonly(tlb); > __tlb_reset_range(tlb); > } > > @@ -824,7 +830,7 @@ static inline void tlb_flush_unshared_tables(struct > mmu_gather *tlb) > * flush the TLB for the unsharer now. > */ > if (tlb->unshared_tables) > - tlb_flush_mmu_tlbonly(tlb); > + __tlb_flush_mmu_tlbonly(tlb); > > /* > * Similarly, we must make sure that concurrent GUP-fast will not > @@ -834,14 +840,16 @@ static inline void > tlb_flush_unshared_tables(struct mmu_gather *tlb) > * We only perform this when we are the last sharer of a page table, > * as the IPI will reach all CPUs: any GUP-fast. > * > - * Note that on configs where tlb_remove_table_sync_one() is a NOP, > - * the expectation is that the tlb_flush_mmu_tlbonly() would have issued > - * required IPIs already for us. > + * Use tlb_gather_remove_table_sync_one() instead of > + * tlb_remove_table_sync_one() to skip the redundant IPI if the > + * TLB flush above already sent one. > */ > if (tlb->fully_unshared_tables) { > - tlb_remove_table_sync_one(); > + tlb_gather_remove_table_sync_one(tlb); > tlb->fully_unshared_tables = false; > } > + > + __tlb_reset_range(tlb); > } > #endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ > --- > > For khugepaged, it should be fine - it uses a local mmu_gather that > doesn't get reused. The lifetime is simply: > > tlb_gather_mmu() → flush → sync → tlb_finish_mmu() > > Let me know if this addresses your concern :) I'll probably have to see the full picture. But this lifetime stuff in core-mm ends up getting more complicated than v2 without a clear benefit to me (except maybe handling some x86 oddities better ;) ) -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 16:10 ` Lance Yang 2026-01-07 6:37 ` Lance Yang @ 2026-01-09 14:13 ` David Hildenbrand (Red Hat) 2026-01-09 15:30 ` Lance Yang 1 sibling, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2026-01-09 14:13 UTC (permalink / raw) To: Lance Yang Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 >> What could work is tracking "tlb_table_flush_sent_ipi" really when we >> are flushing the TLB for removed/unshared tables, and maybe resetting >> it ... I don't know when from the top of my head. > > Not sure what's the best way forward here :( > >> >> v2 was simpler IMHO. > > The main concern Dave raised was that with PV hypercalls or when > INVLPGB is available, we can't tell from a static check whether IPIs > were actually sent. Why can't we set the boolean at runtime when initializing the pv_ops structure, when we are sure that it is allowed? -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-09 14:13 ` David Hildenbrand (Red Hat) @ 2026-01-09 15:30 ` Lance Yang 2026-01-09 15:40 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2026-01-09 15:30 UTC (permalink / raw) To: David Hildenbrand (Red Hat), dave.hansen Cc: dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 2026/1/9 22:13, David Hildenbrand (Red Hat) wrote: > >>> What could work is tracking "tlb_table_flush_sent_ipi" really when we >>> are flushing the TLB for removed/unshared tables, and maybe resetting >>> it ... I don't know when from the top of my head. >> >> Not sure what's the best way forward here :( >> >>> >>> v2 was simpler IMHO. >> >> The main concern Dave raised was that with PV hypercalls or when >> INVLPGB is available, we can't tell from a static check whether IPIs >> were actually sent. > > Why can't we set the boolean at runtime when initializing the pv_ops > structure, when we are sure that it is allowed? Yes, thanks, that sounds like a reasonable trade-off :) As you mentioned: "this lifetime stuff in core-mm ends up getting more complicated than v2 without a clear benefit". I totally agree that v3 is too complicated :( But Dave's concern about v2 was that we can't accurately tell whether IPIs were actually sent in PV environments or with INVLPGB, which misses optimization opportunities. The INVLPGB+no_global_asid case also sends IPIs during TLB flush. Anyway, yeah, I'd rather start with a simple approach, even if it's not perfect. We can always improve it later ;) Any ideas on how to move forward? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-09 15:30 ` Lance Yang @ 2026-01-09 15:40 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2026-01-09 15:40 UTC (permalink / raw) To: Lance Yang, dave.hansen Cc: dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 1/9/26 16:30, Lance Yang wrote: > > > On 2026/1/9 22:13, David Hildenbrand (Red Hat) wrote: >> >>>> What could work is tracking "tlb_table_flush_sent_ipi" really when we >>>> are flushing the TLB for removed/unshared tables, and maybe resetting >>>> it ... I don't know when from the top of my head. >>> >>> Not sure what's the best way forward here :( >>> >>>> >>>> v2 was simpler IMHO. >>> >>> The main concern Dave raised was that with PV hypercalls or when >>> INVLPGB is available, we can't tell from a static check whether IPIs >>> were actually sent. >> >> Why can't we set the boolean at runtime when initializing the pv_ops >> structure, when we are sure that it is allowed? > > Yes, thanks, that sounds like a reasonable trade-off :) > > As you mentioned: > > "this lifetime stuff in core-mm ends up getting more complicated than > v2 without a clear benefit". > > I totally agree that v3 is too complicated :( > > But Dave's concern about v2 was that we can't accurately tell whether > IPIs were actually sent in PV environments or with INVLPGB, which > misses optimization opportunities. The INVLPGB+no_global_asid case > also sends IPIs during TLB flush. > > Anyway, yeah, I'd rather start with a simple approach, even if it's > not perfect. We can always improve it later ;) > > Any ideas on how to move forward? I'd hope Dave can comment :) In general, I saw the whole thing as a two step process: 1) Avoid IPIs completely when the TLB flush sent them. We can achieve that through v2 or v3, one-way or the other, I don't particularly care as long as it is clean and simple. 2) For other configs/arch, send IPIs only to CPUs that are actually in GUP-fast etc. That would resolve some RT headake with broadcast IPIs. Regarding 2), it obviously only applies to setups where 1) does not apply: like x86 with INVLPGB or arm64. I once had the idea of letting CPUs that enter/exit GUP-fast (and similar) to indicate in a global cpumask (or per-CPU variables) that they are in that context. Then, we can just collect these CPUs and limit the IPIs to them (usually, not a lot ...). The trick here is to not slowdown GUP-fast too much. And one person (Yair in RT context) who played with that was not able to reduce the overhead sufficiently enough. I guess the options are a) Per-MM CPU mask we have to update atomically when entering/leaving GUP-fast b) Global mask we have to update atomically when entering/leaving GUP-fast c) Per-CPU variable we have to update when entering-leaving GUP-fast. Interrupts are disabled, so we don't have to worry about reschedule etc. Maybe someone reading along has other thoughts. -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang 2026-01-06 15:19 ` David Hildenbrand (Red Hat) @ 2026-01-06 16:24 ` Dave Hansen 2026-01-07 2:47 ` Lance Yang 1 sibling, 1 reply; 16+ messages in thread From: Dave Hansen @ 2026-01-06 16:24 UTC (permalink / raw) To: Lance Yang, akpm Cc: david, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 1/6/26 04:03, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > When unsharing hugetlb PMD page tables, we currently send two IPIs: one > for TLB invalidation, and another to synchronize with concurrent GUP-fast > walkers via tlb_remove_table_sync_one(). > > However, if the TLB flush already sent IPIs to all CPUs (when freed_tables > or unshared_tables is true), the second IPI is redundant. GUP-fast runs > with IRQs disabled, so when the TLB flush IPI completes, any concurrent > GUP-fast must have finished. > > To avoid the redundant IPI, we add a flag to mmu_gather to track whether > the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB > flush path via flush_tlb_info, so native_flush_tlb_multi() can set the > flag when it sends IPIs for freed_tables. We also set the flag for > local-only flushes, since disabling IRQs provides the same guarantee. The lack of imperative voice is killing me. :) > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h > index 866ea78ba156..c5950a92058c 100644 > --- a/arch/x86/include/asm/tlb.h > +++ b/arch/x86/include/asm/tlb.h > @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) > end = tlb->end; > } > > - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); > + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, > + tlb->freed_tables || tlb->unshared_tables, tlb); > } I think this hunk sums up v3 pretty well. Where there was a single boolean, now there are two. To add to that, the structure that contains the booleans is itself being passed in. The boolean is still named 'freed_tables', and is going from: tlb->freed_tables which is pretty obviously correct to: tlb->freed_tables || tlb->unshared_tables which is _far_ from obviously correct. I'm at a loss for why the patch wouldn't just do this: - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb); I suspect these were sent out in a bit of haste, which isn't the first time I've gotten that feeling with this series. Could we slow down, please? > static inline void invlpg(unsigned long addr) > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 00daedfefc1b..83c260c88b80 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -220,6 +220,7 @@ struct flush_tlb_info { > * will be zero. > */ > struct mm_struct *mm; > + struct mmu_gather *tlb; > unsigned long start; > unsigned long end; > u64 new_tlb_gen; This also gives me pause. There is a *lot* of redundant information between 'struct mmu_gather' and 'struct tlb_flush_info'. There needs to at least be a description of what the relationship is and how these relate to each other. I would have naively thought that the right move here would be to pull the mmu_gather data out at one discrete time rather than store a pointer to it. What I see here is, I suspect, the most expedient way to do it. I'd _certainly_ have done this myself if I was just hacking something together to play with as quickly as possible. So, in the end, I don't hate the approach here (yet). But it is almost impossible to evaluate it because the series is taking some rather egregious shortcuts and is lacking any real semblance of a refactoring effort. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized 2026-01-06 16:24 ` Dave Hansen @ 2026-01-07 2:47 ` Lance Yang 0 siblings, 0 replies; 16+ messages in thread From: Lance Yang @ 2026-01-07 2:47 UTC (permalink / raw) To: Dave Hansen, david Cc: dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0, akpm On 2026/1/7 00:24, Dave Hansen wrote: > On 1/6/26 04:03, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> When unsharing hugetlb PMD page tables, we currently send two IPIs: one >> for TLB invalidation, and another to synchronize with concurrent GUP-fast >> walkers via tlb_remove_table_sync_one(). >> >> However, if the TLB flush already sent IPIs to all CPUs (when freed_tables >> or unshared_tables is true), the second IPI is redundant. GUP-fast runs >> with IRQs disabled, so when the TLB flush IPI completes, any concurrent >> GUP-fast must have finished. >> >> To avoid the redundant IPI, we add a flag to mmu_gather to track whether >> the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB >> flush path via flush_tlb_info, so native_flush_tlb_multi() can set the >> flag when it sends IPIs for freed_tables. We also set the flag for >> local-only flushes, since disabling IRQs provides the same guarantee. > > The lack of imperative voice is killing me. :) Oops. > >> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h >> index 866ea78ba156..c5950a92058c 100644 >> --- a/arch/x86/include/asm/tlb.h >> +++ b/arch/x86/include/asm/tlb.h >> @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) >> end = tlb->end; >> } >> >> - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); >> + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, >> + tlb->freed_tables || tlb->unshared_tables, tlb); >> } > > I think this hunk sums up v3 pretty well. Where there was a single boolean, now there are two. To add to that, the structure that contains the booleans is itself being passed in. The boolean is still named 'freed_tables', and is going from: > > tlb->freed_tables > > which is pretty obviously correct to: > > tlb->freed_tables || tlb->unshared_tables > > which is _far_ from obviously correct. > > I'm at a loss for why the patch wouldn't just do this: > > - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); > + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb); > > I suspect these were sent out in a bit of haste, which isn't the first time I've gotten that feeling with this series. > > Could we slow down, please? Sorry, I went too fast ... > >> static inline void invlpg(unsigned long addr) >> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h >> index 00daedfefc1b..83c260c88b80 100644 >> --- a/arch/x86/include/asm/tlbflush.h >> +++ b/arch/x86/include/asm/tlbflush.h >> @@ -220,6 +220,7 @@ struct flush_tlb_info { >> * will be zero. >> */ >> struct mm_struct *mm; >> + struct mmu_gather *tlb; >> unsigned long start; >> unsigned long end; >> u64 new_tlb_gen; > > This also gives me pause. > > There is a *lot* of redundant information between 'struct mmu_gather' and 'struct tlb_flush_info'. There needs to at least be a description of what the relationship is and how these relate to each other. I would have naively thought that the right move here would be to pull the mmu_gather data out at one discrete time rather than store a pointer to it. > > What I see here is, I suspect, the most expedient way to do it. I'd _certainly_ have done this myself if I was just hacking something together to play with as quickly as possible. > > So, in the end, I don't hate the approach here (yet). But it is almost impossible to evaluate it because the series is taking some rather egregious shortcuts and is lacking any real semblance of a refactoring effort. The flag lifetime issue David pointed out is real, and you're right about the messy parameters :) And, yeah, I need to think more those. Maybe v3 can be fixed, or maybe v2 is actually sufficient - it's conservative but safe (no false positives). Will take more time, thanks! ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI 2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang 2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang @ 2026-01-06 12:03 ` Lance Yang 2026-01-06 15:07 ` David Hildenbrand (Red Hat) ` (2 more replies) 1 sibling, 3 replies; 16+ messages in thread From: Lance Yang @ 2026-01-06 12:03 UTC (permalink / raw) To: akpm Cc: david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang From: Lance Yang <lance.yang@linux.dev> pmdp_collapse_flush() may already send IPIs to flush TLBs, and then callers send another IPI via tlb_remove_table_sync_one() or pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers. However, since GUP-fast runs with IRQs disabled, the TLB flush IPI already provides the necessary synchronization. We can avoid the redundant second IPI. Introduce pmdp_collapse_flush_sync() which combines flush and sync: - For architectures using the generic pmdp_collapse_flush() implementation (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent an IPI, tlb_gather_remove_table_sync_one() will skip the redundant one. - For architectures with custom pmdp_collapse_flush() (s390, riscv, powerpc): Fall back to calling pmdp_collapse_flush() followed by tlb_remove_table_sync_one(). No behavior change. Update khugepaged to use pmdp_collapse_flush_sync() instead of separate flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() macro. Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Lance Yang <lance.yang@linux.dev> --- include/linux/pgtable.h | 13 +++++++++---- mm/khugepaged.c | 9 +++------ mm/pgtable-generic.c | 34 ++++++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+), 10 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index eb8aacba3698..69e290dab450 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) return pmd; } #define pmdp_get_lockless pmdp_get_lockless -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one() #endif /* CONFIG_PGTABLE_LEVELS > 2 */ #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) { return pmdp_get(pmdp); } -static inline void pmdp_get_lockless_sync(void) -{ -} #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm, #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); #else static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, BUILD_BUG(); return *pmdp; } +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +{ + BUILD_BUG(); + return *pmdp; +} #define pmdp_collapse_flush pmdp_collapse_flush #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9f790ec34400..0a98afc85c50 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1177,10 +1177,9 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a * Parallel GUP-fast is fine since GUP-fast will back off when * it detects PMD is changed. */ - _pmd = pmdp_collapse_flush(vma, address, pmd); + _pmd = pmdp_collapse_flush_sync(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); - tlb_remove_table_sync_one(); pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { @@ -1663,8 +1662,7 @@ static enum scan_result try_collapse_pte_mapped_thp(struct mm_struct *mm, unsign } } } - pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); - pmdp_get_lockless_sync(); + pgt_pmd = pmdp_collapse_flush_sync(vma, haddr, pmd); pte_unmap_unlock(start_pte, ptl); if (ptl != pml) spin_unlock(pml); @@ -1817,8 +1815,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) * races against the prior checks. */ if (likely(file_backed_vma_is_retractable(vma))) { - pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); - pmdp_get_lockless_sync(); + pgt_pmd = pmdp_collapse_flush_sync(vma, addr, pmd); success = true; } diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index d3aec7a9926a..be2ee82e6fc4 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return pmd; } + +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + struct mmu_gather tlb; + pmd_t pmd; + + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + VM_BUG_ON(pmd_trans_huge(*pmdp)); + + tlb_gather_mmu(&tlb, vma->vm_mm); + pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp); + + flush_tlb_mm_range(vma->vm_mm, address, address + HPAGE_PMD_SIZE, + PAGE_SHIFT, true, &tlb); + + /* + * Synchronize with GUP-fast. If the flush sent IPIs, skip the + * redundant sync IPI. + */ + tlb_gather_remove_table_sync_one(&tlb); + tlb_finish_mmu(&tlb); + return pmd; +} +#else +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmdp) +{ + pmd_t pmd; + + pmd = pmdp_collapse_flush(vma, address, pmdp); + tlb_remove_table_sync_one(); + return pmd; +} #endif /* arch define pte_free_defer in asm/pgalloc.h for its own implementation */ -- 2.49.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang @ 2026-01-06 15:07 ` David Hildenbrand (Red Hat) 2026-01-06 15:41 ` Lance Yang 2026-01-07 9:46 ` kernel test robot 2026-01-07 10:52 ` kernel test robot 2 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2026-01-06 15:07 UTC (permalink / raw) To: Lance Yang, akpm Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 1/6/26 13:03, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > pmdp_collapse_flush() may already send IPIs to flush TLBs, and then > callers send another IPI via tlb_remove_table_sync_one() or > pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers. > > However, since GUP-fast runs with IRQs disabled, the TLB flush IPI already > provides the necessary synchronization. We can avoid the redundant second > IPI. > > Introduce pmdp_collapse_flush_sync() which combines flush and sync: > > - For architectures using the generic pmdp_collapse_flush() implementation > (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent > an IPI, tlb_gather_remove_table_sync_one() will skip the redundant one. > > - For architectures with custom pmdp_collapse_flush() (s390, riscv, > powerpc): Fall back to calling pmdp_collapse_flush() followed by > tlb_remove_table_sync_one(). No behavior change. > > Update khugepaged to use pmdp_collapse_flush_sync() instead of separate > flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() macro. > > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > include/linux/pgtable.h | 13 +++++++++---- > mm/khugepaged.c | 9 +++------ > mm/pgtable-generic.c | 34 ++++++++++++++++++++++++++++++++++ > 3 files changed, 46 insertions(+), 10 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index eb8aacba3698..69e290dab450 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) > return pmd; > } > #define pmdp_get_lockless pmdp_get_lockless > -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one() > #endif /* CONFIG_PGTABLE_LEVELS > 2 */ > #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ > > @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) > { > return pmdp_get(pmdp); > } > -static inline void pmdp_get_lockless_sync(void) > -{ > -} > #endif > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm, > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, > unsigned long address, pmd_t *pmdp); > +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmdp); > #else > static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, > unsigned long address, > @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, > BUILD_BUG(); > return *pmdp; > } > +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, > + unsigned long address, > + pmd_t *pmdp) > +{ > + BUILD_BUG(); > + return *pmdp; > +} > #define pmdp_collapse_flush pmdp_collapse_flush > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > #endif > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9f790ec34400..0a98afc85c50 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1177,10 +1177,9 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a > * Parallel GUP-fast is fine since GUP-fast will back off when > * it detects PMD is changed. > */ > - _pmd = pmdp_collapse_flush(vma, address, pmd); > + _pmd = pmdp_collapse_flush_sync(vma, address, pmd); > spin_unlock(pmd_ptl); > mmu_notifier_invalidate_range_end(&range); > - tlb_remove_table_sync_one(); Now you issue the IPI under PTL. [...] > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index d3aec7a9926a..be2ee82e6fc4 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, > flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > return pmd; > } > + > +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address, > + pmd_t *pmdp) > +{ > + struct mmu_gather tlb; > + pmd_t pmd; > + > + VM_BUG_ON(address & ~HPAGE_PMD_MASK); > + VM_BUG_ON(pmd_trans_huge(*pmdp)); > + > + tlb_gather_mmu(&tlb, vma->vm_mm); Should we be using the new tlb_gather_mmu_vma(), and do we have to set the TLB pagesize to PMD? -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI 2026-01-06 15:07 ` David Hildenbrand (Red Hat) @ 2026-01-06 15:41 ` Lance Yang 0 siblings, 0 replies; 16+ messages in thread From: Lance Yang @ 2026-01-06 15:41 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0 On 2026/1/6 23:07, David Hildenbrand (Red Hat) wrote: > On 1/6/26 13:03, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> pmdp_collapse_flush() may already send IPIs to flush TLBs, and then >> callers send another IPI via tlb_remove_table_sync_one() or >> pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers. >> >> However, since GUP-fast runs with IRQs disabled, the TLB flush IPI >> already >> provides the necessary synchronization. We can avoid the redundant second >> IPI. >> >> Introduce pmdp_collapse_flush_sync() which combines flush and sync: >> >> - For architectures using the generic pmdp_collapse_flush() >> implementation >> (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent >> an IPI, tlb_gather_remove_table_sync_one() will skip the redundant >> one. >> >> - For architectures with custom pmdp_collapse_flush() (s390, riscv, >> powerpc): Fall back to calling pmdp_collapse_flush() followed by >> tlb_remove_table_sync_one(). No behavior change. >> >> Update khugepaged to use pmdp_collapse_flush_sync() instead of separate >> flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() >> macro. >> >> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> include/linux/pgtable.h | 13 +++++++++---- >> mm/khugepaged.c | 9 +++------ >> mm/pgtable-generic.c | 34 ++++++++++++++++++++++++++++++++++ >> 3 files changed, 46 insertions(+), 10 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index eb8aacba3698..69e290dab450 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >> return pmd; >> } >> #define pmdp_get_lockless pmdp_get_lockless >> -#define pmdp_get_lockless_sync() tlb_remove_table_sync_one() >> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >> @@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >> { >> return pmdp_get(pmdp); >> } >> -static inline void pmdp_get_lockless_sync(void) >> -{ >> -} >> #endif >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> @@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct >> mm_struct *mm, >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, >> unsigned long address, pmd_t *pmdp); >> +extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmdp); >> #else >> static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, >> unsigned long address, >> @@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct >> vm_area_struct *vma, >> BUILD_BUG(); >> return *pmdp; >> } >> +static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, >> + unsigned long address, >> + pmd_t *pmdp) >> +{ >> + BUILD_BUG(); >> + return *pmdp; >> +} >> #define pmdp_collapse_flush pmdp_collapse_flush >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> #endif >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 9f790ec34400..0a98afc85c50 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1177,10 +1177,9 @@ static enum scan_result >> collapse_huge_page(struct mm_struct *mm, unsigned long a >> * Parallel GUP-fast is fine since GUP-fast will back off when >> * it detects PMD is changed. >> */ >> - _pmd = pmdp_collapse_flush(vma, address, pmd); >> + _pmd = pmdp_collapse_flush_sync(vma, address, pmd); >> spin_unlock(pmd_ptl); >> mmu_notifier_invalidate_range_end(&range); >> - tlb_remove_table_sync_one(); > > Now you issue the IPI under PTL. We do send TLB flush IPI under PTL before, e.g. in try_collapse_pte_mapped_thp(): pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); pte_unmap_unlock(start_pte, ptl); But anyway, we can do better by passing ptl in and unlocking before the sync IPI ;) > > [...] > >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >> index d3aec7a9926a..be2ee82e6fc4 100644 >> --- a/mm/pgtable-generic.c >> +++ b/mm/pgtable-generic.c >> @@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct >> *vma, unsigned long address, >> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >> return pmd; >> } >> + >> +pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned >> long address, >> + pmd_t *pmdp) >> +{ >> + struct mmu_gather tlb; >> + pmd_t pmd; >> + >> + VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> + VM_BUG_ON(pmd_trans_huge(*pmdp)); >> + >> + tlb_gather_mmu(&tlb, vma->vm_mm); > > Should we be using the new tlb_gather_mmu_vma(), and do we have to set > the TLB pagesize to PMD? Yes, good point on tlb_gather_mmu_vma()! So, the sequence will be: tlb_gather_mmu_vma(&tlb, vma); pmd = pmdp_huge_get_and_clear(...); flush_tlb_mm_range(..., &tlb); if (ptl) spin_unlock(ptl); tlb_gather_remove_table_sync_one(&tlb); tlb_finish_mmu(&tlb);Thanks, Lance ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang 2026-01-06 15:07 ` David Hildenbrand (Red Hat) @ 2026-01-07 9:46 ` kernel test robot 2026-01-07 10:52 ` kernel test robot 2 siblings, 0 replies; 16+ messages in thread From: kernel test robot @ 2026-01-07 9:46 UTC (permalink / raw) To: Lance Yang, akpm Cc: oe-kbuild-all, david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang Hi Lance, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on next-20260107] [cannot apply to tip/x86/core tip/x86/mm arnd-asm-generic/master linus/master v6.19-rc4] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-skip-redundant-IPI-when-TLB-flush-already-synchronized/20260106-200505 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260106120303.38124-3-lance.yang%40linux.dev patch subject: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI config: s390-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260107/202601071005.oEsmtf0J-lkp@intel.com/config) compiler: s390-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260107/202601071005.oEsmtf0J-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202601071005.oEsmtf0J-lkp@intel.com/ All errors (new ones prefixed by >>): mm/khugepaged.c: In function 'collapse_huge_page': >> mm/khugepaged.c:1180:16: error: implicit declaration of function 'pmdp_collapse_flush_sync'; did you mean 'pmdp_collapse_flush'? [-Wimplicit-function-declaration] 1180 | _pmd = pmdp_collapse_flush_sync(vma, address, pmd); | ^~~~~~~~~~~~~~~~~~~~~~~~ | pmdp_collapse_flush vim +1180 mm/khugepaged.c 1092 1093 static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address, 1094 int referenced, int unmapped, 1095 struct collapse_control *cc) 1096 { 1097 LIST_HEAD(compound_pagelist); 1098 pmd_t *pmd, _pmd; 1099 pte_t *pte; 1100 pgtable_t pgtable; 1101 struct folio *folio; 1102 spinlock_t *pmd_ptl, *pte_ptl; 1103 enum scan_result result = SCAN_FAIL; 1104 struct vm_area_struct *vma; 1105 struct mmu_notifier_range range; 1106 1107 VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1108 1109 /* 1110 * Before allocating the hugepage, release the mmap_lock read lock. 1111 * The allocation can take potentially a long time if it involves 1112 * sync compaction, and we do not need to hold the mmap_lock during 1113 * that. We will recheck the vma after taking it again in write mode. 1114 */ 1115 mmap_read_unlock(mm); 1116 1117 result = alloc_charge_folio(&folio, mm, cc); 1118 if (result != SCAN_SUCCEED) 1119 goto out_nolock; 1120 1121 mmap_read_lock(mm); 1122 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1123 if (result != SCAN_SUCCEED) { 1124 mmap_read_unlock(mm); 1125 goto out_nolock; 1126 } 1127 1128 result = find_pmd_or_thp_or_none(mm, address, &pmd); 1129 if (result != SCAN_SUCCEED) { 1130 mmap_read_unlock(mm); 1131 goto out_nolock; 1132 } 1133 1134 if (unmapped) { 1135 /* 1136 * __collapse_huge_page_swapin will return with mmap_lock 1137 * released when it fails. So we jump out_nolock directly in 1138 * that case. Continuing to collapse causes inconsistency. 1139 */ 1140 result = __collapse_huge_page_swapin(mm, vma, address, pmd, 1141 referenced); 1142 if (result != SCAN_SUCCEED) 1143 goto out_nolock; 1144 } 1145 1146 mmap_read_unlock(mm); 1147 /* 1148 * Prevent all access to pagetables with the exception of 1149 * gup_fast later handled by the ptep_clear_flush and the VM 1150 * handled by the anon_vma lock + PG_lock. 1151 * 1152 * UFFDIO_MOVE is prevented to race as well thanks to the 1153 * mmap_lock. 1154 */ 1155 mmap_write_lock(mm); 1156 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1157 if (result != SCAN_SUCCEED) 1158 goto out_up_write; 1159 /* check if the pmd is still valid */ 1160 vma_start_write(vma); 1161 result = check_pmd_still_valid(mm, address, pmd); 1162 if (result != SCAN_SUCCEED) 1163 goto out_up_write; 1164 1165 anon_vma_lock_write(vma->anon_vma); 1166 1167 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, 1168 address + HPAGE_PMD_SIZE); 1169 mmu_notifier_invalidate_range_start(&range); 1170 1171 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ 1172 /* 1173 * This removes any huge TLB entry from the CPU so we won't allow 1174 * huge and small TLB entries for the same virtual address to 1175 * avoid the risk of CPU bugs in that area. 1176 * 1177 * Parallel GUP-fast is fine since GUP-fast will back off when 1178 * it detects PMD is changed. 1179 */ > 1180 _pmd = pmdp_collapse_flush_sync(vma, address, pmd); 1181 spin_unlock(pmd_ptl); 1182 mmu_notifier_invalidate_range_end(&range); 1183 1184 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); 1185 if (pte) { 1186 result = __collapse_huge_page_isolate(vma, address, pte, cc, 1187 &compound_pagelist); 1188 spin_unlock(pte_ptl); 1189 } else { 1190 result = SCAN_NO_PTE_TABLE; 1191 } 1192 1193 if (unlikely(result != SCAN_SUCCEED)) { 1194 if (pte) 1195 pte_unmap(pte); 1196 spin_lock(pmd_ptl); 1197 BUG_ON(!pmd_none(*pmd)); 1198 /* 1199 * We can only use set_pmd_at when establishing 1200 * hugepmds and never for establishing regular pmds that 1201 * points to regular pagetables. Use pmd_populate for that 1202 */ 1203 pmd_populate(mm, pmd, pmd_pgtable(_pmd)); 1204 spin_unlock(pmd_ptl); 1205 anon_vma_unlock_write(vma->anon_vma); 1206 goto out_up_write; 1207 } 1208 1209 /* 1210 * All pages are isolated and locked so anon_vma rmap 1211 * can't run anymore. 1212 */ 1213 anon_vma_unlock_write(vma->anon_vma); 1214 1215 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, 1216 vma, address, pte_ptl, 1217 &compound_pagelist); 1218 pte_unmap(pte); 1219 if (unlikely(result != SCAN_SUCCEED)) 1220 goto out_up_write; 1221 1222 /* 1223 * The smp_wmb() inside __folio_mark_uptodate() ensures the 1224 * copy_huge_page writes become visible before the set_pmd_at() 1225 * write. 1226 */ 1227 __folio_mark_uptodate(folio); 1228 pgtable = pmd_pgtable(_pmd); 1229 1230 spin_lock(pmd_ptl); 1231 BUG_ON(!pmd_none(*pmd)); 1232 pgtable_trans_huge_deposit(mm, pmd, pgtable); 1233 map_anon_folio_pmd_nopf(folio, pmd, vma, address); 1234 spin_unlock(pmd_ptl); 1235 1236 folio = NULL; 1237 1238 result = SCAN_SUCCEED; 1239 out_up_write: 1240 mmap_write_unlock(mm); 1241 out_nolock: 1242 if (folio) 1243 folio_put(folio); 1244 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); 1245 return result; 1246 } 1247 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang 2026-01-06 15:07 ` David Hildenbrand (Red Hat) 2026-01-07 9:46 ` kernel test robot @ 2026-01-07 10:52 ` kernel test robot 2 siblings, 0 replies; 16+ messages in thread From: kernel test robot @ 2026-01-07 10:52 UTC (permalink / raw) To: Lance Yang, akpm Cc: oe-kbuild-all, david, dave.hansen, dave.hansen, will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang Hi Lance, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on next-20260107] [cannot apply to tip/x86/core tip/x86/mm linus/master v6.16-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-skip-redundant-IPI-when-TLB-flush-already-synchronized/20260106-200505 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20260106120303.38124-3-lance.yang%40linux.dev patch subject: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI config: riscv-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260107/202601071153.9k8Fm05X-lkp@intel.com/config) compiler: riscv64-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260107/202601071153.9k8Fm05X-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202601071153.9k8Fm05X-lkp@intel.com/ All errors (new ones prefixed by >>): mm/khugepaged.c: In function 'collapse_huge_page': mm/khugepaged.c:1180:16: error: implicit declaration of function 'pmdp_collapse_flush_sync'; did you mean 'pmdp_collapse_flush'? [-Wimplicit-function-declaration] 1180 | _pmd = pmdp_collapse_flush_sync(vma, address, pmd); | ^~~~~~~~~~~~~~~~~~~~~~~~ | pmdp_collapse_flush >> mm/khugepaged.c:1180:16: error: incompatible types when assigning to type 'pmd_t' from type 'int' mm/khugepaged.c: In function 'try_collapse_pte_mapped_thp': mm/khugepaged.c:1665:19: error: incompatible types when assigning to type 'pmd_t' from type 'int' 1665 | pgt_pmd = pmdp_collapse_flush_sync(vma, haddr, pmd); | ^~~~~~~~~~~~~~~~~~~~~~~~ mm/khugepaged.c: In function 'retract_page_tables': mm/khugepaged.c:1818:35: error: incompatible types when assigning to type 'pmd_t' from type 'int' 1818 | pgt_pmd = pmdp_collapse_flush_sync(vma, addr, pmd); | ^~~~~~~~~~~~~~~~~~~~~~~~ vim +1180 mm/khugepaged.c 1092 1093 static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address, 1094 int referenced, int unmapped, 1095 struct collapse_control *cc) 1096 { 1097 LIST_HEAD(compound_pagelist); 1098 pmd_t *pmd, _pmd; 1099 pte_t *pte; 1100 pgtable_t pgtable; 1101 struct folio *folio; 1102 spinlock_t *pmd_ptl, *pte_ptl; 1103 enum scan_result result = SCAN_FAIL; 1104 struct vm_area_struct *vma; 1105 struct mmu_notifier_range range; 1106 1107 VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1108 1109 /* 1110 * Before allocating the hugepage, release the mmap_lock read lock. 1111 * The allocation can take potentially a long time if it involves 1112 * sync compaction, and we do not need to hold the mmap_lock during 1113 * that. We will recheck the vma after taking it again in write mode. 1114 */ 1115 mmap_read_unlock(mm); 1116 1117 result = alloc_charge_folio(&folio, mm, cc); 1118 if (result != SCAN_SUCCEED) 1119 goto out_nolock; 1120 1121 mmap_read_lock(mm); 1122 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1123 if (result != SCAN_SUCCEED) { 1124 mmap_read_unlock(mm); 1125 goto out_nolock; 1126 } 1127 1128 result = find_pmd_or_thp_or_none(mm, address, &pmd); 1129 if (result != SCAN_SUCCEED) { 1130 mmap_read_unlock(mm); 1131 goto out_nolock; 1132 } 1133 1134 if (unmapped) { 1135 /* 1136 * __collapse_huge_page_swapin will return with mmap_lock 1137 * released when it fails. So we jump out_nolock directly in 1138 * that case. Continuing to collapse causes inconsistency. 1139 */ 1140 result = __collapse_huge_page_swapin(mm, vma, address, pmd, 1141 referenced); 1142 if (result != SCAN_SUCCEED) 1143 goto out_nolock; 1144 } 1145 1146 mmap_read_unlock(mm); 1147 /* 1148 * Prevent all access to pagetables with the exception of 1149 * gup_fast later handled by the ptep_clear_flush and the VM 1150 * handled by the anon_vma lock + PG_lock. 1151 * 1152 * UFFDIO_MOVE is prevented to race as well thanks to the 1153 * mmap_lock. 1154 */ 1155 mmap_write_lock(mm); 1156 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1157 if (result != SCAN_SUCCEED) 1158 goto out_up_write; 1159 /* check if the pmd is still valid */ 1160 vma_start_write(vma); 1161 result = check_pmd_still_valid(mm, address, pmd); 1162 if (result != SCAN_SUCCEED) 1163 goto out_up_write; 1164 1165 anon_vma_lock_write(vma->anon_vma); 1166 1167 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, 1168 address + HPAGE_PMD_SIZE); 1169 mmu_notifier_invalidate_range_start(&range); 1170 1171 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ 1172 /* 1173 * This removes any huge TLB entry from the CPU so we won't allow 1174 * huge and small TLB entries for the same virtual address to 1175 * avoid the risk of CPU bugs in that area. 1176 * 1177 * Parallel GUP-fast is fine since GUP-fast will back off when 1178 * it detects PMD is changed. 1179 */ > 1180 _pmd = pmdp_collapse_flush_sync(vma, address, pmd); 1181 spin_unlock(pmd_ptl); 1182 mmu_notifier_invalidate_range_end(&range); 1183 1184 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); 1185 if (pte) { 1186 result = __collapse_huge_page_isolate(vma, address, pte, cc, 1187 &compound_pagelist); 1188 spin_unlock(pte_ptl); 1189 } else { 1190 result = SCAN_NO_PTE_TABLE; 1191 } 1192 1193 if (unlikely(result != SCAN_SUCCEED)) { 1194 if (pte) 1195 pte_unmap(pte); 1196 spin_lock(pmd_ptl); 1197 BUG_ON(!pmd_none(*pmd)); 1198 /* 1199 * We can only use set_pmd_at when establishing 1200 * hugepmds and never for establishing regular pmds that 1201 * points to regular pagetables. Use pmd_populate for that 1202 */ 1203 pmd_populate(mm, pmd, pmd_pgtable(_pmd)); 1204 spin_unlock(pmd_ptl); 1205 anon_vma_unlock_write(vma->anon_vma); 1206 goto out_up_write; 1207 } 1208 1209 /* 1210 * All pages are isolated and locked so anon_vma rmap 1211 * can't run anymore. 1212 */ 1213 anon_vma_unlock_write(vma->anon_vma); 1214 1215 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, 1216 vma, address, pte_ptl, 1217 &compound_pagelist); 1218 pte_unmap(pte); 1219 if (unlikely(result != SCAN_SUCCEED)) 1220 goto out_up_write; 1221 1222 /* 1223 * The smp_wmb() inside __folio_mark_uptodate() ensures the 1224 * copy_huge_page writes become visible before the set_pmd_at() 1225 * write. 1226 */ 1227 __folio_mark_uptodate(folio); 1228 pgtable = pmd_pgtable(_pmd); 1229 1230 spin_lock(pmd_ptl); 1231 BUG_ON(!pmd_none(*pmd)); 1232 pgtable_trans_huge_deposit(mm, pmd, pgtable); 1233 map_anon_folio_pmd_nopf(folio, pmd, vma, address); 1234 spin_unlock(pmd_ptl); 1235 1236 folio = NULL; 1237 1238 result = SCAN_SUCCEED; 1239 out_up_write: 1240 mmap_write_unlock(mm); 1241 out_nolock: 1242 if (folio) 1243 folio_put(folio); 1244 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); 1245 return result; 1246 } 1247 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-01-09 15:40 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-01-06 12:03 [PATCH RESEND v3 0/2] skip redundant TLB sync IPIs Lance Yang 2026-01-06 12:03 ` [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized Lance Yang 2026-01-06 15:19 ` David Hildenbrand (Red Hat) 2026-01-06 16:10 ` Lance Yang 2026-01-07 6:37 ` Lance Yang 2026-01-09 14:11 ` David Hildenbrand (Red Hat) 2026-01-09 14:13 ` David Hildenbrand (Red Hat) 2026-01-09 15:30 ` Lance Yang 2026-01-09 15:40 ` David Hildenbrand (Red Hat) 2026-01-06 16:24 ` Dave Hansen 2026-01-07 2:47 ` Lance Yang 2026-01-06 12:03 ` [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI Lance Yang 2026-01-06 15:07 ` David Hildenbrand (Red Hat) 2026-01-06 15:41 ` Lance Yang 2026-01-07 9:46 ` kernel test robot 2026-01-07 10:52 ` kernel test robot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox