* [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault @ 2025-10-13 9:20 Huang Ying 2025-10-13 9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying 2025-10-13 9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying 0 siblings, 2 replies; 33+ messages in thread From: Huang Ying @ 2025-10-13 9:20 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm This series is to optimize the system performance via avoiding TLBI broadcast if page is reused in the write protect fault handler. More details of the background and the test results can be found in [2/2]. Changelog: v2: - Various code cleanup in [1/2], Thanks David's comments! - Remove unnecessary __local_flush_tlb_page_nosync() in [2/2], Thanks Ryan's comments! - Add missing contpte processing, Thanks Rayn and Catalin's comments! Huang Ying (2): mm: add spurious fault fixing support for huge pmd arm64, tlbflush: don't TLBI broadcast if page reused in write fault arch/arm64/include/asm/pgtable.h | 14 +++++--- arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++ arch/arm64/mm/contpte.c | 3 +- arch/arm64/mm/fault.c | 2 +- include/linux/pgtable.h | 4 +++ mm/huge_memory.c | 22 +++++++++--- mm/internal.h | 4 +-- 7 files changed, 90 insertions(+), 15 deletions(-) Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-13 9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying @ 2025-10-13 9:20 ` Huang Ying 2025-10-14 14:21 ` Lorenzo Stoakes 2025-10-13 9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying 1 sibling, 1 reply; 33+ messages in thread From: Huang Ying @ 2025-10-13 9:20 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm In the current kernel, there is spurious fault fixing support for pte, but not for huge pmd because no architectures need it. But in the next patch in the series, we will change the write protection fault handling logic on arm64, so that some stale huge pmd entries may remain in the TLB. These entries need to be flushed via the huge pmd spurious fault fixing mechanism. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/pgtable.h | 4 ++++ mm/huge_memory.c | 22 +++++++++++++++++----- mm/internal.h | 4 ++-- 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 32e8457ad535..341622ec80e4 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) #endif +#ifndef flush_tlb_fix_spurious_fault_pmd +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) +#endif + /* * When walking page tables, get the address of the next boundary, * or the end address of the range if that comes earlier. Although no diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..8533457c52b7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, EXPORT_SYMBOL_GPL(vmf_insert_folio_pud); #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, bool write) +/* Returns whether the PMD entry is changed */ +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, bool write) { + int changed; pmd_t _pmd; _pmd = pmd_mkyoung(*pmd); if (write) _pmd = pmd_mkdirty(_pmd); - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, - pmd, _pmd, write)) + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, + pmd, _pmd, write); + if (changed) update_mmu_cache_pmd(vma, addr, pmd); + + return changed; } int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) goto unlock; - touch_pmd(vmf->vma, vmf->address, vmf->pmd, write); + if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) { + /* See corresponding comments in handle_pte_fault(). */ + if (vmf->flags & FAULT_FLAG_TRIED) + goto unlock; + if (vmf->flags & FAULT_FLAG_WRITE) + flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address, + vmf->pmd); + } unlock: spin_unlock(vmf->ptl); diff --git a/mm/internal.h b/mm/internal.h index 1561fc2ff5b8..8b58ab00a7cd 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs, */ void touch_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, bool write); -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, bool write); +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, bool write); /* * Parses a string with mem suffixes into its order. Useful to parse kernel -- 2.39.5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-13 9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying @ 2025-10-14 14:21 ` Lorenzo Stoakes 2025-10-14 14:38 ` David Hildenbrand 2025-10-15 8:43 ` Huang, Ying 0 siblings, 2 replies; 33+ messages in thread From: Lorenzo Stoakes @ 2025-10-14 14:21 UTC (permalink / raw) To: Huang Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote: > In the current kernel, there is spurious fault fixing support for pte, > but not for huge pmd because no architectures need it. But in the > next patch in the series, we will change the write protection fault > handling logic on arm64, so that some stale huge pmd entries may > remain in the TLB. These entries need to be flushed via the huge pmd > spurious fault fixing mechanism. > > Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Right now the PTE level spurious fault handling is dealt with in handle_pte_fault() when ptep_set_access_flags() returns false. Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and huge_pmd_set_accessed(). 1 - Why are you not adding handling to GUP? 2 - Is this the correct level of abstraction? It's really not obvious but huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP, non-NUMA hint huge page fault where a page table entry already exists but we are faulting anyway (e.g. non-present or read-only writable). You don't mention any of this in the commit message, which you need to do and really need to explain how spurious faults can arise, why you can only do this at the point of abstraction you do (if you are unable to put it in actual fault handing-code), and you need to add a bunch more comments to explain this. Otherwise this just ends up being a lot of open-coded + confusing 'you have to go look it up/just know' type stuff that we have too much of in mm :) So please update commit message/comments, confirm whether this is the correct level of abstraction, and address other comments below, thanks! > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yang Shi <yang@os.amperecomputing.com> > Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> > Cc: Dev Jain <dev.jain@arm.com> > Cc: Barry Song <baohua@kernel.org> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Yicong Yang <yangyicong@hisilicon.com> > Cc: Kefeng Wang <wangkefeng.wang@huawei.com> > Cc: Kevin Brodsky <kevin.brodsky@arm.com> > Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > --- > include/linux/pgtable.h | 4 ++++ > mm/huge_memory.c | 22 +++++++++++++++++----- > mm/internal.h | 4 ++-- > 3 files changed, 23 insertions(+), 7 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 32e8457ad535..341622ec80e4 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) > #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) > #endif > > +#ifndef flush_tlb_fix_spurious_fault_pmd > +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) > +#endif flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to flush_tlb_page() - why do we just do nothing in this case here? > + > /* > * When walking page tables, get the address of the next boundary, > * or the end address of the range if that comes earlier. Although no > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 1b81680b4225..8533457c52b7 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, > EXPORT_SYMBOL_GPL(vmf_insert_folio_pud); > #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > > -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > - pmd_t *pmd, bool write) > +/* Returns whether the PMD entry is changed */ Could we have a kernel doc description here? > +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, It's 2025 can we use bool please :) > + pmd_t *pmd, bool write) > { > + int changed; > pmd_t _pmd; While we're here can we rename this horrible parameter name to e.g. entry? We're significantly altering this function anyway so it isn't much more > > _pmd = pmd_mkyoung(*pmd); > if (write) > _pmd = pmd_mkdirty(_pmd); > - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > - pmd, _pmd, write)) > + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > + pmd, _pmd, write); > + if (changed) > update_mmu_cache_pmd(vma, addr, pmd); We can make this simpler, e.g.: if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, pmd, entry, write)) { update_mmu_cache_pmd(vma, addr, pmd); return true; } return false; > + > + return changed; > } > > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) > if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) > goto unlock; > > - touch_pmd(vmf->vma, vmf->address, vmf->pmd, write); > + if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) { > + /* See corresponding comments in handle_pte_fault(). */ What are the 'corresponding' comments? How can a reader of this code know what they are? This isn't a very helpful comment. Also those comments might be moved in future... Presumably it's: /* Skip spurious TLB flush for retried page fault */ if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not. * This still avoids useless tlb flushes for .text page faults * with threads. */ if (vmf->flags & FAULT_FLAG_WRITE) flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, vmf->pte); So I don't see why it's so egregious to have the equivalent here, or actually ideally to abstract the code entirely. In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum pgtable_level"") David introduced: enum pgtable_level { PGTABLE_LEVEL_PTE = 0, PGTABLE_LEVEL_PMD, PGTABLE_LEVEL_PUD, PGTABLE_LEVEL_P4D, PGTABLE_LEVEL_PGD, }; Which allows for sensible abstraction. > + if (vmf->flags & FAULT_FLAG_TRIED) > + goto unlock; > + if (vmf->flags & FAULT_FLAG_WRITE) > + flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address, > + vmf->pmd); > + } > > unlock: > spin_unlock(vmf->ptl); > diff --git a/mm/internal.h b/mm/internal.h > index 1561fc2ff5b8..8b58ab00a7cd 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs, > */ > void touch_pud(struct vm_area_struct *vma, unsigned long addr, > pud_t *pud, bool write); > -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > - pmd_t *pmd, bool write); > +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, > + pmd_t *pmd, bool write); > > /* > * Parses a string with mem suffixes into its order. Useful to parse kernel > -- > 2.39.5 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-14 14:21 ` Lorenzo Stoakes @ 2025-10-14 14:38 ` David Hildenbrand 2025-10-14 14:49 ` Lorenzo Stoakes 2025-10-15 8:43 ` Huang, Ying 1 sibling, 1 reply; 33+ messages in thread From: David Hildenbrand @ 2025-10-14 14:38 UTC (permalink / raw) To: Lorenzo Stoakes, Huang Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm > > /* Skip spurious TLB flush for retried page fault */ > if (vmf->flags & FAULT_FLAG_TRIED) > goto unlock; > /* > * This is needed only for protection faults but the arch code > * is not yet telling us if this is a protection fault or not. > * This still avoids useless tlb flushes for .text page faults > * with threads. > */ > if (vmf->flags & FAULT_FLAG_WRITE) > flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, > vmf->pte); > > > So I don't see why it's so egregious to have the equivalent here, or actually > ideally to abstract the code entirely. Let's definitely not duplicate such comments whereby one instance will end up bitrotting. When talking about spurious faults I assume the educated reader will usually find the right comments -- like you easily did :P However I agree that ... > > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum > pgtable_level"") David introduced: > > enum pgtable_level { > PGTABLE_LEVEL_PTE = 0, > PGTABLE_LEVEL_PMD, > PGTABLE_LEVEL_PUD, > PGTABLE_LEVEL_P4D, > PGTABLE_LEVEL_PGD, > }; > > Which allows for sensible abstraction. ... if there is an easier way to just unify the code and have the comments at a central place, even better. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-14 14:38 ` David Hildenbrand @ 2025-10-14 14:49 ` Lorenzo Stoakes 2025-10-14 14:58 ` David Hildenbrand 0 siblings, 1 reply; 33+ messages in thread From: Lorenzo Stoakes @ 2025-10-14 14:49 UTC (permalink / raw) To: David Hildenbrand Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote: > > > > > /* Skip spurious TLB flush for retried page fault */ > > if (vmf->flags & FAULT_FLAG_TRIED) > > goto unlock; > > /* > > * This is needed only for protection faults but the arch code > > * is not yet telling us if this is a protection fault or not. > > * This still avoids useless tlb flushes for .text page faults > > * with threads. > > */ > > if (vmf->flags & FAULT_FLAG_WRITE) > > flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, > > vmf->pte); > > > > > > So I don't see why it's so egregious to have the equivalent here, or actually > > ideally to abstract the code entirely. > > Let's definitely not duplicate such comments whereby one instance will end > up bitrotting. We're duplicating the code in two places, how would that bitrot happen exactly? I mean as I also say, probably better to just de-duplicate in general. It's one thing referring to a comment _above_ another function, it's quite another to refer to comments buried in a branch buried inside another function and to hand wave about it. And _those_ comments might very well be moved when the function is sensibly refactored (as it needs to be tbh). So yeah, in general I'd agree _if_ you were doing something similar-but-not-the-same AND referencing a clearly identifiable comment (e.g. above the function). But this isn't that at all. Anyway TL;DR is that we should just de-dupe the code. > > When talking about spurious faults I assume the educated reader will usually > find the right comments -- like you easily did :P Yeah but I'm familiar with the (kinda hideous) code there, it wasn't so much 'easily' found and for somebody else they'd maybe get confused as to what on earth that's referring to. > > However I agree that ... > > > > > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum > > pgtable_level"") David introduced: > > > > enum pgtable_level { > > PGTABLE_LEVEL_PTE = 0, > > PGTABLE_LEVEL_PMD, > > PGTABLE_LEVEL_PUD, > > PGTABLE_LEVEL_P4D, > > PGTABLE_LEVEL_PGD, > > }; > > > > Which allows for sensible abstraction. > > ... if there is an easier way to just unify the code and have the comments > at a central place, even better. Yup we're agreed on that :) > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-14 14:49 ` Lorenzo Stoakes @ 2025-10-14 14:58 ` David Hildenbrand 2025-10-14 15:13 ` Lorenzo Stoakes 0 siblings, 1 reply; 33+ messages in thread From: David Hildenbrand @ 2025-10-14 14:58 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On 14.10.25 16:49, Lorenzo Stoakes wrote: > On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote: >> >>> >>> /* Skip spurious TLB flush for retried page fault */ >>> if (vmf->flags & FAULT_FLAG_TRIED) >>> goto unlock; >>> /* >>> * This is needed only for protection faults but the arch code >>> * is not yet telling us if this is a protection fault or not. >>> * This still avoids useless tlb flushes for .text page faults >>> * with threads. >>> */ >>> if (vmf->flags & FAULT_FLAG_WRITE) >>> flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, >>> vmf->pte); >>> >>> >>> So I don't see why it's so egregious to have the equivalent here, or actually >>> ideally to abstract the code entirely. >> >> Let's definitely not duplicate such comments whereby one instance will end >> up bitrotting. > > We're duplicating the code in two places, how would that bitrot happen exactly? Often we adjust/fix comments to make scenarios/conditions clearer or extend them to cover some new conditions. So even without any code changes people will just ignore to update other comments. Code you can at least test with the hope to find inconsistencies. So copying rather large comments is usually never the answer :) Well, just like copying larger chunks of code, agreed. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-14 14:58 ` David Hildenbrand @ 2025-10-14 15:13 ` Lorenzo Stoakes 0 siblings, 0 replies; 33+ messages in thread From: Lorenzo Stoakes @ 2025-10-14 15:13 UTC (permalink / raw) To: David Hildenbrand Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Tue, Oct 14, 2025 at 04:58:44PM +0200, David Hildenbrand wrote: > On 14.10.25 16:49, Lorenzo Stoakes wrote: > > On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote: > > > > > > > > > > > /* Skip spurious TLB flush for retried page fault */ > > > > if (vmf->flags & FAULT_FLAG_TRIED) > > > > goto unlock; > > > > /* > > > > * This is needed only for protection faults but the arch code > > > > * is not yet telling us if this is a protection fault or not. > > > > * This still avoids useless tlb flushes for .text page faults > > > > * with threads. > > > > */ > > > > if (vmf->flags & FAULT_FLAG_WRITE) > > > > flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, > > > > vmf->pte); > > > > > > > > > > > > So I don't see why it's so egregious to have the equivalent here, or actually > > > > ideally to abstract the code entirely. > > > > > > Let's definitely not duplicate such comments whereby one instance will end > > > up bitrotting. > > > > We're duplicating the code in two places, how would that bitrot happen exactly? > > Often we adjust/fix comments to make scenarios/conditions clearer or extend > them to cover some new conditions. > > So even without any code changes people will just ignore to update other > comments. > > Code you can at least test with the hope to find inconsistencies. > > So copying rather large comments is usually never the answer :) > > Well, just like copying larger chunks of code, agreed. This is a bit moot I don't think it's worth having a big debate about. I'm one of the biggest proponents of de-duplicating things (comments included) and have done so extensively as you know. My _only_ point here is that it's hard to find the comment referenced and it's _very_ likely it'll get moved later (in fact I feel like doing that _right now_ as that function needs refactoring). In that case the lesser evil is to copy a 4 line comment right? But anyway, we both agree de-duplicating the code as a whole is the right way forward and that solves the issue so let's go with that! > > -- > Cheers > > David / dhildenb > > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-14 14:21 ` Lorenzo Stoakes 2025-10-14 14:38 ` David Hildenbrand @ 2025-10-15 8:43 ` Huang, Ying 2025-10-15 11:20 ` Lorenzo Stoakes 1 sibling, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-15 8:43 UTC (permalink / raw) To: Lorenzo Stoakes, David Hildenbrand Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Hi, Lorenzo, Thanks for comments! Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: > On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote: >> In the current kernel, there is spurious fault fixing support for pte, >> but not for huge pmd because no architectures need it. But in the >> next patch in the series, we will change the write protection fault >> handling logic on arm64, so that some stale huge pmd entries may >> remain in the TLB. These entries need to be flushed via the huge pmd >> spurious fault fixing mechanism. >> >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> > > Right now the PTE level spurious fault handling is dealt with in > handle_pte_fault() when ptep_set_access_flags() returns false. > > Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and > huge_pmd_set_accessed(). > > 1 - Why are you not adding handling to GUP? > > 2 - Is this the correct level of abstraction? It's really not obvious but > huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP, > non-NUMA hint huge page fault where a page table entry already exists > but we are faulting anyway (e.g. non-present or read-only writable). > > You don't mention any of this in the commit message, which you need to do > and really need to explain how spurious faults can arise, why you can only > do this at the point of abstraction you do (if you are unable to put it in > actual fault handing-code), and you need to add a bunch more comments to > explain this. This patch adds the spurious PMD page fault fixing based on the spurious PTE page fault fixing. So, I assumed that the spurious page fault fixing has been documented already. But you are right, nothing prevents us from improving it further. Let's try to do that. The page faults may be spurious because of the racy access to the page table. For example, a non-populated virtual page is accessed on 2 CPUs simultaneously, thus the page faults are triggered on both CPUs. However, it's possible that one CPU (say CPU A) cannot find the reason for the page fault if the other CPU (say CPU B) has changed the page table before the PTE is checked on CPU A. Most of the time, the spurious page faults can be ignored safely. However, if the page fault is for the write access, it's possible that a stale read-only TLB entry exists in the local CPU and needs to be flushed on some architectures. This is called the spurious page fault fixing. The spurious page fault fixing only makes sense during page fault handling, so we don't need to do it for GUP. In fact, I plan to avoid it in all GUP paths in another followup patch. As for where to put the spurious PMD page fault fixing code, because it's THP related code, I thought that we should put it in huge_memory.c, so I implemented it in huge_pmd_set_accessed(). If we follow the design of the spurious PTE page fault fixing, we can call the unified implementation in __handle_mm_fault() after huge_pmd_set_accessed() reports nothing has been changed. > Otherwise this just ends up being a lot of open-coded + confusing 'you have > to go look it up/just know' type stuff that we have too much of in mm :) > > So please update commit message/comments, confirm whether this is the > correct level of abstraction, and address other comments below, thanks! > >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Will Deacon <will@kernel.org> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Yang Shi <yang@os.amperecomputing.com> >> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> >> Cc: Dev Jain <dev.jain@arm.com> >> Cc: Barry Song <baohua@kernel.org> >> Cc: Anshuman Khandual <anshuman.khandual@arm.com> >> Cc: Yicong Yang <yangyicong@hisilicon.com> >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> >> Cc: Kevin Brodsky <kevin.brodsky@arm.com> >> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> >> Cc: linux-arm-kernel@lists.infradead.org >> Cc: linux-kernel@vger.kernel.org >> Cc: linux-mm@kvack.org >> --- >> include/linux/pgtable.h | 4 ++++ >> mm/huge_memory.c | 22 +++++++++++++++++----- >> mm/internal.h | 4 ++-- >> 3 files changed, 23 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index 32e8457ad535..341622ec80e4 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) >> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) >> #endif >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) >> +#endif > > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to > flush_tlb_page() - why do we just do nothing in this case here? Because all architectures do nothing for the spurious PMD page fault fixing until the [2/2] of this series. Where, we make it necessary to flush the local TLB for spurious PMD page fault fixing on arm64 architecture. If we follow the design of flush_tlb_fix_spurious_fault(), we need to change all architecture implementation to do nothing in this patch to keep the current behavior. I don't think that it's a good idea. Do you agree? >> + >> /* >> * When walking page tables, get the address of the next boundary, >> * or the end address of the range if that comes earlier. Although no >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 1b81680b4225..8533457c52b7 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, >> EXPORT_SYMBOL_GPL(vmf_insert_folio_pud); >> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ >> >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, >> - pmd_t *pmd, bool write) >> +/* Returns whether the PMD entry is changed */ > > Could we have a kernel doc description here? Sure. >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, > > It's 2025 can we use bool please :) Sure. >> + pmd_t *pmd, bool write) >> { >> + int changed; >> pmd_t _pmd; > > While we're here can we rename this horrible parameter name to e.g. entry? We're > significantly altering this function anyway so it isn't much more Sure. >> >> _pmd = pmd_mkyoung(*pmd); >> if (write) >> _pmd = pmd_mkdirty(_pmd); >> - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, >> - pmd, _pmd, write)) >> + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, >> + pmd, _pmd, write); >> + if (changed) >> update_mmu_cache_pmd(vma, addr, pmd); > > We can make this simpler, e.g.: > > if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > pmd, entry, write)) { > update_mmu_cache_pmd(vma, addr, pmd); > return true; > } > > return false; No problem. As long as David is OK with this. >> + >> + return changed; >> } >> >> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) >> if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) >> goto unlock; >> >> - touch_pmd(vmf->vma, vmf->address, vmf->pmd, write); >> + if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) { >> + /* See corresponding comments in handle_pte_fault(). */ > > What are the 'corresponding' comments? How can a reader of this code know what > they are? This isn't a very helpful comment. Also those comments might be > moved in future... > > Presumably it's: > > /* Skip spurious TLB flush for retried page fault */ > if (vmf->flags & FAULT_FLAG_TRIED) > goto unlock; > /* > * This is needed only for protection faults but the arch code > * is not yet telling us if this is a protection fault or not. > * This still avoids useless tlb flushes for .text page faults > * with threads. > */ > if (vmf->flags & FAULT_FLAG_WRITE) > flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, > vmf->pte); > > > So I don't see why it's so egregious to have the equivalent here, or actually > ideally to abstract the code entirely. > > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum > pgtable_level"") David introduced: > > enum pgtable_level { > PGTABLE_LEVEL_PTE = 0, > PGTABLE_LEVEL_PMD, > PGTABLE_LEVEL_PUD, > PGTABLE_LEVEL_P4D, > PGTABLE_LEVEL_PGD, > }; > > Which allows for sensible abstraction. Sure. Based on your discussion with David on this, I will add a function to do the spurious page fault fixing for the PTE and PMD fault. >> + if (vmf->flags & FAULT_FLAG_TRIED) >> + goto unlock; >> + if (vmf->flags & FAULT_FLAG_WRITE) >> + flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address, >> + vmf->pmd); >> + } >> >> unlock: >> spin_unlock(vmf->ptl); >> diff --git a/mm/internal.h b/mm/internal.h >> index 1561fc2ff5b8..8b58ab00a7cd 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs, >> */ >> void touch_pud(struct vm_area_struct *vma, unsigned long addr, >> pud_t *pud, bool write); >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, >> - pmd_t *pmd, bool write); >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, >> + pmd_t *pmd, bool write); >> >> /* >> * Parses a string with mem suffixes into its order. Useful to parse kernel >> -- >> 2.39.5 >> --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-15 8:43 ` Huang, Ying @ 2025-10-15 11:20 ` Lorenzo Stoakes 2025-10-15 12:23 ` David Hildenbrand 2025-10-16 2:22 ` Huang, Ying 0 siblings, 2 replies; 33+ messages in thread From: Lorenzo Stoakes @ 2025-10-15 11:20 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote: > Hi, Lorenzo, > > Thanks for comments! > > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: > > > On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote: > >> In the current kernel, there is spurious fault fixing support for pte, > >> but not for huge pmd because no architectures need it. But in the > >> next patch in the series, we will change the write protection fault > >> handling logic on arm64, so that some stale huge pmd entries may > >> remain in the TLB. These entries need to be flushed via the huge pmd > >> spurious fault fixing mechanism. > >> > >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> > > > > Right now the PTE level spurious fault handling is dealt with in > > handle_pte_fault() when ptep_set_access_flags() returns false. > > > > Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and > > huge_pmd_set_accessed(). > > > > 1 - Why are you not adding handling to GUP? > > > > 2 - Is this the correct level of abstraction? It's really not obvious but > > huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP, > > non-NUMA hint huge page fault where a page table entry already exists > > but we are faulting anyway (e.g. non-present or read-only writable). > > > > You don't mention any of this in the commit message, which you need to do > > and really need to explain how spurious faults can arise, why you can only > > do this at the point of abstraction you do (if you are unable to put it in > > actual fault handing-code), and you need to add a bunch more comments to > > explain this. > > This patch adds the spurious PMD page fault fixing based on the spurious > PTE page fault fixing. So, I assumed that the spurious page fault > fixing has been documented already. But you are right, nothing prevents > us from improving it further. Let's try to do that. I wouldn't make any kind of assumption like this in the kernel :) sadly our documentation is often incomplete. > > The page faults may be spurious because of the racy access to the page > table. For example, a non-populated virtual page is accessed on 2 CPUs > simultaneously, thus the page faults are triggered on both CPUs. > However, it's possible that one CPU (say CPU A) cannot find the reason > for the page fault if the other CPU (say CPU B) has changed the page > table before the PTE is checked on CPU A. Most of the time, the > spurious page faults can be ignored safely. However, if the page fault > is for the write access, it's possible that a stale read-only TLB entry > exists in the local CPU and needs to be flushed on some architectures. > This is called the spurious page fault fixing. > > The spurious page fault fixing only makes sense during page fault > handling, so we don't need to do it for GUP. In fact, I plan to avoid > it in all GUP paths in another followup patch. OK this is great, let's put it all in the kdoc for the new shared spurious faulting function! :) and additionally add it to the commit message. > > As for where to put the spurious PMD page fault fixing code, because > it's THP related code, I thought that we should put it in huge_memory.c, > so I implemented it in huge_pmd_set_accessed(). If we follow the design > of the spurious PTE page fault fixing, we can call the unified > implementation in __handle_mm_fault() after huge_pmd_set_accessed() > reports nothing has been changed. Ah ok great this is a better place for it for sure, thanks! > > > Otherwise this just ends up being a lot of open-coded + confusing 'you have > > to go look it up/just know' type stuff that we have too much of in mm :) > > > > So please update commit message/comments, confirm whether this is the > > correct level of abstraction, and address other comments below, thanks! > > > >> Cc: Catalin Marinas <catalin.marinas@arm.com> > >> Cc: Will Deacon <will@kernel.org> > >> Cc: Andrew Morton <akpm@linux-foundation.org> > >> Cc: David Hildenbrand <david@redhat.com> > >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > >> Cc: Vlastimil Babka <vbabka@suse.cz> > >> Cc: Zi Yan <ziy@nvidia.com> > >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > >> Cc: Ryan Roberts <ryan.roberts@arm.com> > >> Cc: Yang Shi <yang@os.amperecomputing.com> > >> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> > >> Cc: Dev Jain <dev.jain@arm.com> > >> Cc: Barry Song <baohua@kernel.org> > >> Cc: Anshuman Khandual <anshuman.khandual@arm.com> > >> Cc: Yicong Yang <yangyicong@hisilicon.com> > >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> > >> Cc: Kevin Brodsky <kevin.brodsky@arm.com> > >> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> > >> Cc: linux-arm-kernel@lists.infradead.org > >> Cc: linux-kernel@vger.kernel.org > >> Cc: linux-mm@kvack.org > >> --- > >> include/linux/pgtable.h | 4 ++++ > >> mm/huge_memory.c | 22 +++++++++++++++++----- > >> mm/internal.h | 4 ++-- > >> 3 files changed, 23 insertions(+), 7 deletions(-) > >> > >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >> index 32e8457ad535..341622ec80e4 100644 > >> --- a/include/linux/pgtable.h > >> +++ b/include/linux/pgtable.h > >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) > >> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) > >> #endif > >> > >> +#ifndef flush_tlb_fix_spurious_fault_pmd > >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) > >> +#endif > > > > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to > > flush_tlb_page() - why do we just do nothing in this case here? > > Because all architectures do nothing for the spurious PMD page fault > fixing until the [2/2] of this series. Where, we make it necessary to > flush the local TLB for spurious PMD page fault fixing on arm64 > architecture. > > If we follow the design of flush_tlb_fix_spurious_fault(), we need to > change all architecture implementation to do nothing in this patch to > keep the current behavior. I don't think that it's a good idea. Do > you agree? Yeah probably we should keep the same behaviour as before, which is obviously, prior to this series, we did nothing. I guess in the PTE case we _always_ want to flush the TLB, whereas in the PMD case we otherwise don't have any need to at the point at which the spurious flush is performed. But from your explanation above re: the stale TLB entry this _only_ needs to be done for architectures which might encounter this problem rather than needing a TLB flush in general. Given we're generalising the code and one case always flushes the TLB and the other doesn't maybe it's worth putting a comment in the generalised function mentioning this? > > >> + > >> /* > >> * When walking page tables, get the address of the next boundary, > >> * or the end address of the range if that comes earlier. Although no > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >> index 1b81680b4225..8533457c52b7 100644 > >> --- a/mm/huge_memory.c > >> +++ b/mm/huge_memory.c > >> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, > >> EXPORT_SYMBOL_GPL(vmf_insert_folio_pud); > >> #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > >> > >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > >> - pmd_t *pmd, bool write) > >> +/* Returns whether the PMD entry is changed */ > > > > Could we have a kernel doc description here? > > Sure. Thanks! > > >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, > > > > It's 2025 can we use bool please :) > > Sure. Thanks! > > >> + pmd_t *pmd, bool write) > >> { > >> + int changed; > >> pmd_t _pmd; > > > > While we're here can we rename this horrible parameter name to e.g. entry? We're > > significantly altering this function anyway so it isn't much more > > Sure. Thanks! > > >> > >> _pmd = pmd_mkyoung(*pmd); > >> if (write) > >> _pmd = pmd_mkdirty(_pmd); > >> - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > >> - pmd, _pmd, write)) > >> + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > >> + pmd, _pmd, write); > >> + if (changed) > >> update_mmu_cache_pmd(vma, addr, pmd); > > > > We can make this simpler, e.g.: > > > > if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, > > pmd, entry, write)) { > > update_mmu_cache_pmd(vma, addr, pmd); > > return true; > > } > > > > return false; > > No problem. As long as David is OK with this. Sure I don't think he'd have an issue with it but he can raise it if so :) Thanks! > > >> + > >> + return changed; > >> } > >> > >> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > >> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) > >> if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd))) > >> goto unlock; > >> > >> - touch_pmd(vmf->vma, vmf->address, vmf->pmd, write); > >> + if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) { > >> + /* See corresponding comments in handle_pte_fault(). */ > > > > What are the 'corresponding' comments? How can a reader of this code know what > > they are? This isn't a very helpful comment. Also those comments might be > > moved in future... > > > > Presumably it's: > > > > /* Skip spurious TLB flush for retried page fault */ > > if (vmf->flags & FAULT_FLAG_TRIED) > > goto unlock; > > /* > > * This is needed only for protection faults but the arch code > > * is not yet telling us if this is a protection fault or not. > > * This still avoids useless tlb flushes for .text page faults > > * with threads. > > */ > > if (vmf->flags & FAULT_FLAG_WRITE) > > flush_tlb_fix_spurious_fault(vmf->vma, vmf->address, > > vmf->pte); > > > > > > So I don't see why it's so egregious to have the equivalent here, or actually > > ideally to abstract the code entirely. > > > > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum > > pgtable_level"") David introduced: > > > > enum pgtable_level { > > PGTABLE_LEVEL_PTE = 0, > > PGTABLE_LEVEL_PMD, > > PGTABLE_LEVEL_PUD, > > PGTABLE_LEVEL_P4D, > > PGTABLE_LEVEL_PGD, > > }; > > > > Which allows for sensible abstraction. > > Sure. Based on your discussion with David on this, I will add a > function to do the spurious page fault fixing for the PTE and PMD fault. Thanks! > > >> + if (vmf->flags & FAULT_FLAG_TRIED) > >> + goto unlock; > >> + if (vmf->flags & FAULT_FLAG_WRITE) > >> + flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address, > >> + vmf->pmd); > >> + } > >> > >> unlock: > >> spin_unlock(vmf->ptl); > >> diff --git a/mm/internal.h b/mm/internal.h > >> index 1561fc2ff5b8..8b58ab00a7cd 100644 > >> --- a/mm/internal.h > >> +++ b/mm/internal.h > >> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs, > >> */ > >> void touch_pud(struct vm_area_struct *vma, unsigned long addr, > >> pud_t *pud, bool write); > >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr, > >> - pmd_t *pmd, bool write); > >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr, > >> + pmd_t *pmd, bool write); > >> > >> /* > >> * Parses a string with mem suffixes into its order. Useful to parse kernel > >> -- > >> 2.39.5 > >> > > --- > Best Regards, > Huang, Ying Cheers, Lorenzo ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-15 11:20 ` Lorenzo Stoakes @ 2025-10-15 12:23 ` David Hildenbrand 2025-10-16 2:22 ` Huang, Ying 1 sibling, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2025-10-15 12:23 UTC (permalink / raw) To: Lorenzo Stoakes, Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm >> >>>> >>>> _pmd = pmd_mkyoung(*pmd); >>>> if (write) >>>> _pmd = pmd_mkdirty(_pmd); >>>> - if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, >>>> - pmd, _pmd, write)) >>>> + changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, >>>> + pmd, _pmd, write); >>>> + if (changed) >>>> update_mmu_cache_pmd(vma, addr, pmd); >>> >>> We can make this simpler, e.g.: >>> >>> if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK, >>> pmd, entry, write)) { >>> update_mmu_cache_pmd(vma, addr, pmd); >>> return true; >>> } >>> >>> return false; >> >> No problem. As long as David is OK with this. > > Sure I don't think he'd have an issue with it but he can raise it if so :) Absolutely fine :) -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-15 11:20 ` Lorenzo Stoakes 2025-10-15 12:23 ` David Hildenbrand @ 2025-10-16 2:22 ` Huang, Ying 2025-10-16 8:25 ` Lorenzo Stoakes 1 sibling, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-16 2:22 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: > On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote: >> Hi, Lorenzo, >> >> Thanks for comments! >> >> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: >> >> > On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote: >> >> In the current kernel, there is spurious fault fixing support for pte, >> >> but not for huge pmd because no architectures need it. But in the >> >> next patch in the series, we will change the write protection fault >> >> handling logic on arm64, so that some stale huge pmd entries may >> >> remain in the TLB. These entries need to be flushed via the huge pmd >> >> spurious fault fixing mechanism. >> >> >> >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> >> > >> > Right now the PTE level spurious fault handling is dealt with in >> > handle_pte_fault() when ptep_set_access_flags() returns false. >> > >> > Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and >> > huge_pmd_set_accessed(). >> > >> > 1 - Why are you not adding handling to GUP? >> > >> > 2 - Is this the correct level of abstraction? It's really not obvious but >> > huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP, >> > non-NUMA hint huge page fault where a page table entry already exists >> > but we are faulting anyway (e.g. non-present or read-only writable). >> > >> > You don't mention any of this in the commit message, which you need to do >> > and really need to explain how spurious faults can arise, why you can only >> > do this at the point of abstraction you do (if you are unable to put it in >> > actual fault handing-code), and you need to add a bunch more comments to >> > explain this. >> >> This patch adds the spurious PMD page fault fixing based on the spurious >> PTE page fault fixing. So, I assumed that the spurious page fault >> fixing has been documented already. But you are right, nothing prevents >> us from improving it further. Let's try to do that. > > I wouldn't make any kind of assumption like this in the kernel :) sadly our > documentation is often incomplete. > >> >> The page faults may be spurious because of the racy access to the page >> table. For example, a non-populated virtual page is accessed on 2 CPUs >> simultaneously, thus the page faults are triggered on both CPUs. >> However, it's possible that one CPU (say CPU A) cannot find the reason >> for the page fault if the other CPU (say CPU B) has changed the page >> table before the PTE is checked on CPU A. Most of the time, the >> spurious page faults can be ignored safely. However, if the page fault >> is for the write access, it's possible that a stale read-only TLB entry >> exists in the local CPU and needs to be flushed on some architectures. >> This is called the spurious page fault fixing. >> >> The spurious page fault fixing only makes sense during page fault >> handling, so we don't need to do it for GUP. In fact, I plan to avoid >> it in all GUP paths in another followup patch. > > OK this is great, let's put it all in the kdoc for the new shared spurious > faulting function! :) and additionally add it to the commit message. Sure. Will do it in the next version. [snip] >> >> >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> >> index 32e8457ad535..341622ec80e4 100644 >> >> --- a/include/linux/pgtable.h >> >> +++ b/include/linux/pgtable.h >> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) >> >> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) >> >> #endif >> >> >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd >> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) >> >> +#endif >> > >> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to >> > flush_tlb_page() - why do we just do nothing in this case here? >> >> Because all architectures do nothing for the spurious PMD page fault >> fixing until the [2/2] of this series. Where, we make it necessary to >> flush the local TLB for spurious PMD page fault fixing on arm64 >> architecture. >> >> If we follow the design of flush_tlb_fix_spurious_fault(), we need to >> change all architecture implementation to do nothing in this patch to >> keep the current behavior. I don't think that it's a good idea. Do >> you agree? > > Yeah probably we should keep the same behaviour as before, which is > obviously, prior to this series, we did nothing. > > I guess in the PTE case we _always_ want to flush the TLB, whereas in the > PMD case we otherwise don't have any need to at the point at which the > spurious flush is performed. > > But from your explanation above re: the stale TLB entry this _only_ needs > to be done for architectures which might encounter this problem rather than > needing a TLB flush in general. > > Given we're generalising the code and one case always flushes the TLB and > the other doesn't maybe it's worth putting a comment in the generalised > function mentioning this? I'm not sure whether it's a good idea to document architecture behaviors in the general code. The behavior may be changed architecture by architecture in the future. [snip] --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-16 2:22 ` Huang, Ying @ 2025-10-16 8:25 ` Lorenzo Stoakes 2025-10-16 8:59 ` David Hildenbrand 2025-10-16 9:12 ` Huang, Ying 0 siblings, 2 replies; 33+ messages in thread From: Lorenzo Stoakes @ 2025-10-16 8:25 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote: > Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: > > > On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote: > > OK this is great, let's put it all in the kdoc for the new shared spurious > > faulting function! :) and additionally add it to the commit message. > > Sure. Will do it in the next version. Thanks! > > [snip] > > >> >> > >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >> >> index 32e8457ad535..341622ec80e4 100644 > >> >> --- a/include/linux/pgtable.h > >> >> +++ b/include/linux/pgtable.h > >> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) > >> >> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) > >> >> #endif > >> >> > >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd > >> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) > >> >> +#endif > >> > > >> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to > >> > flush_tlb_page() - why do we just do nothing in this case here? > >> > >> Because all architectures do nothing for the spurious PMD page fault > >> fixing until the [2/2] of this series. Where, we make it necessary to > >> flush the local TLB for spurious PMD page fault fixing on arm64 > >> architecture. > >> > >> If we follow the design of flush_tlb_fix_spurious_fault(), we need to > >> change all architecture implementation to do nothing in this patch to > >> keep the current behavior. I don't think that it's a good idea. Do > >> you agree? > > > > Yeah probably we should keep the same behaviour as before, which is > > obviously, prior to this series, we did nothing. > > > > I guess in the PTE case we _always_ want to flush the TLB, whereas in the > > PMD case we otherwise don't have any need to at the point at which the > > spurious flush is performed. > > > > But from your explanation above re: the stale TLB entry this _only_ needs > > to be done for architectures which might encounter this problem rather than > > needing a TLB flush in general. > > > > Given we're generalising the code and one case always flushes the TLB and > > the other doesn't maybe it's worth putting a comment in the generalised > > function mentioning this? > > I'm not sure whether it's a good idea to document architecture behaviors > in the general code. The behavior may be changed architecture by > architecture in the future. Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD so let's document that to be clear :) > > [snip] > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-16 8:25 ` Lorenzo Stoakes @ 2025-10-16 8:59 ` David Hildenbrand 2025-10-16 9:12 ` Huang, Ying 1 sibling, 0 replies; 33+ messages in thread From: David Hildenbrand @ 2025-10-16 8:59 UTC (permalink / raw) To: Lorenzo Stoakes, Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On 16.10.25 10:25, Lorenzo Stoakes wrote: > On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote: >> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: >> >>> On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote: >>> OK this is great, let's put it all in the kdoc for the new shared spurious >>> faulting function! :) and additionally add it to the commit message. >> >> Sure. Will do it in the next version. > > Thanks! > >> >> [snip] >> >>>>>> >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>> index 32e8457ad535..341622ec80e4 100644 >>>>>> --- a/include/linux/pgtable.h >>>>>> +++ b/include/linux/pgtable.h >>>>>> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) >>>>>> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) >>>>>> #endif >>>>>> >>>>>> +#ifndef flush_tlb_fix_spurious_fault_pmd >>>>>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) >>>>>> +#endif >>>>> >>>>> flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to >>>>> flush_tlb_page() - why do we just do nothing in this case here? >>>> >>>> Because all architectures do nothing for the spurious PMD page fault >>>> fixing until the [2/2] of this series. Where, we make it necessary to >>>> flush the local TLB for spurious PMD page fault fixing on arm64 >>>> architecture. >>>> >>>> If we follow the design of flush_tlb_fix_spurious_fault(), we need to >>>> change all architecture implementation to do nothing in this patch to >>>> keep the current behavior. I don't think that it's a good idea. Do >>>> you agree? >>> >>> Yeah probably we should keep the same behaviour as before, which is >>> obviously, prior to this series, we did nothing. >>> >>> I guess in the PTE case we _always_ want to flush the TLB, whereas in the >>> PMD case we otherwise don't have any need to at the point at which the >>> spurious flush is performed. >>> >>> But from your explanation above re: the stale TLB entry this _only_ needs >>> to be done for architectures which might encounter this problem rather than >>> needing a TLB flush in general. >>> >>> Given we're generalising the code and one case always flushes the TLB and >>> the other doesn't maybe it's worth putting a comment in the generalised >>> function mentioning this? >> >> I'm not sure whether it's a good idea to document architecture behaviors >> in the general code. The behavior may be changed architecture by >> architecture in the future. > > Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD > so let's document that to be clear :) Agreed! That's a big benefit of merging the code, it sticks out what is not common already. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd 2025-10-16 8:25 ` Lorenzo Stoakes 2025-10-16 8:59 ` David Hildenbrand @ 2025-10-16 9:12 ` Huang, Ying 1 sibling, 0 replies; 33+ messages in thread From: Huang, Ying @ 2025-10-16 9:12 UTC (permalink / raw) To: Lorenzo Stoakes Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: > On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote: >> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes: >> >> > On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote: >> > OK this is great, let's put it all in the kdoc for the new shared spurious >> > faulting function! :) and additionally add it to the commit message. >> >> Sure. Will do it in the next version. > > Thanks! > >> >> [snip] >> >> >> >> >> >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> >> >> index 32e8457ad535..341622ec80e4 100644 >> >> >> --- a/include/linux/pgtable.h >> >> >> +++ b/include/linux/pgtable.h >> >> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) >> >> >> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address) >> >> >> #endif >> >> >> >> >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd >> >> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0) >> >> >> +#endif >> >> > >> >> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to >> >> > flush_tlb_page() - why do we just do nothing in this case here? >> >> >> >> Because all architectures do nothing for the spurious PMD page fault >> >> fixing until the [2/2] of this series. Where, we make it necessary to >> >> flush the local TLB for spurious PMD page fault fixing on arm64 >> >> architecture. >> >> >> >> If we follow the design of flush_tlb_fix_spurious_fault(), we need to >> >> change all architecture implementation to do nothing in this patch to >> >> keep the current behavior. I don't think that it's a good idea. Do >> >> you agree? >> > >> > Yeah probably we should keep the same behaviour as before, which is >> > obviously, prior to this series, we did nothing. >> > >> > I guess in the PTE case we _always_ want to flush the TLB, whereas in the >> > PMD case we otherwise don't have any need to at the point at which the >> > spurious flush is performed. >> > >> > But from your explanation above re: the stale TLB entry this _only_ needs >> > to be done for architectures which might encounter this problem rather than >> > needing a TLB flush in general. >> > >> > Given we're generalising the code and one case always flushes the TLB and >> > the other doesn't maybe it's worth putting a comment in the generalised >> > function mentioning this? >> >> I'm not sure whether it's a good idea to document architecture behaviors >> in the general code. The behavior may be changed architecture by >> architecture in the future. > > Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD > so let's document that to be clear :) Sure. Will do this. >> >> [snip] --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-13 9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying 2025-10-13 9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying @ 2025-10-13 9:20 ` Huang Ying 2025-10-15 15:28 ` Ryan Roberts 2025-10-22 4:08 ` Barry Song 1 sibling, 2 replies; 33+ messages in thread From: Huang Ying @ 2025-10-13 9:20 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm A multi-thread customer workload with large memory footprint uses fork()/exec() to run some external programs every tens seconds. When running the workload on an arm64 server machine, it's observed that quite some CPU cycles are spent in the TLB flushing functions. While running the workload on the x86_64 server machine, it's not. This causes the performance on arm64 to be much worse than that on x86_64. During the workload running, after fork()/exec() write-protects all pages in the parent process, memory writing in the parent process will cause a write protection fault. Then the page fault handler will make the PTE/PDE writable if the page can be reused, which is almost always true in the workload. On arm64, to avoid the write protection fault on other CPUs, the page fault handler flushes the TLB globally with TLBI broadcast after changing the PTE/PDE. However, this isn't always necessary. Firstly, it's safe to leave some stall read-only TLB entries as long as they will be flushed finally. Secondly, it's quite possible that the original read-only PTE/PDEs aren't cached in remote TLB at all if the memory footprint is large. In fact, on x86_64, the page fault handler doesn't flush the remote TLB in this situation, which benefits the performance a lot. To improve the performance on arm64, make the write protection fault handler flush the TLB locally instead of globally via TLBI broadcast after making the PTE/PDE writable. If there are stall read-only TLB entries in the remote CPUs, the page fault handler on these CPUs will regard the page fault as spurious and flush the stall TLB entries. To test the patchset, make the usemem.c from vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git). support calling fork()/exec() periodically (merged). To mimic the behavior of the customer workload, run usemem with 4 threads, access 100GB memory, and call fork()/exec() every 40 seconds. Test results show that with the patchset the score of usemem improves ~40.6%. The cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf profile. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- arch/arm64/include/asm/pgtable.h | 14 +++++--- arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++ arch/arm64/mm/contpte.c | 3 +- arch/arm64/mm/fault.c | 2 +- 4 files changed, 67 insertions(+), 8 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index aa89c2e67ebc..35bae2e4bcfe 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* - * Outside of a few very special situations (e.g. hibernation), we always - * use broadcast TLB invalidation instructions, therefore a spurious page - * fault on one CPU which has been handled concurrently by another CPU - * does not need to perform additional invalidation. + * We use local TLB invalidation instruction when reusing page in + * write protection fault handler to avoid TLBI broadcast in the hot + * path. This will cause spurious page faults if stall read-only TLB + * entries exist. */ -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ + local_flush_tlb_page_nonotify(vma, address) + +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ + local_flush_tlb_page_nonotify(vma, address) /* * ZERO_PAGE is a global shared page that is always zero: used diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 18a5dc0c9a54..651b31fd18bb 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) * cannot be easily determined, the value TLBI_TTL_UNKNOWN will * perform a non-hinted invalidation. * + * local_flush_tlb_page(vma, addr) + * Local variant of flush_tlb_page(). Stale TLB entries may + * remain in remote CPUs. + * + * local_flush_tlb_page_nonotify(vma, addr) + * Same as local_flush_tlb_page() except MMU notifier will not be + * called. + * + * local_flush_tlb_contpte_range(vma, start, end) + * Invalidate the virtual-address range '[start, end)' mapped with + * contpte on local CPU for the user address space corresponding + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. * * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented * on top of these routines, since that is our interface to the mmu_gather @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } +static inline void __local_flush_tlb_page_nonotify_nosync( + struct mm_struct *mm, unsigned long uaddr) +{ + unsigned long addr; + + dsb(nshst); + addr = __TLBI_VADDR(uaddr, ASID(mm)); + __tlbi(vale1, addr); + __tlbi_user(vale1, addr); +} + +static inline void local_flush_tlb_page_nonotify( + struct vm_area_struct *vma, unsigned long uaddr) +{ + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); + dsb(nsh); +} + +static inline void local_flush_tlb_page(struct vm_area_struct *vma, + unsigned long uaddr) +{ + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, + (uaddr & PAGE_MASK) + PAGE_SIZE); + dsb(nsh); +} + static inline void __flush_tlb_page_nosync(struct mm_struct *mm, unsigned long uaddr) { @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, dsb(ish); } +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + unsigned long asid, pages; + + start = round_down(start, PAGE_SIZE); + end = round_up(end, PAGE_SIZE); + pages = (end - start) >> PAGE_SHIFT; + + dsb(nshst); + asid = ASID(vma->vm_mm); + __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid, + 3, true, lpa2_is_enabled()); + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); + dsb(nsh); +} + static inline void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c index c0557945939c..0f9bbb7224dc 100644 --- a/arch/arm64/mm/contpte.c +++ b/arch/arm64/mm/contpte.c @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma, __ptep_set_access_flags(vma, addr, ptep, entry, 0); if (dirty) - __flush_tlb_range(vma, start_addr, addr, - PAGE_SIZE, true, 3); + local_flush_tlb_contpte_range(vma, start_addr, addr); } else { __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); __ptep_set_access_flags(vma, addr, ptep, entry, dirty); diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index d816ff44faff..22f54f5afe3f 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma, /* Invalidate a stale read-only entry */ if (dirty) - flush_tlb_page(vma, address); + local_flush_tlb_page(vma, address); return 1; } -- 2.39.5 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-13 9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying @ 2025-10-15 15:28 ` Ryan Roberts 2025-10-16 1:35 ` Huang, Ying 2025-10-22 4:08 ` Barry Song 1 sibling, 1 reply; 33+ messages in thread From: Ryan Roberts @ 2025-10-15 15:28 UTC (permalink / raw) To: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand Cc: Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On 13/10/2025 10:20, Huang Ying wrote: > A multi-thread customer workload with large memory footprint uses > fork()/exec() to run some external programs every tens seconds. When > running the workload on an arm64 server machine, it's observed that > quite some CPU cycles are spent in the TLB flushing functions. While > running the workload on the x86_64 server machine, it's not. This > causes the performance on arm64 to be much worse than that on x86_64. > > During the workload running, after fork()/exec() write-protects all > pages in the parent process, memory writing in the parent process > will cause a write protection fault. Then the page fault handler > will make the PTE/PDE writable if the page can be reused, which is > almost always true in the workload. On arm64, to avoid the write > protection fault on other CPUs, the page fault handler flushes the TLB > globally with TLBI broadcast after changing the PTE/PDE. However, this > isn't always necessary. Firstly, it's safe to leave some stall nit: You keep using the word "stall" here and in the code. I think you mean "stale"? > read-only TLB entries as long as they will be flushed finally. > Secondly, it's quite possible that the original read-only PTE/PDEs > aren't cached in remote TLB at all if the memory footprint is large. > In fact, on x86_64, the page fault handler doesn't flush the remote > TLB in this situation, which benefits the performance a lot. > > To improve the performance on arm64, make the write protection fault > handler flush the TLB locally instead of globally via TLBI broadcast > after making the PTE/PDE writable. If there are stall read-only TLB > entries in the remote CPUs, the page fault handler on these CPUs will > regard the page fault as spurious and flush the stall TLB entries. > > To test the patchset, make the usemem.c from vm-scalability > (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git). > support calling fork()/exec() periodically (merged). To mimic the > behavior of the customer workload, run usemem with 4 threads, access > 100GB memory, and call fork()/exec() every 40 seconds. Test results > show that with the patchset the score of usemem improves ~40.6%. The > cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf > profile. > > Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yang Shi <yang@os.amperecomputing.com> > Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> > Cc: Dev Jain <dev.jain@arm.com> > Cc: Barry Song <baohua@kernel.org> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Yicong Yang <yangyicong@hisilicon.com> > Cc: Kefeng Wang <wangkefeng.wang@huawei.com> > Cc: Kevin Brodsky <kevin.brodsky@arm.com> > Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> > Cc: linux-arm-kernel@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > --- > arch/arm64/include/asm/pgtable.h | 14 +++++--- > arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++ > arch/arm64/mm/contpte.c | 3 +- > arch/arm64/mm/fault.c | 2 +- > 4 files changed, 67 insertions(+), 8 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index aa89c2e67ebc..35bae2e4bcfe 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > /* > - * Outside of a few very special situations (e.g. hibernation), we always > - * use broadcast TLB invalidation instructions, therefore a spurious page > - * fault on one CPU which has been handled concurrently by another CPU > - * does not need to perform additional invalidation. > + * We use local TLB invalidation instruction when reusing page in > + * write protection fault handler to avoid TLBI broadcast in the hot > + * path. This will cause spurious page faults if stall read-only TLB > + * entries exist. > */ > -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) > +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ > + local_flush_tlb_page_nonotify(vma, address) > + > +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ > + local_flush_tlb_page_nonotify(vma, address) > > /* > * ZERO_PAGE is a global shared page that is always zero: used > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 18a5dc0c9a54..651b31fd18bb 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) > * cannot be easily determined, the value TLBI_TTL_UNKNOWN will > * perform a non-hinted invalidation. > * > + * local_flush_tlb_page(vma, addr) > + * Local variant of flush_tlb_page(). Stale TLB entries may > + * remain in remote CPUs. > + * > + * local_flush_tlb_page_nonotify(vma, addr) > + * Same as local_flush_tlb_page() except MMU notifier will not be > + * called. > + * > + * local_flush_tlb_contpte_range(vma, start, end) > + * Invalidate the virtual-address range '[start, end)' mapped with > + * contpte on local CPU for the user address space corresponding > + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. > * > * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented > * on top of these routines, since that is our interface to the mmu_gather > @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); > } > > +static inline void __local_flush_tlb_page_nonotify_nosync( > + struct mm_struct *mm, unsigned long uaddr) > +{ > + unsigned long addr; > + > + dsb(nshst); > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > + __tlbi(vale1, addr); > + __tlbi_user(vale1, addr); > +} > + > +static inline void local_flush_tlb_page_nonotify( > + struct vm_area_struct *vma, unsigned long uaddr) > +{ > + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); > + dsb(nsh); > +} > + > +static inline void local_flush_tlb_page(struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); > + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, > + (uaddr & PAGE_MASK) + PAGE_SIZE); > + dsb(nsh); > +} > + > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > dsb(ish); > } > > +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma, > + unsigned long start, unsigned long end) This would be clearer as an API if it was like this: static inline void local_flush_tlb_contpte(struct vm_area_struct *vma, unsigned long uaddr) i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at round_down(uaddr, PAGE_SIZE). Thanks, Ryan > +{ > + unsigned long asid, pages; > + > + start = round_down(start, PAGE_SIZE); > + end = round_up(end, PAGE_SIZE); > + pages = (end - start) >> PAGE_SHIFT; > + > + dsb(nshst); > + asid = ASID(vma->vm_mm); > + __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid, > + 3, true, lpa2_is_enabled()); > + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); > + dsb(nsh); > +} > + > static inline void flush_tlb_range(struct vm_area_struct *vma, > unsigned long start, unsigned long end) > { > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c > index c0557945939c..0f9bbb7224dc 100644 > --- a/arch/arm64/mm/contpte.c > +++ b/arch/arm64/mm/contpte.c > @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma, > __ptep_set_access_flags(vma, addr, ptep, entry, 0); > > if (dirty) > - __flush_tlb_range(vma, start_addr, addr, > - PAGE_SIZE, true, 3); > + local_flush_tlb_contpte_range(vma, start_addr, addr); > } else { > __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); > __ptep_set_access_flags(vma, addr, ptep, entry, dirty); > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index d816ff44faff..22f54f5afe3f 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma, > > /* Invalidate a stale read-only entry */ > if (dirty) > - flush_tlb_page(vma, address); > + local_flush_tlb_page(vma, address); > return 1; > } > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-15 15:28 ` Ryan Roberts @ 2025-10-16 1:35 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2025-10-16 1:35 UTC (permalink / raw) To: Ryan Roberts Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Hi, Ryan, Thanks for comments! Ryan Roberts <ryan.roberts@arm.com> writes: > On 13/10/2025 10:20, Huang Ying wrote: >> A multi-thread customer workload with large memory footprint uses >> fork()/exec() to run some external programs every tens seconds. When >> running the workload on an arm64 server machine, it's observed that >> quite some CPU cycles are spent in the TLB flushing functions. While >> running the workload on the x86_64 server machine, it's not. This >> causes the performance on arm64 to be much worse than that on x86_64. >> >> During the workload running, after fork()/exec() write-protects all >> pages in the parent process, memory writing in the parent process >> will cause a write protection fault. Then the page fault handler >> will make the PTE/PDE writable if the page can be reused, which is >> almost always true in the workload. On arm64, to avoid the write >> protection fault on other CPUs, the page fault handler flushes the TLB >> globally with TLBI broadcast after changing the PTE/PDE. However, this >> isn't always necessary. Firstly, it's safe to leave some stall > > nit: You keep using the word "stall" here and in the code. I think you mean "stale"? OOPS, my poor English :-( Yes, it should be "stale". Thanks for pointing this out, will fix it in the future versions. >> read-only TLB entries as long as they will be flushed finally. >> Secondly, it's quite possible that the original read-only PTE/PDEs >> aren't cached in remote TLB at all if the memory footprint is large. >> In fact, on x86_64, the page fault handler doesn't flush the remote >> TLB in this situation, which benefits the performance a lot. >> >> To improve the performance on arm64, make the write protection fault >> handler flush the TLB locally instead of globally via TLBI broadcast >> after making the PTE/PDE writable. If there are stall read-only TLB >> entries in the remote CPUs, the page fault handler on these CPUs will >> regard the page fault as spurious and flush the stall TLB entries. >> >> To test the patchset, make the usemem.c from vm-scalability >> (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git). >> support calling fork()/exec() periodically (merged). To mimic the >> behavior of the customer workload, run usemem with 4 threads, access >> 100GB memory, and call fork()/exec() every 40 seconds. Test results >> show that with the patchset the score of usemem improves ~40.6%. The >> cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf >> profile. >> >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Will Deacon <will@kernel.org> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Yang Shi <yang@os.amperecomputing.com> >> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org> >> Cc: Dev Jain <dev.jain@arm.com> >> Cc: Barry Song <baohua@kernel.org> >> Cc: Anshuman Khandual <anshuman.khandual@arm.com> >> Cc: Yicong Yang <yangyicong@hisilicon.com> >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> >> Cc: Kevin Brodsky <kevin.brodsky@arm.com> >> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> >> Cc: linux-arm-kernel@lists.infradead.org >> Cc: linux-kernel@vger.kernel.org >> Cc: linux-mm@kvack.org >> --- >> arch/arm64/include/asm/pgtable.h | 14 +++++--- >> arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++ >> arch/arm64/mm/contpte.c | 3 +- >> arch/arm64/mm/fault.c | 2 +- >> 4 files changed, 67 insertions(+), 8 deletions(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index aa89c2e67ebc..35bae2e4bcfe 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> >> /* >> - * Outside of a few very special situations (e.g. hibernation), we always >> - * use broadcast TLB invalidation instructions, therefore a spurious page >> - * fault on one CPU which has been handled concurrently by another CPU >> - * does not need to perform additional invalidation. >> + * We use local TLB invalidation instruction when reusing page in >> + * write protection fault handler to avoid TLBI broadcast in the hot >> + * path. This will cause spurious page faults if stall read-only TLB >> + * entries exist. >> */ >> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) >> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ >> + local_flush_tlb_page_nonotify(vma, address) >> + >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ >> + local_flush_tlb_page_nonotify(vma, address) >> >> /* >> * ZERO_PAGE is a global shared page that is always zero: used >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >> index 18a5dc0c9a54..651b31fd18bb 100644 >> --- a/arch/arm64/include/asm/tlbflush.h >> +++ b/arch/arm64/include/asm/tlbflush.h >> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) >> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will >> * perform a non-hinted invalidation. >> * >> + * local_flush_tlb_page(vma, addr) >> + * Local variant of flush_tlb_page(). Stale TLB entries may >> + * remain in remote CPUs. >> + * >> + * local_flush_tlb_page_nonotify(vma, addr) >> + * Same as local_flush_tlb_page() except MMU notifier will not be >> + * called. >> + * >> + * local_flush_tlb_contpte_range(vma, start, end) >> + * Invalidate the virtual-address range '[start, end)' mapped with >> + * contpte on local CPU for the user address space corresponding >> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. >> * >> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented >> * on top of these routines, since that is our interface to the mmu_gather >> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); >> } >> >> +static inline void __local_flush_tlb_page_nonotify_nosync( >> + struct mm_struct *mm, unsigned long uaddr) >> +{ >> + unsigned long addr; >> + >> + dsb(nshst); >> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >> + __tlbi(vale1, addr); >> + __tlbi_user(vale1, addr); >> +} >> + >> +static inline void local_flush_tlb_page_nonotify( >> + struct vm_area_struct *vma, unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + dsb(nsh); >> +} >> + >> +static inline void local_flush_tlb_page(struct vm_area_struct *vma, >> + unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, >> + (uaddr & PAGE_MASK) + PAGE_SIZE); >> + dsb(nsh); >> +} >> + >> static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >> unsigned long uaddr) >> { >> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, >> dsb(ish); >> } >> >> +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma, >> + unsigned long start, unsigned long end) > > This would be clearer as an API if it was like this: > > static inline void local_flush_tlb_contpte(struct vm_area_struct *vma, > unsigned long uaddr) > > i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at > round_down(uaddr, PAGE_SIZE). Sure. Will do this. > Thanks, > Ryan > >> +{ >> + unsigned long asid, pages; >> + >> + start = round_down(start, PAGE_SIZE); >> + end = round_up(end, PAGE_SIZE); >> + pages = (end - start) >> PAGE_SHIFT; >> + >> + dsb(nshst); >> + asid = ASID(vma->vm_mm); >> + __flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid, >> + 3, true, lpa2_is_enabled()); >> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); >> + dsb(nsh); >> +} >> + >> static inline void flush_tlb_range(struct vm_area_struct *vma, >> unsigned long start, unsigned long end) >> { >> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c >> index c0557945939c..0f9bbb7224dc 100644 >> --- a/arch/arm64/mm/contpte.c >> +++ b/arch/arm64/mm/contpte.c >> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma, >> __ptep_set_access_flags(vma, addr, ptep, entry, 0); >> >> if (dirty) >> - __flush_tlb_range(vma, start_addr, addr, >> - PAGE_SIZE, true, 3); >> + local_flush_tlb_contpte_range(vma, start_addr, addr); >> } else { >> __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); >> __ptep_set_access_flags(vma, addr, ptep, entry, dirty); >> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >> index d816ff44faff..22f54f5afe3f 100644 >> --- a/arch/arm64/mm/fault.c >> +++ b/arch/arm64/mm/fault.c >> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma, >> >> /* Invalidate a stale read-only entry */ >> if (dirty) >> - flush_tlb_page(vma, address); >> + local_flush_tlb_page(vma, address); >> return 1; >> } >> --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-13 9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying 2025-10-15 15:28 ` Ryan Roberts @ 2025-10-22 4:08 ` Barry Song 2025-10-22 7:31 ` Huang, Ying 2025-10-23 10:18 ` Ryan Roberts 1 sibling, 2 replies; 33+ messages in thread From: Barry Song @ 2025-10-22 4:08 UTC (permalink / raw) To: Huang Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index aa89c2e67ebc..35bae2e4bcfe 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > /* > - * Outside of a few very special situations (e.g. hibernation), we always > - * use broadcast TLB invalidation instructions, therefore a spurious page > - * fault on one CPU which has been handled concurrently by another CPU > - * does not need to perform additional invalidation. > + * We use local TLB invalidation instruction when reusing page in > + * write protection fault handler to avoid TLBI broadcast in the hot > + * path. This will cause spurious page faults if stall read-only TLB > + * entries exist. > */ > -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) > +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ > + local_flush_tlb_page_nonotify(vma, address) > + > +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ > + local_flush_tlb_page_nonotify(vma, address) > > /* > * ZERO_PAGE is a global shared page that is always zero: used > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 18a5dc0c9a54..651b31fd18bb 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) > * cannot be easily determined, the value TLBI_TTL_UNKNOWN will > * perform a non-hinted invalidation. > * > + * local_flush_tlb_page(vma, addr) > + * Local variant of flush_tlb_page(). Stale TLB entries may > + * remain in remote CPUs. > + * > + * local_flush_tlb_page_nonotify(vma, addr) > + * Same as local_flush_tlb_page() except MMU notifier will not be > + * called. > + * > + * local_flush_tlb_contpte_range(vma, start, end) > + * Invalidate the virtual-address range '[start, end)' mapped with > + * contpte on local CPU for the user address space corresponding > + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. > * > * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented > * on top of these routines, since that is our interface to the mmu_gather > @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) > mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); > } > > +static inline void __local_flush_tlb_page_nonotify_nosync( > + struct mm_struct *mm, unsigned long uaddr) > +{ > + unsigned long addr; > + > + dsb(nshst); We were issuing dsb(ishst) even for the nosync case, likely to ensure PTE visibility across cores. However, since set_ptes already includes a dsb(ishst) in __set_pte_complete(), does this mean we’re being overly cautious in __flush_tlb_page_nosync() in many cases? static inline void __flush_tlb_page_nosync(struct mm_struct *mm, unsigned long uaddr) { unsigned long addr; dsb(ishst); addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, (uaddr & PAGE_MASK) + PAGE_SIZE); } On the other hand, __ptep_set_access_flags() doesn’t seem to use set_ptes(), so there’s no guarantee the updated PTEs are visible to all cores. If a remote CPU later encounters a page fault and performs a TLB invalidation, will it still see a stable PTE? > + addr = __TLBI_VADDR(uaddr, ASID(mm)); > + __tlbi(vale1, addr); > + __tlbi_user(vale1, addr); > +} > + > +static inline void local_flush_tlb_page_nonotify( > + struct vm_area_struct *vma, unsigned long uaddr) > +{ > + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); > + dsb(nsh); > +} > + > +static inline void local_flush_tlb_page(struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); > + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, > + (uaddr & PAGE_MASK) + PAGE_SIZE); > + dsb(nsh); > +} > + > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, > dsb(ish); > } > We already have functions like __flush_tlb_page_nosync() and __flush_tlb_range_nosync(). Is there a way to factor out or extract their common parts? Is it because of the differences in barriers that this extraction of common code isn’t feasible? Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 4:08 ` Barry Song @ 2025-10-22 7:31 ` Huang, Ying 2025-10-22 8:14 ` Barry Song 2025-10-23 10:18 ` Ryan Roberts 1 sibling, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-22 7:31 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Hi, Barry, Barry Song <21cnbao@gmail.com> writes: >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index aa89c2e67ebc..35bae2e4bcfe 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> >> /* >> - * Outside of a few very special situations (e.g. hibernation), we always >> - * use broadcast TLB invalidation instructions, therefore a spurious page >> - * fault on one CPU which has been handled concurrently by another CPU >> - * does not need to perform additional invalidation. >> + * We use local TLB invalidation instruction when reusing page in >> + * write protection fault handler to avoid TLBI broadcast in the hot >> + * path. This will cause spurious page faults if stall read-only TLB >> + * entries exist. >> */ >> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) >> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ >> + local_flush_tlb_page_nonotify(vma, address) >> + >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ >> + local_flush_tlb_page_nonotify(vma, address) >> >> /* >> * ZERO_PAGE is a global shared page that is always zero: used >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >> index 18a5dc0c9a54..651b31fd18bb 100644 >> --- a/arch/arm64/include/asm/tlbflush.h >> +++ b/arch/arm64/include/asm/tlbflush.h >> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) >> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will >> * perform a non-hinted invalidation. >> * >> + * local_flush_tlb_page(vma, addr) >> + * Local variant of flush_tlb_page(). Stale TLB entries may >> + * remain in remote CPUs. >> + * >> + * local_flush_tlb_page_nonotify(vma, addr) >> + * Same as local_flush_tlb_page() except MMU notifier will not be >> + * called. >> + * >> + * local_flush_tlb_contpte_range(vma, start, end) >> + * Invalidate the virtual-address range '[start, end)' mapped with >> + * contpte on local CPU for the user address space corresponding >> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. >> * >> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented >> * on top of these routines, since that is our interface to the mmu_gather >> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); >> } >> >> +static inline void __local_flush_tlb_page_nonotify_nosync( >> + struct mm_struct *mm, unsigned long uaddr) >> +{ >> + unsigned long addr; >> + >> + dsb(nshst); > > We were issuing dsb(ishst) even for the nosync case, likely to ensure > PTE visibility across cores. However, since set_ptes already includes a > dsb(ishst) in __set_pte_complete(), does this mean we’re being overly > cautious in __flush_tlb_page_nosync() in many cases? > > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > unsigned long addr; > > dsb(ishst); > addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, > (uaddr & PAGE_MASK) + > PAGE_SIZE); > } IIUC, _nosync() here means doesn't synchronize with the following code. It still synchronizes with the previous code, mainly the page table changing. And, Yes. There may be room to improve this. > On the other hand, __ptep_set_access_flags() doesn’t seem to use > set_ptes(), so there’s no guarantee the updated PTEs are visible to all > cores. If a remote CPU later encounters a page fault and performs a TLB > invalidation, will it still see a stable PTE? I don't think so. We just flush local TLB in local_flush_tlb_page() family functions. So, we only needs to guarantee the page table changes are available for the local page table walking. If a page fault occurs on a remote CPU, we will call local_flush_tlb_page() on the remote CPU. >> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >> + __tlbi(vale1, addr); >> + __tlbi_user(vale1, addr); >> +} >> + >> +static inline void local_flush_tlb_page_nonotify( >> + struct vm_area_struct *vma, unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + dsb(nsh); >> +} >> + >> +static inline void local_flush_tlb_page(struct vm_area_struct *vma, >> + unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, >> + (uaddr & PAGE_MASK) + PAGE_SIZE); >> + dsb(nsh); >> +} >> + >> static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >> unsigned long uaddr) >> { >> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, >> dsb(ish); >> } >> > > We already have functions like > __flush_tlb_page_nosync() and __flush_tlb_range_nosync(). > Is there a way to factor out or extract their common parts? > > Is it because of the differences in barriers that this extraction of > common code isn’t feasible? Yes. It's a good idea to do some code clean to reduce code duplication. Ryan has plan to work on this. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 7:31 ` Huang, Ying @ 2025-10-22 8:14 ` Barry Song 2025-10-22 9:02 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Barry Song @ 2025-10-22 8:14 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm > > > > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > > unsigned long uaddr) > > { > > unsigned long addr; > > > > dsb(ishst); > > addr = __TLBI_VADDR(uaddr, ASID(mm)); > > __tlbi(vale1is, addr); > > __tlbi_user(vale1is, addr); > > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, > > (uaddr & PAGE_MASK) + > > PAGE_SIZE); > > } > > IIUC, _nosync() here means doesn't synchronize with the following code. > It still synchronizes with the previous code, mainly the page table > changing. And, Yes. There may be room to improve this. > > > On the other hand, __ptep_set_access_flags() doesn’t seem to use > > set_ptes(), so there’s no guarantee the updated PTEs are visible to all > > cores. If a remote CPU later encounters a page fault and performs a TLB > > invalidation, will it still see a stable PTE? > > I don't think so. We just flush local TLB in local_flush_tlb_page() > family functions. So, we only needs to guarantee the page table changes > are available for the local page table walking. If a page fault occurs > on a remote CPU, we will call local_flush_tlb_page() on the remote CPU. > My concern is that: We don’t have a dsb(ish) to ensure the PTE page table is visible to remote CPUs, since you’re using dsb(nsh). So even if a remote CPU performs local_flush_tlb_page(), the memory may not be synchronized yet, and it could still see the old PTE. Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 8:14 ` Barry Song @ 2025-10-22 9:02 ` Huang, Ying 2025-10-22 9:17 ` Barry Song 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-22 9:02 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: >> > >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >> > unsigned long uaddr) >> > { >> > unsigned long addr; >> > >> > dsb(ishst); >> > addr = __TLBI_VADDR(uaddr, ASID(mm)); >> > __tlbi(vale1is, addr); >> > __tlbi_user(vale1is, addr); >> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, >> > (uaddr & PAGE_MASK) + >> > PAGE_SIZE); >> > } >> >> IIUC, _nosync() here means doesn't synchronize with the following code. >> It still synchronizes with the previous code, mainly the page table >> changing. And, Yes. There may be room to improve this. >> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all >> > cores. If a remote CPU later encounters a page fault and performs a TLB >> > invalidation, will it still see a stable PTE? >> >> I don't think so. We just flush local TLB in local_flush_tlb_page() >> family functions. So, we only needs to guarantee the page table changes >> are available for the local page table walking. If a page fault occurs >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU. >> > > My concern is that: > > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs > local_flush_tlb_page(), the memory may not be synchronized yet, and it could > still see the old PTE. So, do you think that after the load/store unit of the remote CPU have seen the new PTE, the page table walker could still see the old PTE? I doubt it. Even if so, the worse case is one extra spurious page fault? If the possibility of the worst case is low enough, that should be OK. Additionally, the page table lock is held when writing PTE on this CPU and re-reading PTE on the remote CPU. That provides some memory order guarantee too. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:02 ` Huang, Ying @ 2025-10-22 9:17 ` Barry Song 2025-10-22 9:30 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Barry Song @ 2025-10-22 9:17 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying <ying.huang@linux.alibaba.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > >> > > >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > >> > unsigned long uaddr) > >> > { > >> > unsigned long addr; > >> > > >> > dsb(ishst); > >> > addr = __TLBI_VADDR(uaddr, ASID(mm)); > >> > __tlbi(vale1is, addr); > >> > __tlbi_user(vale1is, addr); > >> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, > >> > (uaddr & PAGE_MASK) + > >> > PAGE_SIZE); > >> > } > >> > >> IIUC, _nosync() here means doesn't synchronize with the following code. > >> It still synchronizes with the previous code, mainly the page table > >> changing. And, Yes. There may be room to improve this. > >> > >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use > >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all > >> > cores. If a remote CPU later encounters a page fault and performs a TLB > >> > invalidation, will it still see a stable PTE? > >> > >> I don't think so. We just flush local TLB in local_flush_tlb_page() > >> family functions. So, we only needs to guarantee the page table changes > >> are available for the local page table walking. If a page fault occurs > >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU. > >> > > > > My concern is that: > > > > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote > > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs > > local_flush_tlb_page(), the memory may not be synchronized yet, and it could > > still see the old PTE. > > So, do you think that after the load/store unit of the remote CPU have > seen the new PTE, the page table walker could still see the old PTE? I Without a barrier in the ish domain, remote CPUs’ load/store units may not see the new PTE written by the first CPU performing the reuse. That’s why we need a barrier in the ish domain to ensure the PTE is actually visible across the SMP domain. A store instruction doesn’t guarantee that the data is immediately visible to other CPUs — at least not for load instructions. Though, I’m not entirely sure about the page table walker case. > doubt it. Even if so, the worse case is one extra spurious page fault? > If the possibility of the worst case is low enough, that should be OK. CPU0: CPU1: write pte; do local tlbi; page fault; do local tlbi; -> still old PTE pte visible to CPU1 > > Additionally, the page table lock is held when writing PTE on this CPU > and re-reading PTE on the remote CPU. That provides some memory order > guarantee too. Right, the PTL might take care of it automatically. Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:17 ` Barry Song @ 2025-10-22 9:30 ` Huang, Ying 2025-10-22 9:37 ` Barry Song 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-22 9:30 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> >> > >> >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >> >> > unsigned long uaddr) >> >> > { >> >> > unsigned long addr; >> >> > >> >> > dsb(ishst); >> >> > addr = __TLBI_VADDR(uaddr, ASID(mm)); >> >> > __tlbi(vale1is, addr); >> >> > __tlbi_user(vale1is, addr); >> >> > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, >> >> > (uaddr & PAGE_MASK) + >> >> > PAGE_SIZE); >> >> > } >> >> >> >> IIUC, _nosync() here means doesn't synchronize with the following code. >> >> It still synchronizes with the previous code, mainly the page table >> >> changing. And, Yes. There may be room to improve this. >> >> >> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use >> >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all >> >> > cores. If a remote CPU later encounters a page fault and performs a TLB >> >> > invalidation, will it still see a stable PTE? >> >> >> >> I don't think so. We just flush local TLB in local_flush_tlb_page() >> >> family functions. So, we only needs to guarantee the page table changes >> >> are available for the local page table walking. If a page fault occurs >> >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU. >> >> >> > >> > My concern is that: >> > >> > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote >> > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs >> > local_flush_tlb_page(), the memory may not be synchronized yet, and it could >> > still see the old PTE. >> >> So, do you think that after the load/store unit of the remote CPU have >> seen the new PTE, the page table walker could still see the old PTE? I > > Without a barrier in the ish domain, remote CPUs’ load/store units may not > see the new PTE written by the first CPU performing the reuse. > > That’s why we need a barrier in the ish domain to ensure the PTE is > actually visible across the SMP domain. A store instruction doesn’t guarantee > that the data is immediately visible to other CPUs — at least not for load > instructions. > > Though, I’m not entirely sure about the page table walker case. > >> doubt it. Even if so, the worse case is one extra spurious page fault? >> If the possibility of the worst case is low enough, that should be OK. > > CPU0: CPU1: > > write pte; > > do local tlbi; > > page fault; > do local tlbi; -> still old PTE > > pte visible to CPU1 With PTL, this becomes CPU0: CPU1: page fault page fault lock PTL write PTE do local tlbi unlock PTL lock PTL <- pte visible to CPU 1 read PTE <- new PTE do local tlbi <- new PTE unlock PTL >> Additionally, the page table lock is held when writing PTE on this CPU >> and re-reading PTE on the remote CPU. That provides some memory order >> guarantee too. > > Right, the PTL might take care of it automatically. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:30 ` Huang, Ying @ 2025-10-22 9:37 ` Barry Song 2025-10-22 9:46 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Barry Song @ 2025-10-22 9:37 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm > > With PTL, this becomes > > CPU0: CPU1: > > page fault page fault > lock PTL > write PTE > do local tlbi > unlock PTL > lock PTL <- pte visible to CPU 1 > read PTE <- new PTE > do local tlbi <- new PTE > unlock PTL I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. CPU0: CPU1: lock PTL write pte; Issue ish barrier do local tlbi; No page fault occurs if tlb misses unlock PTL Otherwise, it could be: CPU0: CPU1: lock PTL write pte; Issue nsh barrier do local tlbi; page fault occurs if tlb misses unlock PTL Not quite sure if adding an ish right after the PTE modification has any noticeable performance impact on the test? I assume the most expensive part is still the tlbi broadcast dsb, not the PTE memory sync barrier? Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:37 ` Barry Song @ 2025-10-22 9:46 ` Huang, Ying 2025-10-22 9:55 ` Barry Song 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-22 9:46 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: >> >> With PTL, this becomes >> >> CPU0: CPU1: >> >> page fault page fault >> lock PTL >> write PTE >> do local tlbi >> unlock PTL >> lock PTL <- pte visible to CPU 1 >> read PTE <- new PTE >> do local tlbi <- new PTE >> unlock PTL > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate memory writing (visible to other CPUs). TBH, I suspect that this is the case. > CPU0: CPU1: > > lock PTL > > write pte; > Issue ish barrier > do local tlbi; > > > No page fault occurs if tlb misses > > > unlock PTL > > > Otherwise, it could be: > > > CPU0: CPU1: > > lock PTL > > write pte; > Issue nsh barrier > do local tlbi; > > > page fault occurs if tlb misses > > > unlock PTL > > > Not quite sure if adding an ish right after the PTE modification has any > noticeable performance impact on the test? I assume the most expensive part > is still the tlbi broadcast dsb, not the PTE memory sync barrier? --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:46 ` Huang, Ying @ 2025-10-22 9:55 ` Barry Song 2025-10-22 10:22 ` Barry Song 2025-10-22 10:34 ` Huang, Ying 0 siblings, 2 replies; 33+ messages in thread From: Barry Song @ 2025-10-22 9:55 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying <ying.huang@linux.alibaba.com> wrote: > > > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > > IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > memory writing (visible to other CPUs). TBH, I suspect that this is the > case. Why? In any case, nsh is not a smp domain. I believe a dmb(ishst) is sufficient to ensure that the new PTE writes are visible to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); it seems like overkill. Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:55 ` Barry Song @ 2025-10-22 10:22 ` Barry Song 2025-10-22 10:34 ` Huang, Ying 1 sibling, 0 replies; 33+ messages in thread From: Barry Song @ 2025-10-22 10:22 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Wed, Oct 22, 2025 at 10:55 PM Barry Song <21cnbao@gmail.com> wrote: > > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: > > > > > > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > > > > IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > > memory writing (visible to other CPUs). TBH, I suspect that this is the > > case. > > Why? In any case, nsh is not a smp domain. > > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > are visible > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > it seems like overkill. On second thought, the PTE/page table walker might not be a typical SMP sync case, so a dmb may not be sufficient—we are not dealing with standard load/store instruction sequences across multiple threads. In any case, my point is that dsb(ish) might be slightly slower than your dsb(nsh), but it makes the PTE visible to other CPUs earlier and helps avoid some page faults after we’ve written the PTE. However, if your current nsh version actually provides better performance—even when multiple threads may access the data simultaneously— It should be completely fine. Now you are write pte don't broadcast pte tlbi don't broadcast tlbi we might be: write pte broadcast pte tlbi don't broadcast tlbi Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 9:55 ` Barry Song 2025-10-22 10:22 ` Barry Song @ 2025-10-22 10:34 ` Huang, Ying 2025-10-22 10:52 ` Barry Song 1 sibling, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-22 10:34 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: > >> > >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. >> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate >> memory writing (visible to other CPUs). TBH, I suspect that this is the >> case. > > Why? In any case, nsh is not a smp domain. I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that dsb just wait for the memory write to be visible in the specified shareability domain instead of making write faster. > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > are visible dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > it seems like overkill. dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page table change is visible to the page table walker of the remote CPU. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 10:34 ` Huang, Ying @ 2025-10-22 10:52 ` Barry Song 2025-10-23 1:22 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Barry Song @ 2025-10-22 10:52 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying <ying.huang@linux.alibaba.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying > > <ying.huang@linux.alibaba.com> wrote: > > > >> > > >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. > >> > >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate > >> memory writing (visible to other CPUs). TBH, I suspect that this is the > >> case. > > > > Why? In any case, nsh is not a smp domain. > > I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that > dsb just wait for the memory write to be visible in the specified > shareability domain instead of making write faster. > > > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes > > are visible > > dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). > > > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); > > it seems like overkill. > > dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page > table change is visible to the page table walker of the remote CPU. It seems we’re aligned on all points[1], although I’m not sure whether you have data comparing A and B. A: write pte don't broadcast pte tlbi don't broadcast tlbi with B: write pte broadcast pte tlbi don't broadcast tlbi I guess the gain comes from "don't broadcat tlbi" ? With B, we should be able to share many existing code. [1] https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52 Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 10:52 ` Barry Song @ 2025-10-23 1:22 ` Huang, Ying 2025-10-23 5:39 ` Barry Song 0 siblings, 1 reply; 33+ messages in thread From: Huang, Ying @ 2025-10-23 1:22 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: > On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying > <ying.huang@linux.alibaba.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying >> > <ying.huang@linux.alibaba.com> wrote: >> > >> >> > >> >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL. >> >> >> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate >> >> memory writing (visible to other CPUs). TBH, I suspect that this is the >> >> case. >> > >> > Why? In any case, nsh is not a smp domain. >> >> I think dsb(ish) will be slower than dsb(nsh) in theory. I guess that >> dsb just wait for the memory write to be visible in the specified >> shareability domain instead of making write faster. >> >> > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes >> > are visible >> >> dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()). >> >> > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish); >> > it seems like overkill. >> >> dsb(ish) here is used for tlbi(XXis) broadcast. It waits until the page >> table change is visible to the page table walker of the remote CPU. > > It seems we’re aligned on all points[1], although I’m not sure whether > you have data comparing A and B. > > A: > write pte > don't broadcast pte > tlbi > don't broadcast tlbi > > with > > B: > write pte > broadcast pte I suspect that pte will be broadcast, DVM broadcast isn't used for the memory coherency IIUC. > tlbi > don't broadcast tlbi > > I guess the gain comes from "don't broadcat tlbi" ? > With B, we should be able to share many existing code. Ryan has some plan to reduce the code duplication with the current solution. > [1] > https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52 --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-23 1:22 ` Huang, Ying @ 2025-10-23 5:39 ` Barry Song 2025-10-23 6:15 ` Huang, Ying 0 siblings, 1 reply; 33+ messages in thread From: Barry Song @ 2025-10-23 5:39 UTC (permalink / raw) To: Huang, Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm > > > > A: > > write pte > > don't broadcast pte > > tlbi > > don't broadcast tlbi > > > > with > > > > B: > > write pte > > broadcast pte > > I suspect that pte will be broadcast, DVM broadcast isn't used for > the memory coherency IIUC. I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new PTE after their TLB is shoot down. But as you said, if the hardware doesn’t propagate the updated PTE faster, it doesn’t seem to help reduce page faults. As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on your platform. Perhaps because the number of CPU cores is small, I didn’t see any noticeable difference between them on phones. > > > tlbi > > don't broadcast tlbi > > > > I guess the gain comes from "don't broadcat tlbi" ? > > With B, we should be able to share many existing code. > > Ryan has some plan to reduce the code duplication with the current > solution. Ok. Thanks Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-23 5:39 ` Barry Song @ 2025-10-23 6:15 ` Huang, Ying 0 siblings, 0 replies; 33+ messages in thread From: Huang, Ying @ 2025-10-23 6:15 UTC (permalink / raw) To: Barry Song Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Barry Song <21cnbao@gmail.com> writes: >> > >> > A: >> > write pte >> > don't broadcast pte >> > tlbi >> > don't broadcast tlbi >> > >> > with >> > >> > B: >> > write pte >> > broadcast pte >> >> I suspect that pte will be broadcast, DVM broadcast isn't used for >> the memory coherency IIUC. > > I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible > to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new > PTE after their TLB is shoot down. But as you said, if the hardware doesn’t > propagate the updated PTE faster, it doesn’t seem to help reduce page faults. > > As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on > your platform. Perhaps because the number of CPU cores is small, I didn’t see > any noticeable difference between them on phones. Sure. I can git it a try. Can you share the test case? >> >> > tlbi >> > don't broadcast tlbi >> > >> > I guess the gain comes from "don't broadcat tlbi" ? >> > With B, we should be able to share many existing code. >> >> Ryan has some plan to reduce the code duplication with the current >> solution. > > Ok. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault 2025-10-22 4:08 ` Barry Song 2025-10-22 7:31 ` Huang, Ying @ 2025-10-23 10:18 ` Ryan Roberts 1 sibling, 0 replies; 33+ messages in thread From: Ryan Roberts @ 2025-10-23 10:18 UTC (permalink / raw) To: Barry Song, Huang Ying Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi, Christoph Lameter (Ampere), Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel, linux-mm Hi Barry, Huang, On 22/10/2025 05:08, Barry Song wrote: >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index aa89c2e67ebc..35bae2e4bcfe 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void) >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> >> /* >> - * Outside of a few very special situations (e.g. hibernation), we always >> - * use broadcast TLB invalidation instructions, therefore a spurious page >> - * fault on one CPU which has been handled concurrently by another CPU >> - * does not need to perform additional invalidation. >> + * We use local TLB invalidation instruction when reusing page in >> + * write protection fault handler to avoid TLBI broadcast in the hot >> + * path. This will cause spurious page faults if stall read-only TLB >> + * entries exist. >> */ >> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) >> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \ >> + local_flush_tlb_page_nonotify(vma, address) >> + >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ >> + local_flush_tlb_page_nonotify(vma, address) >> >> /* >> * ZERO_PAGE is a global shared page that is always zero: used >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h >> index 18a5dc0c9a54..651b31fd18bb 100644 >> --- a/arch/arm64/include/asm/tlbflush.h >> +++ b/arch/arm64/include/asm/tlbflush.h >> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void) >> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will >> * perform a non-hinted invalidation. >> * >> + * local_flush_tlb_page(vma, addr) >> + * Local variant of flush_tlb_page(). Stale TLB entries may >> + * remain in remote CPUs. >> + * >> + * local_flush_tlb_page_nonotify(vma, addr) >> + * Same as local_flush_tlb_page() except MMU notifier will not be >> + * called. >> + * >> + * local_flush_tlb_contpte_range(vma, start, end) >> + * Invalidate the virtual-address range '[start, end)' mapped with >> + * contpte on local CPU for the user address space corresponding >> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs. >> * >> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented >> * on top of these routines, since that is our interface to the mmu_gather >> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm) >> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); >> } >> >> +static inline void __local_flush_tlb_page_nonotify_nosync( >> + struct mm_struct *mm, unsigned long uaddr) >> +{ >> + unsigned long addr; >> + >> + dsb(nshst); I've skimmed through this thread so appologies if I've missed some of the detail, but thought it might be useful to give my opinion as a summary... > We were issuing dsb(ishst) even for the nosync case, likely to ensure > PTE visibility across cores. The leading dsb prior to issuing the tlbi is to ensure that the HW table walker(s) will always see the new pte immediately after the tlbi completes. Without it, you could end up with the old value immediately re-cached after the tlbi completes. So if you are broadcasting the tlbi, the dsb needs to be to ish. If you're doing local invalidation, then nsh is sufficient. "nosync" is just saying that we will not wait for the tlbi to complete. You still need to issue the leading dsb to ensure that the table walkers see the latest pte once the tlbi (eventually) completes. > However, since set_ptes already includes a > dsb(ishst) in __set_pte_complete(), does this mean we’re being overly > cautious in __flush_tlb_page_nosync() in many cases? We only issue a dsb in __set_pte_complete() for kernel ptes. We elide for user ptes becaue we can safely take a fault (for the case where we transition invalid->valid) for user mappings and that race will resolve itself with the help of the PTL. For valid->valid or valid->invalid, there will be an associated tlb flush, which has the barrier. > > static inline void __flush_tlb_page_nosync(struct mm_struct *mm, > unsigned long uaddr) > { > unsigned long addr; > > dsb(ishst); > addr = __TLBI_VADDR(uaddr, ASID(mm)); > __tlbi(vale1is, addr); > __tlbi_user(vale1is, addr); > mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, > (uaddr & PAGE_MASK) + > PAGE_SIZE); > } > > On the other hand, __ptep_set_access_flags() doesn’t seem to use > set_ptes(), so there’s no guarantee the updated PTEs are visible to all > cores. If a remote CPU later encounters a page fault and performs a TLB > invalidation, will it still see a stable PTE? Yes, because the reads and writes are done under the PTL; that synchonizes the memory for us. You were discussing the potential value of upgrading the leading dsb from nshst to ishst during the discussion. IMHO that's neither required nor desirable - the memory synchonization is handled by the PTL. Overall, this optimization relies on the premise that synchronizing with remote CPUs is expensive and races are rare, so we should keep everything local for as long as possible and not worry about micro-optimizing the efficiency of the race case. > >> + addr = __TLBI_VADDR(uaddr, ASID(mm)); >> + __tlbi(vale1, addr); >> + __tlbi_user(vale1, addr); >> +} >> + >> +static inline void local_flush_tlb_page_nonotify( >> + struct vm_area_struct *vma, unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + dsb(nsh); >> +} >> + >> +static inline void local_flush_tlb_page(struct vm_area_struct *vma, >> + unsigned long uaddr) >> +{ >> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); >> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, >> + (uaddr & PAGE_MASK) + PAGE_SIZE); >> + dsb(nsh); >> +} >> + >> static inline void __flush_tlb_page_nosync(struct mm_struct *mm, >> unsigned long uaddr) >> { >> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, >> dsb(ish); >> } >> > > We already have functions like > __flush_tlb_page_nosync() and __flush_tlb_range_nosync(). > Is there a way to factor out or extract their common parts? > > Is it because of the differences in barriers that this extraction of > common code isn’t feasible? I've proposed re-working these functions to add indepednent flags for sync/nosync, local/broadcast and notify/nonotify. I think that will clean it up quite a bit. But I was going to wait for this to land first. And also, Will has an RFC for some other tlbflush API cleanup (converting it to C functions) so might want to wait for or incorporate that too. Thanks, Ryan > > Thanks > Barry ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-10-23 10:19 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-13 9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying 2025-10-13 9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying 2025-10-14 14:21 ` Lorenzo Stoakes 2025-10-14 14:38 ` David Hildenbrand 2025-10-14 14:49 ` Lorenzo Stoakes 2025-10-14 14:58 ` David Hildenbrand 2025-10-14 15:13 ` Lorenzo Stoakes 2025-10-15 8:43 ` Huang, Ying 2025-10-15 11:20 ` Lorenzo Stoakes 2025-10-15 12:23 ` David Hildenbrand 2025-10-16 2:22 ` Huang, Ying 2025-10-16 8:25 ` Lorenzo Stoakes 2025-10-16 8:59 ` David Hildenbrand 2025-10-16 9:12 ` Huang, Ying 2025-10-13 9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying 2025-10-15 15:28 ` Ryan Roberts 2025-10-16 1:35 ` Huang, Ying 2025-10-22 4:08 ` Barry Song 2025-10-22 7:31 ` Huang, Ying 2025-10-22 8:14 ` Barry Song 2025-10-22 9:02 ` Huang, Ying 2025-10-22 9:17 ` Barry Song 2025-10-22 9:30 ` Huang, Ying 2025-10-22 9:37 ` Barry Song 2025-10-22 9:46 ` Huang, Ying 2025-10-22 9:55 ` Barry Song 2025-10-22 10:22 ` Barry Song 2025-10-22 10:34 ` Huang, Ying 2025-10-22 10:52 ` Barry Song 2025-10-23 1:22 ` Huang, Ying 2025-10-23 5:39 ` Barry Song 2025-10-23 6:15 ` Huang, Ying 2025-10-23 10:18 ` Ryan Roberts
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox