[PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault
@ 2025-10-13  9:20 Huang Ying
  2025-10-13  9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying
  2025-10-13  9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Huang Ying @ 2025-10-13  9:20 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand
  Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan,
	Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

This series is to optimize the system performance via avoiding TLBI
broadcast if page is reused in the write protect fault handler.  More
details of the background and the test results can be found in [2/2].

Changelog:

v2:

- Various code cleanup in [1/2], Thanks David's comments!
- Remove unnecessary __local_flush_tlb_page_nosync() in [2/2], Thanks Ryan's comments!
- Add missing contpte processing, Thanks Rayn and Catalin's comments!

Huang Ying (2):
  mm: add spurious fault fixing support for huge pmd
  arm64, tlbflush: don't TLBI broadcast if page reused in write fault

 arch/arm64/include/asm/pgtable.h  | 14 +++++---
 arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c           |  3 +-
 arch/arm64/mm/fault.c             |  2 +-
 include/linux/pgtable.h           |  4 +++
 mm/huge_memory.c                  | 22 +++++++++---
 mm/internal.h                     |  4 +--
 7 files changed, 90 insertions(+), 15 deletions(-)

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-13  9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying
@ 2025-10-13  9:20 ` Huang Ying
  2025-10-14 14:21   ` Lorenzo Stoakes
  2025-10-13  9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
  1 sibling, 1 reply; 33+ messages in thread
From: Huang Ying @ 2025-10-13  9:20 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand
  Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan,
	Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

In the current kernel, there is spurious fault fixing support for pte,
but not for huge pmd because no architectures need it. But in the
next patch in the series, we will change the write protection fault
handling logic on arm64, so that some stale huge pmd entries may
remain in the TLB. These entries need to be flushed via the huge pmd
spurious fault fixing mechanism.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/pgtable.h |  4 ++++
 mm/huge_memory.c        | 22 +++++++++++++++++-----
 mm/internal.h           |  4 ++--
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 32e8457ad535..341622ec80e4 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
 #endif
 
+#ifndef flush_tlb_fix_spurious_fault_pmd
+#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..8533457c52b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write)
+/* Returns whether the PMD entry is changed */
+int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	      pmd_t *pmd, bool write)
 {
+	int changed;
 	pmd_t _pmd;
 
 	_pmd = pmd_mkyoung(*pmd);
 	if (write)
 		_pmd = pmd_mkdirty(_pmd);
-	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
-				  pmd, _pmd, write))
+	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
+					pmd, _pmd, write);
+	if (changed)
 		update_mmu_cache_pmd(vma, addr, pmd);
+
+	return changed;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
 	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
 		goto unlock;
 
-	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
+	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
+		/* See corresponding comments in handle_pte_fault(). */
+		if (vmf->flags & FAULT_FLAG_TRIED)
+			goto unlock;
+		if (vmf->flags & FAULT_FLAG_WRITE)
+			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
+							 vmf->pmd);
+	}
 
 unlock:
 	spin_unlock(vmf->ptl);
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..8b58ab00a7cd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write);
+int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	      pmd_t *pmd, bool write);
 
 /*
  * Parses a string with mem suffixes into its order. Useful to parse kernel
-- 
2.39.5



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-13  9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying
@ 2025-10-14 14:21   ` Lorenzo Stoakes
  2025-10-14 14:38     ` David Hildenbrand
  2025-10-15  8:43     ` Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 14:21 UTC (permalink / raw)
  To: Huang Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
> In the current kernel, there is spurious fault fixing support for pte,
> but not for huge pmd because no architectures need it. But in the
> next patch in the series, we will change the write protection fault
> handling logic on arm64, so that some stale huge pmd entries may
> remain in the TLB. These entries need to be flushed via the huge pmd
> spurious fault fixing mechanism.
>
> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>

Right now the PTE level spurious fault handling is dealt with in
handle_pte_fault() when ptep_set_access_flags() returns false.

Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
huge_pmd_set_accessed().

1 - Why are you not adding handling to GUP?

2 - Is this the correct level of abstraction? It's really not obvious but
    huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
    non-NUMA hint huge page fault where a page table entry already exists
    but we are faulting anyway (e.g. non-present or read-only writable).

You don't mention any of this in the commit message, which you need to do
and really need to explain how spurious faults can arise, why you can only
do this at the point of abstraction you do (if you are unable to put it in
actual fault handing-code), and you need to add a bunch more comments to
explain this.

Otherwise this just ends up being a lot of open-coded + confusing 'you have
to go look it up/just know' type stuff that we have too much of in mm :)

So please update commit message/comments, confirm whether this is the
correct level of abstraction, and address other comments below, thanks!

> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/pgtable.h |  4 ++++
>  mm/huge_memory.c        | 22 +++++++++++++++++-----
>  mm/internal.h           |  4 ++--
>  3 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 32e8457ad535..341622ec80e4 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>  #endif
>
> +#ifndef flush_tlb_fix_spurious_fault_pmd
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
> +#endif

flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
flush_tlb_page() - why do we just do nothing in this case here?

> +
>  /*
>   * When walking page tables, get the address of the next boundary,
>   * or the end address of the range if that comes earlier.  Although no
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..8533457c52b7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> -	       pmd_t *pmd, bool write)
> +/* Returns whether the PMD entry is changed */

Could we have a kernel doc description here?

> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,

It's 2025 can we use bool please :)

> +	      pmd_t *pmd, bool write)
>  {
> +	int changed;
>  	pmd_t _pmd;

While we're here can we rename this horrible parameter name to e.g. entry? We're
significantly altering this function anyway so it isn't much more

>
>  	_pmd = pmd_mkyoung(*pmd);
>  	if (write)
>  		_pmd = pmd_mkdirty(_pmd);
> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> -				  pmd, _pmd, write))
> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> +					pmd, _pmd, write);
> +	if (changed)
>  		update_mmu_cache_pmd(vma, addr, pmd);

We can make this simpler, e.g.:

	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
				  pmd, entry, write)) {
		update_mmu_cache_pmd(vma, addr, pmd);
		return true;
	}

	return false;

> +
> +	return changed;
>  }
>
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
>  	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
>  		goto unlock;
>
> -	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
> +	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
> +		/* See corresponding comments in handle_pte_fault(). */

What are the 'corresponding' comments? How can a reader of this code know what
they are? This isn't a very helpful comment. Also those comments might be
moved in future...

Presumably it's:

		/* Skip spurious TLB flush for retried page fault */
		if (vmf->flags & FAULT_FLAG_TRIED)
			goto unlock;
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (vmf->flags & FAULT_FLAG_WRITE)
			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
						     vmf->pte);


So I don't see why it's so egregious to have the equivalent here, or actually
ideally to abstract the code entirely.

In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
pgtable_level"") David introduced:

	enum pgtable_level {
		PGTABLE_LEVEL_PTE = 0,
		PGTABLE_LEVEL_PMD,
		PGTABLE_LEVEL_PUD,
		PGTABLE_LEVEL_P4D,
		PGTABLE_LEVEL_PGD,
	};

Which allows for sensible abstraction.

> +		if (vmf->flags & FAULT_FLAG_TRIED)
> +			goto unlock;
> +		if (vmf->flags & FAULT_FLAG_WRITE)
> +			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
> +							 vmf->pmd);
> +	}
>
>  unlock:
>  	spin_unlock(vmf->ptl);
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..8b58ab00a7cd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
>   */
>  void touch_pud(struct vm_area_struct *vma, unsigned long addr,
>  	       pud_t *pud, bool write);
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> -	       pmd_t *pmd, bool write);
> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> +	      pmd_t *pmd, bool write);
>
>  /*
>   * Parses a string with mem suffixes into its order. Useful to parse kernel
> --
> 2.39.5
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-14 14:21   ` Lorenzo Stoakes
@ 2025-10-14 14:38     ` David Hildenbrand
  2025-10-14 14:49       ` Lorenzo Stoakes
  2025-10-15  8:43     ` Huang, Ying
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-10-14 14:38 UTC (permalink / raw)
  To: Lorenzo Stoakes, Huang Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka,
	Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm


> 
> 		/* Skip spurious TLB flush for retried page fault */
> 		if (vmf->flags & FAULT_FLAG_TRIED)
> 			goto unlock;
> 		/*
> 		 * This is needed only for protection faults but the arch code
> 		 * is not yet telling us if this is a protection fault or not.
> 		 * This still avoids useless tlb flushes for .text page faults
> 		 * with threads.
> 		 */
> 		if (vmf->flags & FAULT_FLAG_WRITE)
> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> 						     vmf->pte);
> 
> 
> So I don't see why it's so egregious to have the equivalent here, or actually
> ideally to abstract the code entirely.

Let's definitely not duplicate such comments whereby one instance will 
end up bitrotting.

When talking about spurious faults I assume the educated reader will 
usually find the right comments -- like you easily did :P

However I agree that ...

> 
> In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> pgtable_level"") David introduced:
> 
> 	enum pgtable_level {
> 		PGTABLE_LEVEL_PTE = 0,
> 		PGTABLE_LEVEL_PMD,
> 		PGTABLE_LEVEL_PUD,
> 		PGTABLE_LEVEL_P4D,
> 		PGTABLE_LEVEL_PGD,
> 	};
> 
> Which allows for sensible abstraction.

... if there is an easier way to just unify the code and have the 
comments at a central place, even better.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-14 14:38     ` David Hildenbrand
@ 2025-10-14 14:49       ` Lorenzo Stoakes
  2025-10-14 14:58         ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 14:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
>
> >
> > 		/* Skip spurious TLB flush for retried page fault */
> > 		if (vmf->flags & FAULT_FLAG_TRIED)
> > 			goto unlock;
> > 		/*
> > 		 * This is needed only for protection faults but the arch code
> > 		 * is not yet telling us if this is a protection fault or not.
> > 		 * This still avoids useless tlb flushes for .text page faults
> > 		 * with threads.
> > 		 */
> > 		if (vmf->flags & FAULT_FLAG_WRITE)
> > 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > 						     vmf->pte);
> >
> >
> > So I don't see why it's so egregious to have the equivalent here, or actually
> > ideally to abstract the code entirely.
>
> Let's definitely not duplicate such comments whereby one instance will end
> up bitrotting.

We're duplicating the code in two places, how would that bitrot happen exactly?

I mean as I also say, probably better to just de-duplicate in general.

It's one thing referring to a comment _above_ another function, it's quite
another to refer to comments buried in a branch buried inside another function
and to hand wave about it.

And _those_ comments might very well be moved when the function is sensibly
refactored (as it needs to be tbh).

So yeah, in general I'd agree _if_ you were doing something
similar-but-not-the-same AND referencing a clearly identifiable comment
(e.g. above the function).

But this isn't that at all.

Anyway TL;DR is that we should just de-dupe the code.

>
> When talking about spurious faults I assume the educated reader will usually
> find the right comments -- like you easily did :P

Yeah but I'm familiar with the (kinda hideous) code there, it wasn't so much
'easily' found and for somebody else they'd maybe get confused as to what on
earth that's referring to.

>
> However I agree that ...
>
> >
> > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> > pgtable_level"") David introduced:
> >
> > 	enum pgtable_level {
> > 		PGTABLE_LEVEL_PTE = 0,
> > 		PGTABLE_LEVEL_PMD,
> > 		PGTABLE_LEVEL_PUD,
> > 		PGTABLE_LEVEL_P4D,
> > 		PGTABLE_LEVEL_PGD,
> > 	};
> >
> > Which allows for sensible abstraction.
>
> ... if there is an easier way to just unify the code and have the comments
> at a central place, even better.

Yup we're agreed on that :)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-14 14:49       ` Lorenzo Stoakes
@ 2025-10-14 14:58         ` David Hildenbrand
  2025-10-14 15:13           ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-10-14 14:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On 14.10.25 16:49, Lorenzo Stoakes wrote:
> On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
>>
>>>
>>> 		/* Skip spurious TLB flush for retried page fault */
>>> 		if (vmf->flags & FAULT_FLAG_TRIED)
>>> 			goto unlock;
>>> 		/*
>>> 		 * This is needed only for protection faults but the arch code
>>> 		 * is not yet telling us if this is a protection fault or not.
>>> 		 * This still avoids useless tlb flushes for .text page faults
>>> 		 * with threads.
>>> 		 */
>>> 		if (vmf->flags & FAULT_FLAG_WRITE)
>>> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
>>> 						     vmf->pte);
>>>
>>>
>>> So I don't see why it's so egregious to have the equivalent here, or actually
>>> ideally to abstract the code entirely.
>>
>> Let's definitely not duplicate such comments whereby one instance will end
>> up bitrotting.
> 
> We're duplicating the code in two places, how would that bitrot happen exactly?

Often we adjust/fix comments to make scenarios/conditions clearer or 
extend them to cover some new conditions.

So even without any code changes people will just ignore to update other 
comments.

Code you can at least test with the hope to find inconsistencies.

So copying rather large comments is usually never the answer :)

Well, just like copying larger chunks of code, agreed.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-14 14:58         ` David Hildenbrand
@ 2025-10-14 15:13           ` Lorenzo Stoakes
  0 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 15:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On Tue, Oct 14, 2025 at 04:58:44PM +0200, David Hildenbrand wrote:
> On 14.10.25 16:49, Lorenzo Stoakes wrote:
> > On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
> > >
> > > >
> > > > 		/* Skip spurious TLB flush for retried page fault */
> > > > 		if (vmf->flags & FAULT_FLAG_TRIED)
> > > > 			goto unlock;
> > > > 		/*
> > > > 		 * This is needed only for protection faults but the arch code
> > > > 		 * is not yet telling us if this is a protection fault or not.
> > > > 		 * This still avoids useless tlb flushes for .text page faults
> > > > 		 * with threads.
> > > > 		 */
> > > > 		if (vmf->flags & FAULT_FLAG_WRITE)
> > > > 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > > > 						     vmf->pte);
> > > >
> > > >
> > > > So I don't see why it's so egregious to have the equivalent here, or actually
> > > > ideally to abstract the code entirely.
> > >
> > > Let's definitely not duplicate such comments whereby one instance will end
> > > up bitrotting.
> >
> > We're duplicating the code in two places, how would that bitrot happen exactly?
>
> Often we adjust/fix comments to make scenarios/conditions clearer or extend
> them to cover some new conditions.
>
> So even without any code changes people will just ignore to update other
> comments.
>
> Code you can at least test with the hope to find inconsistencies.
>
> So copying rather large comments is usually never the answer :)
>
> Well, just like copying larger chunks of code, agreed.

This is a bit moot I don't think it's worth having a big debate about.

I'm one of the biggest proponents of de-duplicating things (comments
included) and have done so extensively as you know.

My _only_ point here is that it's hard to find the comment referenced and
it's _very_ likely it'll get moved later (in fact I feel like doing that
_right now_ as that function needs refactoring).

In that case the lesser evil is to copy a 4 line comment right?

But anyway, we both agree de-duplicating the code as a whole is the right
way forward and that solves the issue so let's go with that!

>
> --
> Cheers
>
> David / dhildenb
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-14 14:21   ` Lorenzo Stoakes
  2025-10-14 14:38     ` David Hildenbrand
@ 2025-10-15  8:43     ` Huang, Ying
  2025-10-15 11:20       ` Lorenzo Stoakes
  1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-15  8:43 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka,
	Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

Hi, Lorenzo,

Thanks for comments!

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:

> On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
>> In the current kernel, there is spurious fault fixing support for pte,
>> but not for huge pmd because no architectures need it. But in the
>> next patch in the series, we will change the write protection fault
>> handling logic on arm64, so that some stale huge pmd entries may
>> remain in the TLB. These entries need to be flushed via the huge pmd
>> spurious fault fixing mechanism.
>>
>> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
>
> Right now the PTE level spurious fault handling is dealt with in
> handle_pte_fault() when ptep_set_access_flags() returns false.
>
> Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
> huge_pmd_set_accessed().
>
> 1 - Why are you not adding handling to GUP?
>
> 2 - Is this the correct level of abstraction? It's really not obvious but
>     huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
>     non-NUMA hint huge page fault where a page table entry already exists
>     but we are faulting anyway (e.g. non-present or read-only writable).
>
> You don't mention any of this in the commit message, which you need to do
> and really need to explain how spurious faults can arise, why you can only
> do this at the point of abstraction you do (if you are unable to put it in
> actual fault handing-code), and you need to add a bunch more comments to
> explain this.

This patch adds the spurious PMD page fault fixing based on the spurious
PTE page fault fixing.  So, I assumed that the spurious page fault
fixing has been documented already.  But you are right, nothing prevents
us from improving it further.  Let's try to do that.

The page faults may be spurious because of the racy access to the page
table.  For example, a non-populated virtual page is accessed on 2 CPUs
simultaneously, thus the page faults are triggered on both CPUs.
However, it's possible that one CPU (say CPU A) cannot find the reason
for the page fault if the other CPU (say CPU B) has changed the page
table before the PTE is checked on CPU A.  Most of the time, the
spurious page faults can be ignored safely.  However, if the page fault
is for the write access, it's possible that a stale read-only TLB entry
exists in the local CPU and needs to be flushed on some architectures.
This is called the spurious page fault fixing.

The spurious page fault fixing only makes sense during page fault
handling, so we don't need to do it for GUP.  In fact, I plan to avoid
it in all GUP paths in another followup patch.

As for where to put the spurious PMD page fault fixing code, because
it's THP related code, I thought that we should put it in huge_memory.c,
so I implemented it in huge_pmd_set_accessed().  If we follow the design
of the spurious PTE page fault fixing, we can call the unified
implementation in __handle_mm_fault() after huge_pmd_set_accessed()
reports nothing has been changed.

> Otherwise this just ends up being a lot of open-coded + confusing 'you have
> to go look it up/just know' type stuff that we have too much of in mm :)
>
> So please update commit message/comments, confirm whether this is the
> correct level of abstraction, and address other comments below, thanks!
>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Cc: Yicong Yang <yangyicong@hisilicon.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
>> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>>  include/linux/pgtable.h |  4 ++++
>>  mm/huge_memory.c        | 22 +++++++++++++++++-----
>>  mm/internal.h           |  4 ++--
>>  3 files changed, 23 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 32e8457ad535..341622ec80e4 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>>  #endif
>>
>> +#ifndef flush_tlb_fix_spurious_fault_pmd
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
>> +#endif
>
> flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
> flush_tlb_page() - why do we just do nothing in this case here?

Because all architectures do nothing for the spurious PMD page fault
fixing until the [2/2] of this series.  Where, we make it necessary to
flush the local TLB for spurious PMD page fault fixing on arm64
architecture.

If we follow the design of flush_tlb_fix_spurious_fault(), we need to
change all architecture implementation to do nothing in this patch to
keep the current behavior.  I don't think that it's a good idea.  Do
you agree?

>> +
>>  /*
>>   * When walking page tables, get the address of the next boundary,
>>   * or the end address of the range if that comes earlier.  Although no
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1b81680b4225..8533457c52b7 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>
>> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -	       pmd_t *pmd, bool write)
>> +/* Returns whether the PMD entry is changed */
>
> Could we have a kernel doc description here?

Sure.

>> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>
> It's 2025 can we use bool please :)

Sure.

>> +	      pmd_t *pmd, bool write)
>>  {
>> +	int changed;
>>  	pmd_t _pmd;
>
> While we're here can we rename this horrible parameter name to e.g. entry? We're
> significantly altering this function anyway so it isn't much more

Sure.

>>
>>  	_pmd = pmd_mkyoung(*pmd);
>>  	if (write)
>>  		_pmd = pmd_mkdirty(_pmd);
>> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>> -				  pmd, _pmd, write))
>> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>> +					pmd, _pmd, write);
>> +	if (changed)
>>  		update_mmu_cache_pmd(vma, addr, pmd);
>
> We can make this simpler, e.g.:
>
> 	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> 				  pmd, entry, write)) {
> 		update_mmu_cache_pmd(vma, addr, pmd);
> 		return true;
> 	}
>
> 	return false;

No problem.  As long as David is OK with this.

>> +
>> +	return changed;
>>  }
>>
>>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
>>  	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
>>  		goto unlock;
>>
>> -	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
>> +	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
>> +		/* See corresponding comments in handle_pte_fault(). */
>
> What are the 'corresponding' comments? How can a reader of this code know what
> they are? This isn't a very helpful comment. Also those comments might be
> moved in future...
>
> Presumably it's:
>
> 		/* Skip spurious TLB flush for retried page fault */
> 		if (vmf->flags & FAULT_FLAG_TRIED)
> 			goto unlock;
> 		/*
> 		 * This is needed only for protection faults but the arch code
> 		 * is not yet telling us if this is a protection fault or not.
> 		 * This still avoids useless tlb flushes for .text page faults
> 		 * with threads.
> 		 */
> 		if (vmf->flags & FAULT_FLAG_WRITE)
> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> 						     vmf->pte);
>
>
> So I don't see why it's so egregious to have the equivalent here, or actually
> ideally to abstract the code entirely.
>
> In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> pgtable_level"") David introduced:
>
> 	enum pgtable_level {
> 		PGTABLE_LEVEL_PTE = 0,
> 		PGTABLE_LEVEL_PMD,
> 		PGTABLE_LEVEL_PUD,
> 		PGTABLE_LEVEL_P4D,
> 		PGTABLE_LEVEL_PGD,
> 	};
>
> Which allows for sensible abstraction.

Sure.  Based on your discussion with David on this, I will add a
function to do the spurious page fault fixing for the PTE and PMD fault.

>> +		if (vmf->flags & FAULT_FLAG_TRIED)
>> +			goto unlock;
>> +		if (vmf->flags & FAULT_FLAG_WRITE)
>> +			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
>> +							 vmf->pmd);
>> +	}
>>
>>  unlock:
>>  	spin_unlock(vmf->ptl);
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 1561fc2ff5b8..8b58ab00a7cd 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
>>   */
>>  void touch_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	       pud_t *pud, bool write);
>> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -	       pmd_t *pmd, bool write);
>> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +	      pmd_t *pmd, bool write);
>>
>>  /*
>>   * Parses a string with mem suffixes into its order. Useful to parse kernel
>> --
>> 2.39.5
>>

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-15  8:43     ` Huang, Ying
@ 2025-10-15 11:20       ` Lorenzo Stoakes
  2025-10-15 12:23         ` David Hildenbrand
  2025-10-16  2:22         ` Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-10-15 11:20 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote:
> Hi, Lorenzo,
>
> Thanks for comments!
>
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>
> > On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
> >> In the current kernel, there is spurious fault fixing support for pte,
> >> but not for huge pmd because no architectures need it. But in the
> >> next patch in the series, we will change the write protection fault
> >> handling logic on arm64, so that some stale huge pmd entries may
> >> remain in the TLB. These entries need to be flushed via the huge pmd
> >> spurious fault fixing mechanism.
> >>
> >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
> >
> > Right now the PTE level spurious fault handling is dealt with in
> > handle_pte_fault() when ptep_set_access_flags() returns false.
> >
> > Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
> > huge_pmd_set_accessed().
> >
> > 1 - Why are you not adding handling to GUP?
> >
> > 2 - Is this the correct level of abstraction? It's really not obvious but
> >     huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
> >     non-NUMA hint huge page fault where a page table entry already exists
> >     but we are faulting anyway (e.g. non-present or read-only writable).
> >
> > You don't mention any of this in the commit message, which you need to do
> > and really need to explain how spurious faults can arise, why you can only
> > do this at the point of abstraction you do (if you are unable to put it in
> > actual fault handing-code), and you need to add a bunch more comments to
> > explain this.
>
> This patch adds the spurious PMD page fault fixing based on the spurious
> PTE page fault fixing.  So, I assumed that the spurious page fault
> fixing has been documented already.  But you are right, nothing prevents
> us from improving it further.  Let's try to do that.

I wouldn't make any kind of assumption like this in the kernel :) sadly our
documentation is often incomplete.

>
> The page faults may be spurious because of the racy access to the page
> table.  For example, a non-populated virtual page is accessed on 2 CPUs
> simultaneously, thus the page faults are triggered on both CPUs.
> However, it's possible that one CPU (say CPU A) cannot find the reason
> for the page fault if the other CPU (say CPU B) has changed the page
> table before the PTE is checked on CPU A.  Most of the time, the
> spurious page faults can be ignored safely.  However, if the page fault
> is for the write access, it's possible that a stale read-only TLB entry
> exists in the local CPU and needs to be flushed on some architectures.
> This is called the spurious page fault fixing.
>
> The spurious page fault fixing only makes sense during page fault
> handling, so we don't need to do it for GUP.  In fact, I plan to avoid
> it in all GUP paths in another followup patch.

OK this is great, let's put it all in the kdoc for the new shared spurious
faulting function! :) and additionally add it to the commit message.

>
> As for where to put the spurious PMD page fault fixing code, because
> it's THP related code, I thought that we should put it in huge_memory.c,
> so I implemented it in huge_pmd_set_accessed().  If we follow the design
> of the spurious PTE page fault fixing, we can call the unified
> implementation in __handle_mm_fault() after huge_pmd_set_accessed()
> reports nothing has been changed.

Ah ok great this is a better place for it for sure, thanks!

>
> > Otherwise this just ends up being a lot of open-coded + confusing 'you have
> > to go look it up/just know' type stuff that we have too much of in mm :)
> >
> > So please update commit message/comments, confirm whether this is the
> > correct level of abstraction, and address other comments below, thanks!
> >
> >> Cc: Catalin Marinas <catalin.marinas@arm.com>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >> Cc: Yang Shi <yang@os.amperecomputing.com>
> >> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> >> Cc: Dev Jain <dev.jain@arm.com>
> >> Cc: Barry Song <baohua@kernel.org>
> >> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> >> Cc: Yicong Yang <yangyicong@hisilicon.com>
> >> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> >> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
> >> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
> >> Cc: linux-arm-kernel@lists.infradead.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Cc: linux-mm@kvack.org
> >> ---
> >>  include/linux/pgtable.h |  4 ++++
> >>  mm/huge_memory.c        | 22 +++++++++++++++++-----
> >>  mm/internal.h           |  4 ++--
> >>  3 files changed, 23 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 32e8457ad535..341622ec80e4 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> >>  #endif
> >>
> >> +#ifndef flush_tlb_fix_spurious_fault_pmd
> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
> >> +#endif
> >
> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
> > flush_tlb_page() - why do we just do nothing in this case here?
>
> Because all architectures do nothing for the spurious PMD page fault
> fixing until the [2/2] of this series.  Where, we make it necessary to
> flush the local TLB for spurious PMD page fault fixing on arm64
> architecture.
>
> If we follow the design of flush_tlb_fix_spurious_fault(), we need to
> change all architecture implementation to do nothing in this patch to
> keep the current behavior.  I don't think that it's a good idea.  Do
> you agree?

Yeah probably we should keep the same behaviour as before, which is
obviously, prior to this series, we did nothing.

I guess in the PTE case we _always_ want to flush the TLB, whereas in the
PMD case we otherwise don't have any need to at the point at which the
spurious flush is performed.

But from your explanation above re: the stale TLB entry this _only_ needs
to be done for architectures which might encounter this problem rather than
needing a TLB flush in general.

Given we're generalising the code and one case always flushes the TLB and
the other doesn't maybe it's worth putting a comment in the generalised
function mentioning this?

>
> >> +
> >>  /*
> >>   * When walking page tables, get the address of the next boundary,
> >>   * or the end address of the range if that comes earlier.  Although no
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 1b81680b4225..8533457c52b7 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
> >>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
> >>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >>
> >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> >> -	       pmd_t *pmd, bool write)
> >> +/* Returns whether the PMD entry is changed */
> >
> > Could we have a kernel doc description here?
>
> Sure.

Thanks!

>
> >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> >
> > It's 2025 can we use bool please :)
>
> Sure.

Thanks!

>
> >> +	      pmd_t *pmd, bool write)
> >>  {
> >> +	int changed;
> >>  	pmd_t _pmd;
> >
> > While we're here can we rename this horrible parameter name to e.g. entry? We're
> > significantly altering this function anyway so it isn't much more
>
> Sure.

Thanks!

>
> >>
> >>  	_pmd = pmd_mkyoung(*pmd);
> >>  	if (write)
> >>  		_pmd = pmd_mkdirty(_pmd);
> >> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> >> -				  pmd, _pmd, write))
> >> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> >> +					pmd, _pmd, write);
> >> +	if (changed)
> >>  		update_mmu_cache_pmd(vma, addr, pmd);
> >
> > We can make this simpler, e.g.:
> >
> > 	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> > 				  pmd, entry, write)) {
> > 		update_mmu_cache_pmd(vma, addr, pmd);
> > 		return true;
> > 	}
> >
> > 	return false;
>
> No problem.  As long as David is OK with this.

Sure I don't think he'd have an issue with it but he can raise it if so :)

Thanks!

>
> >> +
> >> +	return changed;
> >>  }
> >>
> >>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
> >>  	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
> >>  		goto unlock;
> >>
> >> -	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
> >> +	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
> >> +		/* See corresponding comments in handle_pte_fault(). */
> >
> > What are the 'corresponding' comments? How can a reader of this code know what
> > they are? This isn't a very helpful comment. Also those comments might be
> > moved in future...
> >
> > Presumably it's:
> >
> > 		/* Skip spurious TLB flush for retried page fault */
> > 		if (vmf->flags & FAULT_FLAG_TRIED)
> > 			goto unlock;
> > 		/*
> > 		 * This is needed only for protection faults but the arch code
> > 		 * is not yet telling us if this is a protection fault or not.
> > 		 * This still avoids useless tlb flushes for .text page faults
> > 		 * with threads.
> > 		 */
> > 		if (vmf->flags & FAULT_FLAG_WRITE)
> > 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > 						     vmf->pte);
> >
> >
> > So I don't see why it's so egregious to have the equivalent here, or actually
> > ideally to abstract the code entirely.
> >
> > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> > pgtable_level"") David introduced:
> >
> > 	enum pgtable_level {
> > 		PGTABLE_LEVEL_PTE = 0,
> > 		PGTABLE_LEVEL_PMD,
> > 		PGTABLE_LEVEL_PUD,
> > 		PGTABLE_LEVEL_P4D,
> > 		PGTABLE_LEVEL_PGD,
> > 	};
> >
> > Which allows for sensible abstraction.
>
> Sure.  Based on your discussion with David on this, I will add a
> function to do the spurious page fault fixing for the PTE and PMD fault.

Thanks!

>
> >> +		if (vmf->flags & FAULT_FLAG_TRIED)
> >> +			goto unlock;
> >> +		if (vmf->flags & FAULT_FLAG_WRITE)
> >> +			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
> >> +							 vmf->pmd);
> >> +	}
> >>
> >>  unlock:
> >>  	spin_unlock(vmf->ptl);
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 1561fc2ff5b8..8b58ab00a7cd 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
> >>   */
> >>  void touch_pud(struct vm_area_struct *vma, unsigned long addr,
> >>  	       pud_t *pud, bool write);
> >> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> >> -	       pmd_t *pmd, bool write);
> >> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> >> +	      pmd_t *pmd, bool write);
> >>
> >>  /*
> >>   * Parses a string with mem suffixes into its order. Useful to parse kernel
> >> --
> >> 2.39.5
> >>
>
> ---
> Best Regards,
> Huang, Ying

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-15 11:20       ` Lorenzo Stoakes
@ 2025-10-15 12:23         ` David Hildenbrand
  2025-10-16  2:22         ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-10-15 12:23 UTC (permalink / raw)
  To: Lorenzo Stoakes, Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka,
	Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm


>>
>>>>
>>>>   	_pmd = pmd_mkyoung(*pmd);
>>>>   	if (write)
>>>>   		_pmd = pmd_mkdirty(_pmd);
>>>> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>>>> -				  pmd, _pmd, write))
>>>> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>>>> +					pmd, _pmd, write);
>>>> +	if (changed)
>>>>   		update_mmu_cache_pmd(vma, addr, pmd);
>>>
>>> We can make this simpler, e.g.:
>>>
>>> 	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>>> 				  pmd, entry, write)) {
>>> 		update_mmu_cache_pmd(vma, addr, pmd);
>>> 		return true;
>>> 	}
>>>
>>> 	return false;
>>
>> No problem.  As long as David is OK with this.
> 
> Sure I don't think he'd have an issue with it but he can raise it if so :)

Absolutely fine :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-15 11:20       ` Lorenzo Stoakes
  2025-10-15 12:23         ` David Hildenbrand
@ 2025-10-16  2:22         ` Huang, Ying
  2025-10-16  8:25           ` Lorenzo Stoakes
  1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-16  2:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:

> On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote:
>> Hi, Lorenzo,
>>
>> Thanks for comments!
>>
>> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>>
>> > On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
>> >> In the current kernel, there is spurious fault fixing support for pte,
>> >> but not for huge pmd because no architectures need it. But in the
>> >> next patch in the series, we will change the write protection fault
>> >> handling logic on arm64, so that some stale huge pmd entries may
>> >> remain in the TLB. These entries need to be flushed via the huge pmd
>> >> spurious fault fixing mechanism.
>> >>
>> >> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
>> >
>> > Right now the PTE level spurious fault handling is dealt with in
>> > handle_pte_fault() when ptep_set_access_flags() returns false.
>> >
>> > Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
>> > huge_pmd_set_accessed().
>> >
>> > 1 - Why are you not adding handling to GUP?
>> >
>> > 2 - Is this the correct level of abstraction? It's really not obvious but
>> >     huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
>> >     non-NUMA hint huge page fault where a page table entry already exists
>> >     but we are faulting anyway (e.g. non-present or read-only writable).
>> >
>> > You don't mention any of this in the commit message, which you need to do
>> > and really need to explain how spurious faults can arise, why you can only
>> > do this at the point of abstraction you do (if you are unable to put it in
>> > actual fault handing-code), and you need to add a bunch more comments to
>> > explain this.
>>
>> This patch adds the spurious PMD page fault fixing based on the spurious
>> PTE page fault fixing.  So, I assumed that the spurious page fault
>> fixing has been documented already.  But you are right, nothing prevents
>> us from improving it further.  Let's try to do that.
>
> I wouldn't make any kind of assumption like this in the kernel :) sadly our
> documentation is often incomplete.
>
>>
>> The page faults may be spurious because of the racy access to the page
>> table.  For example, a non-populated virtual page is accessed on 2 CPUs
>> simultaneously, thus the page faults are triggered on both CPUs.
>> However, it's possible that one CPU (say CPU A) cannot find the reason
>> for the page fault if the other CPU (say CPU B) has changed the page
>> table before the PTE is checked on CPU A.  Most of the time, the
>> spurious page faults can be ignored safely.  However, if the page fault
>> is for the write access, it's possible that a stale read-only TLB entry
>> exists in the local CPU and needs to be flushed on some architectures.
>> This is called the spurious page fault fixing.
>>
>> The spurious page fault fixing only makes sense during page fault
>> handling, so we don't need to do it for GUP.  In fact, I plan to avoid
>> it in all GUP paths in another followup patch.
>
> OK this is great, let's put it all in the kdoc for the new shared spurious
> faulting function! :) and additionally add it to the commit message.

Sure.  Will do it in the next version.

[snip]

>> >>
>> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> >> index 32e8457ad535..341622ec80e4 100644
>> >> --- a/include/linux/pgtable.h
>> >> +++ b/include/linux/pgtable.h
>> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> >>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>> >>  #endif
>> >>
>> >> +#ifndef flush_tlb_fix_spurious_fault_pmd
>> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
>> >> +#endif
>> >
>> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
>> > flush_tlb_page() - why do we just do nothing in this case here?
>>
>> Because all architectures do nothing for the spurious PMD page fault
>> fixing until the [2/2] of this series.  Where, we make it necessary to
>> flush the local TLB for spurious PMD page fault fixing on arm64
>> architecture.
>>
>> If we follow the design of flush_tlb_fix_spurious_fault(), we need to
>> change all architecture implementation to do nothing in this patch to
>> keep the current behavior.  I don't think that it's a good idea.  Do
>> you agree?
>
> Yeah probably we should keep the same behaviour as before, which is
> obviously, prior to this series, we did nothing.
>
> I guess in the PTE case we _always_ want to flush the TLB, whereas in the
> PMD case we otherwise don't have any need to at the point at which the
> spurious flush is performed.
>
> But from your explanation above re: the stale TLB entry this _only_ needs
> to be done for architectures which might encounter this problem rather than
> needing a TLB flush in general.
>
> Given we're generalising the code and one case always flushes the TLB and
> the other doesn't maybe it's worth putting a comment in the generalised
> function mentioning this?

I'm not sure whether it's a good idea to document architecture behaviors
in the general code.  The behavior may be changed architecture by
architecture in the future.

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-16  2:22         ` Huang, Ying
@ 2025-10-16  8:25           ` Lorenzo Stoakes
  2025-10-16  8:59             ` David Hildenbrand
  2025-10-16  9:12             ` Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-10-16  8:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote:
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>
> > On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote:
> > OK this is great, let's put it all in the kdoc for the new shared spurious
> > faulting function! :) and additionally add it to the commit message.
>
> Sure.  Will do it in the next version.

Thanks!

>
> [snip]
>
> >> >>
> >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> >> index 32e8457ad535..341622ec80e4 100644
> >> >> --- a/include/linux/pgtable.h
> >> >> +++ b/include/linux/pgtable.h
> >> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >> >>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> >> >>  #endif
> >> >>
> >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd
> >> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
> >> >> +#endif
> >> >
> >> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
> >> > flush_tlb_page() - why do we just do nothing in this case here?
> >>
> >> Because all architectures do nothing for the spurious PMD page fault
> >> fixing until the [2/2] of this series.  Where, we make it necessary to
> >> flush the local TLB for spurious PMD page fault fixing on arm64
> >> architecture.
> >>
> >> If we follow the design of flush_tlb_fix_spurious_fault(), we need to
> >> change all architecture implementation to do nothing in this patch to
> >> keep the current behavior.  I don't think that it's a good idea.  Do
> >> you agree?
> >
> > Yeah probably we should keep the same behaviour as before, which is
> > obviously, prior to this series, we did nothing.
> >
> > I guess in the PTE case we _always_ want to flush the TLB, whereas in the
> > PMD case we otherwise don't have any need to at the point at which the
> > spurious flush is performed.
> >
> > But from your explanation above re: the stale TLB entry this _only_ needs
> > to be done for architectures which might encounter this problem rather than
> > needing a TLB flush in general.
> >
> > Given we're generalising the code and one case always flushes the TLB and
> > the other doesn't maybe it's worth putting a comment in the generalised
> > function mentioning this?
>
> I'm not sure whether it's a good idea to document architecture behaviors
> in the general code.  The behavior may be changed architecture by
> architecture in the future.

Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD
so let's document that to be clear :)

>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-16  8:25           ` Lorenzo Stoakes
@ 2025-10-16  8:59             ` David Hildenbrand
  2025-10-16  9:12             ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-10-16  8:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Vlastimil Babka,
	Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On 16.10.25 10:25, Lorenzo Stoakes wrote:
> On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote:
>> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>>
>>> On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote:
>>> OK this is great, let's put it all in the kdoc for the new shared spurious
>>> faulting function! :) and additionally add it to the commit message.
>>
>> Sure.  Will do it in the next version.
> 
> Thanks!
> 
>>
>> [snip]
>>
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index 32e8457ad535..341622ec80e4 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>>>>   #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>>>>>>   #endif
>>>>>>
>>>>>> +#ifndef flush_tlb_fix_spurious_fault_pmd
>>>>>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
>>>>>> +#endif
>>>>>
>>>>> flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
>>>>> flush_tlb_page() - why do we just do nothing in this case here?
>>>>
>>>> Because all architectures do nothing for the spurious PMD page fault
>>>> fixing until the [2/2] of this series.  Where, we make it necessary to
>>>> flush the local TLB for spurious PMD page fault fixing on arm64
>>>> architecture.
>>>>
>>>> If we follow the design of flush_tlb_fix_spurious_fault(), we need to
>>>> change all architecture implementation to do nothing in this patch to
>>>> keep the current behavior.  I don't think that it's a good idea.  Do
>>>> you agree?
>>>
>>> Yeah probably we should keep the same behaviour as before, which is
>>> obviously, prior to this series, we did nothing.
>>>
>>> I guess in the PTE case we _always_ want to flush the TLB, whereas in the
>>> PMD case we otherwise don't have any need to at the point at which the
>>> spurious flush is performed.
>>>
>>> But from your explanation above re: the stale TLB entry this _only_ needs
>>> to be done for architectures which might encounter this problem rather than
>>> needing a TLB flush in general.
>>>
>>> Given we're generalising the code and one case always flushes the TLB and
>>> the other doesn't maybe it's worth putting a comment in the generalised
>>> function mentioning this?
>>
>> I'm not sure whether it's a good idea to document architecture behaviors
>> in the general code.  The behavior may be changed architecture by
>> architecture in the future.
> 
> Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD
> so let's document that to be clear :)

Agreed! That's a big benefit of merging the code, it sticks out what is 
not common already.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
  2025-10-16  8:25           ` Lorenzo Stoakes
  2025-10-16  8:59             ` David Hildenbrand
@ 2025-10-16  9:12             ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2025-10-16  9:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Catalin Marinas, Will Deacon, Andrew Morton,
	Vlastimil Babka, Zi Yan, Baolin Wang, Ryan Roberts, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:

> On Thu, Oct 16, 2025 at 10:22:57AM +0800, Huang, Ying wrote:
>> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>>
>> > On Wed, Oct 15, 2025 at 04:43:14PM +0800, Huang, Ying wrote:
>> > OK this is great, let's put it all in the kdoc for the new shared spurious
>> > faulting function! :) and additionally add it to the commit message.
>>
>> Sure.  Will do it in the next version.
>
> Thanks!
>
>>
>> [snip]
>>
>> >> >>
>> >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> >> >> index 32e8457ad535..341622ec80e4 100644
>> >> >> --- a/include/linux/pgtable.h
>> >> >> +++ b/include/linux/pgtable.h
>> >> >> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> >> >>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>> >> >>  #endif
>> >> >>
>> >> >> +#ifndef flush_tlb_fix_spurious_fault_pmd
>> >> >> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
>> >> >> +#endif
>> >> >
>> >> > flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
>> >> > flush_tlb_page() - why do we just do nothing in this case here?
>> >>
>> >> Because all architectures do nothing for the spurious PMD page fault
>> >> fixing until the [2/2] of this series.  Where, we make it necessary to
>> >> flush the local TLB for spurious PMD page fault fixing on arm64
>> >> architecture.
>> >>
>> >> If we follow the design of flush_tlb_fix_spurious_fault(), we need to
>> >> change all architecture implementation to do nothing in this patch to
>> >> keep the current behavior.  I don't think that it's a good idea.  Do
>> >> you agree?
>> >
>> > Yeah probably we should keep the same behaviour as before, which is
>> > obviously, prior to this series, we did nothing.
>> >
>> > I guess in the PTE case we _always_ want to flush the TLB, whereas in the
>> > PMD case we otherwise don't have any need to at the point at which the
>> > spurious flush is performed.
>> >
>> > But from your explanation above re: the stale TLB entry this _only_ needs
>> > to be done for architectures which might encounter this problem rather than
>> > needing a TLB flush in general.
>> >
>> > Given we're generalising the code and one case always flushes the TLB and
>> > the other doesn't maybe it's worth putting a comment in the generalised
>> > function mentioning this?
>>
>> I'm not sure whether it's a good idea to document architecture behaviors
>> in the general code.  The behavior may be changed architecture by
>> architecture in the future.
>
> Right, but we are unconditionaly doing a TLB flush in the PTE case but not PMD
> so let's document that to be clear :)

Sure.  Will do this.

>>
>> [snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-13  9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying
  2025-10-13  9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying
@ 2025-10-13  9:20 ` Huang Ying
  2025-10-15 15:28   ` Ryan Roberts
  2025-10-22  4:08   ` Barry Song
  1 sibling, 2 replies; 33+ messages in thread
From: Huang Ying @ 2025-10-13  9:20 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand
  Cc: Huang Ying, Lorenzo Stoakes, Vlastimil Babka, Zi Yan,
	Baolin Wang, Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds.  When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions.  While
running the workload on the x86_64 server machine, it's not.  This
causes the performance on arm64 to be much worse than that on x86_64.

During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault.  Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload.  On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE.  However, this
isn't always necessary.  Firstly, it's safe to leave some stall
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.

To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable.  If there are stall read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stall TLB entries.

To test the patchset, make the usemem.c from vm-scalability
(https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically (merged).  To mimic the
behavior of the customer workload, run usemem with 4 threads, access
100GB memory, and call fork()/exec() every 40 seconds.  Test results
show that with the patchset the score of usemem improves ~40.6%.  The
cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
profile.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 arch/arm64/include/asm/pgtable.h  | 14 +++++---
 arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
 arch/arm64/mm/contpte.c           |  3 +-
 arch/arm64/mm/fault.c             |  2 +-
 4 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index aa89c2e67ebc..35bae2e4bcfe 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * Outside of a few very special situations (e.g. hibernation), we always
- * use broadcast TLB invalidation instructions, therefore a spurious page
- * fault on one CPU which has been handled concurrently by another CPU
- * does not need to perform additional invalidation.
+ * We use local TLB invalidation instruction when reusing page in
+ * write protection fault handler to avoid TLBI broadcast in the hot
+ * path.  This will cause spurious page faults if stall read-only TLB
+ * entries exist.
  */
-#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
+#define flush_tlb_fix_spurious_fault(vma, address, ptep)	\
+	local_flush_tlb_page_nonotify(vma, address)
+
+#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)	\
+	local_flush_tlb_page_nonotify(vma, address)
 
 /*
  * ZERO_PAGE is a global shared page that is always zero: used
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 18a5dc0c9a54..651b31fd18bb 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
  *		cannot be easily determined, the value TLBI_TTL_UNKNOWN will
  *		perform a non-hinted invalidation.
  *
+ *	local_flush_tlb_page(vma, addr)
+ *		Local variant of flush_tlb_page().  Stale TLB entries may
+ *		remain in remote CPUs.
+ *
+ *	local_flush_tlb_page_nonotify(vma, addr)
+ *		Same as local_flush_tlb_page() except MMU notifier will not be
+ *		called.
+ *
+ *	local_flush_tlb_contpte_range(vma, start, end)
+ *		Invalidate the virtual-address range '[start, end)' mapped with
+ *		contpte on local CPU for the user address space corresponding
+ *		to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
  *
  *	Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
  *	on top of these routines, since that is our interface to the mmu_gather
@@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
+static inline void __local_flush_tlb_page_nonotify_nosync(
+	struct mm_struct *mm, unsigned long uaddr)
+{
+	unsigned long addr;
+
+	dsb(nshst);
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
+	__tlbi(vale1, addr);
+	__tlbi_user(vale1, addr);
+}
+
+static inline void local_flush_tlb_page_nonotify(
+	struct vm_area_struct *vma, unsigned long uaddr)
+{
+	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
+	dsb(nsh);
+}
+
+static inline void local_flush_tlb_page(struct vm_area_struct *vma,
+					unsigned long uaddr)
+{
+	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
+						(uaddr & PAGE_MASK) + PAGE_SIZE);
+	dsb(nsh);
+}
+
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					   unsigned long uaddr)
 {
@@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
+						 unsigned long start, unsigned long end)
+{
+	unsigned long asid, pages;
+
+	start = round_down(start, PAGE_SIZE);
+	end = round_up(end, PAGE_SIZE);
+	pages = (end - start) >> PAGE_SHIFT;
+
+	dsb(nshst);
+	asid = ASID(vma->vm_mm);
+	__flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
+			     3, true, lpa2_is_enabled());
+	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
+	dsb(nsh);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end)
 {
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c0557945939c..0f9bbb7224dc 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
 			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
 
 		if (dirty)
-			__flush_tlb_range(vma, start_addr, addr,
-							PAGE_SIZE, true, 3);
+			local_flush_tlb_contpte_range(vma, start_addr, addr);
 	} else {
 		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
 		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d816ff44faff..22f54f5afe3f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 
 	/* Invalidate a stale read-only entry */
 	if (dirty)
-		flush_tlb_page(vma, address);
+		local_flush_tlb_page(vma, address);
 	return 1;
 }
 
-- 
2.39.5



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-13  9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
@ 2025-10-15 15:28   ` Ryan Roberts
  2025-10-16  1:35     ` Huang, Ying
  2025-10-22  4:08   ` Barry Song
  1 sibling, 1 reply; 33+ messages in thread
From: Ryan Roberts @ 2025-10-15 15:28 UTC (permalink / raw)
  To: Huang Ying, Catalin Marinas, Will Deacon, Andrew Morton,
	David Hildenbrand
  Cc: Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

On 13/10/2025 10:20, Huang Ying wrote:
> A multi-thread customer workload with large memory footprint uses
> fork()/exec() to run some external programs every tens seconds.  When
> running the workload on an arm64 server machine, it's observed that
> quite some CPU cycles are spent in the TLB flushing functions.  While
> running the workload on the x86_64 server machine, it's not.  This
> causes the performance on arm64 to be much worse than that on x86_64.
> 
> During the workload running, after fork()/exec() write-protects all
> pages in the parent process, memory writing in the parent process
> will cause a write protection fault.  Then the page fault handler
> will make the PTE/PDE writable if the page can be reused, which is
> almost always true in the workload.  On arm64, to avoid the write
> protection fault on other CPUs, the page fault handler flushes the TLB
> globally with TLBI broadcast after changing the PTE/PDE.  However, this
> isn't always necessary.  Firstly, it's safe to leave some stall

nit: You keep using the word "stall" here and in the code. I think you mean "stale"?

> read-only TLB entries as long as they will be flushed finally.
> Secondly, it's quite possible that the original read-only PTE/PDEs
> aren't cached in remote TLB at all if the memory footprint is large.
> In fact, on x86_64, the page fault handler doesn't flush the remote
> TLB in this situation, which benefits the performance a lot.
> 
> To improve the performance on arm64, make the write protection fault
> handler flush the TLB locally instead of globally via TLBI broadcast
> after making the PTE/PDE writable.  If there are stall read-only TLB
> entries in the remote CPUs, the page fault handler on these CPUs will
> regard the page fault as spurious and flush the stall TLB entries.
> 
> To test the patchset, make the usemem.c from vm-scalability
> (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
> support calling fork()/exec() periodically (merged).  To mimic the
> behavior of the customer workload, run usemem with 4 threads, access
> 100GB memory, and call fork()/exec() every 40 seconds.  Test results
> show that with the patchset the score of usemem improves ~40.6%.  The
> cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
> profile.
> 
> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  arch/arm64/include/asm/pgtable.h  | 14 +++++---
>  arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
>  arch/arm64/mm/contpte.c           |  3 +-
>  arch/arm64/mm/fault.c             |  2 +-
>  4 files changed, 67 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index aa89c2e67ebc..35bae2e4bcfe 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  /*
> - * Outside of a few very special situations (e.g. hibernation), we always
> - * use broadcast TLB invalidation instructions, therefore a spurious page
> - * fault on one CPU which has been handled concurrently by another CPU
> - * does not need to perform additional invalidation.
> + * We use local TLB invalidation instruction when reusing page in
> + * write protection fault handler to avoid TLBI broadcast in the hot
> + * path.  This will cause spurious page faults if stall read-only TLB
> + * entries exist.
>   */
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep)	\
> +	local_flush_tlb_page_nonotify(vma, address)
> +
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)	\
> +	local_flush_tlb_page_nonotify(vma, address)
>  
>  /*
>   * ZERO_PAGE is a global shared page that is always zero: used
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 18a5dc0c9a54..651b31fd18bb 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>   *		cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>   *		perform a non-hinted invalidation.
>   *
> + *	local_flush_tlb_page(vma, addr)
> + *		Local variant of flush_tlb_page().  Stale TLB entries may
> + *		remain in remote CPUs.
> + *
> + *	local_flush_tlb_page_nonotify(vma, addr)
> + *		Same as local_flush_tlb_page() except MMU notifier will not be
> + *		called.
> + *
> + *	local_flush_tlb_contpte_range(vma, start, end)
> + *		Invalidate the virtual-address range '[start, end)' mapped with
> + *		contpte on local CPU for the user address space corresponding
> + *		to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
>   *
>   *	Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>   *	on top of these routines, since that is our interface to the mmu_gather
> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
> +static inline void __local_flush_tlb_page_nonotify_nosync(
> +	struct mm_struct *mm, unsigned long uaddr)
> +{
> +	unsigned long addr;
> +
> +	dsb(nshst);
> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
> +	__tlbi(vale1, addr);
> +	__tlbi_user(vale1, addr);
> +}
> +
> +static inline void local_flush_tlb_page_nonotify(
> +	struct vm_area_struct *vma, unsigned long uaddr)
> +{
> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> +	dsb(nsh);
> +}
> +
> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
> +					unsigned long uaddr)
> +{
> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
> +						(uaddr & PAGE_MASK) + PAGE_SIZE);
> +	dsb(nsh);
> +}
> +
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>  					   unsigned long uaddr)
>  {
> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>  	dsb(ish);
>  }
>  
> +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
> +						 unsigned long start, unsigned long end)

This would be clearer as an API if it was like this:

static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
					unsigned long uaddr)

i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at
round_down(uaddr, PAGE_SIZE).

Thanks,
Ryan

> +{
> +	unsigned long asid, pages;
> +
> +	start = round_down(start, PAGE_SIZE);
> +	end = round_up(end, PAGE_SIZE);
> +	pages = (end - start) >> PAGE_SHIFT;
> +
> +	dsb(nshst);
> +	asid = ASID(vma->vm_mm);
> +	__flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
> +			     3, true, lpa2_is_enabled());
> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> +	dsb(nsh);
> +}
> +
>  static inline void flush_tlb_range(struct vm_area_struct *vma,
>  				   unsigned long start, unsigned long end)
>  {
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..0f9bbb7224dc 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>  			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>  
>  		if (dirty)
> -			__flush_tlb_range(vma, start_addr, addr,
> -							PAGE_SIZE, true, 3);
> +			local_flush_tlb_contpte_range(vma, start_addr, addr);
>  	} else {
>  		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>  		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index d816ff44faff..22f54f5afe3f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>  
>  	/* Invalidate a stale read-only entry */
>  	if (dirty)
> -		flush_tlb_page(vma, address);
> +		local_flush_tlb_page(vma, address);
>  	return 1;
>  }
>  



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-15 15:28   ` Ryan Roberts
@ 2025-10-16  1:35     ` Huang, Ying
  0 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2025-10-16  1:35 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Barry Song, Anshuman Khandual, Yicong Yang,
	Kefeng Wang, Kevin Brodsky, Yin Fengwei, linux-arm-kernel,
	linux-kernel, linux-mm

Hi, Ryan,

Thanks for comments!

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 13/10/2025 10:20, Huang Ying wrote:
>> A multi-thread customer workload with large memory footprint uses
>> fork()/exec() to run some external programs every tens seconds.  When
>> running the workload on an arm64 server machine, it's observed that
>> quite some CPU cycles are spent in the TLB flushing functions.  While
>> running the workload on the x86_64 server machine, it's not.  This
>> causes the performance on arm64 to be much worse than that on x86_64.
>> 
>> During the workload running, after fork()/exec() write-protects all
>> pages in the parent process, memory writing in the parent process
>> will cause a write protection fault.  Then the page fault handler
>> will make the PTE/PDE writable if the page can be reused, which is
>> almost always true in the workload.  On arm64, to avoid the write
>> protection fault on other CPUs, the page fault handler flushes the TLB
>> globally with TLBI broadcast after changing the PTE/PDE.  However, this
>> isn't always necessary.  Firstly, it's safe to leave some stall
>
> nit: You keep using the word "stall" here and in the code. I think you mean "stale"?

OOPS, my poor English :-(
Yes, it should be "stale".  Thanks for pointing this out, will fix it in
the future versions.

>> read-only TLB entries as long as they will be flushed finally.
>> Secondly, it's quite possible that the original read-only PTE/PDEs
>> aren't cached in remote TLB at all if the memory footprint is large.
>> In fact, on x86_64, the page fault handler doesn't flush the remote
>> TLB in this situation, which benefits the performance a lot.
>> 
>> To improve the performance on arm64, make the write protection fault
>> handler flush the TLB locally instead of globally via TLBI broadcast
>> after making the PTE/PDE writable.  If there are stall read-only TLB
>> entries in the remote CPUs, the page fault handler on these CPUs will
>> regard the page fault as spurious and flush the stall TLB entries.
>> 
>> To test the patchset, make the usemem.c from vm-scalability
>> (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
>> support calling fork()/exec() periodically (merged).  To mimic the
>> behavior of the customer workload, run usemem with 4 threads, access
>> 100GB memory, and call fork()/exec() every 40 seconds.  Test results
>> show that with the patchset the score of usemem improves ~40.6%.  The
>> cycles% of TLB flush functions reduces from ~50.5% to ~0.3% in perf
>> profile.
>> 
>> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Cc: Yicong Yang <yangyicong@hisilicon.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
>> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>>  arch/arm64/include/asm/pgtable.h  | 14 +++++---
>>  arch/arm64/include/asm/tlbflush.h | 56 +++++++++++++++++++++++++++++++
>>  arch/arm64/mm/contpte.c           |  3 +-
>>  arch/arm64/mm/fault.c             |  2 +-
>>  4 files changed, 67 insertions(+), 8 deletions(-)
>> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path.  This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>>   */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep)	\
>> +	local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)	\
>> +	local_flush_tlb_page_nonotify(vma, address)
>>  
>>  /*
>>   * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>>   *		cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>>   *		perform a non-hinted invalidation.
>>   *
>> + *	local_flush_tlb_page(vma, addr)
>> + *		Local variant of flush_tlb_page().  Stale TLB entries may
>> + *		remain in remote CPUs.
>> + *
>> + *	local_flush_tlb_page_nonotify(vma, addr)
>> + *		Same as local_flush_tlb_page() except MMU notifier will not be
>> + *		called.
>> + *
>> + *	local_flush_tlb_contpte_range(vma, start, end)
>> + *		Invalidate the virtual-address range '[start, end)' mapped with
>> + *		contpte on local CPU for the user address space corresponding
>> + *		to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
>>   *
>>   *	Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>>   *	on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>  	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>  
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> +	struct mm_struct *mm, unsigned long uaddr)
>> +{
>> +	unsigned long addr;
>> +
>> +	dsb(nshst);
>> +	addr = __TLBI_VADDR(uaddr, ASID(mm));
>> +	__tlbi(vale1, addr);
>> +	__tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> +	struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +	dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> +					unsigned long uaddr)
>> +{
>> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> +						(uaddr & PAGE_MASK) + PAGE_SIZE);
>> +	dsb(nsh);
>> +}
>> +
>>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>  					   unsigned long uaddr)
>>  {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>  	dsb(ish);
>>  }
>>  
>> +static inline void local_flush_tlb_contpte_range(struct vm_area_struct *vma,
>> +						 unsigned long start, unsigned long end)
>
> This would be clearer as an API if it was like this:
>
> static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
> 					unsigned long uaddr)
>
> i.e. the user doesn't set the range - it's implicitly CONT_PTE_SIZE starting at
> round_down(uaddr, PAGE_SIZE).

Sure.  Will do this.

> Thanks,
> Ryan
>
>> +{
>> +	unsigned long asid, pages;
>> +
>> +	start = round_down(start, PAGE_SIZE);
>> +	end = round_up(end, PAGE_SIZE);
>> +	pages = (end - start) >> PAGE_SHIFT;
>> +
>> +	dsb(nshst);
>> +	asid = ASID(vma->vm_mm);
>> +	__flush_tlb_range_op(vale1, start, pages, PAGE_SIZE, asid,
>> +			     3, true, lpa2_is_enabled());
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>> +	dsb(nsh);
>> +}
>> +
>>  static inline void flush_tlb_range(struct vm_area_struct *vma,
>>  				   unsigned long start, unsigned long end)
>>  {
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..0f9bbb7224dc 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>  			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>  
>>  		if (dirty)
>> -			__flush_tlb_range(vma, start_addr, addr,
>> -							PAGE_SIZE, true, 3);
>> +			local_flush_tlb_contpte_range(vma, start_addr, addr);
>>  	} else {
>>  		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>  		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d816ff44faff..22f54f5afe3f 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>  
>>  	/* Invalidate a stale read-only entry */
>>  	if (dirty)
>> -		flush_tlb_page(vma, address);
>> +		local_flush_tlb_page(vma, address);
>>  	return 1;
>>  }
>>  

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-13  9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
  2025-10-15 15:28   ` Ryan Roberts
@ 2025-10-22  4:08   ` Barry Song
  2025-10-22  7:31     ` Huang, Ying
  2025-10-23 10:18     ` Ryan Roberts
  1 sibling, 2 replies; 33+ messages in thread
From: Barry Song @ 2025-10-22  4:08 UTC (permalink / raw)
  To: Huang Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index aa89c2e67ebc..35bae2e4bcfe 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  /*
> - * Outside of a few very special situations (e.g. hibernation), we always
> - * use broadcast TLB invalidation instructions, therefore a spurious page
> - * fault on one CPU which has been handled concurrently by another CPU
> - * does not need to perform additional invalidation.
> + * We use local TLB invalidation instruction when reusing page in
> + * write protection fault handler to avoid TLBI broadcast in the hot
> + * path.  This will cause spurious page faults if stall read-only TLB
> + * entries exist.
>   */
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> +#define flush_tlb_fix_spurious_fault(vma, address, ptep)       \
> +       local_flush_tlb_page_nonotify(vma, address)
> +
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)   \
> +       local_flush_tlb_page_nonotify(vma, address)
>
>  /*
>   * ZERO_PAGE is a global shared page that is always zero: used
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 18a5dc0c9a54..651b31fd18bb 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>   *             cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>   *             perform a non-hinted invalidation.
>   *
> + *     local_flush_tlb_page(vma, addr)
> + *             Local variant of flush_tlb_page().  Stale TLB entries may
> + *             remain in remote CPUs.
> + *
> + *     local_flush_tlb_page_nonotify(vma, addr)
> + *             Same as local_flush_tlb_page() except MMU notifier will not be
> + *             called.
> + *
> + *     local_flush_tlb_contpte_range(vma, start, end)
> + *             Invalidate the virtual-address range '[start, end)' mapped with
> + *             contpte on local CPU for the user address space corresponding
> + *             to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
>   *
>   *     Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>   *     on top of these routines, since that is our interface to the mmu_gather
> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>         mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>
> +static inline void __local_flush_tlb_page_nonotify_nosync(
> +       struct mm_struct *mm, unsigned long uaddr)
> +{
> +       unsigned long addr;
> +
> +       dsb(nshst);

We were issuing dsb(ishst) even for the nosync case, likely to ensure
PTE visibility across cores. However, since set_ptes already includes a
dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
cautious in __flush_tlb_page_nosync() in many cases?

static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
                                           unsigned long uaddr)
{
        unsigned long addr;

        dsb(ishst);
        addr = __TLBI_VADDR(uaddr, ASID(mm));
        __tlbi(vale1is, addr);
        __tlbi_user(vale1is, addr);
        mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
                                                (uaddr & PAGE_MASK) +
PAGE_SIZE);
}

On the other hand, __ptep_set_access_flags() doesn’t seem to use
set_ptes(), so there’s no guarantee the updated PTEs are visible to all
cores. If a remote CPU later encounters a page fault and performs a TLB
invalidation, will it still see a stable PTE?

> +       addr = __TLBI_VADDR(uaddr, ASID(mm));
> +       __tlbi(vale1, addr);
> +       __tlbi_user(vale1, addr);
> +}
> +
> +static inline void local_flush_tlb_page_nonotify(
> +       struct vm_area_struct *vma, unsigned long uaddr)
> +{
> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> +       dsb(nsh);
> +}
> +
> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
> +                                       unsigned long uaddr)
> +{
> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
> +       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
> +                                               (uaddr & PAGE_MASK) + PAGE_SIZE);
> +       dsb(nsh);
> +}
> +
>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>                                            unsigned long uaddr)
>  {
> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>         dsb(ish);
>  }
>

We already have functions like
__flush_tlb_page_nosync() and __flush_tlb_range_nosync().
Is there a way to factor out or extract their common parts?

Is it because of the differences in barriers that this extraction of
common code isn’t feasible?

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  4:08   ` Barry Song
@ 2025-10-22  7:31     ` Huang, Ying
  2025-10-22  8:14       ` Barry Song
  2025-10-23 10:18     ` Ryan Roberts
  1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-22  7:31 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Hi, Barry,

Barry Song <21cnbao@gmail.com> writes:

>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>>  /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path.  This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>>   */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep)       \
>> +       local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)   \
>> +       local_flush_tlb_page_nonotify(vma, address)
>>
>>  /*
>>   * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>>   *             cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>>   *             perform a non-hinted invalidation.
>>   *
>> + *     local_flush_tlb_page(vma, addr)
>> + *             Local variant of flush_tlb_page().  Stale TLB entries may
>> + *             remain in remote CPUs.
>> + *
>> + *     local_flush_tlb_page_nonotify(vma, addr)
>> + *             Same as local_flush_tlb_page() except MMU notifier will not be
>> + *             called.
>> + *
>> + *     local_flush_tlb_contpte_range(vma, start, end)
>> + *             Invalidate the virtual-address range '[start, end)' mapped with
>> + *             contpte on local CPU for the user address space corresponding
>> + *             to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
>>   *
>>   *     Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>>   *     on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>         mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> +       struct mm_struct *mm, unsigned long uaddr)
>> +{
>> +       unsigned long addr;
>> +
>> +       dsb(nshst);
>
> We were issuing dsb(ishst) even for the nosync case, likely to ensure
> PTE visibility across cores. However, since set_ptes already includes a
> dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
> cautious in __flush_tlb_page_nosync() in many cases?
>
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>                                            unsigned long uaddr)
> {
>         unsigned long addr;
>
>         dsb(ishst);
>         addr = __TLBI_VADDR(uaddr, ASID(mm));
>         __tlbi(vale1is, addr);
>         __tlbi_user(vale1is, addr);
>         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>                                                 (uaddr & PAGE_MASK) +
> PAGE_SIZE);
> }

IIUC, _nosync() here means doesn't synchronize with the following code.
It still synchronizes with the previous code, mainly the page table
changing.  And, Yes.  There may be room to improve this.

> On the other hand, __ptep_set_access_flags() doesn’t seem to use
> set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> cores. If a remote CPU later encounters a page fault and performs a TLB
> invalidation, will it still see a stable PTE?

I don't think so.  We just flush local TLB in local_flush_tlb_page()
family functions.  So, we only needs to guarantee the page table changes
are available for the local page table walking.  If a page fault occurs
on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.

>> +       addr = __TLBI_VADDR(uaddr, ASID(mm));
>> +       __tlbi(vale1, addr);
>> +       __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> +       struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +       dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> +                                       unsigned long uaddr)
>> +{
>> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> +                                               (uaddr & PAGE_MASK) + PAGE_SIZE);
>> +       dsb(nsh);
>> +}
>> +
>>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>                                            unsigned long uaddr)
>>  {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>         dsb(ish);
>>  }
>>
>
> We already have functions like
> __flush_tlb_page_nosync() and __flush_tlb_range_nosync().
> Is there a way to factor out or extract their common parts?
>
> Is it because of the differences in barriers that this extraction of
> common code isn’t feasible?

Yes.  It's a good idea to do some code clean to reduce code duplication.
Ryan has plan to work on this.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  7:31     ` Huang, Ying
@ 2025-10-22  8:14       ` Barry Song
  2025-10-22  9:02         ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-10-22  8:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

> >
> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >                                            unsigned long uaddr)
> > {
> >         unsigned long addr;
> >
> >         dsb(ishst);
> >         addr = __TLBI_VADDR(uaddr, ASID(mm));
> >         __tlbi(vale1is, addr);
> >         __tlbi_user(vale1is, addr);
> >         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> >                                                 (uaddr & PAGE_MASK) +
> > PAGE_SIZE);
> > }
>
> IIUC, _nosync() here means doesn't synchronize with the following code.
> It still synchronizes with the previous code, mainly the page table
> changing.  And, Yes.  There may be room to improve this.
>
> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> > cores. If a remote CPU later encounters a page fault and performs a TLB
> > invalidation, will it still see a stable PTE?
>
> I don't think so.  We just flush local TLB in local_flush_tlb_page()
> family functions.  So, we only needs to guarantee the page table changes
> are available for the local page table walking.  If a page fault occurs
> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>

My concern is that:

We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
local_flush_tlb_page(), the memory may not be synchronized yet, and it could
still see the old PTE.

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  8:14       ` Barry Song
@ 2025-10-22  9:02         ` Huang, Ying
  2025-10-22  9:17           ` Barry Song
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-22  9:02 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

>> >
>> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> >                                            unsigned long uaddr)
>> > {
>> >         unsigned long addr;
>> >
>> >         dsb(ishst);
>> >         addr = __TLBI_VADDR(uaddr, ASID(mm));
>> >         __tlbi(vale1is, addr);
>> >         __tlbi_user(vale1is, addr);
>> >         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>> >                                                 (uaddr & PAGE_MASK) +
>> > PAGE_SIZE);
>> > }
>>
>> IIUC, _nosync() here means doesn't synchronize with the following code.
>> It still synchronizes with the previous code, mainly the page table
>> changing.  And, Yes.  There may be room to improve this.
>>
>> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
>> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
>> > cores. If a remote CPU later encounters a page fault and performs a TLB
>> > invalidation, will it still see a stable PTE?
>>
>> I don't think so.  We just flush local TLB in local_flush_tlb_page()
>> family functions.  So, we only needs to guarantee the page table changes
>> are available for the local page table walking.  If a page fault occurs
>> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>>
>
> My concern is that:
>
> We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
> CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
> local_flush_tlb_page(), the memory may not be synchronized yet, and it could
> still see the old PTE.

So, do you think that after the load/store unit of the remote CPU have
seen the new PTE, the page table walker could still see the old PTE?  I
doubt it.  Even if so, the worse case is one extra spurious page fault?
If the possibility of the worst case is low enough, that should be OK.

Additionally, the page table lock is held when writing PTE on this CPU
and re-reading PTE on the remote CPU.  That provides some memory order
guarantee too.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:02         ` Huang, Ying
@ 2025-10-22  9:17           ` Barry Song
  2025-10-22  9:30             ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-10-22  9:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> >> >
> >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >> >                                            unsigned long uaddr)
> >> > {
> >> >         unsigned long addr;
> >> >
> >> >         dsb(ishst);
> >> >         addr = __TLBI_VADDR(uaddr, ASID(mm));
> >> >         __tlbi(vale1is, addr);
> >> >         __tlbi_user(vale1is, addr);
> >> >         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> >> >                                                 (uaddr & PAGE_MASK) +
> >> > PAGE_SIZE);
> >> > }
> >>
> >> IIUC, _nosync() here means doesn't synchronize with the following code.
> >> It still synchronizes with the previous code, mainly the page table
> >> changing.  And, Yes.  There may be room to improve this.
> >>
> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
> >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> >> > cores. If a remote CPU later encounters a page fault and performs a TLB
> >> > invalidation, will it still see a stable PTE?
> >>
> >> I don't think so.  We just flush local TLB in local_flush_tlb_page()
> >> family functions.  So, we only needs to guarantee the page table changes
> >> are available for the local page table walking.  If a page fault occurs
> >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
> >>
> >
> > My concern is that:
> >
> > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
> > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
> > local_flush_tlb_page(), the memory may not be synchronized yet, and it could
> > still see the old PTE.
>
> So, do you think that after the load/store unit of the remote CPU have
> seen the new PTE, the page table walker could still see the old PTE?  I

Without a barrier in the ish domain, remote CPUs’ load/store units may not
see the new PTE written by the first CPU performing the reuse.

That’s why we need a barrier in the ish domain to ensure the PTE is
actually visible across the SMP domain. A store instruction doesn’t guarantee
that the data is immediately visible to other CPUs — at least not for load
instructions.

Though, I’m not entirely sure about the page table walker case.

> doubt it.  Even if so, the worse case is one extra spurious page fault?
> If the possibility of the worst case is low enough, that should be OK.

CPU0:                    CPU1:

write pte;

do local tlbi;

                                       page fault;
                                       do local tlbi; -> still old PTE

pte visible to CPU1

>
> Additionally, the page table lock is held when writing PTE on this CPU
> and re-reading PTE on the remote CPU.  That provides some memory order
> guarantee too.

Right, the PTL might take care of it automatically.

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:17           ` Barry Song
@ 2025-10-22  9:30             ` Huang, Ying
  2025-10-22  9:37               ` Barry Song
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-22  9:30 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 22, 2025 at 10:02 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> >> >
>> >> > static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> >> >                                            unsigned long uaddr)
>> >> > {
>> >> >         unsigned long addr;
>> >> >
>> >> >         dsb(ishst);
>> >> >         addr = __TLBI_VADDR(uaddr, ASID(mm));
>> >> >         __tlbi(vale1is, addr);
>> >> >         __tlbi_user(vale1is, addr);
>> >> >         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>> >> >                                                 (uaddr & PAGE_MASK) +
>> >> > PAGE_SIZE);
>> >> > }
>> >>
>> >> IIUC, _nosync() here means doesn't synchronize with the following code.
>> >> It still synchronizes with the previous code, mainly the page table
>> >> changing.  And, Yes.  There may be room to improve this.
>> >>
>> >> > On the other hand, __ptep_set_access_flags() doesn’t seem to use
>> >> > set_ptes(), so there’s no guarantee the updated PTEs are visible to all
>> >> > cores. If a remote CPU later encounters a page fault and performs a TLB
>> >> > invalidation, will it still see a stable PTE?
>> >>
>> >> I don't think so.  We just flush local TLB in local_flush_tlb_page()
>> >> family functions.  So, we only needs to guarantee the page table changes
>> >> are available for the local page table walking.  If a page fault occurs
>> >> on a remote CPU, we will call local_flush_tlb_page() on the remote CPU.
>> >>
>> >
>> > My concern is that:
>> >
>> > We don’t have a dsb(ish) to ensure the PTE page table is visible to remote
>> > CPUs, since you’re using dsb(nsh). So even if a remote CPU performs
>> > local_flush_tlb_page(), the memory may not be synchronized yet, and it could
>> > still see the old PTE.
>>
>> So, do you think that after the load/store unit of the remote CPU have
>> seen the new PTE, the page table walker could still see the old PTE?  I
>
> Without a barrier in the ish domain, remote CPUs’ load/store units may not
> see the new PTE written by the first CPU performing the reuse.
>
> That’s why we need a barrier in the ish domain to ensure the PTE is
> actually visible across the SMP domain. A store instruction doesn’t guarantee
> that the data is immediately visible to other CPUs — at least not for load
> instructions.
>
> Though, I’m not entirely sure about the page table walker case.
>
>> doubt it.  Even if so, the worse case is one extra spurious page fault?
>> If the possibility of the worst case is low enough, that should be OK.
>
> CPU0:                    CPU1:
>
> write pte;
>
> do local tlbi;
>
>                                        page fault;
>                                        do local tlbi; -> still old PTE
>
> pte visible to CPU1

With PTL, this becomes

CPU0:                           CPU1:

page fault                      page fault
lock PTL
write PTE
do local tlbi
unlock PTL
                                lock PTL        <- pte visible to CPU 1
                                read PTE        <- new PTE
                                do local tlbi   <- new PTE
                                unlock PTL

>> Additionally, the page table lock is held when writing PTE on this CPU
>> and re-reading PTE on the remote CPU.  That provides some memory order
>> guarantee too.
>
> Right, the PTL might take care of it automatically.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:30             ` Huang, Ying
@ 2025-10-22  9:37               ` Barry Song
  2025-10-22  9:46                 ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-10-22  9:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

>
> With PTL, this becomes
>
> CPU0:                           CPU1:
>
> page fault                      page fault
> lock PTL
> write PTE
> do local tlbi
> unlock PTL
>                                 lock PTL        <- pte visible to CPU 1
>                                 read PTE        <- new PTE
>                                 do local tlbi   <- new PTE
>                                 unlock PTL

I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.

CPU0:                                                                  CPU1:

lock PTL

write pte;
Issue ish barrier
do local tlbi;


    No page fault occurs if tlb misses


unlock PTL


Otherwise, it could be:


CPU0:                                                                  CPU1:

lock PTL

write pte;
Issue nsh barrier
do local tlbi;


    page fault occurs if tlb misses


unlock PTL


Not quite sure if adding an ish right after the PTE modification has any
noticeable performance impact on the test? I assume the most expensive part
is still the tlbi broadcast dsb, not the PTE memory sync barrier?

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:37               ` Barry Song
@ 2025-10-22  9:46                 ` Huang, Ying
  2025-10-22  9:55                   ` Barry Song
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-22  9:46 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

>>
>> With PTL, this becomes
>>
>> CPU0:                           CPU1:
>>
>> page fault                      page fault
>> lock PTL
>> write PTE
>> do local tlbi
>> unlock PTL
>>                                 lock PTL        <- pte visible to CPU 1
>>                                 read PTE        <- new PTE
>>                                 do local tlbi   <- new PTE
>>                                 unlock PTL
>
> I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.

IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
memory writing (visible to other CPUs).  TBH, I suspect that this is the
case.

> CPU0:                                                                  CPU1:
>
> lock PTL
>
> write pte;
> Issue ish barrier
> do local tlbi;
>
>
>     No page fault occurs if tlb misses
>
>
> unlock PTL
>
>
> Otherwise, it could be:
>
>
> CPU0:                                                                  CPU1:
>
> lock PTL
>
> write pte;
> Issue nsh barrier
> do local tlbi;
>
>
>     page fault occurs if tlb misses
>
>
> unlock PTL
>
>
> Not quite sure if adding an ish right after the PTE modification has any
> noticeable performance impact on the test? I assume the most expensive part
> is still the tlbi broadcast dsb, not the PTE memory sync barrier?

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:46                 ` Huang, Ying
@ 2025-10-22  9:55                   ` Barry Song
  2025-10-22 10:22                     ` Barry Song
  2025-10-22 10:34                     ` Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Barry Song @ 2025-10-22  9:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:

> >
> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
>
> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
> memory writing (visible to other CPUs).  TBH, I suspect that this is the
> case.

Why? In any case, nsh is not a smp domain.

I believe a dmb(ishst) is sufficient to ensure that the new PTE writes
are visible
to other CPUs. I’m not quite sure why the current flush code uses dsb(ish);
it seems like overkill.

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:55                   ` Barry Song
@ 2025-10-22 10:22                     ` Barry Song
  2025-10-22 10:34                     ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: Barry Song @ 2025-10-22 10:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Oct 22, 2025 at 10:55 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>
> > >
> > > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
> >
> > IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
> > memory writing (visible to other CPUs).  TBH, I suspect that this is the
> > case.
>
> Why? In any case, nsh is not a smp domain.
>
> I believe a dmb(ishst) is sufficient to ensure that the new PTE writes
> are visible
> to other CPUs. I’m not quite sure why the current flush code uses dsb(ish);
> it seems like overkill.

On second thought, the PTE/page table walker might not be a typical
SMP sync case,
so a dmb may not be sufficient—we are not dealing with standard load/store
instruction sequences across multiple threads. In any case, my point is that
dsb(ish) might be slightly slower than your dsb(nsh), but it makes the PTE
visible to other CPUs earlier and helps avoid some page faults after we’ve
written the PTE. However, if your current nsh version actually provides better
performance—even when multiple threads may access the data simultaneously—
It should be completely fine.

Now you are

write pte
don't broadcast pte
tlbi
don't broadcast tlbi

we might be:

write pte
broadcast pte
tlbi
don't broadcast tlbi

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  9:55                   ` Barry Song
  2025-10-22 10:22                     ` Barry Song
@ 2025-10-22 10:34                     ` Huang, Ying
  2025-10-22 10:52                       ` Barry Song
  1 sibling, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-22 10:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>
>> >
>> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
>>
>> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
>> memory writing (visible to other CPUs).  TBH, I suspect that this is the
>> case.
>
> Why? In any case, nsh is not a smp domain.

I think dsb(ish) will be slower than dsb(nsh) in theory.  I guess that
dsb just wait for the memory write to be visible in the specified
shareability domain instead of making write faster.

> I believe a dmb(ishst) is sufficient to ensure that the new PTE writes
> are visible

dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()).

> to other CPUs. I’m not quite sure why the current flush code uses dsb(ish);
> it seems like overkill.

dsb(ish) here is used for tlbi(XXis) broadcast.  It waits until the page
table change is visible to the page table walker of the remote CPU.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22 10:34                     ` Huang, Ying
@ 2025-10-22 10:52                       ` Barry Song
  2025-10-23  1:22                         ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-10-22 10:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying
> > <ying.huang@linux.alibaba.com> wrote:
> >
> >> >
> >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
> >>
> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
> >> memory writing (visible to other CPUs).  TBH, I suspect that this is the
> >> case.
> >
> > Why? In any case, nsh is not a smp domain.
>
> I think dsb(ish) will be slower than dsb(nsh) in theory.  I guess that
> dsb just wait for the memory write to be visible in the specified
> shareability domain instead of making write faster.
>
> > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes
> > are visible
>
> dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()).
>
> > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish);
> > it seems like overkill.
>
> dsb(ish) here is used for tlbi(XXis) broadcast.  It waits until the page
> table change is visible to the page table walker of the remote CPU.

It seems we’re aligned on all points[1], although I’m not sure whether
you have data comparing A and B.

A:
write pte
don't broadcast pte
tlbi
don't broadcast tlbi

with

B:
write pte
broadcast pte
tlbi
don't broadcast tlbi

I guess the gain comes from "don't broadcat tlbi" ?
With B, we should be able to share many existing code.

[1] https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22 10:52                       ` Barry Song
@ 2025-10-23  1:22                         ` Huang, Ying
  2025-10-23  5:39                           ` Barry Song
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2025-10-23  1:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

> On Wed, Oct 22, 2025 at 11:34 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Wed, Oct 22, 2025 at 10:46 PM Huang, Ying
>> > <ying.huang@linux.alibaba.com> wrote:
>> >
>> >> >
>> >> > I agree. Yet the ish barrier can still avoid the page faults during CPU0's PTL.
>> >>
>> >> IIUC, you think that dsb(ish) compared with dsb(nsh) can accelerate
>> >> memory writing (visible to other CPUs).  TBH, I suspect that this is the
>> >> case.
>> >
>> > Why? In any case, nsh is not a smp domain.
>>
>> I think dsb(ish) will be slower than dsb(nsh) in theory.  I guess that
>> dsb just wait for the memory write to be visible in the specified
>> shareability domain instead of making write faster.
>>
>> > I believe a dmb(ishst) is sufficient to ensure that the new PTE writes
>> > are visible
>>
>> dmb(ishst) (smp_wmb()) should pair with dmb(ishld) (smp_rmb()).
>>
>> > to other CPUs. I’m not quite sure why the current flush code uses dsb(ish);
>> > it seems like overkill.
>>
>> dsb(ish) here is used for tlbi(XXis) broadcast.  It waits until the page
>> table change is visible to the page table walker of the remote CPU.
>
> It seems we’re aligned on all points[1], although I’m not sure whether
> you have data comparing A and B.
>
> A:
> write pte
> don't broadcast pte
> tlbi
> don't broadcast tlbi
>
> with
>
> B:
> write pte
> broadcast pte

I suspect that pte will be broadcast, DVM broadcast isn't used for
the memory coherency IIUC.

> tlbi
> don't broadcast tlbi
>
> I guess the gain comes from "don't broadcat tlbi" ?
> With B, we should be able to share many existing code.

Ryan has some plan to reduce the code duplication with the current
solution.

> [1]
> https://lore.kernel.org/linux-mm/20251013092038.6963-1-ying.huang@linux.alibaba.com/T/#m54312d4914c69aa550bee7df36711c03a4280c52

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-23  1:22                         ` Huang, Ying
@ 2025-10-23  5:39                           ` Barry Song
  2025-10-23  6:15                             ` Huang, Ying
  0 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-10-23  5:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

> >
> > A:
> > write pte
> > don't broadcast pte
> > tlbi
> > don't broadcast tlbi
> >
> > with
> >
> > B:
> > write pte
> > broadcast pte
>
> I suspect that pte will be broadcast, DVM broadcast isn't used for
> the memory coherency IIUC.

I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible
to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new
PTE after their TLB is shoot down. But as you said, if the hardware doesn’t
propagate the updated PTE faster, it doesn’t seem to help reduce page faults.

As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on
your platform. Perhaps because the number of CPU cores is small, I didn’t see
any noticeable difference between them on phones.

>
> > tlbi
> > don't broadcast tlbi
> >
> > I guess the gain comes from "don't broadcat tlbi" ?
> > With B, we should be able to share many existing code.
>
> Ryan has some plan to reduce the code duplication with the current
> solution.

Ok.

Thanks
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-23  5:39                           ` Barry Song
@ 2025-10-23  6:15                             ` Huang, Ying
  0 siblings, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2025-10-23  6:15 UTC (permalink / raw)
  To: Barry Song
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang,
	Ryan Roberts, Yang Shi, Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Barry Song <21cnbao@gmail.com> writes:

>> >
>> > A:
>> > write pte
>> > don't broadcast pte
>> > tlbi
>> > don't broadcast tlbi
>> >
>> > with
>> >
>> > B:
>> > write pte
>> > broadcast pte
>>
>> I suspect that pte will be broadcast, DVM broadcast isn't used for
>> the memory coherency IIUC.
>
> I guess you’re right. By “broadcast,” I actually meant the PTE becoming visible
> to other CPUs. With a dsb(ish) before tlbi, other cores’ TLBs can load the new
> PTE after their TLB is shoot down. But as you said, if the hardware doesn’t
> propagate the updated PTE faster, it doesn’t seem to help reduce page faults.
>
> As a side note, I’m curious about the data between dsb(nsh) and dsb(ish) on
> your platform. Perhaps because the number of CPU cores is small, I didn’t see
> any noticeable difference between them on phones.

Sure.  I can git it a try.  Can you share the test case?

>>
>> > tlbi
>> > don't broadcast tlbi
>> >
>> > I guess the gain comes from "don't broadcat tlbi" ?
>> > With B, we should be able to share many existing code.
>>
>> Ryan has some plan to reduce the code duplication with the current
>> solution.
>
> Ok.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  2025-10-22  4:08   ` Barry Song
  2025-10-22  7:31     ` Huang, Ying
@ 2025-10-23 10:18     ` Ryan Roberts
  1 sibling, 0 replies; 33+ messages in thread
From: Ryan Roberts @ 2025-10-23 10:18 UTC (permalink / raw)
  To: Barry Song, Huang Ying
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Vlastimil Babka, Zi Yan, Baolin Wang, Yang Shi,
	Christoph Lameter (Ampere),
	Dev Jain, Anshuman Khandual, Yicong Yang, Kefeng Wang,
	Kevin Brodsky, Yin Fengwei, linux-arm-kernel, linux-kernel,
	linux-mm

Hi Barry, Huang,


On 22/10/2025 05:08, Barry Song wrote:
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>>  /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path.  This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>>   */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep)       \
>> +       local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp)   \
>> +       local_flush_tlb_page_nonotify(vma, address)
>>
>>  /*
>>   * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>>   *             cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>>   *             perform a non-hinted invalidation.
>>   *
>> + *     local_flush_tlb_page(vma, addr)
>> + *             Local variant of flush_tlb_page().  Stale TLB entries may
>> + *             remain in remote CPUs.
>> + *
>> + *     local_flush_tlb_page_nonotify(vma, addr)
>> + *             Same as local_flush_tlb_page() except MMU notifier will not be
>> + *             called.
>> + *
>> + *     local_flush_tlb_contpte_range(vma, start, end)
>> + *             Invalidate the virtual-address range '[start, end)' mapped with
>> + *             contpte on local CPU for the user address space corresponding
>> + *             to 'vma->mm'.  Stale TLB entries may remain in remote CPUs.
>>   *
>>   *     Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>>   *     on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>         mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> +       struct mm_struct *mm, unsigned long uaddr)
>> +{
>> +       unsigned long addr;
>> +
>> +       dsb(nshst);

I've skimmed through this thread so appologies if I've missed some of the
detail, but thought it might be useful to give my opinion as a summary...

> We were issuing dsb(ishst) even for the nosync case, likely to ensure
> PTE visibility across cores. 

The leading dsb prior to issuing the tlbi is to ensure that the HW table
walker(s) will always see the new pte immediately after the tlbi completes.
Without it, you could end up with the old value immediately re-cached after the
tlbi completes. So if you are broadcasting the tlbi, the dsb needs to be to ish.
If you're doing local invalidation, then nsh is sufficient.

"nosync" is just saying that we will not wait for the tlbi to complete. You
still need to issue the leading dsb to ensure that the table walkers see the
latest pte once the tlbi (eventually) completes.

> However, since set_ptes already includes a
> dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
> cautious in __flush_tlb_page_nosync() in many cases?

We only issue a dsb in __set_pte_complete() for kernel ptes. We elide for user
ptes becaue we can safely take a fault (for the case where we transition
invalid->valid) for user mappings and that race will resolve itself with the
help of the PTL. For valid->valid or valid->invalid, there will be an associated
tlb flush, which has the barrier.

> 
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>                                            unsigned long uaddr)
> {
>         unsigned long addr;
> 
>         dsb(ishst);
>         addr = __TLBI_VADDR(uaddr, ASID(mm));
>         __tlbi(vale1is, addr);
>         __tlbi_user(vale1is, addr);
>         mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
>                                                 (uaddr & PAGE_MASK) +
> PAGE_SIZE);
> }
> 
> On the other hand, __ptep_set_access_flags() doesn’t seem to use
> set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> cores. If a remote CPU later encounters a page fault and performs a TLB
> invalidation, will it still see a stable PTE?

Yes, because the reads and writes are done under the PTL; that synchonizes the
memory for us.

You were discussing the potential value of upgrading the leading dsb from nshst
to ishst during the discussion. IMHO that's neither required nor desirable - the
memory synchonization is handled by the PTL. Overall, this optimization relies
on the premise that synchronizing with remote CPUs is expensive and races are
rare, so we should keep everything local for as long as possible and not worry
about micro-optimizing the efficiency of the race case.

> 
>> +       addr = __TLBI_VADDR(uaddr, ASID(mm));
>> +       __tlbi(vale1, addr);
>> +       __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> +       struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +       dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> +                                       unsigned long uaddr)
>> +{
>> +       __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +       mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> +                                               (uaddr & PAGE_MASK) + PAGE_SIZE);
>> +       dsb(nsh);
>> +}
>> +
>>  static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>                                            unsigned long uaddr)
>>  {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>         dsb(ish);
>>  }
>>
> 
> We already have functions like
> __flush_tlb_page_nosync() and __flush_tlb_range_nosync().
> Is there a way to factor out or extract their common parts?
> 
> Is it because of the differences in barriers that this extraction of
> common code isn’t feasible?

I've proposed re-working these functions to add indepednent flags for
sync/nosync, local/broadcast and notify/nonotify. I think that will clean it up
quite a bit. But I was going to wait for this to land first. And also, Will has
an RFC for some other tlbflush API cleanup (converting it to C functions) so
might want to wait for or incorporate that too.

Thanks,
Ryan

> 
> Thanks
> Barry



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-10-23 10:19 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-13  9:20 [PATCH -v2 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying
2025-10-13  9:20 ` [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying
2025-10-14 14:21   ` Lorenzo Stoakes
2025-10-14 14:38     ` David Hildenbrand
2025-10-14 14:49       ` Lorenzo Stoakes
2025-10-14 14:58         ` David Hildenbrand
2025-10-14 15:13           ` Lorenzo Stoakes
2025-10-15  8:43     ` Huang, Ying
2025-10-15 11:20       ` Lorenzo Stoakes
2025-10-15 12:23         ` David Hildenbrand
2025-10-16  2:22         ` Huang, Ying
2025-10-16  8:25           ` Lorenzo Stoakes
2025-10-16  8:59             ` David Hildenbrand
2025-10-16  9:12             ` Huang, Ying
2025-10-13  9:20 ` [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
2025-10-15 15:28   ` Ryan Roberts
2025-10-16  1:35     ` Huang, Ying
2025-10-22  4:08   ` Barry Song
2025-10-22  7:31     ` Huang, Ying
2025-10-22  8:14       ` Barry Song
2025-10-22  9:02         ` Huang, Ying
2025-10-22  9:17           ` Barry Song
2025-10-22  9:30             ` Huang, Ying
2025-10-22  9:37               ` Barry Song
2025-10-22  9:46                 ` Huang, Ying
2025-10-22  9:55                   ` Barry Song
2025-10-22 10:22                     ` Barry Song
2025-10-22 10:34                     ` Huang, Ying
2025-10-22 10:52                       ` Barry Song
2025-10-23  1:22                         ` Huang, Ying
2025-10-23  5:39                           ` Barry Song
2025-10-23  6:15                             ` Huang, Ying
2025-10-23 10:18     ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox