[PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements
@ 2025-02-05 15:09 Ryan Roberts
  2025-02-05 15:09 ` [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() Ryan Roberts
                   ` (16 more replies)
  0 siblings, 17 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Hi All,

This series started out as a few simple bug fixes but evolved into some code
cleanups and useful performance improvements too. It mainly touches arm64 arch
code but there are a couple of supporting mm changes; I'm guessing that going in
through the arm64 tree is the right approach here?

Beyond the bug fixes and cleanups, the 2 key performance improvements are 1)
enabling the use of contpte-mapped blocks in the vmalloc space when appropriate
(which reduces TLB pressure). There were already hooks for this (used by
powerpc) but they required some tidying and extending for arm64. And 2) batching
up barriers when modifying the vmalloc address space for upto 30% reduction in
time taken in vmalloc().

vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
test was repeated 10 times.

legend:
  - p: nr_pages (pages to allocate)
  - h: use_huge (vmalloc() vs vmalloc_huge())
  - (I): statistically significant improvement (95% CI does not overlap)
  - (R): statistically significant regression (95% CI does not overlap)
  - mearements are times; smaller is better

+--------------------------------------------------+-------------+-------------+
| Benchmark                                        |             |             |
|   Result Class                                   |    Apple M2 | Ampere Alta |
+==================================================+=============+=============+
| micromm/vmalloc                                  |             |             |
|   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -12.93% |  (I) -7.89% |
|   fix_size_alloc_test: p:1, h:0 (usec)           |   (R) 4.00% |       1.40% |
|   fix_size_alloc_test: p:1, h:1 (usec)           |   (R) 5.28% |       1.46% |
|   fix_size_alloc_test: p:2, h:0 (usec)           |  (I) -3.04% |      -1.11% |
|   fix_size_alloc_test: p:2, h:1 (usec)           |      -3.24% |      -2.86% |
|   fix_size_alloc_test: p:4, h:0 (usec)           | (I) -11.77% |  (I) -4.48% |
|   fix_size_alloc_test: p:4, h:1 (usec)           |  (I) -9.19% |  (I) -4.45% |
|   fix_size_alloc_test: p:8, h:0 (usec)           | (I) -19.79% | (I) -11.63% |
|   fix_size_alloc_test: p:8, h:1 (usec)           | (I) -19.40% | (I) -11.11% |
|   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -24.89% | (I) -15.26% |
|   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -11.61% |   (R) 6.00% |
|   fix_size_alloc_test: p:32, h:0 (usec)          | (I) -26.54% | (I) -18.80% |
|   fix_size_alloc_test: p:32, h:1 (usec)          | (I) -15.42% |   (R) 5.82% |
|   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -30.25% | (I) -20.80% |
|   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.98% |   (R) 6.54% |
|   fix_size_alloc_test: p:128, h:0 (usec)         | (I) -32.56% | (I) -21.79% |
|   fix_size_alloc_test: p:128, h:1 (usec)         | (I) -18.39% |   (R) 5.91% |
|   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -33.33% | (I) -22.22% |
|   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -18.82% |   (R) 5.79% |
|   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -33.27% | (I) -22.23% |
|   fix_size_alloc_test: p:512, h:1 (usec)         |       0.86% |      -0.71% |
|   full_fit_alloc_test: p:1, h:0 (usec)           |       2.49% |      -0.62% |
|   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |       1.79% |      -1.25% |
|   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -0.32% |       0.61% |
|   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -31.06% | (I) -19.62% |
|   pcpu_alloc_test: p:1, h:0 (usec)               |       0.06% |       0.47% |
|   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -14.94% |  (I) -8.68% |
|   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.22% | (I) -19.59% |
|   vm_map_ram_test: p:1, h:0 (usec)               |       2.65% |   (R) 7.22% |
+--------------------------------------------------+-------------+-------------+

So there are some nice improvements but also some regressions to explain:

First fix_size_alloc_test with h:1 and p:16,32,64,128,256 regress by ~6% on
Altra. The regression is actually introduced by enabling contpte-mapped 64K
blocks in these tests, and that regression is reduced (from about 8% if memory
serves) by doing the barrier batching. I don't have a definite conclusion on the
root cause, but I've ruled out the differences in the mapping paths in vmalloc.
I strongly believe this is likely due to the difference in the allocation path;
64K blocks are not cached per-cpu so we have to go all the way to the buddy. I'm
not sure why this doesn't show up on M2 though. Regardless, I'm going to assert
that it's better to choose 16x reduction in TLB pressure vs 6% on the vmalloc
allocation call duration.

Next we have ~4% regression on M2 when vmalloc'ing a single page. (h is
irrelevant because a single page is too small for contpte). I assume this is
because there is some minor overhead in the barrier deferral mechanism and we
are not getting to amortize it over multiple pages here. But I would assume
vmalloc'ing 1 page is uncommon because it doesn't buy you anything over
kmalloc?

Applies on top of v6.14-rc1. All mm selftests run and pass.

Thanks,
Ryan

Ryan Roberts (16):
  mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
  arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  arm64: hugetlb: Refine tlb maintenance scope
  mm/page_table_check: Batch-check pmds/puds just like ptes
  arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear()
  arm64/mm: Hoist barriers out of ___set_ptes() loop
  arm64/mm: Avoid barriers for invalid or userspace mappings
  mm/vmalloc: Warn on improper use of vunmap_range()
  mm/vmalloc: Gracefully unmap huge ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm: Don't skip arch_sync_kernel_mappings() in error paths
  mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently
  mm: Generalize arch_sync_kernel_mappings()
  arm64/mm: Defer barriers when updating kernel mappings

 arch/arm64/include/asm/hugetlb.h     |  33 +++-
 arch/arm64/include/asm/pgtable.h     | 225 ++++++++++++++++++++-------
 arch/arm64/include/asm/thread_info.h |   2 +
 arch/arm64/include/asm/vmalloc.h     |  40 +++++
 arch/arm64/kernel/process.c          |  20 ++-
 arch/arm64/mm/hugetlbpage.c          | 114 ++++++--------
 arch/loongarch/include/asm/hugetlb.h |   6 +-
 arch/mips/include/asm/hugetlb.h      |   6 +-
 arch/parisc/include/asm/hugetlb.h    |   2 +-
 arch/parisc/mm/hugetlbpage.c         |   2 +-
 arch/powerpc/include/asm/hugetlb.h   |   6 +-
 arch/riscv/include/asm/hugetlb.h     |   3 +-
 arch/riscv/mm/hugetlbpage.c          |   2 +-
 arch/s390/include/asm/hugetlb.h      |  12 +-
 arch/s390/mm/hugetlbpage.c           |  10 +-
 arch/sparc/include/asm/hugetlb.h     |   2 +-
 arch/sparc/mm/hugetlbpage.c          |   2 +-
 include/asm-generic/hugetlb.h        |   2 +-
 include/linux/hugetlb.h              |   4 +-
 include/linux/page_table_check.h     |  30 ++--
 include/linux/pgtable.h              |  24 +--
 include/linux/pgtable_modmask.h      |  32 ++++
 include/linux/vmalloc.h              |  55 +++++++
 mm/hugetlb.c                         |   4 +-
 mm/memory.c                          |  11 +-
 mm/page_table_check.c                |  34 ++--
 mm/vmalloc.c                         |  97 +++++++-----
 27 files changed, 530 insertions(+), 250 deletions(-)
 create mode 100644 include/linux/pgtable_modmask.h

--
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-06  5:03   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes Ryan Roberts
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel,
	Huacai Chen, Thomas Bogendoerfer, James E.J. Bottomley,
	Helge Deller, Madhavan Srinivasan, Michael Ellerman,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Gerald Schaefer,
	David S. Miller, Andreas Larsson, stable

In order to fix a bug, arm64 needs to be told the size of the huge page
for which the huge_pte is being set in huge_ptep_get_and_clear().
Provide for this by adding an `unsigned long sz` parameter to the
function. This follows the same pattern as huge_pte_clear() and
set_huge_pte_at().

This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, loongarch, mips,
parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
in a separate commit.

Cc: <stable@vger.kernel.org>
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/hugetlb.h     |  4 ++--
 arch/arm64/mm/hugetlbpage.c          |  8 +++++---
 arch/loongarch/include/asm/hugetlb.h |  6 ++++--
 arch/mips/include/asm/hugetlb.h      |  6 ++++--
 arch/parisc/include/asm/hugetlb.h    |  2 +-
 arch/parisc/mm/hugetlbpage.c         |  2 +-
 arch/powerpc/include/asm/hugetlb.h   |  6 ++++--
 arch/riscv/include/asm/hugetlb.h     |  3 ++-
 arch/riscv/mm/hugetlbpage.c          |  2 +-
 arch/s390/include/asm/hugetlb.h      | 12 ++++++++----
 arch/s390/mm/hugetlbpage.c           | 10 ++++++++--
 arch/sparc/include/asm/hugetlb.h     |  2 +-
 arch/sparc/mm/hugetlbpage.c          |  2 +-
 include/asm-generic/hugetlb.h        |  2 +-
 include/linux/hugetlb.h              |  4 +++-
 mm/hugetlb.c                         |  4 ++--
 16 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index c6dff3e69539..03db9cb21ace 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -42,8 +42,8 @@ extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 				      unsigned long addr, pte_t *ptep,
 				      pte_t pte, int dirty);
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
-extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-				     unsigned long addr, pte_t *ptep);
+extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				     pte_t *ptep, unsigned long sz);
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
 extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
 				    unsigned long addr, pte_t *ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 98a2a0e64e25..06db4649af91 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -396,8 +396,8 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 		__pte_clear(mm, addr, ptep);
 }

-pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-			      unsigned long addr, pte_t *ptep)
+pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
+			      pte_t *ptep, unsigned long sz)
 {
 	int ncontig;
 	size_t pgsize;
@@ -549,6 +549,8 @@ bool __init arch_hugetlb_valid_size(unsigned long size)

 pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
+	unsigned long psize = huge_page_size(hstate_vma(vma));
+
 	if (alternative_has_cap_unlikely(ARM64_WORKAROUND_2645198)) {
 		/*
 		 * Break-before-make (BBM) is required for all user space mappings
@@ -558,7 +560,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr
 		if (pte_user_exec(__ptep_get(ptep)))
 			return huge_ptep_clear_flush(vma, addr, ptep);
 	}
-	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep, psize);
 }

 void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
index c8e4057734d0..4dc4b3e04225 100644
--- a/arch/loongarch/include/asm/hugetlb.h
+++ b/arch/loongarch/include/asm/hugetlb.h
@@ -36,7 +36,8 @@ static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-					    unsigned long addr, pte_t *ptep)
+					    unsigned long addr, pte_t *ptep,
+					    unsigned long sz)
 {
 	pte_t clear;
 	pte_t pte = ptep_get(ptep);
@@ -51,8 +52,9 @@ static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 					  unsigned long addr, pte_t *ptep)
 {
 	pte_t pte;
+	unsigned long sz = huge_page_size(hstate_vma(vma));

-	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep, sz);
 	flush_tlb_page(vma, addr);
 	return pte;
 }
diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
index d0a86ce83de9..fbc71ddcf0f6 100644
--- a/arch/mips/include/asm/hugetlb.h
+++ b/arch/mips/include/asm/hugetlb.h
@@ -27,7 +27,8 @@ static inline int prepare_hugepage_range(struct file *file,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-					    unsigned long addr, pte_t *ptep)
+					    unsigned long addr, pte_t *ptep,
+					    unsigned long sz)
 {
 	pte_t clear;
 	pte_t pte = *ptep;
@@ -42,13 +43,14 @@ static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 					  unsigned long addr, pte_t *ptep)
 {
 	pte_t pte;
+	unsigned long sz = huge_page_size(hstate_vma(vma));

 	/*
 	 * clear the huge pte entry firstly, so that the other smp threads will
 	 * not get old pte entry after finishing flush_tlb_page and before
 	 * setting new huge pte entry
 	 */
-	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep, sz);
 	flush_tlb_page(vma, addr);
 	return pte;
 }
diff --git a/arch/parisc/include/asm/hugetlb.h b/arch/parisc/include/asm/hugetlb.h
index 5b3a5429f71b..21e9ace17739 100644
--- a/arch/parisc/include/asm/hugetlb.h
+++ b/arch/parisc/include/asm/hugetlb.h
@@ -10,7 +10,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep);
+			      pte_t *ptep, unsigned long sz);

 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
 static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index e9d18cf25b79..a94fe546d434 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -126,7 +126,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,


 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep)
+			      pte_t *ptep, unsigned long sz)
 {
 	pte_t entry;

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index dad2e7980f24..86326587e58d 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -45,7 +45,8 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-					    unsigned long addr, pte_t *ptep)
+					    unsigned long addr, pte_t *ptep,
+					    unsigned long sz)
 {
 	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
 }
@@ -55,8 +56,9 @@ static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 					  unsigned long addr, pte_t *ptep)
 {
 	pte_t pte;
+	unsigned long sz = huge_page_size(hstate_vma(vma));

-	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep, sz);
 	flush_hugetlb_page(vma, addr);
 	return pte;
 }
diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h
index faf3624d8057..446126497768 100644
--- a/arch/riscv/include/asm/hugetlb.h
+++ b/arch/riscv/include/asm/hugetlb.h
@@ -28,7 +28,8 @@ void set_huge_pte_at(struct mm_struct *mm,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-			      unsigned long addr, pte_t *ptep);
+			      unsigned long addr, pte_t *ptep,
+			      unsigned long sz);

 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 42314f093922..b4a78a4b35cf 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -293,7 +293,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,

 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 			      unsigned long addr,
-			      pte_t *ptep)
+			      pte_t *ptep, unsigned long sz)
 {
 	pte_t orig_pte = ptep_get(ptep);
 	int pte_num;
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 7c52acaf9f82..420c74306779 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -26,7 +26,11 @@ void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
-pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
+pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
+			      unsigned long addr, pte_t *ptep,
+			      unsigned long sz);
+pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
+			      unsigned long addr, pte_t *ptep);

 static inline void arch_clear_hugetlb_flags(struct folio *folio)
 {
@@ -48,7 +52,7 @@ static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 					  unsigned long address, pte_t *ptep)
 {
-	return huge_ptep_get_and_clear(vma->vm_mm, address, ptep);
+	return __huge_ptep_get_and_clear(vma->vm_mm, address, ptep);
 }

 #define  __HAVE_ARCH_HUGE_PTEP_SET_ACCESS_FLAGS
@@ -59,7 +63,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(huge_ptep_get(vma->vm_mm, addr, ptep), pte);

 	if (changed) {
-		huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+		__huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
 		__set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
 	}
 	return changed;
@@ -69,7 +73,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 					   unsigned long addr, pte_t *ptep)
 {
-	pte_t pte = huge_ptep_get_and_clear(mm, addr, ptep);
+	pte_t pte = __huge_ptep_get_and_clear(mm, addr, ptep);

 	__set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
 }
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index d9ce199953de..52ee8e854195 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -188,8 +188,8 @@ pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 	return __rste_to_pte(pte_val(*ptep));
 }

-pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-			      unsigned long addr, pte_t *ptep)
+pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep)
 {
 	pte_t pte = huge_ptep_get(mm, addr, ptep);
 	pmd_t *pmdp = (pmd_t *) ptep;
@@ -202,6 +202,12 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 	return pte;
 }

+pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
+			      unsigned long addr, pte_t *ptep, unsigned long sz)
+{
+	return __huge_ptep_get_and_clear(mm, addr, ptep);
+}
+
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
 {
diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index c714ca6a05aa..e7a9cdd498dc 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -20,7 +20,7 @@ void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep);
+			      pte_t *ptep, unsigned long sz);

 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
 static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index eee601a0d2cf..80504148d8a5 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -260,7 +260,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 }

 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
-			      pte_t *ptep)
+			      pte_t *ptep, unsigned long sz)
 {
 	unsigned int i, nptes, orig_shift, shift;
 	unsigned long size;
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index f42133dae68e..2afc95bf1655 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -90,7 +90,7 @@ static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

 #ifndef __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
-		unsigned long addr, pte_t *ptep)
+		unsigned long addr, pte_t *ptep, unsigned long sz)
 {
 	return ptep_get_and_clear(mm, addr, ptep);
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ec8c0ccc8f95..bf5f7256bd28 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1004,7 +1004,9 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
 static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
 						unsigned long addr, pte_t *ptep)
 {
-	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+	unsigned long psize = huge_page_size(hstate_vma(vma));
+
+	return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep, psize);
 }
 #endif

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65068671e460..de9d49e521c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5447,7 +5447,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 	if (src_ptl != dst_ptl)
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

-	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
+	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte, sz);

 	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
 		huge_pte_clear(mm, new_addr, dst_pte, sz);
@@ -5622,7 +5622,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
 		}

-		pte = huge_ptep_get_and_clear(mm, address, ptep);
+		pte = huge_ptep_get_and_clear(mm, address, ptep, sz);
 		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
--
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
  2025-02-05 15:09 ` [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-06  6:15   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level Ryan Roberts
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel, stable

arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
CONT_PMD_SIZE). So the function has to figure out the size from the
huge_pte pointer. This was previously done by walking the pgtable to
determine the level, then using the PTE_CONT bit to determine the number
of ptes.

But the PTE_CONT bit is only valid when the pte is present. For
non-present pte values (e.g. markers, migration entries), the previous
implementation was therefore erroniously determining the size. There is
at least one known caller in core-mm, move_huge_pte(), which may call
huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
this case. Additionally the "regular" ptep_get_and_clear() is robust to
being called for non-present ptes so it makes sense to follow the
behaviour.

Fix this by using the new sz parameter which is now provided to the
function. Additionally when clearing each pte in a contig range, don't
gather the access and dirty bits if the pte is not present.

An alternative approach that would not require API changes would be to
store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
cleaner to follow other APIs' lead and just pass in the size.

While we are at it, add some debug warnings in functions that require
the pte is present.

As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
entry offset field (layout of non-present pte). Since hugetlb is never
swapped to disk, this field will only be populated for markers, which
always set this bit to 0 and hwpoison swap entries, which set the offset
field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
memory in that high half was poisoned (I think!). So in practice, this
bit would almost always be zero for non-present ptes and we would only
clear the first entry if it was actually a contiguous block. That's
probably a less severe symptom than if it was always interpretted as 1
and cleared out potentially-present neighboring PTEs.

Cc: <stable@vger.kernel.org>
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 06db4649af91..328eec4bfe55 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 			     unsigned long pgsize,
 			     unsigned long ncontig)
 {
-	pte_t orig_pte = __ptep_get(ptep);
-	unsigned long i;
-
-	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
-
-		/*
-		 * If HW_AFDBM is enabled, then the HW could turn on
-		 * the dirty or accessed bit for any page in the set,
-		 * so check them all.
-		 */
-		if (pte_dirty(pte))
-			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
-			orig_pte = pte_mkyoung(orig_pte);
+	pte_t pte, tmp_pte;
+	bool present;
+
+	pte = __ptep_get_and_clear(mm, addr, ptep);
+	present = pte_present(pte);
+	while (--ncontig) {
+		ptep++;
+		addr += pgsize;
+		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		if (present) {
+			if (pte_dirty(tmp_pte))
+				pte = pte_mkdirty(pte);
+			if (pte_young(tmp_pte))
+				pte = pte_mkyoung(pte);
+		}
 	}
-	return orig_pte;
+	return pte;
 }

 static pte_t get_clear_contig_flush(struct mm_struct *mm,
@@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 {
 	int ncontig;
 	size_t pgsize;
-	pte_t orig_pte = __ptep_get(ptep);
-
-	if (!pte_cont(orig_pte))
-		return __ptep_get_and_clear(mm, addr, ptep);
-
-	ncontig = find_num_contig(mm, addr, ptep, &pgsize);

+	ncontig = num_contig_ptes(sz, &pgsize);
 	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
 }

@@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pgprot_t hugeprot;
 	pte_t orig_pte;

+	VM_WARN_ON(!pte_present(pte));
+
 	if (!pte_cont(pte))
 		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);

@@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 		return 0;

 	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
+	VM_WARN_ON(!pte_present(orig_pte));

 	/* Make sure we don't lose the dirty or young state */
 	if (pte_dirty(orig_pte))
@@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;

-	if (!pte_cont(__ptep_get(ptep))) {
+	pte = __ptep_get(ptep);
+	VM_WARN_ON(!pte_present(pte));
+
+	if (!pte_cont(pte)) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	size_t pgsize;
 	int ncontig;
+	pte_t pte;

-	if (!pte_cont(__ptep_get(ptep)))
+	pte = __ptep_get(ptep);
+	VM_WARN_ON(!pte_present(pte));
+
+	if (!pte_cont(pte))
 		return ptep_clear_flush(vma, addr, ptep);

 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
-- 
2.43.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
  2025-02-05 15:09 ` [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() Ryan Roberts
  2025-02-05 15:09 ` [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-06  6:46   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 04/16] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel, stable

commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
FEAT_LPA2") changed the "invalidation level unknown" hint from 0 to
TLBI_TTL_UNKNOWN (INT_MAX). But the fallback "unknown level" path in
flush_hugetlb_tlb_range() was not updated. So as it stands, when trying
to invalidate CONT_PMD_SIZE or CONT_PTE_SIZE hugetlb mappings, we will
spuriously try to invalidate at level 0 on LPA2-enabled systems.

Fix this so that the fallback passes TLBI_TTL_UNKNOWN, and while we are
at it, explicitly use the correct stride and level for CONT_PMD_SIZE and
CONT_PTE_SIZE, which should provide a minor optimization.

Cc: <stable@vger.kernel.org>
Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/hugetlb.h | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 03db9cb21ace..8ab9542d2d22 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -76,12 +76,20 @@ static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
 {
 	unsigned long stride = huge_page_size(hstate_vma(vma));
 
-	if (stride == PMD_SIZE)
-		__flush_tlb_range(vma, start, end, stride, false, 2);
-	else if (stride == PUD_SIZE)
-		__flush_tlb_range(vma, start, end, stride, false, 1);
-	else
-		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0);
+	switch (stride) {
+	case PUD_SIZE:
+		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
+		break;
+	case CONT_PMD_SIZE:
+	case PMD_SIZE:
+		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
+		break;
+	case CONT_PTE_SIZE:
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
+		break;
+	default:
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
+	}
 }
 
 #endif /* __ASM_HUGETLB_H */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 04/16] arm64: hugetlb: Refine tlb maintenance scope
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-05 15:09 ` [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

When operating on contiguous blocks of ptes (or pmds) for some hugetlb
sizes, we must honour break-before-make requirements and clear down the
block to invalid state in the pgtable then invalidate the relevant tlb
entries before making the pgtable entries valid again.

However, the tlb maintenance is currently always done assuming the worst
case stride (PAGE_SIZE), last_level (false) and tlb_level
(TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
we know the stride from the huge_pte pgsize, we are always operating
only on the last level, and we always know the tlb_level, again based on
pgsize. So let's start providing these hints.

Additionally, avoid tlb maintenace in set_huge_pte_at().
Break-before-make is only required if we are transitioning the
contiguous pte block from valid -> valid. So let's elide the
clear-and-flush ("break") if the pte range was previously invalid.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/hugetlb.h | 29 +++++++++++++++++++----------
 arch/arm64/mm/hugetlbpage.c      |  9 ++++++---
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 8ab9542d2d22..c38f2944c20d 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -69,27 +69,36 @@ extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 
 #include <asm-generic/hugetlb.h>
 
-#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
-					   unsigned long start,
-					   unsigned long end)
+static inline void __flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+					     unsigned long start,
+					     unsigned long end,
+					     unsigned long stride,
+					     bool last_level)
 {
-	unsigned long stride = huge_page_size(hstate_vma(vma));
-
 	switch (stride) {
 	case PUD_SIZE:
-		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
+		__flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1);
 		break;
 	case CONT_PMD_SIZE:
 	case PMD_SIZE:
-		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
+		__flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2);
 		break;
 	case CONT_PTE_SIZE:
-		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3);
 		break;
 	default:
-		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN);
 	}
 }
 
+#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+					   unsigned long start,
+					   unsigned long end)
+{
+	unsigned long stride = huge_page_size(hstate_vma(vma));
+
+	__flush_hugetlb_tlb_range(vma, start, end, stride, false);
+}
+
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 328eec4bfe55..e870d01d12ea 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -190,8 +190,9 @@ static pte_t get_clear_contig_flush(struct mm_struct *mm,
 {
 	pte_t orig_pte = get_clear_contig(mm, addr, ptep, pgsize, ncontig);
 	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long end = addr + (pgsize * ncontig);
 
-	flush_tlb_range(&vma, addr, addr + (pgsize * ncontig));
+	__flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true);
 	return orig_pte;
 }
 
@@ -216,7 +217,7 @@ static void clear_flush(struct mm_struct *mm,
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
 		__ptep_get_and_clear(mm, addr, ptep);
 
-	flush_tlb_range(&vma, saddr, addr);
+	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
@@ -245,7 +246,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	dpfn = pgsize >> PAGE_SHIFT;
 	hugeprot = pte_pgprot(pte);
 
-	clear_flush(mm, addr, ptep, pgsize, ncontig);
+	/* Only need to "break" if transitioning valid -> valid. */
+	if (pte_valid(__ptep_get(ptep)))
+		clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (3 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 04/16] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-06 10:55   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Convert page_table_check_p[mu]d_set(...) to
page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
as macros that call new batch functions with nr=1 for compatibility.

arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
and to allow the implementation for huge_pte to more efficiently set
ptes/pmds/puds in batches. We need these batch-helpers to make the
refactoring possible.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/page_table_check.h | 30 +++++++++++++++++-----------
 mm/page_table_check.c            | 34 +++++++++++++++++++-------------
 2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
index 6722941c7cb8..289620d4aad3 100644
--- a/include/linux/page_table_check.h
+++ b/include/linux/page_table_check.h
@@ -19,8 +19,10 @@ void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd);
 void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud);
 void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 		unsigned int nr);
-void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd);
-void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud);
+void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
+		unsigned int nr);
+void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
+		unsigned int nr);
 void __page_table_check_pte_clear_range(struct mm_struct *mm,
 					unsigned long addr,
 					pmd_t pmd);
@@ -74,22 +76,22 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
 	__page_table_check_ptes_set(mm, ptep, pte, nr);
 }
 
-static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
-					    pmd_t pmd)
+static inline void page_table_check_pmds_set(struct mm_struct *mm,
+		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
 {
 	if (static_branch_likely(&page_table_check_disabled))
 		return;
 
-	__page_table_check_pmd_set(mm, pmdp, pmd);
+	__page_table_check_pmds_set(mm, pmdp, pmd, nr);
 }
 
-static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
-					    pud_t pud)
+static inline void page_table_check_puds_set(struct mm_struct *mm,
+		pud_t *pudp, pud_t pud, unsigned int nr)
 {
 	if (static_branch_likely(&page_table_check_disabled))
 		return;
 
-	__page_table_check_pud_set(mm, pudp, pud);
+	__page_table_check_puds_set(mm, pudp, pud, nr);
 }
 
 static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
@@ -129,13 +131,13 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
 {
 }
 
-static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
-					    pmd_t pmd)
+static inline void page_table_check_pmds_set(struct mm_struct *mm,
+		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
 {
 }
 
-static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
-					    pud_t pud)
+static inline void page_table_check_puds_set(struct mm_struct *mm,
+		pud_t *pudp, pud_t pud, unsigned int nr)
 {
 }
 
@@ -146,4 +148,8 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
 }
 
 #endif /* CONFIG_PAGE_TABLE_CHECK */
+
+#define page_table_check_pmd_set(mm, pmdp, pmd)	page_table_check_pmds_set(mm, pmdp, pmd, 1)
+#define page_table_check_pud_set(mm, pudp, pud)	page_table_check_puds_set(mm, pudp, pud, 1)
+
 #endif /* __LINUX_PAGE_TABLE_CHECK_H */
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 509c6ef8de40..dae4a7d776b3 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -234,33 +234,39 @@ static inline void page_table_check_pmd_flags(pmd_t pmd)
 		WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
 }
 
-void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
+void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
+		unsigned int nr)
 {
+	unsigned int i;
+	unsigned long stride = PMD_SIZE >> PAGE_SHIFT;
+
 	if (&init_mm == mm)
 		return;
 
 	page_table_check_pmd_flags(pmd);
 
-	__page_table_check_pmd_clear(mm, *pmdp);
-	if (pmd_user_accessible_page(pmd)) {
-		page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
-				     pmd_write(pmd));
-	}
+	for (i = 0; i < nr; i++)
+		__page_table_check_pmd_clear(mm, *(pmdp + i));
+	if (pmd_user_accessible_page(pmd))
+		page_table_check_set(pmd_pfn(pmd), stride * nr, pmd_write(pmd));
 }
-EXPORT_SYMBOL(__page_table_check_pmd_set);
+EXPORT_SYMBOL(__page_table_check_pmds_set);
 
-void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
+void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
+		unsigned int nr)
 {
+	unsigned int i;
+	unsigned long stride = PUD_SIZE >> PAGE_SHIFT;
+
 	if (&init_mm == mm)
 		return;
 
-	__page_table_check_pud_clear(mm, *pudp);
-	if (pud_user_accessible_page(pud)) {
-		page_table_check_set(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT,
-				     pud_write(pud));
-	}
+	for (i = 0; i < nr; i++)
+		__page_table_check_pud_clear(mm, *(pudp + i));
+	if (pud_user_accessible_page(pud))
+		page_table_check_set(pud_pfn(pud), stride * nr, pud_write(pud));
 }
-EXPORT_SYMBOL(__page_table_check_pud_set);
+EXPORT_SYMBOL(__page_table_check_puds_set);
 
 void __page_table_check_pte_clear_range(struct mm_struct *mm,
 					unsigned long addr,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (4 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-06 11:48   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear() Ryan Roberts
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
all a thin wrapper around a generic ___set_ptes(), which takes pgsize
parameter. This cleans up the code to remove the confusing
__set_pte_at() (which was only ever used for pmd/pud) and will allow us
to perform future barrier optimizations in a single place. Additionally,
it will permit the huge_pte API to efficiently batch-set pgtable entries
and take advantage of the future barrier optimizations.

___set_ptes() calls the correct page_table_check_*_set() function based
on the pgsize. This means that huge_ptes be able to get proper coverage
regardless of their size, once it's plumbed into huge_pte. Currently the
huge_pte API always uses the pte API which assumes an entry only covers
a single page.

While we are at it, refactor __ptep_get_and_clear() and
pmdp_huge_get_and_clear() to use a common ___ptep_get_and_clear() which
also takes a pgsize parameter. This will provide the huge_pte API the
means to clear ptes corresponding with the way they were set.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 108 +++++++++++++++++++------------
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b2a2ad1b9e8..3b55d9a15f05 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -420,23 +420,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void __set_ptes(struct mm_struct *mm,
-			      unsigned long __always_unused addr,
-			      pte_t *ptep, pte_t pte, unsigned int nr)
-{
-	page_table_check_ptes_set(mm, ptep, pte, nr);
-	__sync_cache_and_tags(pte, nr);
-
-	for (;;) {
-		__check_safe_pte_update(mm, ptep, pte);
-		__set_pte(ptep, pte);
-		if (--nr == 0)
-			break;
-		ptep++;
-		pte = pte_advance_pfn(pte, 1);
-	}
-}
-
 /*
  * Hugetlb definitions.
  */
@@ -641,30 +624,59 @@ static inline pgprot_t pud_pgprot(pud_t pud)
 	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
 }
 
-static inline void __set_pte_at(struct mm_struct *mm,
-				unsigned long __always_unused addr,
-				pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
+			       unsigned int nr, unsigned long pgsize)
 {
-	__sync_cache_and_tags(pte, nr);
-	__check_safe_pte_update(mm, ptep, pte);
-	__set_pte(ptep, pte);
+	unsigned long stride = pgsize >> PAGE_SHIFT;
+
+	switch (pgsize) {
+	case PAGE_SIZE:
+		page_table_check_ptes_set(mm, ptep, pte, nr);
+		break;
+	case PMD_SIZE:
+		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
+		break;
+	case PUD_SIZE:
+		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
+		break;
+	default:
+		VM_WARN_ON(1);
+	}
+
+	__sync_cache_and_tags(pte, nr * stride);
+
+	for (;;) {
+		__check_safe_pte_update(mm, ptep, pte);
+		__set_pte(ptep, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		pte = pte_advance_pfn(pte, stride);
+	}
 }
 
-static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
-			      pmd_t *pmdp, pmd_t pmd)
+static inline void __set_ptes(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	page_table_check_pmd_set(mm, pmdp, pmd);
-	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
-						PMD_SIZE >> PAGE_SHIFT);
+	___set_ptes(mm, ptep, pte, nr, PAGE_SIZE);
 }
 
-static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
-			      pud_t *pudp, pud_t pud)
+static inline void __set_pmds(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
+{
+	___set_ptes(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
+}
+#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
+
+static inline void __set_puds(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pud_t *pudp, pud_t pud, unsigned int nr)
 {
-	page_table_check_pud_set(mm, pudp, pud);
-	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
-						PUD_SIZE >> PAGE_SHIFT);
+	___set_ptes(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
 }
+#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)
 
 #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
 #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
@@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 
-static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
-				       unsigned long address, pte_t *ptep)
+static inline pte_t ___ptep_get_and_clear(struct mm_struct *mm, pte_t *ptep,
+				       unsigned long pgsize)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
 
-	page_table_check_pte_clear(mm, pte);
+	switch (pgsize) {
+	case PAGE_SIZE:
+		page_table_check_pte_clear(mm, pte);
+		break;
+	case PMD_SIZE:
+		page_table_check_pmd_clear(mm, pte_pmd(pte));
+		break;
+	case PUD_SIZE:
+		page_table_check_pud_clear(mm, pte_pud(pte));
+		break;
+	default:
+		VM_WARN_ON(1);
+	}
 
 	return pte;
 }
 
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
+				       unsigned long address, pte_t *ptep)
+{
+	return ___ptep_get_and_clear(mm, ptep, PAGE_SIZE);
+}
+
 static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr, int full)
 {
@@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address, pmd_t *pmdp)
 {
-	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
-
-	page_table_check_pmd_clear(mm, pmd);
-
-	return pmd;
+	return pte_pmd(___ptep_get_and_clear(mm, (pte_t *)pmdp, PMD_SIZE));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear()
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (5 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07  4:09   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop Ryan Roberts
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Refactor the huge_pte helpers to use the new generic ___set_ptes() and
___ptep_get_and_clear() APIs.

This provides 2 benefits; First, when page_table_check=on, hugetlb is
now properly/fully checked. Previously only the first page of a hugetlb
folio was checked. Second, instead of having to call __set_ptes(nr=1)
for each pte in a loop, the whole contiguous batch can now be set in one
go, which enables some efficiencies and cleans up the code.

One detail to note is that huge_ptep_clear_flush() was previously
calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
block mapping). This has a couple of disadvantages; first
ptep_clear_flush() calls ptep_get_and_clear() which transparently
handles contpte. Given we only call for non-contiguous ptes, it would be
safe, but a waste of effort. It's preferable to go stright to the layer
below. However, more problematic is that ptep_get_and_clear() is for
PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
clear the whole hugetlb folio. So let's stop special-casing the non-cont
case and just rely on get_clear_contig_flush() to do the right thing for
non-cont entries.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 50 ++++++++-----------------------------
 1 file changed, 11 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index e870d01d12ea..02afee31444e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -166,12 +166,12 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	pte_t pte, tmp_pte;
 	bool present;
 
-	pte = __ptep_get_and_clear(mm, addr, ptep);
+	pte = ___ptep_get_and_clear(mm, ptep, pgsize);
 	present = pte_present(pte);
 	while (--ncontig) {
 		ptep++;
 		addr += pgsize;
-		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		tmp_pte = ___ptep_get_and_clear(mm, ptep, pgsize);
 		if (present) {
 			if (pte_dirty(tmp_pte))
 				pte = pte_mkdirty(pte);
@@ -215,7 +215,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		__ptep_get_and_clear(mm, addr, ptep);
+		___ptep_get_and_clear(mm, ptep, pgsize);
 
 	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
@@ -226,32 +226,20 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	size_t pgsize;
 	int i;
 	int ncontig;
-	unsigned long pfn, dpfn;
-	pgprot_t hugeprot;
 
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	if (!pte_present(pte)) {
 		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-			__set_ptes(mm, addr, ptep, pte, 1);
+			___set_ptes(mm, ptep, pte, 1, pgsize);
 		return;
 	}
 
-	if (!pte_cont(pte)) {
-		__set_ptes(mm, addr, ptep, pte, 1);
-		return;
-	}
-
-	pfn = pte_pfn(pte);
-	dpfn = pgsize >> PAGE_SHIFT;
-	hugeprot = pte_pgprot(pte);
-
 	/* Only need to "break" if transitioning valid -> valid. */
-	if (pte_valid(__ptep_get(ptep)))
+	if (pte_cont(pte) && pte_valid(__ptep_get(ptep)))
 		clear_flush(mm, addr, ptep, pgsize, ncontig);
 
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
+	___set_ptes(mm, ptep, pte, ncontig, pgsize);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -441,11 +429,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	int ncontig, i;
+	int ncontig;
 	size_t pgsize = 0;
-	unsigned long pfn = pte_pfn(pte), dpfn;
 	struct mm_struct *mm = vma->vm_mm;
-	pgprot_t hugeprot;
 	pte_t orig_pte;
 
 	VM_WARN_ON(!pte_present(pte));
@@ -454,7 +440,6 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
-	dpfn = pgsize >> PAGE_SHIFT;
 
 	if (!__cont_access_flags_changed(ptep, pte, ncontig))
 		return 0;
@@ -469,19 +454,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	if (pte_young(orig_pte))
 		pte = pte_mkyoung(pte);
 
-	hugeprot = pte_pgprot(pte);
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
-
+	___set_ptes(mm, ptep, pte, ncontig, pgsize);
 	return 1;
 }
 
 void huge_ptep_set_wrprotect(struct mm_struct *mm,
 			     unsigned long addr, pte_t *ptep)
 {
-	unsigned long pfn, dpfn;
-	pgprot_t hugeprot;
-	int ncontig, i;
+	int ncontig;
 	size_t pgsize;
 	pte_t pte;
 
@@ -494,16 +474,11 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	}
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
-	dpfn = pgsize >> PAGE_SHIFT;
 
 	pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
 	pte = pte_wrprotect(pte);
 
-	hugeprot = pte_pgprot(pte);
-	pfn = pte_pfn(pte);
-
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
+	___set_ptes(mm, ptep, pte, ncontig, pgsize);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
@@ -517,10 +492,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	pte = __ptep_get(ptep);
 	VM_WARN_ON(!pte_present(pte));
 
-	if (!pte_cont(pte))
-		return ptep_clear_flush(vma, addr, ptep);
-
-	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
+	ncontig = num_contig_ptes(page_size(pte_page(pte)), &pgsize);
 	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (6 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear() Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07  5:35   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings Ryan Roberts
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

___set_ptes() previously called __set_pte() for each PTE in the range,
which would conditionally issue a DSB and ISB to make the new PTE value
immediately visible to the table walker if the new PTE was valid and for
kernel space.

We can do better than this; let's hoist those barriers out of the loop
so that they are only issued once at the end of the loop. We then reduce
the cost by the number of PTEs in the range.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3b55d9a15f05..1d428e9c0e5a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -317,10 +317,8 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
 	WRITE_ONCE(*ptep, pte);
 }
 
-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte_complete(pte_t pte)
 {
-	__set_pte_nosync(ptep, pte);
-
 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
 	 * or update_mmu_cache() have the necessary barriers.
@@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+	__set_pte_nosync(ptep, pte);
+	__set_pte_complete(pte);
+}
+
 static inline pte_t __ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
@@ -647,12 +651,14 @@ static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
-		__set_pte(ptep, pte);
+		__set_pte_nosync(ptep, pte);
 		if (--nr == 0)
 			break;
 		ptep++;
 		pte = pte_advance_pfn(pte, stride);
 	}
+
+	__set_pte_complete(pte);
 }
 
 static inline void __set_ptes(struct mm_struct *mm,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (7 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07  8:11   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

__set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
used to write entries into pgtables. And they issue barriers (currently
dsb and isb) to ensure that the written values are observed by the table
walker prior to any program-order-future memory access to the mapped
location.

Over the years some of these functions have received optimizations: In
particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
table modifications") made it so that the barriers were only emitted for
valid-kernel mappings for set_pte() (now __set_pte_complete()). And
commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
barriers unconditionally.

This is all very confusing to the casual observer; surely the rules
should be invariant to the level? Let's change this so that every level
consistently emits the barriers only when setting valid, non-user
entries (both table and leaf).

It seems obvious that if it is ok to elide barriers all but valid kernel
mappings at pte level, it must also be ok to do this for leaf entries at
other levels: If setting an entry to invalid, a tlb maintenance
operaiton must surely follow to synchronise the TLB and this contains
the required barriers. If setting a valid user mapping, the previous
mapping must have been invalid and there must have been a TLB
maintenance operation (complete with barriers) to honour
break-before-make. So the worst that can happen is we take an extra
fault (which will imply the DSB + ISB) and conclude that there is
nothing to do. These are the aguments for doing this optimization at pte
level and they also apply to leaf mappings at other levels.

For table entries, the same arguments hold: If unsetting a table entry,
a TLB is required and this will emit the required barriers. If setting a
table entry, the previous value must have been invalid and the table
walker must already be able to observe that. Additionally the contents
of the pgtable being pointed to in the newly set entry must be visible
before the entry is written and this is enforced via smp_wmb() (dmb) in
the pgtable allocation functions and in __split_huge_pmd_locked(). But
this last part could never have been enforced by the barriers in
set_pXd() because they occur after updating the entry. So ultimately,
the wost that can happen by eliding these barriers for user table
entries is an extra fault.

I observe roughly the same number of page faults (107M) with and without
this change when compiling the kernel on Apple M2.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 60 ++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1d428e9c0e5a..ff358d983583 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -767,6 +767,19 @@ static inline bool in_swapper_pgdir(void *addr)
 	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
 }
 
+static inline bool pmd_valid_not_user(pmd_t pmd)
+{
+	/*
+	 * User-space table pmd entries always have (PXN && !UXN). All other
+	 * combinations indicate it's a table entry for kernel space.
+	 * Valid-not-user leaf entries follow the same rules as
+	 * pte_valid_not_user().
+	 */
+	if (pmd_table(pmd))
+		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
+	return pte_valid_not_user(pmd_pte(pmd));
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef __PAGETABLE_PMD_FOLDED
@@ -778,7 +791,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 	WRITE_ONCE(*pmdp, pmd);
 
-	if (pmd_valid(pmd)) {
+	if (pmd_valid_not_user(pmd)) {
 		dsb(ishst);
 		isb();
 	}
@@ -836,6 +849,17 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline bool pgtable_l4_enabled(void);
 
+
+static inline bool pud_valid_not_user(pud_t pud)
+{
+	/*
+	 * Follows the same rules as pmd_valid_not_user().
+	 */
+	if (pud_table(pud))
+		return !((pud_val(pud) & (PUD_TABLE_PXN | PUD_TABLE_UXN)) == PUD_TABLE_PXN);
+	return pte_valid_not_user(pud_pte(pud));
+}
+
 static inline void set_pud(pud_t *pudp, pud_t pud)
 {
 	if (!pgtable_l4_enabled() && in_swapper_pgdir(pudp)) {
@@ -845,7 +869,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 
 	WRITE_ONCE(*pudp, pud);
 
-	if (pud_valid(pud)) {
+	if (pud_valid_not_user(pud)) {
 		dsb(ishst);
 		isb();
 	}
@@ -917,6 +941,16 @@ static inline bool mm_pud_folded(const struct mm_struct *mm)
 #define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & P4D_TABLE_BIT))
 #define p4d_present(p4d)	(!p4d_none(p4d))
 
+static inline bool p4d_valid_not_user(p4d_t p4d)
+{
+	/*
+	 * User-space table p4d entries always have (PXN && !UXN). All other
+	 * combinations indicate it's a table entry for kernel space. p4d block
+	 * entries are not supported.
+	 */
+	return !((p4d_val(p4d) & (P4D_TABLE_PXN | P4D_TABLE_UXN)) == P4D_TABLE_PXN);
+}
+
 static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 	if (in_swapper_pgdir(p4dp)) {
@@ -925,8 +959,11 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 	}
 
 	WRITE_ONCE(*p4dp, p4d);
-	dsb(ishst);
-	isb();
+
+	if (p4d_valid_not_user(p4d)) {
+		dsb(ishst);
+		isb();
+	}
 }
 
 static inline void p4d_clear(p4d_t *p4dp)
@@ -1044,6 +1081,14 @@ static inline bool mm_p4d_folded(const struct mm_struct *mm)
 #define pgd_bad(pgd)		(pgtable_l5_enabled() && !(pgd_val(pgd) & PGD_TABLE_BIT))
 #define pgd_present(pgd)	(!pgd_none(pgd))
 
+static inline bool pgd_valid_not_user(pgd_t pgd)
+{
+	/*
+	 * Follows the same rules as p4d_valid_not_user().
+	 */
+	return !((pgd_val(pgd) & (PGD_TABLE_PXN | PGD_TABLE_UXN)) == PGD_TABLE_PXN);
+}
+
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	if (in_swapper_pgdir(pgdp)) {
@@ -1052,8 +1097,11 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 	}
 
 	WRITE_ONCE(*pgdp, pgd);
-	dsb(ishst);
-	isb();
+
+	if (pgd_valid_not_user(pgd)) {
+		dsb(ishst);
+		isb();
+	}
 }
 
 static inline void pgd_clear(pgd_t *pgdp)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (8 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07  8:41   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
pud level. But it is possible to subsquently call vunmap_range() on a
sub-range of the mapped memory, which partially overlaps a pmd or pud.
In this case, vmalloc unmaps the entire pmd or pud so that the
no-overlapping portion is also unmapped. Clearly that would have a bad
outcome, but it's not something that any callers do today as far as I
can tell. So I guess it's jsut expected that callers will not do this.

However, it would be useful to know if this happened in future; let's
add a warning to cover the eventuality.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmalloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a6e7acebe9ad..fcdf67d5177a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (cleared || pmd_bad(*pmd))
 			*mask |= PGTBL_PMD_MODIFIED;
 
-		if (cleared)
+		if (cleared) {
+			WARN_ON(next - addr < PMD_SIZE);
 			continue;
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
@@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		if (cleared || pud_bad(*pud))
 			*mask |= PGTBL_PUD_MODIFIED;
 
-		if (cleared)
+		if (cleared) {
+			WARN_ON(next - addr < PUD_SIZE);
 			continue;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		vunmap_pmd_range(pud, addr, next, mask);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (9 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07  9:19   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Commit f7ee1f13d606 ("mm/vmalloc: enable mapping of huge pages at pte
level in vmap") added its support by reusing the set_huge_pte_at() API,
which is otherwise only used for user mappings. But when unmapping those
huge ptes, it continued to call ptep_get_and_clear(), which is a
layering violation. To date, the only arch to implement this support is
powerpc and it all happens to work ok for it.

But arm64's implementation of ptep_get_and_clear() can not be safely
used to clear a previous set_huge_pte_at(). So let's introduce a new
arch opt-in function, arch_vmap_pte_range_unmap_size(), which can
provide the size of a (present) pte. Then we can call
huge_ptep_get_and_clear() to tear it down properly.

Note that if vunmap_range() is called with a range that starts in the
middle of a huge pte-mapped page, we must unmap the entire huge page so
the behaviour is consistent with pmd and pud block mappings. In this
case emit a warning just like we do for pmd/pud mappings.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/vmalloc.h |  8 ++++++++
 mm/vmalloc.c            | 18 ++++++++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e3..16dd4cba64f2 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -113,6 +113,14 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, uns
 }
 #endif
 
+#ifndef arch_vmap_pte_range_unmap_size
+static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
+							   pte_t *ptep)
+{
+	return PAGE_SIZE;
+}
+#endif
+
 #ifndef arch_vmap_pte_supported_shift
 static inline int arch_vmap_pte_supported_shift(unsigned long size)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fcdf67d5177a..6111ce900ec4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -350,12 +350,26 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
+	pte_t ptent;
+	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
 	do {
-		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
+#ifdef CONFIG_HUGETLB_PAGE
+		size = arch_vmap_pte_range_unmap_size(addr, pte);
+		if (size != PAGE_SIZE) {
+			if (WARN_ON(!IS_ALIGNED(addr, size))) {
+				addr = ALIGN_DOWN(addr, size);
+				pte = PTR_ALIGN_DOWN(pte, sizeof(*pte) * (size >> PAGE_SHIFT));
+			}
+			ptent = huge_ptep_get_and_clear(&init_mm, addr, pte, size);
+			if (WARN_ON(end - addr < size))
+				size = end - addr;
+		} else
+#endif
+			ptent = ptep_get_and_clear(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (10 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07 10:04   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths Ryan Roberts
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Implement the required arch functions to enable use of contpte in the
vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
operations due to only having to issue a DSB and ISB per contpte block
instead of per pte. But it also means that the TLB pressure reduces due
to only needing a single TLB entry for the whole contpte block.

Since vmap uses set_huge_pte_at() to set the contpte, that API is now
used for kernel mappings for the first time. Although in the vmap case
we never expect it to be called to modify a valid mapping so
clear_flush() should never be called, it's still wise to make it robust
for the kernel case, so amend the tlb flush function if the mm is for
kernel space.

Tested with vmalloc performance selftests:

  # kself/mm/test_vmalloc.sh \
	run_test_mask=1
	test_repeat_count=5
	nr_pages=256
	test_loop_count=100000
	use_huge=1

Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
reduction in time taken.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/vmalloc.h | 40 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/hugetlbpage.c      |  5 +++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 38fafffe699f..fbdeb40f3857 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,46 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
+#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
+static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
+						unsigned long end, u64 pfn,
+						unsigned int max_page_shift)
+{
+	if (max_page_shift < CONT_PTE_SHIFT)
+		return PAGE_SIZE;
+
+	if (end - addr < CONT_PTE_SIZE)
+		return PAGE_SIZE;
+
+	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
+		return PAGE_SIZE;
+
+	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
+		return PAGE_SIZE;
+
+	return CONT_PTE_SIZE;
+}
+
+#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
+static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
+							   pte_t *ptep)
+{
+	/*
+	 * The caller handles alignment so it's sufficient just to check
+	 * PTE_CONT.
+	 */
+	return pte_valid_cont(__ptep_get(ptep)) ? CONT_PTE_SIZE : PAGE_SIZE;
+}
+
+#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
+static inline int arch_vmap_pte_supported_shift(unsigned long size)
+{
+	if (size >= CONT_PTE_SIZE)
+		return CONT_PTE_SHIFT;
+
+	return PAGE_SHIFT;
+}
+
 #endif
 
 #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 02afee31444e..a74e43101dad 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -217,7 +217,10 @@ static void clear_flush(struct mm_struct *mm,
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
 		___ptep_get_and_clear(mm, ptep, pgsize);
 
-	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
+	if (mm == &init_mm)
+		flush_tlb_kernel_range(saddr, addr);
+	else
+		__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (11 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-07 10:21   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently Ryan Roberts
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel, stable

Fix callers that previously skipped calling arch_sync_kernel_mappings()
if an error occurred during a pgtable update. The call is still required
to sync any pgtable updates that may have occurred prior to hitting the
error condition.

These are theoretical bugs discovered during code review.

Cc: <stable@vger.kernel.org>
Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c  | 6 ++++--
 mm/vmalloc.c | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..a15f7dd500ea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3040,8 +3040,10 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(*pgd) && !create)
 			continue;
-		if (WARN_ON_ONCE(pgd_leaf(*pgd)))
-			return -EINVAL;
+		if (WARN_ON_ONCE(pgd_leaf(*pgd))) {
+			err = -EINVAL;
+			break;
+		}
 		if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
 			if (!create)
 				continue;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6111ce900ec4..68950b1824d0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -604,13 +604,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
 			mask |= PGTBL_PGD_MODIFIED;
 		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
 		if (err)
-			return err;
+			break;
 	} while (pgd++, addr = next, addr != end);
 
 	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
 		arch_sync_kernel_mappings(start, end);
 
-	return 0;
+	return err;
 }
 
 /*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (12 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-10  7:11   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings() Ryan Roberts
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

When page_shift is greater than PAGE_SIZE, __vmap_pages_range_noflush()
will call vmap_range_noflush() for each individual huge page. But
vmap_range_noflush() would previously call arch_sync_kernel_mappings()
directly so this would end up being called for every huge page.

We can do better than this; refactor the call into the outer
__vmap_pages_range_noflush() so that it is only called once for the
entire batch operation.

This will benefit performance for arm64 which is about to opt-in to
using the hook.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmalloc.c | 60 ++++++++++++++++++++++++++--------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 68950b1824d0..50fd44439875 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -285,40 +285,38 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 
 static int vmap_range_noflush(unsigned long addr, unsigned long end,
 			phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift)
+			unsigned int max_page_shift, pgtbl_mod_mask *mask)
 {
 	pgd_t *pgd;
-	unsigned long start;
 	unsigned long next;
 	int err;
-	pgtbl_mod_mask mask = 0;
 
 	might_sleep();
 	BUG_ON(addr >= end);
 
-	start = addr;
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot,
-					max_page_shift, &mask);
+					max_page_shift, mask);
 		if (err)
 			break;
 	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, end);
-
 	return err;
 }
 
 int vmap_page_range(unsigned long addr, unsigned long end,
 		    phys_addr_t phys_addr, pgprot_t prot)
 {
+	pgtbl_mod_mask mask = 0;
 	int err;
 
 	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
-				 ioremap_max_page_shift);
+				 ioremap_max_page_shift, &mask);
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(addr, end);
+
 	flush_cache_vmap(addr, end);
 	if (!err)
 		err = kmsan_ioremap_page_range(addr, end, phys_addr, prot,
@@ -587,29 +585,24 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
 }
 
 static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
-		pgprot_t prot, struct page **pages)
+		pgprot_t prot, struct page **pages, pgtbl_mod_mask *mask)
 {
-	unsigned long start = addr;
 	pgd_t *pgd;
 	unsigned long next;
 	int err = 0;
 	int nr = 0;
-	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
-			mask |= PGTBL_PGD_MODIFIED;
-		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+			*mask |= PGTBL_PGD_MODIFIED;
+		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, mask);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
 
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, end);
-
 	return err;
 }
 
@@ -626,26 +619,33 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
 	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+	unsigned long start = addr;
+	pgtbl_mod_mask mask = 0;
+	int err = 0;
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
-			page_shift == PAGE_SHIFT)
-		return vmap_small_pages_range_noflush(addr, end, prot, pages);
-
-	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
-		int err;
-
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
-					page_to_phys(pages[i]), prot,
-					page_shift);
-		if (err)
-			return err;
+			page_shift == PAGE_SHIFT) {
+		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
+						&mask);
+	} else {
+		for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
+			err = vmap_range_noflush(addr,
+						addr + (1UL << page_shift),
+						page_to_phys(pages[i]), prot,
+						page_shift, &mask);
+			if (err)
+				break;
 
-		addr += 1UL << page_shift;
+			addr += 1UL << page_shift;
+		}
 	}
 
-	return 0;
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+
+	return err;
 }
 
 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings()
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (13 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-10  7:45   ` Anshuman Khandual
  2025-02-05 15:09 ` [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings Ryan Roberts
  2025-02-06  7:52 ` [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Andrew Morton
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

arch_sync_kernel_mappings() is an optional hook for arches to allow them
to synchonize certain levels of the kernel pgtables after modification.
But arm64 could benefit from a hook similar to this, paired with a call
prior to starting the batch of modifications.

So let's introduce arch_update_kernel_mappings_begin() and
arch_update_kernel_mappings_end(). Both have a default implementation
which can be overridden by the arch code. The default for the former is
a nop, and the default for the latter is to call
arch_sync_kernel_mappings(), so the latter replaces previous
arch_sync_kernel_mappings() callsites. So by default, the resulting
behaviour is unchanged.

To avoid include hell, the pgtbl_mod_mask type and it's associated
macros are moved to their own header.

In a future patch, arm64 will opt-in to overriding both functions.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h         | 24 +----------------
 include/linux/pgtable_modmask.h | 32 ++++++++++++++++++++++
 include/linux/vmalloc.h         | 47 +++++++++++++++++++++++++++++++++
 mm/memory.c                     |  5 ++--
 mm/vmalloc.c                    | 15 ++++++-----
 5 files changed, 92 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/pgtable_modmask.h

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372..7f70786a73b3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -4,6 +4,7 @@
 
 #include <linux/pfn.h>
 #include <asm/pgtable.h>
+#include <linux/pgtable_modmask.h>
 
 #define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
 #define PUD_ORDER	(PUD_SHIFT - PAGE_SHIFT)
@@ -1786,29 +1787,6 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
-/*
- * Page Table Modification bits for pgtbl_mod_mask.
- *
- * These are used by the p?d_alloc_track*() set of functions an in the generic
- * vmalloc/ioremap code to track at which page-table levels entries have been
- * modified. Based on that the code can better decide when vmalloc and ioremap
- * mapping changes need to be synchronized to other page-tables in the system.
- */
-#define		__PGTBL_PGD_MODIFIED	0
-#define		__PGTBL_P4D_MODIFIED	1
-#define		__PGTBL_PUD_MODIFIED	2
-#define		__PGTBL_PMD_MODIFIED	3
-#define		__PGTBL_PTE_MODIFIED	4
-
-#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
-#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
-#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
-#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
-#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
-
-/* Page-Table Modification Mask */
-typedef unsigned int pgtbl_mod_mask;
-
 #endif /* !__ASSEMBLY__ */
 
 #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
diff --git a/include/linux/pgtable_modmask.h b/include/linux/pgtable_modmask.h
new file mode 100644
index 000000000000..5a21b1bb8df3
--- /dev/null
+++ b/include/linux/pgtable_modmask.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGTABLE_MODMASK_H
+#define _LINUX_PGTABLE_MODMASK_H
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Page Table Modification bits for pgtbl_mod_mask.
+ *
+ * These are used by the p?d_alloc_track*() set of functions an in the generic
+ * vmalloc/ioremap code to track at which page-table levels entries have been
+ * modified. Based on that the code can better decide when vmalloc and ioremap
+ * mapping changes need to be synchronized to other page-tables in the system.
+ */
+#define		__PGTBL_PGD_MODIFIED	0
+#define		__PGTBL_P4D_MODIFIED	1
+#define		__PGTBL_PUD_MODIFIED	2
+#define		__PGTBL_PMD_MODIFIED	3
+#define		__PGTBL_PTE_MODIFIED	4
+
+#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
+#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
+#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
+#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
+#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
+
+/* Page-Table Modification Mask */
+typedef unsigned int pgtbl_mod_mask;
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* _LINUX_PGTABLE_MODMASK_H */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 16dd4cba64f2..cb5d8f1965a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -11,6 +11,7 @@
 #include <asm/page.h>		/* pgprot_t */
 #include <linux/rbtree.h>
 #include <linux/overflow.h>
+#include <linux/pgtable_modmask.h>
 
 #include <asm/vmalloc.h>
 
@@ -213,6 +214,26 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
 		     struct page **pages, unsigned int page_shift);
 
+#ifndef arch_update_kernel_mappings_begin
+/**
+ * arch_update_kernel_mappings_begin - A batch of kernel pgtable mappings are
+ * about to be updated.
+ * @start: Virtual address of start of range to be updated.
+ * @end: Virtual address of end of range to be updated.
+ *
+ * An optional hook to allow architecture code to prepare for a batch of kernel
+ * pgtable mapping updates. An architecture may use this to enter a lazy mode
+ * where some operations can be deferred until the end of the batch.
+ *
+ * Context: Called in task context and may be preemptible.
+ */
+static inline void arch_update_kernel_mappings_begin(unsigned long start,
+						     unsigned long end)
+{
+}
+#endif
+
+#ifndef arch_update_kernel_mappings_end
 /*
  * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
  * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
@@ -229,6 +250,32 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
  */
 void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
 
+/**
+ * arch_update_kernel_mappings_end - A batch of kernel pgtable mappings have
+ * been updated.
+ * @start: Virtual address of start of range that was updated.
+ * @end: Virtual address of end of range that was updated.
+ *
+ * An optional hook to inform architecture code that a batch update is complete.
+ * This balances a previous call to arch_update_kernel_mappings_begin().
+ *
+ * An architecture may override this for any purpose, such as exiting a lazy
+ * mode previously entered with arch_update_kernel_mappings_begin() or syncing
+ * kernel mappings to a secondary pgtable. The default implementation calls an
+ * arch-provided arch_sync_kernel_mappings() if any arch-defined pgtable level
+ * was updated.
+ *
+ * Context: Called in task context and may be preemptible.
+ */
+static inline void arch_update_kernel_mappings_end(unsigned long start,
+						   unsigned long end,
+						   pgtbl_mod_mask mask)
+{
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+}
+#endif
+
 /*
  *	Lowlevel-APIs (not for driver use!)
  */
diff --git a/mm/memory.c b/mm/memory.c
index a15f7dd500ea..f80930bc19f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3035,6 +3035,8 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 	if (WARN_ON(addr >= end))
 		return -EINVAL;
 
+	arch_update_kernel_mappings_begin(start, end);
+
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -3055,8 +3057,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 			break;
 	} while (pgd++, addr = next, addr != end);
 
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, start + size);
+	arch_update_kernel_mappings_end(start, end, mask);
 
 	return err;
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50fd44439875..c5c51d86ef78 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -312,10 +312,10 @@ int vmap_page_range(unsigned long addr, unsigned long end,
 	pgtbl_mod_mask mask = 0;
 	int err;
 
+	arch_update_kernel_mappings_begin(addr, end);
 	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 				 ioremap_max_page_shift, &mask);
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(addr, end);
+	arch_update_kernel_mappings_end(addr, end, mask);
 
 	flush_cache_vmap(addr, end);
 	if (!err)
@@ -463,6 +463,9 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
 	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
+
+	arch_update_kernel_mappings_begin(start, end);
+
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -473,8 +476,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
 		vunmap_p4d_range(pgd, addr, next, &mask);
 	} while (pgd++, addr = next, addr != end);
 
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, end);
+	arch_update_kernel_mappings_end(start, end, mask);
 }
 
 void vunmap_range_noflush(unsigned long start, unsigned long end)
@@ -625,6 +627,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
 	WARN_ON(page_shift < PAGE_SHIFT);
 
+	arch_update_kernel_mappings_begin(start, end);
+
 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
 			page_shift == PAGE_SHIFT) {
 		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
@@ -642,8 +646,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 		}
 	}
 
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, end);
+	arch_update_kernel_mappings_end(start, end, mask);
 
 	return err;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (14 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings() Ryan Roberts
@ 2025-02-05 15:09 ` Ryan Roberts
  2025-02-10  8:03   ` Anshuman Khandual
  2025-02-06  7:52 ` [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Andrew Morton
  16 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-05 15:09 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Because the kernel can't tolerate page faults for kernel mappings, when
setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
dsb(ishst) to ensure that the store to the pgtable is observed by the
table walker immediately. Additionally it emits an isb() to ensure that
any already speculatively determined invalid mapping fault gets
canceled.

We can improve the performance of vmalloc operations by batching these
barriers until the end of a set up entry updates. The newly added
arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
provide the required hooks.

vmalloc improves by up to 30% as a result.

Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
are in the batch mode and can therefore defer any barriers until the end
of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
be emited at the end of the batch.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
 arch/arm64/include/asm/thread_info.h |  2 +
 arch/arm64/kernel/process.c          | 20 +++++++--
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ff358d983583..1ee9b9588502 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -39,6 +39,41 @@
 #include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <linux/page_table_check.h>
+#include <linux/pgtable_modmask.h>
+
+static inline void emit_pte_barriers(void)
+{
+	dsb(ishst);
+	isb();
+}
+
+static inline void queue_pte_barriers(void)
+{
+	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
+		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
+			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
+	} else
+		emit_pte_barriers();
+}
+
+#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
+static inline void arch_update_kernel_mappings_begin(unsigned long start,
+						     unsigned long end)
+{
+	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
+}
+
+#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
+static inline void arch_update_kernel_mappings_end(unsigned long start,
+						   unsigned long end,
+						   pgtbl_mod_mask mask)
+{
+	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
+		emit_pte_barriers();
+
+	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
+	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
+}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
@@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
 	 * or update_mmu_cache() have the necessary barriers.
 	 */
-	if (pte_valid_not_user(pte)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pte_valid_not_user(pte))
+		queue_pte_barriers();
 }
 
 static inline void __set_pte(pte_t *ptep, pte_t pte)
@@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 	WRITE_ONCE(*pmdp, pmd);
 
-	if (pmd_valid_not_user(pmd)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pmd_valid_not_user(pmd))
+		queue_pte_barriers();
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
@@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 
 	WRITE_ONCE(*pudp, pud);
 
-	if (pud_valid_not_user(pud)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pud_valid_not_user(pud))
+		queue_pte_barriers();
 }
 
 static inline void pud_clear(pud_t *pudp)
@@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 
 	WRITE_ONCE(*p4dp, p4d);
 
-	if (p4d_valid_not_user(p4d)) {
-		dsb(ishst);
-		isb();
-	}
+	if (p4d_valid_not_user(p4d))
+		queue_pte_barriers();
 }
 
 static inline void p4d_clear(p4d_t *p4dp)
@@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 
 	WRITE_ONCE(*pgdp, pgd);
 
-	if (pgd_valid_not_user(pgd)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pgd_valid_not_user(pgd))
+		queue_pte_barriers();
 }
 
 static inline void pgd_clear(pgd_t *pgdp)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1114c1c3300a..382d2121261e 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
 #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
 #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
 #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
+#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
+#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 42faebb7b712..1367ec6407d1 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	gcs_thread_switch(next);
 
 	/*
-	 * Complete any pending TLB or cache maintenance on this CPU in case
-	 * the thread migrates to a different CPU.
-	 * This full barrier is also required by the membarrier system
-	 * call.
+	 * Complete any pending TLB or cache maintenance on this CPU in case the
+	 * thread migrates to a different CPU. This full barrier is also
+	 * required by the membarrier system call. Additionally it is required
+	 * for TIF_KMAP_UPDATE_PENDING, see below.
 	 */
 	dsb(ish);
 
@@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	/* avoid expensive SCTLR_EL1 accesses if no change */
 	if (prev->thread.sctlr_user != next->thread.sctlr_user)
 		update_sctlr_el1(next->thread.sctlr_user);
+	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
+		/*
+		 * In unlikely event that a kernel map update is on-going when
+		 * preemption occurs, we must emit_pte_barriers() if pending.
+		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
+		 * is already handled above. The isb() is handled if
+		 * update_sctlr_el1() was called. So only need to emit isb()
+		 * here if it wasn't called.
+		 */
+		isb();
+		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
+	}
 
 	/* the actual thread switch */
 	last = cpu_switch_to(prev, next);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
  2025-02-05 15:09 ` [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() Ryan Roberts
@ 2025-02-06  5:03   ` Anshuman Khandual
  2025-02-06 12:15     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-06  5:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, Huacai Chen,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Gerald Schaefer, David S. Miller,
	Andreas Larsson, stable



On 2/5/25 20:39, Ryan Roberts wrote:
> In order to fix a bug, arm64 needs to be told the size of the huge page
> for which the huge_pte is being set in huge_ptep_get_and_clear().
> Provide for this by adding an `unsigned long sz` parameter to the
> function. This follows the same pattern as huge_pte_clear() and
> set_huge_pte_at().
> 
> This commit makes the required interface modifications to the core mm as
> well as all arches that implement this function (arm64, loongarch, mips,
> parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
> in a separate commit.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/hugetlb.h     |  4 ++--
>  arch/arm64/mm/hugetlbpage.c          |  8 +++++---
>  arch/loongarch/include/asm/hugetlb.h |  6 ++++--
>  arch/mips/include/asm/hugetlb.h      |  6 ++++--
>  arch/parisc/include/asm/hugetlb.h    |  2 +-
>  arch/parisc/mm/hugetlbpage.c         |  2 +-
>  arch/powerpc/include/asm/hugetlb.h   |  6 ++++--
>  arch/riscv/include/asm/hugetlb.h     |  3 ++-
>  arch/riscv/mm/hugetlbpage.c          |  2 +-
>  arch/s390/include/asm/hugetlb.h      | 12 ++++++++----
>  arch/s390/mm/hugetlbpage.c           | 10 ++++++++--
>  arch/sparc/include/asm/hugetlb.h     |  2 +-
>  arch/sparc/mm/hugetlbpage.c          |  2 +-
>  include/asm-generic/hugetlb.h        |  2 +-
>  include/linux/hugetlb.h              |  4 +++-
>  mm/hugetlb.c                         |  4 ++--
>  16 files changed, 48 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
> index c6dff3e69539..03db9cb21ace 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -42,8 +42,8 @@ extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  				      unsigned long addr, pte_t *ptep,
>  				      pte_t pte, int dirty);
>  #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
> -extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> -				     unsigned long addr, pte_t *ptep);
> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
> +				     pte_t *ptep, unsigned long sz);

If VMA could be passed instead of MM, the size of the huge page can
be derived via huge_page_size(hstate_vma(vma)) and another argument
here need not be added. Also MM can be derived from VMA if required.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  2025-02-05 15:09 ` [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes Ryan Roberts
@ 2025-02-06  6:15   ` Anshuman Khandual
  2025-02-06 12:55     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-06  6:15 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable

On 2/5/25 20:39, Ryan Roberts wrote:
> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
> CONT_PMD_SIZE). So the function has to figure out the size from the
> huge_pte pointer. This was previously done by walking the pgtable to
> determine the level, then using the PTE_CONT bit to determine the number
> of ptes.

Actually PTE_CONT gets used to determine if the entry is normal i.e
PMD/PUD based huge page or cont PTE/PMD based huge page just to call
into standard __ptep_get_and_clear() or specific get_clear_contig(),
after determining find_num_contig() by walking the page table.

PTE_CONT presence is only used to determine the switch above but not
to determine the number of ptes for the mapping as mentioned earlier.

There are two similar functions which determines the 

static int find_num_contig(struct mm_struct *mm, unsigned long addr,
                           pte_t *ptep, size_t *pgsize)
{
        pgd_t *pgdp = pgd_offset(mm, addr);
        p4d_t *p4dp;
        pud_t *pudp;
        pmd_t *pmdp;

        *pgsize = PAGE_SIZE;
        p4dp = p4d_offset(pgdp, addr);
        pudp = pud_offset(p4dp, addr);
        pmdp = pmd_offset(pudp, addr);
        if ((pte_t *)pmdp == ptep) {
                *pgsize = PMD_SIZE;
                return CONT_PMDS;
        }
        return CONT_PTES;
}

find_num_contig() already assumes that the entry is contig huge page and
it just finds whether it is PMD or PTE based one. This always requires a
prior PTE_CONT bit being set determination via pte_cont() before calling
find_num_contig() in each instance.

But num_contig_ptes() can get the same information without walking the
page table and thus without predetermining if PTE_CONT is set or not.
size can be derived from VMA argument when present.

static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
{
        int contig_ptes = 0;

        *pgsize = size;

        switch (size) {
#ifndef __PAGETABLE_PMD_FOLDED
        case PUD_SIZE:
                if (pud_sect_supported())
                        contig_ptes = 1;
                break;
#endif
        case PMD_SIZE:
                contig_ptes = 1;
                break;
        case CONT_PMD_SIZE:
                *pgsize = PMD_SIZE;
                contig_ptes = CONT_PMDS;
                break;
        case CONT_PTE_SIZE:
                *pgsize = PAGE_SIZE;
                contig_ptes = CONT_PTES;
                break;
        }

        return contig_ptes;
}

On a side note, why cannot num_contig_ptes() be used all the time and
find_num_contig() be dropped ? OR am I missing something here.

> 
> But the PTE_CONT bit is only valid when the pte is present. For
> non-present pte values (e.g. markers, migration entries), the previous
> implementation was therefore erroniously determining the size. There is
> at least one known caller in core-mm, move_huge_pte(), which may call
> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
> this case. Additionally the "regular" ptep_get_and_clear() is robust to
> being called for non-present ptes so it makes sense to follow the
> behaviour.

With VMA argument and num_contig_ptes() dependency on PTE_CONT being set
and the entry being mapped might not be required.

> 
> Fix this by using the new sz parameter which is now provided to the
> function. Additionally when clearing each pte in a contig range, don't
> gather the access and dirty bits if the pte is not present.

Makes sense.

> 
> An alternative approach that would not require API changes would be to
> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
> cleaner to follow other APIs' lead and just pass in the size.

Right, changing the arguments in the API will help solve this problem.

> 
> While we are at it, add some debug warnings in functions that require
> the pte is present.
> 
> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
> entry offset field (layout of non-present pte). Since hugetlb is never
> swapped to disk, this field will only be populated for markers, which
> always set this bit to 0 and hwpoison swap entries, which set the offset
> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
> memory in that high half was poisoned (I think!). So in practice, this
> bit would almost always be zero for non-present ptes and we would only
> clear the first entry if it was actually a contiguous block. That's
> probably a less severe symptom than if it was always interpretted as 1
> and cleared out potentially-present neighboring PTEs.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
>  1 file changed, 29 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 06db4649af91..328eec4bfe55 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>  			     unsigned long pgsize,
>  			     unsigned long ncontig)
>  {
> -	pte_t orig_pte = __ptep_get(ptep);
> -	unsigned long i;
> -
> -	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
> -		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
> -
> -		/*
> -		 * If HW_AFDBM is enabled, then the HW could turn on
> -		 * the dirty or accessed bit for any page in the set,
> -		 * so check them all.
> -		 */
> -		if (pte_dirty(pte))
> -			orig_pte = pte_mkdirty(orig_pte);
> -
> -		if (pte_young(pte))
> -			orig_pte = pte_mkyoung(orig_pte);
> +	pte_t pte, tmp_pte;
> +	bool present;
> +
> +	pte = __ptep_get_and_clear(mm, addr, ptep);
> +	present = pte_present(pte);
> +	while (--ncontig) {

Although this does the right thing by calling __ptep_get_and_clear() once
for non-contig huge pages but wondering if cont/non-cont separation should
be maintained in the caller huge_ptep_get_and_clear(), keeping the current
logical bifurcation intact.

> +		ptep++;
> +		addr += pgsize;
> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> +		if (present) {

Checking for present entries makes sense here.

> +			if (pte_dirty(tmp_pte))
> +				pte = pte_mkdirty(pte);
> +			if (pte_young(tmp_pte))
> +				pte = pte_mkyoung(pte);
> +		}
>  	}
> -	return orig_pte;
> +	return pte;
>  }
>  
>  static pte_t get_clear_contig_flush(struct mm_struct *mm,
> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  {
>  	int ncontig;
>  	size_t pgsize;
> -	pte_t orig_pte = __ptep_get(ptep);
> -
> -	if (!pte_cont(orig_pte))
> -		return __ptep_get_and_clear(mm, addr, ptep);
> -
> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>  
> +	ncontig = num_contig_ptes(sz, &pgsize);

__ptep_get_and_clear() can still be called here if 'ncontig' is
returned as 0 indicating a normal non-contig huge page thus
keeping get_clear_contig() unchanged just to handle contig huge
pages.

>  	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>  }
>  
> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	pgprot_t hugeprot;
>  	pte_t orig_pte;
>  
> +	VM_WARN_ON(!pte_present(pte));
> +
>  	if (!pte_cont(pte))
>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>  
> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  		return 0;
>  
>  	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
> +	VM_WARN_ON(!pte_present(orig_pte));
>  
>  	/* Make sure we don't lose the dirty or young state */
>  	if (pte_dirty(orig_pte))
> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  	size_t pgsize;
>  	pte_t pte;
>  
> -	if (!pte_cont(__ptep_get(ptep))) {
> +	pte = __ptep_get(ptep);
> +	VM_WARN_ON(!pte_present(pte));
> +
> +	if (!pte_cont(pte)) {
>  		__ptep_set_wrprotect(mm, addr, ptep);
>  		return;
>  	}
> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>  	size_t pgsize;
>  	int ncontig;
> +	pte_t pte;
>  
> -	if (!pte_cont(__ptep_get(ptep)))
> +	pte = __ptep_get(ptep);
> +	VM_WARN_ON(!pte_present(pte));
> +
> +	if (!pte_cont(pte))
>  		return ptep_clear_flush(vma, addr, ptep);
>  
>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);

In all the above instances should not num_contig_ptes() be called to determine
if a given entry is non-contig or contig huge page, thus dropping the need for
pte_cont() and pte_present() tests as proposed here.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  2025-02-05 15:09 ` [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level Ryan Roberts
@ 2025-02-06  6:46   ` Anshuman Khandual
  2025-02-06 13:04     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-06  6:46 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable



On 2/5/25 20:39, Ryan Roberts wrote:
> commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
> FEAT_LPA2") changed the "invalidation level unknown" hint from 0 to
> TLBI_TTL_UNKNOWN (INT_MAX). But the fallback "unknown level" path in
> flush_hugetlb_tlb_range() was not updated. So as it stands, when trying
> to invalidate CONT_PMD_SIZE or CONT_PTE_SIZE hugetlb mappings, we will
> spuriously try to invalidate at level 0 on LPA2-enabled systems.
> 
> Fix this so that the fallback passes TLBI_TTL_UNKNOWN, and while we are
> at it, explicitly use the correct stride and level for CONT_PMD_SIZE and
> CONT_PTE_SIZE, which should provide a minor optimization.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/hugetlb.h | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
> index 03db9cb21ace..8ab9542d2d22 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -76,12 +76,20 @@ static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
>  {
>  	unsigned long stride = huge_page_size(hstate_vma(vma));
>  
> -	if (stride == PMD_SIZE)
> -		__flush_tlb_range(vma, start, end, stride, false, 2);
> -	else if (stride == PUD_SIZE)
> -		__flush_tlb_range(vma, start, end, stride, false, 1);
> -	else
> -		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0);
> +	switch (stride) {
> +	case PUD_SIZE:
> +		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
> +		break;

Just wondering - should not !__PAGETABLE_PMD_FOLDED and pud_sect_supported()
checks also be added here for this PUD_SIZE case ?

> +	case CONT_PMD_SIZE:
> +	case PMD_SIZE:
> +		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
> +		break;
> +	case CONT_PTE_SIZE:
> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
> +		break;
> +	default:
> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
> +	}
>  }
>  
>  #endif /* __ASM_HUGETLB_H */


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements
  2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
                   ` (15 preceding siblings ...)
  2025-02-05 15:09 ` [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings Ryan Roberts
@ 2025-02-06  7:52 ` Andrew Morton
  2025-02-06 11:59   ` Ryan Roberts
  16 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2025-02-06  7:52 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky, linux-arm-kernel, linux-mm,
	linux-kernel

On Wed,  5 Feb 2025 15:09:40 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

>  I'm guessing that going in
> through the arm64 tree is the right approach here?

Seems that way, just from the line counts.

I suggest two series - one for the four cc:stable patches and one for
the 6.14 material.  This depends on whether the ARM maintainers want to
get patches 1-4 into the -stable stream before the 6.14 release.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-02-05 15:09 ` [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
@ 2025-02-06 10:55   ` Anshuman Khandual
  2025-02-06 13:07     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-06 10:55 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> Convert page_table_check_p[mu]d_set(...) to
> page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
> of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
> as macros that call new batch functions with nr=1 for compatibility.
> 
> arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
> and to allow the implementation for huge_pte to more efficiently set
> ptes/pmds/puds in batches. We need these batch-helpers to make the
> refactoring possible.

A very small nit.

Although this justification here is reasonable enough but not sure if
platform specific requirements, need to be spelled out in such detail
for a generic MM change.

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Regardless, LGTM.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

>  include/linux/page_table_check.h | 30 +++++++++++++++++-----------
>  mm/page_table_check.c            | 34 +++++++++++++++++++-------------
>  2 files changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
> index 6722941c7cb8..289620d4aad3 100644
> --- a/include/linux/page_table_check.h
> +++ b/include/linux/page_table_check.h
> @@ -19,8 +19,10 @@ void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd);
>  void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud);
>  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>  		unsigned int nr);
> -void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd);
> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud);
> +void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
> +		unsigned int nr);
> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
> +		unsigned int nr);
>  void __page_table_check_pte_clear_range(struct mm_struct *mm,
>  					unsigned long addr,
>  					pmd_t pmd);
> @@ -74,22 +76,22 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
>  	__page_table_check_ptes_set(mm, ptep, pte, nr);
>  }
>  
> -static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
> -					    pmd_t pmd)
> +static inline void page_table_check_pmds_set(struct mm_struct *mm,
> +		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>  {
>  	if (static_branch_likely(&page_table_check_disabled))
>  		return;
>  
> -	__page_table_check_pmd_set(mm, pmdp, pmd);
> +	__page_table_check_pmds_set(mm, pmdp, pmd, nr);
>  }
>  
> -static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
> -					    pud_t pud)
> +static inline void page_table_check_puds_set(struct mm_struct *mm,
> +		pud_t *pudp, pud_t pud, unsigned int nr)
>  {
>  	if (static_branch_likely(&page_table_check_disabled))
>  		return;
>  
> -	__page_table_check_pud_set(mm, pudp, pud);
> +	__page_table_check_puds_set(mm, pudp, pud, nr);
>  }
>  
>  static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
> @@ -129,13 +131,13 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
>  {
>  }
>  
> -static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
> -					    pmd_t pmd)
> +static inline void page_table_check_pmds_set(struct mm_struct *mm,
> +		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>  {
>  }
>  
> -static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
> -					    pud_t pud)
> +static inline void page_table_check_puds_set(struct mm_struct *mm,
> +		pud_t *pudp, pud_t pud, unsigned int nr)
>  {
>  }
>  
> @@ -146,4 +148,8 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
>  }
>  
>  #endif /* CONFIG_PAGE_TABLE_CHECK */
> +
> +#define page_table_check_pmd_set(mm, pmdp, pmd)	page_table_check_pmds_set(mm, pmdp, pmd, 1)
> +#define page_table_check_pud_set(mm, pudp, pud)	page_table_check_puds_set(mm, pudp, pud, 1)
> +
>  #endif /* __LINUX_PAGE_TABLE_CHECK_H */
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 509c6ef8de40..dae4a7d776b3 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -234,33 +234,39 @@ static inline void page_table_check_pmd_flags(pmd_t pmd)
>  		WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
>  }
>  
> -void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
> +void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
> +		unsigned int nr)
>  {
> +	unsigned int i;
> +	unsigned long stride = PMD_SIZE >> PAGE_SHIFT;
> +
>  	if (&init_mm == mm)
>  		return;
>  
>  	page_table_check_pmd_flags(pmd);
>  
> -	__page_table_check_pmd_clear(mm, *pmdp);
> -	if (pmd_user_accessible_page(pmd)) {
> -		page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
> -				     pmd_write(pmd));
> -	}
> +	for (i = 0; i < nr; i++)
> +		__page_table_check_pmd_clear(mm, *(pmdp + i));
> +	if (pmd_user_accessible_page(pmd))
> +		page_table_check_set(pmd_pfn(pmd), stride * nr, pmd_write(pmd));
>  }
> -EXPORT_SYMBOL(__page_table_check_pmd_set);
> +EXPORT_SYMBOL(__page_table_check_pmds_set);
>  
> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
> +		unsigned int nr)
>  {
> +	unsigned int i;
> +	unsigned long stride = PUD_SIZE >> PAGE_SHIFT;
> +
>  	if (&init_mm == mm)
>  		return;
>  
> -	__page_table_check_pud_clear(mm, *pudp);
> -	if (pud_user_accessible_page(pud)) {
> -		page_table_check_set(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT,
> -				     pud_write(pud));
> -	}
> +	for (i = 0; i < nr; i++)
> +		__page_table_check_pud_clear(mm, *(pudp + i));
> +	if (pud_user_accessible_page(pud))
> +		page_table_check_set(pud_pfn(pud), stride * nr, pud_write(pud));
>  }
> -EXPORT_SYMBOL(__page_table_check_pud_set);
> +EXPORT_SYMBOL(__page_table_check_puds_set);
>  
>  void __page_table_check_pte_clear_range(struct mm_struct *mm,
>  					unsigned long addr,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-02-05 15:09 ` [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
@ 2025-02-06 11:48   ` Anshuman Khandual
  2025-02-06 13:26     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-06 11:48 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
> all a thin wrapper around a generic ___set_ptes(), which takes pgsize

s/generic/common - as generic might be misleading for being generic MM.

> parameter. This cleans up the code to remove the confusing
> __set_pte_at() (which was only ever used for pmd/pud) and will allow us
> to perform future barrier optimizations in a single place. Additionally,
> it will permit the huge_pte API to efficiently batch-set pgtable entries
> and take advantage of the future barrier optimizations.

Makes sense.

> 
> ___set_ptes() calls the correct page_table_check_*_set() function based
> on the pgsize. This means that huge_ptes be able to get proper coverage
> regardless of their size, once it's plumbed into huge_pte. Currently the
> huge_pte API always uses the pte API which assumes an entry only covers
> a single page.

Right

> 
> While we are at it, refactor __ptep_get_and_clear() and
> pmdp_huge_get_and_clear() to use a common ___ptep_get_and_clear() which
> also takes a pgsize parameter. This will provide the huge_pte API the
> means to clear ptes corresponding with the way they were set.

__ptep_get_and_clear() refactoring does not seem to be related to the
previous __set_ptes(). Should they be separated out into two different
patches instead - for better clarity and review ? Both these clean ups
have enough change and can stand own their own.

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 108 +++++++++++++++++++------------
>  1 file changed, 67 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b2a2ad1b9e8..3b55d9a15f05 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -420,23 +420,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>  	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>  }
>  
> -static inline void __set_ptes(struct mm_struct *mm,
> -			      unsigned long __always_unused addr,
> -			      pte_t *ptep, pte_t pte, unsigned int nr)
> -{
> -	page_table_check_ptes_set(mm, ptep, pte, nr);
> -	__sync_cache_and_tags(pte, nr);
> -
> -	for (;;) {
> -		__check_safe_pte_update(mm, ptep, pte);
> -		__set_pte(ptep, pte);
> -		if (--nr == 0)
> -			break;
> -		ptep++;
> -		pte = pte_advance_pfn(pte, 1);
> -	}
> -}
> -
>  /*
>   * Hugetlb definitions.
>   */
> @@ -641,30 +624,59 @@ static inline pgprot_t pud_pgprot(pud_t pud)
>  	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
>  }
>  
> -static inline void __set_pte_at(struct mm_struct *mm,
> -				unsigned long __always_unused addr,
> -				pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> +			       unsigned int nr, unsigned long pgsize)

So address got dropped and page size got added as an argument.
s/___set_ptes/___set_pxds ? to be more generic for all levels.

>  {
> -	__sync_cache_and_tags(pte, nr);
> -	__check_safe_pte_update(mm, ptep, pte);
> -	__set_pte(ptep, pte);
> +	unsigned long stride = pgsize >> PAGE_SHIFT;
> +
> +	switch (pgsize) {
> +	case PAGE_SIZE:
> +		page_table_check_ptes_set(mm, ptep, pte, nr);
> +		break;
> +	case PMD_SIZE:
> +		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
> +		break;
> +	case PUD_SIZE:
> +		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
> +		break;

This is where the new page table check APIs get used for batch testing.

> +	default:
> +		VM_WARN_ON(1);
> +	}
> +
> +	__sync_cache_and_tags(pte, nr * stride);
> +
> +	for (;;) {
> +		__check_safe_pte_update(mm, ptep, pte);
> +		__set_pte(ptep, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		pte = pte_advance_pfn(pte, stride);
> +	}
>  }

LGTM

>  
> -static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> -			      pmd_t *pmdp, pmd_t pmd)
> +static inline void __set_ptes(struct mm_struct *mm,
> +			      unsigned long __always_unused addr,
> +			      pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	page_table_check_pmd_set(mm, pmdp, pmd);
> -	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
> -						PMD_SIZE >> PAGE_SHIFT);
> +	___set_ptes(mm, ptep, pte, nr, PAGE_SIZE);
>  }
>  
> -static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
> -			      pud_t *pudp, pud_t pud)
> +static inline void __set_pmds(struct mm_struct *mm,
> +			      unsigned long __always_unused addr,
> +			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
> +{
> +	___set_ptes(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
> +}
> +#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
> +
> +static inline void __set_puds(struct mm_struct *mm,
> +			      unsigned long __always_unused addr,
> +			      pud_t *pudp, pud_t pud, unsigned int nr)
>  {
> -	page_table_check_pud_set(mm, pudp, pud);
> -	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
> -						PUD_SIZE >> PAGE_SHIFT);
> +	___set_ptes(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
>  }
> +#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)

LGTM

>  
>  #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
>  #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
> @@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>  
> -static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
> -				       unsigned long address, pte_t *ptep)
> +static inline pte_t ___ptep_get_and_clear(struct mm_struct *mm, pte_t *ptep,
> +				       unsigned long pgsize)

So address got dropped and page size got added as an argument.

>  {
>  	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
>  
> -	page_table_check_pte_clear(mm, pte);
> +	switch (pgsize) {
> +	case PAGE_SIZE:
> +		page_table_check_pte_clear(mm, pte);
> +		break;
> +	case PMD_SIZE:
> +		page_table_check_pmd_clear(mm, pte_pmd(pte));
> +		break;
> +	case PUD_SIZE:
> +		page_table_check_pud_clear(mm, pte_pud(pte));
> +		break;
> +	default:
> +		VM_WARN_ON(1);
> +	}
>  
>  	return pte;
>  }
>  
> +static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
> +				       unsigned long address, pte_t *ptep)
> +{
> +	return ___ptep_get_and_clear(mm, ptep, PAGE_SIZE);
> +}
> +
>  static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr, int full)
>  {
> @@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>  					    unsigned long address, pmd_t *pmdp)
>  {
> -	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
> -
> -	page_table_check_pmd_clear(mm, pmd);
> -
> -	return pmd;
> +	return pte_pmd(___ptep_get_and_clear(mm, (pte_t *)pmdp, PMD_SIZE));
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  

Although currently there is no pudp_huge_get_and_clear() helper on arm64
reworked ___ptep_get_and_clear() will be able to support that as well if
and when required as it now supports PUD_SIZE page size.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements
  2025-02-06  7:52 ` [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Andrew Morton
@ 2025-02-06 11:59   ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 11:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Catalin Marinas, Will Deacon, Muchun Song, Pasha Tatashin,
	Uladzislau Rezki, Christoph Hellwig, Mark Rutland,
	Ard Biesheuvel, Anshuman Khandual, Dev Jain, Alexandre Ghiti,
	Steve Capper, Kevin Brodsky, linux-arm-kernel, linux-mm,
	linux-kernel

On 06/02/2025 07:52, Andrew Morton wrote:
> On Wed,  5 Feb 2025 15:09:40 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>>  I'm guessing that going in
>> through the arm64 tree is the right approach here?
> 
> Seems that way, just from the line counts.
> 
> I suggest two series - one for the four cc:stable patches and one for
> the 6.14 material.  This depends on whether the ARM maintainers want to
> get patches 1-4 into the -stable stream before the 6.14 release.

Thanks Andrew, I'm happy to take this approach assuming Catalin/Will agree.

But to be pedantic for a moment, I nominated patches 1-3 and 13 as candidates
for stable. 1-3 should definitely go via arm64. 13 is a pure mm fix. But later
arm64 patches in the series depend on it being fixed. So I wouldn't want to put
13 in through mm tree if it means 14-16 will be in the arm64 tree without the
fix for a while.

Anyway, 13 doesn't depend on anything before it in the series so I can gather
the fixes in to a series of 4 as you suggest. Then the improvements become a
series of 12. And both can go via arm64?

I'll gather review comments then re-post as 2 series for v2; assuming
Will/Catalin are happy.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
  2025-02-06  5:03   ` Anshuman Khandual
@ 2025-02-06 12:15     ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 12:15 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, Huacai Chen,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Gerald Schaefer, David S. Miller,
	Andreas Larsson, stable

Thanks for the review!

On 06/02/2025 05:03, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> In order to fix a bug, arm64 needs to be told the size of the huge page
>> for which the huge_pte is being set in huge_ptep_get_and_clear().
>> Provide for this by adding an `unsigned long sz` parameter to the
>> function. This follows the same pattern as huge_pte_clear() and
>> set_huge_pte_at().
>>
>> This commit makes the required interface modifications to the core mm as
>> well as all arches that implement this function (arm64, loongarch, mips,
>> parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
>> in a separate commit.
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/hugetlb.h     |  4 ++--
>>  arch/arm64/mm/hugetlbpage.c          |  8 +++++---
>>  arch/loongarch/include/asm/hugetlb.h |  6 ++++--
>>  arch/mips/include/asm/hugetlb.h      |  6 ++++--
>>  arch/parisc/include/asm/hugetlb.h    |  2 +-
>>  arch/parisc/mm/hugetlbpage.c         |  2 +-
>>  arch/powerpc/include/asm/hugetlb.h   |  6 ++++--
>>  arch/riscv/include/asm/hugetlb.h     |  3 ++-
>>  arch/riscv/mm/hugetlbpage.c          |  2 +-
>>  arch/s390/include/asm/hugetlb.h      | 12 ++++++++----
>>  arch/s390/mm/hugetlbpage.c           | 10 ++++++++--
>>  arch/sparc/include/asm/hugetlb.h     |  2 +-
>>  arch/sparc/mm/hugetlbpage.c          |  2 +-
>>  include/asm-generic/hugetlb.h        |  2 +-
>>  include/linux/hugetlb.h              |  4 +++-
>>  mm/hugetlb.c                         |  4 ++--
>>  16 files changed, 48 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
>> index c6dff3e69539..03db9cb21ace 100644
>> --- a/arch/arm64/include/asm/hugetlb.h
>> +++ b/arch/arm64/include/asm/hugetlb.h
>> @@ -42,8 +42,8 @@ extern int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  				      unsigned long addr, pte_t *ptep,
>>  				      pte_t pte, int dirty);
>>  #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
>> -extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>> -				     unsigned long addr, pte_t *ptep);
>> +extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>> +				     pte_t *ptep, unsigned long sz);
> 
> If VMA could be passed instead of MM, the size of the huge page can
> be derived via huge_page_size(hstate_vma(vma)) and another argument
> here need not be added. Also MM can be derived from VMA if required.

I considered this approach; infact that's what I first implemented when fixing
an equivalent bug on set_huge_pte_at() in the past. But that was problematic
because there are some cases where the helper is used for kernel mappings (see
vmalloc) and there is no VMA to pass in that case. See [1].

To fix this bug, usage of this helper for kernel mappings is not an issue (yet)
so I guess technically it could be fixed by passing VMA. But later in this
series I start using huge_ptep_get_and_clear() for the vmap (see patch 11) so it
would break at that point.

Another approach I considered was to allocate a spare swap-pte bit (we have a
few) to indicate PTE_CONT for non-present PTEs. Then no API change is required
at all. But given set_huge_pte_at() and huge_pte_clear() already pass "sz", it
seemed best just to keep things simple and follow that pattern.

[1] https://lore.kernel.org/all/20230922115804.2043771-1-ryan.roberts@arm.com/

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  2025-02-06  6:15   ` Anshuman Khandual
@ 2025-02-06 12:55     ` Ryan Roberts
  2025-02-12 14:44       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 12:55 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable

On 06/02/2025 06:15, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
>> CONT_PMD_SIZE). So the function has to figure out the size from the
>> huge_pte pointer. This was previously done by walking the pgtable to
>> determine the level, then using the PTE_CONT bit to determine the number
>> of ptes.
> 
> Actually PTE_CONT gets used to determine if the entry is normal i.e
> PMD/PUD based huge page or cont PTE/PMD based huge page just to call
> into standard __ptep_get_and_clear() or specific get_clear_contig(),
> after determining find_num_contig() by walking the page table.
> 
> PTE_CONT presence is only used to determine the switch above but not
> to determine the number of ptes for the mapping as mentioned earlier.

Sorry I don't really follow your distinction; PTE_CONT is used to decide whether
we are operating on a single entry (pte_cont()==false) or on multiple entires
(pte_cont()==true). For the multiple entry case, the level tells you the exact
number.

I can certainly tidy up this description a bit, but I think we both agree that
the value of PTE_CONT is one of the inputs into deciding how many entries need
to be operated on?

> 
> There are two similar functions which determines the 
> 
> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>                            pte_t *ptep, size_t *pgsize)
> {
>         pgd_t *pgdp = pgd_offset(mm, addr);
>         p4d_t *p4dp;
>         pud_t *pudp;
>         pmd_t *pmdp;
> 
>         *pgsize = PAGE_SIZE;
>         p4dp = p4d_offset(pgdp, addr);
>         pudp = pud_offset(p4dp, addr);
>         pmdp = pmd_offset(pudp, addr);
>         if ((pte_t *)pmdp == ptep) {
>                 *pgsize = PMD_SIZE;
>                 return CONT_PMDS;
>         }
>         return CONT_PTES;
> }
> 
> find_num_contig() already assumes that the entry is contig huge page and
> it just finds whether it is PMD or PTE based one. This always requires a
> prior PTE_CONT bit being set determination via pte_cont() before calling
> find_num_contig() in each instance.

Agreed.

> 
> But num_contig_ptes() can get the same information without walking the
> page table and thus without predetermining if PTE_CONT is set or not.
> size can be derived from VMA argument when present.

Also agreed. But VMA is not provided to this function. And because we want to
use it for kernel space mappings, I think it's a bad idea to pass VMA.

> 
> static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
> {
>         int contig_ptes = 0;
> 
>         *pgsize = size;
> 
>         switch (size) {
> #ifndef __PAGETABLE_PMD_FOLDED
>         case PUD_SIZE:
>                 if (pud_sect_supported())
>                         contig_ptes = 1;
>                 break;
> #endif
>         case PMD_SIZE:
>                 contig_ptes = 1;
>                 break;
>         case CONT_PMD_SIZE:
>                 *pgsize = PMD_SIZE;
>                 contig_ptes = CONT_PMDS;
>                 break;
>         case CONT_PTE_SIZE:
>                 *pgsize = PAGE_SIZE;
>                 contig_ptes = CONT_PTES;
>                 break;
>         }
> 
>         return contig_ptes;
> }
> 
> On a side note, why cannot num_contig_ptes() be used all the time and
> find_num_contig() be dropped ? OR am I missing something here.

There are 2 remaining users of find_num_contig() after my series:
huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can
only be legitimately called for present ptes (so its safe to check pte_cont()).
huge_ptep_set_access_flags() already has the VMA so it would be easy to convert
to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess
you could do the trick where you take the size of the folio that the pte points to?

So yes, I think we could drop find_num_contig() and I agree it would be an
improvement.

But to be honest, grabbing the folio size also feels like a hack to me (we do
this in other places too). While today, the folio size is guarranteed to be be
the same size as the huge pte in practice, I'm not sure there is any spec that
mandates that?

Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so
we can tell the size of both present and non-present ptes, then do the table
walk trick to find the level. Shrug.

> 
>>
>> But the PTE_CONT bit is only valid when the pte is present. For
>> non-present pte values (e.g. markers, migration entries), the previous
>> implementation was therefore erroniously determining the size. There is
>> at least one known caller in core-mm, move_huge_pte(), which may call
>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
>> this case. Additionally the "regular" ptep_get_and_clear() is robust to
>> being called for non-present ptes so it makes sense to follow the
>> behaviour.
> 
> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set
> and the entry being mapped might not be required.
> >>
>> Fix this by using the new sz parameter which is now provided to the
>> function. Additionally when clearing each pte in a contig range, don't
>> gather the access and dirty bits if the pte is not present.
> 
> Makes sense.
> 
>>
>> An alternative approach that would not require API changes would be to
>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
>> cleaner to follow other APIs' lead and just pass in the size.
> 
> Right, changing the arguments in the API will help solve this problem.
> 
>>
>> While we are at it, add some debug warnings in functions that require
>> the pte is present.
>>
>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
>> entry offset field (layout of non-present pte). Since hugetlb is never
>> swapped to disk, this field will only be populated for markers, which
>> always set this bit to 0 and hwpoison swap entries, which set the offset
>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
>> memory in that high half was poisoned (I think!). So in practice, this
>> bit would almost always be zero for non-present ptes and we would only
>> clear the first entry if it was actually a contiguous block. That's
>> probably a less severe symptom than if it was always interpretted as 1
>> and cleared out potentially-present neighboring PTEs.
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
>>  1 file changed, 29 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 06db4649af91..328eec4bfe55 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>>  			     unsigned long pgsize,
>>  			     unsigned long ncontig)
>>  {
>> -	pte_t orig_pte = __ptep_get(ptep);
>> -	unsigned long i;
>> -
>> -	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
>> -		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
>> -
>> -		/*
>> -		 * If HW_AFDBM is enabled, then the HW could turn on
>> -		 * the dirty or accessed bit for any page in the set,
>> -		 * so check them all.
>> -		 */
>> -		if (pte_dirty(pte))
>> -			orig_pte = pte_mkdirty(orig_pte);
>> -
>> -		if (pte_young(pte))
>> -			orig_pte = pte_mkyoung(orig_pte);
>> +	pte_t pte, tmp_pte;
>> +	bool present;
>> +
>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	present = pte_present(pte);
>> +	while (--ncontig) {
> 
> Although this does the right thing by calling __ptep_get_and_clear() once
> for non-contig huge pages but wondering if cont/non-cont separation should
> be maintained in the caller huge_ptep_get_and_clear(), keeping the current
> logical bifurcation intact.

To what benefit?

> 
>> +		ptep++;
>> +		addr += pgsize;
>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		if (present) {
> 
> Checking for present entries makes sense here.
> 
>> +			if (pte_dirty(tmp_pte))
>> +				pte = pte_mkdirty(pte);
>> +			if (pte_young(tmp_pte))
>> +				pte = pte_mkyoung(pte);
>> +		}
>>  	}
>> -	return orig_pte;
>> +	return pte;
>>  }
>>  
>>  static pte_t get_clear_contig_flush(struct mm_struct *mm,
>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>>  {
>>  	int ncontig;
>>  	size_t pgsize;
>> -	pte_t orig_pte = __ptep_get(ptep);
>> -
>> -	if (!pte_cont(orig_pte))
>> -		return __ptep_get_and_clear(mm, addr, ptep);
>> -
>> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>  
>> +	ncontig = num_contig_ptes(sz, &pgsize);
> 
> __ptep_get_and_clear() can still be called here if 'ncontig' is
> returned as 0 indicating a normal non-contig huge page thus
> keeping get_clear_contig() unchanged just to handle contig huge
> pages.

I think you're describing the case where num_contig_ptes() returns 0? The
intention, from my reading of the function, is that num_contig_ptes() returns
the number of ptes that need to be operated on (e.g. 1 for a single entry or N
for a contig block). It will only return 0 if called with an invalid huge size.
I don't believe it will ever "return 0 indicating a normal non-contig huge page".

Perhaps the right solution is to add a warning if returning 0?

> 
>>  	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>>  }
>>  
>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  	pgprot_t hugeprot;
>>  	pte_t orig_pte;
>>  
>> +	VM_WARN_ON(!pte_present(pte));
>> +
>>  	if (!pte_cont(pte))
>>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>>  
>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  		return 0;
>>  
>>  	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>> +	VM_WARN_ON(!pte_present(orig_pte));
>>  
>>  	/* Make sure we don't lose the dirty or young state */
>>  	if (pte_dirty(orig_pte))
>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>  	size_t pgsize;
>>  	pte_t pte;
>>  
>> -	if (!pte_cont(__ptep_get(ptep))) {
>> +	pte = __ptep_get(ptep);
>> +	VM_WARN_ON(!pte_present(pte));
>> +
>> +	if (!pte_cont(pte)) {
>>  		__ptep_set_wrprotect(mm, addr, ptep);
>>  		return;
>>  	}
>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>>  	struct mm_struct *mm = vma->vm_mm;
>>  	size_t pgsize;
>>  	int ncontig;
>> +	pte_t pte;
>>  
>> -	if (!pte_cont(__ptep_get(ptep)))
>> +	pte = __ptep_get(ptep);
>> +	VM_WARN_ON(!pte_present(pte));
>> +
>> +	if (!pte_cont(pte))
>>  		return ptep_clear_flush(vma, addr, ptep);
>>  
>>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> 
> In all the above instances should not num_contig_ptes() be called to determine
> if a given entry is non-contig or contig huge page, thus dropping the need for
> pte_cont() and pte_present() tests as proposed here.

Yeah maybe. But as per above, we have options for how to do that. I'm not sure
which is preferable at the moment. What do you think? Regardless, I think that
cleanup would be a separate patch (which I'm happy to add for v2). For this bug
fix, I was trying to do the minimum.

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  2025-02-06  6:46   ` Anshuman Khandual
@ 2025-02-06 13:04     ` Ryan Roberts
  2025-02-13  4:57       ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 13:04 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable

On 06/02/2025 06:46, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
>> FEAT_LPA2") changed the "invalidation level unknown" hint from 0 to
>> TLBI_TTL_UNKNOWN (INT_MAX). But the fallback "unknown level" path in
>> flush_hugetlb_tlb_range() was not updated. So as it stands, when trying
>> to invalidate CONT_PMD_SIZE or CONT_PTE_SIZE hugetlb mappings, we will
>> spuriously try to invalidate at level 0 on LPA2-enabled systems.
>>
>> Fix this so that the fallback passes TLBI_TTL_UNKNOWN, and while we are
>> at it, explicitly use the correct stride and level for CONT_PMD_SIZE and
>> CONT_PTE_SIZE, which should provide a minor optimization.
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/hugetlb.h | 20 ++++++++++++++------
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
>> index 03db9cb21ace..8ab9542d2d22 100644
>> --- a/arch/arm64/include/asm/hugetlb.h
>> +++ b/arch/arm64/include/asm/hugetlb.h
>> @@ -76,12 +76,20 @@ static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
>>  {
>>  	unsigned long stride = huge_page_size(hstate_vma(vma));
>>  
>> -	if (stride == PMD_SIZE)
>> -		__flush_tlb_range(vma, start, end, stride, false, 2);
>> -	else if (stride == PUD_SIZE)
>> -		__flush_tlb_range(vma, start, end, stride, false, 1);
>> -	else
>> -		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0);
>> +	switch (stride) {
>> +	case PUD_SIZE:
>> +		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
>> +		break;
> 
> Just wondering - should not !__PAGETABLE_PMD_FOLDED and pud_sect_supported()
> checks also be added here for this PUD_SIZE case ?

Yeah I guess so. TBH, it's never been entirely clear to me what the benefit is?
Is it just to remove (a tiny amount of) dead code when we know we don't support
blocks at the level? Or is there something more fundamental going on that I've
missed?

We seem to be quite inconsistent with the use of pud_sect_supported() in
hugetlbpage.c.

Anyway, I'll add this in, I guess it's preferable to follow the established pattern.

Thanks,
Ryan

> 
>> +	case CONT_PMD_SIZE:
>> +	case PMD_SIZE:
>> +		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
>> +		break;
>> +	case CONT_PTE_SIZE:
>> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
>> +		break;
>> +	default:
>> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
>> +	}
>>  }
>>  
>>  #endif /* __ASM_HUGETLB_H */



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-02-06 10:55   ` Anshuman Khandual
@ 2025-02-06 13:07     ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 13:07 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 06/02/2025 10:55, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> Convert page_table_check_p[mu]d_set(...) to
>> page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
>> of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
>> as macros that call new batch functions with nr=1 for compatibility.
>>
>> arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
>> and to allow the implementation for huge_pte to more efficiently set
>> ptes/pmds/puds in batches. We need these batch-helpers to make the
>> refactoring possible.
> 
> A very small nit.
> 
> Although this justification here is reasonable enough but not sure if
> platform specific requirements, need to be spelled out in such detail
> for a generic MM change.

Personally I find the "why" to be the most useful part of a commit log. I'd
prefer to leave it in unless any mm people complain.


> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
> 
> Regardless, LGTM.
> 
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

Thanks!

> 
>>  include/linux/page_table_check.h | 30 +++++++++++++++++-----------
>>  mm/page_table_check.c            | 34 +++++++++++++++++++-------------
>>  2 files changed, 38 insertions(+), 26 deletions(-)
>>
>> diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
>> index 6722941c7cb8..289620d4aad3 100644
>> --- a/include/linux/page_table_check.h
>> +++ b/include/linux/page_table_check.h
>> @@ -19,8 +19,10 @@ void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd);
>>  void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud);
>>  void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>>  		unsigned int nr);
>> -void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd);
>> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud);
>> +void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
>> +		unsigned int nr);
>> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
>> +		unsigned int nr);
>>  void __page_table_check_pte_clear_range(struct mm_struct *mm,
>>  					unsigned long addr,
>>  					pmd_t pmd);
>> @@ -74,22 +76,22 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
>>  	__page_table_check_ptes_set(mm, ptep, pte, nr);
>>  }
>>  
>> -static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
>> -					    pmd_t pmd)
>> +static inline void page_table_check_pmds_set(struct mm_struct *mm,
>> +		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>>  {
>>  	if (static_branch_likely(&page_table_check_disabled))
>>  		return;
>>  
>> -	__page_table_check_pmd_set(mm, pmdp, pmd);
>> +	__page_table_check_pmds_set(mm, pmdp, pmd, nr);
>>  }
>>  
>> -static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
>> -					    pud_t pud)
>> +static inline void page_table_check_puds_set(struct mm_struct *mm,
>> +		pud_t *pudp, pud_t pud, unsigned int nr)
>>  {
>>  	if (static_branch_likely(&page_table_check_disabled))
>>  		return;
>>  
>> -	__page_table_check_pud_set(mm, pudp, pud);
>> +	__page_table_check_puds_set(mm, pudp, pud, nr);
>>  }
>>  
>>  static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
>> @@ -129,13 +131,13 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
>>  {
>>  }
>>  
>> -static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
>> -					    pmd_t pmd)
>> +static inline void page_table_check_pmds_set(struct mm_struct *mm,
>> +		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>>  {
>>  }
>>  
>> -static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
>> -					    pud_t pud)
>> +static inline void page_table_check_puds_set(struct mm_struct *mm,
>> +		pud_t *pudp, pud_t pud, unsigned int nr)
>>  {
>>  }
>>  
>> @@ -146,4 +148,8 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
>>  }
>>  
>>  #endif /* CONFIG_PAGE_TABLE_CHECK */
>> +
>> +#define page_table_check_pmd_set(mm, pmdp, pmd)	page_table_check_pmds_set(mm, pmdp, pmd, 1)
>> +#define page_table_check_pud_set(mm, pudp, pud)	page_table_check_puds_set(mm, pudp, pud, 1)
>> +
>>  #endif /* __LINUX_PAGE_TABLE_CHECK_H */
>> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
>> index 509c6ef8de40..dae4a7d776b3 100644
>> --- a/mm/page_table_check.c
>> +++ b/mm/page_table_check.c
>> @@ -234,33 +234,39 @@ static inline void page_table_check_pmd_flags(pmd_t pmd)
>>  		WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
>>  }
>>  
>> -void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
>> +void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
>> +		unsigned int nr)
>>  {
>> +	unsigned int i;
>> +	unsigned long stride = PMD_SIZE >> PAGE_SHIFT;
>> +
>>  	if (&init_mm == mm)
>>  		return;
>>  
>>  	page_table_check_pmd_flags(pmd);
>>  
>> -	__page_table_check_pmd_clear(mm, *pmdp);
>> -	if (pmd_user_accessible_page(pmd)) {
>> -		page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
>> -				     pmd_write(pmd));
>> -	}
>> +	for (i = 0; i < nr; i++)
>> +		__page_table_check_pmd_clear(mm, *(pmdp + i));
>> +	if (pmd_user_accessible_page(pmd))
>> +		page_table_check_set(pmd_pfn(pmd), stride * nr, pmd_write(pmd));
>>  }
>> -EXPORT_SYMBOL(__page_table_check_pmd_set);
>> +EXPORT_SYMBOL(__page_table_check_pmds_set);
>>  
>> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
>> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
>> +		unsigned int nr)
>>  {
>> +	unsigned int i;
>> +	unsigned long stride = PUD_SIZE >> PAGE_SHIFT;
>> +
>>  	if (&init_mm == mm)
>>  		return;
>>  
>> -	__page_table_check_pud_clear(mm, *pudp);
>> -	if (pud_user_accessible_page(pud)) {
>> -		page_table_check_set(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT,
>> -				     pud_write(pud));
>> -	}
>> +	for (i = 0; i < nr; i++)
>> +		__page_table_check_pud_clear(mm, *(pudp + i));
>> +	if (pud_user_accessible_page(pud))
>> +		page_table_check_set(pud_pfn(pud), stride * nr, pud_write(pud));
>>  }
>> -EXPORT_SYMBOL(__page_table_check_pud_set);
>> +EXPORT_SYMBOL(__page_table_check_puds_set);
>>  
>>  void __page_table_check_pte_clear_range(struct mm_struct *mm,
>>  					unsigned long addr,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-02-06 11:48   ` Anshuman Khandual
@ 2025-02-06 13:26     ` Ryan Roberts
  2025-02-07  9:38       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-06 13:26 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 06/02/2025 11:48, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
>> all a thin wrapper around a generic ___set_ptes(), which takes pgsize
> 
> s/generic/common - as generic might be misleading for being generic MM.

Good spot. I'll make this change.

> 
>> parameter. This cleans up the code to remove the confusing
>> __set_pte_at() (which was only ever used for pmd/pud) and will allow us
>> to perform future barrier optimizations in a single place. Additionally,
>> it will permit the huge_pte API to efficiently batch-set pgtable entries
>> and take advantage of the future barrier optimizations.
> 
> Makes sense.
> 
>>
>> ___set_ptes() calls the correct page_table_check_*_set() function based
>> on the pgsize. This means that huge_ptes be able to get proper coverage
>> regardless of their size, once it's plumbed into huge_pte. Currently the
>> huge_pte API always uses the pte API which assumes an entry only covers
>> a single page.
> 
> Right
> 
>>
>> While we are at it, refactor __ptep_get_and_clear() and
>> pmdp_huge_get_and_clear() to use a common ___ptep_get_and_clear() which
>> also takes a pgsize parameter. This will provide the huge_pte API the
>> means to clear ptes corresponding with the way they were set.
> 
> __ptep_get_and_clear() refactoring does not seem to be related to the
> previous __set_ptes(). Should they be separated out into two different
> patches instead - for better clarity and review ? Both these clean ups
> have enough change and can stand own their own.

Yeah I think you're probably right... I was being lazy... I'll separate them.

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 108 +++++++++++++++++++------------
>>  1 file changed, 67 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0b2a2ad1b9e8..3b55d9a15f05 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -420,23 +420,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>  	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>>  }
>>  
>> -static inline void __set_ptes(struct mm_struct *mm,
>> -			      unsigned long __always_unused addr,
>> -			      pte_t *ptep, pte_t pte, unsigned int nr)
>> -{
>> -	page_table_check_ptes_set(mm, ptep, pte, nr);
>> -	__sync_cache_and_tags(pte, nr);
>> -
>> -	for (;;) {
>> -		__check_safe_pte_update(mm, ptep, pte);
>> -		__set_pte(ptep, pte);
>> -		if (--nr == 0)
>> -			break;
>> -		ptep++;
>> -		pte = pte_advance_pfn(pte, 1);
>> -	}
>> -}
>> -
>>  /*
>>   * Hugetlb definitions.
>>   */
>> @@ -641,30 +624,59 @@ static inline pgprot_t pud_pgprot(pud_t pud)
>>  	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
>>  }
>>  
>> -static inline void __set_pte_at(struct mm_struct *mm,
>> -				unsigned long __always_unused addr,
>> -				pte_t *ptep, pte_t pte, unsigned int nr)
>> +static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>> +			       unsigned int nr, unsigned long pgsize)
> 
> So address got dropped and page size got added as an argument.

Yeah; addr is never used in our implementations so I figured why haul it around
everywhere?

> s/___set_ptes/___set_pxds ? to be more generic for all levels.

So now we are into naming... I agree that in some senses pte feels specific to
the last level. But it's long form "page table entry" seems more generic than
"pxd" which implies only pmd, pud, p4d and pgd. At least to me...

I think we got stuck trying to figure out a clear and short term for "page table
entry at any level" in the past. I think ttd was the best we got; Translation
Table Descriptor, which is the term the Arm ARM uses. But that opens a can of
worms as now we need tdd_t and all the converters pte_tdd(), tdd_pte(),
pmd_tdd(), ... and probably a bunch more stuff on top.

So personally I prefer to take the coward's way out and just reuse pte.

> 
>>  {
>> -	__sync_cache_and_tags(pte, nr);
>> -	__check_safe_pte_update(mm, ptep, pte);
>> -	__set_pte(ptep, pte);
>> +	unsigned long stride = pgsize >> PAGE_SHIFT;
>> +
>> +	switch (pgsize) {
>> +	case PAGE_SIZE:
>> +		page_table_check_ptes_set(mm, ptep, pte, nr);
>> +		break;
>> +	case PMD_SIZE:
>> +		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
>> +		break;
>> +	case PUD_SIZE:
>> +		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
>> +		break;
> 
> This is where the new page table check APIs get used for batch testing.

Yes and I anticipate that the whole switch block should be optimized out when
page_table_check is disabled.

> 
>> +	default:
>> +		VM_WARN_ON(1);
>> +	}
>> +
>> +	__sync_cache_and_tags(pte, nr * stride);
>> +
>> +	for (;;) {
>> +		__check_safe_pte_update(mm, ptep, pte);
>> +		__set_pte(ptep, pte);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		pte = pte_advance_pfn(pte, stride);
>> +	}
>>  }
> 
> LGTM
> 
>>  
>> -static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> -			      pmd_t *pmdp, pmd_t pmd)
>> +static inline void __set_ptes(struct mm_struct *mm,
>> +			      unsigned long __always_unused addr,
>> +			      pte_t *ptep, pte_t pte, unsigned int nr)
>>  {
>> -	page_table_check_pmd_set(mm, pmdp, pmd);
>> -	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
>> -						PMD_SIZE >> PAGE_SHIFT);
>> +	___set_ptes(mm, ptep, pte, nr, PAGE_SIZE);
>>  }
>>  
>> -static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
>> -			      pud_t *pudp, pud_t pud)
>> +static inline void __set_pmds(struct mm_struct *mm,
>> +			      unsigned long __always_unused addr,
>> +			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>> +{
>> +	___set_ptes(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
>> +}
>> +#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
>> +
>> +static inline void __set_puds(struct mm_struct *mm,
>> +			      unsigned long __always_unused addr,
>> +			      pud_t *pudp, pud_t pud, unsigned int nr)
>>  {
>> -	page_table_check_pud_set(mm, pudp, pud);
>> -	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
>> -						PUD_SIZE >> PAGE_SHIFT);
>> +	___set_ptes(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
>>  }
>> +#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)
> 
> LGTM
> 
>>  
>>  #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
>>  #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
>> @@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>>  
>> -static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>> -				       unsigned long address, pte_t *ptep)
>> +static inline pte_t ___ptep_get_and_clear(struct mm_struct *mm, pte_t *ptep,
>> +				       unsigned long pgsize)
> 
> So address got dropped and page size got added as an argument.
> 
>>  {
>>  	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
>>  
>> -	page_table_check_pte_clear(mm, pte);
>> +	switch (pgsize) {
>> +	case PAGE_SIZE:
>> +		page_table_check_pte_clear(mm, pte);
>> +		break;
>> +	case PMD_SIZE:
>> +		page_table_check_pmd_clear(mm, pte_pmd(pte));
>> +		break;
>> +	case PUD_SIZE:
>> +		page_table_check_pud_clear(mm, pte_pud(pte));
>> +		break;
>> +	default:
>> +		VM_WARN_ON(1);
>> +	}
>>  
>>  	return pte;
>>  }
>>  
>> +static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>> +				       unsigned long address, pte_t *ptep)
>> +{
>> +	return ___ptep_get_and_clear(mm, ptep, PAGE_SIZE);
>> +}
>> +
>>  static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, unsigned int nr, int full)
>>  {
>> @@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>  					    unsigned long address, pmd_t *pmdp)
>>  {
>> -	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
>> -
>> -	page_table_check_pmd_clear(mm, pmd);
>> -
>> -	return pmd;
>> +	return pte_pmd(___ptep_get_and_clear(mm, (pte_t *)pmdp, PMD_SIZE));
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
> 
> Although currently there is no pudp_huge_get_and_clear() helper on arm64
> reworked ___ptep_get_and_clear() will be able to support that as well if
> and when required as it now supports PUD_SIZE page size.

yep.

Thanks for all your review so far!



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear()
  2025-02-05 15:09 ` [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear() Ryan Roberts
@ 2025-02-07  4:09   ` Anshuman Khandual
  2025-02-07 10:00     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07  4:09 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> Refactor the huge_pte helpers to use the new generic ___set_ptes() and
> ___ptep_get_and_clear() APIs.
> 
> This provides 2 benefits; First, when page_table_check=on, hugetlb is
> now properly/fully checked. Previously only the first page of a hugetlb

PAGE_TABLE_CHECK will be fully supported now in hugetlb irrespective of
the page table level. This is definitely an improvement.

> folio was checked. Second, instead of having to call __set_ptes(nr=1)
> for each pte in a loop, the whole contiguous batch can now be set in one
> go, which enables some efficiencies and cleans up the code.

Improvements done to common __set_ptes() will automatically be available
for hugetlb pages as well. This converges all batch updates in a single
i.e __set_ptes() which can be optimized further in a single place. Makes
sense.

> 
> One detail to note is that huge_ptep_clear_flush() was previously
> calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
> block mapping). This has a couple of disadvantages; first
> ptep_clear_flush() calls ptep_get_and_clear() which transparently
> handles contpte. Given we only call for non-contiguous ptes, it would be
> safe, but a waste of effort. It's preferable to go stright to the layer

A small nit - typo s/stright/straight

> below. However, more problematic is that ptep_get_and_clear() is for
> PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
> clear the whole hugetlb folio. So let's stop special-casing the non-cont
> case and just rely on get_clear_contig_flush() to do the right thing for
> non-cont entries.

Like before, this change is unrelated to all the conversions done earlier for
the set and clear paths above using the new helpers. Hence ideally it should
be separated out into a different patch.

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/mm/hugetlbpage.c | 50 ++++++++-----------------------------
>  1 file changed, 11 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index e870d01d12ea..02afee31444e 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -166,12 +166,12 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>  	pte_t pte, tmp_pte;
>  	bool present;
>  
> -	pte = __ptep_get_and_clear(mm, addr, ptep);
> +	pte = ___ptep_get_and_clear(mm, ptep, pgsize);
>  	present = pte_present(pte);
>  	while (--ncontig) {
>  		ptep++;
>  		addr += pgsize;
> -		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> +		tmp_pte = ___ptep_get_and_clear(mm, ptep, pgsize);
>  		if (present) {
>  			if (pte_dirty(tmp_pte))
>  				pte = pte_mkdirty(pte);
> @@ -215,7 +215,7 @@ static void clear_flush(struct mm_struct *mm,
>  	unsigned long i, saddr = addr;
>  
>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
> -		__ptep_get_and_clear(mm, addr, ptep);
> +		___ptep_get_and_clear(mm, ptep, pgsize);
>  
>  	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>  }

___ptep_get_and_clear() will have the opportunity to call page_table_check_pxx_clear()
depending on the page size passed unlike the current scenario.

> @@ -226,32 +226,20 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>  	size_t pgsize;
>  	int i;
>  	int ncontig;
> -	unsigned long pfn, dpfn;
> -	pgprot_t hugeprot;
>  
>  	ncontig = num_contig_ptes(sz, &pgsize);
>  
>  	if (!pte_present(pte)) {
>  		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
> -			__set_ptes(mm, addr, ptep, pte, 1);
> +			___set_ptes(mm, ptep, pte, 1, pgsize);

IIUC __set_ptes() wrapper is still around in the header. So what's the benefit of
converting this into ___set_ptes() ? __set_ptes() gets dropped eventually ?

>  		return;
>  	}
>  
> -	if (!pte_cont(pte)) {
> -		__set_ptes(mm, addr, ptep, pte, 1);
> -		return;
> -	}
> -
> -	pfn = pte_pfn(pte);
> -	dpfn = pgsize >> PAGE_SHIFT;
> -	hugeprot = pte_pgprot(pte);
> -
>  	/* Only need to "break" if transitioning valid -> valid. */
> -	if (pte_valid(__ptep_get(ptep)))
> +	if (pte_cont(pte) && pte_valid(__ptep_get(ptep)))
>  		clear_flush(mm, addr, ptep, pgsize, ncontig);
>  
> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>  }

Similarly __set_ptes() will have the opportunity to call page_table_check_pxx_set()
depending on the page size passed unlike the current scenario.

>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -441,11 +429,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  			       unsigned long addr, pte_t *ptep,
>  			       pte_t pte, int dirty)
>  {
> -	int ncontig, i;
> +	int ncontig;
>  	size_t pgsize = 0;
> -	unsigned long pfn = pte_pfn(pte), dpfn;
>  	struct mm_struct *mm = vma->vm_mm;
> -	pgprot_t hugeprot;
>  	pte_t orig_pte;
>  
>  	VM_WARN_ON(!pte_present(pte));
> @@ -454,7 +440,6 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>  
>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> -	dpfn = pgsize >> PAGE_SHIFT;
>  
>  	if (!__cont_access_flags_changed(ptep, pte, ncontig))
>  		return 0;
> @@ -469,19 +454,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	if (pte_young(orig_pte))
>  		pte = pte_mkyoung(pte);
>  
> -	hugeprot = pte_pgprot(pte);
> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
> -
> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>  	return 1;
>  }

This makes huge_ptep_set_access_flags() cleaner and simpler as well.

>  
>  void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  			     unsigned long addr, pte_t *ptep)
>  {
> -	unsigned long pfn, dpfn;
> -	pgprot_t hugeprot;
> -	int ncontig, i;
> +	int ncontig;
>  	size_t pgsize;
>  	pte_t pte;
>  
> @@ -494,16 +474,11 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  	}
>  
>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> -	dpfn = pgsize >> PAGE_SHIFT;
>  
>  	pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>  	pte = pte_wrprotect(pte);
>  
> -	hugeprot = pte_pgprot(pte);
> -	pfn = pte_pfn(pte);
> -
> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>  }

This makes huge_ptep_set_wrprotect() cleaner and simpler as well.

>  
>  pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
> @@ -517,10 +492,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>  	pte = __ptep_get(ptep);
>  	VM_WARN_ON(!pte_present(pte));
>  
> -	if (!pte_cont(pte))
> -		return ptep_clear_flush(vma, addr, ptep);
> -
> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> +	ncontig = num_contig_ptes(page_size(pte_page(pte)), &pgsize);

A VMA argument is present in this function huge_ptep_clear_flush(). Why not just
use that to get the huge page size here, instead of retrieving the PFN contained
in page table entry which might be safer ?

s/page_size(pte_page(pte))/huge_page_size(hstate_vma(vma))

>  	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>  }
>  


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop
  2025-02-05 15:09 ` [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop Ryan Roberts
@ 2025-02-07  5:35   ` Anshuman Khandual
  2025-02-07 10:38     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07  5:35 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/5/25 20:39, Ryan Roberts wrote:
> ___set_ptes() previously called __set_pte() for each PTE in the range,
> which would conditionally issue a DSB and ISB to make the new PTE value
> immediately visible to the table walker if the new PTE was valid and for
> kernel space.
> 
> We can do better than this; let's hoist those barriers out of the loop
> so that they are only issued once at the end of the loop. We then reduce
> the cost by the number of PTEs in the range.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3b55d9a15f05..1d428e9c0e5a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -317,10 +317,8 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>  	WRITE_ONCE(*ptep, pte);
>  }
>  
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte_complete(pte_t pte)
>  {
> -	__set_pte_nosync(ptep, pte);
> -
>  	/*
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>  	 * or update_mmu_cache() have the necessary barriers.
> @@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
>  	}
>  }
>  
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> +	__set_pte_nosync(ptep, pte);
> +	__set_pte_complete(pte);
> +}
> +
>  static inline pte_t __ptep_get(pte_t *ptep)
>  {
>  	return READ_ONCE(*ptep);
> @@ -647,12 +651,14 @@ static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>  
>  	for (;;) {
>  		__check_safe_pte_update(mm, ptep, pte);
> -		__set_pte(ptep, pte);
> +		__set_pte_nosync(ptep, pte);
>  		if (--nr == 0)
>  			break;
>  		ptep++;
>  		pte = pte_advance_pfn(pte, stride);
>  	}
> +
> +	__set_pte_complete(pte);

Given that the loop now iterates over number of page table entries without corresponding
consecutive dsb/isb sync, could there be a situation where something else gets scheduled
on the cpu before __set_pte_complete() is called ? Hence leaving the entire page table
entries block without desired mapping effect. IOW how __set_pte_complete() is ensured to
execute once the loop above completes. Otherwise this change LGTM.

>  }
>  
>  static inline void __set_ptes(struct mm_struct *mm,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings
  2025-02-05 15:09 ` [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings Ryan Roberts
@ 2025-02-07  8:11   ` Anshuman Khandual
  2025-02-07 10:53     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07  8:11 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
> used to write entries into pgtables. And they issue barriers (currently
> dsb and isb) to ensure that the written values are observed by the table
> walker prior to any program-order-future memory access to the mapped
> location.

Right.

> 
> Over the years some of these functions have received optimizations: In
> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
> table modifications") made it so that the barriers were only emitted for
> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
> barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
> barriers unconditionally.

Right.

> 
> This is all very confusing to the casual observer; surely the rules
> should be invariant to the level? Let's change this so that every level
> consistently emits the barriers only when setting valid, non-user
> entries (both table and leaf).

Agreed.

> 
> It seems obvious that if it is ok to elide barriers all but valid kernel
> mappings at pte level, it must also be ok to do this for leaf entries at
> other levels: If setting an entry to invalid, a tlb maintenance
> operaiton must surely follow to synchronise the TLB and this contains

s/operaiton/operation

> the required barriers. If setting a valid user mapping, the previous
> mapping must have been invalid and there must have been a TLB
> maintenance operation (complete with barriers) to honour
> break-before-make. So the worst that can happen is we take an extra
> fault (which will imply the DSB + ISB) and conclude that there is
> nothing to do. These are the aguments for doing this optimization at pte

s/aguments/arguments

> level and they also apply to leaf mappings at other levels.

So user the page table updates both for the table and leaf entries remains
unchanged for now regarding dsb/isb sync i.e don't do anything ?

> 
> For table entries, the same arguments hold: If unsetting a table entry,
> a TLB is required and this will emit the required barriers. If setting a
> table entry, the previous value must have been invalid and the table
> walker must already be able to observe that. Additionally the contents
> of the pgtable being pointed to in the newly set entry must be visible
> before the entry is written and this is enforced via smp_wmb() (dmb) in
> the pgtable allocation functions and in __split_huge_pmd_locked(). But
> this last part could never have been enforced by the barriers in
> set_pXd() because they occur after updating the entry. So ultimately,
> the wost that can happen by eliding these barriers for user table
> entries is an extra fault.

Basically nothing needs to be done while setting user page table entries.

> 
> I observe roughly the same number of page faults (107M) with and without
> this change when compiling the kernel on Apple M2.

These are total page faults or only additional faults caused because there
were no dsb/isb sync after the user page table update ?

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 60 ++++++++++++++++++++++++++++----
>  1 file changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 1d428e9c0e5a..ff358d983583 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -767,6 +767,19 @@ static inline bool in_swapper_pgdir(void *addr)
>  	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>  }
>  
> +static inline bool pmd_valid_not_user(pmd_t pmd)
> +{
> +	/*
> +	 * User-space table pmd entries always have (PXN && !UXN). All other
> +	 * combinations indicate it's a table entry for kernel space.
> +	 * Valid-not-user leaf entries follow the same rules as
> +	 * pte_valid_not_user().
> +	 */
> +	if (pmd_table(pmd))
> +		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);

Should not this be abstracted out as pmd_table_not_user_table() which can
then be re-used in other levels as well.

> +	return pte_valid_not_user(pmd_pte(pmd));
> +}
> +

Something like.

static inline bool pmd_valid_not_user_table(pmd_t pmd)
{
	return pmd_valid(pmd) &&
	       !((pmd_val(pmd) & (PMD_PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
}

static inline bool pmd_valid_not_user(pmd_t pmd)
{
	if (pmd_table(pmd))
		return pmd_valid_not_user_table(pmd);
	else
		return pte_valid_not_user(pmd_pte(pmd));
}

>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
>  #ifdef __PAGETABLE_PMD_FOLDED
> @@ -778,7 +791,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  	WRITE_ONCE(*pmdp, pmd);
>  
> -	if (pmd_valid(pmd)) {
> +	if (pmd_valid_not_user(pmd)) {
>  		dsb(ishst);
>  		isb();
>  	}
> @@ -836,6 +849,17 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  
>  static inline bool pgtable_l4_enabled(void);
>  
> +
> +static inline bool pud_valid_not_user(pud_t pud)
> +{
> +	/*
> +	 * Follows the same rules as pmd_valid_not_user().
> +	 */
> +	if (pud_table(pud))
> +		return !((pud_val(pud) & (PUD_TABLE_PXN | PUD_TABLE_UXN)) == PUD_TABLE_PXN);
> +	return pte_valid_not_user(pud_pte(pud));
> +}

This can be expressed in terms of pmd_valid_not_user() itself.

#define pud_valid_not_user()	pmd_valid_not_user(pud_pmd(pud))

> +
>  static inline void set_pud(pud_t *pudp, pud_t pud)
>  {
>  	if (!pgtable_l4_enabled() && in_swapper_pgdir(pudp)) {
> @@ -845,7 +869,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>  
>  	WRITE_ONCE(*pudp, pud);
>  
> -	if (pud_valid(pud)) {
> +	if (pud_valid_not_user(pud)) {
>  		dsb(ishst);
>  		isb();
>  	}
> @@ -917,6 +941,16 @@ static inline bool mm_pud_folded(const struct mm_struct *mm)
>  #define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & P4D_TABLE_BIT))
>  #define p4d_present(p4d)	(!p4d_none(p4d))
>  
> +static inline bool p4d_valid_not_user(p4d_t p4d)
> +{
> +	/*
> +	 * User-space table p4d entries always have (PXN && !UXN). All other
> +	 * combinations indicate it's a table entry for kernel space. p4d block
> +	 * entries are not supported.
> +	 */
> +	return !((p4d_val(p4d) & (P4D_TABLE_PXN | P4D_TABLE_UXN)) == P4D_TABLE_PXN);
> +}

Instead

#define p4d_valid_not_user_able() pmd_valid_not_user_able(p4d_pmd(p4d))

> +
>  static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
>  	if (in_swapper_pgdir(p4dp)) {
> @@ -925,8 +959,11 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>  	}
>  
>  	WRITE_ONCE(*p4dp, p4d);
> -	dsb(ishst);
> -	isb();
> +
> +	if (p4d_valid_not_user(p4d)) {


Check p4d_valid_not_user_able() instead.

> +		dsb(ishst);
> +		isb();
> +	}
>  }
>  
>  static inline void p4d_clear(p4d_t *p4dp)
> @@ -1044,6 +1081,14 @@ static inline bool mm_p4d_folded(const struct mm_struct *mm)
>  #define pgd_bad(pgd)		(pgtable_l5_enabled() && !(pgd_val(pgd) & PGD_TABLE_BIT))
>  #define pgd_present(pgd)	(!pgd_none(pgd))
>  
> +static inline bool pgd_valid_not_user(pgd_t pgd)
> +{
> +	/*
> +	 * Follows the same rules as p4d_valid_not_user().
> +	 */
> +	return !((pgd_val(pgd) & (PGD_TABLE_PXN | PGD_TABLE_UXN)) == PGD_TABLE_PXN);
> +}

Similarly

#define pgd_valid_not_user_able() pmd_valid_not_user_able(pgd_pmd(pgd))


> +
>  static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>  {
>  	if (in_swapper_pgdir(pgdp)) {
> @@ -1052,8 +1097,11 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>  	}
>  
>  	WRITE_ONCE(*pgdp, pgd);
> -	dsb(ishst);
> -	isb();
> +
> +	if (pgd_valid_not_user(pgd)) {

Check pgd_valid_not_user_able() instead.

> +		dsb(ishst);
> +		isb();
> +	}
>  }
>  
>  static inline void pgd_clear(pgd_t *pgdp)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-02-05 15:09 ` [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
@ 2025-02-07  8:41   ` Anshuman Khandual
  2025-02-07 10:59     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07  8:41 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
> pud level. But it is possible to subsquently call vunmap_range() on a

s/subsquently/subsequently

> sub-range of the mapped memory, which partially overlaps a pmd or pud.
> In this case, vmalloc unmaps the entire pmd or pud so that the
> no-overlapping portion is also unmapped. Clearly that would have a bad
> outcome, but it's not something that any callers do today as far as I
> can tell. So I guess it's jsut expected that callers will not do this.

s/jsut/just

> 
> However, it would be useful to know if this happened in future; let's
> add a warning to cover the eventuality.

This is a reasonable check to prevent bad outcomes later.

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmalloc.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a6e7acebe9ad..fcdf67d5177a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		if (cleared || pmd_bad(*pmd))
>  			*mask |= PGTBL_PMD_MODIFIED;
>  
> -		if (cleared)
> +		if (cleared) {
> +			WARN_ON(next - addr < PMD_SIZE);
>  			continue;
> +		}
>  		if (pmd_none_or_clear_bad(pmd))
>  			continue;
>  		vunmap_pte_range(pmd, addr, next, mask);
> @@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  		if (cleared || pud_bad(*pud))
>  			*mask |= PGTBL_PUD_MODIFIED;
>  
> -		if (cleared)
> +		if (cleared) {
> +			WARN_ON(next - addr < PUD_SIZE);
>  			continue;
> +		}
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
>  		vunmap_pmd_range(pud, addr, next, mask);
Why not also include such checks in vunmap_p4d_range() and __vunmap_range_noflush()
for corresponding P4D and PGD levels as well ?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes
  2025-02-05 15:09 ` [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
@ 2025-02-07  9:19   ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07  9:19 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/5/25 20:39, Ryan Roberts wrote:
> Commit f7ee1f13d606 ("mm/vmalloc: enable mapping of huge pages at pte
> level in vmap") added its support by reusing the set_huge_pte_at() API,
> which is otherwise only used for user mappings. But when unmapping those
> huge ptes, it continued to call ptep_get_and_clear(), which is a
> layering violation. To date, the only arch to implement this support is
> powerpc and it all happens to work ok for it.
> 
> But arm64's implementation of ptep_get_and_clear() can not be safely
> used to clear a previous set_huge_pte_at(). So let's introduce a new
> arch opt-in function, arch_vmap_pte_range_unmap_size(), which can
> provide the size of a (present) pte. Then we can call
> huge_ptep_get_and_clear() to tear it down properly.
> 
> Note that if vunmap_range() is called with a range that starts in the
> middle of a huge pte-mapped page, we must unmap the entire huge page so
> the behaviour is consistent with pmd and pud block mappings. In this
> case emit a warning just like we do for pmd/pud mappings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/vmalloc.h |  8 ++++++++
>  mm/vmalloc.c            | 18 ++++++++++++++++--
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 31e9ffd936e3..16dd4cba64f2 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -113,6 +113,14 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, uns
>  }
>  #endif
>  
> +#ifndef arch_vmap_pte_range_unmap_size
> +static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
> +							   pte_t *ptep)
> +{
> +	return PAGE_SIZE;
> +}
> +#endif
> +
>  #ifndef arch_vmap_pte_supported_shift
>  static inline int arch_vmap_pte_supported_shift(unsigned long size)
>  {
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index fcdf67d5177a..6111ce900ec4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -350,12 +350,26 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			     pgtbl_mod_mask *mask)
>  {
>  	pte_t *pte;
> +	pte_t ptent;
> +	unsigned long size = PAGE_SIZE;

Default fallback size being PAGE_SIZE like before.

>  
>  	pte = pte_offset_kernel(pmd, addr);
>  	do {
> -		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
> +#ifdef CONFIG_HUGETLB_PAGE
> +		size = arch_vmap_pte_range_unmap_size(addr, pte);
> +		if (size != PAGE_SIZE) {
> +			if (WARN_ON(!IS_ALIGNED(addr, size))) {
> +				addr = ALIGN_DOWN(addr, size);
> +				pte = PTR_ALIGN_DOWN(pte, sizeof(*pte) * (size >> PAGE_SHIFT));
> +			}
> +			ptent = huge_ptep_get_and_clear(&init_mm, addr, pte, size);
> +			if (WARN_ON(end - addr < size))
> +				size = end - addr;
> +		} else
> +#endif
> +			ptent = ptep_get_and_clear(&init_mm, addr, pte);

ptep_get_and_clear() gets used both, when !HUGETLB_PAGE or HUGETLB_PAGE with
arch_vmap_pte_range_unmap_size() returned value being PAGE_SIZE, which makes
sense.

>  		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
>  	*mask |= PGTBL_PTE_MODIFIED;
>  }
>  

LGTM

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-02-06 13:26     ` Ryan Roberts
@ 2025-02-07  9:38       ` Ryan Roberts
  2025-02-12 15:29         ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07  9:38 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 06/02/2025 13:26, Ryan Roberts wrote:
> On 06/02/2025 11:48, Anshuman Khandual wrote:
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
>>> all a thin wrapper around a generic ___set_ptes(), which takes pgsize
>>
>> s/generic/common - as generic might be misleading for being generic MM.
> 
> Good spot. I'll make this change.
> 
>>
>>> parameter. This cleans up the code to remove the confusing
>>> __set_pte_at() (which was only ever used for pmd/pud) and will allow us
>>> to perform future barrier optimizations in a single place. Additionally,
>>> it will permit the huge_pte API to efficiently batch-set pgtable entries
>>> and take advantage of the future barrier optimizations.
>>
>> Makes sense.
>>
>>>
>>> ___set_ptes() calls the correct page_table_check_*_set() function based
>>> on the pgsize. This means that huge_ptes be able to get proper coverage
>>> regardless of their size, once it's plumbed into huge_pte. Currently the
>>> huge_pte API always uses the pte API which assumes an entry only covers
>>> a single page.
>>
>> Right
>>
>>>
>>> While we are at it, refactor __ptep_get_and_clear() and
>>> pmdp_huge_get_and_clear() to use a common ___ptep_get_and_clear() which
>>> also takes a pgsize parameter. This will provide the huge_pte API the
>>> means to clear ptes corresponding with the way they were set.
>>
>> __ptep_get_and_clear() refactoring does not seem to be related to the
>> previous __set_ptes(). Should they be separated out into two different
>> patches instead - for better clarity and review ? Both these clean ups
>> have enough change and can stand own their own.
> 
> Yeah I think you're probably right... I was being lazy... I'll separate them.
> 
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 108 +++++++++++++++++++------------
>>>  1 file changed, 67 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0b2a2ad1b9e8..3b55d9a15f05 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -420,23 +420,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>>>  	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>>>  }
>>>  
>>> -static inline void __set_ptes(struct mm_struct *mm,
>>> -			      unsigned long __always_unused addr,
>>> -			      pte_t *ptep, pte_t pte, unsigned int nr)
>>> -{
>>> -	page_table_check_ptes_set(mm, ptep, pte, nr);
>>> -	__sync_cache_and_tags(pte, nr);
>>> -
>>> -	for (;;) {
>>> -		__check_safe_pte_update(mm, ptep, pte);
>>> -		__set_pte(ptep, pte);
>>> -		if (--nr == 0)
>>> -			break;
>>> -		ptep++;
>>> -		pte = pte_advance_pfn(pte, 1);
>>> -	}
>>> -}
>>> -
>>>  /*
>>>   * Hugetlb definitions.
>>>   */
>>> @@ -641,30 +624,59 @@ static inline pgprot_t pud_pgprot(pud_t pud)
>>>  	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
>>>  }
>>>  
>>> -static inline void __set_pte_at(struct mm_struct *mm,
>>> -				unsigned long __always_unused addr,
>>> -				pte_t *ptep, pte_t pte, unsigned int nr)
>>> +static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>>> +			       unsigned int nr, unsigned long pgsize)
>>
>> So address got dropped and page size got added as an argument.
> 
> Yeah; addr is never used in our implementations so I figured why haul it around
> everywhere?
> 
>> s/___set_ptes/___set_pxds ? to be more generic for all levels.
> 
> So now we are into naming... I agree that in some senses pte feels specific to
> the last level. But it's long form "page table entry" seems more generic than
> "pxd" which implies only pmd, pud, p4d and pgd. At least to me...
> 
> I think we got stuck trying to figure out a clear and short term for "page table
> entry at any level" in the past. I think ttd was the best we got; Translation
> Table Descriptor, which is the term the Arm ARM uses. But that opens a can of
> worms as now we need tdd_t and all the converters pte_tdd(), tdd_pte(),
> pmd_tdd(), ... and probably a bunch more stuff on top.
> 
> So personally I prefer to take the coward's way out and just reuse pte.

How about set_ptes_anylvl() and ptep_get_and_clear_anylvl()? I think this makes
it explicit and has the benefit of removing the leading underscores. It also
means we can reuse pte_t and friends, and we can exetend this nomenclature to
other places in future at the expense of a 7 char suffix ("_anylvl").

What do you think?

> 
>>
>>>  {
>>> -	__sync_cache_and_tags(pte, nr);
>>> -	__check_safe_pte_update(mm, ptep, pte);
>>> -	__set_pte(ptep, pte);
>>> +	unsigned long stride = pgsize >> PAGE_SHIFT;
>>> +
>>> +	switch (pgsize) {
>>> +	case PAGE_SIZE:
>>> +		page_table_check_ptes_set(mm, ptep, pte, nr);
>>> +		break;
>>> +	case PMD_SIZE:
>>> +		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
>>> +		break;
>>> +	case PUD_SIZE:
>>> +		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
>>> +		break;
>>
>> This is where the new page table check APIs get used for batch testing.
> 
> Yes and I anticipate that the whole switch block should be optimized out when
> page_table_check is disabled.
> 
>>
>>> +	default:
>>> +		VM_WARN_ON(1);
>>> +	}
>>> +
>>> +	__sync_cache_and_tags(pte, nr * stride);
>>> +
>>> +	for (;;) {
>>> +		__check_safe_pte_update(mm, ptep, pte);
>>> +		__set_pte(ptep, pte);
>>> +		if (--nr == 0)
>>> +			break;
>>> +		ptep++;
>>> +		pte = pte_advance_pfn(pte, stride);
>>> +	}
>>>  }
>>
>> LGTM
>>
>>>  
>>> -static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>>> -			      pmd_t *pmdp, pmd_t pmd)
>>> +static inline void __set_ptes(struct mm_struct *mm,
>>> +			      unsigned long __always_unused addr,
>>> +			      pte_t *ptep, pte_t pte, unsigned int nr)
>>>  {
>>> -	page_table_check_pmd_set(mm, pmdp, pmd);
>>> -	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
>>> -						PMD_SIZE >> PAGE_SHIFT);
>>> +	___set_ptes(mm, ptep, pte, nr, PAGE_SIZE);
>>>  }
>>>  
>>> -static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
>>> -			      pud_t *pudp, pud_t pud)
>>> +static inline void __set_pmds(struct mm_struct *mm,
>>> +			      unsigned long __always_unused addr,
>>> +			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>>> +{
>>> +	___set_ptes(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
>>> +}
>>> +#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
>>> +
>>> +static inline void __set_puds(struct mm_struct *mm,
>>> +			      unsigned long __always_unused addr,
>>> +			      pud_t *pudp, pud_t pud, unsigned int nr)
>>>  {
>>> -	page_table_check_pud_set(mm, pudp, pud);
>>> -	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
>>> -						PUD_SIZE >> PAGE_SHIFT);
>>> +	___set_ptes(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
>>>  }
>>> +#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)
>>
>> LGTM
>>
>>>  
>>>  #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
>>>  #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
>>> @@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>>  }
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>>>  
>>> -static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>> -				       unsigned long address, pte_t *ptep)
>>> +static inline pte_t ___ptep_get_and_clear(struct mm_struct *mm, pte_t *ptep,
>>> +				       unsigned long pgsize)
>>
>> So address got dropped and page size got added as an argument.
>>
>>>  {
>>>  	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
>>>  
>>> -	page_table_check_pte_clear(mm, pte);
>>> +	switch (pgsize) {
>>> +	case PAGE_SIZE:
>>> +		page_table_check_pte_clear(mm, pte);
>>> +		break;
>>> +	case PMD_SIZE:
>>> +		page_table_check_pmd_clear(mm, pte_pmd(pte));
>>> +		break;
>>> +	case PUD_SIZE:
>>> +		page_table_check_pud_clear(mm, pte_pud(pte));
>>> +		break;
>>> +	default:
>>> +		VM_WARN_ON(1);
>>> +	}
>>>  
>>>  	return pte;
>>>  }
>>>  
>>> +static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>> +				       unsigned long address, pte_t *ptep)
>>> +{
>>> +	return ___ptep_get_and_clear(mm, ptep, PAGE_SIZE);
>>> +}
>>> +
>>>  static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>  				pte_t *ptep, unsigned int nr, int full)
>>>  {
>>> @@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>>  					    unsigned long address, pmd_t *pmdp)
>>>  {
>>> -	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
>>> -
>>> -	page_table_check_pmd_clear(mm, pmd);
>>> -
>>> -	return pmd;
>>> +	return pte_pmd(___ptep_get_and_clear(mm, (pte_t *)pmdp, PMD_SIZE));
>>>  }
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>  
>>
>> Although currently there is no pudp_huge_get_and_clear() helper on arm64
>> reworked ___ptep_get_and_clear() will be able to support that as well if
>> and when required as it now supports PUD_SIZE page size.
> 
> yep.
> 
> Thanks for all your review so far!
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear()
  2025-02-07  4:09   ` Anshuman Khandual
@ 2025-02-07 10:00     ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07 10:00 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 04:09, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> Refactor the huge_pte helpers to use the new generic ___set_ptes() and
>> ___ptep_get_and_clear() APIs.
>>
>> This provides 2 benefits; First, when page_table_check=on, hugetlb is
>> now properly/fully checked. Previously only the first page of a hugetlb
> 
> PAGE_TABLE_CHECK will be fully supported now in hugetlb irrespective of
> the page table level. This is definitely an improvement.
> 
>> folio was checked. Second, instead of having to call __set_ptes(nr=1)
>> for each pte in a loop, the whole contiguous batch can now be set in one
>> go, which enables some efficiencies and cleans up the code.
> 
> Improvements done to common __set_ptes() will automatically be available
> for hugetlb pages as well. This converges all batch updates in a single
> i.e __set_ptes() which can be optimized further in a single place. Makes
> sense.
> 
>>
>> One detail to note is that huge_ptep_clear_flush() was previously
>> calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
>> block mapping). This has a couple of disadvantages; first
>> ptep_clear_flush() calls ptep_get_and_clear() which transparently
>> handles contpte. Given we only call for non-contiguous ptes, it would be
>> safe, but a waste of effort. It's preferable to go stright to the layer
> 
> A small nit - typo s/stright/straight
> 
>> below. However, more problematic is that ptep_get_and_clear() is for
>> PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
>> clear the whole hugetlb folio. So let's stop special-casing the non-cont
>> case and just rely on get_clear_contig_flush() to do the right thing for
>> non-cont entries.
> 
> Like before, this change is unrelated to all the conversions done earlier for
> the set and clear paths above using the new helpers. Hence ideally it should
> be separated out into a different patch.

No this is very much related and must be done in this patch. Previously
ptep_get_and_clear() would be called for a PMD or PUD entry. But
ptep_get_and_clear() only considers itself to be operating on PAGE_SIZE entries.
So when page_table_check=on, it will always forward to
page_table_check_pte_clear(). That used to be fine when only the first page of
the hugetlb folio was checked. But now that this patch changes the "set" side to
use the appropriate page_table_check_pXXs_set() call, the "clear" side must be
balanced. So we need to stop calling ptep_get_and_clear().

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/mm/hugetlbpage.c | 50 ++++++++-----------------------------
>>  1 file changed, 11 insertions(+), 39 deletions(-)
>>
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index e870d01d12ea..02afee31444e 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -166,12 +166,12 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>>  	pte_t pte, tmp_pte;
>>  	bool present;
>>  
>> -	pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	pte = ___ptep_get_and_clear(mm, ptep, pgsize);
>>  	present = pte_present(pte);
>>  	while (--ncontig) {
>>  		ptep++;
>>  		addr += pgsize;
>> -		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +		tmp_pte = ___ptep_get_and_clear(mm, ptep, pgsize);
>>  		if (present) {
>>  			if (pte_dirty(tmp_pte))
>>  				pte = pte_mkdirty(pte);
>> @@ -215,7 +215,7 @@ static void clear_flush(struct mm_struct *mm,
>>  	unsigned long i, saddr = addr;
>>  
>>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
>> -		__ptep_get_and_clear(mm, addr, ptep);
>> +		___ptep_get_and_clear(mm, ptep, pgsize);
>>  
>>  	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>>  }
> 
> ___ptep_get_and_clear() will have the opportunity to call page_table_check_pxx_clear()
> depending on the page size passed unlike the current scenario.
> 
>> @@ -226,32 +226,20 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>>  	size_t pgsize;
>>  	int i;
>>  	int ncontig;
>> -	unsigned long pfn, dpfn;
>> -	pgprot_t hugeprot;
>>  
>>  	ncontig = num_contig_ptes(sz, &pgsize);
>>  
>>  	if (!pte_present(pte)) {
>>  		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
>> -			__set_ptes(mm, addr, ptep, pte, 1);
>> +			___set_ptes(mm, ptep, pte, 1, pgsize);
> 
> IIUC __set_ptes() wrapper is still around in the header. So what's the benefit of
> converting this into ___set_ptes() ? __set_ptes() gets dropped eventually ?

__set_ptes() is explicitly operating on PAGE_SIZE entries. The double
underscores is indicating that it's the layer below the contpte management layer.

The new ___set_ptes() takes a pgsize and can therefore operate on PTEs any level
in the pgtable.

As per other thread, I'm proposing to rename ___set_ptes() to set_ptes_anylvl()
and ___ptep_get_and_clear() to ptep_get_and_clear_anylvl(). I think that makes
things a bit clearer?


> 
>>  		return;
>>  	}
>>  
>> -	if (!pte_cont(pte)) {
>> -		__set_ptes(mm, addr, ptep, pte, 1);
>> -		return;
>> -	}
>> -
>> -	pfn = pte_pfn(pte);
>> -	dpfn = pgsize >> PAGE_SHIFT;
>> -	hugeprot = pte_pgprot(pte);
>> -
>>  	/* Only need to "break" if transitioning valid -> valid. */
>> -	if (pte_valid(__ptep_get(ptep)))
>> +	if (pte_cont(pte) && pte_valid(__ptep_get(ptep)))
>>  		clear_flush(mm, addr, ptep, pgsize, ncontig);
>>  
>> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
>> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>>  }
> 
> Similarly __set_ptes() will have the opportunity to call page_table_check_pxx_set()
> depending on the page size passed unlike the current scenario.

Sorry I don't understand this comment. __set_ptes() (2 leading underscores) is
always implicitly operating on PAGE_SIZE entries. ___set_ptes() (3 leading
underscores) allows the size of the entries to be passed in.

> 
>>  
>>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>> @@ -441,11 +429,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  			       unsigned long addr, pte_t *ptep,
>>  			       pte_t pte, int dirty)
>>  {
>> -	int ncontig, i;
>> +	int ncontig;
>>  	size_t pgsize = 0;
>> -	unsigned long pfn = pte_pfn(pte), dpfn;
>>  	struct mm_struct *mm = vma->vm_mm;
>> -	pgprot_t hugeprot;
>>  	pte_t orig_pte;
>>  
>>  	VM_WARN_ON(!pte_present(pte));
>> @@ -454,7 +440,6 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>>  
>>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>> -	dpfn = pgsize >> PAGE_SHIFT;
>>  
>>  	if (!__cont_access_flags_changed(ptep, pte, ncontig))
>>  		return 0;
>> @@ -469,19 +454,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>  	if (pte_young(orig_pte))
>>  		pte = pte_mkyoung(pte);
>>  
>> -	hugeprot = pte_pgprot(pte);
>> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
>> -
>> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>>  	return 1;
>>  }
> 
> This makes huge_ptep_set_access_flags() cleaner and simpler as well.
> 
>>  
>>  void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>  			     unsigned long addr, pte_t *ptep)
>>  {
>> -	unsigned long pfn, dpfn;
>> -	pgprot_t hugeprot;
>> -	int ncontig, i;
>> +	int ncontig;
>>  	size_t pgsize;
>>  	pte_t pte;
>>  
>> @@ -494,16 +474,11 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>  	}
>>  
>>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>> -	dpfn = pgsize >> PAGE_SHIFT;
>>  
>>  	pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>>  	pte = pte_wrprotect(pte);
>>  
>> -	hugeprot = pte_pgprot(pte);
>> -	pfn = pte_pfn(pte);
>> -
>> -	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>> -		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
>> +	___set_ptes(mm, ptep, pte, ncontig, pgsize);
>>  }
> 
> This makes huge_ptep_set_wrprotect() cleaner and simpler as well.
> 
>>  
>>  pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>> @@ -517,10 +492,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>>  	pte = __ptep_get(ptep);
>>  	VM_WARN_ON(!pte_present(pte));
>>  
>> -	if (!pte_cont(pte))
>> -		return ptep_clear_flush(vma, addr, ptep);
>> -
>> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>> +	ncontig = num_contig_ptes(page_size(pte_page(pte)), &pgsize);
> 
> A VMA argument is present in this function huge_ptep_clear_flush(). Why not just
> use that to get the huge page size here, instead of retrieving the PFN contained
> in page table entry which might be safer ?
> 
> s/page_size(pte_page(pte))/huge_page_size(hstate_vma(vma))

Yes, that's a good idea. I'll make this change in the next version.


> 
>>  	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>>  }
>>  



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-05 15:09 ` [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
@ 2025-02-07 10:04   ` Anshuman Khandual
  2025-02-07 11:20     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07 10:04 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/5/25 20:39, Ryan Roberts wrote:
> Implement the required arch functions to enable use of contpte in the
> vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
> operations due to only having to issue a DSB and ISB per contpte block
> instead of per pte. But it also means that the TLB pressure reduces due
> to only needing a single TLB entry for the whole contpte block.

Right.

> 
> Since vmap uses set_huge_pte_at() to set the contpte, that API is now
> used for kernel mappings for the first time. Although in the vmap case
> we never expect it to be called to modify a valid mapping so
> clear_flush() should never be called, it's still wise to make it robust
> for the kernel case, so amend the tlb flush function if the mm is for
> kernel space.

Makes sense.

> 
> Tested with vmalloc performance selftests:
> 
>   # kself/mm/test_vmalloc.sh \
> 	run_test_mask=1
> 	test_repeat_count=5
> 	nr_pages=256
> 	test_loop_count=100000
> 	use_huge=1
> 
> Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
> reduction in time taken.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/vmalloc.h | 40 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/hugetlbpage.c      |  5 +++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 38fafffe699f..fbdeb40f3857 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -23,6 +23,46 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>  	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>  }
>  
> +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
> +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
> +						unsigned long end, u64 pfn,
> +						unsigned int max_page_shift)
> +{
> +	if (max_page_shift < CONT_PTE_SHIFT)
> +		return PAGE_SIZE;
> +
> +	if (end - addr < CONT_PTE_SIZE)
> +		return PAGE_SIZE;
> +
> +	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
> +		return PAGE_SIZE;
> +
> +	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
> +		return PAGE_SIZE;
> +
> +	return CONT_PTE_SIZE;

A small nit:

Should the rationale behind picking CONT_PTE_SIZE be added here as an in code
comment or something in the function - just to make things bit clear.

> +}
> +
> +#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
> +static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
> +							   pte_t *ptep)
> +{
> +	/*
> +	 * The caller handles alignment so it's sufficient just to check
> +	 * PTE_CONT.
> +	 */
> +	return pte_valid_cont(__ptep_get(ptep)) ? CONT_PTE_SIZE : PAGE_SIZE;

I guess it is safe to query the CONT_PTE from the mapped entry itself.

> +}
> +
> +#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
> +static inline int arch_vmap_pte_supported_shift(unsigned long size)
> +{
> +	if (size >= CONT_PTE_SIZE)
> +		return CONT_PTE_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>  #endif
>  
>  #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 02afee31444e..a74e43101dad 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -217,7 +217,10 @@ static void clear_flush(struct mm_struct *mm,
>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
>  		___ptep_get_and_clear(mm, ptep, pgsize);
>  
> -	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
> +	if (mm == &init_mm)
> +		flush_tlb_kernel_range(saddr, addr);
> +	else
> +		__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>  }
>  
>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

Otherwise LGTM.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths
  2025-02-05 15:09 ` [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths Ryan Roberts
@ 2025-02-07 10:21   ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-07 10:21 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable



On 2/5/25 20:39, Ryan Roberts wrote:
> Fix callers that previously skipped calling arch_sync_kernel_mappings()
> if an error occurred during a pgtable update. The call is still required
> to sync any pgtable updates that may have occurred prior to hitting the
> error condition.
> 
> These are theoretical bugs discovered during code review.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified")
> Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered")
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

This change could stand on its own and LGTM.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

> ---
>  mm/memory.c  | 6 ++++--
>  mm/vmalloc.c | 4 ++--
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 539c0f7c6d54..a15f7dd500ea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3040,8 +3040,10 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(*pgd) && !create)
>  			continue;
> -		if (WARN_ON_ONCE(pgd_leaf(*pgd)))
> -			return -EINVAL;
> +		if (WARN_ON_ONCE(pgd_leaf(*pgd))) {
> +			err = -EINVAL;
> +			break;
> +		}
>  		if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
>  			if (!create)
>  				continue;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6111ce900ec4..68950b1824d0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -604,13 +604,13 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
>  			mask |= PGTBL_PGD_MODIFIED;
>  		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
>  		if (err)
> -			return err;
> +			break;
>  	} while (pgd++, addr = next, addr != end);
>  
>  	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>  		arch_sync_kernel_mappings(start, end);
>  
> -	return 0;
> +	return err;
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop
  2025-02-07  5:35   ` Anshuman Khandual
@ 2025-02-07 10:38     ` Ryan Roberts
  2025-02-12 16:00       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07 10:38 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 05:35, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> ___set_ptes() previously called __set_pte() for each PTE in the range,
>> which would conditionally issue a DSB and ISB to make the new PTE value
>> immediately visible to the table walker if the new PTE was valid and for
>> kernel space.
>>
>> We can do better than this; let's hoist those barriers out of the loop
>> so that they are only issued once at the end of the loop. We then reduce
>> the cost by the number of PTEs in the range.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 3b55d9a15f05..1d428e9c0e5a 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -317,10 +317,8 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>>  	WRITE_ONCE(*ptep, pte);
>>  }
>>  
>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +static inline void __set_pte_complete(pte_t pte)
>>  {
>> -	__set_pte_nosync(ptep, pte);
>> -
>>  	/*
>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>  	 * or update_mmu_cache() have the necessary barriers.
>> @@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
>>  	}
>>  }
>>  
>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +	__set_pte_nosync(ptep, pte);
>> +	__set_pte_complete(pte);
>> +}
>> +
>>  static inline pte_t __ptep_get(pte_t *ptep)
>>  {
>>  	return READ_ONCE(*ptep);
>> @@ -647,12 +651,14 @@ static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>>  
>>  	for (;;) {
>>  		__check_safe_pte_update(mm, ptep, pte);
>> -		__set_pte(ptep, pte);
>> +		__set_pte_nosync(ptep, pte);
>>  		if (--nr == 0)
>>  			break;
>>  		ptep++;
>>  		pte = pte_advance_pfn(pte, stride);
>>  	}
>> +
>> +	__set_pte_complete(pte);
> 
> Given that the loop now iterates over number of page table entries without corresponding
> consecutive dsb/isb sync, could there be a situation where something else gets scheduled
> on the cpu before __set_pte_complete() is called ? Hence leaving the entire page table
> entries block without desired mapping effect. IOW how __set_pte_complete() is ensured to
> execute once the loop above completes. Otherwise this change LGTM.

I don't think this changes the model. Previously, __set_pte() was called, which
writes the pte to the pgtable, then issues the barriers. So there is still a
window where the thread could be unscheduled after the write but before the
barriers. Yes, my change makese that window bigger, but if it is a bug now, it
was a bug before.

Additionally, the spec for set_ptes() is:

/**
 * set_ptes - Map consecutive pages to a contiguous range of addresses.
 * @mm: Address space to map the pages into.
 * @addr: Address to map the first page at.
 * @ptep: Page table pointer for the first entry.
 * @pte: Page table entry for the first page.
 * @nr: Number of pages to map.
 *
 * When nr==1, initial state of pte may be present or not present, and new state
 * may be present or not present. When nr>1, initial state of all ptes must be
 * not present, and new state must be present.
 *
 * May be overridden by the architecture, or the architecture can define
 * set_pte() and PFN_PTE_SHIFT.
 *
 * Context: The caller holds the page table lock.  The pages all belong
 * to the same folio.  The PTEs are all in the same PMD.
 */

Note that the caller is required to hold the page table lock. That's a spin lock
so should be non-preemptible at this point (perhaps not for RT?)

Although actually, vmalloc doesn't hold a lock when calling these helpers; it
has a lock when allocating the VA space, then drops it.

So yes, I think there is a chance of preemption after writing the pgtable entry
but before issuing the barriers.

But in that case, we get saved by the DSB in the context switch path. There is
no guarrantee of an ISB in that path (AFAIU). But the need for an ISB is a bit
whooly anyway. My rough understanding is that the ISB is there to prevent
previous speculation from determining that a given translation was invalid and
"caching" that determination in the pipeline. That could still (theoretically)
happen on remote CPUs I think, and we have the spurious fault handler to detect
that. Anyway, once you context switch, the local CPU becomes remote and we don't
have the ISB there, so what's the difference... There's a high chance I've
misunderstood a bunch of this.

In conclusion, I don't think I've made things any worse.

Thanks,
Ryan

> 
>>  }
>>  
>>  static inline void __set_ptes(struct mm_struct *mm,

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings
  2025-02-07  8:11   ` Anshuman Khandual
@ 2025-02-07 10:53     ` Ryan Roberts
  2025-02-12 16:48       ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07 10:53 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 08:11, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
>> used to write entries into pgtables. And they issue barriers (currently
>> dsb and isb) to ensure that the written values are observed by the table
>> walker prior to any program-order-future memory access to the mapped
>> location.
> 
> Right.
> 
>>
>> Over the years some of these functions have received optimizations: In
>> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
>> table modifications") made it so that the barriers were only emitted for
>> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
>> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
>> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
>> barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
>> barriers unconditionally.
> 
> Right.
> 
>>
>> This is all very confusing to the casual observer; surely the rules
>> should be invariant to the level? Let's change this so that every level
>> consistently emits the barriers only when setting valid, non-user
>> entries (both table and leaf).
> 
> Agreed.
> 
>>
>> It seems obvious that if it is ok to elide barriers all but valid kernel
>> mappings at pte level, it must also be ok to do this for leaf entries at
>> other levels: If setting an entry to invalid, a tlb maintenance
>> operaiton must surely follow to synchronise the TLB and this contains
> 
> s/operaiton/operation

Ugh, I really need a spell checker for my editor!

> 
>> the required barriers. If setting a valid user mapping, the previous
>> mapping must have been invalid and there must have been a TLB
>> maintenance operation (complete with barriers) to honour
>> break-before-make. So the worst that can happen is we take an extra
>> fault (which will imply the DSB + ISB) and conclude that there is
>> nothing to do. These are the aguments for doing this optimization at pte
> 
> s/aguments/arguments
> 
>> level and they also apply to leaf mappings at other levels.
> 
> So user the page table updates both for the table and leaf entries remains
> unchanged for now regarding dsb/isb sync i.e don't do anything ?

Sorry, this doesn't parse.

> 
>>
>> For table entries, the same arguments hold: If unsetting a table entry,
>> a TLB is required and this will emit the required barriers. If setting a
>> table entry, the previous value must have been invalid and the table
>> walker must already be able to observe that. Additionally the contents
>> of the pgtable being pointed to in the newly set entry must be visible
>> before the entry is written and this is enforced via smp_wmb() (dmb) in
>> the pgtable allocation functions and in __split_huge_pmd_locked(). But
>> this last part could never have been enforced by the barriers in
>> set_pXd() because they occur after updating the entry. So ultimately,
>> the wost that can happen by eliding these barriers for user table
>> entries is an extra fault.
> 
> Basically nothing needs to be done while setting user page table entries.
> 
>>
>> I observe roughly the same number of page faults (107M) with and without
>> this change when compiling the kernel on Apple M2.
> 
> These are total page faults or only additional faults caused because there
> were no dsb/isb sync after the user page table update ?

total page faults. The experiment was to check that if eliding more barriers for
valid user space mappings, does this lead to an increase in page faults? This
very simple experiment suggests no.

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 60 ++++++++++++++++++++++++++++----
>>  1 file changed, 54 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 1d428e9c0e5a..ff358d983583 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -767,6 +767,19 @@ static inline bool in_swapper_pgdir(void *addr)
>>  	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>>  }
>>  
>> +static inline bool pmd_valid_not_user(pmd_t pmd)
>> +{
>> +	/*
>> +	 * User-space table pmd entries always have (PXN && !UXN). All other
>> +	 * combinations indicate it's a table entry for kernel space.
>> +	 * Valid-not-user leaf entries follow the same rules as
>> +	 * pte_valid_not_user().
>> +	 */
>> +	if (pmd_table(pmd))
>> +		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
> 
> Should not this be abstracted out as pmd_table_not_user_table() which can
> then be re-used in other levels as well.

Yeah maybe. Let me mull it over.

> 
>> +	return pte_valid_not_user(pmd_pte(pmd));
>> +}
>> +
> 
> Something like.
> 
> static inline bool pmd_valid_not_user_table(pmd_t pmd)
> {
> 	return pmd_valid(pmd) &&
> 	       !((pmd_val(pmd) & (PMD_PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
> }
> 
> static inline bool pmd_valid_not_user(pmd_t pmd)
> {
> 	if (pmd_table(pmd))
> 		return pmd_valid_not_user_table(pmd);
> 	else
> 		return pte_valid_not_user(pmd_pte(pmd));
> }
> 
>>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>  {
>>  #ifdef __PAGETABLE_PMD_FOLDED
>> @@ -778,7 +791,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>  
>>  	WRITE_ONCE(*pmdp, pmd);
>>  
>> -	if (pmd_valid(pmd)) {
>> +	if (pmd_valid_not_user(pmd)) {
>>  		dsb(ishst);
>>  		isb();
>>  	}
>> @@ -836,6 +849,17 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>>  
>>  static inline bool pgtable_l4_enabled(void);
>>  
>> +
>> +static inline bool pud_valid_not_user(pud_t pud)
>> +{
>> +	/*
>> +	 * Follows the same rules as pmd_valid_not_user().
>> +	 */
>> +	if (pud_table(pud))
>> +		return !((pud_val(pud) & (PUD_TABLE_PXN | PUD_TABLE_UXN)) == PUD_TABLE_PXN);
>> +	return pte_valid_not_user(pud_pte(pud));
>> +}
> 
> This can be expressed in terms of pmd_valid_not_user() itself.
> 
> #define pud_valid_not_user()	pmd_valid_not_user(pud_pmd(pud))

The trouble with this is that you end up using pmd_table() not pud_table(). For
some configs pud_table() is hardcoded to true. So we lose the benefit. So I'd
rather keep it as it's own function.

> 
>> +
>>  static inline void set_pud(pud_t *pudp, pud_t pud)
>>  {
>>  	if (!pgtable_l4_enabled() && in_swapper_pgdir(pudp)) {
>> @@ -845,7 +869,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>  
>>  	WRITE_ONCE(*pudp, pud);
>>  
>> -	if (pud_valid(pud)) {
>> +	if (pud_valid_not_user(pud)) {
>>  		dsb(ishst);
>>  		isb();
>>  	}
>> @@ -917,6 +941,16 @@ static inline bool mm_pud_folded(const struct mm_struct *mm)
>>  #define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & P4D_TABLE_BIT))
>>  #define p4d_present(p4d)	(!p4d_none(p4d))
>>  
>> +static inline bool p4d_valid_not_user(p4d_t p4d)
>> +{
>> +	/*
>> +	 * User-space table p4d entries always have (PXN && !UXN). All other
>> +	 * combinations indicate it's a table entry for kernel space. p4d block
>> +	 * entries are not supported.
>> +	 */
>> +	return !((p4d_val(p4d) & (P4D_TABLE_PXN | P4D_TABLE_UXN)) == P4D_TABLE_PXN);
>> +}
> 
> Instead
> 
> #define p4d_valid_not_user_able() pmd_valid_not_user_able(p4d_pmd(p4d))
> 
>> +
>>  static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>  {
>>  	if (in_swapper_pgdir(p4dp)) {
>> @@ -925,8 +959,11 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>  	}
>>  
>>  	WRITE_ONCE(*p4dp, p4d);
>> -	dsb(ishst);
>> -	isb();
>> +
>> +	if (p4d_valid_not_user(p4d)) {
> 
> 
> Check p4d_valid_not_user_able() instead.

I don't really know why we would want to add table into the name at this level.
Why not be consistent and continue to use p4d_valid_not_user()? The fact that
p4d doesn't support leaf entries is just a matter for the implementation.

> 
>> +		dsb(ishst);
>> +		isb();
>> +	}
>>  }
>>  
>>  static inline void p4d_clear(p4d_t *p4dp)
>> @@ -1044,6 +1081,14 @@ static inline bool mm_p4d_folded(const struct mm_struct *mm)
>>  #define pgd_bad(pgd)		(pgtable_l5_enabled() && !(pgd_val(pgd) & PGD_TABLE_BIT))
>>  #define pgd_present(pgd)	(!pgd_none(pgd))
>>  
>> +static inline bool pgd_valid_not_user(pgd_t pgd)
>> +{
>> +	/*
>> +	 * Follows the same rules as p4d_valid_not_user().
>> +	 */
>> +	return !((pgd_val(pgd) & (PGD_TABLE_PXN | PGD_TABLE_UXN)) == PGD_TABLE_PXN);
>> +}
> 
> Similarly
> 
> #define pgd_valid_not_user_able() pmd_valid_not_user_able(pgd_pmd(pgd))
> 
> 
>> +
>>  static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>  {
>>  	if (in_swapper_pgdir(pgdp)) {
>> @@ -1052,8 +1097,11 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>  	}
>>  
>>  	WRITE_ONCE(*pgdp, pgd);
>> -	dsb(ishst);
>> -	isb();
>> +
>> +	if (pgd_valid_not_user(pgd)) {
> 
> Check pgd_valid_not_user_able() instead.
> 
>> +		dsb(ishst);
>> +		isb();
>> +	}
>>  }
>>  
>>  static inline void pgd_clear(pgd_t *pgdp)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-02-07  8:41   ` Anshuman Khandual
@ 2025-02-07 10:59     ` Ryan Roberts
  2025-02-13  6:36       ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07 10:59 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 08:41, Anshuman Khandual wrote:
> On 2/5/25 20:39, Ryan Roberts wrote:
>> A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
>> pud level. But it is possible to subsquently call vunmap_range() on a
> 
> s/subsquently/subsequently
> 
>> sub-range of the mapped memory, which partially overlaps a pmd or pud.
>> In this case, vmalloc unmaps the entire pmd or pud so that the
>> no-overlapping portion is also unmapped. Clearly that would have a bad
>> outcome, but it's not something that any callers do today as far as I
>> can tell. So I guess it's jsut expected that callers will not do this.
> 
> s/jsut/just
> 
>>
>> However, it would be useful to know if this happened in future; let's
>> add a warning to cover the eventuality.
> 
> This is a reasonable check to prevent bad outcomes later.
> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/vmalloc.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index a6e7acebe9ad..fcdf67d5177a 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>  		if (cleared || pmd_bad(*pmd))
>>  			*mask |= PGTBL_PMD_MODIFIED;
>>  
>> -		if (cleared)
>> +		if (cleared) {
>> +			WARN_ON(next - addr < PMD_SIZE);
>>  			continue;
>> +		}
>>  		if (pmd_none_or_clear_bad(pmd))
>>  			continue;
>>  		vunmap_pte_range(pmd, addr, next, mask);
>> @@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>  		if (cleared || pud_bad(*pud))
>>  			*mask |= PGTBL_PUD_MODIFIED;
>>  
>> -		if (cleared)
>> +		if (cleared) {
>> +			WARN_ON(next - addr < PUD_SIZE);
>>  			continue;
>> +		}
>>  		if (pud_none_or_clear_bad(pud))
>>  			continue;
>>  		vunmap_pmd_range(pud, addr, next, mask);
> Why not also include such checks in vunmap_p4d_range() and __vunmap_range_noflush()
> for corresponding P4D and PGD levels as well ?

The kernel does not support p4d or pgd leaf entries so there is nothing to check.

Although vunmap_p4d_range() does call p4d_clear_huge(). The function is a stub
and returns void (unlike p[mu]d_clear_huge()). I suspect we could just remove
p4d_clear_huge() entirely. But that would be a separate patch to mm tree I think.

For pgd, there isn't even an equivalent looking function.

Basically at those 2 levels, it's always a table.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-07 10:04   ` Anshuman Khandual
@ 2025-02-07 11:20     ` Ryan Roberts
  2025-02-13  6:32       ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-07 11:20 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 10:04, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> Implement the required arch functions to enable use of contpte in the
>> vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
>> operations due to only having to issue a DSB and ISB per contpte block
>> instead of per pte. But it also means that the TLB pressure reduces due
>> to only needing a single TLB entry for the whole contpte block.
> 
> Right.
> 
>>
>> Since vmap uses set_huge_pte_at() to set the contpte, that API is now
>> used for kernel mappings for the first time. Although in the vmap case
>> we never expect it to be called to modify a valid mapping so
>> clear_flush() should never be called, it's still wise to make it robust
>> for the kernel case, so amend the tlb flush function if the mm is for
>> kernel space.
> 
> Makes sense.
> 
>>
>> Tested with vmalloc performance selftests:
>>
>>   # kself/mm/test_vmalloc.sh \
>> 	run_test_mask=1
>> 	test_repeat_count=5
>> 	nr_pages=256
>> 	test_loop_count=100000
>> 	use_huge=1
>>
>> Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
>> reduction in time taken.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/vmalloc.h | 40 ++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/hugetlbpage.c      |  5 +++-
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 38fafffe699f..fbdeb40f3857 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -23,6 +23,46 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>  	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>  }
>>  
>> +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>> +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>> +						unsigned long end, u64 pfn,
>> +						unsigned int max_page_shift)
>> +{
>> +	if (max_page_shift < CONT_PTE_SHIFT)
>> +		return PAGE_SIZE;
>> +
>> +	if (end - addr < CONT_PTE_SIZE)
>> +		return PAGE_SIZE;
>> +
>> +	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
>> +		return PAGE_SIZE;
>> +
>> +	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
>> +		return PAGE_SIZE;
>> +
>> +	return CONT_PTE_SIZE;
> 
> A small nit:
> 
> Should the rationale behind picking CONT_PTE_SIZE be added here as an in code
> comment or something in the function - just to make things bit clear.

I'm not sure what other size we would pick?

> 
>> +}
>> +
>> +#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
>> +static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
>> +							   pte_t *ptep)
>> +{
>> +	/*
>> +	 * The caller handles alignment so it's sufficient just to check
>> +	 * PTE_CONT.
>> +	 */
>> +	return pte_valid_cont(__ptep_get(ptep)) ? CONT_PTE_SIZE : PAGE_SIZE;
> 
> I guess it is safe to query the CONT_PTE from the mapped entry itself.

Yes I don't see why not. Is there some specific aspect you're concerned about?

> 
>> +}
>> +
>> +#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
>> +static inline int arch_vmap_pte_supported_shift(unsigned long size)
>> +{
>> +	if (size >= CONT_PTE_SIZE)
>> +		return CONT_PTE_SHIFT;
>> +
>> +	return PAGE_SHIFT;
>> +}
>> +
>>  #endif
>>  
>>  #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged
>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>> index 02afee31444e..a74e43101dad 100644
>> --- a/arch/arm64/mm/hugetlbpage.c
>> +++ b/arch/arm64/mm/hugetlbpage.c
>> @@ -217,7 +217,10 @@ static void clear_flush(struct mm_struct *mm,
>>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
>>  		___ptep_get_and_clear(mm, ptep, pgsize);
>>  
>> -	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>> +	if (mm == &init_mm)
>> +		flush_tlb_kernel_range(saddr, addr);
>> +	else
>> +		__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>>  }
>>  
>>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> 
> Otherwise LGTM.
> 
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

Thanks!



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently
  2025-02-05 15:09 ` [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently Ryan Roberts
@ 2025-02-10  7:11   ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-10  7:11 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 2/5/25 20:39, Ryan Roberts wrote:
> When page_shift is greater than PAGE_SIZE, __vmap_pages_range_noflush()
> will call vmap_range_noflush() for each individual huge page. But
> vmap_range_noflush() would previously call arch_sync_kernel_mappings()
> directly so this would end up being called for every huge page.
> 
> We can do better than this; refactor the call into the outer
> __vmap_pages_range_noflush() so that it is only called once for the
> entire batch operation.

This makes sense.

> 
> This will benefit performance for arm64 which is about to opt-in to
> using the hook.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmalloc.c | 60 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 68950b1824d0..50fd44439875 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -285,40 +285,38 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  
>  static int vmap_range_noflush(unsigned long addr, unsigned long end,
>  			phys_addr_t phys_addr, pgprot_t prot,
> -			unsigned int max_page_shift)
> +			unsigned int max_page_shift, pgtbl_mod_mask *mask)
>  {
>  	pgd_t *pgd;
> -	unsigned long start;
>  	unsigned long next;
>  	int err;
> -	pgtbl_mod_mask mask = 0;
>  
>  	might_sleep();
>  	BUG_ON(addr >= end);
>  
> -	start = addr;
>  	pgd = pgd_offset_k(addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot,
> -					max_page_shift, &mask);
> +					max_page_shift, mask);
>  		if (err)
>  			break;
>  	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
>  
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(start, end);
> -
>  	return err;
>  }

arch_sync_kernel_mappings() gets dropped here and moved to existing
vmap_range_noflush() callers instead.

>  
>  int vmap_page_range(unsigned long addr, unsigned long end,
>  		    phys_addr_t phys_addr, pgprot_t prot)
>  {
> +	pgtbl_mod_mask mask = 0;
>  	int err;
>  
>  	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
> -				 ioremap_max_page_shift);
> +				 ioremap_max_page_shift, &mask);
> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> +		arch_sync_kernel_mappings(addr, end);
> +

arch_sync_kernel_mappings() gets moved here.

>  	flush_cache_vmap(addr, end);
>  	if (!err)
>  		err = kmsan_ioremap_page_range(addr, end, phys_addr, prot,
> @@ -587,29 +585,24 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
>  }
>  
>  static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
> -		pgprot_t prot, struct page **pages)
> +		pgprot_t prot, struct page **pages, pgtbl_mod_mask *mask)
>  {
> -	unsigned long start = addr;
>  	pgd_t *pgd;
>  	unsigned long next;
>  	int err = 0;
>  	int nr = 0;
> -	pgtbl_mod_mask mask = 0;
>  
>  	BUG_ON(addr >= end);
>  	pgd = pgd_offset_k(addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_bad(*pgd))
> -			mask |= PGTBL_PGD_MODIFIED;
> -		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
> +			*mask |= PGTBL_PGD_MODIFIED;
> +		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, mask);
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
>  
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(start, end);
> -
>  	return err;
>  }
>  
> @@ -626,26 +619,33 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  		pgprot_t prot, struct page **pages, unsigned int page_shift)
>  {
>  	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> +	unsigned long start = addr;
> +	pgtbl_mod_mask mask = 0;
> +	int err = 0;
>  
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> -			page_shift == PAGE_SHIFT)
> -		return vmap_small_pages_range_noflush(addr, end, prot, pages);
> -
> -	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> -		int err;
> -
> -		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> -					page_to_phys(pages[i]), prot,
> -					page_shift);
> -		if (err)
> -			return err;
> +			page_shift == PAGE_SHIFT) {
> +		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
> +						&mask);

Unlike earlier don't return here until arch_sync_kernel_mappings()
gets covered later.

> +	} else {
> +		for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
> +			err = vmap_range_noflush(addr,
> +						addr + (1UL << page_shift),
> +						page_to_phys(pages[i]), prot,
> +						page_shift, &mask);
> +			if (err)
> +				break;
>  
> -		addr += 1UL << page_shift;
> +			addr += 1UL << page_shift;
> +		}
>  	}
>  
> -	return 0;
> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> +		arch_sync_kernel_mappings(start, end);

arch_sync_kernel_mappings() gets moved here after getting dropped
from both vmap_range_noflush() and vmap_small_pages_range_noflush().

> +
> +	return err;
>  }
>  
>  int vmap_pages_range_noflush(unsigned long addr, unsigned long end,

LGTM, and this can stand on its own as well.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings()
  2025-02-05 15:09 ` [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings() Ryan Roberts
@ 2025-02-10  7:45   ` Anshuman Khandual
  2025-02-10 11:04     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-10  7:45 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/5/25 20:39, Ryan Roberts wrote:
> arch_sync_kernel_mappings() is an optional hook for arches to allow them
> to synchonize certain levels of the kernel pgtables after modification.
> But arm64 could benefit from a hook similar to this, paired with a call
> prior to starting the batch of modifications.
> 
> So let's introduce arch_update_kernel_mappings_begin() and
> arch_update_kernel_mappings_end(). Both have a default implementation
> which can be overridden by the arch code. The default for the former is
> a nop, and the default for the latter is to call
> arch_sync_kernel_mappings(), so the latter replaces previous
> arch_sync_kernel_mappings() callsites. So by default, the resulting
> behaviour is unchanged.
> 
> To avoid include hell, the pgtbl_mod_mask type and it's associated
> macros are moved to their own header.
> 
> In a future patch, arm64 will opt-in to overriding both functions.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h         | 24 +----------------
>  include/linux/pgtable_modmask.h | 32 ++++++++++++++++++++++
>  include/linux/vmalloc.h         | 47 +++++++++++++++++++++++++++++++++
>  mm/memory.c                     |  5 ++--
>  mm/vmalloc.c                    | 15 ++++++-----
>  5 files changed, 92 insertions(+), 31 deletions(-)
>  create mode 100644 include/linux/pgtable_modmask.h
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 94d267d02372..7f70786a73b3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/pfn.h>
>  #include <asm/pgtable.h>
> +#include <linux/pgtable_modmask.h>
>  
>  #define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
>  #define PUD_ORDER	(PUD_SHIFT - PAGE_SHIFT)
> @@ -1786,29 +1787,6 @@ static inline bool arch_has_pfn_modify_check(void)
>  # define PAGE_KERNEL_EXEC PAGE_KERNEL
>  #endif
>  
> -/*
> - * Page Table Modification bits for pgtbl_mod_mask.
> - *
> - * These are used by the p?d_alloc_track*() set of functions an in the generic
> - * vmalloc/ioremap code to track at which page-table levels entries have been
> - * modified. Based on that the code can better decide when vmalloc and ioremap
> - * mapping changes need to be synchronized to other page-tables in the system.
> - */
> -#define		__PGTBL_PGD_MODIFIED	0
> -#define		__PGTBL_P4D_MODIFIED	1
> -#define		__PGTBL_PUD_MODIFIED	2
> -#define		__PGTBL_PMD_MODIFIED	3
> -#define		__PGTBL_PTE_MODIFIED	4
> -
> -#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
> -#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
> -#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
> -#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
> -#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
> -
> -/* Page-Table Modification Mask */
> -typedef unsigned int pgtbl_mod_mask;
> -
>  #endif /* !__ASSEMBLY__ */
>  
>  #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
> diff --git a/include/linux/pgtable_modmask.h b/include/linux/pgtable_modmask.h
> new file mode 100644
> index 000000000000..5a21b1bb8df3
> --- /dev/null
> +++ b/include/linux/pgtable_modmask.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PGTABLE_MODMASK_H
> +#define _LINUX_PGTABLE_MODMASK_H
> +
> +#ifndef __ASSEMBLY__
> +
> +/*
> + * Page Table Modification bits for pgtbl_mod_mask.
> + *
> + * These are used by the p?d_alloc_track*() set of functions an in the generic
> + * vmalloc/ioremap code to track at which page-table levels entries have been
> + * modified. Based on that the code can better decide when vmalloc and ioremap
> + * mapping changes need to be synchronized to other page-tables in the system.
> + */
> +#define		__PGTBL_PGD_MODIFIED	0
> +#define		__PGTBL_P4D_MODIFIED	1
> +#define		__PGTBL_PUD_MODIFIED	2
> +#define		__PGTBL_PMD_MODIFIED	3
> +#define		__PGTBL_PTE_MODIFIED	4
> +
> +#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
> +#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
> +#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
> +#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
> +#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
> +
> +/* Page-Table Modification Mask */
> +typedef unsigned int pgtbl_mod_mask;
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +#endif /* _LINUX_PGTABLE_MODMASK_H */
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 16dd4cba64f2..cb5d8f1965a1 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -11,6 +11,7 @@
>  #include <asm/page.h>		/* pgprot_t */
>  #include <linux/rbtree.h>
>  #include <linux/overflow.h>
> +#include <linux/pgtable_modmask.h>
>  
>  #include <asm/vmalloc.h>
>  
> @@ -213,6 +214,26 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
>  int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>  		     struct page **pages, unsigned int page_shift);
>  
> +#ifndef arch_update_kernel_mappings_begin
> +/**
> + * arch_update_kernel_mappings_begin - A batch of kernel pgtable mappings are
> + * about to be updated.
> + * @start: Virtual address of start of range to be updated.
> + * @end: Virtual address of end of range to be updated.
> + *
> + * An optional hook to allow architecture code to prepare for a batch of kernel
> + * pgtable mapping updates. An architecture may use this to enter a lazy mode
> + * where some operations can be deferred until the end of the batch.
> + *
> + * Context: Called in task context and may be preemptible.
> + */
> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
> +						     unsigned long end)
> +{
> +}
> +#endif
> +
> +#ifndef arch_update_kernel_mappings_end
>  /*
>   * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
>   * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
> @@ -229,6 +250,32 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>   */
>  void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
>  
> +/**
> + * arch_update_kernel_mappings_end - A batch of kernel pgtable mappings have
> + * been updated.
> + * @start: Virtual address of start of range that was updated.
> + * @end: Virtual address of end of range that was updated.
> + *
> + * An optional hook to inform architecture code that a batch update is complete.
> + * This balances a previous call to arch_update_kernel_mappings_begin().
> + *
> + * An architecture may override this for any purpose, such as exiting a lazy
> + * mode previously entered with arch_update_kernel_mappings_begin() or syncing
> + * kernel mappings to a secondary pgtable. The default implementation calls an
> + * arch-provided arch_sync_kernel_mappings() if any arch-defined pgtable level
> + * was updated.
> + *
> + * Context: Called in task context and may be preemptible.
> + */
> +static inline void arch_update_kernel_mappings_end(unsigned long start,
> +						   unsigned long end,
> +						   pgtbl_mod_mask mask)
> +{
> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> +		arch_sync_kernel_mappings(start, end);
> +}

One arch call back calling yet another arch call back sounds bit odd. Also
should not ARCH_PAGE_TABLE_SYNC_MASK be checked both for __begin and __end
callbacks in case a platform subscribes into this framework. Instead the
following changes sound more reasonable, but will also require some more
updates for the current platforms using arch_sync_kernel_mappings().

if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
	arch_update_kernel_mappings_begin()

if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
	arch_update_kernel_mappings_end()

Basically when any platform defines ARCH_PAGE_TABLE_SYNC_MASK and subscribes
this framework, it will also provide arch_update_kernel_mappings_begin/end()
callbacks as required.

> +#endif
> +
>  /*
>   *	Lowlevel-APIs (not for driver use!)
>   */
> diff --git a/mm/memory.c b/mm/memory.c
> index a15f7dd500ea..f80930bc19f6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3035,6 +3035,8 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  	if (WARN_ON(addr >= end))
>  		return -EINVAL;
>  
> +	arch_update_kernel_mappings_begin(start, end);
> +
>  	pgd = pgd_offset(mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
> @@ -3055,8 +3057,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			break;
>  	} while (pgd++, addr = next, addr != end);
>  
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(start, start + size);
> +	arch_update_kernel_mappings_end(start, end, mask);
>  
>  	return err;
>  }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50fd44439875..c5c51d86ef78 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -312,10 +312,10 @@ int vmap_page_range(unsigned long addr, unsigned long end,
>  	pgtbl_mod_mask mask = 0;
>  	int err;
>  
> +	arch_update_kernel_mappings_begin(addr, end);
>  	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
>  				 ioremap_max_page_shift, &mask);
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(addr, end);
> +	arch_update_kernel_mappings_end(addr, end, mask);
>  
>  	flush_cache_vmap(addr, end);
>  	if (!err)
> @@ -463,6 +463,9 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>  	pgtbl_mod_mask mask = 0;
>  
>  	BUG_ON(addr >= end);
> +
> +	arch_update_kernel_mappings_begin(start, end);
> +
>  	pgd = pgd_offset_k(addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
> @@ -473,8 +476,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>  		vunmap_p4d_range(pgd, addr, next, &mask);
>  	} while (pgd++, addr = next, addr != end);
>  
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(start, end);
> +	arch_update_kernel_mappings_end(start, end, mask);
>  }
>  
>  void vunmap_range_noflush(unsigned long start, unsigned long end)
> @@ -625,6 +627,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  
>  	WARN_ON(page_shift < PAGE_SHIFT);
>  
> +	arch_update_kernel_mappings_begin(start, end);
> +
>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>  			page_shift == PAGE_SHIFT) {
>  		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
> @@ -642,8 +646,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>  		}
>  	}
>  
> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> -		arch_sync_kernel_mappings(start, end);
> +	arch_update_kernel_mappings_end(start, end, mask);
>  
>  	return err;
>  }


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-05 15:09 ` [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings Ryan Roberts
@ 2025-02-10  8:03   ` Anshuman Khandual
  2025-02-10 11:12     ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-10  8:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/5/25 20:39, Ryan Roberts wrote:
> Because the kernel can't tolerate page faults for kernel mappings, when
> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
> dsb(ishst) to ensure that the store to the pgtable is observed by the
> table walker immediately. Additionally it emits an isb() to ensure that
> any already speculatively determined invalid mapping fault gets
> canceled.> 
> We can improve the performance of vmalloc operations by batching these
> barriers until the end of a set up entry updates. The newly added
> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
> provide the required hooks.
> 
> vmalloc improves by up to 30% as a result.
> 
> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
> are in the batch mode and can therefore defer any barriers until the end
> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
> be emited at the end of the batch.

Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
set in __begin(), cleared in __end() and saved across a __switch_to().

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
>  arch/arm64/include/asm/thread_info.h |  2 +
>  arch/arm64/kernel/process.c          | 20 +++++++--
>  3 files changed, 63 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ff358d983583..1ee9b9588502 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -39,6 +39,41 @@
>  #include <linux/mm_types.h>
>  #include <linux/sched.h>
>  #include <linux/page_table_check.h>
> +#include <linux/pgtable_modmask.h>
> +
> +static inline void emit_pte_barriers(void)
> +{
> +	dsb(ishst);
> +	isb();
> +}

There are many sequence of these two barriers in this particular header,
hence probably a good idea to factor this out into a common helper.

> +
> +static inline void queue_pte_barriers(void)
> +{
> +	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
> +		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
> +			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
> +	} else
> +		emit_pte_barriers();
> +}
> +
> +#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
> +						     unsigned long end)
> +{
> +	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
> +}
> +
> +#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
> +static inline void arch_update_kernel_mappings_end(unsigned long start,
> +						   unsigned long end,
> +						   pgtbl_mod_mask mask)
> +{
> +	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
> +		emit_pte_barriers();
> +
> +	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
> +	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
> +}
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
> @@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>  	 * or update_mmu_cache() have the necessary barriers.
>  	 */
> -	if (pte_valid_not_user(pte)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pte_valid_not_user(pte))
> +		queue_pte_barriers();
>  }
>  
>  static inline void __set_pte(pte_t *ptep, pte_t pte)
> @@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  	WRITE_ONCE(*pmdp, pmd);
>  
> -	if (pmd_valid_not_user(pmd)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pmd_valid_not_user(pmd))
> +		queue_pte_barriers();
>  }
>  
>  static inline void pmd_clear(pmd_t *pmdp)
> @@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>  
>  	WRITE_ONCE(*pudp, pud);
>  
> -	if (pud_valid_not_user(pud)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pud_valid_not_user(pud))
> +		queue_pte_barriers();
>  }
>  
>  static inline void pud_clear(pud_t *pudp)
> @@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>  
>  	WRITE_ONCE(*p4dp, p4d);
>  
> -	if (p4d_valid_not_user(p4d)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (p4d_valid_not_user(p4d))
> +		queue_pte_barriers();
>  }
>  
>  static inline void p4d_clear(p4d_t *p4dp)
> @@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>  
>  	WRITE_ONCE(*pgdp, pgd);
>  
> -	if (pgd_valid_not_user(pgd)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pgd_valid_not_user(pgd))
> +		queue_pte_barriers();
>  }
>  
>  static inline void pgd_clear(pgd_t *pgdp)
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index 1114c1c3300a..382d2121261e 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
> +#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
> +#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
>  
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 42faebb7b712..1367ec6407d1 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	gcs_thread_switch(next);
>  
>  	/*
> -	 * Complete any pending TLB or cache maintenance on this CPU in case
> -	 * the thread migrates to a different CPU.
> -	 * This full barrier is also required by the membarrier system
> -	 * call.
> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
> +	 * thread migrates to a different CPU. This full barrier is also
> +	 * required by the membarrier system call. Additionally it is required
> +	 * for TIF_KMAP_UPDATE_PENDING, see below.
>  	 */
>  	dsb(ish);
>  
> @@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	/* avoid expensive SCTLR_EL1 accesses if no change */
>  	if (prev->thread.sctlr_user != next->thread.sctlr_user)
>  		update_sctlr_el1(next->thread.sctlr_user);
> +	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
> +		/*
> +		 * In unlikely event that a kernel map update is on-going when
> +		 * preemption occurs, we must emit_pte_barriers() if pending.
> +		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
> +		 * is already handled above. The isb() is handled if
> +		 * update_sctlr_el1() was called. So only need to emit isb()
> +		 * here if it wasn't called.
> +		 */
> +		isb();
> +		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
> +	}
>  
>  	/* the actual thread switch */
>  	last = cpu_switch_to(prev, next);


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings()
  2025-02-10  7:45   ` Anshuman Khandual
@ 2025-02-10 11:04     ` Ryan Roberts
  2025-02-13  5:57       ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-10 11:04 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 10/02/2025 07:45, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> arch_sync_kernel_mappings() is an optional hook for arches to allow them
>> to synchonize certain levels of the kernel pgtables after modification.
>> But arm64 could benefit from a hook similar to this, paired with a call
>> prior to starting the batch of modifications.
>>
>> So let's introduce arch_update_kernel_mappings_begin() and
>> arch_update_kernel_mappings_end(). Both have a default implementation
>> which can be overridden by the arch code. The default for the former is
>> a nop, and the default for the latter is to call
>> arch_sync_kernel_mappings(), so the latter replaces previous
>> arch_sync_kernel_mappings() callsites. So by default, the resulting
>> behaviour is unchanged.
>>
>> To avoid include hell, the pgtbl_mod_mask type and it's associated
>> macros are moved to their own header.
>>
>> In a future patch, arm64 will opt-in to overriding both functions.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h         | 24 +----------------
>>  include/linux/pgtable_modmask.h | 32 ++++++++++++++++++++++
>>  include/linux/vmalloc.h         | 47 +++++++++++++++++++++++++++++++++
>>  mm/memory.c                     |  5 ++--
>>  mm/vmalloc.c                    | 15 ++++++-----
>>  5 files changed, 92 insertions(+), 31 deletions(-)
>>  create mode 100644 include/linux/pgtable_modmask.h
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 94d267d02372..7f70786a73b3 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include <linux/pfn.h>
>>  #include <asm/pgtable.h>
>> +#include <linux/pgtable_modmask.h>
>>  
>>  #define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
>>  #define PUD_ORDER	(PUD_SHIFT - PAGE_SHIFT)
>> @@ -1786,29 +1787,6 @@ static inline bool arch_has_pfn_modify_check(void)
>>  # define PAGE_KERNEL_EXEC PAGE_KERNEL
>>  #endif
>>  
>> -/*
>> - * Page Table Modification bits for pgtbl_mod_mask.
>> - *
>> - * These are used by the p?d_alloc_track*() set of functions an in the generic
>> - * vmalloc/ioremap code to track at which page-table levels entries have been
>> - * modified. Based on that the code can better decide when vmalloc and ioremap
>> - * mapping changes need to be synchronized to other page-tables in the system.
>> - */
>> -#define		__PGTBL_PGD_MODIFIED	0
>> -#define		__PGTBL_P4D_MODIFIED	1
>> -#define		__PGTBL_PUD_MODIFIED	2
>> -#define		__PGTBL_PMD_MODIFIED	3
>> -#define		__PGTBL_PTE_MODIFIED	4
>> -
>> -#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
>> -#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
>> -#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
>> -#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
>> -#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
>> -
>> -/* Page-Table Modification Mask */
>> -typedef unsigned int pgtbl_mod_mask;
>> -
>>  #endif /* !__ASSEMBLY__ */
>>  
>>  #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
>> diff --git a/include/linux/pgtable_modmask.h b/include/linux/pgtable_modmask.h
>> new file mode 100644
>> index 000000000000..5a21b1bb8df3
>> --- /dev/null
>> +++ b/include/linux/pgtable_modmask.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PGTABLE_MODMASK_H
>> +#define _LINUX_PGTABLE_MODMASK_H
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +/*
>> + * Page Table Modification bits for pgtbl_mod_mask.
>> + *
>> + * These are used by the p?d_alloc_track*() set of functions an in the generic
>> + * vmalloc/ioremap code to track at which page-table levels entries have been
>> + * modified. Based on that the code can better decide when vmalloc and ioremap
>> + * mapping changes need to be synchronized to other page-tables in the system.
>> + */
>> +#define		__PGTBL_PGD_MODIFIED	0
>> +#define		__PGTBL_P4D_MODIFIED	1
>> +#define		__PGTBL_PUD_MODIFIED	2
>> +#define		__PGTBL_PMD_MODIFIED	3
>> +#define		__PGTBL_PTE_MODIFIED	4
>> +
>> +#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
>> +#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
>> +#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
>> +#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
>> +#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
>> +
>> +/* Page-Table Modification Mask */
>> +typedef unsigned int pgtbl_mod_mask;
>> +
>> +#endif /* !__ASSEMBLY__ */
>> +
>> +#endif /* _LINUX_PGTABLE_MODMASK_H */
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 16dd4cba64f2..cb5d8f1965a1 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -11,6 +11,7 @@
>>  #include <asm/page.h>		/* pgprot_t */
>>  #include <linux/rbtree.h>
>>  #include <linux/overflow.h>
>> +#include <linux/pgtable_modmask.h>
>>  
>>  #include <asm/vmalloc.h>
>>  
>> @@ -213,6 +214,26 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
>>  int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>>  		     struct page **pages, unsigned int page_shift);
>>  
>> +#ifndef arch_update_kernel_mappings_begin
>> +/**
>> + * arch_update_kernel_mappings_begin - A batch of kernel pgtable mappings are
>> + * about to be updated.
>> + * @start: Virtual address of start of range to be updated.
>> + * @end: Virtual address of end of range to be updated.
>> + *
>> + * An optional hook to allow architecture code to prepare for a batch of kernel
>> + * pgtable mapping updates. An architecture may use this to enter a lazy mode
>> + * where some operations can be deferred until the end of the batch.
>> + *
>> + * Context: Called in task context and may be preemptible.
>> + */
>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>> +						     unsigned long end)
>> +{
>> +}
>> +#endif
>> +
>> +#ifndef arch_update_kernel_mappings_end
>>  /*
>>   * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
>>   * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
>> @@ -229,6 +250,32 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>>   */
>>  void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
>>  
>> +/**
>> + * arch_update_kernel_mappings_end - A batch of kernel pgtable mappings have
>> + * been updated.
>> + * @start: Virtual address of start of range that was updated.
>> + * @end: Virtual address of end of range that was updated.
>> + *
>> + * An optional hook to inform architecture code that a batch update is complete.
>> + * This balances a previous call to arch_update_kernel_mappings_begin().
>> + *
>> + * An architecture may override this for any purpose, such as exiting a lazy
>> + * mode previously entered with arch_update_kernel_mappings_begin() or syncing
>> + * kernel mappings to a secondary pgtable. The default implementation calls an
>> + * arch-provided arch_sync_kernel_mappings() if any arch-defined pgtable level
>> + * was updated.
>> + *
>> + * Context: Called in task context and may be preemptible.
>> + */
>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>> +						   unsigned long end,
>> +						   pgtbl_mod_mask mask)
>> +{
>> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> +		arch_sync_kernel_mappings(start, end);
>> +}
> 
> One arch call back calling yet another arch call back sounds bit odd. 

It's no different from the default implementation of arch_make_huge_pte()
calling pte_mkhuge() is it?

> Also
> should not ARCH_PAGE_TABLE_SYNC_MASK be checked both for __begin and __end
> callbacks in case a platform subscribes into this framework. 

I'm not sure how that would work? The mask is accumulated during the pgtable
walk. So we don't have a mask until we get to the end.

> Instead the
> following changes sound more reasonable, but will also require some more
> updates for the current platforms using arch_sync_kernel_mappings().
> 
> if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> 	arch_update_kernel_mappings_begin()

This makes no sense. mask is always 0 before doing the walk.

> 
> if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
> 	arch_update_kernel_mappings_end()
> 
> Basically when any platform defines ARCH_PAGE_TABLE_SYNC_MASK and subscribes
> this framework, it will also provide arch_update_kernel_mappings_begin/end()
> callbacks as required.

Personally I think it's cleaner to just pass mask to
arch_update_kernel_mappings_end() and it the function decide what it wants to do.

But it's a good question as to whether we should refactor x86 and arm to
directly implement arch_update_kernel_mappings_end() instead of
arch_sync_kernel_mappings(). Personally I thought it was better to avoid the
churn. But interested in others opinions.

Thanks,
Ryan

> 
>> +#endif
>> +
>>  /*
>>   *	Lowlevel-APIs (not for driver use!)
>>   */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a15f7dd500ea..f80930bc19f6 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3035,6 +3035,8 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  	if (WARN_ON(addr >= end))
>>  		return -EINVAL;
>>  
>> +	arch_update_kernel_mappings_begin(start, end);
>> +
>>  	pgd = pgd_offset(mm, addr);
>>  	do {
>>  		next = pgd_addr_end(addr, end);
>> @@ -3055,8 +3057,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>  			break;
>>  	} while (pgd++, addr = next, addr != end);
>>  
>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> -		arch_sync_kernel_mappings(start, start + size);
>> +	arch_update_kernel_mappings_end(start, end, mask);
>>  
>>  	return err;
>>  }
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 50fd44439875..c5c51d86ef78 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -312,10 +312,10 @@ int vmap_page_range(unsigned long addr, unsigned long end,
>>  	pgtbl_mod_mask mask = 0;
>>  	int err;
>>  
>> +	arch_update_kernel_mappings_begin(addr, end);
>>  	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
>>  				 ioremap_max_page_shift, &mask);
>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> -		arch_sync_kernel_mappings(addr, end);
>> +	arch_update_kernel_mappings_end(addr, end, mask);
>>  
>>  	flush_cache_vmap(addr, end);
>>  	if (!err)
>> @@ -463,6 +463,9 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>>  	pgtbl_mod_mask mask = 0;
>>  
>>  	BUG_ON(addr >= end);
>> +
>> +	arch_update_kernel_mappings_begin(start, end);
>> +
>>  	pgd = pgd_offset_k(addr);
>>  	do {
>>  		next = pgd_addr_end(addr, end);
>> @@ -473,8 +476,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>>  		vunmap_p4d_range(pgd, addr, next, &mask);
>>  	} while (pgd++, addr = next, addr != end);
>>  
>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> -		arch_sync_kernel_mappings(start, end);
>> +	arch_update_kernel_mappings_end(start, end, mask);
>>  }
>>  
>>  void vunmap_range_noflush(unsigned long start, unsigned long end)
>> @@ -625,6 +627,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>  
>>  	WARN_ON(page_shift < PAGE_SHIFT);
>>  
>> +	arch_update_kernel_mappings_begin(start, end);
>> +
>>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>  			page_shift == PAGE_SHIFT) {
>>  		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
>> @@ -642,8 +646,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>  		}
>>  	}
>>  
>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> -		arch_sync_kernel_mappings(start, end);
>> +	arch_update_kernel_mappings_end(start, end, mask);
>>  
>>  	return err;
>>  }



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-10  8:03   ` Anshuman Khandual
@ 2025-02-10 11:12     ` Ryan Roberts
  2025-02-13  5:30       ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-10 11:12 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 10/02/2025 08:03, Anshuman Khandual wrote:
> 
> 
> On 2/5/25 20:39, Ryan Roberts wrote:
>> Because the kernel can't tolerate page faults for kernel mappings, when
>> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
>> dsb(ishst) to ensure that the store to the pgtable is observed by the
>> table walker immediately. Additionally it emits an isb() to ensure that
>> any already speculatively determined invalid mapping fault gets
>> canceled.> 
>> We can improve the performance of vmalloc operations by batching these
>> barriers until the end of a set up entry updates. The newly added
>> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
>> provide the required hooks.
>>
>> vmalloc improves by up to 30% as a result.
>>
>> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
>> are in the batch mode and can therefore defer any barriers until the end
>> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
>> be emited at the end of the batch.
> 
> Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
> set in __begin(), cleared in __end() and saved across a __switch_to().

So unconditionally emit the barriers in _end(), and emit them in __switch_to()
if TIF_KMAP_UPDATE_ACTIVE is set?

I guess if calling _begin() then you are definitely going to be setting at least
1 PTE. So you can definitely emit the barriers unconditionally. I was trying to
protect against the case where you get pre-empted (potentially multiple times)
while in the loop. The TIF_KMAP_UPDATE_PENDING flag ensures you only emit the
barriers when you definitely need to. Without it, you would have to emit on
every pre-emption even if no more PTEs got set.

But I suspect this is a premature optimization. Probably it will never occur. So
I'll simplify as you suggest.

Thanks,
Ryan

> 
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
>>  arch/arm64/include/asm/thread_info.h |  2 +
>>  arch/arm64/kernel/process.c          | 20 +++++++--
>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index ff358d983583..1ee9b9588502 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -39,6 +39,41 @@
>>  #include <linux/mm_types.h>
>>  #include <linux/sched.h>
>>  #include <linux/page_table_check.h>
>> +#include <linux/pgtable_modmask.h>
>> +
>> +static inline void emit_pte_barriers(void)
>> +{
>> +	dsb(ishst);
>> +	isb();
>> +}
> 
> There are many sequence of these two barriers in this particular header,
> hence probably a good idea to factor this out into a common helper.
> >> +
>> +static inline void queue_pte_barriers(void)
>> +{
>> +	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
>> +		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>> +			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
>> +	} else
>> +		emit_pte_barriers();
>> +}
>> +
>> +#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>> +						     unsigned long end)
>> +{
>> +	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>> +}
>> +
>> +#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>> +						   unsigned long end,
>> +						   pgtbl_mod_mask mask)
>> +{
>> +	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>> +		emit_pte_barriers();
>> +
>> +	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>> +	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>> +}
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>> @@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>  	 * or update_mmu_cache() have the necessary barriers.
>>  	 */
>> -	if (pte_valid_not_user(pte)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pte_valid_not_user(pte))
>> +		queue_pte_barriers();
>>  }
>>  
>>  static inline void __set_pte(pte_t *ptep, pte_t pte)
>> @@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>  
>>  	WRITE_ONCE(*pmdp, pmd);
>>  
>> -	if (pmd_valid_not_user(pmd)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pmd_valid_not_user(pmd))
>> +		queue_pte_barriers();
>>  }
>>  
>>  static inline void pmd_clear(pmd_t *pmdp)
>> @@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>  
>>  	WRITE_ONCE(*pudp, pud);
>>  
>> -	if (pud_valid_not_user(pud)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pud_valid_not_user(pud))
>> +		queue_pte_barriers();
>>  }
>>  
>>  static inline void pud_clear(pud_t *pudp)
>> @@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>  
>>  	WRITE_ONCE(*p4dp, p4d);
>>  
>> -	if (p4d_valid_not_user(p4d)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (p4d_valid_not_user(p4d))
>> +		queue_pte_barriers();
>>  }
>>  
>>  static inline void p4d_clear(p4d_t *p4dp)
>> @@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>  
>>  	WRITE_ONCE(*pgdp, pgd);
>>  
>> -	if (pgd_valid_not_user(pgd)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pgd_valid_not_user(pgd))
>> +		queue_pte_barriers();
>>  }
>>  
>>  static inline void pgd_clear(pgd_t *pgdp)
>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>> index 1114c1c3300a..382d2121261e 100644
>> --- a/arch/arm64/include/asm/thread_info.h
>> +++ b/arch/arm64/include/asm/thread_info.h
>> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
>> +#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
>> +#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
>>  
>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>> index 42faebb7b712..1367ec6407d1 100644
>> --- a/arch/arm64/kernel/process.c
>> +++ b/arch/arm64/kernel/process.c
>> @@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  	gcs_thread_switch(next);
>>  
>>  	/*
>> -	 * Complete any pending TLB or cache maintenance on this CPU in case
>> -	 * the thread migrates to a different CPU.
>> -	 * This full barrier is also required by the membarrier system
>> -	 * call.
>> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
>> +	 * thread migrates to a different CPU. This full barrier is also
>> +	 * required by the membarrier system call. Additionally it is required
>> +	 * for TIF_KMAP_UPDATE_PENDING, see below.
>>  	 */
>>  	dsb(ish);
>>  
>> @@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  	/* avoid expensive SCTLR_EL1 accesses if no change */
>>  	if (prev->thread.sctlr_user != next->thread.sctlr_user)
>>  		update_sctlr_el1(next->thread.sctlr_user);
>> +	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
>> +		/*
>> +		 * In unlikely event that a kernel map update is on-going when
>> +		 * preemption occurs, we must emit_pte_barriers() if pending.
>> +		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
>> +		 * is already handled above. The isb() is handled if
>> +		 * update_sctlr_el1() was called. So only need to emit isb()
>> +		 * here if it wasn't called.
>> +		 */
>> +		isb();
>> +		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>> +	}
>>  
>>  	/* the actual thread switch */
>>  	last = cpu_switch_to(prev, next);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  2025-02-06 12:55     ` Ryan Roberts
@ 2025-02-12 14:44       ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-12 14:44 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable

On 06/02/2025 12:55, Ryan Roberts wrote:
> On 06/02/2025 06:15, Anshuman Khandual wrote:
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> arm64 supports multiple huge_pte sizes. Some of the sizes are covered by
>>> a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some
>>> are covered by multiple ptes at a particular level (CONT_PTE_SIZE,
>>> CONT_PMD_SIZE). So the function has to figure out the size from the
>>> huge_pte pointer. This was previously done by walking the pgtable to
>>> determine the level, then using the PTE_CONT bit to determine the number
>>> of ptes.
>>
>> Actually PTE_CONT gets used to determine if the entry is normal i.e
>> PMD/PUD based huge page or cont PTE/PMD based huge page just to call
>> into standard __ptep_get_and_clear() or specific get_clear_contig(),
>> after determining find_num_contig() by walking the page table.
>>
>> PTE_CONT presence is only used to determine the switch above but not
>> to determine the number of ptes for the mapping as mentioned earlier.
> 
> Sorry I don't really follow your distinction; PTE_CONT is used to decide whether
> we are operating on a single entry (pte_cont()==false) or on multiple entires
> (pte_cont()==true). For the multiple entry case, the level tells you the exact
> number.
> 
> I can certainly tidy up this description a bit, but I think we both agree that
> the value of PTE_CONT is one of the inputs into deciding how many entries need
> to be operated on?
> 
>>
>> There are two similar functions which determines the 
>>
>> static int find_num_contig(struct mm_struct *mm, unsigned long addr,
>>                            pte_t *ptep, size_t *pgsize)
>> {
>>         pgd_t *pgdp = pgd_offset(mm, addr);
>>         p4d_t *p4dp;
>>         pud_t *pudp;
>>         pmd_t *pmdp;
>>
>>         *pgsize = PAGE_SIZE;
>>         p4dp = p4d_offset(pgdp, addr);
>>         pudp = pud_offset(p4dp, addr);
>>         pmdp = pmd_offset(pudp, addr);
>>         if ((pte_t *)pmdp == ptep) {
>>                 *pgsize = PMD_SIZE;
>>                 return CONT_PMDS;
>>         }
>>         return CONT_PTES;
>> }
>>
>> find_num_contig() already assumes that the entry is contig huge page and
>> it just finds whether it is PMD or PTE based one. This always requires a
>> prior PTE_CONT bit being set determination via pte_cont() before calling
>> find_num_contig() in each instance.
> 
> Agreed.
> 
>>
>> But num_contig_ptes() can get the same information without walking the
>> page table and thus without predetermining if PTE_CONT is set or not.
>> size can be derived from VMA argument when present.
> 
> Also agreed. But VMA is not provided to this function. And because we want to
> use it for kernel space mappings, I think it's a bad idea to pass VMA.
> 
>>
>> static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
>> {
>>         int contig_ptes = 0;
>>
>>         *pgsize = size;
>>
>>         switch (size) {
>> #ifndef __PAGETABLE_PMD_FOLDED
>>         case PUD_SIZE:
>>                 if (pud_sect_supported())
>>                         contig_ptes = 1;
>>                 break;
>> #endif
>>         case PMD_SIZE:
>>                 contig_ptes = 1;
>>                 break;
>>         case CONT_PMD_SIZE:
>>                 *pgsize = PMD_SIZE;
>>                 contig_ptes = CONT_PMDS;
>>                 break;
>>         case CONT_PTE_SIZE:
>>                 *pgsize = PAGE_SIZE;
>>                 contig_ptes = CONT_PTES;
>>                 break;
>>         }
>>
>>         return contig_ptes;
>> }
>>
>> On a side note, why cannot num_contig_ptes() be used all the time and
>> find_num_contig() be dropped ? OR am I missing something here.
> 
> There are 2 remaining users of find_num_contig() after my series:
> huge_ptep_set_access_flags() and huge_ptep_set_wrprotect(). Both of them can
> only be legitimately called for present ptes (so its safe to check pte_cont()).
> huge_ptep_set_access_flags() already has the VMA so it would be easy to convert
> to num_contig_ptes(). huge_ptep_set_wrprotect() doesn't have the VMA but I guess
> you could do the trick where you take the size of the folio that the pte points to?
> 
> So yes, I think we could drop find_num_contig() and I agree it would be an
> improvement.
> 
> But to be honest, grabbing the folio size also feels like a hack to me (we do
> this in other places too). While today, the folio size is guarranteed to be be
> the same size as the huge pte in practice, I'm not sure there is any spec that
> mandates that?
> 
> Perhaps the most robust thing is to just have a PTE_CONT bit for the swap-pte so
> we can tell the size of both present and non-present ptes, then do the table
> walk trick to find the level. Shrug.
> 
>>
>>>
>>> But the PTE_CONT bit is only valid when the pte is present. For
>>> non-present pte values (e.g. markers, migration entries), the previous
>>> implementation was therefore erroniously determining the size. There is
>>> at least one known caller in core-mm, move_huge_pte(), which may call
>>> huge_ptep_get_and_clear() for a non-present pte. So we must be robust to
>>> this case. Additionally the "regular" ptep_get_and_clear() is robust to
>>> being called for non-present ptes so it makes sense to follow the
>>> behaviour.
>>
>> With VMA argument and num_contig_ptes() dependency on PTE_CONT being set
>> and the entry being mapped might not be required.
>>>>
>>> Fix this by using the new sz parameter which is now provided to the
>>> function. Additionally when clearing each pte in a contig range, don't
>>> gather the access and dirty bits if the pte is not present.
>>
>> Makes sense.
>>
>>>
>>> An alternative approach that would not require API changes would be to
>>> store the PTE_CONT bit in a spare bit in the swap entry pte. But it felt
>>> cleaner to follow other APIs' lead and just pass in the size.
>>
>> Right, changing the arguments in the API will help solve this problem.
>>
>>>
>>> While we are at it, add some debug warnings in functions that require
>>> the pte is present.
>>>
>>> As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap
>>> entry offset field (layout of non-present pte). Since hugetlb is never
>>> swapped to disk, this field will only be populated for markers, which
>>> always set this bit to 0 and hwpoison swap entries, which set the offset
>>> field to a PFN; So it would only ever be 1 for a 52-bit PVA system where
>>> memory in that high half was poisoned (I think!). So in practice, this
>>> bit would almost always be zero for non-present ptes and we would only
>>> clear the first entry if it was actually a contiguous block. That's
>>> probably a less severe symptom than if it was always interpretted as 1
>>> and cleared out potentially-present neighboring PTEs.
>>>
>>> Cc: <stable@vger.kernel.org>
>>> Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/mm/hugetlbpage.c | 54 ++++++++++++++++++++-----------------
>>>  1 file changed, 29 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>>> index 06db4649af91..328eec4bfe55 100644
>>> --- a/arch/arm64/mm/hugetlbpage.c
>>> +++ b/arch/arm64/mm/hugetlbpage.c
>>> @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm,
>>>  			     unsigned long pgsize,
>>>  			     unsigned long ncontig)
>>>  {
>>> -	pte_t orig_pte = __ptep_get(ptep);
>>> -	unsigned long i;
>>> -
>>> -	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
>>> -		pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
>>> -
>>> -		/*
>>> -		 * If HW_AFDBM is enabled, then the HW could turn on
>>> -		 * the dirty or accessed bit for any page in the set,
>>> -		 * so check them all.
>>> -		 */
>>> -		if (pte_dirty(pte))
>>> -			orig_pte = pte_mkdirty(orig_pte);
>>> -
>>> -		if (pte_young(pte))
>>> -			orig_pte = pte_mkyoung(orig_pte);
>>> +	pte_t pte, tmp_pte;
>>> +	bool present;
>>> +
>>> +	pte = __ptep_get_and_clear(mm, addr, ptep);
>>> +	present = pte_present(pte);
>>> +	while (--ncontig) {
>>
>> Although this does the right thing by calling __ptep_get_and_clear() once
>> for non-contig huge pages but wondering if cont/non-cont separation should
>> be maintained in the caller huge_ptep_get_and_clear(), keeping the current
>> logical bifurcation intact.
> 
> To what benefit?
> 
>>
>>> +		ptep++;
>>> +		addr += pgsize;
>>> +		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
>>> +		if (present) {
>>
>> Checking for present entries makes sense here.
>>
>>> +			if (pte_dirty(tmp_pte))
>>> +				pte = pte_mkdirty(pte);
>>> +			if (pte_young(tmp_pte))
>>> +				pte = pte_mkyoung(pte);
>>> +		}
>>>  	}
>>> -	return orig_pte;
>>> +	return pte;
>>>  }
>>>  
>>>  static pte_t get_clear_contig_flush(struct mm_struct *mm,
>>> @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>>>  {
>>>  	int ncontig;
>>>  	size_t pgsize;
>>> -	pte_t orig_pte = __ptep_get(ptep);
>>> -
>>> -	if (!pte_cont(orig_pte))
>>> -		return __ptep_get_and_clear(mm, addr, ptep);
>>> -
>>> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>>  
>>> +	ncontig = num_contig_ptes(sz, &pgsize);
>>
>> __ptep_get_and_clear() can still be called here if 'ncontig' is
>> returned as 0 indicating a normal non-contig huge page thus
>> keeping get_clear_contig() unchanged just to handle contig huge
>> pages.
> 
> I think you're describing the case where num_contig_ptes() returns 0? The
> intention, from my reading of the function, is that num_contig_ptes() returns
> the number of ptes that need to be operated on (e.g. 1 for a single entry or N
> for a contig block). It will only return 0 if called with an invalid huge size.
> I don't believe it will ever "return 0 indicating a normal non-contig huge page".
> 
> Perhaps the right solution is to add a warning if returning 0?
> 
>>
>>>  	return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>>>  }
>>>  
>>> @@ -451,6 +445,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>>  	pgprot_t hugeprot;
>>>  	pte_t orig_pte;
>>>  
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>>  	if (!pte_cont(pte))
>>>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>>>  
>>> @@ -461,6 +457,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>>>  		return 0;
>>>  
>>>  	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>>> +	VM_WARN_ON(!pte_present(orig_pte));
>>>  
>>>  	/* Make sure we don't lose the dirty or young state */
>>>  	if (pte_dirty(orig_pte))
>>> @@ -485,7 +482,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>  	size_t pgsize;
>>>  	pte_t pte;
>>>  
>>> -	if (!pte_cont(__ptep_get(ptep))) {
>>> +	pte = __ptep_get(ptep);
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>> +	if (!pte_cont(pte)) {
>>>  		__ptep_set_wrprotect(mm, addr, ptep);
>>>  		return;
>>>  	}
>>> @@ -509,8 +509,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>>>  	struct mm_struct *mm = vma->vm_mm;
>>>  	size_t pgsize;
>>>  	int ncontig;
>>> +	pte_t pte;
>>>  
>>> -	if (!pte_cont(__ptep_get(ptep)))
>>> +	pte = __ptep_get(ptep);
>>> +	VM_WARN_ON(!pte_present(pte));
>>> +
>>> +	if (!pte_cont(pte))
>>>  		return ptep_clear_flush(vma, addr, ptep);
>>>  
>>>  	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>>
>> In all the above instances should not num_contig_ptes() be called to determine
>> if a given entry is non-contig or contig huge page, thus dropping the need for
>> pte_cont() and pte_present() tests as proposed here.
> 
> Yeah maybe. But as per above, we have options for how to do that. I'm not sure
> which is preferable at the moment. What do you think? Regardless, I think that
> cleanup would be a separate patch (which I'm happy to add for v2). For this bug
> fix, I was trying to do the minimum.

I took another look at this; I concluded that we should switch from
find_num_contig() to num_contig_ptes() for the 2 cases where we have the vma and
can therefore directly get the huge_ptep size.

That leaves one callsite in huge_ptep_set_wrprotect(), which doesn't have the
vma. One approach would be to grab the folio out of the pte and use the size of
the folio. That's already done in huge_ptep_get(). But actually I think that's a
very brittle approach because there is nothing stoping the size of the folio
from changing in future (you could have a folio twice the size mapped by 2
huge_ptes for example). So I'm actually proposing to keep find_num_contig() and
use it additionally in huge_ptep_get(). That will mean we end up with 2
callsites for find_num_contig(), and 0 places that infer the huge_pte size from
the folio size. I think that's much cleaner personally. I'll do this for v2.

Thanks,
Ryan

> 
> Thanks,
> Ryan
> 
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-02-07  9:38       ` Ryan Roberts
@ 2025-02-12 15:29         ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-12 15:29 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 09:38, Ryan Roberts wrote:
> On 06/02/2025 13:26, Ryan Roberts wrote:
>> On 06/02/2025 11:48, Anshuman Khandual wrote:
>>> On 2/5/25 20:39, Ryan Roberts wrote:

[...]

>>>>
>>>> While we are at it, refactor __ptep_get_and_clear() and
>>>> pmdp_huge_get_and_clear() to use a common ___ptep_get_and_clear() which
>>>> also takes a pgsize parameter. This will provide the huge_pte API the
>>>> means to clear ptes corresponding with the way they were set.
>>>
>>> __ptep_get_and_clear() refactoring does not seem to be related to the
>>> previous __set_ptes(). Should they be separated out into two different
>>> patches instead - for better clarity and review ? Both these clean ups
>>> have enough change and can stand own their own.
>>
>> Yeah I think you're probably right... I was being lazy... I'll separate them.

When looking at this again, I've decided not to split the patch. The same
approach is being applied to both helper APIs to produce "anysz" versions that
will be used by the huge_pte helpers. So I think it is reasonable to do all the
conversion in a single patch. It just looks too small and bitty if I split it out.

I'll rework the commit log to give both API equal promiance since it currently
sounds like __ptep_get_and_clear is an after thought.

[...]

>>> s/___set_ptes/___set_pxds ? to be more generic for all levels.
>>
>> So now we are into naming... I agree that in some senses pte feels specific to
>> the last level. But it's long form "page table entry" seems more generic than
>> "pxd" which implies only pmd, pud, p4d and pgd. At least to me...
>>
>> I think we got stuck trying to figure out a clear and short term for "page table
>> entry at any level" in the past. I think ttd was the best we got; Translation
>> Table Descriptor, which is the term the Arm ARM uses. But that opens a can of
>> worms as now we need tdd_t and all the converters pte_tdd(), tdd_pte(),
>> pmd_tdd(), ... and probably a bunch more stuff on top.
>>
>> So personally I prefer to take the coward's way out and just reuse pte.
> 
> How about set_ptes_anylvl() and ptep_get_and_clear_anylvl()? I think this makes
> it explicit and has the benefit of removing the leading underscores. It also
> means we can reuse pte_t and friends, and we can exetend this nomenclature to
> other places in future at the expense of a 7 char suffix ("_anylvl").
> 
> What do you think?

I'm going to call them set_ptes_anysz() and ptep_get_and_clear_anysz(). That's
one char shorter and better reflects that we are passing a size parameter in,
not a level parameter.

Thanks,
Ryan

> 
>>
>>>
>>>>  {
>>>> -	__sync_cache_and_tags(pte, nr);
>>>> -	__check_safe_pte_update(mm, ptep, pte);
>>>> -	__set_pte(ptep, pte);
>>>> +	unsigned long stride = pgsize >> PAGE_SHIFT;
>>>> +
>>>> +	switch (pgsize) {
>>>> +	case PAGE_SIZE:
>>>> +		page_table_check_ptes_set(mm, ptep, pte, nr);
>>>> +		break;
>>>> +	case PMD_SIZE:
>>>> +		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
>>>> +		break;
>>>> +	case PUD_SIZE:
>>>> +		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
>>>> +		break;
>>>
>>> This is where the new page table check APIs get used for batch testing.
>>
>> Yes and I anticipate that the whole switch block should be optimized out when
>> page_table_check is disabled.
>>
>>>
>>>> +	default:
>>>> +		VM_WARN_ON(1);
>>>> +	}
>>>> +
>>>> +	__sync_cache_and_tags(pte, nr * stride);
>>>> +
>>>> +	for (;;) {
>>>> +		__check_safe_pte_update(mm, ptep, pte);
>>>> +		__set_pte(ptep, pte);
>>>> +		if (--nr == 0)
>>>> +			break;
>>>> +		ptep++;
>>>> +		pte = pte_advance_pfn(pte, stride);
>>>> +	}
>>>>  }
>>>
>>> LGTM
>>>
>>>>  
>>>> -static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>>>> -			      pmd_t *pmdp, pmd_t pmd)
>>>> +static inline void __set_ptes(struct mm_struct *mm,
>>>> +			      unsigned long __always_unused addr,
>>>> +			      pte_t *ptep, pte_t pte, unsigned int nr)
>>>>  {
>>>> -	page_table_check_pmd_set(mm, pmdp, pmd);
>>>> -	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
>>>> -						PMD_SIZE >> PAGE_SHIFT);
>>>> +	___set_ptes(mm, ptep, pte, nr, PAGE_SIZE);
>>>>  }
>>>>  
>>>> -static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
>>>> -			      pud_t *pudp, pud_t pud)
>>>> +static inline void __set_pmds(struct mm_struct *mm,
>>>> +			      unsigned long __always_unused addr,
>>>> +			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
>>>> +{
>>>> +	___set_ptes(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
>>>> +}
>>>> +#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
>>>> +
>>>> +static inline void __set_puds(struct mm_struct *mm,
>>>> +			      unsigned long __always_unused addr,
>>>> +			      pud_t *pudp, pud_t pud, unsigned int nr)
>>>>  {
>>>> -	page_table_check_pud_set(mm, pudp, pud);
>>>> -	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
>>>> -						PUD_SIZE >> PAGE_SHIFT);
>>>> +	___set_ptes(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
>>>>  }
>>>> +#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)
>>>
>>> LGTM
>>>
>>>>  
>>>>  #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
>>>>  #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
>>>> @@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>>>  }
>>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
>>>>  
>>>> -static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>>> -				       unsigned long address, pte_t *ptep)
>>>> +static inline pte_t ___ptep_get_and_clear(struct mm_struct *mm, pte_t *ptep,
>>>> +				       unsigned long pgsize)
>>>
>>> So address got dropped and page size got added as an argument.
>>>
>>>>  {
>>>>  	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
>>>>  
>>>> -	page_table_check_pte_clear(mm, pte);
>>>> +	switch (pgsize) {
>>>> +	case PAGE_SIZE:
>>>> +		page_table_check_pte_clear(mm, pte);
>>>> +		break;
>>>> +	case PMD_SIZE:
>>>> +		page_table_check_pmd_clear(mm, pte_pmd(pte));
>>>> +		break;
>>>> +	case PUD_SIZE:
>>>> +		page_table_check_pud_clear(mm, pte_pud(pte));
>>>> +		break;
>>>> +	default:
>>>> +		VM_WARN_ON(1);
>>>> +	}
>>>>  
>>>>  	return pte;
>>>>  }
>>>>  
>>>> +static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
>>>> +				       unsigned long address, pte_t *ptep)
>>>> +{
>>>> +	return ___ptep_get_and_clear(mm, ptep, PAGE_SIZE);
>>>> +}
>>>> +
>>>>  static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>  				pte_t *ptep, unsigned int nr, int full)
>>>>  {
>>>> @@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
>>>>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
>>>>  					    unsigned long address, pmd_t *pmdp)
>>>>  {
>>>> -	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
>>>> -
>>>> -	page_table_check_pmd_clear(mm, pmd);
>>>> -
>>>> -	return pmd;
>>>> +	return pte_pmd(___ptep_get_and_clear(mm, (pte_t *)pmdp, PMD_SIZE));
>>>>  }
>>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>  
>>>
>>> Although currently there is no pudp_huge_get_and_clear() helper on arm64
>>> reworked ___ptep_get_and_clear() will be able to support that as well if
>>> and when required as it now supports PUD_SIZE page size.
>>
>> yep.
>>
>> Thanks for all your review so far!
>>
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop
  2025-02-07 10:38     ` Ryan Roberts
@ 2025-02-12 16:00       ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-12 16:00 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 10:38, Ryan Roberts wrote:
> On 07/02/2025 05:35, Anshuman Khandual wrote:
>>
>>
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> ___set_ptes() previously called __set_pte() for each PTE in the range,
>>> which would conditionally issue a DSB and ISB to make the new PTE value
>>> immediately visible to the table walker if the new PTE was valid and for
>>> kernel space.
>>>
>>> We can do better than this; let's hoist those barriers out of the loop
>>> so that they are only issued once at the end of the loop. We then reduce
>>> the cost by the number of PTEs in the range.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 14 ++++++++++----
>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 3b55d9a15f05..1d428e9c0e5a 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -317,10 +317,8 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>>>  	WRITE_ONCE(*ptep, pte);
>>>  }
>>>  
>>> -static inline void __set_pte(pte_t *ptep, pte_t pte)
>>> +static inline void __set_pte_complete(pte_t pte)
>>>  {
>>> -	__set_pte_nosync(ptep, pte);
>>> -
>>>  	/*
>>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>>  	 * or update_mmu_cache() have the necessary barriers.
>>> @@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
>>>  	}
>>>  }
>>>  
>>> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +	__set_pte_nosync(ptep, pte);
>>> +	__set_pte_complete(pte);
>>> +}
>>> +
>>>  static inline pte_t __ptep_get(pte_t *ptep)
>>>  {
>>>  	return READ_ONCE(*ptep);
>>> @@ -647,12 +651,14 @@ static inline void ___set_ptes(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>>>  
>>>  	for (;;) {
>>>  		__check_safe_pte_update(mm, ptep, pte);
>>> -		__set_pte(ptep, pte);
>>> +		__set_pte_nosync(ptep, pte);
>>>  		if (--nr == 0)
>>>  			break;
>>>  		ptep++;
>>>  		pte = pte_advance_pfn(pte, stride);
>>>  	}
>>> +
>>> +	__set_pte_complete(pte);
>>
>> Given that the loop now iterates over number of page table entries without corresponding
>> consecutive dsb/isb sync, could there be a situation where something else gets scheduled
>> on the cpu before __set_pte_complete() is called ? Hence leaving the entire page table
>> entries block without desired mapping effect. IOW how __set_pte_complete() is ensured to
>> execute once the loop above completes. Otherwise this change LGTM.
> 
> I don't think this changes the model. Previously, __set_pte() was called, which
> writes the pte to the pgtable, then issues the barriers. So there is still a
> window where the thread could be unscheduled after the write but before the
> barriers. Yes, my change makese that window bigger, but if it is a bug now, it
> was a bug before.
> 
> Additionally, the spec for set_ptes() is:
> 
> /**
>  * set_ptes - Map consecutive pages to a contiguous range of addresses.
>  * @mm: Address space to map the pages into.
>  * @addr: Address to map the first page at.
>  * @ptep: Page table pointer for the first entry.
>  * @pte: Page table entry for the first page.
>  * @nr: Number of pages to map.
>  *
>  * When nr==1, initial state of pte may be present or not present, and new state
>  * may be present or not present. When nr>1, initial state of all ptes must be
>  * not present, and new state must be present.
>  *
>  * May be overridden by the architecture, or the architecture can define
>  * set_pte() and PFN_PTE_SHIFT.
>  *
>  * Context: The caller holds the page table lock.  The pages all belong
>  * to the same folio.  The PTEs are all in the same PMD.
>  */
> 
> Note that the caller is required to hold the page table lock. That's a spin lock
> so should be non-preemptible at this point (perhaps not for RT?)
> 
> Although actually, vmalloc doesn't hold a lock when calling these helpers; it
> has a lock when allocating the VA space, then drops it.
> 
> So yes, I think there is a chance of preemption after writing the pgtable entry
> but before issuing the barriers.
> 
> But in that case, we get saved by the DSB in the context switch path. There is
> no guarrantee of an ISB in that path (AFAIU). But the need for an ISB is a bit
> whooly anyway. My rough understanding is that the ISB is there to prevent
> previous speculation from determining that a given translation was invalid and
> "caching" that determination in the pipeline. That could still (theoretically)
> happen on remote CPUs I think, and we have the spurious fault handler to detect
> that. Anyway, once you context switch, the local CPU becomes remote and we don't
> have the ISB there, so what's the difference... There's a high chance I've
> misunderstood a bunch of this.

I thought about this some more; The ISB is there to ensure that the "speculative
invalid translation marker" cached in the pipeline gets removed prior to any
code that runs after set_ptes() returns which accesses an address now mapped by
the pte that was set. Even if preemption occurs, the ISB will still execute when
the thread runs again, before the return from set_ptes(). So all is well.

> 
> 
> In conclusion, I don't think I've made things any worse.
> 
> Thanks,
> Ryan
> 
>>
>>>  }
>>>  
>>>  static inline void __set_ptes(struct mm_struct *mm,
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings
  2025-02-07 10:53     ` Ryan Roberts
@ 2025-02-12 16:48       ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-12 16:48 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 07/02/2025 10:53, Ryan Roberts wrote:
> On 07/02/2025 08:11, Anshuman Khandual wrote:
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
>>> used to write entries into pgtables. And they issue barriers (currently
>>> dsb and isb) to ensure that the written values are observed by the table
>>> walker prior to any program-order-future memory access to the mapped
>>> location.
>>
>> Right.
>>
>>>
>>> Over the years some of these functions have received optimizations: In
>>> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
>>> table modifications") made it so that the barriers were only emitted for
>>> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
>>> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
>>> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
>>> barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
>>> barriers unconditionally.
>>
>> Right.
>>
>>>
>>> This is all very confusing to the casual observer; surely the rules
>>> should be invariant to the level? Let's change this so that every level
>>> consistently emits the barriers only when setting valid, non-user
>>> entries (both table and leaf).
>>
>> Agreed.
>>
>>>
>>> It seems obvious that if it is ok to elide barriers all but valid kernel
>>> mappings at pte level, it must also be ok to do this for leaf entries at
>>> other levels: If setting an entry to invalid, a tlb maintenance
>>> operaiton must surely follow to synchronise the TLB and this contains
>>
>> s/operaiton/operation
> 
> Ugh, I really need a spell checker for my editor!
> 
>>
>>> the required barriers. If setting a valid user mapping, the previous
>>> mapping must have been invalid and there must have been a TLB
>>> maintenance operation (complete with barriers) to honour
>>> break-before-make. So the worst that can happen is we take an extra
>>> fault (which will imply the DSB + ISB) and conclude that there is
>>> nothing to do. These are the aguments for doing this optimization at pte
>>
>> s/aguments/arguments
>>
>>> level and they also apply to leaf mappings at other levels.
>>
>> So user the page table updates both for the table and leaf entries remains
>> unchanged for now regarding dsb/isb sync i.e don't do anything ?
> 
> Sorry, this doesn't parse.
> 
>>
>>>
>>> For table entries, the same arguments hold: If unsetting a table entry,
>>> a TLB is required and this will emit the required barriers. If setting a
>>> table entry, the previous value must have been invalid and the table
>>> walker must already be able to observe that. Additionally the contents
>>> of the pgtable being pointed to in the newly set entry must be visible
>>> before the entry is written and this is enforced via smp_wmb() (dmb) in
>>> the pgtable allocation functions and in __split_huge_pmd_locked(). But
>>> this last part could never have been enforced by the barriers in
>>> set_pXd() because they occur after updating the entry. So ultimately,
>>> the wost that can happen by eliding these barriers for user table
>>> entries is an extra fault.
>>
>> Basically nothing needs to be done while setting user page table entries.
>>
>>>
>>> I observe roughly the same number of page faults (107M) with and without
>>> this change when compiling the kernel on Apple M2.
>>
>> These are total page faults or only additional faults caused because there
>> were no dsb/isb sync after the user page table update ?
> 
> total page faults. The experiment was to check that if eliding more barriers for
> valid user space mappings, does this lead to an increase in page faults? This
> very simple experiment suggests no.
> 
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 60 ++++++++++++++++++++++++++++----
>>>  1 file changed, 54 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 1d428e9c0e5a..ff358d983583 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -767,6 +767,19 @@ static inline bool in_swapper_pgdir(void *addr)
>>>  	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>>>  }
>>>  
>>> +static inline bool pmd_valid_not_user(pmd_t pmd)
>>> +{
>>> +	/*
>>> +	 * User-space table pmd entries always have (PXN && !UXN). All other
>>> +	 * combinations indicate it's a table entry for kernel space.
>>> +	 * Valid-not-user leaf entries follow the same rules as
>>> +	 * pte_valid_not_user().
>>> +	 */
>>> +	if (pmd_table(pmd))
>>> +		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
>>
>> Should not this be abstracted out as pmd_table_not_user_table() which can
>> then be re-used in other levels as well.
> 
> Yeah maybe. Let me mull it over.

I discovered a bug (see below). So decided to keep it simple. I'm defining
pmd_valid_not_user() as an inline function, as is done in this version. But all
other levels are just macros defined to wrap pmd_valid_not_user(). That way all
levels are treated the same way. This means that we might be slightly
over-checking for the higher levels that don't support block mappings, but it's
safe and correct and reuses the maximum amount of code.


> 
>>
>>> +	return pte_valid_not_user(pmd_pte(pmd));
>>> +}
>>> +
>>
>> Something like.
>>
>> static inline bool pmd_valid_not_user_table(pmd_t pmd)
>> {
>> 	return pmd_valid(pmd) &&
>> 	       !((pmd_val(pmd) & (PMD_PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
>> }
>>
>> static inline bool pmd_valid_not_user(pmd_t pmd)
>> {
>> 	if (pmd_table(pmd))
>> 		return pmd_valid_not_user_table(pmd);
>> 	else
>> 		return pte_valid_not_user(pmd_pte(pmd));
>> }
>>
>>>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>>  {
>>>  #ifdef __PAGETABLE_PMD_FOLDED
>>> @@ -778,7 +791,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>>  
>>>  	WRITE_ONCE(*pmdp, pmd);
>>>  
>>> -	if (pmd_valid(pmd)) {
>>> +	if (pmd_valid_not_user(pmd)) {
>>>  		dsb(ishst);
>>>  		isb();
>>>  	}
>>> @@ -836,6 +849,17 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>>>  
>>>  static inline bool pgtable_l4_enabled(void);
>>>  
>>> +
>>> +static inline bool pud_valid_not_user(pud_t pud)
>>> +{
>>> +	/*
>>> +	 * Follows the same rules as pmd_valid_not_user().
>>> +	 */
>>> +	if (pud_table(pud))
>>> +		return !((pud_val(pud) & (PUD_TABLE_PXN | PUD_TABLE_UXN)) == PUD_TABLE_PXN);

This is buggy for configs where pud_table() is hardcoded to return true. In this
case we will assume the pud is valid when it may not be. (actually you could
call that a bug in pud_table(), which should really at least still be checking
that it's valid).

>>> +	return pte_valid_not_user(pud_pte(pud));
>>> +}
>>
>> This can be expressed in terms of pmd_valid_not_user() itself.
>>
>> #define pud_valid_not_user()	pmd_valid_not_user(pud_pmd(pud))
> 
> The trouble with this is that you end up using pmd_table() not pud_table(). For
> some configs pud_table() is hardcoded to true. So we lose the benefit. So I'd
> rather keep it as it's own function.
> 
>>
>>> +
>>>  static inline void set_pud(pud_t *pudp, pud_t pud)
>>>  {
>>>  	if (!pgtable_l4_enabled() && in_swapper_pgdir(pudp)) {
>>> @@ -845,7 +869,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>>  
>>>  	WRITE_ONCE(*pudp, pud);
>>>  
>>> -	if (pud_valid(pud)) {
>>> +	if (pud_valid_not_user(pud)) {
>>>  		dsb(ishst);
>>>  		isb();
>>>  	}
>>> @@ -917,6 +941,16 @@ static inline bool mm_pud_folded(const struct mm_struct *mm)
>>>  #define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & P4D_TABLE_BIT))
>>>  #define p4d_present(p4d)	(!p4d_none(p4d))
>>>  
>>> +static inline bool p4d_valid_not_user(p4d_t p4d)
>>> +{
>>> +	/*
>>> +	 * User-space table p4d entries always have (PXN && !UXN). All other
>>> +	 * combinations indicate it's a table entry for kernel space. p4d block
>>> +	 * entries are not supported.
>>> +	 */
>>> +	return !((p4d_val(p4d) & (P4D_TABLE_PXN | P4D_TABLE_UXN)) == P4D_TABLE_PXN);

This was buggy because we never check valid!

>>> +}
>>
>> Instead
>>
>> #define p4d_valid_not_user_able() pmd_valid_not_user_able(p4d_pmd(p4d))
>>
>>> +
>>>  static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>  {
>>>  	if (in_swapper_pgdir(p4dp)) {
>>> @@ -925,8 +959,11 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>  	}
>>>  
>>>  	WRITE_ONCE(*p4dp, p4d);
>>> -	dsb(ishst);
>>> -	isb();
>>> +
>>> +	if (p4d_valid_not_user(p4d)) {
>>
>>
>> Check p4d_valid_not_user_able() instead.
> 
> I don't really know why we would want to add table into the name at this level.
> Why not be consistent and continue to use p4d_valid_not_user()? The fact that
> p4d doesn't support leaf entries is just a matter for the implementation.
> 
>>
>>> +		dsb(ishst);
>>> +		isb();
>>> +	}
>>>  }
>>>  
>>>  static inline void p4d_clear(p4d_t *p4dp)
>>> @@ -1044,6 +1081,14 @@ static inline bool mm_p4d_folded(const struct mm_struct *mm)
>>>  #define pgd_bad(pgd)		(pgtable_l5_enabled() && !(pgd_val(pgd) & PGD_TABLE_BIT))
>>>  #define pgd_present(pgd)	(!pgd_none(pgd))
>>>  
>>> +static inline bool pgd_valid_not_user(pgd_t pgd)
>>> +{
>>> +	/*
>>> +	 * Follows the same rules as p4d_valid_not_user().
>>> +	 */
>>> +	return !((pgd_val(pgd) & (PGD_TABLE_PXN | PGD_TABLE_UXN)) == PGD_TABLE_PXN);

Same here!

>>> +}
>>
>> Similarly
>>
>> #define pgd_valid_not_user_able() pmd_valid_not_user_able(pgd_pmd(pgd))
>>
>>
>>> +
>>>  static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>>  {
>>>  	if (in_swapper_pgdir(pgdp)) {
>>> @@ -1052,8 +1097,11 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>>  	}
>>>  
>>>  	WRITE_ONCE(*pgdp, pgd);
>>> -	dsb(ishst);
>>> -	isb();
>>> +
>>> +	if (pgd_valid_not_user(pgd)) {
>>
>> Check pgd_valid_not_user_able() instead.
>>
>>> +		dsb(ishst);
>>> +		isb();
>>> +	}
>>>  }
>>>  
>>>  static inline void pgd_clear(pgd_t *pgdp)
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  2025-02-06 13:04     ` Ryan Roberts
@ 2025-02-13  4:57       ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-13  4:57 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel, stable



On 2/6/25 18:34, Ryan Roberts wrote:
> On 06/02/2025 06:46, Anshuman Khandual wrote:
>>
>>
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
>>> FEAT_LPA2") changed the "invalidation level unknown" hint from 0 to
>>> TLBI_TTL_UNKNOWN (INT_MAX). But the fallback "unknown level" path in
>>> flush_hugetlb_tlb_range() was not updated. So as it stands, when trying
>>> to invalidate CONT_PMD_SIZE or CONT_PTE_SIZE hugetlb mappings, we will
>>> spuriously try to invalidate at level 0 on LPA2-enabled systems.
>>>
>>> Fix this so that the fallback passes TLBI_TTL_UNKNOWN, and while we are
>>> at it, explicitly use the correct stride and level for CONT_PMD_SIZE and
>>> CONT_PTE_SIZE, which should provide a minor optimization.
>>>
>>> Cc: <stable@vger.kernel.org>
>>> Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/hugetlb.h | 20 ++++++++++++++------
>>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
>>> index 03db9cb21ace..8ab9542d2d22 100644
>>> --- a/arch/arm64/include/asm/hugetlb.h
>>> +++ b/arch/arm64/include/asm/hugetlb.h
>>> @@ -76,12 +76,20 @@ static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
>>>  {
>>>  	unsigned long stride = huge_page_size(hstate_vma(vma));
>>>  
>>> -	if (stride == PMD_SIZE)
>>> -		__flush_tlb_range(vma, start, end, stride, false, 2);
>>> -	else if (stride == PUD_SIZE)
>>> -		__flush_tlb_range(vma, start, end, stride, false, 1);
>>> -	else
>>> -		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0);
>>> +	switch (stride) {
>>> +	case PUD_SIZE:
>>> +		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
>>> +		break;
>>
>> Just wondering - should not !__PAGETABLE_PMD_FOLDED and pud_sect_supported()
>> checks also be added here for this PUD_SIZE case ?
> 
> Yeah I guess so. TBH, it's never been entirely clear to me what the benefit is?
> Is it just to remove (a tiny amount of) dead code when we know we don't support
> blocks at the level? Or is there something more fundamental going on that I've
> missed?

There is a generic fallback for PUD_SIZE in include/asm-generic/pgtable-nopud.h when
it is not defined on arm64 platform and pud_sect_supported() might also get optimized
by the compiler.

static inline bool pud_sect_supported(void)
{
        return PAGE_SIZE == SZ_4K;
}

IIUC this just saves dead code from being compiled as you mentioned.

> 
> We seem to be quite inconsistent with the use of pud_sect_supported() in
> hugetlbpage.c.

PUD_SIZE switch cases in hugetlb_mask_last_page() and arch_make_huge_pte() ? Those
should be fixed.

> 
> Anyway, I'll add this in, I guess it's preferable to follow the established pattern.

Agreed.

> 
> Thanks,
> Ryan
> 
>>
>>> +	case CONT_PMD_SIZE:
>>> +	case PMD_SIZE:
>>> +		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
>>> +		break;
>>> +	case CONT_PTE_SIZE:
>>> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
>>> +		break;
>>> +	default:
>>> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
>>> +	}
>>>  }
>>>  
>>>  #endif /* __ASM_HUGETLB_H */
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-10 11:12     ` Ryan Roberts
@ 2025-02-13  5:30       ` Anshuman Khandual
  2025-02-13  9:38         ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-13  5:30 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/10/25 16:42, Ryan Roberts wrote:
> On 10/02/2025 08:03, Anshuman Khandual wrote:
>>
>>
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> Because the kernel can't tolerate page faults for kernel mappings, when
>>> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
>>> dsb(ishst) to ensure that the store to the pgtable is observed by the
>>> table walker immediately. Additionally it emits an isb() to ensure that
>>> any already speculatively determined invalid mapping fault gets
>>> canceled.> 
>>> We can improve the performance of vmalloc operations by batching these
>>> barriers until the end of a set up entry updates. The newly added
>>> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
>>> provide the required hooks.
>>>
>>> vmalloc improves by up to 30% as a result.
>>>
>>> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
>>> are in the batch mode and can therefore defer any barriers until the end
>>> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
>>> be emited at the end of the batch.
>>
>> Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
>> set in __begin(), cleared in __end() and saved across a __switch_to().
> 
> So unconditionally emit the barriers in _end(), and emit them in __switch_to()
> if TIF_KMAP_UPDATE_ACTIVE is set?

Right.

> 
> I guess if calling _begin() then you are definitely going to be setting at least
> 1 PTE. So you can definitely emit the barriers unconditionally. I was trying to
> protect against the case where you get pre-empted (potentially multiple times)
> while in the loop. The TIF_KMAP_UPDATE_PENDING flag ensures you only emit the
> barriers when you definitely need to. Without it, you would have to emit on
> every pre-emption even if no more PTEs got set.
> 
> But I suspect this is a premature optimization. Probably it will never occur. So

Agreed.

> I'll simplify as you suggest.
> 
> Thanks,
> Ryan
> 
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
>>>  arch/arm64/include/asm/thread_info.h |  2 +
>>>  arch/arm64/kernel/process.c          | 20 +++++++--
>>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index ff358d983583..1ee9b9588502 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -39,6 +39,41 @@
>>>  #include <linux/mm_types.h>
>>>  #include <linux/sched.h>
>>>  #include <linux/page_table_check.h>
>>> +#include <linux/pgtable_modmask.h>
>>> +
>>> +static inline void emit_pte_barriers(void)
>>> +{
>>> +	dsb(ishst);
>>> +	isb();
>>> +}
>>
>> There are many sequence of these two barriers in this particular header,
>> hence probably a good idea to factor this out into a common helper.
>>>> +
>>> +static inline void queue_pte_barriers(void)
>>> +{
>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
>>> +		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>> +			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>> +	} else
>>> +		emit_pte_barriers();
>>> +}
>>> +
>>> +#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
>>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>>> +						     unsigned long end)
>>> +{
>>> +	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>> +}
>>> +
>>> +#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
>>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>>> +						   unsigned long end,
>>> +						   pgtbl_mod_mask mask)
>>> +{
>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>> +		emit_pte_barriers();
>>> +
>>> +	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>> +	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>> +}
>>>  
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>>> @@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
>>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>>  	 * or update_mmu_cache() have the necessary barriers.
>>>  	 */
>>> -	if (pte_valid_not_user(pte)) {
>>> -		dsb(ishst);
>>> -		isb();
>>> -	}
>>> +	if (pte_valid_not_user(pte))
>>> +		queue_pte_barriers();
>>>  }
>>>  
>>>  static inline void __set_pte(pte_t *ptep, pte_t pte)
>>> @@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>>  
>>>  	WRITE_ONCE(*pmdp, pmd);
>>>  
>>> -	if (pmd_valid_not_user(pmd)) {
>>> -		dsb(ishst);
>>> -		isb();
>>> -	}
>>> +	if (pmd_valid_not_user(pmd))
>>> +		queue_pte_barriers();
>>>  }
>>>  
>>>  static inline void pmd_clear(pmd_t *pmdp)
>>> @@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>>  
>>>  	WRITE_ONCE(*pudp, pud);
>>>  
>>> -	if (pud_valid_not_user(pud)) {
>>> -		dsb(ishst);
>>> -		isb();
>>> -	}
>>> +	if (pud_valid_not_user(pud))
>>> +		queue_pte_barriers();
>>>  }
>>>  
>>>  static inline void pud_clear(pud_t *pudp)
>>> @@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>  
>>>  	WRITE_ONCE(*p4dp, p4d);
>>>  
>>> -	if (p4d_valid_not_user(p4d)) {
>>> -		dsb(ishst);
>>> -		isb();
>>> -	}
>>> +	if (p4d_valid_not_user(p4d))
>>> +		queue_pte_barriers();
>>>  }
>>>  
>>>  static inline void p4d_clear(p4d_t *p4dp)
>>> @@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>>  
>>>  	WRITE_ONCE(*pgdp, pgd);
>>>  
>>> -	if (pgd_valid_not_user(pgd)) {
>>> -		dsb(ishst);
>>> -		isb();
>>> -	}
>>> +	if (pgd_valid_not_user(pgd))
>>> +		queue_pte_barriers();
>>>  }
>>>  
>>>  static inline void pgd_clear(pgd_t *pgdp)
>>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>>> index 1114c1c3300a..382d2121261e 100644
>>> --- a/arch/arm64/include/asm/thread_info.h
>>> +++ b/arch/arm64/include/asm/thread_info.h
>>> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>>>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>>>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>>>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
>>> +#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
>>> +#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
>>>  
>>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>>> index 42faebb7b712..1367ec6407d1 100644
>>> --- a/arch/arm64/kernel/process.c
>>> +++ b/arch/arm64/kernel/process.c
>>> @@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>  	gcs_thread_switch(next);
>>>  
>>>  	/*
>>> -	 * Complete any pending TLB or cache maintenance on this CPU in case
>>> -	 * the thread migrates to a different CPU.
>>> -	 * This full barrier is also required by the membarrier system
>>> -	 * call.
>>> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
>>> +	 * thread migrates to a different CPU. This full barrier is also
>>> +	 * required by the membarrier system call. Additionally it is required
>>> +	 * for TIF_KMAP_UPDATE_PENDING, see below.
>>>  	 */
>>>  	dsb(ish);
>>>  
>>> @@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>  	/* avoid expensive SCTLR_EL1 accesses if no change */
>>>  	if (prev->thread.sctlr_user != next->thread.sctlr_user)
>>>  		update_sctlr_el1(next->thread.sctlr_user);
>>> +	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
>>> +		/*
>>> +		 * In unlikely event that a kernel map update is on-going when
>>> +		 * preemption occurs, we must emit_pte_barriers() if pending.
>>> +		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
>>> +		 * is already handled above. The isb() is handled if
>>> +		 * update_sctlr_el1() was called. So only need to emit isb()
>>> +		 * here if it wasn't called.
>>> +		 */
>>> +		isb();
>>> +		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>> +	}
>>>  
>>>  	/* the actual thread switch */
>>>  	last = cpu_switch_to(prev, next);
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings()
  2025-02-10 11:04     ` Ryan Roberts
@ 2025-02-13  5:57       ` Anshuman Khandual
  2025-02-13  9:17         ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-13  5:57 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/10/25 16:34, Ryan Roberts wrote:
> On 10/02/2025 07:45, Anshuman Khandual wrote:
>>
>>
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> arch_sync_kernel_mappings() is an optional hook for arches to allow them
>>> to synchonize certain levels of the kernel pgtables after modification.
>>> But arm64 could benefit from a hook similar to this, paired with a call
>>> prior to starting the batch of modifications.
>>>
>>> So let's introduce arch_update_kernel_mappings_begin() and
>>> arch_update_kernel_mappings_end(). Both have a default implementation
>>> which can be overridden by the arch code. The default for the former is
>>> a nop, and the default for the latter is to call
>>> arch_sync_kernel_mappings(), so the latter replaces previous
>>> arch_sync_kernel_mappings() callsites. So by default, the resulting
>>> behaviour is unchanged.
>>>
>>> To avoid include hell, the pgtbl_mod_mask type and it's associated
>>> macros are moved to their own header.
>>>
>>> In a future patch, arm64 will opt-in to overriding both functions.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/pgtable.h         | 24 +----------------
>>>  include/linux/pgtable_modmask.h | 32 ++++++++++++++++++++++
>>>  include/linux/vmalloc.h         | 47 +++++++++++++++++++++++++++++++++
>>>  mm/memory.c                     |  5 ++--
>>>  mm/vmalloc.c                    | 15 ++++++-----
>>>  5 files changed, 92 insertions(+), 31 deletions(-)
>>>  create mode 100644 include/linux/pgtable_modmask.h
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 94d267d02372..7f70786a73b3 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -4,6 +4,7 @@
>>>  
>>>  #include <linux/pfn.h>
>>>  #include <asm/pgtable.h>
>>> +#include <linux/pgtable_modmask.h>
>>>  
>>>  #define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
>>>  #define PUD_ORDER	(PUD_SHIFT - PAGE_SHIFT)
>>> @@ -1786,29 +1787,6 @@ static inline bool arch_has_pfn_modify_check(void)
>>>  # define PAGE_KERNEL_EXEC PAGE_KERNEL
>>>  #endif
>>>  
>>> -/*
>>> - * Page Table Modification bits for pgtbl_mod_mask.
>>> - *
>>> - * These are used by the p?d_alloc_track*() set of functions an in the generic
>>> - * vmalloc/ioremap code to track at which page-table levels entries have been
>>> - * modified. Based on that the code can better decide when vmalloc and ioremap
>>> - * mapping changes need to be synchronized to other page-tables in the system.
>>> - */
>>> -#define		__PGTBL_PGD_MODIFIED	0
>>> -#define		__PGTBL_P4D_MODIFIED	1
>>> -#define		__PGTBL_PUD_MODIFIED	2
>>> -#define		__PGTBL_PMD_MODIFIED	3
>>> -#define		__PGTBL_PTE_MODIFIED	4
>>> -
>>> -#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
>>> -#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
>>> -#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
>>> -#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
>>> -#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
>>> -
>>> -/* Page-Table Modification Mask */
>>> -typedef unsigned int pgtbl_mod_mask;
>>> -
>>>  #endif /* !__ASSEMBLY__ */
>>>  
>>>  #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
>>> diff --git a/include/linux/pgtable_modmask.h b/include/linux/pgtable_modmask.h
>>> new file mode 100644
>>> index 000000000000..5a21b1bb8df3
>>> --- /dev/null
>>> +++ b/include/linux/pgtable_modmask.h
>>> @@ -0,0 +1,32 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_PGTABLE_MODMASK_H
>>> +#define _LINUX_PGTABLE_MODMASK_H
>>> +
>>> +#ifndef __ASSEMBLY__
>>> +
>>> +/*
>>> + * Page Table Modification bits for pgtbl_mod_mask.
>>> + *
>>> + * These are used by the p?d_alloc_track*() set of functions an in the generic
>>> + * vmalloc/ioremap code to track at which page-table levels entries have been
>>> + * modified. Based on that the code can better decide when vmalloc and ioremap
>>> + * mapping changes need to be synchronized to other page-tables in the system.
>>> + */
>>> +#define		__PGTBL_PGD_MODIFIED	0
>>> +#define		__PGTBL_P4D_MODIFIED	1
>>> +#define		__PGTBL_PUD_MODIFIED	2
>>> +#define		__PGTBL_PMD_MODIFIED	3
>>> +#define		__PGTBL_PTE_MODIFIED	4
>>> +
>>> +#define		PGTBL_PGD_MODIFIED	BIT(__PGTBL_PGD_MODIFIED)
>>> +#define		PGTBL_P4D_MODIFIED	BIT(__PGTBL_P4D_MODIFIED)
>>> +#define		PGTBL_PUD_MODIFIED	BIT(__PGTBL_PUD_MODIFIED)
>>> +#define		PGTBL_PMD_MODIFIED	BIT(__PGTBL_PMD_MODIFIED)
>>> +#define		PGTBL_PTE_MODIFIED	BIT(__PGTBL_PTE_MODIFIED)
>>> +
>>> +/* Page-Table Modification Mask */
>>> +typedef unsigned int pgtbl_mod_mask;
>>> +
>>> +#endif /* !__ASSEMBLY__ */
>>> +
>>> +#endif /* _LINUX_PGTABLE_MODMASK_H */
>>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>>> index 16dd4cba64f2..cb5d8f1965a1 100644
>>> --- a/include/linux/vmalloc.h
>>> +++ b/include/linux/vmalloc.h
>>> @@ -11,6 +11,7 @@
>>>  #include <asm/page.h>		/* pgprot_t */
>>>  #include <linux/rbtree.h>
>>>  #include <linux/overflow.h>
>>> +#include <linux/pgtable_modmask.h>
>>>  
>>>  #include <asm/vmalloc.h>
>>>  
>>> @@ -213,6 +214,26 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
>>>  int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>>>  		     struct page **pages, unsigned int page_shift);
>>>  
>>> +#ifndef arch_update_kernel_mappings_begin
>>> +/**
>>> + * arch_update_kernel_mappings_begin - A batch of kernel pgtable mappings are
>>> + * about to be updated.
>>> + * @start: Virtual address of start of range to be updated.
>>> + * @end: Virtual address of end of range to be updated.
>>> + *
>>> + * An optional hook to allow architecture code to prepare for a batch of kernel
>>> + * pgtable mapping updates. An architecture may use this to enter a lazy mode
>>> + * where some operations can be deferred until the end of the batch.
>>> + *
>>> + * Context: Called in task context and may be preemptible.
>>> + */
>>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>>> +						     unsigned long end)
>>> +{
>>> +}
>>> +#endif
>>> +
>>> +#ifndef arch_update_kernel_mappings_end
>>>  /*
>>>   * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
>>>   * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
>>> @@ -229,6 +250,32 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>>>   */
>>>  void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
>>>  
>>> +/**
>>> + * arch_update_kernel_mappings_end - A batch of kernel pgtable mappings have
>>> + * been updated.
>>> + * @start: Virtual address of start of range that was updated.
>>> + * @end: Virtual address of end of range that was updated.
>>> + *
>>> + * An optional hook to inform architecture code that a batch update is complete.
>>> + * This balances a previous call to arch_update_kernel_mappings_begin().
>>> + *
>>> + * An architecture may override this for any purpose, such as exiting a lazy
>>> + * mode previously entered with arch_update_kernel_mappings_begin() or syncing
>>> + * kernel mappings to a secondary pgtable. The default implementation calls an
>>> + * arch-provided arch_sync_kernel_mappings() if any arch-defined pgtable level
>>> + * was updated.
>>> + *
>>> + * Context: Called in task context and may be preemptible.
>>> + */
>>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>>> +						   unsigned long end,
>>> +						   pgtbl_mod_mask mask)
>>> +{
>>> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>> +		arch_sync_kernel_mappings(start, end);
>>> +}
>>
>> One arch call back calling yet another arch call back sounds bit odd. 
> 
> It's no different from the default implementation of arch_make_huge_pte()
> calling pte_mkhuge() is it?

Agreed. arch_make_huge_pte() ---> pte_mkhuge() where either helpers can be
customized in the platform is another such example but unless necessary we
should probably avoid following that. Anyways it's not a big deal I guess.

> 
>> Also
>> should not ARCH_PAGE_TABLE_SYNC_MASK be checked both for __begin and __end
>> callbacks in case a platform subscribes into this framework. 
> 
> I'm not sure how that would work? The mask is accumulated during the pgtable
> walk. So we don't have a mask until we get to the end.

A non-zero ARCH_PAGE_TABLE_SYNC_MASK indicates that a platform is subscribing
to this mechanism. So could ARCH_PAGE_TABLE_SYNC_MASK != 0 be used instead ?

> 
>> Instead the
>> following changes sound more reasonable, but will also require some more
>> updates for the current platforms using arch_sync_kernel_mappings().
>>
>> if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> 	arch_update_kernel_mappings_begin()
> 
> This makes no sense. mask is always 0 before doing the walk.

Got it.

> 
>>
>> if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>> 	arch_update_kernel_mappings_end()
>>
>> Basically when any platform defines ARCH_PAGE_TABLE_SYNC_MASK and subscribes
>> this framework, it will also provide arch_update_kernel_mappings_begin/end()
>> callbacks as required.
> 
> Personally I think it's cleaner to just pass mask to
> arch_update_kernel_mappings_end() and it the function decide what it wants to do.
> 
> But it's a good question as to whether we should refactor x86 and arm to
> directly implement arch_update_kernel_mappings_end() instead of
> arch_sync_kernel_mappings(). Personally I thought it was better to avoid the
> churn. But interested in others opinions.
> 
> Thanks,
> Ryan
> 
>>
>>> +#endif
>>> +
>>>  /*
>>>   *	Lowlevel-APIs (not for driver use!)
>>>   */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index a15f7dd500ea..f80930bc19f6 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3035,6 +3035,8 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>>  	if (WARN_ON(addr >= end))
>>>  		return -EINVAL;
>>>  
>>> +	arch_update_kernel_mappings_begin(start, end);
>>> +
>>>  	pgd = pgd_offset(mm, addr);
>>>  	do {
>>>  		next = pgd_addr_end(addr, end);
>>> @@ -3055,8 +3057,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>>  			break;
>>>  	} while (pgd++, addr = next, addr != end);
>>>  
>>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>> -		arch_sync_kernel_mappings(start, start + size);
>>> +	arch_update_kernel_mappings_end(start, end, mask);
>>>  
>>>  	return err;
>>>  }
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index 50fd44439875..c5c51d86ef78 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -312,10 +312,10 @@ int vmap_page_range(unsigned long addr, unsigned long end,
>>>  	pgtbl_mod_mask mask = 0;
>>>  	int err;
>>>  
>>> +	arch_update_kernel_mappings_begin(addr, end);
>>>  	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
>>>  				 ioremap_max_page_shift, &mask);
>>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>> -		arch_sync_kernel_mappings(addr, end);
>>> +	arch_update_kernel_mappings_end(addr, end, mask);
>>>  
>>>  	flush_cache_vmap(addr, end);
>>>  	if (!err)
>>> @@ -463,6 +463,9 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>>>  	pgtbl_mod_mask mask = 0;
>>>  
>>>  	BUG_ON(addr >= end);
>>> +
>>> +	arch_update_kernel_mappings_begin(start, end);
>>> +
>>>  	pgd = pgd_offset_k(addr);
>>>  	do {
>>>  		next = pgd_addr_end(addr, end);
>>> @@ -473,8 +476,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
>>>  		vunmap_p4d_range(pgd, addr, next, &mask);
>>>  	} while (pgd++, addr = next, addr != end);
>>>  
>>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>> -		arch_sync_kernel_mappings(start, end);
>>> +	arch_update_kernel_mappings_end(start, end, mask);
>>>  }
>>>  
>>>  void vunmap_range_noflush(unsigned long start, unsigned long end)
>>> @@ -625,6 +627,8 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>>  
>>>  	WARN_ON(page_shift < PAGE_SHIFT);
>>>  
>>> +	arch_update_kernel_mappings_begin(start, end);
>>> +
>>>  	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>>>  			page_shift == PAGE_SHIFT) {
>>>  		err = vmap_small_pages_range_noflush(addr, end, prot, pages,
>>> @@ -642,8 +646,7 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
>>>  		}
>>>  	}
>>>  
>>> -	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>> -		arch_sync_kernel_mappings(start, end);
>>> +	arch_update_kernel_mappings_end(start, end, mask);
>>>  
>>>  	return err;
>>>  }
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-07 11:20     ` Ryan Roberts
@ 2025-02-13  6:32       ` Anshuman Khandual
  2025-02-13  9:09         ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-13  6:32 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/7/25 16:50, Ryan Roberts wrote:
> On 07/02/2025 10:04, Anshuman Khandual wrote:
>>
>>
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> Implement the required arch functions to enable use of contpte in the
>>> vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
>>> operations due to only having to issue a DSB and ISB per contpte block
>>> instead of per pte. But it also means that the TLB pressure reduces due
>>> to only needing a single TLB entry for the whole contpte block.
>>
>> Right.
>>
>>>
>>> Since vmap uses set_huge_pte_at() to set the contpte, that API is now
>>> used for kernel mappings for the first time. Although in the vmap case
>>> we never expect it to be called to modify a valid mapping so
>>> clear_flush() should never be called, it's still wise to make it robust
>>> for the kernel case, so amend the tlb flush function if the mm is for
>>> kernel space.
>>
>> Makes sense.
>>
>>>
>>> Tested with vmalloc performance selftests:
>>>
>>>   # kself/mm/test_vmalloc.sh \
>>> 	run_test_mask=1
>>> 	test_repeat_count=5
>>> 	nr_pages=256
>>> 	test_loop_count=100000
>>> 	use_huge=1
>>>
>>> Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
>>> reduction in time taken.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/include/asm/vmalloc.h | 40 ++++++++++++++++++++++++++++++++
>>>  arch/arm64/mm/hugetlbpage.c      |  5 +++-
>>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>> index 38fafffe699f..fbdeb40f3857 100644
>>> --- a/arch/arm64/include/asm/vmalloc.h
>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>> @@ -23,6 +23,46 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>  	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>  }
>>>  
>>> +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>>> +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>>> +						unsigned long end, u64 pfn,
>>> +						unsigned int max_page_shift)
>>> +{
>>> +	if (max_page_shift < CONT_PTE_SHIFT)
>>> +		return PAGE_SIZE;
>>> +
>>> +	if (end - addr < CONT_PTE_SIZE)
>>> +		return PAGE_SIZE;
>>> +
>>> +	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
>>> +		return PAGE_SIZE;
>>> +
>>> +	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
>>> +		return PAGE_SIZE;
>>> +
>>> +	return CONT_PTE_SIZE;
>>
>> A small nit:
>>
>> Should the rationale behind picking CONT_PTE_SIZE be added here as an in code
>> comment or something in the function - just to make things bit clear.
> 
> I'm not sure what other size we would pick?

The suggestion was to add a small comment in the above helper function explaining
the rationale for various conditions in there while returning either PAGE_SIZE or
CONT_PTE_SIZE to improve readability etc.

> 
>>
>>> +}
>>> +
>>> +#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
>>> +static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
>>> +							   pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * The caller handles alignment so it's sufficient just to check
>>> +	 * PTE_CONT.
>>> +	 */
>>> +	return pte_valid_cont(__ptep_get(ptep)) ? CONT_PTE_SIZE : PAGE_SIZE;
>>
>> I guess it is safe to query the CONT_PTE from the mapped entry itself.
> 
> Yes I don't see why not. Is there some specific aspect you're concerned about?

Nothing came up while following this code, it was more of a general observation.

> 
>>
>>> +}
>>> +
>>> +#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
>>> +static inline int arch_vmap_pte_supported_shift(unsigned long size)
>>> +{
>>> +	if (size >= CONT_PTE_SIZE)
>>> +		return CONT_PTE_SHIFT;
>>> +
>>> +	return PAGE_SHIFT;
>>> +}
>>> +
>>>  #endif
>>>  
>>>  #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged
>>> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
>>> index 02afee31444e..a74e43101dad 100644
>>> --- a/arch/arm64/mm/hugetlbpage.c
>>> +++ b/arch/arm64/mm/hugetlbpage.c
>>> @@ -217,7 +217,10 @@ static void clear_flush(struct mm_struct *mm,
>>>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
>>>  		___ptep_get_and_clear(mm, ptep, pgsize);
>>>  
>>> -	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>>> +	if (mm == &init_mm)
>>> +		flush_tlb_kernel_range(saddr, addr);
>>> +	else
>>> +		__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>>>  }
>>>  
>>>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>>
>> Otherwise LGTM.
>>
>> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> 
> Thanks!
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-02-07 10:59     ` Ryan Roberts
@ 2025-02-13  6:36       ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-13  6:36 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/7/25 16:29, Ryan Roberts wrote:
> On 07/02/2025 08:41, Anshuman Khandual wrote:
>> On 2/5/25 20:39, Ryan Roberts wrote:
>>> A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
>>> pud level. But it is possible to subsquently call vunmap_range() on a
>>
>> s/subsquently/subsequently
>>
>>> sub-range of the mapped memory, which partially overlaps a pmd or pud.
>>> In this case, vmalloc unmaps the entire pmd or pud so that the
>>> no-overlapping portion is also unmapped. Clearly that would have a bad
>>> outcome, but it's not something that any callers do today as far as I
>>> can tell. So I guess it's jsut expected that callers will not do this.
>>
>> s/jsut/just
>>
>>>
>>> However, it would be useful to know if this happened in future; let's
>>> add a warning to cover the eventuality.
>>
>> This is a reasonable check to prevent bad outcomes later.
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/vmalloc.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index a6e7acebe9ad..fcdf67d5177a 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>  		if (cleared || pmd_bad(*pmd))
>>>  			*mask |= PGTBL_PMD_MODIFIED;
>>>  
>>> -		if (cleared)
>>> +		if (cleared) {
>>> +			WARN_ON(next - addr < PMD_SIZE);
>>>  			continue;
>>> +		}
>>>  		if (pmd_none_or_clear_bad(pmd))
>>>  			continue;
>>>  		vunmap_pte_range(pmd, addr, next, mask);
>>> @@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>>  		if (cleared || pud_bad(*pud))
>>>  			*mask |= PGTBL_PUD_MODIFIED;
>>>  
>>> -		if (cleared)
>>> +		if (cleared) {
>>> +			WARN_ON(next - addr < PUD_SIZE);
>>>  			continue;
>>> +		}
>>>  		if (pud_none_or_clear_bad(pud))
>>>  			continue;
>>>  		vunmap_pmd_range(pud, addr, next, mask);
>> Why not also include such checks in vunmap_p4d_range() and __vunmap_range_noflush()
>> for corresponding P4D and PGD levels as well ?
> 
> The kernel does not support p4d or pgd leaf entries so there is nothing to check.> 
> Although vunmap_p4d_range() does call p4d_clear_huge(). The function is a stub
> and returns void (unlike p[mu]d_clear_huge()). I suspect we could just remove
> p4d_clear_huge() entirely. But that would be a separate patch to mm tree I think.
> 
> For pgd, there isn't even an equivalent looking function.
> 
> Basically at those 2 levels, it's always a table.

Understood, thanks !


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-13  6:32       ` Anshuman Khandual
@ 2025-02-13  9:09         ` Ryan Roberts
  2025-02-17  4:33           ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-13  9:09 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel


>>>> +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>>>> +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>>>> +						unsigned long end, u64 pfn,
>>>> +						unsigned int max_page_shift)
>>>> +{
>>>> +	if (max_page_shift < CONT_PTE_SHIFT)
>>>> +		return PAGE_SIZE;
>>>> +
>>>> +	if (end - addr < CONT_PTE_SIZE)
>>>> +		return PAGE_SIZE;
>>>> +
>>>> +	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
>>>> +		return PAGE_SIZE;
>>>> +
>>>> +	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
>>>> +		return PAGE_SIZE;
>>>> +
>>>> +	return CONT_PTE_SIZE;
>>>
>>> A small nit:
>>>
>>> Should the rationale behind picking CONT_PTE_SIZE be added here as an in code
>>> comment or something in the function - just to make things bit clear.
>>
>> I'm not sure what other size we would pick?
> 
> The suggestion was to add a small comment in the above helper function explaining
> the rationale for various conditions in there while returning either PAGE_SIZE or
> CONT_PTE_SIZE to improve readability etc.

OK I've added the following:

/*
 * If the block is at least CONT_PTE_SIZE in size, and is naturally
 * aligned in both virtual and physical space, then we can pte-map the
 * block using the PTE_CONT bit for more efficient use of the TLB.
 */

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings()
  2025-02-13  5:57       ` Anshuman Khandual
@ 2025-02-13  9:17         ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-13  9:17 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

>>>> +/**
>>>> + * arch_update_kernel_mappings_end - A batch of kernel pgtable mappings have
>>>> + * been updated.
>>>> + * @start: Virtual address of start of range that was updated.
>>>> + * @end: Virtual address of end of range that was updated.
>>>> + *
>>>> + * An optional hook to inform architecture code that a batch update is complete.
>>>> + * This balances a previous call to arch_update_kernel_mappings_begin().
>>>> + *
>>>> + * An architecture may override this for any purpose, such as exiting a lazy
>>>> + * mode previously entered with arch_update_kernel_mappings_begin() or syncing
>>>> + * kernel mappings to a secondary pgtable. The default implementation calls an
>>>> + * arch-provided arch_sync_kernel_mappings() if any arch-defined pgtable level
>>>> + * was updated.
>>>> + *
>>>> + * Context: Called in task context and may be preemptible.
>>>> + */
>>>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>>>> +						   unsigned long end,
>>>> +						   pgtbl_mod_mask mask)
>>>> +{
>>>> +	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
>>>> +		arch_sync_kernel_mappings(start, end);
>>>> +}
>>>
>>> One arch call back calling yet another arch call back sounds bit odd. 
>>
>> It's no different from the default implementation of arch_make_huge_pte()
>> calling pte_mkhuge() is it?
> 
> Agreed. arch_make_huge_pte() ---> pte_mkhuge() where either helpers can be
> customized in the platform is another such example but unless necessary we
> should probably avoid following that. Anyways it's not a big deal I guess.
> 
>>
>>> Also
>>> should not ARCH_PAGE_TABLE_SYNC_MASK be checked both for __begin and __end
>>> callbacks in case a platform subscribes into this framework. 
>>
>> I'm not sure how that would work? The mask is accumulated during the pgtable
>> walk. So we don't have a mask until we get to the end.
> 
> A non-zero ARCH_PAGE_TABLE_SYNC_MASK indicates that a platform is subscribing
> to this mechanism. So could ARCH_PAGE_TABLE_SYNC_MASK != 0 be used instead ?

There are now 2 levels of mechanism:

Either: arch defines ARCH_PAGE_TABLE_SYNC_MASK to be non-zero and provides
arch_sync_kernel_mappings(). This is unchanged from how it was before.

Or: arch defines it's own version of one or both of
arch_update_kernel_mappings_begin() and arch_update_kernel_mappings_end().

So a non-zero ARCH_PAGE_TABLE_SYNC_MASK indicates that a platform is subscribing
to the *first* mechanism. It has nothing to do with the second mechanism. If the
platform defines arch_update_kernel_mappings_begin() it wants it to be called.
If it doesn't define it, then it doesn't get called.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-13  5:30       ` Anshuman Khandual
@ 2025-02-13  9:38         ` Ryan Roberts
  2025-02-17  4:48           ` Anshuman Khandual
  0 siblings, 1 reply; 62+ messages in thread
From: Ryan Roberts @ 2025-02-13  9:38 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 13/02/2025 05:30, Anshuman Khandual wrote:
> 
> 
> On 2/10/25 16:42, Ryan Roberts wrote:
>> On 10/02/2025 08:03, Anshuman Khandual wrote:
>>>
>>>
>>> On 2/5/25 20:39, Ryan Roberts wrote:
>>>> Because the kernel can't tolerate page faults for kernel mappings, when
>>>> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
>>>> dsb(ishst) to ensure that the store to the pgtable is observed by the
>>>> table walker immediately. Additionally it emits an isb() to ensure that
>>>> any already speculatively determined invalid mapping fault gets
>>>> canceled.> 
>>>> We can improve the performance of vmalloc operations by batching these
>>>> barriers until the end of a set up entry updates. The newly added
>>>> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
>>>> provide the required hooks.
>>>>
>>>> vmalloc improves by up to 30% as a result.
>>>>
>>>> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
>>>> are in the batch mode and can therefore defer any barriers until the end
>>>> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
>>>> be emited at the end of the batch.
>>>
>>> Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
>>> set in __begin(), cleared in __end() and saved across a __switch_to().
>>
>> So unconditionally emit the barriers in _end(), and emit them in __switch_to()
>> if TIF_KMAP_UPDATE_ACTIVE is set?
> 
> Right.
> 
>>
>> I guess if calling _begin() then you are definitely going to be setting at least
>> 1 PTE. So you can definitely emit the barriers unconditionally. I was trying to
>> protect against the case where you get pre-empted (potentially multiple times)
>> while in the loop. The TIF_KMAP_UPDATE_PENDING flag ensures you only emit the
>> barriers when you definitely need to. Without it, you would have to emit on
>> every pre-emption even if no more PTEs got set.
>>
>> But I suspect this is a premature optimization. Probably it will never occur. So
> 
> Agreed.

Having done this simplification, I've just noticed that one of the
arch_update_kernel_mappings_begin/end callsites is __apply_to_page_range() which
gets called for user space mappings as well as kernel mappings. So actually with
the simplification I'll be emitting barriers even when only user space mappings
were touched.

I think there are a couple of options to fix this:

- Revert to the 2 flag approach. For the user space case, I'll get to _end() and
notice that no barriers are queued so will emit nothing.

- Only set TIF_KMAP_UPDATE_ACTIVE if the address range passed to _begin() is a
kernel address range. I guess that's just a case of checking if the MSB is set
in "end"?

- pass mm to _begin() and only set TIF_KMAP_UPDATE_ACTIVE if mm == &init_mm. I
guess this should be the same as option 2.

I'm leaning towards option 2. But I have a niggling feeling that my proposed
check isn't quite correct. What do you think?

Thanks,
Ryan


> 
>> I'll simplify as you suggest.
>>
>> Thanks,
>> Ryan
>>
>>>
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
>>>>  arch/arm64/include/asm/thread_info.h |  2 +
>>>>  arch/arm64/kernel/process.c          | 20 +++++++--
>>>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index ff358d983583..1ee9b9588502 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -39,6 +39,41 @@
>>>>  #include <linux/mm_types.h>
>>>>  #include <linux/sched.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/pgtable_modmask.h>
>>>> +
>>>> +static inline void emit_pte_barriers(void)
>>>> +{
>>>> +	dsb(ishst);
>>>> +	isb();
>>>> +}
>>>
>>> There are many sequence of these two barriers in this particular header,
>>> hence probably a good idea to factor this out into a common helper.
>>>>> +
>>>> +static inline void queue_pte_barriers(void)
>>>> +{
>>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
>>>> +		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>>> +			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>> +	} else
>>>> +		emit_pte_barriers();
>>>> +}
>>>> +
>>>> +#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
>>>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>>>> +						     unsigned long end)
>>>> +{
>>>> +	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>>> +}
>>>> +
>>>> +#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
>>>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>>>> +						   unsigned long end,
>>>> +						   pgtbl_mod_mask mask)
>>>> +{
>>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>>> +		emit_pte_barriers();
>>>> +
>>>> +	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>> +	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>>> +}
>>>>  
>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>>>> @@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
>>>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>>>  	 * or update_mmu_cache() have the necessary barriers.
>>>>  	 */
>>>> -	if (pte_valid_not_user(pte)) {
>>>> -		dsb(ishst);
>>>> -		isb();
>>>> -	}
>>>> +	if (pte_valid_not_user(pte))
>>>> +		queue_pte_barriers();
>>>>  }
>>>>  
>>>>  static inline void __set_pte(pte_t *ptep, pte_t pte)
>>>> @@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>>>  
>>>>  	WRITE_ONCE(*pmdp, pmd);
>>>>  
>>>> -	if (pmd_valid_not_user(pmd)) {
>>>> -		dsb(ishst);
>>>> -		isb();
>>>> -	}
>>>> +	if (pmd_valid_not_user(pmd))
>>>> +		queue_pte_barriers();
>>>>  }
>>>>  
>>>>  static inline void pmd_clear(pmd_t *pmdp)
>>>> @@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>>>  
>>>>  	WRITE_ONCE(*pudp, pud);
>>>>  
>>>> -	if (pud_valid_not_user(pud)) {
>>>> -		dsb(ishst);
>>>> -		isb();
>>>> -	}
>>>> +	if (pud_valid_not_user(pud))
>>>> +		queue_pte_barriers();
>>>>  }
>>>>  
>>>>  static inline void pud_clear(pud_t *pudp)
>>>> @@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>>  
>>>>  	WRITE_ONCE(*p4dp, p4d);
>>>>  
>>>> -	if (p4d_valid_not_user(p4d)) {
>>>> -		dsb(ishst);
>>>> -		isb();
>>>> -	}
>>>> +	if (p4d_valid_not_user(p4d))
>>>> +		queue_pte_barriers();
>>>>  }
>>>>  
>>>>  static inline void p4d_clear(p4d_t *p4dp)
>>>> @@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>>>  
>>>>  	WRITE_ONCE(*pgdp, pgd);
>>>>  
>>>> -	if (pgd_valid_not_user(pgd)) {
>>>> -		dsb(ishst);
>>>> -		isb();
>>>> -	}
>>>> +	if (pgd_valid_not_user(pgd))
>>>> +		queue_pte_barriers();
>>>>  }
>>>>  
>>>>  static inline void pgd_clear(pgd_t *pgdp)
>>>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>>>> index 1114c1c3300a..382d2121261e 100644
>>>> --- a/arch/arm64/include/asm/thread_info.h
>>>> +++ b/arch/arm64/include/asm/thread_info.h
>>>> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>>>>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>>>>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>>>>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
>>>> +#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
>>>> +#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
>>>>  
>>>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>>>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>>>> index 42faebb7b712..1367ec6407d1 100644
>>>> --- a/arch/arm64/kernel/process.c
>>>> +++ b/arch/arm64/kernel/process.c
>>>> @@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>>  	gcs_thread_switch(next);
>>>>  
>>>>  	/*
>>>> -	 * Complete any pending TLB or cache maintenance on this CPU in case
>>>> -	 * the thread migrates to a different CPU.
>>>> -	 * This full barrier is also required by the membarrier system
>>>> -	 * call.
>>>> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
>>>> +	 * thread migrates to a different CPU. This full barrier is also
>>>> +	 * required by the membarrier system call. Additionally it is required
>>>> +	 * for TIF_KMAP_UPDATE_PENDING, see below.
>>>>  	 */
>>>>  	dsb(ish);
>>>>  
>>>> @@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>>  	/* avoid expensive SCTLR_EL1 accesses if no change */
>>>>  	if (prev->thread.sctlr_user != next->thread.sctlr_user)
>>>>  		update_sctlr_el1(next->thread.sctlr_user);
>>>> +	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
>>>> +		/*
>>>> +		 * In unlikely event that a kernel map update is on-going when
>>>> +		 * preemption occurs, we must emit_pte_barriers() if pending.
>>>> +		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
>>>> +		 * is already handled above. The isb() is handled if
>>>> +		 * update_sctlr_el1() was called. So only need to emit isb()
>>>> +		 * here if it wasn't called.
>>>> +		 */
>>>> +		isb();
>>>> +		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>> +	}
>>>>  
>>>>  	/* the actual thread switch */
>>>>  	last = cpu_switch_to(prev, next);
>>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap
  2025-02-13  9:09         ` Ryan Roberts
@ 2025-02-17  4:33           ` Anshuman Khandual
  0 siblings, 0 replies; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-17  4:33 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/13/25 14:39, Ryan Roberts wrote:
> 
>>>>> +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>>>>> +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
>>>>> +						unsigned long end, u64 pfn,
>>>>> +						unsigned int max_page_shift)
>>>>> +{
>>>>> +	if (max_page_shift < CONT_PTE_SHIFT)
>>>>> +		return PAGE_SIZE;
>>>>> +
>>>>> +	if (end - addr < CONT_PTE_SIZE)
>>>>> +		return PAGE_SIZE;
>>>>> +
>>>>> +	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
>>>>> +		return PAGE_SIZE;
>>>>> +
>>>>> +	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
>>>>> +		return PAGE_SIZE;
>>>>> +
>>>>> +	return CONT_PTE_SIZE;
>>>>
>>>> A small nit:
>>>>
>>>> Should the rationale behind picking CONT_PTE_SIZE be added here as an in code
>>>> comment or something in the function - just to make things bit clear.
>>>
>>> I'm not sure what other size we would pick?
>>
>> The suggestion was to add a small comment in the above helper function explaining
>> the rationale for various conditions in there while returning either PAGE_SIZE or
>> CONT_PTE_SIZE to improve readability etc.
> 
> OK I've added the following:
> 
> /*
>  * If the block is at least CONT_PTE_SIZE in size, and is naturally
>  * aligned in both virtual and physical space, then we can pte-map the
>  * block using the PTE_CONT bit for more efficient use of the TLB.
>  */

Sounds good, thanks !


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-13  9:38         ` Ryan Roberts
@ 2025-02-17  4:48           ` Anshuman Khandual
  2025-02-17  9:40             ` Ryan Roberts
  0 siblings, 1 reply; 62+ messages in thread
From: Anshuman Khandual @ 2025-02-17  4:48 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 2/13/25 15:08, Ryan Roberts wrote:
> On 13/02/2025 05:30, Anshuman Khandual wrote:
>>
>>
>> On 2/10/25 16:42, Ryan Roberts wrote:
>>> On 10/02/2025 08:03, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 2/5/25 20:39, Ryan Roberts wrote:
>>>>> Because the kernel can't tolerate page faults for kernel mappings, when
>>>>> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
>>>>> dsb(ishst) to ensure that the store to the pgtable is observed by the
>>>>> table walker immediately. Additionally it emits an isb() to ensure that
>>>>> any already speculatively determined invalid mapping fault gets
>>>>> canceled.> 
>>>>> We can improve the performance of vmalloc operations by batching these
>>>>> barriers until the end of a set up entry updates. The newly added
>>>>> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
>>>>> provide the required hooks.
>>>>>
>>>>> vmalloc improves by up to 30% as a result.
>>>>>
>>>>> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
>>>>> are in the batch mode and can therefore defer any barriers until the end
>>>>> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
>>>>> be emited at the end of the batch.
>>>>
>>>> Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
>>>> set in __begin(), cleared in __end() and saved across a __switch_to().
>>>
>>> So unconditionally emit the barriers in _end(), and emit them in __switch_to()
>>> if TIF_KMAP_UPDATE_ACTIVE is set?
>>
>> Right.
>>
>>>
>>> I guess if calling _begin() then you are definitely going to be setting at least
>>> 1 PTE. So you can definitely emit the barriers unconditionally. I was trying to
>>> protect against the case where you get pre-empted (potentially multiple times)
>>> while in the loop. The TIF_KMAP_UPDATE_PENDING flag ensures you only emit the
>>> barriers when you definitely need to. Without it, you would have to emit on
>>> every pre-emption even if no more PTEs got set.
>>>
>>> But I suspect this is a premature optimization. Probably it will never occur. So
>>
>> Agreed.
> 
> Having done this simplification, I've just noticed that one of the
> arch_update_kernel_mappings_begin/end callsites is __apply_to_page_range() which
> gets called for user space mappings as well as kernel mappings. So actually with
> the simplification I'll be emitting barriers even when only user space mappings
> were touched.

Right, that will not be desirable.

> 
> I think there are a couple of options to fix this:
> 
> - Revert to the 2 flag approach. For the user space case, I'll get to _end() and
> notice that no barriers are queued so will emit nothing.
> 
> - Only set TIF_KMAP_UPDATE_ACTIVE if the address range passed to _begin() is a
> kernel address range. I guess that's just a case of checking if the MSB is set
> in "end"?
> 
> - pass mm to _begin() and only set TIF_KMAP_UPDATE_ACTIVE if mm == &init_mm. I
> guess this should be the same as option 2.
> 
> I'm leaning towards option 2. But I have a niggling feeling that my proposed
> check isn't quite correct. What do you think?

Option 2 and 3 looks better than the two flags approach proposed earlier. But is
not option 3 bit more simplistic than option 2 ? Does getting struct mm argument
into these function create more code churn ?

> 
> Thanks,
> Ryan
> 
> 
>>
>>> I'll simplify as you suggest.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  arch/arm64/include/asm/pgtable.h     | 65 +++++++++++++++++++---------
>>>>>  arch/arm64/include/asm/thread_info.h |  2 +
>>>>>  arch/arm64/kernel/process.c          | 20 +++++++--
>>>>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>> index ff358d983583..1ee9b9588502 100644
>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>> @@ -39,6 +39,41 @@
>>>>>  #include <linux/mm_types.h>
>>>>>  #include <linux/sched.h>
>>>>>  #include <linux/page_table_check.h>
>>>>> +#include <linux/pgtable_modmask.h>
>>>>> +
>>>>> +static inline void emit_pte_barriers(void)
>>>>> +{
>>>>> +	dsb(ishst);
>>>>> +	isb();
>>>>> +}
>>>>
>>>> There are many sequence of these two barriers in this particular header,
>>>> hence probably a good idea to factor this out into a common helper.
>>>>>> +
>>>>> +static inline void queue_pte_barriers(void)
>>>>> +{
>>>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_ACTIVE)) {
>>>>> +		if (!test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>>>> +			set_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>>> +	} else
>>>>> +		emit_pte_barriers();
>>>>> +}
>>>>> +
>>>>> +#define arch_update_kernel_mappings_begin arch_update_kernel_mappings_begin
>>>>> +static inline void arch_update_kernel_mappings_begin(unsigned long start,
>>>>> +						     unsigned long end)
>>>>> +{
>>>>> +	set_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>>>> +}
>>>>> +
>>>>> +#define arch_update_kernel_mappings_end arch_update_kernel_mappings_end
>>>>> +static inline void arch_update_kernel_mappings_end(unsigned long start,
>>>>> +						   unsigned long end,
>>>>> +						   pgtbl_mod_mask mask)
>>>>> +{
>>>>> +	if (test_thread_flag(TIF_KMAP_UPDATE_PENDING))
>>>>> +		emit_pte_barriers();
>>>>> +
>>>>> +	clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>>> +	clear_thread_flag(TIF_KMAP_UPDATE_ACTIVE);
>>>>> +}
>>>>>  
>>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>>>>> @@ -323,10 +358,8 @@ static inline void __set_pte_complete(pte_t pte)
>>>>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>>>>  	 * or update_mmu_cache() have the necessary barriers.
>>>>>  	 */
>>>>> -	if (pte_valid_not_user(pte)) {
>>>>> -		dsb(ishst);
>>>>> -		isb();
>>>>> -	}
>>>>> +	if (pte_valid_not_user(pte))
>>>>> +		queue_pte_barriers();
>>>>>  }
>>>>>  
>>>>>  static inline void __set_pte(pte_t *ptep, pte_t pte)
>>>>> @@ -791,10 +824,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>>>>  
>>>>>  	WRITE_ONCE(*pmdp, pmd);
>>>>>  
>>>>> -	if (pmd_valid_not_user(pmd)) {
>>>>> -		dsb(ishst);
>>>>> -		isb();
>>>>> -	}
>>>>> +	if (pmd_valid_not_user(pmd))
>>>>> +		queue_pte_barriers();
>>>>>  }
>>>>>  
>>>>>  static inline void pmd_clear(pmd_t *pmdp)
>>>>> @@ -869,10 +900,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>>>>>  
>>>>>  	WRITE_ONCE(*pudp, pud);
>>>>>  
>>>>> -	if (pud_valid_not_user(pud)) {
>>>>> -		dsb(ishst);
>>>>> -		isb();
>>>>> -	}
>>>>> +	if (pud_valid_not_user(pud))
>>>>> +		queue_pte_barriers();
>>>>>  }
>>>>>  
>>>>>  static inline void pud_clear(pud_t *pudp)
>>>>> @@ -960,10 +989,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>>>  
>>>>>  	WRITE_ONCE(*p4dp, p4d);
>>>>>  
>>>>> -	if (p4d_valid_not_user(p4d)) {
>>>>> -		dsb(ishst);
>>>>> -		isb();
>>>>> -	}
>>>>> +	if (p4d_valid_not_user(p4d))
>>>>> +		queue_pte_barriers();
>>>>>  }
>>>>>  
>>>>>  static inline void p4d_clear(p4d_t *p4dp)
>>>>> @@ -1098,10 +1125,8 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>>>>>  
>>>>>  	WRITE_ONCE(*pgdp, pgd);
>>>>>  
>>>>> -	if (pgd_valid_not_user(pgd)) {
>>>>> -		dsb(ishst);
>>>>> -		isb();
>>>>> -	}
>>>>> +	if (pgd_valid_not_user(pgd))
>>>>> +		queue_pte_barriers();
>>>>>  }
>>>>>  
>>>>>  static inline void pgd_clear(pgd_t *pgdp)
>>>>> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
>>>>> index 1114c1c3300a..382d2121261e 100644
>>>>> --- a/arch/arm64/include/asm/thread_info.h
>>>>> +++ b/arch/arm64/include/asm/thread_info.h
>>>>> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>>>>>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>>>>>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>>>>>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
>>>>> +#define TIF_KMAP_UPDATE_ACTIVE	31	/* kernel map update in progress */
>>>>> +#define TIF_KMAP_UPDATE_PENDING	32	/* kernel map updated with deferred barriers */
>>>>>  
>>>>>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>>>>>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>>>>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>>>>> index 42faebb7b712..1367ec6407d1 100644
>>>>> --- a/arch/arm64/kernel/process.c
>>>>> +++ b/arch/arm64/kernel/process.c
>>>>> @@ -680,10 +680,10 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>>>  	gcs_thread_switch(next);
>>>>>  
>>>>>  	/*
>>>>> -	 * Complete any pending TLB or cache maintenance on this CPU in case
>>>>> -	 * the thread migrates to a different CPU.
>>>>> -	 * This full barrier is also required by the membarrier system
>>>>> -	 * call.
>>>>> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
>>>>> +	 * thread migrates to a different CPU. This full barrier is also
>>>>> +	 * required by the membarrier system call. Additionally it is required
>>>>> +	 * for TIF_KMAP_UPDATE_PENDING, see below.
>>>>>  	 */
>>>>>  	dsb(ish);
>>>>>  
>>>>> @@ -696,6 +696,18 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>>>  	/* avoid expensive SCTLR_EL1 accesses if no change */
>>>>>  	if (prev->thread.sctlr_user != next->thread.sctlr_user)
>>>>>  		update_sctlr_el1(next->thread.sctlr_user);
>>>>> +	else if (unlikely(test_thread_flag(TIF_KMAP_UPDATE_PENDING))) {
>>>>> +		/*
>>>>> +		 * In unlikely event that a kernel map update is on-going when
>>>>> +		 * preemption occurs, we must emit_pte_barriers() if pending.
>>>>> +		 * emit_pte_barriers() consists of "dsb(ishst); isb();". The dsb
>>>>> +		 * is already handled above. The isb() is handled if
>>>>> +		 * update_sctlr_el1() was called. So only need to emit isb()
>>>>> +		 * here if it wasn't called.
>>>>> +		 */
>>>>> +		isb();
>>>>> +		clear_thread_flag(TIF_KMAP_UPDATE_PENDING);
>>>>> +	}
>>>>>  
>>>>>  	/* the actual thread switch */
>>>>>  	last = cpu_switch_to(prev, next);
>>>
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings
  2025-02-17  4:48           ` Anshuman Khandual
@ 2025-02-17  9:40             ` Ryan Roberts
  0 siblings, 0 replies; 62+ messages in thread
From: Ryan Roberts @ 2025-02-17  9:40 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon, Muchun Song,
	Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, Mark Rutland, Ard Biesheuvel, Dev Jain,
	Alexandre Ghiti, Steve Capper, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 17/02/2025 04:48, Anshuman Khandual wrote:
> 
> 
> On 2/13/25 15:08, Ryan Roberts wrote:
>> On 13/02/2025 05:30, Anshuman Khandual wrote:
>>>
>>>
>>> On 2/10/25 16:42, Ryan Roberts wrote:
>>>> On 10/02/2025 08:03, Anshuman Khandual wrote:
>>>>>
>>>>>
>>>>> On 2/5/25 20:39, Ryan Roberts wrote:
>>>>>> Because the kernel can't tolerate page faults for kernel mappings, when
>>>>>> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
>>>>>> dsb(ishst) to ensure that the store to the pgtable is observed by the
>>>>>> table walker immediately. Additionally it emits an isb() to ensure that
>>>>>> any already speculatively determined invalid mapping fault gets
>>>>>> canceled.> 
>>>>>> We can improve the performance of vmalloc operations by batching these
>>>>>> barriers until the end of a set up entry updates. The newly added
>>>>>> arch_update_kernel_mappings_begin() / arch_update_kernel_mappings_end()
>>>>>> provide the required hooks.
>>>>>>
>>>>>> vmalloc improves by up to 30% as a result.
>>>>>>
>>>>>> Two new TIF_ flags are created; TIF_KMAP_UPDATE_ACTIVE tells us if we
>>>>>> are in the batch mode and can therefore defer any barriers until the end
>>>>>> of the batch. TIF_KMAP_UPDATE_PENDING tells us if barriers are queued to
>>>>>> be emited at the end of the batch.
>>>>>
>>>>> Why cannot this be achieved with a single TIF_KMAP_UPDATE_ACTIVE which is
>>>>> set in __begin(), cleared in __end() and saved across a __switch_to().
>>>>
>>>> So unconditionally emit the barriers in _end(), and emit them in __switch_to()
>>>> if TIF_KMAP_UPDATE_ACTIVE is set?
>>>
>>> Right.
>>>
>>>>
>>>> I guess if calling _begin() then you are definitely going to be setting at least
>>>> 1 PTE. So you can definitely emit the barriers unconditionally. I was trying to
>>>> protect against the case where you get pre-empted (potentially multiple times)
>>>> while in the loop. The TIF_KMAP_UPDATE_PENDING flag ensures you only emit the
>>>> barriers when you definitely need to. Without it, you would have to emit on
>>>> every pre-emption even if no more PTEs got set.
>>>>
>>>> But I suspect this is a premature optimization. Probably it will never occur. So
>>>
>>> Agreed.
>>
>> Having done this simplification, I've just noticed that one of the
>> arch_update_kernel_mappings_begin/end callsites is __apply_to_page_range() which
>> gets called for user space mappings as well as kernel mappings. So actually with
>> the simplification I'll be emitting barriers even when only user space mappings
>> were touched.
> 
> Right, that will not be desirable.
> 
>>
>> I think there are a couple of options to fix this:
>>
>> - Revert to the 2 flag approach. For the user space case, I'll get to _end() and
>> notice that no barriers are queued so will emit nothing.
>>
>> - Only set TIF_KMAP_UPDATE_ACTIVE if the address range passed to _begin() is a
>> kernel address range. I guess that's just a case of checking if the MSB is set
>> in "end"?
>>
>> - pass mm to _begin() and only set TIF_KMAP_UPDATE_ACTIVE if mm == &init_mm. I
>> guess this should be the same as option 2.
>>
>> I'm leaning towards option 2. But I have a niggling feeling that my proposed
>> check isn't quite correct. What do you think?
> 
> Option 2 and 3 looks better than the two flags approach proposed earlier. But is
> not option 3 bit more simplistic than option 2 ? Does getting struct mm argument
> into these function create more code churn ?

Actually looking at this again, I think the best thing is that when called in
the context of __apply_to_page_range(), we will only call
arch_update_kernel_mappings_[begin|end]() if mm == &init_mm. The function is
explicitly for "kernel mappings" so it doesn't make sense to call it for user
mappings.

Looking at the current implementations of arch_sync_kernel_mappings() they are
filtering on kernel addresses anyway, so this should be safe.

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-02-17  9:41 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-05 15:09 [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 01/16] mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear() Ryan Roberts
2025-02-06  5:03   ` Anshuman Khandual
2025-02-06 12:15     ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 02/16] arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes Ryan Roberts
2025-02-06  6:15   ` Anshuman Khandual
2025-02-06 12:55     ` Ryan Roberts
2025-02-12 14:44       ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 03/16] arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level Ryan Roberts
2025-02-06  6:46   ` Anshuman Khandual
2025-02-06 13:04     ` Ryan Roberts
2025-02-13  4:57       ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 04/16] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 05/16] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
2025-02-06 10:55   ` Anshuman Khandual
2025-02-06 13:07     ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 06/16] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
2025-02-06 11:48   ` Anshuman Khandual
2025-02-06 13:26     ` Ryan Roberts
2025-02-07  9:38       ` Ryan Roberts
2025-02-12 15:29         ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 07/16] arm64: hugetlb: Use ___set_ptes() and ___ptep_get_and_clear() Ryan Roberts
2025-02-07  4:09   ` Anshuman Khandual
2025-02-07 10:00     ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 08/16] arm64/mm: Hoist barriers out of ___set_ptes() loop Ryan Roberts
2025-02-07  5:35   ` Anshuman Khandual
2025-02-07 10:38     ` Ryan Roberts
2025-02-12 16:00       ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 09/16] arm64/mm: Avoid barriers for invalid or userspace mappings Ryan Roberts
2025-02-07  8:11   ` Anshuman Khandual
2025-02-07 10:53     ` Ryan Roberts
2025-02-12 16:48       ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 10/16] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
2025-02-07  8:41   ` Anshuman Khandual
2025-02-07 10:59     ` Ryan Roberts
2025-02-13  6:36       ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 11/16] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
2025-02-07  9:19   ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 12/16] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
2025-02-07 10:04   ` Anshuman Khandual
2025-02-07 11:20     ` Ryan Roberts
2025-02-13  6:32       ` Anshuman Khandual
2025-02-13  9:09         ` Ryan Roberts
2025-02-17  4:33           ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 13/16] mm: Don't skip arch_sync_kernel_mappings() in error paths Ryan Roberts
2025-02-07 10:21   ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 14/16] mm/vmalloc: Batch arch_sync_kernel_mappings() more efficiently Ryan Roberts
2025-02-10  7:11   ` Anshuman Khandual
2025-02-05 15:09 ` [PATCH v1 15/16] mm: Generalize arch_sync_kernel_mappings() Ryan Roberts
2025-02-10  7:45   ` Anshuman Khandual
2025-02-10 11:04     ` Ryan Roberts
2025-02-13  5:57       ` Anshuman Khandual
2025-02-13  9:17         ` Ryan Roberts
2025-02-05 15:09 ` [PATCH v1 16/16] arm64/mm: Defer barriers when updating kernel mappings Ryan Roberts
2025-02-10  8:03   ` Anshuman Khandual
2025-02-10 11:12     ` Ryan Roberts
2025-02-13  5:30       ` Anshuman Khandual
2025-02-13  9:38         ` Ryan Roberts
2025-02-17  4:48           ` Anshuman Khandual
2025-02-17  9:40             ` Ryan Roberts
2025-02-06  7:52 ` [PATCH v1 00/16] hugetlb and vmalloc fixes and perf improvements Andrew Morton
2025-02-06 11:59   ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox