linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] support batch checking of references and unmapping for large folios
@ 2026-02-09 14:07 Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Similar to folio_referenced_one(), we can also apply batched unmapping for large
file folios to optimize the performance of file folio reclamation. By supporting
batched checking of the young flags, flushing TLB entries, and unmapping, I can
observed a significant performance improvements in my performance tests for file
folios reclamation. Please check the performance data in the commit message of
each patch.

Run stress-ng and mm selftests, no issues were found.

Patch 1: Add a new generic batched PTE helper that supports batched checks of
the references for large folios.
Patch 2 - 3: Preparation patches.
patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes().
Patch 5: Support batched unmapping for file large folios.

Changes from v5:
 - Collect reviewed tags from Ryan, Harry and David. Thanks.
 - Fix some coding style issues (per David).
 - Skip batched unmapping for uffd case, reported by Dev. Thanks.

Changes from v4:
 - Fix passing the incorrect 'CONT_PTES' for non-batched APIs.
 - Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan).
 - Fix some coding style issues (per Ryan).
 - Add reviewed tag from Ryan. Thanks.

Changes from v3:
 - Fix using an incorrect parameter in ptep_clear_flush_young_notify()
   (per Liam).

Changes from v2:
 - Rearrange the patch set (per Ryan).
 - Add pte_cont() check in clear_flush_young_ptes() (per Ryan).
 - Add a helper to do contpte block alignment (per Ryan).
 - Fix some coding style issues (per Lorenzo and Ryan).
 - Add more comments and update the commit message (per Lorenzo and Ryan).
 - Add acked tag from Barry. Thanks. 

Changes from v1:
 - Add a new patch to support batched unmapping for file large folios.
 - Update the cover letter


Baolin Wang (5):
  mm: rmap: support batched checks of the references for large folios
  arm64: mm: factor out the address and ptep alignment into a new helper
  arm64: mm: support batch clearing of the young flag for large folios
  arm64: mm: implement the architecture-specific
    clear_flush_young_ptes()
  mm: rmap: support batched unmapping for file large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++----
 arch/arm64/mm/contpte.c          | 62 ++++++++++++++++++++------------
 include/linux/mmu_notifier.h     |  9 ++---
 include/linux/pgtable.h          | 35 ++++++++++++++++++
 mm/rmap.c                        | 38 ++++++++++++++++----
 5 files changed, 129 insertions(+), 38 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:25   ` David Hildenbrand (Arm)
  2026-03-06 21:07   ` Barry Song
  2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation. And it will be overridden by the architecture
that implements a more efficient batch operation in the following patches.

While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch operation.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/mmu_notifier.h |  9 +++++----
 include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
 mm/rmap.c                    | 28 +++++++++++++++++++++++++---
 3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..07a2bbaf86e9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
 	range->owner = owner;
 }
 
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)	\
 ({									\
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	unsigned int ___nr = __nr;					\
+	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
 						  ___address,		\
 						  ___address +		\
-							PAGE_SIZE);	\
+						  ___nr * PAGE_SIZE);	\
 	__young;							\
 })
 
@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 
 #define mmu_notifier_range_update_to_read_only(r) false
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define clear_flush_young_ptes_notify clear_flush_young_ptes
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 21b67d937555..a50df42a893f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+#ifndef clear_flush_young_ptes
+/**
+ * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
+ *			    folio as old and flush the TLB.
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_clear_flush_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	int young = 0;
+
+	for (;;) {
+		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+
+	return young;
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/rmap.c b/mm/rmap.c
index a5a284f2a83d..8807f8a7df28 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -913,9 +913,11 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int ptes = 0, referenced = 0;
+	unsigned int nr;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
+		nr = 1;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			ptes++;
@@ -960,9 +962,21 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+			if (folio_test_large(folio)) {
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
+				pte_t pteval = ptep_get(pvmw.pte);
+
+				nr = folio_pte_batch(folio, pvmw.pte,
+						     pteval, max_nr);
+			}
+
+			ptes += nr;
+			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
 				referenced++;
+			/* Skip the batched PTEs */
+			pvmw.pte += nr - 1;
+			pvmw.address += (nr - 1) * PAGE_SIZE;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -972,7 +986,15 @@ static bool folio_referenced_one(struct folio *folio,
 			WARN_ON_ONCE(1);
 		}
 
-		pra->mapcount--;
+		pra->mapcount -= nr;
+		/*
+		 * If we are sure that we batched the entire folio,
+		 * we can just optimize and stop right here.
+		 */
+		if (ptes == pvmw.nr_pages) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
 	}
 
 	if (referenced)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Factor out the contpte block's address and ptep alignment into a new helper,
and will be reused in the following patch.

No functional changes.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/mm/contpte.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 589bcf878938..e4ddeb46f25d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -26,6 +26,26 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
 }
 
+static inline pte_t *contpte_align_addr_ptep(unsigned long *start,
+					     unsigned long *end, pte_t *ptep,
+					     unsigned int nr)
+{
+	/*
+	 * Note: caller must ensure these nr PTEs are consecutive (present)
+	 * PTEs that map consecutive pages of the same large folio within a
+	 * single VMA and a single page table.
+	 */
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		*end = ALIGN(*end, CONT_PTE_SIZE);
+
+	if (pte_cont(__ptep_get(ptep))) {
+		*start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	return ptep;
+}
+
 static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
 {
@@ -569,14 +589,7 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
 	unsigned long start = addr;
 	unsigned long end = start + nr * PAGE_SIZE;
 
-	if (pte_cont(__ptep_get(ptep + nr - 1)))
-		end = ALIGN(end, CONT_PTE_SIZE);
-
-	if (pte_cont(__ptep_get(ptep))) {
-		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
-		ptep = contpte_align_down(ptep);
-	}
-
+	ptep = contpte_align_addr_ptep(&start, &end, ptep, nr);
 	__clear_young_dirty_ptes(vma, start, ptep, (end - start) / PAGE_SIZE, flags);
 }
 EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs that map consecutive
pages of the same large folio in a single VMA and a single page table.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++++------
 arch/arm64/mm/contpte.c          | 33 ++++++++++++++++++--------------
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d94445b4f3df..3dabf5ea17fa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1648,10 +1648,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1823,7 +1823,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1835,7 +1835,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index e4ddeb46f25d..b929a455103f 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -508,8 +508,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -518,41 +519,45 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * contig range when the range is covered by a single folio, we can get
 	 * away with clearing young for the whole contig range here, so we avoid
 	 * having to unfold.
+	 *
+	 * The 'nr' means consecutive (present) PTEs that map consecutive pages
+	 * of the same large folio in a single VMA and a single page table.
 	 */
 
+	unsigned long end = addr + nr * PAGE_SIZE;
 	int young = 0;
-	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+	ptep = contpte_align_addr_ptep(&addr, &end, ptep, nr);
+	for (; addr != end; ptep++, addr += PAGE_SIZE)
 		young |= __ptep_test_and_clear_young(vma, addr, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long end = addr + nr * PAGE_SIZE;
+
+		contpte_align_addr_ptep(&addr, &end, ptep, nr);
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, addr, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (2 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:30   ` David Hildenbrand (Arm)
  2026-03-06 21:20   ` Barry Song
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
  5 siblings, 2 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
batched checking of young flags and TLB flushing, improving performance during
large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3dabf5ea17fa..a17eb8a76788 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (3 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2026-02-09 14:07 ` Baolin Wang
  2026-02-09 15:31   ` David Hildenbrand (Arm)
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
  5 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2026-02-09 14:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.

Barry previously implemented batched unmapping for lazyfree anonymous large
folios[1] and did not further optimize anonymous large folios or file-backed
large folios at that stage. As for file-backed large folios, the batched
unmapping support is relatively straightforward, as we only need to clear
the consecutive (present) PTE entries for file-backed large folios.

Note that it's not ready to support batched unmapping for uffd case, so
let's still fallback to per-page unmapping for the uffd case.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server (and 50%+ improvement
on my X86 machine) with this patch.

W/o patch:
real    0m1.018s
user    0m0.000s
sys     0m1.018s

W/ patch:
real	0m0.249s
user	0m0.000s
sys	0m0.249s

[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Barry Song <baohua@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 8807f8a7df28..43cb9ac6f523 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1945,12 +1945,16 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
 		return 1;
+
 	if (pte_unused(pte))
 		return 1;
 
+	if (userfaultfd_wp(vma))
+		return 1;
+
 	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
 }
 
@@ -2313,7 +2317,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
-- 
2.47.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2026-02-09 15:25   ` David Hildenbrand (Arm)
  2026-03-06 21:07   ` Barry Song
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:25 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
> 
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> 
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Thanks!

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2026-02-09 15:30   ` David Hildenbrand (Arm)
  2026-02-10  0:39     ` Baolin Wang
  2026-03-06 21:20   ` Barry Song
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:30 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
> 
> W/o patchset:
> real	0m1.518s
> user	0m0.000s
> sys	0m1.518s
> 
> W/ patchset:
> real	0m1.018s
> user	0m0.000s
> sys	0m1.018s
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3dabf5ea17fa..a17eb8a76788 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>   }
>   
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
I guess similar cases where we should never end up with non-present ptes 
should be updated accordingly.

ptep_test_and_clear_young(), for example, should never be called on 
non-present ptes.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-02-09 15:31   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:31 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 2/9/26 15:07, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
> 
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
> 
> Note that it's not ready to support batched unmapping for uffd case, so
> let's still fallback to per-page unmapping for the uffd case.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
> 
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
> 
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
> 
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 15:30   ` David Hildenbrand (Arm)
@ 2026-02-10  0:39     ` Baolin Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-10  0:39 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel



On 2/9/26 11:30 PM, David Hildenbrand (Arm) wrote:
> On 2/9/26 15:07, Baolin Wang wrote:
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to 
>> enable
>> batched checking of young flags and TLB flushing, improving 
>> performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, 
>> and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can 
>> observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ 
>> improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() 
>> dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real    0m1.518s
>> user    0m0.000s
>> sys    0m1.518s
>>
>> W/ patchset:
>> real    0m1.018s
>> user    0m0.000s
>> sys    0m1.018s
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ 
>> asm/pgtable.h
>> index 3dabf5ea17fa..a17eb8a76788 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct 
>> vm_area_struct *vma,
>>       return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +                     unsigned long addr, pte_t *ptep,
>> +                     unsigned int nr)
>> +{
>> +    if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> I guess similar cases where we should never end up with non-present ptes 
> should be updated accordingly.
> 
> ptep_test_and_clear_young(), for example, should never be called on non- 
> present ptes.

Yes. I already adrressed this in my follow-up patchset.

> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/5] support batch checking of references and unmapping for large folios
  2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (4 preceding siblings ...)
  2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-02-10  1:53 ` Andrew Morton
  2026-02-10  2:01   ` Baolin Wang
  5 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2026-02-10  1:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon,  9 Feb 2026 22:07:23 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Similar to folio_referenced_one(), we can also apply batched unmapping for large
> file folios to optimize the performance of file folio reclamation. By supporting
> batched checking of the young flags, flushing TLB entries, and unmapping, I can
> observed a significant performance improvements in my performance tests for file
> folios reclamation. Please check the performance data in the commit message of
> each patch.
> 

Thanks, I updated mm.git to this version.  Below is how v6 altered
mm.git.

I notice that this fix:

https://lore.kernel.org/all/de141225-a0c1-41fd-b3e1-bcab09827ddd@linux.alibaba.com/T/#u

was not carried forward.  Was this deliberate?

Also, regarding the 80-column tricks in folio_referenced_one(): we're
allowed to do this ;)


				unsigned long end_addr;
				unsigned int max_nr;

				end_addr = pmd_addr_end(address, vma->vm_end);
				max_nr = (end_addr - address) >> PAGE_SHIFT;




 arch/arm64/include/asm/pgtable.h |    2 +-
 include/linux/pgtable.h          |   16 ++++++++++------
 mm/rmap.c                        |    9 +++------
 3 files changed, 14 insertions(+), 13 deletions(-)

--- a/arch/arm64/include/asm/pgtable.h~b
+++ a/arch/arm64/include/asm/pgtable.h
@@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes
 					 unsigned long addr, pte_t *ptep,
 					 unsigned int nr)
 {
-	if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
--- a/include/linux/pgtable.h~b
+++ a/include/linux/pgtable.h
@@ -1070,8 +1070,8 @@ static inline void wrprotect_ptes(struct
 
 #ifndef clear_flush_young_ptes
 /**
- * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
- *			    that map consecutive pages of the same folio.
+ * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
+ *			    folio as old and flush the TLB.
  * @vma: The virtual memory area the pages are mapped into.
  * @addr: Address the first page is mapped at.
  * @ptep: Page table pointer for the first entry.
@@ -1087,13 +1087,17 @@ static inline void wrprotect_ptes(struct
  * pages that belong to the same folio.  The PTEs are all in the same PMD.
  */
 static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
-					 unsigned long addr, pte_t *ptep,
-					 unsigned int nr)
+		unsigned long addr, pte_t *ptep, unsigned int nr)
 {
-	int i, young = 0;
+	int young = 0;
 
-	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
+	for (;;) {
 		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
 
 	return young;
 }
--- a/mm/rmap.c~b
+++ a/mm/rmap.c
@@ -963,10 +963,8 @@ static bool folio_referenced_one(struct
 				referenced++;
 		} else if (pvmw.pte) {
 			if (folio_test_large(folio)) {
-				unsigned long end_addr =
-					pmd_addr_end(address, vma->vm_end);
-				unsigned int max_nr =
-					(end_addr - address) >> PAGE_SHIFT;
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
 				pte_t pteval = ptep_get(pvmw.pte);
 
 				nr = folio_pte_batch(folio, pvmw.pte,
@@ -974,8 +972,7 @@ static bool folio_referenced_one(struct
 			}
 
 			ptes += nr;
-			if (clear_flush_young_ptes_notify(vma, address,
-						pvmw.pte, nr))
+			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
 				referenced++;
 			/* Skip the batched PTEs */
 			pvmw.pte += nr - 1;
_



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 0/5] support batch checking of references and unmapping for large folios
  2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
@ 2026-02-10  2:01   ` Baolin Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2026-02-10  2:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 2/10/26 9:53 AM, Andrew Morton wrote:
> On Mon,  9 Feb 2026 22:07:23 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Similar to folio_referenced_one(), we can also apply batched unmapping for large
>> file folios to optimize the performance of file folio reclamation. By supporting
>> batched checking of the young flags, flushing TLB entries, and unmapping, I can
>> observed a significant performance improvements in my performance tests for file
>> folios reclamation. Please check the performance data in the commit message of
>> each patch.
>>
> 
> Thanks, I updated mm.git to this version.  Below is how v6 altered
> mm.git.
> 
> I notice that this fix:
> 
> https://lore.kernel.org/all/de141225-a0c1-41fd-b3e1-bcab09827ddd@linux.alibaba.com/T/#u
> 
> was not carried forward.  Was this deliberate?

Yes. After discussing with David[1], we believe the original patch is 
correct, so the 'fix' is unnecessary.

[1] 
https://lore.kernel.org/all/280ae63e-d66e-438f-8045-6c870420fe76@linux.alibaba.com/

The following diff looks good to me. Thanks.

> Also, regarding the 80-column tricks in folio_referenced_one(): we're
> allowed to do this ;)
> 
> 
> 				unsigned long end_addr;
> 				unsigned int max_nr;
> 
> 				end_addr = pmd_addr_end(address, vma->vm_end);
> 				max_nr = (end_addr - address) >> PAGE_SHIFT;
> 
> 
> 
> 
>   arch/arm64/include/asm/pgtable.h |    2 +-
>   include/linux/pgtable.h          |   16 ++++++++++------
>   mm/rmap.c                        |    9 +++------
>   3 files changed, 14 insertions(+), 13 deletions(-)
> 
> --- a/arch/arm64/include/asm/pgtable.h~b
> +++ a/arch/arm64/include/asm/pgtable.h
> @@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes
>   					 unsigned long addr, pte_t *ptep,
>   					 unsigned int nr)
>   {
> -	if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>   		return __ptep_clear_flush_young(vma, addr, ptep);
>   
>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> --- a/include/linux/pgtable.h~b
> +++ a/include/linux/pgtable.h
> @@ -1070,8 +1070,8 @@ static inline void wrprotect_ptes(struct
>   
>   #ifndef clear_flush_young_ptes
>   /**
> - * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
> - *			    that map consecutive pages of the same folio.
> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> + *			    folio as old and flush the TLB.
>    * @vma: The virtual memory area the pages are mapped into.
>    * @addr: Address the first page is mapped at.
>    * @ptep: Page table pointer for the first entry.
> @@ -1087,13 +1087,17 @@ static inline void wrprotect_ptes(struct
>    * pages that belong to the same folio.  The PTEs are all in the same PMD.
>    */
>   static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> -					 unsigned long addr, pte_t *ptep,
> -					 unsigned int nr)
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>   {
> -	int i, young = 0;
> +	int young = 0;
>   
> -	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
> +	for (;;) {
>   		young |= ptep_clear_flush_young(vma, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
>   
>   	return young;
>   }
> --- a/mm/rmap.c~b
> +++ a/mm/rmap.c
> @@ -963,10 +963,8 @@ static bool folio_referenced_one(struct
>   				referenced++;
>   		} else if (pvmw.pte) {
>   			if (folio_test_large(folio)) {
> -				unsigned long end_addr =
> -					pmd_addr_end(address, vma->vm_end);
> -				unsigned int max_nr =
> -					(end_addr - address) >> PAGE_SHIFT;
> +				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
>   				pte_t pteval = ptep_get(pvmw.pte);
>   
>   				nr = folio_pte_batch(folio, pvmw.pte,
> @@ -974,8 +972,7 @@ static bool folio_referenced_one(struct
>   			}
>   
>   			ptes += nr;
> -			if (clear_flush_young_ptes_notify(vma, address,
> -						pvmw.pte, nr))
> +			if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr))
>   				referenced++;
>   			/* Skip the batched PTEs */
>   			pvmw.pte += nr - 1;
> _



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2026-02-09 15:25   ` David Hildenbrand (Arm)
@ 2026-03-06 21:07   ` Barry Song
  2026-03-07  2:22     ` Baolin Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2026-03-06 21:07 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
>
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM,

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  include/linux/mmu_notifier.h |  9 +++++----
>  include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
>  mm/rmap.c                    | 28 +++++++++++++++++++++++++---
>  3 files changed, 65 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..07a2bbaf86e9 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>         range->owner = owner;
>  }
>
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
>  ({                                                                     \
>         int __young;                                                    \
>         struct vm_area_struct *___vma = __vma;                          \
>         unsigned long ___address = __address;                           \
> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
> +       unsigned int ___nr = __nr;                                      \
> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>         __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
>                                                   ___address,           \
>                                                   ___address +          \
> -                                                       PAGE_SIZE);     \
> +                                                 ___nr * PAGE_SIZE);   \
>         __young;                                                        \
>  })
>
> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>
>  #define mmu_notifier_range_update_to_read_only(r) false
>
> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>  #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>  #define ptep_clear_young_notify ptep_test_and_clear_young
>  #define pmdp_clear_young_notify pmdp_test_and_clear_young
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 21b67d937555..a50df42a893f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +#ifndef clear_flush_young_ptes
> +/**
> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> + *                         folio as old and flush the TLB.
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear access bit.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_clear_flush_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +               unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +       int young = 0;
> +
> +       for (;;) {
> +               young |= ptep_clear_flush_young(vma, addr, ptep);
> +               if (--nr == 0)
> +                       break;
> +               ptep++;
> +               addr += PAGE_SIZE;
> +       }
> +
> +       return young;
> +}
> +#endif

We might have an opportunity to batch the TLB synchronization,
using flush_tlb_range() instead of calling flush_tlb_page()
one by one. Not sure the benefit would be significant though,
especially if only one entry among nr has the young bit set.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
  2026-02-09 15:30   ` David Hildenbrand (Arm)
@ 2026-03-06 21:20   ` Barry Song
  2026-03-07  2:14     ` Baolin Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2026-03-06 21:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real    0m1.518s
> user    0m0.000s
> sys     0m1.518s
>
> W/ patchset:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>  arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3dabf5ea17fa..a17eb8a76788 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>         return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>  }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +                                        unsigned long addr, pte_t *ptep,
> +                                        unsigned int nr)
> +{
> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> +               return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}

A similar question arises here:

If nr = 4 for 16KB large folios and one of those entries is young,
we end up flushing the TLB for all 4 PTEs.

If all four entries are young, we win; if only one is young, it seems
we flush 3 redundant pages. but arm64 has TLB coalescing, so
maybe they are just one TLB?

> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>                                 unsigned long addr, pte_t *ptep, unsigned int nr)
> --
> 2.47.3

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-03-06 21:20   ` Barry Song
@ 2026-03-07  2:14     ` Baolin Wang
  2026-03-07  7:41       ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2026-03-07  2:14 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 3/7/26 5:20 AM, Barry Song wrote:
> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
>> batched checking of young flags and TLB flushing, improving performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
>>
>> W/o patchset:
>> real    0m1.518s
>> user    0m0.000s
>> sys     0m1.518s
>>
>> W/ patchset:
>> real    0m1.018s
>> user    0m0.000s
>> sys     0m1.018s
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks Barry. But this series has been upstreamed, I can not add your 
reviewed tag.

> 
>> ---
>>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 3dabf5ea17fa..a17eb8a76788 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>          return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>>
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +                                        unsigned long addr, pte_t *ptep,
>> +                                        unsigned int nr)
>> +{
>> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>> +               return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
>> +}
> 
> A similar question arises here:
> 
> If nr = 4 for 16KB large folios and one of those entries is young,
> we end up flushing the TLB for all 4 PTEs.
> 
> If all four entries are young, we win; if only one is young, it seems
> we flush 3 redundant pages. but arm64 has TLB coalescing, so
> maybe they are just one TLB?

We discussed a similar issue in the previous thread [1], and I quote 
some comments from Ryan:

"
My concern was the opportunity cost of evicting the entries for all the
non-accessed parts of the folio from the TLB. But of course, I'm talking
nonsense because the architecture does not allow caching non-accessed 
entries in the TLB.
"

[1] 
https://lore.kernel.org/all/02239ca7-9701-4bfa-af0f-dcf0d05a3e89@linux.alibaba.com/



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-06 21:07   ` Barry Song
@ 2026-03-07  2:22     ` Baolin Wang
  2026-03-07  8:02       ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2026-03-07  2:22 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 3/7/26 5:07 AM, Barry Song wrote:
> On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> Currently, folio_referenced_one() always checks the young flag for each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
>> an optimization to clear the young flags for PTEs within a contiguous range.
>> However, this is not sufficient. We can extend this to perform batched operations
>> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
>> of the young flags and flushing TLB entries, thereby improving performance
>> during large folio reclamation. And it will be overridden by the architecture
>> that implements a more efficient batch operation in the following patches.
>>
>> While we are at it, rename ptep_clear_flush_young_notify() to
>> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
>>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> LGTM,
> 
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks.

>> ---
>>   include/linux/mmu_notifier.h |  9 +++++----
>>   include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
>>   mm/rmap.c                    | 28 +++++++++++++++++++++++++---
>>   3 files changed, 65 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..07a2bbaf86e9 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>          range->owner = owner;
>>   }
>>
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
>>   ({                                                                     \
>>          int __young;                                                    \
>>          struct vm_area_struct *___vma = __vma;                          \
>>          unsigned long ___address = __address;                           \
>> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
>> +       unsigned int ___nr = __nr;                                      \
>> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
>>          __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
>>                                                    ___address,           \
>>                                                    ___address +          \
>> -                                                       PAGE_SIZE);     \
>> +                                                 ___nr * PAGE_SIZE);   \
>>          __young;                                                        \
>>   })
>>
>> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
>>
>>   #define mmu_notifier_range_update_to_read_only(r) false
>>
>> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
>> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
>>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
>>   #define ptep_clear_young_notify ptep_test_and_clear_young
>>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 21b67d937555..a50df42a893f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>
>> +#ifndef clear_flush_young_ptes
>> +/**
>> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
>> + *                         folio as old and flush the TLB.
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear access bit.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_clear_flush_young().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>> + * some PTEs might be write-protected.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +               unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +       int young = 0;
>> +
>> +       for (;;) {
>> +               young |= ptep_clear_flush_young(vma, addr, ptep);
>> +               if (--nr == 0)
>> +                       break;
>> +               ptep++;
>> +               addr += PAGE_SIZE;
>> +       }
>> +
>> +       return young;
>> +}
>> +#endif
> 
> We might have an opportunity to batch the TLB synchronization,
> using flush_tlb_range() instead of calling flush_tlb_page()
> one by one. Not sure the benefit would be significant though,
> especially if only one entry among nr has the young bit set.

Yes. In addition, this will involve many architectures’ implementations 
and their differing TLB flush mechanisms, so it’s difficult to make a 
reasonable per-architecture measurement. If any architecture has a more 
efficient flush method, I’d prefer to implement an architecture‑specific 
clear_flush_young_ptes().


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-03-07  2:14     ` Baolin Wang
@ 2026-03-07  7:41       ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2026-03-07  7:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Sat, Mar 7, 2026 at 10:14 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/7/26 5:20 AM, Barry Song wrote:
> > On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> >> batched checking of young flags and TLB flushing, improving performance during
> >> large folio reclamation.
> >>
> >> Performance testing:
> >> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> >> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> >> from approximately 35% to around 5%.
> >>
> >> W/o patchset:
> >> real    0m1.518s
> >> user    0m0.000s
> >> sys     0m1.518s
> >>
> >> W/ patchset:
> >> real    0m1.018s
> >> user    0m0.000s
> >> sys     0m1.018s
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks Barry. But this series has been upstreamed, I can not add your
> reviewed tag.
>
> >
> >> ---
> >>   arch/arm64/include/asm/pgtable.h | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 3dabf5ea17fa..a17eb8a76788 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>          return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
> >>   }
> >>
> >> +#define clear_flush_young_ptes clear_flush_young_ptes
> >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> >> +                                        unsigned long addr, pte_t *ptep,
> >> +                                        unsigned int nr)
> >> +{
> >> +       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> >> +               return __ptep_clear_flush_young(vma, addr, ptep);
> >> +
> >> +       return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> >> +}
> >
> > A similar question arises here:
> >
> > If nr = 4 for 16KB large folios and one of those entries is young,
> > we end up flushing the TLB for all 4 PTEs.
> >
> > If all four entries are young, we win; if only one is young, it seems
> > we flush 3 redundant pages. but arm64 has TLB coalescing, so
> > maybe they are just one TLB?
>
> We discussed a similar issue in the previous thread [1], and I quote
> some comments from Ryan:
>
> "
> My concern was the opportunity cost of evicting the entries for all the
> non-accessed parts of the folio from the TLB. But of course, I'm talking
> nonsense because the architecture does not allow caching non-accessed
> entries in the TLB.
> "

You and Ryan are clearly smarter than me :-) Thinking about it
again, worrying about shooting down the TLBs of non-accessed
PTEs seems to be nonsense.

>
> [1]
> https://lore.kernel.org/all/02239ca7-9701-4bfa-af0f-dcf0d05a3e89@linux.alibaba.com/
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
  2026-03-07  2:22     ` Baolin Wang
@ 2026-03-07  8:02       ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2026-03-07  8:02 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/7/26 5:07 AM, Barry Song wrote:
> > On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >> Currently, folio_referenced_one() always checks the young flag for each PTE
> >> sequentially, which is inefficient for large folios. This inefficiency is
> >> especially noticeable when reclaiming clean file-backed large folios, where
> >> folio_referenced() is observed as a significant performance hotspot.
> >>
> >> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> >> an optimization to clear the young flags for PTEs within a contiguous range.
> >> However, this is not sufficient. We can extend this to perform batched operations
> >> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >>
> >> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> >> of the young flags and flushing TLB entries, thereby improving performance
> >> during large folio reclamation. And it will be overridden by the architecture
> >> that implements a more efficient batch operation in the following patches.
> >>
> >> While we are at it, rename ptep_clear_flush_young_notify() to
> >> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> >>
> >> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >
> > LGTM,
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks.
>
> >> ---
> >>   include/linux/mmu_notifier.h |  9 +++++----
> >>   include/linux/pgtable.h      | 35 +++++++++++++++++++++++++++++++++++
> >>   mm/rmap.c                    | 28 +++++++++++++++++++++++++---
> >>   3 files changed, 65 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >> index d1094c2d5fb6..07a2bbaf86e9 100644
> >> --- a/include/linux/mmu_notifier.h
> >> +++ b/include/linux/mmu_notifier.h
> >> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
> >>          range->owner = owner;
> >>   }
> >>
> >> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)                \
> >> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)  \
> >>   ({                                                                     \
> >>          int __young;                                                    \
> >>          struct vm_area_struct *___vma = __vma;                          \
> >>          unsigned long ___address = __address;                           \
> >> -       __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
> >> +       unsigned int ___nr = __nr;                                      \
> >> +       __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);    \
> >>          __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,        \
> >>                                                    ___address,           \
> >>                                                    ___address +          \
> >> -                                                       PAGE_SIZE);     \
> >> +                                                 ___nr * PAGE_SIZE);   \
> >>          __young;                                                        \
> >>   })
> >>
> >> @@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> >>
> >>   #define mmu_notifier_range_update_to_read_only(r) false
> >>
> >> -#define ptep_clear_flush_young_notify ptep_clear_flush_young
> >> +#define clear_flush_young_ptes_notify clear_flush_young_ptes
> >>   #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
> >>   #define ptep_clear_young_notify ptep_test_and_clear_young
> >>   #define pmdp_clear_young_notify pmdp_test_and_clear_young
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 21b67d937555..a50df42a893f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>   }
> >>   #endif
> >>
> >> +#ifndef clear_flush_young_ptes
> >> +/**
> >> + * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
> >> + *                         folio as old and flush the TLB.
> >> + * @vma: The virtual memory area the pages are mapped into.
> >> + * @addr: Address the first page is mapped at.
> >> + * @ptep: Page table pointer for the first entry.
> >> + * @nr: Number of entries to clear access bit.
> >> + *
> >> + * May be overridden by the architecture; otherwise, implemented as a simple
> >> + * loop over ptep_clear_flush_young().
> >> + *
> >> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >> + * some PTEs might be write-protected.
> >> + *
> >> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> >> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >> + */
> >> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> >> +               unsigned long addr, pte_t *ptep, unsigned int nr)
> >> +{
> >> +       int young = 0;
> >> +
> >> +       for (;;) {
> >> +               young |= ptep_clear_flush_young(vma, addr, ptep);
> >> +               if (--nr == 0)
> >> +                       break;
> >> +               ptep++;
> >> +               addr += PAGE_SIZE;
> >> +       }
> >> +
> >> +       return young;
> >> +}
> >> +#endif
> >
> > We might have an opportunity to batch the TLB synchronization,
> > using flush_tlb_range() instead of calling flush_tlb_page()
> > one by one. Not sure the benefit would be significant though,
> > especially if only one entry among nr has the young bit set.
>
> Yes. In addition, this will involve many architectures’ implementations
> and their differing TLB flush mechanisms, so it’s difficult to make a
> reasonable per-architecture measurement. If any architecture has a more
> efficient flush method, I’d prefer to implement an architecture‑specific
> clear_flush_young_ptes().

Right! Since TLBI is usually quite expensive, I wonder if a generic
implementation for architectures lacking clear_flush_young_ptes()
might benefit from something like the below (just a very rough idea):

int clear_flush_young_ptes(struct vm_area_struct *vma,
                unsigned long addr, pte_t *ptep, unsigned int nr)
{
        unsigned long curr_addr = addr;
        int young = 0;

        while (nr--) {
                young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
                ptep++;
                curr_addr += PAGE_SIZE;
        }

        if (young)
                flush_tlb_range(vma, addr, curr_addr);
        return young;
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-03-07  8:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-09 14:07 [PATCH v6 0/5] support batch checking of references and unmapping for large folios Baolin Wang
2026-02-09 14:07 ` [PATCH v6 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-02-09 15:25   ` David Hildenbrand (Arm)
2026-03-06 21:07   ` Barry Song
2026-03-07  2:22     ` Baolin Wang
2026-03-07  8:02       ` Barry Song
2026-02-09 14:07 ` [PATCH v6 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09 14:07 ` [PATCH v6 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-02-09 14:07 ` [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-02-09 15:30   ` David Hildenbrand (Arm)
2026-02-10  0:39     ` Baolin Wang
2026-03-06 21:20   ` Barry Song
2026-03-07  2:14     ` Baolin Wang
2026-03-07  7:41       ` Barry Song
2026-02-09 14:07 ` [PATCH v6 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-02-09 15:31   ` David Hildenbrand (Arm)
2026-02-10  1:53 ` [PATCH v6 0/5] support batch checking of references and unmapping for " Andrew Morton
2026-02-10  2:01   ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox