linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] support batch checking of references and unmapping for large folios
@ 2025-12-26  6:07 Baolin Wang
  2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Similar to folio_referenced_one(), we can also apply batched unmapping for large
file folios to optimize the performance of file folio reclamation. By supporting
batched checking of the young flags, flushing TLB entries, and unmapping, I can
observed a significant performance improvements in my performance tests for file
folios reclamation. Please check the performance data in the commit message of
each patch.

Run stress-ng and mm selftests, no issues were found.

Patch 1: Add a new generic batched PTE helper that supports batched checks of
the references for large folios.
Patch 2 - 3: Preparation patches.
patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes().
Patch 5: Support batched unmapping for file large folios.

Changes from v4:
 - Fix passing the incorrect 'CONT_PTES' for non-batched APIs.
 - Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan).
 - Fix some coding style issues (per Ryan).
 - Add reviewed tag from Ryan. Thanks.

Changes from v3:
 - Fix using an incorrect parameter in ptep_clear_flush_young_notify()
   (per Liam).

Changes from v2:
 - Rearrange the patch set (per Ryan).
 - Add pte_cont() check in clear_flush_young_ptes() (per Ryan).
 - Add a helper to do contpte block alignment (per Ryan).
 - Fix some coding style issues (per Lorenzo and Ryan).
 - Add more comments and update the commit message (per Lorenzo and Ryan).
 - Add acked tag from Barry. Thanks. 

Changes from v1:
 - Add a new patch to support batched unmapping for file large folios.
 - Update the cover letter

Baolin Wang (5):
  mm: rmap: support batched checks of the references for large folios
  arm64: mm: factor out the address and ptep alignment into a new helper
  arm64: mm: support batch clearing of the young flag for large folios
  arm64: mm: implement the architecture-specific
    clear_flush_young_ptes()
  mm: rmap: support batched unmapping for file large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++----
 arch/arm64/mm/contpte.c          | 62 ++++++++++++++++++++------------
 include/linux/mmu_notifier.h     |  9 ++---
 include/linux/pgtable.h          | 31 ++++++++++++++++
 mm/rmap.c                        | 38 ++++++++++++++++----
 5 files changed, 125 insertions(+), 38 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
@ 2025-12-26  6:07 ` Baolin Wang
  2026-01-07  6:01   ` Harry Yoo
  2026-02-09  8:49   ` David Hildenbrand (Arm)
  2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation. And it will be overridden by the architecture
that implements a more efficient batch operation in the following patches.

While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch operation.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/mmu_notifier.h |  9 +++++----
 include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
 mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
 3 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..07a2bbaf86e9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
 	range->owner = owner;
 }
 
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)	\
 ({									\
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	unsigned int ___nr = __nr;					\
+	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
 						  ___address,		\
 						  ___address +		\
-							PAGE_SIZE);	\
+						  ___nr * PAGE_SIZE);	\
 	__young;							\
 })
 
@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 
 #define mmu_notifier_range_update_to_read_only(r) false
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define clear_flush_young_ptes_notify clear_flush_young_ptes
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2f0dd3a4ace1..eb8aacba3698 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1087,6 +1087,37 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+#ifndef clear_flush_young_ptes
+/**
+ * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
+ *			    that map consecutive pages of the same folio.
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_clear_flush_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	int i, young = 0;
+
+	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
+		young |= ptep_clear_flush_young(vma, addr, ptep);
+
+	return young;
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/rmap.c b/mm/rmap.c
index e805ddc5a27b..985ab0b085ba 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -828,9 +828,11 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int ptes = 0, referenced = 0;
+	unsigned int nr;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
+		nr = 1;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			ptes++;
@@ -875,9 +877,24 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+			if (folio_test_large(folio)) {
+				unsigned long end_addr =
+					pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr =
+					(end_addr - address) >> PAGE_SHIFT;
+				pte_t pteval = ptep_get(pvmw.pte);
+
+				nr = folio_pte_batch(folio, pvmw.pte,
+						     pteval, max_nr);
+			}
+
+			ptes += nr;
+			if (clear_flush_young_ptes_notify(vma, address,
+						pvmw.pte, nr))
 				referenced++;
+			/* Skip the batched PTEs */
+			pvmw.pte += nr - 1;
+			pvmw.address += (nr - 1) * PAGE_SIZE;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -887,7 +904,15 @@ static bool folio_referenced_one(struct folio *folio,
 			WARN_ON_ONCE(1);
 		}
 
-		pra->mapcount--;
+		pra->mapcount -= nr;
+		/*
+		 * If we are sure that we batched the entire folio,
+		 * we can just optimize and stop right here.
+		 */
+		if (ptes == pvmw.nr_pages) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
 	}
 
 	if (referenced)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2025-12-26  6:07 ` Baolin Wang
  2026-02-09  8:50   ` David Hildenbrand (Arm)
  2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Factor out the contpte block's address and ptep alignment into a new helper,
and will be reused in the following patch.

No functional changes.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/mm/contpte.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 589bcf878938..e4ddeb46f25d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -26,6 +26,26 @@ static inline pte_t *contpte_align_down(pte_t *ptep)
 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
 }
 
+static inline pte_t *contpte_align_addr_ptep(unsigned long *start,
+					     unsigned long *end, pte_t *ptep,
+					     unsigned int nr)
+{
+	/*
+	 * Note: caller must ensure these nr PTEs are consecutive (present)
+	 * PTEs that map consecutive pages of the same large folio within a
+	 * single VMA and a single page table.
+	 */
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		*end = ALIGN(*end, CONT_PTE_SIZE);
+
+	if (pte_cont(__ptep_get(ptep))) {
+		*start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	return ptep;
+}
+
 static void contpte_try_unfold_partial(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
 {
@@ -569,14 +589,7 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
 	unsigned long start = addr;
 	unsigned long end = start + nr * PAGE_SIZE;
 
-	if (pte_cont(__ptep_get(ptep + nr - 1)))
-		end = ALIGN(end, CONT_PTE_SIZE);
-
-	if (pte_cont(__ptep_get(ptep))) {
-		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
-		ptep = contpte_align_down(ptep);
-	}
-
+	ptep = contpte_align_addr_ptep(&start, &end, ptep, nr);
 	__clear_young_dirty_ptes(vma, start, ptep, (end - start) / PAGE_SIZE, flags);
 }
 EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
  2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
@ 2025-12-26  6:07 ` Baolin Wang
  2026-01-02 12:21   ` Ryan Roberts
  2026-02-09  9:02   ` David Hildenbrand (Arm)
  2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs that map consecutive
pages of the same large folio in a single VMA and a single page table.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++++------
 arch/arm64/mm/contpte.c          | 33 ++++++++++++++++++--------------
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 445e18e92221..5e9ff16146c3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1648,10 +1648,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1823,7 +1823,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, 1);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1835,7 +1835,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index e4ddeb46f25d..b929a455103f 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -508,8 +508,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -518,41 +519,45 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * contig range when the range is covered by a single folio, we can get
 	 * away with clearing young for the whole contig range here, so we avoid
 	 * having to unfold.
+	 *
+	 * The 'nr' means consecutive (present) PTEs that map consecutive pages
+	 * of the same large folio in a single VMA and a single page table.
 	 */
 
+	unsigned long end = addr + nr * PAGE_SIZE;
 	int young = 0;
-	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
+	ptep = contpte_align_addr_ptep(&addr, &end, ptep, nr);
+	for (; addr != end; ptep++, addr += PAGE_SIZE)
 		young |= __ptep_test_and_clear_young(vma, addr, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long end = addr + nr * PAGE_SIZE;
+
+		contpte_align_addr_ptep(&addr, &end, ptep, nr);
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, addr, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (2 preceding siblings ...)
  2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
@ 2025-12-26  6:07 ` Baolin Wang
  2026-01-28 11:47   ` Chris Mason
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
batched checking of young flags and TLB flushing, improving performance during
large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5e9ff16146c3..aa8f642f1260 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (3 preceding siblings ...)
  2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2025-12-26  6:07 ` Baolin Wang
  2026-01-06 13:22   ` Wei Yang
                     ` (4 more replies)
  2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
  2026-01-16 10:52 ` David Hildenbrand (Red Hat)
  6 siblings, 5 replies; 52+ messages in thread
From: Baolin Wang @ 2025-12-26  6:07 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.

Barry previously implemented batched unmapping for lazyfree anonymous large
folios[1] and did not further optimize anonymous large folios or file-backed
large folios at that stage. As for file-backed large folios, the batched
unmapping support is relatively straightforward, as we only need to clear
the consecutive (present) PTE entries for file-backed large folios.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
75% performance improvement on my Arm64 32-core server (and 50%+ improvement
on my X86 machine) with this patch.

W/o patch:
real    0m1.018s
user    0m0.000s
sys     0m1.018s

W/ patch:
real	0m0.249s
user	0m0.000s
sys	0m0.249s

[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Barry Song <baohua@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 985ab0b085ba..e1d16003c514 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	end_addr = pmd_addr_end(addr, vma->vm_end);
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
-	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
 		return 1;
+
 	if (pte_unused(pte))
 		return 1;
 
@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 *
 			 * See Documentation/mm/mmu_notifier.rst
 			 */
-			dec_mm_counter(mm, mm_counter_file(folio));
+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
 		if (unlikely(folio_test_hugetlb(folio))) {
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
@ 2026-01-02 12:21   ` Ryan Roberts
  2026-02-09  9:02   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 52+ messages in thread
From: Ryan Roberts @ 2026-01-02 12:21 UTC (permalink / raw)
  To: Baolin Wang, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	riel, harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 26/12/2025 06:07, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs that map consecutive
> pages of the same large folio in a single VMA and a single page table.
> 
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  arch/arm64/include/asm/pgtable.h | 12 ++++++------
>  arch/arm64/mm/contpte.c          | 33 ++++++++++++++++++--------------
>  2 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 445e18e92221..5e9ff16146c3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1648,10 +1648,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep,
>  				unsigned int nr, int full);
> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>  extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> @@ -1823,7 +1823,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_test_and_clear_young(vma, addr, ptep);
>  
> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, 1);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> @@ -1835,7 +1835,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_clear_flush_young(vma, addr, ptep);
>  
> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index e4ddeb46f25d..b929a455103f 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -508,8 +508,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>  
> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
>  {
>  	/*
>  	 * ptep_clear_flush_young() technically requires us to clear the access
> @@ -518,41 +519,45 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	 * contig range when the range is covered by a single folio, we can get
>  	 * away with clearing young for the whole contig range here, so we avoid
>  	 * having to unfold.
> +	 *
> +	 * The 'nr' means consecutive (present) PTEs that map consecutive pages
> +	 * of the same large folio in a single VMA and a single page table.
>  	 */
>  
> +	unsigned long end = addr + nr * PAGE_SIZE;
>  	int young = 0;
> -	int i;
>  
> -	ptep = contpte_align_down(ptep);
> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -
> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +	ptep = contpte_align_addr_ptep(&addr, &end, ptep, nr);
> +	for (; addr != end; ptep++, addr += PAGE_SIZE)
>  		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>  
> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr)
>  {
>  	int young;
>  
> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>  
>  	if (young) {
> +		unsigned long end = addr + nr * PAGE_SIZE;
> +
> +		contpte_align_addr_ptep(&addr, &end, ptep, nr);
>  		/*
>  		 * See comment in __ptep_clear_flush_young(); same rationale for
>  		 * eliding the trailing DSB applies here.
>  		 */
> -		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, addr, end,
>  					 PAGE_SIZE, true, 3);
>  	}
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
> +EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
>  
>  void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  					pte_t *ptep, unsigned int nr)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-01-06 13:22   ` Wei Yang
  2026-01-06 21:29     ` Barry Song
  2026-01-07  6:54   ` Harry Yoo
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: Wei Yang @ 2026-01-06 13:22 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>Similar to folio_referenced_one(), we can apply batched unmapping for file
>large folios to optimize the performance of file folios reclamation.
>
>Barry previously implemented batched unmapping for lazyfree anonymous large
>folios[1] and did not further optimize anonymous large folios or file-backed
>large folios at that stage. As for file-backed large folios, the batched
>unmapping support is relatively straightforward, as we only need to clear
>the consecutive (present) PTE entries for file-backed large folios.
>
>Performance testing:
>Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>on my X86 machine) with this patch.
>
>W/o patch:
>real    0m1.018s
>user    0m0.000s
>sys     0m1.018s
>
>W/ patch:
>real	0m0.249s
>user	0m0.000s
>sys	0m0.249s
>
>[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>Acked-by: Barry Song <baohua@kernel.org>
>Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>---
> mm/rmap.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 985ab0b085ba..e1d16003c514 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> 	end_addr = pmd_addr_end(addr, vma->vm_end);
> 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
> 
>-	/* We only support lazyfree batching for now ... */
>-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>+	/* We only support lazyfree or file folios batching for now ... */
>+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> 		return 1;
>+
> 	if (pte_unused(pte))
> 		return 1;
> 
>@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> 			 *
> 			 * See Documentation/mm/mmu_notifier.rst
> 			 */
>-			dec_mm_counter(mm, mm_counter_file(folio));
>+			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> 		}
> discard:
> 		if (unlikely(folio_test_hugetlb(folio))) {
>-- 
>2.47.3
>

Hi, Baolin

When reading your patch, I come up one small question.

Current try_to_unmap_one() has following structure:

    try_to_unmap_one()
        while (page_vma_mapped_walk(&pvmw)) {
            nr_pages = folio_unmap_pte_batch()

            if (nr_pages = folio_nr_pages(folio))
                goto walk_done;
        }

I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().

If my understanding is correct, page_vma_mapped_walk() would start from
(pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
(pvmw->address + nr_pages * PAGE_SIZE), right?

Not sure my understanding is correct, if so do we have some reason not to
skip the cleared range?

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-06 13:22   ` Wei Yang
@ 2026-01-06 21:29     ` Barry Song
  2026-01-07  1:46       ` Wei Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Barry Song @ 2026-01-06 21:29 UTC (permalink / raw)
  To: Wei Yang
  Cc: Baolin Wang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >Similar to folio_referenced_one(), we can apply batched unmapping for file
> >large folios to optimize the performance of file folios reclamation.
> >
> >Barry previously implemented batched unmapping for lazyfree anonymous large
> >folios[1] and did not further optimize anonymous large folios or file-backed
> >large folios at that stage. As for file-backed large folios, the batched
> >unmapping support is relatively straightforward, as we only need to clear
> >the consecutive (present) PTE entries for file-backed large folios.
> >
> >Performance testing:
> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >on my X86 machine) with this patch.
> >
> >W/o patch:
> >real    0m1.018s
> >user    0m0.000s
> >sys     0m1.018s
> >
> >W/ patch:
> >real   0m0.249s
> >user   0m0.000s
> >sys    0m0.249s
> >
> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >Acked-by: Barry Song <baohua@kernel.org>
> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >---
> > mm/rmap.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/rmap.c b/mm/rmap.c
> >index 985ab0b085ba..e1d16003c514 100644
> >--- a/mm/rmap.c
> >+++ b/mm/rmap.c
> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >       end_addr = pmd_addr_end(addr, vma->vm_end);
> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >
> >-      /* We only support lazyfree batching for now ... */
> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >+      /* We only support lazyfree or file folios batching for now ... */
> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >               return 1;
> >+
> >       if (pte_unused(pte))
> >               return 1;
> >
> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                        *
> >                        * See Documentation/mm/mmu_notifier.rst
> >                        */
> >-                      dec_mm_counter(mm, mm_counter_file(folio));
> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >               }
> > discard:
> >               if (unlikely(folio_test_hugetlb(folio))) {
> >--
> >2.47.3
> >
>
> Hi, Baolin
>
> When reading your patch, I come up one small question.
>
> Current try_to_unmap_one() has following structure:
>
>     try_to_unmap_one()
>         while (page_vma_mapped_walk(&pvmw)) {
>             nr_pages = folio_unmap_pte_batch()
>
>             if (nr_pages = folio_nr_pages(folio))
>                 goto walk_done;
>         }
>
> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>
> If my understanding is correct, page_vma_mapped_walk() would start from
> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> (pvmw->address + nr_pages * PAGE_SIZE), right?
>
> Not sure my understanding is correct, if so do we have some reason not to
> skip the cleared range?

I don’t quite understand your question. For nr_pages > 1 but not equal
to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.

take a look:

next_pte:
                do {
                        pvmw->address += PAGE_SIZE;
                        if (pvmw->address >= end)
                                return not_found(pvmw);
                        /* Did we cross page table boundary? */
                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
                                if (pvmw->ptl) {
                                        spin_unlock(pvmw->ptl);
                                        pvmw->ptl = NULL;
                                }
                                pte_unmap(pvmw->pte);
                                pvmw->pte = NULL;
                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
                                goto restart;
                        }
                        pvmw->pte++;
                } while (pte_none(ptep_get(pvmw->pte)));


>
> --
> Wei Yang
> Help you, Help me

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-06 21:29     ` Barry Song
@ 2026-01-07  1:46       ` Wei Yang
  2026-01-07  2:21         ` Barry Song
  2026-01-16  9:53         ` Dev Jain
  0 siblings, 2 replies; 52+ messages in thread
From: Wei Yang @ 2026-01-07  1:46 UTC (permalink / raw)
  To: Barry Song
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>> >Similar to folio_referenced_one(), we can apply batched unmapping for file
>> >large folios to optimize the performance of file folios reclamation.
>> >
>> >Barry previously implemented batched unmapping for lazyfree anonymous large
>> >folios[1] and did not further optimize anonymous large folios or file-backed
>> >large folios at that stage. As for file-backed large folios, the batched
>> >unmapping support is relatively straightforward, as we only need to clear
>> >the consecutive (present) PTE entries for file-backed large folios.
>> >
>> >Performance testing:
>> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>> >on my X86 machine) with this patch.
>> >
>> >W/o patch:
>> >real    0m1.018s
>> >user    0m0.000s
>> >sys     0m1.018s
>> >
>> >W/ patch:
>> >real   0m0.249s
>> >user   0m0.000s
>> >sys    0m0.249s
>> >
>> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> >Acked-by: Barry Song <baohua@kernel.org>
>> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> >---
>> > mm/rmap.c | 7 ++++---
>> > 1 file changed, 4 insertions(+), 3 deletions(-)
>> >
>> >diff --git a/mm/rmap.c b/mm/rmap.c
>> >index 985ab0b085ba..e1d16003c514 100644
>> >--- a/mm/rmap.c
>> >+++ b/mm/rmap.c
>> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> >       end_addr = pmd_addr_end(addr, vma->vm_end);
>> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>> >
>> >-      /* We only support lazyfree batching for now ... */
>> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> >+      /* We only support lazyfree or file folios batching for now ... */
>> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>> >               return 1;
>> >+
>> >       if (pte_unused(pte))
>> >               return 1;
>> >
>> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> >                        *
>> >                        * See Documentation/mm/mmu_notifier.rst
>> >                        */
>> >-                      dec_mm_counter(mm, mm_counter_file(folio));
>> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>> >               }
>> > discard:
>> >               if (unlikely(folio_test_hugetlb(folio))) {
>> >--
>> >2.47.3
>> >
>>
>> Hi, Baolin
>>
>> When reading your patch, I come up one small question.
>>
>> Current try_to_unmap_one() has following structure:
>>
>>     try_to_unmap_one()
>>         while (page_vma_mapped_walk(&pvmw)) {
>>             nr_pages = folio_unmap_pte_batch()
>>
>>             if (nr_pages = folio_nr_pages(folio))
>>                 goto walk_done;
>>         }
>>
>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>
>> If my understanding is correct, page_vma_mapped_walk() would start from
>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>
>> Not sure my understanding is correct, if so do we have some reason not to
>> skip the cleared range?
>
>I don’t quite understand your question. For nr_pages > 1 but not equal
>to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>
>take a look:
>
>next_pte:
>                do {
>                        pvmw->address += PAGE_SIZE;
>                        if (pvmw->address >= end)
>                                return not_found(pvmw);
>                        /* Did we cross page table boundary? */
>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>                                if (pvmw->ptl) {
>                                        spin_unlock(pvmw->ptl);
>                                        pvmw->ptl = NULL;
>                                }
>                                pte_unmap(pvmw->pte);
>                                pvmw->pte = NULL;
>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>                                goto restart;
>                        }
>                        pvmw->pte++;
>                } while (pte_none(ptep_get(pvmw->pte)));
>

Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
will be skipped.

I mean maybe we can skip it in try_to_unmap_one(), for example:

diff --git a/mm/rmap.c b/mm/rmap.c
index 9e5bd4834481..ea1afec7c802 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 */
 		if (nr_pages == folio_nr_pages(folio))
 			goto walk_done;
+		else {
+			pvmw.address += PAGE_SIZE * (nr_pages - 1);
+			pvmw.pte += nr_pages - 1;
+		}
 		continue;
 walk_abort:
 		ret = false;

Not sure this is reasonable.


-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-07  1:46       ` Wei Yang
@ 2026-01-07  2:21         ` Barry Song
  2026-01-07  2:29           ` Baolin Wang
  2026-01-16  9:53         ` Dev Jain
  1 sibling, 1 reply; 52+ messages in thread
From: Barry Song @ 2026-01-07  2:21 UTC (permalink / raw)
  To: Wei Yang
  Cc: Baolin Wang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
> >On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >> >Similar to folio_referenced_one(), we can apply batched unmapping for file
> >> >large folios to optimize the performance of file folios reclamation.
> >> >
> >> >Barry previously implemented batched unmapping for lazyfree anonymous large
> >> >folios[1] and did not further optimize anonymous large folios or file-backed
> >> >large folios at that stage. As for file-backed large folios, the batched
> >> >unmapping support is relatively straightforward, as we only need to clear
> >> >the consecutive (present) PTE entries for file-backed large folios.
> >> >
> >> >Performance testing:
> >> >Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >> >reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >> >75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >> >on my X86 machine) with this patch.
> >> >
> >> >W/o patch:
> >> >real    0m1.018s
> >> >user    0m0.000s
> >> >sys     0m1.018s
> >> >
> >> >W/ patch:
> >> >real   0m0.249s
> >> >user   0m0.000s
> >> >sys    0m0.249s
> >> >
> >> >[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >> >Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >> >Acked-by: Barry Song <baohua@kernel.org>
> >> >Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> >---
> >> > mm/rmap.c | 7 ++++---
> >> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >> >
> >> >diff --git a/mm/rmap.c b/mm/rmap.c
> >> >index 985ab0b085ba..e1d16003c514 100644
> >> >--- a/mm/rmap.c
> >> >+++ b/mm/rmap.c
> >> >@@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >> >       end_addr = pmd_addr_end(addr, vma->vm_end);
> >> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >> >
> >> >-      /* We only support lazyfree batching for now ... */
> >> >-      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >> >+      /* We only support lazyfree or file folios batching for now ... */
> >> >+      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >> >               return 1;
> >> >+
> >> >       if (pte_unused(pte))
> >> >               return 1;
> >> >
> >> >@@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >> >                        *
> >> >                        * See Documentation/mm/mmu_notifier.rst
> >> >                        */
> >> >-                      dec_mm_counter(mm, mm_counter_file(folio));
> >> >+                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >> >               }
> >> > discard:
> >> >               if (unlikely(folio_test_hugetlb(folio))) {
> >> >--
> >> >2.47.3
> >> >
> >>
> >> Hi, Baolin
> >>
> >> When reading your patch, I come up one small question.
> >>
> >> Current try_to_unmap_one() has following structure:
> >>
> >>     try_to_unmap_one()
> >>         while (page_vma_mapped_walk(&pvmw)) {
> >>             nr_pages = folio_unmap_pte_batch()
> >>
> >>             if (nr_pages = folio_nr_pages(folio))
> >>                 goto walk_done;
> >>         }
> >>
> >> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
> >>
> >> If my understanding is correct, page_vma_mapped_walk() would start from
> >> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> >> (pvmw->address + nr_pages * PAGE_SIZE), right?
> >>
> >> Not sure my understanding is correct, if so do we have some reason not to
> >> skip the cleared range?
> >
> >I don’t quite understand your question. For nr_pages > 1 but not equal
> >to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
> >
> >take a look:
> >
> >next_pte:
> >                do {
> >                        pvmw->address += PAGE_SIZE;
> >                        if (pvmw->address >= end)
> >                                return not_found(pvmw);
> >                        /* Did we cross page table boundary? */
> >                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
> >                                if (pvmw->ptl) {
> >                                        spin_unlock(pvmw->ptl);
> >                                        pvmw->ptl = NULL;
> >                                }
> >                                pte_unmap(pvmw->pte);
> >                                pvmw->pte = NULL;
> >                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
> >                                goto restart;
> >                        }
> >                        pvmw->pte++;
> >                } while (pte_none(ptep_get(pvmw->pte)));
> >
>
> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> will be skipped.
>
> I mean maybe we can skip it in try_to_unmap_one(), for example:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e5bd4834481..ea1afec7c802 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                  */
>                 if (nr_pages == folio_nr_pages(folio))
>                         goto walk_done;
> +               else {
> +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
> +                       pvmw.pte += nr_pages - 1;
> +               }
>                 continue;
>  walk_abort:
>                 ret = false;


I feel this couples the PTE walk iteration with the unmap
operation, which does not seem fine to me. It also appears
to affect only corner cases.

>
> Not sure this is reasonable.
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-07  2:21         ` Barry Song
@ 2026-01-07  2:29           ` Baolin Wang
  2026-01-07  3:31             ` Wei Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-01-07  2:29 UTC (permalink / raw)
  To: Barry Song, Wei Yang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 1/7/26 10:21 AM, Barry Song wrote:
> On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>
>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>
>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>
>>>>> Performance testing:
>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>> on my X86 machine) with this patch.
>>>>>
>>>>> W/o patch:
>>>>> real    0m1.018s
>>>>> user    0m0.000s
>>>>> sys     0m1.018s
>>>>>
>>>>> W/ patch:
>>>>> real   0m0.249s
>>>>> user   0m0.000s
>>>>> sys    0m0.249s
>>>>>
>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>> mm/rmap.c | 7 ++++---
>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>
>>>>> -      /* We only support lazyfree batching for now ... */
>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>                return 1;
>>>>> +
>>>>>        if (pte_unused(pte))
>>>>>                return 1;
>>>>>
>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>                         *
>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>                         */
>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>                }
>>>>> discard:
>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>> --
>>>>> 2.47.3
>>>>>
>>>>
>>>> Hi, Baolin
>>>>
>>>> When reading your patch, I come up one small question.
>>>>
>>>> Current try_to_unmap_one() has following structure:
>>>>
>>>>      try_to_unmap_one()
>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>              nr_pages = folio_unmap_pte_batch()
>>>>
>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>                  goto walk_done;
>>>>          }
>>>>
>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>
>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>
>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>> skip the cleared range?
>>>
>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>
>>> take a look:
>>>
>>> next_pte:
>>>                 do {
>>>                         pvmw->address += PAGE_SIZE;
>>>                         if (pvmw->address >= end)
>>>                                 return not_found(pvmw);
>>>                         /* Did we cross page table boundary? */
>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>                                 if (pvmw->ptl) {
>>>                                         spin_unlock(pvmw->ptl);
>>>                                         pvmw->ptl = NULL;
>>>                                 }
>>>                                 pte_unmap(pvmw->pte);
>>>                                 pvmw->pte = NULL;
>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>                                 goto restart;
>>>                         }
>>>                         pvmw->pte++;
>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>
>>
>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>> will be skipped.
>>
>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 9e5bd4834481..ea1afec7c802 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                   */
>>                  if (nr_pages == folio_nr_pages(folio))
>>                          goto walk_done;
>> +               else {
>> +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
>> +                       pvmw.pte += nr_pages - 1;
>> +               }
>>                  continue;
>>   walk_abort:
>>                  ret = false;
> 
> 
> I feel this couples the PTE walk iteration with the unmap
> operation, which does not seem fine to me. It also appears
> to affect only corner cases.

Agree. There may be no performance gains, so I also prefer to leave it 
as is.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-07  2:29           ` Baolin Wang
@ 2026-01-07  3:31             ` Wei Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Wei Yang @ 2026-01-07  3:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Barry Song, Wei Yang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On Wed, Jan 07, 2026 at 10:29:18AM +0800, Baolin Wang wrote:
>
>
>On 1/7/26 10:21 AM, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 2:46 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>> > 
>> > On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>> > > On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>> > > > 
>> > > > On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>> > > > > Similar to folio_referenced_one(), we can apply batched unmapping for file
>> > > > > large folios to optimize the performance of file folios reclamation.
>> > > > > 
>> > > > > Barry previously implemented batched unmapping for lazyfree anonymous large
>> > > > > folios[1] and did not further optimize anonymous large folios or file-backed
>> > > > > large folios at that stage. As for file-backed large folios, the batched
>> > > > > unmapping support is relatively straightforward, as we only need to clear
>> > > > > the consecutive (present) PTE entries for file-backed large folios.
>> > > > > 
>> > > > > Performance testing:
>> > > > > Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> > > > > reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> > > > > 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>> > > > > on my X86 machine) with this patch.
>> > > > > 
>> > > > > W/o patch:
>> > > > > real    0m1.018s
>> > > > > user    0m0.000s
>> > > > > sys     0m1.018s
>> > > > > 
>> > > > > W/ patch:
>> > > > > real   0m0.249s
>> > > > > user   0m0.000s
>> > > > > sys    0m0.249s
>> > > > > 
>> > > > > [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>> > > > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> > > > > Acked-by: Barry Song <baohua@kernel.org>
>> > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> > > > > ---
>> > > > > mm/rmap.c | 7 ++++---
>> > > > > 1 file changed, 4 insertions(+), 3 deletions(-)
>> > > > > 
>> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
>> > > > > index 985ab0b085ba..e1d16003c514 100644
>> > > > > --- a/mm/rmap.c
>> > > > > +++ b/mm/rmap.c
>> > > > > @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> > > > >        end_addr = pmd_addr_end(addr, vma->vm_end);
>> > > > >        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>> > > > > 
>> > > > > -      /* We only support lazyfree batching for now ... */
>> > > > > -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> > > > > +      /* We only support lazyfree or file folios batching for now ... */
>> > > > > +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>> > > > >                return 1;
>> > > > > +
>> > > > >        if (pte_unused(pte))
>> > > > >                return 1;
>> > > > > 
>> > > > > @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> > > > >                         *
>> > > > >                         * See Documentation/mm/mmu_notifier.rst
>> > > > >                         */
>> > > > > -                      dec_mm_counter(mm, mm_counter_file(folio));
>> > > > > +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>> > > > >                }
>> > > > > discard:
>> > > > >                if (unlikely(folio_test_hugetlb(folio))) {
>> > > > > --
>> > > > > 2.47.3
>> > > > > 
>> > > > 
>> > > > Hi, Baolin
>> > > > 
>> > > > When reading your patch, I come up one small question.
>> > > > 
>> > > > Current try_to_unmap_one() has following structure:
>> > > > 
>> > > >      try_to_unmap_one()
>> > > >          while (page_vma_mapped_walk(&pvmw)) {
>> > > >              nr_pages = folio_unmap_pte_batch()
>> > > > 
>> > > >              if (nr_pages = folio_nr_pages(folio))
>> > > >                  goto walk_done;
>> > > >          }
>> > > > 
>> > > > I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>> > > > 
>> > > > If my understanding is correct, page_vma_mapped_walk() would start from
>> > > > (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>> > > > (pvmw->address + nr_pages * PAGE_SIZE), right?
>> > > > 
>> > > > Not sure my understanding is correct, if so do we have some reason not to
>> > > > skip the cleared range?
>> > > 
>> > > I don’t quite understand your question. For nr_pages > 1 but not equal
>> > > to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>> > > 
>> > > take a look:
>> > > 
>> > > next_pte:
>> > >                 do {
>> > >                         pvmw->address += PAGE_SIZE;
>> > >                         if (pvmw->address >= end)
>> > >                                 return not_found(pvmw);
>> > >                         /* Did we cross page table boundary? */
>> > >                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>> > >                                 if (pvmw->ptl) {
>> > >                                         spin_unlock(pvmw->ptl);
>> > >                                         pvmw->ptl = NULL;
>> > >                                 }
>> > >                                 pte_unmap(pvmw->pte);
>> > >                                 pvmw->pte = NULL;
>> > >                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>> > >                                 goto restart;
>> > >                         }
>> > >                         pvmw->pte++;
>> > >                 } while (pte_none(ptep_get(pvmw->pte)));
>> > > 
>> > 
>> > Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>> > will be skipped.
>> > 
>> > I mean maybe we can skip it in try_to_unmap_one(), for example:
>> > 
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index 9e5bd4834481..ea1afec7c802 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> >                   */
>> >                  if (nr_pages == folio_nr_pages(folio))
>> >                          goto walk_done;
>> > +               else {
>> > +                       pvmw.address += PAGE_SIZE * (nr_pages - 1);
>> > +                       pvmw.pte += nr_pages - 1;
>> > +               }
>> >                  continue;
>> >   walk_abort:
>> >                  ret = false;
>> 
>> 
>> I feel this couples the PTE walk iteration with the unmap
>> operation, which does not seem fine to me. It also appears
>> to affect only corner cases.
>
>Agree. There may be no performance gains, so I also prefer to leave it as is.

Got it, thanks.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
@ 2026-01-07  6:01   ` Harry Yoo
  2026-02-09  8:49   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 52+ messages in thread
From: Harry Yoo @ 2026-01-07  6:01 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Fri, Dec 26, 2025 at 02:07:55PM +0800, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
> 
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Looks good to me, so:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
  2026-01-06 13:22   ` Wei Yang
@ 2026-01-07  6:54   ` Harry Yoo
  2026-01-16  8:42   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Harry Yoo @ 2026-01-07  6:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
> 
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
> 
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
> 
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
> 
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Looks good to me, so:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 0/5] support batch checking of references and unmapping for large folios
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (4 preceding siblings ...)
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
@ 2026-01-16  8:41 ` Lorenzo Stoakes
  2026-01-16 10:53   ` David Hildenbrand (Red Hat)
  2026-01-16 10:52 ` David Hildenbrand (Red Hat)
  6 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-01-16  8:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

Andrew -

I know this has had a lot of attention, but can we hold off on sending this
upstream until either David or I have had a chance to review it?

Also note that Dev has discovered an issue with how this interacts with the
accursed uffd-wp logic (see [0]) so series needs a respin anyway.

Thanks, Lorenzo

[0]: https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/


On Fri, Dec 26, 2025 at 02:07:54PM +0800, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> Similar to folio_referenced_one(), we can also apply batched unmapping for large
> file folios to optimize the performance of file folio reclamation. By supporting
> batched checking of the young flags, flushing TLB entries, and unmapping, I can
> observed a significant performance improvements in my performance tests for file
> folios reclamation. Please check the performance data in the commit message of
> each patch.
>
> Run stress-ng and mm selftests, no issues were found.
>
> Patch 1: Add a new generic batched PTE helper that supports batched checks of
> the references for large folios.
> Patch 2 - 3: Preparation patches.
> patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes().
> Patch 5: Support batched unmapping for file large folios.
>
> Changes from v4:
>  - Fix passing the incorrect 'CONT_PTES' for non-batched APIs.
>  - Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan).
>  - Fix some coding style issues (per Ryan).
>  - Add reviewed tag from Ryan. Thanks.
>
> Changes from v3:
>  - Fix using an incorrect parameter in ptep_clear_flush_young_notify()
>    (per Liam).
>
> Changes from v2:
>  - Rearrange the patch set (per Ryan).
>  - Add pte_cont() check in clear_flush_young_ptes() (per Ryan).
>  - Add a helper to do contpte block alignment (per Ryan).
>  - Fix some coding style issues (per Lorenzo and Ryan).
>  - Add more comments and update the commit message (per Lorenzo and Ryan).
>  - Add acked tag from Barry. Thanks.
>
> Changes from v1:
>  - Add a new patch to support batched unmapping for file large folios.
>  - Update the cover letter
>
> Baolin Wang (5):
>   mm: rmap: support batched checks of the references for large folios
>   arm64: mm: factor out the address and ptep alignment into a new helper
>   arm64: mm: support batch clearing of the young flag for large folios
>   arm64: mm: implement the architecture-specific
>     clear_flush_young_ptes()
>   mm: rmap: support batched unmapping for file large folios
>
>  arch/arm64/include/asm/pgtable.h | 23 ++++++++----
>  arch/arm64/mm/contpte.c          | 62 ++++++++++++++++++++------------
>  include/linux/mmu_notifier.h     |  9 ++---
>  include/linux/pgtable.h          | 31 ++++++++++++++++
>  mm/rmap.c                        | 38 ++++++++++++++++----
>  5 files changed, 125 insertions(+), 38 deletions(-)
>
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
  2026-01-06 13:22   ` Wei Yang
  2026-01-07  6:54   ` Harry Yoo
@ 2026-01-16  8:42   ` Lorenzo Stoakes
  2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
  2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
  4 siblings, 0 replies; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-01-16  8:42 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, ryan.roberts, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, riel, harry.yoo, jannh, willy,
	baohua, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

FYI Dev found an issue here, see [0].

[0]: https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Cheers, Lorenzo

On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
>
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
>
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
>
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
>
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/rmap.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 985ab0b085ba..e1d16003c514 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	end_addr = pmd_addr_end(addr, vma->vm_end);
>  	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>
> -	/* We only support lazyfree batching for now ... */
> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> +	/* We only support lazyfree or file folios batching for now ... */
> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>  		return 1;
> +
>  	if (pte_unused(pte))
>  		return 1;
>
> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 *
>  			 * See Documentation/mm/mmu_notifier.rst
>  			 */
> -			dec_mm_counter(mm, mm_counter_file(folio));
> +			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>  		}
>  discard:
>  		if (unlikely(folio_test_hugetlb(folio))) {
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-07  1:46       ` Wei Yang
  2026-01-07  2:21         ` Barry Song
@ 2026-01-16  9:53         ` Dev Jain
  2026-01-16 11:14           ` Lorenzo Stoakes
                             ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Dev Jain @ 2026-01-16  9:53 UTC (permalink / raw)
  To: Wei Yang, Barry Song
  Cc: Baolin Wang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel


On 07/01/26 7:16 am, Wei Yang wrote:
> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>> large folios to optimize the performance of file folios reclamation.
>>>>
>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>> large folios at that stage. As for file-backed large folios, the batched
>>>> unmapping support is relatively straightforward, as we only need to clear
>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>
>>>> Performance testing:
>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>> on my X86 machine) with this patch.
>>>>
>>>> W/o patch:
>>>> real    0m1.018s
>>>> user    0m0.000s
>>>> sys     0m1.018s
>>>>
>>>> W/ patch:
>>>> real   0m0.249s
>>>> user   0m0.000s
>>>> sys    0m0.249s
>>>>
>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> mm/rmap.c | 7 ++++---
>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 985ab0b085ba..e1d16003c514 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>
>>>> -      /* We only support lazyfree batching for now ... */
>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>               return 1;
>>>> +
>>>>       if (pte_unused(pte))
>>>>               return 1;
>>>>
>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                        *
>>>>                        * See Documentation/mm/mmu_notifier.rst
>>>>                        */
>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>               }
>>>> discard:
>>>>               if (unlikely(folio_test_hugetlb(folio))) {
>>>> --
>>>> 2.47.3
>>>>
>>> Hi, Baolin
>>>
>>> When reading your patch, I come up one small question.
>>>
>>> Current try_to_unmap_one() has following structure:
>>>
>>>     try_to_unmap_one()
>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>             nr_pages = folio_unmap_pte_batch()
>>>
>>>             if (nr_pages = folio_nr_pages(folio))
>>>                 goto walk_done;
>>>         }
>>>
>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>
>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>
>>> Not sure my understanding is correct, if so do we have some reason not to
>>> skip the cleared range?
>> I don’t quite understand your question. For nr_pages > 1 but not equal
>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>
>> take a look:
>>
>> next_pte:
>>                do {
>>                        pvmw->address += PAGE_SIZE;
>>                        if (pvmw->address >= end)
>>                                return not_found(pvmw);
>>                        /* Did we cross page table boundary? */
>>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>                                if (pvmw->ptl) {
>>                                        spin_unlock(pvmw->ptl);
>>                                        pvmw->ptl = NULL;
>>                                }
>>                                pte_unmap(pvmw->pte);
>>                                pvmw->pte = NULL;
>>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>                                goto restart;
>>                        }
>>                        pvmw->pte++;
>>                } while (pte_none(ptep_get(pvmw->pte)));
>>
> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> will be skipped.
>
> I mean maybe we can skip it in try_to_unmap_one(), for example:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e5bd4834481..ea1afec7c802 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 */
>  		if (nr_pages == folio_nr_pages(folio))
>  			goto walk_done;
> +		else {
> +			pvmw.address += PAGE_SIZE * (nr_pages - 1);
> +			pvmw.pte += nr_pages - 1;
> +		}
>  		continue;
>  walk_abort:
>  		ret = false;

I am of the opinion that we should do something like this. In the internal pvmw code,
we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
to not none, and we will lose the batching effect. I also plan to extend support to
anonymous folios (therefore generalizing for all types of memory) which will set a
batch of ptes as swap, and the internal pvmw code won't be able to skip through the
batch.


[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

>
> Not sure this is reasonable.
>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 0/5] support batch checking of references and unmapping for large folios
  2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
                   ` (5 preceding siblings ...)
  2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
@ 2026-01-16 10:52 ` David Hildenbrand (Red Hat)
  6 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-16 10:52 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 12/26/25 07:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Similar to folio_referenced_one(), we can also apply batched unmapping for large
> file folios to optimize the performance of file folio reclamation. By supporting
> batched checking of the young flags, flushing TLB entries, and unmapping, I can
> observed a significant performance improvements in my performance tests for file
> folios reclamation. Please check the performance data in the commit message of
> each patch.
> 
> Run stress-ng and mm selftests, no issues were found.


Baolin, I'm intending to review this, but it might still take me a bit 
until I get to it. (PTO/vacation and other fun)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 0/5] support batch checking of references and unmapping for large folios
  2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
@ 2026-01-16 10:53   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-16 10:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Baolin Wang
  Cc: akpm, catalin.marinas, will, ryan.roberts, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, riel, harry.yoo, jannh, willy, baohua,
	dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On 1/16/26 09:41, Lorenzo Stoakes wrote:
> Andrew -
> 
> I know this has had a lot of attention, but can we hold off on sending this
> upstream until either David or I have had a chance to review it?

Ah, I didn't read your mail before I sent mine. +1 ;)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16  9:53         ` Dev Jain
@ 2026-01-16 11:14           ` Lorenzo Stoakes
  2026-01-16 14:28           ` Barry Song
  2026-01-16 15:14           ` Barry Song
  2 siblings, 0 replies; 52+ messages in thread
From: Lorenzo Stoakes @ 2026-01-16 11:14 UTC (permalink / raw)
  To: Dev Jain
  Cc: Wei Yang, Barry Song, Baolin Wang, akpm, david, catalin.marinas,
	will, ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	riel, harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel

On Fri, Jan 16, 2026 at 03:23:02PM +0530, Dev Jain wrote:
> I am of the opinion that we should do something like this. In the internal pvmw code,
> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a
> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.
>
>
> [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

No, as I told you, the correct course is to make your suggestion here with
perhaps a suggested fix-patch, please let's not split the discussion
between _the actual series where the issue exists_ and an invalid patch
report, it makes it _super hard_ to track what on earth is going on here.

Now anybody responding will be inclined to reply there and it's a total
mess...

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16  9:53         ` Dev Jain
  2026-01-16 11:14           ` Lorenzo Stoakes
@ 2026-01-16 14:28           ` Barry Song
  2026-01-16 15:23             ` Barry Song
                               ` (2 more replies)
  2026-01-16 15:14           ` Barry Song
  2 siblings, 3 replies; 52+ messages in thread
From: Barry Song @ 2026-01-16 14:28 UTC (permalink / raw)
  To: Dev Jain
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, linux-mm,
	linux-arm-kernel, linux-kernel

On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 07/01/26 7:16 am, Wei Yang wrote:
> > On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
> >> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
> >>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
> >>>> large folios to optimize the performance of file folios reclamation.
> >>>>
> >>>> Barry previously implemented batched unmapping for lazyfree anonymous large
> >>>> folios[1] and did not further optimize anonymous large folios or file-backed
> >>>> large folios at that stage. As for file-backed large folios, the batched
> >>>> unmapping support is relatively straightforward, as we only need to clear
> >>>> the consecutive (present) PTE entries for file-backed large folios.
> >>>>
> >>>> Performance testing:
> >>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> >>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> >>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> >>>> on my X86 machine) with this patch.
> >>>>
> >>>> W/o patch:
> >>>> real    0m1.018s
> >>>> user    0m0.000s
> >>>> sys     0m1.018s
> >>>>
> >>>> W/ patch:
> >>>> real   0m0.249s
> >>>> user   0m0.000s
> >>>> sys    0m0.249s
> >>>>
> >>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> >>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> Acked-by: Barry Song <baohua@kernel.org>
> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>> ---
> >>>> mm/rmap.c | 7 ++++---
> >>>> 1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 985ab0b085ba..e1d16003c514 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
> >>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >>>>
> >>>> -      /* We only support lazyfree batching for now ... */
> >>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >>>> +      /* We only support lazyfree or file folios batching for now ... */
> >>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >>>>               return 1;
> >>>> +
> >>>>       if (pte_unused(pte))
> >>>>               return 1;
> >>>>
> >>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>                        *
> >>>>                        * See Documentation/mm/mmu_notifier.rst
> >>>>                        */
> >>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
> >>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
> >>>>               }
> >>>> discard:
> >>>>               if (unlikely(folio_test_hugetlb(folio))) {
> >>>> --
> >>>> 2.47.3
> >>>>
> >>> Hi, Baolin
> >>>
> >>> When reading your patch, I come up one small question.
> >>>
> >>> Current try_to_unmap_one() has following structure:
> >>>
> >>>     try_to_unmap_one()
> >>>         while (page_vma_mapped_walk(&pvmw)) {
> >>>             nr_pages = folio_unmap_pte_batch()
> >>>
> >>>             if (nr_pages = folio_nr_pages(folio))
> >>>                 goto walk_done;
> >>>         }
> >>>
> >>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
> >>>
> >>> If my understanding is correct, page_vma_mapped_walk() would start from
> >>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
> >>> (pvmw->address + nr_pages * PAGE_SIZE), right?
> >>>
> >>> Not sure my understanding is correct, if so do we have some reason not to
> >>> skip the cleared range?
> >> I don’t quite understand your question. For nr_pages > 1 but not equal
> >> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
> >>
> >> take a look:
> >>
> >> next_pte:
> >>                do {
> >>                        pvmw->address += PAGE_SIZE;
> >>                        if (pvmw->address >= end)
> >>                                return not_found(pvmw);
> >>                        /* Did we cross page table boundary? */
> >>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
> >>                                if (pvmw->ptl) {
> >>                                        spin_unlock(pvmw->ptl);
> >>                                        pvmw->ptl = NULL;
> >>                                }
> >>                                pte_unmap(pvmw->pte);
> >>                                pvmw->pte = NULL;
> >>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
> >>                                goto restart;
> >>                        }
> >>                        pvmw->pte++;
> >>                } while (pte_none(ptep_get(pvmw->pte)));
> >>
> > Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> > will be skipped.
> >
> > I mean maybe we can skip it in try_to_unmap_one(), for example:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 9e5bd4834481..ea1afec7c802 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                */
> >               if (nr_pages == folio_nr_pages(folio))
> >                       goto walk_done;
> > +             else {
> > +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
> > +                     pvmw.pte += nr_pages - 1;
> > +             }
> >               continue;
> >  walk_abort:
> >               ret = false;
>
> I am of the opinion that we should do something like this. In the internal pvmw code,

I am still not convinced that skipping PTEs in try_to_unmap_one()
is the right place. If we really want to skip certain PTEs early,
should we instead hint page_vma_mapped_walk()? That said, I don't
see much value in doing so, since in most cases nr is either 1 or
folio_nr_pages(folio).

> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a
> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.

Thanks for catching this, Dev. I already filter out some of the more
complex cases, for example:
if (pte_unused(pte))
        return 1;

Since the userfaultfd write-protection case is also a corner case,
could we filter it out as well?

diff --git a/mm/rmap.c b/mm/rmap.c
index c86f1135222b..6bb8ba6f046e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1870,6 +1870,9 @@ static inline unsigned int
folio_unmap_pte_batch(struct folio *folio,
        if (pte_unused(pte))
                return 1;

+       if (userfaultfd_wp(vma))
+               return 1;
+
        return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
}

Just offering a second option — yours is probably better.

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16  9:53         ` Dev Jain
  2026-01-16 11:14           ` Lorenzo Stoakes
  2026-01-16 14:28           ` Barry Song
@ 2026-01-16 15:14           ` Barry Song
  2026-01-18  5:48             ` Dev Jain
  2 siblings, 1 reply; 52+ messages in thread
From: Barry Song @ 2026-01-16 15:14 UTC (permalink / raw)
  To: Dev Jain
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, linux-mm,
	linux-arm-kernel, linux-kernel

> >
> > I mean maybe we can skip it in try_to_unmap_one(), for example:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 9e5bd4834481..ea1afec7c802 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                */
> >               if (nr_pages == folio_nr_pages(folio))
> >                       goto walk_done;
> > +             else {
> > +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
> > +                     pvmw.pte += nr_pages - 1;
> > +             }
> >               continue;
> >  walk_abort:
> >               ret = false;
>
> I am of the opinion that we should do something like this. In the internal pvmw code,
> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
> to not none, and we will lose the batching effect. I also plan to extend support to
> anonymous folios (therefore generalizing for all types of memory) which will set a

I posted an RFC on anon folios quite some time ago [1].
It’s great to hear that you’re interested in taking this over.

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
> batch.

Interesting — I didn’t catch this issue in the RFC earlier. Back then,
we only supported nr == 1 and nr == folio_nr_pages(folio). When
nr == nr_pages, page_vma_mapped_walk() would break entirely. With
Lance’s commit ddd05742b45b08, arbitrary nr in [1, nr_pages] is now
supported, which means we have to handle all the complexity. :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16 14:28           ` Barry Song
@ 2026-01-16 15:23             ` Barry Song
  2026-01-16 15:49             ` Baolin Wang
  2026-01-18  5:46             ` Dev Jain
  2 siblings, 0 replies; 52+ messages in thread
From: Barry Song @ 2026-01-16 15:23 UTC (permalink / raw)
  To: Dev Jain
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, linux-mm,
	linux-arm-kernel, linux-kernel

>
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>         return 1;
>
> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>         if (pte_unused(pte))
>                 return 1;
>
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>         return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }
>
> Just offering a second option — yours is probably better.

Sorry for replying in the wrong place. The above reply was actually meant
for your fix-patch below[1]:

"mm: Fix uffd-wp bit loss when batching file folio unmapping"

[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16 14:28           ` Barry Song
  2026-01-16 15:23             ` Barry Song
@ 2026-01-16 15:49             ` Baolin Wang
  2026-01-18  5:46             ` Dev Jain
  2 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2026-01-16 15:49 UTC (permalink / raw)
  To: Barry Song, Dev Jain
  Cc: Wei Yang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel



On 1/16/26 10:28 PM, Barry Song wrote:
> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 07/01/26 7:16 am, Wei Yang wrote:
>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>
>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>
>>>>>> Performance testing:
>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>> on my X86 machine) with this patch.
>>>>>>
>>>>>> W/o patch:
>>>>>> real    0m1.018s
>>>>>> user    0m0.000s
>>>>>> sys     0m1.018s
>>>>>>
>>>>>> W/ patch:
>>>>>> real   0m0.249s
>>>>>> user   0m0.000s
>>>>>> sys    0m0.249s
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>> mm/rmap.c | 7 ++++---
>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>                return 1;
>>>>>> +
>>>>>>        if (pte_unused(pte))
>>>>>>                return 1;
>>>>>>
>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>                         *
>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>                         */
>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>                }
>>>>>> discard:
>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>> --
>>>>>> 2.47.3
>>>>>>
>>>>> Hi, Baolin
>>>>>
>>>>> When reading your patch, I come up one small question.
>>>>>
>>>>> Current try_to_unmap_one() has following structure:
>>>>>
>>>>>      try_to_unmap_one()
>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>
>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>                  goto walk_done;
>>>>>          }
>>>>>
>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>
>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>
>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>> skip the cleared range?
>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>
>>>> take a look:
>>>>
>>>> next_pte:
>>>>                 do {
>>>>                         pvmw->address += PAGE_SIZE;
>>>>                         if (pvmw->address >= end)
>>>>                                 return not_found(pvmw);
>>>>                         /* Did we cross page table boundary? */
>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>                                 if (pvmw->ptl) {
>>>>                                         spin_unlock(pvmw->ptl);
>>>>                                         pvmw->ptl = NULL;
>>>>                                 }
>>>>                                 pte_unmap(pvmw->pte);
>>>>                                 pvmw->pte = NULL;
>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>                                 goto restart;
>>>>                         }
>>>>                         pvmw->pte++;
>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>
>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>> will be skipped.
>>>
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                 */
>>>                if (nr_pages == folio_nr_pages(folio))
>>>                        goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>                continue;
>>>   walk_abort:
>>>                ret = false;
>>
>> I am of the opinion that we should do something like this. In the internal pvmw code,
> 
> I am still not convinced that skipping PTEs in try_to_unmap_one()
> is the right place. If we really want to skip certain PTEs early,
> should we instead hint page_vma_mapped_walk()? That said, I don't
> see much value in doing so, since in most cases nr is either 1 or
> folio_nr_pages(folio).
> 
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> 
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>          return 1;

Hi Dev, thanks for the report[1], and you also explained why mm-selftets 
can pass.

[1] 
https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/


> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>          if (pte_unused(pte))
>                  return 1;
> 
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }

That small fix makes sense to me. I think Dev can continue to support 
the UFFD batch optimization, and we need more review and testing for the 
UFFD batched operations, as David suggested[2].

[2] 
https://lore.kernel.org/all/9edeeef1-5553-406b-8e56-30b11809eec5@kernel.org/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] mm: rmap: skip batched unmapping for UFFD vmas
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
                     ` (2 preceding siblings ...)
  2026-01-16  8:42   ` Lorenzo Stoakes
@ 2026-01-16 16:26   ` Baolin Wang
  2026-02-09  9:54     ` David Hildenbrand (Arm)
  2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
  4 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-01-16 16:26 UTC (permalink / raw)
  To: baolin.wang
  Cc: Liam.Howlett, akpm, baohua, catalin.marinas, david, dev.jain,
	harry.yoo, jannh, linux-arm-kernel, linux-kernel, linux-mm,
	lorenzo.stoakes, mhocko, riel, rppt, ryan.roberts, surenb,
	vbabka, will, willy

As Dev reported[1], it's not ready to support batched unmapping for uffd case.
Let's still fallback to per-page unmapping for the uffd case.

[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
Reported-by: Dev Jain <dev.jain@arm.com>
Suggested-by: Barry Song <baohua@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/rmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index f13480cb9f2e..172643092dcf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
+	if (userfaultfd_wp(vma))
+		return 1;
+
 	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
 }
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16 14:28           ` Barry Song
  2026-01-16 15:23             ` Barry Song
  2026-01-16 15:49             ` Baolin Wang
@ 2026-01-18  5:46             ` Dev Jain
  2026-01-19  5:50               ` Baolin Wang
  2 siblings, 1 reply; 52+ messages in thread
From: Dev Jain @ 2026-01-18  5:46 UTC (permalink / raw)
  To: Barry Song
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, linux-mm,
	linux-arm-kernel, linux-kernel


On 16/01/26 7:58 pm, Barry Song wrote:
> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>> On 07/01/26 7:16 am, Wei Yang wrote:
>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>
>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>
>>>>>> Performance testing:
>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>> on my X86 machine) with this patch.
>>>>>>
>>>>>> W/o patch:
>>>>>> real    0m1.018s
>>>>>> user    0m0.000s
>>>>>> sys     0m1.018s
>>>>>>
>>>>>> W/ patch:
>>>>>> real   0m0.249s
>>>>>> user   0m0.000s
>>>>>> sys    0m0.249s
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>> mm/rmap.c | 7 ++++---
>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>               return 1;
>>>>>> +
>>>>>>       if (pte_unused(pte))
>>>>>>               return 1;
>>>>>>
>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>                        *
>>>>>>                        * See Documentation/mm/mmu_notifier.rst
>>>>>>                        */
>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>               }
>>>>>> discard:
>>>>>>               if (unlikely(folio_test_hugetlb(folio))) {
>>>>>> --
>>>>>> 2.47.3
>>>>>>
>>>>> Hi, Baolin
>>>>>
>>>>> When reading your patch, I come up one small question.
>>>>>
>>>>> Current try_to_unmap_one() has following structure:
>>>>>
>>>>>     try_to_unmap_one()
>>>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>>>             nr_pages = folio_unmap_pte_batch()
>>>>>
>>>>>             if (nr_pages = folio_nr_pages(folio))
>>>>>                 goto walk_done;
>>>>>         }
>>>>>
>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>
>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>
>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>> skip the cleared range?
>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>
>>>> take a look:
>>>>
>>>> next_pte:
>>>>                do {
>>>>                        pvmw->address += PAGE_SIZE;
>>>>                        if (pvmw->address >= end)
>>>>                                return not_found(pvmw);
>>>>                        /* Did we cross page table boundary? */
>>>>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>                                if (pvmw->ptl) {
>>>>                                        spin_unlock(pvmw->ptl);
>>>>                                        pvmw->ptl = NULL;
>>>>                                }
>>>>                                pte_unmap(pvmw->pte);
>>>>                                pvmw->pte = NULL;
>>>>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>                                goto restart;
>>>>                        }
>>>>                        pvmw->pte++;
>>>>                } while (pte_none(ptep_get(pvmw->pte)));
>>>>
>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>> will be skipped.
>>>
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                */
>>>               if (nr_pages == folio_nr_pages(folio))
>>>                       goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>               continue;
>>>  walk_abort:
>>>               ret = false;
>> I am of the opinion that we should do something like this. In the internal pvmw code,
> I am still not convinced that skipping PTEs in try_to_unmap_one()
> is the right place. If we really want to skip certain PTEs early,
> should we instead hint page_vma_mapped_walk()? That said, I don't
> see much value in doing so, since in most cases nr is either 1 or
> folio_nr_pages(folio).
>
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>         return 1;
>
> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>         if (pte_unused(pte))
>                 return 1;
>
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>         return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }
>
> Just offering a second option — yours is probably better.

No. This is not an edge case. This is a case which gets exposed by your work, and
I believe that if you intend to get the file folio batching thingy in, then you
need to fix the uffd stuff too.

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-16 15:14           ` Barry Song
@ 2026-01-18  5:48             ` Dev Jain
  0 siblings, 0 replies; 52+ messages in thread
From: Dev Jain @ 2026-01-18  5:48 UTC (permalink / raw)
  To: Barry Song
  Cc: Wei Yang, Baolin Wang, akpm, david, catalin.marinas, will,
	lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, linux-mm,
	linux-arm-kernel, linux-kernel


On 16/01/26 8:44 pm, Barry Song wrote:
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                */
>>>               if (nr_pages == folio_nr_pages(folio))
>>>                       goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>               continue;
>>>  walk_abort:
>>>               ret = false;
>> I am of the opinion that we should do something like this. In the internal pvmw code,
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
> I posted an RFC on anon folios quite some time ago [1].
> It’s great to hear that you’re interested in taking this over.
>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Great! Now I have a reference to look at :)

>
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> Interesting — I didn’t catch this issue in the RFC earlier. Back then,
> we only supported nr == 1 and nr == folio_nr_pages(folio). When
> nr == nr_pages, page_vma_mapped_walk() would break entirely. With
> Lance’s commit ddd05742b45b08, arbitrary nr in [1, nr_pages] is now
> supported, which means we have to handle all the complexity. :-)
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-18  5:46             ` Dev Jain
@ 2026-01-19  5:50               ` Baolin Wang
  2026-01-19  6:36                 ` Dev Jain
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-01-19  5:50 UTC (permalink / raw)
  To: Dev Jain, Barry Song
  Cc: Wei Yang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel



On 1/18/26 1:46 PM, Dev Jain wrote:
> 
> On 16/01/26 7:58 pm, Barry Song wrote:
>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>
>>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>
>>>>>>> Performance testing:
>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>>> on my X86 machine) with this patch.
>>>>>>>
>>>>>>> W/o patch:
>>>>>>> real    0m1.018s
>>>>>>> user    0m0.000s
>>>>>>> sys     0m1.018s
>>>>>>>
>>>>>>> W/ patch:
>>>>>>> real   0m0.249s
>>>>>>> user   0m0.000s
>>>>>>> sys    0m0.249s
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>> --- a/mm/rmap.c
>>>>>>> +++ b/mm/rmap.c
>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>
>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>                return 1;
>>>>>>> +
>>>>>>>        if (pte_unused(pte))
>>>>>>>                return 1;
>>>>>>>
>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>>                         *
>>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>>                         */
>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>>                }
>>>>>>> discard:
>>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>> --
>>>>>>> 2.47.3
>>>>>>>
>>>>>> Hi, Baolin
>>>>>>
>>>>>> When reading your patch, I come up one small question.
>>>>>>
>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>
>>>>>>      try_to_unmap_one()
>>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>>
>>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>>                  goto walk_done;
>>>>>>          }
>>>>>>
>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>
>>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>
>>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>>> skip the cleared range?
>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>>
>>>>> take a look:
>>>>>
>>>>> next_pte:
>>>>>                 do {
>>>>>                         pvmw->address += PAGE_SIZE;
>>>>>                         if (pvmw->address >= end)
>>>>>                                 return not_found(pvmw);
>>>>>                         /* Did we cross page table boundary? */
>>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>>                                 if (pvmw->ptl) {
>>>>>                                         spin_unlock(pvmw->ptl);
>>>>>                                         pvmw->ptl = NULL;
>>>>>                                 }
>>>>>                                 pte_unmap(pvmw->pte);
>>>>>                                 pvmw->pte = NULL;
>>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>                                 goto restart;
>>>>>                         }
>>>>>                         pvmw->pte++;
>>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>>
>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>>> will be skipped.
>>>>
>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>                 */
>>>>                if (nr_pages == folio_nr_pages(folio))
>>>>                        goto walk_done;
>>>> +             else {
>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>> +                     pvmw.pte += nr_pages - 1;
>>>> +             }
>>>>                continue;
>>>>   walk_abort:
>>>>                ret = false;
>>> I am of the opinion that we should do something like this. In the internal pvmw code,
>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>> is the right place. If we really want to skip certain PTEs early,
>> should we instead hint page_vma_mapped_walk()? That said, I don't
>> see much value in doing so, since in most cases nr is either 1 or
>> folio_nr_pages(folio).
>>
>>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>>> to not none, and we will lose the batching effect. I also plan to extend support to
>>> anonymous folios (therefore generalizing for all types of memory) which will set a
>>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>>> batch.
>> Thanks for catching this, Dev. I already filter out some of the more
>> complex cases, for example:
>> if (pte_unused(pte))
>>          return 1;
>>
>> Since the userfaultfd write-protection case is also a corner case,
>> could we filter it out as well?
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index c86f1135222b..6bb8ba6f046e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>> folio_unmap_pte_batch(struct folio *folio,
>>          if (pte_unused(pte))
>>                  return 1;
>>
>> +       if (userfaultfd_wp(vma))
>> +               return 1;
>> +
>>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> }
>>
>> Just offering a second option — yours is probably better.
> 
> No. This is not an edge case. This is a case which gets exposed by your work, and
> I believe that if you intend to get the file folio batching thingy in, then you
> need to fix the uffd stuff too.

Barry’s point isn’t that this is an edge case. I think he means that 
uffd is not a common performance-sensitive scenario in production. Also, 
we typically fall back to per-page handling for uffd cases (see 
finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s 
suggestion and filter out the uffd cases until we have test case to show 
performance improvement.

I also think you can continue iterating your patch[1] to support batched 
unmapping for uffd VMAs, and provide data to evaluate its value.

[1] 
https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-19  5:50               ` Baolin Wang
@ 2026-01-19  6:36                 ` Dev Jain
  2026-01-19  7:22                   ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Dev Jain @ 2026-01-19  6:36 UTC (permalink / raw)
  To: Baolin Wang, Barry Song
  Cc: Wei Yang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel


On 19/01/26 11:20 am, Baolin Wang wrote:
>
>
> On 1/18/26 1:46 PM, Dev Jain wrote:
>>
>> On 16/01/26 7:58 pm, Barry Song wrote:
>>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com>
>>>>>> wrote:
>>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping
>>>>>>>> for file
>>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>>
>>>>>>>> Barry previously implemented batched unmapping for lazyfree
>>>>>>>> anonymous large
>>>>>>>> folios[1] and did not further optimize anonymous large folios or
>>>>>>>> file-backed
>>>>>>>> large folios at that stage. As for file-backed large folios, the
>>>>>>>> batched
>>>>>>>> unmapping support is relatively straightforward, as we only need
>>>>>>>> to clear
>>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>>
>>>>>>>> Performance testing:
>>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory
>>>>>>>> cgroup, and try to
>>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I
>>>>>>>> can observe
>>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+
>>>>>>>> improvement
>>>>>>>> on my X86 machine) with this patch.
>>>>>>>>
>>>>>>>> W/o patch:
>>>>>>>> real    0m1.018s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m1.018s
>>>>>>>>
>>>>>>>> W/ patch:
>>>>>>>> real   0m0.249s
>>>>>>>> user   0m0.000s
>>>>>>>> sys    0m0.249s
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>>> --- a/mm/rmap.c
>>>>>>>> +++ b/mm/rmap.c
>>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>>>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>>>>>        end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>>        max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>>
>>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>>> +      /* We only support lazyfree or file folios batching for now
>>>>>>>> ... */
>>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>>                return 1;
>>>>>>>> +
>>>>>>>>        if (pte_unused(pte))
>>>>>>>>                return 1;
>>>>>>>>
>>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio
>>>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>>>                         *
>>>>>>>>                         * See Documentation/mm/mmu_notifier.rst
>>>>>>>>                         */
>>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio),
>>>>>>>> -nr_pages);
>>>>>>>>                }
>>>>>>>> discard:
>>>>>>>>                if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>>> -- 
>>>>>>>> 2.47.3
>>>>>>>>
>>>>>>> Hi, Baolin
>>>>>>>
>>>>>>> When reading your patch, I come up one small question.
>>>>>>>
>>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>>
>>>>>>>      try_to_unmap_one()
>>>>>>>          while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>              nr_pages = folio_unmap_pte_batch()
>>>>>>>
>>>>>>>              if (nr_pages = folio_nr_pages(folio))
>>>>>>>                  goto walk_done;
>>>>>>>          }
>>>>>>>
>>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>>
>>>>>>> If my understanding is correct, page_vma_mapped_walk() would start
>>>>>>> from
>>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already
>>>>>>> cleared to
>>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>>
>>>>>>> Not sure my understanding is correct, if so do we have some reason
>>>>>>> not to
>>>>>>> skip the cleared range?
>>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs
>>>>>> inside.
>>>>>>
>>>>>> take a look:
>>>>>>
>>>>>> next_pte:
>>>>>>                 do {
>>>>>>                         pvmw->address += PAGE_SIZE;
>>>>>>                         if (pvmw->address >= end)
>>>>>>                                 return not_found(pvmw);
>>>>>>                         /* Did we cross page table boundary? */
>>>>>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE))
>>>>>> == 0) {
>>>>>>                                 if (pvmw->ptl) {
>>>>>>                                         spin_unlock(pvmw->ptl);
>>>>>>                                         pvmw->ptl = NULL;
>>>>>>                                 }
>>>>>>                                 pte_unmap(pvmw->pte);
>>>>>>                                 pvmw->pte = NULL;
>>>>>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>>                                 goto restart;
>>>>>>                         }
>>>>>>                         pvmw->pte++;
>>>>>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>>>>>
>>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are
>>>>> pte_none(), they
>>>>> will be skipped.
>>>>>
>>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio
>>>>> *folio, struct vm_area_struct *vma,
>>>>>                 */
>>>>>                if (nr_pages == folio_nr_pages(folio))
>>>>>                        goto walk_done;
>>>>> +             else {
>>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>>> +                     pvmw.pte += nr_pages - 1;
>>>>> +             }
>>>>>                continue;
>>>>>   walk_abort:
>>>>>                ret = false;
>>>> I am of the opinion that we should do something like this. In the
>>>> internal pvmw code,
>>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>>> is the right place. If we really want to skip certain PTEs early,
>>> should we instead hint page_vma_mapped_walk()? That said, I don't
>>> see much value in doing so, since in most cases nr is either 1 or
>>> folio_nr_pages(folio).
>>>
>>>> we keep skipping ptes till the ptes are none. With my proposed
>>>> uffd-fix [1], if the old
>>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert
>>>> all ptes from none
>>>> to not none, and we will lose the batching effect. I also plan to
>>>> extend support to
>>>> anonymous folios (therefore generalizing for all types of memory)
>>>> which will set a
>>>> batch of ptes as swap, and the internal pvmw code won't be able to
>>>> skip through the
>>>> batch.
>>> Thanks for catching this, Dev. I already filter out some of the more
>>> complex cases, for example:
>>> if (pte_unused(pte))
>>>          return 1;
>>>
>>> Since the userfaultfd write-protection case is also a corner case,
>>> could we filter it out as well?
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c86f1135222b..6bb8ba6f046e 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>>> folio_unmap_pte_batch(struct folio *folio,
>>>          if (pte_unused(pte))
>>>                  return 1;
>>>
>>> +       if (userfaultfd_wp(vma))
>>> +               return 1;
>>> +
>>>          return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>> }
>>>
>>> Just offering a second option — yours is probably better.
>>
>> No. This is not an edge case. This is a case which gets exposed by your
>> work, and
>> I believe that if you intend to get the file folio batching thingy in,
>> then you
>> need to fix the uffd stuff too.
>
> Barry’s point isn’t that this is an edge case. I think he means that uffd
> is not a common performance-sensitive scenario in production. Also, we
> typically fall back to per-page handling for uffd cases (see
> finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
> suggestion and filter out the uffd cases until we have test case to show
> performance improvement. 

I am of the opinion that you are making the wrong analogy here. The
per-page fault fidelity is *required* for uffd.

When you say you want to support file folio batched unmapping, I think it's
inappropriate to say "let us refuse to

batch if the pte mapping the file folio is smeared with a particular bit
and consider it a totally different case". Instead

of getting in folio (all memory types) batched unmapping in, we have
already broken this to "lazyfree folio", then

"file folio", the remaining being "anon folio". Now you intend to break
"file folio" to "file folio non uffd" and "file folio uffd".


Just my 2C, I don't opine strongly here.


>
> I also think you can continue iterating your patch[1] to support batched
> unmapping for uffd VMAs, and provide data to evaluate its value.
>
> [1]
> https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-01-19  6:36                 ` Dev Jain
@ 2026-01-19  7:22                   ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2026-01-19  7:22 UTC (permalink / raw)
  To: Dev Jain, Barry Song
  Cc: Wei Yang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel



On 1/19/26 2:36 PM, Dev Jain wrote:
> 
> On 19/01/26 11:20 am, Baolin Wang wrote:
>>
>>
>> On 1/18/26 1:46 PM, Dev Jain wrote:
>>>
>>> On 16/01/26 7:58 pm, Barry Song wrote:
>>>> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>>
>>>>> On 07/01/26 7:16 am, Wei Yang wrote:
>>>>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@gmail.com>
>>>>>>> wrote:
>>>>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping
>>>>>>>>> for file
>>>>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>>>>
>>>>>>>>> Barry previously implemented batched unmapping for lazyfree
>>>>>>>>> anonymous large
>>>>>>>>> folios[1] and did not further optimize anonymous large folios or
>>>>>>>>> file-backed
>>>>>>>>> large folios at that stage. As for file-backed large folios, the
>>>>>>>>> batched
>>>>>>>>> unmapping support is relatively straightforward, as we only need
>>>>>>>>> to clear
>>>>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>>>>
>>>>>>>>> Performance testing:
>>>>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory
>>>>>>>>> cgroup, and try to
>>>>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I
>>>>>>>>> can observe
>>>>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+
>>>>>>>>> improvement
>>>>>>>>> on my X86 machine) with this patch.
>>>>>>>>>
>>>>>>>>> W/o patch:
>>>>>>>>> real    0m1.018s
>>>>>>>>> user    0m0.000s
>>>>>>>>> sys     0m1.018s
>>>>>>>>>
>>>>>>>>> W/ patch:
>>>>>>>>> real   0m0.249s
>>>>>>>>> user   0m0.000s
>>>>>>>>> sys    0m0.249s
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Acked-by: Barry Song <baohua@kernel.org>
>>>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> ---
>>>>>>>>> mm/rmap.c | 7 ++++---
>>>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>>>>> --- a/mm/rmap.c
>>>>>>>>> +++ b/mm/rmap.c
>>>>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>>>>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>>>>>>         end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>>>>         max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>>>>
>>>>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>>>>> +      /* We only support lazyfree or file folios batching for now
>>>>>>>>> ... */
>>>>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>>>>                 return 1;
>>>>>>>>> +
>>>>>>>>>         if (pte_unused(pte))
>>>>>>>>>                 return 1;
>>>>>>>>>
>>>>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio
>>>>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>>>>                          *
>>>>>>>>>                          * See Documentation/mm/mmu_notifier.rst
>>>>>>>>>                          */
>>>>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio),
>>>>>>>>> -nr_pages);
>>>>>>>>>                 }
>>>>>>>>> discard:
>>>>>>>>>                 if (unlikely(folio_test_hugetlb(folio))) {
>>>>>>>>> -- 
>>>>>>>>> 2.47.3
>>>>>>>>>
>>>>>>>> Hi, Baolin
>>>>>>>>
>>>>>>>> When reading your patch, I come up one small question.
>>>>>>>>
>>>>>>>> Current try_to_unmap_one() has following structure:
>>>>>>>>
>>>>>>>>       try_to_unmap_one()
>>>>>>>>           while (page_vma_mapped_walk(&pvmw)) {
>>>>>>>>               nr_pages = folio_unmap_pte_batch()
>>>>>>>>
>>>>>>>>               if (nr_pages = folio_nr_pages(folio))
>>>>>>>>                   goto walk_done;
>>>>>>>>           }
>>>>>>>>
>>>>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>>>>
>>>>>>>> If my understanding is correct, page_vma_mapped_walk() would start
>>>>>>>> from
>>>>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already
>>>>>>>> cleared to
>>>>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>>>>
>>>>>>>> Not sure my understanding is correct, if so do we have some reason
>>>>>>>> not to
>>>>>>>> skip the cleared range?
>>>>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs
>>>>>>> inside.
>>>>>>>
>>>>>>> take a look:
>>>>>>>
>>>>>>> next_pte:
>>>>>>>                  do {
>>>>>>>                          pvmw->address += PAGE_SIZE;
>>>>>>>                          if (pvmw->address >= end)
>>>>>>>                                  return not_found(pvmw);
>>>>>>>                          /* Did we cross page table boundary? */
>>>>>>>                          if ((pvmw->address & (PMD_SIZE - PAGE_SIZE))
>>>>>>> == 0) {
>>>>>>>                                  if (pvmw->ptl) {
>>>>>>>                                          spin_unlock(pvmw->ptl);
>>>>>>>                                          pvmw->ptl = NULL;
>>>>>>>                                  }
>>>>>>>                                  pte_unmap(pvmw->pte);
>>>>>>>                                  pvmw->pte = NULL;
>>>>>>>                                  pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>>>>                                  goto restart;
>>>>>>>                          }
>>>>>>>                          pvmw->pte++;
>>>>>>>                  } while (pte_none(ptep_get(pvmw->pte)));
>>>>>>>
>>>>>> Yes, we do it in page_vma_mapped_walk() now. Since they are
>>>>>> pte_none(), they
>>>>>> will be skipped.
>>>>>>
>>>>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 9e5bd4834481..ea1afec7c802 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio
>>>>>> *folio, struct vm_area_struct *vma,
>>>>>>                  */
>>>>>>                 if (nr_pages == folio_nr_pages(folio))
>>>>>>                         goto walk_done;
>>>>>> +             else {
>>>>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>>>>> +                     pvmw.pte += nr_pages - 1;
>>>>>> +             }
>>>>>>                 continue;
>>>>>>    walk_abort:
>>>>>>                 ret = false;
>>>>> I am of the opinion that we should do something like this. In the
>>>>> internal pvmw code,
>>>> I am still not convinced that skipping PTEs in try_to_unmap_one()
>>>> is the right place. If we really want to skip certain PTEs early,
>>>> should we instead hint page_vma_mapped_walk()? That said, I don't
>>>> see much value in doing so, since in most cases nr is either 1 or
>>>> folio_nr_pages(folio).
>>>>
>>>>> we keep skipping ptes till the ptes are none. With my proposed
>>>>> uffd-fix [1], if the old
>>>>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert
>>>>> all ptes from none
>>>>> to not none, and we will lose the batching effect. I also plan to
>>>>> extend support to
>>>>> anonymous folios (therefore generalizing for all types of memory)
>>>>> which will set a
>>>>> batch of ptes as swap, and the internal pvmw code won't be able to
>>>>> skip through the
>>>>> batch.
>>>> Thanks for catching this, Dev. I already filter out some of the more
>>>> complex cases, for example:
>>>> if (pte_unused(pte))
>>>>           return 1;
>>>>
>>>> Since the userfaultfd write-protection case is also a corner case,
>>>> could we filter it out as well?
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index c86f1135222b..6bb8ba6f046e 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1870,6 +1870,9 @@ static inline unsigned int
>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>           if (pte_unused(pte))
>>>>                   return 1;
>>>>
>>>> +       if (userfaultfd_wp(vma))
>>>> +               return 1;
>>>> +
>>>>           return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>>> }
>>>>
>>>> Just offering a second option — yours is probably better.
>>>
>>> No. This is not an edge case. This is a case which gets exposed by your
>>> work, and
>>> I believe that if you intend to get the file folio batching thingy in,
>>> then you
>>> need to fix the uffd stuff too.
>>
>> Barry’s point isn’t that this is an edge case. I think he means that uffd
>> is not a common performance-sensitive scenario in production. Also, we
>> typically fall back to per-page handling for uffd cases (see
>> finish_fault() and alloc_anon_folio()). So I perfer to follow Barry’s
>> suggestion and filter out the uffd cases until we have test case to show
>> performance improvement.
> 
> I am of the opinion that you are making the wrong analogy here. The
> per-page fault fidelity is *required* for uffd.
> 
> When you say you want to support file folio batched unmapping, I think it's
> inappropriate to say "let us refuse to
> 
> batch if the pte mapping the file folio is smeared with a particular bit
> and consider it a totally different case". Instead
> 
> of getting in folio (all memory types) batched unmapping in, we have
> already broken this to "lazyfree folio", then
> 
> "file folio", the remaining being "anon folio". Now you intend to break
> "file folio" to "file folio non uffd" and "file folio uffd".

At least for me, I think this is a reasonable approach: break a complex 
problem into smaller features and address them step by step (possibly by 
different contributors in the community). This makes it easier for 
reviewers to focus and discuss. You can see that batched unmapping for 
anonymous folios still has ongoing discussion.

As I mentioned, since uffd is not a common performance-sensitive 
scenario in production, we need to continue discussing whether we 
actually need to support batched unmapping for uffd, and support the 
decision with technical feedback and performance data. So I’d prefer to 
discuss it in a separate patch.

David and Lorenzo, what do you think?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
@ 2026-01-28 11:47   ` Chris Mason
  2026-01-29  1:42     ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Chris Mason @ 2026-01-28 11:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.

Hi everyone, I ran mm-new through my AI review prompts and this one was
flagged.  AI review below:

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>  }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> +		return __ptep_clear_flush_young(vma, addr, ptep);

Should this be checking !pte_valid_cont() instead of !pte_cont()?

The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
determine when to take the fast path. The new function only checks
!pte_cont(), which differs when handling non-present PTEs.

Non-present PTEs (device-private, device-exclusive) can reach
clear_flush_young_ptes() through folio_referenced_one()->
clear_flush_young_ptes_notify(). These entries may have bit 52 set as
part of their encoding, but they aren't valid contiguous mappings.

With the current check, wouldn't such entries incorrectly trigger the
contpte path and potentially cause contpte_clear_flush_young_ptes() to
process additional unrelated PTEs beyond the intended single entry?

> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-01-28 11:47   ` Chris Mason
@ 2026-01-29  1:42     ` Baolin Wang
  2026-02-09  9:09       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-01-29  1:42 UTC (permalink / raw)
  To: Chris Mason
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel



On 1/28/26 7:47 PM, Chris Mason wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
>> batched checking of young flags and TLB flushing, improving performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
> 
> Hi everyone, I ran mm-new through my AI review prompts and this one was
> flagged.  AI review below:
> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>>
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					 unsigned long addr, pte_t *ptep,
>> +					 unsigned int nr)
>> +{
>> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
> 
> Should this be checking !pte_valid_cont() instead of !pte_cont()?
> 
> The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
> determine when to take the fast path. The new function only checks
> !pte_cont(), which differs when handling non-present PTEs.
> 
> Non-present PTEs (device-private, device-exclusive) can reach
> clear_flush_young_ptes() through folio_referenced_one()->
> clear_flush_young_ptes_notify(). These entries may have bit 52 set as
> part of their encoding, but they aren't valid contiguous mappings.
> 
> With the current check, wouldn't such entries incorrectly trigger the
> contpte path and potentially cause contpte_clear_flush_young_ptes() to
> process additional unrelated PTEs beyond the intended single entry?

Indeed. I previously discussed with Ryan whether using pte_cont() was 
enough, and we believed that invalid PTEs wouldn’t have the PTE_CONT bit 
set. But we clearly missed the device-folio cases. Thanks for reporting.

Andrew, could you please squash the following fix into this patch? If 
you prefer a new version, please let me know. Thanks.

diff --git a/arch/arm64/include/asm/pgtable.h 
b/arch/arm64/include/asm/pgtable.h
index a17eb8a76788..dc16591c4241 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes(struct 
vm_area_struct *vma,
                                          unsigned long addr, pte_t *ptep,
                                          unsigned int nr)
  {
-       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+       if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
                 return __ptep_clear_flush_young(vma, addr, ptep);

         return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
  2026-01-07  6:01   ` Harry Yoo
@ 2026-02-09  8:49   ` David Hildenbrand (Arm)
  2026-02-09  9:14     ` Baolin Wang
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  8:49 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 12/26/25 07:07, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
> of the young flags and flushing TLB entries, thereby improving performance
> during large folio reclamation. And it will be overridden by the architecture
> that implements a more efficient batch operation in the following patches.
> 
> While we are at it, rename ptep_clear_flush_young_notify() to
> clear_flush_young_ptes_notify() to indicate that this is a batch operation.
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   include/linux/mmu_notifier.h |  9 +++++----
>   include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
>   mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
>   3 files changed, 64 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..07a2bbaf86e9 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>   	range->owner = owner;
>   }
>   
> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr)	\
>   ({									\
>   	int __young;							\
>   	struct vm_area_struct *___vma = __vma;				\
>   	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> +	unsigned int ___nr = __nr;					\
> +	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
>   	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
>   						  ___address,		\
>   						  ___address +		\
> -							PAGE_SIZE);	\
> +						  ___nr * PAGE_SIZE);	\
>   	__young;							\
>   })
>   

Man that's ugly, Not your fault, but can this possibly be turned into an 
inline function in a follow-up patch.

[...]

>   
> +#ifndef clear_flush_young_ptes
> +/**
> + * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
> + *			    that map consecutive pages of the same folio.

With clear_young_dirty_ptes() description in mind, this should probably 
be "Mark PTEs that map consecutive pages of the same folio as clean and 
flush the TLB" ?

> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear access bit.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_clear_flush_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)

Two-tab alignment on second+ line like all similar functions here.

> +{
> +	int i, young = 0;
> +
> +	for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
> +		young |= ptep_clear_flush_young(vma, addr, ptep);
> +

Why don't we use a similar loop we use in clear_young_dirty_ptes() or 
clear_full_ptes() etc? It's not only consistent but also optimizes out 
the first check for nr.


for (;;) {
	young |= ptep_clear_flush_young(vma, addr, ptep);
	if (--nr == 0)
		break;
	ptep++;
	addr += PAGE_SIZE;
}

> +	return young;
> +}
> +#endif
> +
>   /*
>    * On some architectures hardware does not set page access bit when accessing
>    * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e805ddc5a27b..985ab0b085ba 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,9 +828,11 @@ static bool folio_referenced_one(struct folio *folio,
>   	struct folio_referenced_arg *pra = arg;
>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>   	int ptes = 0, referenced = 0;
> +	unsigned int nr;
>   
>   	while (page_vma_mapped_walk(&pvmw)) {
>   		address = pvmw.address;
> +		nr = 1;
>   
>   		if (vma->vm_flags & VM_LOCKED) {
>   			ptes++;
> @@ -875,9 +877,24 @@ static bool folio_referenced_one(struct folio *folio,
>   			if (lru_gen_look_around(&pvmw))
>   				referenced++;
>   		} else if (pvmw.pte) {
> -			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte))
> +			if (folio_test_large(folio)) {
> +				unsigned long end_addr =
> +					pmd_addr_end(address, vma->vm_end);
> +				unsigned int max_nr =
> +					(end_addr - address) >> PAGE_SHIFT;

Good news: you can both fit into a single line as we are allowed to 
exceed 80c if it aids readability.

> +				pte_t pteval = ptep_get(pvmw.pte);
> +
> +				nr = folio_pte_batch(folio, pvmw.pte,
> +						     pteval, max_nr);
> +			}
> +
> +			ptes += nr;

I'm not sure about whether we should mess with the "ptes" variable that 
is so far only used for VM_LOCKED vmas. See below, maybe we can just 
avoid that.

> +			if (clear_flush_young_ptes_notify(vma, address,
> +						pvmw.pte, nr))

Could maybe fit that into a single line as well, whatever you prefer.

>   				referenced++;
> +			/* Skip the batched PTEs */
> +			pvmw.pte += nr - 1;
> +			pvmw.address += (nr - 1) * PAGE_SIZE;
>   		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>   			if (pmdp_clear_flush_young_notify(vma, address,
>   						pvmw.pmd))
> @@ -887,7 +904,15 @@ static bool folio_referenced_one(struct folio *folio,
>   			WARN_ON_ONCE(1);
>   		}
>   
> -		pra->mapcount--;
> +		pra->mapcount -= nr;
> +		/*
> +		 * If we are sure that we batched the entire folio,
> +		 * we can just optimize and stop right here.
> +		 */
> +		if (ptes == pvmw.nr_pages) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
Why not check for !pra->mapcount? Then you can also drop the comment, 
because it's exactly the same thing we check after the loop to indicate 
what to return to the caller.

And you will not have to mess with the "ptes" variable?



Only minor stuff.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper
  2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
@ 2026-02-09  8:50   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  8:50 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 12/26/25 07:07, Baolin Wang wrote:
> Factor out the contpte block's address and ptep alignment into a new helper,
> and will be reused in the following patch.
> 
> No functional changes.
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios
  2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
  2026-01-02 12:21   ` Ryan Roberts
@ 2026-02-09  9:02   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:02 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 12/26/25 07:07, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs that map consecutive
> pages of the same large folio in a single VMA and a single page table.
> 
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---

Nothing jumped at me :) Reusing contpte_align_addr_ptep() makes it a lot 
clearer.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-01-29  1:42     ` Baolin Wang
@ 2026-02-09  9:09       ` David Hildenbrand (Arm)
  2026-02-09  9:36         ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:09 UTC (permalink / raw)
  To: Baolin Wang, Chris Mason
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On 1/29/26 02:42, Baolin Wang wrote:
> 
> 
> On 1/28/26 7:47 PM, Chris Mason wrote:
>> Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to 
>>> enable
>>> batched checking of young flags and TLB flushing, improving 
>>> performance during
>>> large folio reclamation.
>>>
>>> Performance testing:
>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, 
>>> and try to
>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can 
>>> observe
>>> 33% performance improvement on my Arm64 32-core server (and 10%+ 
>>> improvement
>>> on my X86 machine). Meanwhile, the hotspot folio_check_references() 
>>> dropped
>>> from approximately 35% to around 5%.
>>
>> Hi everyone, I ran mm-new through my AI review prompts and this one was
>> flagged.  AI review below:
>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ 
>>> asm/pgtable.h
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1838,6 +1838,17 @@ static inline int 
>>> ptep_clear_flush_young(struct vm_area_struct *vma,
>>>       return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>>   }
>>>
>>> +#define clear_flush_young_ptes clear_flush_young_ptes
>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t *ptep,
>>> +                     unsigned int nr)
>>> +{
>>> +    if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>
>> Should this be checking !pte_valid_cont() instead of !pte_cont()?
>>
>> The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
>> determine when to take the fast path. The new function only checks
>> !pte_cont(), which differs when handling non-present PTEs.
>>
>> Non-present PTEs (device-private, device-exclusive) can reach
>> clear_flush_young_ptes() through folio_referenced_one()->
>> clear_flush_young_ptes_notify(). These entries may have bit 52 set as
>> part of their encoding, but they aren't valid contiguous mappings.
>>
>> With the current check, wouldn't such entries incorrectly trigger the
>> contpte path and potentially cause contpte_clear_flush_young_ptes() to
>> process additional unrelated PTEs beyond the intended single entry?
> 
> Indeed. I previously discussed with Ryan whether using pte_cont() was 
> enough, and we believed that invalid PTEs wouldn’t have the PTE_CONT bit 
> set. But we clearly missed the device-folio cases. Thanks for reporting.
> 
> Andrew, could you please squash the following fix into this patch? If 
> you prefer a new version, please let me know. Thanks.

Isn't the real problem that we should never ever ever ever, try clearing 
the young bit on a non-present pte?

See damon_ptep_mkold() how that is handled with the flushing/notify.

There needs to be a pte_present() check in the caller.


BUT

I recall that folio_referenced() should never apply to ZONE_DEVICE 
folios. folio_referenced() is only called from memory reclaim code, and 
ZONE_DEVICE pages never get reclaimed through vmscan.c

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09  8:49   ` David Hildenbrand (Arm)
@ 2026-02-09  9:14     ` Baolin Wang
  2026-02-09  9:20       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-02-09  9:14 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel



On 2/9/26 4:49 PM, David Hildenbrand (Arm) wrote:
> On 12/26/25 07:07, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for 
>> each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, 
>> where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm64 architecture, which supports contiguous PTEs, there 
>> is already
>> an optimization to clear the young flags for PTEs within a contiguous 
>> range.
>> However, this is not sufficient. We can extend this to perform batched 
>> operations
>> for the entire large folio (which might exceed the contiguous range: 
>> CONT_PTE_SIZE).
>>
>> Introduce a new API: clear_flush_young_ptes() to facilitate batched 
>> checking
>> of the young flags and flushing TLB entries, thereby improving 
>> performance
>> during large folio reclamation. And it will be overridden by the 
>> architecture
>> that implements a more efficient batch operation in the following 
>> patches.
>>
>> While we are at it, rename ptep_clear_flush_young_notify() to
>> clear_flush_young_ptes_notify() to indicate that this is a batch 
>> operation.
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   include/linux/mmu_notifier.h |  9 +++++----
>>   include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
>>   mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
>>   3 files changed, 64 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d1094c2d5fb6..07a2bbaf86e9 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>       range->owner = owner;
>>   }
>> -#define ptep_clear_flush_young_notify(__vma, __address, __ptep)        \
>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, 
>> __nr)    \
>>   ({                                    \
>>       int __young;                            \
>>       struct vm_area_struct *___vma = __vma;                \
>>       unsigned long ___address = __address;                \
>> -    __young = ptep_clear_flush_young(___vma, ___address, __ptep);    \
>> +    unsigned int ___nr = __nr;                    \
>> +    __young = clear_flush_young_ptes(___vma, ___address, __ptep, 
>> ___nr);    \
>>       __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,    \
>>                             ___address,        \
>>                             ___address +        \
>> -                            PAGE_SIZE);    \
>> +                          ___nr * PAGE_SIZE);    \
>>       __young;                            \
>>   })
> 
> Man that's ugly, Not your fault, but can this possibly be turned into an 
> inline function in a follow-up patch.

Yes, the cleanup of these macros is already in my follow-up patch set.

>> +#ifndef clear_flush_young_ptes
>> +/**
>> + * clear_flush_young_ptes - Clear the access bit and perform a TLB 
>> flush for PTEs
>> + *                that map consecutive pages of the same folio.
> 
> With clear_young_dirty_ptes() description in mind, this should probably 
> be "Mark PTEs that map consecutive pages of the same folio as clean and 
> flush the TLB" ?

IMO, “clean” is confusing here, as it sounds like clear the dirty bit to 
make the folio clean.

>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear access bit.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a 
>> simple
>> + * loop over ptep_clear_flush_young().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ. 
>> For example,
>> + * some PTEs might be write-protected.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map 
>> consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same 
>> PMD.
>> + */
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +                     unsigned long addr, pte_t *ptep,
>> +                     unsigned int nr)
> 
> Two-tab alignment on second+ line like all similar functions here.

Sure.

>> +{
>> +    int i, young = 0;
>> +
>> +    for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
>> +        young |= ptep_clear_flush_young(vma, addr, ptep);
>> +
> 
> Why don't we use a similar loop we use in clear_young_dirty_ptes() or 
> clear_full_ptes() etc? It's not only consistent but also optimizes out 
> the first check for nr.
> for (;;) {
>      young |= ptep_clear_flush_young(vma, addr, ptep);
>      if (--nr == 0)
>          break;
>      ptep++;
>      addr += PAGE_SIZE;
> }

We’ve discussed this loop pattern before [1], and it seems that people 
prefer the ‘for (;;)’ loop. Do you have a strong preference for changing 
it back?

[1]https://lore.kernel.org/all/ec49f0fe-9df8-4762-b315-240cbb1ed3ce@arm.com/

>> +    return young;
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when 
>> accessing
>>    * memory page, it is responsibility of software setting this bit. 
>> It brings
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e805ddc5a27b..985ab0b085ba 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -828,9 +828,11 @@ static bool folio_referenced_one(struct folio 
>> *folio,
>>       struct folio_referenced_arg *pra = arg;
>>       DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>       int ptes = 0, referenced = 0;
>> +    unsigned int nr;
>>       while (page_vma_mapped_walk(&pvmw)) {
>>           address = pvmw.address;
>> +        nr = 1;
>>           if (vma->vm_flags & VM_LOCKED) {
>>               ptes++;
>> @@ -875,9 +877,24 @@ static bool folio_referenced_one(struct folio 
>> *folio,
>>               if (lru_gen_look_around(&pvmw))
>>                   referenced++;
>>           } else if (pvmw.pte) {
>> -            if (ptep_clear_flush_young_notify(vma, address,
>> -                        pvmw.pte))
>> +            if (folio_test_large(folio)) {
>> +                unsigned long end_addr =
>> +                    pmd_addr_end(address, vma->vm_end);
>> +                unsigned int max_nr =
>> +                    (end_addr - address) >> PAGE_SHIFT;
> 
> Good news: you can both fit into a single line as we are allowed to 
> exceed 80c if it aids readability.

Sure.

>> +                pte_t pteval = ptep_get(pvmw.pte);
>> +
>> +                nr = folio_pte_batch(folio, pvmw.pte,
>> +                             pteval, max_nr);
>> +            }
>> +
>> +            ptes += nr;
> 
> I'm not sure about whether we should mess with the "ptes" variable that 
> is so far only used for VM_LOCKED vmas. See below, maybe we can just 
> avoid that.

See below.

> 
>> +            if (clear_flush_young_ptes_notify(vma, address,
>> +                        pvmw.pte, nr))
> 
> Could maybe fit that into a single line as well, whatever you prefer.

Sure.

>>                   referenced++;
>> +            /* Skip the batched PTEs */
>> +            pvmw.pte += nr - 1;
>> +            pvmw.address += (nr - 1) * PAGE_SIZE;
>>           } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>               if (pmdp_clear_flush_young_notify(vma, address,
>>                           pvmw.pmd))
>> @@ -887,7 +904,15 @@ static bool folio_referenced_one(struct folio 
>> *folio,
>>               WARN_ON_ONCE(1);
>>           }
>> -        pra->mapcount--;
>> +        pra->mapcount -= nr;
>> +        /*
>> +         * If we are sure that we batched the entire folio,
>> +         * we can just optimize and stop right here.
>> +         */
>> +        if (ptes == pvmw.nr_pages) {
>> +            page_vma_mapped_walk_done(&pvmw);
>> +            break;
>> +        }
> Why not check for !pra->mapcount? Then you can also drop the comment, 
> because it's exactly the same thing we check after the loop to indicate 
> what to return to the caller.
> 
> And you will not have to mess with the "ptes" variable?

We can't rely on pra->mapcount here, because a folio can be mapped in 
multiple VMAs. Even if the pra->mapcount is not zero, we can still call 
page_vma_mapped_walk_done() for the current VMA mapping when the entire 
folio is batched.

> Only minor stuff.

Thanks for taking a look.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09  9:14     ` Baolin Wang
@ 2026-02-09  9:20       ` David Hildenbrand (Arm)
  2026-02-09  9:25         ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:20 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 2/9/26 10:14, Baolin Wang wrote:
> 
> 
> On 2/9/26 4:49 PM, David Hildenbrand (Arm) wrote:
>> On 12/26/25 07:07, Baolin Wang wrote:
>>> Currently, folio_referenced_one() always checks the young flag for 
>>> each PTE
>>> sequentially, which is inefficient for large folios. This 
>>> inefficiency is
>>> especially noticeable when reclaiming clean file-backed large folios, 
>>> where
>>> folio_referenced() is observed as a significant performance hotspot.
>>>
>>> Moreover, on Arm64 architecture, which supports contiguous PTEs, 
>>> there is already
>>> an optimization to clear the young flags for PTEs within a contiguous 
>>> range.
>>> However, this is not sufficient. We can extend this to perform 
>>> batched operations
>>> for the entire large folio (which might exceed the contiguous range: 
>>> CONT_PTE_SIZE).
>>>
>>> Introduce a new API: clear_flush_young_ptes() to facilitate batched 
>>> checking
>>> of the young flags and flushing TLB entries, thereby improving 
>>> performance
>>> during large folio reclamation. And it will be overridden by the 
>>> architecture
>>> that implements a more efficient batch operation in the following 
>>> patches.
>>>
>>> While we are at it, rename ptep_clear_flush_young_notify() to
>>> clear_flush_young_ptes_notify() to indicate that this is a batch 
>>> operation.
>>>
>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   include/linux/mmu_notifier.h |  9 +++++----
>>>   include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
>>>   mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
>>>   3 files changed, 64 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>> index d1094c2d5fb6..07a2bbaf86e9 100644
>>> --- a/include/linux/mmu_notifier.h
>>> +++ b/include/linux/mmu_notifier.h
>>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>>       range->owner = owner;
>>>   }
>>> -#define ptep_clear_flush_young_notify(__vma, __address, 
>>> __ptep)        \
>>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, 
>>> __nr)    \
>>>   ({                                    \
>>>       int __young;                            \
>>>       struct vm_area_struct *___vma = __vma;                \
>>>       unsigned long ___address = __address;                \
>>> -    __young = ptep_clear_flush_young(___vma, ___address, __ptep);    \
>>> +    unsigned int ___nr = __nr;                    \
>>> +    __young = clear_flush_young_ptes(___vma, ___address, __ptep, 
>>> ___nr);    \
>>>       __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,    \
>>>                             ___address,        \
>>>                             ___address +        \
>>> -                            PAGE_SIZE);    \
>>> +                          ___nr * PAGE_SIZE);    \
>>>       __young;                            \
>>>   })
>>
>> Man that's ugly, Not your fault, but can this possibly be turned into 
>> an inline function in a follow-up patch.
> 
> Yes, the cleanup of these macros is already in my follow-up patch set.
> 
>>> +#ifndef clear_flush_young_ptes
>>> +/**
>>> + * clear_flush_young_ptes - Clear the access bit and perform a TLB 
>>> flush for PTEs
>>> + *                that map consecutive pages of the same folio.
>>
>> With clear_young_dirty_ptes() description in mind, this should 
>> probably be "Mark PTEs that map consecutive pages of the same folio as 
>> clean and flush the TLB" ?
> 
> IMO, “clean” is confusing here, as it sounds like clear the dirty bit to 
> make the folio clean.

"as old", sorry, I used the wrong part of the description.

> 
>>> + * @vma: The virtual memory area the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries to clear access bit.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as 
>>> a simple
>>> + * loop over ptep_clear_flush_young().
>>> + *
>>> + * Note that PTE bits in the PTE range besides the PFN can differ. 
>>> For example,
>>> + * some PTEs might be write-protected.
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs map 
>>> consecutive
>>> + * pages that belong to the same folio.  The PTEs are all in the 
>>> same PMD.
>>> + */
>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                     unsigned long addr, pte_t *ptep,
>>> +                     unsigned int nr)
>>
>> Two-tab alignment on second+ line like all similar functions here.
> 
> Sure.
> 
>>> +{
>>> +    int i, young = 0;
>>> +
>>> +    for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
>>> +        young |= ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>
>> Why don't we use a similar loop we use in clear_young_dirty_ptes() or 
>> clear_full_ptes() etc? It's not only consistent but also optimizes out 
>> the first check for nr.
>> for (;;) {
>>      young |= ptep_clear_flush_young(vma, addr, ptep);
>>      if (--nr == 0)
>>          break;
>>      ptep++;
>>      addr += PAGE_SIZE;
>> }
> 
> We’ve discussed this loop pattern before [1], and it seems that people 
> prefer the ‘for (;;)’ loop. Do you have a strong preference for changing 
> it back?

Yes, to make all such helpers look consistent. Note that your version 
was also not consistent with the other variants.

Ryans point was about avoiding two ptep_clear_flush_young() calls, which 
the for(;;) avoids as well.

[...]

>>
>> And you will not have to mess with the "ptes" variable?
> 
> We can't rely on pra->mapcount here, because a folio can be mapped in 
> multiple VMAs. Even if the pra->mapcount is not zero, we can still call 
> page_vma_mapped_walk_done() for the current VMA mapping when the entire 
> folio is batched.

You are absolutely right for folios that are mapped into multiple processes.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
  2026-02-09  9:20       ` David Hildenbrand (Arm)
@ 2026-02-09  9:25         ` Baolin Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Baolin Wang @ 2026-02-09  9:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel



On 2/9/26 5:20 PM, David Hildenbrand (Arm) wrote:
> On 2/9/26 10:14, Baolin Wang wrote:
>>
>>
>> On 2/9/26 4:49 PM, David Hildenbrand (Arm) wrote:
>>> On 12/26/25 07:07, Baolin Wang wrote:
>>>> Currently, folio_referenced_one() always checks the young flag for 
>>>> each PTE
>>>> sequentially, which is inefficient for large folios. This 
>>>> inefficiency is
>>>> especially noticeable when reclaiming clean file-backed large 
>>>> folios, where
>>>> folio_referenced() is observed as a significant performance hotspot.
>>>>
>>>> Moreover, on Arm64 architecture, which supports contiguous PTEs, 
>>>> there is already
>>>> an optimization to clear the young flags for PTEs within a 
>>>> contiguous range.
>>>> However, this is not sufficient. We can extend this to perform 
>>>> batched operations
>>>> for the entire large folio (which might exceed the contiguous range: 
>>>> CONT_PTE_SIZE).
>>>>
>>>> Introduce a new API: clear_flush_young_ptes() to facilitate batched 
>>>> checking
>>>> of the young flags and flushing TLB entries, thereby improving 
>>>> performance
>>>> during large folio reclamation. And it will be overridden by the 
>>>> architecture
>>>> that implements a more efficient batch operation in the following 
>>>> patches.
>>>>
>>>> While we are at it, rename ptep_clear_flush_young_notify() to
>>>> clear_flush_young_ptes_notify() to indicate that this is a batch 
>>>> operation.
>>>>
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>   include/linux/mmu_notifier.h |  9 +++++----
>>>>   include/linux/pgtable.h      | 31 +++++++++++++++++++++++++++++++
>>>>   mm/rmap.c                    | 31 ++++++++++++++++++++++++++++---
>>>>   3 files changed, 64 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/ 
>>>> mmu_notifier.h
>>>> index d1094c2d5fb6..07a2bbaf86e9 100644
>>>> --- a/include/linux/mmu_notifier.h
>>>> +++ b/include/linux/mmu_notifier.h
>>>> @@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
>>>>       range->owner = owner;
>>>>   }
>>>> -#define ptep_clear_flush_young_notify(__vma, __address, 
>>>> __ptep)        \
>>>> +#define clear_flush_young_ptes_notify(__vma, __address, __ptep, 
>>>> __nr)    \
>>>>   ({                                    \
>>>>       int __young;                            \
>>>>       struct vm_area_struct *___vma = __vma;                \
>>>>       unsigned long ___address = __address;                \
>>>> -    __young = ptep_clear_flush_young(___vma, ___address, __ptep);    \
>>>> +    unsigned int ___nr = __nr;                    \
>>>> +    __young = clear_flush_young_ptes(___vma, ___address, __ptep, 
>>>> ___nr);    \
>>>>       __young |= mmu_notifier_clear_flush_young(___vma->vm_mm,    \
>>>>                             ___address,        \
>>>>                             ___address +        \
>>>> -                            PAGE_SIZE);    \
>>>> +                          ___nr * PAGE_SIZE);    \
>>>>       __young;                            \
>>>>   })
>>>
>>> Man that's ugly, Not your fault, but can this possibly be turned into 
>>> an inline function in a follow-up patch.
>>
>> Yes, the cleanup of these macros is already in my follow-up patch set.
>>
>>>> +#ifndef clear_flush_young_ptes
>>>> +/**
>>>> + * clear_flush_young_ptes - Clear the access bit and perform a TLB 
>>>> flush for PTEs
>>>> + *                that map consecutive pages of the same folio.
>>>
>>> With clear_young_dirty_ptes() description in mind, this should 
>>> probably be "Mark PTEs that map consecutive pages of the same folio 
>>> as clean and flush the TLB" ?
>>
>> IMO, “clean” is confusing here, as it sounds like clear the dirty bit 
>> to make the folio clean.
> 
> "as old", sorry, I used the wrong part of the description.

OK.

>>>> + * @vma: The virtual memory area the pages are mapped into.
>>>> + * @addr: Address the first page is mapped at.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @nr: Number of entries to clear access bit.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as 
>>>> a simple
>>>> + * loop over ptep_clear_flush_young().
>>>> + *
>>>> + * Note that PTE bits in the PTE range besides the PFN can differ. 
>>>> For example,
>>>> + * some PTEs might be write-protected.
>>>> + *
>>>> + * Context: The caller holds the page table lock.  The PTEs map 
>>>> consecutive
>>>> + * pages that belong to the same folio.  The PTEs are all in the 
>>>> same PMD.
>>>> + */
>>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                     unsigned long addr, pte_t *ptep,
>>>> +                     unsigned int nr)
>>>
>>> Two-tab alignment on second+ line like all similar functions here.
>>
>> Sure.
>>
>>>> +{
>>>> +    int i, young = 0;
>>>> +
>>>> +    for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
>>>> +        young |= ptep_clear_flush_young(vma, addr, ptep);
>>>> +
>>>
>>> Why don't we use a similar loop we use in clear_young_dirty_ptes() or 
>>> clear_full_ptes() etc? It's not only consistent but also optimizes 
>>> out the first check for nr.
>>> for (;;) {
>>>      young |= ptep_clear_flush_young(vma, addr, ptep);
>>>      if (--nr == 0)
>>>          break;
>>>      ptep++;
>>>      addr += PAGE_SIZE;
>>> }
>>
>> We’ve discussed this loop pattern before [1], and it seems that people 
>> prefer the ‘for (;;)’ loop. Do you have a strong preference for 
>> changing it back?
> 
> Yes, to make all such helpers look consistent. Note that your version 
> was also not consistent with the other variants.
> 
> Ryans point was about avoiding two ptep_clear_flush_young() calls, which 
> the for(;;) avoids as well.

Actually my v2[1] is following the previous pattern, anyway let me 
change it back.

[1] 
https://lore.kernel.org/all/545dba5e899634bc6c8ca782417d16fef3bd049f.1765439381.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09  9:09       ` David Hildenbrand (Arm)
@ 2026-02-09  9:36         ` Baolin Wang
  2026-02-09  9:55           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-02-09  9:36 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Chris Mason
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 2/9/26 5:09 PM, David Hildenbrand (Arm) wrote:
> On 1/29/26 02:42, Baolin Wang wrote:
>>
>>
>> On 1/28/26 7:47 PM, Chris Mason wrote:
>>> Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>> Implement the Arm64 architecture-specific clear_flush_young_ptes() 
>>>> to enable
>>>> batched checking of young flags and TLB flushing, improving 
>>>> performance during
>>>> large folio reclamation.
>>>>
>>>> Performance testing:
>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, 
>>>> and try to
>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I 
>>>> can observe
>>>> 33% performance improvement on my Arm64 32-core server (and 10%+ 
>>>> improvement
>>>> on my X86 machine). Meanwhile, the hotspot folio_check_references() 
>>>> dropped
>>>> from approximately 35% to around 5%.
>>>
>>> Hi everyone, I ran mm-new through my AI review prompts and this one was
>>> flagged.  AI review below:
>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ 
>>>> asm/pgtable.h
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1838,6 +1838,17 @@ static inline int 
>>>> ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>       return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>>>   }
>>>>
>>>> +#define clear_flush_young_ptes clear_flush_young_ptes
>>>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                     unsigned long addr, pte_t *ptep,
>>>> +                     unsigned int nr)
>>>> +{
>>>> +    if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>>>> +        return __ptep_clear_flush_young(vma, addr, ptep);
>>>
>>> Should this be checking !pte_valid_cont() instead of !pte_cont()?
>>>
>>> The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
>>> determine when to take the fast path. The new function only checks
>>> !pte_cont(), which differs when handling non-present PTEs.
>>>
>>> Non-present PTEs (device-private, device-exclusive) can reach
>>> clear_flush_young_ptes() through folio_referenced_one()->
>>> clear_flush_young_ptes_notify(). These entries may have bit 52 set as
>>> part of their encoding, but they aren't valid contiguous mappings.
>>>
>>> With the current check, wouldn't such entries incorrectly trigger the
>>> contpte path and potentially cause contpte_clear_flush_young_ptes() to
>>> process additional unrelated PTEs beyond the intended single entry?
>>
>> Indeed. I previously discussed with Ryan whether using pte_cont() was 
>> enough, and we believed that invalid PTEs wouldn’t have the PTE_CONT 
>> bit set. But we clearly missed the device-folio cases. Thanks for 
>> reporting.
>>
>> Andrew, could you please squash the following fix into this patch? If 
>> you prefer a new version, please let me know. Thanks.
> 
> Isn't the real problem that we should never ever ever ever, try clearing 
> the young bit on a non-present pte?
> 
> See damon_ptep_mkold() how that is handled with the flushing/notify.
> 
> There needs to be a pte_present() check in the caller.

The handling of ZONE_DEVICE memory in check_pte() makes me me doubt my 
earlier understanding. And I think you are right.

	} else if (pte_present(ptent)) {
		pfn = pte_pfn(ptent);
	} else {
		const softleaf_t entry = softleaf_from_pte(ptent);

		/* Handle un-addressable ZONE_DEVICE memory */
		if (!softleaf_is_device_private(entry) &&
		    !softleaf_is_device_exclusive(entry))
			return false;

		pfn = softleaf_to_pfn(entry);
	}


> BUT
> 
> I recall that folio_referenced() should never apply to ZONE_DEVICE 
> folios. folio_referenced() is only called from memory reclaim code, and 
> ZONE_DEVICE pages never get reclaimed through vmscan.c

Thanks for clarifying. So I can drop the pte valid check.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
                     ` (3 preceding siblings ...)
  2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
@ 2026-02-09  9:38   ` David Hildenbrand (Arm)
  2026-02-09  9:43     ` Baolin Wang
  4 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:38 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel

On 12/26/25 07:07, Baolin Wang wrote:
> Similar to folio_referenced_one(), we can apply batched unmapping for file
> large folios to optimize the performance of file folios reclamation.
> 
> Barry previously implemented batched unmapping for lazyfree anonymous large
> folios[1] and did not further optimize anonymous large folios or file-backed
> large folios at that stage. As for file-backed large folios, the batched
> unmapping support is relatively straightforward, as we only need to clear
> the consecutive (present) PTE entries for file-backed large folios.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
> on my X86 machine) with this patch.
> 
> W/o patch:
> real    0m1.018s
> user    0m0.000s
> sys     0m1.018s
> 
> W/ patch:
> real	0m0.249s
> user	0m0.000s
> sys	0m0.249s
> 
> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Acked-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/rmap.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 985ab0b085ba..e1d16003c514 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>   	end_addr = pmd_addr_end(addr, vma->vm_end);
>   	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>   
> -	/* We only support lazyfree batching for now ... */
> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> +	/* We only support lazyfree or file folios batching for now ... */
> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>   		return 1;

Right, the anon folio handling would require a bit more work in the


	} else if (folio_test_anon(folio)) {

branch.

Do you intend to tackle that one as well?


I'll reply to the fixup.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
@ 2026-02-09  9:43     ` Baolin Wang
  2026-02-13  5:19       ` Barry Song
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-02-09  9:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, dev.jain,
	linux-mm, linux-arm-kernel, linux-kernel



On 2/9/26 5:38 PM, David Hildenbrand (Arm) wrote:
> On 12/26/25 07:07, Baolin Wang wrote:
>> Similar to folio_referenced_one(), we can apply batched unmapping for 
>> file
>> large folios to optimize the performance of file folios reclamation.
>>
>> Barry previously implemented batched unmapping for lazyfree anonymous 
>> large
>> folios[1] and did not further optimize anonymous large folios or file- 
>> backed
>> large folios at that stage. As for file-backed large folios, the batched
>> unmapping support is relatively straightforward, as we only need to clear
>> the consecutive (present) PTE entries for file-backed large folios.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, 
>> and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can 
>> observe
>> 75% performance improvement on my Arm64 32-core server (and 50%+ 
>> improvement
>> on my X86 machine) with this patch.
>>
>> W/o patch:
>> real    0m1.018s
>> user    0m0.000s
>> sys     0m1.018s
>>
>> W/ patch:
>> real    0m0.249s
>> user    0m0.000s
>> sys    0m0.249s
>>
>> [1] https://lore.kernel.org/ 
>> all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Acked-by: Barry Song <baohua@kernel.org>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/rmap.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 985ab0b085ba..e1d16003c514 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1863,9 +1863,10 @@ static inline unsigned int 
>> folio_unmap_pte_batch(struct folio *folio,
>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>> -    /* We only support lazyfree batching for now ... */
>> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>> +    /* We only support lazyfree or file folios batching for now ... */
>> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>           return 1;
> 
> Right, the anon folio handling would require a bit more work in the
> 
> 
>      } else if (folio_test_anon(folio)) {
> 
> branch.
> 
> Do you intend to tackle that one as well?
 >> I'll reply to the fixup.

I'm not sure whether Barry has time to continue this work. If he does 
not, I can take over. Barry?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: rmap: skip batched unmapping for UFFD vmas
  2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
@ 2026-02-09  9:54     ` David Hildenbrand (Arm)
  2026-02-09 10:49       ` Barry Song
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Liam.Howlett, akpm, baohua, catalin.marinas, dev.jain, harry.yoo,
	jannh, linux-arm-kernel, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, riel, rppt, ryan.roberts, surenb, vbabka, will, willy

On 1/16/26 17:26, Baolin Wang wrote:
> As Dev reported[1], it's not ready to support batched unmapping for uffd case.
> Let's still fallback to per-page unmapping for the uffd case.
> 
> [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
> Reported-by: Dev Jain <dev.jain@arm.com>
> Suggested-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/rmap.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f13480cb9f2e..172643092dcf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>   	if (pte_unused(pte))
>   		return 1;
>   
> +	if (userfaultfd_wp(vma))
> +		return 1;
> +

Interesting. I was just wondering why we didn't run into that with lazyfree folios.

Staring at pte_install_uffd_wp_if_needed(), we never set the marker for
anonymous VMAs.

So, yeah, if one sets lazyfree on a uffd-wp PTE, the uffd-wp bit will just get
zapped alongside. Just like MADV_DONTNEED.


I'm fine with that temporary fix. But I guess the non-hacky way to handle this would be:


 From 53d016d6e6f624425dbdbc2fb1dec7c91fbef15c Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Mon, 9 Feb 2026 10:52:59 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
  include/linux/mm_inline.h | 15 ++++++---------
  mm/memory.c               | 21 +--------------------
  mm/rmap.c                 |  2 +-
  3 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b5..8a9a2c5f5ee3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -566,9 +566,8 @@ static inline pte_marker copy_pte_marker(
   *
   * Returns true if an uffd-wp pte was installed, false otherwise.
   */
-static inline bool
-pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
-			      pte_t *pte, pte_t pteval)
+static inline bool install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *pte, unsigned int nr, pte_t pteval)
  {
  	bool arm_uffd_pte = false;
  
@@ -598,13 +597,11 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
  	if (unlikely(pte_swp_uffd_wp_any(pteval)))
  		arm_uffd_pte = true;
  
-	if (unlikely(arm_uffd_pte)) {
-		set_pte_at(vma->vm_mm, addr, pte,
-			   make_pte_marker(PTE_MARKER_UFFD_WP));
-		return true;
-	}
+	if (likely(!arm_uffd_pte))
+		return false;
  
-	return false;
+	set_ptes(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP), nr);
+	return true;
  }
  
  static inline bool vma_has_recency(const struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index da360a6eb8a4..0a87d02a9a69 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1592,29 +1592,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
  			      unsigned long addr, pte_t *pte, int nr,
  			      struct zap_details *details, pte_t pteval)
  {
-	bool was_installed = false;
-
-	if (!uffd_supports_wp_marker())
-		return false;
-
-	/* Zap on anonymous always means dropping everything */
-	if (vma_is_anonymous(vma))
-		return false;
-
  	if (zap_drop_markers(details))
  		return false;
  
-	for (;;) {
-		/* the PFN in the PTE is irrelevant. */
-		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
-			was_installed = true;
-		if (--nr == 0)
-			break;
-		pte++;
-		addr += PAGE_SIZE;
-	}
-
-	return was_installed;
+	return install_uffd_wp_ptes_if_needed(vma, addr, pte, nr, pteval);
  }
  
  static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
diff --git a/mm/rmap.c b/mm/rmap.c
index 7b9879ef442d..f71aacf35925 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2061,7 +2061,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
  		 * we may want to replace a none pte with a marker pte if
  		 * it's file-backed, so we don't lose the tracking info.
  		 */
-		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, nr_pages, pteval);
  
  		/* Update high watermark before we lower rss */
  		update_hiwater_rss(mm);
-- 
2.43.0



Does somebody have time to look into that? We should also adjust the doc of pte_install_uffd_wp_if_needed()
and turn it into some proper kerneldoc.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09  9:36         ` Baolin Wang
@ 2026-02-09  9:55           ` David Hildenbrand (Arm)
  2026-02-09 10:13             ` Baolin Wang
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09  9:55 UTC (permalink / raw)
  To: Baolin Wang, Chris Mason
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel

On 2/9/26 10:36, Baolin Wang wrote:
> 
> 
> On 2/9/26 5:09 PM, David Hildenbrand (Arm) wrote:
>> On 1/29/26 02:42, Baolin Wang wrote:
>>>
>>>
>>>
>>> Indeed. I previously discussed with Ryan whether using pte_cont() was 
>>> enough, and we believed that invalid PTEs wouldn’t have the PTE_CONT 
>>> bit set. But we clearly missed the device-folio cases. Thanks for 
>>> reporting.
>>>
>>> Andrew, could you please squash the following fix into this patch? If 
>>> you prefer a new version, please let me know. Thanks.
>>
>> Isn't the real problem that we should never ever ever ever, try 
>> clearing the young bit on a non-present pte?
>>
>> See damon_ptep_mkold() how that is handled with the flushing/notify.
>>
>> There needs to be a pte_present() check in the caller.
> 
> The handling of ZONE_DEVICE memory in check_pte() makes me me doubt my 
> earlier understanding. And I think you are right.
> 
>      } else if (pte_present(ptent)) {
>          pfn = pte_pfn(ptent);
>      } else {
>          const softleaf_t entry = softleaf_from_pte(ptent);
> 
>          /* Handle un-addressable ZONE_DEVICE memory */
>          if (!softleaf_is_device_private(entry) &&
>              !softleaf_is_device_exclusive(entry))
>              return false;
> 
>          pfn = softleaf_to_pfn(entry);
>      }
> 
> 
>> BUT
>>
>> I recall that folio_referenced() should never apply to ZONE_DEVICE 
>> folios. folio_referenced() is only called from memory reclaim code, 
>> and ZONE_DEVICE pages never get reclaimed through vmscan.c
> 
> Thanks for clarifying. So I can drop the pte valid check.

We should probably add a safety check in folio_referenced(), warning
if we would ever get a ZONE_DEVICE folio passed.

Can someone look into that ? :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09  9:55           ` David Hildenbrand (Arm)
@ 2026-02-09 10:13             ` Baolin Wang
  2026-02-16  0:24               ` Alistair Popple
  0 siblings, 1 reply; 52+ messages in thread
From: Baolin Wang @ 2026-02-09 10:13 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Chris Mason
  Cc: akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, baohua, dev.jain, linux-mm, linux-arm-kernel,
	linux-kernel



On 2/9/26 5:55 PM, David Hildenbrand (Arm) wrote:
> On 2/9/26 10:36, Baolin Wang wrote:
>>
>>
>> On 2/9/26 5:09 PM, David Hildenbrand (Arm) wrote:
>>> On 1/29/26 02:42, Baolin Wang wrote:
>>>>
>>>>
>>>>
>>>> Indeed. I previously discussed with Ryan whether using pte_cont() 
>>>> was enough, and we believed that invalid PTEs wouldn’t have the 
>>>> PTE_CONT bit set. But we clearly missed the device-folio cases. 
>>>> Thanks for reporting.
>>>>
>>>> Andrew, could you please squash the following fix into this patch? 
>>>> If you prefer a new version, please let me know. Thanks.
>>>
>>> Isn't the real problem that we should never ever ever ever, try 
>>> clearing the young bit on a non-present pte?
>>>
>>> See damon_ptep_mkold() how that is handled with the flushing/notify.
>>>
>>> There needs to be a pte_present() check in the caller.
>>
>> The handling of ZONE_DEVICE memory in check_pte() makes me me doubt my 
>> earlier understanding. And I think you are right.
>>
>>      } else if (pte_present(ptent)) {
>>          pfn = pte_pfn(ptent);
>>      } else {
>>          const softleaf_t entry = softleaf_from_pte(ptent);
>>
>>          /* Handle un-addressable ZONE_DEVICE memory */
>>          if (!softleaf_is_device_private(entry) &&
>>              !softleaf_is_device_exclusive(entry))
>>              return false;
>>
>>          pfn = softleaf_to_pfn(entry);
>>      }
>>
>>
>>> BUT
>>>
>>> I recall that folio_referenced() should never apply to ZONE_DEVICE 
>>> folios. folio_referenced() is only called from memory reclaim code, 
>>> and ZONE_DEVICE pages never get reclaimed through vmscan.c
>>
>> Thanks for clarifying. So I can drop the pte valid check.
> 
> We should probably add a safety check in folio_referenced(), warning
> if we would ever get a ZONE_DEVICE folio passed.
> 
> Can someone look into that ? :)

Sure, I can take a close look and address that in my follow-up patchset.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: rmap: skip batched unmapping for UFFD vmas
  2026-02-09  9:54     ` David Hildenbrand (Arm)
@ 2026-02-09 10:49       ` Barry Song
  2026-02-09 10:58         ` David Hildenbrand (Arm)
  2026-02-10 12:01         ` Dev Jain
  0 siblings, 2 replies; 52+ messages in thread
From: Barry Song @ 2026-02-09 10:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Baolin Wang, Liam.Howlett, akpm, catalin.marinas, dev.jain,
	harry.yoo, jannh, linux-arm-kernel, linux-kernel, linux-mm,
	lorenzo.stoakes, mhocko, riel, rppt, ryan.roberts, surenb,
	vbabka, will, willy

On Mon, Feb 9, 2026 at 5:54 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 1/16/26 17:26, Baolin Wang wrote:
> > As Dev reported[1], it's not ready to support batched unmapping for uffd case.
> > Let's still fallback to per-page unmapping for the uffd case.
> >
> > [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
> > Reported-by: Dev Jain <dev.jain@arm.com>
> > Suggested-by: Barry Song <baohua@kernel.org>
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> >   mm/rmap.c | 3 +++
> >   1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index f13480cb9f2e..172643092dcf 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >       if (pte_unused(pte))
> >               return 1;
> >
> > +     if (userfaultfd_wp(vma))
> > +             return 1;
> > +
>
> Interesting. I was just wondering why we didn't run into that with lazyfree folios.
>
> Staring at pte_install_uffd_wp_if_needed(), we never set the marker for
> anonymous VMAs.
>
> So, yeah, if one sets lazyfree on a uffd-wp PTE, the uffd-wp bit will just get
> zapped alongside. Just like MADV_DONTNEED.
>
>
> I'm fine with that temporary fix. But I guess the non-hacky way to handle this would be:
>
>
>  From 53d016d6e6f624425dbdbc2fb1dec7c91fbef15c Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Mon, 9 Feb 2026 10:52:59 +0100
> Subject: [PATCH] tmp
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>   include/linux/mm_inline.h | 15 ++++++---------
>   mm/memory.c               | 21 +--------------------
>   mm/rmap.c                 |  2 +-
>   3 files changed, 8 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index fa2d6ba811b5..8a9a2c5f5ee3 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -566,9 +566,8 @@ static inline pte_marker copy_pte_marker(
>    *
>    * Returns true if an uffd-wp pte was installed, false otherwise.
>    */
> -static inline bool
> -pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> -                             pte_t *pte, pte_t pteval)
> +static inline bool install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma,
> +               unsigned long addr, pte_t *pte, unsigned int nr, pte_t pteval)
>   {
>         bool arm_uffd_pte = false;
>
> @@ -598,13 +597,11 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>         if (unlikely(pte_swp_uffd_wp_any(pteval)))
>                 arm_uffd_pte = true;
>
> -       if (unlikely(arm_uffd_pte)) {
> -               set_pte_at(vma->vm_mm, addr, pte,
> -                          make_pte_marker(PTE_MARKER_UFFD_WP));
> -               return true;
> -       }
> +       if (likely(!arm_uffd_pte))
> +               return false;
>
> -       return false;
> +       set_ptes(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP), nr);
> +       return true;
>   }
>
>   static inline bool vma_has_recency(const struct vm_area_struct *vma)
> diff --git a/mm/memory.c b/mm/memory.c
> index da360a6eb8a4..0a87d02a9a69 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1592,29 +1592,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>                               unsigned long addr, pte_t *pte, int nr,
>                               struct zap_details *details, pte_t pteval)
>   {
> -       bool was_installed = false;
> -
> -       if (!uffd_supports_wp_marker())
> -               return false;
> -
> -       /* Zap on anonymous always means dropping everything */
> -       if (vma_is_anonymous(vma))
> -               return false;
> -
>         if (zap_drop_markers(details))
>                 return false;
>
> -       for (;;) {
> -               /* the PFN in the PTE is irrelevant. */
> -               if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
> -                       was_installed = true;
> -               if (--nr == 0)
> -                       break;
> -               pte++;
> -               addr += PAGE_SIZE;
> -       }
> -
> -       return was_installed;
> +       return install_uffd_wp_ptes_if_needed(vma, addr, pte, nr, pteval);
>   }
>
>   static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b9879ef442d..f71aacf35925 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2061,7 +2061,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                  * we may want to replace a none pte with a marker pte if
>                  * it's file-backed, so we don't lose the tracking info.
>                  */
> -               pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> +               install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, nr_pages, pteval);
>
>                 /* Update high watermark before we lower rss */
>                 update_hiwater_rss(mm);
> --
> 2.43.0
>
>
>
> Does somebody have time to look into that? We should also adjust the doc of pte_install_uffd_wp_if_needed()
> and turn it into some proper kerneldoc.

I'd nominate Dev, if he has the time, as he has been working on
related changes and is already familiar with this area :-)

https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

I assume this could be treated as a separate optimization, as
the current temporary fix seems acceptable?

Thanks
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: rmap: skip batched unmapping for UFFD vmas
  2026-02-09 10:49       ` Barry Song
@ 2026-02-09 10:58         ` David Hildenbrand (Arm)
  2026-02-10 12:01         ` Dev Jain
  1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 10:58 UTC (permalink / raw)
  To: Barry Song
  Cc: Baolin Wang, Liam.Howlett, akpm, catalin.marinas, dev.jain,
	harry.yoo, jannh, linux-arm-kernel, linux-kernel, linux-mm,
	lorenzo.stoakes, mhocko, riel, rppt, ryan.roberts, surenb,
	vbabka, will, willy

On 2/9/26 11:49, Barry Song wrote:
> On Mon, Feb 9, 2026 at 5:54 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> On 1/16/26 17:26, Baolin Wang wrote:
>>> As Dev reported[1], it's not ready to support batched unmapping for uffd case.
>>> Let's still fallback to per-page unmapping for the uffd case.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
>>> Reported-by: Dev Jain <dev.jain@arm.com>
>>> Suggested-by: Barry Song <baohua@kernel.org>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>    mm/rmap.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index f13480cb9f2e..172643092dcf 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>        if (pte_unused(pte))
>>>                return 1;
>>>
>>> +     if (userfaultfd_wp(vma))
>>> +             return 1;
>>> +
>>
>> Interesting. I was just wondering why we didn't run into that with lazyfree folios.
>>
>> Staring at pte_install_uffd_wp_if_needed(), we never set the marker for
>> anonymous VMAs.
>>
>> So, yeah, if one sets lazyfree on a uffd-wp PTE, the uffd-wp bit will just get
>> zapped alongside. Just like MADV_DONTNEED.
>>
>>
>> I'm fine with that temporary fix. But I guess the non-hacky way to handle this would be:
>>
>>
>>   From 53d016d6e6f624425dbdbc2fb1dec7c91fbef15c Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Arm)" <david@kernel.org>
>> Date: Mon, 9 Feb 2026 10:52:59 +0100
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>>    include/linux/mm_inline.h | 15 ++++++---------
>>    mm/memory.c               | 21 +--------------------
>>    mm/rmap.c                 |  2 +-
>>    3 files changed, 8 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index fa2d6ba811b5..8a9a2c5f5ee3 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -566,9 +566,8 @@ static inline pte_marker copy_pte_marker(
>>     *
>>     * Returns true if an uffd-wp pte was installed, false otherwise.
>>     */
>> -static inline bool
>> -pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>> -                             pte_t *pte, pte_t pteval)
>> +static inline bool install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma,
>> +               unsigned long addr, pte_t *pte, unsigned int nr, pte_t pteval)
>>    {
>>          bool arm_uffd_pte = false;
>>
>> @@ -598,13 +597,11 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>>          if (unlikely(pte_swp_uffd_wp_any(pteval)))
>>                  arm_uffd_pte = true;
>>
>> -       if (unlikely(arm_uffd_pte)) {
>> -               set_pte_at(vma->vm_mm, addr, pte,
>> -                          make_pte_marker(PTE_MARKER_UFFD_WP));
>> -               return true;
>> -       }
>> +       if (likely(!arm_uffd_pte))
>> +               return false;
>>
>> -       return false;
>> +       set_ptes(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP), nr);
>> +       return true;
>>    }
>>
>>    static inline bool vma_has_recency(const struct vm_area_struct *vma)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index da360a6eb8a4..0a87d02a9a69 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1592,29 +1592,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>                                unsigned long addr, pte_t *pte, int nr,
>>                                struct zap_details *details, pte_t pteval)
>>    {
>> -       bool was_installed = false;
>> -
>> -       if (!uffd_supports_wp_marker())
>> -               return false;
>> -
>> -       /* Zap on anonymous always means dropping everything */
>> -       if (vma_is_anonymous(vma))
>> -               return false;
>> -
>>          if (zap_drop_markers(details))
>>                  return false;
>>
>> -       for (;;) {
>> -               /* the PFN in the PTE is irrelevant. */
>> -               if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
>> -                       was_installed = true;
>> -               if (--nr == 0)
>> -                       break;
>> -               pte++;
>> -               addr += PAGE_SIZE;
>> -       }
>> -
>> -       return was_installed;
>> +       return install_uffd_wp_ptes_if_needed(vma, addr, pte, nr, pteval);
>>    }
>>
>>    static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 7b9879ef442d..f71aacf35925 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2061,7 +2061,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                   * we may want to replace a none pte with a marker pte if
>>                   * it's file-backed, so we don't lose the tracking info.
>>                   */
>> -               pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>> +               install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, nr_pages, pteval);
>>
>>                  /* Update high watermark before we lower rss */
>>                  update_hiwater_rss(mm);
>> --
>> 2.43.0
>>
>>
>>
>> Does somebody have time to look into that? We should also adjust the doc of pte_install_uffd_wp_if_needed()
>> and turn it into some proper kerneldoc.
> 
> I'd nominate Dev, if he has the time, as he has been working on
> related changes and is already familiar with this area :-)
> 
> https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
> 
> I assume this could be treated as a separate optimization, as
> the current temporary fix seems acceptable?

Let's call it a cleanup, because the way try_to_unmap_one() sometimes 
respects "nr_pages" and sometimes doesn't (because it's hidden somewhere 
that nr_pages will be 1) is just nasty and provoked the bug you ran into 
in the first place. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] mm: rmap: skip batched unmapping for UFFD vmas
  2026-02-09 10:49       ` Barry Song
  2026-02-09 10:58         ` David Hildenbrand (Arm)
@ 2026-02-10 12:01         ` Dev Jain
  1 sibling, 0 replies; 52+ messages in thread
From: Dev Jain @ 2026-02-10 12:01 UTC (permalink / raw)
  To: Barry Song, David Hildenbrand (Arm)
  Cc: Baolin Wang, Liam.Howlett, akpm, catalin.marinas, harry.yoo,
	jannh, linux-arm-kernel, linux-kernel, linux-mm, lorenzo.stoakes,
	mhocko, riel, rppt, ryan.roberts, surenb, vbabka, will, willy


On 09/02/26 4:19 pm, Barry Song wrote:
> On Mon, Feb 9, 2026 at 5:54 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
>> On 1/16/26 17:26, Baolin Wang wrote:
>>> As Dev reported[1], it's not ready to support batched unmapping for uffd case.
>>> Let's still fallback to per-page unmapping for the uffd case.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
>>> Reported-by: Dev Jain <dev.jain@arm.com>
>>> Suggested-by: Barry Song <baohua@kernel.org>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   mm/rmap.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index f13480cb9f2e..172643092dcf 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1953,6 +1953,9 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>       if (pte_unused(pte))
>>>               return 1;
>>>
>>> +     if (userfaultfd_wp(vma))
>>> +             return 1;
>>> +
>> Interesting. I was just wondering why we didn't run into that with lazyfree folios.
>>
>> Staring at pte_install_uffd_wp_if_needed(), we never set the marker for
>> anonymous VMAs.
>>
>> So, yeah, if one sets lazyfree on a uffd-wp PTE, the uffd-wp bit will just get
>> zapped alongside. Just like MADV_DONTNEED.
>>
>>
>> I'm fine with that temporary fix. But I guess the non-hacky way to handle this would be:
>>
>>
>>  From 53d016d6e6f624425dbdbc2fb1dec7c91fbef15c Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Arm)" <david@kernel.org>
>> Date: Mon, 9 Feb 2026 10:52:59 +0100
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>>   include/linux/mm_inline.h | 15 ++++++---------
>>   mm/memory.c               | 21 +--------------------
>>   mm/rmap.c                 |  2 +-
>>   3 files changed, 8 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index fa2d6ba811b5..8a9a2c5f5ee3 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -566,9 +566,8 @@ static inline pte_marker copy_pte_marker(
>>    *
>>    * Returns true if an uffd-wp pte was installed, false otherwise.
>>    */
>> -static inline bool
>> -pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>> -                             pte_t *pte, pte_t pteval)
>> +static inline bool install_uffd_wp_ptes_if_needed(struct vm_area_struct *vma,
>> +               unsigned long addr, pte_t *pte, unsigned int nr, pte_t pteval)
>>   {
>>         bool arm_uffd_pte = false;
>>
>> @@ -598,13 +597,11 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>>         if (unlikely(pte_swp_uffd_wp_any(pteval)))
>>                 arm_uffd_pte = true;
>>
>> -       if (unlikely(arm_uffd_pte)) {
>> -               set_pte_at(vma->vm_mm, addr, pte,
>> -                          make_pte_marker(PTE_MARKER_UFFD_WP));
>> -               return true;
>> -       }
>> +       if (likely(!arm_uffd_pte))
>> +               return false;
>>
>> -       return false;
>> +       set_ptes(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP), nr);
>> +       return true;
>>   }
>>
>>   static inline bool vma_has_recency(const struct vm_area_struct *vma)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index da360a6eb8a4..0a87d02a9a69 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1592,29 +1592,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>                               unsigned long addr, pte_t *pte, int nr,
>>                               struct zap_details *details, pte_t pteval)
>>   {
>> -       bool was_installed = false;
>> -
>> -       if (!uffd_supports_wp_marker())
>> -               return false;
>> -
>> -       /* Zap on anonymous always means dropping everything */
>> -       if (vma_is_anonymous(vma))
>> -               return false;
>> -
>>         if (zap_drop_markers(details))
>>                 return false;
>>
>> -       for (;;) {
>> -               /* the PFN in the PTE is irrelevant. */
>> -               if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
>> -                       was_installed = true;
>> -               if (--nr == 0)
>> -                       break;
>> -               pte++;
>> -               addr += PAGE_SIZE;
>> -       }
>> -
>> -       return was_installed;
>> +       return install_uffd_wp_ptes_if_needed(vma, addr, pte, nr, pteval);
>>   }
>>
>>   static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 7b9879ef442d..f71aacf35925 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2061,7 +2061,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                  * we may want to replace a none pte with a marker pte if
>>                  * it's file-backed, so we don't lose the tracking info.
>>                  */
>> -               pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>> +               install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, nr_pages, pteval);
>>
>>                 /* Update high watermark before we lower rss */
>>                 update_hiwater_rss(mm);
>> --
>> 2.43.0
>>
>>
>>
>> Does somebody have time to look into that? We should also adjust the doc of pte_install_uffd_wp_if_needed()
>> and turn it into some proper kerneldoc.
> I'd nominate Dev, if he has the time, as he has been working on
> related changes and is already familiar with this area :-)
>
> https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/

Indeed, I'll be taking a stab on the uffd batching, and the generic folio batching
(meaning anon folios now).

>
> I assume this could be treated as a separate optimization, as
> the current temporary fix seems acceptable?
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-09  9:43     ` Baolin Wang
@ 2026-02-13  5:19       ` Barry Song
  2026-02-18 12:26         ` Dev Jain
  0 siblings, 1 reply; 52+ messages in thread
From: Barry Song @ 2026-02-13  5:19 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand (Arm),
	akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, dev.jain, linux-mm, linux-arm-kernel, linux-kernel

On Mon, Feb 9, 2026 at 5:43 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
[...]
> >> ---
> >>   mm/rmap.c | 7 ++++---
> >>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 985ab0b085ba..e1d16003c514 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1863,9 +1863,10 @@ static inline unsigned int
> >> folio_unmap_pte_batch(struct folio *folio,
> >>       end_addr = pmd_addr_end(addr, vma->vm_end);
> >>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >> -    /* We only support lazyfree batching for now ... */
> >> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> >> +    /* We only support lazyfree or file folios batching for now ... */
> >> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
> >>           return 1;
> >
> > Right, the anon folio handling would require a bit more work in the
> >
> >
> >      } else if (folio_test_anon(folio)) {
> >
> > branch.
> >
> > Do you intend to tackle that one as well?
>  >> I'll reply to the fixup.
>
> I'm not sure whether Barry has time to continue this work. If he does
> not, I can take over. Barry?

I expect to have some availability after April 1st.
In the meantime, please feel free to send along any patches
if you and Dev would like to move forward before then :-)

Best regards
Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
  2026-02-09 10:13             ` Baolin Wang
@ 2026-02-16  0:24               ` Alistair Popple
  0 siblings, 0 replies; 52+ messages in thread
From: Alistair Popple @ 2026-02-16  0:24 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand (Arm),
	Chris Mason, akpm, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, baohua, dev.jain, linux-mm,
	linux-arm-kernel, linux-kernel

On 2026-02-09 at 21:13 +1100, Baolin Wang <baolin.wang@linux.alibaba.com> wrote...
> 
> 
> On 2/9/26 5:55 PM, David Hildenbrand (Arm) wrote:
> > On 2/9/26 10:36, Baolin Wang wrote:
> > > 
> > > 
> > > On 2/9/26 5:09 PM, David Hildenbrand (Arm) wrote:
> > > > On 1/29/26 02:42, Baolin Wang wrote:
> > > > > 
> > > > > 
> > > > > 
> > > > > Indeed. I previously discussed with Ryan whether using
> > > > > pte_cont() was enough, and we believed that invalid PTEs
> > > > > wouldn’t have the PTE_CONT bit set. But we clearly missed
> > > > > the device-folio cases. Thanks for reporting.
> > > > > 
> > > > > Andrew, could you please squash the following fix into this
> > > > > patch? If you prefer a new version, please let me know.
> > > > > Thanks.
> > > > 
> > > > Isn't the real problem that we should never ever ever ever, try
> > > > clearing the young bit on a non-present pte?
> > > > 
> > > > See damon_ptep_mkold() how that is handled with the flushing/notify.
> > > > 
> > > > There needs to be a pte_present() check in the caller.
> > > 
> > > The handling of ZONE_DEVICE memory in check_pte() makes me me doubt
> > > my earlier understanding. And I think you are right.
> > > 
> > >      } else if (pte_present(ptent)) {
> > >          pfn = pte_pfn(ptent);
> > >      } else {
> > >          const softleaf_t entry = softleaf_from_pte(ptent);
> > > 
> > >          /* Handle un-addressable ZONE_DEVICE memory */
> > >          if (!softleaf_is_device_private(entry) &&
> > >              !softleaf_is_device_exclusive(entry))
> > >              return false;
> > > 
> > >          pfn = softleaf_to_pfn(entry);
> > >      }
> > > 
> > > 
> > > > BUT
> > > > 
> > > > I recall that folio_referenced() should never apply to
> > > > ZONE_DEVICE folios. folio_referenced() is only called from
> > > > memory reclaim code, and ZONE_DEVICE pages never get reclaimed
> > > > through vmscan.c

Agree this is true, although I've always found the reason somewhat difficult to
see in code because there are no explicit checks for ZONE_DEVICE pages. Instead
it relies on the fact ZONE_DEVICE pages can't be put on any LRU in the first
place, hence reclaim can't find them.

> > > 
> > > Thanks for clarifying. So I can drop the pte valid check.
> > 
> > We should probably add a safety check in folio_referenced(), warning
> > if we would ever get a ZONE_DEVICE folio passed.
> > 
> > Can someone look into that ? :)
> 
> Sure, I can take a close look and address that in my follow-up patchset.
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios
  2026-02-13  5:19       ` Barry Song
@ 2026-02-18 12:26         ` Dev Jain
  0 siblings, 0 replies; 52+ messages in thread
From: Dev Jain @ 2026-02-18 12:26 UTC (permalink / raw)
  To: Barry Song, Baolin Wang
  Cc: David Hildenbrand (Arm),
	akpm, catalin.marinas, will, lorenzo.stoakes, ryan.roberts,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, riel, harry.yoo,
	jannh, willy, linux-mm, linux-arm-kernel, linux-kernel


On 13/02/26 10:49 am, Barry Song wrote:
> On Mon, Feb 9, 2026 at 5:43 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> [...]
>>>> ---
>>>>   mm/rmap.c | 7 ++++---
>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 985ab0b085ba..e1d16003c514 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int
>>>> folio_unmap_pte_batch(struct folio *folio,
>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>> -    /* We only support lazyfree batching for now ... */
>>>> -    if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>> +    /* We only support lazyfree or file folios batching for now ... */
>>>> +    if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>           return 1;
>>> Right, the anon folio handling would require a bit more work in the
>>>
>>>
>>>      } else if (folio_test_anon(folio)) {
>>>
>>> branch.
>>>
>>> Do you intend to tackle that one as well?
>>  >> I'll reply to the fixup.
>>
>> I'm not sure whether Barry has time to continue this work. If he does
>> not, I can take over. Barry?
> I expect to have some availability after April 1st.
> In the meantime, please feel free to send along any patches
> if you and Dev would like to move forward before then :-)

On it :)

>
> Best regards
> Barry


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2026-02-18 12:26 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-26  6:07 [PATCH v5 0/5] support batch checking of references and unmapping for large folios Baolin Wang
2025-12-26  6:07 ` [PATCH v5 1/5] mm: rmap: support batched checks of the references " Baolin Wang
2026-01-07  6:01   ` Harry Yoo
2026-02-09  8:49   ` David Hildenbrand (Arm)
2026-02-09  9:14     ` Baolin Wang
2026-02-09  9:20       ` David Hildenbrand (Arm)
2026-02-09  9:25         ` Baolin Wang
2025-12-26  6:07 ` [PATCH v5 2/5] arm64: mm: factor out the address and ptep alignment into a new helper Baolin Wang
2026-02-09  8:50   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 3/5] arm64: mm: support batch clearing of the young flag for large folios Baolin Wang
2026-01-02 12:21   ` Ryan Roberts
2026-02-09  9:02   ` David Hildenbrand (Arm)
2025-12-26  6:07 ` [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes() Baolin Wang
2026-01-28 11:47   ` Chris Mason
2026-01-29  1:42     ` Baolin Wang
2026-02-09  9:09       ` David Hildenbrand (Arm)
2026-02-09  9:36         ` Baolin Wang
2026-02-09  9:55           ` David Hildenbrand (Arm)
2026-02-09 10:13             ` Baolin Wang
2026-02-16  0:24               ` Alistair Popple
2025-12-26  6:07 ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios Baolin Wang
2026-01-06 13:22   ` Wei Yang
2026-01-06 21:29     ` Barry Song
2026-01-07  1:46       ` Wei Yang
2026-01-07  2:21         ` Barry Song
2026-01-07  2:29           ` Baolin Wang
2026-01-07  3:31             ` Wei Yang
2026-01-16  9:53         ` Dev Jain
2026-01-16 11:14           ` Lorenzo Stoakes
2026-01-16 14:28           ` Barry Song
2026-01-16 15:23             ` Barry Song
2026-01-16 15:49             ` Baolin Wang
2026-01-18  5:46             ` Dev Jain
2026-01-19  5:50               ` Baolin Wang
2026-01-19  6:36                 ` Dev Jain
2026-01-19  7:22                   ` Baolin Wang
2026-01-16 15:14           ` Barry Song
2026-01-18  5:48             ` Dev Jain
2026-01-07  6:54   ` Harry Yoo
2026-01-16  8:42   ` Lorenzo Stoakes
2026-01-16 16:26   ` [PATCH] mm: rmap: skip batched unmapping for UFFD vmas Baolin Wang
2026-02-09  9:54     ` David Hildenbrand (Arm)
2026-02-09 10:49       ` Barry Song
2026-02-09 10:58         ` David Hildenbrand (Arm)
2026-02-10 12:01         ` Dev Jain
2026-02-09  9:38   ` [PATCH v5 5/5] mm: rmap: support batched unmapping for file large folios David Hildenbrand (Arm)
2026-02-09  9:43     ` Baolin Wang
2026-02-13  5:19       ` Barry Song
2026-02-18 12:26         ` Dev Jain
2026-01-16  8:41 ` [PATCH v5 0/5] support batch checking of references and unmapping for " Lorenzo Stoakes
2026-01-16 10:53   ` David Hildenbrand (Red Hat)
2026-01-16 10:52 ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox