[PATCH 0/2] support batched checks of the references for large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] support batched checks of the references for large folios
@ 2025-11-25  0:56 Baolin Wang
  2025-11-25  0:56 ` [PATCH 1/2] arm64: mm: support batch clearing of the young flag " Baolin Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Baolin Wang @ 2025-11-25  0:56 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

By supporting batched checking of the young flags and flushing TLB entries,
I observed a 33% performance improvement in my file-backed folios reclaim tests.

BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
resend the optimization patch for try_to_unmap() [1].

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Baolin Wang (2):
  arm64: mm: support batch clearing of the young flag for large folios
  mm: rmap: support batched checks of the references for large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 include/linux/mmu_notifier.h     |  9 ++++---
 include/linux/pgtable.h          | 19 ++++++++++++++
 mm/rmap.c                        | 22 ++++++++++++++--
 5 files changed, 92 insertions(+), 25 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] arm64: mm: support batch clearing of the young flag for large folios
  2025-11-25  0:56 [PATCH 0/2] support batched checks of the references for large folios Baolin Wang
@ 2025-11-25  0:56 ` Baolin Wang
  2025-11-25  0:56 ` [PATCH 2/2] mm: rmap: support batched checks of the references " Baolin Wang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Baolin Wang @ 2025-11-25  0:56 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 2 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0944e296dd4a..e03034683156 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c0557945939c..19b122441be3 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * having to unfold.
 	 */
 
+	unsigned long start = addr;
+	unsigned long end = start + nr * PAGE_SIZE;
 	int young = 0;
 	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		end = ALIGN(end, CONT_PTE_SIZE);
 
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
-		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+	if (pte_cont(__ptep_get(ptep))) {
+		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	nr = (end - start) / PAGE_SIZE;
+	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, start, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long start = addr;
+		unsigned long end = start + nr * PAGE_SIZE;
+
+		if (pte_cont(__ptep_get(ptep + nr - 1)))
+			end = ALIGN(end, CONT_PTE_SIZE);
+
+		if (pte_cont(__ptep_get(ptep)))
+			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, start, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] mm: rmap: support batched checks of the references for large folios
  2025-11-25  0:56 [PATCH 0/2] support batched checks of the references for large folios Baolin Wang
  2025-11-25  0:56 ` [PATCH 1/2] arm64: mm: support batch clearing of the young flag " Baolin Wang
@ 2025-11-25  0:56 ` Baolin Wang
  2025-11-25  9:29 ` [PATCH 0/2] " Barry Song
  2025-12-01 16:23 ` David Hildenbrand (Red Hat)
  3 siblings, 0 replies; 7+ messages in thread
From: Baolin Wang @ 2025-11-25  0:56 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua,
	baolin.wang, linux-mm, linux-arm-kernel, linux-kernel

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 include/linux/mmu_notifier.h     |  9 +++++----
 include/linux/pgtable.h          | 19 +++++++++++++++++++
 mm/rmap.c                        | 22 ++++++++++++++++++++--
 4 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e03034683156..a865bd8c46a3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1869,6 +1869,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
+{
+	if (likely(nr == 1))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..be594b274729 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
 	range->owner = owner;
 }
 
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep, __nr)	\
 ({									\
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	unsigned int ___nr = __nr;					\
+	__young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
 						  ___address,		\
 						  ___address +		\
-							PAGE_SIZE);	\
+						nr * PAGE_SIZE);	\
 	__young;							\
 })
 
@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 
 #define mmu_notifier_range_update_to_read_only(r) false
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_young_notify clear_flush_young_ptes
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_young_notify ptep_test_and_clear_young
 #define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b13b6f42be3c..c7d0fd228cb7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -947,6 +947,25 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+#ifndef clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	int young = 0;
+
+	for (;;) {
+		young |= ptep_clear_flush_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+
+	return young;
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/rmap.c b/mm/rmap.c
index f955f02d570e..3833b8557a6f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -827,9 +827,11 @@ static bool folio_referenced_one(struct folio *folio,
 	struct folio_referenced_arg *pra = arg;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int ptes = 0, referenced = 0;
+	unsigned int nr;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
+		nr = 1;
 
 		if (vma->vm_flags & VM_LOCKED) {
 			ptes++;
@@ -874,9 +876,21 @@ static bool folio_referenced_one(struct folio *folio,
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
+			if (folio_test_large(folio)) {
+				unsigned long end_addr = pmd_addr_end(address, vma->vm_end);
+				unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT;
+				pte_t pteval = ptep_get(pvmw.pte);
+
+				nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
+			}
+
+			ptes += nr;
 			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte))
+						pvmw.pte, nr))
 				referenced++;
+			/* Skip the batched PTEs */
+			pvmw.pte += nr - 1;
+			pvmw.address += (nr - 1) * PAGE_SIZE;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -886,7 +900,11 @@ static bool folio_referenced_one(struct folio *folio,
 			WARN_ON_ONCE(1);
 		}
 
-		pra->mapcount--;
+		pra->mapcount -= nr;
+		if (ptes == pvmw.nr_pages) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
 	}
 
 	if (referenced)
-- 
2.47.3



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] support batched checks of the references for large folios
  2025-11-25  0:56 [PATCH 0/2] support batched checks of the references for large folios Baolin Wang
  2025-11-25  0:56 ` [PATCH 1/2] arm64: mm: support batch clearing of the young flag " Baolin Wang
  2025-11-25  0:56 ` [PATCH 2/2] mm: rmap: support batched checks of the references " Baolin Wang
@ 2025-11-25  9:29 ` Barry Song
  2025-11-25 17:38   ` Kairui Song
  2025-12-01 16:23 ` David Hildenbrand (Red Hat)
  3 siblings, 1 reply; 7+ messages in thread
From: Barry Song @ 2025-11-25  9:29 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, linux-mm, linux-arm-kernel,
	linux-kernel

Hi Baolin,

On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

nice!

>
> BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> resend the optimization patch for try_to_unmap() [1].

Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
non-technical issues come up that seriously slowed my progress. Sorry for
the delay.

And I suppose we also need that for try_to_migrate().

>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>
> Baolin Wang (2):
>   arm64: mm: support batch clearing of the young flag for large folios
>   mm: rmap: support batched checks of the references for large folios
>
>  arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  include/linux/mmu_notifier.h     |  9 ++++---
>  include/linux/pgtable.h          | 19 ++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++--
>  5 files changed, 92 insertions(+), 25 deletions(-)
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] support batched checks of the references for large folios
  2025-11-25  9:29 ` [PATCH 0/2] " Barry Song
@ 2025-11-25 17:38   ` Kairui Song
  0 siblings, 0 replies; 7+ messages in thread
From: Kairui Song @ 2025-11-25 17:38 UTC (permalink / raw)
  To: Barry Song
  Cc: Baolin Wang, akpm, david, catalin.marinas, will, lorenzo.stoakes,
	ryan.roberts, Liam.Howlett, vbabka, rppt, surenb, mhocko, riel,
	harry.yoo, jannh, willy, Chris Li, linux-mm, linux-arm-kernel,
	linux-kernel

On Tue, Nov 25, 2025 at 6:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Baolin,
>
> On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > Currently, folio_referenced_one() always checks the young flag for each PTE
> > sequentially, which is inefficient for large folios. This inefficiency is
> > especially noticeable when reclaiming clean file-backed large folios, where
> > folio_referenced() is observed as a significant performance hotspot.
> >
> > Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> > an optimization to clear the young flags for PTEs within a contiguous range.
> > However, this is not sufficient. We can extend this to perform batched operations
> > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >
> > By supporting batched checking of the young flags and flushing TLB entries,
> > I observed a 33% performance improvement in my file-backed folios reclaim tests.
>
> nice!
>
> >
> > BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> > resend the optimization patch for try_to_unmap() [1].
>
> Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
> non-technical issues come up that seriously slowed my progress. Sorry for
> the delay.
>
> And I suppose we also need that for try_to_migrate().
>
> >
> > [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Hi Barry, Baolin.

About the try_to_unmap part, I also noticed that patch and the comment
issue "We only support batched swap_duplicate() for unmapping" in that
patch. I guess one reason is add_swap_count_continuation right? That
limitation will be killed by swap table phase 3:

It can be previewed here:
https://lore.kernel.org/linux-mm/20250514201729.48420-28-ryncsn@gmail.com/

And I think we will be able to handle that much easier by then. Sorry
that it is taking a while to land upstream though.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] support batched checks of the references for large folios
  2025-11-25  0:56 [PATCH 0/2] support batched checks of the references for large folios Baolin Wang
                   ` (2 preceding siblings ...)
  2025-11-25  9:29 ` [PATCH 0/2] " Barry Song
@ 2025-12-01 16:23 ` David Hildenbrand (Red Hat)
  2025-12-02  5:37   ` Baolin Wang
  3 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01 16:23 UTC (permalink / raw)
  To: Baolin Wang, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel

On 11/25/25 01:56, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

Can you point at the benchmark or briefly explain what it does? What 
exactly are we measuring that improves by 33%?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] support batched checks of the references for large folios
  2025-12-01 16:23 ` David Hildenbrand (Red Hat)
@ 2025-12-02  5:37   ` Baolin Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Baolin Wang @ 2025-12-02  5:37 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, ryan.roberts, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, riel, harry.yoo, jannh, willy, baohua, linux-mm,
	linux-arm-kernel, linux-kernel



On 2025/12/2 00:23, David Hildenbrand (Red Hat) wrote:
> On 11/25/25 01:56, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for 
>> each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, 
>> where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there 
>> is already
>> an optimization to clear the young flags for PTEs within a contiguous 
>> range.
>> However, this is not sufficient. We can extend this to perform batched 
>> operations
>> for the entire large folio (which might exceed the contiguous range: 
>> CONT_PTE_SIZE).
>>
>> By supporting batched checking of the young flags and flushing TLB 
>> entries,
>> I observed a 33% performance improvement in my file-backed folios 
>> reclaim tests.
> 
> Can you point at the benchmark or briefly explain what it does? What 
> exactly are we measuring that improves by 33%?

Sorry for not being clear. I've described the performance test in patch 
2, and I should have copied it to the cover letter:

"
Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and 
try to reclaim 8G file-backed folios via the memory.reclaim interface. I 
can observe 33% performance improvement on my Arm64 32-core server (and 
10%+ improvement on my X86 machine). Meanwhile, the hotspot 
folio_check_references() dropped from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s
"


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-02  5:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-25  0:56 [PATCH 0/2] support batched checks of the references for large folios Baolin Wang
2025-11-25  0:56 ` [PATCH 1/2] arm64: mm: support batch clearing of the young flag " Baolin Wang
2025-11-25  0:56 ` [PATCH 2/2] mm: rmap: support batched checks of the references " Baolin Wang
2025-11-25  9:29 ` [PATCH 0/2] " Barry Song
2025-11-25 17:38   ` Kairui Song
2025-12-01 16:23 ` David Hildenbrand (Red Hat)
2025-12-02  5:37   ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox