[PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation
@ 2025-01-15  3:38 Barry Song
  2025-01-15  3:38 ` [PATCH v3 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Barry Song @ 2025-01-15  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 21cnbao, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

From: Barry Song <v-songbaohua@oppo.com>

Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during
shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in
madvise.c. 
However, those folios are still added to the deferred_split list in 
try_to_unmap_one() because we are unmapping PTEs and removing rmap
entries one by one.

Firstly, this has rendered the following counter somewhat confusing, 
/sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred
The split_deferred counter was originally designed to track operations 
such as partial unmap or madvise of large folios. However, in practice, 
most split_deferred cases arise from memory reclamation of aligned 
lazyfree mTHPs as observed by Tangquan. This discrepancy has made
the split_deferred counter highly misleading.

Secondly, this approach is slow because it requires iterating through 
each PTE and removing the rmap one by one for a large folio. In fact, 
all PTEs of a pte-mapped large folio should be unmapped at once, and
the entire folio should be removed from the rmap as a whole.

Thirdly, it also increases the risk of a race condition where lazyfree
folios are incorrectly set back to swapbacked, as a speculative folio_get
may occur in the shrinker's callback.
deferred_split_scan() might call folio_try_get(folio) since we have
added the folio to split_deferred list while removing rmap for the
1st subpage, and while we are scanning the 2nd to nr_pages PTEs of
this folio in try_to_unmap_one(), the entire mTHP could be
transitioned back to swap-backed because the reference count is
incremented, which can make "ref_count == 1 + map_count" within
try_to_unmap_one() false.

   /*
    * The only page refs must be one from isolation
    * plus the rmap(s) (dropped by discard:).
    */
   if (ref_count == 1 + map_count &&
       (!folio_test_dirty(folio) ||
        ...
        (vma->vm_flags & VM_DROPPABLE))) {
           dec_mm_counter(mm, MM_ANONPAGES);
           goto discard;
   }

This patchset resolves the issue by marking only genuinely dirty folios 
as swap-backed, as suggested by David, and transitioning to batched 
unmapping of entire folios in try_to_unmap_one(). Consequently, the 
deferred_split count drops to zero, and memory reclamation performance 
improves significantly — reclaiming 64KiB lazyfree large folios is now 
2.5x faster(The specific data is embedded in the changelog of patch
3/4).

By the way, while the patchset is primarily aimed at PTE-mapped large 
folios, Baolin and Lance also found that try_to_unmap_one() handles 
lazyfree redirtied PMD-mapped large folios inefficiently — it splits 
the PMD into PTEs and iterates over them. This patchset removes the 
unnecessary splitting, enabling us to skip redirtied PMD-mapped large 
folios 3.5X faster during memory reclamation. (The specific data can 
be found in the changelog of patch 4/4).

-v3:
 * collect reviewed-by and acked-by of Baolin, David, Lance and Will.
   thanks!
 * refine pmd-mapped THP lazyfree code per Baolin and Lance.
 * refine tlbbatch deferred flushing range support code per David.

-v2:
 https://lore.kernel.org/linux-mm/20250113033901.68951-1-21cnbao@gmail.com/

 * describle backgrounds, problems more clearly in cover-letter per
   Lorenzo Stoakes;
 * also handle redirtied pmd-mapped large folios per Baolin and Lance;
 * handle some corner cases such as HWPOSION, pte_unused;
 * riscv and x86 build issues.

-v1:
 https://lore.kernel.org/linux-mm/20250106031711.82855-1-21cnbao@gmail.com/

Barry Song (4):
  mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  mm: Support tlbbatch flush for a range of PTEs
  mm: Support batched unmap for lazyfree large folios during reclamation
  mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap

 arch/arm64/include/asm/tlbflush.h |  25 +++----
 arch/arm64/mm/contpte.c           |   2 +-
 arch/riscv/include/asm/tlbflush.h |   5 +-
 arch/riscv/mm/tlbflush.c          |   5 +-
 arch/x86/include/asm/tlbflush.h   |   5 +-
 mm/huge_memory.c                  |  24 +++++--
 mm/rmap.c                         | 115 ++++++++++++++++++++----------
 7 files changed, 117 insertions(+), 64 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-15  3:38 [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
@ 2025-01-15  3:38 ` Barry Song
  2025-01-15  3:38 ` [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Barry Song @ 2025-01-15  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 21cnbao, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan,
	Mauricio Faria de Oliveira

From: Barry Song <v-songbaohua@oppo.com>

The refcount may be temporarily or long-term increased, but this does
not change the fundamental nature of the folio already being lazy-
freed. Therefore, we only reset 'swapbacked' when we are certain the
folio is dirty and not droppable.

Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: Mauricio Faria de Oliveira <mfo@canonical.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
---
 mm/rmap.c | 49 ++++++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..de6b8c34e98c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1868,34 +1868,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				smp_rmb();
 
-				/*
-				 * The only page refs must be one from isolation
-				 * plus the rmap(s) (dropped by discard:).
-				 */
-				if (ref_count == 1 + map_count &&
-				    (!folio_test_dirty(folio) ||
-				     /*
-				      * Unlike MADV_FREE mappings, VM_DROPPABLE
-				      * ones can be dropped even if they've
-				      * been dirtied.
-				      */
-				     (vma->vm_flags & VM_DROPPABLE))) {
-					dec_mm_counter(mm, MM_ANONPAGES);
-					goto discard;
-				}
-
-				/*
-				 * If the folio was redirtied, it cannot be
-				 * discarded. Remap the page to page table.
-				 */
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				/*
-				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
-				 * never get swap backed on failure to drop.
-				 */
-				if (!(vma->vm_flags & VM_DROPPABLE))
+				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+					/*
+					 * redirtied either using the page table or a previously
+					 * obtained GUP reference.
+					 */
+					set_pte_at(mm, address, pvmw.pte, pteval);
 					folio_set_swapbacked(folio);
-				goto walk_abort;
+					goto walk_abort;
+				} else if (ref_count != 1 + map_count) {
+					/*
+					 * Additional reference. Could be a GUP reference or any
+					 * speculative reference. GUP users must mark the folio
+					 * dirty if there was a modification. This folio cannot be
+					 * reclaimed right now either way, so act just like nothing
+					 * happened.
+					 * We'll come back here later and detect if the folio was
+					 * dirtied when the additional reference is gone.
+					 */
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					goto walk_abort;
+				}
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
 			}
 
 			if (swap_duplicate(entry) < 0) {
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-15  3:38 [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
  2025-01-15  3:38 ` [PATCH v3 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
@ 2025-01-15  3:38 ` Barry Song
  2025-01-16  8:30   ` Kefeng Wang
  2025-01-15  3:38 ` [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
  2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  3 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2025-01-15  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 21cnbao, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan,
	Catalin Marinas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Anshuman Khandual, Shaoqin Huang,
	Gavin Shan, Kefeng Wang, Mark Rutland, Kirill A. Shutemov,
	Yosry Ahmed, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Yicong Yang, Will Deacon

From: Barry Song <v-songbaohua@oppo.com>

This patch lays the groundwork for supporting batch PTE unmapping in
try_to_unmap_one(). It introduces range handling for TLB batch flushing,
with the range currently set to the size of PAGE_SIZE.

The function __flush_tlb_range_nosync() is architecture-specific and is
only used within arch/arm64. This function requires the mm structure
instead of the vma structure. To allow its reuse by
arch_tlbbatch_add_pending(), which operates with mm but not vma, this
patch modifies the argument of __flush_tlb_range_nosync() to take mm
as its parameter.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/tlbflush.h | 25 +++++++++++++------------
 arch/arm64/mm/contpte.c           |  2 +-
 arch/riscv/include/asm/tlbflush.h |  5 +++--
 arch/riscv/mm/tlbflush.c          |  5 +++--
 arch/x86/include/asm/tlbflush.h   |  5 +++--
 mm/rmap.c                         | 12 +++++++-----
 6 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc94e036a26b..98fbc8df7cf3 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -322,13 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 	return true;
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	__flush_tlb_page_nosync(mm, uaddr);
-}
-
 /*
  * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
  * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
@@ -448,7 +441,7 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
 	return false;
 }
 
-static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -460,12 +453,12 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 	pages = (end - start) >> PAGE_SHIFT;
 
 	if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
-		flush_tlb_mm(vma->vm_mm);
+		flush_tlb_mm(mm);
 		return;
 	}
 
 	dsb(ishst);
-	asid = ASID(vma->vm_mm);
+	asid = ASID(mm);
 
 	if (last_level)
 		__flush_tlb_range_op(vale1is, start, pages, stride, asid,
@@ -474,7 +467,7 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 				     tlb_level, true, lpa2_is_enabled());
 
-	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 static inline void __flush_tlb_range(struct vm_area_struct *vma,
@@ -482,7 +475,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
 {
-	__flush_tlb_range_nosync(vma, start, end, stride,
+	__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
 				 last_level, tlb_level);
 	dsb(ish);
 }
@@ -533,6 +526,14 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 	dsb(ish);
 	isb();
 }
+
+static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+		struct mm_struct *mm,
+		unsigned long start,
+		unsigned long end)
+{
+	__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
+}
 #endif
 
 #endif
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 55107d27d3f8..bcac4f55f9c1 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -335,7 +335,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 		 * eliding the trailing DSB applies here.
 		 */
 		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
 					 PAGE_SIZE, true, 3);
 	}
 
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 72e559934952..e4c533691a7d 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -60,8 +60,9 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
 
 bool arch_tlbbatch_should_defer(struct mm_struct *mm);
 void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-			       struct mm_struct *mm,
-			       unsigned long uaddr);
+		struct mm_struct *mm,
+		unsigned long start,
+		unsigned long end);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9b6e86ce3867..6d6e8e7cc576 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -186,8 +186,9 @@ bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 }
 
 void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-			       struct mm_struct *mm,
-			       unsigned long uaddr)
+		struct mm_struct *mm,
+		unsigned long start,
+		unsigned long end)
 {
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b8..2b511972d008 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -278,8 +278,9 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 }
 
 static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
+		struct mm_struct *mm,
+		unsigned long start,
+		unsigned long end)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index de6b8c34e98c..abeb9fcec384 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -672,7 +672,8 @@ void try_to_unmap_flush_dirty(void)
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+		unsigned long start,
+		unsigned long end)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch;
@@ -681,7 +682,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	if (!pte_accessible(mm, pteval))
 		return;
 
-	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
+	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -757,7 +758,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+		unsigned long start,
+		unsigned long end)
 {
 }
 
@@ -1792,7 +1794,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2164,7 +2166,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
  2025-01-15  3:38 [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
  2025-01-15  3:38 ` [PATCH v3 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
  2025-01-15  3:38 ` [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
@ 2025-01-15  3:38 ` Barry Song
  2025-02-04 11:38   ` David Hildenbrand
  2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  3 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2025-01-15  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 21cnbao, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

From: Barry Song <v-songbaohua@oppo.com>

Currently, the PTEs and rmap of a large folio are removed one at a time.
This is not only slow but also causes the large folio to be unnecessarily
added to deferred_split, which can lead to races between the
deferred_split shrinker callback and memory reclamation. This patch
releases all PTEs and rmap entries in a batch.
Currently, it only handles lazyfree large folios.

The below microbench tries to reclaim 128MB lazyfree large folios
whose sizes are 64KiB:

 #include <stdio.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>

 #define SIZE 128*1024*1024  // 128 MB

 unsigned long read_split_deferred()
 {
 	FILE *file = fopen("/sys/kernel/mm/transparent_hugepage"
			"/hugepages-64kB/stats/split_deferred", "r");
 	if (!file) {
 		perror("Error opening file");
 		return 0;
 	}

 	unsigned long value;
 	if (fscanf(file, "%lu", &value) != 1) {
 		perror("Error reading value");
 		fclose(file);
 		return 0;
 	}

 	fclose(file);
 	return value;
 }

 int main(int argc, char *argv[])
 {
 	while(1) {
 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

 		memset((void *)p, 1, SIZE);

 		madvise((void *)p, SIZE, MADV_FREE);

 		clock_t start_time = clock();
 		unsigned long start_split = read_split_deferred();
 		madvise((void *)p, SIZE, MADV_PAGEOUT);
 		clock_t end_time = clock();
 		unsigned long end_split = read_split_deferred();

 		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
 		printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n",
 			elapsed_time, end_split - start_split);

 		munmap((void *)p, SIZE);
 	}
 	return 0;
 }

w/o patch:
~ # ./a.out
Time taken by reclamation: 0.177418 seconds, split_deferred: 2048
Time taken by reclamation: 0.178348 seconds, split_deferred: 2048
Time taken by reclamation: 0.174525 seconds, split_deferred: 2048
Time taken by reclamation: 0.171620 seconds, split_deferred: 2048
Time taken by reclamation: 0.172241 seconds, split_deferred: 2048
Time taken by reclamation: 0.174003 seconds, split_deferred: 2048
Time taken by reclamation: 0.171058 seconds, split_deferred: 2048
Time taken by reclamation: 0.171993 seconds, split_deferred: 2048
Time taken by reclamation: 0.169829 seconds, split_deferred: 2048
Time taken by reclamation: 0.172895 seconds, split_deferred: 2048
Time taken by reclamation: 0.176063 seconds, split_deferred: 2048
Time taken by reclamation: 0.172568 seconds, split_deferred: 2048
Time taken by reclamation: 0.171185 seconds, split_deferred: 2048
Time taken by reclamation: 0.170632 seconds, split_deferred: 2048
Time taken by reclamation: 0.170208 seconds, split_deferred: 2048
Time taken by reclamation: 0.174192 seconds, split_deferred: 2048
...

w/ patch:
~ # ./a.out
Time taken by reclamation: 0.074231 seconds, split_deferred: 0
Time taken by reclamation: 0.071026 seconds, split_deferred: 0
Time taken by reclamation: 0.072029 seconds, split_deferred: 0
Time taken by reclamation: 0.071873 seconds, split_deferred: 0
Time taken by reclamation: 0.073573 seconds, split_deferred: 0
Time taken by reclamation: 0.071906 seconds, split_deferred: 0
Time taken by reclamation: 0.073604 seconds, split_deferred: 0
Time taken by reclamation: 0.075903 seconds, split_deferred: 0
Time taken by reclamation: 0.073191 seconds, split_deferred: 0
Time taken by reclamation: 0.071228 seconds, split_deferred: 0
Time taken by reclamation: 0.071391 seconds, split_deferred: 0
Time taken by reclamation: 0.071468 seconds, split_deferred: 0
Time taken by reclamation: 0.071896 seconds, split_deferred: 0
Time taken by reclamation: 0.072508 seconds, split_deferred: 0
Time taken by reclamation: 0.071884 seconds, split_deferred: 0
Time taken by reclamation: 0.072433 seconds, split_deferred: 0
Time taken by reclamation: 0.071939 seconds, split_deferred: 0
...

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/rmap.c | 47 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index abeb9fcec384..be1978d2712d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1642,6 +1642,25 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }

+/* We support batch unmapping of PTEs for lazyfree large folios */
+static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
+			struct folio *folio, pte_t *ptep)
+{
+	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	int max_nr = folio_nr_pages(folio);
+	pte_t pte = ptep_get(ptep);
+
+	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+		return false;
+	if (pte_none(pte) || pte_unused(pte) || !pte_present(pte))
+		return false;
+	if (pte_pfn(pte) != folio_pfn(folio))
+		return false;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+			       NULL, NULL) == max_nr;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1655,6 +1674,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int nr_pages = 1;
 	unsigned long pfn;
 	unsigned long hsz = 0;

@@ -1780,6 +1800,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_vma_unlock_write(vma);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+		} else if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
+				can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) {
+			nr_pages = folio_nr_pages(folio);
+			flush_cache_range(vma, range.start, range.end);
+			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
+			if (should_defer_flush(mm, flags))
+				set_tlb_ubc_flush_pending(mm, pteval, address,
+					address + folio_size(folio));
+			else
+				flush_tlb_range(vma, range.start, range.end);
 		} else {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
@@ -1875,7 +1905,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					 * redirtied either using the page table or a previously
 					 * obtained GUP reference.
 					 */
-					set_pte_at(mm, address, pvmw.pte, pteval);
+					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
 					folio_set_swapbacked(folio);
 					goto walk_abort;
 				} else if (ref_count != 1 + map_count) {
@@ -1888,10 +1918,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					 * We'll come back here later and detect if the folio was
 					 * dirtied when the additional reference is gone.
 					 */
-					set_pte_at(mm, address, pvmw.pte, pteval);
+					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
 					goto walk_abort;
 				}
-				dec_mm_counter(mm, MM_ANONPAGES);
+				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
 				goto discard;
 			}

@@ -1943,13 +1973,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(folio));
 		}
 discard:
-		if (unlikely(folio_test_hugetlb(folio)))
+		if (unlikely(folio_test_hugetlb(folio))) {
 			hugetlb_remove_rmap(folio);
-		else
-			folio_remove_rmap_pte(folio, subpage, vma);
+		} else {
+			folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
+			folio_ref_sub(folio, nr_pages - 1);
+		}
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
+		/* We have already batched the entire folio */
+		if (nr_pages > 1)
+			goto walk_done;
 		continue;
 walk_abort:
 		ret = false;
-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  3:38 [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
                   ` (2 preceding siblings ...)
  2025-01-15  3:38 ` [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
@ 2025-01-15  3:38 ` Barry Song
  2025-01-15  5:01   ` Lance Yang
                     ` (2 more replies)
  3 siblings, 3 replies; 17+ messages in thread
From: Barry Song @ 2025-01-15  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 21cnbao, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

From: Barry Song <v-songbaohua@oppo.com>

The try_to_unmap_one() function currently handles PMD-mapped THPs
inefficiently. It first splits the PMD into PTEs, copies the dirty
state from the PMD to the PTEs, iterates over the PTEs to locate
the dirty state, and then marks the THP as swap-backed. This process
involves unnecessary PMD splitting and redundant iteration. Instead,
this functionality can be efficiently managed in
__discard_anon_folio_pmd_locked(), avoiding the extra steps and
improving performance.

The following microbenchmark redirties folios after invoking MADV_FREE,
then measures the time taken to perform memory reclamation (actually
set those folios swapbacked again) on the redirtied folios.

 #include <stdio.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>

 #define SIZE 128*1024*1024  // 128 MB

 int main(int argc, char *argv[])
 {
 	while(1) {
 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

 		memset((void *)p, 1, SIZE);
 		madvise((void *)p, SIZE, MADV_FREE);
 		/* redirty after MADV_FREE */
 		memset((void *)p, 1, SIZE);

		clock_t start_time = clock();
 		madvise((void *)p, SIZE, MADV_PAGEOUT);
 		clock_t end_time = clock();

 		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
 		printf("Time taken by reclamation: %f seconds\n", elapsed_time);

 		munmap((void *)p, SIZE);
 	}
 	return 0;
 }

Testing results are as below,
w/o patch:
~ # ./a.out
Time taken by reclamation: 0.007300 seconds
Time taken by reclamation: 0.007226 seconds
Time taken by reclamation: 0.007295 seconds
Time taken by reclamation: 0.007731 seconds
Time taken by reclamation: 0.007134 seconds
Time taken by reclamation: 0.007285 seconds
Time taken by reclamation: 0.007720 seconds
Time taken by reclamation: 0.007128 seconds
Time taken by reclamation: 0.007710 seconds
Time taken by reclamation: 0.007712 seconds
Time taken by reclamation: 0.007236 seconds
Time taken by reclamation: 0.007690 seconds
Time taken by reclamation: 0.007174 seconds
Time taken by reclamation: 0.007670 seconds
Time taken by reclamation: 0.007169 seconds
Time taken by reclamation: 0.007305 seconds
Time taken by reclamation: 0.007432 seconds
Time taken by reclamation: 0.007158 seconds
Time taken by reclamation: 0.007133 seconds
…

w/ patch

~ # ./a.out
Time taken by reclamation: 0.002124 seconds
Time taken by reclamation: 0.002116 seconds
Time taken by reclamation: 0.002150 seconds
Time taken by reclamation: 0.002261 seconds
Time taken by reclamation: 0.002137 seconds
Time taken by reclamation: 0.002173 seconds
Time taken by reclamation: 0.002063 seconds
Time taken by reclamation: 0.002088 seconds
Time taken by reclamation: 0.002169 seconds
Time taken by reclamation: 0.002124 seconds
Time taken by reclamation: 0.002111 seconds
Time taken by reclamation: 0.002224 seconds
Time taken by reclamation: 0.002297 seconds
Time taken by reclamation: 0.002260 seconds
Time taken by reclamation: 0.002246 seconds
Time taken by reclamation: 0.002272 seconds
Time taken by reclamation: 0.002277 seconds
Time taken by reclamation: 0.002462 seconds
…

This patch significantly speeds up try_to_unmap_one() by allowing it
to skip redirtied THPs without splitting the PMD.

Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/huge_memory.c | 24 +++++++++++++++++-------
 mm/rmap.c        | 13 ++++++++++---
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..47cc8c3f8f80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	int ref_count, map_count;
 	pmd_t orig_pmd = *pmdp;
 
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
 		return false;
+	}
 
 	orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
 
@@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	 *
 	 * The only folio refs must be one from isolation plus the rmap(s).
 	 */
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
-	    ref_count != map_count + 1) {
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
+		set_pmd_at(mm, addr, pmdp, orig_pmd);
+		return false;
+	}
+
+	if (ref_count != map_count + 1) {
 		set_pmd_at(mm, addr, pmdp, orig_pmd);
 		return false;
 	}
@@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 {
 	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
 	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
 
-	if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
-		return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
-
-	return false;
+	return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
 }
 
 static void remap_page(struct folio *folio, unsigned long nr, int flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index be1978d2712d..a859c399ec7c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
-						  folio))
-				goto walk_done;
+			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
+					goto walk_done;
+				/*
+				 * unmap_huge_pmd_locked has either already marked
+				 * the folio as swap-backed or decided to retain it
+				 * due to GUP or speculative references.
+				 */
+				goto walk_abort;
+			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
@ 2025-01-15  5:01   ` Lance Yang
  2025-01-15  5:09     ` Barry Song
  2025-01-17  1:30   ` Baolin Wang
  2025-01-18 14:00   ` Lance Yang
  2 siblings, 1 reply; 17+ messages in thread
From: Lance Yang @ 2025-01-15  5:01 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> The try_to_unmap_one() function currently handles PMD-mapped THPs
> inefficiently. It first splits the PMD into PTEs, copies the dirty
> state from the PMD to the PTEs, iterates over the PTEs to locate
> the dirty state, and then marks the THP as swap-backed. This process
> involves unnecessary PMD splitting and redundant iteration. Instead,
> this functionality can be efficiently managed in
> __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> improving performance.
>
> The following microbenchmark redirties folios after invoking MADV_FREE,
> then measures the time taken to perform memory reclamation (actually
> set those folios swapbacked again) on the redirtied folios.
>
>  #include <stdio.h>
>  #include <sys/mman.h>
>  #include <string.h>
>  #include <time.h>
>
>  #define SIZE 128*1024*1024  // 128 MB
>
>  int main(int argc, char *argv[])
>  {
>         while(1) {
>                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
>                 memset((void *)p, 1, SIZE);
>                 madvise((void *)p, SIZE, MADV_FREE);
>                 /* redirty after MADV_FREE */
>                 memset((void *)p, 1, SIZE);
>
>                 clock_t start_time = clock();
>                 madvise((void *)p, SIZE, MADV_PAGEOUT);
>                 clock_t end_time = clock();
>
>                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
>
>                 munmap((void *)p, SIZE);
>         }
>         return 0;
>  }
>
> Testing results are as below,
> w/o patch:
> ~ # ./a.out
> Time taken by reclamation: 0.007300 seconds
> Time taken by reclamation: 0.007226 seconds
> Time taken by reclamation: 0.007295 seconds
> Time taken by reclamation: 0.007731 seconds
> Time taken by reclamation: 0.007134 seconds
> Time taken by reclamation: 0.007285 seconds
> Time taken by reclamation: 0.007720 seconds
> Time taken by reclamation: 0.007128 seconds
> Time taken by reclamation: 0.007710 seconds
> Time taken by reclamation: 0.007712 seconds
> Time taken by reclamation: 0.007236 seconds
> Time taken by reclamation: 0.007690 seconds
> Time taken by reclamation: 0.007174 seconds
> Time taken by reclamation: 0.007670 seconds
> Time taken by reclamation: 0.007169 seconds
> Time taken by reclamation: 0.007305 seconds
> Time taken by reclamation: 0.007432 seconds
> Time taken by reclamation: 0.007158 seconds
> Time taken by reclamation: 0.007133 seconds
> …
>
> w/ patch
>
> ~ # ./a.out
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002116 seconds
> Time taken by reclamation: 0.002150 seconds
> Time taken by reclamation: 0.002261 seconds
> Time taken by reclamation: 0.002137 seconds
> Time taken by reclamation: 0.002173 seconds
> Time taken by reclamation: 0.002063 seconds
> Time taken by reclamation: 0.002088 seconds
> Time taken by reclamation: 0.002169 seconds
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002111 seconds
> Time taken by reclamation: 0.002224 seconds
> Time taken by reclamation: 0.002297 seconds
> Time taken by reclamation: 0.002260 seconds
> Time taken by reclamation: 0.002246 seconds
> Time taken by reclamation: 0.002272 seconds
> Time taken by reclamation: 0.002277 seconds
> Time taken by reclamation: 0.002462 seconds
> …
>
> This patch significantly speeds up try_to_unmap_one() by allowing it
> to skip redirtied THPs without splitting the PMD.
>
> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Suggested-by: Lance Yang <ioworker0@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/huge_memory.c | 24 +++++++++++++++++-------
>  mm/rmap.c        | 13 ++++++++++---
>  2 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3d3ebdc002d5..47cc8c3f8f80 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>         int ref_count, map_count;
>         pmd_t orig_pmd = *pmdp;
>
> -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> +       if (pmd_dirty(orig_pmd))
> +               folio_set_dirty(folio);
> +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +               folio_set_swapbacked(folio);
>                 return false;
> +       }

If either the PMD or the folio is dirty, should we just return false right away,
regardless of VM_DROPPABLE? There’s no need to proceed further in that
case, IMHO ;)

Thanks,
Lance

>
>         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
>
> @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>          *
>          * The only folio refs must be one from isolation plus the rmap(s).
>          */
> -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> -           ref_count != map_count + 1) {
> +       if (pmd_dirty(orig_pmd))
> +               folio_set_dirty(folio);
> +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +               folio_set_swapbacked(folio);
> +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> +               return false;
> +       }
> +
> +       if (ref_count != map_count + 1) {
>                 set_pmd_at(mm, addr, pmdp, orig_pmd);
>                 return false;
>         }
> @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>  {
>         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
>         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
>         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
>
> -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> -
> -       return false;
> +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>  }
>
>  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index be1978d2712d..a859c399ec7c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                 }
>
>                 if (!pvmw.pte) {
> -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> -                                                 folio))
> -                               goto walk_done;
> +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> +                                       goto walk_done;
> +                               /*
> +                                * unmap_huge_pmd_locked has either already marked
> +                                * the folio as swap-backed or decided to retain it
> +                                * due to GUP or speculative references.
> +                                */
> +                               goto walk_abort;
> +                       }
>
>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>                                 /*
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  5:01   ` Lance Yang
@ 2025-01-15  5:09     ` Barry Song
  2025-01-15  5:41       ` Lance Yang
  2025-01-15  6:26       ` Lance Yang
  0 siblings, 2 replies; 17+ messages in thread
From: Barry Song @ 2025-01-15  5:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 6:01 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > state from the PMD to the PTEs, iterates over the PTEs to locate
> > the dirty state, and then marks the THP as swap-backed. This process
> > involves unnecessary PMD splitting and redundant iteration. Instead,
> > this functionality can be efficiently managed in
> > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > improving performance.
> >
> > The following microbenchmark redirties folios after invoking MADV_FREE,
> > then measures the time taken to perform memory reclamation (actually
> > set those folios swapbacked again) on the redirtied folios.
> >
> >  #include <stdio.h>
> >  #include <sys/mman.h>
> >  #include <string.h>
> >  #include <time.h>
> >
> >  #define SIZE 128*1024*1024  // 128 MB
> >
> >  int main(int argc, char *argv[])
> >  {
> >         while(1) {
> >                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> >                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >
> >                 memset((void *)p, 1, SIZE);
> >                 madvise((void *)p, SIZE, MADV_FREE);
> >                 /* redirty after MADV_FREE */
> >                 memset((void *)p, 1, SIZE);
> >
> >                 clock_t start_time = clock();
> >                 madvise((void *)p, SIZE, MADV_PAGEOUT);
> >                 clock_t end_time = clock();
> >
> >                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> >                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> >
> >                 munmap((void *)p, SIZE);
> >         }
> >         return 0;
> >  }
> >
> > Testing results are as below,
> > w/o patch:
> > ~ # ./a.out
> > Time taken by reclamation: 0.007300 seconds
> > Time taken by reclamation: 0.007226 seconds
> > Time taken by reclamation: 0.007295 seconds
> > Time taken by reclamation: 0.007731 seconds
> > Time taken by reclamation: 0.007134 seconds
> > Time taken by reclamation: 0.007285 seconds
> > Time taken by reclamation: 0.007720 seconds
> > Time taken by reclamation: 0.007128 seconds
> > Time taken by reclamation: 0.007710 seconds
> > Time taken by reclamation: 0.007712 seconds
> > Time taken by reclamation: 0.007236 seconds
> > Time taken by reclamation: 0.007690 seconds
> > Time taken by reclamation: 0.007174 seconds
> > Time taken by reclamation: 0.007670 seconds
> > Time taken by reclamation: 0.007169 seconds
> > Time taken by reclamation: 0.007305 seconds
> > Time taken by reclamation: 0.007432 seconds
> > Time taken by reclamation: 0.007158 seconds
> > Time taken by reclamation: 0.007133 seconds
> > …
> >
> > w/ patch
> >
> > ~ # ./a.out
> > Time taken by reclamation: 0.002124 seconds
> > Time taken by reclamation: 0.002116 seconds
> > Time taken by reclamation: 0.002150 seconds
> > Time taken by reclamation: 0.002261 seconds
> > Time taken by reclamation: 0.002137 seconds
> > Time taken by reclamation: 0.002173 seconds
> > Time taken by reclamation: 0.002063 seconds
> > Time taken by reclamation: 0.002088 seconds
> > Time taken by reclamation: 0.002169 seconds
> > Time taken by reclamation: 0.002124 seconds
> > Time taken by reclamation: 0.002111 seconds
> > Time taken by reclamation: 0.002224 seconds
> > Time taken by reclamation: 0.002297 seconds
> > Time taken by reclamation: 0.002260 seconds
> > Time taken by reclamation: 0.002246 seconds
> > Time taken by reclamation: 0.002272 seconds
> > Time taken by reclamation: 0.002277 seconds
> > Time taken by reclamation: 0.002462 seconds
> > …
> >
> > This patch significantly speeds up try_to_unmap_one() by allowing it
> > to skip redirtied THPs without splitting the PMD.
> >
> > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/huge_memory.c | 24 +++++++++++++++++-------
> >  mm/rmap.c        | 13 ++++++++++---
> >  2 files changed, 27 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 3d3ebdc002d5..47cc8c3f8f80 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> >         int ref_count, map_count;
> >         pmd_t orig_pmd = *pmdp;
> >
> > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > +       if (pmd_dirty(orig_pmd))
> > +               folio_set_dirty(folio);
> > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > +               folio_set_swapbacked(folio);
> >                 return false;
> > +       }
>
> If either the PMD or the folio is dirty, should we just return false right away,
> regardless of VM_DROPPABLE? There’s no need to proceed further in that
> case, IMHO ;)

I don't quite understand you, but we need to proceed to clear pmd entry.
if vm_droppable is true, even if the folio is dirty, we still drop the folio.

>
> Thanks,
> Lance
>
> >
> >         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> >
> > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> >          *
> >          * The only folio refs must be one from isolation plus the rmap(s).
> >          */
> > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > -           ref_count != map_count + 1) {
> > +       if (pmd_dirty(orig_pmd))
> > +               folio_set_dirty(folio);
> > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > +               folio_set_swapbacked(folio);
> > +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > +               return false;
> > +       }
> > +
> > +       if (ref_count != map_count + 1) {
> >                 set_pmd_at(mm, addr, pmdp, orig_pmd);
> >                 return false;
> >         }
> > @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> >  {
> >         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> >         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> >         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> >
> > -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > -
> > -       return false;
> > +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> >  }
> >
> >  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index be1978d2712d..a859c399ec7c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                 }
> >
> >                 if (!pvmw.pte) {
> > -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > -                                                 folio))
> > -                               goto walk_done;
> > +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> > +                                       goto walk_done;
> > +                               /*
> > +                                * unmap_huge_pmd_locked has either already marked
> > +                                * the folio as swap-backed or decided to retain it
> > +                                * due to GUP or speculative references.
> > +                                */
> > +                               goto walk_abort;
> > +                       }
> >
> >                         if (flags & TTU_SPLIT_HUGE_PMD) {
> >                                 /*
> > --
> > 2.39.3 (Apple Git-146)
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  5:09     ` Barry Song
@ 2025-01-15  5:41       ` Lance Yang
  2025-01-15  6:26       ` Lance Yang
  1 sibling, 0 replies; 17+ messages in thread
From: Lance Yang @ 2025-01-15  5:41 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 1:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 6:01 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > > state from the PMD to the PTEs, iterates over the PTEs to locate
> > > the dirty state, and then marks the THP as swap-backed. This process
> > > involves unnecessary PMD splitting and redundant iteration. Instead,
> > > this functionality can be efficiently managed in
> > > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > > improving performance.
> > >
> > > The following microbenchmark redirties folios after invoking MADV_FREE,
> > > then measures the time taken to perform memory reclamation (actually
> > > set those folios swapbacked again) on the redirtied folios.
> > >
> > >  #include <stdio.h>
> > >  #include <sys/mman.h>
> > >  #include <string.h>
> > >  #include <time.h>
> > >
> > >  #define SIZE 128*1024*1024  // 128 MB
> > >
> > >  int main(int argc, char *argv[])
> > >  {
> > >         while(1) {
> > >                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > >                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >
> > >                 memset((void *)p, 1, SIZE);
> > >                 madvise((void *)p, SIZE, MADV_FREE);
> > >                 /* redirty after MADV_FREE */
> > >                 memset((void *)p, 1, SIZE);
> > >
> > >                 clock_t start_time = clock();
> > >                 madvise((void *)p, SIZE, MADV_PAGEOUT);
> > >                 clock_t end_time = clock();
> > >
> > >                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > >                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> > >
> > >                 munmap((void *)p, SIZE);
> > >         }
> > >         return 0;
> > >  }
> > >
> > > Testing results are as below,
> > > w/o patch:
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.007300 seconds
> > > Time taken by reclamation: 0.007226 seconds
> > > Time taken by reclamation: 0.007295 seconds
> > > Time taken by reclamation: 0.007731 seconds
> > > Time taken by reclamation: 0.007134 seconds
> > > Time taken by reclamation: 0.007285 seconds
> > > Time taken by reclamation: 0.007720 seconds
> > > Time taken by reclamation: 0.007128 seconds
> > > Time taken by reclamation: 0.007710 seconds
> > > Time taken by reclamation: 0.007712 seconds
> > > Time taken by reclamation: 0.007236 seconds
> > > Time taken by reclamation: 0.007690 seconds
> > > Time taken by reclamation: 0.007174 seconds
> > > Time taken by reclamation: 0.007670 seconds
> > > Time taken by reclamation: 0.007169 seconds
> > > Time taken by reclamation: 0.007305 seconds
> > > Time taken by reclamation: 0.007432 seconds
> > > Time taken by reclamation: 0.007158 seconds
> > > Time taken by reclamation: 0.007133 seconds
> > > …
> > >
> > > w/ patch
> > >
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002116 seconds
> > > Time taken by reclamation: 0.002150 seconds
> > > Time taken by reclamation: 0.002261 seconds
> > > Time taken by reclamation: 0.002137 seconds
> > > Time taken by reclamation: 0.002173 seconds
> > > Time taken by reclamation: 0.002063 seconds
> > > Time taken by reclamation: 0.002088 seconds
> > > Time taken by reclamation: 0.002169 seconds
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002111 seconds
> > > Time taken by reclamation: 0.002224 seconds
> > > Time taken by reclamation: 0.002297 seconds
> > > Time taken by reclamation: 0.002260 seconds
> > > Time taken by reclamation: 0.002246 seconds
> > > Time taken by reclamation: 0.002272 seconds
> > > Time taken by reclamation: 0.002277 seconds
> > > Time taken by reclamation: 0.002462 seconds
> > > …
> > >
> > > This patch significantly speeds up try_to_unmap_one() by allowing it
> > > to skip redirtied THPs without splitting the PMD.
> > >
> > > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/huge_memory.c | 24 +++++++++++++++++-------
> > >  mm/rmap.c        | 13 ++++++++++---
> > >  2 files changed, 27 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 3d3ebdc002d5..47cc8c3f8f80 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >         int ref_count, map_count;
> > >         pmd_t orig_pmd = *pmdp;
> > >
> > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > > +       if (pmd_dirty(orig_pmd))
> > > +               folio_set_dirty(folio);
> > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +               folio_set_swapbacked(folio);
> > >                 return false;
> > > +       }
> >
> > If either the PMD or the folio is dirty, should we just return false right away,
> > regardless of VM_DROPPABLE? There’s no need to proceed further in that
> > case, IMHO ;)
>
> I don't quite understand you, but we need to proceed to clear pmd entry.
> if vm_droppable is true, even if the folio is dirty, we still drop the folio.

Ah, you're right, and I completely got it wrong ;(

Thanks,
Lance

>
> >
> > Thanks,
> > Lance
> >
> > >
> > >         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > >
> > > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >          *
> > >          * The only folio refs must be one from isolation plus the rmap(s).
> > >          */
> > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > > -           ref_count != map_count + 1) {
> > > +       if (pmd_dirty(orig_pmd))
> > > +               folio_set_dirty(folio);
> > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +               folio_set_swapbacked(folio);
> > > +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > +               return false;
> > > +       }
> > > +
> > > +       if (ref_count != map_count + 1) {
> > >                 set_pmd_at(mm, addr, pmdp, orig_pmd);
> > >                 return false;
> > >         }
> > > @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > >  {
> > >         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > >         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > > +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > >         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> > >
> > > -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > > -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > -
> > > -       return false;
> > > +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > >  }
> > >
> > >  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index be1978d2712d..a859c399ec7c 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > >                 }
> > >
> > >                 if (!pvmw.pte) {
> > > -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > > -                                                 folio))
> > > -                               goto walk_done;
> > > +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > > +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> > > +                                       goto walk_done;
> > > +                               /*
> > > +                                * unmap_huge_pmd_locked has either already marked
> > > +                                * the folio as swap-backed or decided to retain it
> > > +                                * due to GUP or speculative references.
> > > +                                */
> > > +                               goto walk_abort;
> > > +                       }
> > >
> > >                         if (flags & TTU_SPLIT_HUGE_PMD) {
> > >                                 /*
> > > --
> > > 2.39.3 (Apple Git-146)
> > >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  5:09     ` Barry Song
  2025-01-15  5:41       ` Lance Yang
@ 2025-01-15  6:26       ` Lance Yang
  2025-01-15  6:42         ` Barry Song
  1 sibling, 1 reply; 17+ messages in thread
From: Lance Yang @ 2025-01-15  6:26 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 1:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 6:01 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > > state from the PMD to the PTEs, iterates over the PTEs to locate
> > > the dirty state, and then marks the THP as swap-backed. This process
> > > involves unnecessary PMD splitting and redundant iteration. Instead,
> > > this functionality can be efficiently managed in
> > > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > > improving performance.
> > >
> > > The following microbenchmark redirties folios after invoking MADV_FREE,
> > > then measures the time taken to perform memory reclamation (actually
> > > set those folios swapbacked again) on the redirtied folios.
> > >
> > >  #include <stdio.h>
> > >  #include <sys/mman.h>
> > >  #include <string.h>
> > >  #include <time.h>
> > >
> > >  #define SIZE 128*1024*1024  // 128 MB
> > >
> > >  int main(int argc, char *argv[])
> > >  {
> > >         while(1) {
> > >                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > >                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >
> > >                 memset((void *)p, 1, SIZE);
> > >                 madvise((void *)p, SIZE, MADV_FREE);
> > >                 /* redirty after MADV_FREE */
> > >                 memset((void *)p, 1, SIZE);
> > >
> > >                 clock_t start_time = clock();
> > >                 madvise((void *)p, SIZE, MADV_PAGEOUT);
> > >                 clock_t end_time = clock();
> > >
> > >                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > >                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> > >
> > >                 munmap((void *)p, SIZE);
> > >         }
> > >         return 0;
> > >  }
> > >
> > > Testing results are as below,
> > > w/o patch:
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.007300 seconds
> > > Time taken by reclamation: 0.007226 seconds
> > > Time taken by reclamation: 0.007295 seconds
> > > Time taken by reclamation: 0.007731 seconds
> > > Time taken by reclamation: 0.007134 seconds
> > > Time taken by reclamation: 0.007285 seconds
> > > Time taken by reclamation: 0.007720 seconds
> > > Time taken by reclamation: 0.007128 seconds
> > > Time taken by reclamation: 0.007710 seconds
> > > Time taken by reclamation: 0.007712 seconds
> > > Time taken by reclamation: 0.007236 seconds
> > > Time taken by reclamation: 0.007690 seconds
> > > Time taken by reclamation: 0.007174 seconds
> > > Time taken by reclamation: 0.007670 seconds
> > > Time taken by reclamation: 0.007169 seconds
> > > Time taken by reclamation: 0.007305 seconds
> > > Time taken by reclamation: 0.007432 seconds
> > > Time taken by reclamation: 0.007158 seconds
> > > Time taken by reclamation: 0.007133 seconds
> > > …
> > >
> > > w/ patch
> > >
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002116 seconds
> > > Time taken by reclamation: 0.002150 seconds
> > > Time taken by reclamation: 0.002261 seconds
> > > Time taken by reclamation: 0.002137 seconds
> > > Time taken by reclamation: 0.002173 seconds
> > > Time taken by reclamation: 0.002063 seconds
> > > Time taken by reclamation: 0.002088 seconds
> > > Time taken by reclamation: 0.002169 seconds
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002111 seconds
> > > Time taken by reclamation: 0.002224 seconds
> > > Time taken by reclamation: 0.002297 seconds
> > > Time taken by reclamation: 0.002260 seconds
> > > Time taken by reclamation: 0.002246 seconds
> > > Time taken by reclamation: 0.002272 seconds
> > > Time taken by reclamation: 0.002277 seconds
> > > Time taken by reclamation: 0.002462 seconds
> > > …
> > >
> > > This patch significantly speeds up try_to_unmap_one() by allowing it
> > > to skip redirtied THPs without splitting the PMD.
> > >
> > > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/huge_memory.c | 24 +++++++++++++++++-------
> > >  mm/rmap.c        | 13 ++++++++++---
> > >  2 files changed, 27 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 3d3ebdc002d5..47cc8c3f8f80 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >         int ref_count, map_count;
> > >         pmd_t orig_pmd = *pmdp;
> > >
> > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > > +       if (pmd_dirty(orig_pmd))
> > > +               folio_set_dirty(folio);
> > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +               folio_set_swapbacked(folio);
> > >                 return false;
> > > +       }
> >
> > If either the PMD or the folio is dirty, should we just return false right away,
> > regardless of VM_DROPPABLE? There’s no need to proceed further in that
> > case, IMHO ;)
>
> I don't quite understand you, but we need to proceed to clear pmd entry.
> if vm_droppable is true, even if the folio is dirty, we still drop the folio.

Hey barry,

One thing I still don’t quite understand is as follows:

One of the semantics of VM_DROPPABLE is that, under memory pressure,
the kernel can drop the pages. Similarly, for MADV_FREE, one of its
semantics is that the kernel can free the pages when memory pressure
occurs, but only if there is no subsequent write (i.e., the PMD is clean).

So, if VM_DROPPABLE is true, we still drop the folio even if it's dirty. This
seems to conflict with the semantics of MADV_FREE, which requires the
folio or PMD to be clean before being dropped. wdyt?

Thanks,
Lance




>
> >
> > Thanks,
> > Lance
> >
> > >
> > >         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > >
> > > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >          *
> > >          * The only folio refs must be one from isolation plus the rmap(s).
> > >          */
> > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > > -           ref_count != map_count + 1) {
> > > +       if (pmd_dirty(orig_pmd))
> > > +               folio_set_dirty(folio);
> > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +               folio_set_swapbacked(folio);
> > > +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > +               return false;
> > > +       }
> > > +
> > > +       if (ref_count != map_count + 1) {
> > >                 set_pmd_at(mm, addr, pmdp, orig_pmd);
> > >                 return false;
> > >         }
> > > @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > >  {
> > >         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > >         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > > +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > >         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> > >
> > > -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > > -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > -
> > > -       return false;
> > > +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > >  }
> > >
> > >  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index be1978d2712d..a859c399ec7c 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > >                 }
> > >
> > >                 if (!pvmw.pte) {
> > > -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > > -                                                 folio))
> > > -                               goto walk_done;
> > > +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > > +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> > > +                                       goto walk_done;
> > > +                               /*
> > > +                                * unmap_huge_pmd_locked has either already marked
> > > +                                * the folio as swap-backed or decided to retain it
> > > +                                * due to GUP or speculative references.
> > > +                                */
> > > +                               goto walk_abort;
> > > +                       }
> > >
> > >                         if (flags & TTU_SPLIT_HUGE_PMD) {
> > >                                 /*
> > > --
> > > 2.39.3 (Apple Git-146)
> > >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  6:26       ` Lance Yang
@ 2025-01-15  6:42         ` Barry Song
  2025-01-15  7:01           ` Lance Yang
  0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2025-01-15  6:42 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 7:26 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 1:09 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 6:01 PM Lance Yang <ioworker0@gmail.com> wrote:
> > >
> > > On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > > > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > > > state from the PMD to the PTEs, iterates over the PTEs to locate
> > > > the dirty state, and then marks the THP as swap-backed. This process
> > > > involves unnecessary PMD splitting and redundant iteration. Instead,
> > > > this functionality can be efficiently managed in
> > > > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > > > improving performance.
> > > >
> > > > The following microbenchmark redirties folios after invoking MADV_FREE,
> > > > then measures the time taken to perform memory reclamation (actually
> > > > set those folios swapbacked again) on the redirtied folios.
> > > >
> > > >  #include <stdio.h>
> > > >  #include <sys/mman.h>
> > > >  #include <string.h>
> > > >  #include <time.h>
> > > >
> > > >  #define SIZE 128*1024*1024  // 128 MB
> > > >
> > > >  int main(int argc, char *argv[])
> > > >  {
> > > >         while(1) {
> > > >                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > > >                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > >
> > > >                 memset((void *)p, 1, SIZE);
> > > >                 madvise((void *)p, SIZE, MADV_FREE);
> > > >                 /* redirty after MADV_FREE */
> > > >                 memset((void *)p, 1, SIZE);
> > > >
> > > >                 clock_t start_time = clock();
> > > >                 madvise((void *)p, SIZE, MADV_PAGEOUT);
> > > >                 clock_t end_time = clock();
> > > >
> > > >                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > > >                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> > > >
> > > >                 munmap((void *)p, SIZE);
> > > >         }
> > > >         return 0;
> > > >  }
> > > >
> > > > Testing results are as below,
> > > > w/o patch:
> > > > ~ # ./a.out
> > > > Time taken by reclamation: 0.007300 seconds
> > > > Time taken by reclamation: 0.007226 seconds
> > > > Time taken by reclamation: 0.007295 seconds
> > > > Time taken by reclamation: 0.007731 seconds
> > > > Time taken by reclamation: 0.007134 seconds
> > > > Time taken by reclamation: 0.007285 seconds
> > > > Time taken by reclamation: 0.007720 seconds
> > > > Time taken by reclamation: 0.007128 seconds
> > > > Time taken by reclamation: 0.007710 seconds
> > > > Time taken by reclamation: 0.007712 seconds
> > > > Time taken by reclamation: 0.007236 seconds
> > > > Time taken by reclamation: 0.007690 seconds
> > > > Time taken by reclamation: 0.007174 seconds
> > > > Time taken by reclamation: 0.007670 seconds
> > > > Time taken by reclamation: 0.007169 seconds
> > > > Time taken by reclamation: 0.007305 seconds
> > > > Time taken by reclamation: 0.007432 seconds
> > > > Time taken by reclamation: 0.007158 seconds
> > > > Time taken by reclamation: 0.007133 seconds
> > > > …
> > > >
> > > > w/ patch
> > > >
> > > > ~ # ./a.out
> > > > Time taken by reclamation: 0.002124 seconds
> > > > Time taken by reclamation: 0.002116 seconds
> > > > Time taken by reclamation: 0.002150 seconds
> > > > Time taken by reclamation: 0.002261 seconds
> > > > Time taken by reclamation: 0.002137 seconds
> > > > Time taken by reclamation: 0.002173 seconds
> > > > Time taken by reclamation: 0.002063 seconds
> > > > Time taken by reclamation: 0.002088 seconds
> > > > Time taken by reclamation: 0.002169 seconds
> > > > Time taken by reclamation: 0.002124 seconds
> > > > Time taken by reclamation: 0.002111 seconds
> > > > Time taken by reclamation: 0.002224 seconds
> > > > Time taken by reclamation: 0.002297 seconds
> > > > Time taken by reclamation: 0.002260 seconds
> > > > Time taken by reclamation: 0.002246 seconds
> > > > Time taken by reclamation: 0.002272 seconds
> > > > Time taken by reclamation: 0.002277 seconds
> > > > Time taken by reclamation: 0.002462 seconds
> > > > …
> > > >
> > > > This patch significantly speeds up try_to_unmap_one() by allowing it
> > > > to skip redirtied THPs without splitting the PMD.
> > > >
> > > > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > ---
> > > >  mm/huge_memory.c | 24 +++++++++++++++++-------
> > > >  mm/rmap.c        | 13 ++++++++++---
> > > >  2 files changed, 27 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 3d3ebdc002d5..47cc8c3f8f80 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > > >         int ref_count, map_count;
> > > >         pmd_t orig_pmd = *pmdp;
> > > >
> > > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > > > +       if (pmd_dirty(orig_pmd))
> > > > +               folio_set_dirty(folio);
> > > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > > +               folio_set_swapbacked(folio);
> > > >                 return false;
> > > > +       }
> > >
> > > If either the PMD or the folio is dirty, should we just return false right away,
> > > regardless of VM_DROPPABLE? There’s no need to proceed further in that
> > > case, IMHO ;)
> >
> > I don't quite understand you, but we need to proceed to clear pmd entry.
> > if vm_droppable is true, even if the folio is dirty, we still drop the folio.
>
> Hey barry,
>
> One thing I still don’t quite understand is as follows:
>
> One of the semantics of VM_DROPPABLE is that, under memory pressure,
> the kernel can drop the pages. Similarly, for MADV_FREE, one of its
> semantics is that the kernel can free the pages when memory pressure
> occurs, but only if there is no subsequent write (i.e., the PMD is clean).
>
> So, if VM_DROPPABLE is true, we still drop the folio even if it's dirty. This
> seems to conflict with the semantics of MADV_FREE, which requires the
> folio or PMD to be clean before being dropped. wdyt?

I think we can simply revisit the patch where VM_DROPPABLE was introduced:
commit 9651fcedf7b92d3f7
"mm: add MAP_DROPPABLE for designating always lazily freeable mappings"

>
> Thanks,
> Lance
>
>
>
>
> >
> > >
> > > Thanks,
> > > Lance
> > >
> > > >
> > > >         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > > >
> > > > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > > >          *
> > > >          * The only folio refs must be one from isolation plus the rmap(s).
> > > >          */
> > > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > > > -           ref_count != map_count + 1) {
> > > > +       if (pmd_dirty(orig_pmd))
> > > > +               folio_set_dirty(folio);
> > > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > > +               folio_set_swapbacked(folio);
> > > > +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > > +               return false;
> > > > +       }
> > > > +
> > > > +       if (ref_count != map_count + 1) {
> > > >                 set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > >                 return false;
> > > >         }
> > > > @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > > >  {
> > > >         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > > >         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > > +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > > > +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > > >         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> > > >
> > > > -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > > > -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > > -
> > > > -       return false;
> > > > +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > >  }
> > > >
> > > >  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index be1978d2712d..a859c399ec7c 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > > >                 }
> > > >
> > > >                 if (!pvmw.pte) {
> > > > -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > > > -                                                 folio))
> > > > -                               goto walk_done;
> > > > +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > > > +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> > > > +                                       goto walk_done;
> > > > +                               /*
> > > > +                                * unmap_huge_pmd_locked has either already marked
> > > > +                                * the folio as swap-backed or decided to retain it
> > > > +                                * due to GUP or speculative references.
> > > > +                                */
> > > > +                               goto walk_abort;
> > > > +                       }
> > > >
> > > >                         if (flags & TTU_SPLIT_HUGE_PMD) {
> > > >                                 /*
> > > > --
> > > > 2.39.3 (Apple Git-146)
> > > >

Thanks
Barry


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  6:42         ` Barry Song
@ 2025-01-15  7:01           ` Lance Yang
  0 siblings, 0 replies; 17+ messages in thread
From: Lance Yang @ 2025-01-15  7:01 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 2:42 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 7:26 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 1:09 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, Jan 15, 2025 at 6:01 PM Lance Yang <ioworker0@gmail.com> wrote:
> > > >
> > > > On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > > > > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > > > > state from the PMD to the PTEs, iterates over the PTEs to locate
> > > > > the dirty state, and then marks the THP as swap-backed. This process
> > > > > involves unnecessary PMD splitting and redundant iteration. Instead,
> > > > > this functionality can be efficiently managed in
> > > > > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > > > > improving performance.
> > > > >
> > > > > The following microbenchmark redirties folios after invoking MADV_FREE,
> > > > > then measures the time taken to perform memory reclamation (actually
> > > > > set those folios swapbacked again) on the redirtied folios.
> > > > >
> > > > >  #include <stdio.h>
> > > > >  #include <sys/mman.h>
> > > > >  #include <string.h>
> > > > >  #include <time.h>
> > > > >
> > > > >  #define SIZE 128*1024*1024  // 128 MB
> > > > >
> > > > >  int main(int argc, char *argv[])
> > > > >  {
> > > > >         while(1) {
> > > > >                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > > > >                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > >
> > > > >                 memset((void *)p, 1, SIZE);
> > > > >                 madvise((void *)p, SIZE, MADV_FREE);
> > > > >                 /* redirty after MADV_FREE */
> > > > >                 memset((void *)p, 1, SIZE);
> > > > >
> > > > >                 clock_t start_time = clock();
> > > > >                 madvise((void *)p, SIZE, MADV_PAGEOUT);
> > > > >                 clock_t end_time = clock();
> > > > >
> > > > >                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > > > >                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> > > > >
> > > > >                 munmap((void *)p, SIZE);
> > > > >         }
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > Testing results are as below,
> > > > > w/o patch:
> > > > > ~ # ./a.out
> > > > > Time taken by reclamation: 0.007300 seconds
> > > > > Time taken by reclamation: 0.007226 seconds
> > > > > Time taken by reclamation: 0.007295 seconds
> > > > > Time taken by reclamation: 0.007731 seconds
> > > > > Time taken by reclamation: 0.007134 seconds
> > > > > Time taken by reclamation: 0.007285 seconds
> > > > > Time taken by reclamation: 0.007720 seconds
> > > > > Time taken by reclamation: 0.007128 seconds
> > > > > Time taken by reclamation: 0.007710 seconds
> > > > > Time taken by reclamation: 0.007712 seconds
> > > > > Time taken by reclamation: 0.007236 seconds
> > > > > Time taken by reclamation: 0.007690 seconds
> > > > > Time taken by reclamation: 0.007174 seconds
> > > > > Time taken by reclamation: 0.007670 seconds
> > > > > Time taken by reclamation: 0.007169 seconds
> > > > > Time taken by reclamation: 0.007305 seconds
> > > > > Time taken by reclamation: 0.007432 seconds
> > > > > Time taken by reclamation: 0.007158 seconds
> > > > > Time taken by reclamation: 0.007133 seconds
> > > > > …
> > > > >
> > > > > w/ patch
> > > > >
> > > > > ~ # ./a.out
> > > > > Time taken by reclamation: 0.002124 seconds
> > > > > Time taken by reclamation: 0.002116 seconds
> > > > > Time taken by reclamation: 0.002150 seconds
> > > > > Time taken by reclamation: 0.002261 seconds
> > > > > Time taken by reclamation: 0.002137 seconds
> > > > > Time taken by reclamation: 0.002173 seconds
> > > > > Time taken by reclamation: 0.002063 seconds
> > > > > Time taken by reclamation: 0.002088 seconds
> > > > > Time taken by reclamation: 0.002169 seconds
> > > > > Time taken by reclamation: 0.002124 seconds
> > > > > Time taken by reclamation: 0.002111 seconds
> > > > > Time taken by reclamation: 0.002224 seconds
> > > > > Time taken by reclamation: 0.002297 seconds
> > > > > Time taken by reclamation: 0.002260 seconds
> > > > > Time taken by reclamation: 0.002246 seconds
> > > > > Time taken by reclamation: 0.002272 seconds
> > > > > Time taken by reclamation: 0.002277 seconds
> > > > > Time taken by reclamation: 0.002462 seconds
> > > > > …
> > > > >
> > > > > This patch significantly speeds up try_to_unmap_one() by allowing it
> > > > > to skip redirtied THPs without splitting the PMD.
> > > > >
> > > > > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > ---
> > > > >  mm/huge_memory.c | 24 +++++++++++++++++-------
> > > > >  mm/rmap.c        | 13 ++++++++++---
> > > > >  2 files changed, 27 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index 3d3ebdc002d5..47cc8c3f8f80 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > > > >         int ref_count, map_count;
> > > > >         pmd_t orig_pmd = *pmdp;
> > > > >
> > > > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > > > > +       if (pmd_dirty(orig_pmd))
> > > > > +               folio_set_dirty(folio);
> > > > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > > > +               folio_set_swapbacked(folio);
> > > > >                 return false;
> > > > > +       }
> > > >
> > > > If either the PMD or the folio is dirty, should we just return false right away,
> > > > regardless of VM_DROPPABLE? There’s no need to proceed further in that
> > > > case, IMHO ;)
> > >
> > > I don't quite understand you, but we need to proceed to clear pmd entry.
> > > if vm_droppable is true, even if the folio is dirty, we still drop the folio.
> >
> > Hey barry,
> >
> > One thing I still don’t quite understand is as follows:
> >
> > One of the semantics of VM_DROPPABLE is that, under memory pressure,
> > the kernel can drop the pages. Similarly, for MADV_FREE, one of its
> > semantics is that the kernel can free the pages when memory pressure
> > occurs, but only if there is no subsequent write (i.e., the PMD is clean).
> >
> > So, if VM_DROPPABLE is true, we still drop the folio even if it's dirty. This
> > seems to conflict with the semantics of MADV_FREE, which requires the
> > folio or PMD to be clean before being dropped. wdyt?
>
> I think we can simply revisit the patch where VM_DROPPABLE was introduced:
> commit 9651fcedf7b92d3f7
> "mm: add MAP_DROPPABLE for designating always lazily freeable mappings"

Ah, I see ~

As the patch says: "Unlike MADV_FREE mappings, VM_DROPPABLE ones
can be dropped even if they've been dirtied".

Thanks a lot for the lesson!
Lance

>
> >
> > Thanks,
> > Lance
> >
> >
> >
> >
> > >
> > > >
> > > > Thanks,
> > > > Lance
> > > >
> > > > >
> > > > >         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > > > >
> > > > > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > > > >          *
> > > > >          * The only folio refs must be one from isolation plus the rmap(s).
> > > > >          */
> > > > > -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > > > > -           ref_count != map_count + 1) {
> > > > > +       if (pmd_dirty(orig_pmd))
> > > > > +               folio_set_dirty(folio);
> > > > > +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > > > +               folio_set_swapbacked(folio);
> > > > > +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > > > +               return false;
> > > > > +       }
> > > > > +
> > > > > +       if (ref_count != map_count + 1) {
> > > > >                 set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > > >                 return false;
> > > > >         }
> > > > > @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> > > > >  {
> > > > >         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > > > >         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> > > > > +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> > > > > +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > > > >         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> > > > >
> > > > > -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > > > > -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > > > -
> > > > > -       return false;
> > > > > +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> > > > >  }
> > > > >
> > > > >  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index be1978d2712d..a859c399ec7c 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > > > >                 }
> > > > >
> > > > >                 if (!pvmw.pte) {
> > > > > -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > > > > -                                                 folio))
> > > > > -                               goto walk_done;
> > > > > +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > > > > +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> > > > > +                                       goto walk_done;
> > > > > +                               /*
> > > > > +                                * unmap_huge_pmd_locked has either already marked
> > > > > +                                * the folio as swap-backed or decided to retain it
> > > > > +                                * due to GUP or speculative references.
> > > > > +                                */
> > > > > +                               goto walk_abort;
> > > > > +                       }
> > > > >
> > > > >                         if (flags & TTU_SPLIT_HUGE_PMD) {
> > > > >                                 /*
> > > > > --
> > > > > 2.39.3 (Apple Git-146)
> > > > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-15  3:38 ` [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
@ 2025-01-16  8:30   ` Kefeng Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2025-01-16  8:30 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, linux-riscv, lorenzo.stoakes, ryan.roberts,
	v-songbaohua, x86, ying.huang, zhengtangquan, Catalin Marinas,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Anshuman Khandual, Shaoqin Huang, Gavin Shan,
	Mark Rutland, Kirill A. Shutemov, Yosry Ahmed, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Yicong Yang, Will Deacon



On 2025/1/15 11:38, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> This patch lays the groundwork for supporting batch PTE unmapping in
> try_to_unmap_one(). It introduces range handling for TLB batch flushing,
> with the range currently set to the size of PAGE_SIZE.
> 
> The function __flush_tlb_range_nosync() is architecture-specific and is
> only used within arch/arm64. This function requires the mm structure
> instead of the vma structure. To allow its reuse by
> arch_tlbbatch_add_pending(), which operates with mm but not vma, this
> patch modifies the argument of __flush_tlb_range_nosync() to take mm
> as its parameter.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Shaoqin Huang <shahuang@redhat.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Acked-by: Will Deacon <will@kernel.org>
> ---
>   arch/arm64/include/asm/tlbflush.h | 25 +++++++++++++------------
>   arch/arm64/mm/contpte.c           |  2 +-
>   arch/riscv/include/asm/tlbflush.h |  5 +++--
>   arch/riscv/mm/tlbflush.c          |  5 +++--
>   arch/x86/include/asm/tlbflush.h   |  5 +++--
>   mm/rmap.c                         | 12 +++++++-----
>   6 files changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc94e036a26b..98fbc8df7cf3 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -322,13 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>   	return true;
>   }
>   
> -static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -					     struct mm_struct *mm,
> -					     unsigned long uaddr)
> -{
> -	__flush_tlb_page_nosync(mm, uaddr);
> -}
> -
>   /*
>    * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>    * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> @@ -448,7 +441,7 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
>   	return false;
>   }
>   
> -static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
> @@ -460,12 +453,12 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   	pages = (end - start) >> PAGE_SHIFT;
>   
>   	if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
> -		flush_tlb_mm(vma->vm_mm);
> +		flush_tlb_mm(mm);
>   		return;
>   	}
>   
>   	dsb(ishst);
> -	asid = ASID(vma->vm_mm);
> +	asid = ASID(mm);
>   
>   	if (last_level)
>   		__flush_tlb_range_op(vale1is, start, pages, stride, asid,
> @@ -474,7 +467,7 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
>   				     tlb_level, true, lpa2_is_enabled());
>   
> -	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>   }
>   
>   static inline void __flush_tlb_range(struct vm_area_struct *vma,
> @@ -482,7 +475,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
>   {
> -	__flush_tlb_range_nosync(vma, start, end, stride,
> +	__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
>   				 last_level, tlb_level);
>   	dsb(ish);
>   }
> @@ -533,6 +526,14 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>   	dsb(ish);
>   	isb();
>   }
> +
> +static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +		struct mm_struct *mm,
> +		unsigned long start,
> +		unsigned long end)

Only one line for arguments, same for other functions，

Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>

> +{
> +	__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
> +}
>   #endif
>   
>   #endif
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 55107d27d3f8..bcac4f55f9c1 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -335,7 +335,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>   		 * eliding the trailing DSB applies here.
>   		 */
>   		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
>   					 PAGE_SIZE, true, 3);
>   	}
>   
> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> index 72e559934952..e4c533691a7d 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -60,8 +60,9 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
>   
>   bool arch_tlbbatch_should_defer(struct mm_struct *mm);
>   void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -			       struct mm_struct *mm,
> -			       unsigned long uaddr);
> +		struct mm_struct *mm,
> +		unsigned long start,
> +		unsigned long end);
>   void arch_flush_tlb_batched_pending(struct mm_struct *mm);
>   void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>   
> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> index 9b6e86ce3867..6d6e8e7cc576 100644
> --- a/arch/riscv/mm/tlbflush.c
> +++ b/arch/riscv/mm/tlbflush.c
> @@ -186,8 +186,9 @@ bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>   }
>   
>   void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -			       struct mm_struct *mm,
> -			       unsigned long uaddr)
> +		struct mm_struct *mm,
> +		unsigned long start,
> +		unsigned long end)
>   {
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
>   }
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 69e79fff41b8..2b511972d008 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -278,8 +278,9 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>   }
>   
>   static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -					     struct mm_struct *mm,
> -					     unsigned long uaddr)
> +		struct mm_struct *mm,
> +		unsigned long start,
> +		unsigned long end)
>   {
>   	inc_mm_tlb_gen(mm);
>   	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> diff --git a/mm/rmap.c b/mm/rmap.c
> index de6b8c34e98c..abeb9fcec384 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -672,7 +672,8 @@ void try_to_unmap_flush_dirty(void)
>   	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
>   
>   static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> -				      unsigned long uaddr)
> +		unsigned long start,
> +		unsigned long end)
>   {
>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch;
> @@ -681,7 +682,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
>   	if (!pte_accessible(mm, pteval))
>   		return;
>   
> -	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
> +	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -757,7 +758,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   }
>   #else
>   static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> -				      unsigned long uaddr)
> +		unsigned long start,
> +		unsigned long end)
>   {
>   }
>   
> @@ -1792,7 +1794,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pteval, address);
> +				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}
> @@ -2164,7 +2166,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   				 */
>   				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>   
> -				set_tlb_ubc_flush_pending(mm, pteval, address);
> +				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
>   			} else {
>   				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>   			}



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  2025-01-15  5:01   ` Lance Yang
@ 2025-01-17  1:30   ` Baolin Wang
  2025-01-18 14:00   ` Lance Yang
  2 siblings, 0 replies; 17+ messages in thread
From: Baolin Wang @ 2025-01-17  1:30 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: chrisl, david, ioworker0, kasong, linux-arm-kernel, linux-kernel,
	linux-riscv, lorenzo.stoakes, ryan.roberts, v-songbaohua, x86,
	ying.huang, zhengtangquan



On 2025/1/15 11:38, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The try_to_unmap_one() function currently handles PMD-mapped THPs
> inefficiently. It first splits the PMD into PTEs, copies the dirty
> state from the PMD to the PTEs, iterates over the PTEs to locate
> the dirty state, and then marks the THP as swap-backed. This process
> involves unnecessary PMD splitting and redundant iteration. Instead,
> this functionality can be efficiently managed in
> __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> improving performance.
> 
> The following microbenchmark redirties folios after invoking MADV_FREE,
> then measures the time taken to perform memory reclamation (actually
> set those folios swapbacked again) on the redirtied folios.
> 
>   #include <stdio.h>
>   #include <sys/mman.h>
>   #include <string.h>
>   #include <time.h>
> 
>   #define SIZE 128*1024*1024  // 128 MB
> 
>   int main(int argc, char *argv[])
>   {
>   	while(1) {
>   		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>   				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
>   		memset((void *)p, 1, SIZE);
>   		madvise((void *)p, SIZE, MADV_FREE);
>   		/* redirty after MADV_FREE */
>   		memset((void *)p, 1, SIZE);
> 
> 		clock_t start_time = clock();
>   		madvise((void *)p, SIZE, MADV_PAGEOUT);
>   		clock_t end_time = clock();
> 
>   		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>   		printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> 
>   		munmap((void *)p, SIZE);
>   	}
>   	return 0;
>   }
> 
> Testing results are as below,
> w/o patch:
> ~ # ./a.out
> Time taken by reclamation: 0.007300 seconds
> Time taken by reclamation: 0.007226 seconds
> Time taken by reclamation: 0.007295 seconds
> Time taken by reclamation: 0.007731 seconds
> Time taken by reclamation: 0.007134 seconds
> Time taken by reclamation: 0.007285 seconds
> Time taken by reclamation: 0.007720 seconds
> Time taken by reclamation: 0.007128 seconds
> Time taken by reclamation: 0.007710 seconds
> Time taken by reclamation: 0.007712 seconds
> Time taken by reclamation: 0.007236 seconds
> Time taken by reclamation: 0.007690 seconds
> Time taken by reclamation: 0.007174 seconds
> Time taken by reclamation: 0.007670 seconds
> Time taken by reclamation: 0.007169 seconds
> Time taken by reclamation: 0.007305 seconds
> Time taken by reclamation: 0.007432 seconds
> Time taken by reclamation: 0.007158 seconds
> Time taken by reclamation: 0.007133 seconds
> …
> 
> w/ patch
> 
> ~ # ./a.out
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002116 seconds
> Time taken by reclamation: 0.002150 seconds
> Time taken by reclamation: 0.002261 seconds
> Time taken by reclamation: 0.002137 seconds
> Time taken by reclamation: 0.002173 seconds
> Time taken by reclamation: 0.002063 seconds
> Time taken by reclamation: 0.002088 seconds
> Time taken by reclamation: 0.002169 seconds
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002111 seconds
> Time taken by reclamation: 0.002224 seconds
> Time taken by reclamation: 0.002297 seconds
> Time taken by reclamation: 0.002260 seconds
> Time taken by reclamation: 0.002246 seconds
> Time taken by reclamation: 0.002272 seconds
> Time taken by reclamation: 0.002277 seconds
> Time taken by reclamation: 0.002462 seconds

Good result.

> This patch significantly speeds up try_to_unmap_one() by allowing it
> to skip redirtied THPs without splitting the PMD.
> 
> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Suggested-by: Lance Yang <ioworker0@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  2025-01-15  5:01   ` Lance Yang
  2025-01-17  1:30   ` Baolin Wang
@ 2025-01-18 14:00   ` Lance Yang
  2 siblings, 0 replies; 17+ messages in thread
From: Lance Yang @ 2025-01-18 14:00 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Jan 15, 2025 at 11:38 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> The try_to_unmap_one() function currently handles PMD-mapped THPs
> inefficiently. It first splits the PMD into PTEs, copies the dirty
> state from the PMD to the PTEs, iterates over the PTEs to locate
> the dirty state, and then marks the THP as swap-backed. This process
> involves unnecessary PMD splitting and redundant iteration. Instead,
> this functionality can be efficiently managed in
> __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> improving performance.
>
> The following microbenchmark redirties folios after invoking MADV_FREE,
> then measures the time taken to perform memory reclamation (actually
> set those folios swapbacked again) on the redirtied folios.
>
>  #include <stdio.h>
>  #include <sys/mman.h>
>  #include <string.h>
>  #include <time.h>
>
>  #define SIZE 128*1024*1024  // 128 MB
>
>  int main(int argc, char *argv[])
>  {
>         while(1) {
>                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
>                 memset((void *)p, 1, SIZE);
>                 madvise((void *)p, SIZE, MADV_FREE);
>                 /* redirty after MADV_FREE */
>                 memset((void *)p, 1, SIZE);
>
>                 clock_t start_time = clock();
>                 madvise((void *)p, SIZE, MADV_PAGEOUT);
>                 clock_t end_time = clock();
>
>                 double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>                 printf("Time taken by reclamation: %f seconds\n", elapsed_time);
>
>                 munmap((void *)p, SIZE);
>         }
>         return 0;
>  }
>
> Testing results are as below,
> w/o patch:
> ~ # ./a.out
> Time taken by reclamation: 0.007300 seconds
> Time taken by reclamation: 0.007226 seconds
> Time taken by reclamation: 0.007295 seconds
> Time taken by reclamation: 0.007731 seconds
> Time taken by reclamation: 0.007134 seconds
> Time taken by reclamation: 0.007285 seconds
> Time taken by reclamation: 0.007720 seconds
> Time taken by reclamation: 0.007128 seconds
> Time taken by reclamation: 0.007710 seconds
> Time taken by reclamation: 0.007712 seconds
> Time taken by reclamation: 0.007236 seconds
> Time taken by reclamation: 0.007690 seconds
> Time taken by reclamation: 0.007174 seconds
> Time taken by reclamation: 0.007670 seconds
> Time taken by reclamation: 0.007169 seconds
> Time taken by reclamation: 0.007305 seconds
> Time taken by reclamation: 0.007432 seconds
> Time taken by reclamation: 0.007158 seconds
> Time taken by reclamation: 0.007133 seconds
> …
>
> w/ patch
>
> ~ # ./a.out
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002116 seconds
> Time taken by reclamation: 0.002150 seconds
> Time taken by reclamation: 0.002261 seconds
> Time taken by reclamation: 0.002137 seconds
> Time taken by reclamation: 0.002173 seconds
> Time taken by reclamation: 0.002063 seconds
> Time taken by reclamation: 0.002088 seconds
> Time taken by reclamation: 0.002169 seconds
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002111 seconds
> Time taken by reclamation: 0.002224 seconds
> Time taken by reclamation: 0.002297 seconds
> Time taken by reclamation: 0.002260 seconds
> Time taken by reclamation: 0.002246 seconds
> Time taken by reclamation: 0.002272 seconds
> Time taken by reclamation: 0.002277 seconds
> Time taken by reclamation: 0.002462 seconds
> …
>
> This patch significantly speeds up try_to_unmap_one() by allowing it
> to skip redirtied THPs without splitting the PMD.
>
> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Suggested-by: Lance Yang <ioworker0@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Nothing jumps at me ;)

LGTM. Feel free to add:
Reviewed-by: Lance Yang <ioworker0@gmail.com>

Thanks,
Lance

> ---
>  mm/huge_memory.c | 24 +++++++++++++++++-------
>  mm/rmap.c        | 13 ++++++++++---
>  2 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3d3ebdc002d5..47cc8c3f8f80 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>         int ref_count, map_count;
>         pmd_t orig_pmd = *pmdp;
>
> -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> +       if (pmd_dirty(orig_pmd))
> +               folio_set_dirty(folio);
> +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +               folio_set_swapbacked(folio);
>                 return false;
> +       }
>
>         orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
>
> @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>          *
>          * The only folio refs must be one from isolation plus the rmap(s).
>          */
> -       if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> -           ref_count != map_count + 1) {
> +       if (pmd_dirty(orig_pmd))
> +               folio_set_dirty(folio);
> +       if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +               folio_set_swapbacked(folio);
> +               set_pmd_at(mm, addr, pmdp, orig_pmd);
> +               return false;
> +       }
> +
> +       if (ref_count != map_count + 1) {
>                 set_pmd_at(mm, addr, pmdp, orig_pmd);
>                 return false;
>         }
> @@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>  {
>         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
>         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
>         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
>
> -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> -
> -       return false;
> +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>  }
>
>  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index be1978d2712d..a859c399ec7c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                 }
>
>                 if (!pvmw.pte) {
> -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> -                                                 folio))
> -                               goto walk_done;
> +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> +                                       goto walk_done;
> +                               /*
> +                                * unmap_huge_pmd_locked has either already marked
> +                                * the folio as swap-backed or decided to retain it
> +                                * due to GUP or speculative references.
> +                                */
> +                               goto walk_abort;
> +                       }
>
>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>                                 /*
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
  2025-01-15  3:38 ` [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
@ 2025-02-04 11:38   ` David Hildenbrand
  2025-02-05  2:55     ` Andrew Morton
  2025-02-05  3:35     ` Barry Song
  0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-02-04 11:38 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, linux-riscv, lorenzo.stoakes, ryan.roberts,
	v-songbaohua, x86, ying.huang, zhengtangquan

Hi,

>   	unsigned long hsz = 0;
>   
> @@ -1780,6 +1800,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   				hugetlb_vma_unlock_write(vma);
>   			}
>   			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> +		} else if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
> +				can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) {
> +			nr_pages = folio_nr_pages(folio);
> +			flush_cache_range(vma, range.start, range.end);
> +			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> +			if (should_defer_flush(mm, flags))
> +				set_tlb_ubc_flush_pending(mm, pteval, address,
> +					address + folio_size(folio));
> +			else
> +				flush_tlb_range(vma, range.start, range.end);
>   		} else {

I have some fixes [1] that will collide with this series. I'm currently 
preparing a v2, and am not 100% sure when the fixes will get queued+merged.

I'll base them against mm-stable for now, and send them out based on 
that, to avoid the conflicts here (should all be fairly easy to resolve 
from a quick glimpse).

So we might have to refresh this series here if the fixes go in first.

[1] https://lkml.kernel.org/r/20250129115411.2077152-1-david@redhat.com

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
  2025-02-04 11:38   ` David Hildenbrand
@ 2025-02-05  2:55     ` Andrew Morton
  2025-02-05  3:35     ` Barry Song
  1 sibling, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2025-02-05  2:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, linux-mm, baolin.wang, chrisl, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Tue, 4 Feb 2025 12:38:31 +0100 David Hildenbrand <david@redhat.com> wrote:

> Hi,
> 
> >   	unsigned long hsz = 0;
> >   
> > @@ -1780,6 +1800,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >   				hugetlb_vma_unlock_write(vma);
> >   			}
> >   			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +		} else if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
> > +				can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) {
> > +			nr_pages = folio_nr_pages(folio);
> > +			flush_cache_range(vma, range.start, range.end);
> > +			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> > +			if (should_defer_flush(mm, flags))
> > +				set_tlb_ubc_flush_pending(mm, pteval, address,
> > +					address + folio_size(folio));
> > +			else
> > +				flush_tlb_range(vma, range.start, range.end);
> >   		} else {
> 
> I have some fixes [1] that will collide with this series. I'm currently 
> preparing a v2, and am not 100% sure when the fixes will get queued+merged.
> 
> I'll base them against mm-stable for now, and send them out based on 
> that, to avoid the conflicts here (should all be fairly easy to resolve 
> from a quick glimpse).
> 
> So we might have to refresh this series here if the fixes go in first.
> 
> [1] https://lkml.kernel.org/r/20250129115411.2077152-1-david@redhat.com

It doesn't look like "mm: fixes for device-exclusive entries (hmm)"
will be backportable(?) but yes, we should aim to stage your fixes
against mainline and ahead of other changes to at least make life
easier for anyone who chooses to backport your fixes into an earlier
kernel.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
  2025-02-04 11:38   ` David Hildenbrand
  2025-02-05  2:55     ` Andrew Morton
@ 2025-02-05  3:35     ` Barry Song
  1 sibling, 0 replies; 17+ messages in thread
From: Barry Song @ 2025-02-05  3:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Wed, Feb 5, 2025 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> Hi,
>
> >       unsigned long hsz = 0;
> >
> > @@ -1780,6 +1800,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >                               hugetlb_vma_unlock_write(vma);
> >                       }
> >                       pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +             } else if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
> > +                             can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) {
> > +                     nr_pages = folio_nr_pages(folio);
> > +                     flush_cache_range(vma, range.start, range.end);
> > +                     pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> > +                     if (should_defer_flush(mm, flags))
> > +                             set_tlb_ubc_flush_pending(mm, pteval, address,
> > +                                     address + folio_size(folio));
> > +                     else
> > +                             flush_tlb_range(vma, range.start, range.end);
> >               } else {
>
> I have some fixes [1] that will collide with this series. I'm currently
> preparing a v2, and am not 100% sure when the fixes will get queued+merged.
>
> I'll base them against mm-stable for now, and send them out based on
> that, to avoid the conflicts here (should all be fairly easy to resolve
> from a quick glimpse).
>
> So we might have to refresh this series here if the fixes go in first.

I assume you're referring to "[PATCH v1 08/12] mm/rmap: handle
device-exclusive entries correctly in try_to_unmap_one()". It looks
straightforward to resolve the conflict. If your patch is applied first,
I'll send a rebase.

>
> [1] https://lkml.kernel.org/r/20250129115411.2077152-1-david@redhat.com
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-05  3:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-15  3:38 [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
2025-01-15  3:38 ` [PATCH v3 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
2025-01-15  3:38 ` [PATCH v3 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
2025-01-16  8:30   ` Kefeng Wang
2025-01-15  3:38 ` [PATCH v3 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
2025-02-04 11:38   ` David Hildenbrand
2025-02-05  2:55     ` Andrew Morton
2025-02-05  3:35     ` Barry Song
2025-01-15  3:38 ` [PATCH v3 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
2025-01-15  5:01   ` Lance Yang
2025-01-15  5:09     ` Barry Song
2025-01-15  5:41       ` Lance Yang
2025-01-15  6:26       ` Lance Yang
2025-01-15  6:42         ` Barry Song
2025-01-15  7:01           ` Lance Yang
2025-01-17  1:30   ` Baolin Wang
2025-01-18 14:00   ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox