[PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation
@ 2025-01-13  3:38 Barry Song
  2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Barry Song @ 2025-01-13  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

From: Barry Song <v-songbaohua@oppo.com>

Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during
shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in
madvise.c. 
However, those folios are still added to the deferred_split list in 
try_to_unmap_one() because we are unmapping PTEs and removing rmap
entries one by one.

Firstly, this has rendered the following counter somewhat confusing, 
/sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred
The split_deferred counter was originally designed to track operations 
such as partial unmap or madvise of large folios. However, in practice, 
most split_deferred cases arise from memory reclamation of aligned 
lazyfree mTHPs as observed by Tangquan. This discrepancy has made
the split_deferred counter highly misleading.

Secondly, this approach is slow because it requires iterating through 
each PTE and removing the rmap one by one for a large folio. In fact, 
all PTEs of a pte-mapped large folio should be unmapped at once, and
the entire folio should be removed from the rmap as a whole.

Thirdly, it also increases the risk of a race condition where lazyfree
folios are incorrectly set back to swapbacked, as a speculative folio_get
may occur in the shrinker's callback.
deferred_split_scan() might call folio_try_get(folio) since we have
added the folio to split_deferred list while removing rmap for the
1st subpage, and while we are scanning the 2nd to nr_pages PTEs of
this folio in try_to_unmap_one(), the entire mTHP could be
transitioned back to swap-backed because the reference count is
incremented, which can make "ref_count == 1 + map_count" within
try_to_unmap_one() false.

   /*
    * The only page refs must be one from isolation
    * plus the rmap(s) (dropped by discard:).
    */
   if (ref_count == 1 + map_count &&
       (!folio_test_dirty(folio) ||
        ...
        (vma->vm_flags & VM_DROPPABLE))) {
           dec_mm_counter(mm, MM_ANONPAGES);
           goto discard;
   }

This patchset resolves the issue by marking only genuinely dirty folios 
as swap-backed, as suggested by David, and transitioning to batched 
unmapping of entire folios in try_to_unmap_one(). Consequently, the 
deferred_split count drops to zero, and memory reclamation performance 
improves significantly — reclaiming 64KiB lazyfree large folios is now 
2.5x faster(The specific data is embedded in the changelog of patch
3/4).

By the way, while the patchset is primarily aimed at PTE-mapped large 
folios, Baolin and Lance also found that try_to_unmap_one() handles 
lazyfree redirtied PMD-mapped large folios inefficiently — it splits 
the PMD into PTEs and iterates over them. This patchset removes the 
unnecessary splitting, enabling us to skip redirtied PMD-mapped large 
folios 3.5X faster during memory reclamation. (The specific data can 
be found in the changelog of patch 4/4).

-v2:
 * describle backgrounds, problems more clearly in cover-letter per
   Lorenzo Stoakes;
 * also handle redirtied pmd-mapped large folios per Baolin and Lance;
 * handle some corner cases such as HWPOSION, pte_unused;
 * riscv and x86 build issues.

-v1:
 https://lore.kernel.org/linux-mm/20250106031711.82855-1-21cnbao@gmail.com/

Barry Song (4):
  mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  mm: Support tlbbatch flush for a range of PTEs
  mm: Support batched unmap for lazyfree large folios during reclamation
  mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap

 arch/arm64/include/asm/tlbflush.h |  26 +++----
 arch/arm64/mm/contpte.c           |   2 +-
 arch/riscv/include/asm/tlbflush.h |   3 +-
 arch/riscv/mm/tlbflush.c          |   3 +-
 arch/x86/include/asm/tlbflush.h   |   3 +-
 mm/huge_memory.c                  |  17 ++++-
 mm/rmap.c                         | 112 ++++++++++++++++++++----------
 7 files changed, 111 insertions(+), 55 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-13  3:38 [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
@ 2025-01-13  3:38 ` Barry Song
  2025-01-13 13:19   ` David Hildenbrand
  2025-01-14  2:55   ` Baolin Wang
  2025-01-13  3:38 ` [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Barry Song @ 2025-01-13  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

From: Barry Song <v-songbaohua@oppo.com>

The refcount may be temporarily or long-term increased, but this does
not change the fundamental nature of the folio already being lazy-
freed. Therefore, we only reset 'swapbacked' when we are certain the
folio is dirty and not droppable.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/rmap.c | 49 ++++++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..de6b8c34e98c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1868,34 +1868,29 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				smp_rmb();
 
-				/*
-				 * The only page refs must be one from isolation
-				 * plus the rmap(s) (dropped by discard:).
-				 */
-				if (ref_count == 1 + map_count &&
-				    (!folio_test_dirty(folio) ||
-				     /*
-				      * Unlike MADV_FREE mappings, VM_DROPPABLE
-				      * ones can be dropped even if they've
-				      * been dirtied.
-				      */
-				     (vma->vm_flags & VM_DROPPABLE))) {
-					dec_mm_counter(mm, MM_ANONPAGES);
-					goto discard;
-				}
-
-				/*
-				 * If the folio was redirtied, it cannot be
-				 * discarded. Remap the page to page table.
-				 */
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				/*
-				 * Unlike MADV_FREE mappings, VM_DROPPABLE ones
-				 * never get swap backed on failure to drop.
-				 */
-				if (!(vma->vm_flags & VM_DROPPABLE))
+				if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+					/*
+					 * redirtied either using the page table or a previously
+					 * obtained GUP reference.
+					 */
+					set_pte_at(mm, address, pvmw.pte, pteval);
 					folio_set_swapbacked(folio);
-				goto walk_abort;
+					goto walk_abort;
+				} else if (ref_count != 1 + map_count) {
+					/*
+					 * Additional reference. Could be a GUP reference or any
+					 * speculative reference. GUP users must mark the folio
+					 * dirty if there was a modification. This folio cannot be
+					 * reclaimed right now either way, so act just like nothing
+					 * happened.
+					 * We'll come back here later and detect if the folio was
+					 * dirtied when the additional reference is gone.
+					 */
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					goto walk_abort;
+				}
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
 			}
 
 			if (swap_duplicate(entry) < 0) {
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-13  3:38 [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
  2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
@ 2025-01-13  3:38 ` Barry Song
  2025-01-13 16:48   ` Will Deacon
  2025-01-14  9:52   ` David Hildenbrand
  2025-01-13  3:39 ` [PATCH v2 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
  2025-01-13  3:39 ` [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  3 siblings, 2 replies; 18+ messages in thread
From: Barry Song @ 2025-01-13  3:38 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Anshuman Khandual, Shaoqin Huang,
	Gavin Shan, Kefeng Wang, Mark Rutland, Kirill A. Shutemov,
	Yosry Ahmed, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Yicong Yang

From: Barry Song <v-songbaohua@oppo.com>

This is a preparatory patch to support batch PTE unmapping in
`try_to_unmap_one`. It first introduces range handling for
`tlbbatch` flush. Currently, the range is always set to the size of
PAGE_SIZE.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/tlbflush.h | 26 ++++++++++++++------------
 arch/arm64/mm/contpte.c           |  2 +-
 arch/riscv/include/asm/tlbflush.h |  3 ++-
 arch/riscv/mm/tlbflush.c          |  3 ++-
 arch/x86/include/asm/tlbflush.h   |  3 ++-
 mm/rmap.c                         | 12 +++++++-----
 6 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc94e036a26b..f34e4fab5aa2 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -322,13 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 	return true;
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	__flush_tlb_page_nosync(mm, uaddr);
-}
-
 /*
  * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
  * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
@@ -448,7 +441,7 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
 	return false;
 }
 
-static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
@@ -460,12 +453,12 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 	pages = (end - start) >> PAGE_SHIFT;
 
 	if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
-		flush_tlb_mm(vma->vm_mm);
+		flush_tlb_mm(mm);
 		return;
 	}
 
 	dsb(ishst);
-	asid = ASID(vma->vm_mm);
+	asid = ASID(mm);
 
 	if (last_level)
 		__flush_tlb_range_op(vale1is, start, pages, stride, asid,
@@ -474,7 +467,7 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 				     tlb_level, true, lpa2_is_enabled());
 
-	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 static inline void __flush_tlb_range(struct vm_area_struct *vma,
@@ -482,7 +475,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long stride, bool last_level,
 				     int tlb_level)
 {
-	__flush_tlb_range_nosync(vma, start, end, stride,
+	__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
 				 last_level, tlb_level);
 	dsb(ish);
 }
@@ -533,6 +526,15 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 	dsb(ish);
 	isb();
 }
+
+static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr,
+					     unsigned long size)
+{
+	__flush_tlb_range_nosync(mm, uaddr, uaddr + size,
+				 PAGE_SIZE, true, 3);
+}
 #endif
 
 #endif
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 55107d27d3f8..bcac4f55f9c1 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -335,7 +335,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
 		 * eliding the trailing DSB applies here.
 		 */
 		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
 					 PAGE_SIZE, true, 3);
 	}
 
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 72e559934952..7f3ea687ce33 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -61,7 +61,8 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
 bool arch_tlbbatch_should_defer(struct mm_struct *mm);
 void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       struct mm_struct *mm,
-			       unsigned long uaddr);
+			       unsigned long uaddr,
+			       unsigned long size);
 void arch_flush_tlb_batched_pending(struct mm_struct *mm);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9b6e86ce3867..aeda64a36d50 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -187,7 +187,8 @@ bool arch_tlbbatch_should_defer(struct mm_struct *mm)
 
 void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       struct mm_struct *mm,
-			       unsigned long uaddr)
+			       unsigned long uaddr,
+			       unsigned long size)
 {
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 69e79fff41b8..4b62a6329b8f 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -279,7 +279,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 
 static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 					     struct mm_struct *mm,
-					     unsigned long uaddr)
+					     unsigned long uaddr,
+					     unsigned long size)
 {
 	inc_mm_tlb_gen(mm);
 	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index de6b8c34e98c..365112af5291 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -672,7 +672,8 @@ void try_to_unmap_flush_dirty(void)
 	(TLB_FLUSH_BATCH_PENDING_MASK / 2)
 
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+				      unsigned long uaddr,
+				      unsigned long size)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 	int batch;
@@ -681,7 +682,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
 	if (!pte_accessible(mm, pteval))
 		return;
 
-	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
+	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr, size);
 	tlb_ubc->flush_required = true;
 
 	/*
@@ -757,7 +758,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
-				      unsigned long uaddr)
+				      unsigned long uaddr,
+				      unsigned long size)
 {
 }
 
@@ -1792,7 +1794,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, PAGE_SIZE);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
@@ -2164,7 +2166,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				 */
 				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-				set_tlb_ubc_flush_pending(mm, pteval, address);
+				set_tlb_ubc_flush_pending(mm, pteval, address, PAGE_SIZE);
 			} else {
 				pteval = ptep_clear_flush(vma, address, pvmw.pte);
 			}
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
  2025-01-13  3:38 [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
  2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
  2025-01-13  3:38 ` [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
@ 2025-01-13  3:39 ` Barry Song
  2025-01-13  3:39 ` [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
  3 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2025-01-13  3:39 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

From: Barry Song <v-songbaohua@oppo.com>

Currently, the PTEs and rmap of a large folio are removed one at a time.
This is not only slow but also causes the large folio to be unnecessarily
added to deferred_split, which can lead to races between the
deferred_split shrinker callback and memory reclamation. This patch
releases all PTEs and rmap entries in a batch.
Currently, it only handles lazyfree large folios.

The below microbench tries to reclaim 128MB lazyfree large folios
whose sizes are 64KiB:

 #include <stdio.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>

 #define SIZE 128*1024*1024  // 128 MB

 unsigned long read_split_deferred()
 {
 	FILE *file = fopen("/sys/kernel/mm/transparent_hugepage"
			"/hugepages-64kB/stats/split_deferred", "r");
 	if (!file) {
 		perror("Error opening file");
 		return 0;
 	}

 	unsigned long value;
 	if (fscanf(file, "%lu", &value) != 1) {
 		perror("Error reading value");
 		fclose(file);
 		return 0;
 	}

 	fclose(file);
 	return value;
 }

 int main(int argc, char *argv[])
 {
 	while(1) {
 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

 		memset((void *)p, 1, SIZE);

 		madvise((void *)p, SIZE, MADV_FREE);

 		clock_t start_time = clock();
 		unsigned long start_split = read_split_deferred();
 		madvise((void *)p, SIZE, MADV_PAGEOUT);
 		clock_t end_time = clock();
 		unsigned long end_split = read_split_deferred();

 		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
 		printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n",
 			elapsed_time, end_split - start_split);

 		munmap((void *)p, SIZE);
 	}
 	return 0;
 }

w/o patch:
~ # ./a.out
Time taken by reclamation: 0.177418 seconds, split_deferred: 2048
Time taken by reclamation: 0.178348 seconds, split_deferred: 2048
Time taken by reclamation: 0.174525 seconds, split_deferred: 2048
Time taken by reclamation: 0.171620 seconds, split_deferred: 2048
Time taken by reclamation: 0.172241 seconds, split_deferred: 2048
Time taken by reclamation: 0.174003 seconds, split_deferred: 2048
Time taken by reclamation: 0.171058 seconds, split_deferred: 2048
Time taken by reclamation: 0.171993 seconds, split_deferred: 2048
Time taken by reclamation: 0.169829 seconds, split_deferred: 2048
Time taken by reclamation: 0.172895 seconds, split_deferred: 2048
Time taken by reclamation: 0.176063 seconds, split_deferred: 2048
Time taken by reclamation: 0.172568 seconds, split_deferred: 2048
Time taken by reclamation: 0.171185 seconds, split_deferred: 2048
Time taken by reclamation: 0.170632 seconds, split_deferred: 2048
Time taken by reclamation: 0.170208 seconds, split_deferred: 2048
Time taken by reclamation: 0.174192 seconds, split_deferred: 2048
...

w/ patch:
~ # ./a.out
Time taken by reclamation: 0.074231 seconds, split_deferred: 0
Time taken by reclamation: 0.071026 seconds, split_deferred: 0
Time taken by reclamation: 0.072029 seconds, split_deferred: 0
Time taken by reclamation: 0.071873 seconds, split_deferred: 0
Time taken by reclamation: 0.073573 seconds, split_deferred: 0
Time taken by reclamation: 0.071906 seconds, split_deferred: 0
Time taken by reclamation: 0.073604 seconds, split_deferred: 0
Time taken by reclamation: 0.075903 seconds, split_deferred: 0
Time taken by reclamation: 0.073191 seconds, split_deferred: 0
Time taken by reclamation: 0.071228 seconds, split_deferred: 0
Time taken by reclamation: 0.071391 seconds, split_deferred: 0
Time taken by reclamation: 0.071468 seconds, split_deferred: 0
Time taken by reclamation: 0.071896 seconds, split_deferred: 0
Time taken by reclamation: 0.072508 seconds, split_deferred: 0
Time taken by reclamation: 0.071884 seconds, split_deferred: 0
Time taken by reclamation: 0.072433 seconds, split_deferred: 0
Time taken by reclamation: 0.071939 seconds, split_deferred: 0
...

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/rmap.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 365112af5291..3ef659310797 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1642,6 +1642,25 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }

+/* We support batch unmapping of PTEs for lazyfree large folios */
+static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
+			struct folio *folio, pte_t *ptep)
+{
+	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	int max_nr = folio_nr_pages(folio);
+	pte_t pte = ptep_get(ptep);
+
+	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+		return false;
+	if (pte_none(pte) || pte_unused(pte) || !pte_present(pte))
+		return false;
+	if (pte_pfn(pte) != folio_pfn(folio))
+		return false;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+			       NULL, NULL) == max_nr;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1655,6 +1674,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int nr_pages = 1;
 	unsigned long pfn;
 	unsigned long hsz = 0;

@@ -1780,6 +1800,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				hugetlb_vma_unlock_write(vma);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+		} else if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
+				can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) {
+			nr_pages = folio_nr_pages(folio);
+			flush_cache_range(vma, range.start, range.end);
+			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
+			if (should_defer_flush(mm, flags))
+				set_tlb_ubc_flush_pending(mm, pteval, address, folio_size(folio));
+			else
+				flush_tlb_range(vma, range.start, range.end);
 		} else {
 			flush_cache_page(vma, address, pfn);
 			/* Nuke the page table entry. */
@@ -1875,7 +1904,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					 * redirtied either using the page table or a previously
 					 * obtained GUP reference.
 					 */
-					set_pte_at(mm, address, pvmw.pte, pteval);
+					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
 					folio_set_swapbacked(folio);
 					goto walk_abort;
 				} else if (ref_count != 1 + map_count) {
@@ -1888,10 +1917,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					 * We'll come back here later and detect if the folio was
 					 * dirtied when the additional reference is gone.
 					 */
-					set_pte_at(mm, address, pvmw.pte, pteval);
+					set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
 					goto walk_abort;
 				}
-				dec_mm_counter(mm, MM_ANONPAGES);
+				add_mm_counter(mm, MM_ANONPAGES, -nr_pages);
 				goto discard;
 			}

@@ -1943,13 +1972,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, mm_counter_file(folio));
 		}
 discard:
-		if (unlikely(folio_test_hugetlb(folio)))
+		if (unlikely(folio_test_hugetlb(folio))) {
 			hugetlb_remove_rmap(folio);
-		else
-			folio_remove_rmap_pte(folio, subpage, vma);
+		} else {
+			folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
+			folio_ref_sub(folio, nr_pages - 1);
+		}
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
+		/* We have already batched the entire folio */
+		if (nr_pages > 1)
+			goto walk_done;
 		continue;
 walk_abort:
 		ret = false;
-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-13  3:38 [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
                   ` (2 preceding siblings ...)
  2025-01-13  3:39 ` [PATCH v2 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
@ 2025-01-13  3:39 ` Barry Song
  2025-01-14  3:40   ` Baolin Wang
  3 siblings, 1 reply; 18+ messages in thread
From: Barry Song @ 2025-01-13  3:39 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baolin.wang, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

From: Barry Song <v-songbaohua@oppo.com>

The try_to_unmap_one() function currently handles PMD-mapped THPs
inefficiently. It first splits the PMD into PTEs, copies the dirty
state from the PMD to the PTEs, iterates over the PTEs to locate
the dirty state, and then marks the THP as swap-backed. This process
involves unnecessary PMD splitting and redundant iteration. Instead,
this functionality can be efficiently managed in
__discard_anon_folio_pmd_locked(), avoiding the extra steps and
improving performance.

The following microbenchmark redirties folios after invoking MADV_FREE,
then measures the time taken to perform memory reclamation (actually
set those folios swapbacked again) on the redirtied folios.

 #include <stdio.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>

 #define SIZE 128*1024*1024  // 128 MB

 int main(int argc, char *argv[])
 {
 	while(1) {
 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
 				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

 		memset((void *)p, 1, SIZE);
 		madvise((void *)p, SIZE, MADV_FREE);
 		/* redirty after MADV_FREE */
 		memset((void *)p, 1, SIZE);

		clock_t start_time = clock();
 		madvise((void *)p, SIZE, MADV_PAGEOUT);
 		clock_t end_time = clock();

 		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
 		printf("Time taken by reclamation: %f seconds\n", elapsed_time);

 		munmap((void *)p, SIZE);
 	}
 	return 0;
 }

Testing results are as below,
w/o patch:
~ # ./a.out
Time taken by reclamation: 0.007300 seconds
Time taken by reclamation: 0.007226 seconds
Time taken by reclamation: 0.007295 seconds
Time taken by reclamation: 0.007731 seconds
Time taken by reclamation: 0.007134 seconds
Time taken by reclamation: 0.007285 seconds
Time taken by reclamation: 0.007720 seconds
Time taken by reclamation: 0.007128 seconds
Time taken by reclamation: 0.007710 seconds
Time taken by reclamation: 0.007712 seconds
Time taken by reclamation: 0.007236 seconds
Time taken by reclamation: 0.007690 seconds
Time taken by reclamation: 0.007174 seconds
Time taken by reclamation: 0.007670 seconds
Time taken by reclamation: 0.007169 seconds
Time taken by reclamation: 0.007305 seconds
Time taken by reclamation: 0.007432 seconds
Time taken by reclamation: 0.007158 seconds
Time taken by reclamation: 0.007133 seconds
…

w/ patch

~ # ./a.out
Time taken by reclamation: 0.002124 seconds
Time taken by reclamation: 0.002116 seconds
Time taken by reclamation: 0.002150 seconds
Time taken by reclamation: 0.002261 seconds
Time taken by reclamation: 0.002137 seconds
Time taken by reclamation: 0.002173 seconds
Time taken by reclamation: 0.002063 seconds
Time taken by reclamation: 0.002088 seconds
Time taken by reclamation: 0.002169 seconds
Time taken by reclamation: 0.002124 seconds
Time taken by reclamation: 0.002111 seconds
Time taken by reclamation: 0.002224 seconds
Time taken by reclamation: 0.002297 seconds
Time taken by reclamation: 0.002260 seconds
Time taken by reclamation: 0.002246 seconds
Time taken by reclamation: 0.002272 seconds
Time taken by reclamation: 0.002277 seconds
Time taken by reclamation: 0.002462 seconds
…

This patch significantly speeds up try_to_unmap_one() by allowing it
to skip redirtied THPs without splitting the PMD.

Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/huge_memory.c | 17 ++++++++++++++---
 mm/rmap.c        | 11 ++++++++++-
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..aea49f7125f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	int ref_count, map_count;
 	pmd_t orig_pmd = *pmdp;
 
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
 		return false;
+	}
 
 	orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
 
@@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	 *
 	 * The only folio refs must be one from isolation plus the rmap(s).
 	 */
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
-	    ref_count != map_count + 1) {
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
+		set_pmd_at(mm, addr, pmdp, orig_pmd);
+		return false;
+	}
+
+	if (ref_count != map_count + 1) {
 		set_pmd_at(mm, addr, pmdp, orig_pmd);
 		return false;
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index 3ef659310797..02c4e4b2cd7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
+	bool anon_exclusive, lazyfree, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	int nr_pages = 1;
@@ -1724,9 +1724,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
+			lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
+
 			if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
 						  folio))
 				goto walk_done;
+			/*
+			 * unmap_huge_pmd_locked has either already marked
+			 * the folio as swap-backed or decided to retain it
+			 * due to GUP or speculative references.
+			 */
+			if (lazyfree)
+				goto walk_abort;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
@ 2025-01-13 13:19   ` David Hildenbrand
  2025-01-13 13:20     ` David Hildenbrand
  2025-01-14  2:55   ` Baolin Wang
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-13 13:19 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

On 13.01.25 04:38, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The refcount may be temporarily or long-term increased, but this does
> not change the fundamental nature of the folio already being lazy-
> freed. Therefore, we only reset 'swapbacked' when we are certain the
> folio is dirty and not droppable.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-13 13:19   ` David Hildenbrand
@ 2025-01-13 13:20     ` David Hildenbrand
  2025-01-13 21:56       ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-13 13:20 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes

On 13.01.25 14:19, David Hildenbrand wrote:
> On 13.01.25 04:38, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> The refcount may be temporarily or long-term increased, but this does
>> not change the fundamental nature of the folio already being lazy-
>> freed. Therefore, we only reset 'swapbacked' when we are certain the
>> folio is dirty and not droppable.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 

Ah, should this have a Fixes: ?

Because of a speculative reference, we might not reclaim MADV_FREE 
folios as we silently mark them as swapbacked again, which sounds fairly 
wrong.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-13  3:38 ` [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
@ 2025-01-13 16:48   ` Will Deacon
  2025-01-14  9:52   ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: Will Deacon @ 2025-01-13 16:48 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baolin.wang, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, ryan.roberts, v-songbaohua, x86,
	linux-riscv, ying.huang, zhengtangquan, lorenzo.stoakes,
	Catalin Marinas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Anshuman Khandual, Shaoqin Huang,
	Gavin Shan, Kefeng Wang, Mark Rutland, Kirill A. Shutemov,
	Yosry Ahmed, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Yicong Yang

On Mon, Jan 13, 2025 at 04:38:59PM +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> This is a preparatory patch to support batch PTE unmapping in
> `try_to_unmap_one`. It first introduces range handling for
> `tlbbatch` flush. Currently, the range is always set to the size of
> PAGE_SIZE.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Shaoqin Huang <shahuang@redhat.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/tlbflush.h | 26 ++++++++++++++------------
>  arch/arm64/mm/contpte.c           |  2 +-
>  arch/riscv/include/asm/tlbflush.h |  3 ++-
>  arch/riscv/mm/tlbflush.c          |  3 ++-
>  arch/x86/include/asm/tlbflush.h   |  3 ++-
>  mm/rmap.c                         | 12 +++++++-----
>  6 files changed, 28 insertions(+), 21 deletions(-)

The arm64 bits look correct to me:

Acked-by: Will Deacon <will@kernel.org>

Will


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-13 13:20     ` David Hildenbrand
@ 2025-01-13 21:56       ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2025-01-13 21:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, ryan.roberts, v-songbaohua, x86,
	linux-riscv, ying.huang, zhengtangquan, lorenzo.stoakes, mfo

On Tue, Jan 14, 2025 at 2:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.01.25 14:19, David Hildenbrand wrote:
> > On 13.01.25 04:38, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> The refcount may be temporarily or long-term increased, but this does
> >> not change the fundamental nature of the folio already being lazy-
> >> freed. Therefore, we only reset 'swapbacked' when we are certain the
> >> folio is dirty and not droppable.
> >>
> >> Suggested-by: David Hildenbrand <david@redhat.com>
> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> ---
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> >

Thanks!

>
> Ah, should this have a Fixes: ?

That makes sense to me.

>
> Because of a speculative reference, we might not reclaim MADV_FREE
> folios as we silently mark them as swapbacked again, which sounds fairly
> wrong.
>

I assume the tag should be applied to Mauricio's commit 6c8e2a256915a2 ("mm:
fix race between MADV_FREE reclaim and blkdev direct IO read") and  also
Mauricio is CC'ed.

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
  2025-01-13 13:19   ` David Hildenbrand
@ 2025-01-14  2:55   ` Baolin Wang
  2025-01-14  6:05     ` Lance Yang
  1 sibling, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-01-14  2:55 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: chrisl, david, ioworker0, kasong, linux-arm-kernel, linux-kernel,
	ryan.roberts, v-songbaohua, x86, linux-riscv, ying.huang,
	zhengtangquan, lorenzo.stoakes



On 2025/1/13 11:38, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The refcount may be temporarily or long-term increased, but this does
> not change the fundamental nature of the folio already being lazy-
> freed. Therefore, we only reset 'swapbacked' when we are certain the
> folio is dirty and not droppable.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM. Feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-13  3:39 ` [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
@ 2025-01-14  3:40   ` Baolin Wang
  2025-01-14  4:09     ` Barry Song
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-01-14  3:40 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: chrisl, david, ioworker0, kasong, linux-arm-kernel, linux-kernel,
	ryan.roberts, v-songbaohua, x86, linux-riscv, ying.huang,
	zhengtangquan, lorenzo.stoakes



On 2025/1/13 11:39, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> The try_to_unmap_one() function currently handles PMD-mapped THPs
> inefficiently. It first splits the PMD into PTEs, copies the dirty
> state from the PMD to the PTEs, iterates over the PTEs to locate
> the dirty state, and then marks the THP as swap-backed. This process
> involves unnecessary PMD splitting and redundant iteration. Instead,
> this functionality can be efficiently managed in
> __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> improving performance.
> 
> The following microbenchmark redirties folios after invoking MADV_FREE,
> then measures the time taken to perform memory reclamation (actually
> set those folios swapbacked again) on the redirtied folios.
> 
>   #include <stdio.h>
>   #include <sys/mman.h>
>   #include <string.h>
>   #include <time.h>
> 
>   #define SIZE 128*1024*1024  // 128 MB
> 
>   int main(int argc, char *argv[])
>   {
>   	while(1) {
>   		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>   				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
>   		memset((void *)p, 1, SIZE);
>   		madvise((void *)p, SIZE, MADV_FREE);
>   		/* redirty after MADV_FREE */
>   		memset((void *)p, 1, SIZE);
> 
> 		clock_t start_time = clock();
>   		madvise((void *)p, SIZE, MADV_PAGEOUT);
>   		clock_t end_time = clock();
> 
>   		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>   		printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> 
>   		munmap((void *)p, SIZE);
>   	}
>   	return 0;
>   }
> 
> Testing results are as below,
> w/o patch:
> ~ # ./a.out
> Time taken by reclamation: 0.007300 seconds
> Time taken by reclamation: 0.007226 seconds
> Time taken by reclamation: 0.007295 seconds
> Time taken by reclamation: 0.007731 seconds
> Time taken by reclamation: 0.007134 seconds
> Time taken by reclamation: 0.007285 seconds
> Time taken by reclamation: 0.007720 seconds
> Time taken by reclamation: 0.007128 seconds
> Time taken by reclamation: 0.007710 seconds
> Time taken by reclamation: 0.007712 seconds
> Time taken by reclamation: 0.007236 seconds
> Time taken by reclamation: 0.007690 seconds
> Time taken by reclamation: 0.007174 seconds
> Time taken by reclamation: 0.007670 seconds
> Time taken by reclamation: 0.007169 seconds
> Time taken by reclamation: 0.007305 seconds
> Time taken by reclamation: 0.007432 seconds
> Time taken by reclamation: 0.007158 seconds
> Time taken by reclamation: 0.007133 seconds
> …
> 
> w/ patch
> 
> ~ # ./a.out
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002116 seconds
> Time taken by reclamation: 0.002150 seconds
> Time taken by reclamation: 0.002261 seconds
> Time taken by reclamation: 0.002137 seconds
> Time taken by reclamation: 0.002173 seconds
> Time taken by reclamation: 0.002063 seconds
> Time taken by reclamation: 0.002088 seconds
> Time taken by reclamation: 0.002169 seconds
> Time taken by reclamation: 0.002124 seconds
> Time taken by reclamation: 0.002111 seconds
> Time taken by reclamation: 0.002224 seconds
> Time taken by reclamation: 0.002297 seconds
> Time taken by reclamation: 0.002260 seconds
> Time taken by reclamation: 0.002246 seconds
> Time taken by reclamation: 0.002272 seconds
> Time taken by reclamation: 0.002277 seconds
> Time taken by reclamation: 0.002462 seconds
> …
> 
> This patch significantly speeds up try_to_unmap_one() by allowing it
> to skip redirtied THPs without splitting the PMD.
> 
> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Suggested-by: Lance Yang <ioworker0@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/huge_memory.c | 17 ++++++++++++++---
>   mm/rmap.c        | 11 ++++++++++-
>   2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3d3ebdc002d5..aea49f7125f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>   	int ref_count, map_count;
>   	pmd_t orig_pmd = *pmdp;
>   
> -	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> +	if (pmd_dirty(orig_pmd))
> +		folio_set_dirty(folio);
> +	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +		folio_set_swapbacked(folio);
>   		return false;
> +	}
>   
>   	orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
>   
> @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
>   	 *
>   	 * The only folio refs must be one from isolation plus the rmap(s).
>   	 */
> -	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> -	    ref_count != map_count + 1) {
> +	if (pmd_dirty(orig_pmd))
> +		folio_set_dirty(folio);
> +	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> +		folio_set_swapbacked(folio);
> +		set_pmd_at(mm, addr, pmdp, orig_pmd);
> +		return false;
> +	}
> +
> +	if (ref_count != map_count + 1) {
>   		set_pmd_at(mm, addr, pmdp, orig_pmd);
>   		return false;
>   	}
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3ef659310797..02c4e4b2cd7b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>   	pte_t pteval;
>   	struct page *subpage;
> -	bool anon_exclusive, ret = true;
> +	bool anon_exclusive, lazyfree, ret = true;
>   	struct mmu_notifier_range range;
>   	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>   	int nr_pages = 1;
> @@ -1724,9 +1724,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   		}
>   
>   		if (!pvmw.pte) {
> +			lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);

You've checked lazyfree here, so can we remove the duplicate check in 
unmap_huge_pmd_locked()? Then the code should be:

		if (lazyfree && unmap_huge_pmd_locked(...))
			goto walk_done;

>   			if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>   						  folio))
>   				goto walk_done;
> +			/*
> +			 * unmap_huge_pmd_locked has either already marked
> +			 * the folio as swap-backed or decided to retain it
> +			 * due to GUP or speculative references.
> +			 */
> +			if (lazyfree)
> +				goto walk_abort;
>   
>   			if (flags & TTU_SPLIT_HUGE_PMD) {
>   				/*


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-14  3:40   ` Baolin Wang
@ 2025-01-14  4:09     ` Barry Song
  2025-01-14  6:00       ` Barry Song
  2025-01-14  6:30       ` Lance Yang
  0 siblings, 2 replies; 18+ messages in thread
From: Barry Song @ 2025-01-14  4:09 UTC (permalink / raw)
  To: baolin.wang
  Cc: 21cnbao, akpm, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	lorenzo.stoakes, ryan.roberts, v-songbaohua, x86, ying.huang,
	zhengtangquan

On Tue, Jan 14, 2025 at 4:40 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/1/13 11:39, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > state from the PMD to the PTEs, iterates over the PTEs to locate
> > the dirty state, and then marks the THP as swap-backed. This process
> > involves unnecessary PMD splitting and redundant iteration. Instead,
> > this functionality can be efficiently managed in
> > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > improving performance.
> >
> > The following microbenchmark redirties folios after invoking MADV_FREE,
> > then measures the time taken to perform memory reclamation (actually
> > set those folios swapbacked again) on the redirtied folios.
> >
> >   #include <stdio.h>
> >   #include <sys/mman.h>
> >   #include <string.h>
> >   #include <time.h>
> >
> >   #define SIZE 128*1024*1024  // 128 MB
> >
> >   int main(int argc, char *argv[])
> >   {
> >       while(1) {
> >               volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> >                               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >
> >               memset((void *)p, 1, SIZE);
> >               madvise((void *)p, SIZE, MADV_FREE);
> >               /* redirty after MADV_FREE */
> >               memset((void *)p, 1, SIZE);
> >
> >               clock_t start_time = clock();
> >               madvise((void *)p, SIZE, MADV_PAGEOUT);
> >               clock_t end_time = clock();
> >
> >               double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> >               printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> >
> >               munmap((void *)p, SIZE);
> >       }
> >       return 0;
> >   }
> >
> > Testing results are as below,
> > w/o patch:
> > ~ # ./a.out
> > Time taken by reclamation: 0.007300 seconds
> > Time taken by reclamation: 0.007226 seconds
> > Time taken by reclamation: 0.007295 seconds
> > Time taken by reclamation: 0.007731 seconds
> > Time taken by reclamation: 0.007134 seconds
> > Time taken by reclamation: 0.007285 seconds
> > Time taken by reclamation: 0.007720 seconds
> > Time taken by reclamation: 0.007128 seconds
> > Time taken by reclamation: 0.007710 seconds
> > Time taken by reclamation: 0.007712 seconds
> > Time taken by reclamation: 0.007236 seconds
> > Time taken by reclamation: 0.007690 seconds
> > Time taken by reclamation: 0.007174 seconds
> > Time taken by reclamation: 0.007670 seconds
> > Time taken by reclamation: 0.007169 seconds
> > Time taken by reclamation: 0.007305 seconds
> > Time taken by reclamation: 0.007432 seconds
> > Time taken by reclamation: 0.007158 seconds
> > Time taken by reclamation: 0.007133 seconds
> > …
> >
> > w/ patch
> >
> > ~ # ./a.out
> > Time taken by reclamation: 0.002124 seconds
> > Time taken by reclamation: 0.002116 seconds
> > Time taken by reclamation: 0.002150 seconds
> > Time taken by reclamation: 0.002261 seconds
> > Time taken by reclamation: 0.002137 seconds
> > Time taken by reclamation: 0.002173 seconds
> > Time taken by reclamation: 0.002063 seconds
> > Time taken by reclamation: 0.002088 seconds
> > Time taken by reclamation: 0.002169 seconds
> > Time taken by reclamation: 0.002124 seconds
> > Time taken by reclamation: 0.002111 seconds
> > Time taken by reclamation: 0.002224 seconds
> > Time taken by reclamation: 0.002297 seconds
> > Time taken by reclamation: 0.002260 seconds
> > Time taken by reclamation: 0.002246 seconds
> > Time taken by reclamation: 0.002272 seconds
> > Time taken by reclamation: 0.002277 seconds
> > Time taken by reclamation: 0.002462 seconds
> > …
> >
> > This patch significantly speeds up try_to_unmap_one() by allowing it
> > to skip redirtied THPs without splitting the PMD.
> >
> > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   mm/huge_memory.c | 17 ++++++++++++++---
> >   mm/rmap.c        | 11 ++++++++++-
> >   2 files changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 3d3ebdc002d5..aea49f7125f1 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> >       int ref_count, map_count;
> >       pmd_t orig_pmd = *pmdp;
> >  
> > -     if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > +     if (pmd_dirty(orig_pmd))
> > +             folio_set_dirty(folio);
> > +     if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > +             folio_set_swapbacked(folio);
> >               return false;
> > +     }
> >  
> >       orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> >  
> > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> >        *
> >        * The only folio refs must be one from isolation plus the rmap(s).
> >        */
> > -     if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > -         ref_count != map_count + 1) {
> > +     if (pmd_dirty(orig_pmd))
> > +             folio_set_dirty(folio);
> > +     if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > +             folio_set_swapbacked(folio);
> > +             set_pmd_at(mm, addr, pmdp, orig_pmd);
> > +             return false;
> > +     }
> > +
> > +     if (ref_count != map_count + 1) {
> >               set_pmd_at(mm, addr, pmdp, orig_pmd);
> >               return false;
> >       }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 3ef659310797..02c4e4b2cd7b 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >       DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> >       pte_t pteval;
> >       struct page *subpage;
> > -     bool anon_exclusive, ret = true;
> > +     bool anon_exclusive, lazyfree, ret = true;
> >       struct mmu_notifier_range range;
> >       enum ttu_flags flags = (enum ttu_flags)(long)arg;
> >       int nr_pages = 1;
> > @@ -1724,9 +1724,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >               }
> >  
> >               if (!pvmw.pte) {
> > +                     lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
>
> You've checked lazyfree here, so can we remove the duplicate check in
> unmap_huge_pmd_locked()? Then the code should be:
>
>                 if (lazyfree && unmap_huge_pmd_locked(...))
>                         goto walk_done;


right. it seems unmap_huge_pmd_locked() only handles lazyfree pmd-mapped
thp. so i guess the code could be:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aea49f7125f1..c4c3a7896de4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3131,11 +3131,10 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
+	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
 
-	if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
-		return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
-
-	return false;
+	return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
 }
 
 static void remap_page(struct folio *folio, unsigned long nr, int flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 02c4e4b2cd7b..72907eb1b8fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, lazyfree, ret = true;
+	bool anon_exclusive, ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	int nr_pages = 1;
@@ -1724,18 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
-
-			if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
-						  folio))
-				goto walk_done;
-			/*
-			 * unmap_huge_pmd_locked has either already marked
-			 * the folio as swap-backed or decided to retain it
-			 * due to GUP or speculative references.
-			 */
-			if (lazyfree)
+			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
+					goto walk_done;
+				/*
+				 * unmap_huge_pmd_locked has either already marked
+				 * the folio as swap-backed or decided to retain it
+				 * due to GUP or speculative references.
+				 */
 				goto walk_abort;
+			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*

>
> >                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> >                                                 folio))
> >                               goto walk_done;
> > +                     /*
> > +                      * unmap_huge_pmd_locked has either already marked
> > +                      * the folio as swap-backed or decided to retain it
> > +                      * due to GUP or speculative references.
> > +                      */
> > +                     if (lazyfree)
> > +                             goto walk_abort;
> >  
> >                       if (flags & TTU_SPLIT_HUGE_PMD) {
> >                               /*


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-14  4:09     ` Barry Song
@ 2025-01-14  6:00       ` Barry Song
  2025-01-14  7:51         ` Baolin Wang
  2025-01-14  6:30       ` Lance Yang
  1 sibling, 1 reply; 18+ messages in thread
From: Barry Song @ 2025-01-14  6:00 UTC (permalink / raw)
  To: baolin.wang
  Cc: 21cnbao, akpm, chrisl, david, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	lorenzo.stoakes, ryan.roberts, v-songbaohua, x86, ying.huang,
	zhengtangquan

> > >               if (!pvmw.pte) {
> > > +                     lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
> >
> > You've checked lazyfree here, so can we remove the duplicate check in
> > unmap_huge_pmd_locked()? Then the code should be:
> >
> >                 if (lazyfree && unmap_huge_pmd_locked(...))
> >                         goto walk_done;
>
>
> right. it seems unmap_huge_pmd_locked() only handles lazyfree pmd-mapped
> thp. so i guess the code could be:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aea49f7125f1..c4c3a7896de4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3131,11 +3131,10 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
>         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
>
> -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> -
> -       return false;
> +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>  }
>
>  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 02c4e4b2cd7b..72907eb1b8fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>         DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>         pte_t pteval;
>         struct page *subpage;
> -       bool anon_exclusive, lazyfree, ret = true;
> +       bool anon_exclusive, ret = true;
>         struct mmu_notifier_range range;
>         enum ttu_flags flags = (enum ttu_flags)(long)arg;
>         int nr_pages = 1;
> @@ -1724,18 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                 }
>
>                 if (!pvmw.pte) {
> -                       lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
> -
> -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> -                                                 folio))
> -                               goto walk_done;
> -                       /*
> -                        * unmap_huge_pmd_locked has either already marked
> -                        * the folio as swap-backed or decided to retain it
> -                        * due to GUP or speculative references.
> -                        */
> -                       if (lazyfree)
> +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> +                                       goto walk_done;
> +                               /*
> +                                * unmap_huge_pmd_locked has either already marked
> +                                * the folio as swap-backed or decided to retain it
> +                                * due to GUP or speculative references.
> +                                */
>                                 goto walk_abort;
> +                       }
>
>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>                                 /*
>
> >
> > >                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > >                                                 folio))
> > >                               goto walk_done;
> > > +                     /*
> > > +                      * unmap_huge_pmd_locked has either already marked
> > > +                      * the folio as swap-backed or decided to retain it
> > > +                      * due to GUP or speculative references.
> > > +                      */
> > > +                     if (lazyfree)
> > > +                             goto walk_abort;
> > > 
> > >                       if (flags & TTU_SPLIT_HUGE_PMD) {
> > >                               /*



The final diff is as follows.
Baolin, do you have any additional comments before I send out v3? 

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..47cc8c3f8f80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	int ref_count, map_count;
 	pmd_t orig_pmd = *pmdp;
 
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
 		return false;
+	}
 
 	orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
 
@@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	 *
 	 * The only folio refs must be one from isolation plus the rmap(s).
 	 */
-	if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
-	    ref_count != map_count + 1) {
+	if (pmd_dirty(orig_pmd))
+		folio_set_dirty(folio);
+	if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+		folio_set_swapbacked(folio);
+		set_pmd_at(mm, addr, pmdp, orig_pmd);
+		return false;
+	}
+
+	if (ref_count != map_count + 1) {
 		set_pmd_at(mm, addr, pmdp, orig_pmd);
 		return false;
 	}
@@ -3119,12 +3130,11 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 {
 	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
 	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
 
-	if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
-		return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
-
-	return false;
+	return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
 }
 
 static void remap_page(struct folio *folio, unsigned long nr, int flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index 3ef659310797..72907eb1b8fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1724,9 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
-						  folio))
-				goto walk_done;
+			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
+					goto walk_done;
+				/*
+				 * unmap_huge_pmd_locked has either already marked
+				 * the folio as swap-backed or decided to retain it
+				 * due to GUP or speculative references.
+				 */
+				goto walk_abort;
+			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				/*
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one
  2025-01-14  2:55   ` Baolin Wang
@ 2025-01-14  6:05     ` Lance Yang
  0 siblings, 0 replies; 18+ messages in thread
From: Lance Yang @ 2025-01-14  6:05 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Barry Song, akpm, linux-mm, chrisl, david, kasong,
	linux-arm-kernel, linux-kernel, ryan.roberts, v-songbaohua, x86,
	linux-riscv, ying.huang, zhengtangquan, lorenzo.stoakes

On Tue, Jan 14, 2025 at 10:56 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/1/13 11:38, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > The refcount may be temporarily or long-term increased, but this does
> > not change the fundamental nature of the folio already being lazy-
> > freed. Therefore, we only reset 'swapbacked' when we are certain the
> > folio is dirty and not droppable.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> LGTM. Feel free to add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM!
Reviewed-by: Lance Yang <ioworker0@gmail.com>

Thanks,
Lance


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-14  4:09     ` Barry Song
  2025-01-14  6:00       ` Barry Song
@ 2025-01-14  6:30       ` Lance Yang
  1 sibling, 0 replies; 18+ messages in thread
From: Lance Yang @ 2025-01-14  6:30 UTC (permalink / raw)
  To: Barry Song
  Cc: baolin.wang, akpm, chrisl, david, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan

On Tue, Jan 14, 2025 at 12:09 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 4:40 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/1/13 11:39, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The try_to_unmap_one() function currently handles PMD-mapped THPs
> > > inefficiently. It first splits the PMD into PTEs, copies the dirty
> > > state from the PMD to the PTEs, iterates over the PTEs to locate
> > > the dirty state, and then marks the THP as swap-backed. This process
> > > involves unnecessary PMD splitting and redundant iteration. Instead,
> > > this functionality can be efficiently managed in
> > > __discard_anon_folio_pmd_locked(), avoiding the extra steps and
> > > improving performance.
> > >
> > > The following microbenchmark redirties folios after invoking MADV_FREE,
> > > then measures the time taken to perform memory reclamation (actually
> > > set those folios swapbacked again) on the redirtied folios.
> > >
> > >   #include <stdio.h>
> > >   #include <sys/mman.h>
> > >   #include <string.h>
> > >   #include <time.h>
> > >
> > >   #define SIZE 128*1024*1024  // 128 MB
> > >
> > >   int main(int argc, char *argv[])
> > >   {
> > >       while(1) {
> > >               volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > >                               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >
> > >               memset((void *)p, 1, SIZE);
> > >               madvise((void *)p, SIZE, MADV_FREE);
> > >               /* redirty after MADV_FREE */
> > >               memset((void *)p, 1, SIZE);
> > >
> > >               clock_t start_time = clock();
> > >               madvise((void *)p, SIZE, MADV_PAGEOUT);
> > >               clock_t end_time = clock();
> > >
> > >               double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > >               printf("Time taken by reclamation: %f seconds\n", elapsed_time);
> > >
> > >               munmap((void *)p, SIZE);
> > >       }
> > >       return 0;
> > >   }
> > >
> > > Testing results are as below,
> > > w/o patch:
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.007300 seconds
> > > Time taken by reclamation: 0.007226 seconds
> > > Time taken by reclamation: 0.007295 seconds
> > > Time taken by reclamation: 0.007731 seconds
> > > Time taken by reclamation: 0.007134 seconds
> > > Time taken by reclamation: 0.007285 seconds
> > > Time taken by reclamation: 0.007720 seconds
> > > Time taken by reclamation: 0.007128 seconds
> > > Time taken by reclamation: 0.007710 seconds
> > > Time taken by reclamation: 0.007712 seconds
> > > Time taken by reclamation: 0.007236 seconds
> > > Time taken by reclamation: 0.007690 seconds
> > > Time taken by reclamation: 0.007174 seconds
> > > Time taken by reclamation: 0.007670 seconds
> > > Time taken by reclamation: 0.007169 seconds
> > > Time taken by reclamation: 0.007305 seconds
> > > Time taken by reclamation: 0.007432 seconds
> > > Time taken by reclamation: 0.007158 seconds
> > > Time taken by reclamation: 0.007133 seconds
> > > …
> > >
> > > w/ patch
> > >
> > > ~ # ./a.out
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002116 seconds
> > > Time taken by reclamation: 0.002150 seconds
> > > Time taken by reclamation: 0.002261 seconds
> > > Time taken by reclamation: 0.002137 seconds
> > > Time taken by reclamation: 0.002173 seconds
> > > Time taken by reclamation: 0.002063 seconds
> > > Time taken by reclamation: 0.002088 seconds
> > > Time taken by reclamation: 0.002169 seconds
> > > Time taken by reclamation: 0.002124 seconds
> > > Time taken by reclamation: 0.002111 seconds
> > > Time taken by reclamation: 0.002224 seconds
> > > Time taken by reclamation: 0.002297 seconds
> > > Time taken by reclamation: 0.002260 seconds
> > > Time taken by reclamation: 0.002246 seconds
> > > Time taken by reclamation: 0.002272 seconds
> > > Time taken by reclamation: 0.002277 seconds
> > > Time taken by reclamation: 0.002462 seconds
> > > …
> > >
> > > This patch significantly speeds up try_to_unmap_one() by allowing it
> > > to skip redirtied THPs without splitting the PMD.
> > >
> > > Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Suggested-by: Lance Yang <ioworker0@gmail.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >   mm/huge_memory.c | 17 ++++++++++++++---
> > >   mm/rmap.c        | 11 ++++++++++-
> > >   2 files changed, 24 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 3d3ebdc002d5..aea49f7125f1 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3070,8 +3070,12 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >       int ref_count, map_count;
> > >       pmd_t orig_pmd = *pmdp;
> > >
> > > -     if (folio_test_dirty(folio) || pmd_dirty(orig_pmd))
> > > +     if (pmd_dirty(orig_pmd))
> > > +             folio_set_dirty(folio);
> > > +     if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +             folio_set_swapbacked(folio);
> > >               return false;
> > > +     }
> > >
> > >       orig_pmd = pmdp_huge_clear_flush(vma, addr, pmdp);
> > >
> > > @@ -3098,8 +3102,15 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
> > >        *
> > >        * The only folio refs must be one from isolation plus the rmap(s).
> > >        */
> > > -     if (folio_test_dirty(folio) || pmd_dirty(orig_pmd) ||
> > > -         ref_count != map_count + 1) {
> > > +     if (pmd_dirty(orig_pmd))
> > > +             folio_set_dirty(folio);
> > > +     if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> > > +             folio_set_swapbacked(folio);
> > > +             set_pmd_at(mm, addr, pmdp, orig_pmd);
> > > +             return false;
> > > +     }
> > > +
> > > +     if (ref_count != map_count + 1) {
> > >               set_pmd_at(mm, addr, pmdp, orig_pmd);
> > >               return false;
> > >       }
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 3ef659310797..02c4e4b2cd7b 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > >       DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > >       pte_t pteval;
> > >       struct page *subpage;
> > > -     bool anon_exclusive, ret = true;
> > > +     bool anon_exclusive, lazyfree, ret = true;
> > >       struct mmu_notifier_range range;
> > >       enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > >       int nr_pages = 1;
> > > @@ -1724,9 +1724,18 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > >               }
> > >
> > >               if (!pvmw.pte) {
> > > +                     lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
> >
> > You've checked lazyfree here, so can we remove the duplicate check in
> > unmap_huge_pmd_locked()? Then the code should be:
> >
> >                 if (lazyfree && unmap_huge_pmd_locked(...))
> >                         goto walk_done;
>
>
> right. it seems unmap_huge_pmd_locked() only handles lazyfree pmd-mapped
> thp. so i guess the code could be:

Yep, it would be better to remove the duplicate check in
unmap_huge_pmd_locked(), and we can extend unmap_huge_pmd_locked()
in the future if needed.

Thanks,
Lance


>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aea49f7125f1..c4c3a7896de4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3131,11 +3131,10 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>         VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
>         VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>         VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
> +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
>
> -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
> -
> -       return false;
> +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>  }
>
>  static void remap_page(struct folio *folio, unsigned long nr, int flags)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 02c4e4b2cd7b..72907eb1b8fe 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>         DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>         pte_t pteval;
>         struct page *subpage;
> -       bool anon_exclusive, lazyfree, ret = true;
> +       bool anon_exclusive, ret = true;
>         struct mmu_notifier_range range;
>         enum ttu_flags flags = (enum ttu_flags)(long)arg;
>         int nr_pages = 1;
> @@ -1724,18 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>                 }
>
>                 if (!pvmw.pte) {
> -                       lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
> -
> -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> -                                                 folio))
> -                               goto walk_done;
> -                       /*
> -                        * unmap_huge_pmd_locked has either already marked
> -                        * the folio as swap-backed or decided to retain it
> -                        * due to GUP or speculative references.
> -                        */
> -                       if (lazyfree)
> +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> +                                       goto walk_done;
> +                               /*
> +                                * unmap_huge_pmd_locked has either already marked
> +                                * the folio as swap-backed or decided to retain it
> +                                * due to GUP or speculative references.
> +                                */
>                                 goto walk_abort;
> +                       }
>
>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>                                 /*
>
> >
> > >                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
> > >                                                 folio))
> > >                               goto walk_done;
> > > +                     /*
> > > +                      * unmap_huge_pmd_locked has either already marked
> > > +                      * the folio as swap-backed or decided to retain it
> > > +                      * due to GUP or speculative references.
> > > +                      */
> > > +                     if (lazyfree)
> > > +                             goto walk_abort;
> > >
> > >                       if (flags & TTU_SPLIT_HUGE_PMD) {
> > >                               /*


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap
  2025-01-14  6:00       ` Barry Song
@ 2025-01-14  7:51         ` Baolin Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2025-01-14  7:51 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, chrisl, david, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, ying.huang, zhengtangquan



On 2025/1/14 14:00, Barry Song wrote:
>>>>                if (!pvmw.pte) {
>>>> +                     lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
>>>
>>> You've checked lazyfree here, so can we remove the duplicate check in
>>> unmap_huge_pmd_locked()? Then the code should be:
>>>
>>>                  if (lazyfree && unmap_huge_pmd_locked(...))
>>>                          goto walk_done;
>>
>>
>> right. it seems unmap_huge_pmd_locked() only handles lazyfree pmd-mapped
>> thp. so i guess the code could be:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index aea49f7125f1..c4c3a7896de4 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3131,11 +3131,10 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>>          VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
>>          VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>>          VM_WARN_ON_ONCE(!IS_ALIGNED(addr, HPAGE_PMD_SIZE));
>> +       VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
>> +       VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
>>
>> -       if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
>> -               return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>> -
>> -       return false;
>> +       return __discard_anon_folio_pmd_locked(vma, addr, pmdp, folio);
>>   }
>>
>>   static void remap_page(struct folio *folio, unsigned long nr, int flags)
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 02c4e4b2cd7b..72907eb1b8fe 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1671,7 +1671,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>          DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>>          pte_t pteval;
>>          struct page *subpage;
>> -       bool anon_exclusive, lazyfree, ret = true;
>> +       bool anon_exclusive, ret = true;
>>          struct mmu_notifier_range range;
>>          enum ttu_flags flags = (enum ttu_flags)(long)arg;
>>          int nr_pages = 1;
>> @@ -1724,18 +1724,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>                  }
>>
>>                  if (!pvmw.pte) {
>> -                       lazyfree = folio_test_anon(folio) && !folio_test_swapbacked(folio);
>> -
>> -                       if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>> -                                                 folio))
>> -                               goto walk_done;
>> -                       /*
>> -                        * unmap_huge_pmd_locked has either already marked
>> -                        * the folio as swap-backed or decided to retain it
>> -                        * due to GUP or speculative references.
>> -                        */
>> -                       if (lazyfree)
>> +                       if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>> +                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>> +                                       goto walk_done;
>> +                               /*
>> +                                * unmap_huge_pmd_locked has either already marked
>> +                                * the folio as swap-backed or decided to retain it
>> +                                * due to GUP or speculative references.
>> +                                */
>>                                  goto walk_abort;
>> +                       }
>>
>>                          if (flags & TTU_SPLIT_HUGE_PMD) {
>>                                  /*
>>
>>>
>>>>                        if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd,
>>>>                                                  folio))
>>>>                                goto walk_done;
>>>> +                     /*
>>>> +                      * unmap_huge_pmd_locked has either already marked
>>>> +                      * the folio as swap-backed or decided to retain it
>>>> +                      * due to GUP or speculative references.
>>>> +                      */
>>>> +                     if (lazyfree)
>>>> +                             goto walk_abort;
>>>>
>>>>                        if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>                                /*
> 
> 
> 
> The final diff is as follows.
> Baolin, do you have any additional comments before I send out v3?

No other comments. Look good to me.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-13  3:38 ` [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
  2025-01-13 16:48   ` Will Deacon
@ 2025-01-14  9:52   ` David Hildenbrand
  2025-01-14 10:37     ` Barry Song
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-01-14  9:52 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: baolin.wang, chrisl, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, ryan.roberts, v-songbaohua, x86, linux-riscv,
	ying.huang, zhengtangquan, lorenzo.stoakes, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Anshuman Khandual, Shaoqin Huang,
	Gavin Shan, Kefeng Wang, Mark Rutland, Kirill A. Shutemov,
	Yosry Ahmed, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Yicong Yang

On 13.01.25 04:38, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> This is a preparatory patch to support batch PTE unmapping in
> `try_to_unmap_one`. It first introduces range handling for
> `tlbbatch` flush. Currently, the range is always set to the size of
> PAGE_SIZE.

You could have spelled out why you perform the VMA -> MM change.

> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Shaoqin Huang <shahuang@redhat.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   arch/arm64/include/asm/tlbflush.h | 26 ++++++++++++++------------
>   arch/arm64/mm/contpte.c           |  2 +-
>   arch/riscv/include/asm/tlbflush.h |  3 ++-
>   arch/riscv/mm/tlbflush.c          |  3 ++-
>   arch/x86/include/asm/tlbflush.h   |  3 ++-
>   mm/rmap.c                         | 12 +++++++-----
>   6 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc94e036a26b..f34e4fab5aa2 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -322,13 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>   	return true;
>   }
>   
> -static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -					     struct mm_struct *mm,
> -					     unsigned long uaddr)
> -{
> -	__flush_tlb_page_nosync(mm, uaddr);
> -}
> -
>   /*
>    * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
>    * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> @@ -448,7 +441,7 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
>   	return false;
>   }
>   
> -static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
> @@ -460,12 +453,12 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   	pages = (end - start) >> PAGE_SHIFT;
>   
>   	if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
> -		flush_tlb_mm(vma->vm_mm);
> +		flush_tlb_mm(mm);
>   		return;
>   	}
>   
>   	dsb(ishst);
> -	asid = ASID(vma->vm_mm);
> +	asid = ASID(mm);
>   
>   	if (last_level)
>   		__flush_tlb_range_op(vale1is, start, pages, stride, asid,
> @@ -474,7 +467,7 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   		__flush_tlb_range_op(vae1is, start, pages, stride, asid,
>   				     tlb_level, true, lpa2_is_enabled());
>   
> -	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>   }
>   
>   static inline void __flush_tlb_range(struct vm_area_struct *vma,
> @@ -482,7 +475,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   				     unsigned long stride, bool last_level,
>   				     int tlb_level)
>   {
> -	__flush_tlb_range_nosync(vma, start, end, stride,
> +	__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
>   				 last_level, tlb_level);
>   	dsb(ish);
>   }
> @@ -533,6 +526,15 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>   	dsb(ish);
>   	isb();
>   }
> +
> +static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +					     struct mm_struct *mm,
> +					     unsigned long uaddr,
> +					     unsigned long size)
> +{
> +	__flush_tlb_range_nosync(mm, uaddr, uaddr + size,
> +				 PAGE_SIZE, true, 3);
> +}
>   #endif
>   
>   #endif
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 55107d27d3f8..bcac4f55f9c1 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -335,7 +335,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>   		 * eliding the trailing DSB applies here.
>   		 */
>   		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
>   					 PAGE_SIZE, true, 3);
>   	}
>   
> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> index 72e559934952..7f3ea687ce33 100644
> --- a/arch/riscv/include/asm/tlbflush.h
> +++ b/arch/riscv/include/asm/tlbflush.h
> @@ -61,7 +61,8 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
>   bool arch_tlbbatch_should_defer(struct mm_struct *mm);
>   void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
>   			       struct mm_struct *mm,
> -			       unsigned long uaddr);
> +			       unsigned long uaddr,
> +			       unsigned long size);

While we're at it, can we just convert this to the "two tabs" 
indentation starting on second parameter line" way of doing things? Same 
for all other cases. (at least in core code, if some arch wants their 
own weird rules on how to handle these things)

[...]

>   	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
>   	int batch;
> @@ -681,7 +682,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
>   	if (!pte_accessible(mm, pteval))
>   		return;
>   
> -	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
> +	arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr, size);
>   	tlb_ubc->flush_required = true;
>   
>   	/*
> @@ -757,7 +758,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>   }
>   #else
>   static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> -				      unsigned long uaddr)
> +				      unsigned long uaddr,
> +				      unsigned long size)

I'll note that mist tlb functions seem to consume start+end instead of 
start+size, like

flush_tlb_mm_range()
flush_tlb_kernel_range()

So I'm wondering if this should be start+end instead of uaddr+size.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs
  2025-01-14  9:52   ` David Hildenbrand
@ 2025-01-14 10:37     ` Barry Song
  0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2025-01-14 10:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, baolin.wang, chrisl, ioworker0, kasong,
	linux-arm-kernel, linux-kernel, ryan.roberts, v-songbaohua, x86,
	linux-riscv, ying.huang, zhengtangquan, lorenzo.stoakes,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Anshuman Khandual,
	Shaoqin Huang, Gavin Shan, Kefeng Wang, Mark Rutland,
	Kirill A. Shutemov, Yosry Ahmed, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Yicong Yang

On Tue, Jan 14, 2025 at 10:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.01.25 04:38, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > This is a preparatory patch to support batch PTE unmapping in
> > `try_to_unmap_one`. It first introduces range handling for
> > `tlbbatch` flush. Currently, the range is always set to the size of
> > PAGE_SIZE.
>
> You could have spelled out why you perform the VMA -> MM change.

Sure, I’ll include some additional details in v3.

>
> >
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Shaoqin Huang <shahuang@redhat.com>
> > Cc: Gavin Shan <gshan@redhat.com>
> > Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Lance Yang <ioworker0@gmail.com>
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Cc: Yosry Ahmed <yosryahmed@google.com>
> > Cc: Paul Walmsley <paul.walmsley@sifive.com>
> > Cc: Palmer Dabbelt <palmer@dabbelt.com>
> > Cc: Albert Ou <aou@eecs.berkeley.edu>
> > Cc: Yicong Yang <yangyicong@hisilicon.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >   arch/arm64/include/asm/tlbflush.h | 26 ++++++++++++++------------
> >   arch/arm64/mm/contpte.c           |  2 +-
> >   arch/riscv/include/asm/tlbflush.h |  3 ++-
> >   arch/riscv/mm/tlbflush.c          |  3 ++-
> >   arch/x86/include/asm/tlbflush.h   |  3 ++-
> >   mm/rmap.c                         | 12 +++++++-----
> >   6 files changed, 28 insertions(+), 21 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index bc94e036a26b..f34e4fab5aa2 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -322,13 +322,6 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >       return true;
> >   }
> >
> > -static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > -                                          struct mm_struct *mm,
> > -                                          unsigned long uaddr)
> > -{
> > -     __flush_tlb_page_nosync(mm, uaddr);
> > -}
> > -
> >   /*
> >    * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> >    * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> > @@ -448,7 +441,7 @@ static inline bool __flush_tlb_range_limit_excess(unsigned long start,
> >       return false;
> >   }
> >
> > -static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> > +static inline void __flush_tlb_range_nosync(struct mm_struct *mm,
> >                                    unsigned long start, unsigned long end,
> >                                    unsigned long stride, bool last_level,
> >                                    int tlb_level)
> > @@ -460,12 +453,12 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> >       pages = (end - start) >> PAGE_SHIFT;
> >
> >       if (__flush_tlb_range_limit_excess(start, end, pages, stride)) {
> > -             flush_tlb_mm(vma->vm_mm);
> > +             flush_tlb_mm(mm);
> >               return;
> >       }
> >
> >       dsb(ishst);
> > -     asid = ASID(vma->vm_mm);
> > +     asid = ASID(mm);
> >
> >       if (last_level)
> >               __flush_tlb_range_op(vale1is, start, pages, stride, asid,
> > @@ -474,7 +467,7 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> >               __flush_tlb_range_op(vae1is, start, pages, stride, asid,
> >                                    tlb_level, true, lpa2_is_enabled());
> >
> > -     mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
> > +     mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
> >   }
> >
> >   static inline void __flush_tlb_range(struct vm_area_struct *vma,
> > @@ -482,7 +475,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
> >                                    unsigned long stride, bool last_level,
> >                                    int tlb_level)
> >   {
> > -     __flush_tlb_range_nosync(vma, start, end, stride,
> > +     __flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
> >                                last_level, tlb_level);
> >       dsb(ish);
> >   }
> > @@ -533,6 +526,15 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
> >       dsb(ish);
> >       isb();
> >   }
> > +
> > +static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > +                                          struct mm_struct *mm,
> > +                                          unsigned long uaddr,
> > +                                          unsigned long size)
> > +{
> > +     __flush_tlb_range_nosync(mm, uaddr, uaddr + size,
> > +                              PAGE_SIZE, true, 3);
> > +}
> >   #endif
> >
> >   #endif
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index 55107d27d3f8..bcac4f55f9c1 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -335,7 +335,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> >                * eliding the trailing DSB applies here.
> >                */
> >               addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> > -             __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> > +             __flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> >                                        PAGE_SIZE, true, 3);
> >       }
> >
> > diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> > index 72e559934952..7f3ea687ce33 100644
> > --- a/arch/riscv/include/asm/tlbflush.h
> > +++ b/arch/riscv/include/asm/tlbflush.h
> > @@ -61,7 +61,8 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
> >   bool arch_tlbbatch_should_defer(struct mm_struct *mm);
> >   void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> >                              struct mm_struct *mm,
> > -                            unsigned long uaddr);
> > +                            unsigned long uaddr,
> > +                            unsigned long size);
>
> While we're at it, can we just convert this to the "two tabs"
> indentation starting on second parameter line" way of doing things? Same
> for all other cases. (at least in core code, if some arch wants their
> own weird rules on how to handle these things)

Sure.

>
> [...]
>
> >       struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
> >       int batch;
> > @@ -681,7 +682,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> >       if (!pte_accessible(mm, pteval))
> >               return;
> >
> > -     arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr);
> > +     arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr, size);
> >       tlb_ubc->flush_required = true;
> >
> >       /*
> > @@ -757,7 +758,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
> >   }
> >   #else
> >   static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval,
> > -                                   unsigned long uaddr)
> > +                                   unsigned long uaddr,
> > +                                   unsigned long size)
>
> I'll note that mist tlb functions seem to consume start+end instead of
> start+size, like
>
> flush_tlb_mm_range()
> flush_tlb_kernel_range()
>
> So I'm wondering if this should be start+end instead of uaddr+size.

For some reason, I can’t recall why I originally named it "uaddr" instead
of "address" at:
https://lore.kernel.org/lkml/20220707125242.425242-4-21cnbao@gmail.com/

The name seems a bit odd now.

It's probably because the arm64 functions listed below are using "uaddr":

static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 unsigned long uaddr);
static inline void flush_tlb_page(struct vm_area_struct *vma,
                                  unsigned long uaddr);

At that time, we only supported a single page for TLB deferred shootdown.

Regarding set_tlb_ubc_flush_pending(), I’m fine with either approach:
1. start, size
2. start, end

For option 1, it is eventually converted to (start, start + size) when calling
__flush_tlb_range_nosync().

But if we convert to (start, end), it seems to align with

static inline void flush_tlb_range(struct vm_area_struct *vma,
                                   unsigned long start, unsigned long end);

So if there is no objection from Will, I will move to (start, end)
in v3.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-14 10:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-13  3:38 [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
2025-01-13  3:38 ` [PATCH v2 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
2025-01-13 13:19   ` David Hildenbrand
2025-01-13 13:20     ` David Hildenbrand
2025-01-13 21:56       ` Barry Song
2025-01-14  2:55   ` Baolin Wang
2025-01-14  6:05     ` Lance Yang
2025-01-13  3:38 ` [PATCH v2 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
2025-01-13 16:48   ` Will Deacon
2025-01-14  9:52   ` David Hildenbrand
2025-01-14 10:37     ` Barry Song
2025-01-13  3:39 ` [PATCH v2 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
2025-01-13  3:39 ` [PATCH v2 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
2025-01-14  3:40   ` Baolin Wang
2025-01-14  4:09     ` Barry Song
2025-01-14  6:00       ` Barry Song
2025-01-14  7:51         ` Baolin Wang
2025-01-14  6:30       ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox