[PATCH v3 0/5] batched remove rmap in try_to_unmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
@ 2023-03-06  9:22 Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

This series reconstruct the try_to_unmap_one() from:
  loop:
     clear and update PTE
     unmap one page
     goto loop
to:
  loop:
     clear and update PTE
     goto loop
  unmap the range of folio in one call
It is one step to always map/unmap the entire folio in one call.
Which can simplify the folio mapcount handling by avoid dealing
with each page map/unmap.


The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Testing done with the V3 patchset in a qemu guest
with 4G mem:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 8 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

This series is based on next-20230303.

Changes from v2:
  - General
    - Rebase the patch to next-20230303
    - Update cover letter about the preparation to unmap
      the entire folio in one call
    - No code change comparing to V2. But fix the patch applying
      conflict because of wrong patch order in V2.

Changes from v1:
  - General
    - Rebase the patch to next-20230228

  - Patch1
    - Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
      as suggestion from Mike Kravetz and HORIGUCHI NAOYA
    - Removed the mlock_drain_local() as suggestion from Mike Kravetz
    _ Removed the comments about the mm counter change as suggestion
      from Mike Kravetz

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
 3 files changed, 398 insertions(+), 260 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
@ 2023-03-06  9:22 ` Yin Fengwei
  2023-03-08 21:38   ` Mike Kravetz
  2023-03-06  9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

It's to prepare the batched rmap update for large folio.
No need to looped handle hugetlb. Just handle hugetlb and
bail out early.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 200 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 121 insertions(+), 79 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ba901c416785..508d141dacc5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	munlock_vma_folio(folio, vma, compound);
 }
 
+static bool try_to_unmap_one_hugetlb(struct folio *folio,
+		struct vm_area_struct *vma, struct mmu_notifier_range range,
+		struct page_vma_mapped_walk pvmw, unsigned long address,
+		enum ttu_flags flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pteval;
+	bool ret = true, anon = folio_test_anon(folio);
+
+	/*
+	 * The try_to_unmap() is only passed a hugetlb page
+	 * in the case where the hugetlb page is poisoned.
+	 */
+	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
+	/*
+	 * huge_pmd_unshare may unmap an entire PMD page.
+	 * There is no way of knowing exactly which PMDs may
+	 * be cached for this mm, so we must flush them all.
+	 * start/end were already adjusted above to cover this
+	 * range.
+	 */
+	flush_cache_range(vma, range.start, range.end);
+
+	/*
+	 * To call huge_pmd_unshare, i_mmap_rwsem must be
+	 * held in write mode.  Caller needs to explicitly
+	 * do this outside rmap routines.
+	 *
+	 * We also must hold hugetlb vma_lock in write mode.
+	 * Lock order dictates acquiring vma_lock BEFORE
+	 * i_mmap_rwsem.  We can only try lock here and fail
+	 * if unsuccessful.
+	 */
+	if (!anon) {
+		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+		if (!hugetlb_vma_trylock_write(vma)) {
+			ret = false;
+			goto out;
+		}
+		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+			hugetlb_vma_unlock_write(vma);
+			flush_tlb_range(vma,
+					range.start, range.end);
+			mmu_notifier_invalidate_range(mm,
+					range.start, range.end);
+			/*
+			 * The ref count of the PMD page was
+			 * dropped which is part of the way map
+			 * counting is done for shared PMDs.
+			 * Return 'true' here.  When there is
+			 * no other sharing, huge_pmd_unshare
+			 * returns false and we will unmap the
+			 * actual page and drop map count
+			 * to zero.
+			 */
+			goto out;
+		}
+		hugetlb_vma_unlock_write(vma);
+	}
+	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+
+	/*
+	 * Now the pte is cleared. If this pte was uffd-wp armed,
+	 * we may want to replace a none pte with a marker pte if
+	 * it's file-backed, so we don't lose the tracking info.
+	 */
+	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	/* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */
+	pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
+	set_huge_pte_at(mm, address, pvmw.pte, pteval);
+	hugetlb_count_sub(folio_nr_pages(folio), mm);
+
+	/*
+	 * No need to call mmu_notifier_invalidate_range() it has be
+	 * done above for all cases requiring it to happen under page
+	 * table lock before mmu_notifier_invalidate_range_end()
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+	/* No VM_LOCKED set in vma->vm_flags for hugetlb. So not
+	 * necessary to call mlock_drain_local().
+	 */
+	folio_put(folio);
+
+out:
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1504,86 +1601,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			break;
 		}
 
+		address = pvmw.address;
+		if (folio_test_hugetlb(folio)) {
+			ret = try_to_unmap_one_hugetlb(folio, vma, range,
+							pvmw, address, flags);
+
+			/* no need to loop for hugetlb */
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
-		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
 
-		if (folio_test_hugetlb(folio)) {
-			bool anon = folio_test_anon(folio);
-
-			/*
-			 * The try_to_unmap() is only passed a hugetlb page
-			 * in the case where the hugetlb page is poisoned.
-			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		/* Nuke the page table entry. */
+		if (should_defer_flush(mm, flags)) {
 			/*
-			 * huge_pmd_unshare may unmap an entire PMD page.
-			 * There is no way of knowing exactly which PMDs may
-			 * be cached for this mm, so we must flush them all.
-			 * start/end were already adjusted above to cover this
-			 * range.
+			 * We clear the PTE but do not flush so potentially
+			 * a remote CPU could still be writing to the folio.
+			 * If the entry was previously clean then the
+			 * architecture must guarantee that a clear->dirty
+			 * transition on a cached TLB entry is written through
+			 * and traps if the PTE is unmapped.
 			 */
-			flush_cache_range(vma, range.start, range.end);
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
-			 */
-			if (!anon) {
-				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
-				if (!hugetlb_vma_trylock_write(vma)) {
-					page_vma_mapped_walk_done(&pvmw);
-					ret = false;
-					break;
-				}
-				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					flush_tlb_range(vma,
-						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
-					/*
-					 * The ref count of the PMD page was
-					 * dropped which is part of the way map
-					 * counting is done for shared PMDs.
-					 * Return 'true' here.  When there is
-					 * no other sharing, huge_pmd_unshare
-					 * returns false and we will unmap the
-					 * actual page and drop map count
-					 * to zero.
-					 */
-					page_vma_mapped_walk_done(&pvmw);
-					break;
-				}
-				hugetlb_vma_unlock_write(vma);
-			}
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
 		} else {
-			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-			/* Nuke the page table entry. */
-			if (should_defer_flush(mm, flags)) {
-				/*
-				 * We clear the PTE but do not flush so potentially
-				 * a remote CPU could still be writing to the folio.
-				 * If the entry was previously clean then the
-				 * architecture must guarantee that a clear->dirty
-				 * transition on a cached TLB entry is written through
-				 * and traps if the PTE is unmapped.
-				 */
-				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-			} else {
-				pteval = ptep_clear_flush(vma, address, pvmw.pte);
-			}
+			pteval = ptep_clear_flush(vma, address, pvmw.pte);
 		}
 
 		/*
@@ -1602,14 +1650,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
-				set_huge_pte_at(mm, address, pvmw.pte, pteval);
-			} else {
-				dec_mm_counter(mm, mm_counter(&folio->page));
-				set_pte_at(mm, address, pvmw.pte, pteval);
-			}
-
+			dec_mm_counter(mm, mm_counter(&folio->page));
+			set_pte_at(mm, address, pvmw.pte, pteval);
 		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 2/5] rmap: move page unmap operation to dedicated function
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-03-06  9:22 ` Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

No functional change. Just code reorganized.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 369 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 194 insertions(+), 175 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 508d141dacc5..013643122d0c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1538,17 +1538,204 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
 	return ret;
 }
 
+static bool try_to_unmap_one_page(struct folio *folio,
+		struct vm_area_struct *vma, struct mmu_notifier_range range,
+		struct page_vma_mapped_walk pvmw, unsigned long address,
+		enum ttu_flags flags)
+{
+	bool anon_exclusive, ret = true;
+	struct page *subpage;
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pteval;
+
+	subpage = folio_page(folio,
+			pte_pfn(*pvmw.pte) - folio_pfn(folio));
+	anon_exclusive = folio_test_anon(folio) &&
+		PageAnonExclusive(subpage);
+
+	flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+	/* Nuke the page table entry. */
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially
+		 * a remote CPU could still be writing to the folio.
+		 * If the entry was previously clean then the
+		 * architecture must guarantee that a clear->dirty
+		 * transition on a cached TLB entry is written through
+		 * and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+		set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+	} else {
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+	}
+
+	/*
+	 * Now the pte is cleared. If this pte was uffd-wp armed,
+	 * we may want to replace a none pte with a marker pte if
+	 * it's file-backed, so we don't lose the tracking info.
+	 */
+	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	if (PageHWPoison(subpage) && !(flags & TTU_HWPOISON)) {
+		pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+		dec_mm_counter(mm, mm_counter(&folio->page));
+		set_pte_at(mm, address, pvmw.pte, pteval);
+	} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		/*
+		 * The guest indicated that the page content is of no
+		 * interest anymore. Simply discard the pte, vmscan
+		 * will take care of the rest.
+		 * A future reference will then fault in a new zero
+		 * page. When userfaultfd is active, we must not drop
+		 * this page though, as its main user (postcopy
+		 * migration) will not expect userfaults on already
+		 * copied pages.
+		 */
+		dec_mm_counter(mm, mm_counter(&folio->page));
+		/* We have to invalidate as we cleared the pte */
+		mmu_notifier_invalidate_range(mm, address,
+				address + PAGE_SIZE);
+	} else if (folio_test_anon(folio)) {
+		swp_entry_t entry = { .val = page_private(subpage) };
+		pte_t swp_pte;
+		/*
+		 * Store the swap location in the pte.
+		 * See handle_pte_fault() ...
+		 */
+		if (unlikely(folio_test_swapbacked(folio) !=
+					folio_test_swapcache(folio))) {
+			WARN_ON_ONCE(1);
+			ret = false;
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+					address + PAGE_SIZE);
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		/* MADV_FREE page check */
+		if (!folio_test_swapbacked(folio)) {
+			int ref_count, map_count;
+
+			/*
+			 * Synchronize with gup_pte_range():
+			 * - clear PTE; barrier; read refcount
+			 * - inc refcount; barrier; read PTE
+			 */
+			smp_mb();
+
+			ref_count = folio_ref_count(folio);
+			map_count = folio_mapcount(folio);
+
+			/*
+			 * Order reads for page refcount and dirty flag
+			 * (see comments in __remove_mapping()).
+			 */
+			smp_rmb();
+
+			/*
+			 * The only page refs must be one from isolation
+			 * plus the rmap(s) (dropped by discard:).
+			 */
+			if (ref_count == 1 + map_count &&
+					!folio_test_dirty(folio)) {
+				/* Invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
+			}
+
+			/*
+			 * If the folio was redirtied, it cannot be
+			 * discarded. Remap the page to page table.
+			 */
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			folio_set_swapbacked(folio);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		if (swap_duplicate(entry) < 0) {
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+			swap_free(entry);
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		/* See page_try_share_anon_rmap(): clear PTE first. */
+		if (anon_exclusive &&
+				page_try_share_anon_rmap(subpage)) {
+			swap_free(entry);
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+		if (list_empty(&mm->mmlist)) {
+			spin_lock(&mmlist_lock);
+			if (list_empty(&mm->mmlist))
+				list_add(&mm->mmlist, &init_mm.mmlist);
+			spin_unlock(&mmlist_lock);
+		}
+		dec_mm_counter(mm, MM_ANONPAGES);
+		inc_mm_counter(mm, MM_SWAPENTS);
+		swp_pte = swp_entry_to_pte(entry);
+		if (anon_exclusive)
+			swp_pte = pte_swp_mkexclusive(swp_pte);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+		/* Invalidate as we cleared the pte */
+		mmu_notifier_invalidate_range(mm, address,
+				address + PAGE_SIZE);
+	} else {
+		/*
+		 * This is a locked file-backed folio,
+		 * so it cannot be removed from the page
+		 * cache and replaced by a new folio before
+		 * mmu_notifier_invalidate_range_end, so no
+		 * concurrent thread might update its page table
+		 * to point at a new folio while a device is
+		 * still using this folio.
+		 *
+		 * See Documentation/mm/mmu_notifier.rst
+		 */
+		dec_mm_counter(mm, mm_counter_file(&folio->page));
+	}
+
+discard:
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
+	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
@@ -1613,179 +1800,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
-		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
-
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-		/* Nuke the page table entry. */
-		if (should_defer_flush(mm, flags)) {
-			/*
-			 * We clear the PTE but do not flush so potentially
-			 * a remote CPU could still be writing to the folio.
-			 * If the entry was previously clean then the
-			 * architecture must guarantee that a clear->dirty
-			 * transition on a cached TLB entry is written through
-			 * and traps if the PTE is unmapped.
-			 */
-			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
-			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-		} else {
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
-		}
-
-		/*
-		 * Now the pte is cleared. If this pte was uffd-wp armed,
-		 * we may want to replace a none pte with a marker pte if
-		 * it's file-backed, so we don't lose the tracking info.
-		 */
-		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
-
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
-		/* Update high watermark before we lower rss */
-		update_hiwater_rss(mm);
-
-		if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
-			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			dec_mm_counter(mm, mm_counter(&folio->page));
-			set_pte_at(mm, address, pvmw.pte, pteval);
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
-			/*
-			 * The guest indicated that the page content is of no
-			 * interest anymore. Simply discard the pte, vmscan
-			 * will take care of the rest.
-			 * A future reference will then fault in a new zero
-			 * page. When userfaultfd is active, we must not drop
-			 * this page though, as its main user (postcopy
-			 * migration) will not expect userfaults on already
-			 * copied pages.
-			 */
-			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
-		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = { .val = page_private(subpage) };
-			pte_t swp_pte;
-			/*
-			 * Store the swap location in the pte.
-			 * See handle_pte_fault() ...
-			 */
-			if (unlikely(folio_test_swapbacked(folio) !=
-					folio_test_swapcache(folio))) {
-				WARN_ON_ONCE(1);
-				ret = false;
-				/* We have to invalidate as we cleared the pte */
-				mmu_notifier_invalidate_range(mm, address,
-							address + PAGE_SIZE);
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			/* MADV_FREE page check */
-			if (!folio_test_swapbacked(folio)) {
-				int ref_count, map_count;
-
-				/*
-				 * Synchronize with gup_pte_range():
-				 * - clear PTE; barrier; read refcount
-				 * - inc refcount; barrier; read PTE
-				 */
-				smp_mb();
-
-				ref_count = folio_ref_count(folio);
-				map_count = folio_mapcount(folio);
-
-				/*
-				 * Order reads for page refcount and dirty flag
-				 * (see comments in __remove_mapping()).
-				 */
-				smp_rmb();
-
-				/*
-				 * The only page refs must be one from isolation
-				 * plus the rmap(s) (dropped by discard:).
-				 */
-				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
-					/* Invalidate as we cleared the pte */
-					mmu_notifier_invalidate_range(mm,
-						address, address + PAGE_SIZE);
-					dec_mm_counter(mm, MM_ANONPAGES);
-					goto discard;
-				}
-
-				/*
-				 * If the folio was redirtied, it cannot be
-				 * discarded. Remap the page to page table.
-				 */
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				folio_set_swapbacked(folio);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			if (swap_duplicate(entry) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
+		ret = try_to_unmap_one_page(folio, vma,
+						range, pvmw, address, flags);
+		if (!ret)
+			break;
 
-			/* See page_try_share_anon_rmap(): clear PTE first. */
-			if (anon_exclusive &&
-			    page_try_share_anon_rmap(subpage)) {
-				swap_free(entry);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-			if (list_empty(&mm->mmlist)) {
-				spin_lock(&mmlist_lock);
-				if (list_empty(&mm->mmlist))
-					list_add(&mm->mmlist, &init_mm.mmlist);
-				spin_unlock(&mmlist_lock);
-			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
-			swp_pte = swp_entry_to_pte(entry);
-			if (anon_exclusive)
-				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/* Invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
-		} else {
-			/*
-			 * This is a locked file-backed folio,
-			 * so it cannot be removed from the page
-			 * cache and replaced by a new folio before
-			 * mmu_notifier_invalidate_range_end, so no
-			 * concurrent thread might update its page table
-			 * to point at a new folio while a device is
-			 * still using this folio.
-			 *
-			 * See Documentation/mm/mmu_notifier.rst
-			 */
-			dec_mm_counter(mm, mm_counter_file(&folio->page));
-		}
-discard:
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
 		 * done above for all cases requiring it to happen under page
-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page()
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
@ 2023-03-06  9:22 ` Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

Cleanup exit path of try_to_unmap_one_page() by removing
some duplicated code.
Move page_vma_mapped_walk_done() back to try_to_unmap_one().
Change subpage to page as folio has no concept of subpage.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 74 ++++++++++++++++++++++---------------------------------
 1 file changed, 30 insertions(+), 44 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 013643122d0c..836cfc13cf9d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1528,7 +1528,7 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
 	 *
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
-	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+	page_remove_rmap(&folio->page, vma, true);
 	/* No VM_LOCKED set in vma->vm_flags for hugetlb. So not
 	 * necessary to call mlock_drain_local().
 	 */
@@ -1543,15 +1543,13 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		struct page_vma_mapped_walk pvmw, unsigned long address,
 		enum ttu_flags flags)
 {
-	bool anon_exclusive, ret = true;
-	struct page *subpage;
+	bool anon_exclusive;
+	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t pteval;
 
-	subpage = folio_page(folio,
-			pte_pfn(*pvmw.pte) - folio_pfn(folio));
-	anon_exclusive = folio_test_anon(folio) &&
-		PageAnonExclusive(subpage);
+	page = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio));
+	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
 
 	flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 	/* Nuke the page table entry. */
@@ -1579,15 +1577,14 @@ static bool try_to_unmap_one_page(struct folio *folio,
 	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
 
 	/* Set the dirty flag on the folio now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval) && !folio_test_dirty(folio))
 		folio_mark_dirty(folio);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageHWPoison(subpage) && !(flags & TTU_HWPOISON)) {
-		pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-		dec_mm_counter(mm, mm_counter(&folio->page));
+	if (PageHWPoison(page) && !(flags & TTU_HWPOISON)) {
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
 		set_pte_at(mm, address, pvmw.pte, pteval);
 	} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 		/*
@@ -1600,12 +1597,11 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		 * migration) will not expect userfaults on already
 		 * copied pages.
 		 */
-		dec_mm_counter(mm, mm_counter(&folio->page));
 		/* We have to invalidate as we cleared the pte */
 		mmu_notifier_invalidate_range(mm, address,
 				address + PAGE_SIZE);
 	} else if (folio_test_anon(folio)) {
-		swp_entry_t entry = { .val = page_private(subpage) };
+		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 		/*
 		 * Store the swap location in the pte.
@@ -1614,12 +1610,10 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		if (unlikely(folio_test_swapbacked(folio) !=
 					folio_test_swapcache(folio))) {
 			WARN_ON_ONCE(1);
-			ret = false;
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 					address + PAGE_SIZE);
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit;
 		}
 
 		/* MADV_FREE page check */
@@ -1651,7 +1645,6 @@ static bool try_to_unmap_one_page(struct folio *folio,
 				/* Invalidate as we cleared the pte */
 				mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
-				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
 			}
 
@@ -1659,43 +1652,30 @@ static bool try_to_unmap_one_page(struct folio *folio,
 			 * If the folio was redirtied, it cannot be
 			 * discarded. Remap the page to page table.
 			 */
-			set_pte_at(mm, address, pvmw.pte, pteval);
 			folio_set_swapbacked(folio);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
 
-		if (swap_duplicate(entry) < 0) {
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
-		}
+		if (swap_duplicate(entry) < 0)
+			goto exit_restore_pte;
+
 		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 			swap_free(entry);
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
 
 		/* See page_try_share_anon_rmap(): clear PTE first. */
-		if (anon_exclusive &&
-				page_try_share_anon_rmap(subpage)) {
+		if (anon_exclusive && page_try_share_anon_rmap(page)) {
 			swap_free(entry);
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
+
 		if (list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			if (list_empty(&mm->mmlist))
 				list_add(&mm->mmlist, &init_mm.mmlist);
 			spin_unlock(&mmlist_lock);
 		}
-		dec_mm_counter(mm, MM_ANONPAGES);
 		inc_mm_counter(mm, MM_SWAPENTS);
 		swp_pte = swp_entry_to_pte(entry);
 		if (anon_exclusive)
@@ -1706,8 +1686,7 @@ static bool try_to_unmap_one_page(struct folio *folio,
 			swp_pte = pte_swp_mkuffd_wp(swp_pte);
 		set_pte_at(mm, address, pvmw.pte, swp_pte);
 		/* Invalidate as we cleared the pte */
-		mmu_notifier_invalidate_range(mm, address,
-				address + PAGE_SIZE);
+		mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE);
 	} else {
 		/*
 		 * This is a locked file-backed folio,
@@ -1720,11 +1699,16 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		dec_mm_counter(mm, mm_counter_file(&folio->page));
 	}
 
 discard:
-	return ret;
+	dec_mm_counter(vma->vm_mm, mm_counter(&folio->page));
+	return true;
+
+exit_restore_pte:
+	set_pte_at(mm, address, pvmw.pte, pteval);
+exit:
+	return false;
 }
 
 /*
@@ -1802,8 +1786,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		ret = try_to_unmap_one_page(folio, vma,
 						range, pvmw, address, flags);
-		if (!ret)
+		if (!ret) {
+			page_vma_mapped_walk_done(&pvmw);
 			break;
+		}
 
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
@@ -1812,7 +1798,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		page_remove_rmap(subpage, vma, false);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 4/5] rmap:addd folio_remove_rmap_range()
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-03-06  9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
@ 2023-03-06  9:22 ` Yin Fengwei
  2023-03-06  9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
  2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
  5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

folio_remove_rmap_range() allows to take down the pte mapping to
a specific range of folio. Comparing to page_remove_rmap(), it
batched updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  4 +++
 mm/rmap.c            | 58 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..d2569b42e21a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -200,6 +200,10 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_remove_rmap_range(struct folio *, struct page *,
+			unsigned int nr_pages, struct vm_area_struct *,
+			bool compound);
+
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 836cfc13cf9d..bb3fcb8df579 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1355,23 +1355,25 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 }
 
 /**
- * page_remove_rmap - take down pte mapping from a page
- * @page:	page to remove mapping from
+ * folio_remove_rmap_range - take down pte mapping from a range of pages
+ * @folio:	folio to remove mapping from
+ * @page:	The first page to take down pte mapping
+ * @nr_pages:	The number of pages which will be take down pte mapping
  * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
-		bool compound)
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+			unsigned int nr_pages, struct vm_area_struct *vma,
+			bool compound)
 {
-	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool last;
+	int nr = 0, nr_pmdmapped = 0, last;
 	enum node_stat_item idx;
 
-	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
+	VM_BUG_ON_FOLIO(compound && (nr_pages != folio_nr_pages(folio)), folio);
+	VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
 
 	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(folio_test_hugetlb(folio))) {
@@ -1382,12 +1384,16 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 
 	/* Is page being unmapped by PTE? Is this its last map to be removed? */
 	if (likely(!compound)) {
-		last = atomic_add_negative(-1, &page->_mapcount);
-		nr = last;
-		if (last && folio_test_large(folio)) {
-			nr = atomic_dec_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
+		do {
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last && folio_test_large(folio)) {
+				last = atomic_dec_return_relaxed(mapped);
+				last = (last < COMPOUND_MAPPED);
+			}
+
+			if (last)
+				nr++;
+		} while (page++, --nr_pages > 0);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1441,6 +1447,30 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	munlock_vma_folio(folio, vma, compound);
 }
 
+/**
+ * page_remove_rmap - take down pte mapping from a page
+ * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
+ * @compound:	uncharge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
+		bool compound)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
+
+	if (likely(!compound))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	folio_remove_rmap_range(folio, page, nr_pages, vma, compound);
+}
+
 static bool try_to_unmap_one_hugetlb(struct folio *folio,
 		struct vm_area_struct *vma, struct mmu_notifier_range range,
 		struct page_vma_mapped_walk pvmw, unsigned long address,
-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
                   ` (3 preceding siblings ...)
  2023-03-06  9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
@ 2023-03-06  9:22 ` Yin Fengwei
  2023-03-06 12:39   ` haoxin
  2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
  5 siblings, 1 reply; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06  9:22 UTC (permalink / raw)
  To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, jane.chu, david
  Cc: fengwei.yin

If unmap one page fails, or the vma walk will skip next pte,
or the vma walk will end on next pte, batched remove map,
update folio refcount.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  1 +
 mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
 mm/rmap.c            | 48 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d2569b42e21a..18193d1d5a8e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -424,6 +424,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..19e997dfb5c6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	return false;
 }
 
+/**
+ * pvmw_walk_skip_or_end_on_next - check if next pte will be skipped or
+ *                                 end the walk
+ * @pvmw: pointer to struct page_vma_mapped_walk.
+ *
+ * This function can only be called with correct pte lock hold
+ */
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw)
+{
+       unsigned long address = pvmw->address + PAGE_SIZE;
+
+       if (address >= vma_address_end(pvmw))
+               return true;
+
+       if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
+               return true;
+
+       if (pte_none(*pvmw->pte))
+               return true;
+
+       pvmw->pte++;
+       if (!check_pte(pvmw)) {
+               pvmw->pte--;
+               return true;
+       }
+       pvmw->pte--;
+
+       return false;
+}
+
 /**
  * page_mapped_in_vma - check whether a page is really mapped in a VMA
  * @page: the page to test
diff --git a/mm/rmap.c b/mm/rmap.c
index bb3fcb8df579..a64e9cbb52dd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct folio *folio,
 	return false;
 }
 
+static void folio_remove_rmap_and_update_count(struct folio *folio,
+		struct page *start, struct vm_area_struct *vma, int count)
+{
+	if (count == 0)
+		return;
+
+	/*
+	 * No need to call mmu_notifier_invalidate_range() it has be
+	 * done above for all cases requiring it to happen under page
+	 * table lock before mmu_notifier_invalidate_range_end()
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	folio_remove_rmap_range(folio, start, count, vma,
+					folio_test_hugetlb(folio));
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_drain_local();
+	folio_ref_sub(folio, count);
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	struct page *subpage;
+	struct page *start = NULL;
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int count = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			break;
 		}
 
-		subpage = folio_page(folio,
+		if (!start)
+			start = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		ret = try_to_unmap_one_page(folio, vma,
 						range, pvmw, address, flags);
 		if (!ret) {
+			folio_remove_rmap_and_update_count(folio,
+							start, vma, count);
 			page_vma_mapped_walk_done(&pvmw);
 			break;
 		}
+		count++;
 
 		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
+		 * If next pte will be skipped in page_vma_mapped_walk() or
+		 * the walk will end at it, batched remove rmap and update
+		 * page refcount. We can't do it after page_vma_mapped_walk()
+		 * return false because the pte lock will not be hold.
 		 */
-		page_remove_rmap(subpage, vma, false);
-		if (vma->vm_flags & VM_LOCKED)
-			mlock_drain_local();
-		folio_put(folio);
+		if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
+			folio_remove_rmap_and_update_count(folio,
+							start, vma, count);
+			count = 0;
+			start = NULL;
+		}
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
  2023-03-06  9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
@ 2023-03-06 12:39   ` haoxin
  2023-03-07  2:45     ` Yin, Fengwei
  0 siblings, 1 reply; 13+ messages in thread
From: haoxin @ 2023-03-06 12:39 UTC (permalink / raw)
  To: Yin Fengwei, linux-mm, akpm, willy, mike.kravetz,
	sidhartha.kumar, naoya.horiguchi, jane.chu, david


在 2023/3/6 下午5:22, Yin Fengwei 写道:
> If unmap one page fails, or the vma walk will skip next pte,
> or the vma walk will end on next pte, batched remove map,
> update folio refcount.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>   include/linux/rmap.h |  1 +
>   mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
>   mm/rmap.c            | 48 ++++++++++++++++++++++++++++++++++----------
>   3 files changed, 68 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d2569b42e21a..18193d1d5a8e 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -424,6 +424,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>   }
>   
>   bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw);
>   
>   /*
>    * Used by swapoff to help locate where page is expected in vma.
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 4e448cfbc6ef..19e997dfb5c6 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   	return false;
>   }
>   
> +/**
> + * pvmw_walk_skip_or_end_on_next - check if next pte will be skipped or
> + *                                 end the walk
> + * @pvmw: pointer to struct page_vma_mapped_walk.
> + *
> + * This function can only be called with correct pte lock hold
> + */
> +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw)
> +{
> +       unsigned long address = pvmw->address + PAGE_SIZE;
> +
> +       if (address >= vma_address_end(pvmw))
> +               return true;

If vma_address_end is exactly equal to next address(pvmw->address + 
PAGE_SIZE) , does this mean that we are ignored to unmap the last page 
here ? so

there can just use ' > '      ' if (address > vma_address_end(pvmw))' .

I may have misunderstood, please correct me.

> +
> +       if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
> +               return true;
> +
> +       if (pte_none(*pvmw->pte))
> +               return true;
> +
> +       pvmw->pte++;
> +       if (!check_pte(pvmw)) {
> +               pvmw->pte--;
> +               return true;
> +       }
> +       pvmw->pte--;
> +
> +       return false;
> +}
> +
>   /**
>    * page_mapped_in_vma - check whether a page is really mapped in a VMA
>    * @page: the page to test
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bb3fcb8df579..a64e9cbb52dd 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct folio *folio,
>   	return false;
>   }
>   
> +static void folio_remove_rmap_and_update_count(struct folio *folio,
> +		struct page *start, struct vm_area_struct *vma, int count)
> +{
> +	if (count == 0)
> +		return;
> +
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() it has be
> +	 * done above for all cases requiring it to happen under page
> +	 * table lock before mmu_notifier_invalidate_range_end()
> +	 *
> +	 * See Documentation/mm/mmu_notifier.rst
> +	 */
> +	folio_remove_rmap_range(folio, start, count, vma,
> +					folio_test_hugetlb(folio));
> +	if (vma->vm_flags & VM_LOCKED)
> +		mlock_drain_local();
> +	folio_ref_sub(folio, count);
> +}
> +
>   /*
>    * @arg: enum ttu_flags will be passed to this argument
>    */
> @@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   		     unsigned long address, void *arg)
>   {
>   	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> -	struct page *subpage;
> +	struct page *start = NULL;
>   	bool ret = true;
>   	struct mmu_notifier_range range;
>   	enum ttu_flags flags = (enum ttu_flags)(long)arg;
> +	int count = 0;
>   
>   	/*
>   	 * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   			break;
>   		}
>   
> -		subpage = folio_page(folio,
> +		if (!start)
> +			start = folio_page(folio,
>   					pte_pfn(*pvmw.pte) - folio_pfn(folio));
>   		ret = try_to_unmap_one_page(folio, vma,
>   						range, pvmw, address, flags);
>   		if (!ret) {
> +			folio_remove_rmap_and_update_count(folio,
> +							start, vma, count);
>   			page_vma_mapped_walk_done(&pvmw);
>   			break;
>   		}
> +		count++;
>   
>   		/*
> -		 * No need to call mmu_notifier_invalidate_range() it has be
> -		 * done above for all cases requiring it to happen under page
> -		 * table lock before mmu_notifier_invalidate_range_end()
> -		 *
> -		 * See Documentation/mm/mmu_notifier.rst
> +		 * If next pte will be skipped in page_vma_mapped_walk() or
> +		 * the walk will end at it, batched remove rmap and update
> +		 * page refcount. We can't do it after page_vma_mapped_walk()
> +		 * return false because the pte lock will not be hold.
>   		 */
> -		page_remove_rmap(subpage, vma, false);
> -		if (vma->vm_flags & VM_LOCKED)
> -			mlock_drain_local();
> -		folio_put(folio);
> +		if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
> +			folio_remove_rmap_and_update_count(folio,
> +							start, vma, count);
> +			count = 0;
> +			start = NULL;
> +		}
>   	}
>   
>   	mmu_notifier_invalidate_range_end(&range);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
  2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
                   ` (4 preceding siblings ...)
  2023-03-06  9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
@ 2023-03-06 21:12 ` Andrew Morton
  2023-03-07  2:44   ` Yin, Fengwei
  2023-03-09 13:56   ` Yin, Fengwei
  5 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2023-03-06 21:12 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, willy, mike.kravetz, sidhartha.kumar, naoya.horiguchi,
	jane.chu, david

On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:

> This series is trying to bring the batched rmap removing to
> try_to_unmap_one(). It's expected that the batched rmap
> removing bring performance gain than remove rmap per page.
> 
> ...
>
>  include/linux/rmap.h |   5 +
>  mm/page_vma_mapped.c |  30 +++
>  mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
>  3 files changed, 398 insertions(+), 260 deletions(-)

As was discussed in v2's review, if no performance benefit has been
demonstrated, why make this change?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
  2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
@ 2023-03-07  2:44   ` Yin, Fengwei
  2023-03-09 13:56   ` Yin, Fengwei
  1 sibling, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-07  2:44 UTC (permalink / raw)
  To: akpm
  Cc: david, linux-mm, willy, mike.kravetz, sidhartha.kumar,
	naoya.horiguchi, chu, jane

On Mon, 2023-03-06 at 13:12 -0800, Andrew Morton wrote:
> On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei
> <fengwei.yin@intel.com> wrote:
> 
> > This series is trying to bring the batched rmap removing to
> > try_to_unmap_one(). It's expected that the batched rmap
> > removing bring performance gain than remove rmap per page.
> > 
> > ...
> > 
> >  include/linux/rmap.h |   5 +
> >  mm/page_vma_mapped.c |  30 +++
> >  mm/rmap.c            | 623 +++++++++++++++++++++++++--------------
> > ----
> >  3 files changed, 398 insertions(+), 260 deletions(-)
> 
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
OK. Let me check whether there is way to demonstrate the performance
gain. Thanks.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
  2023-03-06 12:39   ` haoxin
@ 2023-03-07  2:45     ` Yin, Fengwei
  0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-07  2:45 UTC (permalink / raw)
  To: mike.kravetz, chu, jane, akpm, linux-mm, sidhartha.kumar, xhao,
	naoya.horiguchi, david, willy

On Mon, 2023-03-06 at 20:39 +0800, haoxin wrote:
> 
> 在 2023/3/6 下午5:22, Yin Fengwei 写道:
> > If unmap one page fails, or the vma walk will skip next pte,
> > or the vma walk will end on next pte, batched remove map,
> > update folio refcount.
> > 
> > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> > ---
> >   include/linux/rmap.h |  1 +
> >   mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
> >   mm/rmap.c            | 48 ++++++++++++++++++++++++++++++++++-----
> > -----
> >   3 files changed, 68 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index d2569b42e21a..18193d1d5a8e 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -424,6 +424,7 @@ static inline void
> > page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >   }
> >   
> >   bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk
> > *pvmw);
> >   
> >   /*
> >    * Used by swapoff to help locate where page is expected in vma.
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 4e448cfbc6ef..19e997dfb5c6 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct
> > page_vma_mapped_walk *pvmw)
> >         return false;
> >   }
> >   
> > +/**
> > + * pvmw_walk_skip_or_end_on_next - check if next pte will be
> > skipped or
> > + *                                 end the walk
> > + * @pvmw: pointer to struct page_vma_mapped_walk.
> > + *
> > + * This function can only be called with correct pte lock hold
> > + */
> > +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk
> > *pvmw)
> > +{
> > +       unsigned long address = pvmw->address + PAGE_SIZE;
> > +
> > +       if (address >= vma_address_end(pvmw))
> > +               return true;
> 
> If vma_address_end is exactly equal to next address(pvmw->address + 
> PAGE_SIZE) , does this mean that we are ignored to unmap the last
> page 
> here ? so
> 
> there can just use ' > '      ' if (address > vma_address_end(pvmw))'
> .
This check will be done after the last PTE is handled. Thanks.


Regards
Yin, Fengwei

> 
> I may have misunderstood, please correct me.
> 
> > +
> > +       if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
> > +               return true;
> > +
> > +       if (pte_none(*pvmw->pte))
> > +               return true;
> > +
> > +       pvmw->pte++;
> > +       if (!check_pte(pvmw)) {
> > +               pvmw->pte--;
> > +               return true;
> > +       }
> > +       pvmw->pte--;
> > +
> > +       return false;
> > +}
> > +
> >   /**
> >    * page_mapped_in_vma - check whether a page is really mapped in
> > a VMA
> >    * @page: the page to test
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bb3fcb8df579..a64e9cbb52dd 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct
> > folio *folio,
> >         return false;
> >   }
> >   
> > +static void folio_remove_rmap_and_update_count(struct folio
> > *folio,
> > +               struct page *start, struct vm_area_struct *vma, int
> > count)
> > +{
> > +       if (count == 0)
> > +               return;
> > +
> > +       /*
> > +        * No need to call mmu_notifier_invalidate_range() it has
> > be
> > +        * done above for all cases requiring it to happen under
> > page
> > +        * table lock before mmu_notifier_invalidate_range_end()
> > +        *
> > +        * See Documentation/mm/mmu_notifier.rst
> > +        */
> > +       folio_remove_rmap_range(folio, start, count, vma,
> > +                                       folio_test_hugetlb(folio));
> > +       if (vma->vm_flags & VM_LOCKED)
> > +               mlock_drain_local();
> > +       folio_ref_sub(folio, count);
> > +}
> > +
> >   /*
> >    * @arg: enum ttu_flags will be passed to this argument
> >    */
> > @@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio
> > *folio, struct vm_area_struct *vma,
> >                      unsigned long address, void *arg)
> >   {
> >         DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > -       struct page *subpage;
> > +       struct page *start = NULL;
> >         bool ret = true;
> >         struct mmu_notifier_range range;
> >         enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > +       int count = 0;
> >   
> >         /*
> >          * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio
> > *folio, struct vm_area_struct *vma,
> >                         break;
> >                 }
> >   
> > -               subpage = folio_page(folio,
> > +               if (!start)
> > +                       start = folio_page(folio,
> >                                         pte_pfn(*pvmw.pte) -
> > folio_pfn(folio));
> >                 ret = try_to_unmap_one_page(folio, vma,
> >                                                 range, pvmw,
> > address, flags);
> >                 if (!ret) {
> > +                       folio_remove_rmap_and_update_count(folio,
> > +                                                       start, vma,
> > count);
> >                         page_vma_mapped_walk_done(&pvmw);
> >                         break;
> >                 }
> > +               count++;
> >   
> >                 /*
> > -                * No need to call mmu_notifier_invalidate_range()
> > it has be
> > -                * done above for all cases requiring it to happen
> > under page
> > -                * table lock before
> > mmu_notifier_invalidate_range_end()
> > -                *
> > -                * See Documentation/mm/mmu_notifier.rst
> > +                * If next pte will be skipped in
> > page_vma_mapped_walk() or
> > +                * the walk will end at it, batched remove rmap and
> > update
> > +                * page refcount. We can't do it after
> > page_vma_mapped_walk()
> > +                * return false because the pte lock will not be
> > hold.
> >                  */
> > -               page_remove_rmap(subpage, vma, false);
> > -               if (vma->vm_flags & VM_LOCKED)
> > -                       mlock_drain_local();
> > -               folio_put(folio);
> > +               if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
> > +                       folio_remove_rmap_and_update_count(folio,
> > +                                                       start, vma,
> > count);
> > +                       count = 0;
> > +                       start = NULL;
> > +               }
> >         }
> >   
> >         mmu_notifier_invalidate_range_end(&range);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-03-06  9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-03-08 21:38   ` Mike Kravetz
  2023-03-09  5:13     ` Yin, Fengwei
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2023-03-08 21:38 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, akpm, willy, sidhartha.kumar, naoya.horiguchi, jane.chu, david

On 03/06/23 17:22, Yin Fengwei wrote:
> It's to prepare the batched rmap update for large folio.
> No need to looped handle hugetlb. Just handle hugetlb and
> bail out early.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/rmap.c | 200 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 121 insertions(+), 79 deletions(-)

Looks good,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

A few nits below.

> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ba901c416785..508d141dacc5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  	munlock_vma_folio(folio, vma, compound);
>  }
>  
> +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> +		struct vm_area_struct *vma, struct mmu_notifier_range range,
> +		struct page_vma_mapped_walk pvmw, unsigned long address,
> +		enum ttu_flags flags)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t pteval;
> +	bool ret = true, anon = folio_test_anon(folio);
> +
> +	/*
> +	 * The try_to_unmap() is only passed a hugetlb page
> +	 * in the case where the hugetlb page is poisoned.
> +	 */
> +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> +	/*
> +	 * huge_pmd_unshare may unmap an entire PMD page.
> +	 * There is no way of knowing exactly which PMDs may
> +	 * be cached for this mm, so we must flush them all.
> +	 * start/end were already adjusted above to cover this

nit, start/end are adjusted in caller (try_to_unmap_one) not above.

> +	 * range.
> +	 */
> +	flush_cache_range(vma, range.start, range.end);
> +
> +	/*
> +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
> +	 * held in write mode.  Caller needs to explicitly
> +	 * do this outside rmap routines.
> +	 *
> +	 * We also must hold hugetlb vma_lock in write mode.
> +	 * Lock order dictates acquiring vma_lock BEFORE
> +	 * i_mmap_rwsem.  We can only try lock here and fail
> +	 * if unsuccessful.
> +	 */
> +	if (!anon) {
> +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +		if (!hugetlb_vma_trylock_write(vma)) {
> +			ret = false;
> +			goto out;
> +		}
> +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +			hugetlb_vma_unlock_write(vma);
> +			flush_tlb_range(vma,
> +					range.start, range.end);
> +			mmu_notifier_invalidate_range(mm,
> +					range.start, range.end);
> +			/*
> +			 * The ref count of the PMD page was
> +			 * dropped which is part of the way map
> +			 * counting is done for shared PMDs.
> +			 * Return 'true' here.  When there is
> +			 * no other sharing, huge_pmd_unshare
> +			 * returns false and we will unmap the
> +			 * actual page and drop map count
> +			 * to zero.
> +			 */
> +			goto out;
> +		}
> +		hugetlb_vma_unlock_write(vma);
> +	}
> +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> +
> +	/*
> +	 * Now the pte is cleared. If this pte was uffd-wp armed,
> +	 * we may want to replace a none pte with a marker pte if
> +	 * it's file-backed, so we don't lose the tracking info.
> +	 */
> +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> +
> +	/* Set the dirty flag on the folio now the pte is gone. */
> +	if (pte_dirty(pteval))

nit, technically, I suppose this should be huge_pte_dirty but it really is
the same as pte_dirty which is why it works in current code.

> +		folio_mark_dirty(folio);
> +
> +	/* Update high watermark before we lower rss */
> +	update_hiwater_rss(mm);
> +
> +	/* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */
> +	pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> +	set_huge_pte_at(mm, address, pvmw.pte, pteval);
> +	hugetlb_count_sub(folio_nr_pages(folio), mm);
> +
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() it has be
> +	 * done above for all cases requiring it to happen under page
> +	 * table lock before mmu_notifier_invalidate_range_end()
> +	 *
> +	 * See Documentation/mm/mmu_notifier.rst
> +	 */
> +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));

nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass
'true'.  In addition, the same call in try_to_unmap_one is now known to
always be false.  No need to check folio_test_hugetlb(folio) there as
well.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-03-08 21:38   ` Mike Kravetz
@ 2023-03-09  5:13     ` Yin, Fengwei
  0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-09  5:13 UTC (permalink / raw)
  To: mike.kravetz
  Cc: david, linux-mm, willy, naoya.horiguchi, sidhartha.kumar, akpm,
	chu, jane

On Wed, 2023-03-08 at 13:38 -0800, Mike Kravetz wrote:
> On 03/06/23 17:22, Yin Fengwei wrote:
> > It's to prepare the batched rmap update for large folio.
> > No need to looped handle hugetlb. Just handle hugetlb and
> > bail out early.
> > 
> > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> > ---
> >  mm/rmap.c | 200 +++++++++++++++++++++++++++++++++-----------------
> > ----
> >  1 file changed, 121 insertions(+), 79 deletions(-)
> 
> Looks good,
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Thanks a lot for reviewing.

> 
> A few nits below.
> 
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index ba901c416785..508d141dacc5 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page,
> > struct vm_area_struct *vma,
> >         munlock_vma_folio(folio, vma, compound);
> >  }
> >  
> > +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> > +               struct vm_area_struct *vma, struct
> > mmu_notifier_range range,
> > +               struct page_vma_mapped_walk pvmw, unsigned long
> > address,
> > +               enum ttu_flags flags)
> > +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       pte_t pteval;
> > +       bool ret = true, anon = folio_test_anon(folio);
> > +
> > +       /*
> > +        * The try_to_unmap() is only passed a hugetlb page
> > +        * in the case where the hugetlb page is poisoned.
> > +        */
> > +       VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> > +       /*
> > +        * huge_pmd_unshare may unmap an entire PMD page.
> > +        * There is no way of knowing exactly which PMDs may
> > +        * be cached for this mm, so we must flush them all.
> > +        * start/end were already adjusted above to cover this
> 
> nit, start/end are adjusted in caller (try_to_unmap_one) not above.
Yes. Will update the comment.

> 
> > +        * range.
> > +        */
> > +       flush_cache_range(vma, range.start, range.end);
> > +
> > +       /*
> > +        * To call huge_pmd_unshare, i_mmap_rwsem must be
> > +        * held in write mode.  Caller needs to explicitly
> > +        * do this outside rmap routines.
> > +        *
> > +        * We also must hold hugetlb vma_lock in write mode.
> > +        * Lock order dictates acquiring vma_lock BEFORE
> > +        * i_mmap_rwsem.  We can only try lock here and fail
> > +        * if unsuccessful.
> > +        */
> > +       if (!anon) {
> > +               VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > +               if (!hugetlb_vma_trylock_write(vma)) {
> > +                       ret = false;
> > +                       goto out;
> > +               }
> > +               if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> > +                       hugetlb_vma_unlock_write(vma);
> > +                       flush_tlb_range(vma,
> > +                                       range.start, range.end);
> > +                       mmu_notifier_invalidate_range(mm,
> > +                                       range.start, range.end);
> > +                       /*
> > +                        * The ref count of the PMD page was
> > +                        * dropped which is part of the way map
> > +                        * counting is done for shared PMDs.
> > +                        * Return 'true' here.  When there is
> > +                        * no other sharing, huge_pmd_unshare
> > +                        * returns false and we will unmap the
> > +                        * actual page and drop map count
> > +                        * to zero.
> > +                        */
> > +                       goto out;
> > +               }
> > +               hugetlb_vma_unlock_write(vma);
> > +       }
> > +       pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > +       /*
> > +        * Now the pte is cleared. If this pte was uffd-wp armed,
> > +        * we may want to replace a none pte with a marker pte if
> > +        * it's file-backed, so we don't lose the tracking info.
> > +        */
> > +       pte_install_uffd_wp_if_needed(vma, address, pvmw.pte,
> > pteval);
> > +
> > +       /* Set the dirty flag on the folio now the pte is gone. */
> > +       if (pte_dirty(pteval))
> 
> nit, technically, I suppose this should be huge_pte_dirty but it
> really is
> the same as pte_dirty which is why it works in current code.
Yes. Will update to use huge_pte_dirty().

> 
> > +               folio_mark_dirty(folio);
> > +
> > +       /* Update high watermark before we lower rss */
> > +       update_hiwater_rss(mm);
> > +
> > +       /* Poisoned hugetlb folio with TTU_HWPOISON always cleared
> > in flags */
> > +       pteval = swp_entry_to_pte(make_hwpoison_entry(&folio-
> > >page));
> > +       set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > +       hugetlb_count_sub(folio_nr_pages(folio), mm);
> > +
> > +       /*
> > +        * No need to call mmu_notifier_invalidate_range() it has
> > be
> > +        * done above for all cases requiring it to happen under
> > page
> > +        * table lock before mmu_notifier_invalidate_range_end()
> > +        *
> > +        * See Documentation/mm/mmu_notifier.rst
> > +        */
> > +       page_remove_rmap(&folio->page, vma,
> > folio_test_hugetlb(folio));
> 
> nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass
> 'true'.  In addition, the same call in try_to_unmap_one is now known
> to
> always be false.  No need to check folio_test_hugetlb(folio) there as
> well.
The "folio_test_hugetlb(folio) -> true" changes was in patch 3. I tried
to apply "one patch for code moving and one patch for code change)" to
make review easy. But this patch already changed the code, I will move
"folio_test_hugetlb(folio)->true" from patch 3 to this one. Thanks.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
  2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
  2023-03-07  2:44   ` Yin, Fengwei
@ 2023-03-09 13:56   ` Yin, Fengwei
  1 sibling, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-09 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, willy, mike.kravetz, sidhartha.kumar, naoya.horiguchi,
	jane.chu, david


On 3/7/2023 5:12 AM, Andrew Morton wrote:
> On Mon,  6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> 
>> This series is trying to bring the batched rmap removing to
>> try_to_unmap_one(). It's expected that the batched rmap
>> removing bring performance gain than remove rmap per page.
>>
>> ...
>>
>>  include/linux/rmap.h |   5 +
>>  mm/page_vma_mapped.c |  30 +++
>>  mm/rmap.c            | 623 +++++++++++++++++++++++++------------------
>>  3 files changed, 398 insertions(+), 260 deletions(-)
> 
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
> 
I changed the MADV_PAGEOUT not to split the large folio for page cache
and created a micro benchmark mainly as following:

        char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE, fd, 0);
	count = 0;
        while (1) {
                unsigned long i;

                for (i = 0; i < FILESIZE; i += pgsize) {
                        cc = *(volatile char *)(c + i);
                }
                madvise(c, FILESIZE, MADV_PAGEOUT);
		count++;
        }
        munmap(c, FILESIZE);

Run it with 96 instances + 96 files for 1 second. The test platform was on
an IceLake with 48C/96T + 192G memory.

Test result (number count) got 10% improvement with this patch series. And
perf shows following:

Before the patch:
--19.97%--try_to_unmap_one
          |          
          |--12.35%--page_remove_rmap
          |          |          
          |           --11.39%--__mod_lruvec_page_state
          |                     |          
          |                     |--1.51%--__mod_memcg_lruvec_state
          |                     |          |          
          |                     |           --0.91%--cgroup_rstat_updated
          |                     |          
          |                      --0.70%--__mod_lruvec_state
          |                                |          
          |                                 --0.63%--__mod_node_page_state
          |          
          |--5.41%--ptep_clear_flush
          |          |          
          |           --4.65%--flush_tlb_mm_range
          |                     |          
          |                      --3.83%--flush_tlb_func
          |                                |          
          |                                 --3.51%--native_flush_tlb_one_user
          |          
          |--0.75%--percpu_counter_add_batch
          |          
           --0.55%--PageHeadHuge

After the patch:
--9.50%--try_to_unmap_one
          |          
          |--6.94%--try_to_unmap_one_page.constprop.0.isra.0
          |          |          
          |          |--5.07%--ptep_clear_flush
          |          |          |          
          |          |           --4.25%--flush_tlb_mm_range
          |          |                     |          
          |          |                      --3.44%--flush_tlb_func
          |          |                                |          
          |          |                                 --3.05%--native_flush_tlb_one_user
          |          |          
          |           --0.80%--percpu_counter_add_batch
          |          
          |--1.22%--folio_remove_rmap_and_update_count.part.0
          |          |          
          |           --1.16%--folio_remove_rmap_range
          |                     |          
          |                      --0.62%--__mod_lruvec_page_state
          |          
           --0.56%--PageHeadHuge

As expected, the cost of __mod_lruvec_page_state is reduced a lot with batched
folio_remove_rmap_range.

I believe the same benefit is there for page reclaim path also. Thanks.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-03-09 13:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-06  9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
2023-03-06  9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
2023-03-08 21:38   ` Mike Kravetz
2023-03-09  5:13     ` Yin, Fengwei
2023-03-06  9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
2023-03-06  9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
2023-03-06  9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
2023-03-06  9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
2023-03-06 12:39   ` haoxin
2023-03-07  2:45     ` Yin, Fengwei
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
2023-03-07  2:44   ` Yin, Fengwei
2023-03-09 13:56   ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox