* [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
@ 2023-03-06 9:22 Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.
This series reconstruct the try_to_unmap_one() from:
loop:
clear and update PTE
unmap one page
goto loop
to:
loop:
clear and update PTE
goto loop
unmap the range of folio in one call
It is one step to always map/unmap the entire folio in one call.
Which can simplify the folio mapcount handling by avoid dealing
with each page map/unmap.
The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.
Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.
Patch4 adds folio_remove_rmap_range() which batched remove rmap.
Patch5 make try_to_unmap_one() to batched remove rmap.
Testing done with the V3 patchset in a qemu guest
with 4G mem:
- kernel mm selftest to trigger vmscan() and final hit
try_to_unmap_one().
- Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
call against hugetlb.
- 8 hours stress testing: Firefox + kernel mm selftest + kernel
build.
This series is based on next-20230303.
Changes from v2:
- General
- Rebase the patch to next-20230303
- Update cover letter about the preparation to unmap
the entire folio in one call
- No code change comparing to V2. But fix the patch applying
conflict because of wrong patch order in V2.
Changes from v1:
- General
- Rebase the patch to next-20230228
- Patch1
- Removed the if (PageHWPoison(page) && !(flags & TTU_HWPOISON)
as suggestion from Mike Kravetz and HORIGUCHI NAOYA
- Removed the mlock_drain_local() as suggestion from Mike Kravetz
_ Removed the comments about the mm counter change as suggestion
from Mike Kravetz
Yin Fengwei (5):
rmap: move hugetlb try_to_unmap to dedicated function
rmap: move page unmap operation to dedicated function
rmap: cleanup exit path of try_to_unmap_one_page()
rmap:addd folio_remove_rmap_range()
try_to_unmap_one: batched remove rmap, update folio refcount
include/linux/rmap.h | 5 +
mm/page_vma_mapped.c | 30 +++
mm/rmap.c | 623 +++++++++++++++++++++++++------------------
3 files changed, 398 insertions(+), 260 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
@ 2023-03-06 9:22 ` Yin Fengwei
2023-03-08 21:38 ` Mike Kravetz
2023-03-06 9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
` (4 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
It's to prepare the batched rmap update for large folio.
No need to looped handle hugetlb. Just handle hugetlb and
bail out early.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/rmap.c | 200 +++++++++++++++++++++++++++++++++---------------------
1 file changed, 121 insertions(+), 79 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index ba901c416785..508d141dacc5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
munlock_vma_folio(folio, vma, compound);
}
+static bool try_to_unmap_one_hugetlb(struct folio *folio,
+ struct vm_area_struct *vma, struct mmu_notifier_range range,
+ struct page_vma_mapped_walk pvmw, unsigned long address,
+ enum ttu_flags flags)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t pteval;
+ bool ret = true, anon = folio_test_anon(folio);
+
+ /*
+ * The try_to_unmap() is only passed a hugetlb page
+ * in the case where the hugetlb page is poisoned.
+ */
+ VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
+ /*
+ * huge_pmd_unshare may unmap an entire PMD page.
+ * There is no way of knowing exactly which PMDs may
+ * be cached for this mm, so we must flush them all.
+ * start/end were already adjusted above to cover this
+ * range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+
+ /*
+ * To call huge_pmd_unshare, i_mmap_rwsem must be
+ * held in write mode. Caller needs to explicitly
+ * do this outside rmap routines.
+ *
+ * We also must hold hugetlb vma_lock in write mode.
+ * Lock order dictates acquiring vma_lock BEFORE
+ * i_mmap_rwsem. We can only try lock here and fail
+ * if unsuccessful.
+ */
+ if (!anon) {
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (!hugetlb_vma_trylock_write(vma)) {
+ ret = false;
+ goto out;
+ }
+ if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+ hugetlb_vma_unlock_write(vma);
+ flush_tlb_range(vma,
+ range.start, range.end);
+ mmu_notifier_invalidate_range(mm,
+ range.start, range.end);
+ /*
+ * The ref count of the PMD page was
+ * dropped which is part of the way map
+ * counting is done for shared PMDs.
+ * Return 'true' here. When there is
+ * no other sharing, huge_pmd_unshare
+ * returns false and we will unmap the
+ * actual page and drop map count
+ * to zero.
+ */
+ goto out;
+ }
+ hugetlb_vma_unlock_write(vma);
+ }
+ pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+
+ /*
+ * Now the pte is cleared. If this pte was uffd-wp armed,
+ * we may want to replace a none pte with a marker pte if
+ * it's file-backed, so we don't lose the tracking info.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+ /* Set the dirty flag on the folio now the pte is gone. */
+ if (pte_dirty(pteval))
+ folio_mark_dirty(folio);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ /* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */
+ pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
+ set_huge_pte_at(mm, address, pvmw.pte, pteval);
+ hugetlb_count_sub(folio_nr_pages(folio), mm);
+
+ /*
+ * No need to call mmu_notifier_invalidate_range() it has be
+ * done above for all cases requiring it to happen under page
+ * table lock before mmu_notifier_invalidate_range_end()
+ *
+ * See Documentation/mm/mmu_notifier.rst
+ */
+ page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+ /* No VM_LOCKED set in vma->vm_flags for hugetlb. So not
+ * necessary to call mlock_drain_local().
+ */
+ folio_put(folio);
+
+out:
+ return ret;
+}
+
/*
* @arg: enum ttu_flags will be passed to this argument
*/
@@ -1504,86 +1601,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
break;
}
+ address = pvmw.address;
+ if (folio_test_hugetlb(folio)) {
+ ret = try_to_unmap_one_hugetlb(folio, vma, range,
+ pvmw, address, flags);
+
+ /* no need to loop for hugetlb */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
subpage = folio_page(folio,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
- address = pvmw.address;
anon_exclusive = folio_test_anon(folio) &&
PageAnonExclusive(subpage);
- if (folio_test_hugetlb(folio)) {
- bool anon = folio_test_anon(folio);
-
- /*
- * The try_to_unmap() is only passed a hugetlb page
- * in the case where the hugetlb page is poisoned.
- */
- VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+ /* Nuke the page table entry. */
+ if (should_defer_flush(mm, flags)) {
/*
- * huge_pmd_unshare may unmap an entire PMD page.
- * There is no way of knowing exactly which PMDs may
- * be cached for this mm, so we must flush them all.
- * start/end were already adjusted above to cover this
- * range.
+ * We clear the PTE but do not flush so potentially
+ * a remote CPU could still be writing to the folio.
+ * If the entry was previously clean then the
+ * architecture must guarantee that a clear->dirty
+ * transition on a cached TLB entry is written through
+ * and traps if the PTE is unmapped.
*/
- flush_cache_range(vma, range.start, range.end);
+ pteval = ptep_get_and_clear(mm, address, pvmw.pte);
- /*
- * To call huge_pmd_unshare, i_mmap_rwsem must be
- * held in write mode. Caller needs to explicitly
- * do this outside rmap routines.
- *
- * We also must hold hugetlb vma_lock in write mode.
- * Lock order dictates acquiring vma_lock BEFORE
- * i_mmap_rwsem. We can only try lock here and fail
- * if unsuccessful.
- */
- if (!anon) {
- VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
- if (!hugetlb_vma_trylock_write(vma)) {
- page_vma_mapped_walk_done(&pvmw);
- ret = false;
- break;
- }
- if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
- hugetlb_vma_unlock_write(vma);
- flush_tlb_range(vma,
- range.start, range.end);
- mmu_notifier_invalidate_range(mm,
- range.start, range.end);
- /*
- * The ref count of the PMD page was
- * dropped which is part of the way map
- * counting is done for shared PMDs.
- * Return 'true' here. When there is
- * no other sharing, huge_pmd_unshare
- * returns false and we will unmap the
- * actual page and drop map count
- * to zero.
- */
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
- hugetlb_vma_unlock_write(vma);
- }
- pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+ set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
} else {
- flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
- /* Nuke the page table entry. */
- if (should_defer_flush(mm, flags)) {
- /*
- * We clear the PTE but do not flush so potentially
- * a remote CPU could still be writing to the folio.
- * If the entry was previously clean then the
- * architecture must guarantee that a clear->dirty
- * transition on a cached TLB entry is written through
- * and traps if the PTE is unmapped.
- */
- pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
- set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
- } else {
- pteval = ptep_clear_flush(vma, address, pvmw.pte);
- }
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
}
/*
@@ -1602,14 +1650,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
- if (folio_test_hugetlb(folio)) {
- hugetlb_count_sub(folio_nr_pages(folio), mm);
- set_huge_pte_at(mm, address, pvmw.pte, pteval);
- } else {
- dec_mm_counter(mm, mm_counter(&folio->page));
- set_pte_at(mm, address, pvmw.pte, pteval);
- }
-
+ dec_mm_counter(mm, mm_counter(&folio->page));
+ set_pte_at(mm, address, pvmw.pte, pteval);
} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
/*
* The guest indicated that the page content is of no
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 2/5] rmap: move page unmap operation to dedicated function
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-03-06 9:22 ` Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
` (3 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
No functional change. Just code reorganized.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/rmap.c | 369 ++++++++++++++++++++++++++++--------------------------
1 file changed, 194 insertions(+), 175 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 508d141dacc5..013643122d0c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1538,17 +1538,204 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
return ret;
}
+static bool try_to_unmap_one_page(struct folio *folio,
+ struct vm_area_struct *vma, struct mmu_notifier_range range,
+ struct page_vma_mapped_walk pvmw, unsigned long address,
+ enum ttu_flags flags)
+{
+ bool anon_exclusive, ret = true;
+ struct page *subpage;
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t pteval;
+
+ subpage = folio_page(folio,
+ pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ anon_exclusive = folio_test_anon(folio) &&
+ PageAnonExclusive(subpage);
+
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+ /* Nuke the page table entry. */
+ if (should_defer_flush(mm, flags)) {
+ /*
+ * We clear the PTE but do not flush so potentially
+ * a remote CPU could still be writing to the folio.
+ * If the entry was previously clean then the
+ * architecture must guarantee that a clear->dirty
+ * transition on a cached TLB entry is written through
+ * and traps if the PTE is unmapped.
+ */
+ pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+ set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+ } else {
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+ }
+
+ /*
+ * Now the pte is cleared. If this pte was uffd-wp armed,
+ * we may want to replace a none pte with a marker pte if
+ * it's file-backed, so we don't lose the tracking info.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+ /* Set the dirty flag on the folio now the pte is gone. */
+ if (pte_dirty(pteval))
+ folio_mark_dirty(folio);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (PageHWPoison(subpage) && !(flags & TTU_HWPOISON)) {
+ pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+ dec_mm_counter(mm, mm_counter(&folio->page));
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+ /*
+ * The guest indicated that the page content is of no
+ * interest anymore. Simply discard the pte, vmscan
+ * will take care of the rest.
+ * A future reference will then fault in a new zero
+ * page. When userfaultfd is active, we must not drop
+ * this page though, as its main user (postcopy
+ * migration) will not expect userfaults on already
+ * copied pages.
+ */
+ dec_mm_counter(mm, mm_counter(&folio->page));
+ /* We have to invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
+ } else if (folio_test_anon(folio)) {
+ swp_entry_t entry = { .val = page_private(subpage) };
+ pte_t swp_pte;
+ /*
+ * Store the swap location in the pte.
+ * See handle_pte_fault() ...
+ */
+ if (unlikely(folio_test_swapbacked(folio) !=
+ folio_test_swapcache(folio))) {
+ WARN_ON_ONCE(1);
+ ret = false;
+ /* We have to invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
+ page_vma_mapped_walk_done(&pvmw);
+ goto discard;
+ }
+
+ /* MADV_FREE page check */
+ if (!folio_test_swapbacked(folio)) {
+ int ref_count, map_count;
+
+ /*
+ * Synchronize with gup_pte_range():
+ * - clear PTE; barrier; read refcount
+ * - inc refcount; barrier; read PTE
+ */
+ smp_mb();
+
+ ref_count = folio_ref_count(folio);
+ map_count = folio_mapcount(folio);
+
+ /*
+ * Order reads for page refcount and dirty flag
+ * (see comments in __remove_mapping()).
+ */
+ smp_rmb();
+
+ /*
+ * The only page refs must be one from isolation
+ * plus the rmap(s) (dropped by discard:).
+ */
+ if (ref_count == 1 + map_count &&
+ !folio_test_dirty(folio)) {
+ /* Invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm,
+ address, address + PAGE_SIZE);
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ }
+
+ /*
+ * If the folio was redirtied, it cannot be
+ * discarded. Remap the page to page table.
+ */
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ folio_set_swapbacked(folio);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ goto discard;
+ }
+
+ if (swap_duplicate(entry) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ goto discard;
+ }
+ if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ swap_free(entry);
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ goto discard;
+ }
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
+ if (anon_exclusive &&
+ page_try_share_anon_rmap(subpage)) {
+ swap_free(entry);
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ goto discard;
+ }
+ if (list_empty(&mm->mmlist)) {
+ spin_lock(&mmlist_lock);
+ if (list_empty(&mm->mmlist))
+ list_add(&mm->mmlist, &init_mm.mmlist);
+ spin_unlock(&mmlist_lock);
+ }
+ dec_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter(mm, MM_SWAPENTS);
+ swp_pte = swp_entry_to_pte(entry);
+ if (anon_exclusive)
+ swp_pte = pte_swp_mkexclusive(swp_pte);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+ /* Invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
+ } else {
+ /*
+ * This is a locked file-backed folio,
+ * so it cannot be removed from the page
+ * cache and replaced by a new folio before
+ * mmu_notifier_invalidate_range_end, so no
+ * concurrent thread might update its page table
+ * to point at a new folio while a device is
+ * still using this folio.
+ *
+ * See Documentation/mm/mmu_notifier.rst
+ */
+ dec_mm_counter(mm, mm_counter_file(&folio->page));
+ }
+
+discard:
+ return ret;
+}
+
/*
* @arg: enum ttu_flags will be passed to this argument
*/
static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
- struct mm_struct *mm = vma->vm_mm;
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
- pte_t pteval;
struct page *subpage;
- bool anon_exclusive, ret = true;
+ bool ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
@@ -1613,179 +1800,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
subpage = folio_page(folio,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
- anon_exclusive = folio_test_anon(folio) &&
- PageAnonExclusive(subpage);
-
- flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
- /* Nuke the page table entry. */
- if (should_defer_flush(mm, flags)) {
- /*
- * We clear the PTE but do not flush so potentially
- * a remote CPU could still be writing to the folio.
- * If the entry was previously clean then the
- * architecture must guarantee that a clear->dirty
- * transition on a cached TLB entry is written through
- * and traps if the PTE is unmapped.
- */
- pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
- set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
- } else {
- pteval = ptep_clear_flush(vma, address, pvmw.pte);
- }
-
- /*
- * Now the pte is cleared. If this pte was uffd-wp armed,
- * we may want to replace a none pte with a marker pte if
- * it's file-backed, so we don't lose the tracking info.
- */
- pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
-
- /* Set the dirty flag on the folio now the pte is gone. */
- if (pte_dirty(pteval))
- folio_mark_dirty(folio);
-
- /* Update high watermark before we lower rss */
- update_hiwater_rss(mm);
-
- if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
- pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
- dec_mm_counter(mm, mm_counter(&folio->page));
- set_pte_at(mm, address, pvmw.pte, pteval);
- } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
- /*
- * The guest indicated that the page content is of no
- * interest anymore. Simply discard the pte, vmscan
- * will take care of the rest.
- * A future reference will then fault in a new zero
- * page. When userfaultfd is active, we must not drop
- * this page though, as its main user (postcopy
- * migration) will not expect userfaults on already
- * copied pages.
- */
- dec_mm_counter(mm, mm_counter(&folio->page));
- /* We have to invalidate as we cleared the pte */
- mmu_notifier_invalidate_range(mm, address,
- address + PAGE_SIZE);
- } else if (folio_test_anon(folio)) {
- swp_entry_t entry = { .val = page_private(subpage) };
- pte_t swp_pte;
- /*
- * Store the swap location in the pte.
- * See handle_pte_fault() ...
- */
- if (unlikely(folio_test_swapbacked(folio) !=
- folio_test_swapcache(folio))) {
- WARN_ON_ONCE(1);
- ret = false;
- /* We have to invalidate as we cleared the pte */
- mmu_notifier_invalidate_range(mm, address,
- address + PAGE_SIZE);
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
-
- /* MADV_FREE page check */
- if (!folio_test_swapbacked(folio)) {
- int ref_count, map_count;
-
- /*
- * Synchronize with gup_pte_range():
- * - clear PTE; barrier; read refcount
- * - inc refcount; barrier; read PTE
- */
- smp_mb();
-
- ref_count = folio_ref_count(folio);
- map_count = folio_mapcount(folio);
-
- /*
- * Order reads for page refcount and dirty flag
- * (see comments in __remove_mapping()).
- */
- smp_rmb();
-
- /*
- * The only page refs must be one from isolation
- * plus the rmap(s) (dropped by discard:).
- */
- if (ref_count == 1 + map_count &&
- !folio_test_dirty(folio)) {
- /* Invalidate as we cleared the pte */
- mmu_notifier_invalidate_range(mm,
- address, address + PAGE_SIZE);
- dec_mm_counter(mm, MM_ANONPAGES);
- goto discard;
- }
-
- /*
- * If the folio was redirtied, it cannot be
- * discarded. Remap the page to page table.
- */
- set_pte_at(mm, address, pvmw.pte, pteval);
- folio_set_swapbacked(folio);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
-
- if (swap_duplicate(entry) < 0) {
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
- if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
+ ret = try_to_unmap_one_page(folio, vma,
+ range, pvmw, address, flags);
+ if (!ret)
+ break;
- /* See page_try_share_anon_rmap(): clear PTE first. */
- if (anon_exclusive &&
- page_try_share_anon_rmap(subpage)) {
- swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
- if (list_empty(&mm->mmlist)) {
- spin_lock(&mmlist_lock);
- if (list_empty(&mm->mmlist))
- list_add(&mm->mmlist, &init_mm.mmlist);
- spin_unlock(&mmlist_lock);
- }
- dec_mm_counter(mm, MM_ANONPAGES);
- inc_mm_counter(mm, MM_SWAPENTS);
- swp_pte = swp_entry_to_pte(entry);
- if (anon_exclusive)
- swp_pte = pte_swp_mkexclusive(swp_pte);
- if (pte_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, address, pvmw.pte, swp_pte);
- /* Invalidate as we cleared the pte */
- mmu_notifier_invalidate_range(mm, address,
- address + PAGE_SIZE);
- } else {
- /*
- * This is a locked file-backed folio,
- * so it cannot be removed from the page
- * cache and replaced by a new folio before
- * mmu_notifier_invalidate_range_end, so no
- * concurrent thread might update its page table
- * to point at a new folio while a device is
- * still using this folio.
- *
- * See Documentation/mm/mmu_notifier.rst
- */
- dec_mm_counter(mm, mm_counter_file(&folio->page));
- }
-discard:
/*
* No need to call mmu_notifier_invalidate_range() it has be
* done above for all cases requiring it to happen under page
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page()
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
@ 2023-03-06 9:22 ` Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
` (2 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
Cleanup exit path of try_to_unmap_one_page() by removing
some duplicated code.
Move page_vma_mapped_walk_done() back to try_to_unmap_one().
Change subpage to page as folio has no concept of subpage.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/rmap.c | 74 ++++++++++++++++++++++---------------------------------
1 file changed, 30 insertions(+), 44 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 013643122d0c..836cfc13cf9d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1528,7 +1528,7 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
*
* See Documentation/mm/mmu_notifier.rst
*/
- page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+ page_remove_rmap(&folio->page, vma, true);
/* No VM_LOCKED set in vma->vm_flags for hugetlb. So not
* necessary to call mlock_drain_local().
*/
@@ -1543,15 +1543,13 @@ static bool try_to_unmap_one_page(struct folio *folio,
struct page_vma_mapped_walk pvmw, unsigned long address,
enum ttu_flags flags)
{
- bool anon_exclusive, ret = true;
- struct page *subpage;
+ bool anon_exclusive;
+ struct page *page;
struct mm_struct *mm = vma->vm_mm;
pte_t pteval;
- subpage = folio_page(folio,
- pte_pfn(*pvmw.pte) - folio_pfn(folio));
- anon_exclusive = folio_test_anon(folio) &&
- PageAnonExclusive(subpage);
+ page = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
/* Nuke the page table entry. */
@@ -1579,15 +1577,14 @@ static bool try_to_unmap_one_page(struct folio *folio,
pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
/* Set the dirty flag on the folio now the pte is gone. */
- if (pte_dirty(pteval))
+ if (pte_dirty(pteval) && !folio_test_dirty(folio))
folio_mark_dirty(folio);
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageHWPoison(subpage) && !(flags & TTU_HWPOISON)) {
- pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
- dec_mm_counter(mm, mm_counter(&folio->page));
+ if (PageHWPoison(page) && !(flags & TTU_HWPOISON)) {
+ pteval = swp_entry_to_pte(make_hwpoison_entry(page));
set_pte_at(mm, address, pvmw.pte, pteval);
} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
/*
@@ -1600,12 +1597,11 @@ static bool try_to_unmap_one_page(struct folio *folio,
* migration) will not expect userfaults on already
* copied pages.
*/
- dec_mm_counter(mm, mm_counter(&folio->page));
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
} else if (folio_test_anon(folio)) {
- swp_entry_t entry = { .val = page_private(subpage) };
+ swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;
/*
* Store the swap location in the pte.
@@ -1614,12 +1610,10 @@ static bool try_to_unmap_one_page(struct folio *folio,
if (unlikely(folio_test_swapbacked(folio) !=
folio_test_swapcache(folio))) {
WARN_ON_ONCE(1);
- ret = false;
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
- page_vma_mapped_walk_done(&pvmw);
- goto discard;
+ goto exit;
}
/* MADV_FREE page check */
@@ -1651,7 +1645,6 @@ static bool try_to_unmap_one_page(struct folio *folio,
/* Invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm,
address, address + PAGE_SIZE);
- dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}
@@ -1659,43 +1652,30 @@ static bool try_to_unmap_one_page(struct folio *folio,
* If the folio was redirtied, it cannot be
* discarded. Remap the page to page table.
*/
- set_pte_at(mm, address, pvmw.pte, pteval);
folio_set_swapbacked(folio);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- goto discard;
+ goto exit_restore_pte;
}
- if (swap_duplicate(entry) < 0) {
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- goto discard;
- }
+ if (swap_duplicate(entry) < 0)
+ goto exit_restore_pte;
+
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- goto discard;
+ goto exit_restore_pte;
}
/* See page_try_share_anon_rmap(): clear PTE first. */
- if (anon_exclusive &&
- page_try_share_anon_rmap(subpage)) {
+ if (anon_exclusive && page_try_share_anon_rmap(page)) {
swap_free(entry);
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- goto discard;
+ goto exit_restore_pte;
}
+
if (list_empty(&mm->mmlist)) {
spin_lock(&mmlist_lock);
if (list_empty(&mm->mmlist))
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- dec_mm_counter(mm, MM_ANONPAGES);
inc_mm_counter(mm, MM_SWAPENTS);
swp_pte = swp_entry_to_pte(entry);
if (anon_exclusive)
@@ -1706,8 +1686,7 @@ static bool try_to_unmap_one_page(struct folio *folio,
swp_pte = pte_swp_mkuffd_wp(swp_pte);
set_pte_at(mm, address, pvmw.pte, swp_pte);
/* Invalidate as we cleared the pte */
- mmu_notifier_invalidate_range(mm, address,
- address + PAGE_SIZE);
+ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE);
} else {
/*
* This is a locked file-backed folio,
@@ -1720,11 +1699,16 @@ static bool try_to_unmap_one_page(struct folio *folio,
*
* See Documentation/mm/mmu_notifier.rst
*/
- dec_mm_counter(mm, mm_counter_file(&folio->page));
}
discard:
- return ret;
+ dec_mm_counter(vma->vm_mm, mm_counter(&folio->page));
+ return true;
+
+exit_restore_pte:
+ set_pte_at(mm, address, pvmw.pte, pteval);
+exit:
+ return false;
}
/*
@@ -1802,8 +1786,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
ret = try_to_unmap_one_page(folio, vma,
range, pvmw, address, flags);
- if (!ret)
+ if (!ret) {
+ page_vma_mapped_walk_done(&pvmw);
break;
+ }
/*
* No need to call mmu_notifier_invalidate_range() it has be
@@ -1812,7 +1798,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*
* See Documentation/mm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+ page_remove_rmap(subpage, vma, false);
if (vma->vm_flags & VM_LOCKED)
mlock_drain_local();
folio_put(folio);
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 4/5] rmap:addd folio_remove_rmap_range()
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
` (2 preceding siblings ...)
2023-03-06 9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
@ 2023-03-06 9:22 ` Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
5 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
folio_remove_rmap_range() allows to take down the pte mapping to
a specific range of folio. Comparing to page_remove_rmap(), it
batched updates __lruvec_stat for large folio.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
include/linux/rmap.h | 4 +++
mm/rmap.c | 58 +++++++++++++++++++++++++++++++++-----------
2 files changed, 48 insertions(+), 14 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..d2569b42e21a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -200,6 +200,10 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void folio_remove_rmap_range(struct folio *, struct page *,
+ unsigned int nr_pages, struct vm_area_struct *,
+ bool compound);
+
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 836cfc13cf9d..bb3fcb8df579 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1355,23 +1355,25 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
}
/**
- * page_remove_rmap - take down pte mapping from a page
- * @page: page to remove mapping from
+ * folio_remove_rmap_range - take down pte mapping from a range of pages
+ * @folio: folio to remove mapping from
+ * @page: The first page to take down pte mapping
+ * @nr_pages: The number of pages which will be take down pte mapping
* @vma: the vm area from which the mapping is removed
* @compound: uncharge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
- bool compound)
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ unsigned int nr_pages, struct vm_area_struct *vma,
+ bool compound)
{
- struct folio *folio = page_folio(page);
atomic_t *mapped = &folio->_nr_pages_mapped;
- int nr = 0, nr_pmdmapped = 0;
- bool last;
+ int nr = 0, nr_pmdmapped = 0, last;
enum node_stat_item idx;
- VM_BUG_ON_PAGE(compound && !PageHead(page), page);
+ VM_BUG_ON_FOLIO(compound && (nr_pages != folio_nr_pages(folio)), folio);
+ VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
/* Hugetlb pages are not counted in NR_*MAPPED */
if (unlikely(folio_test_hugetlb(folio))) {
@@ -1382,12 +1384,16 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
/* Is page being unmapped by PTE? Is this its last map to be removed? */
if (likely(!compound)) {
- last = atomic_add_negative(-1, &page->_mapcount);
- nr = last;
- if (last && folio_test_large(folio)) {
- nr = atomic_dec_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
+ do {
+ last = atomic_add_negative(-1, &page->_mapcount);
+ if (last && folio_test_large(folio)) {
+ last = atomic_dec_return_relaxed(mapped);
+ last = (last < COMPOUND_MAPPED);
+ }
+
+ if (last)
+ nr++;
+ } while (page++, --nr_pages > 0);
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */
@@ -1441,6 +1447,30 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
munlock_vma_folio(folio, vma, compound);
}
+/**
+ * page_remove_rmap - take down pte mapping from a page
+ * @page: page to remove mapping from
+ * @vma: the vm area from which the mapping is removed
+ * @compound: uncharge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
+ bool compound)
+{
+ struct folio *folio = page_folio(page);
+ unsigned int nr_pages;
+
+ VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
+
+ if (likely(!compound))
+ nr_pages = 1;
+ else
+ nr_pages = folio_nr_pages(folio);
+
+ folio_remove_rmap_range(folio, page, nr_pages, vma, compound);
+}
+
static bool try_to_unmap_one_hugetlb(struct folio *folio,
struct vm_area_struct *vma, struct mmu_notifier_range range,
struct page_vma_mapped_walk pvmw, unsigned long address,
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
` (3 preceding siblings ...)
2023-03-06 9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
@ 2023-03-06 9:22 ` Yin Fengwei
2023-03-06 12:39 ` haoxin
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
5 siblings, 1 reply; 13+ messages in thread
From: Yin Fengwei @ 2023-03-06 9:22 UTC (permalink / raw)
To: linux-mm, akpm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, jane.chu, david
Cc: fengwei.yin
If unmap one page fails, or the vma walk will skip next pte,
or the vma walk will end on next pte, batched remove map,
update folio refcount.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
include/linux/rmap.h | 1 +
mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
mm/rmap.c | 48 ++++++++++++++++++++++++++++++++++----------
3 files changed, 68 insertions(+), 11 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d2569b42e21a..18193d1d5a8e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -424,6 +424,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
}
bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw);
/*
* Used by swapoff to help locate where page is expected in vma.
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..19e997dfb5c6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return false;
}
+/**
+ * pvmw_walk_skip_or_end_on_next - check if next pte will be skipped or
+ * end the walk
+ * @pvmw: pointer to struct page_vma_mapped_walk.
+ *
+ * This function can only be called with correct pte lock hold
+ */
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw)
+{
+ unsigned long address = pvmw->address + PAGE_SIZE;
+
+ if (address >= vma_address_end(pvmw))
+ return true;
+
+ if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
+ return true;
+
+ if (pte_none(*pvmw->pte))
+ return true;
+
+ pvmw->pte++;
+ if (!check_pte(pvmw)) {
+ pvmw->pte--;
+ return true;
+ }
+ pvmw->pte--;
+
+ return false;
+}
+
/**
* page_mapped_in_vma - check whether a page is really mapped in a VMA
* @page: the page to test
diff --git a/mm/rmap.c b/mm/rmap.c
index bb3fcb8df579..a64e9cbb52dd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct folio *folio,
return false;
}
+static void folio_remove_rmap_and_update_count(struct folio *folio,
+ struct page *start, struct vm_area_struct *vma, int count)
+{
+ if (count == 0)
+ return;
+
+ /*
+ * No need to call mmu_notifier_invalidate_range() it has be
+ * done above for all cases requiring it to happen under page
+ * table lock before mmu_notifier_invalidate_range_end()
+ *
+ * See Documentation/mm/mmu_notifier.rst
+ */
+ folio_remove_rmap_range(folio, start, count, vma,
+ folio_test_hugetlb(folio));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_drain_local();
+ folio_ref_sub(folio, count);
+}
+
/*
* @arg: enum ttu_flags will be passed to this argument
*/
@@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
- struct page *subpage;
+ struct page *start = NULL;
bool ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
+ int count = 0;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
break;
}
- subpage = folio_page(folio,
+ if (!start)
+ start = folio_page(folio,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
ret = try_to_unmap_one_page(folio, vma,
range, pvmw, address, flags);
if (!ret) {
+ folio_remove_rmap_and_update_count(folio,
+ start, vma, count);
page_vma_mapped_walk_done(&pvmw);
break;
}
+ count++;
/*
- * No need to call mmu_notifier_invalidate_range() it has be
- * done above for all cases requiring it to happen under page
- * table lock before mmu_notifier_invalidate_range_end()
- *
- * See Documentation/mm/mmu_notifier.rst
+ * If next pte will be skipped in page_vma_mapped_walk() or
+ * the walk will end at it, batched remove rmap and update
+ * page refcount. We can't do it after page_vma_mapped_walk()
+ * return false because the pte lock will not be hold.
*/
- page_remove_rmap(subpage, vma, false);
- if (vma->vm_flags & VM_LOCKED)
- mlock_drain_local();
- folio_put(folio);
+ if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
+ folio_remove_rmap_and_update_count(folio,
+ start, vma, count);
+ count = 0;
+ start = NULL;
+ }
}
mmu_notifier_invalidate_range_end(&range);
--
2.30.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
2023-03-06 9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
@ 2023-03-06 12:39 ` haoxin
2023-03-07 2:45 ` Yin, Fengwei
0 siblings, 1 reply; 13+ messages in thread
From: haoxin @ 2023-03-06 12:39 UTC (permalink / raw)
To: Yin Fengwei, linux-mm, akpm, willy, mike.kravetz,
sidhartha.kumar, naoya.horiguchi, jane.chu, david
在 2023/3/6 下午5:22, Yin Fengwei 写道:
> If unmap one page fails, or the vma walk will skip next pte,
> or the vma walk will end on next pte, batched remove map,
> update folio refcount.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
> include/linux/rmap.h | 1 +
> mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
> mm/rmap.c | 48 ++++++++++++++++++++++++++++++++++----------
> 3 files changed, 68 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d2569b42e21a..18193d1d5a8e 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -424,6 +424,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> }
>
> bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw);
>
> /*
> * Used by swapoff to help locate where page is expected in vma.
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 4e448cfbc6ef..19e997dfb5c6 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> return false;
> }
>
> +/**
> + * pvmw_walk_skip_or_end_on_next - check if next pte will be skipped or
> + * end the walk
> + * @pvmw: pointer to struct page_vma_mapped_walk.
> + *
> + * This function can only be called with correct pte lock hold
> + */
> +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw)
> +{
> + unsigned long address = pvmw->address + PAGE_SIZE;
> +
> + if (address >= vma_address_end(pvmw))
> + return true;
If vma_address_end is exactly equal to next address(pvmw->address +
PAGE_SIZE) , does this mean that we are ignored to unmap the last page
here ? so
there can just use ' > ' ' if (address > vma_address_end(pvmw))' .
I may have misunderstood, please correct me.
> +
> + if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
> + return true;
> +
> + if (pte_none(*pvmw->pte))
> + return true;
> +
> + pvmw->pte++;
> + if (!check_pte(pvmw)) {
> + pvmw->pte--;
> + return true;
> + }
> + pvmw->pte--;
> +
> + return false;
> +}
> +
> /**
> * page_mapped_in_vma - check whether a page is really mapped in a VMA
> * @page: the page to test
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bb3fcb8df579..a64e9cbb52dd 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct folio *folio,
> return false;
> }
>
> +static void folio_remove_rmap_and_update_count(struct folio *folio,
> + struct page *start, struct vm_area_struct *vma, int count)
> +{
> + if (count == 0)
> + return;
> +
> + /*
> + * No need to call mmu_notifier_invalidate_range() it has be
> + * done above for all cases requiring it to happen under page
> + * table lock before mmu_notifier_invalidate_range_end()
> + *
> + * See Documentation/mm/mmu_notifier.rst
> + */
> + folio_remove_rmap_range(folio, start, count, vma,
> + folio_test_hugetlb(folio));
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_drain_local();
> + folio_ref_sub(folio, count);
> +}
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument
> */
> @@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> unsigned long address, void *arg)
> {
> DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> - struct page *subpage;
> + struct page *start = NULL;
> bool ret = true;
> struct mmu_notifier_range range;
> enum ttu_flags flags = (enum ttu_flags)(long)arg;
> + int count = 0;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> break;
> }
>
> - subpage = folio_page(folio,
> + if (!start)
> + start = folio_page(folio,
> pte_pfn(*pvmw.pte) - folio_pfn(folio));
> ret = try_to_unmap_one_page(folio, vma,
> range, pvmw, address, flags);
> if (!ret) {
> + folio_remove_rmap_and_update_count(folio,
> + start, vma, count);
> page_vma_mapped_walk_done(&pvmw);
> break;
> }
> + count++;
>
> /*
> - * No need to call mmu_notifier_invalidate_range() it has be
> - * done above for all cases requiring it to happen under page
> - * table lock before mmu_notifier_invalidate_range_end()
> - *
> - * See Documentation/mm/mmu_notifier.rst
> + * If next pte will be skipped in page_vma_mapped_walk() or
> + * the walk will end at it, batched remove rmap and update
> + * page refcount. We can't do it after page_vma_mapped_walk()
> + * return false because the pte lock will not be hold.
> */
> - page_remove_rmap(subpage, vma, false);
> - if (vma->vm_flags & VM_LOCKED)
> - mlock_drain_local();
> - folio_put(folio);
> + if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
> + folio_remove_rmap_and_update_count(folio,
> + start, vma, count);
> + count = 0;
> + start = NULL;
> + }
> }
>
> mmu_notifier_invalidate_range_end(&range);
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
` (4 preceding siblings ...)
2023-03-06 9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
@ 2023-03-06 21:12 ` Andrew Morton
2023-03-07 2:44 ` Yin, Fengwei
2023-03-09 13:56 ` Yin, Fengwei
5 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2023-03-06 21:12 UTC (permalink / raw)
To: Yin Fengwei
Cc: linux-mm, willy, mike.kravetz, sidhartha.kumar, naoya.horiguchi,
jane.chu, david
On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
> This series is trying to bring the batched rmap removing to
> try_to_unmap_one(). It's expected that the batched rmap
> removing bring performance gain than remove rmap per page.
>
> ...
>
> include/linux/rmap.h | 5 +
> mm/page_vma_mapped.c | 30 +++
> mm/rmap.c | 623 +++++++++++++++++++++++++------------------
> 3 files changed, 398 insertions(+), 260 deletions(-)
As was discussed in v2's review, if no performance benefit has been
demonstrated, why make this change?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
@ 2023-03-07 2:44 ` Yin, Fengwei
2023-03-09 13:56 ` Yin, Fengwei
1 sibling, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-07 2:44 UTC (permalink / raw)
To: akpm
Cc: david, linux-mm, willy, mike.kravetz, sidhartha.kumar,
naoya.horiguchi, chu, jane
On Mon, 2023-03-06 at 13:12 -0800, Andrew Morton wrote:
> On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei
> <fengwei.yin@intel.com> wrote:
>
> > This series is trying to bring the batched rmap removing to
> > try_to_unmap_one(). It's expected that the batched rmap
> > removing bring performance gain than remove rmap per page.
> >
> > ...
> >
> > include/linux/rmap.h | 5 +
> > mm/page_vma_mapped.c | 30 +++
> > mm/rmap.c | 623 +++++++++++++++++++++++++--------------
> > ----
> > 3 files changed, 398 insertions(+), 260 deletions(-)
>
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
OK. Let me check whether there is way to demonstrate the performance
gain. Thanks.
Regards
Yin, Fengwei
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
2023-03-06 12:39 ` haoxin
@ 2023-03-07 2:45 ` Yin, Fengwei
0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-07 2:45 UTC (permalink / raw)
To: mike.kravetz, chu, jane, akpm, linux-mm, sidhartha.kumar, xhao,
naoya.horiguchi, david, willy
On Mon, 2023-03-06 at 20:39 +0800, haoxin wrote:
>
> 在 2023/3/6 下午5:22, Yin Fengwei 写道:
> > If unmap one page fails, or the vma walk will skip next pte,
> > or the vma walk will end on next pte, batched remove map,
> > update folio refcount.
> >
> > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> > ---
> > include/linux/rmap.h | 1 +
> > mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
> > mm/rmap.c | 48 ++++++++++++++++++++++++++++++++++-----
> > -----
> > 3 files changed, 68 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index d2569b42e21a..18193d1d5a8e 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -424,6 +424,7 @@ static inline void
> > page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > }
> >
> > bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
> > +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk
> > *pvmw);
> >
> > /*
> > * Used by swapoff to help locate where page is expected in vma.
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 4e448cfbc6ef..19e997dfb5c6 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct
> > page_vma_mapped_walk *pvmw)
> > return false;
> > }
> >
> > +/**
> > + * pvmw_walk_skip_or_end_on_next - check if next pte will be
> > skipped or
> > + * end the walk
> > + * @pvmw: pointer to struct page_vma_mapped_walk.
> > + *
> > + * This function can only be called with correct pte lock hold
> > + */
> > +bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk
> > *pvmw)
> > +{
> > + unsigned long address = pvmw->address + PAGE_SIZE;
> > +
> > + if (address >= vma_address_end(pvmw))
> > + return true;
>
> If vma_address_end is exactly equal to next address(pvmw->address +
> PAGE_SIZE) , does this mean that we are ignored to unmap the last
> page
> here ? so
>
> there can just use ' > ' ' if (address > vma_address_end(pvmw))'
> .
This check will be done after the last PTE is handled. Thanks.
Regards
Yin, Fengwei
>
> I may have misunderstood, please correct me.
>
> > +
> > + if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
> > + return true;
> > +
> > + if (pte_none(*pvmw->pte))
> > + return true;
> > +
> > + pvmw->pte++;
> > + if (!check_pte(pvmw)) {
> > + pvmw->pte--;
> > + return true;
> > + }
> > + pvmw->pte--;
> > +
> > + return false;
> > +}
> > +
> > /**
> > * page_mapped_in_vma - check whether a page is really mapped in
> > a VMA
> > * @page: the page to test
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bb3fcb8df579..a64e9cbb52dd 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1741,6 +1741,26 @@ static bool try_to_unmap_one_page(struct
> > folio *folio,
> > return false;
> > }
> >
> > +static void folio_remove_rmap_and_update_count(struct folio
> > *folio,
> > + struct page *start, struct vm_area_struct *vma, int
> > count)
> > +{
> > + if (count == 0)
> > + return;
> > +
> > + /*
> > + * No need to call mmu_notifier_invalidate_range() it has
> > be
> > + * done above for all cases requiring it to happen under
> > page
> > + * table lock before mmu_notifier_invalidate_range_end()
> > + *
> > + * See Documentation/mm/mmu_notifier.rst
> > + */
> > + folio_remove_rmap_range(folio, start, count, vma,
> > + folio_test_hugetlb(folio));
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_drain_local();
> > + folio_ref_sub(folio, count);
> > +}
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument
> > */
> > @@ -1748,10 +1768,11 @@ static bool try_to_unmap_one(struct folio
> > *folio, struct vm_area_struct *vma,
> > unsigned long address, void *arg)
> > {
> > DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > - struct page *subpage;
> > + struct page *start = NULL;
> > bool ret = true;
> > struct mmu_notifier_range range;
> > enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > + int count = 0;
> >
> > /*
> > * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1812,26 +1833,31 @@ static bool try_to_unmap_one(struct folio
> > *folio, struct vm_area_struct *vma,
> > break;
> > }
> >
> > - subpage = folio_page(folio,
> > + if (!start)
> > + start = folio_page(folio,
> > pte_pfn(*pvmw.pte) -
> > folio_pfn(folio));
> > ret = try_to_unmap_one_page(folio, vma,
> > range, pvmw,
> > address, flags);
> > if (!ret) {
> > + folio_remove_rmap_and_update_count(folio,
> > + start, vma,
> > count);
> > page_vma_mapped_walk_done(&pvmw);
> > break;
> > }
> > + count++;
> >
> > /*
> > - * No need to call mmu_notifier_invalidate_range()
> > it has be
> > - * done above for all cases requiring it to happen
> > under page
> > - * table lock before
> > mmu_notifier_invalidate_range_end()
> > - *
> > - * See Documentation/mm/mmu_notifier.rst
> > + * If next pte will be skipped in
> > page_vma_mapped_walk() or
> > + * the walk will end at it, batched remove rmap and
> > update
> > + * page refcount. We can't do it after
> > page_vma_mapped_walk()
> > + * return false because the pte lock will not be
> > hold.
> > */
> > - page_remove_rmap(subpage, vma, false);
> > - if (vma->vm_flags & VM_LOCKED)
> > - mlock_drain_local();
> > - folio_put(folio);
> > + if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
> > + folio_remove_rmap_and_update_count(folio,
> > + start, vma,
> > count);
> > + count = 0;
> > + start = NULL;
> > + }
> > }
> >
> > mmu_notifier_invalidate_range_end(&range);
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
2023-03-06 9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-03-08 21:38 ` Mike Kravetz
2023-03-09 5:13 ` Yin, Fengwei
0 siblings, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2023-03-08 21:38 UTC (permalink / raw)
To: Yin Fengwei
Cc: linux-mm, akpm, willy, sidhartha.kumar, naoya.horiguchi, jane.chu, david
On 03/06/23 17:22, Yin Fengwei wrote:
> It's to prepare the batched rmap update for large folio.
> No need to looped handle hugetlb. Just handle hugetlb and
> bail out early.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
> mm/rmap.c | 200 +++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 121 insertions(+), 79 deletions(-)
Looks good,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
A few nits below.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index ba901c416785..508d141dacc5 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> munlock_vma_folio(folio, vma, compound);
> }
>
> +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> + struct vm_area_struct *vma, struct mmu_notifier_range range,
> + struct page_vma_mapped_walk pvmw, unsigned long address,
> + enum ttu_flags flags)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pte_t pteval;
> + bool ret = true, anon = folio_test_anon(folio);
> +
> + /*
> + * The try_to_unmap() is only passed a hugetlb page
> + * in the case where the hugetlb page is poisoned.
> + */
> + VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> + /*
> + * huge_pmd_unshare may unmap an entire PMD page.
> + * There is no way of knowing exactly which PMDs may
> + * be cached for this mm, so we must flush them all.
> + * start/end were already adjusted above to cover this
nit, start/end are adjusted in caller (try_to_unmap_one) not above.
> + * range.
> + */
> + flush_cache_range(vma, range.start, range.end);
> +
> + /*
> + * To call huge_pmd_unshare, i_mmap_rwsem must be
> + * held in write mode. Caller needs to explicitly
> + * do this outside rmap routines.
> + *
> + * We also must hold hugetlb vma_lock in write mode.
> + * Lock order dictates acquiring vma_lock BEFORE
> + * i_mmap_rwsem. We can only try lock here and fail
> + * if unsuccessful.
> + */
> + if (!anon) {
> + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> + if (!hugetlb_vma_trylock_write(vma)) {
> + ret = false;
> + goto out;
> + }
> + if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> + hugetlb_vma_unlock_write(vma);
> + flush_tlb_range(vma,
> + range.start, range.end);
> + mmu_notifier_invalidate_range(mm,
> + range.start, range.end);
> + /*
> + * The ref count of the PMD page was
> + * dropped which is part of the way map
> + * counting is done for shared PMDs.
> + * Return 'true' here. When there is
> + * no other sharing, huge_pmd_unshare
> + * returns false and we will unmap the
> + * actual page and drop map count
> + * to zero.
> + */
> + goto out;
> + }
> + hugetlb_vma_unlock_write(vma);
> + }
> + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> +
> + /*
> + * Now the pte is cleared. If this pte was uffd-wp armed,
> + * we may want to replace a none pte with a marker pte if
> + * it's file-backed, so we don't lose the tracking info.
> + */
> + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> +
> + /* Set the dirty flag on the folio now the pte is gone. */
> + if (pte_dirty(pteval))
nit, technically, I suppose this should be huge_pte_dirty but it really is
the same as pte_dirty which is why it works in current code.
> + folio_mark_dirty(folio);
> +
> + /* Update high watermark before we lower rss */
> + update_hiwater_rss(mm);
> +
> + /* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */
> + pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> + set_huge_pte_at(mm, address, pvmw.pte, pteval);
> + hugetlb_count_sub(folio_nr_pages(folio), mm);
> +
> + /*
> + * No need to call mmu_notifier_invalidate_range() it has be
> + * done above for all cases requiring it to happen under page
> + * table lock before mmu_notifier_invalidate_range_end()
> + *
> + * See Documentation/mm/mmu_notifier.rst
> + */
> + page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass
'true'. In addition, the same call in try_to_unmap_one is now known to
always be false. No need to check folio_test_hugetlb(folio) there as
well.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
2023-03-08 21:38 ` Mike Kravetz
@ 2023-03-09 5:13 ` Yin, Fengwei
0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-09 5:13 UTC (permalink / raw)
To: mike.kravetz
Cc: david, linux-mm, willy, naoya.horiguchi, sidhartha.kumar, akpm,
chu, jane
On Wed, 2023-03-08 at 13:38 -0800, Mike Kravetz wrote:
> On 03/06/23 17:22, Yin Fengwei wrote:
> > It's to prepare the batched rmap update for large folio.
> > No need to looped handle hugetlb. Just handle hugetlb and
> > bail out early.
> >
> > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> > ---
> > mm/rmap.c | 200 +++++++++++++++++++++++++++++++++-----------------
> > ----
> > 1 file changed, 121 insertions(+), 79 deletions(-)
>
> Looks good,
>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Thanks a lot for reviewing.
>
> A few nits below.
>
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index ba901c416785..508d141dacc5 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page,
> > struct vm_area_struct *vma,
> > munlock_vma_folio(folio, vma, compound);
> > }
> >
> > +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> > + struct vm_area_struct *vma, struct
> > mmu_notifier_range range,
> > + struct page_vma_mapped_walk pvmw, unsigned long
> > address,
> > + enum ttu_flags flags)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + pte_t pteval;
> > + bool ret = true, anon = folio_test_anon(folio);
> > +
> > + /*
> > + * The try_to_unmap() is only passed a hugetlb page
> > + * in the case where the hugetlb page is poisoned.
> > + */
> > + VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> > + /*
> > + * huge_pmd_unshare may unmap an entire PMD page.
> > + * There is no way of knowing exactly which PMDs may
> > + * be cached for this mm, so we must flush them all.
> > + * start/end were already adjusted above to cover this
>
> nit, start/end are adjusted in caller (try_to_unmap_one) not above.
Yes. Will update the comment.
>
> > + * range.
> > + */
> > + flush_cache_range(vma, range.start, range.end);
> > +
> > + /*
> > + * To call huge_pmd_unshare, i_mmap_rwsem must be
> > + * held in write mode. Caller needs to explicitly
> > + * do this outside rmap routines.
> > + *
> > + * We also must hold hugetlb vma_lock in write mode.
> > + * Lock order dictates acquiring vma_lock BEFORE
> > + * i_mmap_rwsem. We can only try lock here and fail
> > + * if unsuccessful.
> > + */
> > + if (!anon) {
> > + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > + if (!hugetlb_vma_trylock_write(vma)) {
> > + ret = false;
> > + goto out;
> > + }
> > + if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> > + hugetlb_vma_unlock_write(vma);
> > + flush_tlb_range(vma,
> > + range.start, range.end);
> > + mmu_notifier_invalidate_range(mm,
> > + range.start, range.end);
> > + /*
> > + * The ref count of the PMD page was
> > + * dropped which is part of the way map
> > + * counting is done for shared PMDs.
> > + * Return 'true' here. When there is
> > + * no other sharing, huge_pmd_unshare
> > + * returns false and we will unmap the
> > + * actual page and drop map count
> > + * to zero.
> > + */
> > + goto out;
> > + }
> > + hugetlb_vma_unlock_write(vma);
> > + }
> > + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > + /*
> > + * Now the pte is cleared. If this pte was uffd-wp armed,
> > + * we may want to replace a none pte with a marker pte if
> > + * it's file-backed, so we don't lose the tracking info.
> > + */
> > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte,
> > pteval);
> > +
> > + /* Set the dirty flag on the folio now the pte is gone. */
> > + if (pte_dirty(pteval))
>
> nit, technically, I suppose this should be huge_pte_dirty but it
> really is
> the same as pte_dirty which is why it works in current code.
Yes. Will update to use huge_pte_dirty().
>
> > + folio_mark_dirty(folio);
> > +
> > + /* Update high watermark before we lower rss */
> > + update_hiwater_rss(mm);
> > +
> > + /* Poisoned hugetlb folio with TTU_HWPOISON always cleared
> > in flags */
> > + pteval = swp_entry_to_pte(make_hwpoison_entry(&folio-
> > >page));
> > + set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > + hugetlb_count_sub(folio_nr_pages(folio), mm);
> > +
> > + /*
> > + * No need to call mmu_notifier_invalidate_range() it has
> > be
> > + * done above for all cases requiring it to happen under
> > page
> > + * table lock before mmu_notifier_invalidate_range_end()
> > + *
> > + * See Documentation/mm/mmu_notifier.rst
> > + */
> > + page_remove_rmap(&folio->page, vma,
> > folio_test_hugetlb(folio));
>
> nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass
> 'true'. In addition, the same call in try_to_unmap_one is now known
> to
> always be false. No need to check folio_test_hugetlb(folio) there as
> well.
The "folio_test_hugetlb(folio) -> true" changes was in patch 3. I tried
to apply "one patch for code moving and one patch for code change)" to
make review easy. But this patch already changed the code, I will move
"folio_test_hugetlb(folio)->true" from patch 3 to this one. Thanks.
Regards
Yin, Fengwei
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 0/5] batched remove rmap in try_to_unmap_one()
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
2023-03-07 2:44 ` Yin, Fengwei
@ 2023-03-09 13:56 ` Yin, Fengwei
1 sibling, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-03-09 13:56 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, willy, mike.kravetz, sidhartha.kumar, naoya.horiguchi,
jane.chu, david
On 3/7/2023 5:12 AM, Andrew Morton wrote:
> On Mon, 6 Mar 2023 17:22:54 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>> This series is trying to bring the batched rmap removing to
>> try_to_unmap_one(). It's expected that the batched rmap
>> removing bring performance gain than remove rmap per page.
>>
>> ...
>>
>> include/linux/rmap.h | 5 +
>> mm/page_vma_mapped.c | 30 +++
>> mm/rmap.c | 623 +++++++++++++++++++++++++------------------
>> 3 files changed, 398 insertions(+), 260 deletions(-)
>
> As was discussed in v2's review, if no performance benefit has been
> demonstrated, why make this change?
>
I changed the MADV_PAGEOUT not to split the large folio for page cache
and created a micro benchmark mainly as following:
char *c = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE, fd, 0);
count = 0;
while (1) {
unsigned long i;
for (i = 0; i < FILESIZE; i += pgsize) {
cc = *(volatile char *)(c + i);
}
madvise(c, FILESIZE, MADV_PAGEOUT);
count++;
}
munmap(c, FILESIZE);
Run it with 96 instances + 96 files for 1 second. The test platform was on
an IceLake with 48C/96T + 192G memory.
Test result (number count) got 10% improvement with this patch series. And
perf shows following:
Before the patch:
--19.97%--try_to_unmap_one
|
|--12.35%--page_remove_rmap
| |
| --11.39%--__mod_lruvec_page_state
| |
| |--1.51%--__mod_memcg_lruvec_state
| | |
| | --0.91%--cgroup_rstat_updated
| |
| --0.70%--__mod_lruvec_state
| |
| --0.63%--__mod_node_page_state
|
|--5.41%--ptep_clear_flush
| |
| --4.65%--flush_tlb_mm_range
| |
| --3.83%--flush_tlb_func
| |
| --3.51%--native_flush_tlb_one_user
|
|--0.75%--percpu_counter_add_batch
|
--0.55%--PageHeadHuge
After the patch:
--9.50%--try_to_unmap_one
|
|--6.94%--try_to_unmap_one_page.constprop.0.isra.0
| |
| |--5.07%--ptep_clear_flush
| | |
| | --4.25%--flush_tlb_mm_range
| | |
| | --3.44%--flush_tlb_func
| | |
| | --3.05%--native_flush_tlb_one_user
| |
| --0.80%--percpu_counter_add_batch
|
|--1.22%--folio_remove_rmap_and_update_count.part.0
| |
| --1.16%--folio_remove_rmap_range
| |
| --0.62%--__mod_lruvec_page_state
|
--0.56%--PageHeadHuge
As expected, the cost of __mod_lruvec_page_state is reduced a lot with batched
folio_remove_rmap_range.
I believe the same benefit is there for page reclaim path also. Thanks.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-03-09 13:58 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-06 9:22 [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
2023-03-08 21:38 ` Mike Kravetz
2023-03-09 5:13 ` Yin, Fengwei
2023-03-06 9:22 ` [PATCH v3 2/5] rmap: move page unmap operation " Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
2023-03-06 9:22 ` [PATCH v3 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
2023-03-06 12:39 ` haoxin
2023-03-07 2:45 ` Yin, Fengwei
2023-03-06 21:12 ` [PATCH v3 0/5] batched remove rmap in try_to_unmap_one() Andrew Morton
2023-03-07 2:44 ` Yin, Fengwei
2023-03-09 13:56 ` Yin, Fengwei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox