[PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages
@ 2024-11-14  6:59 Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

Changes in v3:
 - recheck pmd state instead of pmd_same() in retract_page_tables()
   (suggested by Jann Horn)
 - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn)
 - introduce new skip_none_ptes() (suggested by David Hildenbrand)
 - minor changes in [PATCH v2 5/7]
 - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled.
 - use put_page() instead of free_page_and_swap_cache() in
   __tlb_remove_table_one_rcu() (pointed by Jann Horn)
 - collect the Reviewed-bys and Acked-bys
 - rebase onto the next-20241112

Changes in v2:
 - fix [PATCH v1 1/7] (Jann Horn)
 - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
 - introduce zap_nonpresent_ptes() and do_zap_pte_range()
 - check pte_none() instead of can_reclaim_pt after the processing of PTEs
   (remove [PATCH v1 3/7] and [PATCH v1 4/7])
 - reorder patches
 - rebase onto the next-20241031

Changes in v1:
 - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
   https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
   (suggested by David Hildenbrand)
 - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
   (suggested by David Hildenbrand)
 - change to scan and reclaim empty user PTE pages in zap_pte_range()
   (suggested by David Hildenbrand)
 - sent a separate RFC patch to track the tlb flushing issue, and remove
   that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
   link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
 - add [PATCH v1 1/7] into this series
 - drop RFC tag
 - rebase onto the next-20241011

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
freeing operations. Therefore, if we want to free the empty PTE page in this
path, the most natural way is to add it to mmu_gather as well. Now, if
CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths other
that munmap and exit_mmap path, because IPI cannot be synchronized with
rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
PTE pages asynchronously in the future.

This series is based on next-20241112 (which contains the series [2]).

Note: issues related to TLB flushing are not new to this series and are tracked
      in the separate RFC patch [3]. And more context please refer to this
      thread [4].

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
[4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

Qi Zheng (9):
  mm: khugepaged: recheck pmd state in retract_page_tables()
  mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  mm: introduce zap_nonpresent_ptes()
  mm: introduce skip_none_ptes()
  mm: introduce do_zap_pte_range()
  mm: make zap_pte_range() handle full within-PMD range
  mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 arch/x86/Kconfig           |   1 +
 arch/x86/include/asm/tlb.h |  19 +++
 arch/x86/kernel/paravirt.c |   7 ++
 arch/x86/mm/pgtable.c      |  10 +-
 include/linux/mm.h         |   1 +
 include/linux/mm_types.h   |   4 +-
 mm/Kconfig                 |  15 +++
 mm/Makefile                |   1 +
 mm/internal.h              |  19 +++
 mm/khugepaged.c            |  45 ++++---
 mm/madvise.c               |   7 +-
 mm/memory.c                | 250 +++++++++++++++++++++++++------------
 mm/mmu_gather.c            |   9 +-
 mm/pt_reclaim.c            |  71 +++++++++++
 mm/userfaultfd.c           |  51 +++++---
 15 files changed, 397 insertions(+), 113 deletions(-)
 create mode 100644 mm/pt_reclaim.c

-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables()
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 2/9] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

In retract_page_tables(), the lock of new_folio is still held, we will be
blocked in the page fault path, which prevents the pte entries from being
set again. So even though the old empty PTE page may be concurrently freed
and a new PTE page is filled into the pmd entry, it is still empty and can
be removed.

So just refactor the retract_page_tables() a little bit and recheck the
pmd state after holding the pmd lock.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6f8d46d107b4b..99dc995aac110 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -947,17 +947,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return SCAN_SUCCEED;
 }
 
-static int find_pmd_or_thp_or_none(struct mm_struct *mm,
-				   unsigned long address,
-				   pmd_t **pmd)
+static inline int check_pmd_state(pmd_t *pmd)
 {
-	pmd_t pmde;
+	pmd_t pmde = pmdp_get_lockless(pmd);
 
-	*pmd = mm_find_pmd(mm, address);
-	if (!*pmd)
-		return SCAN_PMD_NULL;
-
-	pmde = pmdp_get_lockless(*pmd);
 	if (pmd_none(pmde))
 		return SCAN_PMD_NONE;
 	if (!pmd_present(pmde))
@@ -971,6 +964,17 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	*pmd = mm_find_pmd(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	return check_pmd_state(*pmd);
+}
+
 static int check_pmd_still_valid(struct mm_struct *mm,
 				 unsigned long address,
 				 pmd_t *pmd)
@@ -1720,7 +1724,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		pmd_t *pmd, pgt_pmd;
 		spinlock_t *pml;
 		spinlock_t *ptl;
-		bool skipped_uffd = false;
+		bool success = false;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1757,6 +1761,19 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		mmu_notifier_invalidate_range_start(&range);
 
 		pml = pmd_lock(mm, pmd);
+		/*
+		 * The lock of new_folio is still held, we will be blocked in
+		 * the page fault path, which prevents the pte entries from
+		 * being set again. So even though the old empty PTE page may be
+		 * concurrently freed and a new PTE page is filled into the pmd
+		 * entry, it is still empty and can be removed.
+		 *
+		 * So here we only need to recheck if the state of pmd entry
+		 * still meets our requirements, rather than checking pmd_same()
+		 * like elsewhere.
+		 */
+		if (check_pmd_state(pmd) != SCAN_SUCCEED)
+			goto drop_pml;
 		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
@@ -1770,20 +1787,20 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * repeating the anon_vma check protects from one category,
 		 * and repeating the userfaultfd_wp() check from another.
 		 */
-		if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
-			skipped_uffd = true;
-		} else {
+		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
 			pmdp_get_lockless_sync();
+			success = true;
 		}
 
 		if (ptl != pml)
 			spin_unlock(ptl);
+drop_pml:
 		spin_unlock(pml);
 
 		mmu_notifier_invalidate_range_end(&range);
 
-		if (!skipped_uffd) {
+		if (success) {
 			mm_dec_nr_ptes(mm);
 			page_table_check_pte_clear_range(mm, addr, pgt_pmd);
 			pte_free_defer(mm, pmd_pgtable(pgt_pmd));
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 2/9] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 3/9] mm: introduce zap_nonpresent_ptes() Qi Zheng
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

In move_pages_pte(), since dst_pte needs to be none, the subsequent
pte_same() check cannot prevent the dst_pte page from being freed
concurrently, so we also need to abtain dst_pmdval and recheck pmd_same().
Otherwise, once we support empty PTE page reclaimation for anonymous
pages, it may result in moving the src_pte page into the dts_pte page that
is about to be freed by RCU.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/userfaultfd.c | 51 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60a0be33766ff..8e16dc290ddf1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1020,6 +1020,14 @@ void double_pt_unlock(spinlock_t *ptl1,
 		__release(ptl2);
 }
 
+static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
+				       pte_t orig_dst_pte, pte_t orig_src_pte,
+				       pmd_t *dst_pmd, pmd_t dst_pmdval)
+{
+	return pte_same(ptep_get(src_pte), orig_src_pte) &&
+	       pte_same(ptep_get(dst_pte), orig_dst_pte) &&
+	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
+}
 
 static int move_present_pte(struct mm_struct *mm,
 			    struct vm_area_struct *dst_vma,
@@ -1027,6 +1035,7 @@ static int move_present_pte(struct mm_struct *mm,
 			    unsigned long dst_addr, unsigned long src_addr,
 			    pte_t *dst_pte, pte_t *src_pte,
 			    pte_t orig_dst_pte, pte_t orig_src_pte,
+			    pmd_t *dst_pmd, pmd_t dst_pmdval,
 			    spinlock_t *dst_ptl, spinlock_t *src_ptl,
 			    struct folio *src_folio)
 {
@@ -1034,8 +1043,8 @@ static int move_present_pte(struct mm_struct *mm,
 
 	double_pt_lock(dst_ptl, src_ptl);
 
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -1071,6 +1080,7 @@ static int move_swap_pte(struct mm_struct *mm,
 			 unsigned long dst_addr, unsigned long src_addr,
 			 pte_t *dst_pte, pte_t *src_pte,
 			 pte_t orig_dst_pte, pte_t orig_src_pte,
+			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
 	if (!pte_swp_exclusive(orig_src_pte))
@@ -1078,8 +1088,8 @@ static int move_swap_pte(struct mm_struct *mm,
 
 	double_pt_lock(dst_ptl, src_ptl);
 
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		double_pt_unlock(dst_ptl, src_ptl);
 		return -EAGAIN;
 	}
@@ -1097,13 +1107,14 @@ static int move_zeropage_pte(struct mm_struct *mm,
 			     unsigned long dst_addr, unsigned long src_addr,
 			     pte_t *dst_pte, pte_t *src_pte,
 			     pte_t orig_dst_pte, pte_t orig_src_pte,
+			     pmd_t *dst_pmd, pmd_t dst_pmdval,
 			     spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
 	pte_t zero_pte;
 
 	double_pt_lock(dst_ptl, src_ptl);
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		double_pt_unlock(dst_ptl, src_ptl);
 		return -EAGAIN;
 	}
@@ -1136,6 +1147,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	pte_t *src_pte = NULL;
 	pte_t *dst_pte = NULL;
 	pmd_t dummy_pmdval;
+	pmd_t dst_pmdval;
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
@@ -1148,11 +1160,11 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 retry:
 	/*
 	 * Use the maywrite version to indicate that dst_pte will be modified,
-	 * but since we will use pte_same() to detect the change of the pte
-	 * entry, there is no need to get pmdval, so just pass a dummy variable
-	 * to it.
+	 * since dst_pte needs to be none, the subsequent pte_same() check
+	 * cannot prevent the dst_pte page from being freed concurrently, so we
+	 * also need to abtain dst_pmdval and recheck pmd_same() later.
 	 */
-	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval,
+	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dst_pmdval,
 					   &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
@@ -1161,7 +1173,11 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	/* same as dst_pte */
+	/*
+	 * Unlike dst_pte, the subsequent pte_same() check can ensure the
+	 * stability of the src_pte page, so there is no need to get pmdval,
+	 * just pass a dummy variable to it.
+	 */
 	src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval,
 					   &src_ptl);
 
@@ -1213,7 +1229,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			err = move_zeropage_pte(mm, dst_vma, src_vma,
 					       dst_addr, src_addr, dst_pte, src_pte,
 					       orig_dst_pte, orig_src_pte,
-					       dst_ptl, src_ptl);
+					       dst_pmd, dst_pmdval, dst_ptl, src_ptl);
 			goto out;
 		}
 
@@ -1303,8 +1319,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 
 		err = move_present_pte(mm,  dst_vma, src_vma,
 				       dst_addr, src_addr, dst_pte, src_pte,
-				       orig_dst_pte, orig_src_pte,
-				       dst_ptl, src_ptl, src_folio);
+				       orig_dst_pte, orig_src_pte, dst_pmd,
+				       dst_pmdval, dst_ptl, src_ptl, src_folio);
 	} else {
 		entry = pte_to_swp_entry(orig_src_pte);
 		if (non_swap_entry(entry)) {
@@ -1319,10 +1335,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			goto out;
 		}
 
-		err = move_swap_pte(mm, dst_addr, src_addr,
-				    dst_pte, src_pte,
-				    orig_dst_pte, orig_src_pte,
-				    dst_ptl, src_ptl);
+		err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
+				    orig_dst_pte, orig_src_pte, dst_pmd,
+				    dst_pmdval, dst_ptl, src_ptl);
 	}
 
 out:
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 3/9] mm: introduce zap_nonpresent_ptes()
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 2/9] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 4/9] mm: introduce skip_none_ptes() Qi Zheng
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to
handle non-present ptes, which can improve code readability.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 136 ++++++++++++++++++++++++++++------------------------
 1 file changed, 73 insertions(+), 63 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f7..bd9ebe0f4471f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1587,6 +1587,76 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	return 1;
 }
 
+static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+		unsigned int max_nr, unsigned long addr,
+		struct zap_details *details, int *rss)
+{
+	swp_entry_t entry;
+	int nr = 1;
+
+	entry = pte_to_swp_entry(ptent);
+	if (is_device_private_entry(entry) ||
+		is_device_exclusive_entry(entry)) {
+		struct page *page = pfn_swap_entry_to_page(entry);
+		struct folio *folio = page_folio(page);
+
+		if (unlikely(!should_zap_folio(details, folio)))
+			return 1;
+		/*
+		 * Both device private/exclusive mappings should only
+		 * work with anonymous page so far, so we don't need to
+		 * consider uffd-wp bit when zap. For more information,
+		 * see zap_install_uffd_wp_if_needed().
+		 */
+		WARN_ON_ONCE(!vma_is_anonymous(vma));
+		rss[mm_counter(folio)]--;
+		if (is_device_private_entry(entry))
+			folio_remove_rmap_pte(folio, page, vma);
+		folio_put(folio);
+	} else if (!non_swap_entry(entry)) {
+		/* Genuine swap entries, hence a private anon pages */
+		if (!should_zap_cows(details))
+			return 1;
+
+		nr = swap_pte_batch(pte, max_nr, ptent);
+		rss[MM_SWAPENTS] -= nr;
+		free_swap_and_cache_nr(entry, nr);
+	} else if (is_migration_entry(entry)) {
+		struct folio *folio = pfn_swap_entry_folio(entry);
+
+		if (!should_zap_folio(details, folio))
+			return 1;
+		rss[mm_counter(folio)]--;
+	} else if (pte_marker_entry_uffd_wp(entry)) {
+		/*
+		 * For anon: always drop the marker; for file: only
+		 * drop the marker if explicitly requested.
+		 */
+		if (!vma_is_anonymous(vma) && !zap_drop_markers(details))
+			return 1;
+	} else if (is_guard_swp_entry(entry)) {
+		/*
+		 * Ordinary zapping should not remove guard PTE
+		 * markers. Only do so if we should remove PTE markers
+		 * in general.
+		 */
+		if (!zap_drop_markers(details))
+			return 1;
+	} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {
+		if (!should_zap_cows(details))
+			return 1;
+	} else {
+		/* We should have covered all the swap entry types */
+		pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
+		WARN_ON_ONCE(1);
+	}
+	clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm);
+	zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+
+	return nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1598,7 +1668,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
 	int nr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
@@ -1611,8 +1680,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = ptep_get(pte);
-		struct folio *folio;
-		struct page *page;
 		int max_nr;
 
 		nr = 1;
@@ -1622,8 +1689,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (need_resched())
 			break;
 
+		max_nr = (end - addr) / PAGE_SIZE;
 		if (pte_present(ptent)) {
-			max_nr = (end - addr) / PAGE_SIZE;
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
 					      &force_break);
@@ -1631,67 +1698,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				addr += nr * PAGE_SIZE;
 				break;
 			}
-			continue;
-		}
-
-		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry) ||
-		    is_device_exclusive_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
-			folio = page_folio(page);
-			if (unlikely(!should_zap_folio(details, folio)))
-				continue;
-			/*
-			 * Both device private/exclusive mappings should only
-			 * work with anonymous page so far, so we don't need to
-			 * consider uffd-wp bit when zap. For more information,
-			 * see zap_install_uffd_wp_if_needed().
-			 */
-			WARN_ON_ONCE(!vma_is_anonymous(vma));
-			rss[mm_counter(folio)]--;
-			if (is_device_private_entry(entry))
-				folio_remove_rmap_pte(folio, page, vma);
-			folio_put(folio);
-		} else if (!non_swap_entry(entry)) {
-			max_nr = (end - addr) / PAGE_SIZE;
-			nr = swap_pte_batch(pte, max_nr, ptent);
-			/* Genuine swap entries, hence a private anon pages */
-			if (!should_zap_cows(details))
-				continue;
-			rss[MM_SWAPENTS] -= nr;
-			free_swap_and_cache_nr(entry, nr);
-		} else if (is_migration_entry(entry)) {
-			folio = pfn_swap_entry_folio(entry);
-			if (!should_zap_folio(details, folio))
-				continue;
-			rss[mm_counter(folio)]--;
-		} else if (pte_marker_entry_uffd_wp(entry)) {
-			/*
-			 * For anon: always drop the marker; for file: only
-			 * drop the marker if explicitly requested.
-			 */
-			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_markers(details))
-				continue;
-		} else if (is_guard_swp_entry(entry)) {
-			/*
-			 * Ordinary zapping should not remove guard PTE
-			 * markers. Only do so if we should remove PTE markers
-			 * in general.
-			 */
-			if (!zap_drop_markers(details))
-				continue;
-		} else if (is_hwpoison_entry(entry) ||
-			   is_poisoned_swp_entry(entry)) {
-			if (!should_zap_cows(details))
-				continue;
 		} else {
-			/* We should have covered all the swap entry types */
-			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
-			WARN_ON_ONCE(1);
+			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
+						 addr, details, rss);
 		}
-		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (2 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 3/9] mm: introduce zap_nonpresent_ptes() Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  8:04   ` David Hildenbrand
  2024-11-14  6:59 ` [PATCH v3 5/9] mm: introduce do_zap_pte_range() Qi Zheng
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

This commit introduces skip_none_ptes() to skip over all consecutive none
ptes in zap_pte_range(), which helps optimize away need_resched() +
force_break + incremental pte/addr increments etc.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..24633d0e1445a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,28 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 	return nr;
 }
 
+static inline int skip_none_ptes(pte_t *pte, unsigned long addr,
+				 unsigned long end)
+{
+	pte_t ptent = ptep_get(pte);
+	int max_nr;
+	int nr;
+
+	if (!pte_none(ptent))
+		return 0;
+
+	max_nr = (end - addr) / PAGE_SIZE;
+	nr = 1;
+
+	for (; nr < max_nr; nr++) {
+		ptent = ptep_get(pte + nr);
+		if (!pte_none(ptent))
+			break;
+	}
+
+	return nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		pte_t ptent = ptep_get(pte);
 		int max_nr;
 
-		nr = 1;
-		if (pte_none(ptent))
-			continue;
-
 		if (need_resched())
 			break;
 
+		nr = skip_none_ptes(pte, addr, end);
+		if (nr) {
+			addr += PAGE_SIZE * nr;
+			if (addr == end)
+				break;
+			pte += nr;
+		}
+
 		max_nr = (end - addr) / PAGE_SIZE;
 		if (pte_present(ptent)) {
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 5/9] mm: introduce do_zap_pte_range()
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (3 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 4/9] mm: introduce skip_none_ptes() Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 6/9] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

This commit introduces do_zap_pte_range() to actually zap the PTEs, which
will help improve code readability and facilitate secondary checking of
the processed PTEs in the future.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 24633d0e1445a..bf5ac8e0b4656 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1679,6 +1679,25 @@ static inline int skip_none_ptes(pte_t *pte, unsigned long addr,
 	return nr;
 }
 
+/* If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be re-installed. */
+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+				   struct vm_area_struct *vma, pte_t *pte,
+				   unsigned long addr, unsigned long end,
+				   struct zap_details *details, int *rss,
+				   bool *force_flush, bool *force_break)
+{
+	pte_t ptent = ptep_get(pte);
+	int max_nr = (end - addr) / PAGE_SIZE;
+
+	if (pte_present(ptent))
+		return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+					addr, details, rss, force_flush,
+					force_break);
+
+	return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+					 details, rss);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1701,9 +1720,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
-		pte_t ptent = ptep_get(pte);
-		int max_nr;
-
 		if (need_resched())
 			break;
 
@@ -1715,18 +1731,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			pte += nr;
 		}
 
-		max_nr = (end - addr) / PAGE_SIZE;
-		if (pte_present(ptent)) {
-			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-					      addr, details, rss, &force_flush,
-					      &force_break);
-			if (unlikely(force_break)) {
-				addr += nr * PAGE_SIZE;
-				break;
-			}
-		} else {
-			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
-						 addr, details, rss);
+		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
+				      rss, &force_flush, &force_break);
+		if (unlikely(force_break)) {
+			addr += nr * PAGE_SIZE;
+			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 6/9] mm: make zap_pte_range() handle full within-PMD range
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (4 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 5/9] mm: introduce do_zap_pte_range() Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

In preparation for reclaiming empty PTE pages, this commit first makes
zap_pte_range() to handle the full within-PMD range, so that we can more
easily detect and free PTE pages in this function in subsequent commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
---
 mm/memory.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index bf5ac8e0b4656..8b3348ff374ff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1711,6 +1711,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *pte;
 	int nr;
 
+retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
 	start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -1758,6 +1759,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	if (force_flush)
 		tlb_flush_mmu(tlb);
 
+	if (addr != end) {
+		cond_resched();
+		force_flush = false;
+		force_break = false;
+		goto retry;
+	}
+
 	return addr;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (5 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 6/9] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  6:59 ` [PATCH v3 8/9] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
  2024-11-14  7:00 ` [PATCH v3 9/9] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 15 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 19 +++++++++++++
 mm/madvise.c       |  7 ++++-
 mm/memory.c        | 45 ++++++++++++++++++++++++++++-
 mm/pt_reclaim.c    | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 157 insertions(+), 2 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca59d165f1f2e..1fcd4172d2c03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;		/* Need reclaim page tables? */
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b0168086..7949ab121070f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other than munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index dba52bb0da8ab..850386a67b3e0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -146,3 +146,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index 5a7302baeed7c..5b2aef61073f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1530,4 +1530,23 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
 
+/* pt_reclaim.c */
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+
+#ifdef CONFIG_PT_RECLAIM
+bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+			   struct zap_details *details);
+#else
+static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+					 struct zap_details *details)
+{
+	return false;
+}
+#endif /* CONFIG_PT_RECLAIM */
+
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7dad..49f3a75046f63 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,12 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {
+		.reclaim_pt = true,
+		.even_cows = true,
+	};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 8b3348ff374ff..fe93b0648c430 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1698,6 +1698,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 					 details, rss);
 }
 
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+	int none_nr = 0;
+
+	/*
+	 * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+	 * re-installed, so we need to check pte_none() one by one.
+	 * Otherwise, checking a single PTE in a batch is sufficient.
+	 */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	for (;;) {
+		if (pte_none(ptep_get(pte)))
+			none_nr++;
+		if (--nr == 0)
+			break;
+		pte++;
+	}
+#else
+	if (pte_none(ptep_get(pte)))
+		none_nr = nr;
+#endif
+	return none_nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1709,6 +1733,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
+	pmd_t pmdval;
+	unsigned long start = addr;
+	bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details);
+	bool direct_reclaim = false;
+	int none_nr = 0;
 	int nr;
 
 retry:
@@ -1726,6 +1755,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		nr = skip_none_ptes(pte, addr, end);
 		if (nr) {
+			if (can_reclaim_pt)
+				none_nr += nr;
 			addr += PAGE_SIZE * nr;
 			if (addr == end)
 				break;
@@ -1734,12 +1765,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
 				      rss, &force_flush, &force_break);
+		if (can_reclaim_pt)
+			none_nr += count_pte_none(pte, nr);
 		if (unlikely(force_break)) {
 			addr += nr * PAGE_SIZE;
 			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (can_reclaim_pt && addr == end && (none_nr == PTRS_PER_PTE))
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1766,6 +1802,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..6540a3115dde8
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+			   struct zap_details *details)
+{
+	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
+}
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	pml = pmd_lock(mm, pmd);
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		goto out_ptl;
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	if (ptl != pml)
+		spin_unlock(pml);
+}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 8/9] x86: mm: free page table pages by RCU instead of semi RCU
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (6 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-11-14  6:59 ` Qi Zheng
  2024-11-14  7:00 ` [PATCH v3 9/9] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  6:59 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, it is used for THPs, so it is safe.

After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function
call of free_pte() is as follows:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 include/linux/mm_types.h   |  4 +++-
 mm/mmu_gather.c            |  9 ++++++++-
 5 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257b..d134ecf1ada06 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	put_page(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..89688921ea62e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..69a357b15974a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 97e2f4fe1d6c4..266f53b2bb497 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -438,7 +438,9 @@ FOLIO_MATCH(compound_head, _head_2a);
  * struct ptdesc -    Memory descriptor for page tables.
  * @__page_flags:     Same as page flags. Powerpc only.
  * @pt_rcu_head:      For freeing page table pages.
- * @pt_list:          List of used page tables. Used for s390 and x86.
+ * @pt_list:          List of used page tables. Used for s390 gmap shadow pages
+ *                    (which are not linked into the user page tables) and x86
+ *                    pgds.
  * @_pt_pad_1:        Padding that aliases with page's compound head.
  * @pmd_huge_pte:     Protected by ptdesc->ptl, used for THPs.
  * @__page_mapping:   Aliases with page->mapping. Unused for page tables.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..1e21022bcf339 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,11 +311,18 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
-static void tlb_remove_table_one(void *table)
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
 	__tlb_remove_table(table);
 }
+#endif
+
+static void tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table_one(table);
+}
 
 static void tlb_table_flush(struct mmu_gather *tlb)
 {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 9/9] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (7 preceding siblings ...)
  2024-11-14  6:59 ` [PATCH v3 8/9] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-11-14  7:00 ` Qi Zheng
  8 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  7:00 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes,
	Qi Zheng

Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and
reclaiming PTE pages is profitable only on 64-bit systems, so select
ARCH_SUPPORTS_PT_RECLAIM if X86_64.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e74a611bff4a6..54526ce2b1d90 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -322,6 +322,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
2.20.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14  6:59 ` [PATCH v3 4/9] mm: introduce skip_none_ptes() Qi Zheng
@ 2024-11-14  8:04   ` David Hildenbrand
  2024-11-14  9:20     ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-14  8:04 UTC (permalink / raw)
  To: Qi Zheng, jannh, hughd, willy, muchun.song, vbabka, akpm, peterx
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes


>   static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   				struct vm_area_struct *vma, pmd_t *pmd,
>   				unsigned long addr, unsigned long end,
> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   		pte_t ptent = ptep_get(pte);
>   		int max_nr;
>   
> -		nr = 1;
> -		if (pte_none(ptent))
> -			continue;
> -
>   		if (need_resched())
>   			break;
>   
> +		nr = skip_none_ptes(pte, addr, end);
> +		if (nr) {
> +			addr += PAGE_SIZE * nr;
> +			if (addr == end)
> +				break;
> +			pte += nr;
> +		}
> +
>   		max_nr = (end - addr) / PAGE_SIZE;

I dislike calculating max_nr twice, once here and once in skip_non_ptes.

Further, you're missing to update ptent here. If you inline it you can 
avoid another ptep_get().

>   		if (pte_present(ptent)) {
>   			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14  8:04   ` David Hildenbrand
@ 2024-11-14  9:20     ` Qi Zheng
  2024-11-14 12:32       ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-14  9:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/14 16:04, David Hildenbrand wrote:
> 
>>   static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                   struct vm_area_struct *vma, pmd_t *pmd,
>>                   unsigned long addr, unsigned long end,
>> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct 
>> mmu_gather *tlb,
>>           pte_t ptent = ptep_get(pte);
>>           int max_nr;
>> -        nr = 1;
>> -        if (pte_none(ptent))
>> -            continue;
>> -
>>           if (need_resched())
>>               break;
>> +        nr = skip_none_ptes(pte, addr, end);
>> +        if (nr) {
>> +            addr += PAGE_SIZE * nr;
>> +            if (addr == end)
>> +                break;
>> +            pte += nr;
>> +        }
>> +
>>           max_nr = (end - addr) / PAGE_SIZE;
> 
> I dislike calculating max_nr twice, once here and once in skip_non_ptes.
> 
> Further, you're missing to update ptent here. 

Oh, my bad. However, with [PATCH v3 5/9], there will be no problem, but
there are still two ptep_get() and max_nr calculation.

If you inline it you can
> avoid another ptep_get().

Do you mean to inline the skip_none_ptes() into do_zap_pte_range()? This
will cause these none ptes to be checked again in count_pte_none() when
PTE_MARKER_UFFD_WP is enabled. Of course, maybe we can count none ptes
in do_zap_pte_range() and pass it through function parameters like 
force_break.

> 
>>           if (pte_present(ptent)) {
>>               nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14  9:20     ` Qi Zheng
@ 2024-11-14 12:32       ` David Hildenbrand
  2024-11-14 12:51         ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-14 12:32 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 14.11.24 10:20, Qi Zheng wrote:
> 
> 
> On 2024/11/14 16:04, David Hildenbrand wrote:
>>
>>>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>                    struct vm_area_struct *vma, pmd_t *pmd,
>>>                    unsigned long addr, unsigned long end,
>>> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct
>>> mmu_gather *tlb,
>>>            pte_t ptent = ptep_get(pte);
>>>            int max_nr;
>>> -        nr = 1;
>>> -        if (pte_none(ptent))
>>> -            continue;
>>> -
>>>            if (need_resched())
>>>                break;
>>> +        nr = skip_none_ptes(pte, addr, end);
>>> +        if (nr) {
>>> +            addr += PAGE_SIZE * nr;
>>> +            if (addr == end)
>>> +                break;
>>> +            pte += nr;
>>> +        }
>>> +
>>>            max_nr = (end - addr) / PAGE_SIZE;
>>
>> I dislike calculating max_nr twice, once here and once in skip_non_ptes.
>>
>> Further, you're missing to update ptent here.
> 
> Oh, my bad. However, with [PATCH v3 5/9], there will be no problem, but
> there are still two ptep_get() and max_nr calculation.
> 
> If you inline it you can
>> avoid another ptep_get().
> 
> Do you mean to inline the skip_none_ptes() into do_zap_pte_range()?

Effectively moving this patch after #5, and have it be something like:

diff --git a/mm/memory.c b/mm/memory.c
index 1949f5e0fece5..4f5d1e4c6688e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1667,8 +1667,21 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
         pte_t ptent = ptep_get(pte);
         int max_nr = (end - addr) / PAGE_SIZE;
  
-       if (pte_none(ptent))
-               return 1;
+       /* Skip all consecutive pte_none(). */
+       if (pte_none(ptent)) {
+               int nr;
+
+               for (nr = 1; nr < max_nr; nr++) {
+                       ptent = ptep_get(pte + nr);
+                       if (!pte_none(ptent))
+                               break;
+               }
+               max_nr -= nr;
+               if (!max_nr)
+                       return nr;
+               pte += nr;
+               addr += nr * PAGE_SIZE;
+       }
  
         if (pte_present(ptent))
                 return zap_present_ptes(tlb, vma, pte, ptent, max_nr,


In the context of this patch this makes most sense.

Regarding "count_pte_none" comment, I assume you talk about patch #7.

Can't you simply return the number of pte_none that you skipped here using another
input variable, if really required?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14 12:32       ` David Hildenbrand
@ 2024-11-14 12:51         ` Qi Zheng
  2024-11-14 21:19           ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-14 12:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/14 20:32, David Hildenbrand wrote:
> On 14.11.24 10:20, Qi Zheng wrote:
>>
>>
>> On 2024/11/14 16:04, David Hildenbrand wrote:
>>>
>>>>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>                    struct vm_area_struct *vma, pmd_t *pmd,
>>>>                    unsigned long addr, unsigned long end,
>>>> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct
>>>> mmu_gather *tlb,
>>>>            pte_t ptent = ptep_get(pte);
>>>>            int max_nr;
>>>> -        nr = 1;
>>>> -        if (pte_none(ptent))
>>>> -            continue;
>>>> -
>>>>            if (need_resched())
>>>>                break;
>>>> +        nr = skip_none_ptes(pte, addr, end);
>>>> +        if (nr) {
>>>> +            addr += PAGE_SIZE * nr;
>>>> +            if (addr == end)
>>>> +                break;
>>>> +            pte += nr;
>>>> +        }
>>>> +
>>>>            max_nr = (end - addr) / PAGE_SIZE;
>>>
>>> I dislike calculating max_nr twice, once here and once in skip_non_ptes.
>>>
>>> Further, you're missing to update ptent here.
>>
>> Oh, my bad. However, with [PATCH v3 5/9], there will be no problem, but
>> there are still two ptep_get() and max_nr calculation.
>>
>> If you inline it you can
>>> avoid another ptep_get().
>>
>> Do you mean to inline the skip_none_ptes() into do_zap_pte_range()?
> 
> Effectively moving this patch after #5, and have it be something like:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 1949f5e0fece5..4f5d1e4c6688e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1667,8 +1667,21 @@ static inline int do_zap_pte_range(struct 
> mmu_gather *tlb,
>          pte_t ptent = ptep_get(pte);
>          int max_nr = (end - addr) / PAGE_SIZE;
> 
> -       if (pte_none(ptent))
> -               return 1;
> +       /* Skip all consecutive pte_none(). */
> +       if (pte_none(ptent)) {
> +               int nr;
> +
> +               for (nr = 1; nr < max_nr; nr++) {
> +                       ptent = ptep_get(pte + nr);
> +                       if (!pte_none(ptent))
> +                               break;
> +               }
> +               max_nr -= nr;
> +               if (!max_nr)
> +                       return nr;
> +               pte += nr;
> +               addr += nr * PAGE_SIZE;
> +       }
> 
>          if (pte_present(ptent))
>                  return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> 
> 
> In the context of this patch this makes most sense.
> 
> Regarding "count_pte_none" comment, I assume you talk about patch #7.

Yes.

> 
> Can't you simply return the number of pte_none that you skipped here 
> using another
> input variable, if really required?

Suppose we add an input variable nr_skip to do_zap_pte_range(), you mean
to return the above nr to zap_pte_range() through:

*nr_skip = nr;

and then:

zap_pte_range
--> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
  				      rss, &force_flush, &force_break);
     if (can_reclaim_pt) {
         none_nr += count_pte_none(pte, nr);
         none_nr += nr_skip;
     }

Right?

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14 12:51         ` Qi Zheng
@ 2024-11-14 21:19           ` David Hildenbrand
  2024-11-15  3:03             ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-14 21:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 14.11.24 13:51, Qi Zheng wrote:
> 
> 
> On 2024/11/14 20:32, David Hildenbrand wrote:
>> On 14.11.24 10:20, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/14 16:04, David Hildenbrand wrote:
>>>>
>>>>>     static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>>                     struct vm_area_struct *vma, pmd_t *pmd,
>>>>>                     unsigned long addr, unsigned long end,
>>>>> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct
>>>>> mmu_gather *tlb,
>>>>>             pte_t ptent = ptep_get(pte);
>>>>>             int max_nr;
>>>>> -        nr = 1;
>>>>> -        if (pte_none(ptent))
>>>>> -            continue;
>>>>> -
>>>>>             if (need_resched())
>>>>>                 break;
>>>>> +        nr = skip_none_ptes(pte, addr, end);
>>>>> +        if (nr) {
>>>>> +            addr += PAGE_SIZE * nr;
>>>>> +            if (addr == end)
>>>>> +                break;
>>>>> +            pte += nr;
>>>>> +        }
>>>>> +
>>>>>             max_nr = (end - addr) / PAGE_SIZE;
>>>>
>>>> I dislike calculating max_nr twice, once here and once in skip_non_ptes.
>>>>
>>>> Further, you're missing to update ptent here.
>>>
>>> Oh, my bad. However, with [PATCH v3 5/9], there will be no problem, but
>>> there are still two ptep_get() and max_nr calculation.
>>>
>>> If you inline it you can
>>>> avoid another ptep_get().
>>>
>>> Do you mean to inline the skip_none_ptes() into do_zap_pte_range()?
>>
>> Effectively moving this patch after #5, and have it be something like:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1949f5e0fece5..4f5d1e4c6688e 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1667,8 +1667,21 @@ static inline int do_zap_pte_range(struct
>> mmu_gather *tlb,
>>           pte_t ptent = ptep_get(pte);
>>           int max_nr = (end - addr) / PAGE_SIZE;
>>
>> -       if (pte_none(ptent))
>> -               return 1;
>> +       /* Skip all consecutive pte_none(). */
>> +       if (pte_none(ptent)) {
>> +               int nr;
>> +
>> +               for (nr = 1; nr < max_nr; nr++) {
>> +                       ptent = ptep_get(pte + nr);
>> +                       if (!pte_none(ptent))
>> +                               break;
>> +               }
>> +               max_nr -= nr;
>> +               if (!max_nr)
>> +                       return nr;
>> +               pte += nr;
>> +               addr += nr * PAGE_SIZE;
>> +       }
>>
>>           if (pte_present(ptent))
>>                   return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>>
>>
>> In the context of this patch this makes most sense.
>>
>> Regarding "count_pte_none" comment, I assume you talk about patch #7.
> 
> Yes.
> 
>>
>> Can't you simply return the number of pte_none that you skipped here
>> using another
>> input variable, if really required?
> 
> Suppose we add an input variable nr_skip to do_zap_pte_range(), you mean
> to return the above nr to zap_pte_range() through:

Maybe "cur_none_nr" or something similar.

> 
> *nr_skip = nr;
> 
> and then:
> 
> zap_pte_range
> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
>    				      rss, &force_flush, &force_break);
>       if (can_reclaim_pt) {
>           none_nr += count_pte_none(pte, nr);
>           none_nr += nr_skip;
>       }
> 
> Right?

Yes. I did not look closely at the patch that adds the counting of 
pte_none though (to digest why it is required :) ).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-14 21:19           ` David Hildenbrand
@ 2024-11-15  3:03             ` Qi Zheng
  2024-11-15 10:22               ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-15  3:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/15 05:19, David Hildenbrand wrote:
> On 14.11.24 13:51, Qi Zheng wrote:
>>
>>
>> On 2024/11/14 20:32, David Hildenbrand wrote:
>>> On 14.11.24 10:20, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/14 16:04, David Hildenbrand wrote:
>>>>>
>>>>>>     static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>>>                     struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>                     unsigned long addr, unsigned long end,
>>>>>> @@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct
>>>>>> mmu_gather *tlb,
>>>>>>             pte_t ptent = ptep_get(pte);
>>>>>>             int max_nr;
>>>>>> -        nr = 1;
>>>>>> -        if (pte_none(ptent))
>>>>>> -            continue;
>>>>>> -
>>>>>>             if (need_resched())
>>>>>>                 break;
>>>>>> +        nr = skip_none_ptes(pte, addr, end);
>>>>>> +        if (nr) {
>>>>>> +            addr += PAGE_SIZE * nr;
>>>>>> +            if (addr == end)
>>>>>> +                break;
>>>>>> +            pte += nr;
>>>>>> +        }
>>>>>> +
>>>>>>             max_nr = (end - addr) / PAGE_SIZE;
>>>>>
>>>>> I dislike calculating max_nr twice, once here and once in 
>>>>> skip_non_ptes.
>>>>>
>>>>> Further, you're missing to update ptent here.
>>>>
>>>> Oh, my bad. However, with [PATCH v3 5/9], there will be no problem, but
>>>> there are still two ptep_get() and max_nr calculation.
>>>>
>>>> If you inline it you can
>>>>> avoid another ptep_get().
>>>>
>>>> Do you mean to inline the skip_none_ptes() into do_zap_pte_range()?
>>>
>>> Effectively moving this patch after #5, and have it be something like:
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1949f5e0fece5..4f5d1e4c6688e 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1667,8 +1667,21 @@ static inline int do_zap_pte_range(struct
>>> mmu_gather *tlb,
>>>           pte_t ptent = ptep_get(pte);
>>>           int max_nr = (end - addr) / PAGE_SIZE;
>>>
>>> -       if (pte_none(ptent))
>>> -               return 1;
>>> +       /* Skip all consecutive pte_none(). */
>>> +       if (pte_none(ptent)) {
>>> +               int nr;
>>> +
>>> +               for (nr = 1; nr < max_nr; nr++) {
>>> +                       ptent = ptep_get(pte + nr);
>>> +                       if (!pte_none(ptent))
>>> +                               break;
>>> +               }
>>> +               max_nr -= nr;
>>> +               if (!max_nr)
>>> +                       return nr;
>>> +               pte += nr;
>>> +               addr += nr * PAGE_SIZE;
>>> +       }
>>>
>>>           if (pte_present(ptent))
>>>                   return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>>>
>>>
>>> In the context of this patch this makes most sense.
>>>
>>> Regarding "count_pte_none" comment, I assume you talk about patch #7.
>>
>> Yes.
>>
>>>
>>> Can't you simply return the number of pte_none that you skipped here
>>> using another
>>> input variable, if really required?
>>
>> Suppose we add an input variable nr_skip to do_zap_pte_range(), you mean
>> to return the above nr to zap_pte_range() through:
> 
> Maybe "cur_none_nr" or something similar.

OK.

> 
>>
>> *nr_skip = nr;
>>
>> and then:
>>
>> zap_pte_range
>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
>>                          rss, &force_flush, &force_break);
>>       if (can_reclaim_pt) {
>>           none_nr += count_pte_none(pte, nr);
>>           none_nr += nr_skip;
>>       }
>>
>> Right?
> 
> Yes. I did not look closely at the patch that adds the counting of 

Got it.

> pte_none though (to digest why it is required :) ).

Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
empty PTE page.

Looking forward to your more review feedback on this series.

Thanks!

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-15  3:03             ` Qi Zheng
@ 2024-11-15 10:22               ` David Hildenbrand
  2024-11-15 14:41                 ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-15 10:22 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

>>> *nr_skip = nr;
>>>
>>> and then:
>>>
>>> zap_pte_range
>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
>>>                           rss, &force_flush, &force_break);
>>>        if (can_reclaim_pt) {
>>>            none_nr += count_pte_none(pte, nr);
>>>            none_nr += nr_skip;
>>>        }
>>>
>>> Right?
>>
>> Yes. I did not look closely at the patch that adds the counting of
> 
> Got it.
> 
>> pte_none though (to digest why it is required :) ).
> 
> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
> empty PTE page.

Okay, so the problem is that "nr" would be "all processed entries" but 
there are cases where we "process an entry but not zap it".

What you really only want to know is "was any entry not zapped", which 
could be a simple input boolean variable passed into do_zap_pte_range?

Because as soon as any entry was processed but  no zapped, you can 
immediately give up on reclaiming that table.

> 
> Looking forward to your more review feedback on this series.

Thanks for all your hard work on this, I'm only able to make slow 
progress because I keep getting distracted by all different kinds of 
things :(

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-15 10:22               ` David Hildenbrand
@ 2024-11-15 14:41                 ` Qi Zheng
  2024-11-15 14:59                   ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-15 14:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/15 18:22, David Hildenbrand wrote:
>>>> *nr_skip = nr;
>>>>
>>>> and then:
>>>>
>>>> zap_pte_range
>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
>>>>                           rss, &force_flush, &force_break);
>>>>        if (can_reclaim_pt) {
>>>>            none_nr += count_pte_none(pte, nr);
>>>>            none_nr += nr_skip;
>>>>        }
>>>>
>>>> Right?
>>>
>>> Yes. I did not look closely at the patch that adds the counting of
>>
>> Got it.
>>
>>> pte_none though (to digest why it is required :) ).
>>
>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>> empty PTE page.
> 
> Okay, so the problem is that "nr" would be "all processed entries" but 
> there are cases where we "process an entry but not zap it".
> 
> What you really only want to know is "was any entry not zapped", which 
> could be a simple input boolean variable passed into do_zap_pte_range?
> 
> Because as soon as any entry was processed but  no zapped, you can 
> immediately give up on reclaiming that table.

Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
found in count_pte_none().

> 
>>
>> Looking forward to your more review feedback on this series.
> 
> Thanks for all your hard work on this, I'm only able to make slow 
> progress because I keep getting distracted by all different kinds of 
> things :(

It's OK, just take your time. :)

Thanks!

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-15 14:41                 ` Qi Zheng
@ 2024-11-15 14:59                   ` David Hildenbrand
  2024-11-18  3:35                     ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-15 14:59 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 15.11.24 15:41, Qi Zheng wrote:
> 
> 
> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>> *nr_skip = nr;
>>>>>
>>>>> and then:
>>>>>
>>>>> zap_pte_range
>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, &skip_nr,
>>>>>                            rss, &force_flush, &force_break);
>>>>>         if (can_reclaim_pt) {
>>>>>             none_nr += count_pte_none(pte, nr);
>>>>>             none_nr += nr_skip;
>>>>>         }
>>>>>
>>>>> Right?
>>>>
>>>> Yes. I did not look closely at the patch that adds the counting of
>>>
>>> Got it.
>>>
>>>> pte_none though (to digest why it is required :) ).
>>>
>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>> empty PTE page.
>>
>> Okay, so the problem is that "nr" would be "all processed entries" but
>> there are cases where we "process an entry but not zap it".
>>
>> What you really only want to know is "was any entry not zapped", which
>> could be a simple input boolean variable passed into do_zap_pte_range?
>>
>> Because as soon as any entry was processed but  no zapped, you can
>> immediately give up on reclaiming that table.
> 
> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
> found in count_pte_none().

I'm not sure if well need cont_pte_none(), but I'll have to take a look 
at your new patch to see how this fits together with doing the pte_none 
detection+skipping in do_zap_pte_range().

I was wondering if you cannot simply avoid the additional scanning and 
simply set "can_reclaim_pt" if you skip a zap.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-15 14:59                   ` David Hildenbrand
@ 2024-11-18  3:35                     ` Qi Zheng
  2024-11-18  9:29                       ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-18  3:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/15 22:59, David Hildenbrand wrote:
> On 15.11.24 15:41, Qi Zheng wrote:
>>
>>
>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>> *nr_skip = nr;
>>>>>>
>>>>>> and then:
>>>>>>
>>>>>> zap_pte_range
>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, 
>>>>>> &skip_nr,
>>>>>>                            rss, &force_flush, &force_break);
>>>>>>         if (can_reclaim_pt) {
>>>>>>             none_nr += count_pte_none(pte, nr);
>>>>>>             none_nr += nr_skip;
>>>>>>         }
>>>>>>
>>>>>> Right?
>>>>>
>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>
>>>> Got it.
>>>>
>>>>> pte_none though (to digest why it is required :) ).
>>>>
>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>> empty PTE page.
>>>
>>> Okay, so the problem is that "nr" would be "all processed entries" but
>>> there are cases where we "process an entry but not zap it".
>>>
>>> What you really only want to know is "was any entry not zapped", which
>>> could be a simple input boolean variable passed into do_zap_pte_range?
>>>
>>> Because as soon as any entry was processed but  no zapped, you can
>>> immediately give up on reclaiming that table.
>>
>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>> found in count_pte_none().
> 
> I'm not sure if well need cont_pte_none(), but I'll have to take a look 
> at your new patch to see how this fits together with doing the pte_none 
> detection+skipping in do_zap_pte_range().
> 
> I was wondering if you cannot simply avoid the additional scanning and 
> simply set "can_reclaim_pt" if you skip a zap.

Maybe we can return the information whether the zap was skipped from
zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
did in [PATCH v1 3/7] and [PATCH v1 4/7].

In theory, we can detect empty PTE pages in the following two ways:

1) If no zap is skipped, it means that all pte entries have been
    zap, and the PTE page must be empty.
2) If all pte entries are detected to be none, then the PTE page is
    empty.

In the error case, 1) may cause non-empty PTE pages to be reclaimed
(which is unacceptable), while the 2) will at most cause empty PTE pages
to not be reclaimed.

So the most reliable and efficient method may be:

a. If there is a zap that is skipped, stop scanning and do not reclaim
    the PTE page;
b. Otherwise, as now, detect the empty PTE page through count_pte_none()


> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18  3:35                     ` Qi Zheng
@ 2024-11-18  9:29                       ` David Hildenbrand
  2024-11-18 10:34                         ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-18  9:29 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 18.11.24 04:35, Qi Zheng wrote:
> 
> 
> On 2024/11/15 22:59, David Hildenbrand wrote:
>> On 15.11.24 15:41, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>> *nr_skip = nr;
>>>>>>>
>>>>>>> and then:
>>>>>>>
>>>>>>> zap_pte_range
>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>> &skip_nr,
>>>>>>>                             rss, &force_flush, &force_break);
>>>>>>>          if (can_reclaim_pt) {
>>>>>>>              none_nr += count_pte_none(pte, nr);
>>>>>>>              none_nr += nr_skip;
>>>>>>>          }
>>>>>>>
>>>>>>> Right?
>>>>>>
>>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>>
>>>>> Got it.
>>>>>
>>>>>> pte_none though (to digest why it is required :) ).
>>>>>
>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>> empty PTE page.
>>>>
>>>> Okay, so the problem is that "nr" would be "all processed entries" but
>>>> there are cases where we "process an entry but not zap it".
>>>>
>>>> What you really only want to know is "was any entry not zapped", which
>>>> could be a simple input boolean variable passed into do_zap_pte_range?
>>>>
>>>> Because as soon as any entry was processed but  no zapped, you can
>>>> immediately give up on reclaiming that table.
>>>
>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>> found in count_pte_none().
>>
>> I'm not sure if well need cont_pte_none(), but I'll have to take a look
>> at your new patch to see how this fits together with doing the pte_none
>> detection+skipping in do_zap_pte_range().
>>
>> I was wondering if you cannot simply avoid the additional scanning and
>> simply set "can_reclaim_pt" if you skip a zap.
> 
> Maybe we can return the information whether the zap was skipped from
> zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
> did in [PATCH v1 3/7] and [PATCH v1 4/7].
> 
> In theory, we can detect empty PTE pages in the following two ways:
> 
> 1) If no zap is skipped, it means that all pte entries have been
>      zap, and the PTE page must be empty.
> 2) If all pte entries are detected to be none, then the PTE page is
>      empty.
> 
> In the error case, 1) may cause non-empty PTE pages to be reclaimed
> (which is unacceptable), while the 2) will at most cause empty PTE pages
> to not be reclaimed.
> 
> So the most reliable and efficient method may be:
> 
> a. If there is a zap that is skipped, stop scanning and do not reclaim
>      the PTE page;
> b. Otherwise, as now, detect the empty PTE page through count_pte_none()

Is there a need for count_pte_none() that I am missing?

Assume we have

nr = do_zap_pte_range(&any_skipped)


If "nr" is the number of processed entries (including pte_none()), and
"any_skipped" is set whenever we skipped to zap a !pte_none entry, we 
can detect what we need, no?

If any_skipped == false after the call, we now have "nr" pte_none() 
entries. -> We can continue trying to reclaim

If any_skipped == true, we have at least one !pte_none() entry among the 
"nr" entries. -> We cannot and must not reclaim.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18  9:29                       ` David Hildenbrand
@ 2024-11-18 10:34                         ` Qi Zheng
  2024-11-18 10:41                           ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-18 10:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/18 17:29, David Hildenbrand wrote:
> On 18.11.24 04:35, Qi Zheng wrote:
>>
>>
>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>> *nr_skip = nr;
>>>>>>>>
>>>>>>>> and then:
>>>>>>>>
>>>>>>>> zap_pte_range
>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>> &skip_nr,
>>>>>>>>                             rss, &force_flush, &force_break);
>>>>>>>>          if (can_reclaim_pt) {
>>>>>>>>              none_nr += count_pte_none(pte, nr);
>>>>>>>>              none_nr += nr_skip;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> Right?
>>>>>>>
>>>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>>>
>>>>>> Got it.
>>>>>>
>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>
>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>> empty PTE page.
>>>>>
>>>>> Okay, so the problem is that "nr" would be "all processed entries" but
>>>>> there are cases where we "process an entry but not zap it".
>>>>>
>>>>> What you really only want to know is "was any entry not zapped", which
>>>>> could be a simple input boolean variable passed into do_zap_pte_range?
>>>>>
>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>> immediately give up on reclaiming that table.
>>>>
>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>> found in count_pte_none().
>>>
>>> I'm not sure if well need cont_pte_none(), but I'll have to take a look
>>> at your new patch to see how this fits together with doing the pte_none
>>> detection+skipping in do_zap_pte_range().
>>>
>>> I was wondering if you cannot simply avoid the additional scanning and
>>> simply set "can_reclaim_pt" if you skip a zap.
>>
>> Maybe we can return the information whether the zap was skipped from
>> zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>
>> In theory, we can detect empty PTE pages in the following two ways:
>>
>> 1) If no zap is skipped, it means that all pte entries have been
>>      zap, and the PTE page must be empty.
>> 2) If all pte entries are detected to be none, then the PTE page is
>>      empty.
>>
>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>> (which is unacceptable), while the 2) will at most cause empty PTE pages
>> to not be reclaimed.
>>
>> So the most reliable and efficient method may be:
>>
>> a. If there is a zap that is skipped, stop scanning and do not reclaim
>>      the PTE page;
>> b. Otherwise, as now, detect the empty PTE page through count_pte_none()
> 
> Is there a need for count_pte_none() that I am missing?

When any_skipped == false, at least add VM_BUG_ON() to recheck none ptes.

> 
> Assume we have
> 
> nr = do_zap_pte_range(&any_skipped)
> 
> 
> If "nr" is the number of processed entries (including pte_none()), and
> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we 
> can detect what we need, no?
> 
> If any_skipped == false after the call, we now have "nr" pte_none() 
> entries. -> We can continue trying to reclaim

I prefer that "nr" should not include pte_none().

Something like this:

nr = do_zap_pte_range(&any_skipped);
if (can_reclaim_pt) {
	VM_BUG_ON(!any_skipped && count_pte_none(nr) == nr);
	if (any_skipped)
		can_reclaim_pt = false;
}

nr: the number of processed entries (excluding pte_none())
any_skipped: set whenever we skipped to zap a !pte_none entry


```
do_zap_pte_range
-->     pte_t ptent = ptep_get(pte);
	int max_nr = (end - addr) / PAGE_SIZE;

	/* Skip all consecutive pte_none(). */
	if (pte_none(ptent)) {
		int nr;

		for (nr = 1; nr < max_nr; nr++) {
			ptent = ptep_get(pte + nr);
			if (!pte_none(ptent))
				break;
		}
		max_nr -= nr;
		if (!max_nr)
			return 0;
		pte += nr;
		addr += nr * PAGE_SIZE;
	}

	if (pte_present(ptent))
		return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
					addr, details, rss, force_flush,
					force_break, any_skipped);

	return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
					 details, rss, any_skipped);
```

> 
> If any_skipped == true, we have at least one !pte_none() entry among the 
> "nr" entries. -> We cannot and must not reclaim.

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18 10:34                         ` Qi Zheng
@ 2024-11-18 10:41                           ` David Hildenbrand
  2024-11-18 10:56                             ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-18 10:41 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 18.11.24 11:34, Qi Zheng wrote:
> 
> 
> On 2024/11/18 17:29, David Hildenbrand wrote:
>> On 18.11.24 04:35, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>> *nr_skip = nr;
>>>>>>>>>
>>>>>>>>> and then:
>>>>>>>>>
>>>>>>>>> zap_pte_range
>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>> &skip_nr,
>>>>>>>>>                              rss, &force_flush, &force_break);
>>>>>>>>>           if (can_reclaim_pt) {
>>>>>>>>>               none_nr += count_pte_none(pte, nr);
>>>>>>>>>               none_nr += nr_skip;
>>>>>>>>>           }
>>>>>>>>>
>>>>>>>>> Right?
>>>>>>>>
>>>>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>>>>
>>>>>>> Got it.
>>>>>>>
>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>
>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>> empty PTE page.
>>>>>>
>>>>>> Okay, so the problem is that "nr" would be "all processed entries" but
>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>
>>>>>> What you really only want to know is "was any entry not zapped", which
>>>>>> could be a simple input boolean variable passed into do_zap_pte_range?
>>>>>>
>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>> immediately give up on reclaiming that table.
>>>>>
>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>> found in count_pte_none().
>>>>
>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a look
>>>> at your new patch to see how this fits together with doing the pte_none
>>>> detection+skipping in do_zap_pte_range().
>>>>
>>>> I was wondering if you cannot simply avoid the additional scanning and
>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>
>>> Maybe we can return the information whether the zap was skipped from
>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>
>>> In theory, we can detect empty PTE pages in the following two ways:
>>>
>>> 1) If no zap is skipped, it means that all pte entries have been
>>>       zap, and the PTE page must be empty.
>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>       empty.
>>>
>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>> (which is unacceptable), while the 2) will at most cause empty PTE pages
>>> to not be reclaimed.
>>>
>>> So the most reliable and efficient method may be:
>>>
>>> a. If there is a zap that is skipped, stop scanning and do not reclaim
>>>       the PTE page;
>>> b. Otherwise, as now, detect the empty PTE page through count_pte_none()
>>
>> Is there a need for count_pte_none() that I am missing?
> 
> When any_skipped == false, at least add VM_BUG_ON() to recheck none ptes.
> 
>>
>> Assume we have
>>
>> nr = do_zap_pte_range(&any_skipped)
>>
>>
>> If "nr" is the number of processed entries (including pte_none()), and
>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>> can detect what we need, no?
>>
>> If any_skipped == false after the call, we now have "nr" pte_none()
>> entries. -> We can continue trying to reclaim
> 
> I prefer that "nr" should not include pte_none().
> 

Why? do_zap_pte_range() should tell you how far to advance, nothing 
less, nothing more.

Let's just keep it simple and avoid count_pte_none().

I'm probably missing something important?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18 10:41                           ` David Hildenbrand
@ 2024-11-18 10:56                             ` Qi Zheng
  2024-11-18 10:59                               ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-18 10:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/18 18:41, David Hildenbrand wrote:
> On 18.11.24 11:34, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>
>>>>>>>>>> and then:
>>>>>>>>>>
>>>>>>>>>> zap_pte_range
>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>> &skip_nr,
>>>>>>>>>>                              rss, &force_flush, &force_break);
>>>>>>>>>>           if (can_reclaim_pt) {
>>>>>>>>>>               none_nr += count_pte_none(pte, nr);
>>>>>>>>>>               none_nr += nr_skip;
>>>>>>>>>>           }
>>>>>>>>>>
>>>>>>>>>> Right?
>>>>>>>>>
>>>>>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>>>>>
>>>>>>>> Got it.
>>>>>>>>
>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>
>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>> empty PTE page.
>>>>>>>
>>>>>>> Okay, so the problem is that "nr" would be "all processed 
>>>>>>> entries" but
>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>
>>>>>>> What you really only want to know is "was any entry not zapped", 
>>>>>>> which
>>>>>>> could be a simple input boolean variable passed into 
>>>>>>> do_zap_pte_range?
>>>>>>>
>>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>>> immediately give up on reclaiming that table.
>>>>>>
>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>> found in count_pte_none().
>>>>>
>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a 
>>>>> look
>>>>> at your new patch to see how this fits together with doing the 
>>>>> pte_none
>>>>> detection+skipping in do_zap_pte_range().
>>>>>
>>>>> I was wondering if you cannot simply avoid the additional scanning and
>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>
>>>> Maybe we can return the information whether the zap was skipped from
>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>
>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>
>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>       zap, and the PTE page must be empty.
>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>       empty.
>>>>
>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>> (which is unacceptable), while the 2) will at most cause empty PTE 
>>>> pages
>>>> to not be reclaimed.
>>>>
>>>> So the most reliable and efficient method may be:
>>>>
>>>> a. If there is a zap that is skipped, stop scanning and do not reclaim
>>>>       the PTE page;
>>>> b. Otherwise, as now, detect the empty PTE page through 
>>>> count_pte_none()
>>>
>>> Is there a need for count_pte_none() that I am missing?
>>
>> When any_skipped == false, at least add VM_BUG_ON() to recheck none ptes.
>>
>>>
>>> Assume we have
>>>
>>> nr = do_zap_pte_range(&any_skipped)
>>>
>>>
>>> If "nr" is the number of processed entries (including pte_none()), and
>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>> can detect what we need, no?
>>>
>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>> entries. -> We can continue trying to reclaim
>>
>> I prefer that "nr" should not include pte_none().
>>
> 
> Why? do_zap_pte_range() should tell you how far to advance, nothing 
> less, nothing more.
> 
> Let's just keep it simple and avoid count_pte_none().
> 
> I'm probably missing something important?

As we discussed before, we should skip all consecutive none ptes,
pte and addr are already incremented before returning.

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18 10:56                             ` Qi Zheng
@ 2024-11-18 10:59                               ` David Hildenbrand
  2024-11-18 11:13                                 ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-18 10:59 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 18.11.24 11:56, Qi Zheng wrote:
> 
> 
> On 2024/11/18 18:41, David Hildenbrand wrote:
>> On 18.11.24 11:34, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>
>>>>>>>>>>> and then:
>>>>>>>>>>>
>>>>>>>>>>> zap_pte_range
>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>                               rss, &force_flush, &force_break);
>>>>>>>>>>>            if (can_reclaim_pt) {
>>>>>>>>>>>                none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>                none_nr += nr_skip;
>>>>>>>>>>>            }
>>>>>>>>>>>
>>>>>>>>>>> Right?
>>>>>>>>>>
>>>>>>>>>> Yes. I did not look closely at the patch that adds the counting of
>>>>>>>>>
>>>>>>>>> Got it.
>>>>>>>>>
>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>
>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>> empty PTE page.
>>>>>>>>
>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>> entries" but
>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>
>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>> which
>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>> do_zap_pte_range?
>>>>>>>>
>>>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>
>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>>> found in count_pte_none().
>>>>>>
>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>> look
>>>>>> at your new patch to see how this fits together with doing the
>>>>>> pte_none
>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>
>>>>>> I was wondering if you cannot simply avoid the additional scanning and
>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>
>>>>> Maybe we can return the information whether the zap was skipped from
>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters like I
>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>
>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>
>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>        zap, and the PTE page must be empty.
>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>        empty.
>>>>>
>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>> pages
>>>>> to not be reclaimed.
>>>>>
>>>>> So the most reliable and efficient method may be:
>>>>>
>>>>> a. If there is a zap that is skipped, stop scanning and do not reclaim
>>>>>        the PTE page;
>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>> count_pte_none()
>>>>
>>>> Is there a need for count_pte_none() that I am missing?
>>>
>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none ptes.
>>>
>>>>
>>>> Assume we have
>>>>
>>>> nr = do_zap_pte_range(&any_skipped)
>>>>
>>>>
>>>> If "nr" is the number of processed entries (including pte_none()), and
>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>>> can detect what we need, no?
>>>>
>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>> entries. -> We can continue trying to reclaim
>>>
>>> I prefer that "nr" should not include pte_none().
>>>
>>
>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>> less, nothing more.
>>
>> Let's just keep it simple and avoid count_pte_none().
>>
>> I'm probably missing something important?
> 
> As we discussed before, we should skip all consecutive none ptes,
 > pte and addr are already incremented before returning.

It's probably best to send the resulting patch so I can either 
understand why count_pte_none() is required or comment on how to get rid 
of it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18 10:59                               ` David Hildenbrand
@ 2024-11-18 11:13                                 ` Qi Zheng
  2024-11-19  9:55                                   ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Qi Zheng @ 2024-11-18 11:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/18 18:59, David Hildenbrand wrote:
> On 18.11.24 11:56, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>
>>>>>>>>>>>> and then:
>>>>>>>>>>>>
>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>                               rss, &force_flush, &force_break);
>>>>>>>>>>>>            if (can_reclaim_pt) {
>>>>>>>>>>>>                none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>                none_nr += nr_skip;
>>>>>>>>>>>>            }
>>>>>>>>>>>>
>>>>>>>>>>>> Right?
>>>>>>>>>>>
>>>>>>>>>>> Yes. I did not look closely at the patch that adds the 
>>>>>>>>>>> counting of
>>>>>>>>>>
>>>>>>>>>> Got it.
>>>>>>>>>>
>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>
>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>> empty PTE page.
>>>>>>>>>
>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>> entries" but
>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>
>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>> which
>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>> do_zap_pte_range?
>>>>>>>>>
>>>>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>
>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>>>> found in count_pte_none().
>>>>>>>
>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>> look
>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>> pte_none
>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>
>>>>>>> I was wondering if you cannot simply avoid the additional 
>>>>>>> scanning and
>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>
>>>>>> Maybe we can return the information whether the zap was skipped from
>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters 
>>>>>> like I
>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>
>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>
>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>        zap, and the PTE page must be empty.
>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>        empty.
>>>>>>
>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>> pages
>>>>>> to not be reclaimed.
>>>>>>
>>>>>> So the most reliable and efficient method may be:
>>>>>>
>>>>>> a. If there is a zap that is skipped, stop scanning and do not 
>>>>>> reclaim
>>>>>>        the PTE page;
>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>> count_pte_none()
>>>>>
>>>>> Is there a need for count_pte_none() that I am missing?
>>>>
>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none 
>>>> ptes.
>>>>
>>>>>
>>>>> Assume we have
>>>>>
>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>
>>>>>
>>>>> If "nr" is the number of processed entries (including pte_none()), and
>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>>>> can detect what we need, no?
>>>>>
>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>> entries. -> We can continue trying to reclaim
>>>>
>>>> I prefer that "nr" should not include pte_none().
>>>>
>>>
>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>> less, nothing more.
>>>
>>> Let's just keep it simple and avoid count_pte_none().
>>>
>>> I'm probably missing something important?
>>
>> As we discussed before, we should skip all consecutive none ptes,
>  > pte and addr are already incremented before returning.
> 
> It's probably best to send the resulting patch so I can either 
> understand why count_pte_none() is required or comment on how to get rid 
> of it.

Something like this:

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..e9bec3cd49d44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct 
mmu_gather *tlb,
         return nr;
  }

+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+                                  struct vm_area_struct *vma, pte_t *pte,
+                                  unsigned long addr, unsigned long end,
+                                  struct zap_details *details, int *rss,
+                                  bool *force_flush, bool *force_break,
+                                  bool *any_skipped)
+{
+       pte_t ptent = ptep_get(pte);
+       int max_nr = (end - addr) / PAGE_SIZE;
+
+       /* Skip all consecutive pte_none(). */
+       if (pte_none(ptent)) {
+               int nr;
+
+               for (nr = 1; nr < max_nr; nr++) {
+                       ptent = ptep_get(pte + nr);
+                       if (!pte_none(ptent))
+                               break;
+               }
+               max_nr -= nr;
+               if (!max_nr)
+                       return 0;
+               pte += nr;
+               addr += nr * PAGE_SIZE;
+       }
+
+       if (pte_present(ptent))
+               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+                                       addr, details, rss, force_flush,
+                                       force_break, any_skipped);
+
+       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+                                  details, rss, any_skipped);
+}
+
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+       int none_nr = 0;
+
+       /*
+        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+        * re-installed, so we need to check pte_none() one by one.
+        * Otherwise, checking a single PTE in a batch is sufficient.
+        */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+       for (;;) {
+               if (pte_none(ptep_get(pte)))
+                       none_nr++;
+               if (--nr == 0)
+                       break;
+               pte++;
+       }
+#else
+       if (pte_none(ptep_get(pte)))
+               none_nr = nr;
+#endif
+       return none_nr;
+}
+
+
  static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                 struct vm_area_struct *vma, pmd_t *pmd,
                                 unsigned long addr, unsigned long end,
@@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
         int rss[NR_MM_COUNTERS];
         spinlock_t *ptl;
         pte_t *start_pte;
+       bool can_reclaim_pt;
         pte_t *pte;
         int nr;

@@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
         flush_tlb_batched_pending(mm);
         arch_enter_lazy_mmu_mode();
         do {
-               pte_t ptent = ptep_get(pte);
-               int max_nr;
-
-               nr = 1;
-               if (pte_none(ptent))
-                       continue;
+               bool any_skipped;

                 if (need_resched())
                         break;

-               max_nr = (end - addr) / PAGE_SIZE;
-               if (pte_present(ptent)) {
-                       nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-                                             addr, details, rss, 
&force_flush,
-                                             &force_break);
-                       if (unlikely(force_break)) {
-                               addr += nr * PAGE_SIZE;
-                               break;
-                       }
-               } else {
-                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, 
max_nr,
-                                                addr, details, rss);
+               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
+                                     rss, &force_flush, &force_break,
+                                     &any_skipped);
+               if (can_reclaim_pt) {
+                       VM_BUG_ON(!any_skipped && count_pte_none(pte, 
nr) == nr);
+                       if (any_skipped)
+                               can_reclaim_pt = false;
+               }
+               if (unlikely(force_break)) {
+                       addr += nr * PAGE_SIZE;
+                       break;
                 }
         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);


> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-18 11:13                                 ` Qi Zheng
@ 2024-11-19  9:55                                   ` David Hildenbrand
  2024-11-19 10:03                                     ` Qi Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2024-11-19  9:55 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes

On 18.11.24 12:13, Qi Zheng wrote:
> 
> 
> On 2024/11/18 18:59, David Hildenbrand wrote:
>> On 18.11.24 11:56, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>>
>>>>>>>>>>>>> and then:
>>>>>>>>>>>>>
>>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>>                                rss, &force_flush, &force_break);
>>>>>>>>>>>>>             if (can_reclaim_pt) {
>>>>>>>>>>>>>                 none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>>                 none_nr += nr_skip;
>>>>>>>>>>>>>             }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Right?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>>> counting of
>>>>>>>>>>>
>>>>>>>>>>> Got it.
>>>>>>>>>>>
>>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>>
>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>>> empty PTE page.
>>>>>>>>>>
>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>>> entries" but
>>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>>
>>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>>> which
>>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>>> do_zap_pte_range?
>>>>>>>>>>
>>>>>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>>
>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>>>>> found in count_pte_none().
>>>>>>>>
>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>>> look
>>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>>> pte_none
>>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>>
>>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>>> scanning and
>>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>>
>>>>>>> Maybe we can return the information whether the zap was skipped from
>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>>> like I
>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>>
>>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>>
>>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>>         zap, and the PTE page must be empty.
>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>>         empty.
>>>>>>>
>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>>> pages
>>>>>>> to not be reclaimed.
>>>>>>>
>>>>>>> So the most reliable and efficient method may be:
>>>>>>>
>>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>>> reclaim
>>>>>>>         the PTE page;
>>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>>> count_pte_none()
>>>>>>
>>>>>> Is there a need for count_pte_none() that I am missing?
>>>>>
>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>>> ptes.
>>>>>
>>>>>>
>>>>>> Assume we have
>>>>>>
>>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>>
>>>>>>
>>>>>> If "nr" is the number of processed entries (including pte_none()), and
>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>>>>> can detect what we need, no?
>>>>>>
>>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>>> entries. -> We can continue trying to reclaim
>>>>>
>>>>> I prefer that "nr" should not include pte_none().
>>>>>
>>>>
>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>>> less, nothing more.
>>>>
>>>> Let's just keep it simple and avoid count_pte_none().
>>>>
>>>> I'm probably missing something important?
>>>
>>> As we discussed before, we should skip all consecutive none ptes,
>>   > pte and addr are already incremented before returning.
>>
>> It's probably best to send the resulting patch so I can either
>> understand why count_pte_none() is required or comment on how to get rid
>> of it.
> 
> Something like this:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bd9ebe0f4471f..e9bec3cd49d44 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
> mmu_gather *tlb,
>           return nr;
>    }
> 
> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +                                  struct vm_area_struct *vma, pte_t *pte,
> +                                  unsigned long addr, unsigned long end,
> +                                  struct zap_details *details, int *rss,
> +                                  bool *force_flush, bool *force_break,
> +                                  bool *any_skipped)
> +{
> +       pte_t ptent = ptep_get(pte);
> +       int max_nr = (end - addr) / PAGE_SIZE;

I'd do here:

	  int nr = 0;

> +
> +       /* Skip all consecutive pte_none(). */
> +       if (pte_none(ptent)) {
> +

instead of the "int nr" here

> +               for (nr = 1; nr < max_nr; nr++) {
> +                       ptent = ptep_get(pte + nr);
> +                       if (!pte_none(ptent))
> +                               break;
> +               }
> +               max_nr -= nr;
> +               if (!max_nr)
> +                       return 0;
> +               pte += nr;
> +               addr += nr * PAGE_SIZE;
> +       }
> +
> +       if (pte_present(ptent))
> +               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> +                                       addr, details, rss, force_flush,
> +                                       force_break, any_skipped);

and here:

if (pte_present(ptent))
	nr += zap_present_ptes(...)
else
	nr += zap_nonpresent_ptes();
return nr;

So you really return the number of processed items -- how much the 
caller should advance.

> +
> +       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
> +                                  details, rss, any_skipped);
> +}
> +
> +static inline int count_pte_none(pte_t *pte, int nr)
> +{
> +       int none_nr = 0;
> +
> +       /*
> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
> +        * re-installed, so we need to check pte_none() one by one.
> +        * Otherwise, checking a single PTE in a batch is sufficient.
> +        */
> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> +       for (;;) {
> +               if (pte_none(ptep_get(pte)))
> +                       none_nr++;
> +               if (--nr == 0)
> +                       break;
> +               pte++;
> +       }
> +#else
> +       if (pte_none(ptep_get(pte)))
> +               none_nr = nr;
> +#endif
> +       return none_nr;
> +}
> +
> +
>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                   struct vm_area_struct *vma, pmd_t *pmd,
>                                   unsigned long addr, unsigned long end,
> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
> mmu_gather *tlb,
>           int rss[NR_MM_COUNTERS];
>           spinlock_t *ptl;
>           pte_t *start_pte;
> +       bool can_reclaim_pt;
>           pte_t *pte;
>           int nr;
> 
> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
> mmu_gather *tlb,
>           flush_tlb_batched_pending(mm);
>           arch_enter_lazy_mmu_mode();
>           do {
> -               pte_t ptent = ptep_get(pte);
> -               int max_nr;
> -
> -               nr = 1;
> -               if (pte_none(ptent))
> -                       continue;
> +               bool any_skipped;
> 
>                   if (need_resched())
>                           break;
> 
> -               max_nr = (end - addr) / PAGE_SIZE;
> -               if (pte_present(ptent)) {
> -                       nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> -                                             addr, details, rss,
> &force_flush,
> -                                             &force_break);
> -                       if (unlikely(force_break)) {
> -                               addr += nr * PAGE_SIZE;
> -                               break;
> -                       }
> -               } else {
> -                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
> max_nr,
> -                                                addr, details, rss);
> +               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
> +                                     rss, &force_flush, &force_break,
> +                                     &any_skipped);
> +               if (can_reclaim_pt) {
> +                       VM_BUG_ON(!any_skipped && count_pte_none(pte,
> nr) == nr);

Okay, so this is really only for debugging then? So we can safely drop 
that for now.

I assume it would make sense to check when reclaiming a page table with 
CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?

(as a side note: no VM_BUG_ON, please :) )

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
  2024-11-19  9:55                                   ` David Hildenbrand
@ 2024-11-19 10:03                                     ` Qi Zheng
  0 siblings, 0 replies; 28+ messages in thread
From: Qi Zheng @ 2024-11-19 10:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, muchun.song, vbabka, akpm, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, linux-mm, linux-kernel, zokeefe, rientjes



On 2024/11/19 17:55, David Hildenbrand wrote:
> On 18.11.24 12:13, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 18:59, David Hildenbrand wrote:
>>> On 18.11.24 11:56, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and then:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>>>                                rss, &force_flush, 
>>>>>>>>>>>>>> &force_break);
>>>>>>>>>>>>>>             if (can_reclaim_pt) {
>>>>>>>>>>>>>>                 none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>>>                 none_nr += nr_skip;
>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>>>> counting of
>>>>>>>>>>>>
>>>>>>>>>>>> Got it.
>>>>>>>>>>>>
>>>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>>>
>>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>>>> empty PTE page.
>>>>>>>>>>>
>>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>>>> entries" but
>>>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>>>
>>>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>>>> which
>>>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>>>> do_zap_pte_range?
>>>>>>>>>>>
>>>>>>>>>>> Because as soon as any entry was processed but  no zapped, 
>>>>>>>>>>> you can
>>>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>>>
>>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() 
>>>>>>>>>> entry is
>>>>>>>>>> found in count_pte_none().
>>>>>>>>>
>>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>>>> look
>>>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>>>> pte_none
>>>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>>>
>>>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>>>> scanning and
>>>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>>>
>>>>>>>> Maybe we can return the information whether the zap was skipped 
>>>>>>>> from
>>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>>>> like I
>>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>>>
>>>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>>>
>>>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>>>         zap, and the PTE page must be empty.
>>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>>>         empty.
>>>>>>>>
>>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>>>> pages
>>>>>>>> to not be reclaimed.
>>>>>>>>
>>>>>>>> So the most reliable and efficient method may be:
>>>>>>>>
>>>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>>>> reclaim
>>>>>>>>         the PTE page;
>>>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>>>> count_pte_none()
>>>>>>>
>>>>>>> Is there a need for count_pte_none() that I am missing?
>>>>>>
>>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>>>> ptes.
>>>>>>
>>>>>>>
>>>>>>> Assume we have
>>>>>>>
>>>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>>>
>>>>>>>
>>>>>>> If "nr" is the number of processed entries (including 
>>>>>>> pte_none()), and
>>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none 
>>>>>>> entry, we
>>>>>>> can detect what we need, no?
>>>>>>>
>>>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>>>> entries. -> We can continue trying to reclaim
>>>>>>
>>>>>> I prefer that "nr" should not include pte_none().
>>>>>>
>>>>>
>>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>>>> less, nothing more.
>>>>>
>>>>> Let's just keep it simple and avoid count_pte_none().
>>>>>
>>>>> I'm probably missing something important?
>>>>
>>>> As we discussed before, we should skip all consecutive none ptes,
>>>   > pte and addr are already incremented before returning.
>>>
>>> It's probably best to send the resulting patch so I can either
>>> understand why count_pte_none() is required or comment on how to get rid
>>> of it.
>>
>> Something like this:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index bd9ebe0f4471f..e9bec3cd49d44 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
>> mmu_gather *tlb,
>>           return nr;
>>    }
>>
>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>> +                                  struct vm_area_struct *vma, pte_t 
>> *pte,
>> +                                  unsigned long addr, unsigned long end,
>> +                                  struct zap_details *details, int *rss,
>> +                                  bool *force_flush, bool *force_break,
>> +                                  bool *any_skipped)
>> +{
>> +       pte_t ptent = ptep_get(pte);
>> +       int max_nr = (end - addr) / PAGE_SIZE;
> 
> I'd do here:
> 
>        int nr = 0;
> 
>> +
>> +       /* Skip all consecutive pte_none(). */
>> +       if (pte_none(ptent)) {
>> +
> 
> instead of the "int nr" here
> 
>> +               for (nr = 1; nr < max_nr; nr++) {
>> +                       ptent = ptep_get(pte + nr);
>> +                       if (!pte_none(ptent))
>> +                               break;
>> +               }
>> +               max_nr -= nr;
>> +               if (!max_nr)
>> +                       return 0;
>> +               pte += nr;
>> +               addr += nr * PAGE_SIZE;
>> +       }
>> +
>> +       if (pte_present(ptent))
>> +               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>> +                                       addr, details, rss, force_flush,
>> +                                       force_break, any_skipped);
> 
> and here:
> 
> if (pte_present(ptent))
>      nr += zap_present_ptes(...)
> else
>      nr += zap_nonpresent_ptes();
> return nr;
> 
> So you really return the number of processed items -- how much the 
> caller should advance.

Got it, please ignore my previous stupid considerations. ;)

> 
>> +
>> +       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
>> +                                  details, rss, any_skipped);
>> +}
>> +
>> +static inline int count_pte_none(pte_t *pte, int nr)
>> +{
>> +       int none_nr = 0;
>> +
>> +       /*
>> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
>> +        * re-installed, so we need to check pte_none() one by one.
>> +        * Otherwise, checking a single PTE in a batch is sufficient.
>> +        */
>> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
>> +       for (;;) {
>> +               if (pte_none(ptep_get(pte)))
>> +                       none_nr++;
>> +               if (--nr == 0)
>> +                       break;
>> +               pte++;
>> +       }
>> +#else
>> +       if (pte_none(ptep_get(pte)))
>> +               none_nr = nr;
>> +#endif
>> +       return none_nr;
>> +}
>> +
>> +
>>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                                   struct vm_area_struct *vma, pmd_t *pmd,
>>                                   unsigned long addr, unsigned long end,
>> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>>           int rss[NR_MM_COUNTERS];
>>           spinlock_t *ptl;
>>           pte_t *start_pte;
>> +       bool can_reclaim_pt;
>>           pte_t *pte;
>>           int nr;
>>
>> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>>           flush_tlb_batched_pending(mm);
>>           arch_enter_lazy_mmu_mode();
>>           do {
>> -               pte_t ptent = ptep_get(pte);
>> -               int max_nr;
>> -
>> -               nr = 1;
>> -               if (pte_none(ptent))
>> -                       continue;
>> +               bool any_skipped;
>>
>>                   if (need_resched())
>>                           break;
>>
>> -               max_nr = (end - addr) / PAGE_SIZE;
>> -               if (pte_present(ptent)) {
>> -                       nr = zap_present_ptes(tlb, vma, pte, ptent, 
>> max_nr,
>> -                                             addr, details, rss,
>> &force_flush,
>> -                                             &force_break);
>> -                       if (unlikely(force_break)) {
>> -                               addr += nr * PAGE_SIZE;
>> -                               break;
>> -                       }
>> -               } else {
>> -                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
>> max_nr,
>> -                                                addr, details, rss);
>> +               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>> +                                     rss, &force_flush, &force_break,
>> +                                     &any_skipped);
>> +               if (can_reclaim_pt) {
>> +                       VM_BUG_ON(!any_skipped && count_pte_none(pte,
>> nr) == nr);
> 
> Okay, so this is really only for debugging then? So we can safely drop 
> that for now.
> 
> I assume it would make sense to check when reclaiming a page table with 
> CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?

Ah, make sense. Will change to it in the next version.

> 
> (as a side note: no VM_BUG_ON, please :) )

Got it.

Thanks!

> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-11-19 10:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-14  6:59 [PATCH v3 0/9] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-11-14  6:59 ` [PATCH v3 1/9] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
2024-11-14  6:59 ` [PATCH v3 2/9] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
2024-11-14  6:59 ` [PATCH v3 3/9] mm: introduce zap_nonpresent_ptes() Qi Zheng
2024-11-14  6:59 ` [PATCH v3 4/9] mm: introduce skip_none_ptes() Qi Zheng
2024-11-14  8:04   ` David Hildenbrand
2024-11-14  9:20     ` Qi Zheng
2024-11-14 12:32       ` David Hildenbrand
2024-11-14 12:51         ` Qi Zheng
2024-11-14 21:19           ` David Hildenbrand
2024-11-15  3:03             ` Qi Zheng
2024-11-15 10:22               ` David Hildenbrand
2024-11-15 14:41                 ` Qi Zheng
2024-11-15 14:59                   ` David Hildenbrand
2024-11-18  3:35                     ` Qi Zheng
2024-11-18  9:29                       ` David Hildenbrand
2024-11-18 10:34                         ` Qi Zheng
2024-11-18 10:41                           ` David Hildenbrand
2024-11-18 10:56                             ` Qi Zheng
2024-11-18 10:59                               ` David Hildenbrand
2024-11-18 11:13                                 ` Qi Zheng
2024-11-19  9:55                                   ` David Hildenbrand
2024-11-19 10:03                                     ` Qi Zheng
2024-11-14  6:59 ` [PATCH v3 5/9] mm: introduce do_zap_pte_range() Qi Zheng
2024-11-14  6:59 ` [PATCH v3 6/9] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-11-14  6:59 ` [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-11-14  6:59 ` [PATCH v3 8/9] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-11-14  7:00 ` [PATCH v3 9/9] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox