[PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages
@ 2024-10-31  8:13 Qi Zheng
  2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Changes in v2:
 - fix [PATCH v1 1/7] (Jann Horn)
 - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
 - introduce zap_nonpresent_ptes() and do_zap_pte_range()
 - check pte_none() instead of can_reclaim_pt after the processing of PTEs
   (remove [PATCH v1 3/7] and [PATCH v1 4/7])
 - reorder patches
 - rebase onto the next-20241031

Changes in v1:
 - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
   https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
   (suggested by David Hildenbrand)
 - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
   (suggested by David Hildenbrand)
 - change to scan and reclaim empty user PTE pages in zap_pte_range()
   (suggested by David Hildenbrand)
 - sent a separate RFC patch to track the tlb flushing issue, and remove
   that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
   link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
 - add [PATCH v1 1/7] into this series
 - drop RFC tag
 - rebase onto the next-20241011

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
freeing operations. Therefore, if we want to free the empty PTE page in this
path, the most natural way is to add it to mmu_gather as well. Now, if
CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths other
that munmap and exit_mmap path, because IPI cannot be synchronized with
rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
PTE pages asynchronously in the future.

This series is based on next-20241031 (which contains the series [2]).

Note: issues related to TLB flushing are not new to this series and are tracked
      in the separate RFC patch [3]. And more context please refer to this
      thread [4].

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
[4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

Qi Zheng (7):
  mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  mm: introduce zap_nonpresent_ptes()
  mm: introduce do_zap_pte_range()
  mm: make zap_pte_range() handle full within-PMD range
  mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 arch/x86/Kconfig           |   1 +
 arch/x86/include/asm/tlb.h |  19 ++++
 arch/x86/kernel/paravirt.c |   7 ++
 arch/x86/mm/pgtable.c      |  10 +-
 include/linux/mm.h         |   1 +
 mm/Kconfig                 |  15 +++
 mm/Makefile                |   1 +
 mm/internal.h              |  23 ++++
 mm/khugepaged.c            |  17 ++-
 mm/madvise.c               |   4 +-
 mm/memory.c                | 228 ++++++++++++++++++++++++-------------
 mm/mmu_gather.c            |   9 +-
 mm/pt_reclaim.c            |  68 +++++++++++
 13 files changed, 319 insertions(+), 84 deletions(-)
 create mode 100644 mm/pt_reclaim.c

-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-06 21:48   ` Jann Horn
  2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
also remove the calling of the pte_lockptr().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6f8d46d107b4b..6d76dde64f5fb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *pml;
 		spinlock_t *ptl;
 		bool skipped_uffd = false;
+		pte_t *pte;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 					addr, addr + HPAGE_PMD_SIZE);
 		mmu_notifier_invalidate_range_start(&range);
 
+		pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+		if (!pte) {
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+
 		pml = pmd_lock(mm, pmd);
-		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+		if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			pte_unmap_unlock(pte, ptl);
+			if (ptl != pml)
+				spin_unlock(pml);
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+		pte_unmap(pte);
+
 		/*
 		 * Huge page lock is still held, so normally the page table
 		 * must remain empty; and we have already skipped anon_vma
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes()
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-06 21:48   ` Jann Horn
  2024-11-12 16:58   ` David Hildenbrand
  2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to
handle non-present ptes, which can improve code readability.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 136 ++++++++++++++++++++++++++++------------------------
 1 file changed, 73 insertions(+), 63 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f7..bd9ebe0f4471f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1587,6 +1587,76 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	return 1;
 }
 
+static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+		unsigned int max_nr, unsigned long addr,
+		struct zap_details *details, int *rss)
+{
+	swp_entry_t entry;
+	int nr = 1;
+
+	entry = pte_to_swp_entry(ptent);
+	if (is_device_private_entry(entry) ||
+		is_device_exclusive_entry(entry)) {
+		struct page *page = pfn_swap_entry_to_page(entry);
+		struct folio *folio = page_folio(page);
+
+		if (unlikely(!should_zap_folio(details, folio)))
+			return 1;
+		/*
+		 * Both device private/exclusive mappings should only
+		 * work with anonymous page so far, so we don't need to
+		 * consider uffd-wp bit when zap. For more information,
+		 * see zap_install_uffd_wp_if_needed().
+		 */
+		WARN_ON_ONCE(!vma_is_anonymous(vma));
+		rss[mm_counter(folio)]--;
+		if (is_device_private_entry(entry))
+			folio_remove_rmap_pte(folio, page, vma);
+		folio_put(folio);
+	} else if (!non_swap_entry(entry)) {
+		/* Genuine swap entries, hence a private anon pages */
+		if (!should_zap_cows(details))
+			return 1;
+
+		nr = swap_pte_batch(pte, max_nr, ptent);
+		rss[MM_SWAPENTS] -= nr;
+		free_swap_and_cache_nr(entry, nr);
+	} else if (is_migration_entry(entry)) {
+		struct folio *folio = pfn_swap_entry_folio(entry);
+
+		if (!should_zap_folio(details, folio))
+			return 1;
+		rss[mm_counter(folio)]--;
+	} else if (pte_marker_entry_uffd_wp(entry)) {
+		/*
+		 * For anon: always drop the marker; for file: only
+		 * drop the marker if explicitly requested.
+		 */
+		if (!vma_is_anonymous(vma) && !zap_drop_markers(details))
+			return 1;
+	} else if (is_guard_swp_entry(entry)) {
+		/*
+		 * Ordinary zapping should not remove guard PTE
+		 * markers. Only do so if we should remove PTE markers
+		 * in general.
+		 */
+		if (!zap_drop_markers(details))
+			return 1;
+	} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {
+		if (!should_zap_cows(details))
+			return 1;
+	} else {
+		/* We should have covered all the swap entry types */
+		pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
+		WARN_ON_ONCE(1);
+	}
+	clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm);
+	zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+
+	return nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1598,7 +1668,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
 	int nr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
@@ -1611,8 +1680,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = ptep_get(pte);
-		struct folio *folio;
-		struct page *page;
 		int max_nr;
 
 		nr = 1;
@@ -1622,8 +1689,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (need_resched())
 			break;
 
+		max_nr = (end - addr) / PAGE_SIZE;
 		if (pte_present(ptent)) {
-			max_nr = (end - addr) / PAGE_SIZE;
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
 					      &force_break);
@@ -1631,67 +1698,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				addr += nr * PAGE_SIZE;
 				break;
 			}
-			continue;
-		}
-
-		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry) ||
-		    is_device_exclusive_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
-			folio = page_folio(page);
-			if (unlikely(!should_zap_folio(details, folio)))
-				continue;
-			/*
-			 * Both device private/exclusive mappings should only
-			 * work with anonymous page so far, so we don't need to
-			 * consider uffd-wp bit when zap. For more information,
-			 * see zap_install_uffd_wp_if_needed().
-			 */
-			WARN_ON_ONCE(!vma_is_anonymous(vma));
-			rss[mm_counter(folio)]--;
-			if (is_device_private_entry(entry))
-				folio_remove_rmap_pte(folio, page, vma);
-			folio_put(folio);
-		} else if (!non_swap_entry(entry)) {
-			max_nr = (end - addr) / PAGE_SIZE;
-			nr = swap_pte_batch(pte, max_nr, ptent);
-			/* Genuine swap entries, hence a private anon pages */
-			if (!should_zap_cows(details))
-				continue;
-			rss[MM_SWAPENTS] -= nr;
-			free_swap_and_cache_nr(entry, nr);
-		} else if (is_migration_entry(entry)) {
-			folio = pfn_swap_entry_folio(entry);
-			if (!should_zap_folio(details, folio))
-				continue;
-			rss[mm_counter(folio)]--;
-		} else if (pte_marker_entry_uffd_wp(entry)) {
-			/*
-			 * For anon: always drop the marker; for file: only
-			 * drop the marker if explicitly requested.
-			 */
-			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_markers(details))
-				continue;
-		} else if (is_guard_swp_entry(entry)) {
-			/*
-			 * Ordinary zapping should not remove guard PTE
-			 * markers. Only do so if we should remove PTE markers
-			 * in general.
-			 */
-			if (!zap_drop_markers(details))
-				continue;
-		} else if (is_hwpoison_entry(entry) ||
-			   is_poisoned_swp_entry(entry)) {
-			if (!should_zap_cows(details))
-				continue;
 		} else {
-			/* We should have covered all the swap entry types */
-			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
-			WARN_ON_ONCE(1);
+			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
+						 addr, details, rss);
 		}
-		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
  2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-07 21:50   ` Jann Horn
  2024-11-12 17:00   ` David Hildenbrand
  2024-10-31  8:13 ` [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

This commit introduces do_zap_pte_range() to actually zap the PTEs, which
will help improve code readability and facilitate secondary checking of
the processed PTEs in the future.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..c1150e62dd073 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 	return nr;
 }
 
+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+				   struct vm_area_struct *vma, pte_t *pte,
+				   unsigned long addr, unsigned long end,
+				   struct zap_details *details, int *rss,
+				   bool *force_flush, bool *force_break)
+{
+	pte_t ptent = ptep_get(pte);
+	int max_nr = (end - addr) / PAGE_SIZE;
+
+	if (pte_none(ptent))
+		return 1;
+
+	if (pte_present(ptent))
+		return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+					addr, details, rss, force_flush,
+					force_break);
+
+	return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+					 details, rss);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1679,28 +1700,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
-		pte_t ptent = ptep_get(pte);
-		int max_nr;
-
-		nr = 1;
-		if (pte_none(ptent))
-			continue;
-
 		if (need_resched())
 			break;
 
-		max_nr = (end - addr) / PAGE_SIZE;
-		if (pte_present(ptent)) {
-			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-					      addr, details, rss, &force_flush,
-					      &force_break);
-			if (unlikely(force_break)) {
-				addr += nr * PAGE_SIZE;
-				break;
-			}
-		} else {
-			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
-						 addr, details, rss);
+		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
+				      &force_flush, &force_break);
+		if (unlikely(force_break)) {
+			addr += nr * PAGE_SIZE;
+			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (2 preceding siblings ...)
  2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-07 21:46   ` Jann Horn
  2024-10-31  8:13 ` [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In preparation for reclaiming empty PTE pages, this commit first makes
zap_pte_range() to handle the full within-PMD range, so that we can more
easily detect and free PTE pages in this function in subsequent commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c1150e62dd073..002aa4f454fa0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1691,6 +1691,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *pte;
 	int nr;
 
+retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
 	start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -1730,6 +1731,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	if (force_flush)
 		tlb_flush_mmu(tlb);
 
+	if (addr != end) {
+		cond_resched();
+		force_flush = false;
+		force_break = false;
+		goto retry;
+	}
+
 	return addr;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (3 preceding siblings ...)
  2024-10-31  8:13 ` [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-07 23:35   ` Jann Horn
  2024-10-31  8:13 ` [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
  2024-10-31  8:13 ` [PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  6 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 15 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 23 ++++++++++++++++
 mm/madvise.c       |  4 ++-
 mm/memory.c        | 45 +++++++++++++++++++++++++++++-
 mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 155 insertions(+), 2 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e4bb43035953..ce3936590fe72 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;		/* Need reclaim page tables? */
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b0168086..681909e0a9fa3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other that munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5639b0361663..9d816323d247a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index d5b93c5b63648..7aba395a9940f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1508,4 +1508,27 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
 
+#ifdef CONFIG_PT_RECLAIM
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+#else
+static inline bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd,
+					 pmd_t *pmdval)
+{
+	return false;
+}
+static inline void free_pte(struct mm_struct *mm, unsigned long addr,
+			    struct mmu_gather *tlb, pmd_t pmdval)
+{
+}
+static inline void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd,
+				   unsigned long addr, struct mmu_gather *tlb)
+{
+}
+#endif /* CONFIG_PT_RECLAIM */
+
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7dad..ee88652761d45 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {.reclaim_pt = true,};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 002aa4f454fa0..c4a8c18fbcfd7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 					 details, rss);
 }
 
+static inline int count_pte_none(pte_t *pte, int nr)
+{
+	int none_nr = 0;
+
+	/*
+	 * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
+	 * re-installed, so we need to check pte_none() one by one.
+	 * Otherwise, checking a single PTE in a batch is sufficient.
+	 */
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	for (;;) {
+		if (pte_none(ptep_get(pte)))
+			none_nr++;
+		if (--nr == 0)
+			break;
+		pte++;
+	}
+#else
+	if (pte_none(ptep_get(pte)))
+		none_nr = nr;
+#endif
+	return none_nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
+	pmd_t pmdval;
+	bool can_reclaim_pt = false;
+	bool direct_reclaim = false;
+	unsigned long start = addr;
+	int none_nr = 0;
 	int nr;
 
+	if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
+		can_reclaim_pt = true;
+
 retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
@@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
 				      &force_flush, &force_break);
+		none_nr += count_pte_none(pte, nr);
 		if (unlikely(force_break)) {
 			addr += nr * PAGE_SIZE;
 			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..fc055da40b615
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		return;
+
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+		goto out_ptl;
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	pte_unmap_unlock(start_pte, ptl);
+	if (pml != ptl)
+		spin_unlock(pml);
+}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (4 preceding siblings ...)
  2024-10-31  8:13 ` [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  2024-11-07 22:39   ` Jann Horn
  2024-10-31  8:13 ` [PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  6 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
   PMD page, so it is also safe.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 mm/mmu_gather.c            |  9 ++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257b..e223b53a8b190 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	free_page_and_swap_cache(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..89688921ea62e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..69a357b15974a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..d948479ca09e6 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+#endif
+
 static void tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
+	__tlb_remove_table_one(table);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (5 preceding siblings ...)
  2024-10-31  8:13 ` [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-10-31  8:13 ` Qi Zheng
  6 siblings, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-10-31  8:13 UTC (permalink / raw)
  To: david, jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and
reclaiming PTE pages is profitable only on 64-bit systems, so select
ARCH_SUPPORTS_PT_RECLAIM if X86_64.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d194fb2f979e..194baed21ae5c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -320,6 +320,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
2.20.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-11-06 21:48   ` Jann Horn
  2024-11-07  7:54     ` Qi Zheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-06 21:48 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> In retract_page_tables(), we may modify the pmd entry after acquiring the
> pml and ptl, so we should also check whether the pmd entry is stable.

Why does taking the PMD lock not guarantee that the PMD entry is stable?

> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
> also remove the calling of the pte_lockptr().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/khugepaged.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6f8d46d107b4b..6d76dde64f5fb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                 spinlock_t *pml;
>                 spinlock_t *ptl;
>                 bool skipped_uffd = false;
> +               pte_t *pte;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                                         addr, addr + HPAGE_PMD_SIZE);
>                 mmu_notifier_invalidate_range_start(&range);
>
> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
> +               if (!pte) {
> +                       mmu_notifier_invalidate_range_end(&range);
> +                       continue;
> +               }
> +
>                 pml = pmd_lock(mm, pmd);

I don't understand why you're mapping the page table before locking
the PMD. Doesn't that just mean we need more error checking
afterwards?


> -               ptl = pte_lockptr(mm, pmd);
>                 if (ptl != pml)
>                         spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> +               if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
> +                       pte_unmap_unlock(pte, ptl);
> +                       if (ptl != pml)
> +                               spin_unlock(pml);
> +                       mmu_notifier_invalidate_range_end(&range);
> +                       continue;
> +               }
> +               pte_unmap(pte);
> +
>                 /*
>                  * Huge page lock is still held, so normally the page table
>                  * must remain empty; and we have already skipped anon_vma
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes()
  2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
@ 2024-11-06 21:48   ` Jann Horn
  2024-11-12 16:58   ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: Jann Horn @ 2024-11-06 21:48 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to
> handle non-present ptes, which can improve code readability.
>
> No functional change.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

This is a nice cleanup.

Reviewed-by: Jann Horn <jannh@google.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-11-06 21:48   ` Jann Horn
@ 2024-11-07  7:54     ` Qi Zheng
  2024-11-07 17:57       ` Jann Horn
  0 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-11-07  7:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

Hi Jann,

On 2024/11/7 05:48, Jann Horn wrote:
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> In retract_page_tables(), we may modify the pmd entry after acquiring the
>> pml and ptl, so we should also check whether the pmd entry is stable.
> 
> Why does taking the PMD lock not guarantee that the PMD entry is stable?

Because the pmd entry may have changed before taking the pmd lock, so we
need to recheck it after taking the pmd or pte lock.

> 
>> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
>> also remove the calling of the pte_lockptr().
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/khugepaged.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 6f8d46d107b4b..6d76dde64f5fb 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                  spinlock_t *pml;
>>                  spinlock_t *ptl;
>>                  bool skipped_uffd = false;
>> +               pte_t *pte;
>>
>>                  /*
>>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                                          addr, addr + HPAGE_PMD_SIZE);
>>                  mmu_notifier_invalidate_range_start(&range);
>>
>> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
>> +               if (!pte) {
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +
>>                  pml = pmd_lock(mm, pmd);
> 
> I don't understand why you're mapping the page table before locking
> the PMD. Doesn't that just mean we need more error checking
> afterwards?

The main purpose is to obtain the pmdval. If we don't use
pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
before pte_lockptr(), like this:

pmdval = pmdp_get_lockless(pmd);
pmd_lock
recheck pmdval
pte_lockptr(mm, pmd)

Otherwise, it may cause the system to crash. Consider the following
situation:

     CPU 0              CPU 1

zap_pte_range
--> clear pmd entry
     free pte page (by RCU)

                       retract_page_tables
                       --> pmd_lock
                           pte_lockptr(mm, pmd)  <-- BOOM!!

So maybe calling pte_offset_map_rw_nolock() is more convenient.

Thanks,
Qi


> 
> 
>> -               ptl = pte_lockptr(mm, pmd);
>>                  if (ptl != pml)
>>                          spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>
>> +               if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>> +                       pte_unmap_unlock(pte, ptl);
>> +                       if (ptl != pml)
>> +                               spin_unlock(pml);
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +               pte_unmap(pte);
>> +
>>                  /*
>>                   * Huge page lock is still held, so normally the page table
>>                   * must remain empty; and we have already skipped anon_vma
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-11-07  7:54     ` Qi Zheng
@ 2024-11-07 17:57       ` Jann Horn
  2024-11-08  6:31         ` Qi Zheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-07 17:57 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Nov 7, 2024 at 8:54 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/7 05:48, Jann Horn wrote:
> > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> >> In retract_page_tables(), we may modify the pmd entry after acquiring the
> >> pml and ptl, so we should also check whether the pmd entry is stable.
> >
> > Why does taking the PMD lock not guarantee that the PMD entry is stable?
>
> Because the pmd entry may have changed before taking the pmd lock, so we
> need to recheck it after taking the pmd or pte lock.

You mean it could have changed from the value we obtained from
find_pmd_or_thp_or_none(mm, addr, &pmd)? I don't think that matters
though.

> >> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
> >> also remove the calling of the pte_lockptr().
> >>
> >> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> >> ---
> >>   mm/khugepaged.c | 17 ++++++++++++++++-
> >>   1 file changed, 16 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 6f8d46d107b4b..6d76dde64f5fb 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >>                  spinlock_t *pml;
> >>                  spinlock_t *ptl;
> >>                  bool skipped_uffd = false;
> >> +               pte_t *pte;
> >>
> >>                  /*
> >>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> >> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >>                                          addr, addr + HPAGE_PMD_SIZE);
> >>                  mmu_notifier_invalidate_range_start(&range);
> >>
> >> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
> >> +               if (!pte) {
> >> +                       mmu_notifier_invalidate_range_end(&range);
> >> +                       continue;
> >> +               }
> >> +
> >>                  pml = pmd_lock(mm, pmd);
> >
> > I don't understand why you're mapping the page table before locking
> > the PMD. Doesn't that just mean we need more error checking
> > afterwards?
>
> The main purpose is to obtain the pmdval. If we don't use
> pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
> before pte_lockptr(), like this:
>
> pmdval = pmdp_get_lockless(pmd);
> pmd_lock
> recheck pmdval
> pte_lockptr(mm, pmd)
>
> Otherwise, it may cause the system to crash. Consider the following
> situation:
>
>      CPU 0              CPU 1
>
> zap_pte_range
> --> clear pmd entry
>      free pte page (by RCU)
>
>                        retract_page_tables
>                        --> pmd_lock
>                            pte_lockptr(mm, pmd)  <-- BOOM!!
>
> So maybe calling pte_offset_map_rw_nolock() is more convenient.

How about refactoring find_pmd_or_thp_or_none() like this, by moving
the checks of the PMD entry value into a separate helper:



-static int find_pmd_or_thp_or_none(struct mm_struct *mm,
-                                  unsigned long address,
-                                  pmd_t **pmd)
+static int check_pmd_state(pmd_t *pmd)
 {
-       pmd_t pmde;
+       pmd_t pmde = pmdp_get_lockless(*pmd);

-       *pmd = mm_find_pmd(mm, address);
-       if (!*pmd)
-               return SCAN_PMD_NULL;
-
-       pmde = pmdp_get_lockless(*pmd);
        if (pmd_none(pmde))
                return SCAN_PMD_NONE;
        if (!pmd_present(pmde))
                return SCAN_PMD_NULL;
        if (pmd_trans_huge(pmde))
                return SCAN_PMD_MAPPED;
        if (pmd_devmap(pmde))
                return SCAN_PMD_NULL;
        if (pmd_bad(pmde))
                return SCAN_PMD_NULL;
        return SCAN_SUCCEED;
 }

+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+                                  unsigned long address,
+                                  pmd_t **pmd)
+{
+
+       *pmd = mm_find_pmd(mm, address);
+       if (!*pmd)
+               return SCAN_PMD_NULL;
+       return check_pmd_state(*pmd);
+}
+


And simplifying retract_page_tables() a little bit like this:


        i_mmap_lock_read(mapping);
        vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
                struct mmu_notifier_range range;
                struct mm_struct *mm;
                unsigned long addr;
                pmd_t *pmd, pgt_pmd;
                spinlock_t *pml;
                spinlock_t *ptl;
-               bool skipped_uffd = false;
+               bool success = false;

                /*
                 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
                 * got written to. These VMAs are likely not worth removing
                 * page tables from, as PMD-mapping is likely to be split later.
                 */
                if (READ_ONCE(vma->anon_vma))
                        continue;

                addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
@@ -1763,34 +1767,34 @@ static void retract_page_tables(struct
address_space *mapping, pgoff_t pgoff)

                /*
                 * Huge page lock is still held, so normally the page table
                 * must remain empty; and we have already skipped anon_vma
                 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
                 * held, it is still possible for a racing userfaultfd_ioctl()
                 * to have inserted ptes or markers.  Now that we hold ptlock,
                 * repeating the anon_vma check protects from one category,
                 * and repeating the userfaultfd_wp() check from another.
                 */
-               if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
-                       skipped_uffd = true;
-               } else {
+               if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
                        pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
                        pmdp_get_lockless_sync();
+                       success = true;
                }

                if (ptl != pml)
                        spin_unlock(ptl);
+drop_pml:
                spin_unlock(pml);

                mmu_notifier_invalidate_range_end(&range);

-               if (!skipped_uffd) {
+               if (success) {
                        mm_dec_nr_ptes(mm);
                        page_table_check_pte_clear_range(mm, addr, pgt_pmd);
                        pte_free_defer(mm, pmd_pgtable(pgt_pmd));
                }
        }
        i_mmap_unlock_read(mapping);


And then instead of your patch, I think you can just do this?


@@ -1754,20 +1754,22 @@ static void retract_page_tables(struct
address_space *mapping, pgoff_t pgoff)
                 */
                if (userfaultfd_wp(vma))
                        continue;

                /* PTEs were notified when unmapped; but now for the PMD? */
                mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
                                        addr, addr + HPAGE_PMD_SIZE);
                mmu_notifier_invalidate_range_start(&range);

                pml = pmd_lock(mm, pmd);
+               if (check_pmd_state(mm, addr, pmd) != SCAN_SUCCEED)
+                       goto drop_pml;
                ptl = pte_lockptr(mm, pmd);
                if (ptl != pml)
                        spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);

                /*
                 * Huge page lock is still held, so normally the page table
                 * must remain empty; and we have already skipped anon_vma
                 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
                 * held, it is still possible for a racing userfaultfd_ioctl()
                 * to have inserted ptes or markers.  Now that we hold ptlock,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range
  2024-10-31  8:13 ` [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-11-07 21:46   ` Jann Horn
  0 siblings, 0 replies; 31+ messages in thread
From: Jann Horn @ 2024-11-07 21:46 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> In preparation for reclaiming empty PTE pages, this commit first makes
> zap_pte_range() to handle the full within-PMD range, so that we can more
> easily detect and free PTE pages in this function in subsequent commits.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Jann Horn <jannh@google.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
@ 2024-11-07 21:50   ` Jann Horn
  2024-11-12 17:00   ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: Jann Horn @ 2024-11-07 21:50 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> This commit introduces do_zap_pte_range() to actually zap the PTEs, which
> will help improve code readability and facilitate secondary checking of
> the processed PTEs in the future.
>
> No functional change.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Jann Horn <jannh@google.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-10-31  8:13 ` [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-11-07 22:39   ` Jann Horn
  2024-11-08  7:38     ` Qi Zheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-07 22:39 UTC (permalink / raw)
  To: Qi Zheng, Dave Hansen, Andy Lutomirski, Peter Zijlstra
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

+x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
know if that is enough for them to see it, and I haven't seen them
comment on this series yet; I think you need an ack from them for this
change.

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
> will be freed by semi RCU, that is:
>
>  - batch table freeing: asynchronous free by RCU
>  - single table freeing: IPI + synchronous free
>
> In this way, the page table can be lockless traversed by disabling IRQ in
> paths such as fast GUP. But this is not enough to free the empty PTE page
> table pages in paths other that munmap and exit_mmap path, because IPI
> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>
> In preparation for supporting empty PTE page table pages reclaimation,
> let single table also be freed by RCU like batch table freeing. Then we
> can also use pte_offset_map() etc to prevent PTE page from being freed.

I applied your series locally and followed the page table freeing path
that the reclaim feature would use on x86-64. Looks like it goes like
this with the series applied:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                tlb_remove_table_sync_one [does IPI for GUP-fast]
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Basically, the only remaining case in which
paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
that means X86's direct page table freeing path will almost never be
used? If it stays that way and the X86 folks don't see a performance
impact from using RCU to free tables on munmap() / process exit, I
guess we might want to get rid of the direct page table freeing path
on x86 at some point to simplify things...

(That simplification might also help prepare for Intel Remote Action
Request, if that is a thing people want...)

> Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
> the page table pages:
>
>  - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.
>
>  - For pt_list, it is used to manage the PGD page in x86. Fortunately
>    tlb_remove_table() will not be used for free PGD pages, so it is safe
>    to use pt_rcu_head.
>
>  - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
>    PMD page, so it is also safe.

Please also update the comments on "struct ptdesc" accordingly.

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
>  arch/x86/kernel/paravirt.c |  7 +++++++
>  arch/x86/mm/pgtable.c      | 10 +++++++++-
>  mm/mmu_gather.c            |  9 ++++++++-
>  4 files changed, 43 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 580636cdc257b..e223b53a8b190 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
>         free_page_and_swap_cache(table);
>  }
>
> +#ifdef CONFIG_PT_RECLAIM
> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
> +{
> +       struct page *page;
> +
> +       page = container_of(head, struct page, rcu_head);
> +       free_page_and_swap_cache(page);
> +}

Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
so I think something like put_page() would do the job.

> +static inline void __tlb_remove_table_one(void *table)
> +{
> +       struct page *page;
> +
> +       page = table;
> +       call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
> +}
> +#define __tlb_remove_table_one __tlb_remove_table_one
> +#endif /* CONFIG_PT_RECLAIM */
> +
>  #endif /* _ASM_X86_TLB_H */
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index fec3815335558..89688921ea62e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
>                 static_branch_enable(&virt_spin_lock_key);
>  }
>
> +#ifndef CONFIG_PT_RECLAIM
>  static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>  {
>         tlb_remove_page(tlb, table);
>  }
> +#else
> +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +       tlb_remove_table(tlb, table);
> +}
> +#endif
>
>  struct static_key paravirt_steal_enabled;
>  struct static_key paravirt_steal_rq_enabled;
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 5745a354a241c..69a357b15974a 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
>  #endif
>
>  #ifndef CONFIG_PARAVIRT
> +#ifndef CONFIG_PT_RECLAIM
>  static inline
>  void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>  {
>         tlb_remove_page(tlb, table);
>  }
> -#endif
> +#else
> +static inline
> +void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
> +{
> +       tlb_remove_table(tlb, table);
> +}
> +#endif /* !CONFIG_PT_RECLAIM */
> +#endif /* !CONFIG_PARAVIRT */
>
>  gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
>
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99b3e9408aa0f..d948479ca09e6 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
>         }
>  }
>
> +#ifndef __tlb_remove_table_one
> +static inline void __tlb_remove_table_one(void *table)
> +{
> +       __tlb_remove_table(table);
> +}
> +#endif
> +
>  static void tlb_remove_table_one(void *table)
>  {
>         tlb_remove_table_sync_one();
> -       __tlb_remove_table(table);
> +       __tlb_remove_table_one(table);
>  }
>
>  static void tlb_table_flush(struct mmu_gather *tlb)
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-31  8:13 ` [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-11-07 23:35   ` Jann Horn
  2024-11-08  7:13     ` Qi Zheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-07 23:35 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
>
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).

How does this interact with move_pages_pte()? I am looking at your
series applied on top of next-20241106, and it looks to me like
move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
PMD entry can't change. You can clearly see this assumption at the
WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
think for example move_present_pte() can end up moving a present PTE
into a page table that has already been scheduled for RCU freeing.

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/Kconfig         | 15 ++++++++++
>  mm/Makefile        |  1 +
>  mm/internal.h      | 23 ++++++++++++++++
>  mm/madvise.c       |  4 ++-
>  mm/memory.c        | 45 +++++++++++++++++++++++++++++-
>  mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 155 insertions(+), 2 deletions(-)
>  create mode 100644 mm/pt_reclaim.c
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3e4bb43035953..ce3936590fe72 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
>  struct zap_details {
>         struct folio *single_folio;     /* Locked folio to be unmapped */
>         bool even_cows;                 /* Zap COWed private pages too? */
> +       bool reclaim_pt;                /* Need reclaim page tables? */
>         zap_flags_t zap_flags;          /* Extra flags for zapping */
>  };
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 84000b0168086..681909e0a9fa3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>           The architecture has hardware support for userspace shadow call
>            stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>
> +config ARCH_SUPPORTS_PT_RECLAIM
> +       def_bool n
> +
> +config PT_RECLAIM
> +       bool "reclaim empty user page table pages"
> +       default y
> +       depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> +       select MMU_GATHER_RCU_TABLE_FREE
> +       help
> +         Try to reclaim empty user page table pages in paths other that munmap

nit: s/other that/other than/

> +         and exit_mmap path.
> +
> +         Note: now only empty user PTE page table pages will be reclaimed.
[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0ceae57da7dad..ee88652761d45 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>                                         unsigned long start, unsigned long end)
>  {
> -       zap_page_range_single(vma, start, end - start, NULL);
> +       struct zap_details details = {.reclaim_pt = true,};
> +
> +       zap_page_range_single(vma, start, end - start, &details);
>         return 0;
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 002aa4f454fa0..c4a8c18fbcfd7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  static inline bool should_zap_cows(struct zap_details *details)
>  {
>         /* By default, zap all pages */
> -       if (!details)
> +       if (!details || details->reclaim_pt)
>                 return true;
>
>         /* Or, we zap COWed pages only if the caller wants to */

This looks hacky - when we have a "details" object, its ->even_cows
member is supposed to indicate whether COW pages should be zapped. So
please instead set .even_cows=true in madvise_dontneed_single_vma().

> @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>                                          details, rss);
>  }
>
> +static inline int count_pte_none(pte_t *pte, int nr)
> +{
> +       int none_nr = 0;
> +
> +       /*
> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
> +        * re-installed, so we need to check pte_none() one by one.
> +        * Otherwise, checking a single PTE in a batch is sufficient.
> +        */

Please also add a comment noting that count_pte_none() relies on this
invariant on top of do_zap_pte_range() or somewhere like that.

> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> +       for (;;) {
> +               if (pte_none(ptep_get(pte)))
> +                       none_nr++;
> +               if (--nr == 0)
> +                       break;
> +               pte++;
> +       }
> +#else
> +       if (pte_none(ptep_get(pte)))
> +               none_nr = nr;
> +#endif
> +       return none_nr;
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                 struct vm_area_struct *vma, pmd_t *pmd,
>                                 unsigned long addr, unsigned long end,
> @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>         spinlock_t *ptl;
>         pte_t *start_pte;
>         pte_t *pte;
> +       pmd_t pmdval;
> +       bool can_reclaim_pt = false;
> +       bool direct_reclaim = false;
> +       unsigned long start = addr;
> +       int none_nr = 0;
>         int nr;
>
> +       if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
> +               can_reclaim_pt = true;
> +
>  retry:
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         init_rss_vec(rss);
> @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>
>                 nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
>                                       &force_flush, &force_break);
> +               none_nr += count_pte_none(pte, nr);
>                 if (unlikely(force_break)) {
>                         addr += nr * PAGE_SIZE;
>                         break;
>                 }
>         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> +       if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
> +
>         add_mm_rss_vec(mm, rss);
>         arch_leave_lazy_mmu_mode();
>
> @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 goto retry;
>         }
>
> +       if (can_reclaim_pt) {
> +               if (direct_reclaim)
> +                       free_pte(mm, start, tlb, pmdval);
> +               else
> +                       try_to_free_pte(mm, pmd, start, tlb);
> +       }
> +
>         return addr;
>  }
>
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> new file mode 100644
> index 0000000000000..fc055da40b615
> --- /dev/null
> +++ b/mm/pt_reclaim.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/hugetlb.h>
> +#include <asm-generic/tlb.h>
> +#include <asm/pgalloc.h>
> +
> +#include "internal.h"
> +
> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +{
> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
> +
> +       if (!spin_trylock(pml))
> +               return false;
> +
> +       *pmdval = pmdp_get_lockless(pmd);
> +       pmd_clear(pmd);
> +       spin_unlock(pml);
> +
> +       return true;
> +}
> +
> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> +             pmd_t pmdval)
> +{
> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> +       mm_dec_nr_ptes(mm);
> +}
> +
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +                    struct mmu_gather *tlb)
> +{
> +       pmd_t pmdval;
> +       spinlock_t *pml, *ptl;
> +       pte_t *start_pte, *pte;
> +       int i;
> +

If you swap the following two operations around (first pmd_lock(),
then pte_offset_map_rw_nolock(), then take the PTE lock):

> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +       if (!start_pte)
> +               return;
> +
> +       pml = pmd_lock(mm, pmd);
> +       if (ptl != pml)
> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +

Then I think this check can go away:

> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> +               goto out_ptl;
> +
> +       /* Check if it is empty PTE page */
> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +               if (!pte_none(ptep_get(pte)))
> +                       goto out_ptl;
> +       }
> +       pte_unmap(start_pte);
> +
> +       pmd_clear(pmd);
> +
> +       if (ptl != pml)
> +               spin_unlock(ptl);
> +       spin_unlock(pml);
> +
> +       free_pte(mm, addr, tlb, pmdval);
> +
> +       return;
> +out_ptl:
> +       pte_unmap_unlock(start_pte, ptl);
> +       if (pml != ptl)
> +               spin_unlock(pml);
> +}
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-11-07 17:57       ` Jann Horn
@ 2024-11-08  6:31         ` Qi Zheng
  0 siblings, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-08  6:31 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/8 01:57, Jann Horn wrote:
> On Thu, Nov 7, 2024 at 8:54 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/7 05:48, Jann Horn wrote:
>>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>> In retract_page_tables(), we may modify the pmd entry after acquiring the
>>>> pml and ptl, so we should also check whether the pmd entry is stable.
>>>
>>> Why does taking the PMD lock not guarantee that the PMD entry is stable?
>>
>> Because the pmd entry may have changed before taking the pmd lock, so we
>> need to recheck it after taking the pmd or pte lock.
> 
> You mean it could have changed from the value we obtained from
> find_pmd_or_thp_or_none(mm, addr, &pmd)? I don't think that matters
> though.
> 
>>>> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
>>>> also remove the calling of the pte_lockptr().
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>    mm/khugepaged.c | 17 ++++++++++++++++-
>>>>    1 file changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 6f8d46d107b4b..6d76dde64f5fb 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>>>                   spinlock_t *pml;
>>>>                   spinlock_t *ptl;
>>>>                   bool skipped_uffd = false;
>>>> +               pte_t *pte;
>>>>
>>>>                   /*
>>>>                    * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>>>> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>>>                                           addr, addr + HPAGE_PMD_SIZE);
>>>>                   mmu_notifier_invalidate_range_start(&range);
>>>>
>>>> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
>>>> +               if (!pte) {
>>>> +                       mmu_notifier_invalidate_range_end(&range);
>>>> +                       continue;
>>>> +               }
>>>> +
>>>>                   pml = pmd_lock(mm, pmd);
>>>
>>> I don't understand why you're mapping the page table before locking
>>> the PMD. Doesn't that just mean we need more error checking
>>> afterwards?
>>
>> The main purpose is to obtain the pmdval. If we don't use
>> pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
>> before pte_lockptr(), like this:
>>
>> pmdval = pmdp_get_lockless(pmd);
>> pmd_lock
>> recheck pmdval
>> pte_lockptr(mm, pmd)
>>
>> Otherwise, it may cause the system to crash. Consider the following
>> situation:
>>
>>       CPU 0              CPU 1
>>
>> zap_pte_range
>> --> clear pmd entry
>>       free pte page (by RCU)
>>
>>                         retract_page_tables
>>                         --> pmd_lock
>>                             pte_lockptr(mm, pmd)  <-- BOOM!!
>>
>> So maybe calling pte_offset_map_rw_nolock() is more convenient.
> 
> How about refactoring find_pmd_or_thp_or_none() like this, by moving
> the checks of the PMD entry value into a separate helper:
> 
> 
> 
> -static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> -                                  unsigned long address,
> -                                  pmd_t **pmd)
> +static int check_pmd_state(pmd_t *pmd)
>   {
> -       pmd_t pmde;
> +       pmd_t pmde = pmdp_get_lockless(*pmd);
> 
> -       *pmd = mm_find_pmd(mm, address);
> -       if (!*pmd)
> -               return SCAN_PMD_NULL;
> -
> -       pmde = pmdp_get_lockless(*pmd);
>          if (pmd_none(pmde))
>                  return SCAN_PMD_NONE;
>          if (!pmd_present(pmde))
>                  return SCAN_PMD_NULL;
>          if (pmd_trans_huge(pmde))
>                  return SCAN_PMD_MAPPED;
>          if (pmd_devmap(pmde))
>                  return SCAN_PMD_NULL;
>          if (pmd_bad(pmde))
>                  return SCAN_PMD_NULL;
>          return SCAN_SUCCEED;
>   }
> 
> +static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> +                                  unsigned long address,
> +                                  pmd_t **pmd)
> +{
> +
> +       *pmd = mm_find_pmd(mm, address);
> +       if (!*pmd)
> +               return SCAN_PMD_NULL;
> +       return check_pmd_state(*pmd);
> +}
> +
> 
> 
> And simplifying retract_page_tables() a little bit like this:
> 
> 
>          i_mmap_lock_read(mapping);
>          vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>                  struct mmu_notifier_range range;
>                  struct mm_struct *mm;
>                  unsigned long addr;
>                  pmd_t *pmd, pgt_pmd;
>                  spinlock_t *pml;
>                  spinlock_t *ptl;
> -               bool skipped_uffd = false;
> +               bool success = false;
> 
>                  /*
>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>                   * got written to. These VMAs are likely not worth removing
>                   * page tables from, as PMD-mapping is likely to be split later.
>                   */
>                  if (READ_ONCE(vma->anon_vma))
>                          continue;
> 
>                  addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> @@ -1763,34 +1767,34 @@ static void retract_page_tables(struct
> address_space *mapping, pgoff_t pgoff)
> 
>                  /*
>                   * Huge page lock is still held, so normally the page table
>                   * must remain empty; and we have already skipped anon_vma
>                   * and userfaultfd_wp() vmas.  But since the mmap_lock is not
>                   * held, it is still possible for a racing userfaultfd_ioctl()
>                   * to have inserted ptes or markers.  Now that we hold ptlock,
>                   * repeating the anon_vma check protects from one category,
>                   * and repeating the userfaultfd_wp() check from another.
>                   */
> -               if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
> -                       skipped_uffd = true;
> -               } else {
> +               if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
>                          pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
>                          pmdp_get_lockless_sync();
> +                       success = true;
>                  }
> 
>                  if (ptl != pml)
>                          spin_unlock(ptl);
> +drop_pml:
>                  spin_unlock(pml);
> 
>                  mmu_notifier_invalidate_range_end(&range);
> 
> -               if (!skipped_uffd) {
> +               if (success) {
>                          mm_dec_nr_ptes(mm);
>                          page_table_check_pte_clear_range(mm, addr, pgt_pmd);
>                          pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>                  }
>          }
>          i_mmap_unlock_read(mapping);
> 
> 
> And then instead of your patch, I think you can just do this?

Ah, this does look much better! Will change to this in the next version.

Thanks!

> 
> 
> @@ -1754,20 +1754,22 @@ static void retract_page_tables(struct
> address_space *mapping, pgoff_t pgoff)
>                   */
>                  if (userfaultfd_wp(vma))
>                          continue;
> 
>                  /* PTEs were notified when unmapped; but now for the PMD? */
>                  mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
>                                          addr, addr + HPAGE_PMD_SIZE);
>                  mmu_notifier_invalidate_range_start(&range);
> 
>                  pml = pmd_lock(mm, pmd);
> +               if (check_pmd_state(mm, addr, pmd) != SCAN_SUCCEED)
> +                       goto drop_pml;
>                  ptl = pte_lockptr(mm, pmd);
>                  if (ptl != pml)
>                          spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> 
>                  /*
>                   * Huge page lock is still held, so normally the page table
>                   * must remain empty; and we have already skipped anon_vma
>                   * and userfaultfd_wp() vmas.  But since the mmap_lock is not
>                   * held, it is still possible for a racing userfaultfd_ioctl()
>                   * to have inserted ptes or markers.  Now that we hold ptlock,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-11-07 23:35   ` Jann Horn
@ 2024-11-08  7:13     ` Qi Zheng
  2024-11-08 18:04       ` Jann Horn
  0 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-11-08  7:13 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

Hi Jann,

On 2024/11/8 07:35, Jann Horn wrote:
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> As a first step, this commit aims to synchronously free the empty PTE
>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>> cases other than madvise(MADV_DONTNEED).
>>
>> Once an empty PTE is detected, we first try to hold the pmd lock within
>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>> whether the PTE page is empty and free it (slow path).
> 
> How does this interact with move_pages_pte()? I am looking at your
> series applied on top of next-20241106, and it looks to me like
> move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
> PMD entry can't change. You can clearly see this assumption at the
> WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I

In move_pages_pte(), the following conditions may indeed be triggered:

	/* Sanity checks before the operation */
	if (WARN_ON_ONCE(pmd_none(*dst_pmd)) ||	WARN_ON_ONCE(pmd_none(*src_pmd)) ||
	    WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || 
WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
		err = -EINVAL;
		goto out;
	}

But maybe we can just remove the WARN_ON_ONCE(), because...

> think for example move_present_pte() can end up moving a present PTE
> into a page table that has already been scheduled for RCU freeing.

...this situation is impossible to happen. Before performing move
operation, the pte_same() check will be performed after holding the
pte lock, which can ensure that the PTE page is stable:

CPU 0                    CPU 1

zap_pte_range

			orig_src_pte = ptep_get(src_pte);

pmd_lock
pte_lock
check if all PTEs are pte_none()
--> clear pmd entry
unlock pte
unlock pmd

			src_pte_lock
			pte_same(orig_src_pte, ptep_get(src_pte))
			--> return false and will skip the move op


> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/mm.h |  1 +
>>   mm/Kconfig         | 15 ++++++++++
>>   mm/Makefile        |  1 +
>>   mm/internal.h      | 23 ++++++++++++++++
>>   mm/madvise.c       |  4 ++-
>>   mm/memory.c        | 45 +++++++++++++++++++++++++++++-
>>   mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
>>   7 files changed, 155 insertions(+), 2 deletions(-)
>>   create mode 100644 mm/pt_reclaim.c
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 3e4bb43035953..ce3936590fe72 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
>>   struct zap_details {
>>          struct folio *single_folio;     /* Locked folio to be unmapped */
>>          bool even_cows;                 /* Zap COWed private pages too? */
>> +       bool reclaim_pt;                /* Need reclaim page tables? */
>>          zap_flags_t zap_flags;          /* Extra flags for zapping */
>>   };
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 84000b0168086..681909e0a9fa3 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>>            The architecture has hardware support for userspace shadow call
>>             stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>
>> +config ARCH_SUPPORTS_PT_RECLAIM
>> +       def_bool n
>> +
>> +config PT_RECLAIM
>> +       bool "reclaim empty user page table pages"
>> +       default y
>> +       depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>> +       select MMU_GATHER_RCU_TABLE_FREE
>> +       help
>> +         Try to reclaim empty user page table pages in paths other that munmap
> 
> nit: s/other that/other than/

will fix.

> 
>> +         and exit_mmap path.
>> +
>> +         Note: now only empty user PTE page table pages will be reclaimed.
> [...]
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 0ceae57da7dad..ee88652761d45 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -851,7 +851,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>>   static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>>                                          unsigned long start, unsigned long end)
>>   {
>> -       zap_page_range_single(vma, start, end - start, NULL);
>> +       struct zap_details details = {.reclaim_pt = true,};
>> +
>> +       zap_page_range_single(vma, start, end - start, &details);
>>          return 0;
>>   }
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 002aa4f454fa0..c4a8c18fbcfd7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   static inline bool should_zap_cows(struct zap_details *details)
>>   {
>>          /* By default, zap all pages */
>> -       if (!details)
>> +       if (!details || details->reclaim_pt)
>>                  return true;
>>
>>          /* Or, we zap COWed pages only if the caller wants to */
> 
> This looks hacky - when we have a "details" object, its ->even_cows
> member is supposed to indicate whether COW pages should be zapped. So
> please instead set .even_cows=true in madvise_dontneed_single_vma().

But the details->reclaim_pt should continue to be set, right? Because
we need to use .reclaim_pt to indicate whether the empty PTE page
should be reclaimed.

> 
>> @@ -1678,6 +1678,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>                                           details, rss);
>>   }
>>
>> +static inline int count_pte_none(pte_t *pte, int nr)
>> +{
>> +       int none_nr = 0;
>> +
>> +       /*
>> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
>> +        * re-installed, so we need to check pte_none() one by one.
>> +        * Otherwise, checking a single PTE in a batch is sufficient.
>> +        */
> 
> Please also add a comment noting that count_pte_none() relies on this
> invariant on top of do_zap_pte_range() or somewhere like that.

OK, will add a comment above do_zap_pte_range() to explain this special
case.

> 
>> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
>> +       for (;;) {
>> +               if (pte_none(ptep_get(pte)))
>> +                       none_nr++;
>> +               if (--nr == 0)
>> +                       break;
>> +               pte++;
>> +       }
>> +#else
>> +       if (pte_none(ptep_get(pte)))
>> +               none_nr = nr;
>> +#endif
>> +       return none_nr;
>> +}
>> +
>>   static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                                  struct vm_area_struct *vma, pmd_t *pmd,
>>                                  unsigned long addr, unsigned long end,
>> @@ -1689,8 +1713,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>          spinlock_t *ptl;
>>          pte_t *start_pte;
>>          pte_t *pte;
>> +       pmd_t pmdval;
>> +       bool can_reclaim_pt = false;
>> +       bool direct_reclaim = false;
>> +       unsigned long start = addr;
>> +       int none_nr = 0;
>>          int nr;
>>
>> +       if (details && details->reclaim_pt && (end - start >= PMD_SIZE))
>> +               can_reclaim_pt = true;
>> +
>>   retry:
>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>          init_rss_vec(rss);
>> @@ -1706,12 +1738,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>
>>                  nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
>>                                        &force_flush, &force_break);
>> +               none_nr += count_pte_none(pte, nr);
>>                  if (unlikely(force_break)) {
>>                          addr += nr * PAGE_SIZE;
>>                          break;
>>                  }
>>          } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>
>> +       if (addr == end && can_reclaim_pt && (none_nr == PTRS_PER_PTE))
>> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>> +
>>          add_mm_rss_vec(mm, rss);
>>          arch_leave_lazy_mmu_mode();
>>
>> @@ -1738,6 +1774,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                  goto retry;
>>          }
>>
>> +       if (can_reclaim_pt) {
>> +               if (direct_reclaim)
>> +                       free_pte(mm, start, tlb, pmdval);
>> +               else
>> +                       try_to_free_pte(mm, pmd, start, tlb);
>> +       }
>> +
>>          return addr;
>>   }
>>
>> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
>> new file mode 100644
>> index 0000000000000..fc055da40b615
>> --- /dev/null
>> +++ b/mm/pt_reclaim.c
>> @@ -0,0 +1,68 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/hugetlb.h>
>> +#include <asm-generic/tlb.h>
>> +#include <asm/pgalloc.h>
>> +
>> +#include "internal.h"
>> +
>> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
>> +{
>> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
>> +
>> +       if (!spin_trylock(pml))
>> +               return false;
>> +
>> +       *pmdval = pmdp_get_lockless(pmd);
>> +       pmd_clear(pmd);
>> +       spin_unlock(pml);
>> +
>> +       return true;
>> +}
>> +
>> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
>> +             pmd_t pmdval)
>> +{
>> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
>> +       mm_dec_nr_ptes(mm);
>> +}
>> +
>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>> +                    struct mmu_gather *tlb)
>> +{
>> +       pmd_t pmdval;
>> +       spinlock_t *pml, *ptl;
>> +       pte_t *start_pte, *pte;
>> +       int i;
>> +
> 
> If you swap the following two operations around (first pmd_lock(),
> then pte_offset_map_rw_nolock(), then take the PTE lock):

Indeed, will change to it.

Thanks,
Qi

> 
>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>> +       if (!start_pte)
>> +               return;
>> +
>> +       pml = pmd_lock(mm, pmd);
>> +       if (ptl != pml)
>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> +
> 
> Then I think this check can go away:
> 
>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>> +               goto out_ptl;
>> +
>> +       /* Check if it is empty PTE page */
>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>> +               if (!pte_none(ptep_get(pte)))
>> +                       goto out_ptl;
>> +       }
>> +       pte_unmap(start_pte);
>> +
>> +       pmd_clear(pmd);
>> +
>> +       if (ptl != pml)
>> +               spin_unlock(ptl);
>> +       spin_unlock(pml);
>> +
>> +       free_pte(mm, addr, tlb, pmdval);
>> +
>> +       return;
>> +out_ptl:
>> +       pte_unmap_unlock(start_pte, ptl);
>> +       if (pml != ptl)
>> +               spin_unlock(pml);
>> +}
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-11-07 22:39   ` Jann Horn
@ 2024-11-08  7:38     ` Qi Zheng
  2024-11-08 20:09       ` Jann Horn
  2024-11-13 11:26       ` Qi Zheng
  0 siblings, 2 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-08  7:38 UTC (permalink / raw)
  To: Jann Horn
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, david, hughd,
	willy, mgorman, muchun.song, vbabka, akpm, zokeefe, rientjes,
	peterx, catalin.marinas, linux-mm, linux-kernel, x86

Hi Jann,

On 2024/11/8 06:39, Jann Horn wrote:
> +x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
> know if that is enough for them to see it, and I haven't seen them
> comment on this series yet; I think you need an ack from them for this
> change.

Yes, thanks to Jann for cc-ing x86 MM maintainers, and look forward to
their feedback!

> 
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
>> will be freed by semi RCU, that is:
>>
>>   - batch table freeing: asynchronous free by RCU
>>   - single table freeing: IPI + synchronous free
>>
>> In this way, the page table can be lockless traversed by disabling IRQ in
>> paths such as fast GUP. But this is not enough to free the empty PTE page
>> table pages in paths other that munmap and exit_mmap path, because IPI
>> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>>
>> In preparation for supporting empty PTE page table pages reclaimation,
>> let single table also be freed by RCU like batch table freeing. Then we
>> can also use pte_offset_map() etc to prevent PTE page from being freed.
> 
> I applied your series locally and followed the page table freeing path
> that the reclaim feature would use on x86-64. Looks like it goes like
> this with the series applied:

Yes.

> 
> free_pte
>    pte_free_tlb
>      __pte_free_tlb
>        ___pte_free_tlb
>          paravirt_tlb_remove_table
>            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>              [no-free-memory slowpath:]
>                tlb_table_invalidate
>                tlb_remove_table_one
>                  tlb_remove_table_sync_one [does IPI for GUP-fast]

		   ^
		   It seems that this step can be ommitted when
		   CONFIG_PT_RECLAIM is enabled, because GUP-fast will
		   disable IRQ, which can also serve as the RCU critical
		   section.

>                  __tlb_remove_table_one [frees via RCU]
>              [fastpath:]
>                tlb_table_flush
>                  tlb_remove_table_free [frees via RCU]
>            native_tlb_remove_table [CONFIG_PARAVIRT on native]
>              tlb_remove_table [see above]
> 
> Basically, the only remaining case in which
> paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
> delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
> CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
> that means X86's direct page table freeing path will almost never be
> used? If it stays that way and the X86 folks don't see a performance
> impact from using RCU to free tables on munmap() / process exit, I
> guess we might want to get rid of the direct page table freeing path
> on x86 at some point to simplify things...

In theory, using RCU to asynchronously free PTE pages should make
munmap() / process exit path faster. I can try to grab some data.

> 
> (That simplification might also help prepare for Intel Remote Action
> Request, if that is a thing people want...)

If so, even better!

> 
>> Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
>> the page table pages:
>>
>>   - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.
>>
>>   - For pt_list, it is used to manage the PGD page in x86. Fortunately
>>     tlb_remove_table() will not be used for free PGD pages, so it is safe
>>     to use pt_rcu_head.
>>
>>   - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
>>     PMD page, so it is also safe.
> 
> Please also update the comments on "struct ptdesc" accordingly.

OK, will do.

> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
>>   arch/x86/kernel/paravirt.c |  7 +++++++
>>   arch/x86/mm/pgtable.c      | 10 +++++++++-
>>   mm/mmu_gather.c            |  9 ++++++++-
>>   4 files changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 580636cdc257b..e223b53a8b190 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
>>          free_page_and_swap_cache(table);
>>   }
>>
>> +#ifdef CONFIG_PT_RECLAIM
>> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
>> +{
>> +       struct page *page;
>> +
>> +       page = container_of(head, struct page, rcu_head);
>> +       free_page_and_swap_cache(page);
>> +}
> 
> Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
> so I think something like put_page() would do the job.

Ah, I just did the same thing as __tlb_remove_table(). But I also
have the same doubt as you, why does __tlb_remove_table() need to
call free_page_and_swap_cache() instead of put_page().

Thanks,
Qi

> 
>> +static inline void __tlb_remove_table_one(void *table)
>> +{
>> +       struct page *page;
>> +
>> +       page = table;
>> +       call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
>> +}
>> +#define __tlb_remove_table_one __tlb_remove_table_one
>> +#endif /* CONFIG_PT_RECLAIM */
>> +
>>   #endif /* _ASM_X86_TLB_H */
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index fec3815335558..89688921ea62e 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
>>                  static_branch_enable(&virt_spin_lock_key);
>>   }
>>
>> +#ifndef CONFIG_PT_RECLAIM
>>   static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   {
>>          tlb_remove_page(tlb, table);
>>   }
>> +#else
>> +static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
>> +{
>> +       tlb_remove_table(tlb, table);
>> +}
>> +#endif
>>
>>   struct static_key paravirt_steal_enabled;
>>   struct static_key paravirt_steal_rq_enabled;
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index 5745a354a241c..69a357b15974a 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
>>   #endif
>>
>>   #ifndef CONFIG_PARAVIRT
>> +#ifndef CONFIG_PT_RECLAIM
>>   static inline
>>   void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   {
>>          tlb_remove_page(tlb, table);
>>   }
>> -#endif
>> +#else
>> +static inline
>> +void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
>> +{
>> +       tlb_remove_table(tlb, table);
>> +}
>> +#endif /* !CONFIG_PT_RECLAIM */
>> +#endif /* !CONFIG_PARAVIRT */
>>
>>   gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
>>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 99b3e9408aa0f..d948479ca09e6 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
>>          }
>>   }
>>
>> +#ifndef __tlb_remove_table_one
>> +static inline void __tlb_remove_table_one(void *table)
>> +{
>> +       __tlb_remove_table(table);
>> +}
>> +#endif
>> +
>>   static void tlb_remove_table_one(void *table)
>>   {
>>          tlb_remove_table_sync_one();
>> -       __tlb_remove_table(table);
>> +       __tlb_remove_table_one(table);
>>   }
>>
>>   static void tlb_table_flush(struct mmu_gather *tlb)
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-11-08  7:13     ` Qi Zheng
@ 2024-11-08 18:04       ` Jann Horn
  2024-11-09  3:07         ` Qi Zheng
  0 siblings, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-08 18:04 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/8 07:35, Jann Horn wrote:
> > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> >> As a first step, this commit aims to synchronously free the empty PTE
> >> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> >> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> >> cases other than madvise(MADV_DONTNEED).
> >>
> >> Once an empty PTE is detected, we first try to hold the pmd lock within
> >> the pte lock. If successful, we clear the pmd entry directly (fast path).
> >> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> >> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> >> whether the PTE page is empty and free it (slow path).
> >
> > How does this interact with move_pages_pte()? I am looking at your
> > series applied on top of next-20241106, and it looks to me like
> > move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
> > PMD entry can't change. You can clearly see this assumption at the
> > WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
>
> In move_pages_pte(), the following conditions may indeed be triggered:
>
>         /* Sanity checks before the operation */
>         if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) ||
>             WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) ||
> WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
>                 err = -EINVAL;
>                 goto out;
>         }
>
> But maybe we can just remove the WARN_ON_ONCE(), because...
>
> > think for example move_present_pte() can end up moving a present PTE
> > into a page table that has already been scheduled for RCU freeing.
>
> ...this situation is impossible to happen. Before performing move
> operation, the pte_same() check will be performed after holding the
> pte lock, which can ensure that the PTE page is stable:
>
> CPU 0                    CPU 1
>
> zap_pte_range
>
>                         orig_src_pte = ptep_get(src_pte);
>
> pmd_lock
> pte_lock
> check if all PTEs are pte_none()
> --> clear pmd entry
> unlock pte
> unlock pmd
>
>                         src_pte_lock
>                         pte_same(orig_src_pte, ptep_get(src_pte))
>                         --> return false and will skip the move op

Yes, that works for the source PTE. But what about the destination?

Operations on the destination PTE in move_pages_pte() are, when moving
a normal present source PTE pointing to an order-0 page, and assuming
that the optimistic folio_trylock(src_folio) and
anon_vma_trylock_write(src_anon_vma) succeed:

dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr,
&dummy_pmdval, &dst_ptl)
[check that dst_pte is non-NULL]
some racy WARN_ON_ONCE() checks
spin_lock(dst_ptl);
orig_dst_pte = ptep_get(dst_pte);
spin_unlock(dst_ptl);
[bail if orig_dst_pte isn't none]
double_pt_lock(dst_ptl, src_ptl)
[check pte_same(ptep_get(dst_pte), orig_dst_pte)]

and then we're past the point of no return. Note that there is a
pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is
intentionally pte_none(), so the pte_same() check does not guarantee
that the destination page table is still linked in.

> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 002aa4f454fa0..c4a8c18fbcfd7 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >>   static inline bool should_zap_cows(struct zap_details *details)
> >>   {
> >>          /* By default, zap all pages */
> >> -       if (!details)
> >> +       if (!details || details->reclaim_pt)
> >>                  return true;
> >>
> >>          /* Or, we zap COWed pages only if the caller wants to */
> >
> > This looks hacky - when we have a "details" object, its ->even_cows
> > member is supposed to indicate whether COW pages should be zapped. So
> > please instead set .even_cows=true in madvise_dontneed_single_vma().
>
> But the details->reclaim_pt should continue to be set, right? Because
> we need to use .reclaim_pt to indicate whether the empty PTE page
> should be reclaimed.

Yeah, you should set both.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-11-08  7:38     ` Qi Zheng
@ 2024-11-08 20:09       ` Jann Horn
  2024-11-09  3:14         ` Qi Zheng
  2024-11-13 11:26       ` Qi Zheng
  1 sibling, 1 reply; 31+ messages in thread
From: Jann Horn @ 2024-11-08 20:09 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, david, hughd,
	willy, mgorman, muchun.song, vbabka, akpm, zokeefe, rientjes,
	peterx, catalin.marinas, linux-mm, linux-kernel, x86

On Fri, Nov 8, 2024 at 8:38 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> On 2024/11/8 06:39, Jann Horn wrote:
> > On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > free_pte
> >    pte_free_tlb
> >      __pte_free_tlb
> >        ___pte_free_tlb
> >          paravirt_tlb_remove_table
> >            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
> >              [no-free-memory slowpath:]
> >                tlb_table_invalidate
> >                tlb_remove_table_one
> >                  tlb_remove_table_sync_one [does IPI for GUP-fast]
>
>                    ^
>                    It seems that this step can be ommitted when
>                    CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>                    disable IRQ, which can also serve as the RCU critical
>                    section.

Yeah, I think so too.

> >> +#ifdef CONFIG_PT_RECLAIM
> >> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
> >> +{
> >> +       struct page *page;
> >> +
> >> +       page = container_of(head, struct page, rcu_head);
> >> +       free_page_and_swap_cache(page);
> >> +}
> >
> > Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
> > so I think something like put_page() would do the job.
>
> Ah, I just did the same thing as __tlb_remove_table(). But I also
> have the same doubt as you, why does __tlb_remove_table() need to
> call free_page_and_swap_cache() instead of put_page().

I think commit 9e52fc2b50de3a1c08b44f94c610fbe998c0031a probably just
copy-pasted it from a more generic page freeing path...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-11-08 18:04       ` Jann Horn
@ 2024-11-09  3:07         ` Qi Zheng
  0 siblings, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-09  3:07 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/9 02:04, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/8 07:35, Jann Horn wrote:
>>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>> As a first step, this commit aims to synchronously free the empty PTE
>>>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>>>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>>>> cases other than madvise(MADV_DONTNEED).
>>>>
>>>> Once an empty PTE is detected, we first try to hold the pmd lock within
>>>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>>>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>>>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>>>> whether the PTE page is empty and free it (slow path).
>>>
>>> How does this interact with move_pages_pte()? I am looking at your
>>> series applied on top of next-20241106, and it looks to me like
>>> move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the
>>> PMD entry can't change. You can clearly see this assumption at the
>>> WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I
>>
>> In move_pages_pte(), the following conditions may indeed be triggered:
>>
>>          /* Sanity checks before the operation */
>>          if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) ||
>>              WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) ||
>> WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
>>                  err = -EINVAL;
>>                  goto out;
>>          }
>>
>> But maybe we can just remove the WARN_ON_ONCE(), because...
>>
>>> think for example move_present_pte() can end up moving a present PTE
>>> into a page table that has already been scheduled for RCU freeing.
>>
>> ...this situation is impossible to happen. Before performing move
>> operation, the pte_same() check will be performed after holding the
>> pte lock, which can ensure that the PTE page is stable:
>>
>> CPU 0                    CPU 1
>>
>> zap_pte_range
>>
>>                          orig_src_pte = ptep_get(src_pte);
>>
>> pmd_lock
>> pte_lock
>> check if all PTEs are pte_none()
>> --> clear pmd entry
>> unlock pte
>> unlock pmd
>>
>>                          src_pte_lock
>>                          pte_same(orig_src_pte, ptep_get(src_pte))
>>                          --> return false and will skip the move op
> 
> Yes, that works for the source PTE. But what about the destination?
> 
> Operations on the destination PTE in move_pages_pte() are, when moving
> a normal present source PTE pointing to an order-0 page, and assuming
> that the optimistic folio_trylock(src_folio) and
> anon_vma_trylock_write(src_anon_vma) succeed:
> 
> dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr,
> &dummy_pmdval, &dst_ptl)
> [check that dst_pte is non-NULL]
> some racy WARN_ON_ONCE() checks
> spin_lock(dst_ptl);
> orig_dst_pte = ptep_get(dst_pte);
> spin_unlock(dst_ptl);
> [bail if orig_dst_pte isn't none]
> double_pt_lock(dst_ptl, src_ptl)
> [check pte_same(ptep_get(dst_pte), orig_dst_pte)]
> 
> and then we're past the point of no return. Note that there is a
> pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is
> intentionally pte_none(), so the pte_same() check does not guarantee
> that the destination page table is still linked in.

OK, now I got what you mean. This is indeed a problem. In this case,
it is still necessary to recheck pmd_same() to ensure the stability
of dst_pte page. Will fix it.

> 
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 002aa4f454fa0..c4a8c18fbcfd7 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>>>    static inline bool should_zap_cows(struct zap_details *details)
>>>>    {
>>>>           /* By default, zap all pages */
>>>> -       if (!details)
>>>> +       if (!details || details->reclaim_pt)
>>>>                   return true;
>>>>
>>>>           /* Or, we zap COWed pages only if the caller wants to */
>>>
>>> This looks hacky - when we have a "details" object, its ->even_cows
>>> member is supposed to indicate whether COW pages should be zapped. So
>>> please instead set .even_cows=true in madvise_dontneed_single_vma().
>>
>> But the details->reclaim_pt should continue to be set, right? Because
>> we need to use .reclaim_pt to indicate whether the empty PTE page
>> should be reclaimed.
> 
> Yeah, you should set both.

OK.

Thanks!



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-11-08 20:09       ` Jann Horn
@ 2024-11-09  3:14         ` Qi Zheng
  0 siblings, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-09  3:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, david, hughd,
	willy, mgorman, muchun.song, vbabka, akpm, zokeefe, rientjes,
	peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/9 04:09, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 8:38 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> On 2024/11/8 06:39, Jann Horn wrote:
>>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> free_pte
>>>     pte_free_tlb
>>>       __pte_free_tlb
>>>         ___pte_free_tlb
>>>           paravirt_tlb_remove_table
>>>             tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>>>               [no-free-memory slowpath:]
>>>                 tlb_table_invalidate
>>>                 tlb_remove_table_one
>>>                   tlb_remove_table_sync_one [does IPI for GUP-fast]
>>
>>                     ^
>>                     It seems that this step can be ommitted when
>>                     CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>>                     disable IRQ, which can also serve as the RCU critical
>>                     section.
> 
> Yeah, I think so too.

Will remove this step in the next version.

> 
>>>> +#ifdef CONFIG_PT_RECLAIM
>>>> +static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
>>>> +{
>>>> +       struct page *page;
>>>> +
>>>> +       page = container_of(head, struct page, rcu_head);
>>>> +       free_page_and_swap_cache(page);
>>>> +}
>>>
>>> Why free_page_and_swap_cache()? Page tables shouldn't have swap cache,
>>> so I think something like put_page() would do the job.
>>
>> Ah, I just did the same thing as __tlb_remove_table(). But I also
>> have the same doubt as you, why does __tlb_remove_table() need to
>> call free_page_and_swap_cache() instead of put_page().
> 
> I think commit 9e52fc2b50de3a1c08b44f94c610fbe998c0031a probably just
> copy-pasted it from a more generic page freeing path...

I guess so. Will use put_page() instead of free_page_and_swap_cache()
in the next version. But for __tlb_remove_table(), I prefer to send
a separate patch to modify.

Thanks!




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes()
  2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
  2024-11-06 21:48   ` Jann Horn
@ 2024-11-12 16:58   ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2024-11-12 16:58 UTC (permalink / raw)
  To: Qi Zheng, jannh, hughd, willy, mgorman, muchun.song, vbabka,
	akpm, zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86

On 31.10.24 09:13, Qi Zheng wrote:
> Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to
> handle non-present ptes, which can improve code readability.
> 
> No functional change.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
  2024-11-07 21:50   ` Jann Horn
@ 2024-11-12 17:00   ` David Hildenbrand
  2024-11-13  2:40     ` Qi Zheng
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2024-11-12 17:00 UTC (permalink / raw)
  To: Qi Zheng, jannh, hughd, willy, mgorman, muchun.song, vbabka,
	akpm, zokeefe, rientjes, peterx, catalin.marinas
  Cc: linux-mm, linux-kernel, x86

On 31.10.24 09:13, Qi Zheng wrote:
> This commit introduces do_zap_pte_range() to actually zap the PTEs, which
> will help improve code readability and facilitate secondary checking of
> the processed PTEs in the future.
> 
> No functional change.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>   1 file changed, 26 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bd9ebe0f4471f..c1150e62dd073 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
>   	return nr;
>   }
>   
> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +				   struct vm_area_struct *vma, pte_t *pte,
> +				   unsigned long addr, unsigned long end,
> +				   struct zap_details *details, int *rss,
> +				   bool *force_flush, bool *force_break)
> +{
> +	pte_t ptent = ptep_get(pte);
> +	int max_nr = (end - addr) / PAGE_SIZE;
> +
> +	if (pte_none(ptent))
> +		return 1;

Maybe we should just skip all applicable pte_none() here directly.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-11-12 17:00   ` David Hildenbrand
@ 2024-11-13  2:40     ` Qi Zheng
  2024-11-13 11:43       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-11-13  2:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/13 01:00, David Hildenbrand wrote:
> On 31.10.24 09:13, Qi Zheng wrote:
>> This commit introduces do_zap_pte_range() to actually zap the PTEs, which
>> will help improve code readability and facilitate secondary checking of
>> the processed PTEs in the future.
>>
>> No functional change.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>>   1 file changed, 26 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index bd9ebe0f4471f..c1150e62dd073 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct 
>> mmu_gather *tlb,
>>       return nr;
>>   }
>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>> +                   struct vm_area_struct *vma, pte_t *pte,
>> +                   unsigned long addr, unsigned long end,
>> +                   struct zap_details *details, int *rss,
>> +                   bool *force_flush, bool *force_break)
>> +{
>> +    pte_t ptent = ptep_get(pte);
>> +    int max_nr = (end - addr) / PAGE_SIZE;
>> +
>> +    if (pte_none(ptent))
>> +        return 1;
> 
> Maybe we should just skip all applicable pte_none() here directly.

Do you mean we should keep pte_none() case in zap_pte_range()? Like
below:

diff --git a/mm/memory.c b/mm/memory.c
index 002aa4f454fa0..2ccdcf37b7a46 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1666,9 +1666,6 @@ static inline int do_zap_pte_range(struct 
mmu_gather *tlb,
         pte_t ptent = ptep_get(pte);
         int max_nr = (end - addr) / PAGE_SIZE;

-       if (pte_none(ptent))
-               return 1;
-
         if (pte_present(ptent))
                 return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
                                         addr, details, rss, force_flush,
@@ -1704,11 +1701,15 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
                 if (need_resched())
                         break;

-               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, 
rss,
-                                     &force_flush, &force_break);
-               if (unlikely(force_break)) {
-                       addr += nr * PAGE_SIZE;
-                       break;
+               if (pte_none(ptep_get(pte))) {
+                       nr = 1;
+               } else {
+                       nr = do_zap_pte_range(tlb, vma, pte, addr, end, 
details,
+                                             rss, &force_flush, 
&force_break);
+                       if (unlikely(force_break)) {
+                               addr += nr * PAGE_SIZE;
+                               break;
+                       }
                 }
         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

This avoids repeated checks for pte_none() later. Both are fine for
me, will change to this in the next version.

> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-11-08  7:38     ` Qi Zheng
  2024-11-08 20:09       ` Jann Horn
@ 2024-11-13 11:26       ` Qi Zheng
  1 sibling, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-13 11:26 UTC (permalink / raw)
  To: Jann Horn
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, david, hughd,
	willy, mgorman, muchun.song, vbabka, akpm, zokeefe, rientjes,
	peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/8 15:38, Qi Zheng wrote:
> Hi Jann,
> 
> On 2024/11/8 06:39, Jann Horn wrote:
>> +x86 MM maintainers - x86@kernel.org was already cc'ed, but I don't
>> know if that is enough for them to see it, and I haven't seen them
>> comment on this series yet; I think you need an ack from them for this
>> change.
> 
> Yes, thanks to Jann for cc-ing x86 MM maintainers, and look forward to
> their feedback!
> 
>>
>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@bytedance.com> 
>> wrote:
>>> Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table 
>>> pages
>>> will be freed by semi RCU, that is:
>>>
>>>   - batch table freeing: asynchronous free by RCU
>>>   - single table freeing: IPI + synchronous free
>>>
>>> In this way, the page table can be lockless traversed by disabling 
>>> IRQ in
>>> paths such as fast GUP. But this is not enough to free the empty PTE 
>>> page
>>> table pages in paths other that munmap and exit_mmap path, because IPI
>>> cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
>>>
>>> In preparation for supporting empty PTE page table pages reclaimation,
>>> let single table also be freed by RCU like batch table freeing. Then we
>>> can also use pte_offset_map() etc to prevent PTE page from being freed.
>>
>> I applied your series locally and followed the page table freeing path
>> that the reclaim feature would use on x86-64. Looks like it goes like
>> this with the series applied:
> 
> Yes.
> 
>>
>> free_pte
>>    pte_free_tlb
>>      __pte_free_tlb
>>        ___pte_free_tlb
>>          paravirt_tlb_remove_table
>>            tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
>>              [no-free-memory slowpath:]
>>                tlb_table_invalidate
>>                tlb_remove_table_one
>>                  tlb_remove_table_sync_one [does IPI for GUP-fast]
> 
>             ^
>             It seems that this step can be ommitted when
>             CONFIG_PT_RECLAIM is enabled, because GUP-fast will
>             disable IRQ, which can also serve as the RCU critical
>             section.
> 
>>                  __tlb_remove_table_one [frees via RCU]
>>              [fastpath:]
>>                tlb_table_flush
>>                  tlb_remove_table_free [frees via RCU]
>>            native_tlb_remove_table [CONFIG_PARAVIRT on native]
>>              tlb_remove_table [see above]
>>
>> Basically, the only remaining case in which
>> paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
>> delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
>> CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
>> that means X86's direct page table freeing path will almost never be
>> used? If it stays that way and the X86 folks don't see a performance
>> impact from using RCU to free tables on munmap() / process exit, I
>> guess we might want to get rid of the direct page table freeing path
>> on x86 at some point to simplify things...
> 
> In theory, using RCU to asynchronously free PTE pages should make
> munmap() / process exit path faster. I can try to grab some data.
> 

I ran 'stress-ng --mmap 1 --mmap-bytes 1G', and grabbed the data with
bpftrace like this:

bpftrace -e 'tracepoint:syscalls:sys_enter_munmap /comm == 
"stress-ng"/{@start[tid] = nsecs;} tracepoint:syscalls:sys_exit_munmap 
/@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); 
delete(@start[tid]); } interval:s:1 {exit();}'

The results are as follows:

without patch:

@ns[stress-ng]:
[1K, 2K)           99566 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           77756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[4K, 8K)           32545 |@@@@@@@@@@@@@@@@ 
      |
[8K, 16K)            442 | 
      |
[16K, 32K)            69 | 
      |
[32K, 64K)             1 | 
      |
[64K, 128K)            1 | 
      |
[128K, 256K)          14 | 
      |
[256K, 512K)          14 | 
      |
[512K, 1M)            68 | 
      |

with patch:

@ns[stress-ng]:
[512, 1K)             69 | 
      |
[1K, 2K)           53921 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)           47088 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[4K, 8K)           20583 |@@@@@@@@@@@@@@@@@@@ 
      |
[8K, 16K)            659 | 
      |
[16K, 32K)            93 | 
      |
[32K, 64K)            24 | 
      |
[64K, 128K)           14 | 
      |
[128K, 256K)           6 | 
      |
[256K, 512K)          10 | 
      |
[512K, 1M)            29 | 
      |

It doesn't seem to have much effect on munmap.

Thanks,
Qi



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-11-13  2:40     ` Qi Zheng
@ 2024-11-13 11:43       ` David Hildenbrand
  2024-11-13 12:19         ` Qi Zheng
  2024-11-14  3:09         ` Qi Zheng
  0 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2024-11-13 11:43 UTC (permalink / raw)
  To: Qi Zheng
  Cc: jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86

On 13.11.24 03:40, Qi Zheng wrote:
> 
> 
> On 2024/11/13 01:00, David Hildenbrand wrote:
>> On 31.10.24 09:13, Qi Zheng wrote:
>>> This commit introduces do_zap_pte_range() to actually zap the PTEs, which
>>> will help improve code readability and facilitate secondary checking of
>>> the processed PTEs in the future.
>>>
>>> No functional change.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>    mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>>>    1 file changed, 26 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index bd9ebe0f4471f..c1150e62dd073 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct
>>> mmu_gather *tlb,
>>>        return nr;
>>>    }
>>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>> +                   struct vm_area_struct *vma, pte_t *pte,
>>> +                   unsigned long addr, unsigned long end,
>>> +                   struct zap_details *details, int *rss,
>>> +                   bool *force_flush, bool *force_break)
>>> +{
>>> +    pte_t ptent = ptep_get(pte);
>>> +    int max_nr = (end - addr) / PAGE_SIZE;
>>> +
>>> +    if (pte_none(ptent))
>>> +        return 1;
>>
>> Maybe we should just skip all applicable pte_none() here directly.
> 
> Do you mean we should keep pte_none() case in zap_pte_range()? Like
> below:
> 

No rather an addon patch that will simply skip over all
consecutive pte_none, like:

if (pte_none(ptent)) {
	int nr;

	for (nr = 1; nr < max_nr; nr++) {
		ptent = ptep_get(pte + nr);
		if (pte_none(ptent))
			continue;
	}

	max_nr -= nr;
	if (!max_nr)
		return nr;
	addr += nr * PAGE_SIZE;
	pte += nr;
}

Assuming that it's likely more common to have larger pte_none() holes 
that single ones, optimizing out the 
need_resched()+force_break+incremental pte/addr increments etc.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-11-13 11:43       ` David Hildenbrand
@ 2024-11-13 12:19         ` Qi Zheng
  2024-11-14  3:09         ` Qi Zheng
  1 sibling, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-13 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/13 19:43, David Hildenbrand wrote:
> On 13.11.24 03:40, Qi Zheng wrote:
>>
>>
>> On 2024/11/13 01:00, David Hildenbrand wrote:
>>> On 31.10.24 09:13, Qi Zheng wrote:
>>>> This commit introduces do_zap_pte_range() to actually zap the PTEs, 
>>>> which
>>>> will help improve code readability and facilitate secondary checking of
>>>> the processed PTEs in the future.
>>>>
>>>> No functional change.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>    mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>>>>    1 file changed, 26 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index bd9ebe0f4471f..c1150e62dd073 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct
>>>> mmu_gather *tlb,
>>>>        return nr;
>>>>    }
>>>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>>> +                   struct vm_area_struct *vma, pte_t *pte,
>>>> +                   unsigned long addr, unsigned long end,
>>>> +                   struct zap_details *details, int *rss,
>>>> +                   bool *force_flush, bool *force_break)
>>>> +{
>>>> +    pte_t ptent = ptep_get(pte);
>>>> +    int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +    if (pte_none(ptent))
>>>> +        return 1;
>>>
>>> Maybe we should just skip all applicable pte_none() here directly.
>>
>> Do you mean we should keep pte_none() case in zap_pte_range()? Like
>> below:
>>
> 
> No rather an addon patch that will simply skip over all
> consecutive pte_none, like:
> 
> if (pte_none(ptent)) {
>      int nr;
> 
>      for (nr = 1; nr < max_nr; nr++) {
>          ptent = ptep_get(pte + nr);
>          if (pte_none(ptent))
>              continue;
>      }
> 
>      max_nr -= nr;
>      if (!max_nr)
>          return nr;
>      addr += nr * PAGE_SIZE;
>      pte += nr;
> }
> 
> Assuming that it's likely more common to have larger pte_none() holes 
> that single ones, optimizing out the 
> need_resched()+force_break+incremental pte/addr increments etc.

Ah, got it. And I agree with you, will change to it in the next version.

Thanks!

> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-11-13 11:43       ` David Hildenbrand
  2024-11-13 12:19         ` Qi Zheng
@ 2024-11-14  3:09         ` Qi Zheng
  2024-11-14  4:12           ` Qi Zheng
  1 sibling, 1 reply; 31+ messages in thread
From: Qi Zheng @ 2024-11-14  3:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/13 19:43, David Hildenbrand wrote:
> On 13.11.24 03:40, Qi Zheng wrote:
>>
>>
>> On 2024/11/13 01:00, David Hildenbrand wrote:
>>> On 31.10.24 09:13, Qi Zheng wrote:
>>>> This commit introduces do_zap_pte_range() to actually zap the PTEs, 
>>>> which
>>>> will help improve code readability and facilitate secondary checking of
>>>> the processed PTEs in the future.
>>>>
>>>> No functional change.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>    mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>>>>    1 file changed, 26 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index bd9ebe0f4471f..c1150e62dd073 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct
>>>> mmu_gather *tlb,
>>>>        return nr;
>>>>    }
>>>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>>> +                   struct vm_area_struct *vma, pte_t *pte,
>>>> +                   unsigned long addr, unsigned long end,
>>>> +                   struct zap_details *details, int *rss,
>>>> +                   bool *force_flush, bool *force_break)
>>>> +{
>>>> +    pte_t ptent = ptep_get(pte);
>>>> +    int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +    if (pte_none(ptent))
>>>> +        return 1;
>>>
>>> Maybe we should just skip all applicable pte_none() here directly.
>>
>> Do you mean we should keep pte_none() case in zap_pte_range()? Like
>> below:
>>
> 
> No rather an addon patch that will simply skip over all
> consecutive pte_none, like:
> 
> if (pte_none(ptent)) {
>      int nr;
> 
>      for (nr = 1; nr < max_nr; nr++) {
>          ptent = ptep_get(pte + nr);
>          if (pte_none(ptent))
>              continue;
>      }
> 
>      max_nr -= nr;
>      if (!max_nr)
>          return nr;
>      addr += nr * PAGE_SIZE;
>      pte += nr;
> }

I tend to hand over the pte/addr increments here to the loop
outside do_zap_pte_range(), like this:

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..2367a1c19edd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,36 @@ static inline int zap_nonpresent_ptes(struct 
mmu_gather *tlb,
         return nr;
  }

+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+                                  struct vm_area_struct *vma, pte_t *pte,
+                                  unsigned long addr, unsigned long end,
+                                  struct zap_details *details, int *rss,
+                                  bool *force_flush, bool *force_break)
+{
+       pte_t ptent = ptep_get(pte);
+       int max_nr = (end - addr) / PAGE_SIZE;
+
+       if (pte_none(ptent)) {
+               int nr = 1;
+
+               for (; nr < max_nr; nr++) {
+                       ptent = ptep_get(pte + nr);
+                       if (!pte_none(ptent))
+                               break;
+               }
+
+               return nr;
+       }
+
+       if (pte_present(ptent))
+               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+                                       addr, details, rss, force_flush,
+                                       force_break);
+
+       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+                                        details, rss);
+}
+
  static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                 struct vm_area_struct *vma, pmd_t *pmd,
                                 unsigned long addr, unsigned long end,
@@ -1679,28 +1709,14 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
         flush_tlb_batched_pending(mm);
         arch_enter_lazy_mmu_mode();
         do {
-               pte_t ptent = ptep_get(pte);
-               int max_nr;
-
-               nr = 1;
-               if (pte_none(ptent))
-                       continue;
-
                 if (need_resched())
                         break;

-               max_nr = (end - addr) / PAGE_SIZE;
-               if (pte_present(ptent)) {
-                       nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-                                             addr, details, rss, 
&force_flush,
-                                             &force_break);
-                       if (unlikely(force_break)) {
-                               addr += nr * PAGE_SIZE;
-                               break;
-                       }
-               } else {
-                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, 
max_nr,
-                                                addr, details, rss);
+               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
+                                     rss, &force_flush, &force_break);
+               if (unlikely(force_break)) {
+                       addr += nr * PAGE_SIZE;
+                       break;
                 }
         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);

> 
> Assuming that it's likely more common to have larger pte_none() holes 
> that single ones, optimizing out the 
> need_resched()+force_break+incremental pte/addr increments etc.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce do_zap_pte_range()
  2024-11-14  3:09         ` Qi Zheng
@ 2024-11-14  4:12           ` Qi Zheng
  0 siblings, 0 replies; 31+ messages in thread
From: Qi Zheng @ 2024-11-14  4:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jannh, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, catalin.marinas, linux-mm, linux-kernel, x86



On 2024/11/14 11:09, Qi Zheng wrote:
> 
> 
> On 2024/11/13 19:43, David Hildenbrand wrote:
>> On 13.11.24 03:40, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/13 01:00, David Hildenbrand wrote:
>>>> On 31.10.24 09:13, Qi Zheng wrote:
>>>>> This commit introduces do_zap_pte_range() to actually zap the PTEs, 
>>>>> which
>>>>> will help improve code readability and facilitate secondary 
>>>>> checking of
>>>>> the processed PTEs in the future.
>>>>>
>>>>> No functional change.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>>    mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
>>>>>    1 file changed, 26 insertions(+), 19 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index bd9ebe0f4471f..c1150e62dd073 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct
>>>>> mmu_gather *tlb,
>>>>>        return nr;
>>>>>    }
>>>>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>>>>> +                   struct vm_area_struct *vma, pte_t *pte,
>>>>> +                   unsigned long addr, unsigned long end,
>>>>> +                   struct zap_details *details, int *rss,
>>>>> +                   bool *force_flush, bool *force_break)
>>>>> +{
>>>>> +    pte_t ptent = ptep_get(pte);
>>>>> +    int max_nr = (end - addr) / PAGE_SIZE;
>>>>> +
>>>>> +    if (pte_none(ptent))
>>>>> +        return 1;
>>>>
>>>> Maybe we should just skip all applicable pte_none() here directly.
>>>
>>> Do you mean we should keep pte_none() case in zap_pte_range()? Like
>>> below:
>>>
>>
>> No rather an addon patch that will simply skip over all
>> consecutive pte_none, like:
>>
>> if (pte_none(ptent)) {
>>      int nr;
>>
>>      for (nr = 1; nr < max_nr; nr++) {
>>          ptent = ptep_get(pte + nr);
>>          if (pte_none(ptent))
>>              continue;
>>      }
>>
>>      max_nr -= nr;
>>      if (!max_nr)
>>          return nr;
>>      addr += nr * PAGE_SIZE;
>>      pte += nr;
>> }
> 
> I tend to hand over the pte/addr increments here to the loop
> outside do_zap_pte_range(), like this:
> 

Finally, I choose to introduce skip_none_ptes() to do this:

diff --git a/mm/memory.c b/mm/memory.c
index bd9ebe0f4471f..24633d0e1445a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,28 @@ static inline int zap_nonpresent_ptes(struct 
mmu_gather *tlb,
         return nr;
  }

+static inline int skip_none_ptes(pte_t *pte, unsigned long addr,
+                                unsigned long end)
+{
+       pte_t ptent = ptep_get(pte);
+       int max_nr;
+       int nr;
+
+       if (!pte_none(ptent))
+               return 0;
+
+       max_nr = (end - addr) / PAGE_SIZE;
+       nr = 1;
+
+       for (; nr < max_nr; nr++) {
+               ptent = ptep_get(pte + nr);
+               if (!pte_none(ptent))
+                       break;
+       }
+
+       return nr;
+}
+
  static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                 struct vm_area_struct *vma, pmd_t *pmd,
                                 unsigned long addr, unsigned long end,
@@ -1682,13 +1704,17 @@ static unsigned long zap_pte_range(struct 
mmu_gather *tlb,
                 pte_t ptent = ptep_get(pte);
                 int max_nr;

-               nr = 1;
-               if (pte_none(ptent))
-                       continue;
-
                 if (need_resched())
                         break;

+               nr = skip_none_ptes(pte, addr, end);
+               if (nr) {
+                       addr += PAGE_SIZE * nr;
+                       if (addr == end)
+                               break;
+                       pte += nr;
+               }
+
                 max_nr = (end - addr) / PAGE_SIZE;
                 if (pte_present(ptent)) {
                         nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,

Thanks!




^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-11-14  4:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-31  8:13 [PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-10-31  8:13 ` [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
2024-11-06 21:48   ` Jann Horn
2024-11-07  7:54     ` Qi Zheng
2024-11-07 17:57       ` Jann Horn
2024-11-08  6:31         ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 2/7] mm: introduce zap_nonpresent_ptes() Qi Zheng
2024-11-06 21:48   ` Jann Horn
2024-11-12 16:58   ` David Hildenbrand
2024-10-31  8:13 ` [PATCH v2 3/7] mm: introduce do_zap_pte_range() Qi Zheng
2024-11-07 21:50   ` Jann Horn
2024-11-12 17:00   ` David Hildenbrand
2024-11-13  2:40     ` Qi Zheng
2024-11-13 11:43       ` David Hildenbrand
2024-11-13 12:19         ` Qi Zheng
2024-11-14  3:09         ` Qi Zheng
2024-11-14  4:12           ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 4/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-11-07 21:46   ` Jann Horn
2024-10-31  8:13 ` [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-11-07 23:35   ` Jann Horn
2024-11-08  7:13     ` Qi Zheng
2024-11-08 18:04       ` Jann Horn
2024-11-09  3:07         ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-11-07 22:39   ` Jann Horn
2024-11-08  7:38     ` Qi Zheng
2024-11-08 20:09       ` Jann Horn
2024-11-09  3:14         ` Qi Zheng
2024-11-13 11:26       ` Qi Zheng
2024-10-31  8:13 ` [PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox