linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages
@ 2024-10-17  9:47 Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock() Qi Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Changes in v1:
 - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
   https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
   (suggested by David Hildenbrand)
 - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
   (suggested by David Hildenbrand)
 - change to scan and reclaim empty user PTE pages in zap_pte_range()
   (suggested by David Hildenbrand)
 - sent a separate RFC patch to track the tlb flushing issue, and remove
   that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
   link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
 - add [PATCH v1 1/7] into this series
 - drop RFC tag
 - rebase onto the next-20241011

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
freeing operations. Therefore, if we want to free the empty PTE page in this
path, the most natural way is to add it to mmu_gather as well. Now, if
CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths other
that munmap and exit_mmap path, because IPI cannot be synchronized with
rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
PTE pages asynchronously in the future.

This series is based on next-20241011 (which contains the series [2]).

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/

Qi Zheng (7):
  mm: khugepaged: retract_page_tables() use pte_offset_map_lock()
  mm: make zap_pte_range() handle full within-PMD range
  mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been
    re-installed
  mm: zap_present_ptes: return whether the PTE page is unreclaimable
  mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 arch/x86/Kconfig           |  1 +
 arch/x86/include/asm/tlb.h | 19 ++++++++
 arch/x86/kernel/paravirt.c |  7 +++
 arch/x86/mm/pgtable.c      | 10 +++-
 include/linux/mm.h         |  1 +
 include/linux/mm_inline.h  | 11 +++--
 mm/Kconfig                 | 14 ++++++
 mm/Makefile                |  1 +
 mm/internal.h              | 29 ++++++++++++
 mm/khugepaged.c            |  9 +++-
 mm/madvise.c               |  4 +-
 mm/memory.c                | 95 +++++++++++++++++++++++++++++---------
 mm/mmu_gather.c            |  9 +++-
 mm/pt_reclaim.c            | 68 +++++++++++++++++++++++++++
 14 files changed, 248 insertions(+), 30 deletions(-)
 create mode 100644 mm/pt_reclaim.c

-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock()
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17 18:00   ` Jann Horn
  2024-10-17  9:47 ` [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_lock() to do it, and then we can also remove the
calling of the pte_lockptr().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 94feb85ce996c..b4f49d323c8d9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *pml;
 		spinlock_t *ptl;
 		bool skipped_uffd = false;
+		pte_t *pte;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1757,9 +1758,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		mmu_notifier_invalidate_range_start(&range);
 
 		pml = pmd_lock(mm, pmd);
-		ptl = pte_lockptr(mm, pmd);
+		pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (!pte) {
+			spin_unlock(pml);
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+		pte_unmap(pte);
 
 		/*
 		 * Huge page lock is still held, so normally the page table
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock() Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17 18:06   ` Jann Horn
  2024-10-17  9:47 ` [PATCH v1 3/7] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In preparation for reclaiming empty PTE pages, this commit first makes
zap_pte_range() to handle the full within-PMD range, so that we can more
easily detect and free PTE pages in this function in subsequent commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index caa6ed0a7fe5b..fd57c0f49fce2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1602,6 +1602,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	swp_entry_t entry;
 	int nr;
 
+retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
 	start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -1706,6 +1707,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	if (force_flush)
 		tlb_flush_mmu(tlb);
 
+	if (addr != end) {
+		cond_resched();
+		goto retry;
+	}
+
 	return addr;
 }
 
@@ -1744,8 +1750,6 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			continue;
 		}
 		addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
-		if (addr != next)
-			pmd--;
 	} while (pmd++, cond_resched(), addr != end);
 
 	return addr;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 3/7] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock() Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 4/7] mm: zap_present_ptes: return whether the PTE page is unreclaimable Qi Zheng
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In some cases, we'll replace the none pte with an uffd-wp swap special pte
marker when necessary. Let's expose this information to the caller through
the return value, so that subsequent commits can use this information to
detect whether the PTE page is empty.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_inline.h | 11 ++++++++---
 mm/memory.c               | 15 +++++++++++----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 6f801c7b36e2f..c3ba1da418caf 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -552,8 +552,10 @@ static inline pte_marker copy_pte_marker(
  * pte, and if they see it, they'll fault and serialize at the pgtable lock.
  *
  * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled.
+ *
+ * Returns true if an uffd-wp pte was installed, false otherwise.
  */
-static inline void
+static inline bool
 pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 			      pte_t *pte, pte_t pteval)
 {
@@ -570,7 +572,7 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	 * with a swap pte.  There's no way of leaking the bit.
 	 */
 	if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
-		return;
+		return false;
 
 	/* A uffd-wp wr-protected normal pte */
 	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
@@ -583,10 +585,13 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	if (unlikely(pte_swp_uffd_wp_any(pteval)))
 		arm_uffd_pte = true;
 
-	if (unlikely(arm_uffd_pte))
+	if (unlikely(arm_uffd_pte)) {
 		set_pte_at(vma->vm_mm, addr, pte,
 			   make_pte_marker(PTE_MARKER_UFFD_WP));
+		return true;
+	}
 #endif
+	return false;
 }
 
 static inline bool vma_has_recency(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index fd57c0f49fce2..534d9d52b5ebe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1467,27 +1467,34 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
 /*
  * This function makes sure that we'll replace the none pte with an uffd-wp
  * swap special pte marker when necessary. Must be with the pgtable lock held.
+ *
+ * Returns true if uffd-wp ptes was installed, false otherwise.
  */
-static inline void
+static inline bool
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *pte, int nr,
 			      struct zap_details *details, pte_t pteval)
 {
+	bool was_installed = false;
+
 	/* Zap on anonymous always means dropping everything */
 	if (vma_is_anonymous(vma))
-		return;
+		return false;
 
 	if (zap_drop_file_uffd_wp(details))
-		return;
+		return false;
 
 	for (;;) {
 		/* the PFN in the PTE is irrelevant. */
-		pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
+			was_installed = true;
 		if (--nr == 0)
 			break;
 		pte++;
 		addr += PAGE_SIZE;
 	}
+
+	return was_installed;
 }
 
 static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 4/7] mm: zap_present_ptes: return whether the PTE page is unreclaimable
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (2 preceding siblings ...)
  2024-10-17  9:47 ` [PATCH v1 3/7] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

In the following two cases, the PTE page cannot be empty and cannot be
reclaimed:

1. an uffd-wp pte was re-installed
2. should_zap_folio() return false

Let's expose this information to the caller through is_pt_unreclaimable,
so that subsequent commits can use this information in zap_pte_range() to
detect whether the PTE page can be reclaimed.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 534d9d52b5ebe..cc89ede8ce2ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1501,7 +1501,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, struct folio *folio,
 		struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
 		unsigned long addr, struct zap_details *details, int *rss,
-		bool *force_flush, bool *force_break)
+		bool *force_flush, bool *force_break, bool *is_pt_unreclaimable)
 {
 	struct mm_struct *mm = tlb->mm;
 	bool delay_rmap = false;
@@ -1527,8 +1527,8 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 	arch_check_zapped_pte(vma, ptent);
 	tlb_remove_tlb_entries(tlb, pte, nr, addr);
 	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
-					      ptent);
+		*is_pt_unreclaimable =
+			zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 
 	if (!delay_rmap) {
 		folio_remove_rmap_ptes(folio, page, nr, vma);
@@ -1552,7 +1552,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
 		unsigned int max_nr, unsigned long addr,
 		struct zap_details *details, int *rss, bool *force_flush,
-		bool *force_break)
+		bool *force_break, bool *is_pt_unreclaimable)
 {
 	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct mm_struct *mm = tlb->mm;
@@ -1567,15 +1567,18 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		arch_check_zapped_pte(vma, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 		if (userfaultfd_pte_wp(vma, ptent))
-			zap_install_uffd_wp_if_needed(vma, addr, pte, 1,
-						      details, ptent);
+			*is_pt_unreclaimable =
+				zap_install_uffd_wp_if_needed(vma, addr, pte, 1,
+							      details, ptent);
 		ksm_might_unmap_zero_page(mm, ptent);
 		return 1;
 	}
 
 	folio = page_folio(page);
-	if (unlikely(!should_zap_folio(details, folio)))
+	if (unlikely(!should_zap_folio(details, folio))) {
+		*is_pt_unreclaimable = true;
 		return 1;
+	}
 
 	/*
 	 * Make sure that the common "small folio" case is as fast as possible
@@ -1587,11 +1590,12 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
-				       force_break);
+				       force_break, is_pt_unreclaimable);
 		return nr;
 	}
 	zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
-			       details, rss, force_flush, force_break);
+			       details, rss, force_flush, force_break,
+			       is_pt_unreclaimable);
 	return 1;
 }
 
@@ -1622,6 +1626,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		pte_t ptent = ptep_get(pte);
 		struct folio *folio;
 		struct page *page;
+		bool is_pt_unreclaimable = false;
 		int max_nr;
 
 		nr = 1;
@@ -1635,7 +1640,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			max_nr = (end - addr) / PAGE_SIZE;
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
-					      &force_break);
+					      &force_break, &is_pt_unreclaimable);
 			if (unlikely(force_break)) {
 				addr += nr * PAGE_SIZE;
 				break;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (3 preceding siblings ...)
  2024-10-17  9:47 ` [PATCH v1 4/7] mm: zap_present_ptes: return whether the PTE page is unreclaimable Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17 18:43   ` Jann Horn
  2024-10-17  9:47 ` [PATCH v1 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  6 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 14 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 29 ++++++++++++++++++++
 mm/madvise.c       |  4 ++-
 mm/memory.c        | 47 +++++++++++++++++++++++++++-----
 mm/pt_reclaim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 156 insertions(+), 8 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index df0a5eac66b78..667a466bb4649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2321,6 +2321,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 4b2a1ef9a161c..f5993b9cc2a9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1302,6 +1302,20 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other that munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index d5639b0361663..9d816323d247a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index 906da6280c2df..4adaaea0917c0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1445,4 +1445,33 @@ static inline void accept_page(struct page *page)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void set_pt_unreclaimable(bool *can_reclaim_pt)
+{
+	*can_reclaim_pt = false;
+}
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+#else
+static inline void set_pt_unreclaimable(bool *can_reclaim_pt)
+{
+}
+static inline bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd,
+					 pmd_t *pmdval)
+{
+	return false;
+}
+static inline void free_pte(struct mm_struct *mm, unsigned long addr,
+			    struct mmu_gather *tlb, pmd_t pmdval)
+{
+}
+static inline void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd,
+				   unsigned long addr, struct mmu_gather *tlb)
+{
+}
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index e871a72a6c329..82a6d15429da7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -843,7 +843,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {.reclaim_pt = true,};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index cc89ede8ce2ab..77774b34f2cde 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	pmd_t pmdval;
+	bool can_reclaim_pt = false;
+	bool direct_reclaim;
+	unsigned long start = addr;
 	int nr;
 
+	if (details && details->reclaim_pt)
+		can_reclaim_pt = true;
+
+	if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
+		can_reclaim_pt = false;
+
 retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
@@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
 					      &force_break, &is_pt_unreclaimable);
+			if (is_pt_unreclaimable)
+				set_pt_unreclaimable(&can_reclaim_pt);
 			if (unlikely(force_break)) {
 				addr += nr * PAGE_SIZE;
 				break;
@@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		    is_device_exclusive_entry(entry)) {
 			page = pfn_swap_entry_to_page(entry);
 			folio = page_folio(page);
-			if (unlikely(!should_zap_folio(details, folio)))
+			if (unlikely(!should_zap_folio(details, folio))) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			/*
 			 * Both device private/exclusive mappings should only
 			 * work with anonymous page so far, so we don't need to
@@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			max_nr = (end - addr) / PAGE_SIZE;
 			nr = swap_pte_batch(pte, max_nr, ptent);
 			/* Genuine swap entries, hence a private anon pages */
-			if (!should_zap_cows(details))
+			if (!should_zap_cows(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			rss[MM_SWAPENTS] -= nr;
 			free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
-			if (!should_zap_folio(details, folio))
+			if (!should_zap_folio(details, folio)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 			rss[mm_counter(folio)]--;
 		} else if (pte_marker_entry_uffd_wp(entry)) {
 			/*
@@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			 * drop the marker if explicitly requested.
 			 */
 			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_file_uffd_wp(details))
+			    !zap_drop_file_uffd_wp(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 		} else if (is_hwpoison_entry(entry) ||
 			   is_poisoned_swp_entry(entry)) {
-			if (!should_zap_cows(details))
+			if (!should_zap_cows(details)) {
+				set_pt_unreclaimable(&can_reclaim_pt);
 				continue;
+			}
 		} else {
 			/* We should have covered all the swap entry types */
 			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
 			WARN_ON_ONCE(1);
 		}
 		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+		if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
+			set_pt_unreclaimable(&can_reclaim_pt);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (addr == end && can_reclaim_pt)
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..fc055da40b615
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		return;
+
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+		goto out_ptl;
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	pte_unmap_unlock(start_pte, ptl);
+	if (pml != ptl)
+		spin_unlock(pml);
+}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 6/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (4 preceding siblings ...)
  2024-10-17  9:47 ` [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-17  9:47 ` [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
  6 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
   PMD page, so it is also safe.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 mm/mmu_gather.c            |  9 ++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257b..e223b53a8b190 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	free_page_and_swap_cache(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..89688921ea62e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..69a357b15974a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..d948479ca09e6 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+#endif
+
 static void tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
+	__tlb_remove_table_one(table);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (5 preceding siblings ...)
  2024-10-17  9:47 ` [PATCH v1 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-10-17  9:47 ` Qi Zheng
  2024-10-23  6:54   ` kernel test robot
  6 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2024-10-17  9:47 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, jannh, peterx
  Cc: linux-mm, linux-kernel, x86, Qi Zheng

Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and
reclaiming PTE pages is profitable only on 64-bit systems, so select
ARCH_SUPPORTS_PT_RECLAIM if X86_64.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ea18662942c9..69a20cb9ddd81 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -319,6 +319,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
2.20.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock()
  2024-10-17  9:47 ` [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock() Qi Zheng
@ 2024-10-17 18:00   ` Jann Horn
  2024-10-18  2:15     ` Qi Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Jann Horn @ 2024-10-17 18:00 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, linux-mm, linux-kernel, x86

On Thu, Oct 17, 2024 at 11:47 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> In retract_page_tables(), we may modify the pmd entry after acquiring the
> pml and ptl, so we should also check whether the pmd entry is stable.
> Using pte_offset_map_lock() to do it, and then we can also remove the
> calling of the pte_lockptr().
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/khugepaged.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 94feb85ce996c..b4f49d323c8d9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                 spinlock_t *pml;
>                 spinlock_t *ptl;
>                 bool skipped_uffd = false;
> +               pte_t *pte;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> @@ -1757,9 +1758,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>                 mmu_notifier_invalidate_range_start(&range);
>
>                 pml = pmd_lock(mm, pmd);
> -               ptl = pte_lockptr(mm, pmd);
> +               pte = pte_offset_map_lock(mm, pmd, addr, &ptl);

This takes the lock "ptl" on the success path...

> +               if (!pte) {
> +                       spin_unlock(pml);
> +                       mmu_notifier_invalidate_range_end(&range);
> +                       continue;
> +               }
>                 if (ptl != pml)
>                         spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);

... and this takes the same lock again, right? I think this will
deadlock on kernels with CONFIG_SPLIT_PTE_PTLOCKS=y. Did you test this
on a machine with less than 4 CPU cores, or something like that? Or am
I missing something?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range
  2024-10-17  9:47 ` [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-10-17 18:06   ` Jann Horn
  2024-10-18  2:23     ` Qi Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Jann Horn @ 2024-10-17 18:06 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, linux-mm, linux-kernel, x86

On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> In preparation for reclaiming empty PTE pages, this commit first makes
> zap_pte_range() to handle the full within-PMD range, so that we can more
> easily detect and free PTE pages in this function in subsequent commits.

I think your patch causes some unintended difference in behavior:

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/memory.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index caa6ed0a7fe5b..fd57c0f49fce2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1602,6 +1602,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>         swp_entry_t entry;
>         int nr;
>
> +retry:

This "retry" label is below the line "bool force_flush = false,
force_break = false;", so I think after force_break is set once and
you go through the retry path, every subsequent present PTE will again
bail out and retry. I think that doesn't lead to anything bad, but it
seems unintended.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-17  9:47 ` [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-10-17 18:43   ` Jann Horn
  2024-10-18  2:53     ` Qi Zheng
  2024-10-24 13:21     ` Will Deacon
  0 siblings, 2 replies; 18+ messages in thread
From: Jann Horn @ 2024-10-17 18:43 UTC (permalink / raw)
  To: Qi Zheng, Catalin Marinas, Will Deacon
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, linux-mm, linux-kernel, x86

+arm64 maintainers in case they have opinions on the break-before-make aspects

On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
>
> The following are a memory usage snapshot of one process which actually
> happened on our server:
>
>         VIRT:  55t
>         RES:   590g
>         VmPTE: 110g
>
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
>
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
>
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).
>
> For other cases such as madvise(MADV_FREE), consider scanning and freeing
> empty PTE pages asynchronously in the future.

One thing I find somewhat scary about this is that it makes it
possible to free page tables in anonymous mappings, and to free page
tables of VMAs with an ->anon_vma, which was not possible before. Have
you checked all the current users of pte_offset_map_ro_nolock(),
pte_offset_map_rw_nolock(), and pte_offset_map() to make sure none of
them assume that this can't happen?

For example, pte_offset_map_rw_nolock() is called from move_ptes(),
with a comment basically talking about how this is safe *because only
khugepaged can remove page tables*.

> diff --git a/mm/memory.c b/mm/memory.c
> index cc89ede8ce2ab..77774b34f2cde 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  static inline bool should_zap_cows(struct zap_details *details)
>  {
>         /* By default, zap all pages */
> -       if (!details)
> +       if (!details || details->reclaim_pt)
>                 return true;
>
>         /* Or, we zap COWed pages only if the caller wants to */
> @@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>         pte_t *start_pte;
>         pte_t *pte;
>         swp_entry_t entry;
> +       pmd_t pmdval;
> +       bool can_reclaim_pt = false;
> +       bool direct_reclaim;
> +       unsigned long start = addr;
>         int nr;
>
> +       if (details && details->reclaim_pt)
> +               can_reclaim_pt = true;
> +
> +       if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
> +               can_reclaim_pt = false;

Does this check actually work? Assuming we're on x86, if you pass in
start=0x1000 and end=0x2000, if I understand correctly,
ALIGN_DOWN(end, PMD_SIZE) will be 0, while ALIGN(start, PMD_SIZE) will
be 0x200000, and so we will check:

if (0 - 0x200000 < PMD_SIZE)

which is

if (0xffffffffffe00000 < 0x200000)

which is false?

>  retry:
>         tlb_change_page_size(tlb, PAGE_SIZE);
>         init_rss_vec(rss);
> @@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>                                               addr, details, rss, &force_flush,
>                                               &force_break, &is_pt_unreclaimable);
> +                       if (is_pt_unreclaimable)
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                         if (unlikely(force_break)) {
>                                 addr += nr * PAGE_SIZE;
>                                 break;
> @@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                     is_device_exclusive_entry(entry)) {
>                         page = pfn_swap_entry_to_page(entry);
>                         folio = page_folio(page);
> -                       if (unlikely(!should_zap_folio(details, folio)))
> +                       if (unlikely(!should_zap_folio(details, folio))) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         /*
>                          * Both device private/exclusive mappings should only
>                          * work with anonymous page so far, so we don't need to
> @@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         max_nr = (end - addr) / PAGE_SIZE;
>                         nr = swap_pte_batch(pte, max_nr, ptent);
>                         /* Genuine swap entries, hence a private anon pages */
> -                       if (!should_zap_cows(details))
> +                       if (!should_zap_cows(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         rss[MM_SWAPENTS] -= nr;
>                         free_swap_and_cache_nr(entry, nr);
>                 } else if (is_migration_entry(entry)) {
>                         folio = pfn_swap_entry_folio(entry);
> -                       if (!should_zap_folio(details, folio))
> +                       if (!should_zap_folio(details, folio)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                         rss[mm_counter(folio)]--;
>                 } else if (pte_marker_entry_uffd_wp(entry)) {
>                         /*
> @@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                          * drop the marker if explicitly requested.
>                          */
>                         if (!vma_is_anonymous(vma) &&
> -                           !zap_drop_file_uffd_wp(details))
> +                           !zap_drop_file_uffd_wp(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                 } else if (is_hwpoison_entry(entry) ||
>                            is_poisoned_swp_entry(entry)) {
> -                       if (!should_zap_cows(details))
> +                       if (!should_zap_cows(details)) {
> +                               set_pt_unreclaimable(&can_reclaim_pt);
>                                 continue;
> +                       }
>                 } else {
>                         /* We should have covered all the swap entry types */
>                         pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>                         WARN_ON_ONCE(1);
>                 }
>                 clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> -               zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
> +               if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
> +                       set_pt_unreclaimable(&can_reclaim_pt);
>         } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>
> +       if (addr == end && can_reclaim_pt)
> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
> +
>         add_mm_rss_vec(mm, rss);
>         arch_leave_lazy_mmu_mode();
>
> @@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 goto retry;
>         }
>
> +       if (can_reclaim_pt) {
> +               if (direct_reclaim)
> +                       free_pte(mm, start, tlb, pmdval);
> +               else
> +                       try_to_free_pte(mm, pmd, start, tlb);
> +       }
> +
>         return addr;
>  }
>
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> new file mode 100644
> index 0000000000000..fc055da40b615
> --- /dev/null
> +++ b/mm/pt_reclaim.c
> @@ -0,0 +1,68 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/hugetlb.h>
> +#include <asm-generic/tlb.h>
> +#include <asm/pgalloc.h>
> +
> +#include "internal.h"
> +
> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
> +{
> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
> +
> +       if (!spin_trylock(pml))
> +               return false;
> +
> +       *pmdval = pmdp_get_lockless(pmd);
> +       pmd_clear(pmd);
> +       spin_unlock(pml);
> +
> +       return true;
> +}
> +
> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
> +             pmd_t pmdval)
> +{
> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
> +       mm_dec_nr_ptes(mm);
> +}
> +
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +                    struct mmu_gather *tlb)
> +{
> +       pmd_t pmdval;
> +       spinlock_t *pml, *ptl;
> +       pte_t *start_pte, *pte;
> +       int i;
> +
> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +       if (!start_pte)
> +               return;
> +
> +       pml = pmd_lock(mm, pmd);
> +       if (ptl != pml)
> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> +               goto out_ptl;
> +
> +       /* Check if it is empty PTE page */
> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +               if (!pte_none(ptep_get(pte)))
> +                       goto out_ptl;
> +       }
> +       pte_unmap(start_pte);
> +
> +       pmd_clear(pmd);
> +
> +       if (ptl != pml)
> +               spin_unlock(ptl);
> +       spin_unlock(pml);

At this point, you have cleared the PMD and dropped the locks
protecting against concurrency, but have not yet done a TLB flush. If
another thread concurrently repopulates the PMD at this point, can we
get incoherent TLB state in a way that violates the arm64
break-before-make rule?

Though I guess we can probably already violate break-before-make if
MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
does not seem to set "force_flush" when zapping anon PTEs...

(I realize you're only enabling this for x86 for now, but we should
probably make sure the code is not arch-dependent in subtle
undocumented ways...)

> +       free_pte(mm, addr, tlb, pmdval);
> +
> +       return;
> +out_ptl:
> +       pte_unmap_unlock(start_pte, ptl);
> +       if (pml != ptl)
> +               spin_unlock(pml);
> +}
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock()
  2024-10-17 18:00   ` Jann Horn
@ 2024-10-18  2:15     ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-18  2:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, linux-mm, linux-kernel, x86



On 2024/10/18 02:00, Jann Horn wrote:
> On Thu, Oct 17, 2024 at 11:47 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> In retract_page_tables(), we may modify the pmd entry after acquiring the
>> pml and ptl, so we should also check whether the pmd entry is stable.
>> Using pte_offset_map_lock() to do it, and then we can also remove the
>> calling of the pte_lockptr().
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/khugepaged.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 94feb85ce996c..b4f49d323c8d9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                  spinlock_t *pml;
>>                  spinlock_t *ptl;
>>                  bool skipped_uffd = false;
>> +               pte_t *pte;
>>
>>                  /*
>>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>> @@ -1757,9 +1758,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                  mmu_notifier_invalidate_range_start(&range);
>>
>>                  pml = pmd_lock(mm, pmd);
>> -               ptl = pte_lockptr(mm, pmd);
>> +               pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 
> This takes the lock "ptl" on the success path...
> 
>> +               if (!pte) {
>> +                       spin_unlock(pml);
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>>                  if (ptl != pml)
>>                          spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> 
> ... and this takes the same lock again, right? I think this will

Oh my god, my mistake, I used pte_offset_map_rw_nolock() at first, then
I changed it to pte_offset_map_lock() but forgot to delete this, and
because my test did not trigger retract_page_tables(), so I did not
find this error.

Will change in v2.

Thanks!

> deadlock on kernels with CONFIG_SPLIT_PTE_PTLOCKS=y. Did you test this
> on a machine with less than 4 CPU cores, or something like that? Or am
> I missing something?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range
  2024-10-17 18:06   ` Jann Horn
@ 2024-10-18  2:23     ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-18  2:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, peterx, linux-mm, linux-kernel, x86



On 2024/10/18 02:06, Jann Horn wrote:
> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> In preparation for reclaiming empty PTE pages, this commit first makes
>> zap_pte_range() to handle the full within-PMD range, so that we can more
>> easily detect and free PTE pages in this function in subsequent commits.
> 
> I think your patch causes some unintended difference in behavior:
> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/memory.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index caa6ed0a7fe5b..fd57c0f49fce2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1602,6 +1602,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>          swp_entry_t entry;
>>          int nr;
>>
>> +retry:
> 
> This "retry" label is below the line "bool force_flush = false,
> force_break = false;", so I think after force_break is set once and
> you go through the retry path, every subsequent present PTE will again
> bail out and retry. I think that doesn't lead to anything bad, but it
> seems unintended.

Right, thanks for catching this! Will set force_flush and force_break to
false under "retry" label in v2.

Thanks!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-17 18:43   ` Jann Horn
@ 2024-10-18  2:53     ` Qi Zheng
  2024-10-18  2:58       ` Qi Zheng
  2024-10-24 13:21     ` Will Deacon
  1 sibling, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2024-10-18  2:53 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catalin Marinas, Will Deacon, david, hughd, willy, mgorman,
	muchun.song, vbabka, akpm, zokeefe, rientjes, peterx, linux-mm,
	linux-kernel, x86



On 2024/10/18 02:43, Jann Horn wrote:
> +arm64 maintainers in case they have opinions on the break-before-make aspects
> 
> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> Now in order to pursue high performance, applications mostly use some
>> high-performance user-mode memory allocators, such as jemalloc or
>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>> release page table memory, which may cause huge page table memory usage.
>>
>> The following are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> In this case, most of the page table entries are empty. For such a PTE
>> page where all entries are empty, we can actually free it back to the
>> system for others to use.
>>
>> As a first step, this commit aims to synchronously free the empty PTE
>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>> cases other than madvise(MADV_DONTNEED).
>>
>> Once an empty PTE is detected, we first try to hold the pmd lock within
>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>> whether the PTE page is empty and free it (slow path).
>>
>> For other cases such as madvise(MADV_FREE), consider scanning and freeing
>> empty PTE pages asynchronously in the future.
> 
> One thing I find somewhat scary about this is that it makes it
> possible to free page tables in anonymous mappings, and to free page
> tables of VMAs with an ->anon_vma, which was not possible before. Have
> you checked all the current users of pte_offset_map_ro_nolock(),
> pte_offset_map_rw_nolock(), and pte_offset_map() to make sure none of
> them assume that this can't happen?

For the users of pte_offset_map_ro_nolock() and pte_offset_map(), they
will only perform read-only operations on the PTE page, and the
rcu_read_lock() in pte_offset_map_ro_nolock() and pte_offset_map() will
ensure that the PTE page is valid, so this is safe.

For the users of pte_offset_map_rw_nolock() + pmd_same()/pte_same()
check, they behave similarly to pte_offset_map_lock(), so this is
safe.

For the users who have used pte_offset_map_rw_nolock() but have not
performed a pmd_same()/pte_same() check, that is, the following:

1. copy_pte_range()
2. move_ptes()

They all hold the exclusive mmap_lock, and we will hold the read
lock of mmap_lock to free page tables in anonymous mappings, so
it is also safe.

> 
> For example, pte_offset_map_rw_nolock() is called from move_ptes(),
> with a comment basically talking about how this is safe *because only
> khugepaged can remove page tables*.

As mentioned above, it is also safe here.

> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index cc89ede8ce2ab..77774b34f2cde 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1437,7 +1437,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>   static inline bool should_zap_cows(struct zap_details *details)
>>   {
>>          /* By default, zap all pages */
>> -       if (!details)
>> +       if (!details || details->reclaim_pt)
>>                  return true;
>>
>>          /* Or, we zap COWed pages only if the caller wants to */
>> @@ -1611,8 +1611,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>          pte_t *start_pte;
>>          pte_t *pte;
>>          swp_entry_t entry;
>> +       pmd_t pmdval;
>> +       bool can_reclaim_pt = false;
>> +       bool direct_reclaim;
>> +       unsigned long start = addr;
>>          int nr;
>>
>> +       if (details && details->reclaim_pt)
>> +               can_reclaim_pt = true;
>> +
>> +       if ((ALIGN_DOWN(end, PMD_SIZE)) - (ALIGN(start, PMD_SIZE)) < PMD_SIZE)
>> +               can_reclaim_pt = false;
> 
> Does this check actually work? Assuming we're on x86, if you pass in
> start=0x1000 and end=0x2000, if I understand correctly,
> ALIGN_DOWN(end, PMD_SIZE) will be 0, while ALIGN(start, PMD_SIZE) will
> be 0x200000, and so we will check:
> 
> if (0 - 0x200000 < PMD_SIZE)
> 
> which is
> 
> if (0xffffffffffe00000 < 0x200000)
> 
> which is false?

Oh, I missed this, it seems that we can just do:

if (end - start < PMD_SIZE)
	can_reclaim_pt = false;

> 
>>   retry:
>>          tlb_change_page_size(tlb, PAGE_SIZE);
>>          init_rss_vec(rss);
>> @@ -1641,6 +1651,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                          nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>>                                                addr, details, rss, &force_flush,
>>                                                &force_break, &is_pt_unreclaimable);
>> +                       if (is_pt_unreclaimable)
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                          if (unlikely(force_break)) {
>>                                  addr += nr * PAGE_SIZE;
>>                                  break;
>> @@ -1653,8 +1665,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                      is_device_exclusive_entry(entry)) {
>>                          page = pfn_swap_entry_to_page(entry);
>>                          folio = page_folio(page);
>> -                       if (unlikely(!should_zap_folio(details, folio)))
>> +                       if (unlikely(!should_zap_folio(details, folio))) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          /*
>>                           * Both device private/exclusive mappings should only
>>                           * work with anonymous page so far, so we don't need to
>> @@ -1670,14 +1684,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                          max_nr = (end - addr) / PAGE_SIZE;
>>                          nr = swap_pte_batch(pte, max_nr, ptent);
>>                          /* Genuine swap entries, hence a private anon pages */
>> -                       if (!should_zap_cows(details))
>> +                       if (!should_zap_cows(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          rss[MM_SWAPENTS] -= nr;
>>                          free_swap_and_cache_nr(entry, nr);
>>                  } else if (is_migration_entry(entry)) {
>>                          folio = pfn_swap_entry_folio(entry);
>> -                       if (!should_zap_folio(details, folio))
>> +                       if (!should_zap_folio(details, folio)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                          rss[mm_counter(folio)]--;
>>                  } else if (pte_marker_entry_uffd_wp(entry)) {
>>                          /*
>> @@ -1685,21 +1703,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                           * drop the marker if explicitly requested.
>>                           */
>>                          if (!vma_is_anonymous(vma) &&
>> -                           !zap_drop_file_uffd_wp(details))
>> +                           !zap_drop_file_uffd_wp(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                  } else if (is_hwpoison_entry(entry) ||
>>                             is_poisoned_swp_entry(entry)) {
>> -                       if (!should_zap_cows(details))
>> +                       if (!should_zap_cows(details)) {
>> +                               set_pt_unreclaimable(&can_reclaim_pt);
>>                                  continue;
>> +                       }
>>                  } else {
>>                          /* We should have covered all the swap entry types */
>>                          pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>                          WARN_ON_ONCE(1);
>>                  }
>>                  clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>> -               zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>> +               if (zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent))
>> +                       set_pt_unreclaimable(&can_reclaim_pt);
>>          } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>
>> +       if (addr == end && can_reclaim_pt)
>> +               direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>> +
>>          add_mm_rss_vec(mm, rss);
>>          arch_leave_lazy_mmu_mode();
>>
>> @@ -1724,6 +1750,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                  goto retry;
>>          }
>>
>> +       if (can_reclaim_pt) {
>> +               if (direct_reclaim)
>> +                       free_pte(mm, start, tlb, pmdval);
>> +               else
>> +                       try_to_free_pte(mm, pmd, start, tlb);
>> +       }
>> +
>>          return addr;
>>   }
>>
>> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
>> new file mode 100644
>> index 0000000000000..fc055da40b615
>> --- /dev/null
>> +++ b/mm/pt_reclaim.c
>> @@ -0,0 +1,68 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/hugetlb.h>
>> +#include <asm-generic/tlb.h>
>> +#include <asm/pgalloc.h>
>> +
>> +#include "internal.h"
>> +
>> +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
>> +{
>> +       spinlock_t *pml = pmd_lockptr(mm, pmd);
>> +
>> +       if (!spin_trylock(pml))
>> +               return false;
>> +
>> +       *pmdval = pmdp_get_lockless(pmd);
>> +       pmd_clear(pmd);
>> +       spin_unlock(pml);
>> +
>> +       return true;
>> +}
>> +
>> +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
>> +             pmd_t pmdval)
>> +{
>> +       pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
>> +       mm_dec_nr_ptes(mm);
>> +}
>> +
>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>> +                    struct mmu_gather *tlb)
>> +{
>> +       pmd_t pmdval;
>> +       spinlock_t *pml, *ptl;
>> +       pte_t *start_pte, *pte;
>> +       int i;
>> +
>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>> +       if (!start_pte)
>> +               return;
>> +
>> +       pml = pmd_lock(mm, pmd);
>> +       if (ptl != pml)
>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> +
>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>> +               goto out_ptl;
>> +
>> +       /* Check if it is empty PTE page */
>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>> +               if (!pte_none(ptep_get(pte)))
>> +                       goto out_ptl;
>> +       }
>> +       pte_unmap(start_pte);
>> +
>> +       pmd_clear(pmd);
>> +
>> +       if (ptl != pml)
>> +               spin_unlock(ptl);
>> +       spin_unlock(pml);
> 
> At this point, you have cleared the PMD and dropped the locks
> protecting against concurrency, but have not yet done a TLB flush. If
> another thread concurrently repopulates the PMD at this point, can we
> get incoherent TLB state in a way that violates the arm64
> break-before-make rule?
> 
> Though I guess we can probably already violate break-before-make if
> MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
> does not seem to set "force_flush" when zapping anon PTEs...

Thanks for pointing this out! That's why I sent a separate patch
discussing this a while ago, but unfortunately haven't gotten any
feedback yet, please take a look:

https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

Thanks!

> 
> (I realize you're only enabling this for x86 for now, but we should
> probably make sure the code is not arch-dependent in subtle
> undocumented ways...)
> 
>> +       free_pte(mm, addr, tlb, pmdval);
>> +
>> +       return;
>> +out_ptl:
>> +       pte_unmap_unlock(start_pte, ptl);
>> +       if (pml != ptl)
>> +               spin_unlock(pml);
>> +}
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-18  2:53     ` Qi Zheng
@ 2024-10-18  2:58       ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-18  2:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catalin Marinas, Will Deacon, david, hughd, willy, mgorman,
	muchun.song, vbabka, akpm, zokeefe, rientjes, peterx, linux-mm,
	linux-kernel, x86



On 2024/10/18 10:53, Qi Zheng wrote:
> 
> 
> On 2024/10/18 02:43, Jann Horn wrote:
>> +arm64 maintainers in case they have opinions on the break-before-make 
>> aspects
>>

[snip]

>>> +
>>> +       pmd_clear(pmd);
>>> +
>>> +       if (ptl != pml)
>>> +               spin_unlock(ptl);
>>> +       spin_unlock(pml);
>>
>> At this point, you have cleared the PMD and dropped the locks
>> protecting against concurrency, but have not yet done a TLB flush. If
>> another thread concurrently repopulates the PMD at this point, can we
>> get incoherent TLB state in a way that violates the arm64
>> break-before-make rule?
>>
>> Though I guess we can probably already violate break-before-make if
>> MADV_DONTNEED races with a pagefault, since zap_present_folio_ptes()
>> does not seem to set "force_flush" when zapping anon PTEs...
> 
> Thanks for pointing this out! That's why I sent a separate patch
> discussing this a while ago, but unfortunately haven't gotten any
> feedback yet, please take a look:
> 
> https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

More context here: 
https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

> 
> Thanks!
> 
>>
>> (I realize you're only enabling this for x86 for now, but we should
>> probably make sure the code is not arch-dependent in subtle
>> undocumented ways...)
>>
>>> +       free_pte(mm, addr, tlb, pmdval);
>>> +
>>> +       return;
>>> +out_ptl:
>>> +       pte_unmap_unlock(start_pte, ptl);
>>> +       if (pml != ptl)
>>> +               spin_unlock(pml);
>>> +}
>>> -- 
>>> 2.20.1
>>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-10-17  9:47 ` [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
@ 2024-10-23  6:54   ` kernel test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2024-10-23  6:54 UTC (permalink / raw)
  To: Qi Zheng
  Cc: oe-lkp, lkp, linux-kernel, david, hughd, willy, mgorman,
	muchun.song, vbabka, akpm, zokeefe, rientjes, jannh, peterx,
	linux-mm, x86, Qi Zheng, oliver.sang


Hello,

by this commit, below two configs are enabled

--- /pkg/linux/x86_64-rhel-8.3/gcc-12/c9f9931196ccc64ec25268538edc327c3add08de/.config  2024-10-20 21:40:11.559320920 +0800
+++ /pkg/linux/x86_64-rhel-8.3/gcc-12/2e22ca3c1f2a6d64740f7b875d869d1f80f78ce8/.config  2024-10-20 06:02:46.008212911 +0800
@@ -1207,6 +1207,8 @@ CONFIG_IOMMU_MM_DATA=y
 CONFIG_EXECMEM=y
 CONFIG_NUMA_MEMBLKS=y
 CONFIG_NUMA_EMU=y
+CONFIG_ARCH_SUPPORTS_PT_RECLAIM=y
+CONFIG_PT_RECLAIM=y


then we noticed various issues which we don't observe on parent.


kernel test robot noticed "BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val" on:

commit: 2e22ca3c1f2a6d64740f7b875d869d1f80f78ce8 ("[PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64")
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-khugepaged-retract_page_tables-use-pte_offset_map_lock/20241017-174953
patch link: https://lore.kernel.org/all/0f6e7fb7fb21431710f28df60738f8be98fe9dd9.1729157502.git.zhengqi.arch@bytedance.com/
patch subject: [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

in testcase: boot

config: x86_64-rhel-8.3
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-----------------------------------------------------------+------------+------------+
|                                                           | c9f9931196 | 2e22ca3c1f |
+-----------------------------------------------------------+------------+------------+
| boot_failures                                             | 0          | 6          |
| BUG:Bad_rss-counter_state_mm:#type:MM_FILEPAGES_val       | 0          | 5          |
| BUG:Bad_rss-counter_state_mm:#type:MM_ANONPAGES_val       | 0          | 6          |
| BUG:Bad_page_cache_in_process                             | 0          | 3          |
| segfault_at_ip_sp_error                                   | 0          | 5          |
| Kernel_panic-not_syncing:Attempted_to_kill_init!exitcode= | 0          | 4          |
+-----------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410231429.b91daa36-oliver.sang@intel.com


[    9.153217][    T1] BUG: Bad rss-counter state mm:000000006dcf9cdd type:MM_FILEPAGES val:40
[    9.153929][    T1] BUG: Bad rss-counter state mm:000000006dcf9cdd type:MM_ANONPAGES val:1

...

[    9.444419][  T214] systemd[214]: segfault at 0 ip 0000000000000000 sp 00000000f6b1c2ec error 14 likely on CPU 1 (core 1, socket 0)
[ 9.445388][ T214] Code: Unable to access opcode bytes at 0xffffffffffffffd6.

Code starting with the faulting instruction
===========================================
[  OK  ] Started LKP bootstrap.
[    9.453331][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    9.454023][    T1] CPU: 1 UID: 0 PID: 1 Comm: systemd Not tainted 6.12.0-rc3-next-20241016-00007-g2e22ca3c1f2a #1
[    9.454818][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[    9.455601][    T1] Call Trace:
[    9.455906][    T1]  <TASK>
[ 9.456184][ T1] panic (kernel/panic.c:354) 
[ 9.456522][ T1] do_exit (include/linux/audit.h:327 kernel/exit.c:920) 
[ 9.456884][ T1] do_group_exit (kernel/exit.c:1069) 
[ 9.457252][ T1] get_signal (kernel/signal.c:2917) 
[ 9.457615][ T1] arch_do_signal_or_restart (arch/x86/kernel/signal.c:337) 
[ 9.458053][ T1] syscall_exit_to_user_mode (kernel/entry/common.c:113 include/linux/entry-common.h:328 kernel/entry/common.c:207 kernel/entry/common.c:218) 
[ 9.458495][ T1] __do_fast_syscall_32 (arch/x86/entry/common.c:391) 
[ 9.458904][ T1] do_fast_syscall_32 (arch/x86/entry/common.c:411) 
[ 9.459299][ T1] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:127) 
[    9.459788][    T1] RIP: 0023:0xf7fbf589
[ 9.460130][ T1] Code: 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
   0:	03 74 d8 01          	add    0x1(%rax,%rbx,8),%esi
	...
  20:	00 51 52             	add    %dl,0x52(%rcx)
  23:	55                   	push   %rbp
  24:*	89 e5                	mov    %esp,%ebp		<-- trapping instruction
  26:	0f 34                	sysenter 
  28:	cd 80                	int    $0x80
  2a:	5d                   	pop    %rbp
  2b:	5a                   	pop    %rdx
  2c:	59                   	pop    %rcx
  2d:	c3                   	retq   
  2e:	90                   	nop
  2f:	90                   	nop
  30:	90                   	nop
  31:	90                   	nop
  32:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  39:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
   0:	5d                   	pop    %rbp
   1:	5a                   	pop    %rdx
   2:	59                   	pop    %rcx
   3:	c3                   	retq   
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	90                   	nop
   8:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
   f:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
[    9.461528][    T1] RSP: 002b:00000000ff837680 EFLAGS: 00200206 ORIG_RAX: 0000000000000006
[    9.462212][    T1] RAX: 0000000000000000 RBX: 0000000000000038 RCX: 0000000000000002
[    9.462834][    T1] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000f731e6cc
[    9.463462][    T1] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[    9.464083][    T1] R10: 0000000000000000 R11: 0000000000200206 R12: 0000000000000000
[    9.464714][    T1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    9.465343][    T1]  </TASK>
[    9.465677][    T1] Kernel Offset: 0x4600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241023/202410231429.b91daa36-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-17 18:43   ` Jann Horn
  2024-10-18  2:53     ` Qi Zheng
@ 2024-10-24 13:21     ` Will Deacon
  2024-10-25  2:43       ` Qi Zheng
  1 sibling, 1 reply; 18+ messages in thread
From: Will Deacon @ 2024-10-24 13:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Qi Zheng, Catalin Marinas, david, hughd, willy, mgorman,
	muchun.song, vbabka, akpm, zokeefe, rientjes, peterx, linux-mm,
	linux-kernel, x86

On Thu, Oct 17, 2024 at 08:43:43PM +0200, Jann Horn wrote:
> +arm64 maintainers in case they have opinions on the break-before-make aspects

Thanks, Jann.

> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> > +                    struct mmu_gather *tlb)
> > +{
> > +       pmd_t pmdval;
> > +       spinlock_t *pml, *ptl;
> > +       pte_t *start_pte, *pte;
> > +       int i;
> > +
> > +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> > +       if (!start_pte)
> > +               return;
> > +
> > +       pml = pmd_lock(mm, pmd);
> > +       if (ptl != pml)
> > +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +
> > +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
> > +               goto out_ptl;
> > +
> > +       /* Check if it is empty PTE page */
> > +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> > +               if (!pte_none(ptep_get(pte)))
> > +                       goto out_ptl;
> > +       }
> > +       pte_unmap(start_pte);
> > +
> > +       pmd_clear(pmd);
> > +
> > +       if (ptl != pml)
> > +               spin_unlock(ptl);
> > +       spin_unlock(pml);
> 
> At this point, you have cleared the PMD and dropped the locks
> protecting against concurrency, but have not yet done a TLB flush. If
> another thread concurrently repopulates the PMD at this point, can we
> get incoherent TLB state in a way that violates the arm64
> break-before-make rule?

Sounds like it, yes, unless there's something that constrains the new
PMD value to be some function of what it was in the first place?

Will


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-10-24 13:21     ` Will Deacon
@ 2024-10-25  2:43       ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2024-10-25  2:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jann Horn, Catalin Marinas, david, hughd, willy, mgorman,
	muchun.song, vbabka, akpm, zokeefe, rientjes, peterx, linux-mm,
	linux-kernel, x86

Hi Will,

On 2024/10/24 21:21, Will Deacon wrote:
> On Thu, Oct 17, 2024 at 08:43:43PM +0200, Jann Horn wrote:
>> +arm64 maintainers in case they have opinions on the break-before-make aspects
> 
> Thanks, Jann.
> 
>> On Thu, Oct 17, 2024 at 11:48 AM Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>>> +                    struct mmu_gather *tlb)
>>> +{
>>> +       pmd_t pmdval;
>>> +       spinlock_t *pml, *ptl;
>>> +       pte_t *start_pte, *pte;
>>> +       int i;
>>> +
>>> +       start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>>> +       if (!start_pte)
>>> +               return;
>>> +
>>> +       pml = pmd_lock(mm, pmd);
>>> +       if (ptl != pml)
>>> +               spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> +
>>> +       if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
>>> +               goto out_ptl;
>>> +
>>> +       /* Check if it is empty PTE page */
>>> +       for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>>> +               if (!pte_none(ptep_get(pte)))
>>> +                       goto out_ptl;
>>> +       }
>>> +       pte_unmap(start_pte);
>>> +
>>> +       pmd_clear(pmd);
>>> +
>>> +       if (ptl != pml)
>>> +               spin_unlock(ptl);
>>> +       spin_unlock(pml);
>>
>> At this point, you have cleared the PMD and dropped the locks
>> protecting against concurrency, but have not yet done a TLB flush. If
>> another thread concurrently repopulates the PMD at this point, can we
>> get incoherent TLB state in a way that violates the arm64
>> break-before-make rule?
> 
> Sounds like it, yes, unless there's something that constrains the new
> PMD value to be some function of what it was in the first place?

Thank you for taking a look at this! I have tried to detect this case
and flush TLB in page fault. For details, please refer to this RFC
patch:

https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

And more context here: 
https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

If necessary, I can rebase the RFC patch and resend it.

Thanks!

> 
> Will


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-10-25  2:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-17  9:47 [PATCH v1 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-10-17  9:47 ` [PATCH v1 1/7] mm: khugepaged: retract_page_tables() use pte_offset_map_lock() Qi Zheng
2024-10-17 18:00   ` Jann Horn
2024-10-18  2:15     ` Qi Zheng
2024-10-17  9:47 ` [PATCH v1 2/7] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-10-17 18:06   ` Jann Horn
2024-10-18  2:23     ` Qi Zheng
2024-10-17  9:47 ` [PATCH v1 3/7] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
2024-10-17  9:47 ` [PATCH v1 4/7] mm: zap_present_ptes: return whether the PTE page is unreclaimable Qi Zheng
2024-10-17  9:47 ` [PATCH v1 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-10-17 18:43   ` Jann Horn
2024-10-18  2:53     ` Qi Zheng
2024-10-18  2:58       ` Qi Zheng
2024-10-24 13:21     ` Will Deacon
2024-10-25  2:43       ` Qi Zheng
2024-10-17  9:47 ` [PATCH v1 6/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-10-17  9:47 ` [PATCH v1 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
2024-10-23  6:54   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox