[RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages
@ 2024-08-05 12:55 Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously scan and reclaim empty user PTE pages in
zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In
zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and
page freeing operations. Therefore, if we want to free the empty PTE page in
this path, the most natural way is to add it to mmu_gather as well. There are
two problems that need to be solved here:

1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free
   page table pages by semi RCU:

   - batch table freeing: asynchronous free by RCU
   - single table freeing: IPI + synchronous free

   But this is not enough to free the empty PTE page table pages in paths other
   that munmap and exit_mmap path, because IPI cannot be synchronized with
   rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table
   also be freed by RCU like batch table freeing.

2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not
   flushed before pmd lock is unlocked. This may result in the following two
   situations:

   1) Userland can trigger page fault and fill a huge page, which will cause
      the existence of small size TLB and huge TLB for the same address.

   2) Userland can also trigger page fault and fill a PTE page, which will
      cause the existence of two small size TLBs, but the PTE page they map
      are different.

   For case 1), according to Intel's TLB Application note (317080), some CPUs of
   x86 do not allow it:

   ```
   If software modifies the paging structures so that the page size used for a
   4-KByte range of linear addresses changes, the TLBs may subsequently contain
   both ordinary and large-page translations for the address range.12 A reference
   to a linear address in the address range may use either translation. Which of
   the two translations is used may vary from one execution to another and the
   choice may be implementation-specific.

   Software wishing to prevent this uncertainty should not write to a paging-
   structure entry in a way that would change, for any linear address, both the
   page size and either the page frame or attributes. It can instead use the
   following algorithm: first mark the relevant paging-structure entry (e.g.,
   PDE) not present; then invalidate any translations for the affected linear
   addresses (see Section 5.2); and then modify the relevant paging-structure
   entry to mark it present and establish translation(s) for the new page size.
   ```

   We can also learn more information from the comments above pmdp_invalidate()
   in __split_huge_pmd_locked().

   For case 2), we can see from the comments above ptep_clear_flush() in
   wp_page_copy() that this situation is also not allowed. Even without
   this patch series, madvise(MADV_DONTNEED) can also cause this situation:

           CPU 0                         CPU 1

   madvise (MADV_DONTNEED)
   -->  clear pte entry
        pte_unmap_unlock
                                      touch and tlb miss
				      --> set pte entry 
        mmu_gather flush tlb

   But strangely, I didn't see any relevant fix code, maybe I missed something,
   or is this guaranteed by userland?

   Anyway, this series defines the following two functions to be implemented by
   the architecture. If the architecture does not allow the above two situations,
   then define these two functions to flush the tlb before set_pmd_at().

   - arch_flush_tlb_before_set_huge_page
   - arch_flush_tlb_before_set_pte_page

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

In order to reduce overhead, we only handle the cases with a high probability
of generating empty PTE pages, and other cases will be filtered out, such as:

 - hugetlb vma (unsuitable)
 - userfaultfd_wp vma (may reinstall the pte entry)
 - writable private file mapping case (COW-ed anon page is not zapped)
 - etc

For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case,
of course), consider scanning and freeing empty PTE pages asynchronously in
the future.

This series is based on next-20240805.

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

Qi Zheng (7):
  mm: pgtable: make pte_offset_map_nolock() return pmdval
  mm: introduce CONFIG_PT_RECLAIM
  mm: pass address information to pmd_install()
  mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: mm: define arch_flush_tlb_before_set_huge_page
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 Documentation/mm/split_page_table_lock.rst |   3 +-
 arch/arm/mm/fault-armv.c                   |   2 +-
 arch/powerpc/mm/pgtable.c                  |   2 +-
 arch/x86/Kconfig                           |   1 +
 arch/x86/include/asm/pgtable.h             |   6 +
 arch/x86/include/asm/tlb.h                 |  19 +++
 arch/x86/kernel/paravirt.c                 |   7 ++
 arch/x86/mm/pgtable.c                      |  23 +++-
 include/linux/hugetlb.h                    |   2 +-
 include/linux/mm.h                         |  13 +-
 include/linux/pgtable.h                    |  14 +++
 mm/Kconfig                                 |  14 +++
 mm/Makefile                                |   1 +
 mm/debug_vm_pgtable.c                      |   2 +-
 mm/filemap.c                               |   4 +-
 mm/gup.c                                   |   2 +-
 mm/huge_memory.c                           |   3 +
 mm/internal.h                              |  17 ++-
 mm/khugepaged.c                            |  32 +++--
 mm/memory.c                                |  21 ++--
 mm/migrate_device.c                        |   2 +-
 mm/mmu_gather.c                            |   9 +-
 mm/mprotect.c                              |   8 +-
 mm/mremap.c                                |   4 +-
 mm/page_vma_mapped.c                       |   2 +-
 mm/pgtable-generic.c                       |  21 ++--
 mm/pt_reclaim.c                            | 131 +++++++++++++++++++++
 mm/userfaultfd.c                           |  10 +-
 mm/vmscan.c                                |   2 +-
 29 files changed, 321 insertions(+), 56 deletions(-)
 create mode 100644 mm/pt_reclaim.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-05 14:43   ` David Hildenbrand
  2024-08-05 12:55 ` [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM Qi Zheng
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

Make pte_offset_map_nolock() return pmdval so that we can recheck the
*pmd once the lock is taken. This is a preparation for freeing empty
PTE pages, no functional changes are expected.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  3 ++-
 arch/arm/mm/fault-armv.c                   |  2 +-
 arch/powerpc/mm/pgtable.c                  |  2 +-
 include/linux/mm.h                         |  4 ++--
 mm/filemap.c                               |  2 +-
 mm/khugepaged.c                            |  4 ++--
 mm/memory.c                                |  4 ++--
 mm/mremap.c                                |  2 +-
 mm/page_vma_mapped.c                       |  2 +-
 mm/pgtable-generic.c                       | 21 ++++++++++++---------
 mm/userfaultfd.c                           |  4 ++--
 mm/vmscan.c                                |  2 +-
 12 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..e6a47d57531cd 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -18,7 +18,8 @@ There are helpers to lock/unlock a table and other accessor functions:
 	pointer to its PTE table lock, or returns NULL if no PTE table;
  - pte_offset_map_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
-	lock (not taken), or returns NULL if no PTE table;
+	lock (not taken) and the value of its pmd entry, or returns NULL
+	if no PTE table;
  - pte_offset_map()
 	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
  - pte_unmap()
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..db07e6a05eb6e 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,7 +117,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+	pte = pte_offset_map_nolock(vma->vm_mm, pmd, NULL, address, &ptl);
 	if (!pte)
 		return 0;
 
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..9b67d2a1457ed 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	 */
 	if (pmd_none(*pmd))
 		return;
-	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	pte = pte_offset_map_nolock(mm, pmd, NULL, addr, &ptl);
 	BUG_ON(!pte);
 	assert_spin_locked(ptl);
 	pte_unmap(pte);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43b40334e9b28..b1ef2afe620c5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2937,8 +2937,8 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp,
+			     unsigned long addr, spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/mm/filemap.c b/mm/filemap.c
index 67c3f5136db33..3285dffb64cf8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3231,7 +3231,7 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
 		return 0;
 
-	ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
+	ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, NULL, vmf->address,
 				     &vmf->ptl);
 	if (unlikely(!ptep))
 		return VM_FAULT_NOPAGE;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdd1d8655a76b..91b93259ee214 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1009,7 +1009,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		};
 
 		if (!pte++) {
-			pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+			pte = pte_offset_map_nolock(mm, pmd, NULL, address, &ptl);
 			if (!pte) {
 				mmap_read_unlock(mm);
 				result = SCAN_PMD_NULL;
@@ -1598,7 +1598,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
 		pml = pmd_lock(mm, pmd);
 
-	start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+	start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
 	if (!pml)
diff --git a/mm/memory.c b/mm/memory.c
index d6a9dcddaca4a..afd8a967fb953 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1108,7 +1108,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		ret = -ENOMEM;
 		goto out;
 	}
-	src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+	src_pte = pte_offset_map_nolock(src_mm, src_pmd, NULL, addr, &src_ptl);
 	if (!src_pte) {
 		pte_unmap_unlock(dst_pte, dst_ptl);
 		/* ret == 0 */
@@ -5671,7 +5671,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * it into a huge pmd: just retry later if so.
 		 */
 		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
-						 vmf->address, &vmf->ptl);
+						 NULL, vmf->address, &vmf->ptl);
 		if (unlikely(!vmf->pte))
 			return 0;
 		vmf->orig_pte = ptep_get_lockless(vmf->pte);
diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc6409..f672d0218a6fe 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -175,7 +175,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		err = -EAGAIN;
 		goto out;
 	}
-	new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+	new_pte = pte_offset_map_nolock(mm, new_pmd, NULL, new_addr, &new_ptl);
 	if (!new_pte) {
 		pte_unmap_unlock(old_pte, old_ptl);
 		err = -EAGAIN;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..507701b7bcc1e 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -33,7 +33,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	 * Though, in most cases, page lock already protects this.
 	 */
 	pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
-					  pvmw->address, ptlp);
+					  NULL, pvmw->address, ptlp);
 	if (!pvmw->pte)
 		return false;
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..443e3b34434a5 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,7 +305,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	return NULL;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp,
 			     unsigned long addr, spinlock_t **ptlp)
 {
 	pmd_t pmdval;
@@ -314,6 +314,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	pte = __pte_offset_map(pmd, addr, &pmdval);
 	if (likely(pte))
 		*ptlp = pte_lockptr(mm, &pmdval);
+	if (pmdvalp)
+		*pmdvalp = pmdval;
 	return pte;
 }
 
@@ -347,14 +349,15 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
  * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
  * afterwards.
  *
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it.  This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns.  In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ * pte_offset_map_nolock(mm, pmd, pmdvalp, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it.  This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * In principle, the caller should recheck *pmd once the lock is taken; But in
+ * most cases, either the mmap_lock for write, or pte_same() check on contents,
+ * is enough.
  *
  * Note that free_pgtables(), used after unmapping detached vmas, or when
  * exiting the whole mm, does not take page table lock before freeing a page
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3b7715ecf292a..aa3c9cc51cc36 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1143,7 +1143,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
-	dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+	dst_pte = pte_offset_map_nolock(mm, dst_pmd, NULL, dst_addr, &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
 	if (unlikely(!dst_pte)) {
@@ -1151,7 +1151,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+	src_pte = pte_offset_map_nolock(mm, src_pmd, NULL, src_addr, &src_ptl);
 
 	/*
 	 * We held the mmap_lock for reading so MADV_DONTNEED
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31d13462571e6..b00cd560c0e43 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3378,7 +3378,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_nolock(args->mm, pmd, NULL, start & PMD_MASK, &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
@ 2024-08-05 14:43   ` David Hildenbrand
  2024-08-06  2:40     ` Qi Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-05 14:43 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes
  Cc: linux-mm, linux-kernel

On 05.08.24 14:55, Qi Zheng wrote:
> Make pte_offset_map_nolock() return pmdval so that we can recheck the
> *pmd once the lock is taken. This is a preparation for freeing empty
> PTE pages, no functional changes are expected.

Skimming the patches, only patch #4 updates one of the callsites 
(collapse_pte_mapped_thp).

Wouldn't we have to recheck if the PMD val changed in more cases after 
taking the PTL?

If not, would it make sense to have a separate function that returns the 
pmdval and we won't have to update each and every callsite?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-05 14:43   ` David Hildenbrand
@ 2024-08-06  2:40     ` Qi Zheng
  2024-08-06 14:16       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-06  2:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

Hi David,

On 2024/8/5 22:43, David Hildenbrand wrote:
> On 05.08.24 14:55, Qi Zheng wrote:
>> Make pte_offset_map_nolock() return pmdval so that we can recheck the
>> *pmd once the lock is taken. This is a preparation for freeing empty
>> PTE pages, no functional changes are expected.
> 
> Skimming the patches, only patch #4 updates one of the callsites 
> (collapse_pte_mapped_thp).

In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
also used the pmdval returned by pte_offset_map_nolock().

> 
> Wouldn't we have to recheck if the PMD val changed in more cases after 
> taking the PTL?
> 
> If not, would it make sense to have a separate function that returns the 
> pmdval and we won't have to update each and every callsite?

pte_offset_map_nolock() had already obtained the pmdval previously, just
hadn't returned it. And updating those callsite is simple, so I think
there may not be a need to add a separate function.

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-06  2:40     ` Qi Zheng
@ 2024-08-06 14:16       ` David Hildenbrand
       [not found]         ` <f6c05526-5ac9-4597-9e80-099ea22fa0ae@bytedance.com>
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-06 14:16 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

On 06.08.24 04:40, Qi Zheng wrote:
> Hi David,
> 
> On 2024/8/5 22:43, David Hildenbrand wrote:
>> On 05.08.24 14:55, Qi Zheng wrote:
>>> Make pte_offset_map_nolock() return pmdval so that we can recheck the
>>> *pmd once the lock is taken. This is a preparation for freeing empty
>>> PTE pages, no functional changes are expected.
>>
>> Skimming the patches, only patch #4 updates one of the callsites
>> (collapse_pte_mapped_thp).
> 
> In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
> also used the pmdval returned by pte_offset_map_nolock().

Right, and I am questioning if only touching these two is sufficient, 
and how we can make it clearer when someone actually has to recheck the PMD.

> 
>>
>> Wouldn't we have to recheck if the PMD val changed in more cases after
>> taking the PTL?
>>
>> If not, would it make sense to have a separate function that returns the
>> pmdval and we won't have to update each and every callsite?
> 
> pte_offset_map_nolock() had already obtained the pmdval previously, just
> hadn't returned it. And updating those callsite is simple, so I think
> there may not be a need to add a separate function.

Let me ask this way: why is retract_page_tables() and 
reclaim_pgtables_pmd_entry() different to the other ones, and how would 
someone using pte_offset_map_nolock() know what's to do here?

IIUC, we must check the PMDVAL after taking the PTL in case

(a) we want to modify the page table to turn pte_none() entries to
     !pte_none(). Because it could be that the page table was removed and
     now is all pte_none()

(b) we want to remove the page table ourselves and want to check if it
     has already been removed.

Is that it?

So my thinking is if another function variant can make that clearer.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <f6c05526-5ac9-4597-9e80-099ea22fa0ae@bytedance.com>]

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
       [not found]         ` <f6c05526-5ac9-4597-9e80-099ea22fa0ae@bytedance.com>
@ 2024-08-09 16:54           ` David Hildenbrand
  2024-08-12  6:21             ` Qi Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:54 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

On 07.08.24 05:08, Qi Zheng wrote:
> Hi David,
> 
> On 2024/8/6 22:16, David Hildenbrand wrote:
>> On 06.08.24 04:40, Qi Zheng wrote:
>>> Hi David,
>>>
>>> On 2024/8/5 22:43, David Hildenbrand wrote:
>>>> On 05.08.24 14:55, Qi Zheng wrote:
>>>>> Make pte_offset_map_nolock() return pmdval so that we can recheck the
>>>>> *pmd once the lock is taken. This is a preparation for freeing empty
>>>>> PTE pages, no functional changes are expected.
>>>>
>>>> Skimming the patches, only patch #4 updates one of the callsites
>>>> (collapse_pte_mapped_thp).
>>>
>>> In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
>>> also used the pmdval returned by pte_offset_map_nolock().
>>
>> Right, and I am questioning if only touching these two is sufficient,
>> and how we can make it clearer when someone actually has to recheck the
>> PMD.
>>
>>>
>>>>
>>>> Wouldn't we have to recheck if the PMD val changed in more cases after
>>>> taking the PTL?
>>>>
>>>> If not, would it make sense to have a separate function that returns the
>>>> pmdval and we won't have to update each and every callsite?
>>>
>>> pte_offset_map_nolock() had already obtained the pmdval previously, just
>>> hadn't returned it. And updating those callsite is simple, so I think
>>> there may not be a need to add a separate function.
>>
>> Let me ask this way: why is retract_page_tables() and
>> reclaim_pgtables_pmd_entry() different to the other ones, and how would
>> someone using pte_offset_map_nolock() know what's to do here?
> 
> If we acuqire the PTL (PTE or PMD lock) after calling
> pte_offset_map_nolock(), it means we may be modifying the corresponding
> pte or pmd entry. In that case, we need to perform a pmd_same() check
> after holding the PTL, just like in pte_offset_map_lock(), to prevent
> the possibility of the PTE page being reclaimed at that time.

Okay, what I thought.

> 
> If we call pte_offset_map_nolock() and do not need to acquire the PTL
> afterwards, it means we are only reading the PTE page. In this case, the
> rcu_read_lock() in pte_offset_map_nolock() will ensure that the PTE page
> cannot be reclaimed.
> 
>>
>> IIUC, we must check the PMDVAL after taking the PTL in case
>>
>> (a) we want to modify the page table to turn pte_none() entries to
>>       !pte_none(). Because it could be that the page table was removed and
>>       now is all pte_none()
>>
>> (b) we want to remove the page table ourselves and want to check if it
>>       has already been removed.
>>
>> Is that it?
> 
> Yes.
> 
>>
>> So my thinking is if another function variant can make that clearer.
> 
> OK, how about naming it pte_offset_map_before_lock?

That's the issue with some of the code: for example in 
filemap_fault_recheck_pte_none() we'll call pte_offset_map_nolock() and 
conditionally take the PTL. But we won't be modifying the pages tables.

Maybe something like:

pte_offset_map_readonly_nolock()

and

pte_offset_map_maywrite_nolock()

The latter would require you to pass the PMD pointer such that you have 
to really mess up to ignore what to do with it (check PMD same or not 
check PMD same if you really know what you are douing).

The first would not take a PMD pointer at all, because there is no need to.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-09 16:54           ` David Hildenbrand
@ 2024-08-12  6:21             ` Qi Zheng
  2024-08-16  8:59               ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-12  6:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

Hi David,

On 2024/8/10 00:54, David Hildenbrand wrote:
> On 07.08.24 05:08, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/6 22:16, David Hildenbrand wrote:
>>> On 06.08.24 04:40, Qi Zheng wrote:
>>>> Hi David,
>>>>
>>>> On 2024/8/5 22:43, David Hildenbrand wrote:
>>>>> On 05.08.24 14:55, Qi Zheng wrote:
>>>>>> Make pte_offset_map_nolock() return pmdval so that we can recheck the
>>>>>> *pmd once the lock is taken. This is a preparation for freeing empty
>>>>>> PTE pages, no functional changes are expected.
>>>>>
>>>>> Skimming the patches, only patch #4 updates one of the callsites
>>>>> (collapse_pte_mapped_thp).
>>>>
>>>> In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
>>>> also used the pmdval returned by pte_offset_map_nolock().
>>>
>>> Right, and I am questioning if only touching these two is sufficient,
>>> and how we can make it clearer when someone actually has to recheck the
>>> PMD.
>>>
>>>>
>>>>>
>>>>> Wouldn't we have to recheck if the PMD val changed in more cases after
>>>>> taking the PTL?
>>>>>
>>>>> If not, would it make sense to have a separate function that 
>>>>> returns the
>>>>> pmdval and we won't have to update each and every callsite?
>>>>
>>>> pte_offset_map_nolock() had already obtained the pmdval previously, 
>>>> just
>>>> hadn't returned it. And updating those callsite is simple, so I think
>>>> there may not be a need to add a separate function.
>>>
>>> Let me ask this way: why is retract_page_tables() and
>>> reclaim_pgtables_pmd_entry() different to the other ones, and how would
>>> someone using pte_offset_map_nolock() know what's to do here?
>>
>> If we acuqire the PTL (PTE or PMD lock) after calling
>> pte_offset_map_nolock(), it means we may be modifying the corresponding
>> pte or pmd entry. In that case, we need to perform a pmd_same() check
>> after holding the PTL, just like in pte_offset_map_lock(), to prevent
>> the possibility of the PTE page being reclaimed at that time.
> 
> Okay, what I thought.
> 
>>
>> If we call pte_offset_map_nolock() and do not need to acquire the PTL
>> afterwards, it means we are only reading the PTE page. In this case, the
>> rcu_read_lock() in pte_offset_map_nolock() will ensure that the PTE page
>> cannot be reclaimed.
>>
>>>
>>> IIUC, we must check the PMDVAL after taking the PTL in case
>>>
>>> (a) we want to modify the page table to turn pte_none() entries to
>>>       !pte_none(). Because it could be that the page table was 
>>> removed and
>>>       now is all pte_none()
>>>
>>> (b) we want to remove the page table ourselves and want to check if it
>>>       has already been removed.
>>>
>>> Is that it?
>>
>> Yes.
>>
>>>
>>> So my thinking is if another function variant can make that clearer.
>>
>> OK, how about naming it pte_offset_map_before_lock?
> 
> That's the issue with some of the code: for example in 
> filemap_fault_recheck_pte_none() we'll call pte_offset_map_nolock() and 
> conditionally take the PTL. But we won't be modifying the pages tables.
> 
> Maybe something like:
> 
> pte_offset_map_readonly_nolock()
> 
> and
> 
> pte_offset_map_maywrite_nolock()
> 
> The latter would require you to pass the PMD pointer such that you have 
> to really mess up to ignore what to do with it (check PMD same or not 
> check PMD same if you really know what you are douing).
> 
> The first would not take a PMD pointer at all, because there is no need to.

These two function names LGTM. Will do in the next version.

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-12  6:21             ` Qi Zheng
@ 2024-08-16  8:59               ` David Hildenbrand
  2024-08-16  9:21                 ` Qi Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-16  8:59 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

On 12.08.24 08:21, Qi Zheng wrote:
> Hi David,
> 
> On 2024/8/10 00:54, David Hildenbrand wrote:
>> On 07.08.24 05:08, Qi Zheng wrote:
>>> Hi David,
>>>
>>> On 2024/8/6 22:16, David Hildenbrand wrote:
>>>> On 06.08.24 04:40, Qi Zheng wrote:
>>>>> Hi David,
>>>>>
>>>>> On 2024/8/5 22:43, David Hildenbrand wrote:
>>>>>> On 05.08.24 14:55, Qi Zheng wrote:
>>>>>>> Make pte_offset_map_nolock() return pmdval so that we can recheck the
>>>>>>> *pmd once the lock is taken. This is a preparation for freeing empty
>>>>>>> PTE pages, no functional changes are expected.
>>>>>>
>>>>>> Skimming the patches, only patch #4 updates one of the callsites
>>>>>> (collapse_pte_mapped_thp).
>>>>>
>>>>> In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
>>>>> also used the pmdval returned by pte_offset_map_nolock().
>>>>
>>>> Right, and I am questioning if only touching these two is sufficient,
>>>> and how we can make it clearer when someone actually has to recheck the
>>>> PMD.
>>>>
>>>>>
>>>>>>
>>>>>> Wouldn't we have to recheck if the PMD val changed in more cases after
>>>>>> taking the PTL?
>>>>>>
>>>>>> If not, would it make sense to have a separate function that
>>>>>> returns the
>>>>>> pmdval and we won't have to update each and every callsite?
>>>>>
>>>>> pte_offset_map_nolock() had already obtained the pmdval previously,
>>>>> just
>>>>> hadn't returned it. And updating those callsite is simple, so I think
>>>>> there may not be a need to add a separate function.
>>>>
>>>> Let me ask this way: why is retract_page_tables() and
>>>> reclaim_pgtables_pmd_entry() different to the other ones, and how would
>>>> someone using pte_offset_map_nolock() know what's to do here?
>>>
>>> If we acuqire the PTL (PTE or PMD lock) after calling
>>> pte_offset_map_nolock(), it means we may be modifying the corresponding
>>> pte or pmd entry. In that case, we need to perform a pmd_same() check
>>> after holding the PTL, just like in pte_offset_map_lock(), to prevent
>>> the possibility of the PTE page being reclaimed at that time.
>>
>> Okay, what I thought.
>>
>>>
>>> If we call pte_offset_map_nolock() and do not need to acquire the PTL
>>> afterwards, it means we are only reading the PTE page. In this case, the
>>> rcu_read_lock() in pte_offset_map_nolock() will ensure that the PTE page
>>> cannot be reclaimed.
>>>
>>>>
>>>> IIUC, we must check the PMDVAL after taking the PTL in case
>>>>
>>>> (a) we want to modify the page table to turn pte_none() entries to
>>>>        !pte_none(). Because it could be that the page table was
>>>> removed and
>>>>        now is all pte_none()
>>>>
>>>> (b) we want to remove the page table ourselves and want to check if it
>>>>        has already been removed.
>>>>
>>>> Is that it?
>>>
>>> Yes.
>>>
>>>>
>>>> So my thinking is if another function variant can make that clearer.
>>>
>>> OK, how about naming it pte_offset_map_before_lock?
>>
>> That's the issue with some of the code: for example in
>> filemap_fault_recheck_pte_none() we'll call pte_offset_map_nolock() and
>> conditionally take the PTL. But we won't be modifying the pages tables.
>>
>> Maybe something like:
>>
>> pte_offset_map_readonly_nolock()
>>
>> and
>>
>> pte_offset_map_maywrite_nolock()
>>
>> The latter would require you to pass the PMD pointer such that you have
>> to really mess up to ignore what to do with it (check PMD same or not
>> check PMD same if you really know what you are douing).
>>
>> The first would not take a PMD pointer at all, because there is no need to.
> 
> These two function names LGTM. Will do in the next version.

That is probably something you can send as a separate patch independent 
of this full series.

Then we might also get more review+thoughts from other folks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval
  2024-08-16  8:59               ` David Hildenbrand
@ 2024-08-16  9:21                 ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-16  9:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel, the arch/x86 maintainers

Hi David,

On 2024/8/16 16:59, David Hildenbrand wrote:
> On 12.08.24 08:21, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/10 00:54, David Hildenbrand wrote:
>>> On 07.08.24 05:08, Qi Zheng wrote:
>>>> Hi David,
>>>>
>>>> On 2024/8/6 22:16, David Hildenbrand wrote:
>>>>> On 06.08.24 04:40, Qi Zheng wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On 2024/8/5 22:43, David Hildenbrand wrote:
>>>>>>> On 05.08.24 14:55, Qi Zheng wrote:
>>>>>>>> Make pte_offset_map_nolock() return pmdval so that we can 
>>>>>>>> recheck the
>>>>>>>> *pmd once the lock is taken. This is a preparation for freeing 
>>>>>>>> empty
>>>>>>>> PTE pages, no functional changes are expected.
>>>>>>>
>>>>>>> Skimming the patches, only patch #4 updates one of the callsites
>>>>>>> (collapse_pte_mapped_thp).
>>>>>>
>>>>>> In addition, retract_page_tables() and reclaim_pgtables_pmd_entry()
>>>>>> also used the pmdval returned by pte_offset_map_nolock().
>>>>>
>>>>> Right, and I am questioning if only touching these two is sufficient,
>>>>> and how we can make it clearer when someone actually has to recheck 
>>>>> the
>>>>> PMD.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Wouldn't we have to recheck if the PMD val changed in more cases 
>>>>>>> after
>>>>>>> taking the PTL?
>>>>>>>
>>>>>>> If not, would it make sense to have a separate function that
>>>>>>> returns the
>>>>>>> pmdval and we won't have to update each and every callsite?
>>>>>>
>>>>>> pte_offset_map_nolock() had already obtained the pmdval previously,
>>>>>> just
>>>>>> hadn't returned it. And updating those callsite is simple, so I think
>>>>>> there may not be a need to add a separate function.
>>>>>
>>>>> Let me ask this way: why is retract_page_tables() and
>>>>> reclaim_pgtables_pmd_entry() different to the other ones, and how 
>>>>> would
>>>>> someone using pte_offset_map_nolock() know what's to do here?
>>>>
>>>> If we acuqire the PTL (PTE or PMD lock) after calling
>>>> pte_offset_map_nolock(), it means we may be modifying the corresponding
>>>> pte or pmd entry. In that case, we need to perform a pmd_same() check
>>>> after holding the PTL, just like in pte_offset_map_lock(), to prevent
>>>> the possibility of the PTE page being reclaimed at that time.
>>>
>>> Okay, what I thought.
>>>
>>>>
>>>> If we call pte_offset_map_nolock() and do not need to acquire the PTL
>>>> afterwards, it means we are only reading the PTE page. In this case, 
>>>> the
>>>> rcu_read_lock() in pte_offset_map_nolock() will ensure that the PTE 
>>>> page
>>>> cannot be reclaimed.
>>>>
>>>>>
>>>>> IIUC, we must check the PMDVAL after taking the PTL in case
>>>>>
>>>>> (a) we want to modify the page table to turn pte_none() entries to
>>>>>        !pte_none(). Because it could be that the page table was
>>>>> removed and
>>>>>        now is all pte_none()
>>>>>
>>>>> (b) we want to remove the page table ourselves and want to check if it
>>>>>        has already been removed.
>>>>>
>>>>> Is that it?
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> So my thinking is if another function variant can make that clearer.
>>>>
>>>> OK, how about naming it pte_offset_map_before_lock?
>>>
>>> That's the issue with some of the code: for example in
>>> filemap_fault_recheck_pte_none() we'll call pte_offset_map_nolock() and
>>> conditionally take the PTL. But we won't be modifying the pages tables.
>>>
>>> Maybe something like:
>>>
>>> pte_offset_map_readonly_nolock()
>>>
>>> and
>>>
>>> pte_offset_map_maywrite_nolock()
>>>
>>> The latter would require you to pass the PMD pointer such that you have
>>> to really mess up to ignore what to do with it (check PMD same or not
>>> check PMD same if you really know what you are douing).
>>>
>>> The first would not take a PMD pointer at all, because there is no 
>>> need to.
>>
>> These two function names LGTM. Will do in the next version.
> 
> That is probably something you can send as a separate patch independent 
> of this full series.

I think it makes sense. I am analyzing whether the existing path of
calling pte_offset_map_nolock + spin_lock will be concurrent with
the path of reclaiming PTE pages in THP. If so, it actually needs to
be fixed first.

> 
> Then we might also get more review+thoughts from other folks!

I hope so!

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-06 14:25   ` David Hildenbrand
  2024-08-05 12:55 ` [RFC PATCH v2 3/7] mm: pass address information to pmd_install() Qi Zheng
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

This configuration variable will be used to build the code needed
to free empty user page table pages.

This feature is not available on all architectures yet, so
ARCH_SUPPORTS_PT_RECLAIM is needed. We can remove it once all
architectures support this feature.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/Kconfig | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 3936fe4d26d91..c10741c54dcb1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1278,6 +1278,20 @@ config NUMA_EMU
 	  into virtual nodes when booted with "numa=fake=N", where N is the
 	  number of nodes. This is only useful for debugging.
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other that munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM
  2024-08-05 12:55 ` [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM Qi Zheng
@ 2024-08-06 14:25   ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2024-08-06 14:25 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes
  Cc: linux-mm, linux-kernel

On 05.08.24 14:55, Qi Zheng wrote:
> This configuration variable will be used to build the code needed
> to free empty user page table pages.
> 
> This feature is not available on all architectures yet, so
> ARCH_SUPPORTS_PT_RECLAIM is needed. We can remove it once all
> architectures support this feature.

Please squash this into #4

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 3/7] mm: pass address information to pmd_install()
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Qi Zheng
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

In the subsequent implementation of freeing empty page table pages,
we need the address information to flush tlb, so pass address to
pmd_install() in advance.

No functional changes.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/hugetlb.h |  2 +-
 include/linux/mm.h      |  9 +++++----
 mm/debug_vm_pgtable.c   |  2 +-
 mm/filemap.c            |  2 +-
 mm/gup.c                |  2 +-
 mm/internal.h           |  3 ++-
 mm/memory.c             | 15 ++++++++-------
 mm/migrate_device.c     |  2 +-
 mm/mprotect.c           |  8 ++++----
 mm/mremap.c             |  2 +-
 mm/userfaultfd.c        |  6 +++---
 11 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a76db143bffee..fcdcef367fffe 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -189,7 +189,7 @@ static inline pte_t *pte_offset_huge(pmd_t *pmd, unsigned long address)
 static inline pte_t *pte_alloc_huge(struct mm_struct *mm, pmd_t *pmd,
 				    unsigned long address)
 {
-	return pte_alloc(mm, pmd) ? NULL : pte_offset_huge(pmd, address);
+	return pte_alloc(mm, pmd, address) ? NULL : pte_offset_huge(pmd, address);
 }
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b1ef2afe620c5..f0b821dcb085b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2758,7 +2758,7 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
 static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
 int __pte_alloc_kernel(pmd_t *pmd);
 
 #if defined(CONFIG_MMU)
@@ -2945,13 +2945,14 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdvalp,
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+#define pte_alloc(mm, pmd, addr)			\
+	(unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, addr))
 
 #define pte_alloc_map(mm, pmd, address)			\
-	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+	(pte_alloc(mm, pmd, address) ? NULL : pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	(pte_alloc(mm, pmd) ?			\
+	(pte_alloc(mm, pmd, address) ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index e4969fb54da34..18375744e1845 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1246,7 +1246,7 @@ static int __init init_args(struct pgtable_debug_args *args)
 	args->start_pmdp = pmd_offset(args->pudp, 0UL);
 	WARN_ON(!args->start_pmdp);
 
-	if (pte_alloc(args->mm, args->pmdp)) {
+	if (pte_alloc(args->mm, args->pmdp, args->vaddr)) {
 		pr_err("Failed to allocate pte entries\n");
 		ret = -ENOMEM;
 		goto error;
diff --git a/mm/filemap.c b/mm/filemap.c
index 3285dffb64cf8..efcb8ae3f235f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3453,7 +3453,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
 	}
 
 	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+		pmd_install(mm, vmf->pmd, vmf->address, &vmf->prealloc_pte);
 
 	return false;
 }
diff --git a/mm/gup.c b/mm/gup.c
index d19884e097fd2..53c3b73810150 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -972,7 +972,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		split_huge_pmd(vma, pmd, address);
 		/* If pmd was left empty, stuff a page table in there quickly */
-		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
+		return pte_alloc(mm, pmd, address) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
 	page = follow_huge_pmd(vma, address, pmd, flags, ctx);
diff --git a/mm/internal.h b/mm/internal.h
index 52f7fc4e8ac30..dfc992de01115 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -325,7 +325,8 @@ void folio_activate(struct folio *folio);
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *start_vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		 pgtable_t *pte);
 
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
diff --git a/mm/memory.c b/mm/memory.c
index afd8a967fb953..fef1e425e4702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -417,7 +417,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 	} while (vma);
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		 pgtable_t *pte)
 {
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
@@ -443,13 +444,13 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 	spin_unlock(ptl);
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
 {
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	pmd_install(mm, pmd, addr, &new);
 	if (new)
 		pte_free(mm, new);
 	return 0;
@@ -2115,7 +2116,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Allocate the PTE if necessary; takes PMD lock once only. */
 	ret = -ENOMEM;
-	if (pte_alloc(mm, pmd))
+	if (pte_alloc(mm, pmd, addr))
 		goto out;
 
 	while (pages_to_write_in_pmd) {
@@ -4686,7 +4687,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 * Use pte_alloc() instead of pte_alloc_map(), so that OOM can
 	 * be distinguished from a transient failure of pte_offset_map().
 	 */
-	if (pte_alloc(vma->vm_mm, vmf->pmd))
+	if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address))
 		return VM_FAULT_OOM;
 
 	/* Use the zero-page for reads */
@@ -5033,8 +5034,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
-		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+			pmd_install(vma->vm_mm, vmf->pmd, vmf->address, &vmf->prealloc_pte);
+		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd, vmf->address)))
 			return VM_FAULT_OOM;
 	}
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa0..e4d2e19e6611d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -598,7 +598,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
 		goto abort;
-	if (pte_alloc(mm, pmdp))
+	if (pte_alloc(mm, pmdp, addr))
 		goto abort;
 	if (unlikely(anon_vma_prepare(vma)))
 		goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 37cf8d249405d..7b58db622f825 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -329,11 +329,11 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
  * allocation failures during page faults by kicking OOM and returning
  * error.
  */
-#define  change_pmd_prepare(vma, pmd, cp_flags)				\
+#define  change_pmd_prepare(vma, pmd, addr, cp_flags)			\
 	({								\
 		long err = 0;						\
 		if (unlikely(pgtable_populate_needed(vma, cp_flags))) {	\
-			if (pte_alloc(vma->vm_mm, pmd))			\
+			if (pte_alloc(vma->vm_mm, pmd, addr))		\
 				err = -ENOMEM;				\
 		}							\
 		err;							\
@@ -374,7 +374,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 again:
 		next = pmd_addr_end(addr, end);
 
-		ret = change_pmd_prepare(vma, pmd, cp_flags);
+		ret = change_pmd_prepare(vma, pmd, addr, cp_flags);
 		if (ret) {
 			pages = ret;
 			break;
@@ -401,7 +401,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 				 * cleared; make sure pmd populated if
 				 * necessary, then fall-through to pte level.
 				 */
-				ret = change_pmd_prepare(vma, pmd, cp_flags);
+				ret = change_pmd_prepare(vma, pmd, addr, cp_flags);
 				if (ret) {
 					pages = ret;
 					break;
diff --git a/mm/mremap.c b/mm/mremap.c
index f672d0218a6fe..7723d11e77cd2 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -628,7 +628,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		}
 		if (pmd_none(*old_pmd))
 			continue;
-		if (pte_alloc(new_vma->vm_mm, new_pmd))
+		if (pte_alloc(new_vma->vm_mm, new_pmd, new_addr))
 			break;
 		if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 			      new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index aa3c9cc51cc36..41d659bd2589c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -796,7 +796,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 			break;
 		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
+		    unlikely(__pte_alloc(dst_mm, dst_pmd, dst_addr))) {
 			err = -ENOMEM;
 			break;
 		}
@@ -1713,13 +1713,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 					err = -ENOENT;
 					break;
 				}
-				if (unlikely(__pte_alloc(mm, src_pmd))) {
+				if (unlikely(__pte_alloc(mm, src_pmd, src_addr))) {
 					err = -ENOMEM;
 					break;
 				}
 			}
 
-			if (unlikely(pte_alloc(mm, dst_pmd))) {
+			if (unlikely(pte_alloc(mm, dst_pmd, dst_addr))) {
 				err = -ENOMEM;
 				break;
 			}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (2 preceding siblings ...)
  2024-08-05 12:55 ` [RFC PATCH v2 3/7] mm: pass address information to pmd_install() Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-06 14:40   ` David Hildenbrand
  2024-08-05 12:55 ` [RFC PATCH v2 5/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit attempts to synchronously free the empty PTE
pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
order to reduce overhead, we only handle the cases with a high probability
of generating empty PTE pages, and other cases will be filtered out, such
as:

 - hugetlb vma (unsuitable)
 - userfaultfd_wp vma (may reinstall the pte entry)
 - writable private file mapping case (COW-ed anon page is not zapped)
 - etc

For userfaultfd_wp and private file mapping cases (and MADV_FREE case, of
course), consider scanning and freeing empty PTE pages asynchronously in
the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pgtable.h |  14 +++++
 mm/Makefile             |   1 +
 mm/huge_memory.c        |   3 +
 mm/internal.h           |  14 +++++
 mm/khugepaged.c         |  30 +++++++--
 mm/memory.c             |   2 +
 mm/pt_reclaim.c         | 131 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 189 insertions(+), 6 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2a6a3cccfc367..572343650eb0f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -447,6 +447,20 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_flush_tlb_before_set_huge_page
+static inline void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm,
+						       unsigned long addr)
+{
+}
+#endif
+
+#ifndef arch_flush_tlb_before_set_pte_page
+static inline void arch_flush_tlb_before_set_pte_page(struct mm_struct *mm,
+						      unsigned long addr)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/Makefile b/mm/Makefile
index ab5ed56c5c033..8bec86469c1d5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -145,3 +145,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 697fcf89f975b..0afbb1e45cdac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -999,6 +999,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		arch_flush_tlb_before_set_huge_page(vma->vm_mm, haddr);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1066,6 +1067,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	entry = mk_pmd(&zero_folio->page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	arch_flush_tlb_before_set_huge_page(mm, haddr);
 	set_pmd_at(mm, haddr, pmd, entry);
 	mm_inc_nr_ptes(mm);
 }
@@ -1173,6 +1175,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pgtable = NULL;
 	}
 
+	arch_flush_tlb_before_set_huge_page(mm, addr);
 	set_pmd_at(mm, addr, pmd, entry);
 	update_mmu_cache_pmd(vma, addr, pmd);
 
diff --git a/mm/internal.h b/mm/internal.h
index dfc992de01115..09bd1cee7a523 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1441,4 +1441,18 @@ static inline bool try_to_accept_memory(struct zone *zone, unsigned int order)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+#ifdef CONFIG_PT_RECLAIM
+void try_to_reclaim_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			     unsigned long start_addr, unsigned long end_addr,
+			     struct zap_details *details);
+#else
+static inline void try_to_reclaim_pgtables(struct mmu_gather *tlb,
+					   struct vm_area_struct *vma,
+					   unsigned long start_addr,
+					   unsigned long end_addr,
+					   struct zap_details *details)
+{
+}
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 91b93259ee214..ffd3963b1c3d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1598,7 +1598,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
 		pml = pmd_lock(mm, pmd);
 
-	start_pte = pte_offset_map_nolock(mm, pmd, NULL, haddr, &ptl);
+	start_pte = pte_offset_map_nolock(mm, pmd, &pgt_pmd, haddr, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
 	if (!pml)
@@ -1606,6 +1606,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	else if (ptl != pml)
 		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+	/* pmd entry may be changed by others */
+	if (unlikely(IS_ENABLED(CONFIG_PT_RECLAIM) && !pml &&
+		     !pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+		goto abort;
+
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1651,6 +1656,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	/* step 4: remove empty page table */
 	if (!pml) {
 		pml = pmd_lock(mm, pmd);
+		if (unlikely(IS_ENABLED(CONFIG_PT_RECLAIM) &&
+			     !pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			spin_unlock(pml);
+			goto pmd_change;
+		}
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 	}
@@ -1682,6 +1692,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		pte_unmap_unlock(start_pte, ptl);
 	if (pml && pml != ptl)
 		spin_unlock(pml);
+pmd_change:
 	if (notified)
 		mmu_notifier_invalidate_range_end(&range);
 drop_folio:
@@ -1703,6 +1714,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *pml;
 		spinlock_t *ptl;
 		bool skipped_uffd = false;
+		pte_t *pte;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1738,11 +1750,17 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 					addr, addr + HPAGE_PMD_SIZE);
 		mmu_notifier_invalidate_range_start(&range);
 
+		pte = pte_offset_map_nolock(mm, pmd, &pgt_pmd, addr, &ptl);
+		if (!pte)
+			goto skip;
+
 		pml = pmd_lock(mm, pmd);
-		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+		if (unlikely(IS_ENABLED(CONFIG_PT_RECLAIM) &&
+		    !pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+			goto unlock_skip;
 		/*
 		 * Huge page lock is still held, so normally the page table
 		 * must remain empty; and we have already skipped anon_vma
@@ -1758,11 +1776,11 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
 			pmdp_get_lockless_sync();
 		}
-
+unlock_skip:
+		pte_unmap_unlock(pte, ptl);
 		if (ptl != pml)
-			spin_unlock(ptl);
-		spin_unlock(pml);
-
+			spin_unlock(pml);
+skip:
 		mmu_notifier_invalidate_range_end(&range);
 
 		if (!skipped_uffd) {
diff --git a/mm/memory.c b/mm/memory.c
index fef1e425e4702..a8108451e4dac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -423,6 +423,7 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		arch_flush_tlb_before_set_pte_page(mm, addr);
 		mm_inc_nr_ptes(mm);
 		/*
 		 * Ensure all pte setup (eg. pte page lock and page clearing) are
@@ -1931,6 +1932,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
 	unmap_single_vma(&tlb, vma, address, end, details, false);
+	try_to_reclaim_pgtables(&tlb, vma, address, end, details);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 	hugetlb_zap_end(vma, details);
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..e375e7f2059f8
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pagewalk.h>
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+/*
+ * Locking:
+ *  - already held the mmap read lock to traverse the pgtable
+ *  - use pmd lock for clearing pmd entry
+ *  - use pte lock for checking empty PTE page, and release it after clearing
+ *    pmd entry, then we can capture the changed pmd in pte_offset_map_lock()
+ *    etc after holding this pte lock. Thanks to this, we don't need to hold the
+ *    rmap-related locks.
+ *  - users of pte_offset_map_lock() etc all expect the PTE page to be stable by
+ *    using rcu lock, so PTE pages should be freed by RCU.
+ */
+static int reclaim_pgtables_pmd_entry(pmd_t *pmd, unsigned long addr,
+				      unsigned long next, struct mm_walk *walk)
+{
+	struct mm_struct *mm = walk->mm;
+	struct mmu_gather *tlb = walk->private;
+	pte_t *start_pte, *pte;
+	pmd_t pmdval;
+	spinlock_t *pml = NULL, *ptl;
+	int i;
+
+	start_pte = pte_offset_map_nolock(mm, pmd, &pmdval, addr, &ptl);
+	if (!start_pte)
+		return 0;
+
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd))))
+		goto out_ptl;
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	/*
+	 * NOTE:
+	 *   In order to reuse mmu_gather to batch flush tlb and free PTE pages,
+	 *   here tlb is not flushed before pmd lock is unlocked. This may
+	 *   result in the following two situations:
+	 *
+	 *   1) Userland can trigger page fault and fill a huge page, which will
+	 *      cause the existence of small size TLB and huge TLB for the same
+	 *      address.
+	 *
+	 *   2) Userland can also trigger page fault and fill a PTE page, which
+	 *      will cause the existence of two small size TLBs, but the PTE
+	 *      page they map are different.
+	 *
+	 * Some CPUs do not allow these, to solve this, we can define
+	 * arch_flush_tlb_before_set_{huge|pte}_page to detect this case and
+	 * flush TLB before filling a huge page or a PTE page in page fault
+	 * path.
+	 */
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+
+	return 0;
+
+out_ptl:
+	pte_unmap_unlock(start_pte, ptl);
+	if (pml != ptl)
+		spin_unlock(pml);
+
+	return 0;
+}
+
+static const struct mm_walk_ops reclaim_pgtables_walk_ops = {
+	.pmd_entry = reclaim_pgtables_pmd_entry,
+	.walk_lock = PGWALK_RDLOCK,
+};
+
+void try_to_reclaim_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			     unsigned long start_addr, unsigned long end_addr,
+			     struct zap_details *details)
+{
+	unsigned long start = max(vma->vm_start, start_addr);
+	unsigned long end;
+
+	if (start >= vma->vm_end)
+		return;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return;
+
+	/* Skip hugetlb case  */
+	if (is_vm_hugetlb_page(vma))
+		return;
+
+	/* Leave this to the THP path to handle */
+	if (vma->vm_flags & VM_HUGEPAGE)
+		return;
+
+	/* userfaultfd_wp case may reinstall the pte entry, also skip */
+	if (userfaultfd_wp(vma))
+		return;
+
+	/*
+	 * For private file mapping, the COW-ed page is an anon page, and it
+	 * will not be zapped. For simplicity, skip the all writable private
+	 * file mapping cases.
+	 */
+	if (details && !vma_is_anonymous(vma) &&
+	    !(vma->vm_flags & VM_MAYSHARE) &&
+	    (vma->vm_flags & VM_WRITE))
+		return;
+
+	start = ALIGN(start, PMD_SIZE);
+	end = ALIGN_DOWN(end, PMD_SIZE);
+	if (end - start < PMD_SIZE)
+		return;
+
+	walk_page_range_vma(vma, start, end, &reclaim_pgtables_walk_ops, tlb);
+}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  2024-08-05 12:55 ` [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Qi Zheng
@ 2024-08-06 14:40   ` David Hildenbrand
       [not found]     ` <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-06 14:40 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, mgorman, muchun.song, vbabka, akpm,
	zokeefe, rientjes
  Cc: linux-mm, linux-kernel

On 05.08.24 14:55, Qi Zheng wrote:
> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
> 
> The following are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
> 
> As a first step, this commit attempts to synchronously free the empty PTE
> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
> order to reduce overhead, we only handle the cases with a high probability
> of generating empty PTE pages, and other cases will be filtered out, such
> as:

It doesn't make particular sense during munmap() where we will just 
remove the page tables manually directly afterwards. We should limit it 
to the !munmap case -- in particular MADV_DONTNEED.

To minimze the added overhead, I further suggest to only try reclaim 
asynchronously if we know that likely all ptes will be none, that is, 
when we just zapped *all* ptes of a PTE page table -- our range spans 
the complete PTE page table.

Just imagine someone zaps a single PTE, we really don't want to start 
scanning page tables and involve an (rather expensive) walk_page_range 
just to find out that there is still something mapped.

Last but not least, would there be a way to avoid the walk_page_range() 
and simply trigger it from zap_pte_range(), possibly still while holding 
the PTE table lock?

We might have to trylock the PMD, but that should be doable.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>]

* Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
       [not found]     ` <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>
@ 2024-08-16  9:22       ` David Hildenbrand
  2024-08-16 10:01         ` Qi Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-16  9:22 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel

On 07.08.24 05:58, Qi Zheng wrote:
> Hi David,
> 

Really sorry for the slow replies, I'm struggling with a mixture of 
public holidays, holiday and too many different discussions (well, and 
some stuff I have to finish myself).

> On 2024/8/6 22:40, David Hildenbrand wrote:
>> On 05.08.24 14:55, Qi Zheng wrote:
>>> Now in order to pursue high performance, applications mostly use some
>>> high-performance user-mode memory allocators, such as jemalloc or
>>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
>>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
>>> release page table memory, which may cause huge page table memory usage.
>>>
>>> The following are a memory usage snapshot of one process which actually
>>> happened on our server:
>>>
>>>           VIRT:  55t
>>>           RES:   590g
>>>           VmPTE: 110g
>>>
>>> In this case, most of the page table entries are empty. For such a PTE
>>> page where all entries are empty, we can actually free it back to the
>>> system for others to use.
>>>
>>> As a first step, this commit attempts to synchronously free the empty PTE
>>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
>>> order to reduce overhead, we only handle the cases with a high
>>> probability
>>> of generating empty PTE pages, and other cases will be filtered out, such
>>> as:
>>
>> It doesn't make particular sense during munmap() where we will just
>> remove the page tables manually directly afterwards. We should limit it
>> to the !munmap case -- in particular MADV_DONTNEED.
> 
> munmap directly calls unmap_single_vma() instead of
> zap_page_range_single(), so the munmap case has already been excluded
> here. On the other hand, if we try to reclaim in zap_pte_range(), we
> need to identify the munmap case.
> 
> Of course, we could just modify the MADV_DONTNEED case instead of all
> the callers of zap_page_range_single(), perhaps we could add a new
> parameter to identify the MADV_DONTNEED case?

See below, zap_details might come in handy.

> 
>>
>> To minimze the added overhead, I further suggest to only try reclaim
>> asynchronously if we know that likely all ptes will be none, that is,
> 
> asynchronously? What you probably mean to say is synchronously, right?
> 
>> when we just zapped *all* ptes of a PTE page table -- our range spans
>> the complete PTE page table.
>>
>> Just imagine someone zaps a single PTE, we really don't want to start
>> scanning page tables and involve an (rather expensive) walk_page_range
>> just to find out that there is still something mapped.
> 
> In the munmap path, we first execute unmap and then reclaim the page
> tables:
> 
> unmap_vmas
> free_pgtables
> 
> Therefore, I think doing something similar in zap_page_range_single()
> would be more consistent:
> 
> unmap_single_vma
> try_to_reclaim_pgtables
> 
> And I think that the main overhead should be in flushing TLB and freeing
> the pages. Of course, I will do some performance testing to see the
> actual impact.
> 
>>
>> Last but not least, would there be a way to avoid the walk_page_range()
>> and simply trigger it from zap_pte_range(), possibly still while holding
>> the PTE table lock?
> 
> I've tried doing it that way before, but ultimately I did not choose to
> do it that way because of the following reasons:

I think we really should avoid another page table walk if possible.

> 
> 1. need to identify the munmap case

We already have "struct zap_details". Maybe we can extend that to 
specify what our intention are (either where we come from or whether we 
want to try ripping out apge tables directly).

> 2. trying to record the count of pte_none() within the original
>      zap_pte_range() loop is not very convenient. The most convenient
>      approach is still to loop 512 times to scan the PTE page.

Right, the code might need some reshuffling. As we might temporary drop 
the PTL (break case), fully relying on everything being pte_none() 
doesn't always work.

We could either handle it in zap_pmd_range(), after we processed a full 
PMD range. zap_pmd_range() knows for sure whether the full PMD range was 
covered, even if multiple zap_pte_range() calls were required.

Or we could indicate to zap_pte_range() the original range. Or we could 
make zap_pte_range() simply handle the retrying itself, and not get 
called multiple times for a single PMD range.

So the key points are:

(a) zap_pmd_range() should know for sure whether the full range is
     covered by the zap.
(b) zap_pte_range() knows whether it left any entries being (IOW, it n
     never ran into the "!should_zap_folio" case)
(c) we know whether we temporarily had to drop the PTL and someone might
     have converted pte_none() to something else.

Teaching zap_pte_range() to handle a full within-PMD range itself sounds 
cleanest.

Then we can handle it fully in zap_pte_range():

(a) if we had to leave entries behind (!pte_none()), no need to try
     ripping out the page table.

(b) if we didn't have to drop the PTL, we can remove the page table
     without even re-verifying whether the entries are pte_none(). We
     know they are. If we had to drop the PTL, we have to re-verify at
    least the PTEs that were not zapped in the last iteration.


So there is the chance to avoid pte_none() checks completely, or minimze 
them if we had to drop the PTL.

Anything I am missing? Please let me know if anything is unclear.

Reworking the retry logic for zap_pte_range(), to be called for a single 
PMD only once is likely the first step.

> 3. still need to release the pte lock, and then re-acquire the pmd lock
>      and pte lock.

Yes, if try-locking the PMD fails.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  2024-08-16  9:22       ` David Hildenbrand
@ 2024-08-16 10:01         ` Qi Zheng
  2024-08-16 10:03           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-16 10:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel



On 2024/8/16 17:22, David Hildenbrand wrote:
> On 07.08.24 05:58, Qi Zheng wrote:
>> Hi David,
>>
> 
> Really sorry for the slow replies, I'm struggling with a mixture of 
> public holidays, holiday and too many different discussions (well, and 
> some stuff I have to finish myself).
> 
>> On 2024/8/6 22:40, David Hildenbrand wrote:
>>> On 05.08.24 14:55, Qi Zheng wrote:
>>>> Now in order to pursue high performance, applications mostly use some
>>>> high-performance user-mode memory allocators, such as jemalloc or
>>>> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or 
>>>> MADV_FREE)
>>>> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE 
>>>> will
>>>> release page table memory, which may cause huge page table memory 
>>>> usage.
>>>>
>>>> The following are a memory usage snapshot of one process which actually
>>>> happened on our server:
>>>>
>>>>           VIRT:  55t
>>>>           RES:   590g
>>>>           VmPTE: 110g
>>>>
>>>> In this case, most of the page table entries are empty. For such a PTE
>>>> page where all entries are empty, we can actually free it back to the
>>>> system for others to use.
>>>>
>>>> As a first step, this commit attempts to synchronously free the 
>>>> empty PTE
>>>> pages in zap_page_range_single() (MADV_DONTNEED etc will invoke 
>>>> this). In
>>>> order to reduce overhead, we only handle the cases with a high
>>>> probability
>>>> of generating empty PTE pages, and other cases will be filtered out, 
>>>> such
>>>> as:
>>>
>>> It doesn't make particular sense during munmap() where we will just
>>> remove the page tables manually directly afterwards. We should limit it
>>> to the !munmap case -- in particular MADV_DONTNEED.
>>
>> munmap directly calls unmap_single_vma() instead of
>> zap_page_range_single(), so the munmap case has already been excluded
>> here. On the other hand, if we try to reclaim in zap_pte_range(), we
>> need to identify the munmap case.
>>
>> Of course, we could just modify the MADV_DONTNEED case instead of all
>> the callers of zap_page_range_single(), perhaps we could add a new
>> parameter to identify the MADV_DONTNEED case?
> 
> See below, zap_details might come in handy.
> 
>>
>>>
>>> To minimze the added overhead, I further suggest to only try reclaim
>>> asynchronously if we know that likely all ptes will be none, that is,
>>
>> asynchronously? What you probably mean to say is synchronously, right?
>>
>>> when we just zapped *all* ptes of a PTE page table -- our range spans
>>> the complete PTE page table.
>>>
>>> Just imagine someone zaps a single PTE, we really don't want to start
>>> scanning page tables and involve an (rather expensive) walk_page_range
>>> just to find out that there is still something mapped.
>>
>> In the munmap path, we first execute unmap and then reclaim the page
>> tables:
>>
>> unmap_vmas
>> free_pgtables
>>
>> Therefore, I think doing something similar in zap_page_range_single()
>> would be more consistent:
>>
>> unmap_single_vma
>> try_to_reclaim_pgtables
>>
>> And I think that the main overhead should be in flushing TLB and freeing
>> the pages. Of course, I will do some performance testing to see the
>> actual impact.
>>
>>>
>>> Last but not least, would there be a way to avoid the walk_page_range()
>>> and simply trigger it from zap_pte_range(), possibly still while holding
>>> the PTE table lock?
>>
>> I've tried doing it that way before, but ultimately I did not choose to
>> do it that way because of the following reasons:
> 
> I think we really should avoid another page table walk if possible.
> 
>>
>> 1. need to identify the munmap case
> 
> We already have "struct zap_details". Maybe we can extend that to 
> specify what our intention are (either where we come from or whether we 
> want to try ripping out apge tables directly).
> 
>> 2. trying to record the count of pte_none() within the original
>>      zap_pte_range() loop is not very convenient. The most convenient
>>      approach is still to loop 512 times to scan the PTE page.
> 
> Right, the code might need some reshuffling. As we might temporary drop 
> the PTL (break case), fully relying on everything being pte_none() 
> doesn't always work.
> 
> We could either handle it in zap_pmd_range(), after we processed a full 
> PMD range. zap_pmd_range() knows for sure whether the full PMD range was 
> covered, even if multiple zap_pte_range() calls were required.
> 
> Or we could indicate to zap_pte_range() the original range. Or we could 
> make zap_pte_range() simply handle the retrying itself, and not get 
> called multiple times for a single PMD range.
> 
> So the key points are:
> 
> (a) zap_pmd_range() should know for sure whether the full range is
>      covered by the zap.
> (b) zap_pte_range() knows whether it left any entries being (IOW, it n
>      never ran into the "!should_zap_folio" case)
> (c) we know whether we temporarily had to drop the PTL and someone might
>      have converted pte_none() to something else.
> 
> Teaching zap_pte_range() to handle a full within-PMD range itself sounds 
> cleanest.

Agree.

> 
> Then we can handle it fully in zap_pte_range():
> 
> (a) if we had to leave entries behind (!pte_none()), no need to try
>      ripping out the page table.

Yes.

> 
> (b) if we didn't have to drop the PTL, we can remove the page table
>      without even re-verifying whether the entries are pte_none(). We

If we want to remove the PTE page, we must hold the pmd lock (for
clearing pmd entry). To prevent ABBA deadlock, we must first release the
pte lock and then re-acquire the pmd lock + pte lock. Right? If so, then
rechecking pte_none() is unavoidable. Unless we hold the pmd lock + pte
lock in advance to execute the original code loop.

>      know they are. If we had to drop the PTL, we have to re-verify at
>     least the PTEs that were not zapped in the last iteration.
> 
> 
> So there is the chance to avoid pte_none() checks completely, or minimze 
> them if we had to drop the PTL.
> 
> Anything I am missing? Please let me know if anything is unclear.
> 
> Reworking the retry logic for zap_pte_range(), to be called for a single 
> PMD only once is likely the first step.

Agree, will do.

> 
>> 3. still need to release the pte lock, and then re-acquire the pmd lock
>>      and pte lock.
> 
> Yes, if try-locking the PMD fails.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  2024-08-16 10:01         ` Qi Zheng
@ 2024-08-16 10:03           ` David Hildenbrand
  2024-08-16 10:07             ` Qi Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2024-08-16 10:03 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel

>>
>> (b) if we didn't have to drop the PTL, we can remove the page table
>>       without even re-verifying whether the entries are pte_none(). We
> 
> If we want to remove the PTE page, we must hold the pmd lock (for
> clearing pmd entry). To prevent ABBA deadlock, we must first release the
> pte lock and then re-acquire the pmd lock + pte lock. Right? If so, then
> rechecking pte_none() is unavoidable. Unless we hold the pmd lock + pte
> lock in advance to execute the original code loop.

Try-locking the PMD should work, and succeed in many cases, right? And 
if that fails, we could fallback to what you describe.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
  2024-08-16 10:03           ` David Hildenbrand
@ 2024-08-16 10:07             ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-16 10:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel



On 2024/8/16 18:03, David Hildenbrand wrote:
>>>
>>> (b) if we didn't have to drop the PTL, we can remove the page table
>>>       without even re-verifying whether the entries are pte_none(). We
>>
>> If we want to remove the PTE page, we must hold the pmd lock (for
>> clearing pmd entry). To prevent ABBA deadlock, we must first release the
>> pte lock and then re-acquire the pmd lock + pte lock. Right? If so, then
>> rechecking pte_none() is unavoidable. Unless we hold the pmd lock + pte
>> lock in advance to execute the original code loop.
> 
> Try-locking the PMD should work, and succeed in many cases, right? And 
> if that fails, we could fallback to what you describe.

Ah, right, I got it! Will try to do it in the next version.

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 5/7] x86: mm: free page table pages by RCU instead of semi RCU
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (3 preceding siblings ...)
  2024-08-05 12:55 ` [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 6/7] x86: mm: define arch_flush_tlb_before_set_huge_page Qi Zheng
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, we will do zap_deposited_table() before freeing the
   PMD page, so it is also safe.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/include/asm/tlb.h | 19 +++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 mm/mmu_gather.c            |  9 ++++++++-
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 580636cdc257b..e223b53a8b190 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,4 +34,23 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	free_page_and_swap_cache(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886adc..199b9a3813b4a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -60,10 +60,17 @@ void __init native_pv_lock_init(void)
 		static_branch_disable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f5931499c2d6b..ea8522289c93d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..d948479ca09e6 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,10 +311,17 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+#endif
+
 static void tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
+	__tlb_remove_table_one(table);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 6/7] x86: mm: define arch_flush_tlb_before_set_huge_page
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (4 preceding siblings ...)
  2024-08-05 12:55 ` [RFC PATCH v2 5/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-05 12:55 ` [RFC PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is
not flushed before pmd lock is unlocked. This may result in the following
two situations:

1) Userland can trigger page fault and fill a huge page, which will cause
   the existence of small size TLB and huge TLB for the same address.

2) Userland can also trigger page fault and fill a PTE page, which will
   cause the existence of two small size TLBs, but the PTE page they map
   are different.

According to Intel's TLB Application note (317080), some CPUs of x86 do
not allow the 1) case, so define arch_flush_tlb_before_set_huge_page to
detect and fix this issue.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/include/asm/pgtable.h |  6 ++++++
 arch/x86/mm/pgtable.c          | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e39311a89bf47..f93d964ab6a3e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1668,6 +1668,12 @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
 #define arch_check_zapped_pmd arch_check_zapped_pmd
 void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
 
+#ifdef CONFIG_PT_RECLAIM
+#define arch_flush_tlb_before_set_huge_page arch_flush_tlb_before_set_huge_page
+void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm,
+					 unsigned long addr);
+#endif
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ea8522289c93d..7e14cae819edd 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -934,3 +934,16 @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
 	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
 			pmd_shstk(pmd));
 }
+
+#ifdef CONFIG_PT_RECLAIM
+void arch_flush_tlb_before_set_huge_page(struct mm_struct *mm,
+					 unsigned long addr)
+{
+	if (atomic_read(&mm->tlb_flush_pending)) {
+		unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+		unsigned long end = start + PMD_SIZE;
+
+		flush_tlb_mm_range(mm, start, end, PAGE_SHIFT, false);
+	}
+}
+#endif /* CONFIG_PT_RECLAIM */
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (5 preceding siblings ...)
  2024-08-05 12:55 ` [RFC PATCH v2 6/7] x86: mm: define arch_flush_tlb_before_set_huge_page Qi Zheng
@ 2024-08-05 12:55 ` Qi Zheng
  2024-08-05 13:14 ` [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-08-06  3:31 ` Qi Zheng
  8 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 12:55 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, Qi Zheng

Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and
reclaiming PTE pages is profitable only on 64-bit systems, so select
ARCH_SUPPORTS_PT_RECLAIM if X86_64.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2f611ffd0c9a4..fff6a7e6ea1de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -316,6 +316,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
2.20.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (6 preceding siblings ...)
  2024-08-05 12:55 ` [RFC PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
@ 2024-08-05 13:14 ` Qi Zheng
  2024-08-06  3:31 ` Qi Zheng
  8 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-05 13:14 UTC (permalink / raw)
  To: the arch/x86 maintainers
  Cc: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes, linux-mm, linux-kernel

Add the x86 mailing list.

On 2024/8/5 20:55, Qi Zheng wrote:
> Changes in RFC v2:
>   - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
>     kernel test robot
>   - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
>     in retract_page_tables() (in [RFC PATCH 4/7])
>   - rebase onto the next-20240805
> 
> Hi all,
> 
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.
> 
> So this series aims to synchronously scan and reclaim empty user PTE pages in
> zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In
> zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and
> page freeing operations. Therefore, if we want to free the empty PTE page in
> this path, the most natural way is to add it to mmu_gather as well. There are
> two problems that need to be solved here:
> 
> 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free
>     page table pages by semi RCU:
> 
>     - batch table freeing: asynchronous free by RCU
>     - single table freeing: IPI + synchronous free
> 
>     But this is not enough to free the empty PTE page table pages in paths other
>     that munmap and exit_mmap path, because IPI cannot be synchronized with
>     rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table
>     also be freed by RCU like batch table freeing.
> 
> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not
>     flushed before pmd lock is unlocked. This may result in the following two
>     situations:
> 
>     1) Userland can trigger page fault and fill a huge page, which will cause
>        the existence of small size TLB and huge TLB for the same address.
> 
>     2) Userland can also trigger page fault and fill a PTE page, which will
>        cause the existence of two small size TLBs, but the PTE page they map
>        are different.
> 
>     For case 1), according to Intel's TLB Application note (317080), some CPUs of
>     x86 do not allow it:
> 
>     ```
>     If software modifies the paging structures so that the page size used for a
>     4-KByte range of linear addresses changes, the TLBs may subsequently contain
>     both ordinary and large-page translations for the address range.12 A reference
>     to a linear address in the address range may use either translation. Which of
>     the two translations is used may vary from one execution to another and the
>     choice may be implementation-specific.
> 
>     Software wishing to prevent this uncertainty should not write to a paging-
>     structure entry in a way that would change, for any linear address, both the
>     page size and either the page frame or attributes. It can instead use the
>     following algorithm: first mark the relevant paging-structure entry (e.g.,
>     PDE) not present; then invalidate any translations for the affected linear
>     addresses (see Section 5.2); and then modify the relevant paging-structure
>     entry to mark it present and establish translation(s) for the new page size.
>     ```
> 
>     We can also learn more information from the comments above pmdp_invalidate()
>     in __split_huge_pmd_locked().
> 
>     For case 2), we can see from the comments above ptep_clear_flush() in
>     wp_page_copy() that this situation is also not allowed. Even without
>     this patch series, madvise(MADV_DONTNEED) can also cause this situation:
> 
>             CPU 0                         CPU 1
> 
>     madvise (MADV_DONTNEED)
>     -->  clear pte entry
>          pte_unmap_unlock
>                                        touch and tlb miss
> 				      --> set pte entry
>          mmu_gather flush tlb
> 
>     But strangely, I didn't see any relevant fix code, maybe I missed something,
>     or is this guaranteed by userland?
> 
>     Anyway, this series defines the following two functions to be implemented by
>     the architecture. If the architecture does not allow the above two situations,
>     then define these two functions to flush the tlb before set_pmd_at().
> 
>     - arch_flush_tlb_before_set_huge_page
>     - arch_flush_tlb_before_set_pte_page
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> In order to reduce overhead, we only handle the cases with a high probability
> of generating empty PTE pages, and other cases will be filtered out, such as:
> 
>   - hugetlb vma (unsuitable)
>   - userfaultfd_wp vma (may reinstall the pte entry)
>   - writable private file mapping case (COW-ed anon page is not zapped)
>   - etc
> 
> For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case,
> of course), consider scanning and freeing empty PTE pages asynchronously in
> the future.
> 
> This series is based on next-20240805.
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
> 
> Qi Zheng (7):
>    mm: pgtable: make pte_offset_map_nolock() return pmdval
>    mm: introduce CONFIG_PT_RECLAIM
>    mm: pass address information to pmd_install()
>    mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single()
>    x86: mm: free page table pages by RCU instead of semi RCU
>    x86: mm: define arch_flush_tlb_before_set_huge_page
>    x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
> 
>   Documentation/mm/split_page_table_lock.rst |   3 +-
>   arch/arm/mm/fault-armv.c                   |   2 +-
>   arch/powerpc/mm/pgtable.c                  |   2 +-
>   arch/x86/Kconfig                           |   1 +
>   arch/x86/include/asm/pgtable.h             |   6 +
>   arch/x86/include/asm/tlb.h                 |  19 +++
>   arch/x86/kernel/paravirt.c                 |   7 ++
>   arch/x86/mm/pgtable.c                      |  23 +++-
>   include/linux/hugetlb.h                    |   2 +-
>   include/linux/mm.h                         |  13 +-
>   include/linux/pgtable.h                    |  14 +++
>   mm/Kconfig                                 |  14 +++
>   mm/Makefile                                |   1 +
>   mm/debug_vm_pgtable.c                      |   2 +-
>   mm/filemap.c                               |   4 +-
>   mm/gup.c                                   |   2 +-
>   mm/huge_memory.c                           |   3 +
>   mm/internal.h                              |  17 ++-
>   mm/khugepaged.c                            |  32 +++--
>   mm/memory.c                                |  21 ++--
>   mm/migrate_device.c                        |   2 +-
>   mm/mmu_gather.c                            |   9 +-
>   mm/mprotect.c                              |   8 +-
>   mm/mremap.c                                |   4 +-
>   mm/page_vma_mapped.c                       |   2 +-
>   mm/pgtable-generic.c                       |  21 ++--
>   mm/pt_reclaim.c                            | 131 +++++++++++++++++++++
>   mm/userfaultfd.c                           |  10 +-
>   mm/vmscan.c                                |   2 +-
>   29 files changed, 321 insertions(+), 56 deletions(-)
>   create mode 100644 mm/pt_reclaim.c
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages
  2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (7 preceding siblings ...)
  2024-08-05 13:14 ` [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-08-06  3:31 ` Qi Zheng
  2024-08-16  2:55   ` Qi Zheng
  8 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-08-06  3:31 UTC (permalink / raw)
  To: david, hughd, willy, mgorman, muchun.song, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, the arch/x86 maintainers

Hi all,

On 2024/8/5 20:55, Qi Zheng wrote:

[...]

> 
> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not
>     flushed before pmd lock is unlocked. This may result in the following two
>     situations:
> 
>     1) Userland can trigger page fault and fill a huge page, which will cause
>        the existence of small size TLB and huge TLB for the same address.
> 
>     2) Userland can also trigger page fault and fill a PTE page, which will
>        cause the existence of two small size TLBs, but the PTE page they map
>        are different.
> 
>     For case 1), according to Intel's TLB Application note (317080), some CPUs of
>     x86 do not allow it:
> 
>     ```
>     If software modifies the paging structures so that the page size used for a
>     4-KByte range of linear addresses changes, the TLBs may subsequently contain
>     both ordinary and large-page translations for the address range.12 A reference
>     to a linear address in the address range may use either translation. Which of
>     the two translations is used may vary from one execution to another and the
>     choice may be implementation-specific.
> 
>     Software wishing to prevent this uncertainty should not write to a paging-
>     structure entry in a way that would change, for any linear address, both the
>     page size and either the page frame or attributes. It can instead use the
>     following algorithm: first mark the relevant paging-structure entry (e.g.,
>     PDE) not present; then invalidate any translations for the affected linear
>     addresses (see Section 5.2); and then modify the relevant paging-structure
>     entry to mark it present and establish translation(s) for the new page size.
>     ```
> 
>     We can also learn more information from the comments above pmdp_invalidate()
>     in __split_huge_pmd_locked().
> 
>     For case 2), we can see from the comments above ptep_clear_flush() in
>     wp_page_copy() that this situation is also not allowed. Even without
>     this patch series, madvise(MADV_DONTNEED) can also cause this situation:
> 
>             CPU 0                         CPU 1
> 
>     madvise (MADV_DONTNEED)
>     -->  clear pte entry
>          pte_unmap_unlock
>                                        touch and tlb miss
> 				      --> set pte entry
>          mmu_gather flush tlb
> 
>     But strangely, I didn't see any relevant fix code, maybe I missed something,
>     or is this guaranteed by userland?

I'm still quite confused about this, is there anyone who is familiar
with this part?

Thanks,
Qi

> 
>     Anyway, this series defines the following two functions to be implemented by
>     the architecture. If the architecture does not allow the above two situations,
>     then define these two functions to flush the tlb before set_pmd_at().
> 
>     - arch_flush_tlb_before_set_huge_page
>     - arch_flush_tlb_before_set_pte_page
> 

[...]

> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages
  2024-08-06  3:31 ` Qi Zheng
@ 2024-08-16  2:55   ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-08-16  2:55 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, mgorman, vbabka, akpm, zokeefe,
	rientjes
  Cc: linux-mm, linux-kernel, the arch/x86 maintainers



On 2024/8/6 11:31, Qi Zheng wrote:
> Hi all,
> 
> On 2024/8/5 20:55, Qi Zheng wrote:
> 
> [...]
> 
>>
>> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the 
>> TLB is not
>>     flushed before pmd lock is unlocked. This may result in the 
>> following two
>>     situations:
>>
>>     1) Userland can trigger page fault and fill a huge page, which 
>> will cause
>>        the existence of small size TLB and huge TLB for the same address.
>>
>>     2) Userland can also trigger page fault and fill a PTE page, which 
>> will
>>        cause the existence of two small size TLBs, but the PTE page 
>> they map
>>        are different.
>>
>>     For case 1), according to Intel's TLB Application note (317080), 
>> some CPUs of
>>     x86 do not allow it:
>>
>>     ```
>>     If software modifies the paging structures so that the page size 
>> used for a
>>     4-KByte range of linear addresses changes, the TLBs may 
>> subsequently contain
>>     both ordinary and large-page translations for the address range.12 
>> A reference
>>     to a linear address in the address range may use either 
>> translation. Which of
>>     the two translations is used may vary from one execution to 
>> another and the
>>     choice may be implementation-specific.
>>
>>     Software wishing to prevent this uncertainty should not write to a 
>> paging-
>>     structure entry in a way that would change, for any linear 
>> address, both the
>>     page size and either the page frame or attributes. It can instead 
>> use the
>>     following algorithm: first mark the relevant paging-structure 
>> entry (e.g.,
>>     PDE) not present; then invalidate any translations for the 
>> affected linear
>>     addresses (see Section 5.2); and then modify the relevant 
>> paging-structure
>>     entry to mark it present and establish translation(s) for the new 
>> page size.
>>     ```
>>
>>     We can also learn more information from the comments above 
>> pmdp_invalidate()
>>     in __split_huge_pmd_locked().
>>
>>     For case 2), we can see from the comments above ptep_clear_flush() in
>>     wp_page_copy() that this situation is also not allowed. Even without
>>     this patch series, madvise(MADV_DONTNEED) can also cause this 
>> situation:
>>
>>             CPU 0                         CPU 1
>>
>>     madvise (MADV_DONTNEED)
>>     -->  clear pte entry
>>          pte_unmap_unlock
>>                                        touch and tlb miss
>>                       --> set pte entry
>>          mmu_gather flush tlb
>>
>>     But strangely, I didn't see any relevant fix code, maybe I missed 
>> something,
>>     or is this guaranteed by userland?
> 
> I'm still quite confused about this, is there anyone who is familiar
> with this part?

This is not a new issue introduced by this patch series, and I have
sent a separate RFC patch [1] to track this issue.

I will remove this part of the handling in the next version.

[1]. 
https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

> 
> Thanks,
> Qi
> 
>>
>>     Anyway, this series defines the following two functions to be 
>> implemented by
>>     the architecture. If the architecture does not allow the above two 
>> situations,
>>     then define these two functions to flush the tlb before set_pmd_at().
>>
>>     - arch_flush_tlb_before_set_huge_page
>>     - arch_flush_tlb_before_set_pte_page
>>
> 
> [...]
> 
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-08-16 10:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-05 12:55 [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 1/7] mm: pgtable: make pte_offset_map_nolock() return pmdval Qi Zheng
2024-08-05 14:43   ` David Hildenbrand
2024-08-06  2:40     ` Qi Zheng
2024-08-06 14:16       ` David Hildenbrand
     [not found]         ` <f6c05526-5ac9-4597-9e80-099ea22fa0ae@bytedance.com>
2024-08-09 16:54           ` David Hildenbrand
2024-08-12  6:21             ` Qi Zheng
2024-08-16  8:59               ` David Hildenbrand
2024-08-16  9:21                 ` Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 2/7] mm: introduce CONFIG_PT_RECLAIM Qi Zheng
2024-08-06 14:25   ` David Hildenbrand
2024-08-05 12:55 ` [RFC PATCH v2 3/7] mm: pass address information to pmd_install() Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 4/7] mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() Qi Zheng
2024-08-06 14:40   ` David Hildenbrand
     [not found]     ` <42942b4d-153e-43e2-bfb1-43db49f87e50@bytedance.com>
2024-08-16  9:22       ` David Hildenbrand
2024-08-16 10:01         ` Qi Zheng
2024-08-16 10:03           ` David Hildenbrand
2024-08-16 10:07             ` Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 5/7] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 6/7] x86: mm: define arch_flush_tlb_before_set_huge_page Qi Zheng
2024-08-05 12:55 ` [RFC PATCH v2 7/7] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
2024-08-05 13:14 ` [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-08-06  3:31 ` Qi Zheng
2024-08-16  2:55   ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox