linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock()
@ 2024-08-21  8:18 Qi Zheng
  2024-08-21  8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Hi all,

As proposed by David Hildenbrand [1], this series introduces the following two
new helper functions to replace pte_offset_map_nolock().

1. pte_offset_map_readonly_nolock()
2. pte_offset_map_maywrite_nolock()

As the name suggests, pte_offset_map_readonly_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE page
after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that
the PTE page will not be freed, and there is no need to worry about whether the
pmd entry is modified. Therefore pte_offset_map_readonly_nolock() is just a
renamed version of pte_offset_map_nolock().

pte_offset_map_maywrite_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to ensure
that the pmd entry has not been modified concurrently. So in addition to the
name change, it also outputs the pmdval when successful. This can help the
caller recheck *pmd once the PTL is taken. In some cases we can pass NULL to
pmdvalp: either the mmap_lock for write, or pte_same() check on contents, is
also enough to ensure that the pmd entry is stable.

This series will convert all pte_offset_map_nolock() into the above two helper
functions one by one, and finally completely delete it.

This also a preparation for reclaiming the empty user PTE page table pages.

This series is based on the next-20240820.

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/f79bbfc9-bb4c-4da4-9902-2e73817dd135@redhat.com/

Qi Zheng (14):
  mm: pgtable: introduce pte_offset_map_{readonly|maywrite}_nolock()
  arm: adjust_pte() use pte_offset_map_maywrite_nolock()
  powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock()
  mm: filemap: filemap_fault_recheck_pte_none() use
    pte_offset_map_readonly_nolock()
  mm: khugepaged: __collapse_huge_page_swapin() use
    pte_offset_map_readonly_nolock()
  mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  mm: khugepaged: collapse_pte_mapped_thp() use
    pte_offset_map_maywrite_nolock()
  mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
  mm: mremap: move_ptes() use pte_offset_map_maywrite_nolock()
  mm: page_vma_mapped_walk: map_pte() use
    pte_offset_map_maywrite_nolock()
  mm: userfaultfd: move_pages_pte() use pte_offset_map_maywrite_nolock()
  mm: multi-gen LRU: walk_pte_range() use
    pte_offset_map_maywrite_nolock()
  mm: pgtable: remove pte_offset_map_nolock()
  mm: khugepaged: retract_page_tables() use
    pte_offset_map_maywrite_nolock()

 Documentation/mm/split_page_table_lock.rst |  6 +++-
 arch/arm/mm/fault-armv.c                   |  9 ++++-
 arch/powerpc/mm/pgtable.c                  |  2 +-
 include/linux/mm.h                         |  7 ++--
 mm/filemap.c                               |  4 +--
 mm/khugepaged.c                            | 39 ++++++++++++++++++--
 mm/memory.c                                | 13 +++++--
 mm/mremap.c                                |  7 +++-
 mm/page_vma_mapped.c                       | 24 ++++++++++---
 mm/pgtable-generic.c                       | 42 ++++++++++++++++------
 mm/userfaultfd.c                           | 12 +++++--
 mm/vmscan.c                                |  9 ++++-
 12 files changed, 143 insertions(+), 31 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/14] mm: pgtable: introduce pte_offset_map_{readonly|maywrite}_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock() Qi Zheng
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the PTE
   page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
   that the PTE page will not be freed, and there is no need to worry
   about whether the pmd entry is modified.

2) After acquiring PTL, the pte or pmd entries may be modified. At this
   time, we need to ensure that the pmd entry has not been modified
   concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_readonly_nolock(). For 2), in
addition to changing the name to pte_offset_map_maywrite_nolock(), it also
outputs the pmdval when successful. This can help the caller recheck *pmd
once the PTL is taken. In some cases we can pass NULL to pmdvalp: either
the mmap_lock for write, or pte_same() check on contents, is also enough
to ensure that the pmd entry is stable.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  7 ++++
 include/linux/mm.h                         |  5 +++
 mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..f54f717ae8bdf 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_readonly_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_maywrite_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken) and the value of its pmd entry, or returns NULL
+	if no PTE table;
  - pte_offset_map()
 	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
  - pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00501f85f45f0..1fe0ceabcaf39 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 
 pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
+				      unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
+				      unsigned long addr, pmd_t *pmdvalp,
+				      spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..29d1fd6fd2963 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
+pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
+				      unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
+				      unsigned long addr, pmd_t *pmdvalp,
+				      spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	if (pmdvalp)
+		*pmdvalp = pmdval;
+	return pte;
+}
+
 /*
  * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
  * __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
  * recheck *pmd once the lock is taken; in practice, no callsite needs that -
  * either the mmap_lock for write, or pte_same() check on contents, is enough.
  *
+ * pte_offset_map_readonly_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it.  This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_readonly_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * For readonly case, the caller does not need to recheck *pmd after the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not be freed.
+ *
+ * pte_offset_map_maywrite_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_readonly_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is taken. In some
+ * cases we can pass NULL to pmdvalp: either the mmap_lock for write, or
+ * pte_same() check on contents, is also enough.
+ *
  * Note that free_pgtables(), used after unmapping detached vmas, or when
  * exiting the whole mm, does not take page table lock before freeing a page
  * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
  2024-08-21  8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock() Qi Zheng
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In do_adjust_pte(), we may modify the pte entry. At this time, the write
lock of mmap_lock is not held, and the pte_same() check is not performed
after the PTL held. The corresponding pmd entry may have been modified
concurrently. Therefore, in order to ensure the stability if pmd entry,
use pte_offset_map_maywrite_nolock() to replace pte_offset_map_nolock(),
and do pmd_same() check after holding the PTL.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/arm/mm/fault-armv.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..5371920ec0550 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	pmd_t pmdval;
 	int ret;
 
 	pgd = pgd_offset(vma->vm_mm, address);
@@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	if (pmd_none_or_clear_bad(pmd))
 		return 0;
 
+again:
 	/*
 	 * This is called while another page table is mapped, so we
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+	pte = pte_offset_map_maywrite_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
 	if (!pte)
 		return 0;
 
 	do_pte_lock(ptl);
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		do_pte_unlock(ptl);
+		pte_unmap(pte);
+		goto again;
+	}
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
  2024-08-21  8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
  2024-08-21  8:18 ` [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_readonly_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..ae4d236b5cef7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	 */
 	if (pmd_none(*pmd))
 		return;
-	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	pte = pte_offset_map_readonly_nolock(mm, pmd, addr, &ptl);
 	BUG_ON(!pte);
 	assert_spin_locked(ptl);
 	pte_unmap(pte);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_readonly_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (2 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock() Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_readonly_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/filemap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d87c858465962..491eb92d6db1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3228,8 +3228,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
 		return 0;
 
-	ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
-				     &vmf->ptl);
+	ptep = pte_offset_map_readonly_nolock(vma->vm_mm, vmf->pmd,
+					      vmf->address, &vmf->ptl);
 	if (unlikely(!ptep))
 		return VM_FAULT_NOPAGE;
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_readonly_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (3 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
so convert it to using pte_offset_map_readonly_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdd1d8655a76b..26c083c59f03f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1009,7 +1009,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		};
 
 		if (!pte++) {
-			pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+			/*
+			 * Here the ptl is only used to check pte_same() in
+			 * do_swap_page(), so readonly version is enough.
+			 */
+			pte = pte_offset_map_readonly_nolock(mm, pmd, address, &ptl);
 			if (!pte) {
 				mmap_read_unlock(mm);
 				result = SCAN_PMD_NULL;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (4 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  9:17   ` LEROY Christophe
  2024-08-21  8:18 ` [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already do the pte_same() check, so there is no need to get
pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 93c0c25433d02..d3378e98faf13 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * pmd by anon khugepaged, since that takes mmap_lock in write
 		 * mode; but shmem or file collapse to THP could still morph
 		 * it into a huge pmd: just retry later if so.
+		 *
+		 * Use the maywrite version to indicate that vmf->pte will be
+		 * modified, but since we will use pte_same() to detect the
+		 * change of the pte entry, there is no need to get pmdval.
 		 */
-		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
-						 vmf->address, &vmf->ptl);
+		vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
+							  vmf->pmd, vmf->address,
+							  NULL, &vmf->ptl);
 		if (unlikely(!vmf->pte))
 			return 0;
 		vmf->orig_pte = ptep_get_lockless(vmf->pte);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (5 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquring the ptl, so convert it to using pte_offset_map_maywrite_nolock().
At this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the PTL held. So we should get pgt_pmd and do
pmd_same() check after the ptl held.

For the case where the ptl is released first and then the pml is acquired,
the PTE page may have been freed, so we must do pmd_same() check before
reacquiring the ptl.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 26c083c59f03f..8fcad0b368a08 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1602,7 +1602,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
 		pml = pmd_lock(mm, pmd);
 
-	start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+	start_pte = pte_offset_map_maywrite_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
 	if (!pml)
@@ -1610,6 +1610,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	else if (ptl != pml)
 		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+		goto abort;
+
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1655,6 +1658,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	/* step 4: remove empty page table */
 	if (!pml) {
 		pml = pmd_lock(mm, pmd);
+		/*
+		 * We called pte_unmap() and release the ptl before acquiring
+		 * the pml, which means we left the RCU critical section, so the
+		 * PTE page may have been freed, so we must do pmd_same() check
+		 * before reacquiring the ptl.
+		 */
+		if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			spin_unlock(pml);
+			goto pmd_change;
+		}
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 	}
@@ -1686,6 +1699,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		pte_unmap_unlock(start_pte, ptl);
 	if (pml && pml != ptl)
 		spin_unlock(pml);
+pmd_change:
 	if (notified)
 		mmu_notifier_invalidate_range_end(&range);
 drop_folio:
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (6 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-22  9:26   ` kernel test robot
  2024-08-21  8:18 ` [PATCH 09/14] mm: mremap: move_ptes() " Qi Zheng
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already hold the write lock of mmap_lock, there is no need to
get pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index d3378e98faf13..3016b3bf0c3b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1083,6 +1083,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
+	pmd_t pmdval;
 	pte_t ptent;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress, max_nr, ret = 0;
@@ -1108,7 +1109,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		ret = -ENOMEM;
 		goto out;
 	}
-	src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+	src_pte = pte_offset_map_maywrite_nolock(src_mm, src_pmd, addr, NULL,
+						 &src_ptl);
 	if (!src_pte) {
 		pte_unmap_unlock(dst_pte, dst_ptl);
 		/* ret == 0 */
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 09/14] mm: mremap: move_ptes() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (7 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_maywrite_nolock(). But since we already
hold the exclusive mmap_lock, there is no need to get pmdval to do
pmd_same() check, just pass NULL to pmdvalp parameter.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/mremap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc6409..33a0ccf79c32d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -175,7 +175,12 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		err = -EAGAIN;
 		goto out;
 	}
-	new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+	/*
+	 * Use the maywrite version to indicate that new_pte will be modified,
+	 * but since we hold the exclusive mmap_lock, there is no need to
+	 * recheck pmd_same() after acquiring the new_ptl.
+	 */
+	new_pte = pte_offset_map_maywrite_nolock(mm, new_pmd, new_addr, NULL, &new_ptl);
 	if (!new_pte) {
 		pte_unmap_unlock(old_pte, old_ptl);
 		err = -EAGAIN;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (8 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 09/14] mm: mremap: move_ptes() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_maywrite_nolock(). At
this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the pvmw->ptl held, so we should get pmdval
and do pmd_same() check to ensure the stability of pvmw->pmd.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..da806f3a5047d 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 	return false;
 }
 
-static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
+		    spinlock_t **ptlp)
 {
 	pte_t ptent;
+	pmd_t pmdval;
 
 	if (pvmw->flags & PVMW_SYNC) {
 		/* Use the stricter lookup */
@@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 		return !!pvmw->pte;
 	}
 
+again:
 	/*
 	 * It is important to return the ptl corresponding to pte,
 	 * in case *pvmw->pmd changes underneath us; so we need to
@@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	 * proceeds to loop over next ptes, and finds a match later.
 	 * Though, in most cases, page lock already protects this.
 	 */
-	pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
-					  pvmw->address, ptlp);
+	pvmw->pte = pte_offset_map_maywrite_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+						   pvmw->address, &pmdval, ptlp);
 	if (!pvmw->pte)
 		return false;
+	*pmdvalp = pmdval;
 
 	ptent = ptep_get(pvmw->pte);
 
@@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	}
 	pvmw->ptl = *ptlp;
 	spin_lock(pvmw->ptl);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
+		spin_unlock(pvmw->ptl);
+		goto again;
+	}
+
 	return true;
 }
 
@@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
-		if (!map_pte(pvmw, &ptl)) {
+		if (!map_pte(pvmw, &pmde, &ptl)) {
 			if (!pvmw->pte)
 				goto restart;
 			goto next_pte;
@@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->ptl) {
 			pvmw->ptl = ptl;
 			spin_lock(pvmw->ptl);
+			if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
+				pte_unmap_unlock(pvmw->pte, pvmw->ptl);
+				pvmw->ptl = NULL;
+				pvmw->pte = NULL;
+				goto restart;
+			}
 		}
 		goto this_pte;
 	} while (pvmw->address < end);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 11/14] mm: userfaultfd: move_pages_pte() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (9 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already do the pte_same() check, there is no need to get pmdval
to do pmd_same() check, just pass NULL to pmdvalp parameter.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/userfaultfd.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6ef42d9cd482e..310289fad2b32 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1146,7 +1146,13 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
-	dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+	/*
+	 * Use the maywrite version to indicate that dst_pte will be modified,
+	 * but since we will use pte_same() to detect the change of the pte
+	 * entry, there is no need to get pmdval.
+	 */
+	dst_pte = pte_offset_map_maywrite_nolock(mm, dst_pmd, dst_addr, NULL,
+						 &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
 	if (unlikely(!dst_pte)) {
@@ -1154,7 +1160,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+	/* same as dst_pte */
+	src_pte = pte_offset_map_maywrite_nolock(mm, src_pmd, src_addr, NULL,
+						 &src_ptl);
 
 	/*
 	 * We held the mmap_lock for reading so MADV_DONTNEED
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (10 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
  2024-08-21  8:18 ` [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock() Qi Zheng
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_maywrite_nolock(). At this time, the
write lock of mmap_lock is not held, and the pte_same() check is not
performed after the ptl held, so we should get pmdval and do pmd_same()
check to ensure the stability of pmd entry.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5dc96a843466a..a6620211f138a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3396,8 +3396,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	pmd_t pmdval;
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_maywrite_nolock(args->mm, pmd, start & PMD_MASK,
+					     &pmdval, &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
@@ -3405,6 +3407,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		return false;
 	}
 
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		pte_unmap_unlock(pte, ptl);
+		return false;
+	}
+
 	arch_enter_lazy_mmu_mode();
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (11 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  2024-08-21  8:18 ` [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock() Qi Zheng
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Now no users are using the pte_offset_map_nolock(), remove it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  3 ---
 include/linux/mm.h                         |  2 --
 mm/pgtable-generic.c                       | 21 ---------------------
 3 files changed, 26 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index f54f717ae8bdf..596b425fb28e6 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -16,9 +16,6 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_lock()
 	maps PTE and takes PTE table lock, returns pointer to PTE with
 	pointer to its PTE table lock, or returns NULL if no PTE table;
- - pte_offset_map_nolock()
-	maps PTE, returns pointer to PTE with pointer to its PTE table
-	lock (not taken), or returns NULL if no PTE table;
  - pte_offset_map_readonly_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fe0ceabcaf39..f7c207c3ab701 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2952,8 +2952,6 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
 				      unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 29d1fd6fd2963..9ba2e423bdb41 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,18 +305,6 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	return NULL;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			     unsigned long addr, spinlock_t **ptlp)
-{
-	pmd_t pmdval;
-	pte_t *pte;
-
-	pte = __pte_offset_map(pmd, addr, &pmdval);
-	if (likely(pte))
-		*ptlp = pte_lockptr(mm, &pmdval);
-	return pte;
-}
-
 pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
 				      unsigned long addr, spinlock_t **ptlp)
 {
@@ -374,15 +362,6 @@ pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
  * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
  * afterwards.
  *
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it.  This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns.  In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
- *
  * pte_offset_map_readonly_nolock(mm, pmd, addr, ptlp), above, is like
  * pte_offset_map(); but when successful, it also outputs a pointer to the
  * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
                   ` (12 preceding siblings ...)
  2024-08-21  8:18 ` [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-08-21  8:18 ` Qi Zheng
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  8:18 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_maywrite_nolock() + pmd_same() to do it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8fcad0b368a08..821c840b5b593 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *pml;
 		spinlock_t *ptl;
 		bool skipped_uffd = false;
+		pte_t *pte;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 					addr, addr + HPAGE_PMD_SIZE);
 		mmu_notifier_invalidate_range_start(&range);
 
+		pte = pte_offset_map_maywrite_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+		if (!pte) {
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+
 		pml = pmd_lock(mm, pmd);
-		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+		if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			pte_unmap_unlock(pte, ptl);
+			if (ptl != pml)
+				spin_unlock(pml);
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+		pte_unmap(pte);
+
 		/*
 		 * Huge page lock is still held, so normally the page table
 		 * must remain empty; and we have already skipped anon_vma
-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21  9:17   ` LEROY Christophe
  2024-08-21  9:24     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: LEROY Christophe @ 2024-08-21  9:17 UTC (permalink / raw)
  To: Qi Zheng, david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev



Le 21/08/2024 à 10:18, Qi Zheng a écrit :
> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
> since we already do the pte_same() check, so there is no need to get
> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   mm/memory.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d02..d3378e98faf13 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   		 * pmd by anon khugepaged, since that takes mmap_lock in write
>   		 * mode; but shmem or file collapse to THP could still morph
>   		 * it into a huge pmd: just retry later if so.
> +		 *
> +		 * Use the maywrite version to indicate that vmf->pte will be
> +		 * modified, but since we will use pte_same() to detect the
> +		 * change of the pte entry, there is no need to get pmdval.
>   		 */
> -		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> -						 vmf->address, &vmf->ptl);
> +		vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
> +							  vmf->pmd, vmf->address,
> +							  NULL, &vmf->ptl);

This might be the demonstration that the function name is becoming too long.

Can you find shorter names ?

>   		if (unlikely(!vmf->pte))
>   			return 0;
>   		vmf->orig_pte = ptep_get_lockless(vmf->pte);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  9:17   ` LEROY Christophe
@ 2024-08-21  9:24     ` Qi Zheng
  2024-08-21  9:41       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  9:24 UTC (permalink / raw)
  To: LEROY Christophe
  Cc: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/8/21 17:17, LEROY Christophe wrote:
> 
> 
> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>> since we already do the pte_same() check, so there is no need to get
>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>    mm/memory.c | 9 +++++++--
>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 93c0c25433d02..d3378e98faf13 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>    		 * pmd by anon khugepaged, since that takes mmap_lock in write
>>    		 * mode; but shmem or file collapse to THP could still morph
>>    		 * it into a huge pmd: just retry later if so.
>> +		 *
>> +		 * Use the maywrite version to indicate that vmf->pte will be
>> +		 * modified, but since we will use pte_same() to detect the
>> +		 * change of the pte entry, there is no need to get pmdval.
>>    		 */
>> -		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>> -						 vmf->address, &vmf->ptl);
>> +		vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>> +							  vmf->pmd, vmf->address,
>> +							  NULL, &vmf->ptl);
> 
> This might be the demonstration that the function name is becoming too long.
> 
> Can you find shorter names ?

Maybe use abbreviations?

pte_offset_map_ro_nolock()
pte_offset_map_rw_nolock()

> 
>>    		if (unlikely(!vmf->pte))
>>    			return 0;
>>    		vmf->orig_pte = ptep_get_lockless(vmf->pte);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  9:24     ` Qi Zheng
@ 2024-08-21  9:41       ` David Hildenbrand
  2024-08-21  9:51         ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-21  9:41 UTC (permalink / raw)
  To: Qi Zheng, LEROY Christophe
  Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
	peterx, ryan.roberts, linux-kernel, linux-mm, linux-arm-kernel,
	linuxppc-dev

On 21.08.24 11:24, Qi Zheng wrote:
> 
> 
> On 2024/8/21 17:17, LEROY Christophe wrote:
>>
>>
>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>> since we already do the pte_same() check, so there is no need to get
>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>     mm/memory.c | 9 +++++++--
>>>     1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 93c0c25433d02..d3378e98faf13 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>     		 * pmd by anon khugepaged, since that takes mmap_lock in write
>>>     		 * mode; but shmem or file collapse to THP could still morph
>>>     		 * it into a huge pmd: just retry later if so.
>>> +		 *
>>> +		 * Use the maywrite version to indicate that vmf->pte will be
>>> +		 * modified, but since we will use pte_same() to detect the
>>> +		 * change of the pte entry, there is no need to get pmdval.
>>>     		 */
>>> -		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>> -						 vmf->address, &vmf->ptl);
>>> +		vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>> +							  vmf->pmd, vmf->address,
>>> +							  NULL, &vmf->ptl);

I think we discussed that passing NULL should be forbidden for that 
function.

>>
>> This might be the demonstration that the function name is becoming too long.
>>
>> Can you find shorter names ?
> 
> Maybe use abbreviations?
> 
> pte_offset_map_ro_nolock()
> pte_offset_map_rw_nolock()

At least the "ro" is better, but "rw" does not express the "maywrite" -- 
because without taking the lock we are not allowed to write. But maybe 
"rw" is good enough for that if we document it properly.

And you can use up to 100 characters, if it helps readability.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  9:41       ` David Hildenbrand
@ 2024-08-21  9:51         ` Qi Zheng
  2024-08-21  9:53           ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21  9:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/8/21 17:41, David Hildenbrand wrote:
> On 21.08.24 11:24, Qi Zheng wrote:
>>
>>
>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>
>>>
>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>>> since we already do the pte_same() check, so there is no need to get
>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>     mm/memory.c | 9 +++++++--
>>>>     1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct 
>>>> vm_fault *vmf)
>>>>              * pmd by anon khugepaged, since that takes mmap_lock in 
>>>> write
>>>>              * mode; but shmem or file collapse to THP could still 
>>>> morph
>>>>              * it into a huge pmd: just retry later if so.
>>>> +         *
>>>> +         * Use the maywrite version to indicate that vmf->pte will be
>>>> +         * modified, but since we will use pte_same() to detect the
>>>> +         * change of the pte entry, there is no need to get pmdval.
>>>>              */
>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>> -                         vmf->address, &vmf->ptl);
>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>> +                              vmf->pmd, vmf->address,
>>>> +                              NULL, &vmf->ptl);
> 
> I think we discussed that passing NULL should be forbidden for that 
> function.

Yes, but for some maywrite case, there is no need to get pmdval to
do pmd_same() check. So I passed NULL and added a comment to
explain this.

> 
>>>
>>> This might be the demonstration that the function name is becoming 
>>> too long.
>>>
>>> Can you find shorter names ?
>>
>> Maybe use abbreviations?
>>
>> pte_offset_map_ro_nolock()
>> pte_offset_map_rw_nolock()
> 
> At least the "ro" is better, but "rw" does not express the "maywrite" -- 
> because without taking the lock we are not allowed to write. But maybe 
> "rw" is good enough for that if we document it properly.

OK, will change to it in the next version.

> 
> And you can use up to 100 characters, if it helps readability

Got it.

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  9:51         ` Qi Zheng
@ 2024-08-21  9:53           ` David Hildenbrand
  2024-08-21 10:03             ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-21  9:53 UTC (permalink / raw)
  To: Qi Zheng
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev

On 21.08.24 11:51, Qi Zheng wrote:
> 
> 
> On 2024/8/21 17:41, David Hildenbrand wrote:
>> On 21.08.24 11:24, Qi Zheng wrote:
>>>
>>>
>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>
>>>>
>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>>>> since we already do the pte_same() check, so there is no need to get
>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>>      mm/memory.c | 9 +++++++--
>>>>>      1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>> vm_fault *vmf)
>>>>>               * pmd by anon khugepaged, since that takes mmap_lock in
>>>>> write
>>>>>               * mode; but shmem or file collapse to THP could still
>>>>> morph
>>>>>               * it into a huge pmd: just retry later if so.
>>>>> +         *
>>>>> +         * Use the maywrite version to indicate that vmf->pte will be
>>>>> +         * modified, but since we will use pte_same() to detect the
>>>>> +         * change of the pte entry, there is no need to get pmdval.
>>>>>               */
>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>> -                         vmf->address, &vmf->ptl);
>>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>> +                              vmf->pmd, vmf->address,
>>>>> +                              NULL, &vmf->ptl);
>>
>> I think we discussed that passing NULL should be forbidden for that
>> function.
> 
> Yes, but for some maywrite case, there is no need to get pmdval to
> do pmd_same() check. So I passed NULL and added a comment to
> explain this.

I wonder if it's better to pass a dummy variable instead. One has to 
think harder why that is required compared to blindly passing "NULL" :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21  9:53           ` David Hildenbrand
@ 2024-08-21 10:03             ` Qi Zheng
  2024-08-22  9:29               ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 10:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/8/21 17:53, David Hildenbrand wrote:
> On 21.08.24 11:51, Qi Zheng wrote:
>>
>>
>> On 2024/8/21 17:41, David Hildenbrand wrote:
>>> On 21.08.24 11:24, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>>
>>>>>
>>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). 
>>>>>> But
>>>>>> since we already do the pte_same() check, so there is no need to get
>>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> ---
>>>>>>      mm/memory.c | 9 +++++++--
>>>>>>      1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>>> vm_fault *vmf)
>>>>>>               * pmd by anon khugepaged, since that takes mmap_lock in
>>>>>> write
>>>>>>               * mode; but shmem or file collapse to THP could still
>>>>>> morph
>>>>>>               * it into a huge pmd: just retry later if so.
>>>>>> +         *
>>>>>> +         * Use the maywrite version to indicate that vmf->pte 
>>>>>> will be
>>>>>> +         * modified, but since we will use pte_same() to detect the
>>>>>> +         * change of the pte entry, there is no need to get pmdval.
>>>>>>               */
>>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>>> -                         vmf->address, &vmf->ptl);
>>>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>> +                              vmf->pmd, vmf->address,
>>>>>> +                              NULL, &vmf->ptl);
>>>
>>> I think we discussed that passing NULL should be forbidden for that
>>> function.
>>
>> Yes, but for some maywrite case, there is no need to get pmdval to
>> do pmd_same() check. So I passed NULL and added a comment to
>> explain this.
> 
> I wonder if it's better to pass a dummy variable instead. One has to 
> think harder why that is required compared to blindly passing "NULL" :)

You are afraid that subsequent caller will abuse this function, right?
My initial concern was that this would add a useless local vaiable, but
perhaps that is not a big deal.

Both are fine for me. ;)

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
  2024-08-21  8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-22  9:26   ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2024-08-22  9:26 UTC (permalink / raw)
  To: Qi Zheng, david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts
  Cc: oe-kbuild-all, linux-kernel, linux-mm, linux-arm-kernel,
	linuxppc-dev, Qi Zheng

Hi Qi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on powerpc/next powerpc/fixes linus/master v6.11-rc4 next-20240822]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-pgtable-introduce-pte_offset_map_-readonly-maywrite-_nolock/20240821-162312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/05c311498fc8e7e9b2143c7b5fef6dc624cfc49f.1724226076.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240822/202408221703.ioeASthY-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240822/202408221703.ioeASthY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408221703.ioeASthY-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/memory.c: In function 'copy_pte_range':
>> mm/memory.c:1086:15: warning: unused variable 'pmdval' [-Wunused-variable]
    1086 |         pmd_t pmdval;
         |               ^~~~~~


vim +/pmdval +1086 mm/memory.c

  1076	
  1077	static int
  1078	copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
  1079		       pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
  1080		       unsigned long end)
  1081	{
  1082		struct mm_struct *dst_mm = dst_vma->vm_mm;
  1083		struct mm_struct *src_mm = src_vma->vm_mm;
  1084		pte_t *orig_src_pte, *orig_dst_pte;
  1085		pte_t *src_pte, *dst_pte;
> 1086		pmd_t pmdval;
  1087		pte_t ptent;
  1088		spinlock_t *src_ptl, *dst_ptl;
  1089		int progress, max_nr, ret = 0;
  1090		int rss[NR_MM_COUNTERS];
  1091		swp_entry_t entry = (swp_entry_t){0};
  1092		struct folio *prealloc = NULL;
  1093		int nr;
  1094	
  1095	again:
  1096		progress = 0;
  1097		init_rss_vec(rss);
  1098	
  1099		/*
  1100		 * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the
  1101		 * error handling here, assume that exclusive mmap_lock on dst and src
  1102		 * protects anon from unexpected THP transitions; with shmem and file
  1103		 * protected by mmap_lock-less collapse skipping areas with anon_vma
  1104		 * (whereas vma_needs_copy() skips areas without anon_vma).  A rework
  1105		 * can remove such assumptions later, but this is good enough for now.
  1106		 */
  1107		dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
  1108		if (!dst_pte) {
  1109			ret = -ENOMEM;
  1110			goto out;
  1111		}
  1112		src_pte = pte_offset_map_maywrite_nolock(src_mm, src_pmd, addr, NULL,
  1113							 &src_ptl);
  1114		if (!src_pte) {
  1115			pte_unmap_unlock(dst_pte, dst_ptl);
  1116			/* ret == 0 */
  1117			goto out;
  1118		}
  1119		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
  1120		orig_src_pte = src_pte;
  1121		orig_dst_pte = dst_pte;
  1122		arch_enter_lazy_mmu_mode();
  1123	
  1124		do {
  1125			nr = 1;
  1126	
  1127			/*
  1128			 * We are holding two locks at this point - either of them
  1129			 * could generate latencies in another task on another CPU.
  1130			 */
  1131			if (progress >= 32) {
  1132				progress = 0;
  1133				if (need_resched() ||
  1134				    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
  1135					break;
  1136			}
  1137			ptent = ptep_get(src_pte);
  1138			if (pte_none(ptent)) {
  1139				progress++;
  1140				continue;
  1141			}
  1142			if (unlikely(!pte_present(ptent))) {
  1143				ret = copy_nonpresent_pte(dst_mm, src_mm,
  1144							  dst_pte, src_pte,
  1145							  dst_vma, src_vma,
  1146							  addr, rss);
  1147				if (ret == -EIO) {
  1148					entry = pte_to_swp_entry(ptep_get(src_pte));
  1149					break;
  1150				} else if (ret == -EBUSY) {
  1151					break;
  1152				} else if (!ret) {
  1153					progress += 8;
  1154					continue;
  1155				}
  1156				ptent = ptep_get(src_pte);
  1157				VM_WARN_ON_ONCE(!pte_present(ptent));
  1158	
  1159				/*
  1160				 * Device exclusive entry restored, continue by copying
  1161				 * the now present pte.
  1162				 */
  1163				WARN_ON_ONCE(ret != -ENOENT);
  1164			}
  1165			/* copy_present_ptes() will clear `*prealloc' if consumed */
  1166			max_nr = (end - addr) / PAGE_SIZE;
  1167			ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
  1168						ptent, addr, max_nr, rss, &prealloc);
  1169			/*
  1170			 * If we need a pre-allocated page for this pte, drop the
  1171			 * locks, allocate, and try again.
  1172			 */
  1173			if (unlikely(ret == -EAGAIN))
  1174				break;
  1175			if (unlikely(prealloc)) {
  1176				/*
  1177				 * pre-alloc page cannot be reused by next time so as
  1178				 * to strictly follow mempolicy (e.g., alloc_page_vma()
  1179				 * will allocate page according to address).  This
  1180				 * could only happen if one pinned pte changed.
  1181				 */
  1182				folio_put(prealloc);
  1183				prealloc = NULL;
  1184			}
  1185			nr = ret;
  1186			progress += 8 * nr;
  1187		} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
  1188			 addr != end);
  1189	
  1190		arch_leave_lazy_mmu_mode();
  1191		pte_unmap_unlock(orig_src_pte, src_ptl);
  1192		add_mm_rss_vec(dst_mm, rss);
  1193		pte_unmap_unlock(orig_dst_pte, dst_ptl);
  1194		cond_resched();
  1195	
  1196		if (ret == -EIO) {
  1197			VM_WARN_ON_ONCE(!entry.val);
  1198			if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
  1199				ret = -ENOMEM;
  1200				goto out;
  1201			}
  1202			entry.val = 0;
  1203		} else if (ret == -EBUSY) {
  1204			goto out;
  1205		} else if (ret ==  -EAGAIN) {
  1206			prealloc = folio_prealloc(src_mm, src_vma, addr, false);
  1207			if (!prealloc)
  1208				return -ENOMEM;
  1209		} else if (ret < 0) {
  1210			VM_WARN_ON_ONCE(1);
  1211		}
  1212	
  1213		/* We've captured and resolved the error. Reset, try again. */
  1214		ret = 0;
  1215	
  1216		if (addr != end)
  1217			goto again;
  1218	out:
  1219		if (unlikely(prealloc))
  1220			folio_put(prealloc);
  1221		return ret;
  1222	}
  1223	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-21 10:03             ` Qi Zheng
@ 2024-08-22  9:29               ` David Hildenbrand
  2024-08-22 12:17                 ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-22  9:29 UTC (permalink / raw)
  To: Qi Zheng
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev

On 21.08.24 12:03, Qi Zheng wrote:
> 
> 
> On 2024/8/21 17:53, David Hildenbrand wrote:
>> On 21.08.24 11:51, Qi Zheng wrote:
>>>
>>>
>>> On 2024/8/21 17:41, David Hildenbrand wrote:
>>>> On 21.08.24 11:24, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>>>
>>>>>>
>>>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock().
>>>>>>> But
>>>>>>> since we already do the pte_same() check, so there is no need to get
>>>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>>>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> ---
>>>>>>>       mm/memory.c | 9 +++++++--
>>>>>>>       1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>>>> vm_fault *vmf)
>>>>>>>                * pmd by anon khugepaged, since that takes mmap_lock in
>>>>>>> write
>>>>>>>                * mode; but shmem or file collapse to THP could still
>>>>>>> morph
>>>>>>>                * it into a huge pmd: just retry later if so.
>>>>>>> +         *
>>>>>>> +         * Use the maywrite version to indicate that vmf->pte
>>>>>>> will be
>>>>>>> +         * modified, but since we will use pte_same() to detect the
>>>>>>> +         * change of the pte entry, there is no need to get pmdval.
>>>>>>>                */
>>>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>>>> -                         vmf->address, &vmf->ptl);
>>>>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>> +                              vmf->pmd, vmf->address,
>>>>>>> +                              NULL, &vmf->ptl);
>>>>
>>>> I think we discussed that passing NULL should be forbidden for that
>>>> function.
>>>
>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>> do pmd_same() check. So I passed NULL and added a comment to
>>> explain this.
>>
>> I wonder if it's better to pass a dummy variable instead. One has to
>> think harder why that is required compared to blindly passing "NULL" :)
> 
> You are afraid that subsequent caller will abuse this function, right?

Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL, easy" :)

> My initial concern was that this would add a useless local vaiable, but
> perhaps that is not a big deal.

How many of these "special" instances do we have?

> 
> Both are fine for me. ;)

Also no strong opinion, but having to pass a variable makes you think 
what you are supposed to do with it and why it is not optional.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-22  9:29               ` David Hildenbrand
@ 2024-08-22 12:17                 ` Qi Zheng
  2024-08-22 12:19                   ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-22 12:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, LEROY Christophe, hughd, willy, muchun.song, vbabka,
	akpm, rppt, vishal.moola, peterx, ryan.roberts, linux-kernel,
	linux-mm, linux-arm-kernel, linuxppc-dev

Hi David,

On 2024/8/22 17:29, David Hildenbrand wrote:
> On 21.08.24 12:03, Qi Zheng wrote:
>>
>>

[...]

>>>>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, 
>>>>>>>> vmf->pmd,
>>>>>>>> -                         vmf->address, &vmf->ptl);
>>>>>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>> +                              vmf->pmd, vmf->address,
>>>>>>>> +                              NULL, &vmf->ptl);
>>>>>
>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>> function.
>>>>
>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>> explain this.
>>>
>>> I wonder if it's better to pass a dummy variable instead. One has to
>>> think harder why that is required compared to blindly passing "NULL" :)
>>
>> You are afraid that subsequent caller will abuse this function, right?
> 
> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL, 
> easy" :)
> 
>> My initial concern was that this would add a useless local vaiable, but
>> perhaps that is not a big deal.
> 
> How many of these "special" instances do we have?

We have 5 such special instances.

> 
>>
>> Both are fine for me. ;)
> 
> Also no strong opinion, but having to pass a variable makes you think 
> what you are supposed to do with it and why it is not optional.

Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
have updated the v2 version [1].

[1]. 
https://lore.kernel.org/lkml/cover.1724310149.git.zhengqi.arch@bytedance.com/

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-22 12:17                 ` Qi Zheng
@ 2024-08-22 12:19                   ` David Hildenbrand
  2024-08-22 12:22                     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-22 12:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev

On 22.08.24 14:17, Qi Zheng wrote:
> Hi David,
> 
> On 2024/8/22 17:29, David Hildenbrand wrote:
>> On 21.08.24 12:03, Qi Zheng wrote:
>>>
>>>
> 
> [...]
> 
>>>>>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm,
>>>>>>>>> vmf->pmd,
>>>>>>>>> -                         vmf->address, &vmf->ptl);
>>>>>>>>> +        vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>>> +                              vmf->pmd, vmf->address,
>>>>>>>>> +                              NULL, &vmf->ptl);
>>>>>>
>>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>>> function.
>>>>>
>>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>>> explain this.
>>>>
>>>> I wonder if it's better to pass a dummy variable instead. One has to
>>>> think harder why that is required compared to blindly passing "NULL" :)
>>>
>>> You are afraid that subsequent caller will abuse this function, right?
>>
>> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL,
>> easy" :)
>>
>>> My initial concern was that this would add a useless local vaiable, but
>>> perhaps that is not a big deal.
>>
>> How many of these "special" instances do we have?
> 
> We have 5 such special instances.
> 
>>
>>>
>>> Both are fine for me. ;)
>>
>> Also no strong opinion, but having to pass a variable makes you think
>> what you are supposed to do with it and why it is not optional.
> 
> Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
> have updated the v2 version [1].

No BUG_ON please :) VM_WARN_ON_ONCE() is good enough.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
  2024-08-22 12:19                   ` David Hildenbrand
@ 2024-08-22 12:22                     ` Qi Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-22 12:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/8/22 20:19, David Hildenbrand wrote:
> On 22.08.24 14:17, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/22 17:29, David Hildenbrand wrote:
>>> On 21.08.24 12:03, Qi Zheng wrote:
>>>>
>>>>
>>
>> [...]
>>
>>>>>>>>>> -        vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm,
>>>>>>>>>> vmf->pmd,
>>>>>>>>>> -                         vmf->address, &vmf->ptl);
>>>>>>>>>> +        vmf->pte = 
>>>>>>>>>> pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>>>> +                              vmf->pmd, vmf->address,
>>>>>>>>>> +                              NULL, &vmf->ptl);
>>>>>>>
>>>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>>>> function.
>>>>>>
>>>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>>>> explain this.
>>>>>
>>>>> I wonder if it's better to pass a dummy variable instead. One has to
>>>>> think harder why that is required compared to blindly passing 
>>>>> "NULL" :)
>>>>
>>>> You are afraid that subsequent caller will abuse this function, right?
>>>
>>> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL,
>>> easy" :)
>>>
>>>> My initial concern was that this would add a useless local vaiable, but
>>>> perhaps that is not a big deal.
>>>
>>> How many of these "special" instances do we have?
>>
>> We have 5 such special instances.
>>
>>>
>>>>
>>>> Both are fine for me. ;)
>>>
>>> Also no strong opinion, but having to pass a variable makes you think
>>> what you are supposed to do with it and why it is not optional.
>>
>> Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
>> have updated the v2 version [1].
> 
> No BUG_ON please :) VM_WARN_ON_ONCE() is good enough.

Got it. Will do in the next version.

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-08-22 12:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-21  8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
2024-08-21  8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
2024-08-21  8:18 ` [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock() Qi Zheng
2024-08-21  8:18 ` [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock() Qi Zheng
2024-08-21  8:18 ` [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
2024-08-21  8:18 ` [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
2024-08-21  8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
2024-08-21  9:17   ` LEROY Christophe
2024-08-21  9:24     ` Qi Zheng
2024-08-21  9:41       ` David Hildenbrand
2024-08-21  9:51         ` Qi Zheng
2024-08-21  9:53           ` David Hildenbrand
2024-08-21 10:03             ` Qi Zheng
2024-08-22  9:29               ` David Hildenbrand
2024-08-22 12:17                 ` Qi Zheng
2024-08-22 12:19                   ` David Hildenbrand
2024-08-22 12:22                     ` Qi Zheng
2024-08-21  8:18 ` [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
2024-08-21  8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
2024-08-22  9:26   ` kernel test robot
2024-08-21  8:18 ` [PATCH 09/14] mm: mremap: move_ptes() " Qi Zheng
2024-08-21  8:18 ` [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
2024-08-21  8:18 ` [PATCH 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
2024-08-21  8:18 ` [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
2024-08-21  8:18 ` [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-08-21  8:18 ` [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock() Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox