* [PATCH 01/14] mm: pgtable: introduce pte_offset_map_{readonly|maywrite}_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock() Qi Zheng
` (12 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_readonly_nolock(). For 2), in
addition to changing the name to pte_offset_map_maywrite_nolock(), it also
outputs the pmdval when successful. This can help the caller recheck *pmd
once the PTL is taken. In some cases we can pass NULL to pmdvalp: either
the mmap_lock for write, or pte_same() check on contents, is also enough
to ensure that the pmd entry is stable.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Documentation/mm/split_page_table_lock.rst | 7 ++++
include/linux/mm.h | 5 +++
mm/pgtable-generic.c | 43 ++++++++++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..f54f717ae8bdf 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_readonly_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_maywrite_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken) and the value of its pmd entry, or returns NULL
+ if no PTE table;
- pte_offset_map()
maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
- pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00501f85f45f0..1fe0ceabcaf39 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp);
#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..29d1fd6fd2963 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}
+pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ return pte;
+}
+
+pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, pmd_t *pmdvalp,
+ spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ if (pmdvalp)
+ *pmdvalp = pmdval;
+ return pte;
+}
+
/*
* pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
* __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
* recheck *pmd once the lock is taken; in practice, no callsite needs that -
* either the mmap_lock for write, or pte_same() check on contents, is enough.
*
+ * pte_offset_map_readonly_nolock(mm, pmd, addr, ptlp), above, is like
+ * pte_offset_map(); but when successful, it also outputs a pointer to the
+ * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
+ * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd),
+ * which might by that time act on a changed *pmd: pte_offset_map_readonly_nolock()
+ * provides the correct spinlock pointer for the page table that it returns.
+ * For readonly case, the caller does not need to recheck *pmd after the lock is
+ * taken, because the RCU lock will ensure that the PTE page will not be freed.
+ *
+ * pte_offset_map_maywrite_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_readonly_nolock(); but when successful, it also outputs the
+ * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite
+ * case, this can help the caller recheck *pmd once the lock is taken. In some
+ * cases we can pass NULL to pmdvalp: either the mmap_lock for write, or
+ * pte_same() check on contents, is also enough.
+ *
* Note that free_pgtables(), used after unmapping detached vmas, or when
* exiting the whole mm, does not take page table lock before freeing a page
* table, and may not use RCU at all: "outsiders" like khugepaged should avoid
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
2024-08-21 8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock() Qi Zheng
` (11 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In do_adjust_pte(), we may modify the pte entry. At this time, the write
lock of mmap_lock is not held, and the pte_same() check is not performed
after the PTL held. The corresponding pmd entry may have been modified
concurrently. Therefore, in order to ensure the stability if pmd entry,
use pte_offset_map_maywrite_nolock() to replace pte_offset_map_nolock(),
and do pmd_same() check after holding the PTL.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
arch/arm/mm/fault-armv.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..5371920ec0550 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ pmd_t pmdval;
int ret;
pgd = pgd_offset(vma->vm_mm, address);
@@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
if (pmd_none_or_clear_bad(pmd))
return 0;
+again:
/*
* This is called while another page table is mapped, so we
* must use the nested version. This also means we need to
* open-code the spin-locking.
*/
- pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+ pte = pte_offset_map_maywrite_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
if (!pte)
return 0;
do_pte_lock(ptl);
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ do_pte_unlock(ptl);
+ pte_unmap(pte);
+ goto again;
+ }
ret = do_adjust_pte(vma, address, pfn, pte);
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
2024-08-21 8:18 ` [PATCH 01/14] mm: pgtable: " Qi Zheng
2024-08-21 8:18 ` [PATCH 02/14] arm: adjust_pte() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
` (10 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_readonly_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
arch/powerpc/mm/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..ae4d236b5cef7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
*/
if (pmd_none(*pmd))
return;
- pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+ pte = pte_offset_map_readonly_nolock(mm, pmd, addr, &ptl);
BUG_ON(!pte);
assert_spin_locked(ptl);
pte_unmap(pte);
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_readonly_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (2 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 03/14] powerpc: assert_pte_locked() use pte_offset_map_readonly_nolock() Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
` (9 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_readonly_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/filemap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index d87c858465962..491eb92d6db1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3228,8 +3228,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
return 0;
- ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ ptep = pte_offset_map_readonly_nolock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
if (unlikely(!ptep))
return VM_FAULT_NOPAGE;
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_readonly_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (3 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
` (8 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
so convert it to using pte_offset_map_readonly_nolock().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdd1d8655a76b..26c083c59f03f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1009,7 +1009,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
};
if (!pte++) {
- pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+ /*
+ * Here the ptl is only used to check pte_same() in
+ * do_swap_page(), so readonly version is enough.
+ */
+ pte = pte_offset_map_readonly_nolock(mm, pmd, address, &ptl);
if (!pte) {
mmap_read_unlock(mm);
result = SCAN_PMD_NULL;
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (4 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 9:17 ` LEROY Christophe
2024-08-21 8:18 ` [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
` (7 subsequent siblings)
13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already do the pte_same() check, so there is no need to get
pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memory.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 93c0c25433d02..d3378e98faf13 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* pmd by anon khugepaged, since that takes mmap_lock in write
* mode; but shmem or file collapse to THP could still morph
* it into a huge pmd: just retry later if so.
+ *
+ * Use the maywrite version to indicate that vmf->pte will be
+ * modified, but since we will use pte_same() to detect the
+ * change of the pte entry, there is no need to get pmdval.
*/
- vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
+ vmf->pmd, vmf->address,
+ NULL, &vmf->ptl);
if (unlikely(!vmf->pte))
return 0;
vmf->orig_pte = ptep_get_lockless(vmf->pte);
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21 9:17 ` LEROY Christophe
2024-08-21 9:24 ` Qi Zheng
0 siblings, 1 reply; 26+ messages in thread
From: LEROY Christophe @ 2024-08-21 9:17 UTC (permalink / raw)
To: Qi Zheng, david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev
Le 21/08/2024 à 10:18, Qi Zheng a écrit :
> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
> since we already do the pte_same() check, so there is no need to get
> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/memory.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d02..d3378e98faf13 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * pmd by anon khugepaged, since that takes mmap_lock in write
> * mode; but shmem or file collapse to THP could still morph
> * it into a huge pmd: just retry later if so.
> + *
> + * Use the maywrite version to indicate that vmf->pte will be
> + * modified, but since we will use pte_same() to detect the
> + * change of the pte entry, there is no need to get pmdval.
> */
> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
> + vmf->pmd, vmf->address,
> + NULL, &vmf->ptl);
This might be the demonstration that the function name is becoming too long.
Can you find shorter names ?
> if (unlikely(!vmf->pte))
> return 0;
> vmf->orig_pte = ptep_get_lockless(vmf->pte);
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 9:17 ` LEROY Christophe
@ 2024-08-21 9:24 ` Qi Zheng
2024-08-21 9:41 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 9:24 UTC (permalink / raw)
To: LEROY Christophe
Cc: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/21 17:17, LEROY Christophe wrote:
>
>
> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>> since we already do the pte_same() check, so there is no need to get
>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/memory.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 93c0c25433d02..d3378e98faf13 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> * pmd by anon khugepaged, since that takes mmap_lock in write
>> * mode; but shmem or file collapse to THP could still morph
>> * it into a huge pmd: just retry later if so.
>> + *
>> + * Use the maywrite version to indicate that vmf->pte will be
>> + * modified, but since we will use pte_same() to detect the
>> + * change of the pte entry, there is no need to get pmdval.
>> */
>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>> - vmf->address, &vmf->ptl);
>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>> + vmf->pmd, vmf->address,
>> + NULL, &vmf->ptl);
>
> This might be the demonstration that the function name is becoming too long.
>
> Can you find shorter names ?
Maybe use abbreviations?
pte_offset_map_ro_nolock()
pte_offset_map_rw_nolock()
>
>> if (unlikely(!vmf->pte))
>> return 0;
>> vmf->orig_pte = ptep_get_lockless(vmf->pte);
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 9:24 ` Qi Zheng
@ 2024-08-21 9:41 ` David Hildenbrand
2024-08-21 9:51 ` Qi Zheng
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-21 9:41 UTC (permalink / raw)
To: Qi Zheng, LEROY Christophe
Cc: hughd, willy, muchun.song, vbabka, akpm, rppt, vishal.moola,
peterx, ryan.roberts, linux-kernel, linux-mm, linux-arm-kernel,
linuxppc-dev
On 21.08.24 11:24, Qi Zheng wrote:
>
>
> On 2024/8/21 17:17, LEROY Christophe wrote:
>>
>>
>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>> since we already do the pte_same() check, so there is no need to get
>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> mm/memory.c | 9 +++++++--
>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 93c0c25433d02..d3378e98faf13 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>> * pmd by anon khugepaged, since that takes mmap_lock in write
>>> * mode; but shmem or file collapse to THP could still morph
>>> * it into a huge pmd: just retry later if so.
>>> + *
>>> + * Use the maywrite version to indicate that vmf->pte will be
>>> + * modified, but since we will use pte_same() to detect the
>>> + * change of the pte entry, there is no need to get pmdval.
>>> */
>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>> - vmf->address, &vmf->ptl);
>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>> + vmf->pmd, vmf->address,
>>> + NULL, &vmf->ptl);
I think we discussed that passing NULL should be forbidden for that
function.
>>
>> This might be the demonstration that the function name is becoming too long.
>>
>> Can you find shorter names ?
>
> Maybe use abbreviations?
>
> pte_offset_map_ro_nolock()
> pte_offset_map_rw_nolock()
At least the "ro" is better, but "rw" does not express the "maywrite" --
because without taking the lock we are not allowed to write. But maybe
"rw" is good enough for that if we document it properly.
And you can use up to 100 characters, if it helps readability.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 9:41 ` David Hildenbrand
@ 2024-08-21 9:51 ` Qi Zheng
2024-08-21 9:53 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 9:51 UTC (permalink / raw)
To: David Hildenbrand
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/21 17:41, David Hildenbrand wrote:
> On 21.08.24 11:24, Qi Zheng wrote:
>>
>>
>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>
>>>
>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>>> since we already do the pte_same() check, so there is no need to get
>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> mm/memory.c | 9 +++++++--
>>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>> vm_fault *vmf)
>>>> * pmd by anon khugepaged, since that takes mmap_lock in
>>>> write
>>>> * mode; but shmem or file collapse to THP could still
>>>> morph
>>>> * it into a huge pmd: just retry later if so.
>>>> + *
>>>> + * Use the maywrite version to indicate that vmf->pte will be
>>>> + * modified, but since we will use pte_same() to detect the
>>>> + * change of the pte entry, there is no need to get pmdval.
>>>> */
>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>> - vmf->address, &vmf->ptl);
>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>> + vmf->pmd, vmf->address,
>>>> + NULL, &vmf->ptl);
>
> I think we discussed that passing NULL should be forbidden for that
> function.
Yes, but for some maywrite case, there is no need to get pmdval to
do pmd_same() check. So I passed NULL and added a comment to
explain this.
>
>>>
>>> This might be the demonstration that the function name is becoming
>>> too long.
>>>
>>> Can you find shorter names ?
>>
>> Maybe use abbreviations?
>>
>> pte_offset_map_ro_nolock()
>> pte_offset_map_rw_nolock()
>
> At least the "ro" is better, but "rw" does not express the "maywrite" --
> because without taking the lock we are not allowed to write. But maybe
> "rw" is good enough for that if we document it properly.
OK, will change to it in the next version.
>
> And you can use up to 100 characters, if it helps readability
Got it.
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 9:51 ` Qi Zheng
@ 2024-08-21 9:53 ` David Hildenbrand
2024-08-21 10:03 ` Qi Zheng
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-21 9:53 UTC (permalink / raw)
To: Qi Zheng
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 21.08.24 11:51, Qi Zheng wrote:
>
>
> On 2024/8/21 17:41, David Hildenbrand wrote:
>> On 21.08.24 11:24, Qi Zheng wrote:
>>>
>>>
>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>
>>>>
>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
>>>>> since we already do the pte_same() check, so there is no need to get
>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>> mm/memory.c | 9 +++++++--
>>>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>> vm_fault *vmf)
>>>>> * pmd by anon khugepaged, since that takes mmap_lock in
>>>>> write
>>>>> * mode; but shmem or file collapse to THP could still
>>>>> morph
>>>>> * it into a huge pmd: just retry later if so.
>>>>> + *
>>>>> + * Use the maywrite version to indicate that vmf->pte will be
>>>>> + * modified, but since we will use pte_same() to detect the
>>>>> + * change of the pte entry, there is no need to get pmdval.
>>>>> */
>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>> - vmf->address, &vmf->ptl);
>>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>> + vmf->pmd, vmf->address,
>>>>> + NULL, &vmf->ptl);
>>
>> I think we discussed that passing NULL should be forbidden for that
>> function.
>
> Yes, but for some maywrite case, there is no need to get pmdval to
> do pmd_same() check. So I passed NULL and added a comment to
> explain this.
I wonder if it's better to pass a dummy variable instead. One has to
think harder why that is required compared to blindly passing "NULL" :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 9:53 ` David Hildenbrand
@ 2024-08-21 10:03 ` Qi Zheng
2024-08-22 9:29 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 10:03 UTC (permalink / raw)
To: David Hildenbrand
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/21 17:53, David Hildenbrand wrote:
> On 21.08.24 11:51, Qi Zheng wrote:
>>
>>
>> On 2024/8/21 17:41, David Hildenbrand wrote:
>>> On 21.08.24 11:24, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>>
>>>>>
>>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock().
>>>>>> But
>>>>>> since we already do the pte_same() check, so there is no need to get
>>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> ---
>>>>>> mm/memory.c | 9 +++++++--
>>>>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>>> vm_fault *vmf)
>>>>>> * pmd by anon khugepaged, since that takes mmap_lock in
>>>>>> write
>>>>>> * mode; but shmem or file collapse to THP could still
>>>>>> morph
>>>>>> * it into a huge pmd: just retry later if so.
>>>>>> + *
>>>>>> + * Use the maywrite version to indicate that vmf->pte
>>>>>> will be
>>>>>> + * modified, but since we will use pte_same() to detect the
>>>>>> + * change of the pte entry, there is no need to get pmdval.
>>>>>> */
>>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>>> - vmf->address, &vmf->ptl);
>>>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>> + vmf->pmd, vmf->address,
>>>>>> + NULL, &vmf->ptl);
>>>
>>> I think we discussed that passing NULL should be forbidden for that
>>> function.
>>
>> Yes, but for some maywrite case, there is no need to get pmdval to
>> do pmd_same() check. So I passed NULL and added a comment to
>> explain this.
>
> I wonder if it's better to pass a dummy variable instead. One has to
> think harder why that is required compared to blindly passing "NULL" :)
You are afraid that subsequent caller will abuse this function, right?
My initial concern was that this would add a useless local vaiable, but
perhaps that is not a big deal.
Both are fine for me. ;)
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-21 10:03 ` Qi Zheng
@ 2024-08-22 9:29 ` David Hildenbrand
2024-08-22 12:17 ` Qi Zheng
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-22 9:29 UTC (permalink / raw)
To: Qi Zheng
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 21.08.24 12:03, Qi Zheng wrote:
>
>
> On 2024/8/21 17:53, David Hildenbrand wrote:
>> On 21.08.24 11:51, Qi Zheng wrote:
>>>
>>>
>>> On 2024/8/21 17:41, David Hildenbrand wrote:
>>>> On 21.08.24 11:24, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/8/21 17:17, LEROY Christophe wrote:
>>>>>>
>>>>>>
>>>>>> Le 21/08/2024 à 10:18, Qi Zheng a écrit :
>>>>>>> In handle_pte_fault(), we may modify the vmf->pte after acquiring the
>>>>>>> vmf->ptl, so convert it to using pte_offset_map_maywrite_nolock().
>>>>>>> But
>>>>>>> since we already do the pte_same() check, so there is no need to get
>>>>>>> pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
>>>>>>>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> ---
>>>>>>> mm/memory.c | 9 +++++++--
>>>>>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 93c0c25433d02..d3378e98faf13 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -5504,9 +5504,14 @@ static vm_fault_t handle_pte_fault(struct
>>>>>>> vm_fault *vmf)
>>>>>>> * pmd by anon khugepaged, since that takes mmap_lock in
>>>>>>> write
>>>>>>> * mode; but shmem or file collapse to THP could still
>>>>>>> morph
>>>>>>> * it into a huge pmd: just retry later if so.
>>>>>>> + *
>>>>>>> + * Use the maywrite version to indicate that vmf->pte
>>>>>>> will be
>>>>>>> + * modified, but since we will use pte_same() to detect the
>>>>>>> + * change of the pte entry, there is no need to get pmdval.
>>>>>>> */
>>>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>>>>>>> - vmf->address, &vmf->ptl);
>>>>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>> + vmf->pmd, vmf->address,
>>>>>>> + NULL, &vmf->ptl);
>>>>
>>>> I think we discussed that passing NULL should be forbidden for that
>>>> function.
>>>
>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>> do pmd_same() check. So I passed NULL and added a comment to
>>> explain this.
>>
>> I wonder if it's better to pass a dummy variable instead. One has to
>> think harder why that is required compared to blindly passing "NULL" :)
>
> You are afraid that subsequent caller will abuse this function, right?
Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL, easy" :)
> My initial concern was that this would add a useless local vaiable, but
> perhaps that is not a big deal.
How many of these "special" instances do we have?
>
> Both are fine for me. ;)
Also no strong opinion, but having to pass a variable makes you think
what you are supposed to do with it and why it is not optional.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-22 9:29 ` David Hildenbrand
@ 2024-08-22 12:17 ` Qi Zheng
2024-08-22 12:19 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-22 12:17 UTC (permalink / raw)
To: David Hildenbrand
Cc: Qi Zheng, LEROY Christophe, hughd, willy, muchun.song, vbabka,
akpm, rppt, vishal.moola, peterx, ryan.roberts, linux-kernel,
linux-mm, linux-arm-kernel, linuxppc-dev
Hi David,
On 2024/8/22 17:29, David Hildenbrand wrote:
> On 21.08.24 12:03, Qi Zheng wrote:
>>
>>
[...]
>>>>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm,
>>>>>>>> vmf->pmd,
>>>>>>>> - vmf->address, &vmf->ptl);
>>>>>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>> + vmf->pmd, vmf->address,
>>>>>>>> + NULL, &vmf->ptl);
>>>>>
>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>> function.
>>>>
>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>> explain this.
>>>
>>> I wonder if it's better to pass a dummy variable instead. One has to
>>> think harder why that is required compared to blindly passing "NULL" :)
>>
>> You are afraid that subsequent caller will abuse this function, right?
>
> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL,
> easy" :)
>
>> My initial concern was that this would add a useless local vaiable, but
>> perhaps that is not a big deal.
>
> How many of these "special" instances do we have?
We have 5 such special instances.
>
>>
>> Both are fine for me. ;)
>
> Also no strong opinion, but having to pass a variable makes you think
> what you are supposed to do with it and why it is not optional.
Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
have updated the v2 version [1].
[1].
https://lore.kernel.org/lkml/cover.1724310149.git.zhengqi.arch@bytedance.com/
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-22 12:17 ` Qi Zheng
@ 2024-08-22 12:19 ` David Hildenbrand
2024-08-22 12:22 ` Qi Zheng
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2024-08-22 12:19 UTC (permalink / raw)
To: Qi Zheng
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 22.08.24 14:17, Qi Zheng wrote:
> Hi David,
>
> On 2024/8/22 17:29, David Hildenbrand wrote:
>> On 21.08.24 12:03, Qi Zheng wrote:
>>>
>>>
>
> [...]
>
>>>>>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm,
>>>>>>>>> vmf->pmd,
>>>>>>>>> - vmf->address, &vmf->ptl);
>>>>>>>>> + vmf->pte = pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>>> + vmf->pmd, vmf->address,
>>>>>>>>> + NULL, &vmf->ptl);
>>>>>>
>>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>>> function.
>>>>>
>>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>>> explain this.
>>>>
>>>> I wonder if it's better to pass a dummy variable instead. One has to
>>>> think harder why that is required compared to blindly passing "NULL" :)
>>>
>>> You are afraid that subsequent caller will abuse this function, right?
>>
>> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL,
>> easy" :)
>>
>>> My initial concern was that this would add a useless local vaiable, but
>>> perhaps that is not a big deal.
>>
>> How many of these "special" instances do we have?
>
> We have 5 such special instances.
>
>>
>>>
>>> Both are fine for me. ;)
>>
>> Also no strong opinion, but having to pass a variable makes you think
>> what you are supposed to do with it and why it is not optional.
>
> Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
> have updated the v2 version [1].
No BUG_ON please :) VM_WARN_ON_ONCE() is good enough.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock()
2024-08-22 12:19 ` David Hildenbrand
@ 2024-08-22 12:22 ` Qi Zheng
0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-22 12:22 UTC (permalink / raw)
To: David Hildenbrand
Cc: LEROY Christophe, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts, linux-kernel, linux-mm,
linux-arm-kernel, linuxppc-dev
On 2024/8/22 20:19, David Hildenbrand wrote:
> On 22.08.24 14:17, Qi Zheng wrote:
>> Hi David,
>>
>> On 2024/8/22 17:29, David Hildenbrand wrote:
>>> On 21.08.24 12:03, Qi Zheng wrote:
>>>>
>>>>
>>
>> [...]
>>
>>>>>>>>>> - vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm,
>>>>>>>>>> vmf->pmd,
>>>>>>>>>> - vmf->address, &vmf->ptl);
>>>>>>>>>> + vmf->pte =
>>>>>>>>>> pte_offset_map_maywrite_nolock(vmf->vma->vm_mm,
>>>>>>>>>> + vmf->pmd, vmf->address,
>>>>>>>>>> + NULL, &vmf->ptl);
>>>>>>>
>>>>>>> I think we discussed that passing NULL should be forbidden for that
>>>>>>> function.
>>>>>>
>>>>>> Yes, but for some maywrite case, there is no need to get pmdval to
>>>>>> do pmd_same() check. So I passed NULL and added a comment to
>>>>>> explain this.
>>>>>
>>>>> I wonder if it's better to pass a dummy variable instead. One has to
>>>>> think harder why that is required compared to blindly passing
>>>>> "NULL" :)
>>>>
>>>> You are afraid that subsequent caller will abuse this function, right?
>>>
>>> Yes! "oh, I don't need a pmdval, why would I? let's just pass NULL,
>>> easy" :)
>>>
>>>> My initial concern was that this would add a useless local vaiable, but
>>>> perhaps that is not a big deal.
>>>
>>> How many of these "special" instances do we have?
>>
>> We have 5 such special instances.
>>
>>>
>>>>
>>>> Both are fine for me. ;)
>>>
>>> Also no strong opinion, but having to pass a variable makes you think
>>> what you are supposed to do with it and why it is not optional.
>>
>> Yeah, I added 'BUG_ON(!pmdvalp);' in pte_offset_map_ro_nolock(), and
>> have updated the v2 version [1].
>
> No BUG_ON please :) VM_WARN_ON_ONCE() is good enough.
Got it. Will do in the next version.
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (5 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 06/14] mm: handle_pte_fault() use pte_offset_map_maywrite_nolock() Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
` (6 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquring the ptl, so convert it to using pte_offset_map_maywrite_nolock().
At this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the PTL held. So we should get pgt_pmd and do
pmd_same() check after the ptl held.
For the case where the ptl is released first and then the pml is acquired,
the PTE page may have been freed, so we must do pmd_same() check before
reacquiring the ptl.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 26c083c59f03f..8fcad0b368a08 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1602,7 +1602,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
pml = pmd_lock(mm, pmd);
- start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+ start_pte = pte_offset_map_maywrite_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
if (!start_pte) /* mmap_lock + page lock should prevent this */
goto abort;
if (!pml)
@@ -1610,6 +1610,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
else if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+ goto abort;
+
/* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1655,6 +1658,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
/* step 4: remove empty page table */
if (!pml) {
pml = pmd_lock(mm, pmd);
+ /*
+ * We called pte_unmap() and release the ptl before acquiring
+ * the pml, which means we left the RCU critical section, so the
+ * PTE page may have been freed, so we must do pmd_same() check
+ * before reacquiring the ptl.
+ */
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ spin_unlock(pml);
+ goto pmd_change;
+ }
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
}
@@ -1686,6 +1699,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
pte_unmap_unlock(start_pte, ptl);
if (pml && pml != ptl)
spin_unlock(pml);
+pmd_change:
if (notified)
mmu_notifier_invalidate_range_end(&range);
drop_folio:
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (6 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-22 9:26 ` kernel test robot
2024-08-21 8:18 ` [PATCH 09/14] mm: mremap: move_ptes() " Qi Zheng
` (5 subsequent siblings)
13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already hold the write lock of mmap_lock, there is no need to
get pmdval to do pmd_same() check, just pass NULL to pmdvalp parameter.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index d3378e98faf13..3016b3bf0c3b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1083,6 +1083,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
struct mm_struct *src_mm = src_vma->vm_mm;
pte_t *orig_src_pte, *orig_dst_pte;
pte_t *src_pte, *dst_pte;
+ pmd_t pmdval;
pte_t ptent;
spinlock_t *src_ptl, *dst_ptl;
int progress, max_nr, ret = 0;
@@ -1108,7 +1109,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
ret = -ENOMEM;
goto out;
}
- src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+ src_pte = pte_offset_map_maywrite_nolock(src_mm, src_pmd, addr, NULL,
+ &src_ptl);
if (!src_pte) {
pte_unmap_unlock(dst_pte, dst_ptl);
/* ret == 0 */
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-22 9:26 ` kernel test robot
0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2024-08-22 9:26 UTC (permalink / raw)
To: Qi Zheng, david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: oe-kbuild-all, linux-kernel, linux-mm, linux-arm-kernel,
linuxppc-dev, Qi Zheng
Hi Qi,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on powerpc/next powerpc/fixes linus/master v6.11-rc4 next-20240822]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-pgtable-introduce-pte_offset_map_-readonly-maywrite-_nolock/20240821-162312
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/05c311498fc8e7e9b2143c7b5fef6dc624cfc49f.1724226076.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH 08/14] mm: copy_pte_range() use pte_offset_map_maywrite_nolock()
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240822/202408221703.ioeASthY-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240822/202408221703.ioeASthY-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408221703.ioeASthY-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/memory.c: In function 'copy_pte_range':
>> mm/memory.c:1086:15: warning: unused variable 'pmdval' [-Wunused-variable]
1086 | pmd_t pmdval;
| ^~~~~~
vim +/pmdval +1086 mm/memory.c
1076
1077 static int
1078 copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
1079 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
1080 unsigned long end)
1081 {
1082 struct mm_struct *dst_mm = dst_vma->vm_mm;
1083 struct mm_struct *src_mm = src_vma->vm_mm;
1084 pte_t *orig_src_pte, *orig_dst_pte;
1085 pte_t *src_pte, *dst_pte;
> 1086 pmd_t pmdval;
1087 pte_t ptent;
1088 spinlock_t *src_ptl, *dst_ptl;
1089 int progress, max_nr, ret = 0;
1090 int rss[NR_MM_COUNTERS];
1091 swp_entry_t entry = (swp_entry_t){0};
1092 struct folio *prealloc = NULL;
1093 int nr;
1094
1095 again:
1096 progress = 0;
1097 init_rss_vec(rss);
1098
1099 /*
1100 * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the
1101 * error handling here, assume that exclusive mmap_lock on dst and src
1102 * protects anon from unexpected THP transitions; with shmem and file
1103 * protected by mmap_lock-less collapse skipping areas with anon_vma
1104 * (whereas vma_needs_copy() skips areas without anon_vma). A rework
1105 * can remove such assumptions later, but this is good enough for now.
1106 */
1107 dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
1108 if (!dst_pte) {
1109 ret = -ENOMEM;
1110 goto out;
1111 }
1112 src_pte = pte_offset_map_maywrite_nolock(src_mm, src_pmd, addr, NULL,
1113 &src_ptl);
1114 if (!src_pte) {
1115 pte_unmap_unlock(dst_pte, dst_ptl);
1116 /* ret == 0 */
1117 goto out;
1118 }
1119 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
1120 orig_src_pte = src_pte;
1121 orig_dst_pte = dst_pte;
1122 arch_enter_lazy_mmu_mode();
1123
1124 do {
1125 nr = 1;
1126
1127 /*
1128 * We are holding two locks at this point - either of them
1129 * could generate latencies in another task on another CPU.
1130 */
1131 if (progress >= 32) {
1132 progress = 0;
1133 if (need_resched() ||
1134 spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
1135 break;
1136 }
1137 ptent = ptep_get(src_pte);
1138 if (pte_none(ptent)) {
1139 progress++;
1140 continue;
1141 }
1142 if (unlikely(!pte_present(ptent))) {
1143 ret = copy_nonpresent_pte(dst_mm, src_mm,
1144 dst_pte, src_pte,
1145 dst_vma, src_vma,
1146 addr, rss);
1147 if (ret == -EIO) {
1148 entry = pte_to_swp_entry(ptep_get(src_pte));
1149 break;
1150 } else if (ret == -EBUSY) {
1151 break;
1152 } else if (!ret) {
1153 progress += 8;
1154 continue;
1155 }
1156 ptent = ptep_get(src_pte);
1157 VM_WARN_ON_ONCE(!pte_present(ptent));
1158
1159 /*
1160 * Device exclusive entry restored, continue by copying
1161 * the now present pte.
1162 */
1163 WARN_ON_ONCE(ret != -ENOENT);
1164 }
1165 /* copy_present_ptes() will clear `*prealloc' if consumed */
1166 max_nr = (end - addr) / PAGE_SIZE;
1167 ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
1168 ptent, addr, max_nr, rss, &prealloc);
1169 /*
1170 * If we need a pre-allocated page for this pte, drop the
1171 * locks, allocate, and try again.
1172 */
1173 if (unlikely(ret == -EAGAIN))
1174 break;
1175 if (unlikely(prealloc)) {
1176 /*
1177 * pre-alloc page cannot be reused by next time so as
1178 * to strictly follow mempolicy (e.g., alloc_page_vma()
1179 * will allocate page according to address). This
1180 * could only happen if one pinned pte changed.
1181 */
1182 folio_put(prealloc);
1183 prealloc = NULL;
1184 }
1185 nr = ret;
1186 progress += 8 * nr;
1187 } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
1188 addr != end);
1189
1190 arch_leave_lazy_mmu_mode();
1191 pte_unmap_unlock(orig_src_pte, src_ptl);
1192 add_mm_rss_vec(dst_mm, rss);
1193 pte_unmap_unlock(orig_dst_pte, dst_ptl);
1194 cond_resched();
1195
1196 if (ret == -EIO) {
1197 VM_WARN_ON_ONCE(!entry.val);
1198 if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
1199 ret = -ENOMEM;
1200 goto out;
1201 }
1202 entry.val = 0;
1203 } else if (ret == -EBUSY) {
1204 goto out;
1205 } else if (ret == -EAGAIN) {
1206 prealloc = folio_prealloc(src_mm, src_vma, addr, false);
1207 if (!prealloc)
1208 return -ENOMEM;
1209 } else if (ret < 0) {
1210 VM_WARN_ON_ONCE(1);
1211 }
1212
1213 /* We've captured and resolved the error. Reset, try again. */
1214 ret = 0;
1215
1216 if (addr != end)
1217 goto again;
1218 out:
1219 if (unlikely(prealloc))
1220 folio_put(prealloc);
1221 return ret;
1222 }
1223
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 09/14] mm: mremap: move_ptes() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (7 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
` (4 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_maywrite_nolock(). But since we already
hold the exclusive mmap_lock, there is no need to get pmdval to do
pmd_same() check, just pass NULL to pmdvalp parameter.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/mremap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc6409..33a0ccf79c32d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -175,7 +175,12 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
err = -EAGAIN;
goto out;
}
- new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+ /*
+ * Use the maywrite version to indicate that new_pte will be modified,
+ * but since we hold the exclusive mmap_lock, there is no need to
+ * recheck pmd_same() after acquiring the new_ptl.
+ */
+ new_pte = pte_offset_map_maywrite_nolock(mm, new_pmd, new_addr, NULL, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
err = -EAGAIN;
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (8 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 09/14] mm: mremap: move_ptes() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
` (3 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_maywrite_nolock(). At
this time, the write lock of mmap_lock is not held, and the pte_same()
check is not performed after the pvmw->ptl held, so we should get pmdval
and do pmd_same() check to ensure the stability of pvmw->pmd.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..da806f3a5047d 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
return false;
}
-static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
+ spinlock_t **ptlp)
{
pte_t ptent;
+ pmd_t pmdval;
if (pvmw->flags & PVMW_SYNC) {
/* Use the stricter lookup */
@@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
return !!pvmw->pte;
}
+again:
/*
* It is important to return the ptl corresponding to pte,
* in case *pvmw->pmd changes underneath us; so we need to
@@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
* proceeds to loop over next ptes, and finds a match later.
* Though, in most cases, page lock already protects this.
*/
- pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
- pvmw->address, ptlp);
+ pvmw->pte = pte_offset_map_maywrite_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+ pvmw->address, &pmdval, ptlp);
if (!pvmw->pte)
return false;
+ *pmdvalp = pmdval;
ptent = ptep_get(pvmw->pte);
@@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
}
pvmw->ptl = *ptlp;
spin_lock(pvmw->ptl);
+
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
+ spin_unlock(pvmw->ptl);
+ goto again;
+ }
+
return true;
}
@@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
step_forward(pvmw, PMD_SIZE);
continue;
}
- if (!map_pte(pvmw, &ptl)) {
+ if (!map_pte(pvmw, &pmde, &ptl)) {
if (!pvmw->pte)
goto restart;
goto next_pte;
@@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->ptl) {
pvmw->ptl = ptl;
spin_lock(pvmw->ptl);
+ if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
+ pte_unmap_unlock(pvmw->pte, pvmw->ptl);
+ pvmw->ptl = NULL;
+ pvmw->pte = NULL;
+ goto restart;
+ }
}
goto this_pte;
} while (pvmw->address < end);
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 11/14] mm: userfaultfd: move_pages_pte() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (9 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
` (2 subsequent siblings)
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_maywrite_nolock(). But
since we already do the pte_same() check, there is no need to get pmdval
to do pmd_same() check, just pass NULL to pmdvalp parameter.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/userfaultfd.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6ef42d9cd482e..310289fad2b32 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1146,7 +1146,13 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
src_addr, src_addr + PAGE_SIZE);
mmu_notifier_invalidate_range_start(&range);
retry:
- dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+ /*
+ * Use the maywrite version to indicate that dst_pte will be modified,
+ * but since we will use pte_same() to detect the change of the pte
+ * entry, there is no need to get pmdval.
+ */
+ dst_pte = pte_offset_map_maywrite_nolock(mm, dst_pmd, dst_addr, NULL,
+ &dst_ptl);
/* Retry if a huge pmd materialized from under us */
if (unlikely(!dst_pte)) {
@@ -1154,7 +1160,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
goto out;
}
- src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+ /* same as dst_pte */
+ src_pte = pte_offset_map_maywrite_nolock(mm, src_pmd, src_addr, NULL,
+ &src_ptl);
/*
* We held the mmap_lock for reading so MADV_DONTNEED
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (10 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-08-21 8:18 ` [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock() Qi Zheng
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_maywrite_nolock(). At this time, the
write lock of mmap_lock is not held, and the pte_same() check is not
performed after the ptl held, so we should get pmdval and do pmd_same()
check to ensure the stability of pmd entry.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5dc96a843466a..a6620211f138a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3396,8 +3396,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
DEFINE_MAX_SEQ(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ pmd_t pmdval;
- pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ pte = pte_offset_map_maywrite_nolock(args->mm, pmd, start & PMD_MASK,
+ &pmdval, &ptl);
if (!pte)
return false;
if (!spin_trylock(ptl)) {
@@ -3405,6 +3407,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
return false;
}
+ if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ pte_unmap_unlock(pte, ptl);
+ return false;
+ }
+
arch_enter_lazy_mmu_mode();
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (11 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
2024-08-21 8:18 ` [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock() Qi Zheng
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
Now no users are using the pte_offset_map_nolock(), remove it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Documentation/mm/split_page_table_lock.rst | 3 ---
include/linux/mm.h | 2 --
mm/pgtable-generic.c | 21 ---------------------
3 files changed, 26 deletions(-)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index f54f717ae8bdf..596b425fb28e6 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -16,9 +16,6 @@ There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_lock()
maps PTE and takes PTE table lock, returns pointer to PTE with
pointer to its PTE table lock, or returns NULL if no PTE table;
- - pte_offset_map_nolock()
- maps PTE, returns pointer to PTE with pointer to its PTE table
- lock (not taken), or returns NULL if no PTE table;
- pte_offset_map_readonly_nolock()
maps PTE, returns pointer to PTE with pointer to its PTE table
lock (not taken), or returns NULL if no PTE table;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fe0ceabcaf39..f7c207c3ab701 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2952,8 +2952,6 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, spinlock_t **ptlp);
pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp);
pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 29d1fd6fd2963..9ba2e423bdb41 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,18 +305,6 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
return NULL;
}
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, spinlock_t **ptlp)
-{
- pmd_t pmdval;
- pte_t *pte;
-
- pte = __pte_offset_map(pmd, addr, &pmdval);
- if (likely(pte))
- *ptlp = pte_lockptr(mm, &pmdval);
- return pte;
-}
-
pte_t *pte_offset_map_readonly_nolock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp)
{
@@ -374,15 +362,6 @@ pte_t *pte_offset_map_maywrite_nolock(struct mm_struct *mm, pmd_t *pmd,
* and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s
* afterwards.
*
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it. This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns. In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
- *
* pte_offset_map_readonly_nolock(mm, pmd, addr, ptlp), above, is like
* pte_offset_map(); but when successful, it also outputs a pointer to the
* spinlock in ptlp - as pte_offset_map_lock() does, but in this case without
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_maywrite_nolock()
2024-08-21 8:18 [PATCH 00/14] introduce pte_offset_map_{readonly|maywrite}_nolock() Qi Zheng
` (12 preceding siblings ...)
2024-08-21 8:18 ` [PATCH 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-08-21 8:18 ` Qi Zheng
13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-08-21 8:18 UTC (permalink / raw)
To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
vishal.moola, peterx, ryan.roberts
Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng
In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_maywrite_nolock() + pmd_same() to do it.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/khugepaged.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8fcad0b368a08..821c840b5b593 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
spinlock_t *pml;
spinlock_t *ptl;
bool skipped_uffd = false;
+ pte_t *pte;
/*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
addr, addr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
+ pte = pte_offset_map_maywrite_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+ if (!pte) {
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+
pml = pmd_lock(mm, pmd);
- ptl = pte_lockptr(mm, pmd);
if (ptl != pml)
spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+ pte_unmap_unlock(pte, ptl);
+ if (ptl != pml)
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(&range);
+ continue;
+ }
+ pte_unmap(pte);
+
/*
* Huge page lock is still held, so normally the page table
* must remain empty; and we have already skipped anon_vma
--
2.20.1
^ permalink raw reply [flat|nested] 26+ messages in thread