From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Muchun Song <muchun.song@linux.dev>
Cc: LKML <linux-kernel@vger.kernel.org>,
Linux Memory Management List <linux-mm@kvack.org>,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org,
David Hildenbrand <david@redhat.com>,
hughd@google.com, willy@infradead.org, vbabka@kernel.org,
akpm@linux-foundation.org, rppt@kernel.org,
vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com,
christophe.leroy2@cs-soprasteria.com
Subject: Re: [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Date: Thu, 12 Sep 2024 17:28:39 +0800 [thread overview]
Message-ID: <da59b028-b472-4ac1-b893-2f17496fb384@bytedance.com> (raw)
In-Reply-To: <d02fe02d-a6c7-4157-bb7d-3fe235f21237@linux.dev>
Hi Muchun,
On 2024/9/6 15:20, Muchun Song wrote:
>
>
> On 2024/9/4 16:40, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the
>> PTE
>> page. In this case, the RCU lock in pte_offset_map_nolock() will
>> ensure
>> that the PTE page will not be freed, and there is no need to worry
>> about whether the pmd entry is modified.
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>> time, we need to ensure that the pmd entry has not been modified
>> concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. It is applicable for may-write cases where any
>> modification operations to the page table may happen after the
>> corresponding spinlock is held afterwards. But the users should make sure
>> the page table is stable like checking pte_same() or checking pmd_same()
>> by using the output pmdval before performing the write operations.
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be read-only/read-write protected.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng<zhengqi.arch@bytedance.com>
>> ---
>> Documentation/mm/split_page_table_lock.rst | 7 +++
>> include/linux/mm.h | 5 +++
>> mm/pgtable-generic.c | 50 ++++++++++++++++++++++
>> 3 files changed, 62 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other
>> accessor functions:
>> - pte_offset_map_nolock()
>> maps PTE, returns pointer to PTE with pointer to its PTE table
>> lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_ro_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken) and the value of its pmd entry, or returns NULL
>> + if no PTE table;
>> - pte_offset_map()
>> maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>> - pte_unmap()
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a7c74a840249a..1fde9242231c9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct
>> mm_struct *mm, pmd_t *pmd,
>> pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>> unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, pmd_t *pmdvalp,
>> + spinlock_t **ptlp);
>> #define pte_unmap_unlock(pte, ptl) do { \
>> spin_unlock(ptl); \
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index a78a4adf711ac..262b7065a5a2e 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>> *mm, pmd_t *pmd,
>> return pte;
>> }
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, spinlock_t **ptlp)
>> +{
>> + pmd_t pmdval;
>> + pte_t *pte;
>> +
>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>> + if (likely(pte))
>> + *ptlp = pte_lockptr(mm, &pmdval);
>> + return pte;
>> +}
>> +
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, pmd_t *pmdvalp,
>> + spinlock_t **ptlp)
>> +{
>> + pmd_t pmdval;
>> + pte_t *pte;
>> +
>> + VM_WARN_ON_ONCE(!pmdvalp);
>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>> + if (likely(pte))
>> + *ptlp = pte_lockptr(mm, &pmdval);
>> + *pmdvalp = pmdval;
>> + return pte;
>> +}
>> +
>> /*
>> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>> implementation
>> * __pte_offset_map_lock() below, is usually called with the pmd
>> pointer for
>> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>> *mm, pmd_t *pmd,
>> * recheck *pmd once the lock is taken; in practice, no callsite
>> needs that -
>> * either the mmap_lock for write, or pte_same() check on contents,
>> is enough.
>> *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> pte_offset_map();
>> + * but when successful, it also outputs a pointer to the spinlock in
>> ptlp - as
>> + * pte_offset_map_lock() does, but in this case without locking it.
>> This helps
>> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by
>> that time
>> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the
>> correct spinlock
>> + * pointer for the page table that it returns. Even after grabbing
>> the spinlock,
>> + * we might be looking either at a page table that is still mapped or
>> one that
>> + * was unmapped and is about to get freed. But for R/O access this is
>> sufficient.
>> + * So it is only applicable for read-only cases where any
>> modification operations
>> + * to the page table are not allowed even if the corresponding
>> spinlock is held
>> + * afterwards.
>> + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs
>> the pdmval.
>> + * It is applicable for may-write cases where any modification
>> operations to the
>> + * page table may happen after the corresponding spinlock is held
>> afterwards.
>> + * But the users should make sure the page table is stable like
>> checking pte_same()
>> + * or checking pmd_same() by using the output pmdval before
>> performing the write
>> + * operations.
>
> Now, we have two options to make sure the stability of PTE for users
> of pte_offset_map_rw_nolock(), in order to ease this operation, how
> about proposing a new helper (or two, one for pmd_same, another for
> pte_same) like pte_lock_stability (I am not good at naming, maybe
> you can) which helps users 1) hold the PTL and 2) check if the PTE is
> stable and 3) return true if the PTE stable, otherwise return false.
I've been trying to do this these days, but I found it was not very
convenient.
I introduced the following helpers:
#define __PTE_STABILITY(lock) \
bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, \
pte_t *orig_pte, spinlock_t *ptlp) \
{ \
pte_spin_##lock(ptlp); \
if (orig_pte) { \
VM_WARN_ON_ONCE(pte_none(*orig_pte)); \
return pte_same(*orig_pte, ptep_get(pte)); \
} \
if (orig_pmd) { \
VM_WARN_ON_ONCE(pmd_none(*orig_pmd)); \
return pmd_same(*orig_pmd, pmdp_get_lockless(pmd)); \
} \
VM_WARN_ON_ONCE(1); \
return false; \
}
__PTE_STABILITY(lock)
__PTE_STABILITY(lock_nested)
static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t
*pte,
pte_t *orig_pte, spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp);
}
#ifdef CONFIG_SPLIT_PTE_PTLOCKS
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock_nested(pmd, orig_pmd, pte,
orig_pte, ptlp);
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
__releases(ptlp)
{
spin_unlock(ptlp);
}
#else
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
{
return true;
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
{
}
#endif /* CONFIG_SPLIT_PTE_PTLOCKS */
and try to use them with pte_offset_map_rw_nolock() in the following
functions:
1. collapse_pte_mapped_thp
2. handle_pte_fault
3. map_pte
4. move_pages_pte
5. walk_pte_range
For 1, 2 and 3, the conversion is relatively simple, but 2 actually
already does a pte_same() check, so it does not reduce the amount of
code much.
For 4, the pte_same() checks have already been done, and it is not
easy to convert double_pt_lock() to use pte_stability_lock{_nested}().
For 5, it calls spin_trylock(), we should introduce another
pte_stability_trylock() helper for it, but it feels unnecessary.
There are not many places where pte_offset_map_rw_nolock() is called,
and some places have already done pte_same() checks, so maybe open
code is enough and there is no need to introduce more helper function.
Thanks,
Qi
>
> Muchun,
> Thanks.
>
>> + *
>> + * Note: "RO" / "RW" expresses the intended semantics, not that the
>> *kmap* will
>> + * be read-only/read-write protected.
>> + *
>> * Note that free_pgtables(), used after unmapping detached vmas, or
>> when
>> * exiting the whole mm, does not take page table lock before
>> freeing a page
>> * table, and may not use RCU at all: "outsiders" like khugepaged
>> should avoid
>
next prev parent reply other threads:[~2024-09-12 9:28 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-04 8:40 [PATCH v3 00/14] " Qi Zheng
2024-09-04 8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
2024-09-06 7:20 ` Muchun Song
2024-09-12 9:28 ` Qi Zheng [this message]
2024-09-04 8:40 ` [PATCH v3 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
2024-09-04 8:40 ` [PATCH v3 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
2024-09-04 8:40 ` [PATCH v3 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
2024-09-04 8:40 ` [PATCH v3 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
2024-09-04 8:40 ` [PATCH v3 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
2024-09-04 8:40 ` [PATCH v3 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
2024-09-04 8:40 ` [PATCH v3 08/14] mm: copy_pte_range() " Qi Zheng
2024-09-05 8:57 ` Muchun Song
2024-09-05 10:55 ` Qi Zheng
2024-09-04 8:40 ` [PATCH v3 09/14] mm: mremap: move_ptes() " Qi Zheng
2024-09-05 9:25 ` Muchun Song
2024-09-05 10:56 ` Qi Zheng
2024-09-04 8:40 ` [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
2024-09-05 12:07 ` Muchun Song
2024-09-12 9:30 ` Qi Zheng
2024-09-04 8:40 ` [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
2024-09-05 12:20 ` Muchun Song
2024-09-04 8:40 ` [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
2024-09-05 12:23 ` Muchun Song
2024-09-04 8:40 ` [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-09-05 12:23 ` Muchun Song
2024-09-04 8:40 ` [PATCH v3 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=da59b028-b472-4ac1-b893-2f17496fb384@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=christophe.leroy2@cs-soprasteria.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=muchun.song@linux.dev \
--cc=peterx@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=vbabka@kernel.org \
--cc=vishal.moola@gmail.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox