* [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
@ 2024-02-15 12:17 Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
` (4 more replies)
0 siblings, 5 replies; 42+ messages in thread
From: Ryan Roberts @ 2024-02-15 12:17 UTC (permalink / raw)
To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Andrew Morton, Muchun Song
Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel
This is an RFC for a series that aims to reduce the cost and complexity of
ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
The approach came from discussion with Mark and David [2].
It introduces a new helper, ptep_get_lockless_norecency(), which allows the
access and dirty bits in the returned pte to be incorrect. This relaxation
permits arm64's implementation to just read the single target pte, and avoids
having to iterate over the full contpte block to gather the access and dirty
bits, for the contpte case.
It turns out that none of the call sites using ptep_get_lockless() require
accurate access and dirty bit information, so we can also convert those sites.
Although a couple of places need care (see patches 2 and 3).
Arguably patch 3 is a bit fragile, given the wide accessibility of
vmf->orig_pte. So it might make sense to drop this patch and stick to using
ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
I've chosen the name "recency" because it's shortish and somewhat descriptive,
and is alredy used in a couple of places to mean similar things (see mglru and
damon). I'm open to other names if anyone has better ideas.
If concensus is that this approach is generally acceptable, I intend to create a
series in future to do a similar thing with ptep_get() -> ptep_get_norecency().
---
This series applies on top of [1].
[1] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/a91cfe1c-289e-4828-8cfc-be34eb69a71b@redhat.com/
Thanks,
Ryan
Ryan Roberts (4):
mm: Introduce ptep_get_lockless_norecency()
mm/gup: Use ptep_get_lockless_norecency()
mm/memory: Use ptep_get_lockless_norecency() for orig_pte
arm64/mm: Override ptep_get_lockless_norecency()
arch/arm64/include/asm/pgtable.h | 6 ++++
include/linux/pgtable.h | 55 ++++++++++++++++++++++++++++--
kernel/events/core.c | 2 +-
mm/gup.c | 7 ++--
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 57 ++++++++++++++++++++------------
mm/swap_state.c | 2 +-
mm/swapfile.c | 2 +-
9 files changed, 102 insertions(+), 33 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 42+ messages in thread* [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() 2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts @ 2024-02-15 12:17 ` Ryan Roberts [not found] ` <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com> 2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts ` (3 subsequent siblings) 4 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-02-15 12:17 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel With the introduction of contpte mapping support for arm64, that architecture's implementation of ptep_get_lockless() has become very complex due to the need to gather access and dirty bits from across all of the ptes in the contpte block. This requires careful implementation to ensure the returned value is consistent (because its not possible to read all ptes atomically), but even in the common case when there is no racing modification, we have to read all ptes, which gives an ~O(n^2) cost if the core-mm is iterating over a range, and performing a ptep_get_lockless() on each pte. Solve this by introducing ptep_get_lockless_norecency(), which does not make any guarantees about access and dirty bits. Therefore it can simply read the single target pte. At the same time, convert all call sites that previously used ptep_get_lockless() but don't care about access and dirty state. We may want to do something similar for ptep_get() (i.e. ptep_get_norecency()) in future; it doesn't suffer from the consistency problem because the PTL serializes it with any modifications, but does suffer the same O(n^2) cost. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- kernel/events/core.c | 2 +- mm/hugetlb.c | 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 2 +- mm/swap_state.c | 2 +- mm/swapfile.c | 2 +- 7 files changed, 40 insertions(+), 9 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a36cf4e124b0..9dd40fdbd825 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) #endif /* CONFIG_PGTABLE_LEVELS > 2 */ #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ -/* - * We require that the PTE can be read atomically. - */ #ifndef ptep_get_lockless +/** + * ptep_get_lockless - Get a pte without holding the page table lock. Young and + * dirty bits are guaranteed to accurately reflect the state + * of the pte at the time of the call. + * @ptep: Page table pointer for pte to get. + * + * If young and dirty information is not required, use + * ptep_get_lockless_norecency() which can be faster on some architectures. + * + * May be overridden by the architecture; otherwise, implemented using + * ptep_get(), on the assumption that it is atomic. + * + * Context: Any. + */ static inline pte_t ptep_get_lockless(pte_t *ptep) { return ptep_get(ptep); } #endif +#ifndef ptep_get_lockless_norecency +/** + * ptep_get_lockless_norecency - Get a pte without holding the page table lock. + * Young and dirty bits may not be accurate. + * @ptep: Page table pointer for pte to get. + * + * Prefer this over ptep_get_lockless() when young and dirty information is not + * required since it can be faster on some architectures. + * + * May be overridden by the architecture; otherwise, implemented using the more + * precise ptep_get_lockless(). + * + * Context: Any. + */ +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) +{ + return ptep_get_lockless(ptep); +} +#endif + #ifndef pmdp_get_lockless static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) { diff --git a/kernel/events/core.c b/kernel/events/core.c index f0f0f71213a1..27991312d635 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7583,7 +7583,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr) if (!ptep) goto again; - pte = ptep_get_lockless(ptep); + pte = ptep_get_lockless_norecency(ptep); if (pte_present(pte)) size = pte_leaf_size(pte); pte_unmap(ptep); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 68283e54c899..41dc44eb8454 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, } if (pte) { - pte_t pteval = ptep_get_lockless(pte); + pte_t pteval = ptep_get_lockless_norecency(pte); BUG_ON(pte_present(pteval) && !pte_huge(pteval)); } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2771fc043b3b..1a6c9ed8237a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, } } - vmf.orig_pte = ptep_get_lockless(pte); + vmf.orig_pte = ptep_get_lockless_norecency(pte); if (!is_swap_pte(vmf.orig_pte)) continue; diff --git a/mm/memory.c b/mm/memory.c index 4dd8e35b593a..8e65fa1884f1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4353,7 +4353,7 @@ static bool pte_range_none(pte_t *pte, int nr_pages) int i; for (i = 0; i < nr_pages; i++) { - if (!pte_none(ptep_get_lockless(pte + i))) + if (!pte_none(ptep_get_lockless_norecency(pte + i))) return false; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 2f540748f7c0..061c6c16c7ff 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -837,7 +837,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, if (!pte) break; } - pentry = ptep_get_lockless(pte); + pentry = ptep_get_lockless_norecency(pte); if (!is_swap_pte(pentry)) continue; entry = pte_to_swp_entry(pentry); diff --git a/mm/swapfile.c b/mm/swapfile.c index d1bd8d1e17bd..c414dd238814 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1857,7 +1857,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, break; } - ptent = ptep_get_lockless(pte); + ptent = ptep_get_lockless_norecency(pte); if (!is_swap_pte(ptent)) continue; -- 2.25.1 ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com>]
* Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() [not found] ` <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com> @ 2024-03-26 16:39 ` Ryan Roberts 2024-03-27 9:28 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 16:39 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 16:27, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> With the introduction of contpte mapping support for arm64, that >> architecture's implementation of ptep_get_lockless() has become very >> complex due to the need to gather access and dirty bits from across all >> of the ptes in the contpte block. This requires careful implementation >> to ensure the returned value is consistent (because its not possible to >> read all ptes atomically), but even in the common case when there is no >> racing modification, we have to read all ptes, which gives an ~O(n^2) >> cost if the core-mm is iterating over a range, and performing a >> ptep_get_lockless() on each pte. >> >> Solve this by introducing ptep_get_lockless_norecency(), which does not >> make any guarantees about access and dirty bits. Therefore it can simply >> read the single target pte. >> >> At the same time, convert all call sites that previously used >> ptep_get_lockless() but don't care about access and dirty state. >> > > I'd probably split that part off. I thought the general guidance was to introduce new APIs in same patch they are first used in? If I split this off, I'll have one patch for a new (unused) API, then another for the first users. > >> We may want to do something similar for ptep_get() (i.e. >> ptep_get_norecency()) in future; it doesn't suffer from the consistency >> problem because the PTL serializes it with any modifications, but does >> suffer the same O(n^2) cost. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >> kernel/events/core.c | 2 +- >> mm/hugetlb.c | 2 +- >> mm/khugepaged.c | 2 +- >> mm/memory.c | 2 +- >> mm/swap_state.c | 2 +- >> mm/swapfile.c | 2 +- >> 7 files changed, 40 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a36cf4e124b0..9dd40fdbd825 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >> >> -/* >> - * We require that the PTE can be read atomically. >> - */ >> #ifndef ptep_get_lockless >> +/** >> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and >> + * dirty bits are guaranteed to accurately reflect the state >> + * of the pte at the time of the call. >> + * @ptep: Page table pointer for pte to get. >> + * >> + * If young and dirty information is not required, use >> + * ptep_get_lockless_norecency() which can be faster on some architectures. >> + * >> + * May be overridden by the architecture; otherwise, implemented using >> + * ptep_get(), on the assumption that it is atomic. >> + * >> + * Context: Any. >> + */ > > I think we usually say "Any context.". But I would just do it like idr.h: > > "Any context. It is safe to call this function without locking in your code." > > ... but is this true? We really want to say "without page table lock". Because > there must be some way to prevent against concurrent page table freeing. For > example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU. How about: " Context: Any context that guarrantees the page table can't be freed concurrently. The page table lock is not required. " > >> static inline pte_t ptep_get_lockless(pte_t *ptep) >> { >> return ptep_get(ptep); >> } >> #endif >> >> +#ifndef ptep_get_lockless_norecency >> +/** >> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock. >> + * Young and dirty bits may not be accurate. >> + * @ptep: Page table pointer for pte to get. >> + * >> + * Prefer this over ptep_get_lockless() when young and dirty information is not >> + * required since it can be faster on some architectures. >> + * >> + * May be overridden by the architecture; otherwise, implemented using the more >> + * precise ptep_get_lockless(). >> + * >> + * Context: Any. > > Same comment. > >> + */ >> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >> +{ >> + return ptep_get_lockless(ptep); >> +} >> +#endif > > [...] > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 68283e54c899..41dc44eb8454 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >> vm_area_struct *vma, >> } >> >> if (pte) { >> - pte_t pteval = ptep_get_lockless(pte); >> + pte_t pteval = ptep_get_lockless_norecency(pte); >> >> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >> } >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 2771fc043b3b..1a6c9ed8237a 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >> *mm, >> } >> } >> >> - vmf.orig_pte = ptep_get_lockless(pte); >> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >> if (!is_swap_pte(vmf.orig_pte)) >> continue; > > > Hm, I think you mentioned that we want to be careful with vmf.orig_pte. Yeah good point. So I guess this should move to patch 3 (which may be dropped - tbd)? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() 2024-03-26 16:39 ` Ryan Roberts @ 2024-03-27 9:28 ` David Hildenbrand 2024-03-27 9:57 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-27 9:28 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26.03.24 17:39, Ryan Roberts wrote: > On 26/03/2024 16:27, David Hildenbrand wrote: >> On 15.02.24 13:17, Ryan Roberts wrote: >>> With the introduction of contpte mapping support for arm64, that >>> architecture's implementation of ptep_get_lockless() has become very >>> complex due to the need to gather access and dirty bits from across all >>> of the ptes in the contpte block. This requires careful implementation >>> to ensure the returned value is consistent (because its not possible to >>> read all ptes atomically), but even in the common case when there is no >>> racing modification, we have to read all ptes, which gives an ~O(n^2) >>> cost if the core-mm is iterating over a range, and performing a >>> ptep_get_lockless() on each pte. >>> >>> Solve this by introducing ptep_get_lockless_norecency(), which does not >>> make any guarantees about access and dirty bits. Therefore it can simply >>> read the single target pte. >>> >>> At the same time, convert all call sites that previously used >>> ptep_get_lockless() but don't care about access and dirty state. >>> >> >> I'd probably split that part off. > > I thought the general guidance was to introduce new APIs in same patch they are > first used in? If I split this off, I'll have one patch for a new (unused) API, > then another for the first users. I don't know what exact guidance there is, but I tend to leave "non trivial changes" to separate patches. Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we can perform them here. At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment. > >> >>> We may want to do something similar for ptep_get() (i.e. >>> ptep_get_norecency()) in future; it doesn't suffer from the consistency >>> problem because the PTL serializes it with any modifications, but does >>> suffer the same O(n^2) cost. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >>> kernel/events/core.c | 2 +- >>> mm/hugetlb.c | 2 +- >>> mm/khugepaged.c | 2 +- >>> mm/memory.c | 2 +- >>> mm/swap_state.c | 2 +- >>> mm/swapfile.c | 2 +- >>> 7 files changed, 40 insertions(+), 9 deletions(-) >>> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>> index a36cf4e124b0..9dd40fdbd825 100644 >>> --- a/include/linux/pgtable.h >>> +++ b/include/linux/pgtable.h >>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >>> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >>> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >>> >>> -/* >>> - * We require that the PTE can be read atomically. >>> - */ >>> #ifndef ptep_get_lockless >>> +/** >>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and >>> + * dirty bits are guaranteed to accurately reflect the state >>> + * of the pte at the time of the call. >>> + * @ptep: Page table pointer for pte to get. >>> + * >>> + * If young and dirty information is not required, use >>> + * ptep_get_lockless_norecency() which can be faster on some architectures. >>> + * >>> + * May be overridden by the architecture; otherwise, implemented using >>> + * ptep_get(), on the assumption that it is atomic. >>> + * >>> + * Context: Any. >>> + */ >> >> I think we usually say "Any context.". But I would just do it like idr.h: >> >> "Any context. It is safe to call this function without locking in your code." >> >> ... but is this true? We really want to say "without page table lock". Because >> there must be some way to prevent against concurrent page table freeing. For >> example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU. > > How about: > > " > Context: Any context that guarrantees the page table can't be freed s/guarrantees/guarantees/ > concurrently. The page table lock is not required. > " > Sounds good. >> >>> static inline pte_t ptep_get_lockless(pte_t *ptep) >>> { >>> return ptep_get(ptep); >>> } >>> #endif >>> >>> +#ifndef ptep_get_lockless_norecency >>> +/** >>> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock. >>> + * Young and dirty bits may not be accurate. >>> + * @ptep: Page table pointer for pte to get. >>> + * >>> + * Prefer this over ptep_get_lockless() when young and dirty information is not >>> + * required since it can be faster on some architectures. >>> + * >>> + * May be overridden by the architecture; otherwise, implemented using the more >>> + * precise ptep_get_lockless(). >>> + * >>> + * Context: Any. >> >> Same comment. >> >>> + */ >>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >>> +{ >>> + return ptep_get_lockless(ptep); >>> +} >>> +#endif >> >> [...] >> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 68283e54c899..41dc44eb8454 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >>> vm_area_struct *vma, >>> } >>> >>> if (pte) { >>> - pte_t pteval = ptep_get_lockless(pte); >>> + pte_t pteval = ptep_get_lockless_norecency(pte); >>> >>> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >>> } >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 2771fc043b3b..1a6c9ed8237a 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >>> *mm, >>> } >>> } >>> >>> - vmf.orig_pte = ptep_get_lockless(pte); >>> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >>> if (!is_swap_pte(vmf.orig_pte)) >>> continue; >> >> >> Hm, I think you mentioned that we want to be careful with vmf.orig_pte. > > Yeah good point. So I guess this should move to patch 3 (which may be dropped - > tbd)? > Yes. Or a separate one where you explain in detail why do_swap_page() can handle it just fine. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() 2024-03-27 9:28 ` David Hildenbrand @ 2024-03-27 9:57 ` Ryan Roberts 2024-03-27 17:02 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-27 9:57 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 27/03/2024 09:28, David Hildenbrand wrote: > On 26.03.24 17:39, Ryan Roberts wrote: >> On 26/03/2024 16:27, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> With the introduction of contpte mapping support for arm64, that >>>> architecture's implementation of ptep_get_lockless() has become very >>>> complex due to the need to gather access and dirty bits from across all >>>> of the ptes in the contpte block. This requires careful implementation >>>> to ensure the returned value is consistent (because its not possible to >>>> read all ptes atomically), but even in the common case when there is no >>>> racing modification, we have to read all ptes, which gives an ~O(n^2) >>>> cost if the core-mm is iterating over a range, and performing a >>>> ptep_get_lockless() on each pte. >>>> >>>> Solve this by introducing ptep_get_lockless_norecency(), which does not >>>> make any guarantees about access and dirty bits. Therefore it can simply >>>> read the single target pte. >>>> >>>> At the same time, convert all call sites that previously used >>>> ptep_get_lockless() but don't care about access and dirty state. >>>> >>> >>> I'd probably split that part off. >> >> I thought the general guidance was to introduce new APIs in same patch they are >> first used in? If I split this off, I'll have one patch for a new (unused) API, >> then another for the first users. > > I don't know what exact guidance there is, but I tend to leave "non trivial > changes" to separate patches. > > Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we > can perform them here. > > At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment. got it. > >> >>> >>>> We may want to do something similar for ptep_get() (i.e. >>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency >>>> problem because the PTL serializes it with any modifications, but does >>>> suffer the same O(n^2) cost. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >>>> kernel/events/core.c | 2 +- >>>> mm/hugetlb.c | 2 +- >>>> mm/khugepaged.c | 2 +- >>>> mm/memory.c | 2 +- >>>> mm/swap_state.c | 2 +- >>>> mm/swapfile.c | 2 +- >>>> 7 files changed, 40 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a36cf4e124b0..9dd40fdbd825 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >>>> #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >>>> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >>>> >>>> -/* >>>> - * We require that the PTE can be read atomically. >>>> - */ >>>> #ifndef ptep_get_lockless >>>> +/** >>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young >>>> and >>>> + * dirty bits are guaranteed to accurately reflect the >>>> state >>>> + * of the pte at the time of the call. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * If young and dirty information is not required, use >>>> + * ptep_get_lockless_norecency() which can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using >>>> + * ptep_get(), on the assumption that it is atomic. >>>> + * >>>> + * Context: Any. >>>> + */ >>> >>> I think we usually say "Any context.". But I would just do it like idr.h: >>> >>> "Any context. It is safe to call this function without locking in your code." >>> >>> ... but is this true? We really want to say "without page table lock". Because >>> there must be some way to prevent against concurrent page table freeing. For >>> example, GUP-fast disables IRQs, whereby page table freeing code frees using >>> RCU. >> >> How about: >> >> " >> Context: Any context that guarrantees the page table can't be freed > > s/guarrantees/guarantees/ > >> concurrently. The page table lock is not required. >> " >> > > Sounds good. Great! > >>> >>>> static inline pte_t ptep_get_lockless(pte_t *ptep) >>>> { >>>> return ptep_get(ptep); >>>> } >>>> #endif >>>> >>>> +#ifndef ptep_get_lockless_norecency >>>> +/** >>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table >>>> lock. >>>> + * Young and dirty bits may not be accurate. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * Prefer this over ptep_get_lockless() when young and dirty information is >>>> not >>>> + * required since it can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using the >>>> more >>>> + * precise ptep_get_lockless(). >>>> + * >>>> + * Context: Any. >>> >>> Same comment. >>> >>>> + */ >>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >>>> +{ >>>> + return ptep_get_lockless(ptep); >>>> +} >>>> +#endif >>> >>> [...] >>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 68283e54c899..41dc44eb8454 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >>>> vm_area_struct *vma, >>>> } >>>> >>>> if (pte) { >>>> - pte_t pteval = ptep_get_lockless(pte); >>>> + pte_t pteval = ptep_get_lockless_norecency(pte); >>>> >>>> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >>>> } >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 2771fc043b3b..1a6c9ed8237a 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >>>> *mm, >>>> } >>>> } >>>> >>>> - vmf.orig_pte = ptep_get_lockless(pte); >>>> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >>>> if (!is_swap_pte(vmf.orig_pte)) >>>> continue; >>> >>> >>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte. >> >> Yeah good point. So I guess this should move to patch 3 (which may be dropped - >> tbd)? >> > > Yes. Or a separate one where you explain in detail why do_swap_page() can handle > it just fine. Ahh no wait - I remember now; the reason I believe this is a "trivial" case is because we only leak vmf.orig_pte to the rest of the world if its a swap entry. And if its a swap entry, then ptep_get_lockless_norecency() is equivalent to ptep_get_lockless() - the pte is not present so there are no access or dirty bits. So I think this can stay here? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() 2024-03-27 9:57 ` Ryan Roberts @ 2024-03-27 17:02 ` David Hildenbrand 0 siblings, 0 replies; 42+ messages in thread From: David Hildenbrand @ 2024-03-27 17:02 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel >>>> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>> index 68283e54c899..41dc44eb8454 100644 >>>>> --- a/mm/hugetlb.c >>>>> +++ b/mm/hugetlb.c >>>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >>>>> vm_area_struct *vma, >>>>> } >>>>> >>>>> if (pte) { >>>>> - pte_t pteval = ptep_get_lockless(pte); >>>>> + pte_t pteval = ptep_get_lockless_norecency(pte); >>>>> >>>>> BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >>>>> } >>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>> index 2771fc043b3b..1a6c9ed8237a 100644 >>>>> --- a/mm/khugepaged.c >>>>> +++ b/mm/khugepaged.c >>>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >>>>> *mm, >>>>> } >>>>> } >>>>> >>>>> - vmf.orig_pte = ptep_get_lockless(pte); >>>>> + vmf.orig_pte = ptep_get_lockless_norecency(pte); >>>>> if (!is_swap_pte(vmf.orig_pte)) >>>>> continue; >>>> >>>> >>>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte. >>> >>> Yeah good point. So I guess this should move to patch 3 (which may be dropped - >>> tbd)? >>> >> >> Yes. Or a separate one where you explain in detail why do_swap_page() can handle >> it just fine. > > Ahh no wait - I remember now; the reason I believe this is a "trivial" case is > because we only leak vmf.orig_pte to the rest of the world if its a swap entry. Ugh, yes! -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() 2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts 2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts @ 2024-02-15 12:17 ` Ryan Roberts 2024-03-26 16:30 ` David Hildenbrand 2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts ` (2 subsequent siblings) 4 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-02-15 12:17 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel Gup needs to read ptes locklessly, so it uses ptep_get_lockless(). However, the returned access and dirty bits are unimportant so let's switch over to ptep_get_lockless_norecency(). The wrinkle is that gup needs to check that the pte hasn't changed once it has pinned the folio following this model: pte = ptep_get_lockless_norecency(ptep) ... if (!pte_same(pte, ptep_get_lockless(ptep))) // RACE! ... And now that pte may not contain correct access and dirty information, the pte_same() comparison could spuriously fail. So let's introduce a new pte_same_norecency() helper which will ignore the access and dirty bits when doing the comparison. Note that previously, ptep_get() was being used for the comparison; this is technically incorrect because the PTL is not held. I've also converted the comparison to use the preferred pmd_same() helper instead of doing a raw value comparison. As a side-effect, this new approach removes the possibility of concurrent read/write to the page causing a spurious fast gup failure, because the access and dirty bits are no longer used in the comparison. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/pgtable.h | 18 ++++++++++++++++++ mm/gup.c | 7 ++++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 9dd40fdbd825..8123affa8baf 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -936,6 +936,24 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) } #endif +/** + * pte_same_norecency - Compare pte_a and pte_b, ignoring young and dirty bits, + * if the ptes are present. + * + * @pte_a: First pte to compare. + * @pte_b: Second pte to compare. + * + * Returns 1 if the ptes match, else 0. + */ +static inline int pte_same_norecency(pte_t pte_a, pte_t pte_b) +{ + if (pte_present(pte_a)) + pte_a = pte_mkold(pte_mkclean(pte_a)); + if (pte_present(pte_b)) + pte_b = pte_mkold(pte_mkclean(pte_b)); + return pte_same(pte_a, pte_b); +} + #ifndef __HAVE_ARCH_PTE_UNUSED /* * Some architectures provide facilities to virtualization guests diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d..0f96d0a5ec09 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, if (!ptep) return 0; do { - pte_t pte = ptep_get_lockless(ptep); + pte_t pte = ptep_get_lockless_norecency(ptep); struct page *page; struct folio *folio; @@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, goto pte_unmap; } - if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || - unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { + if (unlikely(!pmd_same(pmd, *pmdp)) || + unlikely(!pte_same_norecency(pte, + ptep_get_lockless_norecency(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; } -- 2.25.1 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() 2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts @ 2024-03-26 16:30 ` David Hildenbrand 2024-03-26 16:48 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 16:30 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15.02.24 13:17, Ryan Roberts wrote: > Gup needs to read ptes locklessly, so it uses ptep_get_lockless(). > However, the returned access and dirty bits are unimportant so let's > switch over to ptep_get_lockless_norecency(). > > The wrinkle is that gup needs to check that the pte hasn't changed once > it has pinned the folio following this model: > > pte = ptep_get_lockless_norecency(ptep) > ... > if (!pte_same(pte, ptep_get_lockless(ptep))) > // RACE! > ... > > And now that pte may not contain correct access and dirty information, > the pte_same() comparison could spuriously fail. So let's introduce a > new pte_same_norecency() helper which will ignore the access and dirty > bits when doing the comparison. > > Note that previously, ptep_get() was being used for the comparison; this > is technically incorrect because the PTL is not held. I've also > converted the comparison to use the preferred pmd_same() helper instead > of doing a raw value comparison. > > As a side-effect, this new approach removes the possibility of > concurrent read/write to the page causing a spurious fast gup failure, > because the access and dirty bits are no longer used in the comparison. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- [...] > #ifndef __HAVE_ARCH_PTE_UNUSED > /* > * Some architectures provide facilities to virtualization guests > diff --git a/mm/gup.c b/mm/gup.c > index df83182ec72d..0f96d0a5ec09 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > if (!ptep) > return 0; > do { > - pte_t pte = ptep_get_lockless(ptep); > + pte_t pte = ptep_get_lockless_norecency(ptep); > struct page *page; > struct folio *folio; > > @@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, > goto pte_unmap; > } > > - if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || > - unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { > + if (unlikely(!pmd_same(pmd, *pmdp)) || > + unlikely(!pte_same_norecency(pte, > + ptep_get_lockless_norecency(ptep)))) { > gup_put_folio(folio, 1, flags); > goto pte_unmap; We pass the pte into pte_access_permitted(). It would be good to mention that you checked all implementations. Acked-by: David Hildenbrand <david@redhat.com> -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() 2024-03-26 16:30 ` David Hildenbrand @ 2024-03-26 16:48 ` Ryan Roberts 0 siblings, 0 replies; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 16:48 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 16:30, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> Gup needs to read ptes locklessly, so it uses ptep_get_lockless(). >> However, the returned access and dirty bits are unimportant so let's >> switch over to ptep_get_lockless_norecency(). >> >> The wrinkle is that gup needs to check that the pte hasn't changed once >> it has pinned the folio following this model: >> >> pte = ptep_get_lockless_norecency(ptep) >> ... >> if (!pte_same(pte, ptep_get_lockless(ptep))) >> // RACE! >> ... >> >> And now that pte may not contain correct access and dirty information, >> the pte_same() comparison could spuriously fail. So let's introduce a >> new pte_same_norecency() helper which will ignore the access and dirty >> bits when doing the comparison. >> >> Note that previously, ptep_get() was being used for the comparison; this >> is technically incorrect because the PTL is not held. I've also >> converted the comparison to use the preferred pmd_same() helper instead >> of doing a raw value comparison. >> >> As a side-effect, this new approach removes the possibility of >> concurrent read/write to the page causing a spurious fast gup failure, >> because the access and dirty bits are no longer used in the comparison. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- > > [...] > >> #ifndef __HAVE_ARCH_PTE_UNUSED >> /* >> * Some architectures provide facilities to virtualization guests >> diff --git a/mm/gup.c b/mm/gup.c >> index df83182ec72d..0f96d0a5ec09 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, >> unsigned long addr, >> if (!ptep) >> return 0; >> do { >> - pte_t pte = ptep_get_lockless(ptep); >> + pte_t pte = ptep_get_lockless_norecency(ptep); >> struct page *page; >> struct folio *folio; >> >> @@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, >> unsigned long addr, >> goto pte_unmap; >> } >> >> - if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >> - unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { >> + if (unlikely(!pmd_same(pmd, *pmdp)) || >> + unlikely(!pte_same_norecency(pte, >> + ptep_get_lockless_norecency(ptep)))) { >> gup_put_folio(folio, 1, flags); >> goto pte_unmap; > > We pass the pte into pte_access_permitted(). It would be good to mention that > you checked all implementations. TBH, I hadn't; I decided that since the "inaccurate access/dirty bits" was only possible on arm64, then only arm64's implementation needed checking. But given your comment, I just had a quick look at all impls. I didn't spot any problems where any impl needs the access/dirty bits. I'll add this to the commit log. > > Acked-by: David Hildenbrand <david@redhat.com> > ^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts 2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts 2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts @ 2024-02-15 12:17 ` Ryan Roberts 2024-03-26 17:02 ` David Hildenbrand 2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts 2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand 4 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-02-15 12:17 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel Let's convert handle_pte_fault()'s use of ptep_get_lockless() to ptep_get_lockless_norecency() to save orig_pte. There are a number of places that follow this model: orig_pte = ptep_get_lockless(ptep) ... <lock> if (!pte_same(orig_pte, ptep_get(ptep))) // RACE! ... <unlock> So we need to be careful to convert all of those to use pte_same_norecency() so that the access and dirty bits are excluded from the comparison. Additionally there are a couple of places that genuinely rely on the access and dirty bits of orig_pte, but with some careful refactoring, we can use ptep_get() once we are holding the lock to achieve equivalent logic. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- mm/memory.c | 55 +++++++++++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8e65fa1884f1..075245314ec3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3014,7 +3014,7 @@ static inline int pte_unmap_same(struct vm_fault *vmf) #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION) if (sizeof(pte_t) > sizeof(unsigned long)) { spin_lock(vmf->ptl); - same = pte_same(ptep_get(vmf->pte), vmf->orig_pte); + same = pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte); spin_unlock(vmf->ptl); } #endif @@ -3062,11 +3062,14 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, * take a double page fault, so mark it accessed here. */ vmf->pte = NULL; - if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) { + if (!arch_has_hw_pte_young()) { pte_t entry; vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); - if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + if (likely(vmf->pte)) + entry = ptep_get(vmf->pte); + if (unlikely(!vmf->pte || + !pte_same_norecency(entry, vmf->orig_pte))) { /* * Other thread has already handled the fault * and update local tlb only @@ -3077,9 +3080,11 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, goto pte_unlock; } - entry = pte_mkyoung(vmf->orig_pte); - if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) - update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1); + if (!pte_young(entry)) { + entry = pte_mkyoung(entry); + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1); + } } /* @@ -3094,7 +3099,8 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src, /* Re-validate under PTL if the page is still mapped */ vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl); - if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + if (unlikely(!vmf->pte || + !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) { /* The PTE changed under us, update local tlb */ if (vmf->pte) update_mmu_tlb(vma, addr, vmf->pte); @@ -3369,14 +3375,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * Re-check the pte - we dropped the lock */ vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); - if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + if (likely(vmf->pte)) + entry = ptep_get(vmf->pte); + if (likely(vmf->pte && pte_same_norecency(entry, vmf->orig_pte))) { if (old_folio) { if (!folio_test_anon(old_folio)) { dec_mm_counter(mm, mm_counter_file(old_folio)); inc_mm_counter(mm, MM_ANONPAGES); } } else { - ksm_might_unmap_zero_page(mm, vmf->orig_pte); + /* Needs dirty bit so can't use vmf->orig_pte. */ + ksm_might_unmap_zero_page(mm, entry); inc_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); @@ -3494,7 +3503,7 @@ static vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf, struct folio *folio * We might have raced with another page fault while we released the * pte_offset_map_lock. */ - if (!pte_same(ptep_get(vmf->pte), vmf->orig_pte)) { + if (!pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte)) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return VM_FAULT_NOPAGE; @@ -3883,7 +3892,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) + if (likely(vmf->pte && + pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte); if (vmf->pte) @@ -3928,7 +3938,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf) * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED. * So is_pte_marker() check is not enough to safely drop the pte. */ - if (pte_same(vmf->orig_pte, ptep_get(vmf->pte))) + if (pte_same_norecency(vmf->orig_pte, ptep_get(vmf->pte))) pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; @@ -4028,8 +4038,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte || - !pte_same(ptep_get(vmf->pte), - vmf->orig_pte))) + !pte_same_norecency(ptep_get(vmf->pte), + vmf->orig_pte))) goto unlock; /* @@ -4117,7 +4127,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(vmf->pte && - pte_same(ptep_get(vmf->pte), vmf->orig_pte))) + pte_same_norecency(ptep_get(vmf->pte), + vmf->orig_pte))) ret = VM_FAULT_OOM; goto unlock; } @@ -4187,7 +4198,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) + if (unlikely(!vmf->pte || + !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) goto out_nomap; if (unlikely(!folio_test_uptodate(folio))) { @@ -4747,7 +4759,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, static bool vmf_pte_changed(struct vm_fault *vmf) { if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) - return !pte_same(ptep_get(vmf->pte), vmf->orig_pte); + return !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte); return !pte_none(ptep_get(vmf->pte)); } @@ -5125,7 +5137,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * the pfn may be screwed if the read is non atomic. */ spin_lock(vmf->ptl); - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } @@ -5197,7 +5209,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) goto out; - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), + vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; - vmf->orig_pte = ptep_get_lockless(vmf->pte); + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; if (pte_none(vmf->orig_pte)) { @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) spin_lock(vmf->ptl); entry = vmf->orig_pte; - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } -- 2.25.1 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts @ 2024-03-26 17:02 ` David Hildenbrand 2024-03-26 17:27 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 17:02 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15.02.24 13:17, Ryan Roberts wrote: > Let's convert handle_pte_fault()'s use of ptep_get_lockless() to > ptep_get_lockless_norecency() to save orig_pte. > > There are a number of places that follow this model: > > orig_pte = ptep_get_lockless(ptep) > ... > <lock> > if (!pte_same(orig_pte, ptep_get(ptep))) > // RACE! > ... > <unlock> > > So we need to be careful to convert all of those to use > pte_same_norecency() so that the access and dirty bits are excluded from > the comparison. > > Additionally there are a couple of places that genuinely rely on the > access and dirty bits of orig_pte, but with some careful refactoring, we > can use ptep_get() once we are holding the lock to achieve equivalent > logic. We really should document that changed behavior somewhere where it can be easily found: that orig_pte might have incomplete/stale accessed/dirty information. > @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > vmf->address, &vmf->ptl); > if (unlikely(!vmf->pte)) > return 0; > - vmf->orig_pte = ptep_get_lockless(vmf->pte); > + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); > vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > if (pte_none(vmf->orig_pte)) { > @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > > spin_lock(vmf->ptl); > entry = vmf->orig_pte; > - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { > + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { > update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); > goto unlock; I was wondering about the following: Assume the PTE is not dirty. Thread 1 does vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) /* not dirty */ /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ spin_lock(vmf->ptl); entry = vmf->orig_pte; if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { ... } ... entry = pte_mkyoung(entry); if (ptep_set_access_flags(vmf->vma, ...) ... pte_unmap_unlock(vmf->pte, vmf->ptl); Generic ptep_set_access_flags() will do another pte_same() check and realize "hey, there was a change!" let's update the PTE! set_pte_at(vma->vm_mm, address, ptep, entry); would overwrite the dirty bit set by thread 2. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-26 17:02 ` David Hildenbrand @ 2024-03-26 17:27 ` Ryan Roberts 2024-03-26 17:38 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 17:27 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 17:02, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to >> ptep_get_lockless_norecency() to save orig_pte. >> >> There are a number of places that follow this model: >> >> orig_pte = ptep_get_lockless(ptep) >> ... >> <lock> >> if (!pte_same(orig_pte, ptep_get(ptep))) >> // RACE! >> ... >> <unlock> >> >> So we need to be careful to convert all of those to use >> pte_same_norecency() so that the access and dirty bits are excluded from >> the comparison. >> >> Additionally there are a couple of places that genuinely rely on the >> access and dirty bits of orig_pte, but with some careful refactoring, we >> can use ptep_get() once we are holding the lock to achieve equivalent >> logic. > > We really should document that changed behavior somewhere where it can be easily > found: that orig_pte might have incomplete/stale accessed/dirty information. I could add it to the orig_pte definition in the `struct vm_fault`? > > >> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >> vmf->address, &vmf->ptl); >> if (unlikely(!vmf->pte)) >> return 0; >> - vmf->orig_pte = ptep_get_lockless(vmf->pte); >> + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); >> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >> >> if (pte_none(vmf->orig_pte)) { >> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >> >> spin_lock(vmf->ptl); >> entry = vmf->orig_pte; >> - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >> + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { >> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); >> goto unlock; > > I was wondering about the following: > > Assume the PTE is not dirty. > > Thread 1 does Sorry not sure what threads have to do with this? How is the vmf shared between threads? What have I misunderstood... > > vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) > /* not dirty */ > > /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ > > spin_lock(vmf->ptl); > entry = vmf->orig_pte; > if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { > ... > } > ... > entry = pte_mkyoung(entry); Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. > if (ptep_set_access_flags(vmf->vma, ...) > ... > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > Generic ptep_set_access_flags() will do another pte_same() check and realize > "hey, there was a change!" let's update the PTE! > > set_pte_at(vma->vm_mm, address, ptep, entry); This is called from the generic ptep_set_access_flags() in your example, right? > > would overwrite the dirty bit set by thread 2. I'm not really sure what you are getting at... Is your concern that there is a race where the page could become dirty in the meantime and it now gets lost? I think that's why arm64 overrides ptep_set_access_flags(); since the hw can update access/dirty we have to deal with the races. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-26 17:27 ` Ryan Roberts @ 2024-03-26 17:38 ` David Hildenbrand 2024-03-26 17:48 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 17:38 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26.03.24 18:27, Ryan Roberts wrote: > On 26/03/2024 17:02, David Hildenbrand wrote: >> On 15.02.24 13:17, Ryan Roberts wrote: >>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to >>> ptep_get_lockless_norecency() to save orig_pte. >>> >>> There are a number of places that follow this model: >>> >>> orig_pte = ptep_get_lockless(ptep) >>> ... >>> <lock> >>> if (!pte_same(orig_pte, ptep_get(ptep))) >>> // RACE! >>> ... >>> <unlock> >>> >>> So we need to be careful to convert all of those to use >>> pte_same_norecency() so that the access and dirty bits are excluded from >>> the comparison. >>> >>> Additionally there are a couple of places that genuinely rely on the >>> access and dirty bits of orig_pte, but with some careful refactoring, we >>> can use ptep_get() once we are holding the lock to achieve equivalent >>> logic. >> >> We really should document that changed behavior somewhere where it can be easily >> found: that orig_pte might have incomplete/stale accessed/dirty information. > > I could add it to the orig_pte definition in the `struct vm_fault`? > >> >> >>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>> vmf->address, &vmf->ptl); >>> if (unlikely(!vmf->pte)) >>> return 0; >>> - vmf->orig_pte = ptep_get_lockless(vmf->pte); >>> + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); >>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >>> >>> if (pte_none(vmf->orig_pte)) { >>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>> >>> spin_lock(vmf->ptl); >>> entry = vmf->orig_pte; >>> - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>> + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { >>> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); >>> goto unlock; >> >> I was wondering about the following: >> >> Assume the PTE is not dirty. >> >> Thread 1 does > > Sorry not sure what threads have to do with this? How is the vmf shared between > threads? What have I misunderstood... Assume we have a HW that does not have HW-managed access/dirty bits. One that ends up using generic ptep_set_access_flags(). Access/dirty bits are always updated under PT lock. Then, imagine two threads. One is the the fault path here. another thread performs some other magic that sets the PTE dirty under PTL. > >> >> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >> /* not dirty */ >> >> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ >> >> spin_lock(vmf->ptl); >> entry = vmf->orig_pte; >> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >> ... >> } >> ... >> entry = pte_mkyoung(entry); > > Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and unconditionally does that in handle_pte_fault(). > >> if (ptep_set_access_flags(vmf->vma, ...) >> ... >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> >> >> Generic ptep_set_access_flags() will do another pte_same() check and realize >> "hey, there was a change!" let's update the PTE! >> >> set_pte_at(vma->vm_mm, address, ptep, entry); > > This is called from the generic ptep_set_access_flags() in your example, right? > Yes. >> >> would overwrite the dirty bit set by thread 2. > > I'm not really sure what you are getting at... Is your concern that there is a > race where the page could become dirty in the meantime and it now gets lost? I > think that's why arm64 overrides ptep_set_access_flags(); since the hw can > update access/dirty we have to deal with the races. My concern is that your patch can in subtle ways lead to use losing PTE dirty bits on architectures that don't have the HW-managed dirty bit. They do exist ;) Arm64 should be fine in that regard. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-26 17:38 ` David Hildenbrand @ 2024-03-26 17:48 ` Ryan Roberts 2024-03-26 17:58 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 17:48 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 17:38, David Hildenbrand wrote: > On 26.03.24 18:27, Ryan Roberts wrote: >> On 26/03/2024 17:02, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to >>>> ptep_get_lockless_norecency() to save orig_pte. >>>> >>>> There are a number of places that follow this model: >>>> >>>> orig_pte = ptep_get_lockless(ptep) >>>> ... >>>> <lock> >>>> if (!pte_same(orig_pte, ptep_get(ptep))) >>>> // RACE! >>>> ... >>>> <unlock> >>>> >>>> So we need to be careful to convert all of those to use >>>> pte_same_norecency() so that the access and dirty bits are excluded from >>>> the comparison. >>>> >>>> Additionally there are a couple of places that genuinely rely on the >>>> access and dirty bits of orig_pte, but with some careful refactoring, we >>>> can use ptep_get() once we are holding the lock to achieve equivalent >>>> logic. >>> >>> We really should document that changed behavior somewhere where it can be easily >>> found: that orig_pte might have incomplete/stale accessed/dirty information. >> >> I could add it to the orig_pte definition in the `struct vm_fault`? >> >>> >>> >>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>>> vmf->address, &vmf->ptl); >>>> if (unlikely(!vmf->pte)) >>>> return 0; >>>> - vmf->orig_pte = ptep_get_lockless(vmf->pte); >>>> + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte); >>>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >>>> >>>> if (pte_none(vmf->orig_pte)) { >>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) >>>> >>>> spin_lock(vmf->ptl); >>>> entry = vmf->orig_pte; >>>> - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>>> + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) { >>>> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); >>>> goto unlock; >>> >>> I was wondering about the following: >>> >>> Assume the PTE is not dirty. >>> >>> Thread 1 does >> >> Sorry not sure what threads have to do with this? How is the vmf shared between >> threads? What have I misunderstood... > > Assume we have a HW that does not have HW-managed access/dirty bits. One that > ends up using generic ptep_set_access_flags(). Access/dirty bits are always > updated under PT lock. > > Then, imagine two threads. One is the the fault path here. another thread > performs some other magic that sets the PTE dirty under PTL. > >> >>> >>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >>> /* not dirty */ >>> >>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ Ahh, this comment about thread 2 is not referring to the code immediately below it. It all makes much more sense now. :) >>> >>> spin_lock(vmf->ptl); >>> entry = vmf->orig_pte; >>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>> ... >>> } >>> ... >>> entry = pte_mkyoung(entry); >> >> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. > > No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and > unconditionally does that in handle_pte_fault(). > >> >>> if (ptep_set_access_flags(vmf->vma, ...) >>> ... >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> >>> >>> Generic ptep_set_access_flags() will do another pte_same() check and realize >>> "hey, there was a change!" let's update the PTE! >>> >>> set_pte_at(vma->vm_mm, address, ptep, entry); >> >> This is called from the generic ptep_set_access_flags() in your example, right? >> > > Yes. > >>> >>> would overwrite the dirty bit set by thread 2. >> >> I'm not really sure what you are getting at... Is your concern that there is a >> race where the page could become dirty in the meantime and it now gets lost? I >> think that's why arm64 overrides ptep_set_access_flags(); since the hw can >> update access/dirty we have to deal with the races. > > My concern is that your patch can in subtle ways lead to use losing PTE dirty > bits on architectures that don't have the HW-managed dirty bit. They do exist ;) But I think the example you give can already happen today? Thread 1 reads orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to set dirty just after the get, then thread 1 is going to set the PTE back to (a modified version of) orig_pte. Isn't it already broken? > > Arm64 should be fine in that regard. > There is plenty of arm64 HW that doesn't do HW access/dirty update. But our ptep_set_access_flags() can always deal with a racing update, even if that update originates from SW. Why do I have the feeling you're about to explain (very patiently) exactly why I'm wrong?... :) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-26 17:48 ` Ryan Roberts @ 2024-03-26 17:58 ` David Hildenbrand 2024-03-27 9:51 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 17:58 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel >>>> >>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >>>> /* not dirty */ >>>> >>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ > > Ahh, this comment about thread 2 is not referring to the code immediately below > it. It all makes much more sense now. :) Sorry :) > >>>> >>>> spin_lock(vmf->ptl); >>>> entry = vmf->orig_pte; >>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>>> ... >>>> } >>>> ... >>>> entry = pte_mkyoung(entry); >>> >>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. >> >> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and >> unconditionally does that in handle_pte_fault(). >> >>> >>>> if (ptep_set_access_flags(vmf->vma, ...) >>>> ... >>>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>>> >>>> >>>> Generic ptep_set_access_flags() will do another pte_same() check and realize >>>> "hey, there was a change!" let's update the PTE! >>>> >>>> set_pte_at(vma->vm_mm, address, ptep, entry); >>> >>> This is called from the generic ptep_set_access_flags() in your example, right? >>> >> >> Yes. >> >>>> >>>> would overwrite the dirty bit set by thread 2. >>> >>> I'm not really sure what you are getting at... Is your concern that there is a >>> race where the page could become dirty in the meantime and it now gets lost? I >>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can >>> update access/dirty we have to deal with the races. >> >> My concern is that your patch can in subtle ways lead to use losing PTE dirty >> bits on architectures that don't have the HW-managed dirty bit. They do exist ;) > > But I think the example you give can already happen today? Thread 1 reads > orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to > set dirty just after the get, then thread 1 is going to set the PTE back to (a > modified version of) orig_pte. Isn't it already broken? No, because the pte_same() check under PTL would have detected it, and we would have backed out. And I think the problem comes to live when we convert pte_same()->pte_same_norecency(), because we fail to protect PTE access/dirty changes that happend under PTL from another thread. But could be I am missing something :) >> Arm64 should be fine in that regard. >> > > There is plenty of arm64 HW that doesn't do HW access/dirty update. But our > ptep_set_access_flags() can always deal with a racing update, even if that > update originates from SW. > > Why do I have the feeling you're about to explain (very patiently) exactly why > I'm wrong?... :) heh ... or you'll tell me (vary patiently) why I am wrong :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-26 17:58 ` David Hildenbrand @ 2024-03-27 9:51 ` Ryan Roberts 2024-03-27 17:05 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-27 9:51 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 17:58, David Hildenbrand wrote: >>>>> >>>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >>>>> /* not dirty */ >>>>> >>>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ >> >> Ahh, this comment about thread 2 is not referring to the code immediately below >> it. It all makes much more sense now. :) > > Sorry :) > >> >>>>> >>>>> spin_lock(vmf->ptl); >>>>> entry = vmf->orig_pte; >>>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>>>> ... >>>>> } >>>>> ... >>>>> entry = pte_mkyoung(entry); >>>> >>>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. >>> >>> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and >>> unconditionally does that in handle_pte_fault(). >>> >>>> >>>>> if (ptep_set_access_flags(vmf->vma, ...) >>>>> ... >>>>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>>>> >>>>> >>>>> Generic ptep_set_access_flags() will do another pte_same() check and realize >>>>> "hey, there was a change!" let's update the PTE! >>>>> >>>>> set_pte_at(vma->vm_mm, address, ptep, entry); >>>> >>>> This is called from the generic ptep_set_access_flags() in your example, right? >>>> >>> >>> Yes. >>> >>>>> >>>>> would overwrite the dirty bit set by thread 2. >>>> >>>> I'm not really sure what you are getting at... Is your concern that there is a >>>> race where the page could become dirty in the meantime and it now gets lost? I >>>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can >>>> update access/dirty we have to deal with the races. >>> >>> My concern is that your patch can in subtle ways lead to use losing PTE dirty >>> bits on architectures that don't have the HW-managed dirty bit. They do exist ;) >> >> But I think the example you give can already happen today? Thread 1 reads >> orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to >> set dirty just after the get, then thread 1 is going to set the PTE back to (a >> modified version of) orig_pte. Isn't it already broken? > > No, because the pte_same() check under PTL would have detected it, and we would > have backed out. And I think the problem comes to live when we convert > pte_same()->pte_same_norecency(), because we fail to protect PTE access/dirty > changes that happend under PTL from another thread. Ahh yep. Got it. I absolutely knew that you would be correct, but I still walked right into it! I think one could argue that the generic ptep_set_access_flags() is not implementing its own spec: " ... Only sets the access flags (dirty, accessed), as well as write permission. Furthermore, we know it always gets set to a "more permissive" setting ... " Surely it should be folding the access and dirty bits from *ptep into entry if they are set? Regardless, I think this example proves that its fragile and subtle. I'm not really sure how to fix it more generally/robustly. Any thoughts? If not perhaps we are better off keeping ptep_get_lockless() around and only using ptep_get_lockless_norecency() for the really obviously correct cases? > > But could be I am missing something :) > >>> Arm64 should be fine in that regard. >>> >> >> There is plenty of arm64 HW that doesn't do HW access/dirty update. But our >> ptep_set_access_flags() can always deal with a racing update, even if that >> update originates from SW. >> >> Why do I have the feeling you're about to explain (very patiently) exactly why >> I'm wrong?... :) > > heh ... or you'll tell me (vary patiently) why I am wrong :) > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte 2024-03-27 9:51 ` Ryan Roberts @ 2024-03-27 17:05 ` David Hildenbrand 0 siblings, 0 replies; 42+ messages in thread From: David Hildenbrand @ 2024-03-27 17:05 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 27.03.24 10:51, Ryan Roberts wrote: > On 26/03/2024 17:58, David Hildenbrand wrote: >>>>>> >>>>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte) >>>>>> /* not dirty */ >>>>>> >>>>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */ >>> >>> Ahh, this comment about thread 2 is not referring to the code immediately below >>> it. It all makes much more sense now. :) >> >> Sorry :) >> >>> >>>>>> >>>>>> spin_lock(vmf->ptl); >>>>>> entry = vmf->orig_pte; >>>>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) { >>>>>> ... >>>>>> } >>>>>> ... >>>>>> entry = pte_mkyoung(entry); >>>>> >>>>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else. >>>> >>>> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and >>>> unconditionally does that in handle_pte_fault(). >>>> >>>>> >>>>>> if (ptep_set_access_flags(vmf->vma, ...) >>>>>> ... >>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>>>>> >>>>>> >>>>>> Generic ptep_set_access_flags() will do another pte_same() check and realize >>>>>> "hey, there was a change!" let's update the PTE! >>>>>> >>>>>> set_pte_at(vma->vm_mm, address, ptep, entry); >>>>> >>>>> This is called from the generic ptep_set_access_flags() in your example, right? >>>>> >>>> >>>> Yes. >>>> >>>>>> >>>>>> would overwrite the dirty bit set by thread 2. >>>>> >>>>> I'm not really sure what you are getting at... Is your concern that there is a >>>>> race where the page could become dirty in the meantime and it now gets lost? I >>>>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can >>>>> update access/dirty we have to deal with the races. >>>> >>>> My concern is that your patch can in subtle ways lead to use losing PTE dirty >>>> bits on architectures that don't have the HW-managed dirty bit. They do exist ;) >>> >>> But I think the example you give can already happen today? Thread 1 reads >>> orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to >>> set dirty just after the get, then thread 1 is going to set the PTE back to (a >>> modified version of) orig_pte. Isn't it already broken? >> >> No, because the pte_same() check under PTL would have detected it, and we would >> have backed out. And I think the problem comes to live when we convert >> pte_same()->pte_same_norecency(), because we fail to protect PTE access/dirty >> changes that happend under PTL from another thread. > > Ahh yep. Got it. I absolutely knew that you would be correct, but I still walked > right into it! > > I think one could argue that the generic ptep_set_access_flags() is not > implementing its own spec: > > " > ... Only sets the access flags (dirty, accessed), as well as write permission. > Furthermore, we know it always gets set to a "more permissive" setting ... > " > > Surely it should be folding the access and dirty bits from *ptep into entry if > they are set? Likely yes. Unless it's also used to clear access/dirty (don't think so, and would not be documented). But the simplification made sense for now, because you previously checked that pte_same(), and nobody can modify it concurrently. > > Regardless, I think this example proves that its fragile and subtle. I'm not > really sure how to fix it more generally/robustly. Any thoughts? If not perhaps > we are better off keeping ptep_get_lockless() around and only using > ptep_get_lockless_norecency() for the really obviously correct cases? Maybe one of the "sources of problems" is that we have a ptep_get_lockless_norecency() call followed by a pte_same() check, like done here. Not the source of all problems I believe, though ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() 2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts ` (2 preceding siblings ...) 2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts @ 2024-02-15 12:17 ` Ryan Roberts 2024-03-26 16:35 ` David Hildenbrand 2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand 4 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-02-15 12:17 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel Override ptep_get_lockless_norecency() when CONFIG_ARM64_CONTPTE is enabled. Because this API doesn't require the access and dirty bits to be accurate, for the contpte case, we can avoid reading all ptes in the contpte block to collect those bits, in contrast to ptep_get_lockless(). Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- arch/arm64/include/asm/pgtable.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 401087e8a43d..c0e4ccf74714 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1287,6 +1287,12 @@ static inline pte_t ptep_get_lockless(pte_t *ptep) return contpte_ptep_get_lockless(ptep); } +#define ptep_get_lockless_norecency ptep_get_lockless_norecency +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) +{ + return __ptep_get(ptep); +} + static inline void set_pte(pte_t *ptep, pte_t pte) { /* -- 2.25.1 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() 2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts @ 2024-03-26 16:35 ` David Hildenbrand 0 siblings, 0 replies; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 16:35 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15.02.24 13:17, Ryan Roberts wrote: > Override ptep_get_lockless_norecency() when CONFIG_ARM64_CONTPTE is > enabled. Because this API doesn't require the access and dirty bits to > be accurate, for the contpte case, we can avoid reading all ptes in the > contpte block to collect those bits, in contrast to ptep_get_lockless(). > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 401087e8a43d..c0e4ccf74714 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -1287,6 +1287,12 @@ static inline pte_t ptep_get_lockless(pte_t *ptep) > return contpte_ptep_get_lockless(ptep); > } > > +#define ptep_get_lockless_norecency ptep_get_lockless_norecency > +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) > +{ > + return __ptep_get(ptep); > +} > + > static inline void set_pte(pte_t *ptep, pte_t pte) > { > /* > -- > 2.25.1 > Reviewed-by: David Hildenbrand <david@redhat.com> -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts ` (3 preceding siblings ...) 2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts @ 2024-03-26 16:17 ` David Hildenbrand 2024-03-26 16:31 ` Ryan Roberts 4 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 16:17 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15.02.24 13:17, Ryan Roberts wrote: > This is an RFC for a series that aims to reduce the cost and complexity of > ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. > The approach came from discussion with Mark and David [2]. > > It introduces a new helper, ptep_get_lockless_norecency(), which allows the > access and dirty bits in the returned pte to be incorrect. This relaxation > permits arm64's implementation to just read the single target pte, and avoids > having to iterate over the full contpte block to gather the access and dirty > bits, for the contpte case. > > It turns out that none of the call sites using ptep_get_lockless() require > accurate access and dirty bit information, so we can also convert those sites. > Although a couple of places need care (see patches 2 and 3). > > Arguably patch 3 is a bit fragile, given the wide accessibility of > vmf->orig_pte. So it might make sense to drop this patch and stick to using > ptep_get_lockless() in the page fault path. I'm keen to hear opinions. Yes. Especially as we have these pte_same() checks that might just fail now because of wrong accessed/dirty bits? Likely, we just want to read "the real deal" on both sides of the pte_same() handling. > > I've chosen the name "recency" because it's shortish and somewhat descriptive, > and is alredy used in a couple of places to mean similar things (see mglru and > damon). I'm open to other names if anyone has better ideas. Not a native speaker; works for me. > > If concensus is that this approach is generally acceptable, I intend to create a > series in future to do a similar thing with ptep_get() -> ptep_get_norecency(). Yes, sounds good to me. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand @ 2024-03-26 16:31 ` Ryan Roberts [not found] ` <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com> 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 16:31 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 16:17, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> This is an RFC for a series that aims to reduce the cost and complexity of >> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. >> The approach came from discussion with Mark and David [2]. >> >> It introduces a new helper, ptep_get_lockless_norecency(), which allows the >> access and dirty bits in the returned pte to be incorrect. This relaxation >> permits arm64's implementation to just read the single target pte, and avoids >> having to iterate over the full contpte block to gather the access and dirty >> bits, for the contpte case. >> >> It turns out that none of the call sites using ptep_get_lockless() require >> accurate access and dirty bit information, so we can also convert those sites. >> Although a couple of places need care (see patches 2 and 3). >> >> Arguably patch 3 is a bit fragile, given the wide accessibility of >> vmf->orig_pte. So it might make sense to drop this patch and stick to using >> ptep_get_lockless() in the page fault path. I'm keen to hear opinions. > > Yes. Especially as we have these pte_same() checks that might just fail now > because of wrong accessed/dirty bits? Which pte_same() checks are you referring to? I've changed them all to pte_same_norecency() which ignores the access/dirty bits when doing the comparison. > > Likely, we just want to read "the real deal" on both sides of the pte_same() > handling. Sorry I'm not sure I understand? You mean read the full pte including access/dirty? That's the same as dropping the patch, right? Of course if we do that, we still have to keep pte_get_lockless() around for this case. In an ideal world we would convert everything over to ptep_get_lockless_norecency() and delete ptep_get_lockless() to remove the ugliness from arm64. > >> >> I've chosen the name "recency" because it's shortish and somewhat descriptive, >> and is alredy used in a couple of places to mean similar things (see mglru and >> damon). I'm open to other names if anyone has better ideas. > > Not a native speaker; works for me. > >> >> If concensus is that this approach is generally acceptable, I intend to create a >> series in future to do a similar thing with ptep_get() -> ptep_get_norecency(). > > Yes, sounds good to me. > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com>]
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 [not found] ` <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com> @ 2024-03-26 16:53 ` Ryan Roberts 2024-03-26 17:04 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 16:53 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 16:34, David Hildenbrand wrote: > On 26.03.24 17:31, Ryan Roberts wrote: >> On 26/03/2024 16:17, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> This is an RFC for a series that aims to reduce the cost and complexity of >>>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. >>>> The approach came from discussion with Mark and David [2]. >>>> >>>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the >>>> access and dirty bits in the returned pte to be incorrect. This relaxation >>>> permits arm64's implementation to just read the single target pte, and avoids >>>> having to iterate over the full contpte block to gather the access and dirty >>>> bits, for the contpte case. >>>> >>>> It turns out that none of the call sites using ptep_get_lockless() require >>>> accurate access and dirty bit information, so we can also convert those sites. >>>> Although a couple of places need care (see patches 2 and 3). >>>> >>>> Arguably patch 3 is a bit fragile, given the wide accessibility of >>>> vmf->orig_pte. So it might make sense to drop this patch and stick to using >>>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions. >>> >>> Yes. Especially as we have these pte_same() checks that might just fail now >>> because of wrong accessed/dirty bits? >> >> Which pte_same() checks are you referring to? I've changed them all to >> pte_same_norecency() which ignores the access/dirty bits when doing the >> comparison. > > I'm reading the patches just now. So I stumbled over that just after I wrote > that, so I was missing that part from the description here. > >> >>> >>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>> handling. >> >> Sorry I'm not sure I understand? You mean read the full pte including >> access/dirty? That's the same as dropping the patch, right? Of course if we do >> that, we still have to keep pte_get_lockless() around for this case. In an ideal >> world we would convert everything over to ptep_get_lockless_norecency() and >> delete ptep_get_lockless() to remove the ugliness from arm64. > > Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any > architecture. > > I do wonder if pte_same_norecency() should be defined per architecture and the > default would be pte_same(). So we could avoid the mkold etc on all other > architectures. Wouldn't that break it's semantics? The "norecency" of ptep_get_lockless_norecency() means "recency information in the returned pte may be incorrect". But the "norecency" of pte_same_norecency() means "ignore the access and dirty bits when you do the comparison". I think you could only do the optimization you describe if you required that pte_same_norecency() would only be given values returned by ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not quite the same; if a page is accessed between gets one will return true and the other false. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 16:53 ` Ryan Roberts @ 2024-03-26 17:04 ` David Hildenbrand 2024-03-26 17:32 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 17:04 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel >>>> >>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>> handling. >>> >>> Sorry I'm not sure I understand? You mean read the full pte including >>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>> that, we still have to keep pte_get_lockless() around for this case. In an ideal >>> world we would convert everything over to ptep_get_lockless_norecency() and >>> delete ptep_get_lockless() to remove the ugliness from arm64. >> >> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >> architecture. >> >> I do wonder if pte_same_norecency() should be defined per architecture and the >> default would be pte_same(). So we could avoid the mkold etc on all other >> architectures. > > Wouldn't that break it's semantics? The "norecency" of > ptep_get_lockless_norecency() means "recency information in the returned pte may > be incorrect". But the "norecency" of pte_same_norecency() means "ignore the > access and dirty bits when you do the comparison". My idea was that ptep_get_lockless_norecency() would return the actual result on these architectures. So e.g., on x86, there would be no actual change in generated code. But yes, the documentation of these functions would have to be improved. Now I wonder if ptep_get_lockless_norecency() should actively clear dirty/accessed bits to more easily find any actual issues where the bits still matter ... > > I think you could only do the optimization you describe if you required that > pte_same_norecency() would only be given values returned by > ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not > quite the same; if a page is accessed between gets one will return true and the > other false. Right. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 17:04 ` David Hildenbrand @ 2024-03-26 17:32 ` Ryan Roberts 2024-03-26 17:39 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 17:32 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 17:04, David Hildenbrand wrote: >>>>> >>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>> handling. >>>> >>>> Sorry I'm not sure I understand? You mean read the full pte including >>>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>> ideal >>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>> >>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >>> architecture. >>> >>> I do wonder if pte_same_norecency() should be defined per architecture and the >>> default would be pte_same(). So we could avoid the mkold etc on all other >>> architectures. >> >> Wouldn't that break it's semantics? The "norecency" of >> ptep_get_lockless_norecency() means "recency information in the returned pte may >> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >> access and dirty bits when you do the comparison". > > My idea was that ptep_get_lockless_norecency() would return the actual result on > these architectures. So e.g., on x86, there would be no actual change in > generated code. I think this is a bad plan... You'll end up with subtle differences between architectures. > > But yes, the documentation of these functions would have to be improved. > > Now I wonder if ptep_get_lockless_norecency() should actively clear > dirty/accessed bits to more easily find any actual issues where the bits still > matter ... I did a version that took that approach. Decided it was not as good as this way though. Now for the life of me, I can't remember my reasoning. > >> >> I think you could only do the optimization you describe if you required that >> pte_same_norecency() would only be given values returned by >> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not >> quite the same; if a page is accessed between gets one will return true and the >> other false. > > Right. > ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 17:32 ` Ryan Roberts @ 2024-03-26 17:39 ` David Hildenbrand 2024-03-26 17:51 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-03-26 17:39 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26.03.24 18:32, Ryan Roberts wrote: > On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>> >>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>> handling. >>>>> >>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>> ideal >>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>> >>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >>>> architecture. >>>> >>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>> architectures. >>> >>> Wouldn't that break it's semantics? The "norecency" of >>> ptep_get_lockless_norecency() means "recency information in the returned pte may >>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>> access and dirty bits when you do the comparison". >> >> My idea was that ptep_get_lockless_norecency() would return the actual result on >> these architectures. So e.g., on x86, there would be no actual change in >> generated code. > > I think this is a bad plan... You'll end up with subtle differences between > architectures. > >> >> But yes, the documentation of these functions would have to be improved. >> >> Now I wonder if ptep_get_lockless_norecency() should actively clear >> dirty/accessed bits to more easily find any actual issues where the bits still >> matter ... > > I did a version that took that approach. Decided it was not as good as this way > though. Now for the life of me, I can't remember my reasoning. Maybe because there are some code paths that check accessed/dirty without "correctness" implications? For example, if the PTE is already dirty, no need to set it dirty etc? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 17:39 ` David Hildenbrand @ 2024-03-26 17:51 ` Ryan Roberts 2024-03-27 9:34 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-03-26 17:51 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26/03/2024 17:39, David Hildenbrand wrote: > On 26.03.24 18:32, Ryan Roberts wrote: >> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>> >>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>>> handling. >>>>>> >>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>> we do >>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>> ideal >>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>> >>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>> any >>>>> architecture. >>>>> >>>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>> architectures. >>>> >>>> Wouldn't that break it's semantics? The "norecency" of >>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>> may >>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>> access and dirty bits when you do the comparison". >>> >>> My idea was that ptep_get_lockless_norecency() would return the actual result on >>> these architectures. So e.g., on x86, there would be no actual change in >>> generated code. >> >> I think this is a bad plan... You'll end up with subtle differences between >> architectures. >> >>> >>> But yes, the documentation of these functions would have to be improved. >>> >>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>> dirty/accessed bits to more easily find any actual issues where the bits still >>> matter ... >> >> I did a version that took that approach. Decided it was not as good as this way >> though. Now for the life of me, I can't remember my reasoning. > > Maybe because there are some code paths that check accessed/dirty without > "correctness" implications? For example, if the PTE is already dirty, no need to > set it dirty etc? I think I decided I was penalizing the architectures that don't care because all their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so that I could feed its result into pte_same(). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-26 17:51 ` Ryan Roberts @ 2024-03-27 9:34 ` David Hildenbrand 2024-03-27 10:01 ` Ryan Roberts 2024-04-03 12:59 ` Ryan Roberts 0 siblings, 2 replies; 42+ messages in thread From: David Hildenbrand @ 2024-03-27 9:34 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 26.03.24 18:51, Ryan Roberts wrote: > On 26/03/2024 17:39, David Hildenbrand wrote: >> On 26.03.24 18:32, Ryan Roberts wrote: >>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>>>> handling. >>>>>>> >>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>> we do >>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>> ideal >>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>> >>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>> any >>>>>> architecture. >>>>>> >>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>> architectures. >>>>> >>>>> Wouldn't that break it's semantics? The "norecency" of >>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>> may >>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>> access and dirty bits when you do the comparison". >>>> >>>> My idea was that ptep_get_lockless_norecency() would return the actual result on >>>> these architectures. So e.g., on x86, there would be no actual change in >>>> generated code. >>> >>> I think this is a bad plan... You'll end up with subtle differences between >>> architectures. >>> >>>> >>>> But yes, the documentation of these functions would have to be improved. >>>> >>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>> matter ... >>> >>> I did a version that took that approach. Decided it was not as good as this way >>> though. Now for the life of me, I can't remember my reasoning. >> >> Maybe because there are some code paths that check accessed/dirty without >> "correctness" implications? For example, if the PTE is already dirty, no need to >> set it dirty etc? > > I think I decided I was penalizing the architectures that don't care because all > their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly > clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so > that I could feed its result into pte_same(). True. With ptep_get_norecency() you're also penalizing other architectures. Therefore my original thought about making the behavior arch-specific, but the arch has to make sure to get the combination of ptep_get_lockless_norecency()+ptep_same_norecency() is right. So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must make sure to also ignore them in ptep_same_norecency(), and must be able to handle access/dirty bit changes differently. Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed accessed or dirty". Only the former would end up ignoring stuff here, the latter would not. But again, just some random thoughts how this affects other architectures and how we could avoid it. The issue I describe in patch #3 would be gone if ptep_same_norecency() would just do a ptep_same() check on other architectures -- and would make it easier to sell :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-27 9:34 ` David Hildenbrand @ 2024-03-27 10:01 ` Ryan Roberts 2024-04-03 12:59 ` Ryan Roberts 1 sibling, 0 replies; 42+ messages in thread From: Ryan Roberts @ 2024-03-27 10:01 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 27/03/2024 09:34, David Hildenbrand wrote: > On 26.03.24 18:51, Ryan Roberts wrote: >> On 26/03/2024 17:39, David Hildenbrand wrote: >>> On 26.03.24 18:32, Ryan Roberts wrote: >>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>> pte_same() >>>>>>>>> handling. >>>>>>>> >>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>> we do >>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>> ideal >>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>> >>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>> any >>>>>>> architecture. >>>>>>> >>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>> and the >>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>> architectures. >>>>>> >>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>> may >>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>> access and dirty bits when you do the comparison". >>>>> >>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>> result on >>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>> generated code. >>>> >>>> I think this is a bad plan... You'll end up with subtle differences between >>>> architectures. >>>> >>>>> >>>>> But yes, the documentation of these functions would have to be improved. >>>>> >>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>> matter ... >>>> >>>> I did a version that took that approach. Decided it was not as good as this way >>>> though. Now for the life of me, I can't remember my reasoning. >>> >>> Maybe because there are some code paths that check accessed/dirty without >>> "correctness" implications? For example, if the PTE is already dirty, no need to >>> set it dirty etc? >> >> I think I decided I was penalizing the architectures that don't care because all >> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >> that I could feed its result into pte_same(). > > True. With ptep_get_norecency() you're also penalizing other architectures. > Therefore my original thought about making the behavior arch-specific, but the > arch has to make sure to get the combination of > ptep_get_lockless_norecency()+ptep_same_norecency() is right. > > So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must > make sure to also ignore them in ptep_same_norecency(), and must be able to > handle access/dirty bit changes differently. > > Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed > accessed or dirty". Only the former would end up ignoring stuff here, the latter > would not. > > But again, just some random thoughts how this affects other architectures and > how we could avoid it. The issue I describe in patch #3 would be gone if > ptep_same_norecency() would just do a ptep_same() check on other architectures > -- and would make it easier to sell :) Perhaps - let me chew on that for a bit. It doesn't feel as easy as you suggest to me. But I can't put my finger on why... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-03-27 9:34 ` David Hildenbrand 2024-03-27 10:01 ` Ryan Roberts @ 2024-04-03 12:59 ` Ryan Roberts 2024-04-08 8:36 ` David Hildenbrand 1 sibling, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-03 12:59 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 27/03/2024 09:34, David Hildenbrand wrote: > On 26.03.24 18:51, Ryan Roberts wrote: >> On 26/03/2024 17:39, David Hildenbrand wrote: >>> On 26.03.24 18:32, Ryan Roberts wrote: >>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>> pte_same() >>>>>>>>> handling. >>>>>>>> >>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>> we do >>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>> ideal >>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>> >>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>> any >>>>>>> architecture. >>>>>>> >>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>> and the >>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>> architectures. >>>>>> >>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>> may >>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>> access and dirty bits when you do the comparison". >>>>> >>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>> result on >>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>> generated code. >>>> >>>> I think this is a bad plan... You'll end up with subtle differences between >>>> architectures. >>>> >>>>> >>>>> But yes, the documentation of these functions would have to be improved. >>>>> >>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>> matter ... >>>> >>>> I did a version that took that approach. Decided it was not as good as this way >>>> though. Now for the life of me, I can't remember my reasoning. >>> >>> Maybe because there are some code paths that check accessed/dirty without >>> "correctness" implications? For example, if the PTE is already dirty, no need to >>> set it dirty etc? >> >> I think I decided I was penalizing the architectures that don't care because all >> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >> that I could feed its result into pte_same(). > > True. With ptep_get_norecency() you're also penalizing other architectures. > Therefore my original thought about making the behavior arch-specific, but the > arch has to make sure to get the combination of > ptep_get_lockless_norecency()+ptep_same_norecency() is right. > > So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must > make sure to also ignore them in ptep_same_norecency(), and must be able to > handle access/dirty bit changes differently. > > Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed > accessed or dirty". Only the former would end up ignoring stuff here, the latter > would not. > > But again, just some random thoughts how this affects other architectures and > how we could avoid it. The issue I describe in patch #3 would be gone if > ptep_same_norecency() would just do a ptep_same() check on other architectures > -- and would make it easier to sell :) > I've been thinking some more about this. I think your proposal is the following: // ARM64 ptep_get_lockless_norecency() { - returned access/dirty may be incorrect - returned access/dirty may be differently incorrect between 2 calls } pte_same_norecency() { - ignore access/dirty when doing comparison } ptep_set_access_flags(ptep, pte) { - must not assume access/dirty in pte are "more permissive" than access/dirty in *ptep - must only set access/dirty in *ptep, never clear } // Other arches: no change to generated code ptep_get_lockless_norecency() { return ptep_get_lockless(); } pte_same_norecency() { return pte_same(); } ptep_set_access_flags(ptep, pte) { - may assume access/dirty in pte are "more permissive" than access/dirty in *ptep - if no HW access/dirty updates, "*ptep = pte" always results in "more permissive" change } An arch either specializes all 3 or none of them. This would allow us to get rid of ptep_get_lockless(). And it addresses the bug you found with ptep_set_access_flags(). BUT, I still have a nagging feeling that there are likely to be other similar problems caused by ignoring access/dirty during pte_same_norecency(). I can't convince myself that its definitely all safe and robust. So I'm leaning towards dropping patch 3 and therefore keeping ptep_get_lockless() around. Let me know if you have any insight that might help me change my mind :) Thanks, Ryan ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-03 12:59 ` Ryan Roberts @ 2024-04-08 8:36 ` David Hildenbrand 2024-04-09 16:35 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-04-08 8:36 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 03.04.24 14:59, Ryan Roberts wrote: > On 27/03/2024 09:34, David Hildenbrand wrote: >> On 26.03.24 18:51, Ryan Roberts wrote: >>> On 26/03/2024 17:39, David Hildenbrand wrote: >>>> On 26.03.24 18:32, Ryan Roberts wrote: >>>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>>> pte_same() >>>>>>>>>> handling. >>>>>>>>> >>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>>> we do >>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>>> ideal >>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>>> >>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>>> any >>>>>>>> architecture. >>>>>>>> >>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>>> and the >>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>>> architectures. >>>>>>> >>>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>>> may >>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>>> access and dirty bits when you do the comparison". >>>>>> >>>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>>> result on >>>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>>> generated code. >>>>> >>>>> I think this is a bad plan... You'll end up with subtle differences between >>>>> architectures. >>>>> >>>>>> >>>>>> But yes, the documentation of these functions would have to be improved. >>>>>> >>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>>> matter ... >>>>> >>>>> I did a version that took that approach. Decided it was not as good as this way >>>>> though. Now for the life of me, I can't remember my reasoning. >>>> >>>> Maybe because there are some code paths that check accessed/dirty without >>>> "correctness" implications? For example, if the PTE is already dirty, no need to >>>> set it dirty etc? >>> >>> I think I decided I was penalizing the architectures that don't care because all >>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >>> that I could feed its result into pte_same(). >> >> True. With ptep_get_norecency() you're also penalizing other architectures. >> Therefore my original thought about making the behavior arch-specific, but the >> arch has to make sure to get the combination of >> ptep_get_lockless_norecency()+ptep_same_norecency() is right. >> >> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must >> make sure to also ignore them in ptep_same_norecency(), and must be able to >> handle access/dirty bit changes differently. >> >> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed >> accessed or dirty". Only the former would end up ignoring stuff here, the latter >> would not. >> >> But again, just some random thoughts how this affects other architectures and >> how we could avoid it. The issue I describe in patch #3 would be gone if >> ptep_same_norecency() would just do a ptep_same() check on other architectures >> -- and would make it easier to sell :) >> > > I've been thinking some more about this. I think your proposal is the following: > > > // ARM64 > ptep_get_lockless_norecency() > { > - returned access/dirty may be incorrect > - returned access/dirty may be differently incorrect between 2 calls > } > pte_same_norecency() > { > - ignore access/dirty when doing comparison > } > ptep_set_access_flags(ptep, pte) > { > - must not assume access/dirty in pte are "more permissive" than > access/dirty in *ptep > - must only set access/dirty in *ptep, never clear > } > > > // Other arches: no change to generated code > ptep_get_lockless_norecency() > { > return ptep_get_lockless(); > } > pte_same_norecency() > { > return pte_same(); > } > ptep_set_access_flags(ptep, pte) > { > - may assume access/dirty in pte are "more permissive" than access/dirty > in *ptep > - if no HW access/dirty updates, "*ptep = pte" always results in "more > permissive" change > } > > An arch either specializes all 3 or none of them. > > This would allow us to get rid of ptep_get_lockless(). > > And it addresses the bug you found with ptep_set_access_flags(). > > > BUT, I still have a nagging feeling that there are likely to be other similar > problems caused by ignoring access/dirty during pte_same_norecency(). I can't > convince myself that its definitely all safe and robust. Right, we'd have to identify the other possible cases and document what an arch + common code must stick to to make it work. Some rules would be: if an arch implements ptep_get_lockless_norecency(): (1) Passing the result from ptep_get_lockless_norecency() to pte_same() is wrong. (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after ptep_get_lockless_norecency() is very likely wrong. > > So I'm leaning towards dropping patch 3 and therefore keeping > ptep_get_lockless() around. > > Let me know if you have any insight that might help me change my mind :) I'm wondering if it would help if we could find a better name (or concept) for "norecency" here, that expresses that only on some archs we'd have that fuzzy handling. Keeping ptep_get_lockless() around for now sounds like the best alternative. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-08 8:36 ` David Hildenbrand @ 2024-04-09 16:35 ` Ryan Roberts 2024-04-10 20:09 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-09 16:35 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 08/04/2024 09:36, David Hildenbrand wrote: > On 03.04.24 14:59, Ryan Roberts wrote: >> On 27/03/2024 09:34, David Hildenbrand wrote: >>> On 26.03.24 18:51, Ryan Roberts wrote: >>>> On 26/03/2024 17:39, David Hildenbrand wrote: >>>>> On 26.03.24 18:32, Ryan Roberts wrote: >>>>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>>>> >>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>>>> pte_same() >>>>>>>>>>> handling. >>>>>>>>>> >>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>>>> we do >>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. >>>>>>>>>> In an >>>>>>>>>> ideal >>>>>>>>>> world we would convert everything over to >>>>>>>>>> ptep_get_lockless_norecency() and >>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>>>> >>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really >>>>>>>>> affect >>>>>>>>> any >>>>>>>>> architecture. >>>>>>>>> >>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>>>> and the >>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>>>> architectures. >>>>>>>> >>>>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned >>>>>>>> pte >>>>>>>> may >>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore >>>>>>>> the >>>>>>>> access and dirty bits when you do the comparison". >>>>>>> >>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>>>> result on >>>>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>>>> generated code. >>>>>> >>>>>> I think this is a bad plan... You'll end up with subtle differences between >>>>>> architectures. >>>>>> >>>>>>> >>>>>>> But yes, the documentation of these functions would have to be improved. >>>>>>> >>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>>>> dirty/accessed bits to more easily find any actual issues where the bits >>>>>>> still >>>>>>> matter ... >>>>>> >>>>>> I did a version that took that approach. Decided it was not as good as >>>>>> this way >>>>>> though. Now for the life of me, I can't remember my reasoning. >>>>> >>>>> Maybe because there are some code paths that check accessed/dirty without >>>>> "correctness" implications? For example, if the PTE is already dirty, no >>>>> need to >>>>> set it dirty etc? >>>> >>>> I think I decided I was penalizing the architectures that don't care because >>>> all >>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >>>> that I could feed its result into pte_same(). >>> >>> True. With ptep_get_norecency() you're also penalizing other architectures. >>> Therefore my original thought about making the behavior arch-specific, but the >>> arch has to make sure to get the combination of >>> ptep_get_lockless_norecency()+ptep_same_norecency() is right. >>> >>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must >>> make sure to also ignore them in ptep_same_norecency(), and must be able to >>> handle access/dirty bit changes differently. >>> >>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed >>> accessed or dirty". Only the former would end up ignoring stuff here, the latter >>> would not. >>> >>> But again, just some random thoughts how this affects other architectures and >>> how we could avoid it. The issue I describe in patch #3 would be gone if >>> ptep_same_norecency() would just do a ptep_same() check on other architectures >>> -- and would make it easier to sell :) >>> >> >> I've been thinking some more about this. I think your proposal is the following: >> >> >> // ARM64 >> ptep_get_lockless_norecency() >> { >> - returned access/dirty may be incorrect >> - returned access/dirty may be differently incorrect between 2 calls >> } >> pte_same_norecency() >> { >> - ignore access/dirty when doing comparison >> } >> ptep_set_access_flags(ptep, pte) >> { >> - must not assume access/dirty in pte are "more permissive" than >> access/dirty in *ptep >> - must only set access/dirty in *ptep, never clear >> } >> >> >> // Other arches: no change to generated code >> ptep_get_lockless_norecency() >> { >> return ptep_get_lockless(); >> } >> pte_same_norecency() >> { >> return pte_same(); >> } >> ptep_set_access_flags(ptep, pte) >> { >> - may assume access/dirty in pte are "more permissive" than access/dirty >> in *ptep >> - if no HW access/dirty updates, "*ptep = pte" always results in "more >> permissive" change >> } >> >> An arch either specializes all 3 or none of them. >> >> This would allow us to get rid of ptep_get_lockless(). >> >> And it addresses the bug you found with ptep_set_access_flags(). >> >> >> BUT, I still have a nagging feeling that there are likely to be other similar >> problems caused by ignoring access/dirty during pte_same_norecency(). I can't >> convince myself that its definitely all safe and robust. > > Right, we'd have to identify the other possible cases and document what an arch > + common code must stick to to make it work. > > Some rules would be: if an arch implements ptep_get_lockless_norecency(): > > (1) Passing the result from ptep_get_lockless_norecency() to pte_same() > is wrong. > (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after > ptep_get_lockless_norecency() is very likely wrong. > > >> >> So I'm leaning towards dropping patch 3 and therefore keeping >> ptep_get_lockless() around. >> >> Let me know if you have any insight that might help me change my mind :) > > I'm wondering if it would help if we could find a better name (or concept) for > "norecency" here, that expresses that only on some archs we'd have that fuzzy > handling. > > Keeping ptep_get_lockless() around for now sounds like the best alternative. Separately to this I've been playing with an idea to add support for uffd-wp and soft-dirty SW PTE bits for arm64; it boils down to keeping the SW bits in separate storage, linked from the ptdesc. And we have some constant HW PTE bits that we can remove and replace with those SW bits so we can keep the pte_t the same size and abstract it all with ptep_get() and set_ptes(). It was all looking straightforward until I got to ptep_get_lockless(). Now that there are 2 separate locations for PTE bits, I can't read it all atomically. So I've been looking at all this again, and getting myself even more confused. I believe there are 2 classes of ptep_get_lockless() caller: 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() 2) everyone else -- (1) doesn't really care if orig_pte is consistent or not. It just wants to read a value, do some speculative work based on that value, then lock the PTL, and check the value hasn't changed. If it has changed, it backs out. So we don't actually need any "lockless" guarrantees here; we could just use ptep_get(). In fact, prior to Hugh's commit 26e1a0c3277d ("mm: use pmdp_get_lockless() without surplus barrier()"), we had this: vmf->pte = pte_offset_map(vmf->pmd, vmf->address); - vmf->orig_pte = *vmf->pte; + vmf->orig_pte = ptep_get_lockless(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; - /* - * some architectures can have larger ptes than wordsize, - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic - * accesses. The code below just needs a consistent view - * for the ifs and we later double check anyway with the - * ptl lock held. So here a barrier will do. - */ - barrier(); if (pte_none(vmf->orig_pte)) { -- (2) All the other users require that a subset of the pte fields are self-consistent; specifically they don't care about access, dirty, uffd-wp or soft-dirty. arm64 can guarrantee that all the other bits are self-consistent just by calling ptep_get(). -- So, I'm making the bold claim that it was never neccessary to specialize pte_get_lockless() on arm64, and I *think* we could just delete it so that ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original aim without needing to introduce "norecency" variants. Additionally I propose documenting ptep_get_lockless() to describe the set of fields that are guarranteed to be self-consistent and the remaining fields which are self-consistent only with best-effort. Could it be this easy? My head is hurting... Thanks, Ryan ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-09 16:35 ` Ryan Roberts @ 2024-04-10 20:09 ` David Hildenbrand 2024-04-11 9:45 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-04-10 20:09 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel [...] Skipping the ptdesc stuff we discussed offline, to not get distracted. :) This stuff is killing me, sorry for the lengthy reply ... > > So I've been looking at all this again, and getting myself even more confused. > > I believe there are 2 classes of ptep_get_lockless() caller: > > 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() > 2) everyone else Likely only completely lockless page table walkers where we *cannot* recheck under PTL is special. Essentially where we disable interrupts and rely on TLB flushes to sync against concurrent changes. Let's take a look where ptep_get_lockless() comes from: commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Nov 13 11:41:40 2020 +0100 mm/gup: Provide gup_get_pte() more generic In order to write another lockless page-table walker, we need gup_get_pte() exposed. While doing that, rename it to ptep_get_lockless() to match the existing ptep_get() naming. GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. "With get_user_pages_fast(), we walk down the pagetables without taking any locks. For this we would like to load the pointers atomically, but sometimes that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we do have is the guarantee that a PTE will only either go from not present to present, or present to not present or both -- it will not switch to a completely different present page without a TLB flush in between; something hat we are blocking by holding interrupts off." Later, we added support for GUP-fast that introduced the !TLB variant: commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 Author: Steve Capper <steve.capper@linaro.org> Date: Thu Oct 9 15:29:14 2014 -0700 mm: introduce a general RCU get_user_pages_fast() With the pattern /* * In the line below we are assuming that the pte can be read * atomically. If this is not the case for your architecture, * please wrap this in a helper function! * * for an example see gup_get_pte in arch/x86/mm/gup.c */ pte_t pte = ACCESS_ONCE(*ptep); ... if (unlikely(pte_val(pte) != pte_val(*ptep))) { ... Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was required to be sane, this the lengthy comment above ptep_get_lockless() that talks about TLB flushes. The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still full of details about TLB flushes syncing against GUP-fast. But as you note, we use it even in contexts where we don't disable interrupts and the TLB flush can't help. We don't disable interrupts during page faults ... so most of the things described in ptep_get_lockless() don't really apply. That's also the reason why ... > > vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > - vmf->orig_pte = *vmf->pte; > + vmf->orig_pte = ptep_get_lockless(vmf->pte); > vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > - /* > - * some architectures can have larger ptes than wordsize, > - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and > - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic > - * accesses. The code below just needs a consistent view > - * for the ifs and we later double check anyway with the > - * ptl lock held. So here a barrier will do. > - */ > - barrier(); > if (pte_none(vmf->orig_pte)) { ... that code was in place. We would possibly read garbage PTEs, but would recheck *under PTL* (where the PTE can no longer change) that what we read wasn't garbage and didn't change. > > -- > > (2) All the other users require that a subset of the pte fields are > self-consistent; specifically they don't care about access, dirty, uffd-wp or > soft-dirty. arm64 can guarrantee that all the other bits are self-consistent > just by calling ptep_get(). Let's focus on access+dirty for now ;) > > -- > > So, I'm making the bold claim that it was never neccessary to specialize > pte_get_lockless() on arm64, and I *think* we could just delete it so that > ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original > aim without needing to introduce "norecency" variants. > > Additionally I propose documenting ptep_get_lockless() to describe the set of > fields that are guarranteed to be self-consistent and the remaining fields which > are self-consistent only with best-effort. > > Could it be this easy? My head is hurting... I think what has to happen is: (1) pte_get_lockless() must return the same value as ptep_get() as long as there are no races. No removal/addition of access/dirty bits etc. (2) Lockless page table walkers that later verify under the PTL can handle serious "garbage PTEs". This is our page fault handler. (3) Lockless page table walkers that cannot verify under PTL cannot handle arbitrary garbage PTEs. This is GUP-fast. Two options: (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes required. This is the common case. HW might concurrently set access/dirty bits, so we can race with that. But we don't read garbage. (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious PTE changes (e.g., present -> present). This is weird x86-PAE. If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. My head is hurting ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-10 20:09 ` David Hildenbrand @ 2024-04-11 9:45 ` Ryan Roberts [not found] ` <70a36403-aefd-4311-b612-84e602465689@redhat.com> 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-11 9:45 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 10/04/2024 21:09, David Hildenbrand wrote: > [...] > > Skipping the ptdesc stuff we discussed offline, to not get distracted. :) > > This stuff is killing me, sorry for the lengthy reply ... No problem - thanks for taking the time to think it through and reply with such clarity. :) > >> >> So I've been looking at all this again, and getting myself even more confused. >> >> I believe there are 2 classes of ptep_get_lockless() caller: >> >> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() >> 2) everyone else > > Likely only completely lockless page table walkers where we *cannot* recheck > under PTL is special. Essentially where we disable interrupts and rely on TLB > flushes to sync against concurrent changes. Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and "lockless walkers that never take the PTL". Detail: the part about disabling interrupts and TLB flush syncing is arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But you make that clear further down. > > Let's take a look where ptep_get_lockless() comes from: > > commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Nov 13 11:41:40 2020 +0100 > > mm/gup: Provide gup_get_pte() more generic > > In order to write another lockless page-table walker, we need > gup_get_pte() exposed. While doing that, rename it to > ptep_get_lockless() to match the existing ptep_get() naming. > > > GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. > > "With get_user_pages_fast(), we walk down the pagetables without taking any > locks. For this we would like to load the pointers atomically, but sometimes > that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we > do have is the guarantee that a PTE will only either go from not present to > present, or present to not present or both -- it will not switch to a completely > different present page without a TLB flush in between; something hat we are > blocking by holding interrupts off." > > Later, we added support for GUP-fast that introduced the !TLB variant: > > commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 > Author: Steve Capper <steve.capper@linaro.org> > Date: Thu Oct 9 15:29:14 2014 -0700 > > mm: introduce a general RCU get_user_pages_fast() > > With the pattern > > /* > * In the line below we are assuming that the pte can be read > * atomically. If this is not the case for your architecture, > * please wrap this in a helper function! > * > * for an example see gup_get_pte in arch/x86/mm/gup.c > */ > pte_t pte = ACCESS_ONCE(*ptep); > ... > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > ... > > > Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = > gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was > required to be sane, this the lengthy comment above ptep_get_lockless() that > talks about TLB flushes. > > The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still > full of details about TLB flushes syncing against GUP-fast. But as you note, we > use it even in contexts where we don't disable interrupts and the TLB flush > can't help. > > We don't disable interrupts during page faults ... so most of the things > described in ptep_get_lockless() don't really apply. > > That's also the reason why ... >> >> vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> - vmf->orig_pte = *vmf->pte; >> + vmf->orig_pte = ptep_get_lockless(vmf->pte); >> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >> >> - /* >> - * some architectures can have larger ptes than wordsize, >> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and >> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic >> - * accesses. The code below just needs a consistent view >> - * for the ifs and we later double check anyway with the >> - * ptl lock held. So here a barrier will do. >> - */ >> - barrier(); >> if (pte_none(vmf->orig_pte)) { > > ... that code was in place. We would possibly read garbage PTEs, but would > recheck *under PTL* (where the PTE can no longer change) that what we read > wasn't garbage and didn't change. Agreed. > >> >> -- >> >> (2) All the other users require that a subset of the pte fields are >> self-consistent; specifically they don't care about access, dirty, uffd-wp or >> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent >> just by calling ptep_get(). > > Let's focus on access+dirty for now ;) > >> >> -- >> >> So, I'm making the bold claim that it was never neccessary to specialize >> pte_get_lockless() on arm64, and I *think* we could just delete it so that >> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original >> aim without needing to introduce "norecency" variants. >> >> Additionally I propose documenting ptep_get_lockless() to describe the set of >> fields that are guarranteed to be self-consistent and the remaining fields which >> are self-consistent only with best-effort. >> >> Could it be this easy? My head is hurting... > > I think what has to happen is: > > (1) pte_get_lockless() must return the same value as ptep_get() as long as there > are no races. No removal/addition of access/dirty bits etc. Today's arm64 ptep_get() guarantees this. > > (2) Lockless page table walkers that later verify under the PTL can handle > serious "garbage PTEs". This is our page fault handler. This isn't really a property of a ptep_get_lockless(); its a statement about a class of users. I agree with the statement. > > (3) Lockless page table walkers that cannot verify under PTL cannot handle > arbitrary garbage PTEs. This is GUP-fast. Two options: > > (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the > atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes > required. This is the common case. HW might concurrently set access/dirty bits, > so we can race with that. But we don't read garbage. Today's arm64 ptep_get() cannot garantee that the access/dirty bits are consistent for contpte ptes. That's the bit that complicates the current ptep_get_lockless() implementation. But the point I was trying to make is that GUP-fast does not actually care about *all* the fields being consistent (e.g. access/dirty). So we could spec pte_get_lockless() to say that "all fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured". This could mean that the access/dirty state *does* change for a given page while GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* that failing to detect this is benign. Aside: GUP-fast currently rechecks the pte originally obtained with ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform to (1), so either it returns the same pte or it returns a different pte or garbage. But that garbage could just happen to be the same as the originally obtained pte. So in that case, it would have a false match. I think this needs to be changed to ptep_get_lockless()? > > (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to > read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious > PTE changes (e.g., present -> present). This is weird x86-PAE. > > > If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. > > My head is hurting ... > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <70a36403-aefd-4311-b612-84e602465689@redhat.com>]
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 [not found] ` <70a36403-aefd-4311-b612-84e602465689@redhat.com> @ 2024-04-15 9:28 ` Ryan Roberts [not found] ` <3e50030d-2289-4470-a727-a293baa21618@redhat.com> 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-15 9:28 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 12/04/2024 21:16, David Hildenbrand wrote: >> >> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >> "lockless walkers that never take the PTL". >> >> Detail: the part about disabling interrupts and TLB flush syncing is >> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >> you make that clear further down. > > Yes, but disabling interrupts is also required for RCU-freeing of page tables > such that they can be walked safely. The TLB flush IPI is arch-specific and > indeed to sync against PTE invalidation (before generic GUP-fast). > [...] > >>>> >>>> Could it be this easy? My head is hurting... >>> >>> I think what has to happen is: >>> >>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there >>> are no races. No removal/addition of access/dirty bits etc. >> >> Today's arm64 ptep_get() guarantees this. >> >>> >>> (2) Lockless page table walkers that later verify under the PTL can handle >>> serious "garbage PTEs". This is our page fault handler. >> >> This isn't really a property of a ptep_get_lockless(); its a statement about a >> class of users. I agree with the statement. > > Yes. That's a requirement for the user of ptep_get_lockless(), such as page > fault handlers. Well, mostly "not GUP". > >> >>> >>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>> >>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the >>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>> required. This is the common case. HW might concurrently set access/dirty bits, >>> so we can race with that. But we don't read garbage. >> >> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >> consistent for contpte ptes. That's the bit that complicates the current >> ptep_get_lockless() implementation. >> >> But the point I was trying to make is that GUP-fast does not actually care about >> *all* the fields being consistent (e.g. access/dirty). So we could spec >> pte_get_lockless() to say that "all fields in the returned pte are guarranteed >> to be self-consistent except for access and dirty information, which may be >> inconsistent if a racing modification occured". > > We *might* have KVM in the future want to check that a PTE is dirty, such that > we can only allow dirty PTEs to be writable in a secondary MMU. That's not there > yet, but one thing I was discussing on the list recently. Burried in: > > https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com > > We wouldn't care about racing modifications, as long as MMU notifiers will > properly notify us when the PTE would lose its dirty bits. > > But getting false-positive dirty bits would be problematic. > >> >> This could mean that the access/dirty state *does* change for a given page while >> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >> that failing to detect this is benign. > > I mean, HW could just set the dirty/access bit immediately after the check. So > if HW concurrently sets the bit and we don't observe that change when we > recheck, I think that would be perfectly fine. Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or soft-dirty or uffd-wp). But if you don't want to change the ptep_get_lockless() spec to explicitly allow this (because you have the KVM use case where false-positive dirty is problematic), then I think we are stuck with ptep_get_lockless() as implemented for arm64 today. > >> >> Aside: GUP-fast currently rechecks the pte originally obtained with >> ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform >> to (1), so either it returns the same pte or it returns a different pte or >> garbage. But that garbage could just happen to be the same as the originally >> obtained pte. So in that case, it would have a false match. I think this needs >> to be changed to ptep_get_lockless()? > > I *think* it's fine, because the case where it would make a difference (x86-PAE) > still requires the TLB flush IPI to sync against PTE changes, and that check > would likely be wrong in one way or the other. So for x86-pae, that check is > just moot either way. > > That my theory, at least. > > (but this "let's fake-read atomically although we don't, but let's do like we > could in some specific circumstances" is really hard to get) > > I was wondering a while ago if we are missing a memory barrier before the checl, > but I think the one from obtaining the page reference gets the job done (at > least that's what I remember). > ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <3e50030d-2289-4470-a727-a293baa21618@redhat.com>]
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 [not found] ` <3e50030d-2289-4470-a727-a293baa21618@redhat.com> @ 2024-04-15 13:30 ` Ryan Roberts [not found] ` <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com> 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-15 13:30 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15/04/2024 11:57, David Hildenbrand wrote: > On 15.04.24 11:28, Ryan Roberts wrote: >> On 12/04/2024 21:16, David Hildenbrand wrote: >>>> >>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>> "lockless walkers that never take the PTL". >>>> >>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >>>> you make that clear further down. >>> >>> Yes, but disabling interrupts is also required for RCU-freeing of page tables >>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>> indeed to sync against PTE invalidation (before generic GUP-fast). >>> [...] >>> >>>>>> >>>>>> Could it be this easy? My head is hurting... >>>>> >>>>> I think what has to happen is: >>>>> >>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>> there >>>>> are no races. No removal/addition of access/dirty bits etc. >>>> >>>> Today's arm64 ptep_get() guarantees this. >>>> >>>>> >>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>> serious "garbage PTEs". This is our page fault handler. >>>> >>>> This isn't really a property of a ptep_get_lockless(); its a statement about a >>>> class of users. I agree with the statement. >>> >>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>> fault handlers. Well, mostly "not GUP". >>> >>>> >>>>> >>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>> >>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the >>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>>>> required. This is the common case. HW might concurrently set access/dirty >>>>> bits, >>>>> so we can race with that. But we don't read garbage. >>>> >>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>> consistent for contpte ptes. That's the bit that complicates the current >>>> ptep_get_lockless() implementation. >>>> >>>> But the point I was trying to make is that GUP-fast does not actually care >>>> about >>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed >>>> to be self-consistent except for access and dirty information, which may be >>>> inconsistent if a racing modification occured". >>> >>> We *might* have KVM in the future want to check that a PTE is dirty, such that >>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there >>> yet, but one thing I was discussing on the list recently. Burried in: >>> >>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>> >>> We wouldn't care about racing modifications, as long as MMU notifiers will >>> properly notify us when the PTE would lose its dirty bits. >>> >>> But getting false-positive dirty bits would be problematic. >>> >>>> >>>> This could mean that the access/dirty state *does* change for a given page >>>> while >>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >>>> that failing to detect this is benign. >>> >>> I mean, HW could just set the dirty/access bit immediately after the check. So >>> if HW concurrently sets the bit and we don't observe that change when we >>> recheck, I think that would be perfectly fine. >> >> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >> soft-dirty or uffd-wp). >> >> But if you don't want to change the ptep_get_lockless() spec to explicitly allow >> this (because you have the KVM use case where false-positive dirty is >> problematic), then I think we are stuck with ptep_get_lockless() as implemented >> for arm64 today. > > At least regarding the dirty bit, we'd have to guarantee that if > ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck > would be able to catch that. > > Would that be possible? Hmm maybe. My head hurts. Let me try to work through some examples... Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in PTE0 for the contpte block, and it is dirty. Now let's say there are 2 racing threads: - ThreadA is doing a GUP-fast for PTE3 - ThreadB is remapping order-0 FolioB at PTE0 (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the example - today's arm64 ptep_get_lockless() can handle the below correctly). ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) READ_ONCE(PTE0) (for a/d) << CLEAN!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) <do speculative work with pte1 content> pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) true = pte_same(pte1, pte2) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, <orig & ~CONT>) WRITE_ONCE(PTE1, <orig & ~CONT>) WRITE_ONCE(PTE2, <orig & ~CONT>) WRITE_ONCE(PTE3, <orig & ~CONT>) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, <new>) This example shows how a *false-negative* can be returned for the dirty state, which isn't detected by the check. I've been unable to come up with an example where a *false-positive* can be returned for dirty state without the second ptep_get_lockless() noticing. In this second example, let's assume everything is the same execpt FolioA is initially clean: ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, <orig & ~CONT>) WRITE_ONCE(PTE1, <orig & ~CONT>) WRITE_ONCE(PTE2, <orig & ~CONT>) WRITE_ONCE(PTE3, <orig & ~CONT>) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, <new>) write to FolioB - HW sets PTE0's dirty READ_ONCE(PTE0) (for a/d) << DIRTY!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) <do speculative work with pte1 content> pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) << BUT THIS IS FOR FolioB READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) false = pte_same(pte1, pte2) << So this fails The only way I can see false-positive not being caught in the second example is if ThreadB subseuently remaps the original folio, so you have an ABA scenario. But these lockless table walkers are already suseptible to that. I think all the same arguments can be extended to the access bit. For me this is all getting rather subtle and difficult to reason about and even harder to spec in a comprehensible way. The best I could come up with is: "All fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured. Additionally it is guranteed that false-positive access and/or dirty information is not possible if 2 calls are made and both ptes are the same. Only false-negative access and/or dirty information is possible in this scenario." which is starting to sound bonkers. Personally I think we are better off at this point, just keeping today's arm64 ptep_get_lockless(). Thanks, Ryan ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com>]
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 [not found] ` <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com> @ 2024-04-15 14:34 ` Ryan Roberts [not found] ` <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com> 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-15 14:34 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15/04/2024 15:23, David Hildenbrand wrote: > On 15.04.24 15:30, Ryan Roberts wrote: >> On 15/04/2024 11:57, David Hildenbrand wrote: >>> On 15.04.24 11:28, Ryan Roberts wrote: >>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>> >>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>> "lockless walkers that never take the PTL". >>>>>> >>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >>>>>> you make that clear further down. >>>>> >>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables >>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>> [...] >>>>> >>>>>>>> >>>>>>>> Could it be this easy? My head is hurting... >>>>>>> >>>>>>> I think what has to happen is: >>>>>>> >>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>> there >>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>> >>>>>> Today's arm64 ptep_get() guarantees this. >>>>>> >>>>>>> >>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>> >>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>> about a >>>>>> class of users. I agree with the statement. >>>>> >>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>> fault handlers. Well, mostly "not GUP". >>>>> >>>>>> >>>>>>> >>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>> >>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>> the >>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>> bits, >>>>>>> so we can race with that. But we don't read garbage. >>>>>> >>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>> ptep_get_lockless() implementation. >>>>>> >>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>> about >>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>> guarranteed >>>>>> to be self-consistent except for access and dirty information, which may be >>>>>> inconsistent if a racing modification occured". >>>>> >>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that >>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>> there >>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>> >>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>> >>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>> properly notify us when the PTE would lose its dirty bits. >>>>> >>>>> But getting false-positive dirty bits would be problematic. >>>>> >>>>>> >>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>> while >>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >>>>>> that failing to detect this is benign. >>>>> >>>>> I mean, HW could just set the dirty/access bit immediately after the check. So >>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>> recheck, I think that would be perfectly fine. >>>> >>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>> soft-dirty or uffd-wp). >>>> >>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>> allow >>>> this (because you have the KVM use case where false-positive dirty is >>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented >>>> for arm64 today. >>> >>> At least regarding the dirty bit, we'd have to guarantee that if >>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>> would be able to catch that. >>> >>> Would that be possible? >> >> Hmm maybe. My head hurts. Let me try to work through some examples... >> >> >> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in >> PTE0 for the contpte block, and it is dirty. >> >> Now let's say there are 2 racing threads: >> >> - ThreadA is doing a GUP-fast for PTE3 >> - ThreadB is remapping order-0 FolioB at PTE0 >> >> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >> example - today's arm64 ptep_get_lockless() can handle the below correctly). >> >> ThreadA ThreadB >> ======= ======= >> >> gup_pte_range() >> pte1 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> mmap(PTE0) >> clear_pte(PTE0) >> unfold(PTE0 - PTE3) >> WRITE_ONCE(PTE0, 0) >> WRITE_ONCE(PTE1, 0) >> WRITE_ONCE(PTE2, 0) >> READ_ONCE(PTE0) (for a/d) << CLEAN!! >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> <do speculative work with pte1 content> >> pte2 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> READ_ONCE(PTE0) (for a/d) >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> true = pte_same(pte1, pte2) >> WRITE_ONCE(PTE3, 0) >> TLBI >> WRITE_ONCE(PTE0, <orig & ~CONT>) >> WRITE_ONCE(PTE1, <orig & ~CONT>) >> WRITE_ONCE(PTE2, <orig & ~CONT>) >> WRITE_ONCE(PTE3, <orig & ~CONT>) >> WRITE_ONCE(PTE0, 0) >> set_pte_at(PTE0, <new>) >> >> This example shows how a *false-negative* can be returned for the dirty state, >> which isn't detected by the check. >> >> I've been unable to come up with an example where a *false-positive* can be >> returned for dirty state without the second ptep_get_lockless() noticing. In >> this second example, let's assume everything is the same execpt FolioA is >> initially clean: >> >> ThreadA ThreadB >> ======= ======= >> >> gup_pte_range() >> pte1 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> mmap(PTE0) >> clear_pte(PTE0) >> unfold(PTE0 - PTE3) >> WRITE_ONCE(PTE0, 0) >> WRITE_ONCE(PTE1, 0) >> WRITE_ONCE(PTE2, 0) >> WRITE_ONCE(PTE3, 0) >> TLBI >> WRITE_ONCE(PTE0, <orig & ~CONT>) >> WRITE_ONCE(PTE1, <orig & ~CONT>) >> WRITE_ONCE(PTE2, <orig & ~CONT>) >> WRITE_ONCE(PTE3, <orig & ~CONT>) >> WRITE_ONCE(PTE0, 0) >> set_pte_at(PTE0, <new>) >> write to FolioB - HW sets PTE0's dirty >> READ_ONCE(PTE0) (for a/d) << DIRTY!! >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> <do speculative work with pte1 content> >> pte2 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >> READ_ONCE(PTE0) (for a/d) >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> false = pte_same(pte1, pte2) << So this fails >> >> The only way I can see false-positive not being caught in the second example is >> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >> But these lockless table walkers are already suseptible to that. >> >> I think all the same arguments can be extended to the access bit. >> >> >> For me this is all getting rather subtle and difficult to reason about and even >> harder to spec in a comprehensible way. The best I could come up with is: >> >> "All fields in the returned pte are guarranteed to be self-consistent except for >> access and dirty information, which may be inconsistent if a racing modification >> occured. Additionally it is guranteed that false-positive access and/or dirty >> information is not possible if 2 calls are made and both ptes are the same. Only >> false-negative access and/or dirty information is possible in this scenario." >> >> which is starting to sound bonkers. Personally I think we are better off at this >> point, just keeping today's arm64 ptep_get_lockless(). > > Remind me again, does arm64 perform an IPI broadcast during a TLB flush that > would sync against GUP-fast? No, the broadcast is in HW. There is no IPI. ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com>]
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 [not found] ` <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com> @ 2024-04-15 15:17 ` Ryan Roberts 2024-04-15 15:22 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-15 15:17 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15/04/2024 15:58, David Hildenbrand wrote: > On 15.04.24 16:34, Ryan Roberts wrote: >> On 15/04/2024 15:23, David Hildenbrand wrote: >>> On 15.04.24 15:30, Ryan Roberts wrote: >>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>> "lockless walkers that never take the PTL". >>>>>>>> >>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>> TLBIs). But >>>>>>>> you make that clear further down. >>>>>>> >>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>> tables >>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>> [...] >>>>>>> >>>>>>>>>> >>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>> >>>>>>>>> I think what has to happen is: >>>>>>>>> >>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>>>> there >>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>> >>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>> >>>>>>>>> >>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>> >>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>> about a >>>>>>>> class of users. I agree with the statement. >>>>>>> >>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>> >>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>>>> the >>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>> flushes >>>>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>>>> bits, >>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>> >>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>> ptep_get_lockless() implementation. >>>>>>>> >>>>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>>>> about >>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>> guarranteed >>>>>>>> to be self-consistent except for access and dirty information, which may be >>>>>>>> inconsistent if a racing modification occured". >>>>>>> >>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>> that >>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>> there >>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>> >>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>> >>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>> >>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>> >>>>>>>> >>>>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>>>> while >>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>> *think* >>>>>>>> that failing to detect this is benign. >>>>>>> >>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>> check. So >>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>> recheck, I think that would be perfectly fine. >>>>>> >>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>> soft-dirty or uffd-wp). >>>>>> >>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>> allow >>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>> implemented >>>>>> for arm64 today. >>>>> >>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>> would be able to catch that. >>>>> >>>>> Would that be possible? >>>> >>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>> >>>> >>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>> stored in >>>> PTE0 for the contpte block, and it is dirty. >>>> >>>> Now let's say there are 2 racing threads: >>>> >>>> - ThreadA is doing a GUP-fast for PTE3 >>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>> >>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>> >>>> ThreadA ThreadB >>>> ======= ======= >>>> >>>> gup_pte_range() >>>> pte1 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> mmap(PTE0) >>>> clear_pte(PTE0) >>>> unfold(PTE0 - PTE3) >>>> WRITE_ONCE(PTE0, 0) >>>> WRITE_ONCE(PTE1, 0) >>>> WRITE_ONCE(PTE2, 0) >>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> <do speculative work with pte1 content> >>>> pte2 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> READ_ONCE(PTE0) (for a/d) >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> true = pte_same(pte1, pte2) >>>> WRITE_ONCE(PTE3, 0) >>>> TLBI >>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>> WRITE_ONCE(PTE0, 0) >>>> set_pte_at(PTE0, <new>) >>>> >>>> This example shows how a *false-negative* can be returned for the dirty state, >>>> which isn't detected by the check. >>>> >>>> I've been unable to come up with an example where a *false-positive* can be >>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>> this second example, let's assume everything is the same execpt FolioA is >>>> initially clean: >>>> >>>> ThreadA ThreadB >>>> ======= ======= >>>> >>>> gup_pte_range() >>>> pte1 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> mmap(PTE0) >>>> clear_pte(PTE0) >>>> unfold(PTE0 - PTE3) >>>> WRITE_ONCE(PTE0, 0) >>>> WRITE_ONCE(PTE1, 0) >>>> WRITE_ONCE(PTE2, 0) >>>> WRITE_ONCE(PTE3, 0) >>>> TLBI >>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>> WRITE_ONCE(PTE0, 0) >>>> set_pte_at(PTE0, <new>) >>>> write to FolioB - HW sets PTE0's dirty >>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> <do speculative work with pte1 content> >>>> pte2 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>> READ_ONCE(PTE0) (for a/d) >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> false = pte_same(pte1, pte2) << So this fails >>>> >>>> The only way I can see false-positive not being caught in the second example is >>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >>>> But these lockless table walkers are already suseptible to that. >>>> >>>> I think all the same arguments can be extended to the access bit. >>>> >>>> >>>> For me this is all getting rather subtle and difficult to reason about and even >>>> harder to spec in a comprehensible way. The best I could come up with is: >>>> >>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>> for >>>> access and dirty information, which may be inconsistent if a racing >>>> modification >>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>> information is not possible if 2 calls are made and both ptes are the same. >>>> Only >>>> false-negative access and/or dirty information is possible in this scenario." >>>> >>>> which is starting to sound bonkers. Personally I think we are better off at >>>> this >>>> point, just keeping today's arm64 ptep_get_lockless(). >>> >>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>> would sync against GUP-fast? >> >> No, the broadcast is in HW. There is no IPI. > > Okay ... > > I agree that the semantics are a bit weird, but if we could get rid of > ptep_get_lockless() on arm64, that would also be nice. > > > Something I've been thinking of ... just to share what I've had in mind. The two > types of users we currently have are: > > (1) ptep_get_lockless() followed by ptep_get() check under PTL. > > (2) ptep_get_lockless() followed by ptep_get() check without PTL. > > What if we had the following instead: > > (1) ptep_get_lockless() followed by ptep_get() check under PTL. > > (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without > PTL. > > And on arm64 let > > (1) ptep_get_lockless() be ptep_get() > > (2) ptep_get_gup_fast() be __ptep_get(). > > That would mean, that (2) would not care if another cont-pte is dirty, because > we don't collect access+dirty bits. That way, we avoid any races with concurrent > unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() > would have to set all cont-PTEs dirty, even if any of these already is dirty. I don't think the "problematic" thing is actually a problem; set_ptes() will always set the dirty bit to the same value for all ptes it covers (and if you do set_ptes() on a partial contpte block, it will be unfolded first). Although I suspect I've misunderstood what you meant there... The potential problem I see with this is that the Arm ARM doesn't specify which PTE of a contpte block the HW stores a/d in. So the HW _could_ update them randomly and this could spuriously increase your check failure rate. In reality I believe most implementations will update the PTE for the address that caused the TLB to be populated. But in some cases, you could have eviction (due to pressure or explicit invalidation) followed by re-population due to faulting on a different page of the contpte block. In this case you would see this type of problem too. But ultimately, isn't this basically equivalent to ptep_get_lockless() returning potentially false-negatives for access and dirty? Just with a much higher chance of getting a false-negative. How is this helping? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-15 15:17 ` Ryan Roberts @ 2024-04-15 15:22 ` David Hildenbrand 2024-04-15 15:53 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-04-15 15:22 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15.04.24 17:17, Ryan Roberts wrote: > On 15/04/2024 15:58, David Hildenbrand wrote: >> On 15.04.24 16:34, Ryan Roberts wrote: >>> On 15/04/2024 15:23, David Hildenbrand wrote: >>>> On 15.04.24 15:30, Ryan Roberts wrote: >>>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>>> "lockless walkers that never take the PTL". >>>>>>>>> >>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>>> TLBIs). But >>>>>>>>> you make that clear further down. >>>>>>>> >>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>>> tables >>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>>> [...] >>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>>> >>>>>>>>>> I think what has to happen is: >>>>>>>>>> >>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>>>>> there >>>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>>> >>>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>>> >>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>>> about a >>>>>>>>> class of users. I agree with the statement. >>>>>>>> >>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>>> >>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>>>>> the >>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>>> flushes >>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>>>>> bits, >>>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>>> >>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>>> ptep_get_lockless() implementation. >>>>>>>>> >>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>>>>> about >>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>>> guarranteed >>>>>>>>> to be self-consistent except for access and dirty information, which may be >>>>>>>>> inconsistent if a racing modification occured". >>>>>>>> >>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>>> that >>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>>> there >>>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>>> >>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>>> >>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>>> >>>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>>> >>>>>>>>> >>>>>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>>>>> while >>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>>> *think* >>>>>>>>> that failing to detect this is benign. >>>>>>>> >>>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>>> check. So >>>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>>> recheck, I think that would be perfectly fine. >>>>>>> >>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>>> soft-dirty or uffd-wp). >>>>>>> >>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>>> allow >>>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>>> implemented >>>>>>> for arm64 today. >>>>>> >>>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>>> would be able to catch that. >>>>>> >>>>>> Would that be possible? >>>>> >>>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>>> >>>>> >>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>>> stored in >>>>> PTE0 for the contpte block, and it is dirty. >>>>> >>>>> Now let's say there are 2 racing threads: >>>>> >>>>> - ThreadA is doing a GUP-fast for PTE3 >>>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>>> >>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>>> >>>>> ThreadA ThreadB >>>>> ======= ======= >>>>> >>>>> gup_pte_range() >>>>> pte1 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> mmap(PTE0) >>>>> clear_pte(PTE0) >>>>> unfold(PTE0 - PTE3) >>>>> WRITE_ONCE(PTE0, 0) >>>>> WRITE_ONCE(PTE1, 0) >>>>> WRITE_ONCE(PTE2, 0) >>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> <do speculative work with pte1 content> >>>>> pte2 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> READ_ONCE(PTE0) (for a/d) >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> true = pte_same(pte1, pte2) >>>>> WRITE_ONCE(PTE3, 0) >>>>> TLBI >>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE0, 0) >>>>> set_pte_at(PTE0, <new>) >>>>> >>>>> This example shows how a *false-negative* can be returned for the dirty state, >>>>> which isn't detected by the check. >>>>> >>>>> I've been unable to come up with an example where a *false-positive* can be >>>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>>> this second example, let's assume everything is the same execpt FolioA is >>>>> initially clean: >>>>> >>>>> ThreadA ThreadB >>>>> ======= ======= >>>>> >>>>> gup_pte_range() >>>>> pte1 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> mmap(PTE0) >>>>> clear_pte(PTE0) >>>>> unfold(PTE0 - PTE3) >>>>> WRITE_ONCE(PTE0, 0) >>>>> WRITE_ONCE(PTE1, 0) >>>>> WRITE_ONCE(PTE2, 0) >>>>> WRITE_ONCE(PTE3, 0) >>>>> TLBI >>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE0, 0) >>>>> set_pte_at(PTE0, <new>) >>>>> write to FolioB - HW sets PTE0's dirty >>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> <do speculative work with pte1 content> >>>>> pte2 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>>> READ_ONCE(PTE0) (for a/d) >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> false = pte_same(pte1, pte2) << So this fails >>>>> >>>>> The only way I can see false-positive not being caught in the second example is >>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >>>>> But these lockless table walkers are already suseptible to that. >>>>> >>>>> I think all the same arguments can be extended to the access bit. >>>>> >>>>> >>>>> For me this is all getting rather subtle and difficult to reason about and even >>>>> harder to spec in a comprehensible way. The best I could come up with is: >>>>> >>>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>>> for >>>>> access and dirty information, which may be inconsistent if a racing >>>>> modification >>>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>>> information is not possible if 2 calls are made and both ptes are the same. >>>>> Only >>>>> false-negative access and/or dirty information is possible in this scenario." >>>>> >>>>> which is starting to sound bonkers. Personally I think we are better off at >>>>> this >>>>> point, just keeping today's arm64 ptep_get_lockless(). >>>> >>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>>> would sync against GUP-fast? >>> >>> No, the broadcast is in HW. There is no IPI. >> >> Okay ... >> >> I agree that the semantics are a bit weird, but if we could get rid of >> ptep_get_lockless() on arm64, that would also be nice. >> >> >> Something I've been thinking of ... just to share what I've had in mind. The two >> types of users we currently have are: >> >> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >> >> (2) ptep_get_lockless() followed by ptep_get() check without PTL. >> >> What if we had the following instead: >> >> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >> >> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without >> PTL. >> >> And on arm64 let >> >> (1) ptep_get_lockless() be ptep_get() >> >> (2) ptep_get_gup_fast() be __ptep_get(). >> >> That would mean, that (2) would not care if another cont-pte is dirty, because >> we don't collect access+dirty bits. That way, we avoid any races with concurrent >> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() >> would have to set all cont-PTEs dirty, even if any of these already is dirty. > > I don't think the "problematic" thing is actually a problem; set_ptes() will > always set the dirty bit to the same value for all ptes it covers (and if you do > set_ptes() on a partial contpte block, it will be unfolded first). Although I > suspect I've misunderstood what you meant there... It's more code like that following that I am concerned about. if (pte_dirty()) { /* Great, nothing to do */ } else mte_mkdirty(); set_ptes(); ... } > > The potential problem I see with this is that the Arm ARM doesn't specify which > PTE of a contpte block the HW stores a/d in. So the HW _could_ update them > randomly and this could spuriously increase your check failure rate. In reality > I believe most implementations will update the PTE for the address that caused > the TLB to be populated. But in some cases, you could have eviction (due to > pressure or explicit invalidation) followed by re-population due to faulting on > a different page of the contpte block. In this case you would see this type of > problem too. > > But ultimately, isn't this basically equivalent to ptep_get_lockless() returning > potentially false-negatives for access and dirty? Just with a much higher chance > of getting a false-negative. How is this helping? You are performing an atomic read like GUP-fast wants you to. So there are no races to worry about like on other architectures: HW might *set* the dirty bit concurrently, but that's just fine. The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. To me that sounds ... much simpler ;) But again, just something I've been thinking about. The reuse of pte_get_lockless() outside GUP code might not have been the wisest choice. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-15 15:22 ` David Hildenbrand @ 2024-04-15 15:53 ` Ryan Roberts 2024-04-15 16:02 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-15 15:53 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 15/04/2024 16:22, David Hildenbrand wrote: > On 15.04.24 17:17, Ryan Roberts wrote: >> On 15/04/2024 15:58, David Hildenbrand wrote: >>> On 15.04.24 16:34, Ryan Roberts wrote: >>>> On 15/04/2024 15:23, David Hildenbrand wrote: >>>>> On 15.04.24 15:30, Ryan Roberts wrote: >>>>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>>>> "lockless walkers that never take the PTL". >>>>>>>>>> >>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>>>> TLBIs). But >>>>>>>>>> you make that clear further down. >>>>>>>>> >>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>>>> tables >>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific >>>>>>>>> and >>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>>>> [...] >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>>>> >>>>>>>>>>> I think what has to happen is: >>>>>>>>>>> >>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as >>>>>>>>>>> long as >>>>>>>>>>> there >>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can >>>>>>>>>>> handle >>>>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>>>> >>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>>>> about a >>>>>>>>>> class of users. I agree with the statement. >>>>>>>>> >>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as >>>>>>>>> page >>>>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot >>>>>>>>>>> handle >>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>>>> >>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check >>>>>>>>>>> later if >>>>>>>>>>> the >>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>>>> flushes >>>>>>>>>>> required. This is the common case. HW might concurrently set >>>>>>>>>>> access/dirty >>>>>>>>>>> bits, >>>>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>>>> ptep_get_lockless() implementation. >>>>>>>>>> >>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually >>>>>>>>>> care >>>>>>>>>> about >>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>>>> guarranteed >>>>>>>>>> to be self-consistent except for access and dirty information, which >>>>>>>>>> may be >>>>>>>>>> inconsistent if a racing modification occured". >>>>>>>>> >>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>>>> that >>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>>>> there >>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>>>> >>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>>>> >>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>>>> >>>>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> This could mean that the access/dirty state *does* change for a given >>>>>>>>>> page >>>>>>>>>> while >>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>>>> *think* >>>>>>>>>> that failing to detect this is benign. >>>>>>>>> >>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>>>> check. So >>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>>>> recheck, I think that would be perfectly fine. >>>>>>>> >>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>>>> soft-dirty or uffd-wp). >>>>>>>> >>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>>>> allow >>>>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>>>> implemented >>>>>>>> for arm64 today. >>>>>>> >>>>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>>>> would be able to catch that. >>>>>>> >>>>>>> Would that be possible? >>>>>> >>>>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>>>> >>>>>> >>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs >>>>>> 0, 1, >>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>>>> stored in >>>>>> PTE0 for the contpte block, and it is dirty. >>>>>> >>>>>> Now let's say there are 2 racing threads: >>>>>> >>>>>> - ThreadA is doing a GUP-fast for PTE3 >>>>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>>>> >>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> true = pte_same(pte1, pte2) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> >>>>>> This example shows how a *false-negative* can be returned for the dirty >>>>>> state, >>>>>> which isn't detected by the check. >>>>>> >>>>>> I've been unable to come up with an example where a *false-positive* can be >>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>>>> this second example, let's assume everything is the same execpt FolioA is >>>>>> initially clean: >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> write to FolioB - HW sets PTE0's dirty >>>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> false = pte_same(pte1, pte2) << So this fails >>>>>> >>>>>> The only way I can see false-positive not being caught in the second >>>>>> example is >>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA >>>>>> scenario. >>>>>> But these lockless table walkers are already suseptible to that. >>>>>> >>>>>> I think all the same arguments can be extended to the access bit. >>>>>> >>>>>> >>>>>> For me this is all getting rather subtle and difficult to reason about and >>>>>> even >>>>>> harder to spec in a comprehensible way. The best I could come up with is: >>>>>> >>>>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>>>> for >>>>>> access and dirty information, which may be inconsistent if a racing >>>>>> modification >>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>>>> information is not possible if 2 calls are made and both ptes are the same. >>>>>> Only >>>>>> false-negative access and/or dirty information is possible in this scenario." >>>>>> >>>>>> which is starting to sound bonkers. Personally I think we are better off at >>>>>> this >>>>>> point, just keeping today's arm64 ptep_get_lockless(). >>>>> >>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>>>> would sync against GUP-fast? >>>> >>>> No, the broadcast is in HW. There is no IPI. >>> >>> Okay ... >>> >>> I agree that the semantics are a bit weird, but if we could get rid of >>> ptep_get_lockless() on arm64, that would also be nice. >>> >>> >>> Something I've been thinking of ... just to share what I've had in mind. The two >>> types of users we currently have are: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_lockless() followed by ptep_get() check without PTL. >>> >>> What if we had the following instead: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without >>> PTL. >>> >>> And on arm64 let >>> >>> (1) ptep_get_lockless() be ptep_get() >>> >>> (2) ptep_get_gup_fast() be __ptep_get(). >>> >>> That would mean, that (2) would not care if another cont-pte is dirty, because >>> we don't collect access+dirty bits. That way, we avoid any races with concurrent >>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() >>> would have to set all cont-PTEs dirty, even if any of these already is dirty. >> >> I don't think the "problematic" thing is actually a problem; set_ptes() will >> always set the dirty bit to the same value for all ptes it covers (and if you do >> set_ptes() on a partial contpte block, it will be unfolded first). Although I >> suspect I've misunderstood what you meant there... > > It's more code like that following that I am concerned about. > > if (pte_dirty()) { > /* Great, nothing to do */ > } else > mte_mkdirty(); > set_ptes(); > ... > } OK I see, so you're worried about uneccessary unfolding that the false-negative dirty reporting could cause? I think the best solution there would be for the core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in his series at [1]. That would avoid any unfolding and just dirty all contpte block(s) touched by the range. [1] https://lore.kernel.org/linux-mm/20240413002219.71246-1-ioworker0@gmail.com/ > >> >> The potential problem I see with this is that the Arm ARM doesn't specify which >> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >> randomly and this could spuriously increase your check failure rate. In reality >> I believe most implementations will update the PTE for the address that caused >> the TLB to be populated. But in some cases, you could have eviction (due to >> pressure or explicit invalidation) followed by re-population due to faulting on >> a different page of the contpte block. In this case you would see this type of >> problem too. >> >> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning >> potentially false-negatives for access and dirty? Just with a much higher chance >> of getting a false-negative. How is this helping? > > You are performing an atomic read like GUP-fast wants you to. So there are no > races to worry about like on other architectures: HW might *set* the dirty bit > concurrently, but that's just fine. But you can still see false-negatives for access and dirty... > > The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. And I think I convinced myself that you will only see false-negatives with today's arm64 ptep_get(). But an order or magnitude fewer than with your proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those). > > To me that sounds ... much simpler ;) But again, just something I've been > thinking about. OK so this approach upgrades my "I'm fairly sure we never see false-positives" to "we definitely never see false-positives". But it certainly increases the quantity of false-negatives. > > The reuse of pte_get_lockless() outside GUP code might not have been the wisest > choice. > If you want to go down the ptep_get_gup_fast() route, you've still got to be able to spec it, and I think it will land pretty close to my most recent stab at respec'ing ptep_get_lockless() a couple of replies up on this thread. Where would your proposal leave the KVM use case? If you call it ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would be left with ptep_get()... Sorry this thread is getting so long. Just to summarise, I think there are currently 3 solutions on the table: - ptep_get_lockless() remains as is - ptep_get_lockless() wraps ptep_get() - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) Based on discussion so far, that's also the order of my preference. Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? I think its just the potential for looping in the face of concurrent modifications? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-15 15:53 ` Ryan Roberts @ 2024-04-15 16:02 ` David Hildenbrand 2024-04-23 10:15 ` Ryan Roberts 0 siblings, 1 reply; 42+ messages in thread From: David Hildenbrand @ 2024-04-15 16:02 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel >>> The potential problem I see with this is that the Arm ARM doesn't specify which >>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >>> randomly and this could spuriously increase your check failure rate. In reality >>> I believe most implementations will update the PTE for the address that caused >>> the TLB to be populated. But in some cases, you could have eviction (due to >>> pressure or explicit invalidation) followed by re-population due to faulting on >>> a different page of the contpte block. In this case you would see this type of >>> problem too. >>> >>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning >>> potentially false-negatives for access and dirty? Just with a much higher chance >>> of getting a false-negative. How is this helping? >> >> You are performing an atomic read like GUP-fast wants you to. So there are no >> races to worry about like on other architectures: HW might *set* the dirty bit >> concurrently, but that's just fine. > > But you can still see false-negatives for access and dirty... Yes. > >> >> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. > > And I think I convinced myself that you will only see false-negatives with > today's arm64 ptep_get(). But an order or magnitude fewer than with your > proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those). > >> >> To me that sounds ... much simpler ;) But again, just something I've been >> thinking about. > > OK so this approach upgrades my "I'm fairly sure we never see false-positives" > to "we definitely never see false-positives". But it certainly increases the > quantity of false-negatives. Yes. > >> >> The reuse of pte_get_lockless() outside GUP code might not have been the wisest >> choice. >> > > If you want to go down the ptep_get_gup_fast() route, you've still got to be > able to spec it, and I think it will land pretty close to my most recent stab at > respec'ing ptep_get_lockless() a couple of replies up on this thread. > > Where would your proposal leave the KVM use case? If you call it > ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would > be left with ptep_get()... It's using GUP-fast. > > Sorry this thread is getting so long. Just to summarise, I think there are > currently 3 solutions on the table: > > - ptep_get_lockless() remains as is > - ptep_get_lockless() wraps ptep_get() > - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) > > Based on discussion so far, that's also the order of my preference. (1) seems like the easiest thing to do. > > Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? Well, you sent that patch series with "that aims to reduce the cost and complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-15 16:02 ` David Hildenbrand @ 2024-04-23 10:15 ` Ryan Roberts 2024-04-23 10:18 ` David Hildenbrand 0 siblings, 1 reply; 42+ messages in thread From: Ryan Roberts @ 2024-04-23 10:15 UTC (permalink / raw) To: David Hildenbrand, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel Hi David, Sorry for the slow reply on this; its was due to a combination of thinking a bit more about the options here and being out on holiday. On 15/04/2024 17:02, David Hildenbrand wrote: >>>> The potential problem I see with this is that the Arm ARM doesn't specify which >>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >>>> randomly and this could spuriously increase your check failure rate. In reality >>>> I believe most implementations will update the PTE for the address that caused >>>> the TLB to be populated. But in some cases, you could have eviction (due to >>>> pressure or explicit invalidation) followed by re-population due to faulting on >>>> a different page of the contpte block. In this case you would see this type of >>>> problem too. >>>> >>>> But ultimately, isn't this basically equivalent to ptep_get_lockless() >>>> returning >>>> potentially false-negatives for access and dirty? Just with a much higher >>>> chance >>>> of getting a false-negative. How is this helping? >>> >>> You are performing an atomic read like GUP-fast wants you to. So there are no >>> races to worry about like on other architectures: HW might *set* the dirty bit >>> concurrently, but that's just fine. >> >> But you can still see false-negatives for access and dirty... > > Yes. > >> >>> >>> The whole races you describe with concurrent folding/unfolding/ ... are >>> irrelevant. >> >> And I think I convinced myself that you will only see false-negatives with >> today's arm64 ptep_get(). But an order or magnitude fewer than with your >> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of >> those). >> >>> >>> To me that sounds ... much simpler ;) But again, just something I've been >>> thinking about. >> >> OK so this approach upgrades my "I'm fairly sure we never see false-positives" >> to "we definitely never see false-positives". But it certainly increases the >> quantity of false-negatives. > > Yes. > >> >>> >>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest >>> choice. >>> >> >> If you want to go down the ptep_get_gup_fast() route, you've still got to be >> able to spec it, and I think it will land pretty close to my most recent stab at >> respec'ing ptep_get_lockless() a couple of replies up on this thread. >> >> Where would your proposal leave the KVM use case? If you call it >> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would >> be left with ptep_get()... > > It's using GUP-fast. > >> >> Sorry this thread is getting so long. Just to summarise, I think there are >> currently 3 solutions on the table: >> >> - ptep_get_lockless() remains as is >> - ptep_get_lockless() wraps ptep_get() >> - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) >> >> Based on discussion so far, that's also the order of my preference. > > (1) seems like the easiest thing to do. Yes, I'm very much in favour of easy. > >> >> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? > > Well, you sent that patch series with "that aims to reduce the cost and > complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) Touche! I'd half forgotten that we were having this conversation in the context of this series! I guess your ptep_get_gup_fast() approach is very similar to ptep_get_lockless_norecency()... So we are back to the beginning :) But ultimately I've come to the conclusion that it is easy to reason about the current arm64 ptep_get_lockless() implementation and see that its correct. The other options both have their drawbacks. Yes, there is a loop in the current implementation that would be nice to get rid of, but I don't think it is really any worse than the cmpxchg loops we already have in other helpers. I'm not planning to persue this any further. Thanks for the useful discussion (as always). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 2024-04-23 10:15 ` Ryan Roberts @ 2024-04-23 10:18 ` David Hildenbrand 0 siblings, 0 replies; 42+ messages in thread From: David Hildenbrand @ 2024-04-23 10:18 UTC (permalink / raw) To: Ryan Roberts, Mark Rutland, Catalin Marinas, Will Deacon, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Andrew Morton, Muchun Song Cc: linux-arm-kernel, linux-mm, linux-kernel On 23.04.24 12:15, Ryan Roberts wrote: > Hi David, > > Sorry for the slow reply on this; its was due to a combination of thinking a bit > more about the options here and being out on holiday. > No worries, there are things more important in life than ptep_get_lockless() :D >> (1) seems like the easiest thing to do. > > Yes, I'm very much in favour of easy. > >> >>> >>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? >> >> Well, you sent that patch series with "that aims to reduce the cost and >> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) > > Touche! I'd half forgotten that we were having this conversation in the context > of this series! > > I guess your ptep_get_gup_fast() approach is very similar to > ptep_get_lockless_norecency()... So we are back to the beginning :) Except that it would be limited to GUP-fast :) > > But ultimately I've come to the conclusion that it is easy to reason about the > current arm64 ptep_get_lockless() implementation and see that its correct. The > other options both have their drawbacks. Yes. > > Yes, there is a loop in the current implementation that would be nice to get rid > of, but I don't think it is really any worse than the cmpxchg loops we already > have in other helpers. > > I'm not planning to persue this any further. Thanks for the useful discussion > (as always). Make sense to me. let's leave it as is for the time being. (and also see if a GUP-fast user that needs precise dirty/accessed actually gets real) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2024-04-23 10:18 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-15 12:17 [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
[not found] ` <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com>
2024-03-26 16:39 ` Ryan Roberts
2024-03-27 9:28 ` David Hildenbrand
2024-03-27 9:57 ` Ryan Roberts
2024-03-27 17:02 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:30 ` David Hildenbrand
2024-03-26 16:48 ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts
2024-03-26 17:02 ` David Hildenbrand
2024-03-26 17:27 ` Ryan Roberts
2024-03-26 17:38 ` David Hildenbrand
2024-03-26 17:48 ` Ryan Roberts
2024-03-26 17:58 ` David Hildenbrand
2024-03-27 9:51 ` Ryan Roberts
2024-03-27 17:05 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:35 ` David Hildenbrand
2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand
2024-03-26 16:31 ` Ryan Roberts
[not found] ` <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com>
2024-03-26 16:53 ` Ryan Roberts
2024-03-26 17:04 ` David Hildenbrand
2024-03-26 17:32 ` Ryan Roberts
2024-03-26 17:39 ` David Hildenbrand
2024-03-26 17:51 ` Ryan Roberts
2024-03-27 9:34 ` David Hildenbrand
2024-03-27 10:01 ` Ryan Roberts
2024-04-03 12:59 ` Ryan Roberts
2024-04-08 8:36 ` David Hildenbrand
2024-04-09 16:35 ` Ryan Roberts
2024-04-10 20:09 ` David Hildenbrand
2024-04-11 9:45 ` Ryan Roberts
[not found] ` <70a36403-aefd-4311-b612-84e602465689@redhat.com>
2024-04-15 9:28 ` Ryan Roberts
[not found] ` <3e50030d-2289-4470-a727-a293baa21618@redhat.com>
2024-04-15 13:30 ` Ryan Roberts
[not found] ` <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com>
2024-04-15 14:34 ` Ryan Roberts
[not found] ` <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com>
2024-04-15 15:17 ` Ryan Roberts
2024-04-15 15:22 ` David Hildenbrand
2024-04-15 15:53 ` Ryan Roberts
2024-04-15 16:02 ` David Hildenbrand
2024-04-23 10:15 ` Ryan Roberts
2024-04-23 10:18 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox