* [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
@ 2026-03-03 6:37 Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Piotr Jaroszynski @ 2026-03-03 6:37 UTC (permalink / raw)
To: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel, linux-mm
Cc: Piotr Jaroszynski, Alistair Popple, Jason Gunthorpe,
John Hubbard, Zi Yan, Breno Leitao, stable
contpte_ptep_set_access_flags() compared the gathered ptep_get() value
against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
from all sub-PTEs in the CONT block, so a dirty sibling can make the
target appear already-dirty. When the gathered value matches entry, the
function returns 0 even though the target sub-PTE still has PTE_RDONLY
set in hardware.
For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
result as authoritative for the entire range. But an SMMU without HTTU
(or with HA/HD disabled in CD.TCR) evaluates each descriptor
individually and will keep raising F_PERMISSION on the unchanged target
sub-PTE, causing an infinite fault loop.
Gathering can therefore cause false no-ops when only a sibling has been
updated:
- write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
- read faults: target still lacks PTE_AF
Fix by checking all sub-PTEs' access flags individually (not via the
gathered view) before returning no-op, and use the raw target PTE for
the write-bit unfold decision. The access-flag mask matches the one
used by __ptep_set_access_flags().
Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
range may become the effective cached translation and software must
maintain consistent attributes across the range.
Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: stable@vger.kernel.org
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
---
arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
1 file changed, 43 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..9868bfe4607c 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
}
EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
+static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
+{
+ pte_t *cont_ptep = contpte_align_down(ptep);
+ const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+ pteval_t entry_access = pte_val(entry) & access_mask;
+ int i;
+
+ for (i = 0; i < CONT_PTES; i++) {
+ pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
+
+ if (pte_access != entry_access)
+ return false;
+ }
+
+ return true;
+}
+
int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty)
@@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
int i;
/*
- * Gather the access/dirty bits for the contiguous range. If nothing has
- * changed, its a noop.
+ * Check whether all sub-PTEs in the CONT block already have the
+ * requested access flags, using raw per-PTE values rather than the
+ * gathered ptep_get() view.
+ *
+ * ptep_get() gathers AF/dirty state across the whole CONT block,
+ * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
+ * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
+ * the gathered result as authoritative for the entire range. But an
+ * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
+ * each descriptor individually and will keep faulting on the target
+ * sub-PTE if its flags haven't actually been updated. Gathering can
+ * therefore cause false no-ops when only a sibling has been updated:
+ * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
+ * - read faults: target still lacks PTE_AF
+ *
+ * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
+ * become the effective cached translation, so all entries must have
+ * consistent attributes. Check the full CONT block before returning
+ * no-op, and when any sub-PTE mismatches, proceed to update the whole
+ * range.
*/
- orig_pte = pte_mknoncont(ptep_get(ptep));
- if (pte_val(orig_pte) == pte_val(entry))
+ if (contpte_all_subptes_match_access_flags(ptep, entry))
return 0;
+ /*
+ * Use raw target pte (not gathered) for write-bit unfold decision.
+ */
+ orig_pte = pte_mknoncont(__ptep_get(ptep));
+
/*
* We can fix up access/dirty bits without having to unfold the contig
* range. But if the write bit is changing, we must unfold.
--
2.22.1.7.gac84d6e93c.dirty
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
@ 2026-03-03 7:19 ` James Houghton
2026-03-03 12:45 ` Jason Gunthorpe
2026-03-03 21:40 ` Piotr Jaroszynski
2026-03-03 8:38 ` Ryan Roberts
2026-03-04 11:17 ` Catalin Marinas
2 siblings, 2 replies; 11+ messages in thread
From: James Houghton @ 2026-03-03 7:19 UTC (permalink / raw)
To: Piotr Jaroszynski
Cc: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel,
linux-mm, Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Mon, Mar 2, 2026 at 10:38 PM Piotr Jaroszynski
<pjaroszynski@nvidia.com> wrote:
>
> contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> from all sub-PTEs in the CONT block, so a dirty sibling can make the
> target appear already-dirty. When the gathered value matches entry, the
> function returns 0 even though the target sub-PTE still has PTE_RDONLY
> set in hardware.
>
> For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> result as authoritative for the entire range. But an SMMU without HTTU
> (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> individually and will keep raising F_PERMISSION on the unchanged target
> sub-PTE, causing an infinite fault loop.
>
> Gathering can therefore cause false no-ops when only a sibling has been
> updated:
> - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> - read faults: target still lacks PTE_AF
>
> Fix by checking all sub-PTEs' access flags individually (not via the
> gathered view) before returning no-op, and use the raw target PTE for
> the write-bit unfold decision. The access-flag mask matches the one
> used by __ptep_set_access_flags().
>
> Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> range may become the effective cached translation and software must
> maintain consistent attributes across the range.
>
> Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Breno Leitao <leitao@debian.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Thanks for the fix!
This is similar (sort of) to a HugeTLB page fault loop I stumbled upon
a while ago[1]. (I wonder if there have been more cases like this.)
Feel free to add:
Reviewed-by: James Houghton <jthoughton@google.com>
[1] https://lore.kernel.org/all/20231204172646.2541916-1-jthoughton@google.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
@ 2026-03-03 8:38 ` Ryan Roberts
2026-03-03 18:40 ` Piotr Jaroszynski
2026-03-04 11:17 ` Catalin Marinas
2 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2026-03-03 8:38 UTC (permalink / raw)
To: Piotr Jaroszynski, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-mm
Cc: Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On 03/03/2026 06:37, Piotr Jaroszynski wrote:
> contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> from all sub-PTEs in the CONT block, so a dirty sibling can make the
> target appear already-dirty. When the gathered value matches entry, the
> function returns 0 even though the target sub-PTE still has PTE_RDONLY
> set in hardware.
>
> For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> result as authoritative for the entire range. But an SMMU without HTTU
> (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> individually and will keep raising F_PERMISSION on the unchanged target
> sub-PTE, causing an infinite fault loop.
Ouch; thanks for the fix!
>
> Gathering can therefore cause false no-ops when only a sibling has been
> updated:
> - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> - read faults: target still lacks PTE_AF
>
> Fix by checking all sub-PTEs' access flags individually (not via the
> gathered view) before returning no-op, and use the raw target PTE for
> the write-bit unfold decision. The access-flag mask matches the one
> used by __ptep_set_access_flags().
>
> Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> range may become the effective cached translation and software must
> maintain consistent attributes across the range.
>
> Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
>
nit: there shouldn't be whitespace here.
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Breno Leitao <leitao@debian.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
This fix looks good to me:
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
> 1 file changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..9868bfe4607c 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
>
> +static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
> +{
> + pte_t *cont_ptep = contpte_align_down(ptep);
> + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> + pteval_t entry_access = pte_val(entry) & access_mask;
> + int i;
> +
> + for (i = 0; i < CONT_PTES; i++) {
> + pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
> +
> + if (pte_access != entry_access)
> + return false;
> + }
There are 2 forms of "dirty"; HW and SW. Here you are testing that all ptes in
the contpte block have the same form of dirty, which I think is the correct
thing to do. You could relax to just test that every pte has one of the forms of
dirty, But in that case, if a pte is sw-dirty but not hw-dirty, then the
PTE_RDONLY bit remains set and the SMMU will fault, I think?
If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
__cont_access_flags_changed(), which just checks for any form of dirty. So I
guess hugetlb is buggy in the same way and should be fixed to use this more
stringent approach?
Thanks,
Ryan
> +
> + return true;
> +}
> +
> int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t entry, int dirty)
> @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> int i;
>
> /*
> - * Gather the access/dirty bits for the contiguous range. If nothing has
> - * changed, its a noop.
> + * Check whether all sub-PTEs in the CONT block already have the
> + * requested access flags, using raw per-PTE values rather than the
> + * gathered ptep_get() view.
> + *
> + * ptep_get() gathers AF/dirty state across the whole CONT block,
> + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> + * the gathered result as authoritative for the entire range. But an
> + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> + * each descriptor individually and will keep faulting on the target
> + * sub-PTE if its flags haven't actually been updated. Gathering can
> + * therefore cause false no-ops when only a sibling has been updated:
> + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> + * - read faults: target still lacks PTE_AF
> + *
> + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> + * become the effective cached translation, so all entries must have
> + * consistent attributes. Check the full CONT block before returning
> + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> + * range.
> */
> - orig_pte = pte_mknoncont(ptep_get(ptep));
> - if (pte_val(orig_pte) == pte_val(entry))
> + if (contpte_all_subptes_match_access_flags(ptep, entry))
> return 0;
>
> + /*
> + * Use raw target pte (not gathered) for write-bit unfold decision.
> + */
> + orig_pte = pte_mknoncont(__ptep_get(ptep));
> +
> /*
> * We can fix up access/dirty bits without having to unfold the contig
> * range. But if the write bit is changing, we must unfold.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 7:19 ` James Houghton
@ 2026-03-03 12:45 ` Jason Gunthorpe
2026-03-03 21:40 ` Piotr Jaroszynski
1 sibling, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2026-03-03 12:45 UTC (permalink / raw)
To: James Houghton
Cc: Piotr Jaroszynski, Ryan Roberts, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-mm, Alistair Popple, John Hubbard,
Zi Yan, Breno Leitao, stable
On Mon, Mar 02, 2026 at 11:19:46PM -0800, James Houghton wrote:
> This is similar (sort of) to a HugeTLB page fault loop I stumbled upon
> a while ago[1]. (I wonder if there have been more cases like this.)
We spent some time here looking at what the ARM spec requires, and
what kinds of HW implementations are possible..
I think the consensus is that SW must never write individual CONT PTEs
that are different from each other.
For example, a possible HW implementation could fetch a cache line of
PTEs, see the CONT bit at the VA's PTE, and then always select the
first PTE in the cache line to load the TLB.
Alternatively HW could just always load the PTE from the VA that is
faulting.
Since ARM ARM doesn't restrict this, SW must assume both possibilities
and that means it must synchronize all PTEs since it cannot predict
which PTE will be loaded by HW.
DBM will only write to a single PTE, which means HW can create
inconsistent PTEs. However if SW is fixing a fault, especially for
different walker IP (SMMU vs CPU), then it must restore the
synchronization since the PTE selection of one walker may not be
compatible with the HW update of the other.
Thus, I think any place where SW is creating inconsistent PTEs, or
leaving PTEs inconsistent after a fault is a bug.. We found a Rule
statement that seemed to require this as well.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 8:38 ` Ryan Roberts
@ 2026-03-03 18:40 ` Piotr Jaroszynski
2026-03-03 19:12 ` Jason Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Piotr Jaroszynski @ 2026-03-03 18:40 UTC (permalink / raw)
To: Ryan Roberts
Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, linux-mm,
Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Tue, Mar 03, 2026 at 08:38:23AM +0000, Ryan Roberts wrote:
> On 03/03/2026 06:37, Piotr Jaroszynski wrote:
> > contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> > against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> > from all sub-PTEs in the CONT block, so a dirty sibling can make the
> > target appear already-dirty. When the gathered value matches entry, the
> > function returns 0 even though the target sub-PTE still has PTE_RDONLY
> > set in hardware.
> >
> > For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> > may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> > result as authoritative for the entire range. But an SMMU without HTTU
> > (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> > individually and will keep raising F_PERMISSION on the unchanged target
> > sub-PTE, causing an infinite fault loop.
>
> Ouch; thanks for the fix!
>
> >
> > Gathering can therefore cause false no-ops when only a sibling has been
> > updated:
> > - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > - read faults: target still lacks PTE_AF
> >
> > Fix by checking all sub-PTEs' access flags individually (not via the
> > gathered view) before returning no-op, and use the raw target PTE for
> > the write-bit unfold decision. The access-flag mask matches the one
> > used by __ptep_set_access_flags().
> >
> > Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> > range may become the effective cached translation and software must
> > maintain consistent attributes across the range.
> >
> > Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
> >
>
> nit: there shouldn't be whitespace here.
>
> > Reviewed-by: Alistair Popple <apopple@nvidia.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Breno Leitao <leitao@debian.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>
> This fix looks good to me:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Thanks!
>
>
> > ---
> > arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index bcac4f55f9c1..9868bfe4607c 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
> > }
> > EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
> >
> > +static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
> > +{
> > + pte_t *cont_ptep = contpte_align_down(ptep);
> > + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> > + pteval_t entry_access = pte_val(entry) & access_mask;
> > + int i;
> > +
> > + for (i = 0; i < CONT_PTES; i++) {
> > + pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
> > +
> > + if (pte_access != entry_access)
> > + return false;
> > + }
>
> There are 2 forms of "dirty"; HW and SW. Here you are testing that all ptes in
> the contpte block have the same form of dirty, which I think is the correct
> thing to do. You could relax to just test that every pte has one of the forms of
> dirty, But in that case, if a pte is sw-dirty but not hw-dirty, then the
> PTE_RDONLY bit remains set and the SMMU will fault, I think?
Yes, for SMMU we need to make sure the HW-dirty state is correct or it
will keep faulting. And while we are doing it, it makes sense to just
update all the bits to be consistent.
>
> If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
> __cont_access_flags_changed(), which just checks for any form of dirty. So I
> guess hugetlb is buggy in the same way and should be fixed to use this more
> stringent approach?
Given sw-dirty is managed by sw, is it correct for sw to ever create a
PTE that's sw-dirty but not hw-dirty? If not, then I think it will still
work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
missing then we will set both. But for thoroughness it could make sense
to be stricter and add some comments there as it does feel a little
fragile. I'm very new to this area though so probably best for others to
comment and tackle this.
Thanks,
Piotr
>
> Thanks,
> Ryan
>
> > +
> > + return true;
> > +}
> > +
> > int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > unsigned long addr, pte_t *ptep,
> > pte_t entry, int dirty)
> > @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > int i;
> >
> > /*
> > - * Gather the access/dirty bits for the contiguous range. If nothing has
> > - * changed, its a noop.
> > + * Check whether all sub-PTEs in the CONT block already have the
> > + * requested access flags, using raw per-PTE values rather than the
> > + * gathered ptep_get() view.
> > + *
> > + * ptep_get() gathers AF/dirty state across the whole CONT block,
> > + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> > + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> > + * the gathered result as authoritative for the entire range. But an
> > + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> > + * each descriptor individually and will keep faulting on the target
> > + * sub-PTE if its flags haven't actually been updated. Gathering can
> > + * therefore cause false no-ops when only a sibling has been updated:
> > + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > + * - read faults: target still lacks PTE_AF
> > + *
> > + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> > + * become the effective cached translation, so all entries must have
> > + * consistent attributes. Check the full CONT block before returning
> > + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> > + * range.
> > */
> > - orig_pte = pte_mknoncont(ptep_get(ptep));
> > - if (pte_val(orig_pte) == pte_val(entry))
> > + if (contpte_all_subptes_match_access_flags(ptep, entry))
> > return 0;
> >
> > + /*
> > + * Use raw target pte (not gathered) for write-bit unfold decision.
> > + */
> > + orig_pte = pte_mknoncont(__ptep_get(ptep));
> > +
> > /*
> > * We can fix up access/dirty bits without having to unfold the contig
> > * range. But if the write bit is changing, we must unfold.
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 18:40 ` Piotr Jaroszynski
@ 2026-03-03 19:12 ` Jason Gunthorpe
2026-03-04 12:20 ` Ryan Roberts
0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2026-03-03 19:12 UTC (permalink / raw)
To: Piotr Jaroszynski
Cc: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel,
linux-mm, Alistair Popple, John Hubbard, Zi Yan, Breno Leitao,
stable
On Tue, Mar 03, 2026 at 10:40:00AM -0800, Piotr Jaroszynski wrote:
> > If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
> > __cont_access_flags_changed(), which just checks for any form of dirty. So I
> > guess hugetlb is buggy in the same way and should be fixed to use this more
> > stringent approach?
>
> Given sw-dirty is managed by sw, is it correct for sw to ever create a
> PTE that's sw-dirty but not hw-dirty? If not, then I think it will still
> work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
> missing then we will set both. But for thoroughness it could make sense
> to be stricter and add some comments there as it does feel a little
> fragile. I'm very new to this area though so probably best for others to
> comment and tackle this.
I also am not so familiar with SW dirty, but the big thing to be
sensitive to here is that the CPU can get by with setting something as
RDONLY | DBM and the CPU will flip off RDONLY without issue, so you
might not notice it.
But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault. Otherwise
the SMMU SVA explodes.
However, I agree the __cont_access_flags_changed() looks concerning.
I spent a few AI $$ to cross check this and it echo's Ryan's concern
but did not find a specific counter-example where PTE_WRITE |
PTE_DIRTY | PTE_RDONLY can happen for hugetlb. I don't really trust
this negative result, so I'm inclined to agree it should be made more
robust. I like the look of the patch (below) it proposed too.
AI ramblings:
PROMPT:
Review 20260303063751.2531716-1-pjaroszynski@nvidia.com.mbx and focus
on __cont_access_flags_changed(), huge_ptep_set_access_flags(),
hugetlb_fault(). Does it have the same issue as this patch is fixing,
considering the following additional note:
A big thing to be sensitive to here is that the CPU can get by with
setting something as RDONLY | DBM and the CPU will flip off RDONLY
without issue, so you might not notice it.
But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault.
====
Analysis: Does hugetlb __cont_access_flags_changed() have the same bug as the contpte patch?
The contpte bug (what the patch fixes)
The old contpte_ptep_set_access_flags() (contpte.c:614) used ptep_get()
which gathers AF/dirty from ALL sub-PTEs via OR, then compared the
gathered result against entry:
// OLD (buggy):
orig_pte = pte_mknoncont(ptep_get(ptep)); // gathers across all sub-PTEs!
if (pte_val(orig_pte) == pte_val(entry))
return 0;
A dirty sibling makes the gathered PTE look already-dirty, even though
the target sub-PTE still has PTE_RDONLY in hardware. False no-op → SMMU
infinite fault loop.
The fix replaces this with contpte_all_subptes_match_access_flags()
which checks each sub-PTE’s raw bit values
(PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY).
The hugetlb code
__cont_access_flags_changed() (hugetlbpage.c:400-418) does NOT use
gathering for the no-op check – it reads each sub-PTE individually via
__ptep_get():
for (i = 0; i < ncontig; i++) {
pte_t orig_pte = __ptep_get(ptep + i);
if (pte_dirty(pte) != pte_dirty(orig_pte))
return 1;
if (pte_young(pte) != pte_young(orig_pte))
return 1;
}
So it does NOT have the exact same gathering-based false no-op bug.
But it does have a weaker form of the same conceptual problem
The check uses pte_dirty() which is:
#define pte_hw_dirty(pte) (pte_write(pte) && !pte_rdonly(pte))
#define pte_sw_dirty(pte) (!!(pte_val(pte) & PTE_DIRTY))
#define pte_dirty(pte) (pte_sw_dirty(pte) || pte_hw_dirty(pte))
It abstracts over HW-dirty and SW-dirty. It does NOT check PTE_RDONLY
directly.
If a sub-PTE were in the state PTE_WRITE | PTE_DIRTY | PTE_RDONLY
(SW-dirty but HW-readonly), then pte_dirty() returns true (via
sw_dirty), and the check would say “no change needed” – but PTE_RDONLY
is still set in hardware, which the SMMU would fault on.
The contpte fix’s contpte_all_subptes_match_access_flags() would catch
this because it compares the raw PTE_RDONLY bit directly.
Does PTE_WRITE | PTE_DIRTY | PTE_RDONLY actually occur?
=======================================================
Looking at all the arm64 PTE manipulation functions:
- pte_mkdirty(): sets PTE_DIRTY AND clears PTE_RDONLY (if writable)
- pte_mkwrite_novma(): sets PTE_WRITE, clears PTE_RDONLY if PTE_DIRTY
- pte_wrprotect(): saves HW dirty to SW dirty, sets PTE_RDONLY, clears
PTE_WRITE
- pte_mkclean(): clears PTE_DIRTY, sets PTE_RDONLY
- HAFDBS: can only CLEAR PTE_RDONLY, never set it
None of these create PTE_DIRTY | PTE_WRITE | PTE_RDONLY. The invariant
holds: for writable pages, PTE_DIRTY => !PTE_RDONLY.
Tracing the RDONLY|DBM scenario through hugetlb_fault()
For the specific scenario: all sub-PTEs are PTE_WRITE | PTE_RDONLY
(RDONLY|DBM, clean writable), and handle_mm_fault() is called:
1. huge_ptep_get() gathers – no dirty siblings, so vmf.orig_pte =
PTE_WRITE | PTE_RDONLY | PTE_AF
2. huge_pte_mkdirty() sets PTE_DIRTY, clears PTE_RDONLY
3. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
4. __cont_access_flags_changed():
- All sub-PTEs: PTE_WRITE | PTE_RDONLY | PTE_AF
- pte_dirty(entry) = true, pte_dirty(sub) = false (RDONLY set, no
PTE_DIRTY)
- Mismatch detected -> returns 1 -> BBM proceeds
5. BBM rewrites ALL PTEs with
PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
6. RDONLY cleared for all entries. Correct.
With some siblings HW-dirtied by CPU (e.g., PTE #3 has !PTE_RDONLY):
1. huge_ptep_get() gathers: sees HW-dirty sibling, calls pte_mkdirty()
on gathered
2. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3. __cont_access_flags_changed():
- PTE #3 (!PTE_RDONLY): pte_dirty() = true -> match
- PTE #0 (PTE_RDONLY): pte_dirty() = false -> mismatch -> returns
1
4. BBM proceeds. All PTEs rewritten. Correct.
With ALL siblings HW-dirtied by CPU:
1. All sub-PTEs: PTE_WRITE | !PTE_RDONLY | PTE_AF (all HW dirty)
2. entry: PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3. __cont_access_flags_changed(): all pte_dirty() match -> returns 0
(no-op)
4. But RDONLY is already clear on all PTEs, so SMMU won’t fault.
Correct.
Conclusion
==========
The hugetlb code does NOT have the same bug in the sense that it doesn’t
produce false no-ops with current code. The key difference is
__cont_access_flags_changed() checks each sub-PTE individually, not a
gathered view.
However, Ryan Roberts’ observation is valid:
__cont_access_flags_changed() has a latent weakness. It relies on the
invariant PTE_DIRTY => !PTE_RDONLY (for writable pages) rather than
checking PTE_RDONLY directly. If this invariant were ever violated
(e.g., by a future code change), the no-op check could falsely pass,
leaving RDONLY set in hardware and causing SMMU fault loops. The contpte
fix’s approach of checking raw bit values is more robust and should be
adopted by the hugetlb code as a defense-in-depth measure.
The recommended fix for hugetlb: replace the pte_dirty()/pte_young()
checks in __cont_access_flags_changed() with a raw-bitmask comparison
similar to contpte_all_subptes_match_access_flags(), using the mask
PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY.
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf564082..34e091b398123e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -389,28 +389,24 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
}
/*
- * huge_ptep_set_access_flags will update access flags (dirty, accesssed)
+ * huge_ptep_set_access_flags will update access flags (dirty, accessed)
* and write permission.
*
- * For a contiguous huge pte range we need to check whether or not write
- * permission has to change only on the first pte in the set. Then for
- * all the contiguous ptes we need to check whether or not there is a
- * discrepancy between dirty or young.
+ * Check all sub-PTEs' raw access flag bits rather than using the abstracted
+ * pte_dirty()/pte_young() helpers which conflate HW-dirty and SW-dirty.
+ * This ensures PTE_RDONLY is checked directly: a sub-PTE that is SW-dirty
+ * (PTE_DIRTY set) but still has PTE_RDONLY would be missed by pte_dirty()
+ * but will cause an SMMU without HTTU to keep faulting. The access flag
+ * mask matches the one used by __ptep_set_access_flags().
*/
static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
{
+ const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+ pteval_t pte_access = pte_val(pte) & access_mask;
int i;
- if (pte_write(pte) != pte_write(__ptep_get(ptep)))
- return 1;
-
for (i = 0; i < ncontig; i++) {
- pte_t orig_pte = __ptep_get(ptep + i);
-
- if (pte_dirty(pte) != pte_dirty(orig_pte))
- return 1;
-
- if (pte_young(pte) != pte_young(orig_pte))
+ if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
return 1;
}
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 7:19 ` James Houghton
2026-03-03 12:45 ` Jason Gunthorpe
@ 2026-03-03 21:40 ` Piotr Jaroszynski
1 sibling, 0 replies; 11+ messages in thread
From: Piotr Jaroszynski @ 2026-03-03 21:40 UTC (permalink / raw)
To: James Houghton
Cc: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel,
linux-mm, Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Mon, Mar 02, 2026 at 11:19:46PM -0800, James Houghton wrote:
> On Mon, Mar 2, 2026 at 10:38 PM Piotr Jaroszynski
> <pjaroszynski@nvidia.com> wrote:
> >
> > contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> > against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> > from all sub-PTEs in the CONT block, so a dirty sibling can make the
> > target appear already-dirty. When the gathered value matches entry, the
> > function returns 0 even though the target sub-PTE still has PTE_RDONLY
> > set in hardware.
> >
> > For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> > may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> > result as authoritative for the entire range. But an SMMU without HTTU
> > (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> > individually and will keep raising F_PERMISSION on the unchanged target
> > sub-PTE, causing an infinite fault loop.
> >
> > Gathering can therefore cause false no-ops when only a sibling has been
> > updated:
> > - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > - read faults: target still lacks PTE_AF
> >
> > Fix by checking all sub-PTEs' access flags individually (not via the
> > gathered view) before returning no-op, and use the raw target PTE for
> > the write-bit unfold decision. The access-flag mask matches the one
> > used by __ptep_set_access_flags().
> >
> > Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> > range may become the effective cached translation and software must
> > maintain consistent attributes across the range.
> >
> > Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
> >
> > Reviewed-by: Alistair Popple <apopple@nvidia.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Breno Leitao <leitao@debian.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>
> Thanks for the fix!
>
> This is similar (sort of) to a HugeTLB page fault loop I stumbled upon
> a while ago[1]. (I wonder if there have been more cases like this.)
I see that your commit 3c0696076aad ("arm64: mm: Always make sw-dirty
PTEs hw-dirty in pte_modify") from that discussion was picked up and
it's very relevant for the hugetlb exposure question. With your patch,
do we have a guarantee that sw-dirty implies hw-dirty in all cases? If
yes, then there should be no exposure for that path. But it still makes
sense to make it more explicit.
>
> Feel free to add:
>
> Reviewed-by: James Houghton <jthoughton@google.com>
Thanks!
>
> [1] https://lore.kernel.org/all/20231204172646.2541916-1-jthoughton@google.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
2026-03-03 8:38 ` Ryan Roberts
@ 2026-03-04 11:17 ` Catalin Marinas
2026-03-04 13:43 ` Jason Gunthorpe
2 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2026-03-04 11:17 UTC (permalink / raw)
To: Piotr Jaroszynski
Cc: Ryan Roberts, Will Deacon, linux-arm-kernel, linux-mm,
Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Mon, Mar 02, 2026 at 10:37:51PM -0800, Piotr Jaroszynski wrote:
> contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> from all sub-PTEs in the CONT block, so a dirty sibling can make the
> target appear already-dirty. When the gathered value matches entry, the
> function returns 0 even though the target sub-PTE still has PTE_RDONLY
> set in hardware.
>
> For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> result as authoritative for the entire range. But an SMMU without HTTU
> (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> individually and will keep raising F_PERMISSION on the unchanged target
> sub-PTE, causing an infinite fault loop.
This can also happen if not all CPUs support the hardware updates of the
AF/dirty bits.
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..9868bfe4607c 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
>
> +static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
> +{
> + pte_t *cont_ptep = contpte_align_down(ptep);
> + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> + pteval_t entry_access = pte_val(entry) & access_mask;
> + int i;
> +
> + for (i = 0; i < CONT_PTES; i++) {
> + pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
> +
> + if (pte_access != entry_access)
> + return false;
> + }
> +
> + return true;
> +}
> +
> int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t entry, int dirty)
> @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> int i;
>
> /*
> - * Gather the access/dirty bits for the contiguous range. If nothing has
> - * changed, its a noop.
> + * Check whether all sub-PTEs in the CONT block already have the
> + * requested access flags, using raw per-PTE values rather than the
> + * gathered ptep_get() view.
> + *
> + * ptep_get() gathers AF/dirty state across the whole CONT block,
> + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> + * the gathered result as authoritative for the entire range. But an
> + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> + * each descriptor individually and will keep faulting on the target
> + * sub-PTE if its flags haven't actually been updated. Gathering can
> + * therefore cause false no-ops when only a sibling has been updated:
> + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> + * - read faults: target still lacks PTE_AF
> + *
> + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> + * become the effective cached translation, so all entries must have
> + * consistent attributes. Check the full CONT block before returning
> + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> + * range.
> */
> - orig_pte = pte_mknoncont(ptep_get(ptep));
> - if (pte_val(orig_pte) == pte_val(entry))
> + if (contpte_all_subptes_match_access_flags(ptep, entry))
> return 0;
Actually, do we need to loop over all the ptes? I think it sufficient to
only check the one at ptep since it is the one that triggered the fault.
Instead of ptep_get(ptep), use __ptep_get(ptep). The rest of the
function sets the flags correctly for all the ptes in the contig range.
--
Catalin
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 19:12 ` Jason Gunthorpe
@ 2026-03-04 12:20 ` Ryan Roberts
2026-03-04 13:44 ` Jason Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2026-03-04 12:20 UTC (permalink / raw)
To: Jason Gunthorpe, Piotr Jaroszynski
Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, linux-mm,
Alistair Popple, John Hubbard, Zi Yan, Breno Leitao, stable
On 03/03/2026 19:12, Jason Gunthorpe wrote:
> On Tue, Mar 03, 2026 at 10:40:00AM -0800, Piotr Jaroszynski wrote:
>
>>> If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
>>> __cont_access_flags_changed(), which just checks for any form of dirty. So I
>>> guess hugetlb is buggy in the same way and should be fixed to use this more
>>> stringent approach?
>>
>> Given sw-dirty is managed by sw, is it correct for sw to ever create a
>> PTE that's sw-dirty but not hw-dirty?
It's possible to have a dirty pte that you subsequently make read-only by
calling pte_wrprotect().
These are the valid states for the bits:
* Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw)
* 0 0 | 1 0 0
* 0 1 | 1 1 0
* 1 0 | 1 0 1
* 1 1 | 0 1 x
But I guess only PTE_RDONLY|PTE_WRITE|PTE_DIRTY causes a problem, which can't
happen.
>> If not, then I think it will still
>> work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
>> missing then we will set both. But for thoroughness it could make sense
>> to be stricter and add some comments there as it does feel a little
>> fragile. I'm very new to this area though so probably best for others to
>> comment and tackle this.
>
> I also am not so familiar with SW dirty, but the big thing to be
> sensitive to here is that the CPU can get by with setting something as
> RDONLY | DBM and the CPU will flip off RDONLY without issue, so you
> might not notice it.
>
> But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
> clear the RDONLY for all CONT entries, or fail the fault. Otherwise
> the SMMU SVA explodes.
Got it.
>
> However, I agree the __cont_access_flags_changed() looks concerning.
>
> I spent a few AI $$ to cross check this and it echo's Ryan's concern
> but did not find a specific counter-example where PTE_WRITE |
> PTE_DIRTY | PTE_RDONLY can happen for hugetlb.
The table I shared above (from pgtable.h) agrees that
PTE_WRITE|PTE_DIRTY|PTE_RDONLY can never happen.
> I don't really trust
> this negative result, so I'm inclined to agree it should be made more
> robust. I like the look of the patch (below) it proposed too.
>
> AI ramblings:
> PROMPT:
> Review 20260303063751.2531716-1-pjaroszynski@nvidia.com.mbx and focus
> on __cont_access_flags_changed(), huge_ptep_set_access_flags(),
> hugetlb_fault(). Does it have the same issue as this patch is fixing,
> considering the following additional note:
>
> A big thing to be sensitive to here is that the CPU can get by with
> setting something as RDONLY | DBM and the CPU will flip off RDONLY
> without issue, so you might not notice it.
>
> But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
> clear the RDONLY for all CONT entries, or fail the fault.
> ====
>
> Analysis: Does hugetlb __cont_access_flags_changed() have the same bug as the contpte patch?
>
> The contpte bug (what the patch fixes)
>
> The old contpte_ptep_set_access_flags() (contpte.c:614) used ptep_get()
> which gathers AF/dirty from ALL sub-PTEs via OR, then compared the
> gathered result against entry:
>
> // OLD (buggy):
> orig_pte = pte_mknoncont(ptep_get(ptep)); // gathers across all sub-PTEs!
> if (pte_val(orig_pte) == pte_val(entry))
> return 0;
>
> A dirty sibling makes the gathered PTE look already-dirty, even though
> the target sub-PTE still has PTE_RDONLY in hardware. False no-op → SMMU
> infinite fault loop.
>
> The fix replaces this with contpte_all_subptes_match_access_flags()
> which checks each sub-PTE’s raw bit values
> (PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY).
>
> The hugetlb code
>
> __cont_access_flags_changed() (hugetlbpage.c:400-418) does NOT use
> gathering for the no-op check – it reads each sub-PTE individually via
> __ptep_get():
>
> for (i = 0; i < ncontig; i++) {
> pte_t orig_pte = __ptep_get(ptep + i);
> if (pte_dirty(pte) != pte_dirty(orig_pte))
> return 1;
> if (pte_young(pte) != pte_young(orig_pte))
> return 1;
> }
>
> So it does NOT have the exact same gathering-based false no-op bug.
>
> But it does have a weaker form of the same conceptual problem
>
> The check uses pte_dirty() which is:
>
> #define pte_hw_dirty(pte) (pte_write(pte) && !pte_rdonly(pte))
> #define pte_sw_dirty(pte) (!!(pte_val(pte) & PTE_DIRTY))
> #define pte_dirty(pte) (pte_sw_dirty(pte) || pte_hw_dirty(pte))
>
> It abstracts over HW-dirty and SW-dirty. It does NOT check PTE_RDONLY
> directly.
>
> If a sub-PTE were in the state PTE_WRITE | PTE_DIRTY | PTE_RDONLY
> (SW-dirty but HW-readonly), then pte_dirty() returns true (via
> sw_dirty), and the check would say “no change needed” – but PTE_RDONLY
> is still set in hardware, which the SMMU would fault on.
>
> The contpte fix’s contpte_all_subptes_match_access_flags() would catch
> this because it compares the raw PTE_RDONLY bit directly.
>
> Does PTE_WRITE | PTE_DIRTY | PTE_RDONLY actually occur?
> =======================================================
>
> Looking at all the arm64 PTE manipulation functions:
>
> - pte_mkdirty(): sets PTE_DIRTY AND clears PTE_RDONLY (if writable)
> - pte_mkwrite_novma(): sets PTE_WRITE, clears PTE_RDONLY if PTE_DIRTY
> - pte_wrprotect(): saves HW dirty to SW dirty, sets PTE_RDONLY, clears
> PTE_WRITE
> - pte_mkclean(): clears PTE_DIRTY, sets PTE_RDONLY
> - HAFDBS: can only CLEAR PTE_RDONLY, never set it
>
> None of these create PTE_DIRTY | PTE_WRITE | PTE_RDONLY. The invariant
> holds: for writable pages, PTE_DIRTY => !PTE_RDONLY.
>
> Tracing the RDONLY|DBM scenario through hugetlb_fault()
>
> For the specific scenario: all sub-PTEs are PTE_WRITE | PTE_RDONLY
> (RDONLY|DBM, clean writable), and handle_mm_fault() is called:
>
> 1. huge_ptep_get() gathers – no dirty siblings, so vmf.orig_pte =
> PTE_WRITE | PTE_RDONLY | PTE_AF
> 2. huge_pte_mkdirty() sets PTE_DIRTY, clears PTE_RDONLY
> 3. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 4. __cont_access_flags_changed():
> - All sub-PTEs: PTE_WRITE | PTE_RDONLY | PTE_AF
> - pte_dirty(entry) = true, pte_dirty(sub) = false (RDONLY set, no
> PTE_DIRTY)
> - Mismatch detected -> returns 1 -> BBM proceeds
> 5. BBM rewrites ALL PTEs with
> PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 6. RDONLY cleared for all entries. Correct.
>
> With some siblings HW-dirtied by CPU (e.g., PTE #3 has !PTE_RDONLY):
>
> 1. huge_ptep_get() gathers: sees HW-dirty sibling, calls pte_mkdirty()
> on gathered
> 2. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 3. __cont_access_flags_changed():
> - PTE #3 (!PTE_RDONLY): pte_dirty() = true -> match
> - PTE #0 (PTE_RDONLY): pte_dirty() = false -> mismatch -> returns
> 1
> 4. BBM proceeds. All PTEs rewritten. Correct.
>
> With ALL siblings HW-dirtied by CPU:
>
> 1. All sub-PTEs: PTE_WRITE | !PTE_RDONLY | PTE_AF (all HW dirty)
> 2. entry: PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 3. __cont_access_flags_changed(): all pte_dirty() match -> returns 0
> (no-op)
> 4. But RDONLY is already clear on all PTEs, so SMMU won’t fault.
> Correct.
>
> Conclusion
> ==========
>
> The hugetlb code does NOT have the same bug in the sense that it doesn’t
> produce false no-ops with current code. The key difference is
> __cont_access_flags_changed() checks each sub-PTE individually, not a
> gathered view.
>
> However, Ryan Roberts’ observation is valid:
> __cont_access_flags_changed() has a latent weakness. It relies on the
> invariant PTE_DIRTY => !PTE_RDONLY (for writable pages) rather than
> checking PTE_RDONLY directly. If this invariant were ever violated
> (e.g., by a future code change), the no-op check could falsely pass,
> leaving RDONLY set in hardware and causing SMMU fault loops. The contpte
> fix’s approach of checking raw bit values is more robust and should be
> adopted by the hugetlb code as a defense-in-depth measure.
>
> The recommended fix for hugetlb: replace the pte_dirty()/pte_young()
> checks in __cont_access_flags_changed() with a raw-bitmask comparison
> similar to contpte_all_subptes_match_access_flags(), using the mask
> PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY.
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf564082..34e091b398123e 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -389,28 +389,24 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
> }
>
> /*
> - * huge_ptep_set_access_flags will update access flags (dirty, accesssed)
> + * huge_ptep_set_access_flags will update access flags (dirty, accessed)
> * and write permission.
> *
> - * For a contiguous huge pte range we need to check whether or not write
> - * permission has to change only on the first pte in the set. Then for
> - * all the contiguous ptes we need to check whether or not there is a
> - * discrepancy between dirty or young.
> + * Check all sub-PTEs' raw access flag bits rather than using the abstracted
> + * pte_dirty()/pte_young() helpers which conflate HW-dirty and SW-dirty.
> + * This ensures PTE_RDONLY is checked directly: a sub-PTE that is SW-dirty
> + * (PTE_DIRTY set) but still has PTE_RDONLY would be missed by pte_dirty()
> + * but will cause an SMMU without HTTU to keep faulting. The access flag
> + * mask matches the one used by __ptep_set_access_flags().
> */
> static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
> {
> + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> + pteval_t pte_access = pte_val(pte) & access_mask;
> int i;
>
> - if (pte_write(pte) != pte_write(__ptep_get(ptep)))
> - return 1;
> -
> for (i = 0; i < ncontig; i++) {
> - pte_t orig_pte = __ptep_get(ptep + i);
> -
> - if (pte_dirty(pte) != pte_dirty(orig_pte))
> - return 1;
> -
> - if (pte_young(pte) != pte_young(orig_pte))
> + if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
> return 1;
> }
I think, based on all the above, the current version is actually not buggy. But
I'm only willing to go to 95% confidence :)
The change looks reasonable though, if you want to be safe.
Thanks,
Ryan
>
>
> Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-04 11:17 ` Catalin Marinas
@ 2026-03-04 13:43 ` Jason Gunthorpe
0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2026-03-04 13:43 UTC (permalink / raw)
To: Catalin Marinas
Cc: Piotr Jaroszynski, Ryan Roberts, Will Deacon, linux-arm-kernel,
linux-mm, Alistair Popple, John Hubbard, Zi Yan, Breno Leitao,
stable
On Wed, Mar 04, 2026 at 11:17:08AM +0000, Catalin Marinas wrote:
> > @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > int i;
> >
> > /*
> > - * Gather the access/dirty bits for the contiguous range. If nothing has
> > - * changed, its a noop.
> > + * Check whether all sub-PTEs in the CONT block already have the
> > + * requested access flags, using raw per-PTE values rather than the
> > + * gathered ptep_get() view.
> > + *
> > + * ptep_get() gathers AF/dirty state across the whole CONT block,
> > + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> > + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> > + * the gathered result as authoritative for the entire range. But an
> > + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> > + * each descriptor individually and will keep faulting on the target
> > + * sub-PTE if its flags haven't actually been updated. Gathering can
> > + * therefore cause false no-ops when only a sibling has been updated:
> > + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > + * - read faults: target still lacks PTE_AF
> > + *
> > + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> > + * become the effective cached translation, so all entries must have
> > + * consistent attributes. Check the full CONT block before returning
> > + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> > + * range.
> > */
> > - orig_pte = pte_mknoncont(ptep_get(ptep));
> > - if (pte_val(orig_pte) == pte_val(entry))
> > + if (contpte_all_subptes_match_access_flags(ptep, entry))
> > return 0;
>
> Actually, do we need to loop over all the ptes? I think it sufficient to
> only check the one at ptep since it is the one that triggered the
> fault.
With CONT we should not be thinking "the one that triggered the
fault".
The PTE that triggered the fault is the PTE that the HW happened to
load into the TLB, we cannot assume it is the sub PTE we are faulting
at. For instance it could be a sub PTE for a completely unrelated
access at a different VA that got cached.
Again, the requirement here is that a fault on a CONT PTE must fix all
the access flags to be consistent or fail. It cannot resume the fault
and leave the sub PTEs inconsistent as the HW is always allowed to
load the RDONLY one for any access to the CONT.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-04 12:20 ` Ryan Roberts
@ 2026-03-04 13:44 ` Jason Gunthorpe
0 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2026-03-04 13:44 UTC (permalink / raw)
To: Ryan Roberts
Cc: Piotr Jaroszynski, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-mm, Alistair Popple, John Hubbard,
Zi Yan, Breno Leitao, stable
On Wed, Mar 04, 2026 at 12:20:31PM +0000, Ryan Roberts wrote:
> > static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
> > {
> > + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> > + pteval_t pte_access = pte_val(pte) & access_mask;
> > int i;
> >
> > - if (pte_write(pte) != pte_write(__ptep_get(ptep)))
> > - return 1;
> > -
> > for (i = 0; i < ncontig; i++) {
> > - pte_t orig_pte = __ptep_get(ptep + i);
> > -
> > - if (pte_dirty(pte) != pte_dirty(orig_pte))
> > - return 1;
> > -
> > - if (pte_young(pte) != pte_young(orig_pte))
> > + if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
> > return 1;
> > }
>
> I think, based on all the above, the current version is actually not buggy. But
> I'm only willing to go to 95% confidence :)
If I understand it right James already found and fixed a case that
violated the invariant right? Maybe it is wise to be robust since
things lock up if it goes wrong.
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-03-04 13:44 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
2026-03-03 12:45 ` Jason Gunthorpe
2026-03-03 21:40 ` Piotr Jaroszynski
2026-03-03 8:38 ` Ryan Roberts
2026-03-03 18:40 ` Piotr Jaroszynski
2026-03-03 19:12 ` Jason Gunthorpe
2026-03-04 12:20 ` Ryan Roberts
2026-03-04 13:44 ` Jason Gunthorpe
2026-03-04 11:17 ` Catalin Marinas
2026-03-04 13:43 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox