* [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
@ 2026-03-03 6:37 Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
2026-03-03 8:38 ` Ryan Roberts
0 siblings, 2 replies; 6+ messages in thread
From: Piotr Jaroszynski @ 2026-03-03 6:37 UTC (permalink / raw)
To: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel, linux-mm
Cc: Piotr Jaroszynski, Alistair Popple, Jason Gunthorpe,
John Hubbard, Zi Yan, Breno Leitao, stable
contpte_ptep_set_access_flags() compared the gathered ptep_get() value
against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
from all sub-PTEs in the CONT block, so a dirty sibling can make the
target appear already-dirty. When the gathered value matches entry, the
function returns 0 even though the target sub-PTE still has PTE_RDONLY
set in hardware.
For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
result as authoritative for the entire range. But an SMMU without HTTU
(or with HA/HD disabled in CD.TCR) evaluates each descriptor
individually and will keep raising F_PERMISSION on the unchanged target
sub-PTE, causing an infinite fault loop.
Gathering can therefore cause false no-ops when only a sibling has been
updated:
- write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
- read faults: target still lacks PTE_AF
Fix by checking all sub-PTEs' access flags individually (not via the
gathered view) before returning no-op, and use the raw target PTE for
the write-bit unfold decision. The access-flag mask matches the one
used by __ptep_set_access_flags().
Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
range may become the effective cached translation and software must
maintain consistent attributes across the range.
Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: stable@vger.kernel.org
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
---
arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
1 file changed, 43 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..9868bfe4607c 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
}
EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
+static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
+{
+ pte_t *cont_ptep = contpte_align_down(ptep);
+ const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+ pteval_t entry_access = pte_val(entry) & access_mask;
+ int i;
+
+ for (i = 0; i < CONT_PTES; i++) {
+ pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
+
+ if (pte_access != entry_access)
+ return false;
+ }
+
+ return true;
+}
+
int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty)
@@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
int i;
/*
- * Gather the access/dirty bits for the contiguous range. If nothing has
- * changed, its a noop.
+ * Check whether all sub-PTEs in the CONT block already have the
+ * requested access flags, using raw per-PTE values rather than the
+ * gathered ptep_get() view.
+ *
+ * ptep_get() gathers AF/dirty state across the whole CONT block,
+ * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
+ * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
+ * the gathered result as authoritative for the entire range. But an
+ * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
+ * each descriptor individually and will keep faulting on the target
+ * sub-PTE if its flags haven't actually been updated. Gathering can
+ * therefore cause false no-ops when only a sibling has been updated:
+ * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
+ * - read faults: target still lacks PTE_AF
+ *
+ * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
+ * become the effective cached translation, so all entries must have
+ * consistent attributes. Check the full CONT block before returning
+ * no-op, and when any sub-PTE mismatches, proceed to update the whole
+ * range.
*/
- orig_pte = pte_mknoncont(ptep_get(ptep));
- if (pte_val(orig_pte) == pte_val(entry))
+ if (contpte_all_subptes_match_access_flags(ptep, entry))
return 0;
+ /*
+ * Use raw target pte (not gathered) for write-bit unfold decision.
+ */
+ orig_pte = pte_mknoncont(__ptep_get(ptep));
+
/*
* We can fix up access/dirty bits without having to unfold the contig
* range. But if the write bit is changing, we must unfold.
--
2.22.1.7.gac84d6e93c.dirty
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
@ 2026-03-03 7:19 ` James Houghton
2026-03-03 12:45 ` Jason Gunthorpe
2026-03-03 8:38 ` Ryan Roberts
1 sibling, 1 reply; 6+ messages in thread
From: James Houghton @ 2026-03-03 7:19 UTC (permalink / raw)
To: Piotr Jaroszynski
Cc: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel,
linux-mm, Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Mon, Mar 2, 2026 at 10:38 PM Piotr Jaroszynski
<pjaroszynski@nvidia.com> wrote:
>
> contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> from all sub-PTEs in the CONT block, so a dirty sibling can make the
> target appear already-dirty. When the gathered value matches entry, the
> function returns 0 even though the target sub-PTE still has PTE_RDONLY
> set in hardware.
>
> For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> result as authoritative for the entire range. But an SMMU without HTTU
> (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> individually and will keep raising F_PERMISSION on the unchanged target
> sub-PTE, causing an infinite fault loop.
>
> Gathering can therefore cause false no-ops when only a sibling has been
> updated:
> - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> - read faults: target still lacks PTE_AF
>
> Fix by checking all sub-PTEs' access flags individually (not via the
> gathered view) before returning no-op, and use the raw target PTE for
> the write-bit unfold decision. The access-flag mask matches the one
> used by __ptep_set_access_flags().
>
> Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> range may become the effective cached translation and software must
> maintain consistent attributes across the range.
>
> Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Breno Leitao <leitao@debian.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Thanks for the fix!
This is similar (sort of) to a HugeTLB page fault loop I stumbled upon
a while ago[1]. (I wonder if there have been more cases like this.)
Feel free to add:
Reviewed-by: James Houghton <jthoughton@google.com>
[1] https://lore.kernel.org/all/20231204172646.2541916-1-jthoughton@google.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
@ 2026-03-03 8:38 ` Ryan Roberts
2026-03-03 18:40 ` Piotr Jaroszynski
1 sibling, 1 reply; 6+ messages in thread
From: Ryan Roberts @ 2026-03-03 8:38 UTC (permalink / raw)
To: Piotr Jaroszynski, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-mm
Cc: Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On 03/03/2026 06:37, Piotr Jaroszynski wrote:
> contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> from all sub-PTEs in the CONT block, so a dirty sibling can make the
> target appear already-dirty. When the gathered value matches entry, the
> function returns 0 even though the target sub-PTE still has PTE_RDONLY
> set in hardware.
>
> For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> result as authoritative for the entire range. But an SMMU without HTTU
> (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> individually and will keep raising F_PERMISSION on the unchanged target
> sub-PTE, causing an infinite fault loop.
Ouch; thanks for the fix!
>
> Gathering can therefore cause false no-ops when only a sibling has been
> updated:
> - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> - read faults: target still lacks PTE_AF
>
> Fix by checking all sub-PTEs' access flags individually (not via the
> gathered view) before returning no-op, and use the raw target PTE for
> the write-bit unfold decision. The access-flag mask matches the one
> used by __ptep_set_access_flags().
>
> Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> range may become the effective cached translation and software must
> maintain consistent attributes across the range.
>
> Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
>
nit: there shouldn't be whitespace here.
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Breno Leitao <leitao@debian.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
This fix looks good to me:
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
> 1 file changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..9868bfe4607c 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
>
> +static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
> +{
> + pte_t *cont_ptep = contpte_align_down(ptep);
> + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> + pteval_t entry_access = pte_val(entry) & access_mask;
> + int i;
> +
> + for (i = 0; i < CONT_PTES; i++) {
> + pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
> +
> + if (pte_access != entry_access)
> + return false;
> + }
There are 2 forms of "dirty"; HW and SW. Here you are testing that all ptes in
the contpte block have the same form of dirty, which I think is the correct
thing to do. You could relax to just test that every pte has one of the forms of
dirty, But in that case, if a pte is sw-dirty but not hw-dirty, then the
PTE_RDONLY bit remains set and the SMMU will fault, I think?
If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
__cont_access_flags_changed(), which just checks for any form of dirty. So I
guess hugetlb is buggy in the same way and should be fixed to use this more
stringent approach?
Thanks,
Ryan
> +
> + return true;
> +}
> +
> int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t entry, int dirty)
> @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> int i;
>
> /*
> - * Gather the access/dirty bits for the contiguous range. If nothing has
> - * changed, its a noop.
> + * Check whether all sub-PTEs in the CONT block already have the
> + * requested access flags, using raw per-PTE values rather than the
> + * gathered ptep_get() view.
> + *
> + * ptep_get() gathers AF/dirty state across the whole CONT block,
> + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> + * the gathered result as authoritative for the entire range. But an
> + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> + * each descriptor individually and will keep faulting on the target
> + * sub-PTE if its flags haven't actually been updated. Gathering can
> + * therefore cause false no-ops when only a sibling has been updated:
> + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> + * - read faults: target still lacks PTE_AF
> + *
> + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> + * become the effective cached translation, so all entries must have
> + * consistent attributes. Check the full CONT block before returning
> + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> + * range.
> */
> - orig_pte = pte_mknoncont(ptep_get(ptep));
> - if (pte_val(orig_pte) == pte_val(entry))
> + if (contpte_all_subptes_match_access_flags(ptep, entry))
> return 0;
>
> + /*
> + * Use raw target pte (not gathered) for write-bit unfold decision.
> + */
> + orig_pte = pte_mknoncont(__ptep_get(ptep));
> +
> /*
> * We can fix up access/dirty bits without having to unfold the contig
> * range. But if the write bit is changing, we must unfold.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 7:19 ` James Houghton
@ 2026-03-03 12:45 ` Jason Gunthorpe
0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-03-03 12:45 UTC (permalink / raw)
To: James Houghton
Cc: Piotr Jaroszynski, Ryan Roberts, Will Deacon, Catalin Marinas,
linux-arm-kernel, linux-mm, Alistair Popple, John Hubbard,
Zi Yan, Breno Leitao, stable
On Mon, Mar 02, 2026 at 11:19:46PM -0800, James Houghton wrote:
> This is similar (sort of) to a HugeTLB page fault loop I stumbled upon
> a while ago[1]. (I wonder if there have been more cases like this.)
We spent some time here looking at what the ARM spec requires, and
what kinds of HW implementations are possible..
I think the consensus is that SW must never write individual CONT PTEs
that are different from each other.
For example, a possible HW implementation could fetch a cache line of
PTEs, see the CONT bit at the VA's PTE, and then always select the
first PTE in the cache line to load the TLB.
Alternatively HW could just always load the PTE from the VA that is
faulting.
Since ARM ARM doesn't restrict this, SW must assume both possibilities
and that means it must synchronize all PTEs since it cannot predict
which PTE will be loaded by HW.
DBM will only write to a single PTE, which means HW can create
inconsistent PTEs. However if SW is fixing a fault, especially for
different walker IP (SMMU vs CPU), then it must restore the
synchronization since the PTE selection of one walker may not be
compatible with the HW update of the other.
Thus, I think any place where SW is creating inconsistent PTEs, or
leaving PTEs inconsistent after a fault is a bug.. We found a Rule
statement that seemed to require this as well.
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 8:38 ` Ryan Roberts
@ 2026-03-03 18:40 ` Piotr Jaroszynski
2026-03-03 19:12 ` Jason Gunthorpe
0 siblings, 1 reply; 6+ messages in thread
From: Piotr Jaroszynski @ 2026-03-03 18:40 UTC (permalink / raw)
To: Ryan Roberts
Cc: Will Deacon, Catalin Marinas, linux-arm-kernel, linux-mm,
Alistair Popple, Jason Gunthorpe, John Hubbard, Zi Yan,
Breno Leitao, stable
On Tue, Mar 03, 2026 at 08:38:23AM +0000, Ryan Roberts wrote:
> On 03/03/2026 06:37, Piotr Jaroszynski wrote:
> > contpte_ptep_set_access_flags() compared the gathered ptep_get() value
> > against the requested entry to detect no-ops. ptep_get() ORs AF/dirty
> > from all sub-PTEs in the CONT block, so a dirty sibling can make the
> > target appear already-dirty. When the gathered value matches entry, the
> > function returns 0 even though the target sub-PTE still has PTE_RDONLY
> > set in hardware.
> >
> > For CPU page-table walks this is benign: with FEAT_HAFDBS the hardware
> > may set AF/dirty on any sub-PTE and the CPU TLB treats the gathered
> > result as authoritative for the entire range. But an SMMU without HTTU
> > (or with HA/HD disabled in CD.TCR) evaluates each descriptor
> > individually and will keep raising F_PERMISSION on the unchanged target
> > sub-PTE, causing an infinite fault loop.
>
> Ouch; thanks for the fix!
>
> >
> > Gathering can therefore cause false no-ops when only a sibling has been
> > updated:
> > - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > - read faults: target still lacks PTE_AF
> >
> > Fix by checking all sub-PTEs' access flags individually (not via the
> > gathered view) before returning no-op, and use the raw target PTE for
> > the write-bit unfold decision. The access-flag mask matches the one
> > used by __ptep_set_access_flags().
> >
> > Per Arm ARM (DDI 0487) D8.7.1 ("The Contiguous bit"), any sub-PTE in a CONT
> > range may become the effective cached translation and software must
> > maintain consistent attributes across the range.
> >
> > Fixes: 4602e5757bcc ("arm64/mm: wire up PTE_CONT for user mappings")
> >
>
> nit: there shouldn't be whitespace here.
>
> > Reviewed-by: Alistair Popple <apopple@nvidia.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Breno Leitao <leitao@debian.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>
> This fix looks good to me:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Thanks!
>
>
> > ---
> > arch/arm64/mm/contpte.c | 47 +++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index bcac4f55f9c1..9868bfe4607c 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -390,6 +390,23 @@ void contpte_clear_young_dirty_ptes(struct vm_area_struct *vma,
> > }
> > EXPORT_SYMBOL_GPL(contpte_clear_young_dirty_ptes);
> >
> > +static bool contpte_all_subptes_match_access_flags(pte_t *ptep, pte_t entry)
> > +{
> > + pte_t *cont_ptep = contpte_align_down(ptep);
> > + const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> > + pteval_t entry_access = pte_val(entry) & access_mask;
> > + int i;
> > +
> > + for (i = 0; i < CONT_PTES; i++) {
> > + pteval_t pte_access = pte_val(__ptep_get(cont_ptep + i)) & access_mask;
> > +
> > + if (pte_access != entry_access)
> > + return false;
> > + }
>
> There are 2 forms of "dirty"; HW and SW. Here you are testing that all ptes in
> the contpte block have the same form of dirty, which I think is the correct
> thing to do. You could relax to just test that every pte has one of the forms of
> dirty, But in that case, if a pte is sw-dirty but not hw-dirty, then the
> PTE_RDONLY bit remains set and the SMMU will fault, I think?
Yes, for SMMU we need to make sure the HW-dirty state is correct or it
will keep faulting. And while we are doing it, it makes sense to just
update all the bits to be consistent.
>
> If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
> __cont_access_flags_changed(), which just checks for any form of dirty. So I
> guess hugetlb is buggy in the same way and should be fixed to use this more
> stringent approach?
Given sw-dirty is managed by sw, is it correct for sw to ever create a
PTE that's sw-dirty but not hw-dirty? If not, then I think it will still
work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
missing then we will set both. But for thoroughness it could make sense
to be stricter and add some comments there as it does feel a little
fragile. I'm very new to this area though so probably best for others to
comment and tackle this.
Thanks,
Piotr
>
> Thanks,
> Ryan
>
> > +
> > + return true;
> > +}
> > +
> > int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > unsigned long addr, pte_t *ptep,
> > pte_t entry, int dirty)
> > @@ -399,13 +416,35 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > int i;
> >
> > /*
> > - * Gather the access/dirty bits for the contiguous range. If nothing has
> > - * changed, its a noop.
> > + * Check whether all sub-PTEs in the CONT block already have the
> > + * requested access flags, using raw per-PTE values rather than the
> > + * gathered ptep_get() view.
> > + *
> > + * ptep_get() gathers AF/dirty state across the whole CONT block,
> > + * which is correct for CPU TLB semantics: with FEAT_HAFDBS the
> > + * hardware may set AF/dirty on any sub-PTE and the CPU TLB treats
> > + * the gathered result as authoritative for the entire range. But an
> > + * SMMU without HTTU (or with HA/HD disabled in CD.TCR) evaluates
> > + * each descriptor individually and will keep faulting on the target
> > + * sub-PTE if its flags haven't actually been updated. Gathering can
> > + * therefore cause false no-ops when only a sibling has been updated:
> > + * - write faults: target still has PTE_RDONLY (needs PTE_RDONLY cleared)
> > + * - read faults: target still lacks PTE_AF
> > + *
> > + * Per Arm ARM (DDI 0487) D8.7.1, any sub-PTE in a CONT range may
> > + * become the effective cached translation, so all entries must have
> > + * consistent attributes. Check the full CONT block before returning
> > + * no-op, and when any sub-PTE mismatches, proceed to update the whole
> > + * range.
> > */
> > - orig_pte = pte_mknoncont(ptep_get(ptep));
> > - if (pte_val(orig_pte) == pte_val(entry))
> > + if (contpte_all_subptes_match_access_flags(ptep, entry))
> > return 0;
> >
> > + /*
> > + * Use raw target pte (not gathered) for write-bit unfold decision.
> > + */
> > + orig_pte = pte_mknoncont(__ptep_get(ptep));
> > +
> > /*
> > * We can fix up access/dirty bits without having to unfold the contig
> > * range. But if the write bit is changing, we must unfold.
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
2026-03-03 18:40 ` Piotr Jaroszynski
@ 2026-03-03 19:12 ` Jason Gunthorpe
0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-03-03 19:12 UTC (permalink / raw)
To: Piotr Jaroszynski
Cc: Ryan Roberts, Will Deacon, Catalin Marinas, linux-arm-kernel,
linux-mm, Alistair Popple, John Hubbard, Zi Yan, Breno Leitao,
stable
On Tue, Mar 03, 2026 at 10:40:00AM -0800, Piotr Jaroszynski wrote:
> > If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
> > __cont_access_flags_changed(), which just checks for any form of dirty. So I
> > guess hugetlb is buggy in the same way and should be fixed to use this more
> > stringent approach?
>
> Given sw-dirty is managed by sw, is it correct for sw to ever create a
> PTE that's sw-dirty but not hw-dirty? If not, then I think it will still
> work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
> missing then we will set both. But for thoroughness it could make sense
> to be stricter and add some comments there as it does feel a little
> fragile. I'm very new to this area though so probably best for others to
> comment and tackle this.
I also am not so familiar with SW dirty, but the big thing to be
sensitive to here is that the CPU can get by with setting something as
RDONLY | DBM and the CPU will flip off RDONLY without issue, so you
might not notice it.
But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault. Otherwise
the SMMU SVA explodes.
However, I agree the __cont_access_flags_changed() looks concerning.
I spent a few AI $$ to cross check this and it echo's Ryan's concern
but did not find a specific counter-example where PTE_WRITE |
PTE_DIRTY | PTE_RDONLY can happen for hugetlb. I don't really trust
this negative result, so I'm inclined to agree it should be made more
robust. I like the look of the patch (below) it proposed too.
AI ramblings:
PROMPT:
Review 20260303063751.2531716-1-pjaroszynski@nvidia.com.mbx and focus
on __cont_access_flags_changed(), huge_ptep_set_access_flags(),
hugetlb_fault(). Does it have the same issue as this patch is fixing,
considering the following additional note:
A big thing to be sensitive to here is that the CPU can get by with
setting something as RDONLY | DBM and the CPU will flip off RDONLY
without issue, so you might not notice it.
But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault.
====
Analysis: Does hugetlb __cont_access_flags_changed() have the same bug as the contpte patch?
The contpte bug (what the patch fixes)
The old contpte_ptep_set_access_flags() (contpte.c:614) used ptep_get()
which gathers AF/dirty from ALL sub-PTEs via OR, then compared the
gathered result against entry:
// OLD (buggy):
orig_pte = pte_mknoncont(ptep_get(ptep)); // gathers across all sub-PTEs!
if (pte_val(orig_pte) == pte_val(entry))
return 0;
A dirty sibling makes the gathered PTE look already-dirty, even though
the target sub-PTE still has PTE_RDONLY in hardware. False no-op → SMMU
infinite fault loop.
The fix replaces this with contpte_all_subptes_match_access_flags()
which checks each sub-PTE’s raw bit values
(PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY).
The hugetlb code
__cont_access_flags_changed() (hugetlbpage.c:400-418) does NOT use
gathering for the no-op check – it reads each sub-PTE individually via
__ptep_get():
for (i = 0; i < ncontig; i++) {
pte_t orig_pte = __ptep_get(ptep + i);
if (pte_dirty(pte) != pte_dirty(orig_pte))
return 1;
if (pte_young(pte) != pte_young(orig_pte))
return 1;
}
So it does NOT have the exact same gathering-based false no-op bug.
But it does have a weaker form of the same conceptual problem
The check uses pte_dirty() which is:
#define pte_hw_dirty(pte) (pte_write(pte) && !pte_rdonly(pte))
#define pte_sw_dirty(pte) (!!(pte_val(pte) & PTE_DIRTY))
#define pte_dirty(pte) (pte_sw_dirty(pte) || pte_hw_dirty(pte))
It abstracts over HW-dirty and SW-dirty. It does NOT check PTE_RDONLY
directly.
If a sub-PTE were in the state PTE_WRITE | PTE_DIRTY | PTE_RDONLY
(SW-dirty but HW-readonly), then pte_dirty() returns true (via
sw_dirty), and the check would say “no change needed” – but PTE_RDONLY
is still set in hardware, which the SMMU would fault on.
The contpte fix’s contpte_all_subptes_match_access_flags() would catch
this because it compares the raw PTE_RDONLY bit directly.
Does PTE_WRITE | PTE_DIRTY | PTE_RDONLY actually occur?
=======================================================
Looking at all the arm64 PTE manipulation functions:
- pte_mkdirty(): sets PTE_DIRTY AND clears PTE_RDONLY (if writable)
- pte_mkwrite_novma(): sets PTE_WRITE, clears PTE_RDONLY if PTE_DIRTY
- pte_wrprotect(): saves HW dirty to SW dirty, sets PTE_RDONLY, clears
PTE_WRITE
- pte_mkclean(): clears PTE_DIRTY, sets PTE_RDONLY
- HAFDBS: can only CLEAR PTE_RDONLY, never set it
None of these create PTE_DIRTY | PTE_WRITE | PTE_RDONLY. The invariant
holds: for writable pages, PTE_DIRTY => !PTE_RDONLY.
Tracing the RDONLY|DBM scenario through hugetlb_fault()
For the specific scenario: all sub-PTEs are PTE_WRITE | PTE_RDONLY
(RDONLY|DBM, clean writable), and handle_mm_fault() is called:
1. huge_ptep_get() gathers – no dirty siblings, so vmf.orig_pte =
PTE_WRITE | PTE_RDONLY | PTE_AF
2. huge_pte_mkdirty() sets PTE_DIRTY, clears PTE_RDONLY
3. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
4. __cont_access_flags_changed():
- All sub-PTEs: PTE_WRITE | PTE_RDONLY | PTE_AF
- pte_dirty(entry) = true, pte_dirty(sub) = false (RDONLY set, no
PTE_DIRTY)
- Mismatch detected -> returns 1 -> BBM proceeds
5. BBM rewrites ALL PTEs with
PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
6. RDONLY cleared for all entries. Correct.
With some siblings HW-dirtied by CPU (e.g., PTE #3 has !PTE_RDONLY):
1. huge_ptep_get() gathers: sees HW-dirty sibling, calls pte_mkdirty()
on gathered
2. entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3. __cont_access_flags_changed():
- PTE #3 (!PTE_RDONLY): pte_dirty() = true -> match
- PTE #0 (PTE_RDONLY): pte_dirty() = false -> mismatch -> returns
1
4. BBM proceeds. All PTEs rewritten. Correct.
With ALL siblings HW-dirtied by CPU:
1. All sub-PTEs: PTE_WRITE | !PTE_RDONLY | PTE_AF (all HW dirty)
2. entry: PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3. __cont_access_flags_changed(): all pte_dirty() match -> returns 0
(no-op)
4. But RDONLY is already clear on all PTEs, so SMMU won’t fault.
Correct.
Conclusion
==========
The hugetlb code does NOT have the same bug in the sense that it doesn’t
produce false no-ops with current code. The key difference is
__cont_access_flags_changed() checks each sub-PTE individually, not a
gathered view.
However, Ryan Roberts’ observation is valid:
__cont_access_flags_changed() has a latent weakness. It relies on the
invariant PTE_DIRTY => !PTE_RDONLY (for writable pages) rather than
checking PTE_RDONLY directly. If this invariant were ever violated
(e.g., by a future code change), the no-op check could falsely pass,
leaving RDONLY set in hardware and causing SMMU fault loops. The contpte
fix’s approach of checking raw bit values is more robust and should be
adopted by the hugetlb code as a defense-in-depth measure.
The recommended fix for hugetlb: replace the pte_dirty()/pte_young()
checks in __cont_access_flags_changed() with a raw-bitmask comparison
similar to contpte_all_subptes_match_access_flags(), using the mask
PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY.
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf564082..34e091b398123e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -389,28 +389,24 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
}
/*
- * huge_ptep_set_access_flags will update access flags (dirty, accesssed)
+ * huge_ptep_set_access_flags will update access flags (dirty, accessed)
* and write permission.
*
- * For a contiguous huge pte range we need to check whether or not write
- * permission has to change only on the first pte in the set. Then for
- * all the contiguous ptes we need to check whether or not there is a
- * discrepancy between dirty or young.
+ * Check all sub-PTEs' raw access flag bits rather than using the abstracted
+ * pte_dirty()/pte_young() helpers which conflate HW-dirty and SW-dirty.
+ * This ensures PTE_RDONLY is checked directly: a sub-PTE that is SW-dirty
+ * (PTE_DIRTY set) but still has PTE_RDONLY would be missed by pte_dirty()
+ * but will cause an SMMU without HTTU to keep faulting. The access flag
+ * mask matches the one used by __ptep_set_access_flags().
*/
static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
{
+ const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+ pteval_t pte_access = pte_val(pte) & access_mask;
int i;
- if (pte_write(pte) != pte_write(__ptep_get(ptep)))
- return 1;
-
for (i = 0; i < ncontig; i++) {
- pte_t orig_pte = __ptep_get(ptep + i);
-
- if (pte_dirty(pte) != pte_dirty(orig_pte))
- return 1;
-
- if (pte_young(pte) != pte_young(orig_pte))
+ if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
return 1;
}
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-03 19:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-03 6:37 [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults Piotr Jaroszynski
2026-03-03 7:19 ` James Houghton
2026-03-03 12:45 ` Jason Gunthorpe
2026-03-03 8:38 ` Ryan Roberts
2026-03-03 18:40 ` Piotr Jaroszynski
2026-03-03 19:12 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox