Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@nvidia.com>
To: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Alistair Popple <apopple@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Breno Leitao <leitao@debian.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
Date: Tue, 3 Mar 2026 15:12:17 -0400	[thread overview]
Message-ID: <20260303191217.GD972761@nvidia.com> (raw)
In-Reply-To: <aacohVRfAK46lOjo@box>

On Tue, Mar 03, 2026 at 10:40:00AM -0800, Piotr Jaroszynski wrote:

> > If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
> > __cont_access_flags_changed(), which just checks for any form of dirty. So I
> > guess hugetlb is buggy in the same way and should be fixed to use this more
> > stringent approach?
> 
> Given sw-dirty is managed by sw, is it correct for sw to ever create a
> PTE that's sw-dirty but not hw-dirty? If not, then I think it will still
> work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
> missing then we will set both. But for thoroughness it could make sense
> to be stricter and add some comments there as it does feel a little
> fragile. I'm very new to this area though so probably best for others to
> comment and tackle this.

I also am not so familiar with SW dirty, but the big thing to be
sensitive to here is that the CPU can get by with setting something as
RDONLY | DBM and the CPU will flip off RDONLY without issue, so you
might not notice it.

But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault. Otherwise
the SMMU SVA explodes.

However, I agree the __cont_access_flags_changed() looks concerning.

I spent a few AI $$ to cross check this and it echo's Ryan's concern
but did not find a specific counter-example where PTE_WRITE |
PTE_DIRTY | PTE_RDONLY can happen for hugetlb. I don't really trust
this negative result, so I'm inclined to agree it should be made more
robust. I like the look of the patch (below) it proposed too.

AI ramblings:
PROMPT:
Review 20260303063751.2531716-1-pjaroszynski@nvidia.com.mbx and focus
on __cont_access_flags_changed(), huge_ptep_set_access_flags(),
hugetlb_fault(). Does it have the same issue as this patch is fixing,
considering the following additional note:

A big thing to be sensitive to here is that the CPU can get by with
setting something as RDONLY | DBM and the CPU will flip off RDONLY
without issue, so you might not notice it.

But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
clear the RDONLY for all CONT entries, or fail the fault.
====

Analysis: Does hugetlb __cont_access_flags_changed() have the same bug as the contpte patch?

The contpte bug (what the patch fixes)

The old contpte_ptep_set_access_flags() (contpte.c:614) used ptep_get()
which gathers AF/dirty from ALL sub-PTEs via OR, then compared the
gathered result against entry:

    // OLD (buggy):
    orig_pte = pte_mknoncont(ptep_get(ptep));   // gathers across all sub-PTEs!
    if (pte_val(orig_pte) == pte_val(entry))
        return 0;

A dirty sibling makes the gathered PTE look already-dirty, even though
the target sub-PTE still has PTE_RDONLY in hardware. False no-op → SMMU
infinite fault loop.

The fix replaces this with contpte_all_subptes_match_access_flags()
which checks each sub-PTE’s raw bit values
(PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY).

The hugetlb code

__cont_access_flags_changed() (hugetlbpage.c:400-418) does NOT use
gathering for the no-op check – it reads each sub-PTE individually via
__ptep_get():

    for (i = 0; i < ncontig; i++) {
        pte_t orig_pte = __ptep_get(ptep + i);
        if (pte_dirty(pte) != pte_dirty(orig_pte))
            return 1;
        if (pte_young(pte) != pte_young(orig_pte))
            return 1;
    }

So it does NOT have the exact same gathering-based false no-op bug.

But it does have a weaker form of the same conceptual problem

The check uses pte_dirty() which is:

    #define pte_hw_dirty(pte)  (pte_write(pte) && !pte_rdonly(pte))
    #define pte_sw_dirty(pte)  (!!(pte_val(pte) & PTE_DIRTY))
    #define pte_dirty(pte)     (pte_sw_dirty(pte) || pte_hw_dirty(pte))

It abstracts over HW-dirty and SW-dirty. It does NOT check PTE_RDONLY
directly.

If a sub-PTE were in the state PTE_WRITE | PTE_DIRTY | PTE_RDONLY
(SW-dirty but HW-readonly), then pte_dirty() returns true (via
sw_dirty), and the check would say “no change needed” – but PTE_RDONLY
is still set in hardware, which the SMMU would fault on.

The contpte fix’s contpte_all_subptes_match_access_flags() would catch
this because it compares the raw PTE_RDONLY bit directly.

Does PTE_WRITE | PTE_DIRTY | PTE_RDONLY actually occur?
=======================================================

Looking at all the arm64 PTE manipulation functions:

-   pte_mkdirty(): sets PTE_DIRTY AND clears PTE_RDONLY (if writable)
-   pte_mkwrite_novma(): sets PTE_WRITE, clears PTE_RDONLY if PTE_DIRTY
-   pte_wrprotect(): saves HW dirty to SW dirty, sets PTE_RDONLY, clears
    PTE_WRITE
-   pte_mkclean(): clears PTE_DIRTY, sets PTE_RDONLY
-   HAFDBS: can only CLEAR PTE_RDONLY, never set it

None of these create PTE_DIRTY | PTE_WRITE | PTE_RDONLY. The invariant
holds: for writable pages, PTE_DIRTY => !PTE_RDONLY.

Tracing the RDONLY|DBM scenario through hugetlb_fault()

For the specific scenario: all sub-PTEs are PTE_WRITE | PTE_RDONLY
(RDONLY|DBM, clean writable), and handle_mm_fault() is called:

1.  huge_ptep_get() gathers – no dirty siblings, so vmf.orig_pte =
    PTE_WRITE | PTE_RDONLY | PTE_AF
2.  huge_pte_mkdirty() sets PTE_DIRTY, clears PTE_RDONLY
3.  entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
4.  __cont_access_flags_changed():
    -   All sub-PTEs: PTE_WRITE | PTE_RDONLY | PTE_AF
    -   pte_dirty(entry) = true, pte_dirty(sub) = false (RDONLY set, no
        PTE_DIRTY)
    -   Mismatch detected -> returns 1 -> BBM proceeds
5.  BBM rewrites ALL PTEs with
    PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
6.  RDONLY cleared for all entries. Correct.

With some siblings HW-dirtied by CPU (e.g., PTE #3 has !PTE_RDONLY):

1.  huge_ptep_get() gathers: sees HW-dirty sibling, calls pte_mkdirty()
    on gathered
2.  entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3.  __cont_access_flags_changed():
    -   PTE #3 (!PTE_RDONLY): pte_dirty() = true -> match
    -   PTE #0 (PTE_RDONLY): pte_dirty() = false -> mismatch -> returns
        1
4.  BBM proceeds. All PTEs rewritten. Correct.

With ALL siblings HW-dirtied by CPU:

1.  All sub-PTEs: PTE_WRITE | !PTE_RDONLY | PTE_AF (all HW dirty)
2.  entry: PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
3.  __cont_access_flags_changed(): all pte_dirty() match -> returns 0
    (no-op)
4.  But RDONLY is already clear on all PTEs, so SMMU won’t fault.
    Correct.

Conclusion
==========

The hugetlb code does NOT have the same bug in the sense that it doesn’t
produce false no-ops with current code. The key difference is
__cont_access_flags_changed() checks each sub-PTE individually, not a
gathered view.

However, Ryan Roberts’ observation is valid:
__cont_access_flags_changed() has a latent weakness. It relies on the
invariant PTE_DIRTY => !PTE_RDONLY (for writable pages) rather than
checking PTE_RDONLY directly. If this invariant were ever violated
(e.g., by a future code change), the no-op check could falsely pass,
leaving RDONLY set in hardware and causing SMMU fault loops. The contpte
fix’s approach of checking raw bit values is more robust and should be
adopted by the hugetlb code as a defense-in-depth measure.

The recommended fix for hugetlb: replace the pte_dirty()/pte_young()
checks in __cont_access_flags_changed() with a raw-bitmask comparison
similar to contpte_all_subptes_match_access_flags(), using the mask
PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY.

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a42c05cf564082..34e091b398123e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -389,28 +389,24 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 }

 /*
- * huge_ptep_set_access_flags will update access flags (dirty, accesssed)
+ * huge_ptep_set_access_flags will update access flags (dirty, accessed)
  * and write permission.
  *
- * For a contiguous huge pte range we need to check whether or not write
- * permission has to change only on the first pte in the set. Then for
- * all the contiguous ptes we need to check whether or not there is a
- * discrepancy between dirty or young.
+ * Check all sub-PTEs' raw access flag bits rather than using the abstracted
+ * pte_dirty()/pte_young() helpers which conflate HW-dirty and SW-dirty.
+ * This ensures PTE_RDONLY is checked directly: a sub-PTE that is SW-dirty
+ * (PTE_DIRTY set) but still has PTE_RDONLY would be missed by pte_dirty()
+ * but will cause an SMMU without HTTU to keep faulting.  The access flag
+ * mask matches the one used by __ptep_set_access_flags().
  */
 static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
 {
+	const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+	pteval_t pte_access = pte_val(pte) & access_mask;
 	int i;

-	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
-		return 1;
-
 	for (i = 0; i < ncontig; i++) {
-		pte_t orig_pte = __ptep_get(ptep + i);
-
-		if (pte_dirty(pte) != pte_dirty(orig_pte))
-			return 1;
-
-		if (pte_young(pte) != pte_young(orig_pte))
+		if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
 			return 1;
 	}

Jason

     prev parent reply	other threads:[~2026-03-03 19:12 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03  6:37 Piotr Jaroszynski
2026-03-03  7:19 ` James Houghton
2026-03-03 12:45   ` Jason Gunthorpe
2026-03-03 21:40   ` Piotr Jaroszynski
2026-03-03  8:38 ` Ryan Roberts
2026-03-03 18:40   ` Piotr Jaroszynski
2026-03-03 19:12     ` Jason Gunthorpe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260303191217.GD972761@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=pjaroszynski@nvidia.com \
    --cc=ryan.roberts@arm.com \
    --cc=stable@vger.kernel.org \
    --cc=will@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox