linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Piotr Jaroszynski <pjaroszynski@nvidia.com>
Cc: Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Alistair Popple <apopple@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Breno Leitao <leitao@debian.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] arm64: contpte: fix set_access_flags() no-op check for SMMU/ATS faults
Date: Wed, 4 Mar 2026 12:20:31 +0000	[thread overview]
Message-ID: <1080db49-2f83-4fec-ba73-94c6b3a8f7fa@arm.com> (raw)
In-Reply-To: <20260303191217.GD972761@nvidia.com>

On 03/03/2026 19:12, Jason Gunthorpe wrote:
> On Tue, Mar 03, 2026 at 10:40:00AM -0800, Piotr Jaroszynski wrote:
> 
>>> If my reasoning is correct, then I think arm64 hugetlb has a similar bug; See
>>> __cont_access_flags_changed(), which just checks for any form of dirty. So I
>>> guess hugetlb is buggy in the same way and should be fixed to use this more
>>> stringent approach?
>>
>> Given sw-dirty is managed by sw, is it correct for sw to ever create a
>> PTE that's sw-dirty but not hw-dirty? 

It's possible to have a dirty pte that you subsequently make read-only by
calling pte_wrprotect().

These are the valid states for the bits:

 * Dirty  Writable | PTE_RDONLY  PTE_WRITE  PTE_DIRTY (sw)
 *   0      0      |   1           0          0
 *   0      1      |   1           1          0
 *   1      0      |   1           0          1
 *   1      1      |   0           1          x

But I guess only PTE_RDONLY|PTE_WRITE|PTE_DIRTY causes a problem, which can't
happen.


>> If not, then I think it will still
>> work fine for the SMMU case as sw-dirty implies hw-dirty, and if it's
>> missing then we will set both. But for thoroughness it could make sense
>> to be stricter and add some comments there as it does feel a little
>> fragile. I'm very new to this area though so probably best for others to
>> comment and tackle this.
> 
> I also am not so familiar with SW dirty, but the big thing to be
> sensitive to here is that the CPU can get by with setting something as
> RDONLY | DBM and the CPU will flip off RDONLY without issue, so you
> might not notice it.
> 
> But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
> clear the RDONLY for all CONT entries, or fail the fault. Otherwise
> the SMMU SVA explodes.

Got it.

> 
> However, I agree the __cont_access_flags_changed() looks concerning.
> 
> I spent a few AI $$ to cross check this and it echo's Ryan's concern
> but did not find a specific counter-example where PTE_WRITE |
> PTE_DIRTY | PTE_RDONLY can happen for hugetlb. 

The table I shared above (from pgtable.h) agrees that
PTE_WRITE|PTE_DIRTY|PTE_RDONLY can never happen.

> I don't really trust
> this negative result, so I'm inclined to agree it should be made more
> robust. I like the look of the patch (below) it proposed too.
> 
> AI ramblings:
> PROMPT:
> Review 20260303063751.2531716-1-pjaroszynski@nvidia.com.mbx and focus
> on __cont_access_flags_changed(), huge_ptep_set_access_flags(),
> hugetlb_fault(). Does it have the same issue as this patch is fixing,
> considering the following additional note:
> 
> A big thing to be sensitive to here is that the CPU can get by with
> setting something as RDONLY | DBM and the CPU will flip off RDONLY
> without issue, so you might not notice it.
> 
> But if handle_mm_fault() is called on something RDONLY|DBM it *MUST*
> clear the RDONLY for all CONT entries, or fail the fault.
> ====
> 
> Analysis: Does hugetlb __cont_access_flags_changed() have the same bug as the contpte patch?
> 
> The contpte bug (what the patch fixes)
> 
> The old contpte_ptep_set_access_flags() (contpte.c:614) used ptep_get()
> which gathers AF/dirty from ALL sub-PTEs via OR, then compared the
> gathered result against entry:
> 
>     // OLD (buggy):
>     orig_pte = pte_mknoncont(ptep_get(ptep));   // gathers across all sub-PTEs!
>     if (pte_val(orig_pte) == pte_val(entry))
>         return 0;
> 
> A dirty sibling makes the gathered PTE look already-dirty, even though
> the target sub-PTE still has PTE_RDONLY in hardware. False no-op → SMMU
> infinite fault loop.
> 
> The fix replaces this with contpte_all_subptes_match_access_flags()
> which checks each sub-PTE’s raw bit values
> (PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY).
> 
> The hugetlb code
> 
> __cont_access_flags_changed() (hugetlbpage.c:400-418) does NOT use
> gathering for the no-op check – it reads each sub-PTE individually via
> __ptep_get():
> 
>     for (i = 0; i < ncontig; i++) {
>         pte_t orig_pte = __ptep_get(ptep + i);
>         if (pte_dirty(pte) != pte_dirty(orig_pte))
>             return 1;
>         if (pte_young(pte) != pte_young(orig_pte))
>             return 1;
>     }
> 
> So it does NOT have the exact same gathering-based false no-op bug.
> 
> But it does have a weaker form of the same conceptual problem
> 
> The check uses pte_dirty() which is:
> 
>     #define pte_hw_dirty(pte)  (pte_write(pte) && !pte_rdonly(pte))
>     #define pte_sw_dirty(pte)  (!!(pte_val(pte) & PTE_DIRTY))
>     #define pte_dirty(pte)     (pte_sw_dirty(pte) || pte_hw_dirty(pte))
> 
> It abstracts over HW-dirty and SW-dirty. It does NOT check PTE_RDONLY
> directly.
> 
> If a sub-PTE were in the state PTE_WRITE | PTE_DIRTY | PTE_RDONLY
> (SW-dirty but HW-readonly), then pte_dirty() returns true (via
> sw_dirty), and the check would say “no change needed” – but PTE_RDONLY
> is still set in hardware, which the SMMU would fault on.
> 
> The contpte fix’s contpte_all_subptes_match_access_flags() would catch
> this because it compares the raw PTE_RDONLY bit directly.
> 
> Does PTE_WRITE | PTE_DIRTY | PTE_RDONLY actually occur?
> =======================================================
> 
> Looking at all the arm64 PTE manipulation functions:
> 
> -   pte_mkdirty(): sets PTE_DIRTY AND clears PTE_RDONLY (if writable)
> -   pte_mkwrite_novma(): sets PTE_WRITE, clears PTE_RDONLY if PTE_DIRTY
> -   pte_wrprotect(): saves HW dirty to SW dirty, sets PTE_RDONLY, clears
>     PTE_WRITE
> -   pte_mkclean(): clears PTE_DIRTY, sets PTE_RDONLY
> -   HAFDBS: can only CLEAR PTE_RDONLY, never set it
> 
> None of these create PTE_DIRTY | PTE_WRITE | PTE_RDONLY. The invariant
> holds: for writable pages, PTE_DIRTY => !PTE_RDONLY.
> 
> Tracing the RDONLY|DBM scenario through hugetlb_fault()
> 
> For the specific scenario: all sub-PTEs are PTE_WRITE | PTE_RDONLY
> (RDONLY|DBM, clean writable), and handle_mm_fault() is called:
> 
> 1.  huge_ptep_get() gathers – no dirty siblings, so vmf.orig_pte =
>     PTE_WRITE | PTE_RDONLY | PTE_AF
> 2.  huge_pte_mkdirty() sets PTE_DIRTY, clears PTE_RDONLY
> 3.  entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 4.  __cont_access_flags_changed():
>     -   All sub-PTEs: PTE_WRITE | PTE_RDONLY | PTE_AF
>     -   pte_dirty(entry) = true, pte_dirty(sub) = false (RDONLY set, no
>         PTE_DIRTY)
>     -   Mismatch detected -> returns 1 -> BBM proceeds
> 5.  BBM rewrites ALL PTEs with
>     PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 6.  RDONLY cleared for all entries. Correct.
> 
> With some siblings HW-dirtied by CPU (e.g., PTE #3 has !PTE_RDONLY):
> 
> 1.  huge_ptep_get() gathers: sees HW-dirty sibling, calls pte_mkdirty()
>     on gathered
> 2.  entry = PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 3.  __cont_access_flags_changed():
>     -   PTE #3 (!PTE_RDONLY): pte_dirty() = true -> match
>     -   PTE #0 (PTE_RDONLY): pte_dirty() = false -> mismatch -> returns
>         1
> 4.  BBM proceeds. All PTEs rewritten. Correct.
> 
> With ALL siblings HW-dirtied by CPU:
> 
> 1.  All sub-PTEs: PTE_WRITE | !PTE_RDONLY | PTE_AF (all HW dirty)
> 2.  entry: PTE_WRITE | PTE_DIRTY | !PTE_RDONLY | PTE_AF
> 3.  __cont_access_flags_changed(): all pte_dirty() match -> returns 0
>     (no-op)
> 4.  But RDONLY is already clear on all PTEs, so SMMU won’t fault.
>     Correct.
> 
> Conclusion
> ==========
> 
> The hugetlb code does NOT have the same bug in the sense that it doesn’t
> produce false no-ops with current code. The key difference is
> __cont_access_flags_changed() checks each sub-PTE individually, not a
> gathered view.
> 
> However, Ryan Roberts’ observation is valid:
> __cont_access_flags_changed() has a latent weakness. It relies on the
> invariant PTE_DIRTY => !PTE_RDONLY (for writable pages) rather than
> checking PTE_RDONLY directly. If this invariant were ever violated
> (e.g., by a future code change), the no-op check could falsely pass,
> leaving RDONLY set in hardware and causing SMMU fault loops. The contpte
> fix’s approach of checking raw bit values is more robust and should be
> adopted by the hugetlb code as a defense-in-depth measure.
> 
> The recommended fix for hugetlb: replace the pte_dirty()/pte_young()
> checks in __cont_access_flags_changed() with a raw-bitmask comparison
> similar to contpte_all_subptes_match_access_flags(), using the mask
> PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY.
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index a42c05cf564082..34e091b398123e 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -389,28 +389,24 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
>  }
>  
>  /*
> - * huge_ptep_set_access_flags will update access flags (dirty, accesssed)
> + * huge_ptep_set_access_flags will update access flags (dirty, accessed)
>   * and write permission.
>   *
> - * For a contiguous huge pte range we need to check whether or not write
> - * permission has to change only on the first pte in the set. Then for
> - * all the contiguous ptes we need to check whether or not there is a
> - * discrepancy between dirty or young.
> + * Check all sub-PTEs' raw access flag bits rather than using the abstracted
> + * pte_dirty()/pte_young() helpers which conflate HW-dirty and SW-dirty.
> + * This ensures PTE_RDONLY is checked directly: a sub-PTE that is SW-dirty
> + * (PTE_DIRTY set) but still has PTE_RDONLY would be missed by pte_dirty()
> + * but will cause an SMMU without HTTU to keep faulting.  The access flag
> + * mask matches the one used by __ptep_set_access_flags().
>   */
>  static int __cont_access_flags_changed(pte_t *ptep, pte_t pte, int ncontig)
>  {
> +	const pteval_t access_mask = PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
> +	pteval_t pte_access = pte_val(pte) & access_mask;
>  	int i;
>  
> -	if (pte_write(pte) != pte_write(__ptep_get(ptep)))
> -		return 1;
> -
>  	for (i = 0; i < ncontig; i++) {
> -		pte_t orig_pte = __ptep_get(ptep + i);
> -
> -		if (pte_dirty(pte) != pte_dirty(orig_pte))
> -			return 1;
> -
> -		if (pte_young(pte) != pte_young(orig_pte))
> +		if ((pte_val(__ptep_get(ptep + i)) & access_mask) != pte_access)
>  			return 1;
>  	}

I think, based on all the above, the current version is actually not buggy. But
I'm only willing to go to 95% confidence :)

The change looks reasonable though, if you want to be safe.

Thanks,
Ryan


>  
> 
> Jason



  reply	other threads:[~2026-03-04 12:20 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03  6:37 Piotr Jaroszynski
2026-03-03  7:19 ` James Houghton
2026-03-03 12:45   ` Jason Gunthorpe
2026-03-03 21:40   ` Piotr Jaroszynski
2026-03-03  8:38 ` Ryan Roberts
2026-03-03 18:40   ` Piotr Jaroszynski
2026-03-03 19:12     ` Jason Gunthorpe
2026-03-04 12:20       ` Ryan Roberts [this message]
2026-03-04 13:44         ` Jason Gunthorpe
2026-03-04 11:17 ` Catalin Marinas
2026-03-04 13:43   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1080db49-2f83-4fec-ba73-94c6b3a8f7fa@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=pjaroszynski@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=will@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox