Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
	Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Andrey Ryabinin <ryabinin.a.a@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Mark Rutland <mark.rutland@arm.com>,
	David Hildenbrand <david@redhat.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>,
	Barry Song <21cnbao@gmail.com>,
	Alistair Popple <apopple@nvidia.com>,
	Yang Shi <shy828301@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-arm-kernel@lists.infradead.org, x86@kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings
Date: Fri, 16 Feb 2024 12:53:43 +0000	[thread overview]
Message-ID: <892caa6a-e4fe-4009-aa33-0570526961c5@arm.com> (raw)
In-Reply-To: <Zc9UQy-mtYAzNWm2@arm.com>

Hi Catalin,

Thanks for the review! Comments below...


On 16/02/2024 12:25, Catalin Marinas wrote:
> On Thu, Feb 15, 2024 at 10:31:59AM +0000, Ryan Roberts wrote:
>>  arch/arm64/mm/contpte.c          | 285 +++++++++++++++++++++++++++++++
> 
> Nitpick: I think most symbols in contpte.c can be EXPORT_SYMBOL_GPL().
> We don't expect them to be used by random out of tree modules. In fact,
> do we expect them to end up in modules at all? Most seem to be called
> from the core mm code.

The problem is that the contpte_* symbols are called from the ptep_* inline
functions. So where those inlines are called from modules, we need to make sure
the contpte_* symbols are available.

John Hubbard originally reported this problem against v1 and I enumerated all
the drivers that call into the ptep_* inlines here:
https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94be8@arm.com/#t

So they definitely need to be exported. Perhaps we can tighten it to
EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break anything
out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal
amounts of both.

> 
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>> +}
> [...]
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +	 * If we find a pte that is not consistent, then we must be racing with
>> +	 * an update so start again. If the target pte does not have CONT_PTE
>> +	 * set then that is considered consistent on its own because it is not
>> +	 * part of a contpte range.
>> +*/
> 
> I can't get my head around this lockless API. Maybe it works fine (and
> may have been discussed already) but we should document what the races
> are, why it works, what the memory ordering requirements are. For
> example, the generic (well, x86 PAE) ptep_get_lockless() only needs to
> ensure that the low/high 32 bits of a pte are consistent and there are
> some ordering rules on how these are updated.
> 
> Does the arm64 implementation only need to be correct w.r.t. the
> access/dirty bits? Since we can read orig_ptep atomically, I assume the
> only other updates from unfolding would set the dirty/access bits.
> 
>> +
>> +	pgprot_t orig_prot;
>> +	unsigned long pfn;
>> +	pte_t orig_pte;
>> +	pgprot_t prot;
>> +	pte_t *ptep;
>> +	pte_t pte;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return orig_pte;
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> 
> We don't have any ordering guarantees in how the ptes in this range are
> read or written in the contpte_set_ptes() and the fold/unfold functions.
> We might not need them given all the other checks below but it's worth
> adding a comment.
> 
>> +
>> +		if (!pte_valid_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
> 
> I think this also needs some comment. I get the !pte_valid_cont() check
> to attempt retrying when racing with unfolding. Are the other checks
> needed to detect re-folding with different protection or pfn?
> 
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
> 
> After writing the comments above, I think I figured out that the whole
> point of this loop is to check that the ptes in the contig range are
> still consistent and the only variation allowed is the dirty/young
> state to be passed to the orig_pte returned. The original pte may have
> been updated by the time this loop finishes but I don't think it
> matters, it wouldn't be any different than reading a single pte and
> returning it while it is being updated.

Correct. The pte can be updated at any time, before after or during the reads.
That was always the case. But now we have to cope with a whole contpte block
being repainted while we are reading it. So we are just checking to make sure
that all the ptes that we read from the contpte block are consistent with
eachother and therefore we can trust that the access/dirty bits we gathered are
consistent.


> 
> If you can make this easier to parse (in a few years time) with an
> additional patch adding some more comments, that would be great. For
> this patch:

I already have a big block comment at the top, which was trying to explain it.
Clearly not well enough though. I'll add more comments as a follow up patch when
I get back from holiday.

> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks!

>

next prev parent reply	other threads:[~2024-02-16 12:53 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15 10:31 [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 01/18] mm: Clarify the spec for set_ptes() Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 02/18] mm: thp: Batch-collapse PMD with set_ptes() Ryan Roberts
2024-02-15 10:31 ` [PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn() Ryan Roberts
2024-02-15 10:40   ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn() Ryan Roberts
2024-02-15 10:42   ` David Hildenbrand
2024-02-15 11:17   ` Mark Rutland
2024-02-15 18:27   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 05/18] x86/mm: " Ryan Roberts
2024-02-15 10:43   ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition Ryan Roberts
2024-02-15 10:43   ` David Hildenbrand
2024-02-15 10:31 ` [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep) Ryan Roberts
2024-02-15 11:18   ` Mark Rutland
2024-02-15 18:34   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1) Ryan Roberts
2024-02-15 11:19   ` Mark Rutland
2024-02-15 18:34   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear() Ryan Roberts
2024-02-15 11:20   ` Mark Rutland
2024-02-15 18:35   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit Ryan Roberts
2024-02-15 11:23   ` Mark Rutland
2024-02-15 19:21   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2024-02-15 11:24   ` Mark Rutland
2024-02-15 19:22   ` Catalin Marinas
2024-02-15 10:31 ` [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2024-02-15 11:27   ` Mark Rutland
2024-02-16 12:25   ` Catalin Marinas
2024-02-16 12:53     ` Ryan Roberts [this message]
2024-02-16 16:56       ` Catalin Marinas
2024-02-16 19:54         ` John Hubbard
2024-02-20 19:50           ` Ryan Roberts
2024-02-19 15:18       ` Catalin Marinas
2024-02-20 19:58         ` Ryan Roberts
2024-02-15 10:32 ` [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API Ryan Roberts
2024-02-15 11:28   ` Mark Rutland
2024-02-16 12:30   ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs Ryan Roberts
2024-02-15 11:28   ` Mark Rutland
2024-02-16 12:30   ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 15/18] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() Ryan Roberts
2024-02-15 10:32 ` [PATCH v6 16/18] arm64/mm: Implement pte_batch_hint() Ryan Roberts
2024-02-16 12:34   ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf Ryan Roberts
2024-02-16 12:34   ` Catalin Marinas
2024-02-15 10:32 ` [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings Ryan Roberts
2024-02-15 11:30   ` Mark Rutland
2024-02-16 12:35   ` Catalin Marinas
2024-06-24 14:30   ` Kefeng Wang
2024-06-24 15:56     ` Ryan Roberts
2024-06-25  3:16       ` Kefeng Wang
2024-06-25  7:23         ` Baolin Wang
2024-06-25 11:40           ` Ryan Roberts
2024-06-25 12:37             ` Baolin Wang
2024-06-25 12:41               ` Ryan Roberts
2024-06-25 13:06                 ` Matthew Wilcox
2024-06-25 13:41                   ` Ryan Roberts
2024-06-25 14:06                     ` Matthew Wilcox
2024-06-25 14:45                       ` Ryan Roberts
2024-06-25 12:23           ` Kefeng Wang
2024-02-15 11:36 ` [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=892caa6a-e4fe-4009-aa33-0570526961c5@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=ryabinin.a.a@gmail.com \
    --cc=shy828301@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox