Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
Date: Thu, 11 Apr 2024 10:45:38 +0100	[thread overview]
Message-ID: <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> (raw)
In-Reply-To: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com>

On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
> 
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
> 
> This stuff is killing me, sorry for the lengthy reply ...

No problem - thanks for taking the time to think it through and reply with such
clarity. :)

> 
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
> 
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.

Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".

Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.

> 
> Let's take a look where ptep_get_lockless() comes from:
> 
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Nov 13 11:41:40 2020 +0100
> 
>     mm/gup: Provide gup_get_pte() more generic
> 
>     In order to write another lockless page-table walker, we need
>     gup_get_pte() exposed. While doing that, rename it to
>     ptep_get_lockless() to match the existing ptep_get() naming.
> 
> 
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
> 
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks.  For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
> 
> Later, we added support for GUP-fast that introduced the !TLB variant:
> 
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@linaro.org>
> Date:   Thu Oct 9 15:29:14 2014 -0700
> 
>     mm: introduce a general RCU get_user_pages_fast()
> 
> With the pattern
> 
> /*
>  * In the line below we are assuming that the pte can be read
>  * atomically. If this is not the case for your architecture,
>  * please wrap this in a helper function!
>  *
>  * for an example see gup_get_pte in arch/x86/mm/gup.c
>  */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
> 
> 
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
> 
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
> 
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
> 
> That's also the reason why ...
>>
>>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> -               vmf->orig_pte = *vmf->pte;
>> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> -               /*
>> -                * some architectures can have larger ptes than wordsize,
>> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> -                * accesses.  The code below just needs a consistent view
>> -                * for the ifs and we later double check anyway with the
>> -                * ptl lock held. So here a barrier will do.
>> -                */
>> -               barrier();
>>                  if (pte_none(vmf->orig_pte)) {
> 
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.

Agreed.

> 
>>
>> -- 
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
> 
> Let's focus on access+dirty for now ;)
> 
>>
>> -- 
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
> 
> I think what has to happen is:
> 
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.

Today's arm64 ptep_get() guarantees this.

> 
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.

This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.

> 
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
> 
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.

Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.

But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".

This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.

Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?

> 
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
> 
> 
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
> 
> My head is hurting ...
>

next prev parent reply	other threads:[~2024-04-11  9:45 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15 12:17 Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
     [not found]   ` <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com>
2024-03-26 16:39     ` Ryan Roberts
2024-03-27  9:28       ` David Hildenbrand
2024-03-27  9:57         ` Ryan Roberts
2024-03-27 17:02           ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:30   ` David Hildenbrand
2024-03-26 16:48     ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts
2024-03-26 17:02   ` David Hildenbrand
2024-03-26 17:27     ` Ryan Roberts
2024-03-26 17:38       ` David Hildenbrand
2024-03-26 17:48         ` Ryan Roberts
2024-03-26 17:58           ` David Hildenbrand
2024-03-27  9:51             ` Ryan Roberts
2024-03-27 17:05               ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:35   ` David Hildenbrand
2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand
2024-03-26 16:31   ` Ryan Roberts
     [not found]     ` <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com>
2024-03-26 16:53       ` Ryan Roberts
2024-03-26 17:04         ` David Hildenbrand
2024-03-26 17:32           ` Ryan Roberts
2024-03-26 17:39             ` David Hildenbrand
2024-03-26 17:51               ` Ryan Roberts
2024-03-27  9:34                 ` David Hildenbrand
2024-03-27 10:01                   ` Ryan Roberts
2024-04-03 12:59                   ` Ryan Roberts
2024-04-08  8:36                     ` David Hildenbrand
2024-04-09 16:35                       ` Ryan Roberts
2024-04-10 20:09                         ` David Hildenbrand
2024-04-11  9:45                           ` Ryan Roberts [this message]
     [not found]                             ` <70a36403-aefd-4311-b612-84e602465689@redhat.com>
2024-04-15  9:28                               ` Ryan Roberts
     [not found]                                 ` <3e50030d-2289-4470-a727-a293baa21618@redhat.com>
2024-04-15 13:30                                   ` Ryan Roberts
     [not found]                                     ` <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com>
2024-04-15 14:34                                       ` Ryan Roberts
     [not found]                                         ` <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com>
2024-04-15 15:17                                           ` Ryan Roberts
2024-04-15 15:22                                             ` David Hildenbrand
2024-04-15 15:53                                               ` Ryan Roberts
2024-04-15 16:02                                                 ` David Hildenbrand
2024-04-23 10:15                                                   ` Ryan Roberts
2024-04-23 10:18                                                     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=adrian.hunter@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=catalin.marinas@arm.com \
    --cc=david@redhat.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=muchun.song@linux.dev \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox