From: Ryan Roberts <ryan.roberts@arm.com>
To: David Hildenbrand <david@redhat.com>,
Mark Rutland <mark.rutland@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
Adrian Hunter <adrian.hunter@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Muchun Song <muchun.song@linux.dev>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64
Date: Thu, 11 Apr 2024 10:45:38 +0100 [thread overview]
Message-ID: <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> (raw)
In-Reply-To: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com>
On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
>
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
>
> This stuff is killing me, sorry for the lengthy reply ...
No problem - thanks for taking the time to think it through and reply with such
clarity. :)
>
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
>
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.
Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".
Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.
>
> Let's take a look where ptep_get_lockless() comes from:
>
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 13 11:41:40 2020 +0100
>
> mm/gup: Provide gup_get_pte() more generic
>
> In order to write another lockless page-table walker, we need
> gup_get_pte() exposed. While doing that, rename it to
> ptep_get_lockless() to match the existing ptep_get() naming.
>
>
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
>
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks. For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
>
> Later, we added support for GUP-fast that introduced the !TLB variant:
>
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@linaro.org>
> Date: Thu Oct 9 15:29:14 2014 -0700
>
> mm: introduce a general RCU get_user_pages_fast()
>
> With the pattern
>
> /*
> * In the line below we are assuming that the pte can be read
> * atomically. If this is not the case for your architecture,
> * please wrap this in a helper function!
> *
> * for an example see gup_get_pte in arch/x86/mm/gup.c
> */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
>
>
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
>
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
>
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
>
> That's also the reason why ...
>>
>> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> - vmf->orig_pte = *vmf->pte;
>> + vmf->orig_pte = ptep_get_lockless(vmf->pte);
>> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> - /*
>> - * some architectures can have larger ptes than wordsize,
>> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> - * accesses. The code below just needs a consistent view
>> - * for the ifs and we later double check anyway with the
>> - * ptl lock held. So here a barrier will do.
>> - */
>> - barrier();
>> if (pte_none(vmf->orig_pte)) {
>
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.
Agreed.
>
>>
>> --
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
>
> Let's focus on access+dirty for now ;)
>
>>
>> --
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
>
> I think what has to happen is:
>
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.
Today's arm64 ptep_get() guarantees this.
>
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.
This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.
>
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
>
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.
Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.
But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".
This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.
Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?
>
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
>
>
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
>
> My head is hurting ...
>
next prev parent reply other threads:[~2024-04-11 9:45 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-15 12:17 Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Ryan Roberts
[not found] ` <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com>
2024-03-26 16:39 ` Ryan Roberts
2024-03-27 9:28 ` David Hildenbrand
2024-03-27 9:57 ` Ryan Roberts
2024-03-27 17:02 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:30 ` David Hildenbrand
2024-03-26 16:48 ` Ryan Roberts
2024-02-15 12:17 ` [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte Ryan Roberts
2024-03-26 17:02 ` David Hildenbrand
2024-03-26 17:27 ` Ryan Roberts
2024-03-26 17:38 ` David Hildenbrand
2024-03-26 17:48 ` Ryan Roberts
2024-03-26 17:58 ` David Hildenbrand
2024-03-27 9:51 ` Ryan Roberts
2024-03-27 17:05 ` David Hildenbrand
2024-02-15 12:17 ` [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency() Ryan Roberts
2024-03-26 16:35 ` David Hildenbrand
2024-03-26 16:17 ` [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 David Hildenbrand
2024-03-26 16:31 ` Ryan Roberts
[not found] ` <de143212-49ce-4c30-8bfa-4c0ff613f107@redhat.com>
2024-03-26 16:53 ` Ryan Roberts
2024-03-26 17:04 ` David Hildenbrand
2024-03-26 17:32 ` Ryan Roberts
2024-03-26 17:39 ` David Hildenbrand
2024-03-26 17:51 ` Ryan Roberts
2024-03-27 9:34 ` David Hildenbrand
2024-03-27 10:01 ` Ryan Roberts
2024-04-03 12:59 ` Ryan Roberts
2024-04-08 8:36 ` David Hildenbrand
2024-04-09 16:35 ` Ryan Roberts
2024-04-10 20:09 ` David Hildenbrand
2024-04-11 9:45 ` Ryan Roberts [this message]
[not found] ` <70a36403-aefd-4311-b612-84e602465689@redhat.com>
2024-04-15 9:28 ` Ryan Roberts
[not found] ` <3e50030d-2289-4470-a727-a293baa21618@redhat.com>
2024-04-15 13:30 ` Ryan Roberts
[not found] ` <969dc6c3-2764-4a35-9fa6-7596832fb2a3@redhat.com>
2024-04-15 14:34 ` Ryan Roberts
[not found] ` <11b1c25b-3e20-4acf-9be5-57b508266c5b@redhat.com>
2024-04-15 15:17 ` Ryan Roberts
2024-04-15 15:22 ` David Hildenbrand
2024-04-15 15:53 ` Ryan Roberts
2024-04-15 16:02 ` David Hildenbrand
2024-04-23 10:15 ` Ryan Roberts
2024-04-23 10:18 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com \
--to=ryan.roberts@arm.com \
--cc=adrian.hunter@intel.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.shishkin@linux.intel.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=irogers@google.com \
--cc=jolsa@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mark.rutland@arm.com \
--cc=muchun.song@linux.dev \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox