From: Rik van Riel <riel@surriel.com>
To: Nadav Amit <nadav.amit@gmail.com>, x86@kernel.org
Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org,
dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com,
thomas.lendacky@amd.com, kernel-team@meta.com,
linux-mm@kvack.org, akpm@linux-foundation.org, jannh@google.com,
mhklinux@outlook.com, andrew.cooper3@citrix.com
Subject: Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
Date: Mon, 20 Jan 2025 11:09:19 -0500 [thread overview]
Message-ID: <e9858d9d264cbdb27db7f3df237ba2410522150e.camel@surriel.com> (raw)
In-Reply-To: <84ba1c3e-d975-458f-89f5-a6f5d04a3d22@gmail.com>
On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>
>
> On 20/01/2025 4:40, Rik van Riel wrote:
> >
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > + VM_WARN_ON_ONCE(1);
>
> Not sure why not the use VM_WARN_ONCE() instead with some more
> informative message (anyhow, a string is allocated for it).
>
VM_WARN_ON_ONCE only has a condition, not a message.
> >
> > +static u16 get_global_asid(void)
> > +{
> > + lockdep_assert_held(&global_asid_lock);
> > +
> > + do {
> > + u16 start = last_global_asid;
> > + u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > + if (asid >= MAX_ASID_AVAILABLE) {
> > + reset_global_asid_space();
> > + continue;
> > + }
>
> I think that unless something is awfully wrong, you are supposed to
> at
> most call reset_global_asid_space() once. So if that's the case, why
> not
> do it this way?
>
> Instead, you can get rid of the loop and just do:
>
> asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
>
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.
I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.
>
> > + /* Slower check to make sure. */
> > + for_each_cpu(cpu, mm_cpumask(mm)) {
> > + /* Skip the CPUs that aren't really running this
> > process. */
> > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > + continue;
>
> Then perhaps at least add a comment next to loaded_mm, that it's not
> private per-se, but rarely accessed by other cores?
>
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.
Which comment are you referring to that should
be edited?
> >
> > +
> > + /*
> > + * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > + * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > + * ordering for synchronization.
> > + *
> > + * While the process has threads still using a dynamic
> > ASID,
> > + * TLB invalidation IPIs continue to get sent.
> > + *
> > + * This code sets asid_transition first, before assigning
> > the
> > + * global ASID.
> > + *
> > + * The TLB flush code will only verify the ASID transition
> > + * after it has seen the new global ASID for the process.
> > + */
> > + WRITE_ONCE(mm->context.asid_transition, true);
> > + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>
> I know it is likely correct in practice (due to TSO memory model),
> but
> it is not clear, at least for me, how those write order affects the
> rest
> of the code. I managed to figure out how it relates to the reads in
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>
What kind of wording should we add here to make it
easier to understand?
"The TLB invalidation code reads these variables in
the opposite order in which they are written" ?
> > + /*
> > + * If at least one CPU is not using the global
> > ASID yet,
> > + * send a TLB flush IPI. The IPI should cause
> > stragglers
> > + * to transition soon.
> > + *
> > + * This can race with the CPU switching to another
> > task;
> > + * that results in a (harmless) extra IPI.
> > + */
> > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > + flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > + return;
>
> I am trying to figure out why we return here. The transition might
> not
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it
> unconditionally?
The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?
On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.
I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)
>
> > + /*
> > + * TLB flushes with INVLPGB are kicked off asynchronously.
> > + * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > + * before these TLB flushes happen.
> > + */
> > + if (info->end == TLB_FLUSH_ALL) {
> > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > + /* Do any CPUs supporting INVLPGB need PTI? */
> > + if (static_cpu_has(X86_FEATURE_PTI))
> > + invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > + } else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
>
> I guess I was wrong, and do-while was cleaner here.
>
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>
We set maxnr to 1 for larger stride shifts at the top of the function:
/* Flushing multiple pages at once is not supported with 1GB
pages. */
if (info->stride_shift > PMD_SHIFT)
maxnr = 1;
> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the
> low bits of start and end based on ((1ull << stride_shift) - 1). But
> whatever... ]
I'll change it back :)
I'm just happy this code is getting lots of attention,
and we're improving it with time.
> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> > !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> > cpumask_set_cpu(cpu, mm_cpumask(next));
> >
> > + /*
> > + * Check if the current mm is transitioning to a
> > new ASID.
> > + */
> > + if (needs_global_asid_reload(next, prev_asid)) {
> > + next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > + choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > + goto reload_tlb;
>
> Not a fan of the goto's when they are not really needed, and I don't
> think it is really needed here. Especially that the name of the tag
> "reload_tlb" does not really convey that the page-tables are reloaded
> at
> that point.
In this particular case, the CPU continues running with
the same page tables, but with a different PCID.
>
> > + }
> > +
> > + /*
> > + * Broadcast TLB invalidation keeps this PCID up
> > to date
> > + * all the time.
> > + */
> > + if (is_global_asid(prev_asid))
> > + return;
>
> Hard for me to convince myself
When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.
The global ASID should always be up to date.
>
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> > if (unlikely(loaded_mm == &init_mm))
> > return;
> >
> > + /* Reload the ASID if transitioning into or out of a
> > global ASID */
> > + if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > + switch_mm_irqs_off(NULL, loaded_mm, NULL);
>
> I understand you want to reuse that logic, but it doesn't seem
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance.
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.
My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.
I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.
>
> > + loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > + }
> > +
> > + /* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > + if (is_global_asid(loaded_mm_asid))
> > + return;
>
> The comment does not clarify to me, and I don't manage to clearly
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
>
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).
Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.
--
All Rights Reversed.
next prev parent reply other threads:[~2025-01-20 16:10 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-20 2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
2025-01-20 2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2025-01-20 19:32 ` David Hildenbrand
2025-01-20 2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2025-01-20 19:47 ` David Hildenbrand
2025-01-21 1:03 ` Rik van Riel
2025-01-21 7:46 ` David Hildenbrand
2025-01-21 8:54 ` Peter Zijlstra
2025-01-22 15:48 ` Rik van Riel
2025-01-20 2:40 ` [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-01-20 2:40 ` [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-01-20 2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-01-21 9:45 ` Peter Zijlstra
2025-01-22 16:58 ` Rik van Riel
2025-01-20 2:40 ` [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-01-20 2:40 ` [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
2025-01-20 2:40 ` [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-01-20 2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-01-20 14:02 ` Nadav Amit
2025-01-20 16:09 ` Rik van Riel [this message]
2025-01-20 20:04 ` Nadav Amit
2025-01-20 22:44 ` Rik van Riel
2025-01-21 7:31 ` Nadav Amit
2025-01-21 9:55 ` Peter Zijlstra
2025-01-21 10:33 ` Peter Zijlstra
2025-01-23 1:40 ` Rik van Riel
2025-01-21 18:48 ` Dave Hansen
2025-01-22 8:38 ` Peter Zijlstra
2025-01-23 1:13 ` Rik van Riel
2025-01-23 9:07 ` Peter Zijlstra
2025-01-23 12:42 ` Rik van Riel
2025-01-20 2:40 ` [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-01-20 2:40 ` [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-01-20 2:40 ` [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-01-20 5:58 ` [PATCH v6 00/12] AMD broadcast TLB invalidation Michael Kelley
2025-01-24 11:41 ` Manali Shukla
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e9858d9d264cbdb27db7f3df237ba2410522150e.camel@surriel.com \
--to=riel@surriel.com \
--cc=akpm@linux-foundation.org \
--cc=andrew.cooper3@citrix.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=jannh@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhklinux@outlook.com \
--cc=nadav.amit@gmail.com \
--cc=peterz@infradead.org \
--cc=thomas.lendacky@amd.com \
--cc=x86@kernel.org \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox