linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: Nadav Amit <nadav.amit@gmail.com>, x86@kernel.org
Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org,
	 dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com,
	thomas.lendacky@amd.com, 	kernel-team@meta.com,
	linux-mm@kvack.org, akpm@linux-foundation.org, 	jannh@google.com,
	mhklinux@outlook.com, andrew.cooper3@citrix.com
Subject: Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
Date: Mon, 20 Jan 2025 11:09:19 -0500	[thread overview]
Message-ID: <e9858d9d264cbdb27db7f3df237ba2410522150e.camel@surriel.com> (raw)
In-Reply-To: <84ba1c3e-d975-458f-89f5-a6f5d04a3d22@gmail.com>

On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
> 
> 
> On 20/01/2025 4:40, Rik van Riel wrote:
> > 
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > +	VM_WARN_ON_ONCE(1);
> 
> Not sure why not the use VM_WARN_ONCE() instead with some more 
> informative message (anyhow, a string is allocated for it).
> 
VM_WARN_ON_ONCE only has a condition, not a message.

> > 
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start = last_global_asid;
> > +		u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >= MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> 
> I think that unless something is awfully wrong, you are supposed to
> at 
> most call reset_global_asid_space() once. So if that's the case, why
> not 
> do it this way?
> 
> Instead, you can get rid of the loop and just do:
> 
> 	asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
> 
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.

I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.

> 
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > +			continue;
> 
> Then perhaps at least add a comment next to loaded_mm, that it's not 
> private per-se, but rarely accessed by other cores?
> 
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.

Which comment are you referring to that should
be edited?

> > 
> > +
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> 
> I know it is likely correct in practice (due to TSO memory model),
> but 
> it is not clear, at least for me, how those write order affects the
> rest 
> of the code. I managed to figure out how it relates to the reads in 
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it 
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
> 

What kind of wording should we add here to make it
easier to understand?

"The TLB invalidation code reads these variables in
 the opposite order in which they are written" ?


> > +		/*
> > +		 * If at least one CPU is not using the global
> > ASID yet,
> > +		 * send a TLB flush IPI. The IPI should cause
> > stragglers
> > +		 * to transition soon.
> > +		 *
> > +		 * This can race with the CPU switching to another
> > task;
> > +		 * that results in a (harmless) extra IPI.
> > +		 */
> > +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > +			flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > +			return;
> 
> I am trying to figure out why we return here. The transition might
> not 
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it 
> unconditionally?

The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?

On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.

I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)

> 
> > +	/*
> > +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> > +	 * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > +	 * before these TLB flushes happen.
> > +	 */
> > +	if (info->end == TLB_FLUSH_ALL) {
> > +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > +	} else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
> 
> I guess I was wrong, and do-while was cleaner here.
> 
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
> 
We set maxnr to 1 for larger stride shifts at the top of the function:

        /* Flushing multiple pages at once is not supported with 1GB
pages. */
        if (info->stride_shift > PMD_SHIFT)
                maxnr = 1;

> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the 
> low bits of start and end based on ((1ull << stride_shift) - 1). But 
> whatever... ]

I'll change it back :)

I'm just happy this code is getting lots of attention,
and we're improving it with time.


> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> >   				 !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> >   			cpumask_set_cpu(cpu, mm_cpumask(next));
> >   
> > +		/*
> > +		 * Check if the current mm is transitioning to a
> > new ASID.
> > +		 */
> > +		if (needs_global_asid_reload(next, prev_asid)) {
> > +			next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > +			choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > +			goto reload_tlb;
> 
> Not a fan of the goto's when they are not really needed, and I don't 
> think it is really needed here. Especially that the name of the tag 
> "reload_tlb" does not really convey that the page-tables are reloaded
> at 
> that point.

In this particular case, the CPU continues running with
the same page tables, but with a different PCID.

> 
> > +		}
> > +
> > +		/*
> > +		 * Broadcast TLB invalidation keeps this PCID up
> > to date
> > +		 * all the time.
> > +		 */
> > +		if (is_global_asid(prev_asid))
> > +			return;
> 
> Hard for me to convince myself

When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.

The global ASID should always be up to date.

> 
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> >   	if (unlikely(loaded_mm == &init_mm))
> >   		return;
> >   
> > +	/* Reload the ASID if transitioning into or out of a
> > global ASID */
> > +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> 
> I understand you want to reuse that logic, but it doesn't seem 
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance. 
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.

My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.

I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.


> 
> > +		loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > +	}
> > +
> > +	/* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > +	if (is_global_asid(loaded_mm_asid))
> > +		return;
> 
> The comment does not clarify to me, and I don't manage to clearly 
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
> 
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).

Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.


-- 
All Rights Reversed.


  reply	other threads:[~2025-01-20 16:10 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-20  2:40 [PATCH v6 00/12] AMD broadcast TLB invalidation Rik van Riel
2025-01-20  2:40 ` [PATCH v6 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2025-01-20 19:32   ` David Hildenbrand
2025-01-20  2:40 ` [PATCH v6 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2025-01-20 19:47   ` David Hildenbrand
2025-01-21  1:03     ` Rik van Riel
2025-01-21  7:46       ` David Hildenbrand
2025-01-21  8:54         ` Peter Zijlstra
2025-01-22 15:48           ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-01-20  2:40 ` [PATCH v6 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-01-20  2:40 ` [PATCH v6 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-01-21  9:45   ` Peter Zijlstra
2025-01-22 16:58     ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-01-20  2:40 ` [PATCH v6 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
2025-01-20  2:40 ` [PATCH v6 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-01-20  2:40 ` [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-01-20 14:02   ` Nadav Amit
2025-01-20 16:09     ` Rik van Riel [this message]
2025-01-20 20:04       ` Nadav Amit
2025-01-20 22:44         ` Rik van Riel
2025-01-21  7:31           ` Nadav Amit
2025-01-21  9:55   ` Peter Zijlstra
2025-01-21 10:33     ` Peter Zijlstra
2025-01-23  1:40       ` Rik van Riel
2025-01-21 18:48     ` Dave Hansen
2025-01-22  8:38   ` Peter Zijlstra
2025-01-23  1:13     ` Rik van Riel
2025-01-23  9:07       ` Peter Zijlstra
2025-01-23 12:42         ` Rik van Riel
2025-01-20  2:40 ` [PATCH v6 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-01-20  2:40 ` [PATCH v6 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-01-20  2:40 ` [PATCH v6 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-01-20  5:58 ` [PATCH v6 00/12] AMD broadcast TLB invalidation Michael Kelley
2025-01-24 11:41 ` Manali Shukla

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e9858d9d264cbdb27db7f3df237ba2410522150e.camel@surriel.com \
    --to=riel@surriel.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew.cooper3@citrix.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=jannh@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhklinux@outlook.com \
    --cc=nadav.amit@gmail.com \
    --cc=peterz@infradead.org \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox