Re: Excessive TLB flush ranges

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Christoph Hellwig <hch@lst.de>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Baoquan He <bhe@redhat.com>, John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	x86@kernel.org
Subject: Re: Excessive TLB flush ranges
Date: Wed, 17 May 2023 12:31:04 +0200	[thread overview]
Message-ID: <87ttwb5jx3.ffs@tglx> (raw)
In-Reply-To: <ABA0B923-FD56-4787-9ED1-994BEA13C496@gmail.com>

Nadav!

On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.

All good.

>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.  

Correct and there is obviously a tradeoff too, which has yet to be
determined.

>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.

Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.

>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.

That's correct and there might be smarter ways to handle that list muck.

>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don’t know what “dismissing” you are talking about.

Sorry, I was overreacting due to increased grumpiness.

> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That’s all to say that caching effect is not something to
> be considered obsolete.

I never claimed that it does not matter. That's surely part of a
decision making to investigate that.

>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>> 
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).

For the vmalloc() area mappings? Not really.

> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.

The way how arm/arm64 implement that in software is:

    magic_barrier1();
    flush_range_with_magic_opcodes();
    magic_barrier2();

And for that use case having the list with individual ranges is not
really wrong.

Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.

>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because…
>
> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)

Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.

Thanks,

        tglx

next prev parent reply	other threads:[~2023-05-17 10:31 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 16:43 Thomas Gleixner
2023-05-15 16:59 ` Russell King (Oracle)
2023-05-15 19:46   ` Thomas Gleixner
2023-05-15 21:11     ` Thomas Gleixner
2023-05-15 21:31       ` Russell King (Oracle)
2023-05-16  6:37         ` Thomas Gleixner
2023-05-16  6:46           ` Thomas Gleixner
2023-05-16  8:18           ` Thomas Gleixner
2023-05-16  8:20             ` Thomas Gleixner
2023-05-16  8:27               ` Russell King (Oracle)
2023-05-16  9:03                 ` Thomas Gleixner
2023-05-16 10:05                   ` Baoquan He
2023-05-16 14:21                     ` Thomas Gleixner
2023-05-16 19:03                       ` Thomas Gleixner
2023-05-17  9:38                         ` Thomas Gleixner
2023-05-17 10:52                           ` Baoquan He
2023-05-19 11:22                             ` Thomas Gleixner
2023-05-19 11:49                               ` Baoquan He
2023-05-19 14:13                                 ` Thomas Gleixner
2023-05-19 12:01                         ` [RFC PATCH 1/3] mm/vmalloc.c: try to flush vmap_area one by one Baoquan He
2023-05-19 14:16                           ` Thomas Gleixner
2023-05-19 12:02                         ` [RFC PATCH 2/3] mm/vmalloc.c: Only flush VM_FLUSH_RESET_PERMS area immediately Baoquan He
2023-05-19 12:03                         ` [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do purge firstly Baoquan He
2023-05-19 14:17                           ` Thomas Gleixner
2023-05-19 18:38                           ` Thomas Gleixner
2023-05-19 23:46                             ` Baoquan He
2023-05-21 23:10                               ` Thomas Gleixner
2023-05-22 11:21                                 ` Baoquan He
2023-05-22 12:02                                   ` Thomas Gleixner
2023-05-22 14:34                                     ` Baoquan He
2023-05-22 20:21                                       ` Thomas Gleixner
2023-05-22 20:44                                         ` Thomas Gleixner
2023-05-23  9:35                                         ` Baoquan He
2023-05-19 13:49                   ` Excessive TLB flush ranges Thomas Gleixner
2023-05-16  8:21             ` Russell King (Oracle)
2023-05-16  8:19           ` Russell King (Oracle)
2023-05-16  8:44             ` Thomas Gleixner
2023-05-16  8:48               ` Russell King (Oracle)
2023-05-16 12:09                 ` Thomas Gleixner
2023-05-16 13:42                   ` Uladzislau Rezki
2023-05-16 14:38                     ` Thomas Gleixner
2023-05-16 15:01                       ` Uladzislau Rezki
2023-05-16 17:04                         ` Thomas Gleixner
2023-05-17 11:26                           ` Uladzislau Rezki
2023-05-17 11:58                             ` Thomas Gleixner
2023-05-17 12:15                               ` Uladzislau Rezki
2023-05-17 16:32                                 ` Thomas Gleixner
2023-05-19 10:01                                   ` Uladzislau Rezki
2023-05-19 14:56                                     ` Thomas Gleixner
2023-05-19 15:14                                       ` Uladzislau Rezki
2023-05-19 16:32                                         ` Thomas Gleixner
2023-05-19 17:02                                           ` Uladzislau Rezki
2023-05-16 17:56                       ` Nadav Amit
2023-05-16 19:32                         ` Thomas Gleixner
2023-05-17  0:23                           ` Thomas Gleixner
2023-05-17  1:23                             ` Nadav Amit
2023-05-17 10:31                               ` Thomas Gleixner [this message]
2023-05-17 11:47                                 ` Thomas Gleixner
2023-05-17 22:41                                   ` Nadav Amit
2023-05-17 14:43                                 ` Mark Rutland
2023-05-17 16:41                                   ` Thomas Gleixner
2023-05-17 22:57                                 ` Nadav Amit
2023-05-19 11:49                                   ` Thomas Gleixner
2023-05-17 12:12                               ` Russell King (Oracle)
2023-05-17 23:14                                 ` Nadav Amit
2023-05-15 18:17 ` Uladzislau Rezki
2023-05-16  2:26   ` Baoquan He
2023-05-16  6:40     ` Thomas Gleixner
2023-05-16  8:07       ` Baoquan He
2023-05-16  8:10         ` Baoquan He
2023-05-16  8:45         ` Russell King (Oracle)
2023-05-16  9:13           ` Thomas Gleixner
2023-05-16  8:54         ` Thomas Gleixner
2023-05-16  9:48           ` Baoquan He
2023-05-15 20:02 ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ttwb5jx3.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=hch@lst.de \
    --cc=jogness@linutronix.de \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=lstoakes@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterz@infradead.org \
    --cc=urezki@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox