Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Andy Lutomirski" <luto@kernel.org>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Anton Blanchard" <anton@ozlabs.org>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Paul Mackerras" <paulus@ozlabs.org>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Dave Hansen" <dave.hansen@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
Date: Sat, 08 Jan 2022 15:04:14 -0700	[thread overview]
Message-ID: <3586aa63-2dd2-4569-b9b9-f51080962ff2@www.fastmail.com> (raw)
In-Reply-To: <CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com>

On Sat, Jan 8, 2022, at 12:22 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> To improve scalability, this patch adds a percpu hazard pointer scheme to
>> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
>> must not be freed, and __mmput() checks the pointers belonging to all CPUs
>> that might be lazily using the mm in question.
>
> Ugh. This feels horribly fragile to me, and also looks like it makes
> some common cases potentially quite expensive for machines with large
> CPU counts if they don't do that mm_cpumask optimization - which in
> turn feels quite fragile as well.
>
> IOW, this just feels *complicated*.
>
> And I think it's overly so. I get the strong feeling that we could
> make the rules much simpler and more straightforward.
>
> For example, how about we make the rules be

There there, Linus, not everything is as simple^Wincapable as x86 bare metal, and mm_cpumask does not have useful cross-arch semantics.  Is that good?

>
>  - a lazy TLB mm reference requires that there's an actual active user
> of that mm (ie "mm_users > 0")
>
>  - the last mm_users decrement (ie __mmput) forces a TLB flush, and
> that TLB flush must make sure that no lazy users exist (which I think
> it does already anyway).

It does, on x86 bare metal, in exit_mmap().  It’s implicit, but it could be made explicit, as below.

>
> Doesn't that seem like a really simple set of rules?
>
> And the nice thing about it is that we *already* do that required TLB
> flush in all normal circumstances. __mmput() already calls
> exit_mmap(), and exit_mm() already forces that TLB flush in every
> normal situation.

Exactly. On x86 bare metal and similar architectures, this flush is done by IPI, which involves a loop over all CPUs that might be using the mm.  And other patches in this series add the core ability for x86 to shoot down the lazy TLB cleanly so the core drops its reference and wires it up for x86.

>
> So we might have to make sure that every architecture really does that
> "drop lazy mms on TLB flush", and maybe add a flag to the existing
> 'struct mmu_gather tlb' to make sure that flush actually always
> happens (even if the process somehow managed to unmap all vma's even
> before exiting).

So this requires that all architectures actually walk all relevant CPUs to see if an IPI is needed and send that IPI.  On architectures that actually need an IPI anyway (x86 bare metal, powerpc (I think) and others, fine. But on architectures with a broadcast-to-all-CPUs flush (ARM64 IIUC), then the extra IPI will be much much slower than a simple load-acquire in a loop.

In fact, arm64 doesn’t even track mm_cpumask at all last time I checked, so even an IPI lazy shoot down would require looping *all* CPUs, doing a load-acquire, and possibly doing an IPI. I much prefer doing a load-acquire and possibly a cmpxchg.

(And x86 PV can do hypercall flushes.  If a bunch of vCPUs are not running, an IPI shootdown will end up sleeping until they run, whereas this patch will allow the hypervisor to leave them asleep and thus to finish __mmput without waking them. This only matters on a CPU-oversubscribed host, but still.  And it kind of looks like hardware remote flushes are coming in AMD land eventually.)

But yes, I fully agree that this patch is complicated and subtle.

>
> Is there something silly I'm missing? Somebody pat me on the head, and
> say "There, there, Linus, don't try to get involved with things you
> don't understand.." and explain to me in small words.
>
>                   Linus

next prev parent reply	other threads:[~2022-01-08 22:04 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski [this message]
2022-01-09  0:27       ` Linus Torvalds
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds
     [not found]                   ` <1641790309.2vqc26hwm3.astroid@bobo.none>
     [not found]                     ` <1641791321.kvkq0n8kbq.astroid@bobo.none>
2022-01-10 17:19                       ` Linus Torvalds
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski
     [not found] ` <e6e7c11c38a3880e56fb7dfff4fa67090d831a3b.1641659630.git.luto@kernel.org>
2022-01-12 15:55   ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3586aa63-2dd2-4569-b9b9-f51080962ff2@www.fastmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=benh@kernel.crashing.org \
    --cc=dave.hansen@intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@ozlabs.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox