Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Andy Lutomirski" <luto@kernel.org>
To: "Linus Torvalds" <torvalds@linux-foundation.org>,
	"Will Deacon" <will@kernel.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Anton Blanchard" <anton@ozlabs.org>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Paul Mackerras" <paulus@ozlabs.org>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Dave Hansen" <dave.hansen@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
Date: Sat, 08 Jan 2022 19:58:39 -0800	[thread overview]
Message-ID: <430e3db1-693f-4d46-bebf-0a953fe6c2fc@www.fastmail.com> (raw)
In-Reply-To: <CAHk-=wi2MtYYTs08RKHtj9Vtm0dri-saWwYh0tj=QUUUDSJFRQ@mail.gmail.com>

On Sat, Jan 8, 2022, at 4:53 PM, Linus Torvalds wrote:
> [ Let's try this again, without the html crud this time. Apologies to
> the people who see this reply twice ]
>
> On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> So this requires that all architectures actually walk all relevant
>> CPUs to see if an IPI is needed and send that IPI. On architectures
>> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
>> and others, fine. But on architectures with a broadcast-to-all-CPUs
>> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
>> simple load-acquire in a loop.
>
> ... hmm. How about a hybrid scheme?
>
>  (a) architectures that already require that IPI anyway for TLB
> invalidation (ie x86, but others too), just make the rule be that the
> TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
> they already do.
>
>  (b) architectures like arm64 that do hw-assisted TLB shootdown will
> have an ASID allocator model, and what you do is to use that to either
>     (b') increment/decrement the mm_count at mm ASID allocation/freeing time
>     (b'') use the existing ASID tracking data to find the CPU's that
> have that ASID
>
>  (c) can you really imagine hardware TLB shootdown without ASID
> allocation? That doesn't seem to make sense. But if it exists, maybe
> that kind of crazy case would do the percpu array walking.
>

So I can go over a handful of TLB flush schemes:

1. x86 bare metal.  As noted, just plain shootdown would work.  (Unless we switch to inexact mm_cpumask() tracking, which might be enough of a win that it's worth it.)  Right now, "ASID" (i.e. PCID, thanks Intel) is allocated per cpu.  They are never explicitly freed -- they just expire off a percpu LRU.  The data structures have no idea whether an mm still exists -- instead they track mm->context.ctx_id, which is 64 bits and never reused.

2. x86 paravirt.  This is just like bare metal except there's a hypercall to flush a specific target cpu.  (I think this is mutually exclusive with PCID, but I'm not sure.  I haven't looked that hard.  I'm not sure exactly what is implemented right now.  It could be an operation to flush (cpu, pcid), but that gets awkward for reasons that aren't too relevant to this discussion.)  In this model, the exit_mmap() shootdown would either need to switch to a non-paravirt flush or we need a fancy mm_count solution of some sort.

3. Hypothetical better x86.  AMD has INVLPGB, which is almost useless right now.  But it's *so* close to being very useful, and I've asked engineers at AMD and Intel to improve this.  Specifically, I want PCID to be widened to 64 bits.  (This would, as I understand it, not affect the TLB hardware at all.  It would affect the tiny table that sits in front of the real PCID and maintains the illusion that PCID is 12 bits, and it would affect the MOV CR3 instruction.  The latter makes it complicated.)  And INVLPGB would invalidate a given 64-bit PCID system-wide.  In this model, there would be no such thing as freeing an ASID.  So I think we would want something very much like this patch.

4. ARM64.  I only barely understand it, but I think it's an intermediate scheme with ASIDs that are wide enough to be useful but narrow enough to run out on occasion.  I don't think they're tracked -- I think the whole world just gets invalidated when they overflow.  I could be wrong.

In any event, ASID lifetimes aren't a magic solution -- how do we know when to expire an ASID?  Presumably it would be when an mm is fully freed (__mmdrop), which gets us right back to square one.

In any case, what I particularly like about my patch is that, while it's subtle, it's subtle just once.  I think it can handle all the interesting arch cases by merely redefining for_each_possible_lazymm_cpu() to do the right thing.

> (Honesty in advertising: I don't know the arm64 ASID code - I used to
> know the old alpha version I wrote in a previous lifetime - but afaik
> any ASID allocator has to be able to track CPU's that have a
> particular ASID in use and be able to invalidate it).
>
> Hmm. The x86 maintainers are on this thread, but they aren't even the
> problem. Adding Catalin and Will to this, I think they should know
> if/how this would fit with the arm64 ASID allocator.
>

Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

> Will/Catalin, background here:
>
>    
> https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@mail.gmail.com/
>
> for my objection to that special "keep non-refcounted magic per-cpu
> pointer to lazy tlb mm".
>
>            Linus

--Andy

next prev parent reply	other threads:[~2022-01-09  3:59 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-08 16:43 [PATCH 00/23] mm, sched: Rework lazy mm handling Andy Lutomirski
2022-01-08 16:43 ` [PATCH 01/23] membarrier: Document why membarrier() works Andy Lutomirski
2022-01-12 15:30   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Andy Lutomirski
2022-01-12 15:40   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Andy Lutomirski
2022-01-08 16:43 ` [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Andy Lutomirski
2022-01-12 15:52   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Andy Lutomirski
2022-01-10  8:42   ` Christophe Leroy
2022-01-12 15:57   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Andy Lutomirski
2022-01-12 16:11   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Andy Lutomirski
2022-01-12 16:13   ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Andy Lutomirski
2022-01-12 16:30   ` Mathieu Desnoyers
2022-01-12 17:08     ` Mathieu Desnoyers
2022-01-08 16:43 ` [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Andy Lutomirski
2022-01-08 16:43 ` [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Andy Lutomirski
2022-01-10 22:06   ` Sami Tolvanen
2022-01-08 16:43 ` [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Andy Lutomirski
2022-01-08 16:43 ` [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Andy Lutomirski
2022-01-08 16:43 ` [PATCH 14/23] sched, exec: Factor current mm changes out from exec Andy Lutomirski
2022-01-08 16:44 ` [PATCH 15/23] kthread: Switch to __change_current_mm() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Andy Lutomirski
2022-01-08 19:22   ` Linus Torvalds
2022-01-08 22:04     ` Andy Lutomirski
2022-01-09  0:27       ` Linus Torvalds
2022-01-09  0:53       ` Linus Torvalds
2022-01-09  3:58         ` Andy Lutomirski [this message]
2022-01-09  4:38           ` Linus Torvalds
2022-01-09 20:19             ` Andy Lutomirski
2022-01-09 20:48               ` Linus Torvalds
2022-01-09 21:51                 ` Linus Torvalds
2022-01-10  0:52                   ` Andy Lutomirski
2022-01-10  2:36                     ` Rik van Riel
2022-01-10  3:51                       ` Linus Torvalds
     [not found]                   ` <1641790309.2vqc26hwm3.astroid@bobo.none>
     [not found]                     ` <1641791321.kvkq0n8kbq.astroid@bobo.none>
2022-01-10 17:19                       ` Linus Torvalds
2022-01-10 20:52                     ` Andy Lutomirski
2022-01-11  3:10                       ` Nicholas Piggin
2022-01-11 15:39                         ` Andy Lutomirski
2022-01-11 22:48                           ` Nicholas Piggin
2022-01-11 10:39                 ` Will Deacon
2022-01-11 15:22                   ` Andy Lutomirski
2022-01-09  5:56   ` Nadav Amit
2022-01-09  6:48     ` Linus Torvalds
2022-01-09  8:49       ` Nadav Amit
2022-01-09 19:10         ` Linus Torvalds
2022-01-09 19:52           ` Andy Lutomirski
2022-01-09 20:00             ` Linus Torvalds
2022-01-09 20:34             ` Nadav Amit
2022-01-09 20:48               ` Andy Lutomirski
2022-01-09 19:22         ` Rik van Riel
2022-01-09 19:34           ` Nadav Amit
2022-01-09 19:37             ` Rik van Riel
2022-01-09 19:51               ` Nadav Amit
2022-01-09 19:54                 ` Linus Torvalds
2022-01-08 16:44 ` [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Andy Lutomirski
2022-01-08 16:44 ` [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Andy Lutomirski
2022-01-08 16:44 ` [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Andy Lutomirski
2022-01-10 13:13   ` Ard Biesheuvel
2022-01-08 16:44 ` [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Andy Lutomirski
2022-01-08 16:44 ` [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Andy Lutomirski
2022-01-08 16:44 ` [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Andy Lutomirski
     [not found] ` <e6e7c11c38a3880e56fb7dfff4fa67090d831a3b.1641659630.git.luto@kernel.org>
2022-01-12 15:55   ` [PATCH 05/23] membarrier, kthread: Use _ONCE accessors for task->mm Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=430e3db1-693f-4d46-bebf-0a953fe6c2fc@www.fastmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=benh@kernel.crashing.org \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@intel.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=nadav.amit@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulus@ozlabs.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox