From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CFF0C4361B for ; Sun, 6 Dec 2020 03:59:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D80B023108 for ; Sun, 6 Dec 2020 03:59:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D80B023108 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BC8126B0068; Sat, 5 Dec 2020 22:59:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B53186B006C; Sat, 5 Dec 2020 22:59:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F43A8D0001; Sat, 5 Dec 2020 22:59:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id 814776B0068 for ; Sat, 5 Dec 2020 22:59:12 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3D4131EE6 for ; Sun, 6 Dec 2020 03:59:12 +0000 (UTC) X-FDA: 77561502144.09.crate15_4309699273d2 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id 1BDEA180AD807 for ; Sun, 6 Dec 2020 03:59:12 +0000 (UTC) X-HE-Tag: crate15_4309699273d2 X-Filterd-Recvd-Size: 13574 Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Sun, 6 Dec 2020 03:59:11 +0000 (UTC) Received: by mail-pg1-f194.google.com with SMTP id t3so6125178pgi.11 for ; Sat, 05 Dec 2020 19:59:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=FptgIr82FqTx3PHqnbjpte09oCnyjLLgrpPMpwFz88M=; b=Q6EsRabrUE2RTMBsAqNxGvg6ogWN1/xjOSQDh/Zno6SCU1/zkEOeCBt2J1JIiu8Csl JFIHJYoUEseptJ+tkOyw3DeWvATYFrTGkMfaptz1QM5DIs6LPM02O8Cq/i6MLK0u1HuL OF4oHjgSqgYKQ+WP7tckuYLM8pu5RbookG6eXTJCC/wlQKJAERsunG8bUXLeZEi6LvA4 MCChAW5kqSadLPQZKrIVG/tnW2jbOlk/sLLXRPly4jeZGgSNEMrsrGw+J2AYZJKVSZ8Y acisYoev2jsJDa4pfUpqI0mqMxCoWEjFPh98GdiICLQ//BK83VcahSq9GeJekNc58o78 ipBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=FptgIr82FqTx3PHqnbjpte09oCnyjLLgrpPMpwFz88M=; b=BOVBgzmazNzjGSMBecYbogLrnjBd48iiEn8EnOBtDSzDH8nLICLLt6GDijvWEVaO88 Rxat/I5hITs6GzRfYFYHNl8X+TL/de7tlKcpfM42OwVSXGMbCgcrX9nvICR90vAdm5e/ Iw+1mKEiCQ+OP7Noftu+EHvBYw3enOink0RhUwAaDIHl5ter9T9rLAFsieLO9fesxnNT rKJP0NQfDWsA40n3JnkII/s/GGAL1sJQwLivHaMKZl4SP/yV/4cJpZs0HmzSAFUW3J8y okQpb2++Vdv2P2BANpixtdC7KKfTrfYgNvbjtQIBc4H/RB+oa2v2pvbz4RhR+QEyfwYO lzcw== X-Gm-Message-State: AOAM532rtSZJkNuhOLm9Ceb1HnsTLYqbK2jLLbFHMX7gZ05g6WKNuyv/ xacBEFJk70BPKydW/FBUjxA= X-Google-Smtp-Source: ABdhPJzvDyvnLw4jfZJLoSS9UTDSAnw3moePqpodQUp3PeTMl/gPGntdvFiBGx1lbBhbyo+yiEOzZw== X-Received: by 2002:a05:6a00:148d:b029:19d:9622:bf7 with SMTP id v13-20020a056a00148db029019d96220bf7mr10801096pfu.11.1607227150468; Sat, 05 Dec 2020 19:59:10 -0800 (PST) Received: from localhost ([1.129.153.198]) by smtp.gmail.com with ESMTPSA id 21sm1593688pfx.84.2020.12.05.19.59.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 05 Dec 2020 19:59:09 -0800 (PST) Date: Sun, 06 Dec 2020 13:59:02 +1000 From: Nicholas Piggin Subject: Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode To: Andy Lutomirski Cc: Anton Blanchard , Arnd Bergmann , linux-arch , LKML , Linux-MM , linuxppc-dev , Mathieu Desnoyers , Peter Zijlstra , X86 ML References: <1607152918.fkgmomgfw9.astroid@bobo.none> <116A6B40-C77B-4B6A-897B-18342CD62CEC@amacapital.net> <1607209402.fogfsh8ov4.astroid@bobo.none> In-Reply-To: MIME-Version: 1.0 Message-Id: <1607224014.8xeujbleij.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Excerpts from Andy Lutomirski's message of December 6, 2020 10:36 am: > On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin wrote: >> >> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am: >> > >=20 >> If an mm was lazy tlb for a kernel thread and then it becomes unlazy, >> and if switch_mm is serialising but return to user is not, then you >> need a serialising instruction somewhere before return to user. unlazy >> is the logical place to add that, because the lazy tlb mm (i.e., >> switching to a kernel thread and back without switching mm) is what >> opens the hole. >=20 > The issue here is that unlazying on x86 sometimes serializes and > sometimes doesn't. That's additional state that x86 keeps around though, which is fine. It can optimise that case if it knows it's already serialised. > It's straightforward to add logic to the x86 code > to serialize specifically in the case where membarrier core sync is > registered and unlazying would otherwise not serialize, but trying to > define sensible semantics for this in a call to core code seems > complicated. It's not though, it's a call from core code (to arch code). > (Specifically, the x86 code only sometimes sends IPIs to > lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will > flush the TLB, and that operation is serializing. >=20 > The whole lazy thing is IMO a red herring for membarrier(). The > membarrier() logic requires that switching *logical* mms > (rq->curr->mm) serializes before user mode if the new mm is registered > for core sync. It's not a red herring, the reason the IPI gets skipped is because we go to a kernel thread -- that's all core code and core lazy tlb handling. That x86 might do some additional ops serialise during un-lazy in some cases doesn't make that a red herring, it just means that you can take advantage of it and avoid doing an extra serialising op. > AFAICT the only architecture on which this isn't > automatic is x86, and somehow the logic turned into "actually changing > rq->curr->mm serializes, but unlazying only sometimes serializes, so > we need to do an extra serialization step on unlazying operations" > instead of "tell x86 to make sure it always serializes when it > switches logical mms". The latter is easy to specify and easy to > implement. >=20 >> >> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying >> arch code about when an mm becomes not-lazy, and nothing to do with >> membarrier at all even. It's a convenient hook to do your un-lazying. >> I guess you can do it also checking things in switch_mm and keeping stat= e >> in arch code, I don't think that's necessarily the best place to put it. >=20 > I'm confused. I just re-read your patches, and it looks like you have > arch code calling exit_lazy_tlb(). More for code-comment / consistency than anything else. They are entirely arch hooks. > On x86, if we do a TLB shootdown > IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching > to init_mm for real), and we have no way to notify the core scheduler > about this, so we don't. The result is that the core scheduler state > and the x86 state gets out of sync. If the core scheduler > subsequently switches us back to the mm that it thinks we were still > using lazily them, from the x86 code's perspective, we're not > unlazying -- we're just doing a regular switch from init_mm to some > other mm. This is why x86's switch_mm_irqs_off() totally ignores its > 'prev' argument. You actually do now have such a way to do that now that we've (hopefully) closed races, and I think should use it, which might make=20 things simpler for you. See patch 6 do_shoot_lazy_tlb(). > I'm honestly a bit surprised that other architectures don't do the > same thing. I suppose that some architectures don't use shootdown > IPIs at all, in which case there doesn't seem to be any good reason to > aggressively unlazy. powerpc/radix does (in some cases) since a few years ago. It just=20 doesn't fully exploit that for the final TLB shootdown to always clean=20 them all up and avoid the subsequent shoot-lazies IPI, but it could be=20 more aggressive there. The powerpc virtualised hash architecture is the traditional one and=20 isn't conducive to this (translation management is done via hcalls, and the hypervisor maintains the TLB) so I suspect that's why it wasn't done earlier there. That will continue to rely on shoot-lazies. > (Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush > the TLB" instruction, that instruction is considerably slower than the > old "switch mm and flush" operation. So the operation "switch to > init_mm" is only ever any slower than "flush and stay lazy" if we get > lucky and unlazy to the same mm before we get a second TLB shootdown > *and* if unlazying to the same mm would not have needed to flush. I > spend quite a bit of time tuning this stuff and being quite surprised > at the bizarre performance properties of Intel's TLB management > instructions.) Well, you also casue an extra mm switch in case you returned to the same mm. Which probably isn't uncommon (app<->idle). >> >> So membarrier code is unchanged (it cares that the serialise is done at >> un-lazy time), core code is simpler (no knowledge of this membarrier >> quirk and it already knows about lazy-tlb so the calls actually improve >> the documentation), and x86 code I would argue becomes nicer (or no real >> difference at worst) because you can move some exit lazy tlb handling to >> that specific call rather than decipher it from switch_mm. >=20 > As above, I can't move the exit-lazy handling because the scheduler > doesn't know when I'm unlazying. As above, you can actually tell it. But even if you don't do that, in the current scheme it's still telling you a superset of what you need,=20 so you'd just put move your extra checks there. >=20 >> >> > >> > I=E2=80=99m currently trying to document how membarrier actually works= , and >> > hopefully this will result in untangling membarrier from mmdrop() and >> > such. >> >> That would be nice. >=20 > It's still a work in progress. I haven't actually convinced myself > that the non-IPI case in membarrier() is correct, nor have I convinced > myself that it's incorrect. >=20 > Anyway, I think that my patch is a bit incorrect and I either need a > barrier somewhere (which may already exist) or a store-release to > lazy_mm to make sure that all accesses to the lazy mm are done before > lazy_mm is freed. On x86, even aside from the fact that all stores > are releases, this isn't needed -- stopping using an mm is itself a > full barrier. Will this be a performance issue on power? store-release is lwsync on power. Not so bad as a full barrier, but probably not wonderful. The fast path would be worse than shoot-lazies of course, but may not be prohibitive. I'm still going to persue shoot-lazies for the merge window. As you see it's about a dozen lines and a if (IS_ENABLED(... in core code. Your change is common code, but a significant complexity (which affects all archs) so needs a lot more review and testing at this point. If x86 is already shooting lazies in its final TLB flush, I don't know why you're putting so much effort in though, surely it's more complexity and (even slightly) more cost there too. >=20 >> >> > >> > A silly part of this is that x86 already has a high quality >> > implementation of most of membarrier(): flush_tlb_mm(). If you flush >> > an mm=E2=80=99s TLB, we carefully propagate the flush to all threads, = with >> > attention to memory ordering. We can=E2=80=99t use this directly as a= n >> > arch-specific implementation of membarrier because it has the annoying >> > side affect of flushing the TLB and because upcoming hardware might be >> > able to flush without guaranteeing a core sync. (Upcoming means Zen >> > 3, but the Zen 3 implementation is sadly not usable by Linux.) >> > >> >> A hardware broadcast TLB flush, you mean? What makes it unusable by >> Linux out of curiosity? >=20 > The new instruction is INVLPGB. Unfortunately, x86's ASID field is > very narrow, and there's no way we can give each mm the same ASID > across all CPUs, which means we can't accurately target the flush at > the correct set of TLB entries. I've asked engineers at both Intel > and AMD to widen the ASID field, but that will end up being > complicated -- x86 has run out of bits in its absurdly overloaded CR3 > encoding, and widening the ASID to any reasonable size would require > adding a new way to switch mms. There are lots of reasons that x86 > should do that anyway [0], but it would be a big project and I'm not > sure that either company is interested in big projects like that. Interesting, thanks. powerpc has a PID register for guest ASIDs that implements about 20 bits. The IPI is very flexible though, it allows more complex/fine grained flushes and also software state to be updated, so we've started using it a bit. I haven't seen much software where performance of IPIs is prohibitive these days. Maybe improvements to threaded malloc, JVMs databases etc reduce the amount of flushes. > [0] On x86, you can't switch between (64-bit execution, 48-bit virtual > address space) and (64-bit execution, 57-bit address space) without > exiting 64-bit mode in the middle. This is because the way that the > addressing mode is split among multiple registers prevents a single > instruction from switching between the two states. This is absolutely > delightful for anyone trying to boot an OS on a system with a very, > very large amount of memory. >=20 powerpc has some issues like that with context switching guest / host state there are several MMU registers involved that can't be switched=20 with a single instruction. It doesn't require such a big hammer, but a careful sequence to switch things. Thanks, Nick