From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0B50C433F5 for ; Tue, 11 Jan 2022 03:10:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 735776B0072; Mon, 10 Jan 2022 22:10:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E4F46B0073; Mon, 10 Jan 2022 22:10:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5AC946B0074; Mon, 10 Jan 2022 22:10:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237]) by kanga.kvack.org (Postfix) with ESMTP id 4A9586B0072 for ; Mon, 10 Jan 2022 22:10:17 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E2F7C181CB2A2 for ; Tue, 11 Jan 2022 03:10:16 +0000 (UTC) X-FDA: 79016527632.26.0E0B9FD Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf02.hostedemail.com (Postfix) with ESMTP id 80FD880005 for ; Tue, 11 Jan 2022 03:10:16 +0000 (UTC) Received: by mail-pj1-f41.google.com with SMTP id lr15-20020a17090b4b8f00b001b19671cbebso3367241pjb.1 for ; Mon, 10 Jan 2022 19:10:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=KGhXskeWeF++6Qq+O+6sYoul9PU/Gvc9NNDfObNCvXk=; b=LFnoIQBYPLd7U8Y7ffu26TgmqHwmHz/RaLrxz/t89/ObSSbNqZNcEOdQkvs6D9jBE+ VrpQ9/RaKScu2kv0ANIrwzLJAEHCkTjF56emG0yJPUBaYtdYkPX7v8HdKBSsDhv0WoPl ZSsSjNlJVQ97pPSvPlDSEXaVoZosCQL8AePVpTsN8b6z4samVeIqsl9IkkB4/F/9ZOFV f6WqV3eUB9ABhfzPIk4ExYnG2EdT24wr8bJlbEIPyqjvfeojs/+Uhlkn0MK3xuj++rQr aE+aL6xq9VGenG/StzZztTMgqA8FEVp7h+oA2KttoY52g0pVUDBOArpqvlgDnQZigrCv O73A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=KGhXskeWeF++6Qq+O+6sYoul9PU/Gvc9NNDfObNCvXk=; b=FoJy/y9JOMbfO05aUoaGZ0avUo3COU5UxIuOF/BcQcd9B7csJQtEWH6Iiil4oqMj7z OcCCkunup1l96Rom69OQjoWP4vg9i3TRuA8hjroj6uzt5FmdBrP6KIuzG2vFXE+glJaM mXz0GNE+qAHjj2boppVR0qlXSGql+heqT6RZJ38I6k+PYYDc7m4kkLhZ5xuVEHHA8MOn rUeVHUl+GnfaJHvkw7H3HtTZZoI+09gLTUP+tglhF3YxuVY35S7XTGXptjBWuNTAjPgj 1ieVmVbbQFey28VAXsETrtNBMKoYdNWNEgJLZ4yVTFMjHAAtFbT1t/ZCuevCXI4q9UYG 3FAA== X-Gm-Message-State: AOAM531kO6HoEOBPhBBsNYXq7Ns1L4B+j4Z1nqMi9AiW9hJc95HM/Q3I bVgz8Fh0OCop1yEwiedyCq4= X-Google-Smtp-Source: ABdhPJzUnuqhIAjqQ+jrRPi+mo64hhbmaXgiYDjCvcR6PCYx/fPkQh5u/JokOwh7Q3nZKMfi5nQQnw== X-Received: by 2002:a17:902:830b:b0:149:5f4d:e178 with SMTP id bd11-20020a170902830b00b001495f4de178mr2674408plb.152.1641870615090; Mon, 10 Jan 2022 19:10:15 -0800 (PST) Received: from localhost (124-171-74-95.tpgi.com.au. [124.171.74.95]) by smtp.gmail.com with ESMTPSA id d13sm8354088pfl.18.2022.01.10.19.10.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 19:10:14 -0800 (PST) Date: Tue, 11 Jan 2022 13:10:09 +1000 From: Nicholas Piggin Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms To: Andy Lutomirski , Linus Torvalds Cc: Andrew Morton , Anton Blanchard , Benjamin Herrenschmidt , Catalin Marinas , Dave Hansen , linux-arch , Linux-MM , Mathieu Desnoyers , Nadav Amit , Paul Mackerras , "Peter Zijlstra (Intel)" , Randy Dunlap , Rik van Riel , Will Deacon , the arch/x86 maintainers References: <7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.luto@kernel.org> <3586aa63-2dd2-4569-b9b9-f51080962ff2@www.fastmail.com> <430e3db1-693f-4d46-bebf-0a953fe6c2fc@www.fastmail.com> <484a7f37-ceed-44f6-8629-0e67a0860dc8@www.fastmail.com> <1641790309.2vqc26hwm3.astroid@bobo.none> <0d905aef-f53c-4102-931f-a22edd084fae@www.fastmail.com> In-Reply-To: <0d905aef-f53c-4102-931f-a22edd084fae@www.fastmail.com> MIME-Version: 1.0 Message-Id: <1641867901.xve7qy78q6.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 80FD880005 X-Stat-Signature: irkk4qzcnnoy9481je85skcu568d36gr Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=LFnoIQBY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of npiggin@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=npiggin@gmail.com X-HE-Tag: 1641870616-1172 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am: >=20 >=20 > On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote: >> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am: >>> [ Ugh, I actually went back and looked at Nick's patches again, to >>> just verify my memory, and they weren't as pretty as I thought they >>> were ] >>>=20 >>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds >>> wrote: >>>> >>>> I'd much rather have a *much* smaller patch that says "on x86 and >>>> powerpc, we don't need this overhead at all". >>>=20 >>> For some reason I thought Nick's patch worked at "last mmput" time and >>> the TLB flush IPIs that happen at that point anyway would then make >>> sure any lazy TLB is cleaned up. >>>=20 >>> But that's not actually what it does. It ties the >>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the >>> last mmdrop() instead. Because it really tied the whole logic to the >>> mm_count logic (and made lazy tlb to not do mm_count) rather than the >>> mm_users thing I mis-remembered it doing. >> >> It does this because on powerpc with hash MMU, we can't use IPIs for >> TLB shootdowns. >=20 > I know nothing about powerpc=E2=80=99s mmu. If you can=E2=80=99t do IPI s= hootdowns, The paravirtualised hash MMU environment doesn't because it has a single=20 level translation and the guest uses hypercalls to insert and remove=20 translations and the hypervisor flushes TLBs. The HV could flush TLBs with IPIs but obviously the guest can't use those to execute the lazy switch. In radix guests (and all bare metal) the OS flushes its own TLBs. We are moving over to radix, but powerpc also has a hardware broadcast=20 flush instruction which can be a bit faster than IPIs and is usable by=20 bare metal and radix guests so those can also avoid the IPIs if they=20 want. Part of the powerpc patch I just sent to combine the lazy switch=20 with the final TLB flush is to force it to always take the IPI path and=20 not use TLBIE instruction on the final exit. So hazard points could avoid some IPIs there too. > it sounds like the hazard pointer scheme might actually be pretty good. Some IPIs in the exit path just aren't that big a concern. I measured, got numbers, tried to irritate it, just wasn't really a problem. Some archs use IPIs for all threaded TLB shootdowns and exits not that frequent. Very fast short lived processes that do a lot of exits just don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot, and long lived and multi threaded ones that do don't exit at high rates. So from what I can see it's premature optimization. Actually maybe not even optimization because IIRC it adds complexity and even a barrier on powerpc in the context switch path which is a lot more critical than exit() for us we don't want slowdowns there. It's a pretty high complexity boutique kind of synchronization. Which don't get me wrong is the kind of thing I like, it is clever and may be perfectly bug free but it needs to prove itself over the simple dumb shoot lazies approach. >>> So at least some of my arguments were based on me just mis-remembering >>> what Nick's patch actually did (mainly because I mentally recreated >>> the patch from "Nick did something like this" and what I thought would >>> be the way to do it on x86). >> >> With powerpc with the radix MMU using IPI based shootdowns, we can=20 >> actually do the switch-away-from-lazy on the final TLB flush and the >> final broadcast shootdown thing becomes a no-op. I didn't post that >> additional patch because it's powerpc-specific and I didn't want to >> post more code so widely. >> >>> So I guess I have to recant my arguments. >>>=20 >>> I still think my "get rid of lazy at last mmput" model should work, >>> and would be a perfect match for x86, but I can't really point to Nick >>> having done that. >>>=20 >>> So I was full of BS. >>>=20 >>> Hmm. I'd love to try to actually create a patch that does that "Nick >>> thing", but on last mmput() (ie when __mmput triggers). Because I >>> think this is interesting. But then I look at my schedule for the >>> upcoming week, and I go "I don't have a leg to stand on in this >>> discussion, and I'm just all hot air". >> >> I agree Andy's approach is very complicated and adds more overhead than >> necessary for powerpc, which is why I don't want to use it. I'm still >> not entirely sure what the big problem would be to convert x86 to use >> it, I admit I haven't kept up with the exact details of its lazy tlb >> mm handling recently though. >=20 > The big problem is the entire remainder of this series! If x86 is going = to do shootdowns without mm_count, I want the result to work and be maintai= nable. A few of the issues that needed solving: >=20 > - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MM= U, not CPUs that have it in active_mm. Getting this in sync needed core ch= anges. Definitely should have been done at the time x86 deviated, but better=20 late than never. >=20 > - mmgrab and mmdrop are barriers, and core code relies on that. If we get= rid of a bunch of calls (conditionally), we need to stop depending on the = barriers. I fixed this. membarrier relied on a call that mmdrop was providing. Adding a smp_mb() instead if mmdrop is a no-op is fine. Patches changing membarrier's=20 ordering requirements can be concurrent and are not fundmentally tied to lazy tlb mm switching, it just reuses an existing ordering point. > - There were too many mmgrab and mmdrop calls, and the call sites had dif= ferent semantics and different refcounting rules (thanks, kthread). I clea= ned this up. Seems like a decent cleanup. Again lazy tlb specific, just general switch code should be factored and better contained in kernel/sched/ which is fine, but concurrent to lazy tlb improvements. > - If we do a shootdown instead of a refcount, then, when exit() tears dow= n its mm, we are lazily using *that* mm when we do the shootdowns. If activ= e_mm continues to point to the being-freed mm and an NMI tries to dereferen= ce it, we=E2=80=99re toast. I fixed those issues. My shoot lazies patch has no such issues with that AFAIKS. What exact=20 issue is it and where did you fix it? >=20 > - If we do a UEFI runtime service call while lazy or a text_poke while la= zy and the mm goes away while this is happening, we would blow up. Refcount= ing prevents this but, in current kernels, a shootdown IPI on x86 would not= prevent this. I fixed these issues (and removed duplicate code). >=20 > My point here is that the current lazy mm code is a huge mess. 90% of the= complexity in this series is cleaning up core messiness and x86 messiness.= I would still like to get rid of ->active_mm entirely (it appears to serve= no good purpose on any architecture), it that can be saved for later, I t= hink. I disagree, the lazy mm code is very clean and simple. And I can't see=20 how you would propose to remove active_mm from core code I'm skeptical but would be very interested to see, but that's nothing to do with my shoot lazies patch and can also be concurrent except for mechanical merge issues. Thanks, Nick