From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06E79C433EF for ; Mon, 10 Jan 2022 03:51:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E94266B0071; Sun, 9 Jan 2022 22:51:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E1D646B0073; Sun, 9 Jan 2022 22:51:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C95ED6B0074; Sun, 9 Jan 2022 22:51:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0164.hostedemail.com [216.40.44.164]) by kanga.kvack.org (Postfix) with ESMTP id B36996B0071 for ; Sun, 9 Jan 2022 22:51:46 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 57B0393D70 for ; Mon, 10 Jan 2022 03:51:46 +0000 (UTC) X-FDA: 79013003412.11.B123CC3 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf22.hostedemail.com (Postfix) with ESMTP id C914CC0011 for ; Mon, 10 Jan 2022 03:51:45 +0000 (UTC) Received: by mail-ed1-f44.google.com with SMTP id 30so46647934edv.3 for ; Sun, 09 Jan 2022 19:51:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=UN/K11NGDAyS29b+tKFTVgzsYRR4azKfnt+fJjqSmE8=; b=bgHFnfCcTlfxYQYR0YlEC56rnbIqGhm3dT6YYDmbt0Q4tiRVoLeFZdz4xXCWYt0dm0 0XAf62/b37mIZPif0YcJ8ZvhdnKcxwPTpTKPPfgteymyvLXL0S+IIzyFEZt2nLWJVZP7 df2KqDa2N/y30u77VC9u0Pr3r0byN2eiBEG48= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=UN/K11NGDAyS29b+tKFTVgzsYRR4azKfnt+fJjqSmE8=; b=siMdAgAK73XqzKLRdOb5+YjTkE/zRAFZauaoKUdEMotxzgpjOiD5CZVxHBEJhWIHeg DYRhlGTUMDfBcFllatr4zwYf4L96hYJ0QvEJA/iMBqWDWWkDZqfEHGcwcWjmlXN4vSko AvXCOqQ1pBCH7Zuu2Of0D7+atAklAN1rRqLXeVFuCB4CxmbMM1vJ/aUM99aO0YviXd9e SsBNUvSiNcimCaiRKo3ZZkhl9Fs9viIjvOwEbhdknxtAKPfwOuSp9INbewEcDDx41K6A Ivjefk5E26XNWT7T8+q2wEwtAZ+aY18xQg3FOEs6cus9jhu5YzWmEmyEOxxEGgCGH+gt mpgg== X-Gm-Message-State: AOAM531VcqMJHWY7k9+8n5H6zpFuiNYdQ9JXL4f/KDzAATQUpR4UprIy E/Q+BOBp1Nrs9nuCwplBiWy+kAwRIOuei54cH1o= X-Google-Smtp-Source: ABdhPJxEptzPEJR5ZaL2equVTjOyH9hT+/YRdFjDo6vWUH10hv8iqLKpsFsM0YFrBpcjdfwGjmwmTA== X-Received: by 2002:a17:906:9b93:: with SMTP id dd19mr61145705ejc.177.1641786704048; Sun, 09 Jan 2022 19:51:44 -0800 (PST) Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com. [209.85.221.43]) by smtp.gmail.com with ESMTPSA id ky3sm1944360ejc.178.2022.01.09.19.51.43 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 09 Jan 2022 19:51:43 -0800 (PST) Received: by mail-wr1-f43.google.com with SMTP id o3so24192426wrh.10 for ; Sun, 09 Jan 2022 19:51:43 -0800 (PST) X-Received: by 2002:adf:f54e:: with SMTP id j14mr62229023wrp.442.1641786703058; Sun, 09 Jan 2022 19:51:43 -0800 (PST) MIME-Version: 1.0 References: <7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.luto@kernel.org> <3586aa63-2dd2-4569-b9b9-f51080962ff2@www.fastmail.com> <430e3db1-693f-4d46-bebf-0a953fe6c2fc@www.fastmail.com> <484a7f37-ceed-44f6-8629-0e67a0860dc8@www.fastmail.com> <00f58dff-9df5-45ac-a078-d852f13b1dfe@www.fastmail.com> In-Reply-To: From: Linus Torvalds Date: Sun, 9 Jan 2022 19:51:26 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms To: Rik van Riel Cc: Andy Lutomirski , Will Deacon , Catalin Marinas , Andrew Morton , Linux-MM , Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , "the arch/x86 maintainers" , Dave Hansen , "Peter Zijlstra (Intel)" , Nadav Amit , Mathieu Desnoyers Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=bgHFnfCc; spf=pass (imf22.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.44 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none X-Rspamd-Queue-Id: C914CC0011 X-Stat-Signature: q5k1heaqexqsnhmn91mufy4a7y1qktjk X-Rspamd-Server: rspam04 X-HE-Tag: 1641786705-177143 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel wrote: > > Also, while 800 loads is kinda expensive, it is a heck of > a lot less expensive than 800 IPIs. Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's. Go take a look. It's part of the whole "flush TLB's" thing in __mmput(). So let me explain one more time what I think we should have done, at least on x86: (1) stop refcounting active_mm entries entirely on x86 Why can we do that? Because instead of worrying about doing those mm_count games for the active_mm reference, we realize that any active_mm has to have a _regular_ mm associated with it, and it has a 'mm_users' count. And when that mm_users count go to zero, we have: (2) mmput -> __mmput -> exit_mmap(), which already has to flush all TLB's because it's tearing down the page tables And since it has to flush those TLB's as part of tearing down the page tables, we on x86 then have: (3) that TLB flush will have to do the IPI's to anybody who has that mm active already and that IPI has to be done *regardless*. And the TLB flushing done by that IPI? That code already clears the lazy status (and not doing so would be pointless and in fact wrong). Notice? There isn't some "800 loads". There isn't some "800 IPI's". And there isn't any refcounting cost of the lazy TLB. Well, right now there *is* that refcounting cost, but my point is that I don't think it should exist. It shouldn't exist as an atomic access to mm_count (with those cache ping-pongs when you have a lot of threads across a lot of CPUs), but it *also* shouldn't exist as a "lightweight hazard pointer". See my point? I think the lazy-tlb refcounting we do is pointless if you have to do IPI's for TLB flushes. Note: the above is for x86, which has to do the IPI's anyway (and it's very possible that if you don't have to do IPI's because you have HW TLB coherency, maybe lazy TLB's aren't what you should be using, but I think that should be a separate discussion). And yes, right now we do that pointless reference counting, because it was simple and straightforward, and people historically didn't see it as a problem. Plus we had to have that whole secondary 'mm_count' anyway for other reasons, since we use it for things that need to keep a ref to 'struct mm_struct' around regardless of page table counts (eg things like a lot of /proc references to 'struct mm_struct' do not want to keep forced references to user page tables alive). But I think conceptually mm_count (ie mmgrab/mmdrop) was always really dodgy for lazy TLB. Lazy TLB really cares about the page tables still being there, and that's not what mm_count is ostensibly about. That's really what mm_users is about. Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's technically odd (ie mmgrab really only keeps the 'struct mm' around, but not about the vma's and page tables). Side note: you can see the effects of this mis-use of mmgrab/mmdrop in how we tear down _almost_ all the page table state in __mmput(). But look at what we keep until the final __mmdrop, even though there are no page tables left: mm_free_pgd(mm); destroy_context(mm); exactly because even though we've torn down all the page tables earlier, we had to keep the page table *root* around for the lazy case. It's kind of a layering violation, but it comes from that lazy-tlb mm_count use, and so we have that odd situation where the page table directory lifetime is very different from the rest of the page table lifetimes. (You can easily make excuses for it by just saying that "mm_users" is the user-space page table user count, and that the page directory has a different lifetime because it's also about the kernel page tables, so it's all a bit of a gray area, but I do think it's also a bit of a sign of how our refcounting for lazy-tlb is a bit dodgy). Linus