From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B261BC433F5 for ; Sun, 19 Dec 2021 17:28:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 25A756B0071; Sun, 19 Dec 2021 12:28:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2099D6B0073; Sun, 19 Dec 2021 12:28:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D1646B0074; Sun, 19 Dec 2021 12:28:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67]) by kanga.kvack.org (Postfix) with ESMTP id F38756B0071 for ; Sun, 19 Dec 2021 12:28:03 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id AE93B180ED7A7 for ; Sun, 19 Dec 2021 17:27:47 +0000 (UTC) X-FDA: 78935226174.09.3256F1C Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf13.hostedemail.com (Postfix) with ESMTP id 5820A2002C for ; Sun, 19 Dec 2021 17:27:40 +0000 (UTC) Received: by mail-lj1-f174.google.com with SMTP id z8so12153340ljz.9 for ; Sun, 19 Dec 2021 09:27:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=KJMrZEXDcM/T7DLnqDgvdZVEREMw/sAHU3NJ11ovNwQ=; b=cgLwCh1hva5/ZFr7b/ckunrr4Un/SrfEk3NNboLI8xh1YV4i5YiQ6V0wF7FUDhGnSp c8K5T/WyCeMD8xhVCc9XL+bcpOjty8tNVo7PFFsqw1ABN7x2xZXxAnAJa1GRflKMBpcr 0h+OOKMqmXLfixl3bPNrF7iMsIc1mP3RCBy+8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=KJMrZEXDcM/T7DLnqDgvdZVEREMw/sAHU3NJ11ovNwQ=; b=MJFApVx9ZAjkQc6S3I56fvGvyUK/+yhTVZ5Lj1NdsOZ4pTtPDIC7O7PlaytVUCgT/m WorZlugawGjQqWL03MhHWqVp6iAL0HbqSEGFt8ylX5Q4dx0ZRn6fp5eS8D6hogAp+uct cwvWEnIdZzHI/pwxQfwcQRvCc9/Uy7vWqshyTceRebcXZI2ZM3gpWSk6EUCbCTGLB9Bz m9/+EqAXW1JFiNCFRgngEzX596CQpvDyD5XerI4yTzhyoFA6B/l1Mnjs5028HM+8KYK4 0hFZi5CYy3h8rN3BeoPHdKcZH6NrtbCnpMr5tq4PsPkjJdJhCHNv0R2KA9M9a3h3h0cN dJJg== X-Gm-Message-State: AOAM533EbpJzp56T8eXM9pB4CdgqaAEgeF8fwJFx+BJxTm+XpQfHs5ZJ Ta7VY9xyfIy3y4E+ecmqXs/Lwacqt0IkY3ClgvU= X-Google-Smtp-Source: ABdhPJxLWDGskOuC+ABaQruE5kN4XqA0lOvzy0d63LRZNy5MMATE5l5t42IPHRv3Ldr4vREINY/XPA== X-Received: by 2002:a2e:8507:: with SMTP id j7mr11146099lji.307.1639934865186; Sun, 19 Dec 2021 09:27:45 -0800 (PST) Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com. [209.85.208.171]) by smtp.gmail.com with ESMTPSA id x17sm2285153lji.96.2021.12.19.09.27.44 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 19 Dec 2021 09:27:44 -0800 (PST) Received: by mail-lj1-f171.google.com with SMTP id a37so12144568ljq.13 for ; Sun, 19 Dec 2021 09:27:44 -0800 (PST) X-Received: by 2002:adf:f54e:: with SMTP id j14mr10021527wrp.442.1639934853596; Sun, 19 Dec 2021 09:27:33 -0800 (PST) MIME-Version: 1.0 References: <54c492d7-ddcd-dcd0-7209-efb2847adf7c@redhat.com> <20211217204705.GF6385@nvidia.com> <2E28C79D-F79C-45BE-A16C-43678AD165E9@vmware.com> <20211218030509.GA1432915@nvidia.com> <5C0A673F-8326-4484-B976-DA844298DB29@vmware.com> <20211218184233.GB1432915@nvidia.com> <5CA1D89F-9DDB-4F91-8929-FE29BB79A653@vmware.com> <4D97206A-3B32-4818-9980-8F24BC57E289@vmware.com> <5A7D771C-FF95-465E-95F6-CD249FE28381@vmware.com> In-Reply-To: <5A7D771C-FF95-465E-95F6-CD249FE28381@vmware.com> From: Linus Torvalds Date: Sun, 19 Dec 2021 09:27:17 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) To: Nadav Amit Cc: David Hildenbrand , Jason Gunthorpe , Linux Kernel Mailing List , Andrew Morton , Hugh Dickins , David Rientjes , Shakeel Butt , John Hubbard , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Linux-MM , "open list:KERNEL SELFTEST FRAMEWORK" , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 5820A2002C X-Stat-Signature: rf4mjfi64usdbo89zk38q7tbpt6ag4xn Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=cgLwCh1h; dmarc=none; spf=pass (imf13.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.174 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org X-HE-Tag: 1639934860-553467 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit wrote: > > I found my old messy code for the software-PTE thing. > > I see that eventually I decided to hold a pointer to the =E2=80=9Cextra P= TEs=E2=80=9D > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent > pages approach but this code is long gone. ] Ok, I understand why that ends up being the choice, but it makes it too ugly and messy to look up to be worth it, I think. > I still don=E2=80=99t know what exactly you have in mind for making use > out of it for the COW issue. So the truly fundamental question for COW (and for a long-term GUP) is fairly simple: - Is the page I have truly owned exclusively by this VM? If that _isn't_ the case, you absolutely have to COW. If that _is_ the case, you can re-use the page. That is really it, boiled down to the pure basics. And if you aren't sure whether you are the ultimate and only authority over the page, then COW is the "safer" option, in that breaking sharing is fundamentally better than over-sharing. Now, the reason I like "page_count()=3D=3D1" is that it is a 100% certain way to know that you own the page absolutely and clearly. There is no question what-so-ever about it. And the reason I hate "page_mapcount()=3D=3D1" with a passion is that it is NOTHING OF THE KIND. It is an entirely meaningless number. It doesn't mean anything at all. Even if the page mapcount is exactly right, it could easily and trivially be a result of "fork, then unmap in either parent or child". Now that page_mapcount() is unquestionably 1, but despite that, at some point the page was shared by another VM, and you can not know whether you really have exclusive access. And that "even if page mapcount is exactly right" is a big issue in itself, as I hope I've explained. It requires page locking, it requires that you take swapcache users into account, it is just a truly messy and messed up thing. There really is absolutely no reason for page_mapcount to exist. It's a mistake. We have it for completely broken historical reasons. It's WRONG. Now, if "page_count()=3D=3D1" is so great, what is the issue? Problem solve= d. No, while page_count()=3D=3D1 is one really fundamental marker (unlike the mapcount), it does have problems too. Because yes, "page_count()=3D=3D1" does mean that you have truly exclusive ownership of the page, but the reverse is not true. The way the current regular VM code handles that "the reverse is not true" is by making "the page is writable" be the second way you can say "you clearly have full ownership of the page". So that's why you then have the "maybe_pinned()" thing in fork() and in swap cache creation that keeps such a page writable, and doesn't do the virtual copy and make it read-only again. But that's also why it has problems with write-protect (whether mprotect or uddf_wp). Anyway, that was a long explanation to make the thinking clear, and finally come to the actual answer to your question: Adding another bit in the page tables - *purely* to say "this VM owns the page outright" - would be fairly powerful. And fairly simple. Then any COW event will set that bit - because when you actually COW, the page you install is *yours*. No questions asked. And fork() would simply clear that bit (unless the page was one of the pinned pages that we simply copy). See how simple that kind of concept is. And please, see how INCREDIBLY BROKEN page_mapcount() is. It really fundamentally is pure and utter garbage. It in no way says "I have exclusive ownership of this page", because even if the mapcount is 1 *now*, it could have been something else earlier, and some other VM could have gotten a reference to it before the current VM did so. This is why I will categoricall NAK any stupid attempt to re-introduce page_mapcount() for COW or GUP handling. It's unacceptably fundamentally broken. Btw, the extra bit doesn't really have to be in the page tables. It could be a bit in the page itself. We could add another page bit that we just clear when we do the "add ref to page as you make a virtual copy during fork() etc". And no, we can't use "pincount" either, because it's not exact. The fact that the page count is so elevated that we think it's pinned is a _heuristic_, and that's ok when you have the opposite problem, and ask "*might* this page be pinned". You want to never get a false negative, but it can get a false positive. Linus