From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FA1FC433EF for ; Fri, 17 Dec 2021 20:18:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 73DF46B0071; Fri, 17 Dec 2021 15:18:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C5666B0073; Fri, 17 Dec 2021 15:18:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 566396B0074; Fri, 17 Dec 2021 15:18:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0006.hostedemail.com [216.40.44.6]) by kanga.kvack.org (Postfix) with ESMTP id 433C16B0071 for ; Fri, 17 Dec 2021 15:18:14 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 085E4180DA199 for ; Fri, 17 Dec 2021 20:18:04 +0000 (UTC) X-FDA: 78928397688.31.4BFD725 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id E1FB21A0024 for ; Fri, 17 Dec 2021 20:18:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1639772282; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2c0VJ6vHzyhRBdOsA/CnnUPbNYIb49wsJ7s121sWUXc=; b=Q7hPSNCO0OB4ewxvHR+/rFt3Lvl5v1t5QCDcRFhE6vNtFrpR2DoCUUHhy4OGxQG86a6yUt K2UQ+kdBo2zp9CeyQ/FHPQeB8N13ScoUPD2uHKsrsL2AFD/ZaSc+yk6MHJ2Fs4zTKAFHcW s2qHttDb2yXhVkrsgR6d3hyPCEWs3l4= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-509-V0GKTD3oPL2Tew31CR-DUw-1; Fri, 17 Dec 2021 15:18:01 -0500 X-MC-Unique: V0GKTD3oPL2Tew31CR-DUw-1 Received: by mail-wm1-f71.google.com with SMTP id j20-20020a05600c1c1400b00343ad0c4c40so3462500wms.0 for ; Fri, 17 Dec 2021 12:18:01 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=2c0VJ6vHzyhRBdOsA/CnnUPbNYIb49wsJ7s121sWUXc=; b=oYKgkX7QqOfXsy33o054s/T5AjW5KufkJmK/+n0Vu4Xjl34T5+FFf4UQoF1tj6NYfl LJaHX+zarQC8VPJa/eRsXbuf2nyt7bRUB7nqtKgLTKbP5tj6NAPLSisqB/0H0Gg7vqPW L/O6MtIoHgBfQd3rqTbYww/Fjkf9YiLDpkGDjBcJ5EAle85E014jodbjTRJpgdy/UM+J dUdQO1hvG3oZdzAwEwgdE0PUaZz7GVhOF7NZglx7DiYMr9g4zFMDt31lo1gS/naj5db8 NeVURKjgg5itTz65jIoR9RKIkRA+pw/kwc2xxOgR1r0+T8XhAv6zw68iPam4MHWlreOR ePRg== X-Gm-Message-State: AOAM530hRnC5xKUDK4xprtrY/TVl73xtT0i2IxBWHR2jfZCEoEBlbgFt hrrEbeEpy29tR0YE19L+4pN09SDceMypURESETE2Msel+0IK1rm+rppVOew3BLAOkiRt3P7ZKWN LWDrm8uLEzQE= X-Received: by 2002:a1c:9d55:: with SMTP id g82mr8940534wme.58.1639772280205; Fri, 17 Dec 2021 12:18:00 -0800 (PST) X-Google-Smtp-Source: ABdhPJxLQJHbpNL5QwotNaX3cEFF/5ZCFCReDKEtVQIwIqRfkG0LtWKZ65TBS/l/u2j7qT9xkJUMWw== X-Received: by 2002:a1c:9d55:: with SMTP id g82mr8940497wme.58.1639772279937; Fri, 17 Dec 2021 12:17:59 -0800 (PST) Received: from [192.168.3.132] (p4ff234b8.dip0.t-ipconnect.de. [79.242.52.184]) by smtp.gmail.com with ESMTPSA id o64sm8355019wme.28.2021.12.17.12.17.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 17 Dec 2021 12:17:59 -0800 (PST) Message-ID: <54c492d7-ddcd-dcd0-7209-efb2847adf7c@redhat.com> Date: Fri, 17 Dec 2021 21:17:58 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Linus Torvalds Cc: Linux Kernel Mailing List , Andrew Morton , Hugh Dickins , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Linux-MM , "open list:KERNEL SELFTEST FRAMEWORK" , "open list:DOCUMENTATION" References: <20211217113049.23850-1-david@redhat.com> <20211217113049.23850-7-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: E1FB21A0024 X-Stat-Signature: 4b9i3gmbt99ozf1etzc76sppgbonxs3e Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Q7hPSNCO; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam11 X-HE-Tag: 1639772282-945211 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 17.12.21 20:22, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds > wrote: >> >> If we are doing a COW, we need an *exclusive* access to the page. That >> is not mapcount, that is the page ref. >> >> mapcount is insane, and I think this is making this worse again. > > Maybe I'm misreading this, but afaik > > - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE. > > This just increments the page count, because mapcount == 1. > > - fork() > > - unmap in the original > > - child now has "mapcount == 1" on a page again, but refcount is > elevated, and child HAS TO COW before writing. Hi Linus, This is just GUP before fork(), which is in general problematic/incompatible with sharing. What we're concerned about in the context of this series (see the security issue) is GUP after fork(). And we're not changing GUP before fork() or even the COW logic in the context of this series. I agree that GUP before fork() has to be handled differently: during fork(): don't share something that cannot possibly be shared in a safe way. Don't allow COW semantics for something that is just broken with COW. > > Notice? "mapcount" is complete BS. The number of times a page is > mapped is irrelevant for COW. All that matters is that we get an > exclusive access to the page before we can write to it. We have to be very careful about the two sides of the story: GUP before fork and GUP after fork. > > Anybody who takes mapcount into account at COW time is broken, and it > worries me how this is all mixing up with the COW logic. > > Now, maybe this "unshare" case is sufficiently different from COW that > it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it > doesn't happen for a real COW. > > But honestly, for "unshare", I still don't see that the mapcount > matters. What does "mapcount == 1" mean? Why is it meaningful? I'll reply to your first mail in a sec. GUP is the problem with COW, not ordinary processes mapping a page (mapcount), where you will only get new sharers during fork() -- in a very controlled way. So GUP has to take care to unshare *before* taking a reference, such that we can never reach the point of missed COW. GUP really is the problematic bit with it all. Without GUP, we'd be living in a wonderful world in regards to COW. > > Because if COW does things right, and always breaks a COW based on > refcount, then what's the problem with taking a read-only ref to the > page whether it is mapped multiple times or mapped just once? Anybody > who already had write access to the page can write to it regardless, > and any new writers go through COW and get a new page. Let's just take a look at what refcount does *wrong*. Let's use an adjusted version of your example above, because it's a perfect fit: 1. mem = mmap(pagesize, MAP_PRIVATE) -> refcount == 1 2. memset(mem, 0, pagesize); /* Page is mapped R/W */ 3. fork() /* Page gets mapped R/O */ -> refcount > 1 4. child quits -> refcount == 1 5. Take a R/O pin (RDMA, VFIO, ...) -> refcount > 1 6. memset(mem, 0xff, pagesize); -> Write fault -> COW And GUP sees something different than our MM -- and this is perfectly valid, the R/O pin is just reading page content we might be modifying afterwrds. Take out 3. and 4. and it works as expected. This wouldn't happen when relying on the mapcount. And 3+4 can really be anything that results in a R/O mapping of an anonymous page, even if it's just swapout followed by read fault that maps the page R/O. > > I must be missing something realyl fundamental here, but to me it > really reads like "mapcount can fundamentally never be relevant for > COW, and if it's not relevant for COW, how can it be relevant for a > read-only copy?" It really is the right value to use. Only GUP is the problematic bit that has to trigger unsharing to not mess up COW logic later. Take GUP out of the equation and COW just works as expected with the mapcount -- as long as we can read an atomic value and synchronize against fork. (again, still composing the other mail :) ) -- Thanks, David / dhildenb