From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85C4BC433E0 for ; Tue, 22 Dec 2020 23:39:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E734922B2D for ; Tue, 22 Dec 2020 23:39:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E734922B2D Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0FA426B0093; Tue, 22 Dec 2020 18:39:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0AA7D6B0095; Tue, 22 Dec 2020 18:39:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDBD58D0001; Tue, 22 Dec 2020 18:39:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0216.hostedemail.com [216.40.44.216]) by kanga.kvack.org (Postfix) with ESMTP id D7D3B6B0093 for ; Tue, 22 Dec 2020 18:39:52 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 964183624 for ; Tue, 22 Dec 2020 23:39:52 +0000 (UTC) X-FDA: 77622538224.29.lift28_36011e427464 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id 74794180865A4 for ; Tue, 22 Dec 2020 23:39:52 +0000 (UTC) X-HE-Tag: lift28_36011e427464 X-Filterd-Recvd-Size: 9415 Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Dec 2020 23:39:51 +0000 (UTC) Received: by mail-io1-f45.google.com with SMTP id q137so13503466iod.9 for ; Tue, 22 Dec 2020 15:39:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=A0AkWSCozRMC8MnKl4n5pkXo4MvucVXsl5zrvsOTgMg=; b=Gpicw7ClBK8Sf4PuS+AS96JVD1BIWe1SiSOoBQB/DkQJVimB3RhRNRQVEf6H2EYbPG jfbYkoQLbRLuc6S2EsVPyUln8mY5cfkPOrmbwDjw/Cl9xTLiKtU6EvBXRVuVEM800JV1 dLMlGxTgJTmEi7qqxSMSsTCWP6C4mWWqd7eP57H2x8rL4/1JmxKA8aiPFC9o2LTmN5Rc gI/GHD3snPdzXeYfv9uHeN4aHzDzJWx1OAdlpv4wmtn4gunHE5m7d0RWWZkjcmIBofmN SBLs600H3WS97RCH8jeNSCCyiKctceDF7R5GQaewMVd3XfzxbPQxfRLVucQJcTdZBkBu y3EQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=A0AkWSCozRMC8MnKl4n5pkXo4MvucVXsl5zrvsOTgMg=; b=FCmV5frPgIqjIJ1JWmpNGCvNnvFojS9YuujCd0TbF99UwOyu5AFcjzWm6tOWQQRdM4 CIVNlQzDM+CwKbIbmgJ9OViJlHJEvWxKIVNekNpPXqQuPDMPYwljbOSEjN1wZO9AFcgv 7EtK0BPAA99/+LsiaX2+KJxsXUTzbtxS/5wTehvl8kAvMgXK6dMAqBI60nugbal+UVLu eIiI1fIACyZBAY1hfecTOyjUazFYCdEJJMmNGo66D/hgV3b5rho+hlTLJV1I0diAytRg sjj72+NVwIs12R8Re9YK9641Up0rMdW0Un0WH1LVEk08sG+zngP0IrgRtGb0reau09fq Hb2A== X-Gm-Message-State: AOAM530Of8sOeWwOoPK+VXWe5GDtMMLg4TFoeztrWhNclWVgvEt/DrBi k912pF43dw0ZrpTzq0ZfhDE/7w== X-Google-Smtp-Source: ABdhPJze60CMPQrTraoojTdyLspc3Yt/g4avyDkIBJdaGN9P0XqCx2ZXAqLq0zxQWkwk2fLZ42gYaA== X-Received: by 2002:a05:6602:314b:: with SMTP id m11mr20046182ioy.165.1608680391276; Tue, 22 Dec 2020 15:39:51 -0800 (PST) Received: from google.com ([2620:15c:183:200:7220:84ff:fe09:2d90]) by smtp.gmail.com with ESMTPSA id s1sm24167582iot.0.2020.12.22.15.39.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Dec 2020 15:39:50 -0800 (PST) Date: Tue, 22 Dec 2020 16:39:46 -0700 From: Yu Zhao To: Andrea Arcangeli Cc: Andy Lutomirski , Linus Torvalds , Peter Xu , Nadav Amit , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 22, 2020 at 05:02:37PM -0500, Andrea Arcangeli wrote: > On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote: > > This works but I don't prefer this option because 1) this is new > > way of making pte_wrprotect safe and 2) it's specific to ufd and > > can't be applied to clear_soft_dirty() which has no "catcher". No > > I didn't look into clear_soft_dirty issue, I can look into that. > > To avoid having to deal with a 3rd solution it will have to use one of > the two: > > 1) avoid deferring tlb flush and enforce a sync flush before dropping > the PT lock even if mm_mm_tlb_flush_pending is true -> > write_protect_page in KSM > > 2) add its own new catcher for its own specific marker (for UFFD-WP > the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a > vma->vm_pgprot not PROT_NONE, for soft dirty it could be > _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the > pte value is interpreted. > > > matter how good the documentation about this new way is now, it > > will be out of date, speaking from my personal experience. > > A common catcher for all 3 is not possible because each catcher > depends on whatever marker and whatever pte value they set that may > lead to a different deterministic path where to put the catcher or > multiple paths even. do_numa_page requires a catcher in a different > place already. > > Or well, a common catcher for all 3 is technically possible but it'd > perform slower requiring to check things twice. > > But perhaps the soft_dirty can use the same catcher of uffd-wp given > the similarity? > > > I'd go with what Nadav has -- the memory corruption problem has been > > there for a while and nobody has complained except Nadav. This makes > > me think people is less likely to notice any performance issues from > > holding mmap_lock for write in his patch either. > > uffd-wp is a fairly new feature, the current users are irrelevant, > keeping it optimal is just for the future potential. > > > But I can't say I have zero concern with the potential performance > > impact given that I have been expecting the fix to go to stable, > > which I care most. So the next option on my list is to have a > > Actually stable would be very fine to go with Nadav patch and use the > mmap_lock_write unconditionally. The optimal fix is only relevant for > the future potential, so it's only relevant for Linus's tree. > > However the feature is recent enough that it won't require a deep > backport so the optimal fix is likely fine for stable as well, > generally stable prefers the same fix as in the upstream when there's > no major backport rejection issue. > > The alternative solution for uffd is to do the deferred flush under > mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell > change_protection not to defer the flush and to take the > mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the > catcher and then userfaultfd_writeprotect(mode_wp=true) > userfaultfd_writeprotect(mode_wp=false) can even run in parallel at > all times. The cons is large userfaultfd_writeprotect will block for > I/O and those would happen at least in the postcopy live snapshotting > use case. > > The main cons is that it'd require modification to change_protection > so it actually looks more intrusive, not less. > > Overall anything that allows to wrprotect 1 pte with only the > mmap_lock_read exactly like KSM write_protect_page, would be enough for > uffd-wp. > > What isn't ok in terms of future potential is unconditional > mmap_lock_write as in the original suggested patch in my view. It > doesn't mean we can take mmap_lock_write when the operation is large > and there is actually more benefit from deferring the flush. > > > common "catcher" in do_wp_page() which singles out pages that have > > page_mapcount equal to one and reuse them instead of trying to > > I don't think the page_mapcount matters here. If the wp page reuse was > more accurate (as it was before) we wouldn't notice this issue, but it > still would be a bug that there were stale TLB entries. It worked by > luck. Can you please correct my understanding below? Thank you. Generally speaking, we have four possible combinations relevant to this discussion: 1) anon, COW 2) anon, non-COW 3) file, COW 4) file, non-COW Assume we pte_wrprotect while holding mmap_lock for read, all four combinations can be routed to do_wp_page(). The difference is that 1) and 3) are guaranteed to be RO when they become COW, and what we do on top of their existing state doesn't make any difference. 2) is the false positive because of what we do, and it's causing the memory corruption because do_wp_page() tries to make copies of pages that seem to be RO but may have stale RW tlb entries pending flush. If we grab mmap_lock for write, we block the fault that tries to copy before we flush. Once it's done, we'll have two faulting CPUs, one that had the stale entry and would otherwise do the damage and the other that tries to copy. Then there is the silly part: we make a copy and swap it with the only user who already has it... We are talking about non-COW anon pages here -- they can't be mapped more than once. So why not just identify them by checking page_mapcount == 1 and then unconditionally reuse them? (This is probably where I've missed things.) 4) may also be false positive (there are other legit cases relying on it), and it's correctly identified by !PageAnon() + VM_WRITE|VM_SHARED. For this category, we always reuse and we hitch a free ride.