From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=6rh8=FZ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,FSL_HELO_FAKE,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AFC4FC433DB
	for <linux-mm@archiver.kernel.org>; Mon, 21 Dec 2020 04:44:27 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 415D522CB2
	for <linux-mm@archiver.kernel.org>; Mon, 21 Dec 2020 04:44:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 415D522CB2
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8B9976B0068; Sun, 20 Dec 2020 23:44:26 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 868386B006C; Sun, 20 Dec 2020 23:44:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 77FCB6B006E; Sun, 20 Dec 2020 23:44:26 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212])
	by kanga.kvack.org (Postfix) with ESMTP id 5F60F6B0068
	for <linux-mm@kvack.org>; Sun, 20 Dec 2020 23:44:26 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 2B7B6180AD83B
	for <linux-mm@kvack.org>; Mon, 21 Dec 2020 04:44:26 +0000 (UTC)
X-FDA: 77616048132.13.aunt52_0407d8027454
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id 0E2CD18140B85
	for <linux-mm@kvack.org>; Mon, 21 Dec 2020 04:44:26 +0000 (UTC)
X-HE-Tag: aunt52_0407d8027454
X-Filterd-Recvd-Size: 10520
Received: from mail-il1-f172.google.com (mail-il1-f172.google.com [209.85.166.172])
	by imf40.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 21 Dec 2020 04:44:25 +0000 (UTC)
Received: by mail-il1-f172.google.com with SMTP id q1so7772770ilt.6
        for <linux-mm@kvack.org>; Sun, 20 Dec 2020 20:44:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=p8JksJ1UE+8rqp4HYul0mL9E6FM1BsI4Tkdvg3Hw2OY=;
        b=R1h03FujlY+KgN5boGgDxp45kSuW98V/MCScMBkzOiKe02XkYO8H8H0Ilp0WctVBsN
         v+dpxuv17skD1Ux1jEuCx+2nPsM7GHd3Hl2pvGl5BRe0zRfmxpd24I7ra4FIcqTiQJZV
         5cFk+m/AgBYi8V8K1lLtFX2G0143VugwnvFmAWxWHIiZKRqkFsrnZjE/kDSjrkVoVwiU
         1rSTz40MZ4E4lspUpO+A4egnLtrkg/QtaT6FPikQLR7MGSAkB7GCkaDO4p+tVE8exRTD
         KulBRNcUBr+GtUVnC/7y/ZhdK/bROJdDFPRpIq5I0TXofzfBlKUvcHNIx/YHzl6ACNuE
         tj7w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=p8JksJ1UE+8rqp4HYul0mL9E6FM1BsI4Tkdvg3Hw2OY=;
        b=Rr2wUYkH3ids5g6W7sicROJpI4rvuPacTx06aUOPddepkIzKR2Ua5z7rYmn7z8g5i6
         XLgGpllA5KBsLtZ7x6Wm3jGnKaLu1RsO5HDoCzw/RE3skq5aWHhapRcQBCBVvQjj8lQO
         tO4pD9cSPYyC1c1HgIBBmzL0c02sxgnF+Rof+Z3jbnJn+MgQlAQx6sYtFuAxnGoLXwFq
         ju9V9vdX7LdZlp5fVJjNBu+MHzomMwgBWt4Ygw5XZ45xuLf8boVHEcF0LfwHXw+nYf5H
         6E4x4vS75C87HsvQb2rrA5NR2yjqfrA2p4mnpbzDhpOuhdD7uV4TsCOl30vRtJbZMLnW
         5fRQ==
X-Gm-Message-State: AOAM531JXkMb08r/j2gezWI7YADkqxxKwBMRr//zrVr3QpN5ITZcYW4G
	SSn690QLHF+ASvLNqTj4+e2VMQ==
X-Google-Smtp-Source: ABdhPJwrDHPeqcY3HZhvnXxaX4FDb199mgH0soAhXc71NIMSD+41+jSEqh4vkN90QyGaFMwqWa9P9g==
X-Received: by 2002:a92:ce47:: with SMTP id a7mr15272146ilr.261.1608525864712;
        Sun, 20 Dec 2020 20:44:24 -0800 (PST)
Received: from google.com ([2620:15c:183:200:7220:84ff:fe09:2d90])
        by smtp.gmail.com with ESMTPSA id q5sm12155640ile.48.2020.12.20.20.44.23
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 20 Dec 2020 20:44:23 -0800 (PST)
Date: Sun, 20 Dec 2020 21:44:19 -0700
From: Yu Zhao <yuzhao@google.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, linux-mm <linux-mm@kvack.org>,
	Peter Xu <peterx@redhat.com>, lkml <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>, stable@vger.kernel.org,
	minchan@kernel.org, Andy Lutomirski <luto@kernel.org>,
	Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Message-ID: <X+AoIwSqkKMo7Oxg@google.com>
References: <20201219043006.2206347-1-namit@vmware.com>
 <X95RRZ3hkebEmmaj@redhat.com>
 <EDC00345-B46E-4396-8379-98E943723809@gmail.com>
 <X97pprdcRXusLGnq@google.com>
 <DDA15360-D6D4-46A8-95A4-5EE34107A407@gmail.com>
 <X98fZOiLNmnDQKhN@google.com>
 <3680387D-65F1-4078-A19D-F77DE8544B96@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <3680387D-65F1-4078-A19D-F77DE8544B96@gmail.com>
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sun, Dec 20, 2020 at 07:33:09PM -0800, Nadav Amit wrote:
> > On Dec 20, 2020, at 1:54 AM, Yu Zhao <yuzhao@google.com> wrote:
> >=20
> > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@google.com> wrote:
> >>>=20
> >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> >>>> [ cc=E2=80=99ing some more people who have experience with similar=
 problems ]
> >>>>=20
> >>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.c=
om> wrote:
> >>>>>=20
> >>>>> Hello,
> >>>>>=20
> >>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>>> Analyzing this problem indicates that there is a real bug since
> >>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This =
might
> >>>>>=20
> >>>>> Never having to take the mmap_sem for writing, and in turn never
> >>>>> blocking, in order to modify the pagetables is quite an important
> >>>>> feature in uffd that justifies uffd instead of mprotect. It's not=
 the
> >>>>> most important reason to use uffd, but it'd be nice if that guara=
ntee
> >>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for t=
he
> >>>>> other pgtable manipulations.
> >>>>>=20
> >>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>>>=20
> >>>>>> cpu0				cpu1
> >>>>>> ----				----
> >>>>>> userfaultfd_writeprotect()
> >>>>>> [ write-protecting ]
> >>>>>> mwriteprotect_range()
> >>>>>> mmap_read_lock()
> >>>>>> change_protection()
> >>>>>> change_protection_range()
> >>>>>> ...
> >>>>>> change_pte_range()
> >>>>>> [ defer TLB flushes]
> >>>>>> 				userfaultfd_writeprotect()
> >>>>>> 				 mmap_read_lock()
> >>>>>> 				 change_protection()
> >>>>>> 				 [ write-unprotect ]
> >>>>>> 				 ...
> >>>>>> 				  [ unprotect PTE logically ]
> >>>>>> 				...
> >>>>>> 				[ page-fault]
> >>>>>> 				...
> >>>>>> 				wp_page_copy()
> >>>>>> 				[ set new writable page in PTE]
> >>>=20
> >>> I don't see any problem in this example -- wp_page_copy() calls
> >>> ptep_clear_flush_notify(), which should take care of the stale entr=
y
> >>> left by cpu0.
> >>>=20
> >>> That being said, I suspect the memory corruption you observed is
> >>> related this example, with cpu1 running something else that flushes
> >>> conditionally depending on pte_write().
> >>>=20
> >>> Do you know which type of pages were corrupted? file, anon, etc.
> >>=20
> >> First, Yu, you are correct. My analysis is incorrect, but let me hav=
e
> >> another try (below). To answer your (and Andrea=E2=80=99s) question =
- this happens
> >> with upstream without any changes, excluding a small fix to the self=
test,
> >> since it failed (got stuck) due to missing wake events. [1]
> >>=20
> >> We are talking about anon memory.
> >>=20
> >> So to correct myself, I think that what I really encountered was act=
ually
> >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed).=
 The
> >> problem was that in this case the =E2=80=9Cwrite=E2=80=9D-bit was re=
moved during unprotect.
> >=20
> > Thanks. You are right about when the problem happens: UFD write-
> > UNprotecting. But it's not UFD write-UNprotecting that removes the
> > writable bit -- the bit can only be removed during COW or UFD
> > write-protecting. So your original example was almost correct, except
> > the last line describing cpu1.
>=20
> The scenario is a bit confusing, so stay with me. The idea behind uffd
> unprotect is indeed only to mark the PTE logically as uffd-unprotected,=
 and
> not to *set* the writable bit, allowing the #PF handler to do COW or
> whatever correctly upon #PF.

Right.

> However, the problem that we have is that if a page is already writable=
,
> write-unprotect *clears* the writable bit, making it write-protected (a=
t
> least for anonymous pages). This is not good from performance point-of-=
view,
> but also a correctness issue, as I pointed out.
>=20
> In some more detail: mwriteprotect_range() uses vm_get_page_prot() to
> compute the new protection. For anonymous private memory, at least on x=
86,
> this means the write-bit in the protection is clear. So later,
> change_pte_range() *clears* the write-bit during *unprotection*.

Agreed.

> That=E2=80=99s the reason the second part of my patch - the change to p=
reserve_write
> - fixes the problem.

Yes, it fixes a part of the problem.

But what if there is no writable bit there for you to preserve? If the
bit was cleared by COW, e.g., KSM, fork, etc., no problem, because they
guarantee the flush. If it was cleared by a priory UFD write-protecting,
you still would run into the same problem because of the deterred flush.
And you are trying to fix this part of the problem by grabbing
mmap_write_lock. I think I understand your patch correctly.

Both parts of the problem were introduced by the commits I listed, and
your patch does fix the problem. I'm just saying it's not an optimal fix
because for the second part of the problem:
1) there is no need to grab mmap_write_lock
2) there is no need to make a copy in do_wp_page() if the write bit was
cleared by UFD write-protecting (non-COW case).

The fix should be done in do_wp_page(), i.e., to handle non-COW pages
correctly. Preserving the write bit can be considered separately as an
optimization, not a fix. (It eliminates unnecessary page faults.)

> > The problem is how do_wp_page() handles non-COW pages. (For COW pages=
,
> > do_wp_page() works correctly by either reusing an existing page or
> > make a new copy out of it.) In UFD case, the existing page may not
> > have been properly write-protected. As you pointed out, the tlb flush
> > may not be done yet. Making a copy can potentially race with the
> > writer on cpu2.
>=20
> Just to clarify the difference - You regard a scenario of UFFD
> write-protect, while I am pretty sure the problem I encountered is duri=
ng
> write-unprotect.
>=20
> I am not sure we are on the same page (but we may be). The problem I ha=
ve is
> with cow_user_page() that is called by do_wp_page() before any TLB flus=
h
> took place (either by change_protection_range() or by do_wp_page() whic=
h
> does flush, but after the copy).
>=20
> Let me know if you regard a different scenario.
>=20
> > Should we fix the problem by ensuring integrity of the copy? IMO, no,
> > because do_wp_page() shouldn't copy at all in this case. It seems it
> > was recently broken by
> >=20
> >  be068f29034f mm: fix misplaced unlock_page in do_wp_page()
> >  09854ba94c6a mm: do_wp_page() simplification
> >=20
> > I haven't study them carefully. But if you could just revert them and
> > run the test again, we'd know where exactly to look at next.
>=20
> These patches regard the wp_page_reuse() case, which makes me think we
> are not on the same page. I do not see a problem with wp_page_reuse()
> since it does not make a copy of the page. If you can explain what I
> am missing, it would be great.
>=20