From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wefk=GP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E2F50C433E0
	for <linux-mm@archiver.kernel.org>; Tue, 12 Jan 2021 20:49:51 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1F15F22DFA
	for <linux-mm@archiver.kernel.org>; Tue, 12 Jan 2021 20:49:50 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F15F22DFA
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4427C6B00D0; Tue, 12 Jan 2021 15:49:50 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 41C136B00D2; Tue, 12 Jan 2021 15:49:50 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3303E6B00D3; Tue, 12 Jan 2021 15:49:50 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0184.hostedemail.com [216.40.44.184])
	by kanga.kvack.org (Postfix) with ESMTP id 1EB746B00D0
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 15:49:50 -0500 (EST)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id D1B261EE6
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 20:49:49 +0000 (UTC)
X-FDA: 77698314498.08.cent95_2909f2527518
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin08.hostedemail.com (Postfix) with ESMTP id B06FD1819E769
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 20:49:49 +0000 (UTC)
X-HE-Tag: cent95_2909f2527518
X-Filterd-Recvd-Size: 11167
Received: from mail-io1-f47.google.com (mail-io1-f47.google.com [209.85.166.47])
	by imf11.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 20:49:49 +0000 (UTC)
Received: by mail-io1-f47.google.com with SMTP id y19so7046217iov.2
        for <linux-mm@kvack.org>; Tue, 12 Jan 2021 12:49:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=iz948zq6frbZj4yrO0gxVPBCgl3MtZVScrUscJeCONc=;
        b=mUfnl+ylyuef1eqF5Au0U+9hQqR6rirDJC+wgHZcgeX+Y10zfQZesA2JPpjOJfAwem
         vfs+NxTNB2oiy4I0ee8RoLt4IE+0c+o5nLK90UCmAQK7dhkSy3ls0KrVTtKWUgNXuXb5
         g7LHX6knzedxFCigXXZZGKIoxMKlKMXXixLEgghpQ58zLLUDEXh8bGqmg6m8EohhZXCq
         upbCvoy9vWALYnMM0dKDsIEVb0JfVESkH3xlqpVmKlzYCIgOoPTe0Gx0XYFA5K6aG9zs
         l/LGSWC7dnBdNRSZZSNZDH53xSE4hgYKK53yQCXD41+6n1eKibYkRL85GBD+3hwXkkkz
         Sc2A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=iz948zq6frbZj4yrO0gxVPBCgl3MtZVScrUscJeCONc=;
        b=QhePqh22mXVZzUY5flLCYsXKgGuFZOSAOcT162u4lEgJJQyXMAnuB9eh3AwYtclrDm
         rXjGrj0h0NHamvcMgTu+aTHaLFa9YCEYJcYEjOqra7B5mt3GKaczlANkSqFeRsp/IP3I
         Wt8FvZBVUndnXOuvpwqR6sZDaQOyc3HlrL+C0MnnrHUoBdD+OzjYpSfn8VAr7Qm2H8X3
         mzo0P7EpEwL/kuw5FU8bSwRXY2VVDM9iih2nBb6ptG/8372Uu60UMgO3OA/50pEF1362
         Gc6xsrwj7VcW8KXyg6mIlrQ4J0X4eS5DYouvsL4k+LHvCMuQQQifNlFM/N4fEI+u9vjC
         PMiQ==
X-Gm-Message-State: AOAM530ZQt0cn+Dv0g0uvdHi2cWhniwMPxz0Nskm/aTGTKLy2UzsROIS
	6YOEFEXYcvfV9OnP3NNgG8Sc4Q==
X-Google-Smtp-Source: ABdhPJxA3ncb6sVXiU6rEnsOvA2T2v/R2etdfaiCr1nJnYQmxuLz2uwDw9LnJAyOGuJmsZNIrdy1uw==
X-Received: by 2002:a5e:8202:: with SMTP id l2mr724009iom.106.1610484588372;
        Tue, 12 Jan 2021 12:49:48 -0800 (PST)
Received: from google.com ([2620:15c:183:200:7220:84ff:fe09:2d90])
        by smtp.gmail.com with ESMTPSA id n11sm2412446ioh.37.2021.01.12.12.49.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 12 Jan 2021 12:49:47 -0800 (PST)
Date: Tue, 12 Jan 2021 13:49:43 -0700
From: Yu Zhao <yuzhao@google.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm <linux-mm@kvack.org>, lkml <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	stable <stable@vger.kernel.org>, Minchan Kim <minchan@kernel.org>,
	Will Deacon <will@kernel.org>, surenb@google.com
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Message-ID: <X/4LZ5NpCwYLgW1s@google.com>
References: <CALCETrV=8tY7h=aaudWBEn-MJnNkm2wz5qjH49SYqwkjYTpOaA@mail.gmail.com>
 <CAHk-=wj=CcOHQpG0cUGfoMCt2=Uaifpqq-p-mMOmW8XmrBn4fQ@mail.gmail.com>
 <20210105153727.GK3040@hirez.programming.kicks-ass.net>
 <bfb1cbe6-a705-469d-c95a-776624817e33@codeaurora.org>
 <0201238b-e716-2a3c-e9ea-d5294ff77525@linux.vnet.ibm.com>
 <X/3VE64nr91WCtuM@hirez.programming.kicks-ass.net>
 <ec912505-ed4d-a45d-2ed4-7586919da4de@linux.vnet.ibm.com>
 <C7D5A74C-25BF-458A-AAD9-61E484B9F225@gmail.com>
 <X/3+6ZnRCNOwhjGT@google.com>
 <2C7AE23B-ACA3-4D55-A907-AF781C5608F0@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <2C7AE23B-ACA3-4D55-A907-AF781C5608F0@gmail.com>
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> >=20
> > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@linux.vnet.ib=
m.com> wrote:
> >>>=20
> >>> Le 12/01/2021 =C3=A0 17:57, Peter Zijlstra a =C3=A9crit :
> >>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> >>>>> Le 12/01/2021 =C3=A0 12:43, Vinayak Menon a =C3=A9crit :
> >>>>>> Possibility of race against other PTE modifiers
> >>>>>>=20
> >>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PT=
Es RO and that
> >>>>>> is described and fixed here https://lore.kernel.org/patchwork/pa=
tch/1062672/
> >>>> Right, that's exactly the kind of thing I was worried about.
> >>>>>> 2) mprotect - change_protection in mprotect which does the defer=
red flush is
> >>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on =
faults
> >>>>>> on those VMAs.
> >>>> Sure, mprotect also changes vm_flags, so it really needs that anyw=
ay.
> >>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in =
(2) above.
> >>>>>> But SPF does not take UFFD faults.
> >>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect an=
d covered by
> >>>>>> (2) above.
> >>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon=
 page faults.
> >>>> What happened to shared/file-backed stuff? ISTR I had that working=
.
> >>>=20
> >>> File-backed mappings are not processed in a speculative way, there =
were options to manage some of them depending on the underlying file syst=
em but that's still not done.
> >>>=20
> >>> Shared anonymous mapping, are also not yet handled in a speculative=
 way (vm_ops is not null).
> >>>=20
> >>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT=
->PRESENT
> >>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW=
 is fine as well.
> >>>> The tricky one is demotion, specifically write to non-write.
> >>>>>> I could not see a case where speculative path cannot see a PTE u=
pdate done via
> >>>>>> a fault on another CPU.
> >>>> One you didn't mention is the NUMA balancing scanning crud; althou=
gh I
> >>>> think that's fine, loosing a PTE update there is harmless. But I'v=
e not
> >>>> thought overly hard on it.
> >>>=20
> >>> That's a good point, I need to double check on that side.
> >>>=20
> >>>>> You explained it fine. Indeed SPF is handling deferred TLB invali=
dation by
> >>>>> marking the VMA through vm_write_begin/end(), as for the fork cas=
e you
> >>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checke=
d, the PTE
> >>>>> values read are valid.
> >>>> That should indeed work, but are we really sure we covered them al=
l?
> >>>> Should we invest in better TLBI APIs to make sure we can't get thi=
s
> >>>> wrong?
> >>>=20
> >>> That may be a good option to identify deferred TLB invalidation but=
 I've no clue on what this API would look like.
> >>=20
> >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >> The basic idea is to save a generation in the page-struct that track=
s
> >> when deferred PTE change took place, and track whenever a TLB flush
> >> completed. In addition, other users - such as mprotect - would use
> >> the tlb_gather interface.
> >>=20
> >> Unfortunately, due to limited space in page-struct this would only
> >> be possible for 64-bit (and my implementation is only for x86-64).
> >=20
> > I don't want to discourage you but I don't think this would end up
> > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > and I've run into problems with this before while trying to do
> > something similar.
>=20
> Discourage, discourage. Better now than later.
>=20
> It will be relatively easy to extend the scheme to be per-VMA instead o=
f
> per-table for architectures that prefer it this way. It does require
> TLB-generation tracking though, which Andy only implemented for x86, so=
 I
> will focus on x86-64 right now.
>=20
> [ For per-VMA it would require an additional cmpxchg, I presume to save=
 the
> last deferred generation though. ]
>=20
> > I'd recommend per-vma and per-category (unmapping, clearing writable
> > and clearing dirty) tracking, which only rely on arch-independent dat=
a
> > structures, i.e., vm_area_struct and mm_struct.
>=20
> I think that tracking changes on =E2=80=9Cwhat was changed=E2=80=9D gra=
nularity is harder
> and more fragile.
>=20
> Let me finish trying the clean up the mess first, since fullmm and
> need_flush_all semantics were mixed; there are 3 different flushing sch=
emes
> for mprotect(), munmap() and try_to_unmap(); there are missing memory
> barriers; mprotect() performs TLB flushes even when permissions are
> promoted; etc.
>=20
> There are also some optimizations that we discussed before, such on x86=
 -=20
> RW->RO does not require a TLB flush if a PTE is not dirty, etc.
>=20
> I am trying to finish something so you can say how terrible it is, so I=
 will
> not waste too much time. ;-)
>=20
> >> It would still require to do the copying while holding the PTL thoug=
h.
> >=20
> > IMO, this is unacceptable. Most archs don't support per-table PTL, an=
d
> > even x86_64 can be configured to use per-mm PTL. What if we want to
> > support a larger page size in the feature?
> >=20
> > It seems to me the only way to solve the problem with self-explanator=
y
> > code and without performance impact is to check mm_tlb_flush_pending
> > and the writable bit (and two other cases I mentioned above) at the
> > same time. Of course, this requires a lot of effort to audit the
> > existing uses, clean them up and properly wrap them up with new
> > primitives, BUG_ON all invalid cases and document the exact workflow
> > to prevent misuses.
> >=20
> > I've mentioned the following before -- it only demonstrates the rough
> > idea.
> >=20
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5e9ca612d7d7..af38c5ee327e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_f=
ault *vmf)
> > 		goto unlock;
> > 	}
> > 	if (vmf->flags & FAULT_FLAG_WRITE) {
> > -		if (!pte_write(entry))
> > +		if (!pte_write(entry)) {
> > +			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> > +				flush_tlb_page(vmf->vma, vmf->address);
> > 			return do_wp_page(vmf);
> > +		}
> > 		entry =3D pte_mkdirty(entry);
> > 	}
> > 	entry =3D pte_mkyoung(entry);
>=20
> I understand. This might be required, regardless of the deferred flushe=
s
> detection scheme. If we assume that no write-unprotect requires a COW (=
which
> should be true in this case, since we take a reference on the page), yo=
ur
> proposal should be sufficient.
>=20
> Still, I think that there are many unnecessary TLB flushes right now,
> and others that might be missed due to the overly complicated invalidat=
ion
> schemes.=20
>=20
> Regardless, as Andrea pointed, this requires first to figure out the
> semantics of mprotect() and friends when pages are pinned.

Thanks, I appreciate your effort. I'd be glad to review whatever you
come up with.