From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E79CCA9EAF for ; Fri, 25 Oct 2019 03:14:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0DD6020867 for ; Fri, 25 Oct 2019 03:14:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="uWp25hDn" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0DD6020867 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 73FBF6B0003; Thu, 24 Oct 2019 23:14:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6F05D6B0006; Thu, 24 Oct 2019 23:14:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5B76B6B0007; Thu, 24 Oct 2019 23:14:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189]) by kanga.kvack.org (Postfix) with ESMTP id 28F7F6B0003 for ; Thu, 24 Oct 2019 23:14:13 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id B82718249980 for ; Fri, 25 Oct 2019 03:14:12 +0000 (UTC) X-FDA: 76080838344.25.wren46_24abdc795183e X-HE-Tag: wren46_24abdc795183e X-Filterd-Recvd-Size: 13717 Received: from mail-lf1-f65.google.com (mail-lf1-f65.google.com [209.85.167.65]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Fri, 25 Oct 2019 03:14:11 +0000 (UTC) Received: by mail-lf1-f65.google.com with SMTP id g3so438542lfb.11 for ; Thu, 24 Oct 2019 20:14:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=lZ1Us2d8fBQL28EAatJUcUl206Vy5KcFevyjqCQWRFU=; b=uWp25hDnmCPJQbFPRzpZwTY6meu3gGDHJ0xuCSMkzTi1HI7c8ns7hPiKEibfT/kxuX RtE4OcrE2g4OuGdjyWKpIecwVeMVMSbmf9q4qwDHBT0QHRfX/kw4NgH2puwLpipY1V55 9LahlKzprJD9xuGyU0S6NDEO7n0GHYvrrLLkOb/HkMGKlKZEBc6PhwjjWyxAldDvveXn KWxLKpMiGDpMp80hFnfvCCjMzCGAT2P7K0dgnl8l1ZJkbOVUO2f4QV0dyyVwaCjXQJ0S 7l8RY22QkV2Bq2o89Owy07xMNyb770tJegJZ0WKDk4KOyoU7/SZ/ktL0v56598k/Lw8H jdXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=lZ1Us2d8fBQL28EAatJUcUl206Vy5KcFevyjqCQWRFU=; b=bqevXQOVYT1TZb2IL6Ut+DsXOGr/4ntrcbgQ9B6JLxNGbgMLCdHNLh1EAHMha/RadB KYPE7LJz16m9Io8GJlinS3+OzjwR8hSASrxycq5v1a8k0o4Ep4zMqUnHsbzLDZq9QaaY pSSX9wT8ovxNqofjZq7acNXE9hrKgtEmVyGgySt9dNOReubF6Npm0gEKG4kJniuvCbke HdjxIv0qo7ovgeu+xucbI6Q6Sq4okkiY9ge2/tpnHFsFkxjQopZjCJDr1k/olvbbhwIG KfNzK2oYk0nRJVY8vaEcnUdDiQ+Ya+fPSeHs2HeJZD4YBsmVEd9nEvKdiXuNy/wjEq8r lZdw== X-Gm-Message-State: APjAAAXHdmW+eneDyudxq0dNuGB3QeNoUZP/xeoRLTaXhS4vc0BfUfhe y2fucTiCzwvlZiZ6kdW6zIQhejBX2JRUCyilrhk= X-Google-Smtp-Source: APXvYqxaub15caCsjtAjPlhf0wXxYZnUZkI9ps3uXUjsNRTrjw2p7LPf7zF1owe5K63u4gz9efGm1fU8buDEmYlI6Hw= X-Received: by 2002:a19:ac04:: with SMTP id g4mr862462lfc.63.1571973250294; Thu, 24 Oct 2019 20:14:10 -0700 (PDT) MIME-Version: 1.0 References: <20180926031858.9692-1-aneesh.kumar@linux.ibm.com> In-Reply-To: <20180926031858.9692-1-aneesh.kumar@linux.ibm.com> From: "Figo.zhang" Date: Fri, 25 Oct 2019 11:13:58 +0800 Message-ID: Subject: Re: [PATCH V2] mm: Recheck page table entry with page table lock held To: "Aneesh Kumar K.V" Cc: Andrew Morton , "Kirill A . Shutemov" , Linux MM , LKML Content-Type: multipart/alternative; boundary="0000000000002e2ab50595b38bde" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --0000000000002e2ab50595b38bde Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Aneesh Kumar K.V =E4=BA=8E2018=E5=B9=B49=E6=9C= =8826=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8A=E5=8D=8811:19=E5=86=99=E9=81=93= =EF=BC=9A > We clear the pte temporarily during read/modify/write update of the pte. > If we > take a page fault while the pte is cleared, the application can get > SIGBUS. One > such case is with remap_pfn_range without a backing vm_ops->fault callbac= k. > do_fault will return SIGBUS in that case. > what is " remap_pfn_range without a backing vm_ops->fault callback ", would you like elaborate the scenario? is it the case using remap_pfn_range() in drivers mmap() file operations? if in that case, why it will trap into do_fault? > > cpu 0 cpu1 > mprotect() > ptep_modify_prot_start()/pte cleared. > . > . page fault. > . > . > prep_modify_prot_commit() i am confusing this scenario, when CPU0 will call in change_pte_range()->ptep_modify_prot_start() to clear the pte content, and on the other thread, in handle_pte_fault(), pte_offset_map() can get the pte, and the pte is not invalid, it's pte is valid but just the content is all zero, so why it will call into do_fault? in handle_pte_fault(): vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address); if (!vmf->pte) { return do_fault(vmf); } > > > Fix this by taking page table lock and rechecking for pte_none. > > Signed-off-by: Aneesh Kumar K.V > --- > V1: > * update commit message. > > mm/memory.c | 31 +++++++++++++++++++++++++++---- > 1 file changed, 27 insertions(+), 4 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index c467102a5cbc..c2f933184303 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3745,10 +3745,33 @@ static vm_fault_t do_fault(struct vm_fault *vmf) > struct vm_area_struct *vma =3D vmf->vma; > vm_fault_t ret; > > - /* The VMA was not fully populated on mmap() or missing > VM_DONTEXPAND */ > - if (!vma->vm_ops->fault) > - ret =3D VM_FAULT_SIGBUS; > - else if (!(vmf->flags & FAULT_FLAG_WRITE)) > + /* > + * The VMA was not fully populated on mmap() or missing > VM_DONTEXPAND > + */ > + if (!vma->vm_ops->fault) { > + > + /* > + * pmd entries won't be marked none during a R/M/W cycle. > + */ > + if (unlikely(pmd_none(*vmf->pmd))) > + ret =3D VM_FAULT_SIGBUS; > + else { > + vmf->ptl =3D pte_lockptr(vmf->vma->vm_mm, vmf->pm= d); > + /* > + * Make sure this is not a temporary clearing of > pte > + * by holding ptl and checking again. A R/M/W > update > + * of pte involves: take ptl, clearing the pte so > that > + * we don't have concurrent modification by > hardware > + * followed by an update. > + */ > + spin_lock(vmf->ptl); > + if (unlikely(pte_none(*vmf->pte))) > + ret =3D VM_FAULT_SIGBUS; > + else > + ret =3D VM_FAULT_NOPAGE; > + spin_unlock(vmf->ptl); > + } > + } else if (!(vmf->flags & FAULT_FLAG_WRITE)) > ret =3D do_read_fault(vmf); > else if (!(vma->vm_flags & VM_SHARED)) > ret =3D do_cow_fault(vmf); > -- > 2.17.1 > > --0000000000002e2ab50595b38bde Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> =E4=BA=8E2= 018=E5=B9=B49=E6=9C=8826=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8A=E5=8D=8811:19= =E5=86=99=E9=81=93=EF=BC=9A
We clear the pte temporarily during read/modify/write update of= the pte. If we
take a page fault while the pte is cleared, the application can get SIGBUS.= One
such case is with remap_pfn_range without a backing vm_ops->fault callba= ck.
do_fault will return SIGBUS in that case.
what is &quo= t; remap_pfn_range without a backing vm_ops->fault callback ", would you like=C2=A0 elaborate the scenario?=C2=A0
=C2=A0= is it the case using remap_pfn_range()=C2=A0 in drivers mmap() file operati= ons?
if in that case, why it will trap into do_fault?

cpu 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
.=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0page fault.
.
.
prep_modify_prot_commit()

=C2=A0 i am confu= sing this=C2=A0 scenario, when CPU0 will call in=C2=A0change_pte_range()-&g= t;ptep_modify_prot_start() to clear the pte content, and=C2=A0
on= the other thread, in=C2=A0handle_pte_fault(),=C2=A0pte_offset_map() can ge= t the pte, and the pte is not invalid, it's pte is valid but just the c= ontent is all zero, so why it will call into do_fault?

=
in=C2=A0 handle_pte_fault():=C2=A0
=C2=A0 =C2=A0 vmf->pte =3D pte_offse= t_map(vmf->pmd, vmf->address);
=C2=A0 =C2=A0 if (!vmf-&= gt;pte) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return do_fault(vmf)= ;
=C2=A0 =C2=A0 }
=C2=A0
=C2=A0

=C2=A0
Fix this by taking page table lock and rechecking for pte_none.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
V1:
* update commit message.

=C2=A0mm/memory.c | 31 +++++++++++++++++++++++++++----
=C2=A01 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..c2f933184303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3745,10 +3745,33 @@ static vm_fault_t do_fault(struct vm_fault *vmf) =C2=A0 =C2=A0 =C2=A0 =C2=A0 struct vm_area_struct *vma =3D vmf->vma;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 vm_fault_t ret;

-=C2=A0 =C2=A0 =C2=A0 =C2=A0/* The VMA was not fully populated on mmap() or= missing VM_DONTEXPAND */
-=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!vma->vm_ops->fault)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D VM_FAULT_SI= GBUS;
-=C2=A0 =C2=A0 =C2=A0 =C2=A0else if (!(vmf->flags & FAULT_FLAG_WRITE= ))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * The VMA was not fully populated on mmap() or= missing VM_DONTEXPAND
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!vma->vm_ops->fault) {
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * pmd entries won&= #39;t be marked none during a R/M/W cycle.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (unlikely(pmd_no= ne(*vmf->pmd)))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0ret =3D VM_FAULT_SIGBUS;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0else {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0vmf->ptl =3D pte_lockptr(vmf->vma->vm_mm, vmf->pmd);<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 * Make sure this is not a temporary clearing of pte
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 * by holding ptl and checking again. A R/M/W update
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 * of pte involves: take ptl, clearing the pte so that
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 * we don't have concurrent modification by hardware
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 * followed by an update.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0spin_lock(vmf->ptl);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0if (unlikely(pte_none(*vmf->pte)))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D VM_FAULT_SIGBUS;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0else
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D VM_FAULT_NOPAGE;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0spin_unlock(vmf->ptl);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0} else if (!(vmf->flags & FAULT_FLAG_WRI= TE))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D do_read_fau= lt(vmf);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 else if (!(vma->vm_flags & VM_SHARED)) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D do_cow_faul= t(vmf);
--
2.17.1

--0000000000002e2ab50595b38bde--