From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6843EC43461 for ; Fri, 11 Sep 2020 19:03:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A4C3122207 for ; Fri, 11 Sep 2020 19:03:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="PAoqGBmI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A4C3122207 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E18548E0001; Fri, 11 Sep 2020 15:03:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DCA346B0003; Fri, 11 Sep 2020 15:03:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C91678E0001; Fri, 11 Sep 2020 15:03:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id B27F76B0002 for ; Fri, 11 Sep 2020 15:03:55 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 74FB9180AD81A for ; Fri, 11 Sep 2020 19:03:55 +0000 (UTC) X-FDA: 77251705230.19.event38_1e10133270f1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 28FE71AD1B9 for ; Fri, 11 Sep 2020 19:03:55 +0000 (UTC) X-HE-Tag: event38_1e10133270f1 X-Filterd-Recvd-Size: 15044 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Fri, 11 Sep 2020 19:03:54 +0000 (UTC) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 08BJ1nqJ070282; Fri, 11 Sep 2020 15:03:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : content-transfer-encoding : in-reply-to; s=pp1; bh=BA+waGK444WFdAdGsxtmH4MVyQmh9mHMY6P6ku0v4Ks=; b=PAoqGBmI1WsEu/qH5mX1NCJjx4VjyBJDOUw68p7GuYHEFTEDLVyHklhTy+D05a9nhIEh S3OPtLLBBYAxRCIWYIGMEgVuSzV/wiY0aRinuDzcFKAICBrC4V6KZGYgqhBPlJ/HtW90 2sVdth6sFOm8qAfahbo8bWOSGUQUEITM97sVy3F3h16HR0GwbY8iTIlfBFmJ+neomaWW ttvdIJBuVK0LzRf3TnGmPJDMF9t2Br4ehU8orVA+6UbKHdmO/NDqhkIh6K8Prm4HrnX/ X/r9PwkYXTaBXZJhLWPSWh77gi2NbCByHKwD02OEDWOrr1Md8dXOnOojvEUCnbXgmBbM gw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 33geac150s-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 11 Sep 2020 15:03:16 -0400 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 08BJ2G5J073420; Fri, 11 Sep 2020 15:03:15 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 33geac14yv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 11 Sep 2020 15:03:15 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 08BIvA0f002964; Fri, 11 Sep 2020 19:03:12 GMT Received: from b06cxnps4074.portsmouth.uk.ibm.com (d06relay11.portsmouth.uk.ibm.com [9.149.109.196]) by ppma04ams.nl.ibm.com with ESMTP id 33c2a8fk1b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 11 Sep 2020 19:03:12 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 08BJ39xe22675840 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 11 Sep 2020 19:03:10 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B6F01A405F; Fri, 11 Sep 2020 19:03:09 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 02C6DA405B; Fri, 11 Sep 2020 19:03:08 +0000 (GMT) Received: from localhost (unknown [9.145.43.16]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Fri, 11 Sep 2020 19:03:07 +0000 (GMT) Date: Fri, 11 Sep 2020 21:03:06 +0200 From: Vasily Gorbik To: Jason Gunthorpe , John Hubbard , Linus Torvalds Cc: Gerald Schaefer , Alexander Gordeev , Peter Zijlstra , Dave Hansen , LKML , linux-mm , linux-arch , Andrew Morton , Russell King , Mike Rapoport , Catalin Marinas , Will Deacon , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , Jeff Dike , Richard Weinberger , Dave Hansen , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , Andrey Ryabinin , linux-x86 , linux-arm , linux-power , linux-sparc , linux-um , linux-s390 , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda Subject: [PATCH] mm/gup: fix gup_fast with dynamic page table folding Message-ID: References: <20200911070939.GB1362448@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200911070939.GB1362448@hirez.programming.kicks-ass.net> X-Patchwork-Bot: notify X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-09-11_09:2020-09-10,2020-09-11 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 impostorscore=0 lowpriorityscore=0 priorityscore=1501 malwarescore=0 mlxlogscore=999 phishscore=0 bulkscore=0 suspectscore=0 spamscore=0 adultscore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009110148 X-Rspamd-Queue-Id: 28FE71AD1B9 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently to make sure that every page table entry is read just once gup_fast walks perform READ_ONCE and pass pXd value down to the next gup_pXd_range function by value e.g.: static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end= , unsigned int flags, struct page **pages, int *nr= ) ... pudp =3D pud_offset(&p4d, addr); This function passes a reference on that local value copy to pXd_offset, and might get the very same pointer in return. This happens when the level is folded (on most arches), and that pointer should not be iterated= . On s390 due to the fact that each task might have different 5,4 or 3-level address translation and hence different levels folded the logic is more complex and non-iteratable pointer to a local copy leads to severe problems. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr =3D 0x1007ffff000, end =3D 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end= , unsigned int flags, struct page **pages, int *nr= ) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp =3D pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud =3D READ_ONCE(*pudp); // next =3D 0x10080000000, due to PUD_SIZE/MASK !=3D PGDI= R_SIZE/MASK on s390 next =3D pud_addr_end(addr, end); ... } while (pudp++, addr =3D next, addr !=3D end); // pudp++ iterati= ng over stack return 1; } This happens since s390 moved to common gup code with commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust"= ) and commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code"). s390 tried to mimic static level folding by changing pXd_offset primitives to always calculate top level page table offset in pgd_offset and just return the value passed when pXd_offset has to act as folded. What is crucial for gup_fast and what has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly. And the latter is not possible with dynamic folding. To fix the issue in addition to pXd values pass original pXdp pointers down to gup_pXd_range functions. And introduce pXd_offset_lockless helpers, which take an additional pXd entry value parameter. This has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 Cc: # 5.2+ Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast= code") Reviewed-by: Gerald Schaefer Reviewed-by: Alexander Gordeev Signed-off-by: Vasily Gorbik --- arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- include/linux/pgtable.h | 10 ++++++++ mm/gup.c | 18 +++++++------- 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgta= ble.h index 7eb01a5459cd..b55561cc8786 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, u= nsigned long address) =20 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), add= ress) =20 -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigne= d long address) { - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R= 1) - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); - return (p4d_t *) pgd; + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R1= ) + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); + return (p4d_t *) pgdp; } +#define p4d_offset_lockless p4d_offset_lockless =20 -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) { - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R= 2) - return (pud_t *) p4d_deref(*p4d) + pud_index(address); - return (pud_t *) p4d; + return p4d_offset_lockless(pgdp, *pgdp, address); +} + +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigne= d long address) +{ + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R2= ) + return (pud_t *) p4d_deref(p4d) + pud_index(address); + return (pud_t *) p4dp; +} +#define pud_offset_lockless pud_offset_lockless + +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) +{ + return pud_offset_lockless(p4dp, *p4dp, address); } #define pud_offset pud_offset =20 -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigne= d long address) +{ + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R3= ) + return (pmd_t *) pud_deref(pud) + pmd_index(address); + return (pmd_t *) pudp; +} +#define pmd_offset_lockless pmd_offset_lockless + +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) { - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >=3D _REGION_ENTRY_TYPE_R= 3) - return (pmd_t *) pud_deref(*pud) + pmd_index(address); - return (pmd_t *) pud; + return pmd_offset_lockless(pudp, *pudp, address); } #define pmd_offset pmd_offset =20 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..e899d3506671 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif =20 +#ifndef p4d_offset_lockless +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&pgd, address= ) +#endif +#ifndef pud_offset_lockless +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&p4d, address= ) +#endif +#ifndef pmd_offset_lockless +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&pud, address= ) +#endif + /* * p?d_leaf() - true if this entry is a final mapping to a physical addr= ess. * This differs from p?d_huge() by the fact that they are always availab= le (if diff --git a/mm/gup.c b/mm/gup.c index e5739a1974d5..578bf5bd8bf8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, = unsigned long addr, return 1; } =20 -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long en= d, +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, uns= igned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; =20 - pmdp =3D pmd_offset(&pud, addr); + pmdp =3D pmd_offset_lockless(pudp, pud, addr); do { pmd_t pmd =3D READ_ONCE(*pmdp); =20 @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long= addr, unsigned long end, return 1; } =20 -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long en= d, +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, uns= igned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; =20 - pudp =3D pud_offset(&p4d, addr); + pudp =3D pud_offset_lockless(p4dp, p4d, addr); do { pud_t pud =3D READ_ONCE(*pudp); =20 @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long= addr, unsigned long end, if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, PUD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; } while (pudp++, addr =3D next, addr !=3D end); =20 return 1; } =20 -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long en= d, +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, uns= igned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; p4d_t *p4dp; =20 - p4dp =3D p4d_offset(&pgd, addr); + p4dp =3D p4d_offset_lockless(pgdp, pgd, addr); do { p4d_t p4d =3D READ_ONCE(*p4dp); =20 @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long a= ddr, unsigned long end, if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, P4D_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) return 0; } while (p4dp++, addr =3D next, addr !=3D end); =20 @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsig= ned long end, if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) return; - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; } while (pgdp++, addr =3D next, addr !=3D end); } --=20 =E2=A3=BF=E2=A3=BF=E2=A3=BF=E2=A3=BF=E2=A2=8B=E2=A1=80=E2=A3=80=E2=A0=B9=E2= =A3=BF=E2=A3=BF=E2=A3=BF=E2=A3=BF =E2=A3=BF=E2=A3=BF=E2=A3=BF=E2=A3=BF=E2=A0=A0=E2=A3=B6=E2=A1=A6=E2=A0=80=E2= =A3=BF=E2=A3=BF=E2=A3=BF=E2=A3=BF =E2=A3=BF=E2=A3=BF=E2=A3=BF=E2=A0=8F=E2=A3=B4=E2=A3=AE=E2=A3=B4=E2=A3=A7=E2= =A0=88=E2=A2=BF=E2=A3=BF=E2=A3=BF =E2=A3=BF=E2=A3=BF=E2=A1=8F=E2=A2=B0=E2=A3=BF=E2=A0=96=E2=A3=A0=E2=A3=BF=E2= =A1=86=E2=A0=88=E2=A3=BF=E2=A3=BF =E2=A3=BF=E2=A2=9B=E2=A3=B5=E2=A3=84=E2=A0=99=E2=A3=B6=E2=A3=B6=E2=A1=9F=E2= =A3=85=E2=A3=A0=E2=A0=B9=E2=A3=BF =E2=A3=BF=E2=A3=9C=E2=A3=9B=E2=A0=BB=E2=A2=8E=E2=A3=89=E2=A3=89=E2=A3=80=E2= =A0=BF=E2=A3=AB=E2=A3=B5=E2=A3=BF