From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90762C74A5B for ; Thu, 23 Mar 2023 22:12:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 252106B0071; Thu, 23 Mar 2023 18:12:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 201626B0072; Thu, 23 Mar 2023 18:12:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C9D96B0074; Thu, 23 Mar 2023 18:12:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 000616B0071 for ; Thu, 23 Mar 2023 18:12:02 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id BC7381C5EFD for ; Thu, 23 Mar 2023 22:12:02 +0000 (UTC) X-FDA: 80601561684.29.0C2EB61 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id A09B2A000D for ; Thu, 23 Mar 2023 22:12:00 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cw+srRe+; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679609520; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sfGmIhqQXV4R6AWtyp80xCfeosGLorwip0gO15Hz6Qs=; b=y6roIud6PO+IIQy9oJHkHiTxoZ/DKb9IvzhoZiuMXpV7/1jiinV360p1Pzw6DyqoKJsnxe r4KYV9h/2rZqouOqTjZjQrYJgx70ZBQkWj63oQgDqhvs9v0seqVp6F4hzdxgf9ZqJzVRcR B2iHuLFGR8KgIOR35ur+jnPrJEBaiXU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cw+srRe+; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679609520; a=rsa-sha256; cv=none; b=16Nm5NgBm6cIyZURIixw2c8rK8mIDZsLuFAc7qXLgam1fbYP5sxRM6qs734gdeAJ1FLBb4 outhSNbphU24f3MFx5Pk/hq3mO58WrQpmU3nywO9ZPpqcK218WwsYGVsO3lOMw2+YoDGQm kCG8J19qWZ9r47PSss2T7b97F/d+NkE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679609519; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=sfGmIhqQXV4R6AWtyp80xCfeosGLorwip0gO15Hz6Qs=; b=cw+srRe+jvp9vNoQsoCeHRN8Nj/9kUv0IGDi7b2Z0p+OSngsHAIVs/l+ZUwoZ2R0PZMCtk yCsdYNDG3wpJoSqlihMjq4BVbGzJvMq/MTTtEzhn+8kNHruq9SUwt6nsim9ryqrWNc8cT4 vPEOh2tM/AAXfzj4dQhZyw1/g6Riahg= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-411-rzGZfgnaM6CaW4JjYDpe-Q-1; Thu, 23 Mar 2023 18:11:58 -0400 X-MC-Unique: rzGZfgnaM6CaW4JjYDpe-Q-1 Received: by mail-qk1-f199.google.com with SMTP id r70-20020a374449000000b00746c31401f0so1169202qka.6 for ; Thu, 23 Mar 2023 15:11:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679609518; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BdUUJpvBYjmASyn2bJRVEYYq1jXnao8uWURPAVj9ZuA=; b=RtyYJ87U2lst323USm5EPIHuk0WlabDqqeyXtptJREHjeAA4uGRgyMQchuHDl0gWj5 OYhrmyprWAPlgvak82LAg1PYk9vtr/TmwXUFU1obyXGeO6uc2htW+tnNaAjL7JFU3oYb 3biZr7JSKlJI/F2Expx5TsB7rRLkRk90AbPIwgDgcc9ETD9D1lWDv7mHzlvCfNRRF+oZ LUgY8zvPwHw+E0sTeOyOJLDu+ER3ozb/AJOHK5wwrNKnk+GsVwzwMoEBmNNBIeamvIsA OJe2L+aD8q1rgwNh5R31A3zSciKo6PW4DO4Kd23HBc+fWp8sqjvslTAHs2NHHJKMnu4c inLw== X-Gm-Message-State: AAQBX9cUtSGg661EiF5uZ0tQ/JCvuPBRvQX+jsht+gZq31TrtmZDWzSk MpbX+NfuxYPQm02vNven7OI1+GBWzpXSp1ILKhowM1/2jaEuKiudLBSSCt5w5V3lF1LcQSW7/ta oYBgp0nMX2Cc= X-Received: by 2002:a05:6214:5289:b0:572:54c1:c14e with SMTP id kj9-20020a056214528900b0057254c1c14emr698354qvb.5.1679609517715; Thu, 23 Mar 2023 15:11:57 -0700 (PDT) X-Google-Smtp-Source: AKy350YBlxw12POHwLjWBno5c2aXDai3cLQ8JVAyA6GrrMtmktG7KD8PtGtInbK3TmVArYNPVj7p4w== X-Received: by 2002:a05:6214:5289:b0:572:54c1:c14e with SMTP id kj9-20020a056214528900b0057254c1c14emr698317qvb.5.1679609517384; Thu, 23 Mar 2023 15:11:57 -0700 (PDT) Received: from x1n (bras-base-aurron9127w-grc-40-70-52-229-124.dsl.bell.ca. [70.52.229.124]) by smtp.gmail.com with ESMTPSA id e7-20020ad450c7000000b005dd8b934592sm190729qvq.42.2023.03.23.15.11.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Mar 2023 15:11:56 -0700 (PDT) Date: Thu, 23 Mar 2023 18:11:55 -0400 From: Peter Xu To: Muhammad Usama Anjum Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Axel Rasmussen , Mike Rapoport , Nadav Amit , linux-stable , David Hildenbrand Subject: Re: [PATCH] mm/hugetlb: Fix uffd wr-protection for CoW optimization path Message-ID: References: <20230321191840.1897940-1-peterx@redhat.com> <44aae7fc-fb1f-b38e-bc17-504abf054e3f@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="le5JTZlZZI66lfKs" Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 83gkmrtj9wefnxq18yw34gyjjndtnx6j X-Rspamd-Queue-Id: A09B2A000D X-HE-Tag: 1679609520-17738 X-HE-Meta: U2FsdGVkX18+GDEsbl019AGrPe6RSHTZGnzstu95FqZHkxA0WB40BoBvC8J/4l8SE1T8MiF2ZF6JKdHOUBwcgjWffBu+qpvpjiyEntCsbhORay/sEya+DpPRrQhhJfFt3pI5hvAKRxFLFy+alKX10qBSv6Y9kFdniij2kIVIFk2zAjbbbDw6fXBr2FHA11rDJheAUSZXsufSsU/4rBcej1uSp2kyqoIhNgXbJy/IhYWF1JyAC5JQnxWMvDzzXa46D9iti1D9NT7QLnd/gMf1Q7SAZCCgCmb7vZ/0C0aoMGrHDLONiRmh3BRqLceebBaGMSEWAWOQa/Ma45yonx+tXoPwURnc9a4nb4/SxtycFlvfUTnyA3r4ae57NYKAwHYJcqfIYJWHSJovt2Ip72rSBYOmQwydTGbZrCEVU1DLEI0nWxISHs2/RM8OwdlEUIvUFoRS7JBOuzKtLUPQOiPNwNao9ewaOTelHjWv8vqffvsWSNcEfg6d3ZTXSxKMRX6iSL4Njz2yU1DUnInVq59EtlhUEKwZIZUrTvbcNdHN/dxP0cE8sezK7tqppTtoiHHogBaMLGwix/1f0uP/KLpaNq4BlqDddRBFEdc7xXo7Fr5hANUr/vbd2zGP9HlUWWxNOH2Lq/nbOeCQJ/5MPe/9cucn8eFsIWJcZrHPHmv5+yuCWRyiR/MlMFRT90edJ6m1MokXATieTHoIXJUuQEBH6oxirMfv4PFVugYOCRNf/1+tWEpaCUOEsZINX91wCbusw6XEbJ3B0dxT6Ls71+3E2ya6tEEWG4flbvCE61IfvAkEVbvSFgCxnVWdL3SWAY4PLAm3AOEvpLn+FKzcfYRxztc3VQrfczfyuxPGBiJRtkhIxOlG1xo8BpfrDUNG7PR/RYstUtRA4Z4Q0IYJNDAoVTcqkB9pcVAzwmGzQvWczb1D1/UUfNwqHuPGBetmTHj4BrDqHdqOqZwZYewoeOn wt8zYZCH pQ1k3ojWbQvzuzY3Xln3UGsKQo8hkY1Ymaqj4knnFqJte3Mj72vFGiWzha3VxTvbHozKa2wsuqyNtH1R6gWrbnRr6+D7K0dAh7C0tVnixmB2g4eemOC50hB9go01qmbu8EZSbTe9cWk7uxiuemLACJbE6lCrThK8JGPJ4H44NIXfLEBb5blra7IDF/J9khRbctAXQAEPWqYRj6JBN6I1gke5ytSwGM/r/KWv76TE6L9OG4Z1UqstU+bmTVs02cMfbO+XSWg4NDqGWmXR4dIONNvv/oz2TvFyvBxR5aCTTieMD0D4HYCR03mvRmBEeEdmDLuAHIkKgBXYWIkWT4HYNX0bq26tQ+zMSnVph83YNM5D9ikmMYWGX2qEEORfck8M8mDWOXecCC5bTzYGoaeXgbXqq62e+XuJwpeQijbVDfUlufoTxyOB3GfyzVq4QIESiHdyIj69T2biIMgQZedQVYzPK31wwqRjEsQ9TYhDJB237wsMqMAJpgIDi6urcgsQCYnmzHvTsXDrL2gkFW7hLXVMV8xsecuw1116DcufeP6LnftJFt2Gz4H2T84qp9V46Z5pPnsGvqLPBMc7+1fFDusPjRp04EdpOYCrZzr5CCM0VKWVUNpTl1vqboEK32kBoBkHCUegK7qfWOLxYg5r3rvj92fYeAGPBUB9slNE4XF6esfYsZshKuQb6+n3d3JpISdS7 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --le5JTZlZZI66lfKs Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit On Thu, Mar 23, 2023 at 08:33:07PM +0500, Muhammad Usama Anjum wrote: > Hi Peter, > > Sorry for late reply. > > On 3/22/23 12:50 AM, Peter Xu wrote: > > On Tue, Mar 21, 2023 at 08:36:35PM +0100, David Hildenbrand wrote: > >> On 21.03.23 20:18, Peter Xu wrote: > >>> This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be > >>> writable even with uffd-wp bit set. It only happens with all these > >>> conditions met: (1) hugetlb memory (2) private mapping (3) original mapping > >>> was missing, then (4) being wr-protected (IOW, pte marker installed). Then > >>> write to the page to trigger. > >>> > >>> Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before > >>> even reaching hugetlb_wp() to avoid taking more locks that userfault won't > >>> need. However there's one CoW optimization path for missing hugetlb page > >>> that can trigger hugetlb_wp() inside hugetlb_no_page(), that can bypass the > >>> userfaultfd-wp traps. > >>> > >>> A few ways to resolve this: > >>> > >>> (1) Skip the CoW optimization for hugetlb private mapping, considering > >>> that private mappings for hugetlb should be very rare, so it may not > >>> really be helpful to major workloads. The worst case is we only skip the > >>> optimization if userfaultfd_wp(vma)==true, because uffd-wp needs another > >>> fault anyway. > >>> > >>> (2) Move the userfaultfd-wp handling for hugetlb from hugetlb_fault() > >>> into hugetlb_wp(). The major cons is there're a bunch of locks taken > >>> when calling hugetlb_wp(), and that will make the changeset unnecessarily > >>> complicated due to the lock operations. > >>> > >>> (3) Carry over uffd-wp bit in hugetlb_wp(), so it'll need to fault again > >>> for uffd-wp privately mapped pages. > >>> > >>> This patch chose option (3) which contains the minimum changeset (simplest > >>> for backport) and also make sure hugetlb_wp() itself will start to be > >>> always safe with uffd-wp ptes even if called elsewhere in the future. > >>> > >>> This patch will be needed for v5.19+ hence copy stable. > >>> > >>> Reported-by: Muhammad Usama Anjum > >>> Cc: linux-stable > >>> Fixes: 166f3ecc0daf ("mm/hugetlb: hook page faults for uffd write protection") > >>> Signed-off-by: Peter Xu > >>> --- > >>> mm/hugetlb.c | 8 +++++--- > >>> 1 file changed, 5 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>> index 8bfd07f4c143..22337b191eae 100644 > >>> --- a/mm/hugetlb.c > >>> +++ b/mm/hugetlb.c > >>> @@ -5478,7 +5478,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, > >>> struct folio *pagecache_folio, spinlock_t *ptl) > >>> { > >>> const bool unshare = flags & FAULT_FLAG_UNSHARE; > >>> - pte_t pte; > >>> + pte_t pte, newpte; > >>> struct hstate *h = hstate_vma(vma); > >>> struct page *old_page; > >>> struct folio *new_folio; > >>> @@ -5622,8 +5622,10 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, > >>> mmu_notifier_invalidate_range(mm, range.start, range.end); > >>> page_remove_rmap(old_page, vma, true); > >>> hugepage_add_new_anon_rmap(new_folio, vma, haddr); > >>> - set_huge_pte_at(mm, haddr, ptep, > >>> - make_huge_pte(vma, &new_folio->page, !unshare)); > >>> + newpte = make_huge_pte(vma, &new_folio->page, !unshare); > >>> + if (huge_pte_uffd_wp(pte)) > >>> + newpte = huge_pte_mkuffd_wp(newpte); > >>> + set_huge_pte_at(mm, haddr, ptep, newpte); > >>> folio_set_hugetlb_migratable(new_folio); > >>> /* Make the old page be freed below */ > >>> new_folio = page_folio(old_page); > >> > >> Looks correct to me. Do we have a reproducer? > > > > I used a reproducer for the async mode I wrote (patch 2 attached, need to > > change to VM_PRIVATE): > > > > https://lore.kernel.org/all/ZBNr4nohj%2FTw4Zhw@x1n/ > > > > I don't think kernel kselftest can trigger it because we don't do strict > > checks yet with uffd-wp bits. I've already started looking into cleanup > > the test cases and I do plan to add new tests to cover this. > > > > Meanwhile, let's also wait for an ack from Muhammad. Even though the async > > mode is not part of the code base, it'll be a good test for verifying every > > single uffd-wp bit being set or cleared as expected. > I've tested by applying this patch. But the bug is still there. Just like > Peter has mentioned, we are using our in progress patches related to > pagemap_scan ioctl and userfaultd wp async patches to reproduce it. > > To reproduce please build kernel and run pagemap_ioctl test in mm in > hugetlb_mem_reproducer branch: > https://gitlab.collabora.com/usama.anjum/linux-mainline/-/tree/hugetlb_mem_reproducer > > In case you have any question on how to reproduce, please let me know. I'll > try to provide a cleaner alternative. Hmm, I think my current fix is incomplete if not wrong. The root cause should still be valid, however I overlooked another path: if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { if (!PageAnonExclusive(old_page)) page_move_anon_rmap(old_page, vma); if (likely(!unshare)) set_huge_ptep_writable(vma, haddr, ptep); delayacct_wpcopy_end(); return 0; } We should bail out early in this path, and it'll be even easier we always bail out hugetlb_wp() as long as uffd-wp is detected because userfault should always be handled before any decision to CoW. v2 attached.. Please give it another shot. Thanks, -- Peter Xu --le5JTZlZZI66lfKs Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename="0001-mm-hugetlb-Fix-uffd-wr-protection-for-CoW-optimizati.patch" >From 4a294f9ec5d2ba94a6a7ecf03bd096ea35902f2f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Tue, 21 Mar 2023 14:58:42 -0400 Subject: [PATCH v2] mm/hugetlb: Fix uffd wr-protection for CoW optimization path This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be writable even with uffd-wp bit set. It only happens with hugetlb private mappings, when someone firstly wr-protects a missing pte (which will install a pte marker), followed by a write to the page. That will trigger a missing fault and an optimized CoW in the same fault stack. Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before even reaching hugetlb_wp() to avoid taking more locks that userfault won't need. However there's one CoW optimization path for missing hugetlb page that can trigger hugetlb_wp() inside hugetlb_no_page(), that can bypass the userfaultfd-wp traps. A few ways to resolve this: (1) Skip the CoW optimization for hugetlb private mapping, considering that private mappings for hugetlb should be very rare, so it may not really be helpful to major workloads. The worst case is we only skip the optimization if userfaultfd_wp(vma)==true, because uffd-wp needs another fault anyway. (2) Move the userfaultfd-wp handling for hugetlb from hugetlb_fault() into hugetlb_wp(). The major cons is there're a bunch of locks taken when calling hugetlb_wp(), and that will make the changeset unnecessarily complicated due to the lock operations. (3) Skip hugetlb_wp() when uffd-wp bit is still set. It means it requires another hugetlb_fault() to resolve the uffd-wp bit first. This patch chose option (3) which contains the minimum changeset (simplest for backport) and also make sure hugetlb_wp() itself will start to be always safe with uffd-wp ptes even if called elsewhere in the future. This patch will be needed for v5.19+ hence copy stable. Reported-by: Muhammad Usama Anjum Cc: linux-stable Fixes: 166f3ecc0daf ("mm/hugetlb: hook page faults for uffd write protection") Signed-off-by: Peter Xu --- mm/hugetlb.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8bfd07f4c143..b60959f2a3f0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5478,7 +5478,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, struct folio *pagecache_folio, spinlock_t *ptl) { const bool unshare = flags & FAULT_FLAG_UNSHARE; - pte_t pte; + pte_t pte = huge_ptep_get(ptep); struct hstate *h = hstate_vma(vma); struct page *old_page; struct folio *new_folio; @@ -5487,6 +5487,17 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long haddr = address & huge_page_mask(h); struct mmu_notifier_range range; + /* + * Never handle CoW for uffd-wp protected pages. It should be only + * handled when the uffd-wp protection is removed. + * + * Note that only the CoW optimization path can trigger this and + * got skipped, because hugetlb_fault() will always resolve uffd-wp + * bit first. + */ + if (huge_pte_uffd_wp(pte)) + return 0; + /* * hugetlb does not support FOLL_FORCE-style write faults that keep the * PTE mapped R/O such as maybe_mkwrite() would do. @@ -5500,7 +5511,6 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } - pte = huge_ptep_get(ptep); old_page = pte_page(pte); delayacct_wpcopy_start(); -- 2.39.1 --le5JTZlZZI66lfKs--