From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A57CAC433EF for ; Fri, 4 Mar 2022 05:18:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 427088D000B; Fri, 4 Mar 2022 00:18:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D6FD8D0001; Fri, 4 Mar 2022 00:18:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29E9D8D000B; Fri, 4 Mar 2022 00:18:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254]) by kanga.kvack.org (Postfix) with ESMTP id 1B9D88D0001 for ; Fri, 4 Mar 2022 00:18:31 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id C64F5181C43D5 for ; Fri, 4 Mar 2022 05:18:30 +0000 (UTC) X-FDA: 79205548380.21.184504C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 114C1100006 for ; Fri, 4 Mar 2022 05:18:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646371109; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tQXe4rml5LknCAC9VWmfIhmJrE0oVG26j9HM2k7VJ/Y=; b=YRk6CwSJRirz5ZPup+ZRPTleUQaY/xixCTA/Ffp5Vn7i2jLZiyigtkymXB/bC3ZHu4NnEL /E1Zh9nryNoNXgx+qmDH85idTSIuviXAl46HCKViC45vC4PKc/0g8PgxP4NgSWrZBNtBLI b1F1GqS9SiYjrYH1kDPxSbKhR31ipI8= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-473-lu25IBDPMDKtSImHRj3MmA-1; Fri, 04 Mar 2022 00:18:28 -0500 X-MC-Unique: lu25IBDPMDKtSImHRj3MmA-1 Received: by mail-pj1-f72.google.com with SMTP id lp2-20020a17090b4a8200b001bc449ecbceso6724413pjb.8 for ; Thu, 03 Mar 2022 21:18:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=tQXe4rml5LknCAC9VWmfIhmJrE0oVG26j9HM2k7VJ/Y=; b=43mbfPt9go1oupgJ4MWS1m3mMEpR6sLnb4OQs/BNDSWRnCfyqfsA2kUOA8lVxoZNG5 wD0yLR8PQKs+KdgKIqofPehlxnZRxHtSb1eUuIJGxvmrxP6e/uk0p4Lfqi96p+PxpPSm s/BkeYDPNRgXRIjwgCECBPXRfkzYMp2N6ojP330nFPqwSsAEha9famWHfAxoEkMQFfXu xj8Wt7rFQoARTLUTf+oy0joWEIX/eT+SeYu5+xznfLPJ8uBSIGfd91x5l3ri2YbmEC0H qzyMEA1NTswpd5OiztKVq1uy5EuvPOH+7svnnspfx8dcxm/cCZ3w7FSvi9xSEH76TTV+ zOqQ== X-Gm-Message-State: AOAM530Y2r2heSXudqwc5GZuFRIjUhjIqxM0XXFudePE7LDrcDMVD8Lo qiBDktTbonrjOJEPtIN8/K+gvcI6Xm+VS+xxflaG+hd7CpgnhXktEOzaRFiFuBj5Wc8gNH5lru2 olLMb96gmaqHMgFmSt/BrWISpH2ZpB3fLjTlrNITdlN/ZMEvIajVCl+mahppN X-Received: by 2002:a17:903:2285:b0:151:4b38:298e with SMTP id b5-20020a170903228500b001514b38298emr29450215plh.36.1646371107503; Thu, 03 Mar 2022 21:18:27 -0800 (PST) X-Google-Smtp-Source: ABdhPJxWOXIBo4RzUqL6IP5zCBIf1bLH01fWQdczZaceP2UrGilCNmkw3HF88pBzciWEi4lBzLO6HQ== X-Received: by 2002:a17:903:2285:b0:151:4b38:298e with SMTP id b5-20020a170903228500b001514b38298emr29450185plh.36.1646371107064; Thu, 03 Mar 2022 21:18:27 -0800 (PST) Received: from localhost.localdomain ([94.177.118.59]) by smtp.gmail.com with ESMTPSA id p16-20020a056a000b5000b004f669806cd9sm4323865pfo.87.2022.03.03.21.18.19 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 03 Mar 2022 21:18:26 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: peterx@redhat.com, Nadav Amit , Hugh Dickins , David Hildenbrand , Axel Rasmussen , Matthew Wilcox , Alistair Popple , Mike Rapoport , Andrew Morton , Jerome Glisse , Mike Kravetz , "Kirill A . Shutemov" , Andrea Arcangeli Subject: [PATCH v7 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Date: Fri, 4 Mar 2022 13:16:53 +0800 Message-Id: <20220304051708.86193-9-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220304051708.86193-1-peterx@redhat.com> References: <20220304051708.86193-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 114C1100006 X-Stat-Signature: fg9h1r33c3e3tucpzmm567ghdtq7g6es Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YRk6CwSJ; spf=none (imf05.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-HE-Tag: 1646371109-663774 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memory differs from anonymous memory in that even if the pte = is missing, the data could still resides either in the file or in page/swap = cache. So when wr-protect a pte, we need to consider none ptes too. We do that by installing the uffd-wp pte markers when necessary. So when there's a future write to the pte, the fault handler will go the special = path to first fault-in the page as read-only, then report to userfaultfd serve= r with the wr-protect message. On the other hand, when unprotecting a page, it's also possible that the = pte got unmapped but replaced by the special uffd-wp marker. Then we'll need= to be able to recover from a uffd-wp pte marker into a none pte, so that the ne= xt access to the page will fault in correctly as usual when accessed the nex= t time. Special care needs to be taken throughout the change_protection_range() process. Since now we allow user to wr-protect a none pte, we need to be= able to pre-populate the page table entries if we see (!anonymous && MM_CP_UFF= D_WP) requests, otherwise change_protection_range() will always skip when the p= gtable entry does not exist. For example, the pgtable can be missing for a whole chunk of 2M pmd, but = the page cache can exist for the 2M range. When we want to wr-protect one 4K= page within the 2M pmd range, we need to pre-populate the pgtable and install = the pte marker showing that we want to get a message and block the thread whe= n the page cache of that 4K page is written. Without pre-populating the pmd, change_protection() will simply skip that whole pmd. Note that this patch only covers the small pages (pte level) but not cove= ring any of the transparent huge pages yet. That will be done later, and this= patch will be a preparation for it too. Signed-off-by: Peter Xu --- mm/mprotect.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 62 insertions(+), 2 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 6d179c720089..4878b6b99df9 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -184,8 +185,16 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, newpte =3D pte_swp_mksoft_dirty(newpte); if (pte_swp_uffd_wp(oldpte)) newpte =3D pte_swp_mkuffd_wp(newpte); - } else if (is_pte_marker_entry(entry)) { - /* Skip it, the same as none pte */ + } else if (pte_marker_entry_uffd_wp(entry)) { + /* + * If this is uffd-wp pte marker and we'd like + * to unprotect it, drop it; the next page + * fault will trigger without uffd trapping. + */ + if (uffd_wp_resolve) { + pte_clear(vma->vm_mm, addr, pte); + pages++; + } continue; } else { newpte =3D oldpte; @@ -200,6 +209,20 @@ static unsigned long change_pte_range(struct vm_area= _struct *vma, pmd_t *pmd, set_pte_at(vma->vm_mm, addr, pte, newpte); pages++; } + } else { + /* It must be an none page, or what else?.. */ + WARN_ON_ONCE(!pte_none(oldpte)); + if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { + /* + * For file-backed mem, we need to be able to + * wr-protect a none pte, because even if the + * pte is none, the page/swap cache could + * exist. Doing that by install a marker. + */ + set_pte_at(vma->vm_mm, addr, pte, + make_pte_marker(PTE_MARKER_UFFD_WP)); + pages++; + } } } while (pte++, addr +=3D PAGE_SIZE, addr !=3D end); arch_leave_lazy_mmu_mode(); @@ -233,6 +256,39 @@ static inline int pmd_none_or_clear_bad_unless_trans= _huge(pmd_t *pmd) return 0; } =20 +/* Return true if we're uffd wr-protecting file-backed memory, or false = */ +static inline bool +uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) +{ + return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); +} + +/* + * If wr-protecting the range for file-backed, populate pgtable for the = case + * when pgtable is empty but page cache exists. When {pte|pmd|...}_allo= c() + * failed it means no memory, we don't have a better option but stop. + */ +#define change_pmd_prepare(vma, pmd, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \ + break; \ + } \ + } while (0) +/* + * This is the general pud/p4d/pgd version of change_pmd_prepare(). We n= eed to + * have separate change_pmd_prepare() because pte_alloc() returns 0 on s= uccess, + * while {pmd|pud|p4d}_alloc() returns the valid pointer on success. + */ +#define change_prepare(vma, high, low, addr, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + low##_t *p =3D low##_alloc(vma->vm_mm, high, addr); \ + if (WARN_ON_ONCE(p =3D=3D NULL)) \ + break; \ + } \ + } while (0) + static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -251,6 +307,7 @@ static inline unsigned long change_pmd_range(struct v= m_area_struct *vma, =20 next =3D pmd_addr_end(addr, end); =20 + change_pmd_prepare(vma, pmd, cp_flags); /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -316,6 +373,7 @@ static inline unsigned long change_pud_range(struct v= m_area_struct *vma, pud =3D pud_offset(p4d, addr); do { next =3D pud_addr_end(addr, end); + change_prepare(vma, pud, pmd, addr, cp_flags); if (pud_none_or_clear_bad(pud)) continue; pages +=3D change_pmd_range(vma, pud, addr, next, newprot, @@ -336,6 +394,7 @@ static inline unsigned long change_p4d_range(struct v= m_area_struct *vma, p4d =3D p4d_offset(pgd, addr); do { next =3D p4d_addr_end(addr, end); + change_prepare(vma, p4d, pud, addr, cp_flags); if (p4d_none_or_clear_bad(p4d)) continue; pages +=3D change_pud_range(vma, p4d, addr, next, newprot, @@ -361,6 +420,7 @@ static unsigned long change_protection_range(struct v= m_area_struct *vma, inc_tlb_flush_pending(mm); do { next =3D pgd_addr_end(addr, end); + change_prepare(vma, pgd, p4d, addr, cp_flags); if (pgd_none_or_clear_bad(pgd)) continue; pages +=3D change_p4d_range(vma, pgd, addr, next, newprot, --=20 2.32.0