From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6011AC433EF for ; Fri, 4 Mar 2022 05:18:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 02FB58D0009; Fri, 4 Mar 2022 00:18:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F21D18D0001; Fri, 4 Mar 2022 00:18:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E11348D0009; Fri, 4 Mar 2022 00:18:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102]) by kanga.kvack.org (Postfix) with ESMTP id D396B8D0001 for ; Fri, 4 Mar 2022 00:18:14 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 8A99B181C43D5 for ; Fri, 4 Mar 2022 05:18:14 +0000 (UTC) X-FDA: 79205547708.20.64B4A58 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf18.hostedemail.com (Postfix) with ESMTP id F3CD41C000A for ; Fri, 4 Mar 2022 05:18:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646371093; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h4ZxUpUrUaV8nyMPDE8x8P31UP3HUv0lUXU89pH2T9E=; b=FOf4oiFZOorHD4glkQAoavw/Nu9j4gx5SwiHmX+KPtX+y2Cy8V7wJPXHUF7JuDG12WGJaI 818XQk7KkqIk4+jR23eAylrg97zOTN1o0+DuUmp66DNc2Sq1n/gBjBSY3tqBicgwo9Gs2q ZGH/SQDzK7UCvtjPxgfHm3M1fALecfo= Received: from mail-pl1-f199.google.com (mail-pl1-f199.google.com [209.85.214.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-401-G4Bc-_G7P125ACKnyWPY-Q-1; Fri, 04 Mar 2022 00:18:12 -0500 X-MC-Unique: G4Bc-_G7P125ACKnyWPY-Q-1 Received: by mail-pl1-f199.google.com with SMTP id o15-20020a170902d4cf00b00151559fadd7so4084873plg.20 for ; Thu, 03 Mar 2022 21:18:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=h4ZxUpUrUaV8nyMPDE8x8P31UP3HUv0lUXU89pH2T9E=; b=rwRtPzCFXkXfMIHcG/L2A8E6LYvUMqyIfc+aYSNYwloadSw4eTW+osomflQC/KM/hI Hzt1MjSwE9vegJjfLZC3afT1A5eQUOLGA7sK1yJ3V5GEHUaJgNhsbZ+QanaWURdym/xH ZxnuxMivUCr+qA6dXdY72y9cQoe7PQf4w/8r2ufsPBKAhhRr9Ro7TJ/Hqjk3p8cKH+4k u2A66L0+gP1jarFUT55pvFjFvTyRCwMfyQp5Hed+TGzimYulTK/wRwd74py3eQhjsZ/0 938aBn1RdGIyU/8n2A7+d9oZic40G06YiS8uAJ38d0u4v3/Zy7g9pAPepjtzHNh7pe50 3BRQ== X-Gm-Message-State: AOAM531YChGkvcRSkPpLfJOQyssxjZTUbG0oFEhQ3oEiRYS/Jvi5a1Io g2cLQgr2yWPoVcTEfEBQJIidktIcsxvg2Th0QcSexsCiLsmC/+yiUvVNWX/NkpyXtXmQLpRgqT0 VRniUNvqAvCV+spQuGb6LXKF7H1qMb3qXVCEyfVidg5V+4k2gjO9eeq1S2GV/ X-Received: by 2002:a05:6a00:b96:b0:4f3:c0f6:5c47 with SMTP id g22-20020a056a000b9600b004f3c0f65c47mr517344pfj.69.1646371091295; Thu, 03 Mar 2022 21:18:11 -0800 (PST) X-Google-Smtp-Source: ABdhPJyn5dg8SYjJevnXT+y2Ca46TGDW60Atz3RHVXa5U9+Jx4K7xiL9CysJtRs0lRcaiYEnHEu04Q== X-Received: by 2002:a05:6a00:b96:b0:4f3:c0f6:5c47 with SMTP id g22-20020a056a000b9600b004f3c0f65c47mr517282pfj.69.1646371090410; Thu, 03 Mar 2022 21:18:10 -0800 (PST) Received: from localhost.localdomain ([94.177.118.59]) by smtp.gmail.com with ESMTPSA id p16-20020a056a000b5000b004f669806cd9sm4323865pfo.87.2022.03.03.21.18.02 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 03 Mar 2022 21:18:10 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: peterx@redhat.com, Nadav Amit , Hugh Dickins , David Hildenbrand , Axel Rasmussen , Matthew Wilcox , Alistair Popple , Mike Rapoport , Andrew Morton , Jerome Glisse , Mike Kravetz , "Kirill A . Shutemov" , Andrea Arcangeli Subject: [PATCH v7 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Date: Fri, 4 Mar 2022 13:16:51 +0800 Message-Id: <20220304051708.86193-7-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220304051708.86193-1-peterx@redhat.com> References: <20220304051708.86193-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="US-ASCII" X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: F3CD41C000A X-Stat-Signature: 1wi6exnksb8ftbz8qhk4c7mhgy7bmagk Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FOf4oiFZ; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf18.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1646371093-675381 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memories are prone to unmap/swap so the ptes are always unsta= ble, because they can be easily faulted back later using the page cache. This= could lead to uffd-wp getting lost when unmapping or swapping out such memory. = One example is shmem. PTE markers are needed to store those information. This patch prepares it by handling uffd-wp pte markers first it is applie= d elsewhere, so that the page fault handler can recognize uffd-wp pte marke= rs. The handling of uffd-wp pte markers is similar to missing fault, it's jus= t that we'll handle this "missing fault" when we see the pte markers, meanwhile = we need to make sure the marker information is kept during processing the fa= ult. This is a slow path of uffd-wp handling, because zapping of wr-protected = shmem ptes should be rare. So far it should only trigger in two conditions: (1) When trying to punch holes in shmem_fallocate(), there is an optimi= zation to zap the pgtables before evicting the page. (2) When swapping out shmem pages. Because of this, the page fault handling is simplifed too by not sending = the wr-protect message in the 1st page fault, instead the page will be instal= led read-only, so the uffd-wp message will be generated in the next fault, wh= ich will trigger the do_wp_page() path of general uffd-wp handling. Disable fault-around for all uffd-wp registered ranges for extra safety j= ust like uffd-minor fault, and clean the code up. Signed-off-by: Peter Xu --- include/linux/userfaultfd_k.h | 17 +++++++++ mm/memory.c | 67 ++++++++++++++++++++++++++++++----- 2 files changed, 75 insertions(+), 9 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.= h index bd09c3c89b59..827e38b7be65 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -96,6 +96,18 @@ static inline bool uffd_disable_huge_pmd_share(struct = vm_area_struct *vma) return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); } =20 +/* + * Don't do fault around for either WP or MINOR registered uffd range. = For + * MINOR registered range, fault around will be a total disaster and pte= s can + * be installed without notifications; for WP it should mostly be fine a= s long + * as the fault around checks for pte_none() before the installation, ho= wever + * to be super safe we just forbid it. + */ +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); +} + static inline bool userfaultfd_missing(struct vm_area_struct *vma) { return vma->vm_flags & VM_UFFD_MISSING; @@ -236,6 +248,11 @@ static inline void userfaultfd_unmap_complete(struct= mm_struct *mm, { } =20 +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ =20 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) diff --git a/mm/memory.c b/mm/memory.c index cdd0d108d3ee..f509ddf2ad39 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3512,6 +3512,39 @@ static inline bool should_try_to_free_swap(struct = page *page, page_count(page) =3D=3D 2; } =20 +static vm_fault_t pte_marker_clear(struct vm_fault *vmf) +{ + vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* + * Be careful so that we will only recover a special uffd-wp pte into a + * none pte. Otherwise it means the pte could have changed, so retry. + */ + if (is_pte_marker(*vmf->pte)) + pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; +} + +/* + * This is actually a page-missing access, but with uffd-wp special pte + * installed. It means this pte was wr-protected before being unmapped. + */ +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) +{ + /* + * Just in case there're leftover special ptes even after the region + * got unregistered - we can simply clear them. We can also do that + * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp + * ranges, but it should be more efficient to be done lazily here. + */ + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) + return pte_marker_clear(vmf); + + /* do_fault() can handle pte markers too like none pte */ + return do_fault(vmf); +} + static vm_fault_t handle_pte_marker(struct vm_fault *vmf) { swp_entry_t entry =3D pte_to_swp_entry(vmf->orig_pte); @@ -3525,8 +3558,11 @@ static vm_fault_t handle_pte_marker(struct vm_faul= t *vmf) if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker)) return VM_FAULT_SIGBUS; =20 - /* TODO: handle pte markers */ - return 0; + if (pte_marker_entry_uffd_wp(entry)) + return pte_marker_handle_uffd_wp(vmf); + + /* This is an unknown pte marker */ + return VM_FAULT_SIGBUS; } =20 /* @@ -4051,6 +4087,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct = page *page) void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long a= ddr) { struct vm_area_struct *vma =3D vmf->vma; + bool uffd_wp =3D pte_marker_uffd_wp(vmf->orig_pte); bool write =3D vmf->flags & FAULT_FLAG_WRITE; bool prefault =3D vmf->address !=3D addr; pte_t entry; @@ -4065,6 +4102,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *= page, unsigned long addr) =20 if (write) entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); + if (unlikely(uffd_wp)) + entry =3D pte_mkuffd_wp(pte_wrprotect(entry)); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); @@ -4238,9 +4277,21 @@ static vm_fault_t do_fault_around(struct vm_fault = *vmf) return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); } =20 +/* Return true if we should do read fault-around, false otherwise */ +static inline bool should_fault_around(struct vm_fault *vmf) +{ + /* No ->map_pages? No way to fault around... */ + if (!vmf->vma->vm_ops->map_pages) + return false; + + if (uffd_disable_fault_around(vmf->vma)) + return false; + + return fault_around_bytes >> PAGE_SHIFT > 1; +} + static vm_fault_t do_read_fault(struct vm_fault *vmf) { - struct vm_area_struct *vma =3D vmf->vma; vm_fault_t ret =3D 0; =20 /* @@ -4248,12 +4299,10 @@ static vm_fault_t do_read_fault(struct vm_fault *= vmf) * if page by the offset is not ready to be mapped (cold cache or * something). */ - if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { - if (likely(!userfaultfd_minor(vmf->vma))) { - ret =3D do_fault_around(vmf); - if (ret) - return ret; - } + if (should_fault_around(vmf)) { + ret =3D do_fault_around(vmf); + if (ret) + return ret; } =20 ret =3D __do_fault(vmf); --=20 2.32.0