From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79804C433B4 for ; Mon, 26 Apr 2021 21:17:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E21506101D for ; Mon, 26 Apr 2021 21:17:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E21506101D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 661226B0036; Mon, 26 Apr 2021 17:17:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 635FB6B006E; Mon, 26 Apr 2021 17:17:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D7CF6B0070; Mon, 26 Apr 2021 17:17:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0158.hostedemail.com [216.40.44.158]) by kanga.kvack.org (Postfix) with ESMTP id 1A9756B0036 for ; Mon, 26 Apr 2021 17:17:02 -0400 (EDT) Received: from smtpin32.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id BE780180AD83E for ; Mon, 26 Apr 2021 21:17:01 +0000 (UTC) X-FDA: 78075778242.32.59100C7 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf16.hostedemail.com (Postfix) with ESMTP id B85B380192F9 for ; Mon, 26 Apr 2021 21:16:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1619471820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=TEQLnehRjN8mwQ4ES7/Y9NAfh0YJFRvcwj6u1Q6K0CE=; b=TiYz7nKxfsaPP4Q4nR/sJjxzukRCNKmRXexgvw3IBWb2fN4QRj9yjKc8MlW2qBqIDkcGYw Xs9OT9/1J0EyQSR/YmwsDLkpUTOyJBe73ScbcaI515Uobg03890fIuIYRa65h+vb6Bk2hh gdizSoNsdPapORlTWIyTxqzjkTBYyq0= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-191-NYvVZOwUObGgFa0MhVMPEQ-1; Mon, 26 Apr 2021 17:16:56 -0400 X-MC-Unique: NYvVZOwUObGgFa0MhVMPEQ-1 Received: by mail-qk1-f200.google.com with SMTP id h15-20020a37de0f0000b029029a8ada2e18so21488895qkj.11 for ; Mon, 26 Apr 2021 14:16:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=TEQLnehRjN8mwQ4ES7/Y9NAfh0YJFRvcwj6u1Q6K0CE=; b=tyC/kCF0oX1fm8EGBIw/DKzHbQQwp+sDFNuiYf6ctFdoFpCTr+M6zBWVOouV+xYtQU XgJ9IFbLpFHW3QT0Q/n64d0C/jq6oH9HwY4vvgL4A77aT+/RNq9386wVno/dMdU7iDKp Q22KxOIz7sW8+eveB0Yk/4+bdL1fQ2mNt9fY18nKyLvIJL5wvdmnKE0UmFdOmzxFaBey bCdaKDVK9XoVAoHh40iIF4Xzi1vg57oRtUgSaVwqRkXTWCoF9W4HaeWMoaX7NsEIk2rp K91TFempAj3tsW/9NTksuN/ZVjEColGMU5Rd1nGy0Rz2/57ciH/f9HTJVc2jt5yjYe1I nPBQ== X-Gm-Message-State: AOAM530HhZ9/PadvS/kJs7+RZ22zxLByNzuhGZxDvQrAUMnuFgfyFX/k jcq3VniCA4H2g8ZO1f8km/Zh1h9LChqvF6WWsKPLxAxNPs0OLs48qgk3NPH5fnfrQwzM+V+TaTz IbazNiZKEk9k= X-Received: by 2002:a37:bc43:: with SMTP id m64mr19667070qkf.186.1619471816220; Mon, 26 Apr 2021 14:16:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwzaehxDdHdCrW+x0DBANxZND88U+9w+Ju2P10+GFOeHdhA/f1R9jfiU1eywFPsLMLlcX5nAA== X-Received: by 2002:a37:bc43:: with SMTP id m64mr19667036qkf.186.1619471815840; Mon, 26 Apr 2021 14:16:55 -0700 (PDT) Received: from xz-x1 (bras-base-toroon474qw-grc-77-184-145-104-227.dsl.bell.ca. [184.145.104.227]) by smtp.gmail.com with ESMTPSA id f7sm12666659qtv.53.2021.04.26.14.16.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Apr 2021 14:16:54 -0700 (PDT) Date: Mon, 26 Apr 2021 17:16:53 -0400 From: Peter Xu To: Mike Kravetz Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Mike Rapoport , Nadav Amit , Jerome Glisse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , "Kirill A . Shutemov" , Axel Rasmussen , Matthew Wilcox Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required Message-ID: <20210426211653.GH85002@xz-x1> References: <20210323004912.35132-1-peterx@redhat.com> <20210323005054.35973-1-peterx@redhat.com> <02712955-0552-f82a-0ab8-460d63af3519@oracle.com> MIME-Version: 1.0 In-Reply-To: <02712955-0552-f82a-0ab8-460d63af3519@oracle.com> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: B85B380192F9 X-Stat-Signature: gof64opgnigpzno3m8badfahzn78meez X-Rspamd-Server: rspam02 Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf16; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619471817-315952 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 23, 2021 at 01:33:08PM -0700, Mike Kravetz wrote: > On 3/22/21 5:50 PM, Peter Xu wrote: > > Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop > > uffd-wp special swap pte for hugetlb too, only if we're going to unmap the > > whole vma, or we're punching a hole with safe locks held. > > > > For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it > > has taken hugetlb fault mutex so that no concurrent page fault would trigger. > > While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not > > safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP, > > while the latter one won't be able to. > > This commit message is a bit confusing, but the hugetlb hole punch code > path is a bit confusing. :) How about something like this? > > As with shmem uffd-wp special ptes, only drop the uffd-wp special swap > pte if unmapping an entire vma or synchronized such that faults can not > race with the unmap operation. This requires passing zap_flags all the > way to the lowest level hugetlb unmap routine: __unmap_hugepage_range. > > In general, unmap calls originated in hugetlbfs code will pass the > ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent > faults. The exception is hole punch which will first unmap without any > synchronization. Later when hole punch actually removes the page from > the file, it will check to see if there was a subsequent fault and if > so take the hugetlb fault mutex while unmapping again. This second > unmap will pass in ZAP_FLAG_DROP_FILE_UFFD_WP. Sure, I can replace mine. Maybe it's because I didn't explain enough on the reasoning so it's confusing. The core justification of "whether to apply ZAP_FLAG_DROP_FILE_UFFD_WP flag when unmap a hugetlb range" is (IMHO): we should never reach a state when a page fault could errornously fault in a page-cache page that was wr-protected to be writable, even in an extremely short period. That could happen if e.g. we pass ZAP_FLAG_DROP_FILE_UFFD_WP in hugetlbfs_punch_hole() when calling hugetlb_vmdelete_list(), because if a page fault triggers after that call and before the remove_inode_hugepages() right after it, the page cache can be mapped writable again in the small window, which can cause data corruption. > > > > > Signed-off-by: Peter Xu > > --- > > fs/hugetlbfs/inode.c | 15 +++++++++------ > > include/linux/hugetlb.h | 13 ++++++++----- > > mm/hugetlb.c | 27 +++++++++++++++++++++------ > > mm/memory.c | 5 ++++- > > 4 files changed, 42 insertions(+), 18 deletions(-) > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index d81f52b87bd7..5fe19e801a2b 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page) > > } > > > > static void > > -hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end) > > +hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, > > + unsigned long zap_flags) > > { > > struct vm_area_struct *vma; > > > > @@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end) > > } > > > > unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end, > > - NULL); > > + NULL, zap_flags); > > } > > } > > > > @@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, > > mutex_lock(&hugetlb_fault_mutex_table[hash]); > > hugetlb_vmdelete_list(&mapping->i_mmap, > > index * pages_per_huge_page(h), > > - (index + 1) * pages_per_huge_page(h)); > > + (index + 1) * pages_per_huge_page(h), > > + ZAP_FLAG_DROP_FILE_UFFD_WP); > > i_mmap_unlock_write(mapping); > > } > > > > @@ -579,7 +581,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset) > > i_mmap_lock_write(mapping); > > i_size_write(inode, offset); > > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) > > - hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); > > + hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0, > > + ZAP_FLAG_DROP_FILE_UFFD_WP); > > i_mmap_unlock_write(mapping); > > remove_inode_hugepages(inode, offset, LLONG_MAX); > > } > > @@ -612,8 +615,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) > > i_mmap_lock_write(mapping); > > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) > > hugetlb_vmdelete_list(&mapping->i_mmap, > > - hole_start >> PAGE_SHIFT, > > - hole_end >> PAGE_SHIFT); > > + hole_start >> PAGE_SHIFT, > > + hole_end >> PAGE_SHIFT, 0); > > i_mmap_unlock_write(mapping); > > remove_inode_hugepages(inode, hole_start, hole_end); > > inode_unlock(inode); > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index 92710600596e..4047fa042782 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, > > unsigned long *, unsigned long *, long, unsigned int, > > int *); > > void unmap_hugepage_range(struct vm_area_struct *, > > - unsigned long, unsigned long, struct page *); > > + unsigned long, unsigned long, struct page *, > > + unsigned long); > > void __unmap_hugepage_range_final(struct mmu_gather *tlb, > > struct vm_area_struct *vma, > > unsigned long start, unsigned long end, > > - struct page *ref_page); > > + struct page *ref_page, unsigned long zap_flags); > > void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, > > unsigned long start, unsigned long end, > > - struct page *ref_page); > > + struct page *ref_page, unsigned long zap_flags); > > Nothing introduced with your patch, but it seems __unmap_hugepage_range_final > does not need to be in the header and can be static in hugetlb.c. It seems to be used in unmap_single_vma() of mm/memory.c? > > Everything else looks good, > > Reviewed-by: Mike Kravetz Please let me know whether you want my extra paragraph in the commit message, then I'll coordinate accordingly with the R-b. Thanks! -- Peter Xu