From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xNql=FF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-19.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1,
	USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB223C64E7A
	for <linux-mm@archiver.kernel.org>; Tue,  1 Dec 2020 21:31:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 27B2C2084C
	for <linux-mm@archiver.kernel.org>; Tue,  1 Dec 2020 21:31:42 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sAG5xQr7"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 27B2C2084C
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8929C6B005D; Tue,  1 Dec 2020 16:31:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 844076B0068; Tue,  1 Dec 2020 16:31:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 759566B006C; Tue,  1 Dec 2020 16:31:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0127.hostedemail.com [216.40.44.127])
	by kanga.kvack.org (Postfix) with ESMTP id 5F2306B005D
	for <linux-mm@kvack.org>; Tue,  1 Dec 2020 16:31:41 -0500 (EST)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 20F0C181AEF00
	for <linux-mm@kvack.org>; Tue,  1 Dec 2020 21:31:41 +0000 (UTC)
X-FDA: 77546010402.10.oven11_520e81f273ad
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin10.hostedemail.com (Postfix) with ESMTP id EA79E169F9A
	for <linux-mm@kvack.org>; Tue,  1 Dec 2020 21:31:40 +0000 (UTC)
X-HE-Tag: oven11_520e81f273ad
X-Filterd-Recvd-Size: 8666
Received: from mail-ot1-f66.google.com (mail-ot1-f66.google.com [209.85.210.66])
	by imf10.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue,  1 Dec 2020 21:31:40 +0000 (UTC)
Received: by mail-ot1-f66.google.com with SMTP id f16so3124303otl.11
        for <linux-mm@kvack.org>; Tue, 01 Dec 2020 13:31:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :user-agent:mime-version;
        bh=EdSEejUBnam4KyFCA2TjLSdmd0XXpVEcrnOSSkBF68Q=;
        b=sAG5xQr7qlmjSN37qADpl74MFGzcH7BABaU2IWfaLP1zLlqQT0I92iszLrSfyLkEXP
         Z7JJH1ofw05227Sq/3nWvMDOfM2yO2whvCGVTEAoen2eBzCem0Y+MEtAialnapF6MJ0b
         xs6mDoXI6tK3SqWDQDXKeQAMmq/P3s458x+ZctIl9iVQo5npYpqdWoZxoXDeAn44uw6C
         FECN3OhUZLbq0R1dGrhBpoeUbt7Cchrdattlfky+vM8MNl24wIyJxURDCMXKi5v7F76Q
         RAauinVvRsvtZkMYm9HCcFbbQEMt9BP46a92+w3xujx2YRefKe7Hn/d7QuQsuT1WXTaF
         1Atg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:user-agent:mime-version;
        bh=EdSEejUBnam4KyFCA2TjLSdmd0XXpVEcrnOSSkBF68Q=;
        b=mPpsBKjqaaHwayhlK0XDJ+zhE1ZpiHouqc9oY3eCBYAr11u4CwTP4VeR4/6+gMZ/tb
         WbO3rT++SnBc+AaX9jaM5jYX3oxLezAkUi3afA3VnSgaMkf0QarJKkgkGiIQrDTOUxCX
         WXp4TjDnSFO3z1h76ZCIUgRepyhYg6+y/9KX33q99cIAynGIi288SGNfIxgH17a52fFA
         HdUinaKXvTeQyYXr0qfe2iPO4SX0gms1HTQgSxLblmbRSLZk42GeQcOc6PWgxyru2j1j
         YqwACbhavN7ddmqnkEyGb8RR650+jdqmww47Jcpaj5UsQBJ+PDcLEPqq+XTWGmUe3q98
         JD8Q==
X-Gm-Message-State: AOAM531MHzoudwDFvsyEQ32lCgB0VyUgspqUZ0CuW4seQcImlBU4RimT
	ZU6WT/LB6xmxPejhJ6nntlxY8Q==
X-Google-Smtp-Source: ABdhPJyvi63yAw5X1ZSHmGEKe+2zyCR/Uknn5yGRuOAm60Y2Z6gxegraZ26YYoF9Mu5J/tqrphChDg==
X-Received: by 2002:a9d:7cc8:: with SMTP id r8mr3178276otn.233.1606858299045;
        Tue, 01 Dec 2020 13:31:39 -0800 (PST)
Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id k5sm238374oot.30.2020.12.01.13.31.37
        (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128);
        Tue, 01 Dec 2020 13:31:38 -0800 (PST)
Date: Tue, 1 Dec 2020 13:31:21 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@eggly.anvils
To: Peter Xu <peterx@redhat.com>
cc: Andrea Arcangeli <aarcange@redhat.com>, 
    Matthew Wilcox <willy@infradead.org>, linux-kernel@vger.kernel.org, 
    linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, 
    Hugh Dickins <hughd@google.com>, Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH] mm: Don't fault around userfaultfd-registered regions
 on reads
In-Reply-To: <20201128152903.GK6573@xz-x1>
Message-ID: <alpine.LSU.2.11.2012011250070.1582@eggly.anvils>
References: <20201126222359.8120-1-peterx@redhat.com> <20201127122224.GX4327@casper.infradead.org> <X8Ga44uXHmzn/vB9@redhat.com> <20201128152903.GK6573@xz-x1>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sat, 28 Nov 2020, Peter Xu wrote:
> On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote:
> 
> > Now it would be ok if filemap_map_pages re-filled the ptes, after they
> > were zapped the first time by fallocate, and then the fallocate would
> > zap them again before truncating the page, but I don't see a second
> > pte zap... there's just the below single call of unmap_mapping_range:
> > 
> > 		if ((u64)unmap_end > (u64)unmap_start)
> > 			unmap_mapping_range(mapping, unmap_start,
> > 					    1 + unmap_end - unmap_start, 0);
> > 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call
> truncate_inode_page() then onto truncate_cleanup_page().

Correct.

> 
> Since we're at it, some more context: this is actually where I started to
> notice the issue, that we'll try to pre-unmap the whole region first before
> shmem_truncate_range().  The thing is shmem_truncate_range() will zap the ptes
> too, in an even safer way (with page lock held).  So before I came up with the
> current patch, I also tried below patch, and it also fixes the data corrupt
> issue:
> 
> -----8<-----
> diff --git a/mm/shmem.c b/mm/shmem.c
> index efa19e33e470..b275f401fe1f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode_lock(inode);
>  
>         if (mode & FALLOC_FL_PUNCH_HOLE) {
> -               struct address_space *mapping = file->f_mapping;
>                 loff_t unmap_start = round_up(offset, PAGE_SIZE);
>                 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>                 DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
> @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                 inode->i_private = &shmem_falloc;
>                 spin_unlock(&inode->i_lock);
>  
> -               if ((u64)unmap_end > (u64)unmap_start)
> -                       unmap_mapping_range(mapping, unmap_start,
> -                                           1 + unmap_end - unmap_start, 0);
>                 shmem_truncate_range(inode, offset, offset + len - 1);
>                 /* No need to unmap again: hole-punching leaves COWed pages */
> -----8<-----
>  
> Above code existed starting from the 1st day of shmem fallocate code, so I was
> thinking why we did that.  One reason could be for performance, since this
> pre-unmap of explicit unmap_mapping_range() will try to walk the page tables
> once rather than walking for every single page, so when the hole is huge it
> could benefit us since truncate_cleanup_page() is able to avoid those per-page
> walks then because page_mapped() could be more likely to be zero:
> 
> 	if (page_mapped(page)) {
> 		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
> 		unmap_mapping_pages(mapping, page->index, nr, false);
> 	}
> 
> It would be good if Hugh can help confirm whether my understanding is correct,
> though..

Correct.  Code duplicated from mm/truncate.c, but with more comments
over there, in truncate_pagecache() and in truncate_pagecache_range().

> 
> As a summary, that's why I didn't try to remove the optimization (which fixes
> the issue too) but instead I tried to dissable fault around instead for uffd,
> which sounds a better thing to do.
> 
> > 
> > So looking at filemap_map_pages in shmem, I'm really wondering if the
> > non-uffd case is correct or not.
> > 
> > Do we end up with ptes pointing to non pagecache, so then the virtual
> > mapping is out of sync also with read/write syscalls that will then
> > allocate another shmem page out of sync of the ptes?

No, as you point out, the second pte zap, under page lock, keeps it safe.

> > 
> > If a real page fault happened instead of filemap_map_pages, the
> > shmem_fault() would block during fallocate punch hole by checking
> > inode->i_private, as the comment says:
> > 
> > 	 * So refrain from
> > 	 * faulting pages into the hole while it's being punched.
> > 
> > It's not immediately clear where filemap_map_pages refrains from
> > faulting pages into the hole while it's being punched... given it's
> > ignoring inode->i_private.

Please don't ever rely on that i_private business for correctness: as
more of the comment around "So refrain from" explains, it was added
to avoid a livelock with the trinity fuzzer, and I'd gladly delete
it all, if I could ever be confident that enough has changed in the
intervening years that it's no longer necessary.

It tends to prevent shmem faults racing hole punches in the same area
(without quite guaranteeing no race at all).  But without the livelock
issue, I'd just have gone on letting them race.  "Punch hole" ==
"Lose data" - though I can imagine that UFFD would like more control
over that.  Since map_pages is driven by faulting, it should already
be throttled too.

But Andrea in next mail goes on to see other issues with UFFD_WP
in relation to shmem swap, so this thread is probably dead now.
If there's a bit to spare for UFFD_WP in the anonymous swap entry,
then that bit should be usable for shmem too: but a shmem page
(and its swap entry) can be shared between many different users,
so I don't know whether that will make sense.

Hugh