From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=tZyN=I3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F025BC433DB
	for <linux-mm@archiver.kernel.org>; Mon, 29 Mar 2021 22:50:50 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6BFE36192E
	for <linux-mm@archiver.kernel.org>; Mon, 29 Mar 2021 22:50:50 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6BFE36192E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D99BE6B007D; Mon, 29 Mar 2021 18:50:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D48F66B007E; Mon, 29 Mar 2021 18:50:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B9C356B0080; Mon, 29 Mar 2021 18:50:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161])
	by kanga.kvack.org (Postfix) with ESMTP id 9C8C26B007D
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 18:50:49 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 60E13181AF5C1
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 22:50:49 +0000 (UTC)
X-FDA: 77974408218.03.B723AF0
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf09.hostedemail.com (Postfix) with ESMTP id 004826000107
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 22:50:46 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1617058248;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=EEp/AWms1iQPM6gX6QYU8HGZGwUHBY8T+uSbGj6gOv0=;
	b=R/YbcfjzfF++vgHKMAppi64gVfxm9bsgPobTnU8f29Gg30FEmbnfCrGNqTidgCWPOGizrA
	EXPmIo2+k0QkyoQKCcfYpGl+kMvxY0iMLYXhTT22N14ZWHsumbeGgcpPTt9lbnn/0ek7Co
	fExntP9MbOD2mQ2fYAyhz++/74ABf5g=
Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com
 [209.85.219.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-519-JNbJ5tTxOWqKWEdXipHTKA-1; Mon, 29 Mar 2021 18:50:46 -0400
X-MC-Unique: JNbJ5tTxOWqKWEdXipHTKA-1
Received: by mail-qv1-f69.google.com with SMTP id p2so8447731qvi.6
        for <linux-mm@kvack.org>; Mon, 29 Mar 2021 15:50:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=EEp/AWms1iQPM6gX6QYU8HGZGwUHBY8T+uSbGj6gOv0=;
        b=MY7NR11wJ6ezqFmL/0hMb2VbsrOo6FmGpEoPzf3PsYPWbbl98L5EMHsXlXsd+omf6m
         r8QGvzP8YzXMWSs/kkTqWopgz/HoCrGMip142Kv0wwxseRhYnLDFfIqb4O227T1iHqaZ
         Xv9dGhLmlKntmDLW+Zf9snyXTX4e7JKeB8VG3OJCZ6qsNDVS3yw48ipPYDjiK2TXMnAl
         +JV+z5si1xtJS9gDi5f32akyCLBSmnsdHihbqjmmM7JnJgER2ILQEnuJ+rdKs1SdRBhu
         iec4tloPG5N8MezxK6yVVgoOmIjR43xT3JdBqSDMg+Q3WI7fd6j/+XWwahW7CrybV83n
         djWQ==
X-Gm-Message-State: AOAM531yMIH5R1yOSc2cJPHqIZs/pN3/FRgbhDlDh/YU7NmLaTQ1rJug
	X+aT1K/Z6JzyJ1rVhH4P5JErZfmxM8yeeo5+vcstx/FHSxe0Wo+1W5Kta7rc9dllZHXkWyxx3GD
	xHt52vTi0aKg=
X-Received: by 2002:a37:9802:: with SMTP id a2mr28987607qke.473.1617058245973;
        Mon, 29 Mar 2021 15:50:45 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJw4TtfFGH5PtBrvRg+7tdRAACaWWJA5uxnJt87xc9spU6EsS47neLYRetUlyAwN4AL62ZnzNg==
X-Received: by 2002:a37:9802:: with SMTP id a2mr28987571qke.473.1617058245571;
        Mon, 29 Mar 2021 15:50:45 -0700 (PDT)
Received: from xz-x1 (bras-base-toroon474qw-grc-82-174-91-135-175.dsl.bell.ca. [174.91.135.175])
        by smtp.gmail.com with ESMTPSA id n2sm11458357qta.61.2021.03.29.15.50.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 29 Mar 2021 15:50:44 -0700 (PDT)
Date: Mon, 29 Mar 2021 18:50:42 -0400
From: Peter Xu <peterx@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>, Jerome Glisse <jglisse@redhat.com>,
	Joe Perches <joe@perches.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>, Shaohua Li <shli@fb.com>,
	Shuah Khan <shuah@kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Wang Qing <wangqing@vivo.com>, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, Brian Geffon <bgeffon@google.com>,
	Cannon Matthews <cannonmatthews@google.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Michel Lespinasse <walken@google.com>,
	Mina Almasry <almasrymina@google.com>,
	Oliver Upton <oupton@google.com>
Subject: Re: [PATCH v2] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior
Message-ID: <20210329225042.GF429942@xz-x1>
References: <20210329200659.65971-1-axelrasmussen@google.com>
MIME-Version: 1.0
In-Reply-To: <20210329200659.65971-1-axelrasmussen@google.com>
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 004826000107
X-Stat-Signature: 56c87zwtpaj6hqw9gpfnd4rtsti1yh5j
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<peterx@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1617058246-612245
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Mar 29, 2021 at 01:06:59PM -0700, Axel Rasmussen wrote:
> Previously, we shared too much of the code with COPY and ZEROPAGE, so we
> manipulated things in various invalid ways:
> 
> - Previously, we unconditionally called shmem_inode_acct_block. In the
>   continue case, we're looking up an existing page which would have been
>   accounted for properly when it was allocated. So doing it twice
>   results in double-counting, and eventually leaking.
> 
> - Previously, we made the pte writable whenever the VMA was writable.
>   However, for continue, consider this case:
> 
>   1. A tmpfs file was created
>   2. The non-UFFD-registered side mmap()-s with MAP_SHARED
>   3. The UFFD-registered side mmap()-s with MAP_PRIVATE
> 
>   In this case, even though the UFFD-registered VMA may be writable, we
>   still want CoW behavior. So, check for this case and don't make the
>   pte writable.
> 
> - The offset / max_off checking doesn't necessarily hurt anything, but
>   it's not needed in the CONTINUE case, so skip it.
> 
> - Previously, we unconditionally called ClearPageDirty() in the error
>   path. In the continue case though, since this is an existing page, it
>   might have already been dirty before we started touching it. So,
>   remember whether or not it was dirty before we set_page_dirty(), and
>   only clear the bit if it wasn't dirty before.
> 
> - Previously, we unconditionally removed the page from the page cache in
>   the error path. But in the continue case, we didn't add it - it was
>   already there because the page is present in some second
>   (non-UFFD-registered) mapping. So, removing it is invalid.
> 
> Because the error handling issues are easy to exercise in the selftest,
> make a small modification there to do so.
> 
> Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
> added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
> check for that mode first thing, and then "goto" down to where the parts
> we actually want are. This leaves the code in between cleaner.
> 
> Changes since v1:
> - Refactor to skip ahead with goto, instead of adding several more
>   "if (!is_continue)".
> - Fix unconditional ClearPageDirty().
> - Don't pte_mkwrite() when is_continue && !VM_SHARED.
> 
> Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  mm/shmem.c                               | 67 ++++++++++++++----------
>  tools/testing/selftests/vm/userfaultfd.c | 12 +++++
>  2 files changed, 51 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d2e0e81b7d2e..8ab1f1f29987 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2378,17 +2378,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	pte_t _dst_pte, *dst_pte;
>  	int ret;
>  	pgoff_t offset, max_off;
> -
> -	ret = -ENOMEM;
> -	if (!shmem_inode_acct_block(inode, 1))
> -		goto out;
> +	int writable;
> +	bool was_dirty;
>  
>  	if (is_continue) {
>  		ret = -EFAULT;
>  		page = find_lock_page(mapping, pgoff);
>  		if (!page)
> -			goto out_unacct_blocks;
> -	} else if (!*pagep) {
> +			goto out;
> +		goto install_ptes;
> +	}
> +
> +	ret = -ENOMEM;
> +	if (!shmem_inode_acct_block(inode, 1))
> +		goto out;
> +
> +	if (!*pagep) {
>  		page = shmem_alloc_page(gfp, info, pgoff);
>  		if (!page)
>  			goto out_unacct_blocks;
> @@ -2415,13 +2420,11 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  		*pagep = NULL;
>  	}
>  
> -	if (!is_continue) {
> -		VM_BUG_ON(PageSwapBacked(page));
> -		VM_BUG_ON(PageLocked(page));
> -		__SetPageLocked(page);
> -		__SetPageSwapBacked(page);
> -		__SetPageUptodate(page);
> -	}
> +	VM_BUG_ON(PageSwapBacked(page));
> +	VM_BUG_ON(PageLocked(page));
> +	__SetPageLocked(page);
> +	__SetPageSwapBacked(page);
> +	__SetPageUptodate(page);
>  
>  	ret = -EFAULT;
>  	offset = linear_page_index(dst_vma, dst_addr);
> @@ -2429,16 +2432,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	if (unlikely(offset >= max_off))
>  		goto out_release;
>  
> -	/* If page wasn't already in the page cache, add it. */
> -	if (!is_continue) {
> -		ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
> -					      gfp & GFP_RECLAIM_MASK, dst_mm);
> -		if (ret)
> -			goto out_release;
> -	}
> +	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
> +				      gfp & GFP_RECLAIM_MASK, dst_mm);
> +	if (ret)
> +		goto out_release;
>  
> +install_ptes:
>  	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	if (dst_vma->vm_flags & VM_WRITE)
> +	/* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
> +	writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
> +		? 0
> +		: dst_vma->vm_flags & VM_WRITE;
> +	if (writable)
>  		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
>  	else {
>  		/*
> @@ -2448,15 +2453,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  		 * unconditionally before unlock_page(), but doing it
>  		 * only if VM_WRITE is not set is faster.
>  		 */
> +		was_dirty = PageDirty(page);
>  		set_page_dirty(page);
>  	}
>  
>  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
>  
> -	ret = -EFAULT;
> -	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> -	if (unlikely(offset >= max_off))
> -		goto out_release_unlock;
> +	if (!is_continue) {
> +		ret = -EFAULT;
> +		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +		if (unlikely(offset >= max_off))
> +			goto out_release_unlock;
> +	}

I think you're right, but I won't touch this code explicitly since the new code
slightly affects readability, while skipping it won't save a lot of cpu cycles.
No strong opinion though.

>  
>  	ret = -EEXIST;
>  	if (!pte_none(*dst_pte))
> @@ -2485,13 +2493,16 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	return ret;
>  out_release_unlock:
>  	pte_unmap_unlock(dst_pte, ptl);
> -	ClearPageDirty(page);
> -	delete_from_page_cache(page);
> +	if (!was_dirty)
> +		ClearPageDirty(page);

It could be using a random was_dirty from stack, does not seem right.

Maybe simply drop this ClearPageDirty()?  I'm not sure whether the page free
code will complain, but if not I think an extra dirty bit is better than losing
one, as the latter corrupts data while the former happens anyways, not to
mention this is an error path.  Maybe acceptable?

> +	if (!is_continue)
> +		delete_from_page_cache(page);
>  out_release:
>  	unlock_page(page);
>  	put_page(page);
>  out_unacct_blocks:
> -	shmem_inode_unacct_blocks(inode, 1);
> +	if (!is_continue)
> +		shmem_inode_unacct_blocks(inode, 1);
>  	goto out;
>  }
>  #endif /* CONFIG_USERFAULTFD */
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index f6c86b036d0f..d8541a59dae5 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -485,6 +485,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
>  static void continue_range(int ufd, __u64 start, __u64 len)
>  {
>  	struct uffdio_continue req;
> +	int ret;
>  
>  	req.range.start = start;
>  	req.range.len = len;
> @@ -493,6 +494,17 @@ static void continue_range(int ufd, __u64 start, __u64 len)
>  	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
>  		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
>  		    (uint64_t)start);
> +
> +	/*
> +	 * Error handling within the kernel for continue is subtly different
> +	 * from copy or zeropage, so it may be a source of bugs. Trigger an
> +	 * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
> +	 */
> +	req.mapped = 0;
> +	ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
> +	if (ret >= 0 || req.mapped != -EEXIST)
> +		err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
> +		    ret, req.mapped);
>  }
>  
>  static void *locking_thread(void *arg)
> -- 
> 2.31.0.291.g576ba9dcdaf-goog
> 

-- 
Peter Xu