From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xtX6=BV=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-17.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68078C433E0
	for <linux-mm@archiver.kernel.org>; Tue, 11 Aug 2020 18:40:01 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 124742076B
	for <linux-mm@archiver.kernel.org>; Tue, 11 Aug 2020 18:40:00 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Uy/tA/QI"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 124742076B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8F3CF6B0002; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A39A6B0005; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7911D8D0001; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0122.hostedemail.com [216.40.44.122])
	by kanga.kvack.org (Postfix) with ESMTP id 6348C6B0002
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id E4EBF8248047
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:59 +0000 (UTC)
X-FDA: 77139152118.13.beam90_0906ae226fe5
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id B215118140B60
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:59 +0000 (UTC)
X-HE-Tag: beam90_0906ae226fe5
X-Filterd-Recvd-Size: 14486
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1597171198;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=ehldQ7u2RFp9mhKtKspgyBES/eNp3K3u2HvNYnpxyMk=;
	b=Uy/tA/QIDk96bE6rMcKr+n1xe1A5Y2vJwIO3LYJKTFxTfbvBbe/+npQo9Wh1b2hfE8O3bL
	s/RkM1U92fjN+wOgtfUaLvEKqWwvCUfjSy11vZGDLx1XYJhiz45/Dj0BrNyj8mcs6HspyC
	r+Zxj9iInfd0GOlgtsvQJ/RdQMz8g6U=
Received: from mail-il1-f197.google.com (mail-il1-f197.google.com
 [209.85.166.197]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-124-7xZMDAIxMBGW2BbcjBxOOw-1; Tue, 11 Aug 2020 14:39:55 -0400
X-MC-Unique: 7xZMDAIxMBGW2BbcjBxOOw-1
Received: by mail-il1-f197.google.com with SMTP id c84so11183068ila.18
        for <linux-mm@kvack.org>; Tue, 11 Aug 2020 11:39:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
         :content-transfer-encoding;
        bh=ehldQ7u2RFp9mhKtKspgyBES/eNp3K3u2HvNYnpxyMk=;
        b=UZ6vaV8GR+mgSKGcUs0PWMN84I6ObBqA5JxA6XWjh9+pX4uW/jbpfCuMMezVAYJ+Rh
         fi14qNyRnYb6kHu7avSgWFwhaSAl5aXAmAua2Mf1EakEi0srQGSMlrfkXwuf59nreqC4
         VZHB+7cg1KzVtfxYjUqt3mcYM+oNQd76FVQMHuyKgtXcYBDx9QbM+LAIEXtdeJNJP1+c
         Lqe1KBUZqVWduvEp3y3Ey8SPZGJNPUWxbuOk1K0vA+M05Tdq1botU25yND6sqAOZm6Sv
         LUmZDjq7X/pzlXuGQ/f0ymWkbaMeMl/MHCMZYxOx3gUxsUrmVudQBsNxRIqCSpThCPfz
         TTIg==
X-Gm-Message-State: AOAM533bq4xVEN5/tEVK2NXptVZNdBSmgH0E814DZ28ipEm5YvCOCzOM
	h3q9Q1QrdL0NewUTTvHzPt46ALogkCWm70RgfY9LBQMLM/I1dQmMEDnuoeqUkDrQrFlwtjLamum
	Q/HHAOjoQyC8=
X-Received: by 2002:a92:1f4f:: with SMTP id i76mr24616592ile.226.1597171194630;
        Tue, 11 Aug 2020 11:39:54 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJybSqRuf4y/mJQfGPNNf9rnhVxAmfLnpKyaH8o923KU+hMchJrC8pd6QdUfXzedlL1GyN1QPg==
X-Received: by 2002:a92:1f4f:: with SMTP id i76mr24616564ile.226.1597171194326;
        Tue, 11 Aug 2020 11:39:54 -0700 (PDT)
Received: from localhost.localdomain (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15])
        by smtp.gmail.com with ESMTPSA id v17sm13864621ilj.33.2020.08.11.11.39.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 11 Aug 2020 11:39:52 -0700 (PDT)
From: Peter Xu <peterx@redhat.com>
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	peterx@redhat.com,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>,
	Christoph Hellwig <hch@lst.de>,
	Oleg Nesterov <oleg@redhat.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Jan Kara <jack@suse.cz>
Subject: [PATCH v3] mm/gup: Allow real explicit breaking of COW
Date: Tue, 11 Aug 2020 14:39:50 -0400
Message-Id: <20200811183950.10603-1-peterx@redhat.com>
X-Mailer: git-send-email 2.26.2
MIME-Version: 1.0
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Queue-Id: B215118140B60
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Starting from commit 17839856fd58 ("gup: document and work around "COW ca=
n
break either way" issue", 2020-06-02), explicit copy-on-write behavior is
enforced for private gup pages even if it's a read-only.  It is achieved =
by
always passing FOLL_WRITE to emulate a write.

That should fix the COW issue that we were facing, however above commit c=
ould
also break userfaultfd-wp and applications like umapsort [1,2].

One general routine of umap-like program is: userspace library will manag=
e page
allocations, and it will evict the least recently used pages from memory =
to
external storages (e.g., file systems).  Below are the general steps to e=
vict
an in-memory page in the uffd service thread when the page pool is full:

  (1) UFFDIO_WRITEPROTECT with mode=3DWP on some to-be-evicted page P, so=
 that
      further writes to page P will block (keep page P clean)
  (2) Copy page P to external storage (e.g. file system)
  (3) MADV_DONTNEED to evict page P

Here step (1) makes sure that the page to dump will always be up-to-date,=
 so
that the page snapshot in the file system is consistent with the one that=
 was
in the memory.  However with commit 17839856fd58, step (2) can potentiall=
y hang
itself because e.g. if we use write() to a file system fd to dump the pag=
e
data, that will be a translated read gup request in the file system drive=
r to
read the page content, then the read gup will be translated to a write gu=
p due
to the new enforced COW behavior.  This write gup will further trigger
handle_userfault() and hang the uffd service thread itself.

I think the problem will go away too if we replace the write() to the fil=
e
system into a memory write to a mmaped region in the userspace library, b=
ecause
normal page faults will not enforce COW, only gup is affected.  However w=
e
cannot forbid users to use write() or any form of kernel level read gup.

One solution is actually already mentioned in commit 17839856fd58, which =
is to
provide an explicit BREAK_COW scemantics for enforced COW.  Then we can s=
till
use FAULT_FLAG_WRITE to identify whether this is a "real write request" o=
r an
"enfornced COW (read) request".

With the enforced COW, we also need to inherit UFFD_WP bit during COW bec=
ause
now COW can happen with UFFD_WP enabled (previously, it cannot).

Since at it, rename the variable in __handle_mm_fault() from "dirty" to "=
cow"
to better suite its functionality.

[1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.=
cpp
[2] https://github.com/LLNL/umap

CC: Marty Mcfadden <mcfadden8@llnl.gov>
CC: Maya B. Gokhale <gokhale2@llnl.gov>
CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Kirill Shutemov <kirill@shutemov.name>
CC: Jan Kara <jack@suse.cz>
Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3:
- inherit UFFD_WP bit for COW too
- take care of huge page cases
- more comments
v2:
- apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph]
- removed comments above do_wp_page which seems redundant
---
 include/linux/mm.h |  3 +++
 mm/gup.c           |  6 ++++--
 mm/huge_memory.c   | 12 +++++++++++-
 mm/memory.c        | 39 +++++++++++++++++++++++++++++++--------
 4 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6a82f9bccd7..a1f5c92b44cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,6 +409,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal =
signals.
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read=
).
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -439,6 +440,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_BREAK_COW			0x400
=20
 /*
  * The default fault flags that should be used by most of the
@@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma=
, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page *=
/
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gu=
p */
+#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for r=
ead) */
=20
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with e=
ach
diff --git a/mm/gup.c b/mm/gup.c
index d8a33dd1430d..c33e84ab9c36 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, stru=
ct vm_area_struct *vma,
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |=3D FAULT_FLAG_WRITE;
+	if (*flags & FOLL_BREAK_COW)
+		fault_flags |=3D FAULT_FLAG_BREAK_COW;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |=3D FAULT_FLAG_REMOTE;
 	if (locked)
@@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *ts=
k, struct mm_struct *mm,
 			}
 			if (is_vm_hugetlb_page(vma)) {
 				if (should_force_cow_break(vma, foll_flags))
-					foll_flags |=3D FOLL_WRITE;
+					foll_flags |=3D FOLL_BREAK_COW;
 				i =3D follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						foll_flags, locked);
@@ -1095,7 +1097,7 @@ static long __get_user_pages(struct task_struct *ts=
k, struct mm_struct *mm,
 		}
=20
 		if (should_force_cow_break(vma, foll_flags))
-			foll_flags |=3D FOLL_WRITE;
+			foll_flags |=3D FOLL_BREAK_COW;
=20
 retry:
 		/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 206f52b36ffb..c88f773d03af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,7 +1296,17 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vm=
f, pmd_t orig_pmd)
 	if (reuse_swap_page(page, NULL)) {
 		pmd_t entry;
 		entry =3D pmd_mkyoung(orig_pmd);
-		entry =3D maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry =3D pmd_mkdirty(entry);
+		if (pmd_uffd_wp(orig_pmd))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry =3D pmd_mkuffd_wp(pmd_wrprotect(entry));
+		else
+			entry =3D maybe_pmd_mkwrite(entry, vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..b27b555a9df8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2706,7 +2706,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vm=
f)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry =3D mk_pte(new_page, vma->vm_page_prot);
 		entry =3D pte_sw_mkyoung(entry);
-		entry =3D maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry =3D pte_mkdirty(entry);
+		if (pte_uffd_wp(vmf->orig_pte))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry =3D pte_mkuffd_wp(pte_wrprotect(entry));
+		else
+			entry =3D maybe_mkwrite(entry, vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2900,7 +2910,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma =3D vmf->vma;
=20
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+	/*
+	 * Userfaultfd-wp only cares about real writes.  E.g., enforced COW for
+	 * read does not count.  When that happens, we will do the COW with the
+	 * UFFD_WP bit inherited from the original PTE/PMD.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3290,7 +3306,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		put_page(swapcache);
 	}
=20
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		ret |=3D do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &=3D VM_FAULT_ERROR;
@@ -4117,7 +4133,14 @@ static inline vm_fault_t create_huge_pmd(struct vm=
_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pm=
d)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		/*
+		 * Userfaultfd-wp only cares about real writes.  E.g., enforced
+		 * COW for read does not count.  When that happens, we will do
+		 * the COW with the UFFD_WP bit inherited from the original
+		 * PTE/PMD.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) &&
+		    userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
@@ -4241,7 +4264,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault =
*vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry =3D pte_mkdirty(entry);
@@ -4281,7 +4304,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_=
struct *vma,
 		.pgoff =3D linear_page_index(vma, address),
 		.gfp_mask =3D __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty =3D flags & FAULT_FLAG_WRITE;
+	bool cow =3D flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW);
 	struct mm_struct *mm =3D vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4308,7 +4331,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_=
struct *vma,
=20
 			/* NUMA case for anonymous PUDs would go here */
=20
-			if (dirty && !pud_write(orig_pud)) {
+			if (cow && !pud_write(orig_pud)) {
 				ret =3D wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4346,7 +4369,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_=
struct *vma,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
=20
-			if (dirty && !pmd_write(orig_pmd)) {
+			if (cow && !pmd_write(orig_pmd)) {
 				ret =3D wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
--=20
2.26.2