linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Stevens <stevensd@chromium.org>
To: linux-mm@kvack.org, Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	linux-kernel@vger.kernel.org,
	David Stevens <stevensd@chromium.org>
Subject: [PATCH v2] mm/khugepaged: skip shmem with userfaultfd
Date: Mon,  6 Feb 2023 20:28:56 +0900	[thread overview]
Message-ID: <20230206112856.1802547-1-stevensd@google.com> (raw)

From: David Stevens <stevensd@chromium.org>

Collapsing memory will result in any empty pages in the target range
being filled by the new THP. If userspace has a userfaultfd registered
with MODE_MISSING, for any page which it knows to be missing after
registering the userfaultfd, it may expect a UFFD_EVENT_PAGEFAULT.
Taking these two facts together, khugepaged needs to take care when
collapsing pages in shmem to make sure it doesn't break the userfaultfd
API.

This change first makes sure that the intermediate page cache state
during collapse is not visible by moving when gaps are filled to after
the page cache lock is acquired for the final time. This is necessary
because the synchronization provided by locking hpage is insufficient
for functions which operate on the page cache without actually locking
individual pages to examine their content (e.g. shmem_mfill_atomic_pte).

This refactoring allows us to iterate over i_mmap to check for any VMAs
with userfaultfds and then finalize the collapse if no such VMAs exist,
all while holding the page cache lock. Since no mm locks are held, it is
necessary to add smb_rmb/smb_wmb to ensure that userfaultfd updates to
vm_flags are visible to khugepaged. However, no further locking of
userfaultfd state is necessary. Although new userfaultfds can be
registered concurrently with finalizing the collapse, any missing pages
that are being replaced can no longer be observed by userspace, so there
is no data race.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. The fact that the intermediate page cache state before
the rollback of a failed collapse can no longer be observed is
technically a userspace-visible change (via at least SEEK_DATA and
SEEK_END), but it is exceedingly unlikely that anything relies on being
able to observe that transient state.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 fs/userfaultfd.c |  2 ++
 mm/khugepaged.c  | 67 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cc694846617a..6ddfcff11920 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -114,6 +114,8 @@ static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
 	const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP;
 
 	vma->vm_flags = flags;
+	/* Pairs with smp_rmb() in khugepaged's collapse_file() */
+	smp_wmb();
 	/*
 	 * For shared mappings, we want to enable writenotify while
 	 * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 79be13133322..97435c226b18 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -55,6 +55,7 @@ enum scan_result {
 	SCAN_CGROUP_CHARGE_FAIL,
 	SCAN_TRUNCATED,
 	SCAN_PAGE_HAS_PRIVATE,
+	SCAN_PAGE_FILLED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
+ *  - finalize updates to the page cache;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
@@ -1747,6 +1748,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
 	int nr_none = 0, result = SCAN_SUCCEED;
 	bool is_shmem = shmem_file(file);
+	bool i_mmap_locked = false;
 	int nr = 0;
 
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
@@ -1780,8 +1782,14 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * At this point the hpage is locked and not up-to-date.
-	 * It's safe to insert it into the page cache, because nobody would
-	 * be able to map it or use it in another way until we unlock it.
+	 *
+	 * While iterating, we may drop the page cache lock multiple times. It
+	 * is safe to replace pages in the page cache with hpage while doing so
+	 * because nobody is able to map or otherwise access the content of
+	 * hpage until we unlock it. However, we cannot insert hpage into empty
+	 * indicies until we know we won't have to drop the page cache lock
+	 * again, as doing so would let things which only check the presence
+	 * of pages in the page cache see a state that may yet be rolled back.
 	 */
 
 	xas_set(&xas, start);
@@ -1802,13 +1810,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 						result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
-					xas_set(&xas, index);
+					xas_set(&xas, index + 1);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -1967,6 +1974,46 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		put_page(page);
 		goto xa_unlocked;
 	}
+
+	if (nr_none) {
+		struct vm_area_struct *vma;
+		int nr_none_check = 0;
+
+		xas_unlock_irq(&xas);
+		i_mmap_lock_read(mapping);
+		i_mmap_locked = true;
+		xas_lock_irq(&xas);
+
+		xas_set(&xas, start);
+		for (index = start; index < end; index++) {
+			if (!xas_next(&xas))
+				nr_none_check++;
+		}
+
+		if (nr_none != nr_none_check) {
+			result = SCAN_PAGE_FILLED;
+			goto xa_locked;
+		}
+
+		/*
+		 * If userspace observed a missing page in a VMA with an armed
+		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
+		 * that page, so we need to roll back to avoid suppressing such
+		 * an event. Any userfaultfds armed after this point will not be
+		 * able to observe any missing pages, since the page cache is
+		 * locked until after the collapse is completed.
+		 *
+		 * Pairs with smp_wmb() in userfaultfd_set_vm_flags().
+		 */
+		smp_rmb();
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
+			if (userfaultfd_missing(vma)) {
+				result = SCAN_EXCEED_NONE_PTE;
+				goto xa_locked;
+			}
+		}
+	}
+
 	nr = thp_nr_pages(hpage);
 
 	if (is_shmem)
@@ -2000,6 +2047,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	xas_store(&xas, hpage);
 xa_locked:
 	xas_unlock_irq(&xas);
+	if (i_mmap_locked)
+		i_mmap_unlock_read(mapping);
 xa_unlocked:
 
 	/*
@@ -2065,15 +2114,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
+		end = index;
+		for (index = start; index < end; index++) {
+			xas_next(&xas);
 			page = list_first_entry_or_null(&pagelist,
 					struct page, lru);
 			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
 				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
 				continue;
 			}
 
-- 
2.39.1.519.gcb327c4b5f-goog



             reply	other threads:[~2023-02-06 11:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06 11:28 David Stevens [this message]
2023-02-06 14:25 ` Matthew Wilcox
2023-02-06 19:01 ` Matthew Wilcox
2023-02-06 20:52   ` Peter Xu
2023-02-06 21:50     ` Matthew Wilcox
2023-02-07  1:37       ` David Stevens
2023-02-07  2:29         ` Matthew Wilcox
2023-02-07  4:14           ` David Stevens
2023-02-06 21:02 ` Peter Xu
2023-02-07  3:56   ` David Stevens
2023-02-07 16:34     ` Peter Xu
2023-02-08  2:42       ` David Stevens
2023-02-08 17:24         ` Peter Xu
2023-02-09  5:10           ` David Stevens
2023-02-09 18:50             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230206112856.1802547-1-stevensd@google.com \
    --to=stevensd@chromium.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox