linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] per-vma locks in userfaultfd
@ 2024-02-15 18:27 Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 1/4] userfaultfd: move userfaultfd_ctx struct to header file Lokesh Gidra
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Lokesh Gidra @ 2024-02-15 18:27 UTC (permalink / raw)
  To: akpm
  Cc: lokeshgidra, linux-fsdevel, linux-mm, linux-kernel, selinux,
	surenb, kernel-team, aarcange, peterx, david, axelrasmussen,
	bgeffon, willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

Performing userfaultfd operations (like copy/move etc.) in critical
section of mmap_lock (read-mode) causes significant contention on the
lock when operations requiring the lock in write-mode are taking place
concurrently. We can use per-vma locks instead to significantly reduce
the contention issue.

Android runtime's Garbage Collector uses userfaultfd for concurrent
compaction. mmap-lock contention during compaction potentially causes
jittery experience for the user. During one such reproducible scenario,
we observed the following improvements with this patch-set:

- Wall clock time of compaction phase came down from ~3s to <500ms
- Uninterruptible sleep time (across all threads in the process) was
  ~10ms (none in mmap_lock) during compaction, instead of >20s

Changes since v6 [6]:
- Added a patch to add vma_assert_locked() for !CONFIG_PER_VMA_LOCK, per
  Suren Baghdasaryan
- Replaced mmap_assert_locked() with vma_assert_locked() in
  move_pages_huge_pmd(), per Suren Baghdasaryan

Changes since v5 [5]:
- Use abstract function names (like uffd_mfill_lock/uffd_mfill_unlock)
  to avoid using too many #ifdef's, per Suren Baghdasaryan and Liam
  Howlett
- Use 'unlikely' (as earlier) to anon_vma related checks, per Liam Howlett
- Eliminate redundant ptr->err->ptr conversion, per Liam Howlett
- Use 'int' instead of 'long' for error return type, per Liam Howlett

Changes since v4 [4]:
- Fix possible deadlock in find_and_lock_vmas() which may arise if
  lock_vma() is used for both src and dst vmas.
- Ensure we lock vma only once if src and dst vmas are same.
- Fix error handling in move_pages() after successfully locking vmas.
- Introduce helper function for finding dst vma and preparing its
  anon_vma when done in mmap_lock critical section, per Liam Howlett.
- Introduce helper function for finding dst and src vmas when done in
  mmap_lock critical section.

Changes since v3 [3]:
- Rename function names to clearly reflect which lock is being taken,
  per Liam Howlett.
- Have separate functions and abstractions in mm/userfaultfd.c to avoid
  confusion around which lock is being acquired/released, per Liam Howlett.
- Prepare anon_vma for all private vmas, anonymous or file-backed,
  per Jann Horn.

Changes since v2 [2]:
- Implement and use lock_vma() which uses mmap_lock critical section
  to lock the VMA using per-vma lock if lock_vma_under_rcu() fails,
  per Liam R. Howlett. This helps simplify the code and also avoids
  performing the entire userfaultfd operation under mmap_lock.

Changes since v1 [1]:
- rebase patches on 'mm-unstable' branch

[1] https://lore.kernel.org/all/20240126182647.2748949-1-lokeshgidra@google.com/
[2] https://lore.kernel.org/all/20240129193512.123145-1-lokeshgidra@google.com/
[3] https://lore.kernel.org/all/20240206010919.1109005-1-lokeshgidra@google.com/
[4] https://lore.kernel.org/all/20240208212204.2043140-1-lokeshgidra@google.com/
[5] https://lore.kernel.org/all/20240213001920.3551772-1-lokeshgidra@google.com/
[6] https://lore.kernel.org/all/20240213215741.3816570-1-lokeshgidra@google.com/

Lokesh Gidra (4):
  userfaultfd: move userfaultfd_ctx struct to header file
  userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx
  mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK
  userfaultfd: use per-vma locks in userfaultfd operations

 fs/userfaultfd.c              |  86 ++-----
 include/linux/mm.h            |   5 +
 include/linux/userfaultfd_k.h |  75 ++++--
 mm/huge_memory.c              |   5 +-
 mm/userfaultfd.c              | 438 +++++++++++++++++++++++++---------
 5 files changed, 413 insertions(+), 196 deletions(-)

-- 
2.43.0.687.g38aa6559b0-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/4] userfaultfd: move userfaultfd_ctx struct to header file
  2024-02-15 18:27 [PATCH v7 0/4] per-vma locks in userfaultfd Lokesh Gidra
@ 2024-02-15 18:27 ` Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 2/4] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx Lokesh Gidra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Lokesh Gidra @ 2024-02-15 18:27 UTC (permalink / raw)
  To: akpm
  Cc: lokeshgidra, linux-fsdevel, linux-mm, linux-kernel, selinux,
	surenb, kernel-team, aarcange, peterx, david, axelrasmussen,
	bgeffon, willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

Moving the struct to userfaultfd_k.h to be accessible from
mm/userfaultfd.c. There are no other changes in the struct.

This is required to prepare for using per-vma locks in userfaultfd
operations.

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 fs/userfaultfd.c              | 39 -----------------------------------
 include/linux/userfaultfd_k.h | 39 +++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 05c8e8a05427..58331b83d648 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -50,45 +50,6 @@ static struct ctl_table vm_userfaultfd_table[] = {
 
 static struct kmem_cache *userfaultfd_ctx_cachep __ro_after_init;
 
-/*
- * Start with fault_pending_wqh and fault_wqh so they're more likely
- * to be in the same cacheline.
- *
- * Locking order:
- *	fd_wqh.lock
- *		fault_pending_wqh.lock
- *			fault_wqh.lock
- *		event_wqh.lock
- *
- * To avoid deadlocks, IRQs must be disabled when taking any of the above locks,
- * since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's
- * also taken in IRQ context.
- */
-struct userfaultfd_ctx {
-	/* waitqueue head for the pending (i.e. not read) userfaults */
-	wait_queue_head_t fault_pending_wqh;
-	/* waitqueue head for the userfaults */
-	wait_queue_head_t fault_wqh;
-	/* waitqueue head for the pseudo fd to wakeup poll/read */
-	wait_queue_head_t fd_wqh;
-	/* waitqueue head for events */
-	wait_queue_head_t event_wqh;
-	/* a refile sequence protected by fault_pending_wqh lock */
-	seqcount_spinlock_t refile_seq;
-	/* pseudo fd refcounting */
-	refcount_t refcount;
-	/* userfaultfd syscall flags */
-	unsigned int flags;
-	/* features requested from the userspace */
-	unsigned int features;
-	/* released */
-	bool released;
-	/* memory mappings are changing because of non-cooperative event */
-	atomic_t mmap_changing;
-	/* mm with one ore more vmas attached to this userfaultfd_ctx */
-	struct mm_struct *mm;
-};
-
 struct userfaultfd_fork_ctx {
 	struct userfaultfd_ctx *orig;
 	struct userfaultfd_ctx *new;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e4056547fbe6..691d928ee864 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,6 +36,45 @@
 #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
+/*
+ * Start with fault_pending_wqh and fault_wqh so they're more likely
+ * to be in the same cacheline.
+ *
+ * Locking order:
+ *	fd_wqh.lock
+ *		fault_pending_wqh.lock
+ *			fault_wqh.lock
+ *		event_wqh.lock
+ *
+ * To avoid deadlocks, IRQs must be disabled when taking any of the above locks,
+ * since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's
+ * also taken in IRQ context.
+ */
+struct userfaultfd_ctx {
+	/* waitqueue head for the pending (i.e. not read) userfaults */
+	wait_queue_head_t fault_pending_wqh;
+	/* waitqueue head for the userfaults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* waitqueue head for events */
+	wait_queue_head_t event_wqh;
+	/* a refile sequence protected by fault_pending_wqh lock */
+	seqcount_spinlock_t refile_seq;
+	/* pseudo fd refcounting */
+	refcount_t refcount;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* features requested from the userspace */
+	unsigned int features;
+	/* released */
+	bool released;
+	/* memory mappings are changing because of non-cooperative event */
+	atomic_t mmap_changing;
+	/* mm with one ore more vmas attached to this userfaultfd_ctx */
+	struct mm_struct *mm;
+};
+
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
 /* A combined operation mode + behavior flags. */
-- 
2.43.0.687.g38aa6559b0-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 2/4] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx
  2024-02-15 18:27 [PATCH v7 0/4] per-vma locks in userfaultfd Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 1/4] userfaultfd: move userfaultfd_ctx struct to header file Lokesh Gidra
@ 2024-02-15 18:27 ` Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations Lokesh Gidra
  3 siblings, 0 replies; 15+ messages in thread
From: Lokesh Gidra @ 2024-02-15 18:27 UTC (permalink / raw)
  To: akpm
  Cc: lokeshgidra, linux-fsdevel, linux-mm, linux-kernel, selinux,
	surenb, kernel-team, aarcange, peterx, david, axelrasmussen,
	bgeffon, willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

Increments and loads to mmap_changing are always in mmap_lock
critical section. This ensures that if userspace requests event
notification for non-cooperative operations (e.g. mremap), userfaultfd
operations don't occur concurrently.

This can be achieved by using a separate read-write semaphore in
userfaultfd_ctx such that increments are done in write-mode and loads
in read-mode, thereby eliminating the dependency on mmap_lock for this
purpose.

This is a preparatory step before we replace mmap_lock usage with
per-vma locks in fill/move ioctls.

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 fs/userfaultfd.c              | 40 ++++++++++++----------
 include/linux/userfaultfd_k.h | 31 ++++++++++--------
 mm/userfaultfd.c              | 62 ++++++++++++++++++++---------------
 3 files changed, 75 insertions(+), 58 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 58331b83d648..c00a021bcce4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -685,12 +685,15 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 		ctx->flags = octx->flags;
 		ctx->features = octx->features;
 		ctx->released = false;
+		init_rwsem(&ctx->map_changing_lock);
 		atomic_set(&ctx->mmap_changing, 0);
 		ctx->mm = vma->vm_mm;
 		mmgrab(ctx->mm);
 
 		userfaultfd_ctx_get(octx);
+		down_write(&octx->map_changing_lock);
 		atomic_inc(&octx->mmap_changing);
+		up_write(&octx->map_changing_lock);
 		fctx->orig = octx;
 		fctx->new = ctx;
 		list_add_tail(&fctx->list, fcs);
@@ -737,7 +740,9 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 	if (ctx->features & UFFD_FEATURE_EVENT_REMAP) {
 		vm_ctx->ctx = ctx;
 		userfaultfd_ctx_get(ctx);
+		down_write(&ctx->map_changing_lock);
 		atomic_inc(&ctx->mmap_changing);
+		up_write(&ctx->map_changing_lock);
 	} else {
 		/* Drop uffd context if remap feature not enabled */
 		vma_start_write(vma);
@@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
 		return true;
 
 	userfaultfd_ctx_get(ctx);
+	down_write(&ctx->map_changing_lock);
 	atomic_inc(&ctx->mmap_changing);
+	up_write(&ctx->map_changing_lock);
 	mmap_read_unlock(mm);
 
 	msg_init(&ewq.msg);
@@ -825,7 +832,9 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, unsigned long start,
 		return -ENOMEM;
 
 	userfaultfd_ctx_get(ctx);
+	down_write(&ctx->map_changing_lock);
 	atomic_inc(&ctx->mmap_changing);
+	up_write(&ctx->map_changing_lock);
 	unmap_ctx->ctx = ctx;
 	unmap_ctx->start = start;
 	unmap_ctx->end = end;
@@ -1709,9 +1718,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
 		flags |= MFILL_ATOMIC_WP;
 	if (mmget_not_zero(ctx->mm)) {
-		ret = mfill_atomic_copy(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
-					uffdio_copy.len, &ctx->mmap_changing,
-					flags);
+		ret = mfill_atomic_copy(ctx, uffdio_copy.dst, uffdio_copy.src,
+					uffdio_copy.len, flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1761,9 +1769,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	if (mmget_not_zero(ctx->mm)) {
-		ret = mfill_atomic_zeropage(ctx->mm, uffdio_zeropage.range.start,
-					   uffdio_zeropage.range.len,
-					   &ctx->mmap_changing);
+		ret = mfill_atomic_zeropage(ctx, uffdio_zeropage.range.start,
+					   uffdio_zeropage.range.len);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1818,9 +1825,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return -EINVAL;
 
 	if (mmget_not_zero(ctx->mm)) {
-		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-					  uffdio_wp.range.len, mode_wp,
-					  &ctx->mmap_changing);
+		ret = mwriteprotect_range(ctx, uffdio_wp.range.start,
+					  uffdio_wp.range.len, mode_wp);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1870,9 +1876,8 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 		flags |= MFILL_ATOMIC_WP;
 
 	if (mmget_not_zero(ctx->mm)) {
-		ret = mfill_atomic_continue(ctx->mm, uffdio_continue.range.start,
-					    uffdio_continue.range.len,
-					    &ctx->mmap_changing, flags);
+		ret = mfill_atomic_continue(ctx, uffdio_continue.range.start,
+					    uffdio_continue.range.len, flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1925,9 +1930,8 @@ static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long
 		goto out;
 
 	if (mmget_not_zero(ctx->mm)) {
-		ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start,
-					  uffdio_poison.range.len,
-					  &ctx->mmap_changing, 0);
+		ret = mfill_atomic_poison(ctx, uffdio_poison.range.start,
+					  uffdio_poison.range.len, 0);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -2003,13 +2007,14 @@ static int userfaultfd_move(struct userfaultfd_ctx *ctx,
 	if (mmget_not_zero(mm)) {
 		mmap_read_lock(mm);
 
-		/* Re-check after taking mmap_lock */
+		/* Re-check after taking map_changing_lock */
+		down_read(&ctx->map_changing_lock);
 		if (likely(!atomic_read(&ctx->mmap_changing)))
 			ret = move_pages(ctx, mm, uffdio_move.dst, uffdio_move.src,
 					 uffdio_move.len, uffdio_move.mode);
 		else
 			ret = -EAGAIN;
-
+		up_read(&ctx->map_changing_lock);
 		mmap_read_unlock(mm);
 		mmput(mm);
 	} else {
@@ -2216,6 +2221,7 @@ static int new_userfaultfd(int flags)
 	ctx->flags = flags;
 	ctx->features = 0;
 	ctx->released = false;
+	init_rwsem(&ctx->map_changing_lock);
 	atomic_set(&ctx->mmap_changing, 0);
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 691d928ee864..3210c3552976 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -69,6 +69,13 @@ struct userfaultfd_ctx {
 	unsigned int features;
 	/* released */
 	bool released;
+	/*
+	 * Prevents userfaultfd operations (fill/move/wp) from happening while
+	 * some non-cooperative event(s) is taking place. Increments are done
+	 * in write-mode. Whereas, userfaultfd operations, which includes
+	 * reading mmap_changing, is done under read-mode.
+	 */
+	struct rw_semaphore map_changing_lock;
 	/* memory mappings are changing because of non-cooperative event */
 	atomic_t mmap_changing;
 	/* mm with one ore more vmas attached to this userfaultfd_ctx */
@@ -113,22 +120,18 @@ extern int mfill_atomic_install_pte(pmd_t *dst_pmd,
 				    unsigned long dst_addr, struct page *page,
 				    bool newly_allocated, uffd_flags_t flags);
 
-extern ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start,
+extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				 unsigned long src_start, unsigned long len,
-				 atomic_t *mmap_changing, uffd_flags_t flags);
-extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
+				 uffd_flags_t flags);
+extern ssize_t mfill_atomic_zeropage(struct userfaultfd_ctx *ctx,
 				     unsigned long dst_start,
-				     unsigned long len,
-				     atomic_t *mmap_changing);
-extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
-				     unsigned long len, atomic_t *mmap_changing,
-				     uffd_flags_t flags);
-extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
-				   unsigned long len, atomic_t *mmap_changing,
-				   uffd_flags_t flags);
-extern int mwriteprotect_range(struct mm_struct *dst_mm,
-			       unsigned long start, unsigned long len,
-			       bool enable_wp, atomic_t *mmap_changing);
+				     unsigned long len);
+extern ssize_t mfill_atomic_continue(struct userfaultfd_ctx *ctx, unsigned long dst_start,
+				     unsigned long len, uffd_flags_t flags);
+extern ssize_t mfill_atomic_poison(struct userfaultfd_ctx *ctx, unsigned long start,
+				   unsigned long len, uffd_flags_t flags);
+extern int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+			       unsigned long len, bool enable_wp);
 extern long uffd_wp_range(struct vm_area_struct *vma,
 			  unsigned long start, unsigned long len, bool enable_wp);
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9cc93cc1330b..74aad0831e40 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -353,11 +353,11 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
  * called with mmap_lock held, it will release mmap_lock before returning.
  */
 static __always_inline ssize_t mfill_atomic_hugetlb(
+					      struct userfaultfd_ctx *ctx,
 					      struct vm_area_struct *dst_vma,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      atomic_t *mmap_changing,
 					      uffd_flags_t flags)
 {
 	struct mm_struct *dst_mm = dst_vma->vm_mm;
@@ -379,6 +379,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	 * feature is not supported.
 	 */
 	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
+		up_read(&ctx->map_changing_lock);
 		mmap_read_unlock(dst_mm);
 		return -EINVAL;
 	}
@@ -463,6 +464,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
+			up_read(&ctx->map_changing_lock);
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!folio);
 
@@ -473,12 +475,13 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 				goto out;
 			}
 			mmap_read_lock(dst_mm);
+			down_read(&ctx->map_changing_lock);
 			/*
 			 * If memory mappings are changing because of non-cooperative
 			 * operation (e.g. mremap) running in parallel, bail out and
 			 * request the user to retry later
 			 */
-			if (mmap_changing && atomic_read(mmap_changing)) {
+			if (atomic_read(&ctx->mmap_changing)) {
 				err = -EAGAIN;
 				break;
 			}
@@ -501,6 +504,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	}
 
 out_unlock:
+	up_read(&ctx->map_changing_lock);
 	mmap_read_unlock(dst_mm);
 out:
 	if (folio)
@@ -512,11 +516,11 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 }
 #else /* !CONFIG_HUGETLB_PAGE */
 /* fail at build time if gcc attempts to use this */
-extern ssize_t mfill_atomic_hugetlb(struct vm_area_struct *dst_vma,
+extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx,
+				    struct vm_area_struct *dst_vma,
 				    unsigned long dst_start,
 				    unsigned long src_start,
 				    unsigned long len,
-				    atomic_t *mmap_changing,
 				    uffd_flags_t flags);
 #endif /* CONFIG_HUGETLB_PAGE */
 
@@ -564,13 +568,13 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 	return err;
 }
 
-static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
+static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 					    unsigned long dst_start,
 					    unsigned long src_start,
 					    unsigned long len,
-					    atomic_t *mmap_changing,
 					    uffd_flags_t flags)
 {
+	struct mm_struct *dst_mm = ctx->mm;
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
 	pmd_t *dst_pmd;
@@ -600,8 +604,9 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 	 * operation (e.g. mremap) running in parallel, bail out and
 	 * request the user to retry later
 	 */
+	down_read(&ctx->map_changing_lock);
 	err = -EAGAIN;
-	if (mmap_changing && atomic_read(mmap_changing))
+	if (atomic_read(&ctx->mmap_changing))
 		goto out_unlock;
 
 	/*
@@ -633,8 +638,8 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
 	if (is_vm_hugetlb_page(dst_vma))
-		return  mfill_atomic_hugetlb(dst_vma, dst_start, src_start,
-					     len, mmap_changing, flags);
+		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
+					     src_start, len, flags);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
@@ -693,6 +698,7 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 		if (unlikely(err == -ENOENT)) {
 			void *kaddr;
 
+			up_read(&ctx->map_changing_lock);
 			mmap_read_unlock(dst_mm);
 			BUG_ON(!folio);
 
@@ -723,6 +729,7 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 	}
 
 out_unlock:
+	up_read(&ctx->map_changing_lock);
 	mmap_read_unlock(dst_mm);
 out:
 	if (folio)
@@ -733,34 +740,33 @@ static __always_inline ssize_t mfill_atomic(struct mm_struct *dst_mm,
 	return copied ? copied : err;
 }
 
-ssize_t mfill_atomic_copy(struct mm_struct *dst_mm, unsigned long dst_start,
+ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 			  unsigned long src_start, unsigned long len,
-			  atomic_t *mmap_changing, uffd_flags_t flags)
+			  uffd_flags_t flags)
 {
-	return mfill_atomic(dst_mm, dst_start, src_start, len, mmap_changing,
+	return mfill_atomic(ctx, dst_start, src_start, len,
 			    uffd_flags_set_mode(flags, MFILL_ATOMIC_COPY));
 }
 
-ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, unsigned long start,
-			      unsigned long len, atomic_t *mmap_changing)
+ssize_t mfill_atomic_zeropage(struct userfaultfd_ctx *ctx,
+			      unsigned long start,
+			      unsigned long len)
 {
-	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+	return mfill_atomic(ctx, start, 0, len,
 			    uffd_flags_set_mode(0, MFILL_ATOMIC_ZEROPAGE));
 }
 
-ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
-			      unsigned long len, atomic_t *mmap_changing,
-			      uffd_flags_t flags)
+ssize_t mfill_atomic_continue(struct userfaultfd_ctx *ctx, unsigned long start,
+			      unsigned long len, uffd_flags_t flags)
 {
-	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+	return mfill_atomic(ctx, start, 0, len,
 			    uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
 }
 
-ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
-			    unsigned long len, atomic_t *mmap_changing,
-			    uffd_flags_t flags)
+ssize_t mfill_atomic_poison(struct userfaultfd_ctx *ctx, unsigned long start,
+			    unsigned long len, uffd_flags_t flags)
 {
-	return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+	return mfill_atomic(ctx, start, 0, len,
 			    uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON));
 }
 
@@ -793,10 +799,10 @@ long uffd_wp_range(struct vm_area_struct *dst_vma,
 	return ret;
 }
 
-int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-			unsigned long len, bool enable_wp,
-			atomic_t *mmap_changing)
+int mwriteprotect_range(struct userfaultfd_ctx *ctx, unsigned long start,
+			unsigned long len, bool enable_wp)
 {
+	struct mm_struct *dst_mm = ctx->mm;
 	unsigned long end = start + len;
 	unsigned long _start, _end;
 	struct vm_area_struct *dst_vma;
@@ -820,8 +826,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	 * operation (e.g. mremap) running in parallel, bail out and
 	 * request the user to retry later
 	 */
+	down_read(&ctx->map_changing_lock);
 	err = -EAGAIN;
-	if (mmap_changing && atomic_read(mmap_changing))
+	if (atomic_read(&ctx->mmap_changing))
 		goto out_unlock;
 
 	err = -ENOENT;
@@ -850,6 +857,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 		err = 0;
 	}
 out_unlock:
+	up_read(&ctx->map_changing_lock);
 	mmap_read_unlock(dst_mm);
 	return err;
 }
-- 
2.43.0.687.g38aa6559b0-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK
  2024-02-15 18:27 [PATCH v7 0/4] per-vma locks in userfaultfd Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 1/4] userfaultfd: move userfaultfd_ctx struct to header file Lokesh Gidra
  2024-02-15 18:27 ` [PATCH v7 2/4] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx Lokesh Gidra
@ 2024-02-15 18:27 ` Lokesh Gidra
  2024-02-15 22:19   ` Suren Baghdasaryan
  2024-02-15 18:27 ` [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations Lokesh Gidra
  3 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2024-02-15 18:27 UTC (permalink / raw)
  To: akpm
  Cc: lokeshgidra, linux-fsdevel, linux-mm, linux-kernel, selinux,
	surenb, kernel-team, aarcange, peterx, david, axelrasmussen,
	bgeffon, willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

vma_assert_locked() is needed to replace mmap_assert_locked() once we
start using per-vma locks in userfaultfd operations.

In !CONFIG_PER_VMA_LOCK case when mm is locked, it implies that the
given VMA is locked.

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 include/linux/mm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c85634b186c..5ece3ad34ef8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -781,6 +781,11 @@ static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	return NULL;
 }
 
+static inline void vma_assert_locked(struct vm_area_struct *vma)
+{
+	mmap_assert_locked(vma->vm_mm);
+}
+
 static inline void release_fault_lock(struct vm_fault *vmf)
 {
 	mmap_read_unlock(vmf->vma->vm_mm);
-- 
2.43.0.687.g38aa6559b0-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2024-02-15 18:27 [PATCH v7 0/4] per-vma locks in userfaultfd Lokesh Gidra
                   ` (2 preceding siblings ...)
  2024-02-15 18:27 ` [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK Lokesh Gidra
@ 2024-02-15 18:27 ` Lokesh Gidra
  2025-01-23  4:14   ` Barry Song
  3 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2024-02-15 18:27 UTC (permalink / raw)
  To: akpm
  Cc: lokeshgidra, linux-fsdevel, linux-mm, linux-kernel, selinux,
	surenb, kernel-team, aarcange, peterx, david, axelrasmussen,
	bgeffon, willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

All userfaultfd operations, except write-protect, opportunistically use
per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
critical section.

Write-protect operation requires mmap_lock as it iterates over multiple
vmas.

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 fs/userfaultfd.c              |  13 +-
 include/linux/userfaultfd_k.h |   5 +-
 mm/huge_memory.c              |   5 +-
 mm/userfaultfd.c              | 380 ++++++++++++++++++++++++++--------
 4 files changed, 299 insertions(+), 104 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c00a021bcce4..60dcfafdc11a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -2005,17 +2005,8 @@ static int userfaultfd_move(struct userfaultfd_ctx *ctx,
 		return -EINVAL;
 
 	if (mmget_not_zero(mm)) {
-		mmap_read_lock(mm);
-
-		/* Re-check after taking map_changing_lock */
-		down_read(&ctx->map_changing_lock);
-		if (likely(!atomic_read(&ctx->mmap_changing)))
-			ret = move_pages(ctx, mm, uffdio_move.dst, uffdio_move.src,
-					 uffdio_move.len, uffdio_move.mode);
-		else
-			ret = -EAGAIN;
-		up_read(&ctx->map_changing_lock);
-		mmap_read_unlock(mm);
+		ret = move_pages(ctx, uffdio_move.dst, uffdio_move.src,
+				 uffdio_move.len, uffdio_move.mode);
 		mmput(mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3210c3552976..05d59f74fc88 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -138,9 +138,8 @@ extern long uffd_wp_range(struct vm_area_struct *vma,
 /* move_pages */
 void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
 void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
-ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
-		   unsigned long dst_start, unsigned long src_start,
-		   unsigned long len, __u64 flags);
+ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
+		   unsigned long src_start, unsigned long len, __u64 flags);
 int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dst_pmdval,
 			struct vm_area_struct *dst_vma,
 			struct vm_area_struct *src_vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..c337ebb4f7ab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2158,7 +2158,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 #ifdef CONFIG_USERFAULTFD
 /*
- * The PT lock for src_pmd and the mmap_lock for reading are held by
+ * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
  * the caller, but it must return after releasing the page_table_lock.
  * Just move the page from src_pmd to dst_pmd if possible.
  * Return zero if succeeded in moving the page, -EAGAIN if it needs to be
@@ -2181,7 +2181,8 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	src_ptl = pmd_lockptr(mm, src_pmd);
 
 	lockdep_assert_held(src_ptl);
-	mmap_assert_locked(mm);
+	vma_assert_locked(src_vma);
+	vma_assert_locked(dst_vma);
 
 	/* Sanity checks before the operation */
 	if (WARN_ON_ONCE(!pmd_none(dst_pmdval)) || WARN_ON_ONCE(src_addr & ~HPAGE_PMD_MASK) ||
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 74aad0831e40..4744d6a96f96 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,19 +20,11 @@
 #include "internal.h"
 
 static __always_inline
-struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
-				    unsigned long dst_start,
-				    unsigned long len)
+bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
-	/*
-	 * Make sure that the dst range is both valid and fully within a
-	 * single existing vma.
-	 */
-	struct vm_area_struct *dst_vma;
-
-	dst_vma = find_vma(dst_mm, dst_start);
-	if (!range_in_vma(dst_vma, dst_start, dst_start + len))
-		return NULL;
+	/* Make sure that the dst range is fully within dst_vma. */
+	if (dst_end > dst_vma->vm_end)
+		return false;
 
 	/*
 	 * Check the vma is registered in uffd, this is required to
@@ -40,11 +32,122 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 	 * time.
 	 */
 	if (!dst_vma->vm_userfaultfd_ctx.ctx)
-		return NULL;
+		return false;
+
+	return true;
+}
+
+static __always_inline
+struct vm_area_struct *find_vma_and_prepare_anon(struct mm_struct *mm,
+						 unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	mmap_assert_locked(mm);
+	vma = vma_lookup(mm, addr);
+	if (!vma)
+		vma = ERR_PTR(-ENOENT);
+	else if (!(vma->vm_flags & VM_SHARED) &&
+		 unlikely(anon_vma_prepare(vma)))
+		vma = ERR_PTR(-ENOMEM);
+
+	return vma;
+}
+
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * lock_vma() - Lookup and lock vma corresponding to @address.
+ * @mm: mm to search vma in.
+ * @address: address that the vma should contain.
+ *
+ * Should be called without holding mmap_lock. vma should be unlocked after use
+ * with unlock_vma().
+ *
+ * Return: A locked vma containing @address, -ENOENT if no vma is found, or
+ * -ENOMEM if anon_vma couldn't be allocated.
+ */
+static struct vm_area_struct *lock_vma(struct mm_struct *mm,
+				       unsigned long address)
+{
+	struct vm_area_struct *vma;
+
+	vma = lock_vma_under_rcu(mm, address);
+	if (vma) {
+		/*
+		 * lock_vma_under_rcu() only checks anon_vma for private
+		 * anonymous mappings. But we need to ensure it is assigned in
+		 * private file-backed vmas as well.
+		 */
+		if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
+			vma_end_read(vma);
+		else
+			return vma;
+	}
+
+	mmap_read_lock(mm);
+	vma = find_vma_and_prepare_anon(mm, address);
+	if (!IS_ERR(vma)) {
+		/*
+		 * We cannot use vma_start_read() as it may fail due to
+		 * false locked (see comment in vma_start_read()). We
+		 * can avoid that by directly locking vm_lock under
+		 * mmap_lock, which guarantees that nobody can lock the
+		 * vma for write (vma_start_write()) under us.
+		 */
+		down_read(&vma->vm_lock->lock);
+	}
+
+	mmap_read_unlock(mm);
+	return vma;
+}
+
+static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
+					      unsigned long dst_start,
+					      unsigned long len)
+{
+	struct vm_area_struct *dst_vma;
 
+	dst_vma = lock_vma(dst_mm, dst_start);
+	if (IS_ERR(dst_vma) || validate_dst_vma(dst_vma, dst_start + len))
+		return dst_vma;
+
+	vma_end_read(dst_vma);
+	return ERR_PTR(-ENOENT);
+}
+
+static void uffd_mfill_unlock(struct vm_area_struct *vma)
+{
+	vma_end_read(vma);
+}
+
+#else
+
+static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
+					      unsigned long dst_start,
+					      unsigned long len)
+{
+	struct vm_area_struct *dst_vma;
+
+	mmap_read_lock(dst_mm);
+	dst_vma = find_vma_and_prepare_anon(dst_mm, dst_start);
+	if (IS_ERR(dst_vma))
+		goto out_unlock;
+
+	if (validate_dst_vma(dst_vma, dst_start + len))
+		return dst_vma;
+
+	dst_vma = ERR_PTR(-ENOENT);
+out_unlock:
+	mmap_read_unlock(dst_mm);
 	return dst_vma;
 }
 
+static void uffd_mfill_unlock(struct vm_area_struct *vma)
+{
+	mmap_read_unlock(vma->vm_mm);
+}
+#endif
+
 /* Check if dst_addr is outside of file's size. Must be called with ptl held. */
 static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
 				 unsigned long dst_addr)
@@ -350,7 +453,8 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 #ifdef CONFIG_HUGETLB_PAGE
 /*
  * mfill_atomic processing for HUGETLB vmas.  Note that this routine is
- * called with mmap_lock held, it will release mmap_lock before returning.
+ * called with either vma-lock or mmap_lock held, it will release the lock
+ * before returning.
  */
 static __always_inline ssize_t mfill_atomic_hugetlb(
 					      struct userfaultfd_ctx *ctx,
@@ -361,7 +465,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 					      uffd_flags_t flags)
 {
 	struct mm_struct *dst_mm = dst_vma->vm_mm;
-	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
 	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
@@ -380,7 +483,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	 */
 	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
 		up_read(&ctx->map_changing_lock);
-		mmap_read_unlock(dst_mm);
+		uffd_mfill_unlock(dst_vma);
 		return -EINVAL;
 	}
 
@@ -403,24 +506,28 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	 * retry, dst_vma will be set to NULL and we must lookup again.
 	 */
 	if (!dst_vma) {
+		dst_vma = uffd_mfill_lock(dst_mm, dst_start, len);
+		if (IS_ERR(dst_vma)) {
+			err = PTR_ERR(dst_vma);
+			goto out;
+		}
+
 		err = -ENOENT;
-		dst_vma = find_dst_vma(dst_mm, dst_start, len);
-		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
-			goto out_unlock;
+		if (!is_vm_hugetlb_page(dst_vma))
+			goto out_unlock_vma;
 
 		err = -EINVAL;
 		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
-			goto out_unlock;
-
-		vm_shared = dst_vma->vm_flags & VM_SHARED;
-	}
+			goto out_unlock_vma;
 
-	/*
-	 * If not shared, ensure the dst_vma has a anon_vma.
-	 */
-	err = -ENOMEM;
-	if (!vm_shared) {
-		if (unlikely(anon_vma_prepare(dst_vma)))
+		/*
+		 * If memory mappings are changing because of non-cooperative
+		 * operation (e.g. mremap) running in parallel, bail out and
+		 * request the user to retry later
+		 */
+		down_read(&ctx->map_changing_lock);
+		err = -EAGAIN;
+		if (atomic_read(&ctx->mmap_changing))
 			goto out_unlock;
 	}
 
@@ -465,7 +572,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 
 		if (unlikely(err == -ENOENT)) {
 			up_read(&ctx->map_changing_lock);
-			mmap_read_unlock(dst_mm);
+			uffd_mfill_unlock(dst_vma);
 			BUG_ON(!folio);
 
 			err = copy_folio_from_user(folio,
@@ -474,17 +581,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 				err = -EFAULT;
 				goto out;
 			}
-			mmap_read_lock(dst_mm);
-			down_read(&ctx->map_changing_lock);
-			/*
-			 * If memory mappings are changing because of non-cooperative
-			 * operation (e.g. mremap) running in parallel, bail out and
-			 * request the user to retry later
-			 */
-			if (atomic_read(&ctx->mmap_changing)) {
-				err = -EAGAIN;
-				break;
-			}
 
 			dst_vma = NULL;
 			goto retry;
@@ -505,7 +601,8 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 
 out_unlock:
 	up_read(&ctx->map_changing_lock);
-	mmap_read_unlock(dst_mm);
+out_unlock_vma:
+	uffd_mfill_unlock(dst_vma);
 out:
 	if (folio)
 		folio_put(folio);
@@ -597,7 +694,15 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	copied = 0;
 	folio = NULL;
 retry:
-	mmap_read_lock(dst_mm);
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	dst_vma = uffd_mfill_lock(dst_mm, dst_start, len);
+	if (IS_ERR(dst_vma)) {
+		err = PTR_ERR(dst_vma);
+		goto out;
+	}
 
 	/*
 	 * If memory mappings are changing because of non-cooperative
@@ -609,15 +714,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	if (atomic_read(&ctx->mmap_changing))
 		goto out_unlock;
 
-	/*
-	 * Make sure the vma is not shared, that the dst range is
-	 * both valid and fully within a single existing vma.
-	 */
-	err = -ENOENT;
-	dst_vma = find_dst_vma(dst_mm, dst_start, len);
-	if (!dst_vma)
-		goto out_unlock;
-
 	err = -EINVAL;
 	/*
 	 * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
@@ -647,16 +743,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
 		goto out_unlock;
 
-	/*
-	 * Ensure the dst_vma has a anon_vma or this page
-	 * would get a NULL anon_vma when moved in the
-	 * dst_vma.
-	 */
-	err = -ENOMEM;
-	if (!(dst_vma->vm_flags & VM_SHARED) &&
-	    unlikely(anon_vma_prepare(dst_vma)))
-		goto out_unlock;
-
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
 
@@ -699,7 +785,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 			void *kaddr;
 
 			up_read(&ctx->map_changing_lock);
-			mmap_read_unlock(dst_mm);
+			uffd_mfill_unlock(dst_vma);
 			BUG_ON(!folio);
 
 			kaddr = kmap_local_folio(folio, 0);
@@ -730,7 +816,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 
 out_unlock:
 	up_read(&ctx->map_changing_lock);
-	mmap_read_unlock(dst_mm);
+	uffd_mfill_unlock(dst_vma);
 out:
 	if (folio)
 		folio_put(folio);
@@ -1267,27 +1353,136 @@ static int validate_move_areas(struct userfaultfd_ctx *ctx,
 	if (!vma_is_anonymous(src_vma) || !vma_is_anonymous(dst_vma))
 		return -EINVAL;
 
+	return 0;
+}
+
+static __always_inline
+int find_vmas_mm_locked(struct mm_struct *mm,
+			unsigned long dst_start,
+			unsigned long src_start,
+			struct vm_area_struct **dst_vmap,
+			struct vm_area_struct **src_vmap)
+{
+	struct vm_area_struct *vma;
+
+	mmap_assert_locked(mm);
+	vma = find_vma_and_prepare_anon(mm, dst_start);
+	if (IS_ERR(vma))
+		return PTR_ERR(vma);
+
+	*dst_vmap = vma;
+	/* Skip finding src_vma if src_start is in dst_vma */
+	if (src_start >= vma->vm_start && src_start < vma->vm_end)
+		goto out_success;
+
+	vma = vma_lookup(mm, src_start);
+	if (!vma)
+		return -ENOENT;
+out_success:
+	*src_vmap = vma;
+	return 0;
+}
+
+#ifdef CONFIG_PER_VMA_LOCK
+static int uffd_move_lock(struct mm_struct *mm,
+			  unsigned long dst_start,
+			  unsigned long src_start,
+			  struct vm_area_struct **dst_vmap,
+			  struct vm_area_struct **src_vmap)
+{
+	struct vm_area_struct *vma;
+	int err;
+
+	vma = lock_vma(mm, dst_start);
+	if (IS_ERR(vma))
+		return PTR_ERR(vma);
+
+	*dst_vmap = vma;
 	/*
-	 * Ensure the dst_vma has a anon_vma or this page
-	 * would get a NULL anon_vma when moved in the
-	 * dst_vma.
+	 * Skip finding src_vma if src_start is in dst_vma. This also ensures
+	 * that we don't lock the same vma twice.
 	 */
-	if (unlikely(anon_vma_prepare(dst_vma)))
-		return -ENOMEM;
+	if (src_start >= vma->vm_start && src_start < vma->vm_end) {
+		*src_vmap = vma;
+		return 0;
+	}
 
-	return 0;
+	/*
+	 * Using lock_vma() to get src_vma can lead to following deadlock:
+	 *
+	 * Thread1				Thread2
+	 * -------				-------
+	 * vma_start_read(dst_vma)
+	 *					mmap_write_lock(mm)
+	 *					vma_start_write(src_vma)
+	 * vma_start_read(src_vma)
+	 * mmap_read_lock(mm)
+	 *					vma_start_write(dst_vma)
+	 */
+	*src_vmap = lock_vma_under_rcu(mm, src_start);
+	if (likely(*src_vmap))
+		return 0;
+
+	/* Undo any locking and retry in mmap_lock critical section */
+	vma_end_read(*dst_vmap);
+
+	mmap_read_lock(mm);
+	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
+	if (!err) {
+		/*
+		 * See comment in lock_vma() as to why not using
+		 * vma_start_read() here.
+		 */
+		down_read(&(*dst_vmap)->vm_lock->lock);
+		if (*dst_vmap != *src_vmap)
+			down_read(&(*src_vmap)->vm_lock->lock);
+	}
+	mmap_read_unlock(mm);
+	return err;
+}
+
+static void uffd_move_unlock(struct vm_area_struct *dst_vma,
+			     struct vm_area_struct *src_vma)
+{
+	vma_end_read(src_vma);
+	if (src_vma != dst_vma)
+		vma_end_read(dst_vma);
 }
 
+#else
+
+static int uffd_move_lock(struct mm_struct *mm,
+			  unsigned long dst_start,
+			  unsigned long src_start,
+			  struct vm_area_struct **dst_vmap,
+			  struct vm_area_struct **src_vmap)
+{
+	int err;
+
+	mmap_read_lock(mm);
+	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
+	if (err)
+		mmap_read_unlock(mm);
+	return err;
+}
+
+static void uffd_move_unlock(struct vm_area_struct *dst_vma,
+			     struct vm_area_struct *src_vma)
+{
+	mmap_assert_locked(src_vma->vm_mm);
+	mmap_read_unlock(dst_vma->vm_mm);
+}
+#endif
+
 /**
  * move_pages - move arbitrary anonymous pages of an existing vma
  * @ctx: pointer to the userfaultfd context
- * @mm: the address space to move pages
  * @dst_start: start of the destination virtual memory range
  * @src_start: start of the source virtual memory range
  * @len: length of the virtual memory range
  * @mode: flags from uffdio_move.mode
  *
- * Must be called with mmap_lock held for read.
+ * It will either use the mmap_lock in read mode or per-vma locks
  *
  * move_pages() remaps arbitrary anonymous pages atomically in zero
  * copy. It only works on non shared anonymous pages because those can
@@ -1355,10 +1550,10 @@ static int validate_move_areas(struct userfaultfd_ctx *ctx,
  * could be obtained. This is the only additional complexity added to
  * the rmap code to provide this anonymous page remapping functionality.
  */
-ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
-		   unsigned long dst_start, unsigned long src_start,
-		   unsigned long len, __u64 mode)
+ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
+		   unsigned long src_start, unsigned long len, __u64 mode)
 {
+	struct mm_struct *mm = ctx->mm;
 	struct vm_area_struct *src_vma, *dst_vma;
 	unsigned long src_addr, dst_addr;
 	pmd_t *src_pmd, *dst_pmd;
@@ -1376,28 +1571,34 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
 	    WARN_ON_ONCE(dst_start + len <= dst_start))
 		goto out;
 
+	err = uffd_move_lock(mm, dst_start, src_start, &dst_vma, &src_vma);
+	if (err)
+		goto out;
+
+	/* Re-check after taking map_changing_lock */
+	err = -EAGAIN;
+	down_read(&ctx->map_changing_lock);
+	if (likely(atomic_read(&ctx->mmap_changing)))
+		goto out_unlock;
 	/*
 	 * Make sure the vma is not shared, that the src and dst remap
 	 * ranges are both valid and fully within a single existing
 	 * vma.
 	 */
-	src_vma = find_vma(mm, src_start);
-	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
-		goto out;
-	if (src_start < src_vma->vm_start ||
-	    src_start + len > src_vma->vm_end)
-		goto out;
+	err = -EINVAL;
+	if (src_vma->vm_flags & VM_SHARED)
+		goto out_unlock;
+	if (src_start + len > src_vma->vm_end)
+		goto out_unlock;
 
-	dst_vma = find_vma(mm, dst_start);
-	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
-		goto out;
-	if (dst_start < dst_vma->vm_start ||
-	    dst_start + len > dst_vma->vm_end)
-		goto out;
+	if (dst_vma->vm_flags & VM_SHARED)
+		goto out_unlock;
+	if (dst_start + len > dst_vma->vm_end)
+		goto out_unlock;
 
 	err = validate_move_areas(ctx, src_vma, dst_vma);
 	if (err)
-		goto out;
+		goto out_unlock;
 
 	for (src_addr = src_start, dst_addr = dst_start;
 	     src_addr < src_start + len;) {
@@ -1514,6 +1715,9 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
 		moved += step_size;
 	}
 
+out_unlock:
+	up_read(&ctx->map_changing_lock);
+	uffd_move_unlock(dst_vma, src_vma);
 out:
 	VM_WARN_ON(moved < 0);
 	VM_WARN_ON(err > 0);
-- 
2.43.0.687.g38aa6559b0-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK
  2024-02-15 18:27 ` [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK Lokesh Gidra
@ 2024-02-15 22:19   ` Suren Baghdasaryan
  0 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2024-02-15 22:19 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: akpm, linux-fsdevel, linux-mm, linux-kernel, selinux,
	kernel-team, aarcange, peterx, david, axelrasmussen, bgeffon,
	willy, jannh, kaleshsingh, ngeoffray, timmurray, rppt,
	Liam.Howlett, ryan.roberts

On Thu, Feb 15, 2024 at 10:28 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> vma_assert_locked() is needed to replace mmap_assert_locked() once we
> start using per-vma locks in userfaultfd operations.
>
> In !CONFIG_PER_VMA_LOCK case when mm is locked, it implies that the
> given VMA is locked.

Yes, makes sense. With per-vma locks used in more places, this makes
replacing mmap_assert_locked() with vma_assert_locked() very
straight-forward.

>
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/mm.h | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3c85634b186c..5ece3ad34ef8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -781,6 +781,11 @@ static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         return NULL;
>  }
>
> +static inline void vma_assert_locked(struct vm_area_struct *vma)
> +{
> +       mmap_assert_locked(vma->vm_mm);
> +}
> +
>  static inline void release_fault_lock(struct vm_fault *vmf)
>  {
>         mmap_read_unlock(vmf->vma->vm_mm);
> --
> 2.43.0.687.g38aa6559b0-goog
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2024-02-15 18:27 ` [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations Lokesh Gidra
@ 2025-01-23  4:14   ` Barry Song
  2025-01-23 16:48     ` Lorenzo Stoakes
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Barry Song @ 2025-01-23  4:14 UTC (permalink / raw)
  To: lokeshgidra
  Cc: Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon, david,
	jannh, kaleshsingh, kernel-team, linux-fsdevel, linux-kernel,
	linux-mm, ngeoffray, peterx, rppt, ryan.roberts, selinux, surenb,
	timmurray, willy

> All userfaultfd operations, except write-protect, opportunistically use
> per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> critical section.
> 
> Write-protect operation requires mmap_lock as it iterates over multiple
> vmas.

Hi Lokesh,

Apologies for reviving this old thread. We truly appreciate the excellent work
you’ve done in transitioning many userfaultfd operations to per-VMA locks.

However, we’ve noticed that userfaultfd still remains one of the largest users
of mmap_lock for write operations, with the other—binder—having been recently
addressed by Carlos Llamas's "binder: faster page installations" series:

https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/

The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
and userfaultfd_unregister() operations, both of which require the mmap_lock
in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
lower-priority background task, there are cases where, after acquiring the
mmap_lock, it gets preempted by other tasks. As a result, even high-priority
threads waiting for the mmap_lock — whether in writer or reader mode—can
end up experiencing significant delays(The delay can reach several hundred
milliseconds in the worst case.)

We haven’t yet identified an ideal solution for this. However, the Java heap
appears to behave like a "volatile" vma in its usage. A somewhat simplistic
idea would be to designate a specific region of the user address space as
"volatile" and restrict all "volatile" VMAs to this isolated region.

We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
mapped to the volatile space, while those without it will be mapped to the
non-volatile space.

         ┌────────────┐TASK_SIZE             
         │            │                      
         │            │                      
         │            │mmap VOLATILE         
         ┼────────────┤                      
         │            │                      
         │            │                      
         │            │                      
         │            │                      
         │            │default mmap          
         │            │                      
         │            │                      
         └────────────┘   

VMAs in the volatile region are assigned their own volatile_mmap_lock,
which is independent of the mmap_lock for the non-volatile region.
Additionally, we ensure that no single VMA spans the boundary between
the volatile and non-volatile regions. This separation prevents the
frequent modifications of a small number of volatile VMAs from blocking
other operations on a large number of non-volatile VMAs.

The implementation itself wouldn’t be overly complex, but the design
might come across as somewhat hacky.

Lastly, I have two questions:

1. Have you observed similar issues where userfaultfd continues to
cause lock contention and priority inversion?

2. If so, do you have any ideas or suggestions on how to address this
problem?

> 
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  fs/userfaultfd.c              |  13 +-
>  include/linux/userfaultfd_k.h |   5 +-
>  mm/huge_memory.c              |   5 +-
>  mm/userfaultfd.c              | 380 ++++++++++++++++++++++++++--------
>  4 files changed, 299 insertions(+), 104 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index c00a021bcce4..60dcfafdc11a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c

Thanks
Barry



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23  4:14   ` Barry Song
@ 2025-01-23 16:48     ` Lorenzo Stoakes
  2025-01-23 17:07       ` Lorenzo Stoakes
  2025-01-23 17:14       ` David Hildenbrand
  2025-01-23 16:52     ` Liam R. Howlett
  2025-01-23 17:14     ` Matthew Wilcox
  2 siblings, 2 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-01-23 16:48 UTC (permalink / raw)
  To: Barry Song
  Cc: lokeshgidra, Liam.Howlett, aarcange, akpm, axelrasmussen,
	bgeffon, david, jannh, kaleshsingh, kernel-team, linux-fsdevel,
	linux-kernel, linux-mm, ngeoffray, peterx, rppt, ryan.roberts,
	selinux, surenb, timmurray, willy

+cc vbabka

I realise this is resurrecting an old thread, but helpful to cc- us mmap
maintainers as my mail at least is nightmare :P Thanks!

On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote:
> > All userfaultfd operations, except write-protect, opportunistically use
> > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > critical section.
> >
> > Write-protect operation requires mmap_lock as it iterates over multiple
> > vmas.
>
> Hi Lokesh,
>
> Apologies for reviving this old thread. We truly appreciate the excellent work
> you’ve done in transitioning many userfaultfd operations to per-VMA locks.

Let me also say - thanks Lokesh!

>
> However, we’ve noticed that userfaultfd still remains one of the largest users
> of mmap_lock for write operations, with the other—binder—having been recently
> addressed by Carlos Llamas's "binder: faster page installations" series:
>
> https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
>
> The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> and userfaultfd_unregister() operations, both of which require the mmap_lock
> in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> lower-priority background task, there are cases where, after acquiring the
> mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> threads waiting for the mmap_lock — whether in writer or reader mode—can
> end up experiencing significant delays(The delay can reach several hundred
> milliseconds in the worst case.)

Thanks for reporting - this strikes me as an important report, but I'm not
sure about your proposed solution :)

However I wonder if we can't look at more efficiently handling the locking
around uffd register/unregister?

I think uffd suffers from a complexity problem, in that there are multiple
moving parts and complicated interactions especially for each of the
different kinds of events uffd can deal with.

Also ordering matters a great deal within uffd. Ideally we'd not have uffd
effectively have open-coded hooks all over the mm code, but that ship
sailed long ago and so we need to really carefully assess how locking
changes impacts uffd behaviour.

>
> We haven’t yet identified an ideal solution for this. However, the Java heap
> appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> idea would be to designate a specific region of the user address space as
> "volatile" and restrict all "volatile" VMAs to this isolated region.
>
> We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> mapped to the volatile space, while those without it will be mapped to the
> non-volatile space.

This feels like a major thing to do just to suit specific uffd usages, a
feature that not everybody makes use of.

The virtual address space is a somewhat sensitive subject (see the
relatively recent discussion on a proposed MAP_BELOW flag), and has a lot
of landmines you can step on.

How would this range be determined? How would this interact with people
doing 'interesting' things with MAP_FIXED mappings? What if somebody
MAP_FIXED into a volatile region and gets this behaviour by mistake?

How do the two locks interact with one another and vma locks? I mean we
already have a very complicated set of interactions between the mmap sem
and vma locks where the mmap sem is taken to lock the mmap (of course), but
now we'd have two?

Also you have to write lock when you modify the VMA tree, would we have two
maple trees now? And what about the overhead?

I feel like this is not the right route for this.

>
>          ┌────────────┐TASK_SIZE
>          │            │
>          │            │
>          │            │mmap VOLATILE
>          ┼────────────┤
>          │            │
>          │            │
>          │            │
>          │            │
>          │            │default mmap
>          │            │
>          │            │
>          └────────────┘
>
> VMAs in the volatile region are assigned their own volatile_mmap_lock,
> which is independent of the mmap_lock for the non-volatile region.
> Additionally, we ensure that no single VMA spans the boundary between
> the volatile and non-volatile regions. This separation prevents the
> frequent modifications of a small number of volatile VMAs from blocking
> other operations on a large number of non-volatile VMAs.

I think really overall this will be solving one can of worms by introducing
another can of very large worms in space :P but perhaps I am missing
details here.

>
> The implementation itself wouldn’t be overly complex, but the design
> might come across as somewhat hacky.
>
> Lastly, I have two questions:
>
> 1. Have you observed similar issues where userfaultfd continues to
> cause lock contention and priority inversion?
>
> 2. If so, do you have any ideas or suggestions on how to address this
> problem?

Addressing and investigating this however is VERY IMPORTANT. Let's perhaps
investigate how we might find a uffd-specific means of addressing these
problems.

>
> >
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  fs/userfaultfd.c              |  13 +-
> >  include/linux/userfaultfd_k.h |   5 +-
> >  mm/huge_memory.c              |   5 +-
> >  mm/userfaultfd.c              | 380 ++++++++++++++++++++++++++--------
> >  4 files changed, 299 insertions(+), 104 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index c00a021bcce4..60dcfafdc11a 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
>
> Thanks
> Barry
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23  4:14   ` Barry Song
  2025-01-23 16:48     ` Lorenzo Stoakes
@ 2025-01-23 16:52     ` Liam R. Howlett
  2025-01-23 18:45       ` Lokesh Gidra
  2025-01-23 17:14     ` Matthew Wilcox
  2 siblings, 1 reply; 15+ messages in thread
From: Liam R. Howlett @ 2025-01-23 16:52 UTC (permalink / raw)
  To: Barry Song
  Cc: lokeshgidra, aarcange, akpm, axelrasmussen, bgeffon, david,
	jannh, kaleshsingh, kernel-team, linux-fsdevel, linux-kernel,
	linux-mm, ngeoffray, peterx, rppt, ryan.roberts, selinux, surenb,
	timmurray, willy

* Barry Song <21cnbao@gmail.com> [250122 23:14]:
> > All userfaultfd operations, except write-protect, opportunistically use
> > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > critical section.
> > 
> > Write-protect operation requires mmap_lock as it iterates over multiple
> > vmas.
> h
> Hi Lokesh,
> 
> Apologies for reviving this old thread. We truly appreciate the excellent work
> you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> 
> However, we’ve noticed that userfaultfd still remains one of the largest users
> of mmap_lock for write operations, with the other—binder—having been recently
> addressed by Carlos Llamas's "binder: faster page installations" series:
> 
> https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> 
> The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> and userfaultfd_unregister() operations, both of which require the mmap_lock
> in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> lower-priority background task, there are cases where, after acquiring the
> mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> threads waiting for the mmap_lock — whether in writer or reader mode—can
> end up experiencing significant delays(The delay can reach several hundred
> milliseconds in the worst case.)

This needs an RFC or proposal or a discussion - certainly not a reply to
an old v7 patch set.  I'd want neon lights and stuff directing people to
this topic.

> 
> We haven’t yet identified an ideal solution for this. However, the Java heap
> appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> idea would be to designate a specific region of the user address space as
> "volatile" and restrict all "volatile" VMAs to this isolated region.

I'm going to assume the uffd changes are in the volatile area?  But
really, maybe you mean the opposite..  I'll just assume I guessed
correct here.  Because, both sides of this are competing for the write
lock.

> 
> We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> mapped to the volatile space, while those without it will be mapped to the
> non-volatile space.
> 
>          ┌────────────┐TASK_SIZE             
>          │            │                      
>          │            │                      
>          │            │mmap VOLATILE         
>          ┼────────────┤                      
>          │            │                      
>          │            │                      
>          │            │                      
>          │            │                      
>          │            │default mmap          
>          │            │                      
>          │            │                      
>          └────────────┘   

No, this is way too complicated for what you are trying to work around.

You are proposing a segmented layout of the virtual memory area so that
an optional (userfaultfd) component can avoid a lock - which already has
another optional (vma locking) workaround.

I think we need to stand back and look at what we're doing here in
regards to userfaultfd and how it interacts with everything.  Things
have gotten complex and we're going in the wrong direction.

I suggest there is an easier way to avoid the contention, and maybe try
to rectify some of the uffd code to fit better with the evolved use
cases and vma locking.

> 
> VMAs in the volatile region are assigned their own volatile_mmap_lock,
> which is independent of the mmap_lock for the non-volatile region.
> Additionally, we ensure that no single VMA spans the boundary between
> the volatile and non-volatile regions. This separation prevents the
> frequent modifications of a small number of volatile VMAs from blocking
> other operations on a large number of non-volatile VMAs.
> 
> The implementation itself wouldn’t be overly complex, but the design
> might come across as somewhat hacky.
> 
> Lastly, I have two questions:
> 
> 1. Have you observed similar issues where userfaultfd continues to
> cause lock contention and priority inversion?
> 
> 2. If so, do you have any ideas or suggestions on how to address this
> problem?

These are good questions.

I have a few of my own about what you described:

- What is causing your application to register/unregister so many uffds?

- Does the writes to the vmas overlap the register/unregsiter area
  today?  That is, do you have writes besides register/unregister going
  into your proposed volatile area or uffd modifications happening in
  the 'default mmap' area you specify above?

Barry, this is a good LSF topic - will you be there?  I hope to attend.

Something along the lines of "Userfualtfd contention, interactions, and
mitigations".

Thanks,
Liam



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23 16:48     ` Lorenzo Stoakes
@ 2025-01-23 17:07       ` Lorenzo Stoakes
  2025-01-23 17:14       ` David Hildenbrand
  1 sibling, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-01-23 17:07 UTC (permalink / raw)
  To: Barry Song
  Cc: lokeshgidra, Liam.Howlett, aarcange, akpm, axelrasmussen,
	bgeffon, david, jannh, kaleshsingh, kernel-team, linux-fsdevel,
	linux-kernel, linux-mm, ngeoffray, peterx, rppt, ryan.roberts,
	selinux, surenb, timmurray, willy, Vlastimil Babka

On Thu, Jan 23, 2025 at 04:48:10PM +0000, Lorenzo Stoakes wrote:
> +cc vbabka

Sorry my mail client messing up, forgot to actually cc... fixed :)

>
> I realise this is resurrecting an old thread, but helpful to cc- us mmap
> maintainers as my mail at least is nightmare :P Thanks!
>
> On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote:
> > > All userfaultfd operations, except write-protect, opportunistically use
> > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > critical section.
> > >
> > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > vmas.
> >
> > Hi Lokesh,
> >
> > Apologies for reviving this old thread. We truly appreciate the excellent work
> > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
>
> Let me also say - thanks Lokesh!
>
> >
> > However, we’ve noticed that userfaultfd still remains one of the largest users
> > of mmap_lock for write operations, with the other—binder—having been recently
> > addressed by Carlos Llamas's "binder: faster page installations" series:
> >
> > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> >
> > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > lower-priority background task, there are cases where, after acquiring the
> > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > end up experiencing significant delays(The delay can reach several hundred
> > milliseconds in the worst case.)
>
> Thanks for reporting - this strikes me as an important report, but I'm not
> sure about your proposed solution :)
>
> However I wonder if we can't look at more efficiently handling the locking
> around uffd register/unregister?
>
> I think uffd suffers from a complexity problem, in that there are multiple
> moving parts and complicated interactions especially for each of the
> different kinds of events uffd can deal with.
>
> Also ordering matters a great deal within uffd. Ideally we'd not have uffd
> effectively have open-coded hooks all over the mm code, but that ship
> sailed long ago and so we need to really carefully assess how locking
> changes impacts uffd behaviour.
>
> >
> > We haven’t yet identified an ideal solution for this. However, the Java heap
> > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > idea would be to designate a specific region of the user address space as
> > "volatile" and restrict all "volatile" VMAs to this isolated region.
> >
> > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > mapped to the volatile space, while those without it will be mapped to the
> > non-volatile space.
>
> This feels like a major thing to do just to suit specific uffd usages, a
> feature that not everybody makes use of.
>
> The virtual address space is a somewhat sensitive subject (see the
> relatively recent discussion on a proposed MAP_BELOW flag), and has a lot
> of landmines you can step on.
>
> How would this range be determined? How would this interact with people
> doing 'interesting' things with MAP_FIXED mappings? What if somebody
> MAP_FIXED into a volatile region and gets this behaviour by mistake?
>
> How do the two locks interact with one another and vma locks? I mean we
> already have a very complicated set of interactions between the mmap sem
> and vma locks where the mmap sem is taken to lock the mmap (of course), but
> now we'd have two?
>
> Also you have to write lock when you modify the VMA tree, would we have two
> maple trees now? And what about the overhead?
>
> I feel like this is not the right route for this.
>
> >
> >          ┌────────────┐TASK_SIZE
> >          │            │
> >          │            │
> >          │            │mmap VOLATILE
> >          ┼────────────┤
> >          │            │
> >          │            │
> >          │            │
> >          │            │
> >          │            │default mmap
> >          │            │
> >          │            │
> >          └────────────┘
> >
> > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > which is independent of the mmap_lock for the non-volatile region.
> > Additionally, we ensure that no single VMA spans the boundary between
> > the volatile and non-volatile regions. This separation prevents the
> > frequent modifications of a small number of volatile VMAs from blocking
> > other operations on a large number of non-volatile VMAs.
>
> I think really overall this will be solving one can of worms by introducing
> another can of very large worms in space :P but perhaps I am missing
> details here.
>
> >
> > The implementation itself wouldn’t be overly complex, but the design
> > might come across as somewhat hacky.
> >
> > Lastly, I have two questions:
> >
> > 1. Have you observed similar issues where userfaultfd continues to
> > cause lock contention and priority inversion?
> >
> > 2. If so, do you have any ideas or suggestions on how to address this
> > problem?
>
> Addressing and investigating this however is VERY IMPORTANT. Let's perhaps
> investigate how we might find a uffd-specific means of addressing these
> problems.
>
> >
> > >
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > ---
> > >  fs/userfaultfd.c              |  13 +-
> > >  include/linux/userfaultfd_k.h |   5 +-
> > >  mm/huge_memory.c              |   5 +-
> > >  mm/userfaultfd.c              | 380 ++++++++++++++++++++++++++--------
> > >  4 files changed, 299 insertions(+), 104 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index c00a021bcce4..60dcfafdc11a 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> >
> > Thanks
> > Barry
> >
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23  4:14   ` Barry Song
  2025-01-23 16:48     ` Lorenzo Stoakes
  2025-01-23 16:52     ` Liam R. Howlett
@ 2025-01-23 17:14     ` Matthew Wilcox
  2 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2025-01-23 17:14 UTC (permalink / raw)
  To: Barry Song
  Cc: lokeshgidra, Liam.Howlett, aarcange, akpm, axelrasmussen,
	bgeffon, david, jannh, kaleshsingh, kernel-team, linux-fsdevel,
	linux-kernel, linux-mm, ngeoffray, peterx, rppt, ryan.roberts,
	selinux, surenb, timmurray

On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote:
> However, we’ve noticed that userfaultfd still remains one of the largest users
> of mmap_lock for write operations, with the other—binder—having been recently
> addressed by Carlos Llamas's "binder: faster page installations" series:
> 
> https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> 
> The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> and userfaultfd_unregister() operations, both of which require the mmap_lock
> in write mode to either split or merge VMAs. Since HeapTaskDaemon is a

I don't think that's an innate necessity.  It does require quite some
work in order to fix it though.  Since it was so much work, nobody's
been motivated to fix it ... but now you are!  Congratulations ;-)

The reason we need the mmap_lock in write mode is that the VMA tree is
currently protected by that mmap_lock.  It shouldn't be.  Logically, it
should be protected by the ma_lock spinlock.  But today we use external
locking (ie the mmap_lock) because fixing that was a huge amount of work
that we didn't have time to do.

This MAP_VOLATILE idea is never going to fly.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23 16:48     ` Lorenzo Stoakes
  2025-01-23 17:07       ` Lorenzo Stoakes
@ 2025-01-23 17:14       ` David Hildenbrand
  2025-01-23 17:46         ` Suren Baghdasaryan
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-01-23 17:14 UTC (permalink / raw)
  To: Lorenzo Stoakes, Barry Song
  Cc: lokeshgidra, Liam.Howlett, aarcange, akpm, axelrasmussen,
	bgeffon, jannh, kaleshsingh, kernel-team, linux-fsdevel,
	linux-kernel, linux-mm, ngeoffray, peterx, rppt, ryan.roberts,
	selinux, surenb, timmurray, willy

>>
>>           ┌────────────┐TASK_SIZE
>>           │            │
>>           │            │
>>           │            │mmap VOLATILE
>>           ┼────────────┤
>>           │            │
>>           │            │
>>           │            │
>>           │            │
>>           │            │default mmap
>>           │            │
>>           │            │
>>           └────────────┘
>>
>> VMAs in the volatile region are assigned their own volatile_mmap_lock,
>> which is independent of the mmap_lock for the non-volatile region.
>> Additionally, we ensure that no single VMA spans the boundary between
>> the volatile and non-volatile regions. This separation prevents the
>> frequent modifications of a small number of volatile VMAs from blocking
>> other operations on a large number of non-volatile VMAs.
> 
> I think really overall this will be solving one can of worms by introducing
> another can of very large worms in space :P but perhaps I am missing
> details here.

Fully agreed; not a big fan :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23 17:14       ` David Hildenbrand
@ 2025-01-23 17:46         ` Suren Baghdasaryan
  0 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2025-01-23 17:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Barry Song, lokeshgidra, Liam.Howlett, aarcange,
	akpm, axelrasmussen, bgeffon, jannh, kaleshsingh, kernel-team,
	linux-fsdevel, linux-kernel, linux-mm, ngeoffray, peterx, rppt,
	ryan.roberts, selinux, timmurray, willy

On Thu, Jan 23, 2025 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >>           ┌────────────┐TASK_SIZE
> >>           │            │
> >>           │            │
> >>           │            │mmap VOLATILE
> >>           ┼────────────┤
> >>           │            │
> >>           │            │
> >>           │            │
> >>           │            │
> >>           │            │default mmap
> >>           │            │
> >>           │            │
> >>           └────────────┘
> >>
> >> VMAs in the volatile region are assigned their own volatile_mmap_lock,
> >> which is independent of the mmap_lock for the non-volatile region.
> >> Additionally, we ensure that no single VMA spans the boundary between
> >> the volatile and non-volatile regions. This separation prevents the
> >> frequent modifications of a small number of volatile VMAs from blocking
> >> other operations on a large number of non-volatile VMAs.
> >
> > I think really overall this will be solving one can of worms by introducing
> > another can of very large worms in space :P but perhaps I am missing
> > details here.
>
> Fully agreed; not a big fan :)

+1. Let's not add more coarse-grained locks in mm. Discussing this at
LSFMM as Liam suggested would be a good idea. I'm definitely
interested.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23 16:52     ` Liam R. Howlett
@ 2025-01-23 18:45       ` Lokesh Gidra
  2025-01-27 22:08         ` Barry Song
  0 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2025-01-23 18:45 UTC (permalink / raw)
  To: Liam R. Howlett, Barry Song, lokeshgidra, aarcange, akpm,
	axelrasmussen, bgeffon, david, jannh, kaleshsingh, kernel-team,
	linux-fsdevel, linux-kernel, linux-mm, ngeoffray, peterx, rppt,
	ryan.roberts, selinux, surenb, timmurray, willy

On Thu, Jan 23, 2025 at 8:52 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Barry Song <21cnbao@gmail.com> [250122 23:14]:
> > > All userfaultfd operations, except write-protect, opportunistically use
> > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > critical section.
> > >
> > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > vmas.
> > h
> > Hi Lokesh,
> >
> > Apologies for reviving this old thread. We truly appreciate the excellent work
> > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> >
> > However, we’ve noticed that userfaultfd still remains one of the largest users
> > of mmap_lock for write operations, with the other—binder—having been recently
> > addressed by Carlos Llamas's "binder: faster page installations" series:
> >
> > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> >
> > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > lower-priority background task, there are cases where, after acquiring the
> > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > end up experiencing significant delays(The delay can reach several hundred
> > milliseconds in the worst case.)

Do you happen to have some trace that I can take a look at?
>
> This needs an RFC or proposal or a discussion - certainly not a reply to
> an old v7 patch set.  I'd want neon lights and stuff directing people to
> this topic.
>
> >
> > We haven’t yet identified an ideal solution for this. However, the Java heap
> > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > idea would be to designate a specific region of the user address space as
> > "volatile" and restrict all "volatile" VMAs to this isolated region.
>
> I'm going to assume the uffd changes are in the volatile area?  But
> really, maybe you mean the opposite..  I'll just assume I guessed
> correct here.  Because, both sides of this are competing for the write
> lock.
>
> >
> > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > mapped to the volatile space, while those without it will be mapped to the
> > non-volatile space.
> >
> >          ┌────────────┐TASK_SIZE
> >          │            │
> >          │            │
> >          │            │mmap VOLATILE
> >          ┼────────────┤
> >          │            │
> >          │            │
> >          │            │
> >          │            │
> >          │            │default mmap
> >          │            │
> >          │            │
> >          └────────────┘
>
> No, this is way too complicated for what you are trying to work around.
>
> You are proposing a segmented layout of the virtual memory area so that
> an optional (userfaultfd) component can avoid a lock - which already has
> another optional (vma locking) workaround.
>
> I think we need to stand back and look at what we're doing here in
> regards to userfaultfd and how it interacts with everything.  Things
> have gotten complex and we're going in the wrong direction.
>
> I suggest there is an easier way to avoid the contention, and maybe try
> to rectify some of the uffd code to fit better with the evolved use
> cases and vma locking.
>
> >
> > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > which is independent of the mmap_lock for the non-volatile region.
> > Additionally, we ensure that no single VMA spans the boundary between
> > the volatile and non-volatile regions. This separation prevents the
> > frequent modifications of a small number of volatile VMAs from blocking
> > other operations on a large number of non-volatile VMAs.
> >
> > The implementation itself wouldn’t be overly complex, but the design
> > might come across as somewhat hacky.

I agree with others. Your proposal sounds too radical and doesn't seem
necessary to me. I'd like to see the traces and understand how
real/frequent the issue is.
> >
> > Lastly, I have two questions:
> >
> > 1. Have you observed similar issues where userfaultfd continues to
> > cause lock contention and priority inversion?

We haven't seen any such cases so far. But due to some other reasons,
we are seriously considering temporarily increasing the GC-thread's
priority when it is running stop-the-world pause.
> >
> > 2. If so, do you have any ideas or suggestions on how to address this
> > problem?

There are userspace solutions possible to reduce/eliminate the number
of times userfaultfd register/unregister are done during a GC. I
didn't do it due to added complexity it would introduce to the GC's
code.
>
> These are good questions.
>
> I have a few of my own about what you described:
>
> - What is causing your application to register/unregister so many uffds?

In every GC invocation, we have two userfaultfd_register() + mremap()
in a stop-the-world pause, and then two userfaultfd_unregister() at
the end of GC. The problematic ones ought to be the one in the pause
as we want to keep it as short as possible. The reason we want to
register/unregister the heap during GC is so that the overhead of
userfaults can be avoided when GC is not active.

>
> - Does the writes to the vmas overlap the register/unregsiter area
>   today?  That is, do you have writes besides register/unregister going
>   into your proposed volatile area or uffd modifications happening in
>   the 'default mmap' area you specify above?

That shouldn't be the case. The access to uffd registered VMAs should
start *after* registration. That's the reason it is done in a pause.
AFAIK, the source of contention is if some native (non-java) thread,
which is not participating in the pause, does a mmap_lock write
operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the
address space. The heap can't be involved.
>
> Barry, this is a good LSF topic - will you be there?  I hope to attend.
>
> Something along the lines of "Userfualtfd contention, interactions, and
> mitigations".
>
> Thanks,
> Liam
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations
  2025-01-23 18:45       ` Lokesh Gidra
@ 2025-01-27 22:08         ` Barry Song
  0 siblings, 0 replies; 15+ messages in thread
From: Barry Song @ 2025-01-27 22:08 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Liam R. Howlett, aarcange, akpm, axelrasmussen, bgeffon, david,
	jannh, kaleshsingh, kernel-team, linux-fsdevel, linux-kernel,
	linux-mm, ngeoffray, peterx, rppt, ryan.roberts, selinux, surenb,
	timmurray, willy

On Fri, Jan 24, 2025 at 2:45 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Thu, Jan 23, 2025 at 8:52 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Barry Song <21cnbao@gmail.com> [250122 23:14]:
> > > > All userfaultfd operations, except write-protect, opportunistically use
> > > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > > critical section.
> > > >
> > > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > > vmas.
> > > h
> > > Hi Lokesh,
> > >
> > > Apologies for reviving this old thread. We truly appreciate the excellent work
> > > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
> > >
> > > However, we’ve noticed that userfaultfd still remains one of the largest users
> > > of mmap_lock for write operations, with the other—binder—having been recently
> > > addressed by Carlos Llamas's "binder: faster page installations" series:
> > >
> > > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> > >
> > > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > > lower-priority background task, there are cases where, after acquiring the
> > > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > > end up experiencing significant delays(The delay can reach several hundred
> > > milliseconds in the worst case.)
>
> Do you happen to have some trace that I can take a look at?

We observed a rough trace in Android Studio showing the HeapTaskDaemon
stuck in a runnable state after holding the mmap_lock for 1 second, while other
threads were waiting for the lock.

Our team will assist in collecting a detailed trace, but everyone is
currently on
an extended Chinese New Year holiday. Apologies, this may delay the process
until after February 8.

> >
> > This needs an RFC or proposal or a discussion - certainly not a reply to
> > an old v7 patch set.  I'd want neon lights and stuff directing people to
> > this topic.
> >
> > >
> > > We haven’t yet identified an ideal solution for this. However, the Java heap
> > > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > > idea would be to designate a specific region of the user address space as
> > > "volatile" and restrict all "volatile" VMAs to this isolated region.
> >
> > I'm going to assume the uffd changes are in the volatile area?  But
> > really, maybe you mean the opposite..  I'll just assume I guessed
> > correct here.  Because, both sides of this are competing for the write
> > lock.
> >
> > >
> > > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > > mapped to the volatile space, while those without it will be mapped to the
> > > non-volatile space.
> > >
> > >          ┌────────────┐TASK_SIZE
> > >          │            │
> > >          │            │
> > >          │            │mmap VOLATILE
> > >          ┼────────────┤
> > >          │            │
> > >          │            │
> > >          │            │
> > >          │            │
> > >          │            │default mmap
> > >          │            │
> > >          │            │
> > >          └────────────┘
> >
> > No, this is way too complicated for what you are trying to work around.
> >
> > You are proposing a segmented layout of the virtual memory area so that
> > an optional (userfaultfd) component can avoid a lock - which already has
> > another optional (vma locking) workaround.
> >
> > I think we need to stand back and look at what we're doing here in
> > regards to userfaultfd and how it interacts with everything.  Things
> > have gotten complex and we're going in the wrong direction.
> >
> > I suggest there is an easier way to avoid the contention, and maybe try
> > to rectify some of the uffd code to fit better with the evolved use
> > cases and vma locking.
> >
> > >
> > > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > > which is independent of the mmap_lock for the non-volatile region.
> > > Additionally, we ensure that no single VMA spans the boundary between
> > > the volatile and non-volatile regions. This separation prevents the
> > > frequent modifications of a small number of volatile VMAs from blocking
> > > other operations on a large number of non-volatile VMAs.
> > >
> > > The implementation itself wouldn’t be overly complex, but the design
> > > might come across as somewhat hacky.
>
> I agree with others. Your proposal sounds too radical and doesn't seem
> necessary to me. I'd like to see the traces and understand how
> real/frequent the issue is.

No worries, I figured the idea might not be well-received since it was more of
a hack. Just try to explain that some VMAs might contribute more mmap_lock
contention (volatile), while others might not.

> > >
> > > Lastly, I have two questions:
> > >
> > > 1. Have you observed similar issues where userfaultfd continues to
> > > cause lock contention and priority inversion?
>
> We haven't seen any such cases so far. But due to some other reasons,
> we are seriously considering temporarily increasing the GC-thread's
> priority when it is running stop-the-world pause.
> > >
> > > 2. If so, do you have any ideas or suggestions on how to address this
> > > problem?
>
> There are userspace solutions possible to reduce/eliminate the number
> of times userfaultfd register/unregister are done during a GC. I
> didn't do it due to added complexity it would introduce to the GC's
> code.
> >
> > These are good questions.
> >
> > I have a few of my own about what you described:
> >
> > - What is causing your application to register/unregister so many uffds?
>
> In every GC invocation, we have two userfaultfd_register() + mremap()
> in a stop-the-world pause, and then two userfaultfd_unregister() at
> the end of GC. The problematic ones ought to be the one in the pause
> as we want to keep it as short as possible. The reason we want to
> register/unregister the heap during GC is so that the overhead of
> userfaults can be avoided when GC is not active.
>
> >
> > - Does the writes to the vmas overlap the register/unregsiter area
> >   today?  That is, do you have writes besides register/unregister going
> >   into your proposed volatile area or uffd modifications happening in
> >   the 'default mmap' area you specify above?
>
> That shouldn't be the case. The access to uffd registered VMAs should
> start *after* registration. That's the reason it is done in a pause.
> AFAIK, the source of contention is if some native (non-java) thread,
> which is not participating in the pause, does a mmap_lock write
> operation (mmap/munmap/mprotect/mremap/mlock etc.) elsewhere in the
> address space. The heap can't be involved.

Exactly. Essentially, we observe that the GC holds the mmap_lock but
gets preempted for an extended period, causing other tasks performing
mmap-like operations to wait for the GC to release the lock.

> >
> > Barry, this is a good LSF topic - will you be there?  I hope to attend.
> >
> > Something along the lines of "Userfualtfd contention, interactions, and
> > mitigations".

Thank you for your interest in this topic

It's unlikely that a travel budget will be available, so I won’t be attending
in person. I might apply for virtual attendance to participate in some
discussions, but I don’t plan to run a session remotely—too many things
can go wrong.

> >
> > Thanks,
> > Liam
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-27 22:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-15 18:27 [PATCH v7 0/4] per-vma locks in userfaultfd Lokesh Gidra
2024-02-15 18:27 ` [PATCH v7 1/4] userfaultfd: move userfaultfd_ctx struct to header file Lokesh Gidra
2024-02-15 18:27 ` [PATCH v7 2/4] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx Lokesh Gidra
2024-02-15 18:27 ` [PATCH v7 3/4] mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK Lokesh Gidra
2024-02-15 22:19   ` Suren Baghdasaryan
2024-02-15 18:27 ` [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd operations Lokesh Gidra
2025-01-23  4:14   ` Barry Song
2025-01-23 16:48     ` Lorenzo Stoakes
2025-01-23 17:07       ` Lorenzo Stoakes
2025-01-23 17:14       ` David Hildenbrand
2025-01-23 17:46         ` Suren Baghdasaryan
2025-01-23 16:52     ` Liam R. Howlett
2025-01-23 18:45       ` Lokesh Gidra
2025-01-27 22:08         ` Barry Song
2025-01-23 17:14     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox