linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults
@ 2025-11-30 11:18 Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line Mike Rapoport
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

These patches allow guest_memfd to notify userspace about minor page
faults using userfaultfd and let userspace to resolve these page faults
using UFFDIO_CONTINUE.

To allow UFFDIO_CONTINUE outside of the core mm I added a
get_folio_noalloc() callback to vm_ops that allows an address space
backing a VMA to return a folio that exists in it's page cache (patch 2)

In order for guest_memfd to notify userspace about page faults, there is a
new VM_FAULT_UFFD_MINOR that a ->fault() handler can return to inform the
page fault handler that it needs to call handle_userfault() to complete the
fault (patch 3).
 
Patch 4 plumbs these new goodies into guest_memfd.

This series is the minimal change I've been able to come up with to allow
integration of guest_memfd with uffd and while refactoring uffd and making
mfill_atomic() flow more linear would have been a nice improvement, it's
way out of the scope of enabling uffd with guest_memfd.

v3 changes:
* rename ->get_folio() to ->get_folio_noalloc()
* fix build errors reported by kbuild
* pull handling of UFFD_MINOR out of hotpath in __do_fault()
* update guest_memfs changes so its ->fault() and ->get_folio_noalloc()
  follow the same semantics as shmem and hugetlb.
* s/MISSING/MINOR/g in changelogs
* added review tags

v2: https://lore.kernel.org/all/20251125183840.2368510-1-rppt@kernel.org
* rename ->get_shared_folio() to ->get_folio()
* hardwire VM_FAULF_UFFD_MINOR to 0 when CONFIG_USERFAULTFD=n

v1: https://patch.msgid.link/20251123102707.559422-1-rppt@kernel.org
* Introduce VM_FAULF_UFFD_MINOR to avoid exporting handle_userfault()
* Simplify vma_can_mfill_atomic()
* Rename get_pagecache_folio() to get_shared_folio() and use inode
  instead of vma as its argument

rfc: https://patch.msgid.link/20251117114631.2029447-1-rppt@kernel.org

Mike Rapoport (Microsoft) (4):
  userfaultfd: move vma_can_userfault out of line
  userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
  mm: introduce VM_FAULT_UFFD_MINOR fault reason
  guest_memfd: add support for userfaultfd minor mode

Nikita Kalyazin (1):
  KVM: selftests: test userfaultfd minor for guest_memfd

 include/linux/mm.h                            |  9 ++
 include/linux/mm_types.h                      | 10 +-
 include/linux/userfaultfd_k.h                 | 36 +------
 mm/memory.c                                   |  5 +-
 mm/shmem.c                                    | 20 +++-
 mm/userfaultfd.c                              | 80 ++++++++++++---
 .../testing/selftests/kvm/guest_memfd_test.c  | 97 +++++++++++++++++++
 virt/kvm/guest_memfd.c                        | 33 ++++++-
 8 files changed, 236 insertions(+), 54 deletions(-)


base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line
  2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
@ 2025-11-30 11:18 ` Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 2/5] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest, David Hildenbrand (Red Hat)

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

vma_can_userfault() has grown pretty big and it's not called on
performance critical path.

Move it out of line.

No functional changes.

Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/userfaultfd_k.h | 36 ++---------------------------------
 mm/userfaultfd.c              | 34 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c0e716aec26a..e4f43e7b063f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -208,40 +208,8 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     vm_flags_t vm_flags,
-				     bool wp_async)
-{
-	vm_flags &= __VM_UFFD_FLAGS;
-
-	if (vma->vm_flags & VM_DROPPABLE)
-		return false;
-
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
-		return false;
-
-	/*
-	 * If wp async enabled, and WP is the only mode enabled, allow any
-	 * memory type.
-	 */
-	if (wp_async && (vm_flags == VM_UFFD_WP))
-		return true;
-
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
-	/*
-	 * If user requested uffd-wp but not enabled pte markers for
-	 * uffd-wp, then shmem & hugetlbfs are not supported but only
-	 * anonymous.
-	 */
-	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
-		return false;
-#endif
-
-	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
-}
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+		       bool wp_async);
 
 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
 {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..8dc964389b0d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1977,6 +1977,40 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 	return moved ? moved : err;
 }
 
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+		       bool wp_async)
+{
+	vm_flags &= __VM_UFFD_FLAGS;
+
+	if (vma->vm_flags & VM_DROPPABLE)
+		return false;
+
+	if ((vm_flags & VM_UFFD_MINOR) &&
+	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+		return false;
+
+	/*
+	 * If wp async enabled, and WP is the only mode enabled, allow any
+	 * memory type.
+	 */
+	if (wp_async && (vm_flags == VM_UFFD_WP))
+		return true;
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	/*
+	 * If user requested uffd-wp but not enabled pte markers for
+	 * uffd-wp, then shmem & hugetlbfs are not supported but only
+	 * anonymous.
+	 */
+	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+		return false;
+#endif
+
+	/* By default, allow any of anon|shmem|hugetlb */
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+	    vma_is_shmem(vma);
+}
+
 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
 				     vm_flags_t vm_flags)
 {
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 2/5] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
  2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line Mike Rapoport
@ 2025-11-30 11:18 ` Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest, David Hildenbrand (Red Hat)

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
it needs to get a folio that already exists in the pagecache backing
that VMA.

Instead of using shmem_get_folio() for that, add a get_folio_noalloc()
method to 'struct vm_operations_struct' that will return a folio if it
exists in the VMA's pagecache at given pgoff.

Implement get_folio_noalloc() method for shmem and slightly refactor
userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support
this new API.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/mm.h |  9 ++++++++
 mm/shmem.c         | 18 ++++++++++++++++
 mm/userfaultfd.c   | 52 +++++++++++++++++++++++++++++-----------------
 3 files changed, 60 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c79b3369b82..6351a9cde360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,6 +690,15 @@ struct vm_operations_struct {
 	struct page *(*find_normal_page)(struct vm_area_struct *vma,
 					 unsigned long addr);
 #endif /* CONFIG_FIND_NORMAL_PAGE */
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Called by userfault to resolve UFFDIO_CONTINUE request.
+	 * Should return the folio found at pgoff in the VMA's pagecache if it
+	 * exists or ERR_PTR otherwise.
+	 * The returned folio is locked and with reference held.
+	 */
+	struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff);
+#endif
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/shmem.c b/mm/shmem.c
index 5a3f0f754dc0..9f8c54ad0e32 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3262,6 +3262,18 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 	shmem_inode_unacct_blocks(inode, 1);
 	return ret;
 }
+
+static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff)
+{
+	struct folio *folio;
+	int err;
+
+	err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+	if (err)
+		return ERR_PTR(err);
+
+	return folio;
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #ifdef CONFIG_TMPFS
@@ -5294,6 +5306,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.get_folio_noalloc	= shmem_get_folio_noalloc,
+#endif
 };
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5303,6 +5318,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.get_folio_noalloc	= shmem_get_folio_noalloc,
+#endif
 };
 
 int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 8dc964389b0d..5610f29dac73 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -388,15 +388,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 	struct page *page;
 	int ret;
 
-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+	folio = dst_vma->vm_ops->get_folio_noalloc(inode, pgoff);
 	/* Our caller expects us to return -EFAULT if we failed to find folio */
-	if (ret == -ENOENT)
-		ret = -EFAULT;
-	if (ret)
-		goto out;
-	if (!folio) {
-		ret = -EFAULT;
-		goto out;
+	if (IS_ERR_OR_NULL(folio)) {
+		if (PTR_ERR(folio) == -ENOENT || !folio)
+			return -EFAULT;
+		return PTR_ERR(folio);
 	}
 
 	page = folio_file_page(folio, pgoff);
@@ -411,13 +408,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 		goto out_release;
 
 	folio_unlock(folio);
-	ret = 0;
-out:
-	return ret;
+	return 0;
+
 out_release:
 	folio_unlock(folio);
 	folio_put(folio);
-	goto out;
+	return ret;
 }
 
 /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
@@ -694,6 +690,15 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 	return err;
 }
 
+static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma,
+						 uffd_flags_t flags)
+{
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+		return vma->vm_ops && vma->vm_ops->get_folio_noalloc;
+
+	return vma_is_anonymous(vma) || vma_is_shmem(vma);
+}
+
 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 					    unsigned long dst_start,
 					    unsigned long src_start,
@@ -766,10 +771,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
 					     src_start, len, flags);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+	if (!vma_can_mfill_atomic(dst_vma, flags))
 		goto out_unlock;
 
 	while (src_addr < src_start + len) {
@@ -1985,9 +1987,21 @@ bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
 	if (vma->vm_flags & VM_DROPPABLE)
 		return false;
 
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
-		return false;
+	if (vm_flags & VM_UFFD_MINOR) {
+		/*
+		 * If only MINOR mode is requested and we can request an
+		 * existing folio from VMA's page cache, allow it
+		 */
+		if (vm_flags == VM_UFFD_MINOR && vma->vm_ops &&
+		    vma->vm_ops->get_folio_noalloc)
+			return true;
+		/*
+		 * Only hugetlb and shmem can support MINOR mode in combination
+		 * with other modes
+		 */
+		if (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))
+			return false;
+	}
 
 	/*
 	 * If wp async enabled, and WP is the only mode enabled, allow any
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason
  2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 2/5] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
@ 2025-11-30 11:18 ` Mike Rapoport
  2025-12-01  8:59   ` David Hildenbrand (Red Hat)
  2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
  2025-11-30 11:18 ` [PATCH v3 5/5] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
  4 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest, David Hildenbrand (Red Hat)

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When a VMA is registered with userfaulfd in minor mode, its ->fault()
method should check if a folio exists in the page cache and if yes
->fault() should call handle_userfault(VM_UFFD_MINOR).

Instead of calling handle_userfault() directly from a specific ->fault()
implementation introduce new fault reason VM_FAULT_UFFD_MINOR that will
notify the core page fault handler that it should call
handle_userfaultfd(VM_UFFD_MINOR) to complete a page fault.

Replace a call to handle_userfault(VM_UFFD_MINOR) in shmem and use the
new VM_FAULT_UFFD_MINOR there instead.

For configurations that don't enable CONFIG_USERFAULTFD,
VM_FAULT_UFFD_MINOR is set to 0.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/mm_types.h | 10 +++++++++-
 mm/memory.c              |  5 ++++-
 mm/shmem.c               |  2 +-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..c92a52c572c3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1523,6 +1523,8 @@ typedef __bitwise unsigned int vm_fault_t;
  *				fsync() to complete (for synchronous page faults
  *				in DAX)
  * @VM_FAULT_COMPLETED:		->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_UFFD_MINOR:	->fault did not modify page tables and needs
+ *				handle_userfault(VM_UFFD_MINOR) to complete
  * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
  *
  */
@@ -1540,6 +1542,11 @@ enum vm_fault_reason {
 	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
 	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
 	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
+#ifdef CONFIG_USERFAULTFD
+	VM_FAULT_UFFD_MINOR	= (__force vm_fault_t)0x008000,
+#else
+	VM_FAULT_UFFD_MINOR	= (__force vm_fault_t)0x000000,
+#endif
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
 };
 
@@ -1564,7 +1571,8 @@ enum vm_fault_reason {
 	{ VM_FAULT_FALLBACK,            "FALLBACK" },	\
 	{ VM_FAULT_DONE_COW,            "DONE_COW" },	\
 	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },	\
-	{ VM_FAULT_COMPLETED,           "COMPLETED" }
+	{ VM_FAULT_COMPLETED,           "COMPLETED" },	\
+	{ VM_FAULT_UFFD_MINOR,		"UFFD_MINOR" }	\
 
 struct vm_special_mapping {
 	const char *name;	/* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..8d2180ec6933 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5280,8 +5280,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
-			    VM_FAULT_DONE_COW)))
+			    VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR))) {
+		if (ret & VM_FAULT_UFFD_MINOR)
+			return handle_userfault(vmf, VM_UFFD_MINOR);
 		return ret;
+	}
 
 	folio = page_folio(vmf->page);
 	if (unlikely(PageHWPoison(vmf->page))) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f8c54ad0e32..2c32e2398e99 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2460,7 +2460,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio))
 			folio_put(folio);
-		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
+		*fault_type = VM_FAULT_UFFD_MINOR;
 		return 0;
 	}
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
                   ` (2 preceding siblings ...)
  2025-11-30 11:18 ` [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
@ 2025-11-30 11:18 ` Mike Rapoport
  2025-12-01  9:12   ` David Hildenbrand (Red Hat)
  2025-12-01 13:39   ` Nikita Kalyazin
  2025-11-30 11:18 ` [PATCH v3 5/5] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
  4 siblings, 2 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

userfaultfd notifications about minor page faults used for live migration
and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add
minor fault registration mode").

To use the same mechanism for VMs that use guest_memfd to map their memory,
guest_memfd should support userfaultfd minor mode.

Extend ->fault() method of guest_memfd with ability to notify core page
fault handler that a page fault requires handle_userfault(VM_UFFD_MINOR) to
complete and add implementation of ->get_folio_noalloc() to guest_memfd
vm_ops.

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 virt/kvm/guest_memfd.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index ffadc5ee8e04..dca6e373937b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/userfaultfd_k.h>
 
 #include "kvm_mm.h"
 
@@ -359,7 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED))
 		return VM_FAULT_SIGBUS;
 
-	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
+	if (!IS_ERR_OR_NULL(folio) && userfaultfd_minor(vmf->vma)) {
+		ret = VM_FAULT_UFFD_MINOR;
+		goto out_folio;
+	}
+
+	if (PTR_ERR(folio) == -ENOENT)
+		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+
 	if (IS_ERR(folio)) {
 		int err = PTR_ERR(folio);
 
@@ -390,8 +399,30 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode,
+						pgoff_t pgoff)
+{
+	struct folio *folio;
+
+	folio = filemap_lock_folio(inode->i_mapping, pgoff);
+	if (IS_ERR_OR_NULL(folio))
+		return folio;
+
+	if (!folio_test_uptodate(folio)) {
+		clear_highpage(folio_page(folio, 0));
+		kvm_gmem_mark_prepared(folio);
+	}
+
+	return folio;
+}
+#endif
+
 static const struct vm_operations_struct kvm_gmem_vm_ops = {
 	.fault = kvm_gmem_fault_user_mapping,
+#ifdef CONFIG_USERFAULTFD
+	.get_folio_noalloc	= kvm_gmem_get_folio_noalloc,
+#endif
 };
 
 static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 5/5] KVM: selftests: test userfaultfd minor for guest_memfd
  2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
                   ` (3 preceding siblings ...)
  2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
@ 2025-11-30 11:18 ` Mike Rapoport
  4 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2025-11-30 11:18 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Nikita Kalyazin,
	Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
	linux-kselftest

From: Nikita Kalyazin <kalyazin@amazon.com>

The test demonstrates that a minor userfaultfd event in guest_memfd can
be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.

Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 97 +++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index e7d9aeb418d3..d0cf57d41cc9 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,13 +10,17 @@
 #include <errno.h>
 #include <stdio.h>
 #include <fcntl.h>
+#include <pthread.h>
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
 #include <linux/sizes.h>
+#include <linux/userfaultfd.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
 
 #include "kvm_util.h"
 #include "test_util.h"
@@ -254,6 +258,98 @@ static void test_guest_memfd_flags(struct kvm_vm *vm)
 	}
 }
 
+struct fault_args {
+	char *addr;
+	volatile char value;
+};
+
+static void *fault_thread_fn(void *arg)
+{
+	struct fault_args *args = arg;
+
+	/* Trigger page fault */
+	args->value = *args->addr;
+	return NULL;
+}
+
+static void test_uffd_minor(int fd, size_t total_size)
+{
+	struct uffdio_api uffdio_api = {
+		.api = UFFD_API,
+	};
+	struct uffdio_register uffd_reg;
+	struct uffdio_continue uffd_cont;
+	struct uffd_msg msg;
+	struct fault_args args;
+	pthread_t fault_thread;
+	void *mem, *mem_nofault;
+	int uffd, ret;
+	off_t offset = page_size;
+	void *fault_addr;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
+	TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed");
+
+	ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed");
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed");
+
+	mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed");
+
+	uffd_reg.range.start = (unsigned long)mem;
+	uffd_reg.range.len = total_size;
+	uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR;
+	ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			offset, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed");
+
+	memset(mem_nofault + offset, 0xaa, page_size);
+
+	fault_addr = mem + offset;
+	args.addr = fault_addr;
+
+	ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args);
+	TEST_ASSERT(ret == 0, "pthread_create should succeed");
+
+	ret = read(uffd, &msg, sizeof(msg));
+	TEST_ASSERT(ret != -1, "read from userfaultfd should succeed");
+	TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault");
+	TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr,
+		    "pagefault should occur at expected address");
+
+
+	uffd_cont.range.start = (unsigned long)fault_addr;
+	uffd_cont.range.len = page_size;
+	uffd_cont.mode = 0;
+	ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed");
+
+	/*
+	 * wait for fault_thread to finish to make sure fault happened and was
+	 * resolved before we verify the values
+	 */
+	ret = pthread_join(fault_thread, NULL);
+	TEST_ASSERT(ret == 0, "pthread_join should succeed");
+
+	TEST_ASSERT(args.value == *(char *)(mem_nofault + offset),
+		    "memory should contain the value that was copied");
+	TEST_ASSERT(args.value == *(char *)(mem + offset),
+		    "no further fault is expected");
+
+	ret = munmap(mem_nofault, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+	close(uffd);
+}
+
 #define gmem_test(__test, __vm, __flags)				\
 do {									\
 	int fd = vm_create_guest_memfd(__vm, page_size * 4, __flags);	\
@@ -273,6 +369,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
 		if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
 			gmem_test(mmap_supported, vm, flags);
 			gmem_test(fault_overflow, vm, flags);
+			gmem_test(uffd_minor, vm, flags);
 		} else {
 			gmem_test(fault_private, vm, flags);
 		}
-- 
2.51.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason
  2025-11-30 11:18 ` [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
@ 2025-12-01  8:59   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01  8:59 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	Hugh Dickins, James Houghton, Liam R. Howlett, Lorenzo Stoakes,
	Michal Hocko, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On 11/30/25 12:18, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> When a VMA is registered with userfaulfd in minor mode, its ->fault()
> method should check if a folio exists in the page cache and if yes
> ->fault() should call handle_userfault(VM_UFFD_MINOR).
> 
> Instead of calling handle_userfault() directly from a specific ->fault()
> implementation introduce new fault reason VM_FAULT_UFFD_MINOR that will
> notify the core page fault handler that it should call
> handle_userfaultfd(VM_UFFD_MINOR) to complete a page fault.
> 
> Replace a call to handle_userfault(VM_UFFD_MINOR) in shmem and use the
> new VM_FAULT_UFFD_MINOR there instead.
> 
> For configurations that don't enable CONFIG_USERFAULTFD,
> VM_FAULT_UFFD_MINOR is set to 0.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---

LGTM and does not look too invasive for me.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
@ 2025-12-01  9:12   ` David Hildenbrand (Red Hat)
  2025-12-01 13:39   ` Nikita Kalyazin
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01  9:12 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	Hugh Dickins, James Houghton, Liam R. Howlett, Lorenzo Stoakes,
	Michal Hocko, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On 11/30/25 12:18, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> userfaultfd notifications about minor page faults used for live migration
> and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
> mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add
> minor fault registration mode").
> 
> To use the same mechanism for VMs that use guest_memfd to map their memory,
> guest_memfd should support userfaultfd minor mode.
> 
> Extend ->fault() method of guest_memfd with ability to notify core page
> fault handler that a page fault requires handle_userfault(VM_UFFD_MINOR) to
> complete and add implementation of ->get_folio_noalloc() to guest_memfd
> vm_ops.
> 
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   virt/kvm/guest_memfd.c | 33 ++++++++++++++++++++++++++++++++-
>   1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index ffadc5ee8e04..dca6e373937b 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>   #include <linux/kvm_host.h>
>   #include <linux/pagemap.h>
>   #include <linux/anon_inodes.h>
> +#include <linux/userfaultfd_k.h>
>   
>   #include "kvm_mm.h"
>   
> @@ -359,7 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>   	if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED))
>   		return VM_FAULT_SIGBUS;
>   
> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
> +	if (!IS_ERR_OR_NULL(folio) && userfaultfd_minor(vmf->vma)) {

Can we ever get NULL here?

> +		ret = VM_FAULT_UFFD_MINOR;
> +		goto out_folio;
> +	}
> +
> +	if (PTR_ERR(folio) == -ENOENT)
> +		folio = kvm_gmem_get_folio(inode, vmf->pgoff);

Was briefly wondering what the performance impact of that two-step 
approach is (two lookups in case we have to create it IIUC)

Wouldn't it be better to limit it to the userfaultfd_minor(vmf->vma) case?


if (userfaultfd_minor(vmf->vma)) {
	folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
	if (!IS_ERR(folio)) {
		ret = VM_FAULT_UFFD_MINOR;
		goto out_folio;
	}
} else {
	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
}

if (IS_ERR(folio)) {
...

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
  2025-12-01  9:12   ` David Hildenbrand (Red Hat)
@ 2025-12-01 13:39   ` Nikita Kalyazin
  2025-12-01 15:54     ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-01 13:39 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
	Lorenzo Stoakes, Michal Hocko, Paolo Bonzini, Peter Xu,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest



On 30/11/2025 11:18, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> userfaultfd notifications about minor page faults used for live migration
> and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
> mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add
> minor fault registration mode").
> 
> To use the same mechanism for VMs that use guest_memfd to map their memory,
> guest_memfd should support userfaultfd minor mode.
> 
> Extend ->fault() method of guest_memfd with ability to notify core page
> fault handler that a page fault requires handle_userfault(VM_UFFD_MINOR) to
> complete and add implementation of ->get_folio_noalloc() to guest_memfd
> vm_ops.
> 
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   virt/kvm/guest_memfd.c | 33 ++++++++++++++++++++++++++++++++-
>   1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index ffadc5ee8e04..dca6e373937b 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>   #include <linux/kvm_host.h>
>   #include <linux/pagemap.h>
>   #include <linux/anon_inodes.h>
> +#include <linux/userfaultfd_k.h>
> 
>   #include "kvm_mm.h"
> 
> @@ -359,7 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>          if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED))
>                  return VM_FAULT_SIGBUS;
> 
> -       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +       folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
> +       if (!IS_ERR_OR_NULL(folio) && userfaultfd_minor(vmf->vma)) {
> +               ret = VM_FAULT_UFFD_MINOR;
> +               goto out_folio;
> +       }

I realised that I might have been wrong in [1] saying that the noalloc 
get folio was ok for our use case.  Unfortunately we rely on a minor 
fault to get generated even when the page is being allocated.  Peter and 
I discussed it originally in [2].  Since we want to populate guest 
memory with the content supplied by userspace on demand, we have to be 
able to intercept the very first access, meaning we either need a minor 
or major UFFD event for that.  We decided to make use of the minor at 
the time.  If we have to preserve the shmem semantics, it forces us to 
implement support for major/UFFDIO_COPY.

[1] 
https://lore.kernel.org/all/4405c306-9d7c-4fd6-9ea6-2ed1b73f5c2e@amazon.com
[2] https://lore.kernel.org/kvm/Z9HhTjEWtM58Zfxf@x1.local

> +
> +       if (PTR_ERR(folio) == -ENOENT)
> +               folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +
>          if (IS_ERR(folio)) {
>                  int err = PTR_ERR(folio);
> 
> @@ -390,8 +399,30 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>          return ret;
>   }
> 
> +#ifdef CONFIG_USERFAULTFD
> +static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode,
> +                                               pgoff_t pgoff)
> +{
> +       struct folio *folio;
> +
> +       folio = filemap_lock_folio(inode->i_mapping, pgoff);
> +       if (IS_ERR_OR_NULL(folio))
> +               return folio;
> +
> +       if (!folio_test_uptodate(folio)) {
> +               clear_highpage(folio_page(folio, 0));
> +               kvm_gmem_mark_prepared(folio);
> +       }
> +
> +       return folio;
> +}
> +#endif
> +
>   static const struct vm_operations_struct kvm_gmem_vm_ops = {
>          .fault = kvm_gmem_fault_user_mapping,
> +#ifdef CONFIG_USERFAULTFD
> +       .get_folio_noalloc      = kvm_gmem_get_folio_noalloc,
> +#endif
>   };
> 
>   static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> --
> 2.51.0
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 13:39   ` Nikita Kalyazin
@ 2025-12-01 15:54     ` David Hildenbrand (Red Hat)
  2025-12-01 16:48       ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01 15:54 UTC (permalink / raw)
  To: kalyazin, Mike Rapoport, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	Hugh Dickins, James Houghton, Liam R. Howlett, Lorenzo Stoakes,
	Michal Hocko, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, linux-kernel,
	kvm, linux-kselftest

On 12/1/25 14:39, Nikita Kalyazin wrote:
> 
> 
> On 30/11/2025 11:18, Mike Rapoport wrote:
>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>
>> userfaultfd notifications about minor page faults used for live migration
>> and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
>> mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add
>> minor fault registration mode").
>>
>> To use the same mechanism for VMs that use guest_memfd to map their memory,
>> guest_memfd should support userfaultfd minor mode.
>>
>> Extend ->fault() method of guest_memfd with ability to notify core page
>> fault handler that a page fault requires handle_userfault(VM_UFFD_MINOR) to
>> complete and add implementation of ->get_folio_noalloc() to guest_memfd
>> vm_ops.
>>
>> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> ---
>>    virt/kvm/guest_memfd.c | 33 ++++++++++++++++++++++++++++++++-
>>    1 file changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index ffadc5ee8e04..dca6e373937b 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>    #include <linux/kvm_host.h>
>>    #include <linux/pagemap.h>
>>    #include <linux/anon_inodes.h>
>> +#include <linux/userfaultfd_k.h>
>>
>>    #include "kvm_mm.h"
>>
>> @@ -359,7 +360,15 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>           if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED))
>>                   return VM_FAULT_SIGBUS;
>>
>> -       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>> +       folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
>> +       if (!IS_ERR_OR_NULL(folio) && userfaultfd_minor(vmf->vma)) {
>> +               ret = VM_FAULT_UFFD_MINOR;
>> +               goto out_folio;
>> +       }
> 
> I realised that I might have been wrong in [1] saying that the noalloc
> get folio was ok for our use case.  Unfortunately we rely on a minor
> fault to get generated even when the page is being allocated.  Peter and
> I discussed it originally in [2].  Since we want to populate guest
> memory with the content supplied by userspace on demand, we have to be
> able to intercept the very first access, meaning we either need a minor
> or major UFFD event for that.  We decided to make use of the minor at
> the time.  If we have to preserve the shmem semantics, it forces us to
> implement support for major/UFFDIO_COPY.

If we want missing semantics then likely we should be adding ... missing 
support? :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 15:54     ` David Hildenbrand (Red Hat)
@ 2025-12-01 16:48       ` Nikita Kalyazin
  2025-12-01 18:35         ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-01 16:48 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Mike Rapoport, linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
	Hugh Dickins, James Houghton, Liam R. Howlett, Lorenzo Stoakes,
	Michal Hocko, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, linux-kernel,
	kvm, linux-kselftest



On 01/12/2025 15:54, David Hildenbrand (Red Hat) wrote:
> On 12/1/25 14:39, Nikita Kalyazin wrote:
>>
>>
>> On 30/11/2025 11:18, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> userfaultfd notifications about minor page faults used for live 
>>> migration
>>> and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
>>> mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: 
>>> add
>>> minor fault registration mode").
>>>
>>> To use the same mechanism for VMs that use guest_memfd to map their 
>>> memory,
>>> guest_memfd should support userfaultfd minor mode.
>>>
>>> Extend ->fault() method of guest_memfd with ability to notify core page
>>> fault handler that a page fault requires 
>>> handle_userfault(VM_UFFD_MINOR) to
>>> complete and add implementation of ->get_folio_noalloc() to guest_memfd
>>> vm_ops.
>>>
>>> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> ---
>>>    virt/kvm/guest_memfd.c | 33 ++++++++++++++++++++++++++++++++-
>>>    1 file changed, 32 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>> index ffadc5ee8e04..dca6e373937b 100644
>>> --- a/virt/kvm/guest_memfd.c
>>> +++ b/virt/kvm/guest_memfd.c
>>> @@ -4,6 +4,7 @@
>>>    #include <linux/kvm_host.h>
>>>    #include <linux/pagemap.h>
>>>    #include <linux/anon_inodes.h>
>>> +#include <linux/userfaultfd_k.h>
>>>
>>>    #include "kvm_mm.h"
>>>
>>> @@ -359,7 +360,15 @@ static vm_fault_t 
>>> kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>>           if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED))
>>>                   return VM_FAULT_SIGBUS;
>>>
>>> -       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>> +       folio = filemap_lock_folio(inode->i_mapping, vmf->pgoff);
>>> +       if (!IS_ERR_OR_NULL(folio) && userfaultfd_minor(vmf->vma)) {
>>> +               ret = VM_FAULT_UFFD_MINOR;
>>> +               goto out_folio;
>>> +       }
>>
>> I realised that I might have been wrong in [1] saying that the noalloc
>> get folio was ok for our use case.  Unfortunately we rely on a minor
>> fault to get generated even when the page is being allocated.  Peter and
>> I discussed it originally in [2].  Since we want to populate guest
>> memory with the content supplied by userspace on demand, we have to be
>> able to intercept the very first access, meaning we either need a minor
>> or major UFFD event for that.  We decided to make use of the minor at
>> the time.  If we have to preserve the shmem semantics, it forces us to
>> implement support for major/UFFDIO_COPY.
> 
> If we want missing semantics then likely we should be adding ... missing
> support? :)

I believe I found the precise point where we convinced ourselves that 
minor support was sufficient: [1].  If at this moment we don't find that 
reasoning valid anymore, then indeed implementing missing is the only 
option.

[1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local

> 
> -- 
> Cheers
> 
> David



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 16:48       ` Nikita Kalyazin
@ 2025-12-01 18:35         ` Peter Xu
  2025-12-01 20:12           ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2025-12-01 18:35 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
> I believe I found the precise point where we convinced ourselves that minor
> support was sufficient: [1].  If at this moment we don't find that reasoning
> valid anymore, then indeed implementing missing is the only option.
> 
> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local

Now after I re-read the discussion, I may have made a wrong statement
there, sorry.  I could have got slightly confused on when the write()
syscall can be involved.

I agree if you want to get an event when cache missed with the current uffd
definitions and when pre-population is forbidden, then MISSING trap is
required.  That is, with/without the need of UFFDIO_COPY being available.

Do I understand it right that UFFDIO_COPY is not allowed in your case, but
only write()?

One way that might work this around, is introducing a new UFFD_FEATURE bit
allowing the MINOR registration to trap all pgtable faults, which will
change the MINOR fault semantics.

That'll need some further thoughts, meanwhile we may also want to make sure
the old shmem/hugetlbfs semantics are kept (e.g. they should fail MINOR
registers if the new feature bit is enabled in the ctx somehow; or support
them properly in the codebase).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 18:35         ` Peter Xu
@ 2025-12-01 20:12           ` Nikita Kalyazin
  2025-12-01 20:57             ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-01 20:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest



On 01/12/2025 18:35, Peter Xu wrote:
> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>> I believe I found the precise point where we convinced ourselves that minor
>> support was sufficient: [1].  If at this moment we don't find that reasoning
>> valid anymore, then indeed implementing missing is the only option.
>>
>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
> 
> Now after I re-read the discussion, I may have made a wrong statement
> there, sorry.  I could have got slightly confused on when the write()
> syscall can be involved.
> 
> I agree if you want to get an event when cache missed with the current uffd
> definitions and when pre-population is forbidden, then MISSING trap is
> required.  That is, with/without the need of UFFDIO_COPY being available.
> 
> Do I understand it right that UFFDIO_COPY is not allowed in your case, but
> only write()?

No, UFFDIO_COPY would work perfectly fine.  We will still use write() 
whenever we resolve stage-2 faults as they aren't visible to UFFD.  When 
a userfault occurs at an offset that already has a page in the cache, we 
will have to keep using UFFDIO_CONTINUE so it looks like both will be 
required:

  - user mapping major fault -> UFFDIO_COPY (fills the cache and sets up 
userspace PT)
  - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
  - stage-2 fault -> write() (only fills the cache)

> 
> One way that might work this around, is introducing a new UFFD_FEATURE bit
> allowing the MINOR registration to trap all pgtable faults, which will
> change the MINOR fault semantics.

This would equally work for us.  I suppose this MINOR+MAJOR semantics 
would be more intrusive from the API point of view though.

> 
> That'll need some further thoughts, meanwhile we may also want to make sure
> the old shmem/hugetlbfs semantics are kept (e.g. they should fail MINOR
> registers if the new feature bit is enabled in the ctx somehow; or support
> them properly in the codebase).
> 
> Thanks,
> 
> --
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 20:12           ` Nikita Kalyazin
@ 2025-12-01 20:57             ` Peter Xu
  2025-12-02 11:50               ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2025-12-01 20:57 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 01/12/2025 18:35, Peter Xu wrote:
> > On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
> > > I believe I found the precise point where we convinced ourselves that minor
> > > support was sufficient: [1].  If at this moment we don't find that reasoning
> > > valid anymore, then indeed implementing missing is the only option.
> > > 
> > > [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
> > 
> > Now after I re-read the discussion, I may have made a wrong statement
> > there, sorry.  I could have got slightly confused on when the write()
> > syscall can be involved.
> > 
> > I agree if you want to get an event when cache missed with the current uffd
> > definitions and when pre-population is forbidden, then MISSING trap is
> > required.  That is, with/without the need of UFFDIO_COPY being available.
> > 
> > Do I understand it right that UFFDIO_COPY is not allowed in your case, but
> > only write()?
> 
> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
> whenever we resolve stage-2 faults as they aren't visible to UFFD.  When a
> userfault occurs at an offset that already has a page in the cache, we will
> have to keep using UFFDIO_CONTINUE so it looks like both will be required:
> 
>  - user mapping major fault -> UFFDIO_COPY (fills the cache and sets up
> userspace PT)
>  - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
>  - stage-2 fault -> write() (only fills the cache)

Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's series?

It looks fine indeed, but it looks slightly weird then, as you'll have two
ways to populate the page cache.  Logically here atomicity is indeed not
needed when you trap both MISSING + MINOR.

> 
> > 
> > One way that might work this around, is introducing a new UFFD_FEATURE bit
> > allowing the MINOR registration to trap all pgtable faults, which will
> > change the MINOR fault semantics.
> 
> This would equally work for us.  I suppose this MINOR+MAJOR semantics would
> be more intrusive from the API point of view though.

Yes it is, it's just that I don't know whether it'll be harder when you
want to completely support UFFDIO_COPY here, per previous discussions.

After a 2nd thought, such UFFD_FEATURE is probably not a good design,
because it essentially means that feature bit will functionally overlap
with what MISSING trap was trying to do, however duplicating that concept
in a VMA that was registered as MINOR only.

Maybe it's possible instead if we allow a module to support MISSING trap,
but without supporting UFFDIO_COPY ioctl.

That is, the MISSING events will be properly generated if MISSING traps are
supported, however the module needs to provide its own way to resolve it if
UFFDIO_COPY ioctl isn't available.  Gmem is fine in this case as long as
it'll always be registered with both MISSING+MINOR traps, then resolving
using write()s would work.

Such would be possible when with something like my v3 previously:

https://lore.kernel.org/all/20250926211650.525109-1-peterx@redhat.com/#t

Then gmem needs to declare VM_UFFD_MISSING + VM_UFFD_MINOR in
uffd_features, but _UFFDIO_CONTINUE only (without _UFFDIO_COPY) in
uffd_ioctls.

Since Mike already took this series over, I'll leave that to you all to
decide.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-01 20:57             ` Peter Xu
@ 2025-12-02 11:50               ` Nikita Kalyazin
  2025-12-02 15:36                 ` Peter Xu
  2025-12-03  9:23                 ` David Hildenbrand (Red Hat)
  0 siblings, 2 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-02 11:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest



On 01/12/2025 20:57, Peter Xu wrote:
> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>
>>
>> On 01/12/2025 18:35, Peter Xu wrote:
>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>> I believe I found the precise point where we convinced ourselves that minor
>>>> support was sufficient: [1].  If at this moment we don't find that reasoning
>>>> valid anymore, then indeed implementing missing is the only option.
>>>>
>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>
>>> Now after I re-read the discussion, I may have made a wrong statement
>>> there, sorry.  I could have got slightly confused on when the write()
>>> syscall can be involved.
>>>
>>> I agree if you want to get an event when cache missed with the current uffd
>>> definitions and when pre-population is forbidden, then MISSING trap is
>>> required.  That is, with/without the need of UFFDIO_COPY being available.
>>>
>>> Do I understand it right that UFFDIO_COPY is not allowed in your case, but
>>> only write()?
>>
>> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
>> whenever we resolve stage-2 faults as they aren't visible to UFFD.  When a
>> userfault occurs at an offset that already has a page in the cache, we will
>> have to keep using UFFDIO_CONTINUE so it looks like both will be required:
>>
>>   - user mapping major fault -> UFFDIO_COPY (fills the cache and sets up
>> userspace PT)
>>   - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
>>   - stage-2 fault -> write() (only fills the cache)
> 
> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's series?

Yes, that's the one ([1]).

[1] 
https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com

> 
> It looks fine indeed, but it looks slightly weird then, as you'll have two
> ways to populate the page cache.  Logically here atomicity is indeed not
> needed when you trap both MISSING + MINOR.

I reran the test based on the UFFDIO_COPY prototype I had using your 
series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB: 
237 vs 202 ms (+17%).  Even though UFFDIO_COPY alone is functionally 
sufficient, I would prefer to have an option to use write() where 
possible and only falling back to UFFDIO_COPY for userspace faults to 
have better performance.

[2] 
https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com

> 
>>
>>>
>>> One way that might work this around, is introducing a new UFFD_FEATURE bit
>>> allowing the MINOR registration to trap all pgtable faults, which will
>>> change the MINOR fault semantics.
>>
>> This would equally work for us.  I suppose this MINOR+MAJOR semantics would
>> be more intrusive from the API point of view though.
> 
> Yes it is, it's just that I don't know whether it'll be harder when you
> want to completely support UFFDIO_COPY here, per previous discussions.
> 
> After a 2nd thought, such UFFD_FEATURE is probably not a good design,
> because it essentially means that feature bit will functionally overlap
> with what MISSING trap was trying to do, however duplicating that concept
> in a VMA that was registered as MINOR only.
> 
> Maybe it's possible instead if we allow a module to support MISSING trap,
> but without supporting UFFDIO_COPY ioctl.
> 
> That is, the MISSING events will be properly generated if MISSING traps are
> supported, however the module needs to provide its own way to resolve it if
> UFFDIO_COPY ioctl isn't available.  Gmem is fine in this case as long as
> it'll always be registered with both MISSING+MINOR traps, then resolving
> using write()s would work.

Yes, this would also work for me.  This is almost how it was 
(accidentally) working until this version of the patches.  If this is a 
lighter undertaking compared to implementing UFFDIO_COPY, I'd be happy 
with it too.

> 
> Such would be possible when with something like my v3 previously:
> 
> https://lore.kernel.org/all/20250926211650.525109-1-peterx@redhat.com/#t
> 
> Then gmem needs to declare VM_UFFD_MISSING + VM_UFFD_MINOR in
> uffd_features, but _UFFDIO_CONTINUE only (without _UFFDIO_COPY) in
> uffd_ioctls.
> 
> Since Mike already took this series over, I'll leave that to you all to
> decide.

Thanks for you input, Peter.

> 
> --
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-02 11:50               ` Nikita Kalyazin
@ 2025-12-02 15:36                 ` Peter Xu
  2025-12-02 15:59                   ` Nikita Kalyazin
  2025-12-03  9:23                 ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Xu @ 2025-12-02 15:36 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On Tue, Dec 02, 2025 at 11:50:31AM +0000, Nikita Kalyazin wrote:
> > It looks fine indeed, but it looks slightly weird then, as you'll have two
> > ways to populate the page cache.  Logically here atomicity is indeed not
> > needed when you trap both MISSING + MINOR.
> 
> I reran the test based on the UFFDIO_COPY prototype I had using your series
> [2], and UFFDIO_COPY is slower than write() to populate 512 MiB: 237 vs 202
> ms (+17%).  Even though UFFDIO_COPY alone is functionally sufficient, I
> would prefer to have an option to use write() where possible and only
> falling back to UFFDIO_COPY for userspace faults to have better performance.

Yes, write() should be fine.

Especially to gmem, I guess write() support is needed when VMAs cannot be
mapped at all in strict CoCo context, so it needs to be available one way
or another.

IIUC it's because UFFDIO_COPY (or memcpy(), I recall you used to test that
instead) will involve pgtable operations.  So I wonder if the VMA mapping
the gmem will still be accessed at some point later (either private->share
convertable ones for device DMAs for CoCo, or fully shared non-CoCo use
case), then the pgtable overhead will happen later for a write()-styled
fault resolution.

From that POV, above number makes sense.

Thanks for the extra testing results.

> 
> [2]
> https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-02 15:36                 ` Peter Xu
@ 2025-12-02 15:59                   ` Nikita Kalyazin
  0 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-02 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand (Red Hat),
	Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest




On 02/12/2025 15:36, Peter Xu wrote:
> On Tue, Dec 02, 2025 at 11:50:31AM +0000, Nikita Kalyazin wrote:
>>> It looks fine indeed, but it looks slightly weird then, as you'll have two
>>> ways to populate the page cache.  Logically here atomicity is indeed not
>>> needed when you trap both MISSING + MINOR.
>>
>> I reran the test based on the UFFDIO_COPY prototype I had using your series
>> [2], and UFFDIO_COPY is slower than write() to populate 512 MiB: 237 vs 202
>> ms (+17%).  Even though UFFDIO_COPY alone is functionally sufficient, I
>> would prefer to have an option to use write() where possible and only
>> falling back to UFFDIO_COPY for userspace faults to have better performance.
> 
> Yes, write() should be fine.
> 
> Especially to gmem, I guess write() support is needed when VMAs cannot be
> mapped at all in strict CoCo context, so it needs to be available one way
> or another.

write() is supposed to be supported only for shared memory, ie 
accessible to the host.  AFAIK private memory should be populated via 
other mechanisms.

> 
> IIUC it's because UFFDIO_COPY (or memcpy(), I recall you used to test that
> instead) will involve pgtable operations.
Yes, for memcpy() it's even worse because it triggers VMA faults for 
every page.  UFFDIO_COPY's overhead is lower because the only extra 
thing it does compared to write() is installing user PTs.

> instead) will involve pgtable operations.  So I wonder if the VMA mapping
> the gmem will still be accessed at some point later (either private->share
> convertable ones for device DMAs for CoCo, or fully shared non-CoCo use
> case), then the pgtable overhead will happen later for a write()-styled
> fault resolution.

At least in Firecracker use case, only pages that are related to PV 
devices are going to get accessed by the VMM via user PTs (such as 
virtio queues and buffers).  The majority of pages are only touched by 
vCPUs via stage-2 mappings and are never accessed via user PTs.

> 
>  From that POV, above number makes sense.
> 
> Thanks for the extra testing results.
> 
>>
>> [2]
>> https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com
> 
> --
> Peter Xu
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-02 11:50               ` Nikita Kalyazin
  2025-12-02 15:36                 ` Peter Xu
@ 2025-12-03  9:23                 ` David Hildenbrand (Red Hat)
  2025-12-03 10:03                   ` Nikita Kalyazin
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03  9:23 UTC (permalink / raw)
  To: kalyazin, Peter Xu
  Cc: Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest

On 12/2/25 12:50, Nikita Kalyazin wrote:
> 
> 
> On 01/12/2025 20:57, Peter Xu wrote:
>> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>>
>>>
>>> On 01/12/2025 18:35, Peter Xu wrote:
>>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>>> I believe I found the precise point where we convinced ourselves that minor
>>>>> support was sufficient: [1].  If at this moment we don't find that reasoning
>>>>> valid anymore, then indeed implementing missing is the only option.
>>>>>
>>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>>
>>>> Now after I re-read the discussion, I may have made a wrong statement
>>>> there, sorry.  I could have got slightly confused on when the write()
>>>> syscall can be involved.
>>>>
>>>> I agree if you want to get an event when cache missed with the current uffd
>>>> definitions and when pre-population is forbidden, then MISSING trap is
>>>> required.  That is, with/without the need of UFFDIO_COPY being available.
>>>>
>>>> Do I understand it right that UFFDIO_COPY is not allowed in your case, but
>>>> only write()?
>>>
>>> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
>>> whenever we resolve stage-2 faults as they aren't visible to UFFD.  When a
>>> userfault occurs at an offset that already has a page in the cache, we will
>>> have to keep using UFFDIO_CONTINUE so it looks like both will be required:
>>>
>>>    - user mapping major fault -> UFFDIO_COPY (fills the cache and sets up
>>> userspace PT)
>>>    - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
>>>    - stage-2 fault -> write() (only fills the cache)
>>
>> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's series?
> 
> Yes, that's the one ([1]).
> 
> [1]
> https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com
> 
>>
>> It looks fine indeed, but it looks slightly weird then, as you'll have two
>> ways to populate the page cache.  Logically here atomicity is indeed not
>> needed when you trap both MISSING + MINOR.
> 
> I reran the test based on the UFFDIO_COPY prototype I had using your
> series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
> 237 vs 202 ms (+17%).  Even though UFFDIO_COPY alone is functionally
> sufficient, I would prefer to have an option to use write() where
> possible and only falling back to UFFDIO_COPY for userspace faults to
> have better performance.

Just so I understand correctly: we could even do without UFFDIO_COPY for 
that scenario by using write() + minor faults?

But what you are saying is that there might be a performance benefit in 
using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault 
overhead?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-03  9:23                 ` David Hildenbrand (Red Hat)
@ 2025-12-03 10:03                   ` Nikita Kalyazin
  2025-12-04 17:27                     ` Nikita Kalyazin
  0 siblings, 1 reply; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-03 10:03 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Peter Xu
  Cc: Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest



On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
> On 12/2/25 12:50, Nikita Kalyazin wrote:
>>
>>
>> On 01/12/2025 20:57, Peter Xu wrote:
>>> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>>>
>>>>
>>>> On 01/12/2025 18:35, Peter Xu wrote:
>>>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>>>> I believe I found the precise point where we convinced ourselves 
>>>>>> that minor
>>>>>> support was sufficient: [1].  If at this moment we don't find that 
>>>>>> reasoning
>>>>>> valid anymore, then indeed implementing missing is the only option.
>>>>>>
>>>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>>>
>>>>> Now after I re-read the discussion, I may have made a wrong statement
>>>>> there, sorry.  I could have got slightly confused on when the write()
>>>>> syscall can be involved.
>>>>>
>>>>> I agree if you want to get an event when cache missed with the 
>>>>> current uffd
>>>>> definitions and when pre-population is forbidden, then MISSING trap is
>>>>> required.  That is, with/without the need of UFFDIO_COPY being 
>>>>> available.
>>>>>
>>>>> Do I understand it right that UFFDIO_COPY is not allowed in your 
>>>>> case, but
>>>>> only write()?
>>>>
>>>> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
>>>> whenever we resolve stage-2 faults as they aren't visible to UFFD.  
>>>> When a
>>>> userfault occurs at an offset that already has a page in the cache, 
>>>> we will
>>>> have to keep using UFFDIO_CONTINUE so it looks like both will be 
>>>> required:
>>>>
>>>>    - user mapping major fault -> UFFDIO_COPY (fills the cache and 
>>>> sets up
>>>> userspace PT)
>>>>    - user mapping minor fault -> UFFDIO_CONTINUE (only sets up 
>>>> userspace PT)
>>>>    - stage-2 fault -> write() (only fills the cache)
>>>
>>> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's 
>>> series?
>>
>> Yes, that's the one ([1]).
>>
>> [1]
>> https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com
>>
>>>
>>> It looks fine indeed, but it looks slightly weird then, as you'll 
>>> have two
>>> ways to populate the page cache.  Logically here atomicity is indeed not
>>> needed when you trap both MISSING + MINOR.
>>
>> I reran the test based on the UFFDIO_COPY prototype I had using your
>> series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
>> 237 vs 202 ms (+17%).  Even though UFFDIO_COPY alone is functionally
>> sufficient, I would prefer to have an option to use write() where
>> possible and only falling back to UFFDIO_COPY for userspace faults to
>> have better performance.
> 
> Just so I understand correctly: we could even do without UFFDIO_COPY for
> that scenario by using write() + minor faults?

We still need major fault notifications as well (which we were 
accidentally generating until this version).  But we can resolve them 
with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.

> 
> But what you are saying is that there might be a performance benefit in
> using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault
> overhead?

UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a 
single syscall instead of two, but the amount of userspace faults, at 
least in our scenario, is negligible compared to the amount of stage-2 
faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY 
if it can be avoided.

> 
> -- 
> Cheers
> 
> David



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
  2025-12-03 10:03                   ` Nikita Kalyazin
@ 2025-12-04 17:27                     ` Nikita Kalyazin
  0 siblings, 0 replies; 20+ messages in thread
From: Nikita Kalyazin @ 2025-12-04 17:27 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Peter Xu
  Cc: Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
	Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
	Vlastimil Babka, linux-kernel, kvm, linux-kselftest



On 03/12/2025 10:03, Nikita Kalyazin wrote:
> On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
>> On 12/2/25 12:50, Nikita Kalyazin wrote:
>>>
>>>
>>> On 01/12/2025 20:57, Peter Xu wrote:
>>>> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>>>>
>>>>>
>>>>> On 01/12/2025 18:35, Peter Xu wrote:
>>>>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>>>>> I believe I found the precise point where we convinced ourselves 
>>>>>>> that minor
>>>>>>> support was sufficient: [1].  If at this moment we don't find 
>>>>>>> that reasoning
>>>>>>> valid anymore, then indeed implementing missing is the only option.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>>>>
>>>>>> Now after I re-read the discussion, I may have made a wrong statement
>>>>>> there, sorry.  I could have got slightly confused on when the write()
>>>>>> syscall can be involved.
>>>>>>
>>>>>> I agree if you want to get an event when cache missed with the 
>>>>>> current uffd
>>>>>> definitions and when pre-population is forbidden, then MISSING 
>>>>>> trap is
>>>>>> required.  That is, with/without the need of UFFDIO_COPY being 
>>>>>> available.
>>>>>>
>>>>>> Do I understand it right that UFFDIO_COPY is not allowed in your 
>>>>>> case, but
>>>>>> only write()?
>>>>>
>>>>> No, UFFDIO_COPY would work perfectly fine.  We will still use write()
>>>>> whenever we resolve stage-2 faults as they aren't visible to UFFD. 
>>>>> When a
>>>>> userfault occurs at an offset that already has a page in the cache, 
>>>>> we will
>>>>> have to keep using UFFDIO_CONTINUE so it looks like both will be 
>>>>> required:
>>>>>
>>>>>    - user mapping major fault -> UFFDIO_COPY (fills the cache and 
>>>>> sets up
>>>>> userspace PT)
>>>>>    - user mapping minor fault -> UFFDIO_CONTINUE (only sets up 
>>>>> userspace PT)
>>>>>    - stage-2 fault -> write() (only fills the cache)
>>>>
>>>> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's 
>>>> series?
>>>
>>> Yes, that's the one ([1]).
>>>
>>> [1]
>>> https://lore.kernel.org/kvm/20250618042424.330664-1- 
>>> jthoughton@google.com
>>>
>>>>
>>>> It looks fine indeed, but it looks slightly weird then, as you'll 
>>>> have two
>>>> ways to populate the page cache.  Logically here atomicity is indeed 
>>>> not
>>>> needed when you trap both MISSING + MINOR.
>>>
>>> I reran the test based on the UFFDIO_COPY prototype I had using your
>>> series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
>>> 237 vs 202 ms (+17%).  Even though UFFDIO_COPY alone is functionally
>>> sufficient, I would prefer to have an option to use write() where
>>> possible and only falling back to UFFDIO_COPY for userspace faults to
>>> have better performance.
>>
>> Just so I understand correctly: we could even do without UFFDIO_COPY for
>> that scenario by using write() + minor faults?
> 
> We still need major fault notifications as well (which we were 
> accidentally generating until this version).  But we can resolve them 
> with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.

We had a conversation about that at the guest_memfd sync today:

Q: Is it possible from the API point of view to support MISSING 
notifications without supporting UFFDIO_COPY?

A: The manpage [1] says on UFFDIO_REGISTER_MODE_MISSING that "the page 
fault is resolved from user-space by either an UFFDIO_COPY or an 
UFFDIO_ZEROPAGE ioctl", but I don't think it's actually enforced 
anywhere in the code.

Q: UFFDIO_COPY is supposed to provide atomic semantics, while write() + 
UFFDIO_CONTINUE does not. Is it a problem?

A: It isn't a problem for the particular Firecracker use case because 1) 
vCPUs can be prevented from seeing partially populated pages in the 
cache via KVM userfault intercept [2] and 2) we do not use other 
userspace mappings.  However, as James pointed, in the general case, 
other actors may observe partially populated pages via other userspace 
mappings.

[1] https://man7.org/linux/man-pages/man2/userfaultfd.2.html
[2] 
https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com

> 
>>
>> But what you are saying is that there might be a performance benefit in
>> using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault
>> overhead?
> 
> UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a 
> single syscall instead of two, but the amount of userspace faults, at 
> least in our scenario, is negligible compared to the amount of stage-2 
> faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY 
> if it can be avoided.
> 
>>
>> -- 
>> Cheers
>>
>> David
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-12-04 17:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 2/5] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
2025-12-01  8:59   ` David Hildenbrand (Red Hat)
2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
2025-12-01  9:12   ` David Hildenbrand (Red Hat)
2025-12-01 13:39   ` Nikita Kalyazin
2025-12-01 15:54     ` David Hildenbrand (Red Hat)
2025-12-01 16:48       ` Nikita Kalyazin
2025-12-01 18:35         ` Peter Xu
2025-12-01 20:12           ` Nikita Kalyazin
2025-12-01 20:57             ` Peter Xu
2025-12-02 11:50               ` Nikita Kalyazin
2025-12-02 15:36                 ` Peter Xu
2025-12-02 15:59                   ` Nikita Kalyazin
2025-12-03  9:23                 ` David Hildenbrand (Red Hat)
2025-12-03 10:03                   ` Nikita Kalyazin
2025-12-04 17:27                     ` Nikita Kalyazin
2025-11-30 11:18 ` [PATCH v3 5/5] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox