* [RFC PATCH 1/4] userfaultfd: move vma_can_userfault out of line
2025-11-17 11:46 [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
@ 2025-11-17 11:46 ` Mike Rapoport
2025-11-17 17:00 ` David Hildenbrand (Red Hat)
2025-11-17 11:46 ` [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-11-17 11:46 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, David Hildenbrand,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
vma_can_userfault() has grown pretty big and it's not called on
performance critical path.
Move it out of line.
No functional changes.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 36 ++---------------------------------
mm/userfaultfd.c | 34 +++++++++++++++++++++++++++++++++
2 files changed, 36 insertions(+), 34 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c0e716aec26a..e4f43e7b063f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -208,40 +208,8 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
- bool wp_async)
-{
- vm_flags &= __VM_UFFD_FLAGS;
-
- if (vma->vm_flags & VM_DROPPABLE)
- return false;
-
- if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
- return false;
-
- /*
- * If wp async enabled, and WP is the only mode enabled, allow any
- * memory type.
- */
- if (wp_async && (vm_flags == VM_UFFD_WP))
- return true;
-
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
- /*
- * If user requested uffd-wp but not enabled pte markers for
- * uffd-wp, then shmem & hugetlbfs are not supported but only
- * anonymous.
- */
- if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
- return false;
-#endif
-
- /* By default, allow any of anon|shmem|hugetlb */
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+ bool wp_async);
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
{
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af61b95c89e4..8dc964389b0d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1977,6 +1977,40 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
return moved ? moved : err;
}
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+ bool wp_async)
+{
+ vm_flags &= __VM_UFFD_FLAGS;
+
+ if (vma->vm_flags & VM_DROPPABLE)
+ return false;
+
+ if ((vm_flags & VM_UFFD_MINOR) &&
+ (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ return false;
+
+ /*
+ * If wp async enabled, and WP is the only mode enabled, allow any
+ * memory type.
+ */
+ if (wp_async && (vm_flags == VM_UFFD_WP))
+ return true;
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+ /*
+ * If user requested uffd-wp but not enabled pte markers for
+ * uffd-wp, then shmem & hugetlbfs are not supported but only
+ * anonymous.
+ */
+ if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+ return false;
+#endif
+
+ /* By default, allow any of anon|shmem|hugetlb */
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
--
2.50.1
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 1/4] userfaultfd: move vma_can_userfault out of line
2025-11-17 11:46 ` [RFC PATCH 1/4] userfaultfd: move vma_can_userfault out of line Mike Rapoport
@ 2025-11-17 17:00 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-17 17:00 UTC (permalink / raw)
To: Mike Rapoport, linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, Hugh Dickins,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Nikita Kalyazin,
Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
linux-kselftest
On 17.11.25 12:46, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> vma_can_userfault() has grown pretty big and it's not called on
> performance critical path.
>
> Move it out of line.
>
> No functional changes.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
2025-11-17 11:46 [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
2025-11-17 11:46 ` [RFC PATCH 1/4] userfaultfd: move vma_can_userfault out of line Mike Rapoport
@ 2025-11-17 11:46 ` Mike Rapoport
2025-11-17 17:08 ` David Hildenbrand (Red Hat)
2025-11-17 11:46 ` [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd Mike Rapoport
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-11-17 11:46 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, David Hildenbrand,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
it needs to get a folio that already exists in the pagecache backing
that VMA.
Instead of using shmem_get_folio() for that, add a get_pagecache_folio()
method to 'struct vm_operations_struct' that will return a folio if it
exists in the VMA's pagecache at given pgoff.
Implement get_pagecache_folio() method for shmem and slightly refactor
userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support
this new API.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 9 +++++++
mm/shmem.c | 20 ++++++++++++++++
mm/userfaultfd.c | 60 ++++++++++++++++++++++++++++++----------------
3 files changed, 69 insertions(+), 20 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..c35c1e1ac4dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,6 +690,15 @@ struct vm_operations_struct {
struct page *(*find_normal_page)(struct vm_area_struct *vma,
unsigned long addr);
#endif /* CONFIG_FIND_NORMAL_PAGE */
+#ifdef CONFIG_USERFAULTFD
+ /*
+ * Called by userfault to resolve UFFDIO_CONTINUE request.
+ * Should return the folio found at pgoff in the VMA's pagecache if it
+ * exists or ERR_PTR otherwise.
+ */
+ struct folio *(*get_pagecache_folio)(struct vm_area_struct *vma,
+ pgoff_t pgoff);
+#endif
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/shmem.c b/mm/shmem.c
index b9081b817d28..4ac122284bff 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3260,6 +3260,20 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
shmem_inode_unacct_blocks(inode, 1);
return ret;
}
+
+static struct folio *shmem_get_pagecache_folio(struct vm_area_struct *vma,
+ pgoff_t pgoff)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct folio *folio;
+ int err;
+
+ err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+ if (err)
+ return ERR_PTR(err);
+
+ return folio;
+}
#endif /* CONFIG_USERFAULTFD */
#ifdef CONFIG_TMPFS
@@ -5292,6 +5306,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .get_pagecache_folio = shmem_get_pagecache_folio,
+#endif
};
static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5301,6 +5318,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .get_pagecache_folio = shmem_get_pagecache_folio,
+#endif
};
int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 8dc964389b0d..60b3183a72c0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -382,21 +382,17 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
unsigned long dst_addr,
uffd_flags_t flags)
{
- struct inode *inode = file_inode(dst_vma->vm_file);
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
struct folio *folio;
struct page *page;
int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+ folio = dst_vma->vm_ops->get_pagecache_folio(dst_vma, pgoff);
/* Our caller expects us to return -EFAULT if we failed to find folio */
- if (ret == -ENOENT)
- ret = -EFAULT;
- if (ret)
- goto out;
- if (!folio) {
- ret = -EFAULT;
- goto out;
+ if (IS_ERR_OR_NULL(folio)) {
+ if (PTR_ERR(folio) == -ENOENT || !folio)
+ return -EFAULT;
+ return PTR_ERR(folio);
}
page = folio_file_page(folio, pgoff);
@@ -411,13 +407,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
goto out_release;
folio_unlock(folio);
- ret = 0;
-out:
- return ret;
+ return 0;
+
out_release:
folio_unlock(folio);
folio_put(folio);
- goto out;
+ return ret;
}
/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
@@ -694,6 +689,22 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
return err;
}
+static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma,
+ uffd_flags_t flags)
+{
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
+ if (vma->vm_ops && vma->vm_ops->get_pagecache_folio)
+ return true;
+ else
+ return false;
+ }
+
+ if (vma_is_anonymous(vma) || vma_is_shmem(vma))
+ return true;
+
+ return false;
+}
+
static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
unsigned long dst_start,
unsigned long src_start,
@@ -766,10 +777,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
- goto out_unlock;
- if (!vma_is_shmem(dst_vma) &&
- uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ if (!vma_can_mfill_atomic(dst_vma, flags))
goto out_unlock;
while (src_addr < src_start + len) {
@@ -1985,9 +1993,21 @@ bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
if (vma->vm_flags & VM_DROPPABLE)
return false;
- if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
- return false;
+ if (vm_flags & VM_UFFD_MINOR) {
+ /*
+ * If only MINOR mode is requested and we can request an
+ * existing folio from VMA's page cache, allow it
+ */
+ if (vm_flags == VM_UFFD_MINOR && vma->vm_ops &&
+ vma->vm_ops->get_pagecache_folio)
+ return true;
+ /*
+ * Only hugetlb and shmem can support MINOR mode in combination
+ * with other modes
+ */
+ if (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))
+ return false;
+ }
/*
* If wp async enabled, and WP is the only mode enabled, allow any
--
2.50.1
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
2025-11-17 11:46 ` [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
@ 2025-11-17 17:08 ` David Hildenbrand (Red Hat)
2025-11-21 11:52 ` Mike Rapoport
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-17 17:08 UTC (permalink / raw)
To: Mike Rapoport, linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, Hugh Dickins,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Nikita Kalyazin,
Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
linux-kselftest
On 17.11.25 12:46, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
> it needs to get a folio that already exists in the pagecache backing
> that VMA.
>
> Instead of using shmem_get_folio() for that, add a get_pagecache_folio()
> method to 'struct vm_operations_struct' that will return a folio if it
> exists in the VMA's pagecache at given pgoff.
>
> Implement get_pagecache_folio() method for shmem and slightly refactor
> userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support
> this new API.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/mm.h | 9 +++++++
> mm/shmem.c | 20 ++++++++++++++++
> mm/userfaultfd.c | 60 ++++++++++++++++++++++++++++++----------------
> 3 files changed, 69 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d16b33bacc32..c35c1e1ac4dd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -690,6 +690,15 @@ struct vm_operations_struct {
> struct page *(*find_normal_page)(struct vm_area_struct *vma,
> unsigned long addr);
> #endif /* CONFIG_FIND_NORMAL_PAGE */
> +#ifdef CONFIG_USERFAULTFD
> + /*
> + * Called by userfault to resolve UFFDIO_CONTINUE request.
> + * Should return the folio found at pgoff in the VMA's pagecache if it
> + * exists or ERR_PTR otherwise.
> + */
What are the locking +refcount rules? Without looking at the code, I
would assume we return with a folio reference held and the folio locked?
> + struct folio *(*get_pagecache_folio)(struct vm_area_struct *vma,
> + pgoff_t pgoff);
The combination of VMA + pgoff looks weird at first. Would vma + addr or
vma+vma_offset into vma be better?
But it also makes me wonder if the callback would ever even require the
VMA, or actually only vma->vm_file?
Thinking out loud, I wonder if one could just call that "get_folio" or
"get_shared_folio" (IOW, never an anon folio in a MAP_PRIVATE mapping).
> +#endif
> };
>
> #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b9081b817d28..4ac122284bff 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3260,6 +3260,20 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> shmem_inode_unacct_blocks(inode, 1);
> return ret;
> }
> +
> +static struct folio *shmem_get_pagecache_folio(struct vm_area_struct *vma,
> + pgoff_t pgoff)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> + struct folio *folio;
> + int err;
> +
> + err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> + if (err)
> + return ERR_PTR(err);
> +
> + return folio;
> +}
> #endif /* CONFIG_USERFAULTFD */
>
> #ifdef CONFIG_TMPFS
> @@ -5292,6 +5306,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
> .set_policy = shmem_set_policy,
> .get_policy = shmem_get_policy,
> #endif
> +#ifdef CONFIG_USERFAULTFD
> + .get_pagecache_folio = shmem_get_pagecache_folio,
> +#endif
> };
>
> static const struct vm_operations_struct shmem_anon_vm_ops = {
> @@ -5301,6 +5318,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
> .set_policy = shmem_set_policy,
> .get_policy = shmem_get_policy,
> #endif
> +#ifdef CONFIG_USERFAULTFD
> + .get_pagecache_folio = shmem_get_pagecache_folio,
> +#endif
> };
>
> int shmem_init_fs_context(struct fs_context *fc)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 8dc964389b0d..60b3183a72c0 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -382,21 +382,17 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
> unsigned long dst_addr,
> uffd_flags_t flags)
> {
> - struct inode *inode = file_inode(dst_vma->vm_file);
> pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> struct folio *folio;
> struct page *page;
> int ret;
>
> - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> + folio = dst_vma->vm_ops->get_pagecache_folio(dst_vma, pgoff);
> /* Our caller expects us to return -EFAULT if we failed to find folio */
> - if (ret == -ENOENT)
> - ret = -EFAULT;
> - if (ret)
> - goto out;
> - if (!folio) {
> - ret = -EFAULT;
> - goto out;
> + if (IS_ERR_OR_NULL(folio)) {
> + if (PTR_ERR(folio) == -ENOENT || !folio)
> + return -EFAULT;
> + return PTR_ERR(folio);
> }
>
> page = folio_file_page(folio, pgoff);
> @@ -411,13 +407,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
> goto out_release;
>
> folio_unlock(folio);
> - ret = 0;
> -out:
> - return ret;
> + return 0;
> +
> out_release:
> folio_unlock(folio);
> folio_put(folio);
> - goto out;
> + return ret;
> }
>
> /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
> @@ -694,6 +689,22 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
> return err;
> }
>
> +static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma,
> + uffd_flags_t flags)
> +{
> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> + if (vma->vm_ops && vma->vm_ops->get_pagecache_folio)
> + return true;
> + else
> + return false;
Probably easier to read is
return vma->vm_ops && vma->vm_ops->get_pagecache_folio;
> + }
> +
> + if (vma_is_anonymous(vma) || vma_is_shmem(vma))
> + return true;
> +
> + return false;
Could also be simplified to:
return vma_is_anonymous(vma) || vma_is_shmem(vma);
--
Cheers
David
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
2025-11-17 17:08 ` David Hildenbrand (Red Hat)
@ 2025-11-21 11:52 ` Mike Rapoport
0 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2025-11-21 11:52 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Baolin Wang,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Nikita Kalyazin, Paolo Bonzini, Peter Xu, Sean Christopherson,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, linux-kernel,
kvm, linux-kselftest
On Mon, Nov 17, 2025 at 06:08:57PM +0100, David Hildenbrand (Red Hat) wrote:
> On 17.11.25 12:46, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
> > it needs to get a folio that already exists in the pagecache backing
> > that VMA.
> >
> > Instead of using shmem_get_folio() for that, add a get_pagecache_folio()
> > method to 'struct vm_operations_struct' that will return a folio if it
> > exists in the VMA's pagecache at given pgoff.
> >
> > Implement get_pagecache_folio() method for shmem and slightly refactor
> > userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support
> > this new API.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > include/linux/mm.h | 9 +++++++
> > mm/shmem.c | 20 ++++++++++++++++
> > mm/userfaultfd.c | 60 ++++++++++++++++++++++++++++++----------------
> > 3 files changed, 69 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index d16b33bacc32..c35c1e1ac4dd 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -690,6 +690,15 @@ struct vm_operations_struct {
> > struct page *(*find_normal_page)(struct vm_area_struct *vma,
> > unsigned long addr);
> > #endif /* CONFIG_FIND_NORMAL_PAGE */
> > +#ifdef CONFIG_USERFAULTFD
> > + /*
> > + * Called by userfault to resolve UFFDIO_CONTINUE request.
> > + * Should return the folio found at pgoff in the VMA's pagecache if it
> > + * exists or ERR_PTR otherwise.
> > + */
>
> What are the locking +refcount rules? Without looking at the code, I would
> assume we return with a folio reference held and the folio locked?
Right, will add it to the comment
> > + struct folio *(*get_pagecache_folio)(struct vm_area_struct *vma,
> > + pgoff_t pgoff);
>
>
> The combination of VMA + pgoff looks weird at first. Would vma + addr or
> vma+vma_offset into vma be better?
Copied from map_pages() :)
> But it also makes me wonder if the callback would ever even require the VMA,
> or actually only vma->vm_file?
It's actually inode, I'm going to pass that instead of vma.
> Thinking out loud, I wonder if one could just call that "get_folio" or
> "get_shared_folio" (IOW, never an anon folio in a MAP_PRIVATE mapping).
Naming is hard :)
get_shared_folio() sounds good to me so unless there other suggestions I'll
stick with it.
> > +#endif
> > };
> > #ifdef CONFIG_NUMA_BALANCING
...
> > +static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma,
> > + uffd_flags_t flags)
> > +{
> > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> > + if (vma->vm_ops && vma->vm_ops->get_pagecache_folio)
> > + return true;
> > + else
> > + return false;
>
> Probably easier to read is
>
> return vma->vm_ops && vma->vm_ops->get_pagecache_folio;
>
> > + }
> > +
> > + if (vma_is_anonymous(vma) || vma_is_shmem(vma))
> > + return true;
> > +
> > + return false;
>
>
> Could also be simplified to:
>
> return vma_is_anonymous(vma) || vma_is_shmem(vma);
Agree with for both of them.
> --
> Cheers
>
> David
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
2025-11-17 11:46 [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
2025-11-17 11:46 ` [RFC PATCH 1/4] userfaultfd: move vma_can_userfault out of line Mike Rapoport
2025-11-17 11:46 ` [RFC PATCH 2/4] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
@ 2025-11-17 11:46 ` Mike Rapoport
2025-11-18 16:41 ` David Hildenbrand (Red Hat)
2025-11-17 11:46 ` [RFC PATCH 4/4] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
2025-11-17 17:55 ` [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Nikita Kalyazin
4 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-11-17 11:46 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, David Hildenbrand,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
* Export handle_userfault() for KVM module so that fault() handler in
guest_memfd would be able to notify userspace about page faults in its
address space.
* Implement get_pagecache_folio() for guest_memfd.
* And finally, introduce UFFD_FEATURE_MINOR_GENERIC that will allow
using userfaultfd minor mode with memory types other than shmem and
hugetlb provided they are allowed to call handle_userfault() and
implement get_pagecache_folio().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
fs/userfaultfd.c | 4 +++-
include/uapi/linux/userfaultfd.h | 8 +++++++-
virt/kvm/guest_memfd.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 54c6cc7fe9c6..964fa2662d5c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -537,6 +537,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
out:
return ret;
}
+EXPORT_SYMBOL_FOR_MODULES(handle_userfault, "kvm");
static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
struct userfaultfd_wait_queue *ewq)
@@ -1978,7 +1979,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
uffdio_api.features = UFFD_API_FEATURES;
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
uffdio_api.features &=
- ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
+ ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM |
+ UFFD_FEATURE_MINOR_GENERIC);
#endif
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 2841e4ea8f2c..c5cbd4a5a26e 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -42,7 +42,8 @@
UFFD_FEATURE_WP_UNPOPULATED | \
UFFD_FEATURE_POISON | \
UFFD_FEATURE_WP_ASYNC | \
- UFFD_FEATURE_MOVE)
+ UFFD_FEATURE_MOVE | \
+ UFFD_FEATURE_MINOR_GENERIC)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -210,6 +211,10 @@ struct uffdio_api {
* UFFD_FEATURE_MINOR_SHMEM indicates the same support as
* UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
*
+ * UFFD_FEATURE_MINOR_GENERIC indicates that minor faults can be
+ * intercepted for file-backed memory in case subsystem backing this
+ * memory supports it.
+ *
* UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
* faults would be provided and the offset within the page would not be
* masked.
@@ -248,6 +253,7 @@ struct uffdio_api {
#define UFFD_FEATURE_POISON (1<<14)
#define UFFD_FEATURE_WP_ASYNC (1<<15)
#define UFFD_FEATURE_MOVE (1<<16)
+#define UFFD_FEATURE_MINOR_GENERIC (1<<17)
__u64 features;
__u64 ioctls;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fbca8c0972da..5e3c63307fdf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
#include <linux/kvm_host.h>
#include <linux/pagemap.h>
#include <linux/anon_inodes.h>
+#include <linux/userfaultfd_k.h>
#include "kvm_mm.h"
@@ -369,6 +370,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
return vmf_error(err);
}
+ if (userfaultfd_minor(vmf->vma)) {
+ folio_unlock(folio);
+ folio_put(folio);
+ return handle_userfault(vmf, VM_UFFD_MINOR);
+ }
+
if (WARN_ON_ONCE(folio_test_large(folio))) {
ret = VM_FAULT_SIGBUS;
goto out_folio;
@@ -390,8 +397,31 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
return ret;
}
+#ifdef CONFIG_USERFAULTFD
+static struct folio *kvm_gmem_get_pagecache_folio(struct vm_area_struct *vma,
+ pgoff_t pgoff)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct folio *folio;
+
+ folio = kvm_gmem_get_folio(inode, pgoff);
+ if (IS_ERR_OR_NULL(folio))
+ return folio;
+
+ if (!folio_test_uptodate(folio)) {
+ clear_highpage(folio_page(folio, 0));
+ kvm_gmem_mark_prepared(folio);
+ }
+
+ return folio;
+}
+#endif
+
static const struct vm_operations_struct kvm_gmem_vm_ops = {
.fault = kvm_gmem_fault_user_mapping,
+#ifdef CONFIG_USERFAULTFD
+ .get_pagecache_folio = kvm_gmem_get_pagecache_folio,
+#endif
};
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
--
2.50.1
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
2025-11-17 11:46 ` [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd Mike Rapoport
@ 2025-11-18 16:41 ` David Hildenbrand (Red Hat)
2025-11-21 11:59 ` Mike Rapoport
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-18 16:41 UTC (permalink / raw)
To: Mike Rapoport, linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, Hugh Dickins,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Nikita Kalyazin,
Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
linux-kselftest
On 17.11.25 12:46, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> * Export handle_userfault() for KVM module so that fault() handler in
> guest_memfd would be able to notify userspace about page faults in its
> address space.
> * Implement get_pagecache_folio() for guest_memfd.
> * And finally, introduce UFFD_FEATURE_MINOR_GENERIC that will allow
> using userfaultfd minor mode with memory types other than shmem and
> hugetlb provided they are allowed to call handle_userfault() and
> implement get_pagecache_folio().
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> fs/userfaultfd.c | 4 +++-
> include/uapi/linux/userfaultfd.h | 8 +++++++-
> virt/kvm/guest_memfd.c | 30 ++++++++++++++++++++++++++++++
> 3 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 54c6cc7fe9c6..964fa2662d5c 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -537,6 +537,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> out:
> return ret;
> }
> +EXPORT_SYMBOL_FOR_MODULES(handle_userfault, "kvm");
>
> static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> struct userfaultfd_wait_queue *ewq)
> @@ -1978,7 +1979,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
> uffdio_api.features = UFFD_API_FEATURES;
> #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> uffdio_api.features &=
> - ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
> + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM |
> + UFFD_FEATURE_MINOR_GENERIC);
> #endif
> #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 2841e4ea8f2c..c5cbd4a5a26e 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -42,7 +42,8 @@
> UFFD_FEATURE_WP_UNPOPULATED | \
> UFFD_FEATURE_POISON | \
> UFFD_FEATURE_WP_ASYNC | \
> - UFFD_FEATURE_MOVE)
> + UFFD_FEATURE_MOVE | \
> + UFFD_FEATURE_MINOR_GENERIC)
> #define UFFD_API_IOCTLS \
> ((__u64)1 << _UFFDIO_REGISTER | \
> (__u64)1 << _UFFDIO_UNREGISTER | \
> @@ -210,6 +211,10 @@ struct uffdio_api {
> * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
> * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
> *
> + * UFFD_FEATURE_MINOR_GENERIC indicates that minor faults can be
> + * intercepted for file-backed memory in case subsystem backing this
> + * memory supports it.
> + *
> * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
> * faults would be provided and the offset within the page would not be
> * masked.
> @@ -248,6 +253,7 @@ struct uffdio_api {
> #define UFFD_FEATURE_POISON (1<<14)
> #define UFFD_FEATURE_WP_ASYNC (1<<15)
> #define UFFD_FEATURE_MOVE (1<<16)
> +#define UFFD_FEATURE_MINOR_GENERIC (1<<17)
> __u64 features;
>
> __u64 ioctls;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index fbca8c0972da..5e3c63307fdf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
> #include <linux/kvm_host.h>
> #include <linux/pagemap.h>
> #include <linux/anon_inodes.h>
> +#include <linux/userfaultfd_k.h>
>
> #include "kvm_mm.h"
>
> @@ -369,6 +370,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> return vmf_error(err);
> }
>
> + if (userfaultfd_minor(vmf->vma)) {
> + folio_unlock(folio);
> + folio_put(folio);
> + return handle_userfault(vmf, VM_UFFD_MINOR);
> + }
Staring at things like VM_FAULT_NEEDDSYNC, I'm wondering whether we could have a
new return value from ->fault that would indicate that
handle_userfault(vmf, VM_UFFD_MINOR) should be called.
Maybe some VM_FAULT_UFFD_MINOR or simply VM_FAULT_USERFAULTFD and we
can just derive that it is VM_UFFD_MINOR.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f66a3206a63c..2cf17da880f0e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1601,6 +1601,8 @@ typedef __bitwise unsigned int vm_fault_t;
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_USERFAULTFD: ->fault did not modify page tables and needs
+ * handle_userfault() to complete
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -1618,6 +1620,7 @@ enum vm_fault_reason {
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+ VM_FAULT_USERFAULTFD = (__force vm_fault_t)0x006000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -1642,6 +1645,7 @@ enum vm_fault_reason {
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_USERFAULTFD, "USERFAULTFD" },\
{ VM_FAULT_COMPLETED, "COMPLETED" }
struct vm_special_mapping {
IIUC, we have exactly two invocations of ->fault(vmf) in memory.c where
we would have to handle it IIUC. And the return value would never leave
the core.
That way, we wouldn't have to export handle_userfault().
Just a thought ...
--
Cheers
David
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
2025-11-18 16:41 ` David Hildenbrand (Red Hat)
@ 2025-11-21 11:59 ` Mike Rapoport
0 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2025-11-21 11:59 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Baolin Wang,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Nikita Kalyazin, Paolo Bonzini, Peter Xu, Sean Christopherson,
Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, linux-kernel,
kvm, linux-kselftest
On Tue, Nov 18, 2025 at 05:41:13PM +0100, David Hildenbrand (Red Hat) wrote:
> On 17.11.25 12:46, Mike Rapoport wrote:
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index fbca8c0972da..5e3c63307fdf 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -4,6 +4,7 @@
> > #include <linux/kvm_host.h>
> > #include <linux/pagemap.h>
> > #include <linux/anon_inodes.h>
> > +#include <linux/userfaultfd_k.h>
> > #include "kvm_mm.h"
> > @@ -369,6 +370,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> > return vmf_error(err);
> > }
> > + if (userfaultfd_minor(vmf->vma)) {
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + return handle_userfault(vmf, VM_UFFD_MINOR);
> > + }
>
> Staring at things like VM_FAULT_NEEDDSYNC, I'm wondering whether we could have a
> new return value from ->fault that would indicate that
> handle_userfault(vmf, VM_UFFD_MINOR) should be called.
>
> Maybe some VM_FAULT_UFFD_MINOR or simply VM_FAULT_USERFAULTFD and we
> can just derive that it is VM_UFFD_MINOR.
_UFFD_MINOR sounds better, maybe we'll want something for missing later on.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4f66a3206a63c..2cf17da880f0e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1601,6 +1601,8 @@ typedef __bitwise unsigned int vm_fault_t;
> * fsync() to complete (for synchronous page faults
> * in DAX)
> * @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
> + * @VM_FAULT_USERFAULTFD: ->fault did not modify page tables and needs
> + * handle_userfault() to complete
> * @VM_FAULT_HINDEX_MASK: mask HINDEX value
> *
> */
> @@ -1618,6 +1620,7 @@ enum vm_fault_reason {
> VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
> VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
> VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
> + VM_FAULT_USERFAULTFD = (__force vm_fault_t)0x006000,
> VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
> };
> @@ -1642,6 +1645,7 @@ enum vm_fault_reason {
> { VM_FAULT_FALLBACK, "FALLBACK" }, \
> { VM_FAULT_DONE_COW, "DONE_COW" }, \
> { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
> + { VM_FAULT_USERFAULTFD, "USERFAULTFD" },\
> { VM_FAULT_COMPLETED, "COMPLETED" }
> struct vm_special_mapping {
>
>
> IIUC, we have exactly two invocations of ->fault(vmf) in memory.c where
> we would have to handle it IIUC. And the return value would never leave
> the core.
I've found only one :/
But nevertheless, I like the idea to return VM_FAULT_UFFD_MINOR from
->fault() and then call handle_userfault() from __do_fault().
> That way, we wouldn't have to export handle_userfault().
>
> Just a thought ...
>
> --
> Cheers
>
> David
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 4/4] KVM: selftests: test userfaultfd minor for guest_memfd
2025-11-17 11:46 [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
` (2 preceding siblings ...)
2025-11-17 11:46 ` [RFC PATCH 3/4] userfaultfd, guest_memfd: support userfault minor mode in guest_memfd Mike Rapoport
@ 2025-11-17 11:46 ` Mike Rapoport
2025-11-17 17:55 ` [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Nikita Kalyazin
4 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2025-11-17 11:46 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, David Hildenbrand,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Nikita Kalyazin, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
The test demonstrates that a minor userfaultfd event in guest_memfd can
be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
.../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++
1 file changed, 103 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index e7d9aeb418d3..a5d3ed21d7bb 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,13 +10,17 @@
#include <errno.h>
#include <stdio.h>
#include <fcntl.h>
+#include <pthread.h>
#include <linux/bitmap.h>
#include <linux/falloc.h>
#include <linux/sizes.h>
+#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
#include "kvm_util.h"
#include "test_util.h"
@@ -254,6 +258,104 @@ static void test_guest_memfd_flags(struct kvm_vm *vm)
}
}
+struct fault_args {
+ char *addr;
+ volatile char value;
+};
+
+static void *fault_thread_fn(void *arg)
+{
+ struct fault_args *args = arg;
+
+ /* Trigger page fault */
+ args->value = *args->addr;
+ return NULL;
+}
+
+static void test_uffd_minor(int fd, size_t total_size)
+{
+ struct uffdio_api uffdio_api = {
+ .api = UFFD_API,
+ .features = UFFD_FEATURE_MINOR_GENERIC,
+ };
+ struct uffdio_register uffd_reg;
+ struct uffdio_continue uffd_cont;
+ struct uffd_msg msg;
+ struct fault_args args;
+ pthread_t fault_thread;
+ void *mem, *mem_nofault, *buf = NULL;
+ int uffd, ret;
+ off_t offset = page_size;
+ void *fault_addr;
+
+ ret = posix_memalign(&buf, page_size, total_size);
+ TEST_ASSERT_EQ(ret, 0);
+
+ memset(buf, 0xaa, total_size);
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
+ TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed");
+
+ ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed");
+
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed");
+
+ mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed");
+
+ uffd_reg.range.start = (unsigned long)mem;
+ uffd_reg.range.len = total_size;
+ uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR;
+ ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ offset, page_size);
+ TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed");
+
+ fault_addr = mem + offset;
+ args.addr = fault_addr;
+
+ ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args);
+ TEST_ASSERT(ret == 0, "pthread_create should succeed");
+
+ ret = read(uffd, &msg, sizeof(msg));
+ TEST_ASSERT(ret != -1, "read from userfaultfd should succeed");
+ TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault");
+ TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr,
+ "pagefault should occur at expected address");
+
+ memcpy(mem_nofault + offset, buf + offset, page_size);
+
+ uffd_cont.range.start = (unsigned long)fault_addr;
+ uffd_cont.range.len = page_size;
+ uffd_cont.mode = 0;
+ ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed");
+
+ /*
+ * wait for fault_thread to finish to make sure fault happened and was
+ * resolved before we verify the values
+ */
+ ret = pthread_join(fault_thread, NULL);
+ TEST_ASSERT(ret == 0, "pthread_join should succeed");
+
+ TEST_ASSERT(args.value == *(char *)(mem_nofault + offset),
+ "memory should contain the value that was copied");
+ TEST_ASSERT(args.value == *(char *)(mem + offset),
+ "no further fault is expected");
+
+ ret = munmap(mem_nofault, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+
+ ret = munmap(mem, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+ free(buf);
+ close(uffd);
+}
+
#define gmem_test(__test, __vm, __flags) \
do { \
int fd = vm_create_guest_memfd(__vm, page_size * 4, __flags); \
@@ -273,6 +375,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
gmem_test(mmap_supported, vm, flags);
gmem_test(fault_overflow, vm, flags);
+ gmem_test(uffd_minor, vm, flags);
} else {
gmem_test(fault_private, vm, flags);
}
--
2.50.1
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults
2025-11-17 11:46 [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
` (3 preceding siblings ...)
2025-11-17 11:46 ` [RFC PATCH 4/4] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
@ 2025-11-17 17:55 ` Nikita Kalyazin
2025-11-17 19:39 ` Peter Xu
4 siblings, 1 reply; 13+ messages in thread
From: Nikita Kalyazin @ 2025-11-17 17:55 UTC (permalink / raw)
To: Mike Rapoport, linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Baolin Wang, David Hildenbrand,
Hugh Dickins, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Paolo Bonzini, Peter Xu, Sean Christopherson, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
linux-kselftest
On 17/11/2025 11:46, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> These patches allow guest_memfd to notify userspace about minor page
> faults using userfaultfd and let userspace to resolve these page faults
> using UFFDIO_CONTINUE.
>
> To allow UFFDIO_CONTINUE outside of the core mm I added a
> get_pagecache_folio() callback to vm_ops that allows an address space
> backing a VMA to return a folio that exists in it's page cache (patch 2)
>
> In order for guest_memfd to notify userspace about page faults, it has to
> call handle_userfault() and since guest_memfd may be a part of kvm module,
> handle_userfault() is exported for kvm module (patch 3).
>
> Note that patch 3 changelog does not provide motivation for enabling uffd
> in guest_memfd, mainly because I can't say I understand why is that
> required :)
> Would be great to hear from KVM folks about it.
Hi Mike,
Thanks for posting it!
In our use case, Firecracker snapshot-restore using UFFD [1], we will
use UFFD minor/continue to respond to guest_memfd faults in user
mappings primarily due to VMM accesses that are required for PV (virtio)
device emulation and also KVM accesses when decoding MMIO operations on x86.
Nikita
[1]
https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
>
> This series is the minimal change I've been able to come up with to allow
> integration of guest_memfd with uffd and while refactoring uffd and making
> mfill_atomic() flow more linear would have been a nice improvement, it's
> way out of the scope of enabling uffd with guest_memfd.
>
> Mike Rapoport (Microsoft) (3):
> userfaultfd: move vma_can_userfault out of line
> userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
> userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
>
> Nikita Kalyazin (1):
> KVM: selftests: test userfaultfd minor for guest_memfd
>
> fs/userfaultfd.c | 4 +-
> include/linux/mm.h | 9 ++
> include/linux/userfaultfd_k.h | 36 +-----
> include/uapi/linux/userfaultfd.h | 8 +-
> mm/shmem.c | 20 ++++
> mm/userfaultfd.c | 88 ++++++++++++---
> .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++
> virt/kvm/guest_memfd.c | 30 +++++
> 8 files changed, 245 insertions(+), 53 deletions(-)
>
>
> base-commit: 6146a0f1dfae5d37442a9ddcba012add260bceb0
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults
2025-11-17 17:55 ` [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults Nikita Kalyazin
@ 2025-11-17 19:39 ` Peter Xu
2025-11-19 17:35 ` Nikita Kalyazin
0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2025-11-17 19:39 UTC (permalink / raw)
To: Nikita Kalyazin
Cc: Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
Baolin Wang, David Hildenbrand, Hugh Dickins, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Mon, Nov 17, 2025 at 05:55:46PM +0000, Nikita Kalyazin wrote:
> In our use case, Firecracker snapshot-restore using UFFD [1], we will use
> UFFD minor/continue to respond to guest_memfd faults in user mappings
> primarily due to VMM accesses that are required for PV (virtio) device
> emulation and also KVM accesses when decoding MMIO operations on x86.
I'm curious if firecracker plans to support live snapshot save. When with
something like ioctls_supported flags, guest-memfd can declare support for
wr-protect support easily too, and synchronous userfaultfd wr-protect traps
will be an efficient way to do live save.
I'm guessing it's not an immediate demand now or it would have been asked
already supporting both MINOR and WP, but I just want to raise this
question. Qemu already supports live snapshot save, so it'll always be
good gmem can also support wp at some point, but it can be done later too.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults
2025-11-17 19:39 ` Peter Xu
@ 2025-11-19 17:35 ` Nikita Kalyazin
0 siblings, 0 replies; 13+ messages in thread
From: Nikita Kalyazin @ 2025-11-19 17:35 UTC (permalink / raw)
To: Peter Xu
Cc: Mike Rapoport, linux-mm, Andrea Arcangeli, Andrew Morton,
Baolin Wang, David Hildenbrand, Hugh Dickins, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On 17/11/2025 19:39, Peter Xu wrote:
> On Mon, Nov 17, 2025 at 05:55:46PM +0000, Nikita Kalyazin wrote:
>> In our use case, Firecracker snapshot-restore using UFFD [1], we will use
>> UFFD minor/continue to respond to guest_memfd faults in user mappings
>> primarily due to VMM accesses that are required for PV (virtio) device
>> emulation and also KVM accesses when decoding MMIO operations on x86.
>
> I'm curious if firecracker plans to support live snapshot save. When with
> something like ioctls_supported flags, guest-memfd can declare support for
> wr-protect support easily too, and synchronous userfaultfd wr-protect traps
> will be an efficient way to do live save.
>
> I'm guessing it's not an immediate demand now or it would have been asked
> already supporting both MINOR and WP, but I just want to raise this
> question. Qemu already supports live snapshot save, so it'll always be
> good gmem can also support wp at some point, but it can be done later too.
No, live snapshots haven't been on our plan so far.
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 13+ messages in thread