* [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-03 17:45 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 02/17] userfaultfd: introduce struct mfill_state Mike Rapoport
` (16 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Split copying of data when locks held from mfill_atomic_pte_copy() into
a helper function mfill_copy_folio_locked().
This makes improves code readability and makes complex
mfill_atomic_pte_copy() function easier to comprehend.
No functional change.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 59 ++++++++++++++++++++++++++++--------------------
1 file changed, 35 insertions(+), 24 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e6dfd5f28acd..a0885d543f22 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -238,6 +238,40 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
return ret;
}
+static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
+{
+ void *kaddr;
+ int ret;
+
+ kaddr = kmap_local_folio(folio, 0);
+ /*
+ * The read mmap_lock is held here. Despite the
+ * mmap_lock being read recursive a deadlock is still
+ * possible if a writer has taken a lock. For example:
+ *
+ * process A thread 1 takes read lock on own mmap_lock
+ * process A thread 2 calls mmap, blocks taking write lock
+ * process B thread 1 takes page fault, read lock on own mmap lock
+ * process B thread 2 calls mmap, blocks taking write lock
+ * process A thread 1 blocks taking read lock on process B
+ * process B thread 1 blocks taking read lock on process A
+ *
+ * Disable page faults to prevent potential deadlock
+ * and retry the copy outside the mmap_lock.
+ */
+ pagefault_disable();
+ ret = copy_from_user(kaddr, (const void __user *) src_addr,
+ PAGE_SIZE);
+ pagefault_enable();
+ kunmap_local(kaddr);
+
+ if (ret)
+ return -EFAULT;
+
+ flush_dcache_folio(folio);
+ return ret;
+}
+
static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
@@ -245,7 +279,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
uffd_flags_t flags,
struct folio **foliop)
{
- void *kaddr;
int ret;
struct folio *folio;
@@ -256,27 +289,7 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
if (!folio)
goto out;
- kaddr = kmap_local_folio(folio, 0);
- /*
- * The read mmap_lock is held here. Despite the
- * mmap_lock being read recursive a deadlock is still
- * possible if a writer has taken a lock. For example:
- *
- * process A thread 1 takes read lock on own mmap_lock
- * process A thread 2 calls mmap, blocks taking write lock
- * process B thread 1 takes page fault, read lock on own mmap lock
- * process B thread 2 calls mmap, blocks taking write lock
- * process A thread 1 blocks taking read lock on process B
- * process B thread 1 blocks taking read lock on process A
- *
- * Disable page faults to prevent potential deadlock
- * and retry the copy outside the mmap_lock.
- */
- pagefault_disable();
- ret = copy_from_user(kaddr, (const void __user *) src_addr,
- PAGE_SIZE);
- pagefault_enable();
- kunmap_local(kaddr);
+ ret = mfill_copy_folio_locked(folio, src_addr);
/* fallback to copy_from_user outside mmap_lock */
if (unlikely(ret)) {
@@ -285,8 +298,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
/* don't free the page */
goto out;
}
-
- flush_dcache_folio(folio);
} else {
folio = *foliop;
*foliop = NULL;
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper
2026-01-27 19:29 ` [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper Mike Rapoport
@ 2026-02-03 17:45 ` Peter Xu
2026-02-08 9:49 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-03 17:45 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Jan 27, 2026 at 09:29:20PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Split copying of data when locks held from mfill_atomic_pte_copy() into
> a helper function mfill_copy_folio_locked().
>
> This makes improves code readability and makes complex
> mfill_atomic_pte_copy() function easier to comprehend.
>
> No functional change.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
The movement looks all fine,
Acked-by: Peter Xu <peterx@redhat.com>
Just one pure question to ask.
> ---
> mm/userfaultfd.c | 59 ++++++++++++++++++++++++++++--------------------
> 1 file changed, 35 insertions(+), 24 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e6dfd5f28acd..a0885d543f22 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -238,6 +238,40 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> return ret;
> }
>
> +static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> +{
> + void *kaddr;
> + int ret;
> +
> + kaddr = kmap_local_folio(folio, 0);
> + /*
> + * The read mmap_lock is held here. Despite the
> + * mmap_lock being read recursive a deadlock is still
> + * possible if a writer has taken a lock. For example:
> + *
> + * process A thread 1 takes read lock on own mmap_lock
> + * process A thread 2 calls mmap, blocks taking write lock
> + * process B thread 1 takes page fault, read lock on own mmap lock
> + * process B thread 2 calls mmap, blocks taking write lock
> + * process A thread 1 blocks taking read lock on process B
> + * process B thread 1 blocks taking read lock on process A
While moving, I wonder if we need this complex use case to describe the
deadlock. Shouldn't this already happen with 1 process only?
process A thread 1 takes read lock (e.g. reaching here but
before copy_from_user)
process A thread 2 calls mmap, blocks taking write lock
process A thread 1 goes on copy_from_user(), trigger page fault,
then tries to re-take the read lock
IIUC above should already cause deadlock when rwsem prioritize the write
lock here.
> + *
> + * Disable page faults to prevent potential deadlock
> + * and retry the copy outside the mmap_lock.
> + */
> + pagefault_disable();
> + ret = copy_from_user(kaddr, (const void __user *) src_addr,
> + PAGE_SIZE);
> + pagefault_enable();
> + kunmap_local(kaddr);
> +
> + if (ret)
> + return -EFAULT;
> +
> + flush_dcache_folio(folio);
> + return ret;
> +}
> +
> static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
> struct vm_area_struct *dst_vma,
> unsigned long dst_addr,
> @@ -245,7 +279,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
> uffd_flags_t flags,
> struct folio **foliop)
> {
> - void *kaddr;
> int ret;
> struct folio *folio;
>
> @@ -256,27 +289,7 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
> if (!folio)
> goto out;
>
> - kaddr = kmap_local_folio(folio, 0);
> - /*
> - * The read mmap_lock is held here. Despite the
> - * mmap_lock being read recursive a deadlock is still
> - * possible if a writer has taken a lock. For example:
> - *
> - * process A thread 1 takes read lock on own mmap_lock
> - * process A thread 2 calls mmap, blocks taking write lock
> - * process B thread 1 takes page fault, read lock on own mmap lock
> - * process B thread 2 calls mmap, blocks taking write lock
> - * process A thread 1 blocks taking read lock on process B
> - * process B thread 1 blocks taking read lock on process A
> - *
> - * Disable page faults to prevent potential deadlock
> - * and retry the copy outside the mmap_lock.
> - */
> - pagefault_disable();
> - ret = copy_from_user(kaddr, (const void __user *) src_addr,
> - PAGE_SIZE);
> - pagefault_enable();
> - kunmap_local(kaddr);
> + ret = mfill_copy_folio_locked(folio, src_addr);
>
> /* fallback to copy_from_user outside mmap_lock */
> if (unlikely(ret)) {
> @@ -285,8 +298,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
> /* don't free the page */
> goto out;
> }
> -
> - flush_dcache_folio(folio);
> } else {
> folio = *foliop;
> *foliop = NULL;
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper
2026-02-03 17:45 ` Peter Xu
@ 2026-02-08 9:49 ` Mike Rapoport
0 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 9:49 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
Hi Peter,
On Tue, Feb 03, 2026 at 12:45:02PM -0500, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:20PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Split copying of data when locks held from mfill_atomic_pte_copy() into
> > a helper function mfill_copy_folio_locked().
> >
> > This makes improves code readability and makes complex
> > mfill_atomic_pte_copy() function easier to comprehend.
> >
> > No functional change.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>
> The movement looks all fine,
>
> Acked-by: Peter Xu <peterx@redhat.com>
Thanks!
> Just one pure question to ask.
>
> > ---
> > mm/userfaultfd.c | 59 ++++++++++++++++++++++++++++--------------------
> > 1 file changed, 35 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index e6dfd5f28acd..a0885d543f22 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -238,6 +238,40 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> > return ret;
> > }
> >
> > +static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> > +{
> > + void *kaddr;
> > + int ret;
> > +
> > + kaddr = kmap_local_folio(folio, 0);
> > + /*
> > + * The read mmap_lock is held here. Despite the
> > + * mmap_lock being read recursive a deadlock is still
> > + * possible if a writer has taken a lock. For example:
> > + *
> > + * process A thread 1 takes read lock on own mmap_lock
> > + * process A thread 2 calls mmap, blocks taking write lock
> > + * process B thread 1 takes page fault, read lock on own mmap lock
> > + * process B thread 2 calls mmap, blocks taking write lock
> > + * process A thread 1 blocks taking read lock on process B
> > + * process B thread 1 blocks taking read lock on process A
>
> While moving, I wonder if we need this complex use case to describe the
> deadlock. Shouldn't this already happen with 1 process only?
>
> process A thread 1 takes read lock (e.g. reaching here but
> before copy_from_user)
> process A thread 2 calls mmap, blocks taking write lock
> process A thread 1 goes on copy_from_user(), trigger page fault,
> then tries to re-take the read lock
>
> IIUC above should already cause deadlock when rwsem prioritize the write
> lock here.
We surely can improve the description here, but it should be a separate
patch with its own changelog and it's out of scope of this series.
> > + *
> > + * Disable page faults to prevent potential deadlock
> > + * and retry the copy outside the mmap_lock.
> > + */
> > + pagefault_disable();
> > + ret = copy_from_user(kaddr, (const void __user *) src_addr,
> > + PAGE_SIZE);
> > + pagefault_enable();
> > + kunmap_local(kaddr);
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 02/17] userfaultfd: introduce struct mfill_state
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 03/17] userfaultfd: introduce mfill_get_pmd() helper Mike Rapoport
` (15 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
mfill_atomic() passes a lot of parameters down to its callees.
Aggregate them all into mfill_state structure and pass this structure to
functions that implement various UFFDIO_ commands.
Tracking the state in a structure will allow moving the code that retries
copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the
loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped
memory.
The mfill_state definition is deliberately local to mm/userfaultfd.c, hence
shmem_mfill_atomic_pte() is not updated.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 148 ++++++++++++++++++++++++++---------------------
1 file changed, 82 insertions(+), 66 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a0885d543f22..6a0697c93ff4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,6 +20,20 @@
#include "internal.h"
#include "swap.h"
+struct mfill_state {
+ struct userfaultfd_ctx *ctx;
+ unsigned long src_start;
+ unsigned long dst_start;
+ unsigned long len;
+ uffd_flags_t flags;
+
+ struct vm_area_struct *vma;
+ unsigned long src_addr;
+ unsigned long dst_addr;
+ struct folio *folio;
+ pmd_t *pmd;
+};
+
static __always_inline
bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
{
@@ -272,17 +286,17 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
return ret;
}
-static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- unsigned long src_addr,
- uffd_flags_t flags,
- struct folio **foliop)
+static int mfill_atomic_pte_copy(struct mfill_state *state)
{
- int ret;
+ struct vm_area_struct *dst_vma = state->vma;
+ unsigned long dst_addr = state->dst_addr;
+ unsigned long src_addr = state->src_addr;
+ uffd_flags_t flags = state->flags;
+ pmd_t *dst_pmd = state->pmd;
struct folio *folio;
+ int ret;
- if (!*foliop) {
+ if (!state->folio) {
ret = -ENOMEM;
folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
dst_addr);
@@ -294,13 +308,13 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
/* fallback to copy_from_user outside mmap_lock */
if (unlikely(ret)) {
ret = -ENOENT;
- *foliop = folio;
+ state->folio = folio;
/* don't free the page */
goto out;
}
} else {
- folio = *foliop;
- *foliop = NULL;
+ folio = state->folio;
+ state->folio = NULL;
}
/*
@@ -357,10 +371,11 @@ static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
return ret;
}
-static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr)
+static int mfill_atomic_pte_zeropage(struct mfill_state *state)
{
+ struct vm_area_struct *dst_vma = state->vma;
+ unsigned long dst_addr = state->dst_addr;
+ pmd_t *dst_pmd = state->pmd;
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
int ret;
@@ -392,13 +407,14 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
}
/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
-static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- uffd_flags_t flags)
+static int mfill_atomic_pte_continue(struct mfill_state *state)
{
- struct inode *inode = file_inode(dst_vma->vm_file);
+ struct vm_area_struct *dst_vma = state->vma;
+ unsigned long dst_addr = state->dst_addr;
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+ struct inode *inode = file_inode(dst_vma->vm_file);
+ uffd_flags_t flags = state->flags;
+ pmd_t *dst_pmd = state->pmd;
struct folio *folio;
struct page *page;
int ret;
@@ -436,15 +452,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
}
/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
-static int mfill_atomic_pte_poison(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- uffd_flags_t flags)
+static int mfill_atomic_pte_poison(struct mfill_state *state)
{
- int ret;
+ struct vm_area_struct *dst_vma = state->vma;
struct mm_struct *dst_mm = dst_vma->vm_mm;
+ unsigned long dst_addr = state->dst_addr;
+ pmd_t *dst_pmd = state->pmd;
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
+ int ret;
_dst_pte = make_pte_marker(PTE_MARKER_POISONED);
ret = -EAGAIN;
@@ -668,22 +684,20 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx,
uffd_flags_t flags);
#endif /* CONFIG_HUGETLB_PAGE */
-static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- unsigned long src_addr,
- uffd_flags_t flags,
- struct folio **foliop)
+static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state)
{
+ struct vm_area_struct *dst_vma = state->vma;
+ unsigned long src_addr = state->src_addr;
+ unsigned long dst_addr = state->dst_addr;
+ struct folio **foliop = &state->folio;
+ uffd_flags_t flags = state->flags;
+ pmd_t *dst_pmd = state->pmd;
ssize_t err;
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
- return mfill_atomic_pte_continue(dst_pmd, dst_vma,
- dst_addr, flags);
- } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
- return mfill_atomic_pte_poison(dst_pmd, dst_vma,
- dst_addr, flags);
- }
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ return mfill_atomic_pte_continue(state);
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON))
+ return mfill_atomic_pte_poison(state);
/*
* The normal page fault path for a shmem will invoke the
@@ -697,12 +711,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
*/
if (!(dst_vma->vm_flags & VM_SHARED)) {
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY))
- err = mfill_atomic_pte_copy(dst_pmd, dst_vma,
- dst_addr, src_addr,
- flags, foliop);
+ err = mfill_atomic_pte_copy(state);
else
- err = mfill_atomic_pte_zeropage(dst_pmd,
- dst_vma, dst_addr);
+ err = mfill_atomic_pte_zeropage(state);
} else {
err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
dst_addr, src_addr,
@@ -718,13 +729,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
unsigned long len,
uffd_flags_t flags)
{
+ struct mfill_state state = (struct mfill_state){
+ .ctx = ctx,
+ .dst_start = dst_start,
+ .src_start = src_start,
+ .flags = flags,
+
+ .src_addr = src_start,
+ .dst_addr = dst_start,
+ };
struct mm_struct *dst_mm = ctx->mm;
struct vm_area_struct *dst_vma;
+ long copied = 0;
ssize_t err;
pmd_t *dst_pmd;
- unsigned long src_addr, dst_addr;
- long copied;
- struct folio *folio;
/*
* Sanitize the command parameters:
@@ -736,10 +754,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
VM_WARN_ON_ONCE(src_start + len <= src_start);
VM_WARN_ON_ONCE(dst_start + len <= dst_start);
- src_addr = src_start;
- dst_addr = dst_start;
- copied = 0;
- folio = NULL;
retry:
/*
* Make sure the vma is not shared, that the dst range is
@@ -790,12 +804,14 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
goto out_unlock;
- while (src_addr < src_start + len) {
- pmd_t dst_pmdval;
+ state.vma = dst_vma;
- VM_WARN_ON_ONCE(dst_addr >= dst_start + len);
+ while (state.src_addr < src_start + len) {
+ VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len);
+
+ pmd_t dst_pmdval;
- dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+ dst_pmd = mm_alloc_pmd(dst_mm, state.dst_addr);
if (unlikely(!dst_pmd)) {
err = -ENOMEM;
break;
@@ -827,34 +843,34 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
* tables under us; pte_offset_map_lock() will deal with that.
*/
- err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr,
- src_addr, flags, &folio);
+ state.pmd = dst_pmd;
+ err = mfill_atomic_pte(&state);
cond_resched();
if (unlikely(err == -ENOENT)) {
void *kaddr;
up_read(&ctx->map_changing_lock);
- uffd_mfill_unlock(dst_vma);
- VM_WARN_ON_ONCE(!folio);
+ uffd_mfill_unlock(state.vma);
+ VM_WARN_ON_ONCE(!state.folio);
- kaddr = kmap_local_folio(folio, 0);
+ kaddr = kmap_local_folio(state.folio, 0);
err = copy_from_user(kaddr,
- (const void __user *) src_addr,
+ (const void __user *)state.src_addr,
PAGE_SIZE);
kunmap_local(kaddr);
if (unlikely(err)) {
err = -EFAULT;
goto out;
}
- flush_dcache_folio(folio);
+ flush_dcache_folio(state.folio);
goto retry;
} else
- VM_WARN_ON_ONCE(folio);
+ VM_WARN_ON_ONCE(state.folio);
if (!err) {
- dst_addr += PAGE_SIZE;
- src_addr += PAGE_SIZE;
+ state.dst_addr += PAGE_SIZE;
+ state.src_addr += PAGE_SIZE;
copied += PAGE_SIZE;
if (fatal_signal_pending(current))
@@ -866,10 +882,10 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
out_unlock:
up_read(&ctx->map_changing_lock);
- uffd_mfill_unlock(dst_vma);
+ uffd_mfill_unlock(state.vma);
out:
- if (folio)
- folio_put(folio);
+ if (state.folio)
+ folio_put(state.folio);
VM_WARN_ON_ONCE(copied < 0);
VM_WARN_ON_ONCE(err > 0);
VM_WARN_ON_ONCE(!copied && !err);
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 03/17] userfaultfd: introduce mfill_get_pmd() helper.
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 01/17] userfaultfd: introduce mfill_copy_folio_locked() helper Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 02/17] userfaultfd: introduce struct mfill_state Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Mike Rapoport
` (14 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
There is a lengthy code chunk in mfill_atomic() that establishes the PMD
for UFFDIO operations. This code may be called twice: first time when
the copy is performed with VMA/mm locks held and the other time after
the copy is retried with locks dropped.
Move the code that establishes a PMD into a helper function so it can be
reused later during refactoring of mfill_atomic_pte_copy().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 103 ++++++++++++++++++++++++-----------------------
1 file changed, 53 insertions(+), 50 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6a0697c93ff4..9dd285b13f3b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -157,6 +157,57 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma)
}
#endif
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, address);
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, address);
+ if (!pud)
+ return NULL;
+ /*
+ * Note that we didn't run this because the pmd was
+ * missing, the *pmd may be already established and in
+ * turn it may also be a trans_huge_pmd.
+ */
+ return pmd_alloc(mm, pud, address);
+}
+
+static int mfill_get_pmd(struct mfill_state *state)
+{
+ struct mm_struct *dst_mm = state->ctx->mm;
+ pmd_t *dst_pmd;
+ pmd_t dst_pmdval;
+
+ dst_pmd = mm_alloc_pmd(dst_mm, state->dst_addr);
+ if (unlikely(!dst_pmd))
+ return -ENOMEM;
+
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ if (unlikely(pmd_none(dst_pmdval)) &&
+ unlikely(__pte_alloc(dst_mm, dst_pmd)))
+ return -ENOMEM;
+
+ dst_pmdval = pmdp_get_lockless(dst_pmd);
+ /*
+ * If the dst_pmd is THP don't override it and just be strict.
+ * (This includes the case where the PMD used to be THP and
+ * changed back to none after __pte_alloc().)
+ */
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval)))
+ return -EEXIST;
+ if (unlikely(pmd_bad(dst_pmdval)))
+ return -EFAULT;
+
+ state->pmd = dst_pmd;
+ return 0;
+}
+
/* Check if dst_addr is outside of file's size. Must be called with ptl held. */
static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
unsigned long dst_addr)
@@ -489,27 +540,6 @@ static int mfill_atomic_pte_poison(struct mfill_state *state)
return ret;
}
-static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
-{
- pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
-
- pgd = pgd_offset(mm, address);
- p4d = p4d_alloc(mm, pgd, address);
- if (!p4d)
- return NULL;
- pud = pud_alloc(mm, p4d, address);
- if (!pud)
- return NULL;
- /*
- * Note that we didn't run this because the pmd was
- * missing, the *pmd may be already established and in
- * turn it may also be a trans_huge_pmd.
- */
- return pmd_alloc(mm, pud, address);
-}
-
#ifdef CONFIG_HUGETLB_PAGE
/*
* mfill_atomic processing for HUGETLB vmas. Note that this routine is
@@ -742,7 +772,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
struct vm_area_struct *dst_vma;
long copied = 0;
ssize_t err;
- pmd_t *dst_pmd;
/*
* Sanitize the command parameters:
@@ -809,41 +838,15 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
while (state.src_addr < src_start + len) {
VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len);
- pmd_t dst_pmdval;
-
- dst_pmd = mm_alloc_pmd(dst_mm, state.dst_addr);
- if (unlikely(!dst_pmd)) {
- err = -ENOMEM;
+ err = mfill_get_pmd(&state);
+ if (err)
break;
- }
- dst_pmdval = pmdp_get_lockless(dst_pmd);
- if (unlikely(pmd_none(dst_pmdval)) &&
- unlikely(__pte_alloc(dst_mm, dst_pmd))) {
- err = -ENOMEM;
- break;
- }
- dst_pmdval = pmdp_get_lockless(dst_pmd);
- /*
- * If the dst_pmd is THP don't override it and just be strict.
- * (This includes the case where the PMD used to be THP and
- * changed back to none after __pte_alloc().)
- */
- if (unlikely(!pmd_present(dst_pmdval) ||
- pmd_trans_huge(dst_pmdval))) {
- err = -EEXIST;
- break;
- }
- if (unlikely(pmd_bad(dst_pmdval))) {
- err = -EFAULT;
- break;
- }
/*
* For shmem mappings, khugepaged is allowed to remove page
* tables under us; pte_offset_map_lock() will deal with that.
*/
- state.pmd = dst_pmd;
err = mfill_atomic_pte(&state);
cond_resched();
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (2 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 03/17] userfaultfd: introduce mfill_get_pmd() helper Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-02 21:49 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Mike Rapoport
` (13 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Split the code that finds, locks and verifies VMA from mfill_atomic()
into a helper function.
This function will be used later during refactoring of
mfill_atomic_pte_copy().
Add a counterpart mfill_put_vma() helper that unlocks the VMA and
releases map_changing_lock.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 124 ++++++++++++++++++++++++++++-------------------
1 file changed, 73 insertions(+), 51 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9dd285b13f3b..45d8f04aaf4f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -157,6 +157,73 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma)
}
#endif
+static void mfill_put_vma(struct mfill_state *state)
+{
+ up_read(&state->ctx->map_changing_lock);
+ uffd_mfill_unlock(state->vma);
+ state->vma = NULL;
+}
+
+static int mfill_get_vma(struct mfill_state *state)
+{
+ struct userfaultfd_ctx *ctx = state->ctx;
+ uffd_flags_t flags = state->flags;
+ struct vm_area_struct *dst_vma;
+ int err;
+
+ /*
+ * Make sure the vma is not shared, that the dst range is
+ * both valid and fully within a single existing vma.
+ */
+ dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len);
+ if (IS_ERR(dst_vma))
+ return PTR_ERR(dst_vma);
+
+ /*
+ * If memory mappings are changing because of non-cooperative
+ * operation (e.g. mremap) running in parallel, bail out and
+ * request the user to retry later
+ */
+ down_read(&ctx->map_changing_lock);
+ err = -EAGAIN;
+ if (atomic_read(&ctx->mmap_changing))
+ goto out_unlock;
+
+ err = -EINVAL;
+
+ /*
+ * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
+ * it will overwrite vm_ops, so vma_is_anonymous must return false.
+ */
+ if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
+ dst_vma->vm_flags & VM_SHARED))
+ goto out_unlock;
+
+ /*
+ * validate 'mode' now that we know the dst_vma: don't allow
+ * a wrprotect copy if the userfaultfd didn't register as WP.
+ */
+ if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
+ goto out_unlock;
+
+ if (is_vm_hugetlb_page(dst_vma))
+ goto out;
+
+ if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
+ goto out_unlock;
+ if (!vma_is_shmem(dst_vma) &&
+ uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ goto out_unlock;
+
+out:
+ state->vma = dst_vma;
+ return 0;
+
+out_unlock:
+ mfill_put_vma(state);
+ return err;
+}
+
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
@@ -768,8 +835,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
.src_addr = src_start,
.dst_addr = dst_start,
};
- struct mm_struct *dst_mm = ctx->mm;
- struct vm_area_struct *dst_vma;
long copied = 0;
ssize_t err;
@@ -784,57 +849,17 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
VM_WARN_ON_ONCE(dst_start + len <= dst_start);
retry:
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- dst_vma = uffd_mfill_lock(dst_mm, dst_start, len);
- if (IS_ERR(dst_vma)) {
- err = PTR_ERR(dst_vma);
+ err = mfill_get_vma(&state);
+ if (err)
goto out;
- }
-
- /*
- * If memory mappings are changing because of non-cooperative
- * operation (e.g. mremap) running in parallel, bail out and
- * request the user to retry later
- */
- down_read(&ctx->map_changing_lock);
- err = -EAGAIN;
- if (atomic_read(&ctx->mmap_changing))
- goto out_unlock;
-
- err = -EINVAL;
- /*
- * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
- * it will overwrite vm_ops, so vma_is_anonymous must return false.
- */
- if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
- dst_vma->vm_flags & VM_SHARED))
- goto out_unlock;
-
- /*
- * validate 'mode' now that we know the dst_vma: don't allow
- * a wrprotect copy if the userfaultfd didn't register as WP.
- */
- if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
- goto out_unlock;
/*
* If this is a HUGETLB vma, pass off to appropriate routine
*/
- if (is_vm_hugetlb_page(dst_vma))
- return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
+ if (is_vm_hugetlb_page(state.vma))
+ return mfill_atomic_hugetlb(ctx, state.vma, dst_start,
src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
- goto out_unlock;
- if (!vma_is_shmem(dst_vma) &&
- uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
- goto out_unlock;
-
- state.vma = dst_vma;
-
while (state.src_addr < src_start + len) {
VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len);
@@ -853,8 +878,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
if (unlikely(err == -ENOENT)) {
void *kaddr;
- up_read(&ctx->map_changing_lock);
- uffd_mfill_unlock(state.vma);
+ mfill_put_vma(&state);
VM_WARN_ON_ONCE(!state.folio);
kaddr = kmap_local_folio(state.folio, 0);
@@ -883,9 +907,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
break;
}
-out_unlock:
- up_read(&ctx->map_changing_lock);
- uffd_mfill_unlock(state.vma);
+ mfill_put_vma(&state);
out:
if (state.folio)
folio_put(state.folio);
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
2026-01-27 19:29 ` [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Mike Rapoport
@ 2026-02-02 21:49 ` Peter Xu
2026-02-08 9:54 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-02 21:49 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
Hi, Mike,
On Tue, Jan 27, 2026 at 09:29:23PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Split the code that finds, locks and verifies VMA from mfill_atomic()
> into a helper function.
>
> This function will be used later during refactoring of
> mfill_atomic_pte_copy().
>
> Add a counterpart mfill_put_vma() helper that unlocks the VMA and
> releases map_changing_lock.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/userfaultfd.c | 124 ++++++++++++++++++++++++++++-------------------
> 1 file changed, 73 insertions(+), 51 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 9dd285b13f3b..45d8f04aaf4f 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -157,6 +157,73 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma)
> }
> #endif
>
> +static void mfill_put_vma(struct mfill_state *state)
> +{
> + up_read(&state->ctx->map_changing_lock);
> + uffd_mfill_unlock(state->vma);
> + state->vma = NULL;
> +}
> +
> +static int mfill_get_vma(struct mfill_state *state)
> +{
> + struct userfaultfd_ctx *ctx = state->ctx;
> + uffd_flags_t flags = state->flags;
> + struct vm_area_struct *dst_vma;
> + int err;
> +
> + /*
> + * Make sure the vma is not shared, that the dst range is
> + * both valid and fully within a single existing vma.
> + */
> + dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len);
> + if (IS_ERR(dst_vma))
> + return PTR_ERR(dst_vma);
> +
> + /*
> + * If memory mappings are changing because of non-cooperative
> + * operation (e.g. mremap) running in parallel, bail out and
> + * request the user to retry later
> + */
> + down_read(&ctx->map_changing_lock);
> + err = -EAGAIN;
> + if (atomic_read(&ctx->mmap_changing))
> + goto out_unlock;
> +
> + err = -EINVAL;
> +
> + /*
> + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
> + * it will overwrite vm_ops, so vma_is_anonymous must return false.
> + */
> + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
> + dst_vma->vm_flags & VM_SHARED))
> + goto out_unlock;
> +
> + /*
> + * validate 'mode' now that we know the dst_vma: don't allow
> + * a wrprotect copy if the userfaultfd didn't register as WP.
> + */
> + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> + goto out_unlock;
> +
> + if (is_vm_hugetlb_page(dst_vma))
> + goto out;
> +
> + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> + goto out_unlock;
> + if (!vma_is_shmem(dst_vma) &&
> + uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> + goto out_unlock;
IMHO it's a bit weird to check for vma permissions in a get_vma() function.
Also, in the follow up patch it'll be also reused in
mfill_copy_folio_retry() which doesn't need to check vma permission.
Maybe we can introduce mfill_vma_check() for these two checks? Then we can
also drop the slightly weird is_vm_hugetlb_page() check (and "out" label)
above.
> +
> +out:
> + state->vma = dst_vma;
> + return 0;
> +
> +out_unlock:
> + mfill_put_vma(state);
> + return err;
> +}
> +
> static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> {
> pgd_t *pgd;
> @@ -768,8 +835,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> .src_addr = src_start,
> .dst_addr = dst_start,
> };
> - struct mm_struct *dst_mm = ctx->mm;
> - struct vm_area_struct *dst_vma;
> long copied = 0;
> ssize_t err;
>
> @@ -784,57 +849,17 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> VM_WARN_ON_ONCE(dst_start + len <= dst_start);
>
> retry:
> - /*
> - * Make sure the vma is not shared, that the dst range is
> - * both valid and fully within a single existing vma.
> - */
> - dst_vma = uffd_mfill_lock(dst_mm, dst_start, len);
> - if (IS_ERR(dst_vma)) {
> - err = PTR_ERR(dst_vma);
> + err = mfill_get_vma(&state);
> + if (err)
> goto out;
> - }
> -
> - /*
> - * If memory mappings are changing because of non-cooperative
> - * operation (e.g. mremap) running in parallel, bail out and
> - * request the user to retry later
> - */
> - down_read(&ctx->map_changing_lock);
> - err = -EAGAIN;
> - if (atomic_read(&ctx->mmap_changing))
> - goto out_unlock;
> -
> - err = -EINVAL;
> - /*
> - * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
> - * it will overwrite vm_ops, so vma_is_anonymous must return false.
> - */
> - if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
> - dst_vma->vm_flags & VM_SHARED))
> - goto out_unlock;
> -
> - /*
> - * validate 'mode' now that we know the dst_vma: don't allow
> - * a wrprotect copy if the userfaultfd didn't register as WP.
> - */
> - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> - goto out_unlock;
>
> /*
> * If this is a HUGETLB vma, pass off to appropriate routine
> */
> - if (is_vm_hugetlb_page(dst_vma))
> - return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
> + if (is_vm_hugetlb_page(state.vma))
> + return mfill_atomic_hugetlb(ctx, state.vma, dst_start,
> src_start, len, flags);
>
> - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> - goto out_unlock;
> - if (!vma_is_shmem(dst_vma) &&
> - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> - goto out_unlock;
> -
> - state.vma = dst_vma;
> -
> while (state.src_addr < src_start + len) {
> VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len);
>
> @@ -853,8 +878,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> if (unlikely(err == -ENOENT)) {
> void *kaddr;
>
> - up_read(&ctx->map_changing_lock);
> - uffd_mfill_unlock(state.vma);
> + mfill_put_vma(&state);
> VM_WARN_ON_ONCE(!state.folio);
>
> kaddr = kmap_local_folio(state.folio, 0);
> @@ -883,9 +907,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> break;
> }
>
> -out_unlock:
> - up_read(&ctx->map_changing_lock);
> - uffd_mfill_unlock(state.vma);
> + mfill_put_vma(&state);
> out:
> if (state.folio)
> folio_put(state.folio);
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
2026-02-02 21:49 ` Peter Xu
@ 2026-02-08 9:54 ` Mike Rapoport
0 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 9:54 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
Hi Peter,
On Mon, Feb 02, 2026 at 04:49:09PM -0500, Peter Xu wrote:
> Hi, Mike,
>
> On Tue, Jan 27, 2026 at 09:29:23PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Split the code that finds, locks and verifies VMA from mfill_atomic()
> > into a helper function.
> >
> > This function will be used later during refactoring of
> > mfill_atomic_pte_copy().
> >
> > Add a counterpart mfill_put_vma() helper that unlocks the VMA and
> > releases map_changing_lock.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > mm/userfaultfd.c | 124 ++++++++++++++++++++++++++++-------------------
> > 1 file changed, 73 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 9dd285b13f3b..45d8f04aaf4f 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -157,6 +157,73 @@ static void uffd_mfill_unlock(struct vm_area_struct *vma)
> > }
> > #endif
> >
> > +static void mfill_put_vma(struct mfill_state *state)
> > +{
> > + up_read(&state->ctx->map_changing_lock);
> > + uffd_mfill_unlock(state->vma);
> > + state->vma = NULL;
> > +}
> > +
> > +static int mfill_get_vma(struct mfill_state *state)
> > +{
> > + struct userfaultfd_ctx *ctx = state->ctx;
> > + uffd_flags_t flags = state->flags;
> > + struct vm_area_struct *dst_vma;
> > + int err;
> > +
> > + /*
> > + * Make sure the vma is not shared, that the dst range is
> > + * both valid and fully within a single existing vma.
> > + */
> > + dst_vma = uffd_mfill_lock(ctx->mm, state->dst_start, state->len);
> > + if (IS_ERR(dst_vma))
> > + return PTR_ERR(dst_vma);
> > +
> > + /*
> > + * If memory mappings are changing because of non-cooperative
> > + * operation (e.g. mremap) running in parallel, bail out and
> > + * request the user to retry later
> > + */
> > + down_read(&ctx->map_changing_lock);
> > + err = -EAGAIN;
> > + if (atomic_read(&ctx->mmap_changing))
> > + goto out_unlock;
> > +
> > + err = -EINVAL;
> > +
> > + /*
> > + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
> > + * it will overwrite vm_ops, so vma_is_anonymous must return false.
> > + */
> > + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
> > + dst_vma->vm_flags & VM_SHARED))
> > + goto out_unlock;
> > +
> > + /*
> > + * validate 'mode' now that we know the dst_vma: don't allow
> > + * a wrprotect copy if the userfaultfd didn't register as WP.
> > + */
> > + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> > + goto out_unlock;
> > +
> > + if (is_vm_hugetlb_page(dst_vma))
> > + goto out;
> > +
> > + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> > + goto out_unlock;
> > + if (!vma_is_shmem(dst_vma) &&
> > + uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > + goto out_unlock;
>
> IMHO it's a bit weird to check for vma permissions in a get_vma() function.
>
> Also, in the follow up patch it'll be also reused in
> mfill_copy_folio_retry() which doesn't need to check vma permission.
>
> Maybe we can introduce mfill_vma_check() for these two checks? Then we can
> also drop the slightly weird is_vm_hugetlb_page() check (and "out" label)
> above.
This version of get_vma() keeps the checks exactly as they were when we
were retrying after dropping the lock and I prefer to have them this way to
begin with.
Later we can optimize this further after the dust settles after these
changes.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (3 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 04/17] userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-02 21:23 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 06/17] userfaultfd: move vma_can_userfault out of line Mike Rapoport
` (12 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Implementation of UFFDIO_COPY for anonymous memory might fail to copy
data data from userspace buffer when the destination VMA is locked
(either with mm_lock or with per-VMA lock).
In that case, mfill_atomic() releases the locks, retries copying the
data with locks dropped and then re-locks the destination VMA and
re-establishes PMD.
Since this retry-reget dance is only relevant for UFFDIO_COPY and it
never happens for other UFFDIO_ operations, make it a part of
mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for
anonymous memory.
shmem implementation will be updated later and the loop in
mfill_atomic() will be adjusted afterwards.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 70 +++++++++++++++++++++++++++++++-----------------
1 file changed, 46 insertions(+), 24 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 45d8f04aaf4f..01a2b898fa40 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -404,35 +404,57 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
return ret;
}
+static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
+{
+ unsigned long src_addr = state->src_addr;
+ void *kaddr;
+ int err;
+
+ /* retry copying with mm_lock dropped */
+ mfill_put_vma(state);
+
+ kaddr = kmap_local_folio(folio, 0);
+ err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
+ kunmap_local(kaddr);
+ if (unlikely(err))
+ return -EFAULT;
+
+ flush_dcache_folio(folio);
+
+ /* reget VMA and PMD, they could change underneath us */
+ err = mfill_get_vma(state);
+ if (err)
+ return err;
+
+ err = mfill_get_pmd(state);
+ if (err)
+ return err;
+
+ return 0;
+}
+
static int mfill_atomic_pte_copy(struct mfill_state *state)
{
- struct vm_area_struct *dst_vma = state->vma;
unsigned long dst_addr = state->dst_addr;
unsigned long src_addr = state->src_addr;
uffd_flags_t flags = state->flags;
- pmd_t *dst_pmd = state->pmd;
struct folio *folio;
int ret;
- if (!state->folio) {
- ret = -ENOMEM;
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
- dst_addr);
- if (!folio)
- goto out;
+ folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr);
+ if (!folio)
+ return -ENOMEM;
- ret = mfill_copy_folio_locked(folio, src_addr);
+ ret = -ENOMEM;
+ if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL))
+ goto out_release;
+ ret = mfill_copy_folio_locked(folio, src_addr);
+ if (unlikely(ret)) {
/* fallback to copy_from_user outside mmap_lock */
- if (unlikely(ret)) {
- ret = -ENOENT;
- state->folio = folio;
- /* don't free the page */
- goto out;
- }
- } else {
- folio = state->folio;
- state->folio = NULL;
+ ret = mfill_copy_folio_retry(state, folio);
+ if (ret)
+ goto out_release;
}
/*
@@ -442,17 +464,16 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
*/
__folio_mark_uptodate(folio);
- ret = -ENOMEM;
- if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
- goto out_release;
-
- ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
+ ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
&folio->page, true, flags);
if (ret)
goto out_release;
out:
return ret;
out_release:
+ /* Don't return -ENOENT so that our caller won't retry */
+ if (ret == -ENOENT)
+ ret = -EFAULT;
folio_put(folio);
goto out;
}
@@ -907,7 +928,8 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
break;
}
- mfill_put_vma(&state);
+ if (state.vma)
+ mfill_put_vma(&state);
out:
if (state.folio)
folio_put(state.folio);
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
2026-01-27 19:29 ` [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Mike Rapoport
@ 2026-02-02 21:23 ` Peter Xu
2026-02-08 10:01 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-02 21:23 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
Hi, Mike,
On Tue, Jan 27, 2026 at 09:29:24PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Implementation of UFFDIO_COPY for anonymous memory might fail to copy
> data data from userspace buffer when the destination VMA is locked
> (either with mm_lock or with per-VMA lock).
>
> In that case, mfill_atomic() releases the locks, retries copying the
> data with locks dropped and then re-locks the destination VMA and
> re-establishes PMD.
>
> Since this retry-reget dance is only relevant for UFFDIO_COPY and it
> never happens for other UFFDIO_ operations, make it a part of
> mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for
> anonymous memory.
>
> shmem implementation will be updated later and the loop in
> mfill_atomic() will be adjusted afterwards.
Thanks for the refactoring. Looks good to me in general, only some
nitpicks inline.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/userfaultfd.c | 70 +++++++++++++++++++++++++++++++-----------------
> 1 file changed, 46 insertions(+), 24 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 45d8f04aaf4f..01a2b898fa40 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -404,35 +404,57 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> return ret;
> }
>
> +static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
> +{
> + unsigned long src_addr = state->src_addr;
> + void *kaddr;
> + int err;
> +
> + /* retry copying with mm_lock dropped */
> + mfill_put_vma(state);
> +
> + kaddr = kmap_local_folio(folio, 0);
> + err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
> + kunmap_local(kaddr);
> + if (unlikely(err))
> + return -EFAULT;
> +
> + flush_dcache_folio(folio);
> +
> + /* reget VMA and PMD, they could change underneath us */
> + err = mfill_get_vma(state);
> + if (err)
> + return err;
> +
> + err = mfill_get_pmd(state);
> + if (err)
> + return err;
> +
> + return 0;
> +}
> +
> static int mfill_atomic_pte_copy(struct mfill_state *state)
> {
> - struct vm_area_struct *dst_vma = state->vma;
> unsigned long dst_addr = state->dst_addr;
> unsigned long src_addr = state->src_addr;
> uffd_flags_t flags = state->flags;
> - pmd_t *dst_pmd = state->pmd;
> struct folio *folio;
> int ret;
>
> - if (!state->folio) {
> - ret = -ENOMEM;
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
> - dst_addr);
> - if (!folio)
> - goto out;
> + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr);
> + if (!folio)
> + return -ENOMEM;
>
> - ret = mfill_copy_folio_locked(folio, src_addr);
> + ret = -ENOMEM;
> + if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL))
> + goto out_release;
>
> + ret = mfill_copy_folio_locked(folio, src_addr);
> + if (unlikely(ret)) {
> /* fallback to copy_from_user outside mmap_lock */
> - if (unlikely(ret)) {
> - ret = -ENOENT;
> - state->folio = folio;
> - /* don't free the page */
> - goto out;
> - }
> - } else {
> - folio = state->folio;
> - state->folio = NULL;
> + ret = mfill_copy_folio_retry(state, folio);
Yes, I agree this should work and should avoid the previous ENOENT
processing that might be hard to follow. It'll move the complexity into
mfill_state though (e.g., now it's unknown on the vma lock state after this
function returns..), but I guess it's fine.
> + if (ret)
> + goto out_release;
> }
>
> /*
> @@ -442,17 +464,16 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
> */
> __folio_mark_uptodate(folio);
Since success path should make sure vma lock held when reaching here, but
now with mfill_copy_folio_retry()'s presence it's not as clear as before,
maybe we add an assertion for that here before installing ptes? No strong
feelings.
>
> - ret = -ENOMEM;
> - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
> - goto out_release;
> -
> - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
> &folio->page, true, flags);
> if (ret)
> goto out_release;
> out:
> return ret;
> out_release:
> + /* Don't return -ENOENT so that our caller won't retry */
> + if (ret == -ENOENT)
> + ret = -EFAULT;
I recall the code removed is the only path that can return ENOENT? Then
maybe this line isn't needed?
> folio_put(folio);
> goto out;
> }
> @@ -907,7 +928,8 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> break;
> }
>
> - mfill_put_vma(&state);
> + if (state.vma)
I wonder if we should move this check into mfill_put_vma() directly, it
might be overlooked if we'll put_vma in other paths otherwise.
> + mfill_put_vma(&state);
> out:
> if (state.folio)
> folio_put(state.folio);
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
2026-02-02 21:23 ` Peter Xu
@ 2026-02-08 10:01 ` Mike Rapoport
0 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 10:01 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Mon, Feb 02, 2026 at 04:23:16PM -0500, Peter Xu wrote:
> Hi, Mike,
>
> On Tue, Jan 27, 2026 at 09:29:24PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Implementation of UFFDIO_COPY for anonymous memory might fail to copy
> > data data from userspace buffer when the destination VMA is locked
> > (either with mm_lock or with per-VMA lock).
> >
> > In that case, mfill_atomic() releases the locks, retries copying the
> > data with locks dropped and then re-locks the destination VMA and
> > re-establishes PMD.
> >
> > Since this retry-reget dance is only relevant for UFFDIO_COPY and it
> > never happens for other UFFDIO_ operations, make it a part of
> > mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for
> > anonymous memory.
> >
> > shmem implementation will be updated later and the loop in
> > mfill_atomic() will be adjusted afterwards.
>
> Thanks for the refactoring. Looks good to me in general, only some
> nitpicks inline.
>
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > mm/userfaultfd.c | 70 +++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 46 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 45d8f04aaf4f..01a2b898fa40 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -404,35 +404,57 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
> > return ret;
> > }
> >
> > +static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
> > +{
> > + unsigned long src_addr = state->src_addr;
> > + void *kaddr;
> > + int err;
> > +
> > + /* retry copying with mm_lock dropped */
> > + mfill_put_vma(state);
> > +
> > + kaddr = kmap_local_folio(folio, 0);
> > + err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
> > + kunmap_local(kaddr);
> > + if (unlikely(err))
> > + return -EFAULT;
> > +
> > + flush_dcache_folio(folio);
> > +
> > + /* reget VMA and PMD, they could change underneath us */
> > + err = mfill_get_vma(state);
> > + if (err)
> > + return err;
> > +
> > + err = mfill_get_pmd(state);
> > + if (err)
> > + return err;
> > +
> > + return 0;
> > +}
> > +
> > static int mfill_atomic_pte_copy(struct mfill_state *state)
> > {
> > - struct vm_area_struct *dst_vma = state->vma;
> > unsigned long dst_addr = state->dst_addr;
> > unsigned long src_addr = state->src_addr;
> > uffd_flags_t flags = state->flags;
> > - pmd_t *dst_pmd = state->pmd;
> > struct folio *folio;
> > int ret;
> >
> > - if (!state->folio) {
> > - ret = -ENOMEM;
> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
> > - dst_addr);
> > - if (!folio)
> > - goto out;
> > + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr);
> > + if (!folio)
> > + return -ENOMEM;
> >
> > - ret = mfill_copy_folio_locked(folio, src_addr);
> > + ret = -ENOMEM;
> > + if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL))
> > + goto out_release;
> >
> > + ret = mfill_copy_folio_locked(folio, src_addr);
> > + if (unlikely(ret)) {
> > /* fallback to copy_from_user outside mmap_lock */
> > - if (unlikely(ret)) {
> > - ret = -ENOENT;
> > - state->folio = folio;
> > - /* don't free the page */
> > - goto out;
> > - }
> > - } else {
> > - folio = state->folio;
> > - state->folio = NULL;
> > + ret = mfill_copy_folio_retry(state, folio);
>
> Yes, I agree this should work and should avoid the previous ENOENT
> processing that might be hard to follow. It'll move the complexity into
> mfill_state though (e.g., now it's unknown on the vma lock state after this
> function returns..), but I guess it's fine.
When this function returns success VMA is locked. If the function fails it
does not matter if the VMA is locked.
I'll add some comments.
> > + if (ret)
> > + goto out_release;
> > }
> >
> > /*
> > @@ -442,17 +464,16 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
> > */
> > __folio_mark_uptodate(folio);
>
> Since success path should make sure vma lock held when reaching here, but
> now with mfill_copy_folio_retry()'s presence it's not as clear as before,
> maybe we add an assertion for that here before installing ptes? No strong
> feelings.
I'll add comments.
> >
> > - ret = -ENOMEM;
> > - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
> > - goto out_release;
> > -
> > - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> > + ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
> > &folio->page, true, flags);
> > if (ret)
> > goto out_release;
> > out:
> > return ret;
> > out_release:
> > + /* Don't return -ENOENT so that our caller won't retry */
> > + if (ret == -ENOENT)
> > + ret = -EFAULT;
>
> I recall the code removed is the only path that can return ENOENT? Then
> maybe this line isn't needed?
I didn't want to audit all potential errors and this is a temporal safety
measure to avoid breaking biscection. This is anyway removed in the
following patches.
> > folio_put(folio);
> > goto out;
> > }
> > @@ -907,7 +928,8 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> > break;
> > }
> >
> > - mfill_put_vma(&state);
> > + if (state.vma)
>
> I wonder if we should move this check into mfill_put_vma() directly, it
> might be overlooked if we'll put_vma in other paths otherwise.
Yeah, I'll check this.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 06/17] userfaultfd: move vma_can_userfault out of line
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (4 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 05/17] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops Mike Rapoport
` (11 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest,
David Hildenbrand (Red Hat)
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
vma_can_userfault() has grown pretty big and it's not called on
performance critical path.
Move it out of line.
No functional changes.
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 35 ++---------------------------------
mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+), 33 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index fd5f42765497..a49cf750e803 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -208,39 +208,8 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- vm_flags_t vm_flags,
- bool wp_async)
-{
- vm_flags &= __VM_UFFD_FLAGS;
-
- if (vma->vm_flags & VM_DROPPABLE)
- return false;
-
- if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
- return false;
-
- /*
- * If wp async enabled, and WP is the only mode enabled, allow any
- * memory type.
- */
- if (wp_async && (vm_flags == VM_UFFD_WP))
- return true;
-
- /*
- * If user requested uffd-wp but not enabled pte markers for
- * uffd-wp, then shmem & hugetlbfs are not supported but only
- * anonymous.
- */
- if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) &&
- !vma_is_anonymous(vma))
- return false;
-
- /* By default, allow any of anon|shmem|hugetlb */
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+ bool wp_async);
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
{
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 01a2b898fa40..786f0a245675 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2016,6 +2016,39 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
return moved ? moved : err;
}
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+ bool wp_async)
+{
+ vm_flags &= __VM_UFFD_FLAGS;
+
+ if (vma->vm_flags & VM_DROPPABLE)
+ return false;
+
+ if ((vm_flags & VM_UFFD_MINOR) &&
+ (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ return false;
+
+ /*
+ * If wp async enabled, and WP is the only mode enabled, allow any
+ * memory type.
+ */
+ if (wp_async && (vm_flags == VM_UFFD_WP))
+ return true;
+
+ /*
+ * If user requested uffd-wp but not enabled pte markers for
+ * uffd-wp, then shmem & hugetlbfs are not supported but only
+ * anonymous.
+ */
+ if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) &&
+ !vma_is_anonymous(vma))
+ return false;
+
+ /* By default, allow any of anon|shmem|hugetlb */
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (5 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 06/17] userfaultfd: move vma_can_userfault out of line Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-02 21:36 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 08/17] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
` (10 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Current userfaultfd implementation works only with memory managed by
core MM: anonymous, shmem and hugetlb.
First, there is no fundamental reason to limit userfaultfd support only
to the core memory types and userfaults can be handled similarly to
regular page faults provided a VMA owner implements appropriate
callbacks.
Second, historically various code paths were conditioned on
vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of
these conditions can be expressed as operations implemented by a
particular memory type.
Introduce vm_uffd_ops extension to vm_operations_struct that will
delegate memory type specific operations to a VMA owner.
Operations for anonymous memory are handled internally in userfaultfd
using anon_uffd_ops that implicitly assigned to anonymous VMAs.
Start with a single operation, ->can_userfault() that will verify that a
VMA meets requirements for userfaultfd support at registration time.
Implement that method for anonymous, shmem and hugetlb and move relevant
parts of vma_can_userfault() into the new callbacks.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 5 +++++
include/linux/userfaultfd_k.h | 6 +++++
mm/hugetlb.c | 21 ++++++++++++++++++
mm/shmem.c | 23 ++++++++++++++++++++
mm/userfaultfd.c | 41 ++++++++++++++++++++++-------------
5 files changed, 81 insertions(+), 15 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15076261d0c2..3c2caff646c3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -732,6 +732,8 @@ struct vm_fault {
*/
};
+struct vm_uffd_ops;
+
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -817,6 +819,9 @@ struct vm_operations_struct {
struct page *(*find_normal_page)(struct vm_area_struct *vma,
unsigned long addr);
#endif /* CONFIG_FIND_NORMAL_PAGE */
+#ifdef CONFIG_USERFAULTFD
+ const struct vm_uffd_ops *uffd_ops;
+#endif
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a49cf750e803..56e85ab166c7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -80,6 +80,12 @@ struct userfaultfd_ctx {
extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
+/* VMA userfaultfd operations */
+struct vm_uffd_ops {
+ /* Checks if a VMA can support userfaultfd */
+ bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
+};
+
/* A combined operation mode + behavior flags. */
typedef unsigned int __bitwise uffd_flags_t;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51273baec9e5..909131910c43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4797,6 +4797,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
return 0;
}
+#ifdef CONFIG_USERFAULTFD
+static bool hugetlb_can_userfault(struct vm_area_struct *vma,
+ vm_flags_t vm_flags)
+{
+ /*
+ * If user requested uffd-wp but not enabled pte markers for
+ * uffd-wp, then hugetlb is not supported.
+ */
+ if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
+ return false;
+ return true;
+}
+
+static const struct vm_uffd_ops hugetlb_uffd_ops = {
+ .can_userfault = hugetlb_can_userfault,
+};
+#endif
+
/*
* When a new function is introduced to vm_operations_struct and added
* to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -4810,6 +4828,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.close = hugetlb_vm_op_close,
.may_split = hugetlb_vm_op_split,
.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_USERFAULTFD
+ .uffd_ops = &hugetlb_uffd_ops,
+#endif
};
static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index ec6c01378e9d..9b82cda271c4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5290,6 +5290,23 @@ static const struct super_operations shmem_ops = {
#endif
};
+#ifdef CONFIG_USERFAULTFD
+static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
+{
+ /*
+ * If user requested uffd-wp but not enabled pte markers for
+ * uffd-wp, then shmem is not supported.
+ */
+ if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
+ return false;
+ return true;
+}
+
+static const struct vm_uffd_ops shmem_uffd_ops = {
+ .can_userfault = shmem_can_userfault,
+};
+#endif
+
static const struct vm_operations_struct shmem_vm_ops = {
.fault = shmem_fault,
.map_pages = filemap_map_pages,
@@ -5297,6 +5314,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .uffd_ops = &shmem_uffd_ops,
+#endif
};
static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5306,6 +5326,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .uffd_ops = &shmem_uffd_ops,
+#endif
};
int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 786f0a245675..d035f5e17f07 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -34,6 +34,25 @@ struct mfill_state {
pmd_t *pmd;
};
+static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
+{
+ /* anonymous memory does not support MINOR mode */
+ if (vm_flags & VM_UFFD_MINOR)
+ return false;
+ return true;
+}
+
+static const struct vm_uffd_ops anon_uffd_ops = {
+ .can_userfault = anon_can_userfault,
+};
+
+static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return &anon_uffd_ops;
+ return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
+}
+
static __always_inline
bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
{
@@ -2019,13 +2038,15 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
bool wp_async)
{
- vm_flags &= __VM_UFFD_FLAGS;
+ const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
- if (vma->vm_flags & VM_DROPPABLE)
+ /* only VMAs that implement vm_uffd_ops are supported */
+ if (!ops)
return false;
- if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ vm_flags &= __VM_UFFD_FLAGS;
+
+ if (vma->vm_flags & VM_DROPPABLE)
return false;
/*
@@ -2035,18 +2056,8 @@ bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
if (wp_async && (vm_flags == VM_UFFD_WP))
return true;
- /*
- * If user requested uffd-wp but not enabled pte markers for
- * uffd-wp, then shmem & hugetlbfs are not supported but only
- * anonymous.
- */
- if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) &&
- !vma_is_anonymous(vma))
- return false;
-
/* By default, allow any of anon|shmem|hugetlb */
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
+ return ops->can_userfault(vma, vm_flags);
}
static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-01-27 19:29 ` [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops Mike Rapoport
@ 2026-02-02 21:36 ` Peter Xu
2026-02-08 10:13 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-02 21:36 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Jan 27, 2026 at 09:29:26PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Current userfaultfd implementation works only with memory managed by
> core MM: anonymous, shmem and hugetlb.
>
> First, there is no fundamental reason to limit userfaultfd support only
> to the core memory types and userfaults can be handled similarly to
> regular page faults provided a VMA owner implements appropriate
> callbacks.
>
> Second, historically various code paths were conditioned on
> vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of
> these conditions can be expressed as operations implemented by a
> particular memory type.
>
> Introduce vm_uffd_ops extension to vm_operations_struct that will
> delegate memory type specific operations to a VMA owner.
>
> Operations for anonymous memory are handled internally in userfaultfd
> using anon_uffd_ops that implicitly assigned to anonymous VMAs.
>
> Start with a single operation, ->can_userfault() that will verify that a
> VMA meets requirements for userfaultfd support at registration time.
>
> Implement that method for anonymous, shmem and hugetlb and move relevant
> parts of vma_can_userfault() into the new callbacks.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/mm.h | 5 +++++
> include/linux/userfaultfd_k.h | 6 +++++
> mm/hugetlb.c | 21 ++++++++++++++++++
> mm/shmem.c | 23 ++++++++++++++++++++
> mm/userfaultfd.c | 41 ++++++++++++++++++++++-------------
> 5 files changed, 81 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 15076261d0c2..3c2caff646c3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -732,6 +732,8 @@ struct vm_fault {
> */
> };
>
> +struct vm_uffd_ops;
> +
> /*
> * These are the virtual MM functions - opening of an area, closing and
> * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -817,6 +819,9 @@ struct vm_operations_struct {
> struct page *(*find_normal_page)(struct vm_area_struct *vma,
> unsigned long addr);
> #endif /* CONFIG_FIND_NORMAL_PAGE */
> +#ifdef CONFIG_USERFAULTFD
> + const struct vm_uffd_ops *uffd_ops;
> +#endif
> };
>
> #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a49cf750e803..56e85ab166c7 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -80,6 +80,12 @@ struct userfaultfd_ctx {
>
> extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>
> +/* VMA userfaultfd operations */
> +struct vm_uffd_ops {
> + /* Checks if a VMA can support userfaultfd */
> + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
> +};
> +
> /* A combined operation mode + behavior flags. */
> typedef unsigned int __bitwise uffd_flags_t;
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 51273baec9e5..909131910c43 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4797,6 +4797,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +#ifdef CONFIG_USERFAULTFD
> +static bool hugetlb_can_userfault(struct vm_area_struct *vma,
> + vm_flags_t vm_flags)
> +{
> + /*
> + * If user requested uffd-wp but not enabled pte markers for
> + * uffd-wp, then hugetlb is not supported.
> + */
> + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
> + return false;
IMHO we don't need to dup this for every vm_uffd_ops driver. It might be
unnecessary to even make driver be aware how pte marker plays the role
here, because pte markers are needed for all page cache file systems
anyway. There should have no outliers. Instead we can just let
can_userfault() report whether the driver generically supports userfaultfd,
leaving the detail checks for core mm.
I understand you wanted to also make anon to be a driver, so this line
won't apply to anon. However IMHO anon is special enough so we can still
make this in the generic path.
> + return true;
> +}
> +
> +static const struct vm_uffd_ops hugetlb_uffd_ops = {
> + .can_userfault = hugetlb_can_userfault,
> +};
> +#endif
> +
> /*
> * When a new function is introduced to vm_operations_struct and added
> * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4810,6 +4828,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> .close = hugetlb_vm_op_close,
> .may_split = hugetlb_vm_op_split,
> .pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_USERFAULTFD
> + .uffd_ops = &hugetlb_uffd_ops,
> +#endif
> };
>
> static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ec6c01378e9d..9b82cda271c4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -5290,6 +5290,23 @@ static const struct super_operations shmem_ops = {
> #endif
> };
>
> +#ifdef CONFIG_USERFAULTFD
> +static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
> +{
> + /*
> + * If user requested uffd-wp but not enabled pte markers for
> + * uffd-wp, then shmem is not supported.
> + */
> + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
> + return false;
> + return true;
> +}
> +
> +static const struct vm_uffd_ops shmem_uffd_ops = {
> + .can_userfault = shmem_can_userfault,
> +};
> +#endif
> +
> static const struct vm_operations_struct shmem_vm_ops = {
> .fault = shmem_fault,
> .map_pages = filemap_map_pages,
> @@ -5297,6 +5314,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
> .set_policy = shmem_set_policy,
> .get_policy = shmem_get_policy,
> #endif
> +#ifdef CONFIG_USERFAULTFD
> + .uffd_ops = &shmem_uffd_ops,
> +#endif
> };
>
> static const struct vm_operations_struct shmem_anon_vm_ops = {
> @@ -5306,6 +5326,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
> .set_policy = shmem_set_policy,
> .get_policy = shmem_get_policy,
> #endif
> +#ifdef CONFIG_USERFAULTFD
> + .uffd_ops = &shmem_uffd_ops,
> +#endif
> };
>
> int shmem_init_fs_context(struct fs_context *fc)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 786f0a245675..d035f5e17f07 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -34,6 +34,25 @@ struct mfill_state {
> pmd_t *pmd;
> };
>
> +static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
> +{
> + /* anonymous memory does not support MINOR mode */
> + if (vm_flags & VM_UFFD_MINOR)
> + return false;
> + return true;
> +}
> +
> +static const struct vm_uffd_ops anon_uffd_ops = {
> + .can_userfault = anon_can_userfault,
> +};
> +
> +static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
> +{
> + if (vma_is_anonymous(vma))
> + return &anon_uffd_ops;
> + return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
> +}
> +
> static __always_inline
> bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> {
> @@ -2019,13 +2038,15 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
> bool wp_async)
> {
> - vm_flags &= __VM_UFFD_FLAGS;
> + const struct vm_uffd_ops *ops = vma_uffd_ops(vma);
>
> - if (vma->vm_flags & VM_DROPPABLE)
> + /* only VMAs that implement vm_uffd_ops are supported */
> + if (!ops)
> return false;
>
> - if ((vm_flags & VM_UFFD_MINOR) &&
> - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> + vm_flags &= __VM_UFFD_FLAGS;
> +
> + if (vma->vm_flags & VM_DROPPABLE)
> return false;
>
> /*
> @@ -2035,18 +2056,8 @@ bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
> if (wp_async && (vm_flags == VM_UFFD_WP))
> return true;
>
> - /*
> - * If user requested uffd-wp but not enabled pte markers for
> - * uffd-wp, then shmem & hugetlbfs are not supported but only
> - * anonymous.
> - */
> - if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) &&
> - !vma_is_anonymous(vma))
> - return false;
> -
> /* By default, allow any of anon|shmem|hugetlb */
> - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> - vma_is_shmem(vma);
> + return ops->can_userfault(vma, vm_flags);
> }
>
> static void userfaultfd_set_vm_flags(struct vm_area_struct *vma,
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-02-02 21:36 ` Peter Xu
@ 2026-02-08 10:13 ` Mike Rapoport
2026-02-11 19:35 ` Peter Xu
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 10:13 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
Hi Peter,
On Mon, Feb 02, 2026 at 04:36:40PM -0500, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:26PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Current userfaultfd implementation works only with memory managed by
> > core MM: anonymous, shmem and hugetlb.
> >
> > First, there is no fundamental reason to limit userfaultfd support only
> > to the core memory types and userfaults can be handled similarly to
> > regular page faults provided a VMA owner implements appropriate
> > callbacks.
> >
> > Second, historically various code paths were conditioned on
> > vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of
> > these conditions can be expressed as operations implemented by a
> > particular memory type.
> >
> > Introduce vm_uffd_ops extension to vm_operations_struct that will
> > delegate memory type specific operations to a VMA owner.
> >
> > Operations for anonymous memory are handled internally in userfaultfd
> > using anon_uffd_ops that implicitly assigned to anonymous VMAs.
> >
> > Start with a single operation, ->can_userfault() that will verify that a
> > VMA meets requirements for userfaultfd support at registration time.
> >
> > Implement that method for anonymous, shmem and hugetlb and move relevant
> > parts of vma_can_userfault() into the new callbacks.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > include/linux/mm.h | 5 +++++
> > include/linux/userfaultfd_k.h | 6 +++++
> > mm/hugetlb.c | 21 ++++++++++++++++++
> > mm/shmem.c | 23 ++++++++++++++++++++
> > mm/userfaultfd.c | 41 ++++++++++++++++++++++-------------
> > 5 files changed, 81 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 15076261d0c2..3c2caff646c3 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -732,6 +732,8 @@ struct vm_fault {
> > */
> > };
> >
> > +struct vm_uffd_ops;
> > +
> > /*
> > * These are the virtual MM functions - opening of an area, closing and
> > * unmapping it (needed to keep files on disk up-to-date etc), pointer
> > @@ -817,6 +819,9 @@ struct vm_operations_struct {
> > struct page *(*find_normal_page)(struct vm_area_struct *vma,
> > unsigned long addr);
> > #endif /* CONFIG_FIND_NORMAL_PAGE */
> > +#ifdef CONFIG_USERFAULTFD
> > + const struct vm_uffd_ops *uffd_ops;
> > +#endif
> > };
> >
> > #ifdef CONFIG_NUMA_BALANCING
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index a49cf750e803..56e85ab166c7 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -80,6 +80,12 @@ struct userfaultfd_ctx {
> >
> > extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> >
> > +/* VMA userfaultfd operations */
> > +struct vm_uffd_ops {
> > + /* Checks if a VMA can support userfaultfd */
> > + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
> > +};
> > +
> > /* A combined operation mode + behavior flags. */
> > typedef unsigned int __bitwise uffd_flags_t;
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 51273baec9e5..909131910c43 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4797,6 +4797,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_USERFAULTFD
> > +static bool hugetlb_can_userfault(struct vm_area_struct *vma,
> > + vm_flags_t vm_flags)
> > +{
> > + /*
> > + * If user requested uffd-wp but not enabled pte markers for
> > + * uffd-wp, then hugetlb is not supported.
> > + */
> > + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
> > + return false;
>
> IMHO we don't need to dup this for every vm_uffd_ops driver. It might be
> unnecessary to even make driver be aware how pte marker plays the role
> here, because pte markers are needed for all page cache file systems
> anyway. There should have no outliers. Instead we can just let
> can_userfault() report whether the driver generically supports userfaultfd,
> leaving the detail checks for core mm.
>
> I understand you wanted to also make anon to be a driver, so this line
> won't apply to anon. However IMHO anon is special enough so we can still
> make this in the generic path.
Well, the idea is to drop all vma_is*() in can_userfault(). And maybe
eventually in entire mm/userfaultfd.c
If all page cache filesystems need this, something like this should work,
right?
if (!uffd_supports_wp_marker() && (vma->vm_flags & VM_SHARED) &&
(vm_flags & VM_UFFD_WP))
return false;
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-02-08 10:13 ` Mike Rapoport
@ 2026-02-11 19:35 ` Peter Xu
2026-02-15 17:47 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-11 19:35 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Sun, Feb 08, 2026 at 12:13:45PM +0200, Mike Rapoport wrote:
> Hi Peter,
>
> On Mon, Feb 02, 2026 at 04:36:40PM -0500, Peter Xu wrote:
> > On Tue, Jan 27, 2026 at 09:29:26PM +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > >
> > > Current userfaultfd implementation works only with memory managed by
> > > core MM: anonymous, shmem and hugetlb.
> > >
> > > First, there is no fundamental reason to limit userfaultfd support only
> > > to the core memory types and userfaults can be handled similarly to
> > > regular page faults provided a VMA owner implements appropriate
> > > callbacks.
> > >
> > > Second, historically various code paths were conditioned on
> > > vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of
> > > these conditions can be expressed as operations implemented by a
> > > particular memory type.
> > >
> > > Introduce vm_uffd_ops extension to vm_operations_struct that will
> > > delegate memory type specific operations to a VMA owner.
> > >
> > > Operations for anonymous memory are handled internally in userfaultfd
> > > using anon_uffd_ops that implicitly assigned to anonymous VMAs.
> > >
> > > Start with a single operation, ->can_userfault() that will verify that a
> > > VMA meets requirements for userfaultfd support at registration time.
> > >
> > > Implement that method for anonymous, shmem and hugetlb and move relevant
> > > parts of vma_can_userfault() into the new callbacks.
> > >
> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > > ---
> > > include/linux/mm.h | 5 +++++
> > > include/linux/userfaultfd_k.h | 6 +++++
> > > mm/hugetlb.c | 21 ++++++++++++++++++
> > > mm/shmem.c | 23 ++++++++++++++++++++
> > > mm/userfaultfd.c | 41 ++++++++++++++++++++++-------------
> > > 5 files changed, 81 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 15076261d0c2..3c2caff646c3 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -732,6 +732,8 @@ struct vm_fault {
> > > */
> > > };
> > >
> > > +struct vm_uffd_ops;
> > > +
> > > /*
> > > * These are the virtual MM functions - opening of an area, closing and
> > > * unmapping it (needed to keep files on disk up-to-date etc), pointer
> > > @@ -817,6 +819,9 @@ struct vm_operations_struct {
> > > struct page *(*find_normal_page)(struct vm_area_struct *vma,
> > > unsigned long addr);
> > > #endif /* CONFIG_FIND_NORMAL_PAGE */
> > > +#ifdef CONFIG_USERFAULTFD
> > > + const struct vm_uffd_ops *uffd_ops;
> > > +#endif
> > > };
> > >
> > > #ifdef CONFIG_NUMA_BALANCING
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index a49cf750e803..56e85ab166c7 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -80,6 +80,12 @@ struct userfaultfd_ctx {
> > >
> > > extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> > >
> > > +/* VMA userfaultfd operations */
> > > +struct vm_uffd_ops {
> > > + /* Checks if a VMA can support userfaultfd */
> > > + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
> > > +};
> > > +
> > > /* A combined operation mode + behavior flags. */
> > > typedef unsigned int __bitwise uffd_flags_t;
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 51273baec9e5..909131910c43 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -4797,6 +4797,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> > > return 0;
> > > }
> > >
> > > +#ifdef CONFIG_USERFAULTFD
> > > +static bool hugetlb_can_userfault(struct vm_area_struct *vma,
> > > + vm_flags_t vm_flags)
> > > +{
> > > + /*
> > > + * If user requested uffd-wp but not enabled pte markers for
> > > + * uffd-wp, then hugetlb is not supported.
> > > + */
> > > + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP))
> > > + return false;
> >
> > IMHO we don't need to dup this for every vm_uffd_ops driver. It might be
> > unnecessary to even make driver be aware how pte marker plays the role
> > here, because pte markers are needed for all page cache file systems
> > anyway. There should have no outliers. Instead we can just let
> > can_userfault() report whether the driver generically supports userfaultfd,
> > leaving the detail checks for core mm.
> >
> > I understand you wanted to also make anon to be a driver, so this line
> > won't apply to anon. However IMHO anon is special enough so we can still
> > make this in the generic path.
>
> Well, the idea is to drop all vma_is*() in can_userfault(). And maybe
> eventually in entire mm/userfaultfd.c
>
> If all page cache filesystems need this, something like this should work,
> right?
>
> if (!uffd_supports_wp_marker() && (vma->vm_flags & VM_SHARED) &&
> (vm_flags & VM_UFFD_WP))
> return false;
Sorry for a late response.
IIUC we can't check against VM_SHARED, because we need pte markers also for
MAP_PRIVATE on file mappings.
The need of pte markers come from the fact that the vma has a page cache
backing it, rather than whether it's a shared or private mapping. Consider
if a file mapping vma + MAP_PRIVATE, if we wr-protect the vma with nothing
populated, we want to still get notified whenever there's a write.
So the original check should be good.
I'm fine with most of the rest comments in this series I left and I'm OK if
you prefer settle things down first. For this one, I still want to see if
we can move this to uffd core code.
The whole point is I want to have zero info leaked about pte marker into
module ops.
For that, IMHO it'll be fine we use one vma_is_anonymous() is uffd core
code once.
Actually, I don't think uffd core can get rid of handling anon specially.
With this series applied, mfill_atomic_pte_copy() will still need to
hard-code anon processing on MAP_PRIVATE and I don't think it can go away..
mfill_atomic_pte_copy():
if (!(state->vma->vm_flags & VM_SHARED))
ops = &anon_uffd_ops;
IMHO using vma_is_anonymous() for one more time should be better than
leaking pte marker whole concept to modules. So the driver should only
report if the driver supports UFFD_WP in general. It shouldn't care about
anything the core mm would already do otherwise, including this one on
"whether system config / arch has globally enabled pte markers" and the
relation between that config and the WP feature impl details.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-02-11 19:35 ` Peter Xu
@ 2026-02-15 17:47 ` Mike Rapoport
2026-02-18 21:34 ` Peter Xu
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-15 17:47 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Wed, Feb 11, 2026 at 02:35:23PM -0500, Peter Xu wrote:
> On Sun, Feb 08, 2026 at 12:13:45PM +0200, Mike Rapoport wrote:
> > >
> > > I understand you wanted to also make anon to be a driver, so this line
> > > won't apply to anon. However IMHO anon is special enough so we can still
> > > make this in the generic path.
> >
> > Well, the idea is to drop all vma_is*() in can_userfault(). And maybe
> > eventually in entire mm/userfaultfd.c
> >
> > If all page cache filesystems need this, something like this should work,
> > right?
> >
> > if (!uffd_supports_wp_marker() && (vma->vm_flags & VM_SHARED) &&
> > (vm_flags & VM_UFFD_WP))
> > return false;
>
> Sorry for a late response.
>
> IMHO using vma_is_anonymous() for one more time should be better than
> leaking pte marker whole concept to modules. So the driver should only
> report if the driver supports UFFD_WP in general. It shouldn't care about
> anything the core mm would already do otherwise, including this one on
> "whether system config / arch has globally enabled pte markers" and the
> relation between that config and the WP feature impl details.
I agree. Will move the check for the markers back into userfaultfd.c
> Thanks,
>
> --
> Peter Xu
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops
2026-02-15 17:47 ` Mike Rapoport
@ 2026-02-18 21:34 ` Peter Xu
0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2026-02-18 21:34 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Sun, Feb 15, 2026 at 07:47:19PM +0200, Mike Rapoport wrote:
> I agree. Will move the check for the markers back into userfaultfd.c
Thank you!
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 08/17] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (6 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 07/17] userfaultfd: introduce vm_uffd_ops Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio() Mike Rapoport
` (9 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE
it needs to get a folio that already exists in the pagecache backing
that VMA.
Instead of using shmem_get_folio() for that, add a get_folio_noalloc()
method to 'struct vm_uffd_ops' that will return a folio if it exists in
the VMA's pagecache at given pgoff.
Implement get_folio_noalloc() method for shmem and slightly refactor
userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support
this new API.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 7 +++++++
mm/shmem.c | 15 ++++++++++++++-
mm/userfaultfd.c | 32 ++++++++++++++++----------------
3 files changed, 37 insertions(+), 17 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 56e85ab166c7..66dfc3c164e6 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -84,6 +84,13 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
struct vm_uffd_ops {
/* Checks if a VMA can support userfaultfd */
bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
+ /*
+ * Called to resolve UFFDIO_CONTINUE request.
+ * Should return the folio found at pgoff in the VMA's pagecache if it
+ * exists or ERR_PTR otherwise.
+ * The returned folio is locked and with reference held.
+ */
+ struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff);
};
/* A combined operation mode + behavior flags. */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9b82cda271c4..87cd8d2fdb97 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5291,6 +5291,18 @@ static const struct super_operations shmem_ops = {
};
#ifdef CONFIG_USERFAULTFD
+static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff)
+{
+ struct folio *folio;
+ int err;
+
+ err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+ if (err)
+ return ERR_PTR(err);
+
+ return folio;
+}
+
static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
{
/*
@@ -5303,7 +5315,8 @@ static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
}
static const struct vm_uffd_ops shmem_uffd_ops = {
- .can_userfault = shmem_can_userfault,
+ .can_userfault = shmem_can_userfault,
+ .get_folio_noalloc = shmem_get_folio_noalloc,
};
#endif
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d035f5e17f07..f0e6336015f1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -188,6 +188,7 @@ static int mfill_get_vma(struct mfill_state *state)
struct userfaultfd_ctx *ctx = state->ctx;
uffd_flags_t flags = state->flags;
struct vm_area_struct *dst_vma;
+ const struct vm_uffd_ops *ops;
int err;
/*
@@ -228,10 +229,12 @@ static int mfill_get_vma(struct mfill_state *state)
if (is_vm_hugetlb_page(dst_vma))
goto out;
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
+ ops = vma_uffd_ops(dst_vma);
+ if (!ops)
goto out_unlock;
- if (!vma_is_shmem(dst_vma) &&
- uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
+ !ops->get_folio_noalloc)
goto out_unlock;
out:
@@ -568,6 +571,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
static int mfill_atomic_pte_continue(struct mfill_state *state)
{
struct vm_area_struct *dst_vma = state->vma;
+ const struct vm_uffd_ops *ops = vma_uffd_ops(dst_vma);
unsigned long dst_addr = state->dst_addr;
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
struct inode *inode = file_inode(dst_vma->vm_file);
@@ -577,16 +581,13 @@ static int mfill_atomic_pte_continue(struct mfill_state *state)
struct page *page;
int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+ if (!ops)
+ return -EOPNOTSUPP;
+
+ folio = ops->get_folio_noalloc(inode, pgoff);
/* Our caller expects us to return -EFAULT if we failed to find folio */
- if (ret == -ENOENT)
- ret = -EFAULT;
- if (ret)
- goto out;
- if (!folio) {
- ret = -EFAULT;
- goto out;
- }
+ if (IS_ERR_OR_NULL(folio))
+ return -EFAULT;
page = folio_file_page(folio, pgoff);
if (PageHWPoison(page)) {
@@ -600,13 +601,12 @@ static int mfill_atomic_pte_continue(struct mfill_state *state)
goto out_release;
folio_unlock(folio);
- ret = 0;
-out:
- return ret;
+ return 0;
+
out_release:
folio_unlock(folio);
folio_put(folio);
- goto out;
+ return ret;
}
/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio()
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (7 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 08/17] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-02 22:13 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Mike Rapoport
` (8 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
and use it to refactor mfill_atomic_pte_zeroed_folio() and
mfill_atomic_pte_copy().
mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform
almost identical actions:
* allocate a folio
* update folio contents (either copy from userspace of fill with zeros)
* update page tables with the new folio
Split a __mfill_atomic_pte() helper that handles both cases and uses
newly introduced vm_uffd_ops->alloc_folio() to allocate the folio.
Pass the ops structure from the callers to __mfill_atomic_pte() to later
allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs.
Note, that the new ops method is called alloc_folio() rather than
folio_alloc() to avoid clash with alloc_tag macro folio_alloc().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/userfaultfd_k.h | 6 +++
mm/userfaultfd.c | 92 ++++++++++++++++++-----------------
2 files changed, 54 insertions(+), 44 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 66dfc3c164e6..4d8b879eed91 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -91,6 +91,12 @@ struct vm_uffd_ops {
* The returned folio is locked and with reference held.
*/
struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff);
+ /*
+ * Called during resolution of UFFDIO_COPY request.
+ * Should return allocate a and return folio or NULL if allocation fails.
+ */
+ struct folio *(*alloc_folio)(struct vm_area_struct *vma,
+ unsigned long addr);
};
/* A combined operation mode + behavior flags. */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f0e6336015f1..b3c12630769c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -42,8 +42,26 @@ static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
return true;
}
+static struct folio *anon_alloc_folio(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct folio *folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
+ addr);
+
+ if (!folio)
+ return NULL;
+
+ if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) {
+ folio_put(folio);
+ return NULL;
+ }
+
+ return folio;
+}
+
static const struct vm_uffd_ops anon_uffd_ops = {
.can_userfault = anon_can_userfault,
+ .alloc_folio = anon_alloc_folio,
};
static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
@@ -455,7 +473,8 @@ static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio
return 0;
}
-static int mfill_atomic_pte_copy(struct mfill_state *state)
+static int __mfill_atomic_pte(struct mfill_state *state,
+ const struct vm_uffd_ops *ops)
{
unsigned long dst_addr = state->dst_addr;
unsigned long src_addr = state->src_addr;
@@ -463,20 +482,22 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
struct folio *folio;
int ret;
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr);
+ folio = ops->alloc_folio(state->vma, state->dst_addr);
if (!folio)
return -ENOMEM;
- ret = -ENOMEM;
- if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL))
- goto out_release;
-
- ret = mfill_copy_folio_locked(folio, src_addr);
- if (unlikely(ret)) {
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
+ ret = mfill_copy_folio_locked(folio, src_addr);
/* fallback to copy_from_user outside mmap_lock */
- ret = mfill_copy_folio_retry(state, folio);
- if (ret)
- goto out_release;
+ if (unlikely(ret)) {
+ ret = mfill_copy_folio_retry(state, folio);
+ if (ret)
+ goto err_folio_put;
+ }
+ } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
+ clear_user_highpage(&folio->page, state->dst_addr);
+ } else {
+ VM_WARN_ONCE(1, "unknown UFFDIO operation");
}
/*
@@ -489,47 +510,30 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
&folio->page, true, flags);
if (ret)
- goto out_release;
-out:
- return ret;
-out_release:
+ goto err_folio_put;
+
+ return 0;
+
+err_folio_put:
+ folio_put(folio);
/* Don't return -ENOENT so that our caller won't retry */
if (ret == -ENOENT)
ret = -EFAULT;
- folio_put(folio);
- goto out;
+ return ret;
}
-static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr)
+static int mfill_atomic_pte_copy(struct mfill_state *state)
{
- struct folio *folio;
- int ret = -ENOMEM;
-
- folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
- if (!folio)
- return ret;
-
- if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
- goto out_put;
+ const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
- /*
- * The memory barrier inside __folio_mark_uptodate makes sure that
- * zeroing out the folio become visible before mapping the page
- * using set_pte_at(). See do_anonymous_page().
- */
- __folio_mark_uptodate(folio);
+ return __mfill_atomic_pte(state, ops);
+}
- ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
- &folio->page, true, 0);
- if (ret)
- goto out_put;
+static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state)
+{
+ const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
- return 0;
-out_put:
- folio_put(folio);
- return ret;
+ return __mfill_atomic_pte(state, ops);
}
static int mfill_atomic_pte_zeropage(struct mfill_state *state)
@@ -542,7 +546,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
int ret;
if (mm_forbids_zeropage(dst_vma->vm_mm))
- return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
+ return mfill_atomic_pte_zeroed_folio(state);
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio()
2026-01-27 19:29 ` [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio() Mike Rapoport
@ 2026-02-02 22:13 ` Peter Xu
2026-02-08 10:22 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-02 22:13 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Jan 27, 2026 at 09:29:28PM +0200, Mike Rapoport wrote:
[...]
> -static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr)
> +static int mfill_atomic_pte_copy(struct mfill_state *state)
> {
> - struct folio *folio;
> - int ret = -ENOMEM;
> -
> - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
> - if (!folio)
> - return ret;
> -
> - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
> - goto out_put;
> + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
>
> - /*
> - * The memory barrier inside __folio_mark_uptodate makes sure that
> - * zeroing out the folio become visible before mapping the page
> - * using set_pte_at(). See do_anonymous_page().
> - */
> - __folio_mark_uptodate(folio);
> + return __mfill_atomic_pte(state, ops);
> +}
>
> - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> - &folio->page, true, 0);
> - if (ret)
> - goto out_put;
> +static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state)
> +{
> + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
>
> - return 0;
> -out_put:
> - folio_put(folio);
> - return ret;
> + return __mfill_atomic_pte(state, ops);
> }
>
> static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> @@ -542,7 +546,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> int ret;
>
> if (mm_forbids_zeropage(dst_vma->vm_mm))
> - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
> + return mfill_atomic_pte_zeroed_folio(state);
After this patch, mfill_atomic_pte_zeroed_folio() should be 100% the same
impl with mfill_atomic_pte_copy(), so IIUC we can drop it.
>
> _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
> dst_vma->vm_page_prot));
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio()
2026-02-02 22:13 ` Peter Xu
@ 2026-02-08 10:22 ` Mike Rapoport
2026-02-11 19:37 ` Peter Xu
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 10:22 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Mon, Feb 02, 2026 at 05:13:20PM -0500, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:28PM +0200, Mike Rapoport wrote:
>
> [...]
>
> > -static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
> > - struct vm_area_struct *dst_vma,
> > - unsigned long dst_addr)
> > +static int mfill_atomic_pte_copy(struct mfill_state *state)
> > {
> > - struct folio *folio;
> > - int ret = -ENOMEM;
> > -
> > - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
> > - if (!folio)
> > - return ret;
> > -
> > - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
> > - goto out_put;
> > + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> >
> > - /*
> > - * The memory barrier inside __folio_mark_uptodate makes sure that
> > - * zeroing out the folio become visible before mapping the page
> > - * using set_pte_at(). See do_anonymous_page().
> > - */
> > - __folio_mark_uptodate(folio);
> > + return __mfill_atomic_pte(state, ops);
> > +}
> >
> > - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> > - &folio->page, true, 0);
> > - if (ret)
> > - goto out_put;
> > +static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state)
> > +{
> > + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> >
> > - return 0;
> > -out_put:
> > - folio_put(folio);
> > - return ret;
> > + return __mfill_atomic_pte(state, ops);
> > }
> >
> > static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> > @@ -542,7 +546,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> > int ret;
> >
> > if (mm_forbids_zeropage(dst_vma->vm_mm))
> > - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
> > + return mfill_atomic_pte_zeroed_folio(state);
>
> After this patch, mfill_atomic_pte_zeroed_folio() should be 100% the same
> impl with mfill_atomic_pte_copy(), so IIUC we can drop it.
It will be slightly different after the next patch to emphasize that
copying into MAP_PRIVATE actually creates anonymous memory.
> > _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
> > dst_vma->vm_page_prot));
> > --
> > 2.51.0
> >
>
> --
> Peter Xu
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio()
2026-02-08 10:22 ` Mike Rapoport
@ 2026-02-11 19:37 ` Peter Xu
0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2026-02-11 19:37 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Sun, Feb 08, 2026 at 12:22:52PM +0200, Mike Rapoport wrote:
> On Mon, Feb 02, 2026 at 05:13:20PM -0500, Peter Xu wrote:
> > On Tue, Jan 27, 2026 at 09:29:28PM +0200, Mike Rapoport wrote:
> >
> > [...]
> >
> > > -static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
> > > - struct vm_area_struct *dst_vma,
> > > - unsigned long dst_addr)
> > > +static int mfill_atomic_pte_copy(struct mfill_state *state)
> > > {
> > > - struct folio *folio;
> > > - int ret = -ENOMEM;
> > > -
> > > - folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
> > > - if (!folio)
> > > - return ret;
> > > -
> > > - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
> > > - goto out_put;
> > > + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> > >
> > > - /*
> > > - * The memory barrier inside __folio_mark_uptodate makes sure that
> > > - * zeroing out the folio become visible before mapping the page
> > > - * using set_pte_at(). See do_anonymous_page().
> > > - */
> > > - __folio_mark_uptodate(folio);
> > > + return __mfill_atomic_pte(state, ops);
> > > +}
> > >
> > > - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> > > - &folio->page, true, 0);
> > > - if (ret)
> > > - goto out_put;
> > > +static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state)
> > > +{
> > > + const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
> > >
> > > - return 0;
> > > -out_put:
> > > - folio_put(folio);
> > > - return ret;
> > > + return __mfill_atomic_pte(state, ops);
> > > }
> > >
> > > static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> > > @@ -542,7 +546,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> > > int ret;
> > >
> > > if (mm_forbids_zeropage(dst_vma->vm_mm))
> > > - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
> > > + return mfill_atomic_pte_zeroed_folio(state);
> >
> > After this patch, mfill_atomic_pte_zeroed_folio() should be 100% the same
> > impl with mfill_atomic_pte_copy(), so IIUC we can drop it.
>
> It will be slightly different after the next patch to emphasize that
> copying into MAP_PRIVATE actually creates anonymous memory.
True. It might be helpful to leave a line in the commit message so it's
intentional to temporarily have two functions do the same thing, but I'm OK
either way.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (8 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 09/17] userfaultfd: introduce vm_uffd_ops->alloc_folio() Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-03 17:40 ` Peter Xu
2026-01-27 19:29 ` [PATCH RFC 11/17] userfaultfd: mfill_atomic() remove retry logic Mike Rapoport
` (7 subsequent siblings)
17 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use
them in __mfill_atomic_pte() to add shmem folios to page cache and
remove them in case of error.
Implement these methods in shmem along with vm_uffd_ops->alloc_folio()
and drop shmem_mfill_atomic_pte().
Since userfaultfd now does not reference any functions from shmem, drop
include if linux/shmem_fs.h from mm/userfaultfd.c
mfill_atomic_install_pte() is not used anywhere outside of
mm/userfaultfd, make it static.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
fixup
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/shmem_fs.h | 14 ----
include/linux/userfaultfd_k.h | 20 +++--
mm/shmem.c | 148 ++++++++++++----------------------
mm/userfaultfd.c | 79 +++++++++---------
4 files changed, 106 insertions(+), 155 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index e2069b3179c4..754f17e5b53c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -223,20 +223,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
extern bool shmem_charge(struct inode *inode, long pages);
-#ifdef CONFIG_USERFAULTFD
-#ifdef CONFIG_SHMEM
-extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- unsigned long src_addr,
- uffd_flags_t flags,
- struct folio **foliop);
-#else /* !CONFIG_SHMEM */
-#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
- src_addr, flags, foliop) ({ BUG(); 0; })
-#endif /* CONFIG_SHMEM */
-#endif /* CONFIG_USERFAULTFD */
-
/*
* Used space is stored as unsigned 64-bit value in bytes but
* quota core supports only signed 64-bit values so use that
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 4d8b879eed91..75d5b09f2560 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -97,6 +97,21 @@ struct vm_uffd_ops {
*/
struct folio *(*alloc_folio)(struct vm_area_struct *vma,
unsigned long addr);
+ /*
+ * Called during resolution of UFFDIO_COPY request.
+ * Should lock the folio and add it to VMA's page cache.
+ * Returns 0 on success, error code on failre.
+ */
+ int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma,
+ unsigned long addr);
+ /*
+ * Called during resolution of UFFDIO_COPY request on the error
+ * handling path.
+ * Should revert the operation of ->filemap_add().
+ * The folio should be unlocked, but the reference to it should not be
+ * dropped.
+ */
+ void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma);
};
/* A combined operation mode + behavior flags. */
@@ -130,11 +145,6 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at
/* Flags controlling behavior. These behavior changes are mode-independent. */
#define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0)
-extern int mfill_atomic_install_pte(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr, struct page *page,
- bool newly_allocated, uffd_flags_t flags);
-
extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len,
uffd_flags_t flags);
diff --git a/mm/shmem.c b/mm/shmem.c
index 87cd8d2fdb97..6f0485f76cb8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3169,118 +3169,73 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
#endif /* CONFIG_TMPFS_QUOTA */
#ifdef CONFIG_USERFAULTFD
-int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr,
- unsigned long src_addr,
- uffd_flags_t flags,
- struct folio **foliop)
-{
- struct inode *inode = file_inode(dst_vma->vm_file);
- struct shmem_inode_info *info = SHMEM_I(inode);
+static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
struct address_space *mapping = inode->i_mapping;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ pgoff_t pgoff = linear_page_index(vma, addr);
gfp_t gfp = mapping_gfp_mask(mapping);
- pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- void *page_kaddr;
struct folio *folio;
- int ret;
- pgoff_t max_off;
-
- if (shmem_inode_acct_blocks(inode, 1)) {
- /*
- * We may have got a page, returned -ENOENT triggering a retry,
- * and now we find ourselves with -ENOMEM. Release the page, to
- * avoid a BUG_ON in our caller.
- */
- if (unlikely(*foliop)) {
- folio_put(*foliop);
- *foliop = NULL;
- }
- return -ENOMEM;
- }
- if (!*foliop) {
- ret = -ENOMEM;
- folio = shmem_alloc_folio(gfp, 0, info, pgoff);
- if (!folio)
- goto out_unacct_blocks;
+ if (unlikely(pgoff >= DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)))
+ return NULL;
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
- page_kaddr = kmap_local_folio(folio, 0);
- /*
- * The read mmap_lock is held here. Despite the
- * mmap_lock being read recursive a deadlock is still
- * possible if a writer has taken a lock. For example:
- *
- * process A thread 1 takes read lock on own mmap_lock
- * process A thread 2 calls mmap, blocks taking write lock
- * process B thread 1 takes page fault, read lock on own mmap lock
- * process B thread 2 calls mmap, blocks taking write lock
- * process A thread 1 blocks taking read lock on process B
- * process B thread 1 blocks taking read lock on process A
- *
- * Disable page faults to prevent potential deadlock
- * and retry the copy outside the mmap_lock.
- */
- pagefault_disable();
- ret = copy_from_user(page_kaddr,
- (const void __user *)src_addr,
- PAGE_SIZE);
- pagefault_enable();
- kunmap_local(page_kaddr);
-
- /* fallback to copy_from_user outside mmap_lock */
- if (unlikely(ret)) {
- *foliop = folio;
- ret = -ENOENT;
- /* don't free the page */
- goto out_unacct_blocks;
- }
+ folio = shmem_alloc_folio(gfp, 0, info, pgoff);
+ if (!folio)
+ return NULL;
- flush_dcache_folio(folio);
- } else { /* ZEROPAGE */
- clear_user_highpage(&folio->page, dst_addr);
- }
- } else {
- folio = *foliop;
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
- *foliop = NULL;
+ if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) {
+ folio_put(folio);
+ return NULL;
}
- VM_BUG_ON(folio_test_locked(folio));
- VM_BUG_ON(folio_test_swapbacked(folio));
+ return folio;
+}
+
+static int shmem_mfill_filemap_add(struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t pgoff = linear_page_index(vma, addr);
+ gfp_t gfp = mapping_gfp_mask(mapping);
+ int err;
+
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
- __folio_mark_uptodate(folio);
-
- ret = -EFAULT;
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- if (unlikely(pgoff >= max_off))
- goto out_release;
- ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
- if (ret)
- goto out_release;
- ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
- if (ret)
- goto out_release;
+ err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
+ if (err)
+ goto err_unlock;
- ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
- &folio->page, true, flags);
- if (ret)
- goto out_delete_from_cache;
+ if (shmem_inode_acct_blocks(inode, 1)) {
+ err = -ENOMEM;
+ goto err_delete_from_cache;
+ }
+ folio_add_lru(folio);
shmem_recalc_inode(inode, 1, 0);
- folio_unlock(folio);
+
return 0;
-out_delete_from_cache:
+
+err_delete_from_cache:
filemap_remove_folio(folio);
-out_release:
+err_unlock:
+ folio_unlock(folio);
+ return err;
+}
+
+static void shmem_mfill_filemap_remove(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+
+ filemap_remove_folio(folio);
+ shmem_recalc_inode(inode, 0, 0);
folio_unlock(folio);
- folio_put(folio);
-out_unacct_blocks:
- shmem_inode_unacct_blocks(inode, 1);
- return ret;
}
#endif /* CONFIG_USERFAULTFD */
@@ -5317,6 +5272,9 @@ static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
static const struct vm_uffd_ops shmem_uffd_ops = {
.can_userfault = shmem_can_userfault,
.get_folio_noalloc = shmem_get_folio_noalloc,
+ .alloc_folio = shmem_mfill_folio_alloc,
+ .filemap_add = shmem_mfill_filemap_add,
+ .filemap_remove = shmem_mfill_filemap_remove,
};
#endif
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b3c12630769c..54aa195237ba 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,7 +14,6 @@
#include <linux/userfaultfd_k.h>
#include <linux/mmu_notifier.h>
#include <linux/hugetlb.h>
-#include <linux/shmem_fs.h>
#include <asm/tlbflush.h>
#include <asm/tlb.h>
#include "internal.h"
@@ -337,10 +336,10 @@ static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
* This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
* and anon, and for both shared and private VMAs.
*/
-int mfill_atomic_install_pte(pmd_t *dst_pmd,
- struct vm_area_struct *dst_vma,
- unsigned long dst_addr, struct page *page,
- bool newly_allocated, uffd_flags_t flags)
+static int mfill_atomic_install_pte(pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr, struct page *page,
+ uffd_flags_t flags)
{
int ret;
struct mm_struct *dst_mm = dst_vma->vm_mm;
@@ -384,9 +383,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
goto out_unlock;
if (page_in_cache) {
- /* Usually, cache pages are already added to LRU */
- if (newly_allocated)
- folio_add_lru(folio);
folio_add_file_rmap_pte(folio, page, dst_vma);
} else {
folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE);
@@ -401,6 +397,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ if (page_in_cache)
+ folio_unlock(folio);
+
/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);
ret = 0;
@@ -507,13 +506,22 @@ static int __mfill_atomic_pte(struct mfill_state *state,
*/
__folio_mark_uptodate(folio);
+ if (ops->filemap_add) {
+ ret = ops->filemap_add(folio, state->vma, state->dst_addr);
+ if (ret)
+ goto err_folio_put;
+ }
+
ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
- &folio->page, true, flags);
+ &folio->page, flags);
if (ret)
- goto err_folio_put;
+ goto err_filemap_remove;
return 0;
+err_filemap_remove:
+ if (ops->filemap_remove)
+ ops->filemap_remove(folio, state->vma);
err_folio_put:
folio_put(folio);
/* Don't return -ENOENT so that our caller won't retry */
@@ -526,6 +534,18 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
{
const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
+ /*
+ * The normal page fault path for a MAP_PRIVATE mapping in a
+ * file-backed VMA will invoke the fault, fill the hole in the file and
+ * COW it right away. The result generates plain anonymous memory.
+ * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
+ * generate anonymous memory directly without actually filling the
+ * hole. For the MAP_PRIVATE case the robustness check only happens in
+ * the pagetable (to verify it's still none) and not in the page cache.
+ */
+ if (!(state->vma->vm_flags & VM_SHARED))
+ ops = &anon_uffd_ops;
+
return __mfill_atomic_pte(state, ops);
}
@@ -545,7 +565,8 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
spinlock_t *ptl;
int ret;
- if (mm_forbids_zeropage(dst_vma->vm_mm))
+ if (mm_forbids_zeropage(dst_vma->vm_mm) ||
+ (dst_vma->vm_flags & VM_SHARED))
return mfill_atomic_pte_zeroed_folio(state);
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
@@ -600,11 +621,10 @@ static int mfill_atomic_pte_continue(struct mfill_state *state)
}
ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
- page, false, flags);
+ page, flags);
if (ret)
goto out_release;
- folio_unlock(folio);
return 0;
out_release:
@@ -827,41 +847,18 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx,
static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state)
{
- struct vm_area_struct *dst_vma = state->vma;
- unsigned long src_addr = state->src_addr;
- unsigned long dst_addr = state->dst_addr;
- struct folio **foliop = &state->folio;
uffd_flags_t flags = state->flags;
- pmd_t *dst_pmd = state->pmd;
- ssize_t err;
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
return mfill_atomic_pte_continue(state);
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON))
return mfill_atomic_pte_poison(state);
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY))
+ return mfill_atomic_pte_copy(state);
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE))
+ return mfill_atomic_pte_zeropage(state);
- /*
- * The normal page fault path for a shmem will invoke the
- * fault, fill the hole in the file and COW it right away. The
- * result generates plain anonymous memory. So when we are
- * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll
- * generate anonymous memory directly without actually filling
- * the hole. For the MAP_PRIVATE case the robustness check
- * only happens in the pagetable (to verify it's still none)
- * and not in the radix tree.
- */
- if (!(dst_vma->vm_flags & VM_SHARED)) {
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY))
- err = mfill_atomic_pte_copy(state);
- else
- err = mfill_atomic_pte_zeropage(state);
- } else {
- err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
- dst_addr, src_addr,
- flags, foliop);
- }
-
- return err;
+ return -EOPNOTSUPP;
}
static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-01-27 19:29 ` [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Mike Rapoport
@ 2026-02-03 17:40 ` Peter Xu
2026-02-08 10:35 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-03 17:40 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Jan 27, 2026 at 09:29:29PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use
> them in __mfill_atomic_pte() to add shmem folios to page cache and
> remove them in case of error.
>
> Implement these methods in shmem along with vm_uffd_ops->alloc_folio()
> and drop shmem_mfill_atomic_pte().
>
> Since userfaultfd now does not reference any functions from shmem, drop
> include if linux/shmem_fs.h from mm/userfaultfd.c
>
> mfill_atomic_install_pte() is not used anywhere outside of
> mm/userfaultfd, make it static.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
This patch looks like a real nice cleanup on its own, thanks Mike!
I guess I never tried to read into shmem accountings, now after I read some
of the codes I don't see any issue with your change. We can also wait for
some shmem developers double check those. Comments inline below on
something I spot.
>
> fixup
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
[unexpected lines can be removed here]
> ---
> include/linux/shmem_fs.h | 14 ----
> include/linux/userfaultfd_k.h | 20 +++--
> mm/shmem.c | 148 ++++++++++++----------------------
> mm/userfaultfd.c | 79 +++++++++---------
> 4 files changed, 106 insertions(+), 155 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e2069b3179c4..754f17e5b53c 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -223,20 +223,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
>
> extern bool shmem_charge(struct inode *inode, long pages);
>
> -#ifdef CONFIG_USERFAULTFD
> -#ifdef CONFIG_SHMEM
> -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr,
> - unsigned long src_addr,
> - uffd_flags_t flags,
> - struct folio **foliop);
> -#else /* !CONFIG_SHMEM */
> -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
> - src_addr, flags, foliop) ({ BUG(); 0; })
> -#endif /* CONFIG_SHMEM */
> -#endif /* CONFIG_USERFAULTFD */
> -
> /*
> * Used space is stored as unsigned 64-bit value in bytes but
> * quota core supports only signed 64-bit values so use that
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 4d8b879eed91..75d5b09f2560 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -97,6 +97,21 @@ struct vm_uffd_ops {
> */
> struct folio *(*alloc_folio)(struct vm_area_struct *vma,
> unsigned long addr);
> + /*
> + * Called during resolution of UFFDIO_COPY request.
> + * Should lock the folio and add it to VMA's page cache.
> + * Returns 0 on success, error code on failre.
failure
> + */
> + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma,
> + unsigned long addr);
> + /*
> + * Called during resolution of UFFDIO_COPY request on the error
> + * handling path.
> + * Should revert the operation of ->filemap_add().
> + * The folio should be unlocked, but the reference to it should not be
> + * dropped.
Might be slightly misleading to explicitly mention this? As page cache
also holds references and IIUC they need to be dropped there. But I get
your point, on keeping the last refcount due to allocation.
IMHO the "should revert the operation of ->filemap_add()" is good enough
and accurately describes it.
> + */
> + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma);
> };
>
> /* A combined operation mode + behavior flags. */
> @@ -130,11 +145,6 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at
> /* Flags controlling behavior. These behavior changes are mode-independent. */
> #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0)
>
> -extern int mfill_atomic_install_pte(pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr, struct page *page,
> - bool newly_allocated, uffd_flags_t flags);
> -
> extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> unsigned long src_start, unsigned long len,
> uffd_flags_t flags);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 87cd8d2fdb97..6f0485f76cb8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3169,118 +3169,73 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
> #endif /* CONFIG_TMPFS_QUOTA */
>
> #ifdef CONFIG_USERFAULTFD
> -int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr,
> - unsigned long src_addr,
> - uffd_flags_t flags,
> - struct folio **foliop)
> -{
> - struct inode *inode = file_inode(dst_vma->vm_file);
> - struct shmem_inode_info *info = SHMEM_I(inode);
> +static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> struct address_space *mapping = inode->i_mapping;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + pgoff_t pgoff = linear_page_index(vma, addr);
> gfp_t gfp = mapping_gfp_mask(mapping);
> - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> - void *page_kaddr;
> struct folio *folio;
> - int ret;
> - pgoff_t max_off;
> -
> - if (shmem_inode_acct_blocks(inode, 1)) {
> - /*
> - * We may have got a page, returned -ENOENT triggering a retry,
> - * and now we find ourselves with -ENOMEM. Release the page, to
> - * avoid a BUG_ON in our caller.
> - */
> - if (unlikely(*foliop)) {
> - folio_put(*foliop);
> - *foliop = NULL;
> - }
> - return -ENOMEM;
> - }
>
> - if (!*foliop) {
> - ret = -ENOMEM;
> - folio = shmem_alloc_folio(gfp, 0, info, pgoff);
> - if (!folio)
> - goto out_unacct_blocks;
> + if (unlikely(pgoff >= DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)))
> + return NULL;
>
> - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> - page_kaddr = kmap_local_folio(folio, 0);
> - /*
> - * The read mmap_lock is held here. Despite the
> - * mmap_lock being read recursive a deadlock is still
> - * possible if a writer has taken a lock. For example:
> - *
> - * process A thread 1 takes read lock on own mmap_lock
> - * process A thread 2 calls mmap, blocks taking write lock
> - * process B thread 1 takes page fault, read lock on own mmap lock
> - * process B thread 2 calls mmap, blocks taking write lock
> - * process A thread 1 blocks taking read lock on process B
> - * process B thread 1 blocks taking read lock on process A
> - *
> - * Disable page faults to prevent potential deadlock
> - * and retry the copy outside the mmap_lock.
> - */
> - pagefault_disable();
> - ret = copy_from_user(page_kaddr,
> - (const void __user *)src_addr,
> - PAGE_SIZE);
> - pagefault_enable();
> - kunmap_local(page_kaddr);
> -
> - /* fallback to copy_from_user outside mmap_lock */
> - if (unlikely(ret)) {
> - *foliop = folio;
> - ret = -ENOENT;
> - /* don't free the page */
> - goto out_unacct_blocks;
> - }
> + folio = shmem_alloc_folio(gfp, 0, info, pgoff);
> + if (!folio)
> + return NULL;
>
> - flush_dcache_folio(folio);
> - } else { /* ZEROPAGE */
> - clear_user_highpage(&folio->page, dst_addr);
> - }
> - } else {
> - folio = *foliop;
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> - *foliop = NULL;
> + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) {
> + folio_put(folio);
> + return NULL;
> }
>
> - VM_BUG_ON(folio_test_locked(folio));
> - VM_BUG_ON(folio_test_swapbacked(folio));
> + return folio;
> +}
> +
> +static int shmem_mfill_filemap_add(struct folio *folio,
> + struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> + struct address_space *mapping = inode->i_mapping;
> + pgoff_t pgoff = linear_page_index(vma, addr);
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + int err;
> +
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
> - __folio_mark_uptodate(folio);
> -
> - ret = -EFAULT;
> - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> - if (unlikely(pgoff >= max_off))
> - goto out_release;
>
> - ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
> - if (ret)
> - goto out_release;
> - ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
> - if (ret)
> - goto out_release;
> + err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
> + if (err)
> + goto err_unlock;
>
> - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> - &folio->page, true, flags);
> - if (ret)
> - goto out_delete_from_cache;
> + if (shmem_inode_acct_blocks(inode, 1)) {
We used to do this early before allocation, IOW, I think we still have an
option to leave this to alloc_folio() hook. However I don't see an issue
either keeping it in filemap_add(). Maybe this movement should better be
spelled out in the commit message anyway on how this decision is made.
IIUC it's indeed safe we move this acct_blocks() here, I even see Hugh
mentioned such in an older commit 3022fd7af96, but Hugh left uffd alone at
that time:
Userfaultfd is a foreign country: they do things differently there, and
for good reason - to avoid mmap_lock deadlock. Leave ordering in
shmem_mfill_atomic_pte() untouched for now, but I would rather like to
mesh it better with shmem_get_folio_gfp() in the future.
I'm not sure if that's also what you wanted to do - to make userfaultfd
code work similarly like what shmem_alloc_and_add_folio() does right now.
Maybe you want to mention that too somewhere in the commit log when posting
a formal patch.
One thing not directly relevant is, shmem_alloc_and_add_folio() also does
proper recalc of inode allocation info when acct_blocks() fails here. But
if that's a problem, that's pre-existing for userfaultfd, so IIUC we can
also leave it alone until someone (maybe quota user) complains about shmem
allocation failures on UFFDIO_COPY.. It's just that it looks similar
problem here in userfaultfd path.
> + err = -ENOMEM;
> + goto err_delete_from_cache;
> + }
>
> + folio_add_lru(folio);
This change is pretty separate from the work, but looks correct to me: IIUC
we moved the lru add earlier now, and it should be safe as long as we're
holding folio lock all through the process, and folio_put() (ultimately,
__page_cache_release()) will always properly undo the lru change. Please
help double check if my understanding is correct.
> shmem_recalc_inode(inode, 1, 0);
> - folio_unlock(folio);
> +
> return 0;
> -out_delete_from_cache:
> +
> +err_delete_from_cache:
> filemap_remove_folio(folio);
> -out_release:
> +err_unlock:
> + folio_unlock(folio);
> + return err;
> +}
> +
> +static void shmem_mfill_filemap_remove(struct folio *folio,
> + struct vm_area_struct *vma)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> +
> + filemap_remove_folio(folio);
> + shmem_recalc_inode(inode, 0, 0);
> folio_unlock(folio);
> - folio_put(folio);
> -out_unacct_blocks:
> - shmem_inode_unacct_blocks(inode, 1);
This looks wrong, or maybe I miss somewhere we did the unacct_blocks()?
> - return ret;
> }
> #endif /* CONFIG_USERFAULTFD */
>
> @@ -5317,6 +5272,9 @@ static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
> static const struct vm_uffd_ops shmem_uffd_ops = {
> .can_userfault = shmem_can_userfault,
> .get_folio_noalloc = shmem_get_folio_noalloc,
> + .alloc_folio = shmem_mfill_folio_alloc,
> + .filemap_add = shmem_mfill_filemap_add,
> + .filemap_remove = shmem_mfill_filemap_remove,
> };
> #endif
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index b3c12630769c..54aa195237ba 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,7 +14,6 @@
> #include <linux/userfaultfd_k.h>
> #include <linux/mmu_notifier.h>
> #include <linux/hugetlb.h>
> -#include <linux/shmem_fs.h>
> #include <asm/tlbflush.h>
> #include <asm/tlb.h>
> #include "internal.h"
> @@ -337,10 +336,10 @@ static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
> * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
> * and anon, and for both shared and private VMAs.
> */
> -int mfill_atomic_install_pte(pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr, struct page *page,
> - bool newly_allocated, uffd_flags_t flags)
> +static int mfill_atomic_install_pte(pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr, struct page *page,
> + uffd_flags_t flags)
> {
> int ret;
> struct mm_struct *dst_mm = dst_vma->vm_mm;
> @@ -384,9 +383,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> goto out_unlock;
>
> if (page_in_cache) {
> - /* Usually, cache pages are already added to LRU */
> - if (newly_allocated)
> - folio_add_lru(folio);
> folio_add_file_rmap_pte(folio, page, dst_vma);
> } else {
> folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE);
> @@ -401,6 +397,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
>
> set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
>
> + if (page_in_cache)
> + folio_unlock(folio);
Nitpick: another small change that looks correct, but IMHO would be nice to
either make it a small separate patch, or mention in the commit message.
> +
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(dst_vma, dst_addr, dst_pte);
> ret = 0;
> @@ -507,13 +506,22 @@ static int __mfill_atomic_pte(struct mfill_state *state,
> */
> __folio_mark_uptodate(folio);
>
> + if (ops->filemap_add) {
> + ret = ops->filemap_add(folio, state->vma, state->dst_addr);
> + if (ret)
> + goto err_folio_put;
> + }
> +
> ret = mfill_atomic_install_pte(state->pmd, state->vma, dst_addr,
> - &folio->page, true, flags);
> + &folio->page, flags);
> if (ret)
> - goto err_folio_put;
> + goto err_filemap_remove;
>
> return 0;
>
> +err_filemap_remove:
> + if (ops->filemap_remove)
> + ops->filemap_remove(folio, state->vma);
> err_folio_put:
> folio_put(folio);
> /* Don't return -ENOENT so that our caller won't retry */
> @@ -526,6 +534,18 @@ static int mfill_atomic_pte_copy(struct mfill_state *state)
> {
> const struct vm_uffd_ops *ops = vma_uffd_ops(state->vma);
>
> + /*
> + * The normal page fault path for a MAP_PRIVATE mapping in a
> + * file-backed VMA will invoke the fault, fill the hole in the file and
> + * COW it right away. The result generates plain anonymous memory.
> + * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll
> + * generate anonymous memory directly without actually filling the
> + * hole. For the MAP_PRIVATE case the robustness check only happens in
> + * the pagetable (to verify it's still none) and not in the page cache.
> + */
> + if (!(state->vma->vm_flags & VM_SHARED))
> + ops = &anon_uffd_ops;
> +
> return __mfill_atomic_pte(state, ops);
> }
>
> @@ -545,7 +565,8 @@ static int mfill_atomic_pte_zeropage(struct mfill_state *state)
> spinlock_t *ptl;
> int ret;
>
> - if (mm_forbids_zeropage(dst_vma->vm_mm))
> + if (mm_forbids_zeropage(dst_vma->vm_mm) ||
> + (dst_vma->vm_flags & VM_SHARED))
> return mfill_atomic_pte_zeroed_folio(state);
>
> _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
> @@ -600,11 +621,10 @@ static int mfill_atomic_pte_continue(struct mfill_state *state)
> }
>
> ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> - page, false, flags);
> + page, flags);
> if (ret)
> goto out_release;
>
> - folio_unlock(folio);
> return 0;
>
> out_release:
> @@ -827,41 +847,18 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultfd_ctx *ctx,
>
> static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state)
> {
> - struct vm_area_struct *dst_vma = state->vma;
> - unsigned long src_addr = state->src_addr;
> - unsigned long dst_addr = state->dst_addr;
> - struct folio **foliop = &state->folio;
> uffd_flags_t flags = state->flags;
> - pmd_t *dst_pmd = state->pmd;
> - ssize_t err;
>
> if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> return mfill_atomic_pte_continue(state);
> if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON))
> return mfill_atomic_pte_poison(state);
> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY))
> + return mfill_atomic_pte_copy(state);
> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE))
> + return mfill_atomic_pte_zeropage(state);
>
> - /*
> - * The normal page fault path for a shmem will invoke the
> - * fault, fill the hole in the file and COW it right away. The
> - * result generates plain anonymous memory. So when we are
> - * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll
> - * generate anonymous memory directly without actually filling
> - * the hole. For the MAP_PRIVATE case the robustness check
> - * only happens in the pagetable (to verify it's still none)
> - * and not in the radix tree.
> - */
> - if (!(dst_vma->vm_flags & VM_SHARED)) {
> - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY))
> - err = mfill_atomic_pte_copy(state);
> - else
> - err = mfill_atomic_pte_zeropage(state);
> - } else {
> - err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
> - dst_addr, src_addr,
> - flags, foliop);
> - }
It's great to merge these otherwise.
Thanks!
> -
> - return err;
> + return -EOPNOTSUPP;
> }
>
> static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> --
> 2.51.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-02-03 17:40 ` Peter Xu
@ 2026-02-08 10:35 ` Mike Rapoport
2026-02-11 20:00 ` Peter Xu
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-08 10:35 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Feb 03, 2026 at 12:40:26PM -0500, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:29PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use
> > them in __mfill_atomic_pte() to add shmem folios to page cache and
> > remove them in case of error.
> >
> > Implement these methods in shmem along with vm_uffd_ops->alloc_folio()
> > and drop shmem_mfill_atomic_pte().
> >
> > Since userfaultfd now does not reference any functions from shmem, drop
> > include if linux/shmem_fs.h from mm/userfaultfd.c
> >
> > mfill_atomic_install_pte() is not used anywhere outside of
> > mm/userfaultfd, make it static.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>
> This patch looks like a real nice cleanup on its own, thanks Mike!
>
> I guess I never tried to read into shmem accountings, now after I read some
> of the codes I don't see any issue with your change. We can also wait for
> some shmem developers double check those. Comments inline below on
> something I spot.
>
> >
> > fixup
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>
> [unexpected lines can be removed here]
Sure :)
> > ---
> > include/linux/shmem_fs.h | 14 ----
> > include/linux/userfaultfd_k.h | 20 +++--
> > mm/shmem.c | 148 ++++++++++++----------------------
> > mm/userfaultfd.c | 79 +++++++++---------
> > 4 files changed, 106 insertions(+), 155 deletions(-)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index e2069b3179c4..754f17e5b53c 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -97,6 +97,21 @@ struct vm_uffd_ops {
> > */
> > struct folio *(*alloc_folio)(struct vm_area_struct *vma,
> > unsigned long addr);
> > + /*
> > + * Called during resolution of UFFDIO_COPY request.
> > + * Should lock the folio and add it to VMA's page cache.
> > + * Returns 0 on success, error code on failre.
>
> failure
Thanks, will fix.
> > + */
> > + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma,
> > + unsigned long addr);
> > + /*
> > + * Called during resolution of UFFDIO_COPY request on the error
> > + * handling path.
> > + * Should revert the operation of ->filemap_add().
> > + * The folio should be unlocked, but the reference to it should not be
> > + * dropped.
>
> Might be slightly misleading to explicitly mention this? As page cache
> also holds references and IIUC they need to be dropped there. But I get
> your point, on keeping the last refcount due to allocation.
>
> IMHO the "should revert the operation of ->filemap_add()" is good enough
> and accurately describes it.
Yeah, sounds good.
> > + */
> > + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma);
> > };
> >
> > /* A combined operation mode + behavior flags. */
...
> > +static int shmem_mfill_filemap_add(struct folio *folio,
> > + struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > + struct address_space *mapping = inode->i_mapping;
> > + pgoff_t pgoff = linear_page_index(vma, addr);
> > + gfp_t gfp = mapping_gfp_mask(mapping);
> > + int err;
> > +
> > __folio_set_locked(folio);
> > __folio_set_swapbacked(folio);
> > - __folio_mark_uptodate(folio);
> > -
> > - ret = -EFAULT;
> > - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > - if (unlikely(pgoff >= max_off))
> > - goto out_release;
> >
> > - ret = mem_cgroup_charge(folio, dst_vma->vm_mm, gfp);
> > - if (ret)
> > - goto out_release;
> > - ret = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
> > - if (ret)
> > - goto out_release;
> > + err = shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp);
> > + if (err)
> > + goto err_unlock;
> >
> > - ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> > - &folio->page, true, flags);
> > - if (ret)
> > - goto out_delete_from_cache;
> > + if (shmem_inode_acct_blocks(inode, 1)) {
>
> We used to do this early before allocation, IOW, I think we still have an
> option to leave this to alloc_folio() hook. However I don't see an issue
> either keeping it in filemap_add(). Maybe this movement should better be
> spelled out in the commit message anyway on how this decision is made.
>
> IIUC it's indeed safe we move this acct_blocks() here, I even see Hugh
> mentioned such in an older commit 3022fd7af96, but Hugh left uffd alone at
> that time:
>
> Userfaultfd is a foreign country: they do things differently there, and
> for good reason - to avoid mmap_lock deadlock. Leave ordering in
> shmem_mfill_atomic_pte() untouched for now, but I would rather like to
> mesh it better with shmem_get_folio_gfp() in the future.
>
> I'm not sure if that's also what you wanted to do - to make userfaultfd
> code work similarly like what shmem_alloc_and_add_folio() does right now.
> Maybe you want to mention that too somewhere in the commit log when posting
> a formal patch.
>
> One thing not directly relevant is, shmem_alloc_and_add_folio() also does
> proper recalc of inode allocation info when acct_blocks() fails here. But
> if that's a problem, that's pre-existing for userfaultfd, so IIUC we can
> also leave it alone until someone (maybe quota user) complains about shmem
> allocation failures on UFFDIO_COPY.. It's just that it looks similar
> problem here in userfaultfd path.
I actually wanted to have ordering as close as possible to
shmem_alloc_and_add_folio(), that's the first reason on moving acct_blocks
to ->filemap_add().
Another reason, is that it simplifies rollback in case of a failure, as
shmem_recalc_inode(inode, 0, 0); in ->filemap_remove() takes care of the
block accounting as well.
> > + err = -ENOMEM;
> > + goto err_delete_from_cache;
> > + }
> >
> > + folio_add_lru(folio);
>
> This change is pretty separate from the work, but looks correct to me: IIUC
> we moved the lru add earlier now, and it should be safe as long as we're
> holding folio lock all through the process, and folio_put() (ultimately,
> __page_cache_release()) will always properly undo the lru change. Please
> help double check if my understanding is correct.
This follows shmem_alloc_and_add_folio(), and my understanding as well that
this is safe as long as we hold folio lock.
> > +static void shmem_mfill_filemap_remove(struct folio *folio,
> > + struct vm_area_struct *vma)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > +
> > + filemap_remove_folio(folio);
> > + shmem_recalc_inode(inode, 0, 0);
> > folio_unlock(folio);
> > - folio_put(folio);
> > -out_unacct_blocks:
> > - shmem_inode_unacct_blocks(inode, 1);
>
> This looks wrong, or maybe I miss somewhere we did the unacct_blocks()?
This is handled by shmem_recalc_inode(inode, 0, 0).
> > @@ -401,6 +397,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> >
> > set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> >
> > + if (page_in_cache)
> > + folio_unlock(folio);
>
> Nitpick: another small change that looks correct, but IMHO would be nice to
> either make it a small separate patch, or mention in the commit message.
I'll address this in the commit log,
> > +
> > /* No need to invalidate - it was non-present before */
> > update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > ret = 0;
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-02-08 10:35 ` Mike Rapoport
@ 2026-02-11 20:00 ` Peter Xu
2026-02-15 17:45 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-11 20:00 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Sun, Feb 08, 2026 at 12:35:43PM +0200, Mike Rapoport wrote:
> > > +static void shmem_mfill_filemap_remove(struct folio *folio,
> > > + struct vm_area_struct *vma)
> > > +{
> > > + struct inode *inode = file_inode(vma->vm_file);
> > > +
> > > + filemap_remove_folio(folio);
> > > + shmem_recalc_inode(inode, 0, 0);
> > > folio_unlock(folio);
> > > - folio_put(folio);
> > > -out_unacct_blocks:
> > > - shmem_inode_unacct_blocks(inode, 1);
> >
> > This looks wrong, or maybe I miss somewhere we did the unacct_blocks()?
>
> This is handled by shmem_recalc_inode(inode, 0, 0).
IIUC shmem_recalc_inode() only does the fixup of shmem_inode_info over
possiblly changing inode->i_mapping->nrpages. It's not for reverting the
accounting in the failure paths here.
OTOH, we still need to maintain accounting for the rest things with
correctly invoke shmem_inode_unacct_blocks(). One thing we can try is
testing this series against either shmem quota support (since 2023, IIUC
it's relevant to "quota" mount option), or max_blocks accountings (IIUC,
"size" mount option), etc. Any of those should reflect a difference if my
understanding is correct.
So IIUC we still need the unacct_blocks(), please kindly help double check.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-02-11 20:00 ` Peter Xu
@ 2026-02-15 17:45 ` Mike Rapoport
2026-02-18 21:45 ` Peter Xu
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-15 17:45 UTC (permalink / raw)
To: Peter Xu
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Wed, Feb 11, 2026 at 03:00:58PM -0500, Peter Xu wrote:
> On Sun, Feb 08, 2026 at 12:35:43PM +0200, Mike Rapoport wrote:
> > > > +static void shmem_mfill_filemap_remove(struct folio *folio,
> > > > + struct vm_area_struct *vma)
> > > > +{
> > > > + struct inode *inode = file_inode(vma->vm_file);
> > > > +
> > > > + filemap_remove_folio(folio);
> > > > + shmem_recalc_inode(inode, 0, 0);
> > > > folio_unlock(folio);
> > > > - folio_put(folio);
> > > > -out_unacct_blocks:
> > > > - shmem_inode_unacct_blocks(inode, 1);
> > >
> > > This looks wrong, or maybe I miss somewhere we did the unacct_blocks()?
> >
> > This is handled by shmem_recalc_inode(inode, 0, 0).
>
> IIUC shmem_recalc_inode() only does the fixup of shmem_inode_info over
> possiblly changing inode->i_mapping->nrpages. It's not for reverting the
> accounting in the failure paths here.
>
> OTOH, we still need to maintain accounting for the rest things with
> correctly invoke shmem_inode_unacct_blocks(). One thing we can try is
> testing this series against either shmem quota support (since 2023, IIUC
> it's relevant to "quota" mount option), or max_blocks accountings (IIUC,
> "size" mount option), etc. Any of those should reflect a difference if my
> understanding is correct.
>
> So IIUC we still need the unacct_blocks(), please kindly help double check.
I followed shmem_get_folio_gfp() error handling, and unless I missed
something we should have the same sequence with uffd.
In shmem_mfill_filemap_add() we increment both i_mapping->nrpages and
info->alloced in shmem_add_to_page_cache() and
shmem_recalc_inode(inode, 1, 0) respectively.
Then in shmem_filemap_remove() the call to filemap_remove_folio()
decrements i_mapping->nrpages and shmem_recalc_inode(inode, 0, 0) will see
freed=1 and will call shmem_inode_unacct_blocks().
> Thanks,
>
> --
> Peter Xu
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
2026-02-15 17:45 ` Mike Rapoport
@ 2026-02-18 21:45 ` Peter Xu
0 siblings, 0 replies; 41+ messages in thread
From: Peter Xu @ 2026-02-18 21:45 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Sun, Feb 15, 2026 at 07:45:04PM +0200, Mike Rapoport wrote:
> On Wed, Feb 11, 2026 at 03:00:58PM -0500, Peter Xu wrote:
> > On Sun, Feb 08, 2026 at 12:35:43PM +0200, Mike Rapoport wrote:
> > > > > +static void shmem_mfill_filemap_remove(struct folio *folio,
> > > > > + struct vm_area_struct *vma)
> > > > > +{
> > > > > + struct inode *inode = file_inode(vma->vm_file);
> > > > > +
> > > > > + filemap_remove_folio(folio);
> > > > > + shmem_recalc_inode(inode, 0, 0);
> > > > > folio_unlock(folio);
> > > > > - folio_put(folio);
> > > > > -out_unacct_blocks:
> > > > > - shmem_inode_unacct_blocks(inode, 1);
> > > >
> > > > This looks wrong, or maybe I miss somewhere we did the unacct_blocks()?
> > >
> > > This is handled by shmem_recalc_inode(inode, 0, 0).
> >
> > IIUC shmem_recalc_inode() only does the fixup of shmem_inode_info over
> > possiblly changing inode->i_mapping->nrpages. It's not for reverting the
> > accounting in the failure paths here.
> >
> > OTOH, we still need to maintain accounting for the rest things with
> > correctly invoke shmem_inode_unacct_blocks(). One thing we can try is
> > testing this series against either shmem quota support (since 2023, IIUC
> > it's relevant to "quota" mount option), or max_blocks accountings (IIUC,
> > "size" mount option), etc. Any of those should reflect a difference if my
> > understanding is correct.
> >
> > So IIUC we still need the unacct_blocks(), please kindly help double check.
>
> I followed shmem_get_folio_gfp() error handling, and unless I missed
> something we should have the same sequence with uffd.
>
> In shmem_mfill_filemap_add() we increment both i_mapping->nrpages and
> info->alloced in shmem_add_to_page_cache() and
> shmem_recalc_inode(inode, 1, 0) respectively.
>
> Then in shmem_filemap_remove() the call to filemap_remove_folio()
> decrements i_mapping->nrpages and shmem_recalc_inode(inode, 0, 0) will see
> freed=1 and will call shmem_inode_unacct_blocks().
You're correct. I guess I was misleaded by the comments above
shmem_recalc_inode() when reading this part assuming it's only for the
cases where nrpages changed behind the hood.. :)
I believe we need shmem_recalc_inode(inode, 0, 0) to make sure
info->alloced is properly decremented, so shmem_inode_unacct_blocks()
explicit calls will miss that otherwise due to the reordering of shmem
accounting in this patch.
It's slightly tricky on using these functions, I wonder if we want to
mention them in the commit log, but I'm OK either way.
Thanks for double checking!
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH RFC 11/17] userfaultfd: mfill_atomic() remove retry logic
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (9 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 10/17] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 12/17] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
` (6 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Since __mfill_atomic_pte() handles the retry for both anonymous and
shmem, there is no need to retry copying the date from the userspace in
the loop in mfill_atomic().
Drop the retry logic from mfill_atomic().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/userfaultfd.c | 24 ------------------------
1 file changed, 24 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 54aa195237ba..1bd7631463c6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -29,7 +29,6 @@ struct mfill_state {
struct vm_area_struct *vma;
unsigned long src_addr;
unsigned long dst_addr;
- struct folio *folio;
pmd_t *pmd;
};
@@ -889,7 +888,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
VM_WARN_ON_ONCE(src_start + len <= src_start);
VM_WARN_ON_ONCE(dst_start + len <= dst_start);
-retry:
err = mfill_get_vma(&state);
if (err)
goto out;
@@ -916,26 +914,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
err = mfill_atomic_pte(&state);
cond_resched();
- if (unlikely(err == -ENOENT)) {
- void *kaddr;
-
- mfill_put_vma(&state);
- VM_WARN_ON_ONCE(!state.folio);
-
- kaddr = kmap_local_folio(state.folio, 0);
- err = copy_from_user(kaddr,
- (const void __user *)state.src_addr,
- PAGE_SIZE);
- kunmap_local(kaddr);
- if (unlikely(err)) {
- err = -EFAULT;
- goto out;
- }
- flush_dcache_folio(state.folio);
- goto retry;
- } else
- VM_WARN_ON_ONCE(state.folio);
-
if (!err) {
state.dst_addr += PAGE_SIZE;
state.src_addr += PAGE_SIZE;
@@ -951,8 +929,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
if (state.vma)
mfill_put_vma(&state);
out:
- if (state.folio)
- folio_put(state.folio);
VM_WARN_ON_ONCE(copied < 0);
VM_WARN_ON_ONCE(err > 0);
VM_WARN_ON_ONCE(!copied && !err);
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 12/17] mm: introduce VM_FAULT_UFFD_MINOR fault reason
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (10 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 11/17] userfaultfd: mfill_atomic() remove retry logic Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 13/17] mm: introduce VM_FAULT_UFFD_MISSING " Mike Rapoport
` (5 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest,
David Hildenbrand (Red Hat)
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
When a VMA is registered with userfaulfd in minor mode, its ->fault()
method should check if a folio exists in the page cache and if yes
->fault() should call handle_userfault(VM_UFFD_MINOR).
Instead of calling handle_userfault() directly from a specific ->fault()
implementation introduce new fault reason VM_FAULT_UFFD_MINOR that will
notify the core page fault handler that it should call
handle_userfaultfd(VM_UFFD_MINOR) to complete a page fault.
Replace a call to handle_userfault(VM_UFFD_MINOR) in shmem and use the
new VM_FAULT_UFFD_MINOR there instead.
For configurations that don't enable CONFIG_USERFAULTFD,
VM_FAULT_UFFD_MINOR is set to 0.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm_types.h | 10 +++++++++-
mm/memory.c | 5 ++++-
mm/shmem.c | 2 +-
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 42af2292951d..b25ac322bfbf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1555,6 +1555,8 @@ typedef __bitwise unsigned int vm_fault_t;
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_UFFD_MINOR: ->fault did not modify page tables and needs
+ * handle_userfault(VM_UFFD_MINOR) to complete
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -1572,6 +1574,11 @@ enum vm_fault_reason {
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+#ifdef CONFIG_USERFAULTFD
+ VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x008000,
+#else
+ VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x000000,
+#endif
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -1596,7 +1603,8 @@ enum vm_fault_reason {
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
- { VM_FAULT_COMPLETED, "COMPLETED" }
+ { VM_FAULT_COMPLETED, "COMPLETED" }, \
+ { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..fcb3e0c3113e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5319,8 +5319,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
- VM_FAULT_DONE_COW)))
+ VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR))) {
+ if (ret & VM_FAULT_UFFD_MINOR)
+ return handle_userfault(vmf, VM_UFFD_MINOR);
return ret;
+ }
folio = page_folio(vmf->page);
if (unlikely(PageHWPoison(vmf->page))) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 6f0485f76cb8..6aa905147c0c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2481,7 +2481,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
if (folio && vma && userfaultfd_minor(vma)) {
if (!xa_is_value(folio))
folio_put(folio);
- *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
+ *fault_type = VM_FAULT_UFFD_MINOR;
return 0;
}
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 13/17] mm: introduce VM_FAULT_UFFD_MISSING fault reason
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (11 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 12/17] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 14/17] KVM: guest_memfd: implement userfaultfd minor mode Mike Rapoport
` (4 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
When a VMA is registered with userfaulfd in missing mode, its ->fault()
method should check if a folio exists in the page cache and if no
->fault() should call handle_userfault(VM_UFFD_MISSING).
Instead of calling handle_userfault() directly from a specific ->fault()
implementation introduce new fault reason VM_FAULT_UFFD_MISSING that
will notify the core page fault handler that it should call
handle_userfaultfd(VM_UFFD_MISSING) to complete a page fault.
Replace a call to handle_userfault(VM_UFFD_MISSING) in shmem and use the
new VM_FAULT_UFFD_MISSING there instead.
For configurations that don't enable CONFIG_USERFAULTFD,
VM_FAULT_UFFD_MISSING is set to 0.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm_types.h | 7 ++++++-
mm/memory.c | 5 ++++-
mm/shmem.c | 2 +-
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b25ac322bfbf..a061c43e835b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1557,6 +1557,8 @@ typedef __bitwise unsigned int vm_fault_t;
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
* @VM_FAULT_UFFD_MINOR: ->fault did not modify page tables and needs
* handle_userfault(VM_UFFD_MINOR) to complete
+ * @VM_FAULT_UFFD_MISSING: ->fault did not modify page tables and needs
+ * handle_userfault(VM_UFFD_MISSING) to complete
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -1576,8 +1578,10 @@ enum vm_fault_reason {
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
#ifdef CONFIG_USERFAULTFD
VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x008000,
+ VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x010000,
#else
VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x000000,
+ VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x000000,
#endif
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -1604,7 +1608,8 @@ enum vm_fault_reason {
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
{ VM_FAULT_COMPLETED, "COMPLETED" }, \
- { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }
+ { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }, \
+ { VM_FAULT_UFFD_MISSING, "UFFD_MISSING" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index fcb3e0c3113e..f72e69a43b68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5319,9 +5319,12 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
- VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR))) {
+ VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR |
+ VM_FAULT_UFFD_MISSING))) {
if (ret & VM_FAULT_UFFD_MINOR)
return handle_userfault(vmf, VM_UFFD_MINOR);
+ if (ret & VM_FAULT_UFFD_MISSING)
+ return handle_userfault(vmf, VM_UFFD_MISSING);
return ret;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 6aa905147c0c..1bc544cab2a8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2530,7 +2530,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
*/
if (vma && userfaultfd_missing(vma)) {
- *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
+ *fault_type = VM_FAULT_UFFD_MISSING;
return 0;
}
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 14/17] KVM: guest_memfd: implement userfaultfd minor mode
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (12 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 13/17] mm: introduce VM_FAULT_UFFD_MISSING " Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 15/17] KVM: guest_memfd: implement userfaultfd missing mode Mike Rapoport
` (3 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
userfaultfd notifications about minor page faults used for live migration
and snapshotting of VMs with memory backed by shared hugetlbfs or tmpfs
mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add
minor fault registration mode").
To use the same mechanism for VMs that use guest_memfd to map their memory,
guest_memfd should support userfaultfd minor mode.
Extend ->fault() method of guest_memfd with ability to notify core page
fault handler that a page fault requires handle_userfault(VM_UFFD_MINOR)
to complete and add vm_uffd_ops to guest_memfd vm_ops with
implementation of ->can_userfault() and ->get_folio_noalloc() methods.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
virt/kvm/guest_memfd.c | 76 ++++++++++++++++++++++++++++++++++++------
1 file changed, 65 insertions(+), 11 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fdaea3422c30..087e7632bf70 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,6 +7,7 @@
#include <linux/mempolicy.h>
#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
+#include <linux/userfaultfd_k.h>
#include "kvm_mm.h"
@@ -121,6 +122,26 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
return r;
}
+static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode, pgoff_t pgoff)
+{
+ return __filemap_get_folio(inode->i_mapping, pgoff,
+ FGP_LOCK | FGP_ACCESSED, 0);
+}
+
+static struct folio *__kvm_gmem_folio_alloc(struct inode *inode, pgoff_t index)
+{
+ struct mempolicy *policy;
+ struct folio *folio;
+
+ policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index);
+ folio = __filemap_get_folio_mpol(inode->i_mapping, index,
+ FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+ mapping_gfp_mask(inode->i_mapping), policy);
+ mpol_cond_put(policy);
+
+ return folio;
+}
+
/*
* Returns a locked folio on success. The caller is responsible for
* setting the up-to-date flag before the memory is mapped into the guest.
@@ -133,25 +154,17 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
{
/* TODO: Support huge pages. */
- struct mempolicy *policy;
struct folio *folio;
/*
* Fast-path: See if folio is already present in mapping to avoid
* policy_lookup.
*/
- folio = __filemap_get_folio(inode->i_mapping, index,
- FGP_LOCK | FGP_ACCESSED, 0);
+ folio = kvm_gmem_get_folio_noalloc(inode, index);
if (!IS_ERR(folio))
return folio;
- policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index);
- folio = __filemap_get_folio_mpol(inode->i_mapping, index,
- FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
- mapping_gfp_mask(inode->i_mapping), policy);
- mpol_cond_put(policy);
-
- return folio;
+ return __kvm_gmem_folio_alloc(inode, index);
}
static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
@@ -405,7 +418,24 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
return VM_FAULT_SIGBUS;
- folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+ folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff,
+ FGP_LOCK | FGP_ACCESSED, 0);
+
+ if (userfaultfd_armed(vmf->vma)) {
+ /*
+ * If userfaultfd is registered in minor mode and a folio
+ * exists, return VM_FAULT_UFFD_MINOR to trigger the
+ * userfaultfd handler.
+ */
+ if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) {
+ ret = VM_FAULT_UFFD_MINOR;
+ goto out_folio;
+ }
+ }
+
+ /* folio not in the pagecache, try to allocate */
+ if (IS_ERR(folio))
+ folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);
if (IS_ERR(folio)) {
if (PTR_ERR(folio) == -EAGAIN)
return VM_FAULT_RETRY;
@@ -462,12 +492,36 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
}
#endif /* CONFIG_NUMA */
+#ifdef CONFIG_USERFAULTFD
+static bool kvm_gmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+
+ /*
+ * Only support userfaultfd for guest_memfd with INIT_SHARED flag.
+ * This ensures the memory can be mapped to userspace.
+ */
+ if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
+ return false;
+
+ return true;
+}
+
+static const struct vm_uffd_ops kvm_gmem_uffd_ops = {
+ .can_userfault = kvm_gmem_can_userfault,
+ .get_folio_noalloc = kvm_gmem_get_folio_noalloc,
+};
+#endif /* CONFIG_USERFAULTFD */
+
static const struct vm_operations_struct kvm_gmem_vm_ops = {
.fault = kvm_gmem_fault_user_mapping,
#ifdef CONFIG_NUMA
.get_policy = kvm_gmem_get_policy,
.set_policy = kvm_gmem_set_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .uffd_ops = &kvm_gmem_uffd_ops,
+#endif
};
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 15/17] KVM: guest_memfd: implement userfaultfd missing mode
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (13 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 14/17] KVM: guest_memfd: implement userfaultfd minor mode Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 16/17] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
` (2 subsequent siblings)
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
userfaultfd missing mode allows populating guest memory with the content
supplied by userspace on demand.
Extend guest_memfd implementation of vm_uffd_ops to support MISSING
mode.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
virt/kvm/guest_memfd.c | 60 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 59 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 087e7632bf70..14cca057fc0e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -431,6 +431,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
ret = VM_FAULT_UFFD_MINOR;
goto out_folio;
}
+
+ /*
+ * Check if userfaultfd is registered in missing mode. If so,
+ * check if a folio exists in the page cache. If not, return
+ * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler.
+ */
+ if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio))
+ return VM_FAULT_UFFD_MISSING;
}
/* folio not in the pagecache, try to allocate */
@@ -507,9 +515,59 @@ static bool kvm_gmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_fla
return true;
}
+static struct folio *kvm_gmem_folio_alloc(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ pgoff_t pgoff = linear_page_index(vma, addr);
+ struct mempolicy *mpol;
+ struct folio *folio;
+ gfp_t gfp;
+
+ if (unlikely(pgoff >= (i_size_read(inode) >> PAGE_SHIFT)))
+ return NULL;
+
+ gfp = mapping_gfp_mask(inode->i_mapping);
+ mpol = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
+ mpol = mpol ?: get_task_policy(current);
+ folio = folio_alloc_mpol(gfp, 0, mpol, pgoff, numa_node_id());
+ mpol_cond_put(mpol);
+
+ return folio;
+}
+
+static int kvm_gmem_filemap_add(struct folio *folio,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t pgoff = linear_page_index(vma, addr);
+ int err;
+
+ __folio_set_locked(folio);
+ err = filemap_add_folio(mapping, folio, pgoff, GFP_KERNEL);
+ if (err) {
+ folio_unlock(folio);
+ return err;
+ }
+
+ return 0;
+}
+
+static void kvm_gmem_filemap_remove(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ filemap_remove_folio(folio);
+ folio_unlock(folio);
+}
+
static const struct vm_uffd_ops kvm_gmem_uffd_ops = {
- .can_userfault = kvm_gmem_can_userfault,
+ .can_userfault = kvm_gmem_can_userfault,
.get_folio_noalloc = kvm_gmem_get_folio_noalloc,
+ .alloc_folio = kvm_gmem_folio_alloc,
+ .filemap_add = kvm_gmem_filemap_add,
+ .filemap_remove = kvm_gmem_filemap_remove,
};
#endif /* CONFIG_USERFAULTFD */
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 16/17] KVM: selftests: test userfaultfd minor for guest_memfd
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (14 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 15/17] KVM: guest_memfd: implement userfaultfd missing mode Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-01-27 19:29 ` [PATCH RFC 17/17] KVM: selftests: test userfaultfd missing " Mike Rapoport
2026-02-03 20:56 ` [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Peter Xu
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
The test demonstrates that a minor userfaultfd event in guest_memfd can
be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
.../testing/selftests/kvm/guest_memfd_test.c | 113 ++++++++++++++++++
1 file changed, 113 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 618c937f3c90..7612819e340a 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,13 +10,17 @@
#include <errno.h>
#include <stdio.h>
#include <fcntl.h>
+#include <pthread.h>
#include <linux/bitmap.h>
#include <linux/falloc.h>
#include <linux/sizes.h>
+#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
#include "kvm_util.h"
#include "numaif.h"
@@ -329,6 +333,112 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
close(fd1);
}
+struct fault_args {
+ char *addr;
+ char value;
+};
+
+static void *fault_thread_fn(void *arg)
+{
+ struct fault_args *args = arg;
+
+ /* Trigger page fault */
+ args->value = *args->addr;
+ return NULL;
+}
+
+static void test_uffd_minor(int fd, size_t total_size)
+{
+ struct uffdio_register uffd_reg;
+ struct uffdio_continue uffd_cont;
+ struct uffd_msg msg;
+ struct fault_args args;
+ pthread_t fault_thread;
+ void *mem, *mem_nofault, *buf = NULL;
+ int uffd, ret;
+ off_t offset = page_size;
+ void *fault_addr;
+ const char test_val = 0xcd;
+
+ ret = posix_memalign(&buf, page_size, total_size);
+ TEST_ASSERT_EQ(ret, 0);
+ memset(buf, test_val, total_size);
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
+ TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed");
+
+ struct uffdio_api uffdio_api = {
+ .api = UFFD_API,
+ .features = 0,
+ };
+ ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed");
+
+ /* Map the guest_memfd twice: once with UFFD registered, once without */
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed");
+
+ mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed");
+
+ /* Register UFFD_MINOR on the first mapping */
+ uffd_reg.range.start = (unsigned long)mem;
+ uffd_reg.range.len = total_size;
+ uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR;
+ ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed");
+
+ /*
+ * Populate the page in the page cache first via mem_nofault.
+ * This is required for UFFD_MINOR - the page must exist in the cache.
+ * Write test data to the page.
+ */
+ memcpy(mem_nofault + offset, buf + offset, page_size);
+
+ /*
+ * Now access the same page via mem (which has UFFD_MINOR registered).
+ * Since the page exists in the cache, this should trigger UFFD_MINOR.
+ */
+ fault_addr = mem + offset;
+ args.addr = fault_addr;
+
+ ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args);
+ TEST_ASSERT(ret == 0, "pthread_create should succeed");
+
+ ret = read(uffd, &msg, sizeof(msg));
+ TEST_ASSERT(ret != -1, "read from userfaultfd should succeed");
+ TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault");
+ TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr,
+ "pagefault should occur at expected address");
+ TEST_ASSERT(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR,
+ "pagefault should be minor fault");
+
+ /* Resolve the minor fault with UFFDIO_CONTINUE */
+ uffd_cont.range.start = (unsigned long)fault_addr;
+ uffd_cont.range.len = page_size;
+ uffd_cont.mode = 0;
+ ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed");
+
+ /* Wait for the faulting thread to complete */
+ ret = pthread_join(fault_thread, NULL);
+ TEST_ASSERT(ret == 0, "pthread_join should succeed");
+
+ /* Verify the thread read the correct value */
+ TEST_ASSERT(args.value == test_val,
+ "memory should contain the value that was written");
+ TEST_ASSERT(*(char *)(mem + offset) == test_val,
+ "no further fault is expected");
+
+ ret = munmap(mem_nofault, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+
+ ret = munmap(mem, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+ free(buf);
+ close(uffd);
+}
+
static void test_guest_memfd_flags(struct kvm_vm *vm)
{
uint64_t valid_flags = vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS);
@@ -383,6 +493,9 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
gmem_test(file_size, vm, flags);
gmem_test(fallocate, vm, flags);
gmem_test(invalid_punch_hole, vm, flags);
+
+ if (flags & GUEST_MEMFD_FLAG_INIT_SHARED)
+ gmem_test(uffd_minor, vm, flags);
}
static void test_guest_memfd(unsigned long vm_type)
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH RFC 17/17] KVM: selftests: test userfaultfd missing for guest_memfd
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (15 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 16/17] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
@ 2026-01-27 19:29 ` Mike Rapoport
2026-02-03 20:56 ` [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Peter Xu
17 siblings, 0 replies; 41+ messages in thread
From: Mike Rapoport @ 2026-01-27 19:29 UTC (permalink / raw)
To: linux-mm
Cc: Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Baolin Wang,
David Hildenbrand, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini, Peter Xu,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
From: Nikita Kalyazin <kalyazin@amazon.com>
The test demonstrates that a missing userfaultfd event in guest_memfd
can be resolved via a UFFDIO_COPY ioctl.
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
.../testing/selftests/kvm/guest_memfd_test.c | 80 ++++++++++++++++++-
1 file changed, 79 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 7612819e340a..f77e70d22175 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -439,6 +439,82 @@ static void test_uffd_minor(int fd, size_t total_size)
close(uffd);
}
+static void test_uffd_missing(int fd, size_t total_size)
+{
+ struct uffdio_register uffd_reg;
+ struct uffdio_copy uffd_copy;
+ struct uffd_msg msg;
+ struct fault_args args;
+ pthread_t fault_thread;
+ void *mem, *buf = NULL;
+ int uffd, ret;
+ off_t offset = page_size;
+ void *fault_addr;
+ const char test_val = 0xab;
+
+ ret = posix_memalign(&buf, page_size, total_size);
+ TEST_ASSERT_EQ(ret, 0);
+ memset(buf, test_val, total_size);
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
+ TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed");
+
+ struct uffdio_api uffdio_api = {
+ .api = UFFD_API,
+ .features = 0,
+ };
+ ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed");
+
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed");
+
+ uffd_reg.range.start = (unsigned long)mem;
+ uffd_reg.range.len = total_size;
+ uffd_reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+ ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed");
+
+ fault_addr = mem + offset;
+ args.addr = fault_addr;
+
+ ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args);
+ TEST_ASSERT(ret == 0, "pthread_create should succeed");
+
+ ret = read(uffd, &msg, sizeof(msg));
+ TEST_ASSERT(ret != -1, "read from userfaultfd should succeed");
+ TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault");
+ TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr,
+ "pagefault should occur at expected address");
+ TEST_ASSERT(!(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP),
+ "pagefault should not be write-protect");
+
+ uffd_copy.dst = (unsigned long)fault_addr;
+ uffd_copy.src = (unsigned long)(buf + offset);
+ uffd_copy.len = page_size;
+ uffd_copy.mode = 0;
+ ret = ioctl(uffd, UFFDIO_COPY, &uffd_copy);
+ TEST_ASSERT(ret != -1, "ioctl(UFFDIO_COPY) should succeed");
+
+ /* Wait for the faulting thread to complete - this provides the memory barrier */
+ ret = pthread_join(fault_thread, NULL);
+ TEST_ASSERT(ret == 0, "pthread_join should succeed");
+
+ /*
+ * Now it's safe to check args.value - the thread has completed
+ * and memory is synchronized
+ */
+ TEST_ASSERT(args.value == test_val,
+ "memory should contain the value that was copied");
+ TEST_ASSERT(*(char *)(mem + offset) == test_val,
+ "no further fault is expected");
+
+ ret = munmap(mem, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+ free(buf);
+ close(uffd);
+}
+
static void test_guest_memfd_flags(struct kvm_vm *vm)
{
uint64_t valid_flags = vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS);
@@ -494,8 +570,10 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
gmem_test(fallocate, vm, flags);
gmem_test(invalid_punch_hole, vm, flags);
- if (flags & GUEST_MEMFD_FLAG_INIT_SHARED)
+ if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
gmem_test(uffd_minor, vm, flags);
+ gmem_test(uffd_missing, vm, flags);
+ }
}
static void test_guest_memfd(unsigned long vm_type)
--
2.51.0
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd
2026-01-27 19:29 [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Mike Rapoport
` (16 preceding siblings ...)
2026-01-27 19:29 ` [PATCH RFC 17/17] KVM: selftests: test userfaultfd missing " Mike Rapoport
@ 2026-02-03 20:56 ` Peter Xu
2026-02-09 15:35 ` David Hildenbrand (Arm)
17 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2026-02-03 20:56 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, David Hildenbrand, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Tue, Jan 27, 2026 at 09:29:19PM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> These patches enable support for userfaultfd in guest_memfd.
> They are quite different from the latest posting [1] so I'm restarting the
> versioning. As there was a lot of tension around the topic, this is an RFC
> to get some feedback and see how we can move forward.
>
> As the ground work I refactored userfaultfd handling of PTE-based memory types
> (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a
> folio or getting an existing folio from the page cache. shmem also implements
> callbacks that add a folio to the page cache after the data passed in
> UFFDIO_COPY was copied and remove the folio from the page cache if page table
> update fails.
>
> In order for guest_memfd to notify userspace about page faults, there are new
> VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can
> return to inform the page fault handler that it needs to call
> handle_userfault() to complete the fault.
>
> Nikita helped to plumb these new goodies into guest_memfd and provided basic
> tests to verify that guest_memfd works with userfaultfd.
>
> I deliberately left hugetlb out, at least for the most part.
> hugetlb handles acquisition of VMA and more importantly establishing of parent
> page table entry differently than PTE-based memory types. This is a different
> abstraction level than what vm_uffd_ops provides and people objected to
> exposing such low level APIs as a part of VMA operations.
>
> Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I
> prefer to delay it until the dust settles after the changes in this set.
>
> [1] https://lore.kernel.org/all/20251130111812.699259-1-rppt@kernel.org
>
> Mike Rapoport (Microsoft) (12):
> userfaultfd: introduce mfill_copy_folio_locked() helper
> userfaultfd: introduce struct mfill_state
> userfaultfd: introduce mfill_get_pmd() helper.
> userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
> userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
> userfaultfd: move vma_can_userfault out of line
> userfaultfd: introduce vm_uffd_ops
> userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
> userfaultfd: introduce vm_uffd_ops->alloc_folio()
> shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
> userfaultfd: mfill_atomic() remove retry logic
> mm: introduce VM_FAULT_UFFD_MINOR fault reason
>
> Nikita Kalyazin (5):
> mm: introduce VM_FAULT_UFFD_MISSING fault reason
> KVM: guest_memfd: implement userfaultfd minor mode
> KVM: guest_memfd: implement userfaultfd missing mode
> KVM: selftests: test userfaultfd minor for guest_memfd
> KVM: selftests: test userfaultfd missing for guest_memfd
>
> include/linux/mm.h | 5 +
> include/linux/mm_types.h | 15 +-
> include/linux/shmem_fs.h | 14 -
> include/linux/userfaultfd_k.h | 74 +-
> mm/hugetlb.c | 21 +
> mm/memory.c | 8 +-
> mm/shmem.c | 188 +++--
> mm/userfaultfd.c | 671 ++++++++++--------
> .../testing/selftests/kvm/guest_memfd_test.c | 191 +++++
> virt/kvm/guest_memfd.c | 134 +++-
> 10 files changed, 871 insertions(+), 450 deletions(-)
Mike,
The idea looks good to me, thanks for this work! Your process on
UFFDIO_COPY over anon/shmem is nice to me.
If you remember, I used to raise a concern on introducing two new fault
retvals only for userfaultfd:
https://lore.kernel.org/all/aShb8J18BaRrsA-u@x1.local/
IMHO they're not only unnecessarily leaking userfaultfd information into
fault core definitions, but also cause code duplications. I still think we
should avoid them.
This time, I've attached a smoke tested patch removing both of them.
It's pretty small and it runs all fine with all old/new userfaultfd tests
(including gmem ones). Feel free to have a look at the end.
I understand you want to avoid adding mnore complexity to this series, if
you want I can also prepare such a patch after this series landed to remove
the two retvals. I'd still would like to know how you think about it,
though, let me know if you have any comments.
Note that it may indeed need some perf tests to make sure there's zero
overhead after this change. Currently there's still some trivial overheads
(e.g. unnecessary folio locks), but IIUC we can even avoid that.
Thanks,
===8<===
From 5379d084494b17281f3e5365104a7edbdbe53759 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 3 Feb 2026 15:07:58 -0500
Subject: [PATCH] mm/userfaultfd: Remove two userfaultfd fault retvals
They're not needed when with vm_uffd_ops. We can remove both of them.
Actually, another side benefit is drivers do not need to process
userfaultfd missing / minor faults anymore in the main fault handler.
This patch will make get_folio_noalloc() required for either MISSING or
MINOR fault, but that's not a problem, as it should be lightweight and the
current only outside-mm user (gmem) will support both anyway.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/linux/mm_types.h | 15 +-----------
include/linux/userfaultfd_k.h | 2 +-
mm/memory.c | 45 +++++++++++++++++++++++++++++------
mm/shmem.c | 12 ----------
virt/kvm/guest_memfd.c | 20 ----------------
5 files changed, 40 insertions(+), 54 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a6d32470a78a3..3cc8ae7228860 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1612,10 +1612,6 @@ typedef __bitwise unsigned int vm_fault_t;
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
- * @VM_FAULT_UFFD_MINOR: ->fault did not modify page tables and needs
- * handle_userfault(VM_UFFD_MINOR) to complete
- * @VM_FAULT_UFFD_MISSING: ->fault did not modify page tables and needs
- * handle_userfault(VM_UFFD_MISSING) to complete
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -1633,13 +1629,6 @@ enum vm_fault_reason {
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
-#ifdef CONFIG_USERFAULTFD
- VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x008000,
- VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x010000,
-#else
- VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x000000,
- VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x000000,
-#endif
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -1664,9 +1653,7 @@ enum vm_fault_reason {
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
- { VM_FAULT_COMPLETED, "COMPLETED" }, \
- { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }, \
- { VM_FAULT_UFFD_MISSING, "UFFD_MISSING" }
+ { VM_FAULT_COMPLETED, "COMPLETED" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75d5b09f2560c..5923e32de53b5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -85,7 +85,7 @@ struct vm_uffd_ops {
/* Checks if a VMA can support userfaultfd */
bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
/*
- * Called to resolve UFFDIO_CONTINUE request.
+ * Required by any uffd driver for either MISSING or MINOR fault.
* Should return the folio found at pgoff in the VMA's pagecache if it
* exists or ERR_PTR otherwise.
* The returned folio is locked and with reference held.
diff --git a/mm/memory.c b/mm/memory.c
index 456344938c72b..098febb761acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5338,6 +5338,33 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
+static vm_fault_t fault_process_userfaultfd(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct inode *inode = file_inode(vma->vm_file);
+ /*
+ * NOTE: we could double check this hook present when
+ * UFFDIO_REGISTER on MISSING or MINOR for a file driver.
+ */
+ struct folio *folio =
+ vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff);
+
+ if (!IS_ERR_OR_NULL(folio)) {
+ /*
+ * TODO: provide a flag for get_folio_noalloc() to avoid
+ * locking (or even the extra reference?)
+ */
+ folio_unlock(folio);
+ folio_put(folio);
+ if (userfaultfd_minor(vma))
+ return handle_userfault(vmf, VM_UFFD_MINOR);
+ } else {
+ return handle_userfault(vmf, VM_UFFD_MISSING);
+ }
+
+ return 0;
+}
+
/*
* The mmap_lock must have been held on entry, and may have been
* released depending on flags and vma->vm_ops->fault() return value.
@@ -5370,16 +5397,20 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
+ /*
+ * If this is an userfaultfd trap, process it in advance before
+ * triggering the genuine fault handler.
+ */
+ if (userfaultfd_missing(vma) || userfaultfd_minor(vma)) {
+ ret = fault_process_userfaultfd(vmf);
+ if (ret)
+ return ret;
+ }
+
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
- VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR |
- VM_FAULT_UFFD_MISSING))) {
- if (ret & VM_FAULT_UFFD_MINOR)
- return handle_userfault(vmf, VM_UFFD_MINOR);
- if (ret & VM_FAULT_UFFD_MISSING)
- return handle_userfault(vmf, VM_UFFD_MISSING);
+ VM_FAULT_DONE_COW)))
return ret;
- }
folio = page_folio(vmf->page);
if (unlikely(PageHWPoison(vmf->page))) {
diff --git a/mm/shmem.c b/mm/shmem.c
index eafd7986fc2ec..5286f28b3e443 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2484,13 +2484,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
fault_mm = vma ? vma->vm_mm : NULL;
folio = filemap_get_entry(inode->i_mapping, index);
- if (folio && vma && userfaultfd_minor(vma)) {
- if (!xa_is_value(folio))
- folio_put(folio);
- *fault_type = VM_FAULT_UFFD_MINOR;
- return 0;
- }
-
if (xa_is_value(folio)) {
error = shmem_swapin_folio(inode, index, &folio,
sgp, gfp, vma, fault_type);
@@ -2535,11 +2528,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
* Fast cache lookup and swap lookup did not find it: allocate.
*/
- if (vma && userfaultfd_missing(vma)) {
- *fault_type = VM_FAULT_UFFD_MISSING;
- return 0;
- }
-
/* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */
orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false);
if (orders > 0) {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 14cca057fc0ec..bd0de685f42f8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -421,26 +421,6 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff,
FGP_LOCK | FGP_ACCESSED, 0);
- if (userfaultfd_armed(vmf->vma)) {
- /*
- * If userfaultfd is registered in minor mode and a folio
- * exists, return VM_FAULT_UFFD_MINOR to trigger the
- * userfaultfd handler.
- */
- if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) {
- ret = VM_FAULT_UFFD_MINOR;
- goto out_folio;
- }
-
- /*
- * Check if userfaultfd is registered in missing mode. If so,
- * check if a folio exists in the page cache. If not, return
- * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler.
- */
- if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio))
- return VM_FAULT_UFFD_MISSING;
- }
-
/* folio not in the pagecache, try to allocate */
if (IS_ERR(folio))
folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);
--
2.50.1
--
Peter Xu
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd
2026-02-03 20:56 ` [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd Peter Xu
@ 2026-02-09 15:35 ` David Hildenbrand (Arm)
2026-02-11 6:04 ` Mike Rapoport
0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-09 15:35 UTC (permalink / raw)
To: Peter Xu, Mike Rapoport
Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Axel Rasmussen,
Baolin Wang, Hugh Dickins, James Houghton, Liam R. Howlett,
Lorenzo Stoakes, Michal Hocko, Muchun Song, Nikita Kalyazin,
Oscar Salvador, Paolo Bonzini, Sean Christopherson, Shuah Khan,
Suren Baghdasaryan, Vlastimil Babka, linux-kernel, kvm,
linux-kselftest
On 2/3/26 21:56, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:19PM +0200, Mike Rapoport wrote:
>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>
>> Hi,
>>
>> These patches enable support for userfaultfd in guest_memfd.
>> They are quite different from the latest posting [1] so I'm restarting the
>> versioning. As there was a lot of tension around the topic, this is an RFC
>> to get some feedback and see how we can move forward.
>>
>> As the ground work I refactored userfaultfd handling of PTE-based memory types
>> (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a
>> folio or getting an existing folio from the page cache. shmem also implements
>> callbacks that add a folio to the page cache after the data passed in
>> UFFDIO_COPY was copied and remove the folio from the page cache if page table
>> update fails.
>>
>> In order for guest_memfd to notify userspace about page faults, there are new
>> VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can
>> return to inform the page fault handler that it needs to call
>> handle_userfault() to complete the fault.
>>
>> Nikita helped to plumb these new goodies into guest_memfd and provided basic
>> tests to verify that guest_memfd works with userfaultfd.
>>
>> I deliberately left hugetlb out, at least for the most part.
>> hugetlb handles acquisition of VMA and more importantly establishing of parent
>> page table entry differently than PTE-based memory types. This is a different
>> abstraction level than what vm_uffd_ops provides and people objected to
>> exposing such low level APIs as a part of VMA operations.
>>
>> Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I
>> prefer to delay it until the dust settles after the changes in this set.
>>
>> [1] https://lore.kernel.org/all/20251130111812.699259-1-rppt@kernel.org
>>
>> Mike Rapoport (Microsoft) (12):
>> userfaultfd: introduce mfill_copy_folio_locked() helper
>> userfaultfd: introduce struct mfill_state
>> userfaultfd: introduce mfill_get_pmd() helper.
>> userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
>> userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
>> userfaultfd: move vma_can_userfault out of line
>> userfaultfd: introduce vm_uffd_ops
>> userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
>> userfaultfd: introduce vm_uffd_ops->alloc_folio()
>> shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
>> userfaultfd: mfill_atomic() remove retry logic
>> mm: introduce VM_FAULT_UFFD_MINOR fault reason
>>
>> Nikita Kalyazin (5):
>> mm: introduce VM_FAULT_UFFD_MISSING fault reason
>> KVM: guest_memfd: implement userfaultfd minor mode
>> KVM: guest_memfd: implement userfaultfd missing mode
>> KVM: selftests: test userfaultfd minor for guest_memfd
>> KVM: selftests: test userfaultfd missing for guest_memfd
>>
>> include/linux/mm.h | 5 +
>> include/linux/mm_types.h | 15 +-
>> include/linux/shmem_fs.h | 14 -
>> include/linux/userfaultfd_k.h | 74 +-
>> mm/hugetlb.c | 21 +
>> mm/memory.c | 8 +-
>> mm/shmem.c | 188 +++--
>> mm/userfaultfd.c | 671 ++++++++++--------
>> .../testing/selftests/kvm/guest_memfd_test.c | 191 +++++
>> virt/kvm/guest_memfd.c | 134 +++-
>> 10 files changed, 871 insertions(+), 450 deletions(-)
>
> Mike,
>
> The idea looks good to me, thanks for this work! Your process on
> UFFDIO_COPY over anon/shmem is nice to me.
>
> If you remember, I used to raise a concern on introducing two new fault
> retvals only for userfaultfd:
>
> https://lore.kernel.org/all/aShb8J18BaRrsA-u@x1.local/
>
> IMHO they're not only unnecessarily leaking userfaultfd information into
> fault core definitions, but also cause code duplications. I still think we
> should avoid them.
>
> This time, I've attached a smoke tested patch removing both of them.
>
> It's pretty small and it runs all fine with all old/new userfaultfd tests
> (including gmem ones). Feel free to have a look at the end.
>
> I understand you want to avoid adding mnore complexity to this series, if
> you want I can also prepare such a patch after this series landed to remove
> the two retvals. I'd still would like to know how you think about it,
> though, let me know if you have any comments.
>
> Note that it may indeed need some perf tests to make sure there's zero
> overhead after this change. Currently there's still some trivial overheads
> (e.g. unnecessary folio locks), but IIUC we can even avoid that.
>
> Thanks,
>
> ===8<===
> From 5379d084494b17281f3e5365104a7edbdbe53759 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 3 Feb 2026 15:07:58 -0500
> Subject: [PATCH] mm/userfaultfd: Remove two userfaultfd fault retvals
>
> They're not needed when with vm_uffd_ops. We can remove both of them.
> Actually, another side benefit is drivers do not need to process
> userfaultfd missing / minor faults anymore in the main fault handler.
>
> This patch will make get_folio_noalloc() required for either MISSING or
> MINOR fault, but that's not a problem, as it should be lightweight and the
> current only outside-mm user (gmem) will support both anyway.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/linux/mm_types.h | 15 +-----------
> include/linux/userfaultfd_k.h | 2 +-
> mm/memory.c | 45 +++++++++++++++++++++++++++++------
> mm/shmem.c | 12 ----------
> virt/kvm/guest_memfd.c | 20 ----------------
> 5 files changed, 40 insertions(+), 54 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a6d32470a78a3..3cc8ae7228860 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1612,10 +1612,6 @@ typedef __bitwise unsigned int vm_fault_t;
> * fsync() to complete (for synchronous page faults
> * in DAX)
> * @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
> - * @VM_FAULT_UFFD_MINOR: ->fault did not modify page tables and needs
> - * handle_userfault(VM_UFFD_MINOR) to complete
> - * @VM_FAULT_UFFD_MISSING: ->fault did not modify page tables and needs
> - * handle_userfault(VM_UFFD_MISSING) to complete
> * @VM_FAULT_HINDEX_MASK: mask HINDEX value
> *
> */
> @@ -1633,13 +1629,6 @@ enum vm_fault_reason {
> VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
> VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
> VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
> -#ifdef CONFIG_USERFAULTFD
> - VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x008000,
> - VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x010000,
> -#else
> - VM_FAULT_UFFD_MINOR = (__force vm_fault_t)0x000000,
> - VM_FAULT_UFFD_MISSING = (__force vm_fault_t)0x000000,
> -#endif
> VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
> };
>
> @@ -1664,9 +1653,7 @@ enum vm_fault_reason {
> { VM_FAULT_FALLBACK, "FALLBACK" }, \
> { VM_FAULT_DONE_COW, "DONE_COW" }, \
> { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
> - { VM_FAULT_COMPLETED, "COMPLETED" }, \
> - { VM_FAULT_UFFD_MINOR, "UFFD_MINOR" }, \
> - { VM_FAULT_UFFD_MISSING, "UFFD_MISSING" }
> + { VM_FAULT_COMPLETED, "COMPLETED" }
>
> struct vm_special_mapping {
> const char *name; /* The name, e.g. "[vdso]". */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 75d5b09f2560c..5923e32de53b5 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -85,7 +85,7 @@ struct vm_uffd_ops {
> /* Checks if a VMA can support userfaultfd */
> bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
> /*
> - * Called to resolve UFFDIO_CONTINUE request.
> + * Required by any uffd driver for either MISSING or MINOR fault.
> * Should return the folio found at pgoff in the VMA's pagecache if it
> * exists or ERR_PTR otherwise.
> * The returned folio is locked and with reference held.
> diff --git a/mm/memory.c b/mm/memory.c
> index 456344938c72b..098febb761acc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5338,6 +5338,33 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return VM_FAULT_OOM;
> }
>
> +static vm_fault_t fault_process_userfaultfd(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct inode *inode = file_inode(vma->vm_file);
> + /*
> + * NOTE: we could double check this hook present when
> + * UFFDIO_REGISTER on MISSING or MINOR for a file driver.
> + */
> + struct folio *folio =
> + vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff);
> +
> + if (!IS_ERR_OR_NULL(folio)) {
> + /*
> + * TODO: provide a flag for get_folio_noalloc() to avoid
> + * locking (or even the extra reference?)
> + */
> + folio_unlock(folio);
> + folio_put(folio);
> + if (userfaultfd_minor(vma))
> + return handle_userfault(vmf, VM_UFFD_MINOR);
> + } else {
> + return handle_userfault(vmf, VM_UFFD_MISSING);
> + }
> +
> + return 0;
> +}
> +
> /*
> * The mmap_lock must have been held on entry, and may have been
> * released depending on flags and vma->vm_ops->fault() return value.
> @@ -5370,16 +5397,20 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> return VM_FAULT_OOM;
> }
>
> + /*
> + * If this is an userfaultfd trap, process it in advance before
> + * triggering the genuine fault handler.
> + */
> + if (userfaultfd_missing(vma) || userfaultfd_minor(vma)) {
> + ret = fault_process_userfaultfd(vmf);
> + if (ret)
> + return ret;
> + }
> +
> ret = vma->vm_ops->fault(vmf);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> - VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR |
> - VM_FAULT_UFFD_MISSING))) {
> - if (ret & VM_FAULT_UFFD_MINOR)
> - return handle_userfault(vmf, VM_UFFD_MINOR);
> - if (ret & VM_FAULT_UFFD_MISSING)
> - return handle_userfault(vmf, VM_UFFD_MISSING);
> + VM_FAULT_DONE_COW)))
> return ret;
> - }
>
> folio = page_folio(vmf->page);
> if (unlikely(PageHWPoison(vmf->page))) {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index eafd7986fc2ec..5286f28b3e443 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2484,13 +2484,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> fault_mm = vma ? vma->vm_mm : NULL;
>
> folio = filemap_get_entry(inode->i_mapping, index);
> - if (folio && vma && userfaultfd_minor(vma)) {
> - if (!xa_is_value(folio))
> - folio_put(folio);
> - *fault_type = VM_FAULT_UFFD_MINOR;
> - return 0;
> - }
> -
> if (xa_is_value(folio)) {
> error = shmem_swapin_folio(inode, index, &folio,
> sgp, gfp, vma, fault_type);
> @@ -2535,11 +2528,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> * Fast cache lookup and swap lookup did not find it: allocate.
> */
>
> - if (vma && userfaultfd_missing(vma)) {
> - *fault_type = VM_FAULT_UFFD_MISSING;
> - return 0;
> - }
> -
> /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */
> orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false);
> if (orders > 0) {
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 14cca057fc0ec..bd0de685f42f8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -421,26 +421,6 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff,
> FGP_LOCK | FGP_ACCESSED, 0);
>
> - if (userfaultfd_armed(vmf->vma)) {
> - /*
> - * If userfaultfd is registered in minor mode and a folio
> - * exists, return VM_FAULT_UFFD_MINOR to trigger the
> - * userfaultfd handler.
> - */
> - if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) {
> - ret = VM_FAULT_UFFD_MINOR;
> - goto out_folio;
> - }
> -
> - /*
> - * Check if userfaultfd is registered in missing mode. If so,
> - * check if a folio exists in the page cache. If not, return
> - * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler.
> - */
> - if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio))
> - return VM_FAULT_UFFD_MISSING;
> - }
> -
> /* folio not in the pagecache, try to allocate */
> if (IS_ERR(folio))
> folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);
That looks better in general. We should likely find a better/more
consistent name for fault_process_userfaultfd().
--
Cheers,
David
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd
2026-02-09 15:35 ` David Hildenbrand (Arm)
@ 2026-02-11 6:04 ` Mike Rapoport
2026-02-11 9:52 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 41+ messages in thread
From: Mike Rapoport @ 2026-02-11 6:04 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Peter Xu, linux-mm, Andrea Arcangeli, Andrew Morton,
Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
On Mon, Feb 09, 2026 at 04:35:47PM +0100, David Hildenbrand (Arm) wrote:
> On 2/3/26 21:56, Peter Xu wrote:
>
> > +static vm_fault_t fault_process_userfaultfd(struct vm_fault *vmf)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + struct inode *inode = file_inode(vma->vm_file);
> > + /*
> > + * NOTE: we could double check this hook present when
> > + * UFFDIO_REGISTER on MISSING or MINOR for a file driver.
> > + */
> > + struct folio *folio =
> > + vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff);
> > +
> > + if (!IS_ERR_OR_NULL(folio)) {
> > + /*
> > + * TODO: provide a flag for get_folio_noalloc() to avoid
> > + * locking (or even the extra reference?)
> > + */
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + if (userfaultfd_minor(vma))
> > + return handle_userfault(vmf, VM_UFFD_MINOR);
> > + } else {
> > + return handle_userfault(vmf, VM_UFFD_MISSING);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * The mmap_lock must have been held on entry, and may have been
> > * released depending on flags and vma->vm_ops->fault() return value.
> > @@ -5370,16 +5397,20 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> > return VM_FAULT_OOM;
> > }
> > + /*
> > + * If this is an userfaultfd trap, process it in advance before
> > + * triggering the genuine fault handler.
> > + */
> > + if (userfaultfd_missing(vma) || userfaultfd_minor(vma)) {
> > + ret = fault_process_userfaultfd(vmf);
> > + if (ret)
> > + return ret;
> > + }
I agree this is neater than handling VM_FAULT_UFFD.
I'd just move the checks for userfaultfd_minor() and userfaultfd_missing()
inside fault_process_userfaultfd().
> > +
> > ret = vma->vm_ops->fault(vmf);
> > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> > - VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR |
> > - VM_FAULT_UFFD_MISSING))) {
> > - if (ret & VM_FAULT_UFFD_MINOR)
> > - return handle_userfault(vmf, VM_UFFD_MINOR);
> > - if (ret & VM_FAULT_UFFD_MISSING)
> > - return handle_userfault(vmf, VM_UFFD_MISSING);
> > + VM_FAULT_DONE_COW)))
> > return ret;
> > - }
> > folio = page_folio(vmf->page);
> > if (unlikely(PageHWPoison(vmf->page))) {
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index eafd7986fc2ec..5286f28b3e443 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2484,13 +2484,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> > fault_mm = vma ? vma->vm_mm : NULL;
> > folio = filemap_get_entry(inode->i_mapping, index);
> > - if (folio && vma && userfaultfd_minor(vma)) {
> > - if (!xa_is_value(folio))
> > - folio_put(folio);
> > - *fault_type = VM_FAULT_UFFD_MINOR;
> > - return 0;
> > - }
> > -
> > if (xa_is_value(folio)) {
> > error = shmem_swapin_folio(inode, index, &folio,
> > sgp, gfp, vma, fault_type);
> > @@ -2535,11 +2528,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> > * Fast cache lookup and swap lookup did not find it: allocate.
> > */
> > - if (vma && userfaultfd_missing(vma)) {
> > - *fault_type = VM_FAULT_UFFD_MISSING;
> > - return 0;
> > - }
> > -
> > /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */
> > orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false);
> > if (orders > 0) {
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 14cca057fc0ec..bd0de685f42f8 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -421,26 +421,6 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
> > folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff,
> > FGP_LOCK | FGP_ACCESSED, 0);
> > - if (userfaultfd_armed(vmf->vma)) {
> > - /*
> > - * If userfaultfd is registered in minor mode and a folio
> > - * exists, return VM_FAULT_UFFD_MINOR to trigger the
> > - * userfaultfd handler.
> > - */
> > - if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) {
> > - ret = VM_FAULT_UFFD_MINOR;
> > - goto out_folio;
> > - }
> > -
> > - /*
> > - * Check if userfaultfd is registered in missing mode. If so,
> > - * check if a folio exists in the page cache. If not, return
> > - * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler.
> > - */
> > - if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio))
> > - return VM_FAULT_UFFD_MISSING;
> > - }
> > -
> > /* folio not in the pagecache, try to allocate */
> > if (IS_ERR(folio))
> > folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);
>
> That looks better in general. We should likely find a better/more consistent
> name for fault_process_userfaultfd().
__do_userfault()? :)
> --
> Cheers,
>
> David
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd
2026-02-11 6:04 ` Mike Rapoport
@ 2026-02-11 9:52 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 9:52 UTC (permalink / raw)
To: Mike Rapoport
Cc: Peter Xu, linux-mm, Andrea Arcangeli, Andrew Morton,
Axel Rasmussen, Baolin Wang, Hugh Dickins, James Houghton,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Muchun Song,
Nikita Kalyazin, Oscar Salvador, Paolo Bonzini,
Sean Christopherson, Shuah Khan, Suren Baghdasaryan,
Vlastimil Babka, linux-kernel, kvm, linux-kselftest
>>> -
>>> /* folio not in the pagecache, try to allocate */
>>> if (IS_ERR(folio))
>>> folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);
>>
>> That looks better in general. We should likely find a better/more consistent
>> name for fault_process_userfaultfd().
>
> __do_userfault()? :)
Yes, that matches the naming scheme :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 41+ messages in thread