Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nikita Kalyazin <kalyazin@amazon.com>
To: Peter Xu <peterx@redhat.com>
Cc: <akpm@linux-foundation.org>, <pbonzini@redhat.com>,
	<shuah@kernel.org>, <viro@zeniv.linux.org.uk>,
	<brauner@kernel.org>, <muchun.song@linux.dev>, <hughd@google.com>,
	<kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<linux-fsdevel@vger.kernel.org>, <jack@suse.cz>,
	<lorenzo.stoakes@oracle.com>, <Liam.Howlett@oracle.com>,
	<jannh@google.com>, <ryan.roberts@arm.com>, <david@redhat.com>,
	<jthoughton@google.com>, <graf@amazon.de>, <jgowans@amazon.com>,
	<roypat@amazon.co.uk>, <derekmn@amazon.com>, <nsaenz@amazon.es>,
	<xmarcalx@amazon.com>
Subject: Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
Date: Fri, 20 Jun 2025 13:00:24 +0100	[thread overview]
Message-ID: <2097f155-c459-40e1-93e8-3d501ae66b42@amazon.com> (raw)
In-Reply-To: <aEl9CNGLY0Sil7nq@x1.local>

On 11/06/2025 13:56, Peter Xu wrote:
> On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 10/06/2025 23:22, Peter Xu wrote:
>>> On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
>>>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>>>> non-huge pages by calling vm_ops->fault().  A new VMF flag,
>>>> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
>>>> handle_userfault().
>>>
>>> It's not clear yet on why this is needed to be generalized out of the blue.
>>>
>>> Some mentioning of guest_memfd use case might help for other reviewers, or
>>> some mention of the need to introduce userfaultfd support in kernel
>>> modules.
>>
>> Hi Peter,
>>
>> Sounds fair, thank you.
>>
>>>>
>>>> Suggested-by: James Houghton <jthoughton@google.com>
>>>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>>>> ---
>>>>    include/linux/mm_types.h |  4 ++++
>>>>    mm/hugetlb.c             |  2 +-
>>>>    mm/shmem.c               |  9 ++++++---
>>>>    mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
>>>>    4 files changed, 38 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 0234f14f2aa6..2f26ee9742bf 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>>>>     * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>>>>     *                        We should only access orig_pte if this flag set.
>>>>     * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>>>> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
>>>> + *                                 minor handler as it is being called by the
>>>> + *                                 userfaultfd code itself.
>>>
>>> We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
>>> as it's not easy to follow when without userfault minor context.  It might
>>> be better to use generic terms like NO_USERFAULT.
>>
>> Yes, I agree, can name it more generically.
>>
>>> Said that, I wonder if we'll need to add a vm_ops anyway in the latter
>>> patch, whether we can also avoid reusing fault() but instead resolve the
>>> page faults using the vm_ops hook too.  That might be helpful because then
>>> we can avoid this new FAULT_FLAG_* that is totally not useful to
>>> non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
>>> struct below just to suite the current fault() interfacing.
>>
>> I'm not sure I fully understand that.  Calling fault() op helps us reuse the
>> FS specifics when resolving the fault.  I get that the new op can imply the
>> userfault flag so the flag doesn't need to be exposed to mm, but doing so
>> will bring duplication of the logic within FSes between this new op and the
>> fault(), unless we attempt to factor common parts out.  For example, for
>> shmem_get_folio_gfp(), we would still need to find a way to suppress the
>> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
>> op.  Is that what you're proposing?
> 
> Yes it is what I was proposing.  shmem_get_folio_gfp() always has that
> handling when vmf==NULL, then vma==NULL and userfault will be skipped.
> 
> So what I was thinking is one vm_ops.userfaultfd_request(req), where req
> can be:
> 
>    (1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
>        both MISSING/WP/MINOR.  Here WP should mean sync-wp tracking, async
>        was so far by default almost supported everywhere except
>        VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
>        as of now (even if I think it shouldn't be hard to support the rest
>        two..).
> 
>    (2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
>        defined to suite userfault's need on fault resolutions.  It likely
>        doesn't need vmf as the parameter, but likely (when anon isn't taking
>        into account, after all anon have vm_ops==NULL..) the inode and
>        offsets, perhaps some flag would be needed to identify MISSING or
>        MINOR faults, for example.
> 
> Maybe some more.
> 
> I was even thinking whether we could merge hugetlb into the picture too on
> generalize its fault resolutions.  Hugetlb was always special, maye this is
> a chance too to make it generalized, but it doesn't need to happen in one
> shot even if it could work.  We could start with shmem.
> 
> So this does sound like slightly involved, and I'm not yet 100% sure this
> will work, but likely.  If you want, I can take a stab at this this week or
> next just to see whether it'll work in general.  I also don't expect this
> to depend on guest-memfd at all - it can be alone a refactoring making
> userfault module-ready.

Thanks for explaining that.  I played a bit with it myself and it 
appears to be working for the MISSING mode for both shmem and 
guest_memfd.  Attaching my sketch below.  Please let me know if this is 
how you see it.

I found that arguments and return values are significantly different 
between the two request types, which may be a bit confusing, although we 
do not expect many callers of those.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..eb30b23b24d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -603,6 +603,16 @@ struct vm_fault {
  					 */
  };

+#ifdef CONFIG_USERFAULTFD
+/*
+ * Used by userfaultfd_request().
+ */
+enum uffd_req {
+	UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */
+	UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */
+};
+#endif
+
  /*
   * These are the virtual MM functions - opening of an area, closing and
   * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -680,6 +690,22 @@ struct vm_operations_struct {
  	 */
  	struct page *(*find_special_page)(struct vm_area_struct *vma,
  					  unsigned long addr);
+
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Called by the userfaultfd code to query supported modes or request
+	 * fault resolution.
+	 * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask
+	 * of modes as in struct uffdio_register.  No other arguments are
+	 * used.
+	 * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault
+	 * using the mode specified in the mode argument. The inode, pgoff and
+	 * foliop arguments must be set accordingly.
+	 */
+	int (*userfaultfd_request)(enum uffd_req req, int mode,
+				   struct inode *inode, pgoff_t pgoff,
+				   struct folio **foliop);
+#endif
  };

  #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..1cabb925da0e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  		return false;

  	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+	    (!is_vm_hugetlb_page(vma) &&
+	     !vma->vm_ops->userfaultfd_request &&
+	     !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						NULL, 0, NULL) &
+	           UFFDIO_REGISTER_MODE_MINOR)))
  		return false;

  	/*
@@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  #endif

  	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	return vma_is_anonymous(vma) ||
+	    is_vm_hugetlb_page(vma) ||
+	    (vma->vm_ops->userfaultfd_request &&
+	     vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL,
+					      0, NULL));
  }

  static inline bool vma_has_uffd_without_event_remap(struct 
vm_area_struct *vma)
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..a5b5c4131dcf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct 
address_space *mapping,
  	return 0;
  }

+#ifdef CONFIG_USERFAULTFD
+static int shmem_userfaultfd_request(enum uffd_req req, int mode,
+				     struct inode *inode, pgoff_t pgoff,
+				     struct folio **foliop)
+{
+	int ret;
+
+	switch (req) {
+	case UFFD_REQ_GET_SUPPORTED:
+		ret =
+		    UFFDIO_REGISTER_MODE_MISSING |
+		    UFFDIO_REGISTER_MODE_WP |
+		    UFFDIO_REGISTER_MODE_MINOR;
+		break;
+	case UFFD_REQ_FAULT_RESOLVE:
+		ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC);
+		if (ret == -ENOENT)
+			ret = -EFAULT;
+		if (ret)
+			break;
+		if (!*foliop) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+#endif
+
  static const struct address_space_operations shmem_aops = {
  	.writepage	= shmem_writepage,
  	.dirty_folio	= noop_dirty_folio,
@@ -5306,6 +5340,9 @@ static const struct vm_operations_struct 
shmem_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5315,6 +5352,9 @@ static const struct vm_operations_struct 
shmem_anon_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..efc150bf5691 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
  	struct page *page;
  	int ret;

-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
-	/* Our caller expects us to return -EFAULT if we failed to find folio */
-	if (ret == -ENOENT)
-		ret = -EFAULT;
+	if (!dst_vma->vm_ops->userfaultfd_request ||
+	    !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						   NULL, 0, NULL) &
+	      UFFDIO_REGISTER_MODE_MINOR)) {
+		return -EFAULT;
+	}
+
+	ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE,
+						   UFFDIO_REGISTER_MODE_MINOR,
+						   inode, pgoff, &folio);
  	if (ret)
  		goto out;
-	if (!folio) {
-		ret = -EFAULT;
-		goto out;
-	}

  	page = folio_file_page(folio, pgoff);
  	if (PageHWPoison(page)) {
@@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct 
userfaultfd_ctx *ctx,
  		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
  					     src_start, len, flags);

-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+	if (!vma_is_anonymous(dst_vma) &&
+            (!dst_vma->vm_ops->userfaultfd_request ||
+	     (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						    NULL, 0, NULL))))
  		goto out_unlock;

  	while (src_addr < src_start + len) {

> 
> Thanks,
> 
> --
> Peter Xu
>

next prev parent reply	other threads:[~2025-06-20 12:00 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
2025-06-10 22:22   ` Peter Xu
2025-06-11 12:09     ` Nikita Kalyazin
2025-06-11 12:56       ` Peter Xu
2025-06-20 12:00         ` Nikita Kalyazin [this message]
2025-06-20 15:21           ` Peter Xu
2025-06-20 16:51             ` Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 2/6] mm: provide can_userfault vma operation Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 3/6] mm: userfaultfd: use " Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
2025-06-10 22:25   ` Peter Xu
2025-06-11 12:09     ` Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 5/6] mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 6/6] KVM: selftests: test userfaultfd minor for guest_memfd Nikita Kalyazin
2025-04-04 16:33 ` [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Lorenzo Stoakes
2025-04-04 16:56   ` Nikita Kalyazin
2025-04-04 16:59     ` Lorenzo Stoakes
2025-04-04 17:12 ` Liam R. Howlett
2025-04-07 11:04   ` Nikita Kalyazin
2025-04-07 13:40     ` Liam R. Howlett
2025-04-07 14:04       ` Nikita Kalyazin
2025-04-07 14:24         ` Liam R. Howlett
2025-04-07 14:46           ` David Hildenbrand
2025-04-07 15:14             ` Lorenzo Stoakes
2025-04-07 15:26               ` David Hildenbrand
2025-04-08  8:20             ` Christian Brauner
2025-04-08 13:15               ` Ackerley Tng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2097f155-c459-40e1-93e8-3d501ae66b42@amazon.com \
    --to=kalyazin@amazon.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=graf@amazon.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgowans@amazon.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=nsaenz@amazon.es \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=roypat@amazon.co.uk \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xmarcalx@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox