Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: James Houghton <jthoughton@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Nikita Kalyazin <kalyazin@amazon.com>,
	akpm@linux-foundation.org, pbonzini@redhat.com,
	 shuah@kernel.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com,
	 david@redhat.com, ryan.roberts@arm.com,
	quic_eberman@quicinc.com,  graf@amazon.de, jgowans@amazon.com,
	roypat@amazon.co.uk, derekmn@amazon.com,  nsaenz@amazon.es,
	xmarcalx@amazon.com
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing
Date: Wed, 5 Mar 2025 11:35:27 -0800	[thread overview]
Message-ID: <CADrL8HXOQ=RuhjTEmMBJrWYkcBaGrqtXmhzPDAo1BE3EWaBk4g@mail.gmail.com> (raw)
In-Reply-To: <Z8YfOVYvbwlZST0J@x1.local>

On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> > This series is built on top of the v3 write syscall support [1].
> >
> > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > in guest_memfd in userspace.  However, KVM itself also triggers faults
> > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > PV EOI and page table walking code when fetching the MMIO instruction on
> > x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > that KVM would be accessing those pages via userspace page tables.  In
> > order for such faults to be handled in userspace, guest_memfd needs to
> > support userfaultfd.
> >
> > This series proposes a limited support for userfaultfd in guest_memfd:
> >  - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
> >    (as is fault support in general)
> >  - Only `page missing` event is currently supported
> >  - Userspace is supposed to respond to the event with the `write`
> >    syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
> >    process.   Note that we can't use `UFFDIO_COPY` here because
> >    userfaulfd code does not know how to prepare guest_memfd pages, eg
> >    remove them from direct map [4].
> >
> > Not included in this series:
> >  - Proper interface for userfaultfd to recognise guest_memfd mappings
> >  - Proper handling of truncation cases after locking the page
> >
> > Request for comments:
> >  - Is it a sensible workflow for guest_memfd to resolve a userfault
> >    `page missing` event with `write` syscall + `UFFDIO_CONTINUE`?  One
> >    of the alternatives is teaching `UFFDIO_COPY` how to deal with
> >    guest_memfd pages.
>
> Probably not..  I don't see what protects a thread fault concurrently
> during write() happening, seeing partial data.  Since you check the page
> cache it'll let it pass, but the partial page will be faulted in there.

+1 here.

I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].

I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.

[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com/

> I think we may need to either go with full MISSING or full MINOR traps.

I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.

I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.

> One thing to mention is we probably need MINOR sooner or later to support
> gmem huge pages.  The thing is for huge folios in gmem we can't rely on
> missing in page cache, as we always need to allocate in hugetlb sizes.
>
> >  - What is a way forward to make userfaultfd code aware of guest_memfd?
> >    I saw that Patrick hit a somewhat similar problem in [5] when trying
> >    to use direct map manipulation functions in KVM and was pointed by
> >    David at Elliot's guestmem library [6] that might include a shim for that.
> >    Would the library be the right place to expose required interfaces like
> >    `vma_is_gmem`?
>
> Not sure what's the best to do, but IIUC the current way this series uses
> may not work as long as one tries to reference a kvm symbol from core mm..
>
> One trick I used so far is leveraging vm_ops and provide hook function to
> report specialties when it's gmem.  In general, I did not yet dare to
> overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
> be accepted.  E.g. something like this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e742738240c..b068bb79fdbc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -653,8 +653,26 @@ struct vm_operations_struct {
>          */
>         struct page *(*find_special_page)(struct vm_area_struct *vma,
>                                           unsigned long addr);
> +       /*
> +        * When set, return the allowed orders bitmask in faults of mmap()
> +        * ranges (e.g. for follow up huge_fault() processing).  Drivers
> +        * can use this to bypass THP setups for specific types of VMAs.
> +        */
> +       unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
>  };
>
> +static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
> +{
> +       return vma->vm_ops && vma->vm_ops->get_supported_orders;
> +}
> +
> +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
> +{
> +       if (!vma_has_supported_orders(vma))
> +               return 0;
> +       return vma->vm_ops->get_supported_orders(vma);
> +}
> +
>
> In my case I used that to allow gmem report huge page supports on faults.
>
> Said that, above only existed in my own tree so far, so I also don't know
> whether something like that could be accepted (even if it'll work for you).

I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?

And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.

I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...

next prev parent reply	other threads:[~2025-03-05 19:36 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-03 13:30 Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 1/5] KVM: guest_memfd: add kvm_gmem_vma_is_gmem Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 2/5] KVM: guest_memfd: add support for uffd missing Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 3/5] mm: userfaultfd: allow to register userfaultfd for guest_memfd Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 4/5] mm: userfaultfd: support continue " Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 5/5] KVM: selftests: add uffd missing test " Nikita Kalyazin
2025-03-03 21:29 ` [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing Peter Xu
2025-03-05 19:35   ` James Houghton [this message]
2025-03-05 20:29     ` Peter Xu
2025-03-10 18:12       ` Nikita Kalyazin
2025-03-10 19:57         ` Peter Xu
2025-03-11 16:56           ` Nikita Kalyazin
2025-03-12 15:45             ` Peter Xu
2025-03-12 17:07               ` Nikita Kalyazin
2025-03-12 19:32                 ` Peter Xu
2025-03-13 15:25                   ` Nikita Kalyazin
2025-03-13 19:12                     ` Peter Xu
2025-03-13 22:13                       ` Nikita Kalyazin
2025-03-13 22:38                         ` Peter Xu
2025-03-14 17:12                           ` Nikita Kalyazin
2025-03-14 18:32                             ` Peter Xu
2025-03-14 20:04                             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADrL8HXOQ=RuhjTEmMBJrWYkcBaGrqtXmhzPDAo1BE3EWaBk4g@mail.gmail.com' \
    --to=jthoughton@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=graf@amazon.de \
    --cc=jgowans@amazon.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=nsaenz@amazon.es \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=quic_eberman@quicinc.com \
    --cc=roypat@amazon.co.uk \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=xmarcalx@amazon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox