From: James Houghton <jthoughton@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Nikita Kalyazin <kalyazin@amazon.com>,
akpm@linux-foundation.org, pbonzini@redhat.com,
shuah@kernel.org, kvm@vger.kernel.org,
linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, lorenzo.stoakes@oracle.com,
david@redhat.com, ryan.roberts@arm.com,
quic_eberman@quicinc.com, graf@amazon.de, jgowans@amazon.com,
roypat@amazon.co.uk, derekmn@amazon.com, nsaenz@amazon.es,
xmarcalx@amazon.com
Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing
Date: Wed, 5 Mar 2025 11:35:27 -0800 [thread overview]
Message-ID: <CADrL8HXOQ=RuhjTEmMBJrWYkcBaGrqtXmhzPDAo1BE3EWaBk4g@mail.gmail.com> (raw)
In-Reply-To: <Z8YfOVYvbwlZST0J@x1.local>
On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> > This series is built on top of the v3 write syscall support [1].
> >
> > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > in guest_memfd in userspace. However, KVM itself also triggers faults
> > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > PV EOI and page table walking code when fetching the MMIO instruction on
> > x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > that KVM would be accessing those pages via userspace page tables. In
> > order for such faults to be handled in userspace, guest_memfd needs to
> > support userfaultfd.
> >
> > This series proposes a limited support for userfaultfd in guest_memfd:
> > - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
> > (as is fault support in general)
> > - Only `page missing` event is currently supported
> > - Userspace is supposed to respond to the event with the `write`
> > syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
> > process. Note that we can't use `UFFDIO_COPY` here because
> > userfaulfd code does not know how to prepare guest_memfd pages, eg
> > remove them from direct map [4].
> >
> > Not included in this series:
> > - Proper interface for userfaultfd to recognise guest_memfd mappings
> > - Proper handling of truncation cases after locking the page
> >
> > Request for comments:
> > - Is it a sensible workflow for guest_memfd to resolve a userfault
> > `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One
> > of the alternatives is teaching `UFFDIO_COPY` how to deal with
> > guest_memfd pages.
>
> Probably not.. I don't see what protects a thread fault concurrently
> during write() happening, seeing partial data. Since you check the page
> cache it'll let it pass, but the partial page will be faulted in there.
+1 here.
I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].
I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.
[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com/
> I think we may need to either go with full MISSING or full MINOR traps.
I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.
I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.
> One thing to mention is we probably need MINOR sooner or later to support
> gmem huge pages. The thing is for huge folios in gmem we can't rely on
> missing in page cache, as we always need to allocate in hugetlb sizes.
>
> > - What is a way forward to make userfaultfd code aware of guest_memfd?
> > I saw that Patrick hit a somewhat similar problem in [5] when trying
> > to use direct map manipulation functions in KVM and was pointed by
> > David at Elliot's guestmem library [6] that might include a shim for that.
> > Would the library be the right place to expose required interfaces like
> > `vma_is_gmem`?
>
> Not sure what's the best to do, but IIUC the current way this series uses
> may not work as long as one tries to reference a kvm symbol from core mm..
>
> One trick I used so far is leveraging vm_ops and provide hook function to
> report specialties when it's gmem. In general, I did not yet dare to
> overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
> be accepted. E.g. something like this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e742738240c..b068bb79fdbc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -653,8 +653,26 @@ struct vm_operations_struct {
> */
> struct page *(*find_special_page)(struct vm_area_struct *vma,
> unsigned long addr);
> + /*
> + * When set, return the allowed orders bitmask in faults of mmap()
> + * ranges (e.g. for follow up huge_fault() processing). Drivers
> + * can use this to bypass THP setups for specific types of VMAs.
> + */
> + unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
> };
>
> +static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
> +{
> + return vma->vm_ops && vma->vm_ops->get_supported_orders;
> +}
> +
> +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
> +{
> + if (!vma_has_supported_orders(vma))
> + return 0;
> + return vma->vm_ops->get_supported_orders(vma);
> +}
> +
>
> In my case I used that to allow gmem report huge page supports on faults.
>
> Said that, above only existed in my own tree so far, so I also don't know
> whether something like that could be accepted (even if it'll work for you).
I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...
next prev parent reply other threads:[~2025-03-05 19:36 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-03 13:30 Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 1/5] KVM: guest_memfd: add kvm_gmem_vma_is_gmem Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 2/5] KVM: guest_memfd: add support for uffd missing Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 3/5] mm: userfaultfd: allow to register userfaultfd for guest_memfd Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 4/5] mm: userfaultfd: support continue " Nikita Kalyazin
2025-03-03 13:30 ` [RFC PATCH 5/5] KVM: selftests: add uffd missing test " Nikita Kalyazin
2025-03-03 21:29 ` [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing Peter Xu
2025-03-05 19:35 ` James Houghton [this message]
2025-03-05 20:29 ` Peter Xu
2025-03-10 18:12 ` Nikita Kalyazin
2025-03-10 19:57 ` Peter Xu
2025-03-11 16:56 ` Nikita Kalyazin
2025-03-12 15:45 ` Peter Xu
2025-03-12 17:07 ` Nikita Kalyazin
2025-03-12 19:32 ` Peter Xu
2025-03-13 15:25 ` Nikita Kalyazin
2025-03-13 19:12 ` Peter Xu
2025-03-13 22:13 ` Nikita Kalyazin
2025-03-13 22:38 ` Peter Xu
2025-03-14 17:12 ` Nikita Kalyazin
2025-03-14 18:32 ` Peter Xu
2025-03-14 20:04 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CADrL8HXOQ=RuhjTEmMBJrWYkcBaGrqtXmhzPDAo1BE3EWaBk4g@mail.gmail.com' \
--to=jthoughton@google.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=derekmn@amazon.com \
--cc=graf@amazon.de \
--cc=jgowans@amazon.com \
--cc=kalyazin@amazon.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=nsaenz@amazon.es \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=quic_eberman@quicinc.com \
--cc=roypat@amazon.co.uk \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=xmarcalx@amazon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox