* [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion
@ 2025-11-13 23:07 Michael Roth
2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Michael Roth @ 2025-11-13 23:07 UTC (permalink / raw)
To: kvm
Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini,
seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve,
ackerleytng, aik, ira.weiny, yan.y.zhao
This patchset is also available at:
https://github.com/AMDESE/linux/tree/gmem-populate-rework-rfc1
and is based on top of kvm-x86/next (kvm-x86-next-2025.11.07)
Overview
--------
Yan previously posted a series[1] that reworked kvm_gmem_populate() to deal
with potential locking issues that might arise once in-place conversion
support[2] is added for guest_memfd. To quote Yan's original summary of the
issues:
(1)
In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
invalidation lock for protecting its preparedness tracking. Similarly, the
in-place conversion version of guest_memfd series by Ackerly also requires
kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
kvm_gmem_get_pfn
filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
filemap invalidation lock.
(2)
Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
resulting in the following lock sequence in tdx_vcpu_init_mem_region():
- filemap invalidation lock --> mm->mmap_lock
However, in future code, the shared filemap invalidation lock will be held
in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
- mm->mmap_lock --> filemap invalidation lock
This creates an AB-BA deadlock issue.
Sean has since then addressed (1) with his series[3] that avoids relying on
calling kvm_gmem_get_pfn() within the TDX post-populate callback to re-fetch
the PFN that was passed to it.
This series aims to address (2), which is still outstanding, and does so based
heavily on Sean's suggested approach[4] of hoisting the get_user_pages_fast()
out of the TDX post-populate callback so that it can be called prior to taking
the filemap invalidate lock so that the ABBA deadlock is no longer possible.
It additionally removes 'preparation' tracking from guest_memfd, which would
similarly complicate locking considerations in the context of in-place
conversion (and even moreso in the context of hugepage support). This has
been discussed during both the guest_memfd calls and PUCK calls, and so far
no strong objections have been given, so hopefully that particular change
isn't too controversial.
Some items worth noting/discussing
----------------------------------
(A) Unlike TDX, which has always enforced that the source address used to
populate the contents of gmem pages via kvm_gmem_populate() is
page-aligned, SNP explicitly allowed for this. This unfortunately means
that instead of a simple 1:1 correspondance between source/target pages,
post-populate callbacks need to be able to handle straddling multiple
source pages to populate a single target page within guest_memfd, which
complicates the handling. While the changes to the SNP post-populate
callback in patch #3 are not horrendous, they certainly are not ideal.
However, architectures that never allowed a non-page-aligned source
address can essentially ignore the src_pages/src_offset considerations
and simply assume/enforce src_offset is 0, and that src_pages[0] is the
source struct page of relevance for each call.
That said, it would be possible to have SNP copy unaligned pages into
an intermediate set of bounce-buffer pages before passing them to some
variant of kvm_gmem_populate() that skips the GUP and just works directly
with the kernel-allocated bounce pages, but there is a performance hit
there, and potentially some additional complexity with the interfaces to
handle the different flow, so it's not clear if the trade-off is worth
it.
Another potential approach would be to take advantage of the fact that
all *known* VMM implementations of SNP do use page-aligned source
addresses, so it *may* be justifiable to retroactively enforce this as
a requirement so that the post-populate callbacks can be simplified
accordingly.
(B) While one of the aims of this rework is to implement things such that
a separate source address can still be passed to kvm_gmem_populate()
even though the gmem pages can be populated in-place from userspace
beforehand, issues still arise if the source address itself has the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute set, e.g. if source/target
addresses are the same page. One line of reasoning would be to
conclude that KVM_MEMORY_ATTRIBUTE_PRIVATE implies that it cannot
be used as the source of a GUP/copy_from_user(), and thus cases like
source==target are naturally disallowed. Thus userspace has no choice
but to populate pages in-place *prior* to setting the
KVM_MEMORY_ATTRIBUTE_PRIVATE attribute (as kvm_gmem_populate()
requires), and passing in NULL for the source such that the GUP can
be skipped (otherwise, it will trigger the shared memory fault path,
which will then SIGBUS because it will see that it is faulting in
pages for which KVM_MEMORY_ATTRIBUTE_PRIVATE is set).
While workable, this would at the very least involve documentation
updates to KVM_TDX_INIT_MEM_REGION/KVM_SEV_SNP_LAUNCH_UPDATE to cover
these soon-to-be-possible scenarios. Ira posted a patch separately
that demonstrates how a NULL source could be safely handled within
the TDX post-populate callback[5].
Known issues / TODO
-------------------
- Compile-tested only for the TDX bits (testing/feedback welcome!)
Thanks,
Mike
[1] https://lore.kernel.org/kvm/20250703062641.3247-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com/
[3] https://lore.kernel.org/kvm/20251030200951.3402865-1-seanjc@google.com/
[4] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/
[5] https://lore.kernel.org/kvm/20251105-tdx-init-in-place-v1-1-1196b67d0423@intel.com/
----------------------------------------------------------------
Michael Roth (3):
KVM: guest_memfd: Remove preparation tracking
KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION
KVM: guest_memfd: GUP source pages prior to populating guest memory
Documentation/virt/kvm/x86/intel-tdx.rst | 2 +-
arch/x86/kvm/svm/sev.c | 40 +++++++++-----
arch/x86/kvm/vmx/tdx.c | 20 +++----
include/linux/kvm_host.h | 3 +-
virt/kvm/guest_memfd.c | 89 ++++++++++++++++++--------------
5 files changed, 88 insertions(+), 66 deletions(-)
^ permalink raw reply [flat|nested] 35+ messages in thread* [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth @ 2025-11-13 23:07 ` Michael Roth 2025-11-17 23:58 ` Ackerley Tng 2025-11-20 9:12 ` Yan Zhao 2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth 2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth 2 siblings, 2 replies; 35+ messages in thread From: Michael Roth @ 2025-11-13 23:07 UTC (permalink / raw) To: kvm Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny, yan.y.zhao guest_memfd currently uses the folio uptodate flag to track: 1) whether or not a page has been cleared before initial usage 2) whether or not the architecture hooks have been issued to put the page in a private state as defined by the architecture In practice, 2) is only actually being tracked for SEV-SNP VMs, and there do not seem to be any plans/reasons that would suggest this will change in the future, so this additional tracking/complexity is not really providing any general benefit to guest_memfd users. Future plans around in-place conversion and hugepage support, where the per-folio uptodate flag is planned to be used purely to track the initial clearing of folios, whereas conversion operations could trigger multiple transitions between 'prepared' and 'unprepared' and thus need separate tracking, will make the burden of tracking this information within guest_memfd even more complex, since preparation generally happens during fault time, on the "read-side" of any global locks that might protect state tracked by guest_memfd, and so may require more complex locking schemes to allow for concurrent handling of page faults for multiple vCPUs where the "preparedness" state tracked by guest_memfd might need to be updated as part of handling the fault. Instead of keeping this current/future complexity within guest_memfd for what is essentially just SEV-SNP, just drop the tracking for 2) and have the arch-specific preparation hooks get triggered unconditionally on every fault so the arch-specific hooks can check the preparation state directly and decide whether or not a folio still needs additional preparation. In the case of SEV-SNP, the preparation state is already checked again via the preparation hooks to avoid double-preparation, so nothing extra needs to be done to update the handling of things there. Signed-off-by: Michael Roth <michael.roth@amd.com> --- virt/kvm/guest_memfd.c | 47 ++++++++++++++---------------------------- 1 file changed, 15 insertions(+), 32 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fdaea3422c30..9160379df378 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -76,11 +76,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo return 0; } -static inline void kvm_gmem_mark_prepared(struct folio *folio) -{ - folio_mark_uptodate(folio); -} - /* * Process @folio, which contains @gfn, so that the guest can use it. * The folio must be locked and the gfn must be contained in @slot. @@ -90,13 +85,7 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio) static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, struct folio *folio) { - unsigned long nr_pages, i; pgoff_t index; - int r; - - nr_pages = folio_nr_pages(folio); - for (i = 0; i < nr_pages; i++) - clear_highpage(folio_page(folio, i)); /* * Preparing huge folios should always be safe, since it should @@ -114,11 +103,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, WARN_ON(!IS_ALIGNED(slot->gmem.pgoff, folio_nr_pages(folio))); index = kvm_gmem_get_index(slot, gfn); index = ALIGN_DOWN(index, folio_nr_pages(folio)); - r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); - if (!r) - kvm_gmem_mark_prepared(folio); - return r; + return __kvm_gmem_prepare_folio(kvm, slot, index, folio); } /* @@ -420,7 +406,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); - kvm_gmem_mark_prepared(folio); + folio_mark_uptodate(folio); } vmf->page = folio_file_page(folio, vmf->pgoff); @@ -757,7 +743,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) static struct folio *__kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, pgoff_t index, kvm_pfn_t *pfn, - bool *is_prepared, int *max_order) + int *max_order) { struct file *slot_file = READ_ONCE(slot->gmem.file); struct gmem_file *f = file->private_data; @@ -787,7 +773,6 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file, if (max_order) *max_order = 0; - *is_prepared = folio_test_uptodate(folio); return folio; } @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, { pgoff_t index = kvm_gmem_get_index(slot, gfn); struct folio *folio; - bool is_prepared = false; int r = 0; CLASS(gmem_get_file, file)(slot); if (!file) return -EFAULT; - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); if (IS_ERR(folio)) return PTR_ERR(folio); - if (!is_prepared) - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); + if (!folio_test_uptodate(folio)) { + unsigned long i, nr_pages = folio_nr_pages(folio); + + for (i = 0; i < nr_pages; i++) + clear_highpage(folio_page(folio, i)); + folio_mark_uptodate(folio); + } + + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); folio_unlock(folio); @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long struct folio *folio; gfn_t gfn = start_gfn + i; pgoff_t index = kvm_gmem_get_index(slot, gfn); - bool is_prepared = false; kvm_pfn_t pfn; if (signal_pending(current)) { @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; } - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); if (IS_ERR(folio)) { ret = PTR_ERR(folio); break; } - if (is_prepared) { - folio_unlock(folio); - folio_put(folio); - ret = -EEXIST; - break; - } - folio_unlock(folio); WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order)); @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long p = src ? src + i * PAGE_SIZE : NULL; ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret) - kvm_gmem_mark_prepared(folio); + folio_mark_uptodate(folio); put_folio_and_exit: folio_put(folio); -- 2.25.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth @ 2025-11-17 23:58 ` Ackerley Tng 2025-11-19 0:18 ` Michael Roth 2025-11-20 9:12 ` Yan Zhao 1 sibling, 1 reply; 35+ messages in thread From: Ackerley Tng @ 2025-11-17 23:58 UTC (permalink / raw) To: Michael Roth, kvm Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, aik, ira.weiny, yan.y.zhao Michael Roth <michael.roth@amd.com> writes: > guest_memfd currently uses the folio uptodate flag to track: > > 1) whether or not a page has been cleared before initial usage > 2) whether or not the architecture hooks have been issued to put the > page in a private state as defined by the architecture > > In practice, 2) is only actually being tracked for SEV-SNP VMs, and > there do not seem to be any plans/reasons that would suggest this will > change in the future, so this additional tracking/complexity is not > really providing any general benefit to guest_memfd users. Future plans > around in-place conversion and hugepage support, where the per-folio > uptodate flag is planned to be used purely to track the initial clearing > of folios, whereas conversion operations could trigger multiple > transitions between 'prepared' and 'unprepared' and thus need separate > tracking, will make the burden of tracking this information within > guest_memfd even more complex, since preparation generally happens > during fault time, on the "read-side" of any global locks that might > protect state tracked by guest_memfd, and so may require more complex > locking schemes to allow for concurrent handling of page faults for > multiple vCPUs where the "preparedness" state tracked by guest_memfd > might need to be updated as part of handling the fault. > > Instead of keeping this current/future complexity within guest_memfd for > what is essentially just SEV-SNP, just drop the tracking for 2) and have > the arch-specific preparation hooks get triggered unconditionally on > every fault so the arch-specific hooks can check the preparation state > directly and decide whether or not a folio still needs additional > preparation. In the case of SEV-SNP, the preparation state is already > checked again via the preparation hooks to avoid double-preparation, so > nothing extra needs to be done to update the handling of things there. > This looks good to me, thanks! What do you think of moving preparation (or SNP-specific work) to be done when the page is actually mapped by KVM instead? So whatever's done in preparation is now called from KVM instead of within guest_memfd [1]? I'm concerned about how this preparation needs to be done for the entire folio. With huge pages, could it be weird if actually only one page in the huge page is faulted in, and hence only that one page needs to be prepared, instead of the entire huge page? In the other series [2], there was a part about how guest_memfd should invalidate the shared status on conversion from private to shared. Is that still an intended step, after this series to remove preparation tracking? [1] https://lore.kernel.org/all/diqzcy7op5wg.fsf@google.com/ [2] https://lore.kernel.org/all/20250613005400.3694904-4-michael.roth@amd.com/ > Signed-off-by: Michael Roth <michael.roth@amd.com> > --- > virt/kvm/guest_memfd.c | 47 ++++++++++++++---------------------------- > 1 file changed, 15 insertions(+), 32 deletions(-) > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index fdaea3422c30..9160379df378 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -76,11 +76,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo > return 0; > } > > -static inline void kvm_gmem_mark_prepared(struct folio *folio) > -{ > - folio_mark_uptodate(folio); > -} > - > /* > * Process @folio, which contains @gfn, so that the guest can use it. > * The folio must be locked and the gfn must be contained in @slot. > @@ -90,13 +85,7 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio) > static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > gfn_t gfn, struct folio *folio) > { > - unsigned long nr_pages, i; > pgoff_t index; > - int r; > - > - nr_pages = folio_nr_pages(folio); > - for (i = 0; i < nr_pages; i++) > - clear_highpage(folio_page(folio, i)); > > /* > * Preparing huge folios should always be safe, since it should > @@ -114,11 +103,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > WARN_ON(!IS_ALIGNED(slot->gmem.pgoff, folio_nr_pages(folio))); > index = kvm_gmem_get_index(slot, gfn); > index = ALIGN_DOWN(index, folio_nr_pages(folio)); > - r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); > - if (!r) > - kvm_gmem_mark_prepared(folio); > > - return r; > + return __kvm_gmem_prepare_folio(kvm, slot, index, folio); > } > > /* > @@ -420,7 +406,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) > > if (!folio_test_uptodate(folio)) { > clear_highpage(folio_page(folio, 0)); > - kvm_gmem_mark_prepared(folio); > + folio_mark_uptodate(folio); > } > > vmf->page = folio_file_page(folio, vmf->pgoff); > @@ -757,7 +743,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) > static struct folio *__kvm_gmem_get_pfn(struct file *file, > struct kvm_memory_slot *slot, > pgoff_t index, kvm_pfn_t *pfn, > - bool *is_prepared, int *max_order) > + int *max_order) > { > struct file *slot_file = READ_ONCE(slot->gmem.file); > struct gmem_file *f = file->private_data; > @@ -787,7 +773,6 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file, > if (max_order) > *max_order = 0; > > - *is_prepared = folio_test_uptodate(folio); > return folio; > } > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > { > pgoff_t index = kvm_gmem_get_index(slot, gfn); > struct folio *folio; > - bool is_prepared = false; > int r = 0; > > CLASS(gmem_get_file, file)(slot); > if (!file) > return -EFAULT; > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > if (IS_ERR(folio)) > return PTR_ERR(folio); > > - if (!is_prepared) > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > + if (!folio_test_uptodate(folio)) { > + unsigned long i, nr_pages = folio_nr_pages(folio); > + > + for (i = 0; i < nr_pages; i++) > + clear_highpage(folio_page(folio, i)); > + folio_mark_uptodate(folio); > + } > + > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > folio_unlock(folio); > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > struct folio *folio; > gfn_t gfn = start_gfn + i; > pgoff_t index = kvm_gmem_get_index(slot, gfn); > - bool is_prepared = false; > kvm_pfn_t pfn; > > if (signal_pending(current)) { > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > break; > } > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > if (IS_ERR(folio)) { > ret = PTR_ERR(folio); > break; > } > > - if (is_prepared) { > - folio_unlock(folio); > - folio_put(folio); > - ret = -EEXIST; > - break; > - } > - > folio_unlock(folio); > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > (npages - i) < (1 << max_order)); > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > p = src ? src + i * PAGE_SIZE : NULL; > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > if (!ret) > - kvm_gmem_mark_prepared(folio); > + folio_mark_uptodate(folio); > > put_folio_and_exit: > folio_put(folio); > -- > 2.25.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-17 23:58 ` Ackerley Tng @ 2025-11-19 0:18 ` Michael Roth 0 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-11-19 0:18 UTC (permalink / raw) To: Ackerley Tng Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, aik, ira.weiny, yan.y.zhao On Mon, Nov 17, 2025 at 03:58:46PM -0800, Ackerley Tng wrote: > Michael Roth <michael.roth@amd.com> writes: > > > guest_memfd currently uses the folio uptodate flag to track: > > > > 1) whether or not a page has been cleared before initial usage > > 2) whether or not the architecture hooks have been issued to put the > > page in a private state as defined by the architecture > > > > In practice, 2) is only actually being tracked for SEV-SNP VMs, and > > there do not seem to be any plans/reasons that would suggest this will > > change in the future, so this additional tracking/complexity is not > > really providing any general benefit to guest_memfd users. Future plans > > around in-place conversion and hugepage support, where the per-folio > > uptodate flag is planned to be used purely to track the initial clearing > > of folios, whereas conversion operations could trigger multiple > > transitions between 'prepared' and 'unprepared' and thus need separate > > tracking, will make the burden of tracking this information within > > guest_memfd even more complex, since preparation generally happens > > during fault time, on the "read-side" of any global locks that might > > protect state tracked by guest_memfd, and so may require more complex > > locking schemes to allow for concurrent handling of page faults for > > multiple vCPUs where the "preparedness" state tracked by guest_memfd > > might need to be updated as part of handling the fault. > > > > Instead of keeping this current/future complexity within guest_memfd for > > what is essentially just SEV-SNP, just drop the tracking for 2) and have > > the arch-specific preparation hooks get triggered unconditionally on > > every fault so the arch-specific hooks can check the preparation state > > directly and decide whether or not a folio still needs additional > > preparation. In the case of SEV-SNP, the preparation state is already > > checked again via the preparation hooks to avoid double-preparation, so > > nothing extra needs to be done to update the handling of things there. > > > > This looks good to me, thanks! > > What do you think of moving preparation (or SNP-specific work) to be > done when the page is actually mapped by KVM instead? So whatever's done > in preparation is now called from KVM instead of within guest_memfd [1]? Now that preparation tracking is removed, it is now completely decoupled from the kvm_gmem_populate() path and fully contained in kvm_gmem_get_pfn(), where it becomes a lot more straightforward to move this into the KVM MMU fault path. But gmem currently also handles the inverse operation via the gmem_invalidate() hooks, which is driven separately from the KVM MMU notifiers. And it's not so simple to just plumb it into those paths, but invalidation in this sense involves clearing the 'validated' bit in the RMP table for the page, which is a destructive operation, whereas the notifiers as they exist today can be using for non-destructive operations like simply rebuilding stage2 mappings. So we'd probably need to think through what that would look like if we really want to move preparation/un-preparation out of gmem. So I think it makes sense to consider this patch as-is as a stepping stone toward that, but I don't have any objection to going that direction. Curious what others have to say though. > > I'm concerned about how this preparation needs to be done for the entire > folio. With huge pages, could it be weird if actually only one page in > the huge page is faulted in, and hence only that one page needs to be > prepared, instead of the entire huge page? In previous iterations of THP support for SNP[1] I think this worked out okay. You'd prepare optimistically prepare the whole huge folio, and if KVM mapped it as, say, 4K, you'd get an RMP fault and PSMASH the RMP table to smaller 4K/prepare entries. But that was before in-place conversion was in the picture, so we didn't have to worry about ever converting those other prepared entries to a shared state, so you could defer everything until folio cleanup. For in-place we'd need to take the memory attributes for the range we are mapping into account and clamp the range down to a smaller order accordingly before issuing the prepare hook. But I think it would still be doable. Maybe more directly would be to let KVM MMU tell us the max mapping level it will be using so we can just defer all the attribute handling to KVM. But this same approach could still be done with gmem issuing the prepare hooks in the case of in-place conversion. So I think it's doable either way... hard to tell what approach is cleaner without some hugepage patches on top. I'm still trying to get update THP on top of your in-place conversion patches posted and maybe it'll be easier to see what things would look like in that context. [1] https://lore.kernel.org/kvm/20241212063635.712877-1-michael.roth@amd.com/ > > In the other series [2], there was a part about how guest_memfd should > invalidate the shared status on conversion from private to shared. Is > that still an intended step, after this series to remove preparation > tracking? Yes, I was still planning to have gmem drive prepare/invalidate where needed. If we move things out to MMU that will require some rethinking however. Thanks, Mike > > [1] https://lore.kernel.org/all/diqzcy7op5wg.fsf@google.com/ > [2] https://lore.kernel.org/all/20250613005400.3694904-4-michael.roth@amd.com/ > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > --- > > virt/kvm/guest_memfd.c | 47 ++++++++++++++---------------------------- > > 1 file changed, 15 insertions(+), 32 deletions(-) > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > index fdaea3422c30..9160379df378 100644 > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -76,11 +76,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo > > return 0; > > } > > > > -static inline void kvm_gmem_mark_prepared(struct folio *folio) > > -{ > > - folio_mark_uptodate(folio); > > -} > > - > > /* > > * Process @folio, which contains @gfn, so that the guest can use it. > > * The folio must be locked and the gfn must be contained in @slot. > > @@ -90,13 +85,7 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio) > > static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > > gfn_t gfn, struct folio *folio) > > { > > - unsigned long nr_pages, i; > > pgoff_t index; > > - int r; > > - > > - nr_pages = folio_nr_pages(folio); > > - for (i = 0; i < nr_pages; i++) > > - clear_highpage(folio_page(folio, i)); > > > > /* > > * Preparing huge folios should always be safe, since it should > > @@ -114,11 +103,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > > WARN_ON(!IS_ALIGNED(slot->gmem.pgoff, folio_nr_pages(folio))); > > index = kvm_gmem_get_index(slot, gfn); > > index = ALIGN_DOWN(index, folio_nr_pages(folio)); > > - r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); > > - if (!r) > > - kvm_gmem_mark_prepared(folio); > > > > - return r; > > + return __kvm_gmem_prepare_folio(kvm, slot, index, folio); > > } > > > > /* > > @@ -420,7 +406,7 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) > > > > if (!folio_test_uptodate(folio)) { > > clear_highpage(folio_page(folio, 0)); > > - kvm_gmem_mark_prepared(folio); > > + folio_mark_uptodate(folio); > > } > > > > vmf->page = folio_file_page(folio, vmf->pgoff); > > @@ -757,7 +743,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) > > static struct folio *__kvm_gmem_get_pfn(struct file *file, > > struct kvm_memory_slot *slot, > > pgoff_t index, kvm_pfn_t *pfn, > > - bool *is_prepared, int *max_order) > > + int *max_order) > > { > > struct file *slot_file = READ_ONCE(slot->gmem.file); > > struct gmem_file *f = file->private_data; > > @@ -787,7 +773,6 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file, > > if (max_order) > > *max_order = 0; > > > > - *is_prepared = folio_test_uptodate(folio); > > return folio; > > } > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > { > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > struct folio *folio; > > - bool is_prepared = false; > > int r = 0; > > > > CLASS(gmem_get_file, file)(slot); > > if (!file) > > return -EFAULT; > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > if (IS_ERR(folio)) > > return PTR_ERR(folio); > > > > - if (!is_prepared) > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > + if (!folio_test_uptodate(folio)) { > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > + > > + for (i = 0; i < nr_pages; i++) > > + clear_highpage(folio_page(folio, i)); > > + folio_mark_uptodate(folio); > > + } > > + > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > folio_unlock(folio); > > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > struct folio *folio; > > gfn_t gfn = start_gfn + i; > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > - bool is_prepared = false; > > kvm_pfn_t pfn; > > > > if (signal_pending(current)) { > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > break; > > } > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > if (IS_ERR(folio)) { > > ret = PTR_ERR(folio); > > break; > > } > > > > - if (is_prepared) { > > - folio_unlock(folio); > > - folio_put(folio); > > - ret = -EEXIST; > > - break; > > - } > > - > > folio_unlock(folio); > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > (npages - i) < (1 << max_order)); > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > p = src ? src + i * PAGE_SIZE : NULL; > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > if (!ret) > > - kvm_gmem_mark_prepared(folio); > > + folio_mark_uptodate(folio); > > > > put_folio_and_exit: > > folio_put(folio); > > -- > > 2.25.1 > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth 2025-11-17 23:58 ` Ackerley Tng @ 2025-11-20 9:12 ` Yan Zhao 2025-11-21 12:43 ` Michael Roth 1 sibling, 1 reply; 35+ messages in thread From: Yan Zhao @ 2025-11-20 9:12 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > { > pgoff_t index = kvm_gmem_get_index(slot, gfn); > struct folio *folio; > - bool is_prepared = false; > int r = 0; > > CLASS(gmem_get_file, file)(slot); > if (!file) > return -EFAULT; > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > if (IS_ERR(folio)) > return PTR_ERR(folio); > > - if (!is_prepared) > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > + if (!folio_test_uptodate(folio)) { > + unsigned long i, nr_pages = folio_nr_pages(folio); > + > + for (i = 0; i < nr_pages; i++) > + clear_highpage(folio_page(folio, i)); > + folio_mark_uptodate(folio); Here, the entire folio is cleared only when the folio is not marked uptodate. Then, please check my questions at the bottom > + } > + > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > folio_unlock(folio); > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > struct folio *folio; > gfn_t gfn = start_gfn + i; > pgoff_t index = kvm_gmem_get_index(slot, gfn); > - bool is_prepared = false; > kvm_pfn_t pfn; > > if (signal_pending(current)) { > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > break; > } > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > if (IS_ERR(folio)) { > ret = PTR_ERR(folio); > break; > } > > - if (is_prepared) { > - folio_unlock(folio); > - folio_put(folio); > - ret = -EEXIST; > - break; > - } > - > folio_unlock(folio); > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > (npages - i) < (1 << max_order)); TDX could hit this warning easily when npages == 1, max_order == 9. > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > p = src ? src + i * PAGE_SIZE : NULL; > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > if (!ret) > - kvm_gmem_mark_prepared(folio); > + folio_mark_uptodate(folio); As also asked in [1], why is the entire folio marked as uptodate here? Why does kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked uptodate? It's possible (at least for TDX) that a huge folio is only partially populated by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, while GFN 0x820 is faulted after TD is running. However, these two GFNs can belong to the same folio of order 9. Note: the current code should not impact TDX. I'm just asking out of curiosity:) [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-20 9:12 ` Yan Zhao @ 2025-11-21 12:43 ` Michael Roth 2025-11-25 3:13 ` Yan Zhao 0 siblings, 1 reply; 35+ messages in thread From: Michael Roth @ 2025-11-21 12:43 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > { > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > struct folio *folio; > > - bool is_prepared = false; > > int r = 0; > > > > CLASS(gmem_get_file, file)(slot); > > if (!file) > > return -EFAULT; > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > if (IS_ERR(folio)) > > return PTR_ERR(folio); > > > > - if (!is_prepared) > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > + if (!folio_test_uptodate(folio)) { > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > + > > + for (i = 0; i < nr_pages; i++) > > + clear_highpage(folio_page(folio, i)); > > + folio_mark_uptodate(folio); > Here, the entire folio is cleared only when the folio is not marked uptodate. > Then, please check my questions at the bottom Yes, in this patch at least where I tried to mirror the current logic. I would not be surprised if we need to rework things for inplace/hugepage support though, but decoupling 'preparation' from the uptodate flag is the main goal here. > > > + } > > + > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > folio_unlock(folio); > > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > struct folio *folio; > > gfn_t gfn = start_gfn + i; > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > - bool is_prepared = false; > > kvm_pfn_t pfn; > > > > if (signal_pending(current)) { > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > break; > > } > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > if (IS_ERR(folio)) { > > ret = PTR_ERR(folio); > > break; > > } > > > > - if (is_prepared) { > > - folio_unlock(folio); > > - folio_put(folio); > > - ret = -EEXIST; > > - break; > > - } > > - > > folio_unlock(folio); > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > (npages - i) < (1 << max_order)); > TDX could hit this warning easily when npages == 1, max_order == 9. Yes, this will need to change to handle that. I don't think I had to change this for previous iterations of SNP hugepage support, but there are definitely cases where a sub-2M range might get populated even though it's backed by a 2M folio, so I'm not sure why I didn't hit it there. But I'm taking Sean's cue on touching as little of the existing hugepage logic as possible in this particular series so we can revisit the remaining changes with some better context. > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > p = src ? src + i * PAGE_SIZE : NULL; > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > if (!ret) > > - kvm_gmem_mark_prepared(folio); > > + folio_mark_uptodate(folio); > As also asked in [1], why is the entire folio marked as uptodate here? Why does > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > uptodate? Quoting your example from[1] for more context: > I also have a question about this patch: > > Suppose there's a 2MB huge folio A, where > A1 and A2 are 4KB pages belonging to folio A. > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > It adds page A1 and invokes folio_mark_uptodate() on folio A. In SNP hugepage patchset you responded to, it would only mark A1 as prepared/cleared. There was 4K-granularity tracking added to handle this. There was an odd subtlety in that series though: it was defaulting to the folio_order() for the prep-tracking/post-populate, but it would then clamp it down based on the max order possible according whether that particular order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not a great way to handle things, and I don't remember if I'd actually intended to implement it that way or not... that's probably why I never tripped over the WARN_ON() above, now that I think of it. But neither of these these apply to any current plans for hugepage support that I'm aware of, so probably not worth working through what that series did and look at this from a fresh perspective. > > (2) kvm_gmem_get_pfn() later faults in page A2. > As folio A is uptodate, clear_highpage() is not invoked on page A2. > kvm_gmem_prepare_folio() is invoked on the whole folio A. > > (2) could occur at least in TDX when only a part the 2MB page is added as guest > initial memory. > > My questions: > - Would (2) occur on SEV? > - If it does, is the lack of clear_highpage() on A2 a problem ? > - Is invoking gmem_prepare on page A1 a problem? Assuming this patch goes upstream in some form, we will now have the following major differences versus previous code: 1) uptodate flag only tracks whether a folio has been cleared 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and the architecture can handle it's own tracking at whatever granularity it likes. My hope is that 1) can similarly be done in such a way that gmem does not need to track things at sub-hugepage granularity and necessitate the need for some new data structure/state/flag to track sub-page status. My understanding based on prior discussion in guest_memfd calls was that it would be okay to go ahead and clear the entire folio at initial allocation time, and basically never mess with it again. It was also my understanding that for TDX it might even be optimal to completely skip clearing the folio if it is getting mapped into SecureEPT as a hugepage since the TDX module would handle that, but that maybe conversely after private->shared there would be some need to reclear... I'll try to find that discussion and refresh. Vishal I believe suggested some flags to provide more control over this behavior. > > It's possible (at least for TDX) that a huge folio is only partially populated > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, > while GFN 0x820 is faulted after TD is running. However, these two GFNs can > belong to the same folio of order 9. Would the above scheme of clearing the entire folio up front and not re-clearing at fault time work for this case? Thanks, Mike > > Note: the current code should not impact TDX. I'm just asking out of curiosity:) > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-21 12:43 ` Michael Roth @ 2025-11-25 3:13 ` Yan Zhao 2025-12-01 1:35 ` Vishal Annapurve 2025-12-01 23:44 ` Michael Roth 0 siblings, 2 replies; 35+ messages in thread From: Yan Zhao @ 2025-11-25 3:13 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > { > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > struct folio *folio; > > > - bool is_prepared = false; > > > int r = 0; > > > > > > CLASS(gmem_get_file, file)(slot); > > > if (!file) > > > return -EFAULT; > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > if (IS_ERR(folio)) > > > return PTR_ERR(folio); > > > > > > - if (!is_prepared) > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > + if (!folio_test_uptodate(folio)) { > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > + > > > + for (i = 0; i < nr_pages; i++) > > > + clear_highpage(folio_page(folio, i)); > > > + folio_mark_uptodate(folio); > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > Then, please check my questions at the bottom > > Yes, in this patch at least where I tried to mirror the current logic. I > would not be surprised if we need to rework things for inplace/hugepage > support though, but decoupling 'preparation' from the uptodate flag is > the main goal here. Could you elaborate a little why the decoupling is needed if it's not for hugepage? > > > + } > > > + > > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > > folio_unlock(folio); > > > > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > struct folio *folio; > > > gfn_t gfn = start_gfn + i; > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > - bool is_prepared = false; > > > kvm_pfn_t pfn; > > > > > > if (signal_pending(current)) { > > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > break; > > > } > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > > if (IS_ERR(folio)) { > > > ret = PTR_ERR(folio); > > > break; > > > } > > > > > > - if (is_prepared) { > > > - folio_unlock(folio); > > > - folio_put(folio); > > > - ret = -EEXIST; > > > - break; > > > - } > > > - > > > folio_unlock(folio); > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > (npages - i) < (1 << max_order)); > > TDX could hit this warning easily when npages == 1, max_order == 9. > > Yes, this will need to change to handle that. I don't think I had to > change this for previous iterations of SNP hugepage support, but > there are definitely cases where a sub-2M range might get populated > even though it's backed by a 2M folio, so I'm not sure why I didn't > hit it there. > > But I'm taking Sean's cue on touching as little of the existing > hugepage logic as possible in this particular series so we can revisit > the remaining changes with some better context. Frankly, I don't understand why this patch 1 is required if we only want "moving GUP out of post_populate()" to work for 4KB folios. > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > p = src ? src + i * PAGE_SIZE : NULL; > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > if (!ret) > > > - kvm_gmem_mark_prepared(folio); > > > + folio_mark_uptodate(folio); > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > uptodate? > > Quoting your example from[1] for more context: > > > I also have a question about this patch: > > > > Suppose there's a 2MB huge folio A, where > > A1 and A2 are 4KB pages belonging to folio A. > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > In SNP hugepage patchset you responded to, it would only mark A1 as You mean code in https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? > prepared/cleared. There was 4K-granularity tracking added to handle this. I don't find the code that marks only A1 as "prepared/cleared". Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() to mark the entire folio A as uptodate. However, according to your statement below that "uptodate flag only tracks whether a folio has been cleared", I don't follow why and where the entire folio A would be cleared if kvm_gmem_populate() only adds page A1. > There was an odd subtlety in that series though: it was defaulting to the > folio_order() for the prep-tracking/post-populate, but it would then clamp > it down based on the max order possible according whether that particular > order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not > a great way to handle things, and I don't remember if I'd actually intended > to implement it that way or not... that's probably why I never tripped over > the WARN_ON() above, now that I think of it. > > But neither of these these apply to any current plans for hugepage support > that I'm aware of, so probably not worth working through what that series > did and look at this from a fresh perspective. > > > > > (2) kvm_gmem_get_pfn() later faults in page A2. > > As folio A is uptodate, clear_highpage() is not invoked on page A2. > > kvm_gmem_prepare_folio() is invoked on the whole folio A. > > > > (2) could occur at least in TDX when only a part the 2MB page is added as guest > > initial memory. > > > > My questions: > > - Would (2) occur on SEV? > > - If it does, is the lack of clear_highpage() on A2 a problem ? > > - Is invoking gmem_prepare on page A1 a problem? > > Assuming this patch goes upstream in some form, we will now have the > following major differences versus previous code: > > 1) uptodate flag only tracks whether a folio has been cleared > 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and > the architecture can handle it's own tracking at whatever granularity > it likes. 2) looks good to me. > My hope is that 1) can similarly be done in such a way that gmem does not > need to track things at sub-hugepage granularity and necessitate the need > for some new data structure/state/flag to track sub-page status. I actually don't understand what uptodate flag helps gmem to track. Why can't clear_highpage() be done inside arch specific code? TDX doesn't need this clearing after all. > My understanding based on prior discussion in guest_memfd calls was that > it would be okay to go ahead and clear the entire folio at initial allocation > time, and basically never mess with it again. It was also my understanding That's where I don't follow in this patch. I don't see where the entire folio A is cleared if it's only partially mapped by kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to kvm_gmem_populate() has set the uptodate flag. > that for TDX it might even be optimal to completely skip clearing the folio > if it is getting mapped into SecureEPT as a hugepage since the TDX module > would handle that, but that maybe conversely after private->shared there > would be some need to reclear... I'll try to find that discussion and > refresh. Vishal I believe suggested some flags to provide more control over > this behavior. > > > > > It's possible (at least for TDX) that a huge folio is only partially populated > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can > > belong to the same folio of order 9. > > Would the above scheme of clearing the entire folio up front and not re-clearing > at fault time work for this case? This case doesn't affect TDX, because TDX clearing private pages internally in SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage() after making a folio private, it works fine for TDX. I was just trying to understand why SNP needs the clearing of entire folio in kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's partially mapped in kvm_gmem_populate(). Also, I'm wondering if it would be better if SNP could move the clearing of folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare() which is always invoked by kvm_gmem_get_pfn() and the architecture can handle it's own tracking at whatever granularity. > > Note: the current code should not impact TDX. I'm just asking out of curiosity:) > > > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-25 3:13 ` Yan Zhao @ 2025-12-01 1:35 ` Vishal Annapurve 2025-12-01 2:51 ` Yan Zhao 2025-12-01 23:44 ` Michael Roth 1 sibling, 1 reply; 35+ messages in thread From: Vishal Annapurve @ 2025-12-01 1:35 UTC (permalink / raw) To: Yan Zhao Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Mon, Nov 24, 2025 at 7:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > { > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > struct folio *folio; > > > > - bool is_prepared = false; > > > > int r = 0; > > > > > > > > CLASS(gmem_get_file, file)(slot); > > > > if (!file) > > > > return -EFAULT; > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > > if (IS_ERR(folio)) > > > > return PTR_ERR(folio); > > > > > > > > - if (!is_prepared) > > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > + if (!folio_test_uptodate(folio)) { > > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > > + > > > > + for (i = 0; i < nr_pages; i++) > > > > + clear_highpage(folio_page(folio, i)); > > > > + folio_mark_uptodate(folio); > > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > > Then, please check my questions at the bottom > > > > Yes, in this patch at least where I tried to mirror the current logic. I > > would not be surprised if we need to rework things for inplace/hugepage > > support though, but decoupling 'preparation' from the uptodate flag is > > the main goal here. > Could you elaborate a little why the decoupling is needed if it's not for > hugepage? IMO, decoupling is useful in general and we don't necessarily need to wait till hugepage support lands to clean up this logic. Current preparation logic has created some confusion regarding multiple features for guest_memfd under discussion such as generic write, uffd support, and direct map removal. It would be useful to simplify the guest_memfd logic in this regard. > > > > > > + } > > > > + > > > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > > > > folio_unlock(folio); > > > > > > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > struct folio *folio; > > > > gfn_t gfn = start_gfn + i; > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > - bool is_prepared = false; > > > > kvm_pfn_t pfn; > > > > > > > > if (signal_pending(current)) { > > > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > break; > > > > } > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > > > if (IS_ERR(folio)) { > > > > ret = PTR_ERR(folio); > > > > break; > > > > } > > > > > > > > - if (is_prepared) { > > > > - folio_unlock(folio); > > > > - folio_put(folio); > > > > - ret = -EEXIST; > > > > - break; > > > > - } > > > > - > > > > folio_unlock(folio); > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > > (npages - i) < (1 << max_order)); > > > TDX could hit this warning easily when npages == 1, max_order == 9. > > > > Yes, this will need to change to handle that. I don't think I had to > > change this for previous iterations of SNP hugepage support, but > > there are definitely cases where a sub-2M range might get populated > > even though it's backed by a 2M folio, so I'm not sure why I didn't > > hit it there. > > > > But I'm taking Sean's cue on touching as little of the existing > > hugepage logic as possible in this particular series so we can revisit > > the remaining changes with some better context. > Frankly, I don't understand why this patch 1 is required if we only want "moving > GUP out of post_populate()" to work for 4KB folios. > > > > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > p = src ? src + i * PAGE_SIZE : NULL; > > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > if (!ret) > > > > - kvm_gmem_mark_prepared(folio); > > > > + folio_mark_uptodate(folio); > > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > > uptodate? > > > > Quoting your example from[1] for more context: > > > > > I also have a question about this patch: > > > > > > Suppose there's a 2MB huge folio A, where > > > A1 and A2 are 4KB pages belonging to folio A. > > > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > > > In SNP hugepage patchset you responded to, it would only mark A1 as > You mean code in > https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? > > > prepared/cleared. There was 4K-granularity tracking added to handle this. > I don't find the code that marks only A1 as "prepared/cleared". > Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() > to mark the entire folio A as uptodate. > > However, according to your statement below that "uptodate flag only tracks > whether a folio has been cleared", I don't follow why and where the entire folio > A would be cleared if kvm_gmem_populate() only adds page A1. I think kvm_gmem_populate() is currently only used by SNP and TDX logic, I don't see an issue with marking the complete folio as uptodate even if its partially updated by kvm_gmem_populate() paths as the private memory will eventually get initialized anyways. > > > There was an odd subtlety in that series though: it was defaulting to the > > folio_order() for the prep-tracking/post-populate, but it would then clamp > > it down based on the max order possible according whether that particular > > order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not > > a great way to handle things, and I don't remember if I'd actually intended > > to implement it that way or not... that's probably why I never tripped over > > the WARN_ON() above, now that I think of it. > > > > But neither of these these apply to any current plans for hugepage support > > that I'm aware of, so probably not worth working through what that series > > did and look at this from a fresh perspective. > > > > > > > > (2) kvm_gmem_get_pfn() later faults in page A2. > > > As folio A is uptodate, clear_highpage() is not invoked on page A2. > > > kvm_gmem_prepare_folio() is invoked on the whole folio A. > > > > > > (2) could occur at least in TDX when only a part the 2MB page is added as guest > > > initial memory. > > > > > > My questions: > > > - Would (2) occur on SEV? > > > - If it does, is the lack of clear_highpage() on A2 a problem ? > > > - Is invoking gmem_prepare on page A1 a problem? > > > > Assuming this patch goes upstream in some form, we will now have the > > following major differences versus previous code: > > > > 1) uptodate flag only tracks whether a folio has been cleared > > 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and > > the architecture can handle it's own tracking at whatever granularity > > it likes. > 2) looks good to me. > > > My hope is that 1) can similarly be done in such a way that gmem does not > > need to track things at sub-hugepage granularity and necessitate the need > > for some new data structure/state/flag to track sub-page status. > I actually don't understand what uptodate flag helps gmem to track. > Why can't clear_highpage() be done inside arch specific code? TDX doesn't need > this clearing after all. Target audience for guest_memfd includes non-confidential VMs as well. Inline with shmem and other filesystems, guest_memfd should clear pages on fault before handing them out to the users. There should be a way to opt-out of this behavior for certain private faults like for SNP/TDX and possibly for CCA as well. > > > My understanding based on prior discussion in guest_memfd calls was that > > it would be okay to go ahead and clear the entire folio at initial allocation > > time, and basically never mess with it again. It was also my understanding > That's where I don't follow in this patch. > I don't see where the entire folio A is cleared if it's only partially mapped by > kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to > kvm_gmem_populate() has set the uptodate flag. Since kvm_gmem_populate() is specific to SNP and TDX VMs, I don't think this behavior is concerning. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-01 1:35 ` Vishal Annapurve @ 2025-12-01 2:51 ` Yan Zhao 2025-12-01 19:33 ` Vishal Annapurve 0 siblings, 1 reply; 35+ messages in thread From: Yan Zhao @ 2025-12-01 2:51 UTC (permalink / raw) To: Vishal Annapurve Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Sun, Nov 30, 2025 at 05:35:41PM -0800, Vishal Annapurve wrote: > On Mon, Nov 24, 2025 at 7:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > p = src ? src + i * PAGE_SIZE : NULL; > > > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > > if (!ret) > > > > > - kvm_gmem_mark_prepared(folio); > > > > > + folio_mark_uptodate(folio); > > > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > > > uptodate? > > > > > > Quoting your example from[1] for more context: > > > > > > > I also have a question about this patch: > > > > > > > > Suppose there's a 2MB huge folio A, where > > > > A1 and A2 are 4KB pages belonging to folio A. > > > > > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > > > > > In SNP hugepage patchset you responded to, it would only mark A1 as > > You mean code in > > https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? > > > > > prepared/cleared. There was 4K-granularity tracking added to handle this. > > I don't find the code that marks only A1 as "prepared/cleared". > > Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() > > to mark the entire folio A as uptodate. > > > > However, according to your statement below that "uptodate flag only tracks > > whether a folio has been cleared", I don't follow why and where the entire folio > > A would be cleared if kvm_gmem_populate() only adds page A1. > > I think kvm_gmem_populate() is currently only used by SNP and TDX > logic, I don't see an issue with marking the complete folio as > uptodate even if its partially updated by kvm_gmem_populate() paths as > the private memory will eventually get initialized anyways. Still using the above example, If only page A1 is passed to sev_gmem_post_populate(), will SNP initialize the entire folio A? - if yes, could you kindly point me to the code that does this? . - if sev_gmem_post_populate() only initializes page A1, after marking the complete folio A as uptodate in kvm_gmem_populate(), later faulting in page A2 in kvm_gmem_get_pfn() will not clear page A2 by invoking clear_highpage(), since the entire folio A is uptodate. I don't understand why this is OK. Or what's the purpose of invoking clear_highpage() on other folios? Thanks Yan ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-01 2:51 ` Yan Zhao @ 2025-12-01 19:33 ` Vishal Annapurve 2025-12-02 9:16 ` Yan Zhao 0 siblings, 1 reply; 35+ messages in thread From: Vishal Annapurve @ 2025-12-01 19:33 UTC (permalink / raw) To: Yan Zhao Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Sun, Nov 30, 2025 at 6:53 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > On Sun, Nov 30, 2025 at 05:35:41PM -0800, Vishal Annapurve wrote: > > On Mon, Nov 24, 2025 at 7:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > p = src ? src + i * PAGE_SIZE : NULL; > > > > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > > > if (!ret) > > > > > > - kvm_gmem_mark_prepared(folio); > > > > > > + folio_mark_uptodate(folio); > > > > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > > > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > > > > uptodate? > > > > > > > > Quoting your example from[1] for more context: > > > > > > > > > I also have a question about this patch: > > > > > > > > > > Suppose there's a 2MB huge folio A, where > > > > > A1 and A2 are 4KB pages belonging to folio A. > > > > > > > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > > > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > > > > > > > In SNP hugepage patchset you responded to, it would only mark A1 as > > > You mean code in > > > https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? > > > > > > > prepared/cleared. There was 4K-granularity tracking added to handle this. > > > I don't find the code that marks only A1 as "prepared/cleared". > > > Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() > > > to mark the entire folio A as uptodate. > > > > > > However, according to your statement below that "uptodate flag only tracks > > > whether a folio has been cleared", I don't follow why and where the entire folio > > > A would be cleared if kvm_gmem_populate() only adds page A1. > > > > I think kvm_gmem_populate() is currently only used by SNP and TDX > > logic, I don't see an issue with marking the complete folio as > > uptodate even if its partially updated by kvm_gmem_populate() paths as > > the private memory will eventually get initialized anyways. > Still using the above example, > If only page A1 is passed to sev_gmem_post_populate(), will SNP initialize the > entire folio A? > - if yes, could you kindly point me to the code that does this? . > - if sev_gmem_post_populate() only initializes page A1, after marking the > complete folio A as uptodate in kvm_gmem_populate(), later faulting in page A2 > in kvm_gmem_get_pfn() will not clear page A2 by invoking clear_highpage(), > since the entire folio A is uptodate. I don't understand why this is OK. > Or what's the purpose of invoking clear_highpage() on other folios? I think sev_gmem_post_populate() only initializes the ranges marked for snp_launch_update(). Since the current code lacks a hugepage provider, the kvm_gmem_populate() doesn't need to explicitly clear anything for 4K backings during kvm_gmem_populate(). I see your point. Once a hugepage provider lands, kvm_gmem_populate() can first invoke clear_highpage() or an equivalent API on a complete huge folio before calling the architecture-specific post-populate hook to keep the implementation consistent. Subsequently, we need to figure out a way to avoid this clearing for SNP/TDX/CCA private faults. > > Thanks > Yan ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-01 19:33 ` Vishal Annapurve @ 2025-12-02 9:16 ` Yan Zhao 0 siblings, 0 replies; 35+ messages in thread From: Yan Zhao @ 2025-12-02 9:16 UTC (permalink / raw) To: Vishal Annapurve Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Mon, Dec 01, 2025 at 11:33:18AM -0800, Vishal Annapurve wrote: > On Sun, Nov 30, 2025 at 6:53 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > On Sun, Nov 30, 2025 at 05:35:41PM -0800, Vishal Annapurve wrote: > > > On Mon, Nov 24, 2025 at 7:15 PM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > > p = src ? src + i * PAGE_SIZE : NULL; > > > > > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > > > > if (!ret) > > > > > > > - kvm_gmem_mark_prepared(folio); > > > > > > > + folio_mark_uptodate(folio); > > > > > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > > > > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > > > > > uptodate? > > > > > > > > > > Quoting your example from[1] for more context: > > > > > > > > > > > I also have a question about this patch: > > > > > > > > > > > > Suppose there's a 2MB huge folio A, where > > > > > > A1 and A2 are 4KB pages belonging to folio A. > > > > > > > > > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > > > > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > > > > > > > > > In SNP hugepage patchset you responded to, it would only mark A1 as > > > > You mean code in > > > > https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? > > > > > > > > > prepared/cleared. There was 4K-granularity tracking added to handle this. > > > > I don't find the code that marks only A1 as "prepared/cleared". > > > > Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() > > > > to mark the entire folio A as uptodate. > > > > > > > > However, according to your statement below that "uptodate flag only tracks > > > > whether a folio has been cleared", I don't follow why and where the entire folio > > > > A would be cleared if kvm_gmem_populate() only adds page A1. > > > > > > I think kvm_gmem_populate() is currently only used by SNP and TDX > > > logic, I don't see an issue with marking the complete folio as > > > uptodate even if its partially updated by kvm_gmem_populate() paths as > > > the private memory will eventually get initialized anyways. > > Still using the above example, > > If only page A1 is passed to sev_gmem_post_populate(), will SNP initialize the > > entire folio A? > > - if yes, could you kindly point me to the code that does this? . > > - if sev_gmem_post_populate() only initializes page A1, after marking the > > complete folio A as uptodate in kvm_gmem_populate(), later faulting in page A2 > > in kvm_gmem_get_pfn() will not clear page A2 by invoking clear_highpage(), > > since the entire folio A is uptodate. I don't understand why this is OK. > > Or what's the purpose of invoking clear_highpage() on other folios? > > I think sev_gmem_post_populate() only initializes the ranges marked > for snp_launch_update(). Since the current code lacks a hugepage > provider, the kvm_gmem_populate() doesn't need to explicitly clear > anything for 4K backings during kvm_gmem_populate(). > > I see your point. Once a hugepage provider lands, kvm_gmem_populate() > can first invoke clear_highpage() or an equivalent API on a complete > huge folio before calling the architecture-specific post-populate hook > to keep the implementation consistent. Maybe clear_highpage() in kvm_gmem_get_folio()? When in-place copy in kvm_gmem_populate() comes, kvm_gmem_get_folio() can be invoked first for shared memory, so clear_highpage() there is before userspace writes to shared memory. No clear_highpage() is required when kvm_gmem_populate() invokes __kvm_gmem_get_pfn() to get the folio again. > Subsequently, we need to figure out a way to avoid this clearing for > SNP/TDX/CCA private faults. > > > > > Thanks > > Yan ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-11-25 3:13 ` Yan Zhao 2025-12-01 1:35 ` Vishal Annapurve @ 2025-12-01 23:44 ` Michael Roth 2025-12-02 9:17 ` Yan Zhao 1 sibling, 1 reply; 35+ messages in thread From: Michael Roth @ 2025-12-01 23:44 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote: > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > { > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > struct folio *folio; > > > > - bool is_prepared = false; > > > > int r = 0; > > > > > > > > CLASS(gmem_get_file, file)(slot); > > > > if (!file) > > > > return -EFAULT; > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > > if (IS_ERR(folio)) > > > > return PTR_ERR(folio); > > > > > > > > - if (!is_prepared) > > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > + if (!folio_test_uptodate(folio)) { > > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > > + > > > > + for (i = 0; i < nr_pages; i++) > > > > + clear_highpage(folio_page(folio, i)); > > > > + folio_mark_uptodate(folio); > > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > > Then, please check my questions at the bottom > > > > Yes, in this patch at least where I tried to mirror the current logic. I > > would not be surprised if we need to rework things for inplace/hugepage > > support though, but decoupling 'preparation' from the uptodate flag is > > the main goal here. > Could you elaborate a little why the decoupling is needed if it's not for > hugepage? For instance, for in-place conversion: 1. initial allocation: clear, set uptodate, fault in as private 2. private->shared: call invalidate hook, fault in as shared 3. shared->private: call prep hook, fault in as private Here, 2/3 need to track where the current state is shared/private in order to make appropriate architecture-specific changes (e.g. RMP table updates). But we want to allow for non-destructive in-place conversion, where a page is 'uptodate', but not in the desired shared/private state. So 'uptodate' becomes a separate piece of state, which is still reasonable for gmem to track in the current 4K-only implementation, and provides for a reasonable approach to upstreaming in-place conversion, which isn't far off for either SNP or TDX. For hugepages, we'll have other things to consider, but those things are probably still somewhat far off, and so we shouldn't block steps toward in-place conversion based on uncertainty around hugepages. I think it's gotten enough attention at least that we know it *can* work, e.g. even if we take the inefficient/easy route of zero'ing the whole folio on initial access, setting it uptodate, and never doing anything with uptodate again, it's still a usable implementation. > > > > > > + } > > > > + > > > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > > > > folio_unlock(folio); > > > > > > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > struct folio *folio; > > > > gfn_t gfn = start_gfn + i; > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > - bool is_prepared = false; > > > > kvm_pfn_t pfn; > > > > > > > > if (signal_pending(current)) { > > > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > break; > > > > } > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > > > if (IS_ERR(folio)) { > > > > ret = PTR_ERR(folio); > > > > break; > > > > } > > > > > > > > - if (is_prepared) { > > > > - folio_unlock(folio); > > > > - folio_put(folio); > > > > - ret = -EEXIST; > > > > - break; > > > > - } > > > > - > > > > folio_unlock(folio); > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > > (npages - i) < (1 << max_order)); > > > TDX could hit this warning easily when npages == 1, max_order == 9. > > > > Yes, this will need to change to handle that. I don't think I had to > > change this for previous iterations of SNP hugepage support, but > > there are definitely cases where a sub-2M range might get populated > > even though it's backed by a 2M folio, so I'm not sure why I didn't > > hit it there. > > > > But I'm taking Sean's cue on touching as little of the existing > > hugepage logic as possible in this particular series so we can revisit > > the remaining changes with some better context. > Frankly, I don't understand why this patch 1 is required if we only want "moving > GUP out of post_populate()" to work for 4KB folios. Above I outlined one of the use-cases for in-place conversion. During the 2 PUCK sessions prior to this RFC, Sean also mentioned some potential that other deadlocks might exist in current code due to how the locking is currently handled, and that we should consider this as a general cleanup against current kvm/next, but I leave that to Sean to elaborate on. Personally I think this series makes sense against kvm/next regardless: tracking preparation in gmem is basically already broken: everyone ignores it except SNP, so it was never performing that duty as-designed. So we are now simplying uptodate flag to no longer include this extra state-tracking, and leaving it for architecture-specific tracking. I can't see that be anything but beneficial to future gmem changes. > > > > > > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > p = src ? src + i * PAGE_SIZE : NULL; > > > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > if (!ret) > > > > - kvm_gmem_mark_prepared(folio); > > > > + folio_mark_uptodate(folio); > > > As also asked in [1], why is the entire folio marked as uptodate here? Why does > > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked > > > uptodate? > > > > Quoting your example from[1] for more context: > > > > > I also have a question about this patch: > > > > > > Suppose there's a 2MB huge folio A, where > > > A1 and A2 are 4KB pages belonging to folio A. > > > > > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A. > > > It adds page A1 and invokes folio_mark_uptodate() on folio A. > > > > In SNP hugepage patchset you responded to, it would only mark A1 as > You mean code in > https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ? No, sorry, I got my references mixed up. The only publically-posted version of SNP hugepage support is the THP series that does not involve in-place conversion, and that's what I was referencing. It's there where per-4K bitmap was added to track preparation, and in that series page-clearing/preparation are still coupled to some degree so per-4K tracking of page-clearing was still possible and that's how it was handled: https://github.com/AMDESE/linux/blob/snp-prepare-thp-rfc1/virt/kvm/guest_memfd.c#L992 but that can be considered an abandoned approach so I wouldn't spend much time referencing that. > > > prepared/cleared. There was 4K-granularity tracking added to handle this. > I don't find the code that marks only A1 as "prepared/cleared". > Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate() > to mark the entire folio A as uptodate. > > However, according to your statement below that "uptodate flag only tracks > whether a folio has been cleared", I don't follow why and where the entire folio > A would be cleared if kvm_gmem_populate() only adds page A1. > > > There was an odd subtlety in that series though: it was defaulting to the > > folio_order() for the prep-tracking/post-populate, but it would then clamp > > it down based on the max order possible according whether that particular > > order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not > > a great way to handle things, and I don't remember if I'd actually intended > > to implement it that way or not... that's probably why I never tripped over > > the WARN_ON() above, now that I think of it. > > > > But neither of these these apply to any current plans for hugepage support > > that I'm aware of, so probably not worth working through what that series > > did and look at this from a fresh perspective. > > > > > > > > (2) kvm_gmem_get_pfn() later faults in page A2. > > > As folio A is uptodate, clear_highpage() is not invoked on page A2. > > > kvm_gmem_prepare_folio() is invoked on the whole folio A. > > > > > > (2) could occur at least in TDX when only a part the 2MB page is added as guest > > > initial memory. > > > > > > My questions: > > > - Would (2) occur on SEV? > > > - If it does, is the lack of clear_highpage() on A2 a problem ? > > > - Is invoking gmem_prepare on page A1 a problem? > > > > Assuming this patch goes upstream in some form, we will now have the > > following major differences versus previous code: > > > > 1) uptodate flag only tracks whether a folio has been cleared > > 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and > > the architecture can handle it's own tracking at whatever granularity > > it likes. > 2) looks good to me. > > > My hope is that 1) can similarly be done in such a way that gmem does not > > need to track things at sub-hugepage granularity and necessitate the need > > for some new data structure/state/flag to track sub-page status. > I actually don't understand what uptodate flag helps gmem to track. > Why can't clear_highpage() be done inside arch specific code? TDX doesn't need > this clearing after all. It could. E.g. via the kernel-internal gmem flag that I mentioned in my earlier reply, or some alternative. In the context of this series, uptodate flag continues to instruct kvm_gmem_get_pfn() that it doesn't not need to re-clear pages, because a prior kvm_gmem_get_pfn() or kvm_gmem_populate() already initialized the folio, and it is no longer tied to any notion of preparedness-tracking. What use uptodate will have in the context of hugepages: I'm not sure. For non-in-place conversion, it's tempting to just let it continue to be per-folio and require clearing the whole folio on initial access, but it's not efficient. It may make sense to farm it out to post-populate/prep hooks instead, as you're suggesting for TDX. But then, for in-place conversion, you have to deal with pages initially faulted in as shared. They might be split prior to initial access as a private page, where we can't assume TDX will have scrubbed things. So in that case it might still make sense to rely on it. Definitely things that require some more thought. But having it inextricably tied to preparedness just makes preparation tracking similarly more complicated as it pulls it back into gmem when that does not seem to be the direction any architectures other SNP have/want to go. > > > My understanding based on prior discussion in guest_memfd calls was that > > it would be okay to go ahead and clear the entire folio at initial allocation > > time, and basically never mess with it again. It was also my understanding > That's where I don't follow in this patch. > I don't see where the entire folio A is cleared if it's only partially mapped by > kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to > kvm_gmem_populate() has set the uptodate flag. > > > that for TDX it might even be optimal to completely skip clearing the folio > > if it is getting mapped into SecureEPT as a hugepage since the TDX module > > would handle that, but that maybe conversely after private->shared there > > would be some need to reclear... I'll try to find that discussion and > > refresh. Vishal I believe suggested some flags to provide more control over > > this behavior. > > > > > > > > It's possible (at least for TDX) that a huge folio is only partially populated > > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the > > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, > > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can > > > belong to the same folio of order 9. > > > > Would the above scheme of clearing the entire folio up front and not re-clearing > > at fault time work for this case? > This case doesn't affect TDX, because TDX clearing private pages internally in > SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage() > after making a folio private, it works fine for TDX. > > I was just trying to understand why SNP needs the clearing of entire folio in > kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's > partially mapped in kvm_gmem_populate(). > Also, I'm wondering if it would be better if SNP could move the clearing of > folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare() > which is always invoked by kvm_gmem_get_pfn() and the architecture can handle > it's own tracking at whatever granularity. Possibly, but I touched elsewhere on where in-place conversion might trip up this approach. At least decoupling them allows for the prep side of things to be moved to architecture-specific tracking. We can deal with uptodate separately I think. -Mike > > > > > Note: the current code should not impact TDX. I'm just asking out of curiosity:) > > > > > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ > > > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-01 23:44 ` Michael Roth @ 2025-12-02 9:17 ` Yan Zhao 2025-12-03 13:47 ` Michael Roth 0 siblings, 1 reply; 35+ messages in thread From: Yan Zhao @ 2025-12-02 9:17 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Mon, Dec 01, 2025 at 05:44:47PM -0600, Michael Roth wrote: > On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote: > > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > > { > > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > > struct folio *folio; > > > > > - bool is_prepared = false; > > > > > int r = 0; > > > > > > > > > > CLASS(gmem_get_file, file)(slot); > > > > > if (!file) > > > > > return -EFAULT; > > > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > > > if (IS_ERR(folio)) > > > > > return PTR_ERR(folio); > > > > > > > > > > - if (!is_prepared) > > > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > + if (!folio_test_uptodate(folio)) { > > > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > > > + > > > > > + for (i = 0; i < nr_pages; i++) > > > > > + clear_highpage(folio_page(folio, i)); > > > > > + folio_mark_uptodate(folio); > > > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > > > Then, please check my questions at the bottom > > > > > > Yes, in this patch at least where I tried to mirror the current logic. I > > > would not be surprised if we need to rework things for inplace/hugepage > > > support though, but decoupling 'preparation' from the uptodate flag is > > > the main goal here. > > Could you elaborate a little why the decoupling is needed if it's not for > > hugepage? > > For instance, for in-place conversion: > > 1. initial allocation: clear, set uptodate, fault in as private > 2. private->shared: call invalidate hook, fault in as shared > 3. shared->private: call prep hook, fault in as private > > Here, 2/3 need to track where the current state is shared/private in > order to make appropriate architecture-specific changes (e.g. RMP table > updates). But we want to allow for non-destructive in-place conversion, > where a page is 'uptodate', but not in the desired shared/private state. > So 'uptodate' becomes a separate piece of state, which is still > reasonable for gmem to track in the current 4K-only implementation, and > provides for a reasonable approach to upstreaming in-place conversion, > which isn't far off for either SNP or TDX. To me, "1. initial allocation: clear, set uptodate" is more appropriate to be done in kvm_gmem_get_folio(), instead of in kvm_gmem_get_pfn(). With it, below looks reasonable to me. > For hugepages, we'll have other things to consider, but those things are > probably still somewhat far off, and so we shouldn't block steps toward > in-place conversion based on uncertainty around hugepages. I think it's > gotten enough attention at least that we know it *can* work, e.g. even > if we take the inefficient/easy route of zero'ing the whole folio on > initial access, setting it uptodate, and never doing anything with > uptodate again, it's still a usable implementation. <...> > > > Assuming this patch goes upstream in some form, we will now have the > > > following major differences versus previous code: > > > > > > 1) uptodate flag only tracks whether a folio has been cleared > > > 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and > > > the architecture can handle it's own tracking at whatever granularity > > > it likes. > > 2) looks good to me. > > > > > My hope is that 1) can similarly be done in such a way that gmem does not > > > need to track things at sub-hugepage granularity and necessitate the need > > > for some new data structure/state/flag to track sub-page status. > > I actually don't understand what uptodate flag helps gmem to track. > > Why can't clear_highpage() be done inside arch specific code? TDX doesn't need > > this clearing after all. > > It could. E.g. via the kernel-internal gmem flag that I mentioned in my > earlier reply, or some alternative. > > In the context of this series, uptodate flag continues to instruct > kvm_gmem_get_pfn() that it doesn't not need to re-clear pages, because > a prior kvm_gmem_get_pfn() or kvm_gmem_populate() already initialized > the folio, and it is no longer tied to any notion of > preparedness-tracking. > > What use uptodate will have in the context of hugepages: I'm not sure. > For non-in-place conversion, it's tempting to just let it continue to be > per-folio and require clearing the whole folio on initial access, but > it's not efficient. It may make sense to farm it out to > post-populate/prep hooks instead, as you're suggesting for TDX. > > But then, for in-place conversion, you have to deal with pages initially > faulted in as shared. They might be split prior to initial access as a > private page, where we can't assume TDX will have scrubbed things. So in > that case it might still make sense to rely on it. > > Definitely things that require some more thought. But having it inextricably > tied to preparedness just makes preparation tracking similarly more > complicated as it pulls it back into gmem when that does not seem to be > the direction any architectures other SNP have/want to go. > > > > > > My understanding based on prior discussion in guest_memfd calls was that > > > it would be okay to go ahead and clear the entire folio at initial allocation > > > time, and basically never mess with it again. It was also my understanding > > That's where I don't follow in this patch. > > I don't see where the entire folio A is cleared if it's only partially mapped by > > kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to > > kvm_gmem_populate() has set the uptodate flag. > > > > > that for TDX it might even be optimal to completely skip clearing the folio > > > if it is getting mapped into SecureEPT as a hugepage since the TDX module > > > would handle that, but that maybe conversely after private->shared there > > > would be some need to reclear... I'll try to find that discussion and > > > refresh. Vishal I believe suggested some flags to provide more control over > > > this behavior. > > > > > > > > > > > It's possible (at least for TDX) that a huge folio is only partially populated > > > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the > > > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, > > > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can > > > > belong to the same folio of order 9. > > > > > > Would the above scheme of clearing the entire folio up front and not re-clearing > > > at fault time work for this case? > > This case doesn't affect TDX, because TDX clearing private pages internally in > > SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage() > > after making a folio private, it works fine for TDX. > > > > I was just trying to understand why SNP needs the clearing of entire folio in > > kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's > > partially mapped in kvm_gmem_populate(). > > Also, I'm wondering if it would be better if SNP could move the clearing of > > folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare() > > which is always invoked by kvm_gmem_get_pfn() and the architecture can handle > > it's own tracking at whatever granularity. > > Possibly, but I touched elsewhere on where in-place conversion might > trip up this approach. At least decoupling them allows for the prep side > of things to be moved to architecture-specific tracking. We can deal > with uptodate separately I think. > > -Mike > > > > > > > > > Note: the current code should not impact TDX. I'm just asking out of curiosity:) > > > > > > > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ > > > > > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-02 9:17 ` Yan Zhao @ 2025-12-03 13:47 ` Michael Roth 2025-12-05 3:54 ` Yan Zhao 0 siblings, 1 reply; 35+ messages in thread From: Michael Roth @ 2025-12-03 13:47 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Tue, Dec 02, 2025 at 05:17:20PM +0800, Yan Zhao wrote: > On Mon, Dec 01, 2025 at 05:44:47PM -0600, Michael Roth wrote: > > On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote: > > > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > > > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > > > { > > > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > > > struct folio *folio; > > > > > > - bool is_prepared = false; > > > > > > int r = 0; > > > > > > > > > > > > CLASS(gmem_get_file, file)(slot); > > > > > > if (!file) > > > > > > return -EFAULT; > > > > > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > > > > if (IS_ERR(folio)) > > > > > > return PTR_ERR(folio); > > > > > > > > > > > > - if (!is_prepared) > > > > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > > + if (!folio_test_uptodate(folio)) { > > > > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > > > > + > > > > > > + for (i = 0; i < nr_pages; i++) > > > > > > + clear_highpage(folio_page(folio, i)); > > > > > > + folio_mark_uptodate(folio); > > > > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > > > > Then, please check my questions at the bottom > > > > > > > > Yes, in this patch at least where I tried to mirror the current logic. I > > > > would not be surprised if we need to rework things for inplace/hugepage > > > > support though, but decoupling 'preparation' from the uptodate flag is > > > > the main goal here. > > > Could you elaborate a little why the decoupling is needed if it's not for > > > hugepage? > > > > For instance, for in-place conversion: > > > > 1. initial allocation: clear, set uptodate, fault in as private > > 2. private->shared: call invalidate hook, fault in as shared > > 3. shared->private: call prep hook, fault in as private > > > > Here, 2/3 need to track where the current state is shared/private in > > order to make appropriate architecture-specific changes (e.g. RMP table > > updates). But we want to allow for non-destructive in-place conversion, > > where a page is 'uptodate', but not in the desired shared/private state. > > So 'uptodate' becomes a separate piece of state, which is still > > reasonable for gmem to track in the current 4K-only implementation, and > > provides for a reasonable approach to upstreaming in-place conversion, > > which isn't far off for either SNP or TDX. > To me, "1. initial allocation: clear, set uptodate" is more appropriate to > be done in kvm_gmem_get_folio(), instead of in kvm_gmem_get_pfn(). The downside is that preallocating originally involved just preallocating, and zero'ing would happen lazily during fault time. (or upfront via KVM_PRE_FAULT_MEMORY). But in the context of the hugetlb RFC, it certainly looks cleaner to do it there, since it could be done before any potential splitting activity, so then all the tail pages can inherit that initial uptodate flag. We'd probably want to weigh that these trade-offs based on how it will affect hugepages, but that would be clearer in the context of a new posting of hugepage support on top of these changes. So I think it's better to address that change as a follow-up so we can consider it with more context. > > With it, below looks reasonable to me. > > For hugepages, we'll have other things to consider, but those things are > > probably still somewhat far off, and so we shouldn't block steps toward > > in-place conversion based on uncertainty around hugepages. I think it's > > gotten enough attention at least that we know it *can* work, e.g. even > > if we take the inefficient/easy route of zero'ing the whole folio on > > initial access, setting it uptodate, and never doing anything with > > uptodate again, it's still a usable implementation. To me as well. But in the context of this series, against kvm/next, it creates userspace-visible changes to pre-allocation behavior that we can't justify in the context of this series alone, so as above I think that's something to save for hugepage follow-up. FWIW though, I ended up taking this approach for the hugetlb-based test branch, to address the fact (after you reminded me) that I wasn't fully zero'ing folio's in the kvm_gmem_populate() path: https://github.com/AMDESE/linux/commit/253fb30b076d81232deba0b02db070d5bc2b90b5 So maybe for your testing you could do similar. For newer hugepage support I'll likely do similar, but I don't think any of this reasoning or changes makes sense to people reviewing this series without already being aware of hugepage plans/development, so that's why I'm trying to keep this series more focused on in-place conversion enablement, because hugepage plans might be massively reworked for next posting based on LPC talks and changes in direction mentioned in recent guest_memfd calls and we are basically just hand-waving about what it will look like at this point while blocking other efforts. -Mike > > <...> > > > > Assuming this patch goes upstream in some form, we will now have the > > > > following major differences versus previous code: > > > > > > > > 1) uptodate flag only tracks whether a folio has been cleared > > > > 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and > > > > the architecture can handle it's own tracking at whatever granularity > > > > it likes. > > > 2) looks good to me. > > > > > > > My hope is that 1) can similarly be done in such a way that gmem does not > > > > need to track things at sub-hugepage granularity and necessitate the need > > > > for some new data structure/state/flag to track sub-page status. > > > I actually don't understand what uptodate flag helps gmem to track. > > > Why can't clear_highpage() be done inside arch specific code? TDX doesn't need > > > this clearing after all. > > > > It could. E.g. via the kernel-internal gmem flag that I mentioned in my > > earlier reply, or some alternative. > > > > In the context of this series, uptodate flag continues to instruct > > kvm_gmem_get_pfn() that it doesn't not need to re-clear pages, because > > a prior kvm_gmem_get_pfn() or kvm_gmem_populate() already initialized > > the folio, and it is no longer tied to any notion of > > preparedness-tracking. > > > > What use uptodate will have in the context of hugepages: I'm not sure. > > For non-in-place conversion, it's tempting to just let it continue to be > > per-folio and require clearing the whole folio on initial access, but > > it's not efficient. It may make sense to farm it out to > > post-populate/prep hooks instead, as you're suggesting for TDX. > > > > But then, for in-place conversion, you have to deal with pages initially > > faulted in as shared. They might be split prior to initial access as a > > private page, where we can't assume TDX will have scrubbed things. So in > > that case it might still make sense to rely on it. > > > > Definitely things that require some more thought. But having it inextricably > > tied to preparedness just makes preparation tracking similarly more > > complicated as it pulls it back into gmem when that does not seem to be > > the direction any architectures other SNP have/want to go. > > > > > > > > > My understanding based on prior discussion in guest_memfd calls was that > > > > it would be okay to go ahead and clear the entire folio at initial allocation > > > > time, and basically never mess with it again. It was also my understanding > > > That's where I don't follow in this patch. > > > I don't see where the entire folio A is cleared if it's only partially mapped by > > > kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to > > > kvm_gmem_populate() has set the uptodate flag. > > > > > > > that for TDX it might even be optimal to completely skip clearing the folio > > > > if it is getting mapped into SecureEPT as a hugepage since the TDX module > > > > would handle that, but that maybe conversely after private->shared there > > > > would be some need to reclear... I'll try to find that discussion and > > > > refresh. Vishal I believe suggested some flags to provide more control over > > > > this behavior. > > > > > > > > > > > > > > It's possible (at least for TDX) that a huge folio is only partially populated > > > > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the > > > > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region, > > > > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can > > > > > belong to the same folio of order 9. > > > > > > > > Would the above scheme of clearing the entire folio up front and not re-clearing > > > > at fault time work for this case? > > > This case doesn't affect TDX, because TDX clearing private pages internally in > > > SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage() > > > after making a folio private, it works fine for TDX. > > > > > > I was just trying to understand why SNP needs the clearing of entire folio in > > > kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's > > > partially mapped in kvm_gmem_populate(). > > > Also, I'm wondering if it would be better if SNP could move the clearing of > > > folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare() > > > which is always invoked by kvm_gmem_get_pfn() and the architecture can handle > > > it's own tracking at whatever granularity. > > > > Possibly, but I touched elsewhere on where in-place conversion might > > trip up this approach. At least decoupling them allows for the prep side > > of things to be moved to architecture-specific tracking. We can deal > > with uptodate separately I think. > > > > -Mike > > > > > > > > > > > > > Note: the current code should not impact TDX. I'm just asking out of curiosity:) > > > > > > > > > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/ > > > > > > > > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking 2025-12-03 13:47 ` Michael Roth @ 2025-12-05 3:54 ` Yan Zhao 0 siblings, 0 replies; 35+ messages in thread From: Yan Zhao @ 2025-12-05 3:54 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Wed, Dec 03, 2025 at 07:47:17AM -0600, Michael Roth wrote: > On Tue, Dec 02, 2025 at 05:17:20PM +0800, Yan Zhao wrote: > > On Mon, Dec 01, 2025 at 05:44:47PM -0600, Michael Roth wrote: > > > On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote: > > > > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote: > > > > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote: > > > > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote: > > > > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > > > > { > > > > > > > pgoff_t index = kvm_gmem_get_index(slot, gfn); > > > > > > > struct folio *folio; > > > > > > > - bool is_prepared = false; > > > > > > > int r = 0; > > > > > > > > > > > > > > CLASS(gmem_get_file, file)(slot); > > > > > > > if (!file) > > > > > > > return -EFAULT; > > > > > > > > > > > > > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); > > > > > > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); > > > > > > > if (IS_ERR(folio)) > > > > > > > return PTR_ERR(folio); > > > > > > > > > > > > > > - if (!is_prepared) > > > > > > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); > > > > > > > + if (!folio_test_uptodate(folio)) { > > > > > > > + unsigned long i, nr_pages = folio_nr_pages(folio); > > > > > > > + > > > > > > > + for (i = 0; i < nr_pages; i++) > > > > > > > + clear_highpage(folio_page(folio, i)); > > > > > > > + folio_mark_uptodate(folio); > > > > > > Here, the entire folio is cleared only when the folio is not marked uptodate. > > > > > > Then, please check my questions at the bottom > > > > > > > > > > Yes, in this patch at least where I tried to mirror the current logic. I > > > > > would not be surprised if we need to rework things for inplace/hugepage > > > > > support though, but decoupling 'preparation' from the uptodate flag is > > > > > the main goal here. > > > > Could you elaborate a little why the decoupling is needed if it's not for > > > > hugepage? > > > > > > For instance, for in-place conversion: > > > > > > 1. initial allocation: clear, set uptodate, fault in as private > > > 2. private->shared: call invalidate hook, fault in as shared > > > 3. shared->private: call prep hook, fault in as private > > > > > > Here, 2/3 need to track where the current state is shared/private in > > > order to make appropriate architecture-specific changes (e.g. RMP table > > > updates). But we want to allow for non-destructive in-place conversion, > > > where a page is 'uptodate', but not in the desired shared/private state. > > > So 'uptodate' becomes a separate piece of state, which is still > > > reasonable for gmem to track in the current 4K-only implementation, and > > > provides for a reasonable approach to upstreaming in-place conversion, > > > which isn't far off for either SNP or TDX. > > To me, "1. initial allocation: clear, set uptodate" is more appropriate to > > be done in kvm_gmem_get_folio(), instead of in kvm_gmem_get_pfn(). > > The downside is that preallocating originally involved just > preallocating, and zero'ing would happen lazily during fault time. (or > upfront via KVM_PRE_FAULT_MEMORY). > > But in the context of the hugetlb RFC, it certainly looks cleaner to do > it there, since it could be done before any potential splitting activity, > so then all the tail pages can inherit that initial uptodate flag. > > We'd probably want to weigh that these trade-offs based on how it will > affect hugepages, but that would be clearer in the context of a new > posting of hugepage support on top of these changes. So I think it's > better to address that change as a follow-up so we can consider it with > more context. > > > > > With it, below looks reasonable to me. > > > For hugepages, we'll have other things to consider, but those things are > > > probably still somewhat far off, and so we shouldn't block steps toward > > > in-place conversion based on uncertainty around hugepages. I think it's > > > gotten enough attention at least that we know it *can* work, e.g. even > > > if we take the inefficient/easy route of zero'ing the whole folio on > > > initial access, setting it uptodate, and never doing anything with > > > uptodate again, it's still a usable implementation. > > To me as well. But in the context of this series, against kvm/next, it > creates userspace-visible changes to pre-allocation behavior that we > can't justify in the context of this series alone, so as above I think > that's something to save for hugepage follow-up. > > FWIW though, I ended up taking this approach for the hugetlb-based test > branch, to address the fact (after you reminded me) that I wasn't fully > zero'ing folio's in the kvm_gmem_populate() path: > > https://github.com/AMDESE/linux/commit/253fb30b076d81232deba0b02db070d5bc2b90b5 > > So maybe for your testing you could do similar. For newer hugepage > support I'll likely do similar, but I don't think any of this reasoning > or changes makes sense to people reviewing this series without already > being aware of hugepage plans/development, so that's why I'm trying to > keep this series more focused on in-place conversion enablement, because > hugepage plans might be massively reworked for next posting based on LPC > talks and changes in direction mentioned in recent guest_memfd calls and > we are basically just hand-waving about what it will look like at this > point while blocking other efforts. > Got it. Thanks for the explanation! ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION 2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth 2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth @ 2025-11-13 23:07 ` Michael Roth 2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth 2 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-11-13 23:07 UTC (permalink / raw) To: kvm Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny, yan.y.zhao Since it was never possible to use a non-PAGE_SIZE-aligned @source_addr, go ahead and document this as a requirement, and add a KVM_BUG_ON() in the post-populate callback handler to ensure future reworks to guest_memfd do not violate this constraint. Signed-off-by: Michael Roth <michael.roth@amd.com> --- Documentation/virt/kvm/x86/intel-tdx.rst | 2 +- arch/x86/kvm/vmx/tdx.c | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst index 5efac62c92c7..6a222e9d0954 100644 --- a/Documentation/virt/kvm/x86/intel-tdx.rst +++ b/Documentation/virt/kvm/x86/intel-tdx.rst @@ -156,7 +156,7 @@ KVM_TDX_INIT_MEM_REGION :Returns: 0 on success, <0 on error Initialize @nr_pages TDX guest private memory starting from @gpa with userspace -provided data from @source_addr. +provided data from @source_addr. @source_addr must be PAGE_SIZE-aligned. Note, before calling this sub command, memory attribute of the range [gpa, gpa + nr_pages] needs to be private. Userspace can use diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 3cf80babc3c1..57ed101a1181 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3127,6 +3127,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) return -EIO; + if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) + return -EINVAL; + /* * Get the source page if it has been faulted in. Return failure if the * source page has been swapped out or unmapped in primary memory. -- 2.25.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth 2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth 2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth @ 2025-11-13 23:07 ` Michael Roth 2025-11-20 9:11 ` Yan Zhao 2025-11-20 19:34 ` Ira Weiny 2 siblings, 2 replies; 35+ messages in thread From: Michael Roth @ 2025-11-13 23:07 UTC (permalink / raw) To: kvm Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny, yan.y.zhao Currently the post-populate callbacks handle copying source pages into private GPA ranges backed by guest_memfd, where kvm_gmem_populate() acquires the filemap invalidate lock, then calls a post-populate callback which may issue a get_user_pages() on the source pages prior to copying them into the private GPA (e.g. TDX). This will not be compatible with in-place conversion, where the userspace page fault path will attempt to acquire filemap invalidate lock while holding the mm->mmap_lock, leading to a potential ABBA deadlock[1]. Address this by hoisting the GUP above the filemap invalidate lock so that these page faults path can be taken early, prior to acquiring the filemap invalidate lock. It's not currently clear whether this issue is reachable with the current implementation of guest_memfd, which doesn't support in-place conversion, however it does provide a consistent mechanism to provide stable source/target PFNs to callbacks rather than punting to vendor-specific code, which allows for more commonality across architectures, which may be worthwhile even without in-place conversion. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> --- arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- include/linux/kvm_host.h | 3 ++- virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ 4 files changed, 71 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 0835c664fbfd..d0ac710697a2 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { }; static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, - void __user *src, int order, void *opaque) + struct page **src_pages, loff_t src_offset, + int order, void *opaque) { struct sev_gmem_populate_args *sev_populate_args = opaque; struct kvm_sev_info *sev = to_kvm_sev_info(kvm); @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf int npages = (1 << order); gfn_t gfn; - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) return -EINVAL; for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf goto err; } - if (src) { - void *vaddr = kmap_local_pfn(pfn + i); + if (src_pages) { + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); + void *dst_vaddr = kmap_local_pfn(pfn + i); - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { - ret = -EFAULT; - goto err; + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); + kunmap_local(src_vaddr); + + if (src_offset) { + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); + + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); + kunmap_local(src_vaddr); } - kunmap_local(vaddr); + + kunmap_local(dst_vaddr); } ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf if (!snp_page_reclaim(kvm, pfn + i) && sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { - void *vaddr = kmap_local_pfn(pfn + i); + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); + void *dst_vaddr = kmap_local_pfn(pfn + i); - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) - pr_debug("Failed to write CPUID page back to userspace\n"); + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); + kunmap_local(src_vaddr); + + if (src_offset) { + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); + + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); + kunmap_local(src_vaddr); + } - kunmap_local(vaddr); + kunmap_local(dst_vaddr); } /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 57ed101a1181..dd5439ec1473 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { }; static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, - void __user *src, int order, void *_arg) + struct page **src_pages, loff_t src_offset, + int order, void *_arg) { struct tdx_gmem_post_populate_arg *arg = _arg; struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); u64 err, entry, level_state; gpa_t gpa = gfn_to_gpa(gfn); - struct page *src_page; int ret, i; if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) return -EIO; - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) + /* Source should be page-aligned, in which case src_offset will be 0. */ + if (KVM_BUG_ON(src_offset)) return -EINVAL; - /* - * Get the source page if it has been faulted in. Return failure if the - * source page has been swapped out or unmapped in primary memory. - */ - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); - if (ret < 0) - return ret; - if (ret != 1) - return -ENOMEM; - - kvm_tdx->page_add_src = src_page; + kvm_tdx->page_add_src = src_pages[i]; ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); kvm_tdx->page_add_src = NULL; - put_page(src_page); - if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) return ret; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d93f75b05ae2..7e9d2403c61f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord * Returns the number of pages that were populated. */ typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, - void __user *src, int order, void *opaque); + struct page **src_pages, loff_t src_offset, + int order, void *opaque); long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, kvm_gmem_populate_cb post_populate, void *opaque); diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 9160379df378..e9ac3fd4fd8f 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE + +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) + long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, kvm_gmem_populate_cb post_populate, void *opaque) { struct kvm_memory_slot *slot; - void __user *p; - + struct page **src_pages; int ret = 0, max_order; - long i; + loff_t src_offset = 0; + long i, src_npages; lockdep_assert_held(&kvm->slots_lock); @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long if (!file) return -EFAULT; + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); + + if (src) { + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; + + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); + if (!src_pages) + return -ENOMEM; + + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); + if (ret < 0) + return ret; + + if (ret != src_npages) + return -ENOMEM; + + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); + } + filemap_invalidate_lock(file->f_mapping); - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); for (i = 0; i < npages; i += (1 << max_order)) { struct folio *folio; gfn_t gfn = start_gfn + i; @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long max_order--; } - p = src ? src + i * PAGE_SIZE : NULL; - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, + src_offset, max_order, opaque); if (!ret) folio_mark_uptodate(folio); @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long filemap_invalidate_unlock(file->f_mapping); + if (src) { + long j; + + for (j = 0; j < src_npages; j++) + put_page(src_pages[j]); + kfree(src_pages); + } + return ret && !i ? ret : i; } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); -- 2.25.1 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth @ 2025-11-20 9:11 ` Yan Zhao 2025-11-21 13:01 ` Michael Roth 2025-11-20 19:34 ` Ira Weiny 1 sibling, 1 reply; 35+ messages in thread From: Yan Zhao @ 2025-11-20 9:11 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > Currently the post-populate callbacks handle copying source pages into > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > acquires the filemap invalidate lock, then calls a post-populate > callback which may issue a get_user_pages() on the source pages prior to > copying them into the private GPA (e.g. TDX). > > This will not be compatible with in-place conversion, where the > userspace page fault path will attempt to acquire filemap invalidate > lock while holding the mm->mmap_lock, leading to a potential ABBA > deadlock[1]. > > Address this by hoisting the GUP above the filemap invalidate lock so > that these page faults path can be taken early, prior to acquiring the > filemap invalidate lock. > > It's not currently clear whether this issue is reachable with the > current implementation of guest_memfd, which doesn't support in-place > conversion, however it does provide a consistent mechanism to provide > stable source/target PFNs to callbacks rather than punting to > vendor-specific code, which allows for more commonality across > architectures, which may be worthwhile even without in-place conversion. > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Michael Roth <michael.roth@amd.com> > --- > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > include/linux/kvm_host.h | 3 ++- > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > 4 files changed, 71 insertions(+), 35 deletions(-) > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 0835c664fbfd..d0ac710697a2 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > }; > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > - void __user *src, int order, void *opaque) > + struct page **src_pages, loff_t src_offset, > + int order, void *opaque) > { > struct sev_gmem_populate_args *sev_populate_args = opaque; > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > int npages = (1 << order); > gfn_t gfn; > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > return -EINVAL; > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > goto err; > } > > - if (src) { > - void *vaddr = kmap_local_pfn(pfn + i); > + if (src_pages) { > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > - ret = -EFAULT; > - goto err; > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > + kunmap_local(src_vaddr); > + > + if (src_offset) { > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > + > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > + kunmap_local(src_vaddr); IIUC, src_offset is the src's offset from the first page. e.g., src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. Then it looks like the two memcpy() calls here only work when npages == 1 ? > } > - kunmap_local(vaddr); > + > + kunmap_local(dst_vaddr); > } > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > if (!snp_page_reclaim(kvm, pfn + i) && > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > - void *vaddr = kmap_local_pfn(pfn + i); > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > - pr_debug("Failed to write CPUID page back to userspace\n"); > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > + kunmap_local(src_vaddr); > + > + if (src_offset) { > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > + > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > + kunmap_local(src_vaddr); > + } > > - kunmap_local(vaddr); > + kunmap_local(dst_vaddr); > } > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 57ed101a1181..dd5439ec1473 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > }; > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > - void __user *src, int order, void *_arg) > + struct page **src_pages, loff_t src_offset, > + int order, void *_arg) > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > u64 err, entry, level_state; > gpa_t gpa = gfn_to_gpa(gfn); > - struct page *src_page; > int ret, i; > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > return -EIO; > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > + /* Source should be page-aligned, in which case src_offset will be 0. */ > + if (KVM_BUG_ON(src_offset)) if (KVM_BUG_ON(src_offset, kvm)) > return -EINVAL; > > - /* > - * Get the source page if it has been faulted in. Return failure if the > - * source page has been swapped out or unmapped in primary memory. > - */ > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > - if (ret < 0) > - return ret; > - if (ret != 1) > - return -ENOMEM; > - > - kvm_tdx->page_add_src = src_page; > + kvm_tdx->page_add_src = src_pages[i]; src_pages[0] ? i is not initialized. Should there also be a KVM_BUG_ON(order > 0, kvm) ? > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > kvm_tdx->page_add_src = NULL; > > - put_page(src_page); > - > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > return ret; > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index d93f75b05ae2..7e9d2403c61f 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > * Returns the number of pages that were populated. > */ > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > - void __user *src, int order, void *opaque); > + struct page **src_pages, loff_t src_offset, > + int order, void *opaque); > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > kvm_gmem_populate_cb post_populate, void *opaque); > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 9160379df378..e9ac3fd4fd8f 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > + > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB when src_pages[] can only hold up to 512 pages? Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() per 4KB while removing the max_order from post_populate() parameters, as done in Sean's sketch patch [1]? Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily triggered by TDX when max_order > 0 && npages == 1: WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order)); [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > kvm_gmem_populate_cb post_populate, void *opaque) > { > struct kvm_memory_slot *slot; > - void __user *p; > - > + struct page **src_pages; > int ret = 0, max_order; > - long i; > + loff_t src_offset = 0; > + long i, src_npages; > > lockdep_assert_held(&kvm->slots_lock); > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > if (!file) > return -EFAULT; > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > + > + if (src) { > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > + > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > + if (!src_pages) > + return -ENOMEM; > + > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > + if (ret < 0) > + return ret; > + > + if (ret != src_npages) > + return -ENOMEM; > + > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > + } > + > filemap_invalidate_lock(file->f_mapping); > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > for (i = 0; i < npages; i += (1 << max_order)) { > struct folio *folio; > gfn_t gfn = start_gfn + i; > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > max_order--; > } > > - p = src ? src + i * PAGE_SIZE : NULL; > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > + src_offset, max_order, opaque); Why src_offset is not 0 starting from the 2nd page? > if (!ret) > folio_mark_uptodate(folio); > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > filemap_invalidate_unlock(file->f_mapping); > > + if (src) { > + long j; > + > + for (j = 0; j < src_npages; j++) > + put_page(src_pages[j]); > + kfree(src_pages); > + } > + > return ret && !i ? ret : i; > } > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-20 9:11 ` Yan Zhao @ 2025-11-21 13:01 ` Michael Roth 2025-11-24 9:31 ` Yan Zhao 2025-12-01 1:44 ` Vishal Annapurve 0 siblings, 2 replies; 35+ messages in thread From: Michael Roth @ 2025-11-21 13:01 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > Currently the post-populate callbacks handle copying source pages into > > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > > acquires the filemap invalidate lock, then calls a post-populate > > callback which may issue a get_user_pages() on the source pages prior to > > copying them into the private GPA (e.g. TDX). > > > > This will not be compatible with in-place conversion, where the > > userspace page fault path will attempt to acquire filemap invalidate > > lock while holding the mm->mmap_lock, leading to a potential ABBA > > deadlock[1]. > > > > Address this by hoisting the GUP above the filemap invalidate lock so > > that these page faults path can be taken early, prior to acquiring the > > filemap invalidate lock. > > > > It's not currently clear whether this issue is reachable with the > > current implementation of guest_memfd, which doesn't support in-place > > conversion, however it does provide a consistent mechanism to provide > > stable source/target PFNs to callbacks rather than punting to > > vendor-specific code, which allows for more commonality across > > architectures, which may be worthwhile even without in-place conversion. > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > --- > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > include/linux/kvm_host.h | 3 ++- > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > index 0835c664fbfd..d0ac710697a2 100644 > > --- a/arch/x86/kvm/svm/sev.c > > +++ b/arch/x86/kvm/svm/sev.c > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > }; > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > - void __user *src, int order, void *opaque) > > + struct page **src_pages, loff_t src_offset, > > + int order, void *opaque) > > { > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > int npages = (1 << order); > > gfn_t gfn; > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > return -EINVAL; > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > goto err; > > } > > > > - if (src) { > > - void *vaddr = kmap_local_pfn(pfn + i); > > + if (src_pages) { > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > - ret = -EFAULT; > > - goto err; > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > + kunmap_local(src_vaddr); > > + > > + if (src_offset) { > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > + > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > + kunmap_local(src_vaddr); > IIUC, src_offset is the src's offset from the first page. e.g., > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > Then it looks like the two memcpy() calls here only work when npages == 1 ? src_offset ends up being the offset into the pair of src pages that we are using to fully populate a single dest page with each iteration. So if we start at src_offset, read a page worth of data, then we are now at src_offset in the next src page and the loop continues that way even if npages > 1. If src_offset is 0 we never have to bother with straddling 2 src pages so the 2nd memcpy is skipped on every iteration. That's the intent at least. Is there a flaw in the code/reasoning that I missed? > > > } > > - kunmap_local(vaddr); > > + > > + kunmap_local(dst_vaddr); > > } > > > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > if (!snp_page_reclaim(kvm, pfn + i) && > > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > > - void *vaddr = kmap_local_pfn(pfn + i); > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > > - pr_debug("Failed to write CPUID page back to userspace\n"); > > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > > + kunmap_local(src_vaddr); > > + > > + if (src_offset) { > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > + > > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > > + kunmap_local(src_vaddr); > > + } > > > > - kunmap_local(vaddr); > > + kunmap_local(dst_vaddr); > > } > > > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 57ed101a1181..dd5439ec1473 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > > }; > > > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > - void __user *src, int order, void *_arg) > > + struct page **src_pages, loff_t src_offset, > > + int order, void *_arg) > > { > > struct tdx_gmem_post_populate_arg *arg = _arg; > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > u64 err, entry, level_state; > > gpa_t gpa = gfn_to_gpa(gfn); > > - struct page *src_page; > > int ret, i; > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > > return -EIO; > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > + if (KVM_BUG_ON(src_offset)) > if (KVM_BUG_ON(src_offset, kvm)) > > > return -EINVAL; > > > > - /* > > - * Get the source page if it has been faulted in. Return failure if the > > - * source page has been swapped out or unmapped in primary memory. > > - */ > > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > > - if (ret < 0) > > - return ret; > > - if (ret != 1) > > - return -ENOMEM; > > - > > - kvm_tdx->page_add_src = src_page; > > + kvm_tdx->page_add_src = src_pages[i]; > src_pages[0] ? i is not initialized. Sorry, I switched on TDX options for compile testing but I must have done a sloppy job confirming it actually built. I'll re-test push these and squash in the fixes in the github tree. > > Should there also be a KVM_BUG_ON(order > 0, kvm) ? Seems reasonable, but I'm not sure this is the right patch. Maybe I could squash it into the preceeding documentation patch so as to not give the impression this patch changes those expectations in any way. > > > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > > kvm_tdx->page_add_src = NULL; > > > > - put_page(src_page); > > - > > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > > return ret; > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index d93f75b05ae2..7e9d2403c61f 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > > * Returns the number of pages that were populated. > > */ > > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > - void __user *src, int order, void *opaque); > > + struct page **src_pages, loff_t src_offset, > > + int order, void *opaque); > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > > kvm_gmem_populate_cb post_populate, void *opaque); > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > index 9160379df378..e9ac3fd4fd8f 100644 > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > > + > > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) > Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge > folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB > when src_pages[] can only hold up to 512 pages? This was necessarily chosen in prep for hugepages, but more about my unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER happens to align with 2MB hugepages while seeming like a reasonable batching value so that's why I chose it. Even with 1GB support, I wasn't really planning to increase it. SNP doesn't really make use of RMP sizes >2MB, and it sounds like TDX handles promotion in a completely different path. So atm I'm leaning toward just letting GMEM_GUP_NPAGES be the cap for the max page size we support for kvm_gmem_populate() path and not bothering to change it until a solid use-case arises. > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > per 4KB while removing the max_order from post_populate() parameters, as done > in Sean's sketch patch [1]? That's an option too, but SNP can make use of 2MB pages in the post-populate callback so I don't want to shut the door on that option just yet if it's not too much of a pain to work in. Given the guest BIOS lives primarily in 1 or 2 of these 2MB regions the benefits might be worthwhile, and SNP doesn't have a post-post-populate promotion path like TDX (at least, not one that would help much for guest boot times) Thanks, Mike > > Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily > triggered by TDX when max_order > 0 && npages == 1: > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > (npages - i) < (1 << max_order)); > > > [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > > kvm_gmem_populate_cb post_populate, void *opaque) > > { > > struct kvm_memory_slot *slot; > > - void __user *p; > > - > > + struct page **src_pages; > > int ret = 0, max_order; > > - long i; > > + loff_t src_offset = 0; > > + long i, src_npages; > > > > lockdep_assert_held(&kvm->slots_lock); > > > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > if (!file) > > return -EFAULT; > > > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > > + > > + if (src) { > > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > > + > > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > > + if (!src_pages) > > + return -ENOMEM; > > + > > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > > + if (ret < 0) > > + return ret; > > + > > + if (ret != src_npages) > > + return -ENOMEM; > > + > > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > > + } > > + > > filemap_invalidate_lock(file->f_mapping); > > > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > for (i = 0; i < npages; i += (1 << max_order)) { > > struct folio *folio; > > gfn_t gfn = start_gfn + i; > > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > max_order--; > > } > > > > - p = src ? src + i * PAGE_SIZE : NULL; > > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > > + src_offset, max_order, opaque); > Why src_offset is not 0 starting from the 2nd page? > > > if (!ret) > > folio_mark_uptodate(folio); > > > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > filemap_invalidate_unlock(file->f_mapping); > > > > + if (src) { > > + long j; > > + > > + for (j = 0; j < src_npages; j++) > > + put_page(src_pages[j]); > > + kfree(src_pages); > > + } > > + > > return ret && !i ? ret : i; > > } > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > > -- > > 2.25.1 > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-21 13:01 ` Michael Roth @ 2025-11-24 9:31 ` Yan Zhao 2025-11-24 15:53 ` Ira Weiny ` (2 more replies) 2025-12-01 1:44 ` Vishal Annapurve 1 sibling, 3 replies; 35+ messages in thread From: Yan Zhao @ 2025-11-24 9:31 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > > Currently the post-populate callbacks handle copying source pages into > > > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > > > acquires the filemap invalidate lock, then calls a post-populate > > > callback which may issue a get_user_pages() on the source pages prior to > > > copying them into the private GPA (e.g. TDX). > > > > > > This will not be compatible with in-place conversion, where the > > > userspace page fault path will attempt to acquire filemap invalidate > > > lock while holding the mm->mmap_lock, leading to a potential ABBA > > > deadlock[1]. > > > > > > Address this by hoisting the GUP above the filemap invalidate lock so > > > that these page faults path can be taken early, prior to acquiring the > > > filemap invalidate lock. > > > > > > It's not currently clear whether this issue is reachable with the > > > current implementation of guest_memfd, which doesn't support in-place > > > conversion, however it does provide a consistent mechanism to provide > > > stable source/target PFNs to callbacks rather than punting to > > > vendor-specific code, which allows for more commonality across > > > architectures, which may be worthwhile even without in-place conversion. > > > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > --- > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > include/linux/kvm_host.h | 3 ++- > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > index 0835c664fbfd..d0ac710697a2 100644 > > > --- a/arch/x86/kvm/svm/sev.c > > > +++ b/arch/x86/kvm/svm/sev.c > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > }; > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > - void __user *src, int order, void *opaque) > > > + struct page **src_pages, loff_t src_offset, > > > + int order, void *opaque) > > > { > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > int npages = (1 << order); > > > gfn_t gfn; > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > return -EINVAL; > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > goto err; > > > } > > > > > > - if (src) { > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > + if (src_pages) { > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > - ret = -EFAULT; > > > - goto err; > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > + kunmap_local(src_vaddr); > > > + > > > + if (src_offset) { > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > + > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > + kunmap_local(src_vaddr); > > IIUC, src_offset is the src's offset from the first page. e.g., > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > src_offset ends up being the offset into the pair of src pages that we > are using to fully populate a single dest page with each iteration. So > if we start at src_offset, read a page worth of data, then we are now at > src_offset in the next src page and the loop continues that way even if > npages > 1. > > If src_offset is 0 we never have to bother with straddling 2 src pages so > the 2nd memcpy is skipped on every iteration. > > That's the intent at least. Is there a flaw in the code/reasoning that I > missed? Oh, I got you. SNP expects a single src_offset applies for each src page. So if npages = 2, there're 4 memcpy() calls. src: |---------|---------|---------| (VA contiguous) ^ ^ ^ | | | dst: |---------|---------| (PA contiguous) I previously incorrectly thought kvm_gmem_populate() should pass in src_offset as 0 for the 2nd src page. Would you consider checking if params.uaddr is PAGE_ALIGNED() in snp_launch_update() to simplify the design? > > > > > } > > > - kunmap_local(vaddr); > > > + > > > + kunmap_local(dst_vaddr); > > > } > > > > > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > > > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > if (!snp_page_reclaim(kvm, pfn + i) && > > > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > > > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > > > - pr_debug("Failed to write CPUID page back to userspace\n"); > > > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > > > + kunmap_local(src_vaddr); > > > + > > > + if (src_offset) { > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > + > > > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > > > + kunmap_local(src_vaddr); > > > + } > > > > > > - kunmap_local(vaddr); > > > + kunmap_local(dst_vaddr); > > > } > > > > > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > > index 57ed101a1181..dd5439ec1473 100644 > > > --- a/arch/x86/kvm/vmx/tdx.c > > > +++ b/arch/x86/kvm/vmx/tdx.c > > > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > > > }; > > > > > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > - void __user *src, int order, void *_arg) > > > + struct page **src_pages, loff_t src_offset, > > > + int order, void *_arg) > > > { > > > struct tdx_gmem_post_populate_arg *arg = _arg; > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > u64 err, entry, level_state; > > > gpa_t gpa = gfn_to_gpa(gfn); > > > - struct page *src_page; > > > int ret, i; > > > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > > > return -EIO; > > > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > > + if (KVM_BUG_ON(src_offset)) > > if (KVM_BUG_ON(src_offset, kvm)) > > > > > return -EINVAL; > > > > > > - /* > > > - * Get the source page if it has been faulted in. Return failure if the > > > - * source page has been swapped out or unmapped in primary memory. > > > - */ > > > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > > > - if (ret < 0) > > > - return ret; > > > - if (ret != 1) > > > - return -ENOMEM; > > > - > > > - kvm_tdx->page_add_src = src_page; > > > + kvm_tdx->page_add_src = src_pages[i]; > > src_pages[0] ? i is not initialized. > > Sorry, I switched on TDX options for compile testing but I must have done a > sloppy job confirming it actually built. I'll re-test push these and squash > in the fixes in the github tree. > > > > > Should there also be a KVM_BUG_ON(order > 0, kvm) ? > > Seems reasonable, but I'm not sure this is the right patch. Maybe I > could squash it into the preceeding documentation patch so as to not > give the impression this patch changes those expectations in any way. I don't think it should be documented as a user requirement. However, we need to comment out that this assertion is due to that tdx_vcpu_init_mem_region() passes npages as 1 to kvm_gmem_populate(). > > > > > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > > > kvm_tdx->page_add_src = NULL; > > > > > > - put_page(src_page); > > > - > > > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > > > return ret; > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index d93f75b05ae2..7e9d2403c61f 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > > > * Returns the number of pages that were populated. > > > */ > > > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > - void __user *src, int order, void *opaque); > > > + struct page **src_pages, loff_t src_offset, > > > + int order, void *opaque); > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > > > kvm_gmem_populate_cb post_populate, void *opaque); > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 9160379df378..e9ac3fd4fd8f 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > > > > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > > > + > > > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) > > Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge > > folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB > > when src_pages[] can only hold up to 512 pages? > > This was necessarily chosen in prep for hugepages, but more about my > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > happens to align with 2MB hugepages while seeming like a reasonable > batching value so that's why I chose it. > > Even with 1GB support, I wasn't really planning to increase it. SNP > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > handles promotion in a completely different path. So atm I'm leaning > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > support for kvm_gmem_populate() path and not bothering to change it > until a solid use-case arises. The problem is that with hugetlb-based guest_memfd, the folio itself could be of 1GB, though SNP and TDX can force mapping at only 4KB. Then since max_order = folio_order(folio) (at least in the tree for [1]), WARN_ON() in kvm_gmem_populate() could still be hit: folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order)); TDX is even easier to hit this warning because it always passes npages as 1. [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > per 4KB while removing the max_order from post_populate() parameters, as done > > in Sean's sketch patch [1]? > > That's an option too, but SNP can make use of 2MB pages in the > post-populate callback so I don't want to shut the door on that option > just yet if it's not too much of a pain to work in. Given the guest BIOS > lives primarily in 1 or 2 of these 2MB regions the benefits might be > worthwhile, and SNP doesn't have a post-post-populate promotion path > like TDX (at least, not one that would help much for guest boot times) I see. So, what about below change? --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long } folio_unlock(folio); - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || - (npages - i) < (1 << max_order)); ret = -EINVAL; - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM_MEMORY_ATTRIBUTE_PRIVATE)) { if (!max_order) > > > > > Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily > > triggered by TDX when max_order > 0 && npages == 1: > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > (npages - i) < (1 << max_order)); > > > > > > [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > > > kvm_gmem_populate_cb post_populate, void *opaque) > > > { > > > struct kvm_memory_slot *slot; > > > - void __user *p; > > > - > > > + struct page **src_pages; > > > int ret = 0, max_order; > > > - long i; > > > + loff_t src_offset = 0; > > > + long i, src_npages; > > > > > > lockdep_assert_held(&kvm->slots_lock); > > > > > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > if (!file) > > > return -EFAULT; > > > > > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > > > + > > > + if (src) { > > > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > > > + > > > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > > > + if (!src_pages) > > > + return -ENOMEM; > > > + > > > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > > > + if (ret < 0) > > > + return ret; > > > + > > > + if (ret != src_npages) > > > + return -ENOMEM; > > > + > > > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > > > + } > > > + > > > filemap_invalidate_lock(file->f_mapping); > > > > > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > for (i = 0; i < npages; i += (1 << max_order)) { > > > struct folio *folio; > > > gfn_t gfn = start_gfn + i; > > > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > max_order--; > > > } > > > > > > - p = src ? src + i * PAGE_SIZE : NULL; > > > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > > > + src_offset, max_order, opaque); > > Why src_offset is not 0 starting from the 2nd page? > > > > > if (!ret) > > > folio_mark_uptodate(folio); > > > > > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > filemap_invalidate_unlock(file->f_mapping); > > > > > > + if (src) { > > > + long j; > > > + > > > + for (j = 0; j < src_npages; j++) > > > + put_page(src_pages[j]); > > > + kfree(src_pages); > > > + } > > > + > > > return ret && !i ? ret : i; > > > } > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > > > -- > > > 2.25.1 > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-24 9:31 ` Yan Zhao @ 2025-11-24 15:53 ` Ira Weiny 2025-11-25 3:12 ` Yan Zhao 2025-12-01 1:47 ` Vishal Annapurve 2025-12-01 22:13 ` Michael Roth 2 siblings, 1 reply; 35+ messages in thread From: Ira Weiny @ 2025-11-24 15:53 UTC (permalink / raw) To: Yan Zhao, Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny Yan Zhao wrote: > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: [snip] > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > goto err; > > > > } > > > > > > > > - if (src) { > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > + if (src_pages) { > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > - ret = -EFAULT; > > > > - goto err; > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > + kunmap_local(src_vaddr); > > > > + > > > > + if (src_offset) { > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > + > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > + kunmap_local(src_vaddr); > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > src_offset ends up being the offset into the pair of src pages that we > > are using to fully populate a single dest page with each iteration. So > > if we start at src_offset, read a page worth of data, then we are now at > > src_offset in the next src page and the loop continues that way even if > > npages > 1. > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > the 2nd memcpy is skipped on every iteration. > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > missed? > Oh, I got you. SNP expects a single src_offset applies for each src page. > > So if npages = 2, there're 4 memcpy() calls. > > src: |---------|---------|---------| (VA contiguous) > ^ ^ ^ > | | | > dst: |---------|---------| (PA contiguous) > I'm not following the above diagram. Either src and dst are aligned and src_pages points to exactly one page. OR not aligned and src_pages points to 2 pages. src: |---------|---------| (VA contiguous) ^ ^ | | dst: |---------| (PA contiguous) Regardless I think this is all bike shedding over a feature which I really don't think buys us much trying to allow the src to be missaligned. > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > as 0 for the 2nd src page. > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > snp_launch_update() to simplify the design? I think this would help a lot... ATM I'm not even sure the algorithm works if order is not 0. [snip] > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > in Sean's sketch patch [1]? > > > > That's an option too, but SNP can make use of 2MB pages in the > > post-populate callback so I don't want to shut the door on that option > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > like TDX (at least, not one that would help much for guest boot times) > I see. > > So, what about below change? I'm not following what this change has to do with moving GUP out of the post_populate calls? Ira > > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > } > > folio_unlock(folio); > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > - (npages - i) < (1 << max_order)); > > ret = -EINVAL; > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || > + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > KVM_MEMORY_ATTRIBUTE_PRIVATE, > KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > if (!max_order) > > > [snip] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-24 15:53 ` Ira Weiny @ 2025-11-25 3:12 ` Yan Zhao 0 siblings, 0 replies; 35+ messages in thread From: Yan Zhao @ 2025-11-25 3:12 UTC (permalink / raw) To: Ira Weiny Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik On Mon, Nov 24, 2025 at 09:53:03AM -0600, Ira Weiny wrote: > Yan Zhao wrote: > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > [snip] > > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > goto err; > > > > > } > > > > > > > > > > - if (src) { > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > + if (src_pages) { > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > > - ret = -EFAULT; > > > > > - goto err; > > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > > + kunmap_local(src_vaddr); > > > > > + > > > > > + if (src_offset) { > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > + > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > + kunmap_local(src_vaddr); > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > are using to fully populate a single dest page with each iteration. So > > > if we start at src_offset, read a page worth of data, then we are now at > > > src_offset in the next src page and the loop continues that way even if > > > npages > 1. > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > the 2nd memcpy is skipped on every iteration. > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > missed? > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > So if npages = 2, there're 4 memcpy() calls. > > > > src: |---------|---------|---------| (VA contiguous) > > ^ ^ ^ > > | | | > > dst: |---------|---------| (PA contiguous) > > > > I'm not following the above diagram. Either src and dst are aligned and Hmm, the src/dst legend in the above diagram just denotes source and target, not the actual src user pointer. > src_pages points to exactly one page. OR not aligned and src_pages points > to 2 pages. > > src: |---------|---------| (VA contiguous) > ^ ^ > | | > dst: |---------| (PA contiguous) > > Regardless I think this is all bike shedding over a feature which I really > don't think buys us much trying to allow the src to be missaligned. > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > as 0 for the 2nd src page. > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > snp_launch_update() to simplify the design? > > I think this would help a lot... ATM I'm not even sure the algorithm > works if order is not 0. > > [snip] > > > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > > in Sean's sketch patch [1]? > > > > > > That's an option too, but SNP can make use of 2MB pages in the > > > post-populate callback so I don't want to shut the door on that option > > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > > like TDX (at least, not one that would help much for guest boot times) > > I see. > > > > So, what about below change? > > I'm not following what this change has to do with moving GUP out of the > post_populate calls? Without this change, TDX (and possibly SNP) would hit a warning when max_order>0. (either GUP in 4KB granularity or this change can get rid of the warning). Since this series already contains changes for 2MB pages (e.g., batched GUP to allow SNP to map 2MB pages, and actually we don't need the change in patch 1 without considering huge pages), I don't see any reason to leave this change out of tree. Note: kvm_gmem_populate() already contains the logic of while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM_MEMORY_ATTRIBUTE_PRIVATE)) { if (!max_order) goto put_folio_and_exit; max_order--; } Also, the series is titled "Rework preparation/population flows in prep for in-place conversion", so it's not just about "moving GUP out of the post_populate", right? :) > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > } > > > > folio_unlock(folio); > > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > - (npages - i) < (1 << max_order)); > > > > ret = -EINVAL; > > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || > > + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > KVM_MEMORY_ATTRIBUTE_PRIVATE, > > KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > > if (!max_order) > > > > > > > > [snip] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-24 9:31 ` Yan Zhao 2025-11-24 15:53 ` Ira Weiny @ 2025-12-01 1:47 ` Vishal Annapurve 2025-12-01 21:03 ` Michael Roth 2025-12-01 22:13 ` Michael Roth 2 siblings, 1 reply; 35+ messages in thread From: Vishal Annapurve @ 2025-12-01 1:47 UTC (permalink / raw) To: Yan Zhao Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Mon, Nov 24, 2025 at 1:34 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > + if (src_offset) { > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > + > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > + kunmap_local(src_vaddr); > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > src_offset ends up being the offset into the pair of src pages that we > > are using to fully populate a single dest page with each iteration. So > > if we start at src_offset, read a page worth of data, then we are now at > > src_offset in the next src page and the loop continues that way even if > > npages > 1. > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > the 2nd memcpy is skipped on every iteration. > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > missed? > Oh, I got you. SNP expects a single src_offset applies for each src page. > > So if npages = 2, there're 4 memcpy() calls. > > src: |---------|---------|---------| (VA contiguous) > ^ ^ ^ > | | | > dst: |---------|---------| (PA contiguous) > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > as 0 for the 2nd src page. > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > snp_launch_update() to simplify the design? > IIUC, this ship has sailed, as asserting this would break existing userspace which can pass unaligned userspace buffers. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-01 1:47 ` Vishal Annapurve @ 2025-12-01 21:03 ` Michael Roth 0 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-12-01 21:03 UTC (permalink / raw) To: Vishal Annapurve Cc: Yan Zhao, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Sun, Nov 30, 2025 at 05:47:37PM -0800, Vishal Annapurve wrote: > On Mon, Nov 24, 2025 at 1:34 AM Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > + if (src_offset) { > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > + > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > + kunmap_local(src_vaddr); > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > are using to fully populate a single dest page with each iteration. So > > > if we start at src_offset, read a page worth of data, then we are now at > > > src_offset in the next src page and the loop continues that way even if > > > npages > 1. > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > the 2nd memcpy is skipped on every iteration. > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > missed? > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > So if npages = 2, there're 4 memcpy() calls. > > > > src: |---------|---------|---------| (VA contiguous) > > ^ ^ ^ > > | | | > > dst: |---------|---------| (PA contiguous) > > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > as 0 for the 2nd src page. > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > snp_launch_update() to simplify the design? > > > > IIUC, this ship has sailed, as asserting this would break existing > userspace which can pass unaligned userspace buffers. Actually, on the PUCK call before I sent this patchset Sean/Paolo seemed to be okay with the prospect of enforcing that params.uaddr is PAGE_ALIGNED(), since all *known* userspace implementations do use a page-aligned params.uaddr and this would be highly unlikely to have any serious fallout. However, it was suggested that I post the RFC with non-page-aligned handling intact so we can have some further discussion about it. That would be one of the 3 approaches listed under (A) in the cover letter. (Sean proposed another option that he might still advocate for, also listed in the cover letter under (A), but wanted to see what this looked like first). Personally, I'm fine with forcing params.uaddr to. But there is still some slight risk that some VMM out there flying under the radar will surface this userspace breakage and that won't be fun to deal with. IMO, if an implementation wants to enforce page alignment, they can simply assert(src_offset == 0) in the post-populate callback and just treat src_pages[0] as if it was the only src input, like what was done in the tdx_post_populate() callback here. The overall changes seemed trivial enough that I don't see it being a headache for platforms that enforce that src pointer is PAGE-ALIGNED. And for platforms like SNP that don't, it does not seem like a huge headache to straddle 2 src pages for each PFN we're populating. Maybe some better comments/documentation around kvm_gmem_populate() would more effectively alleviate potential confusion from new users of the proposed interface. -Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-24 9:31 ` Yan Zhao 2025-11-24 15:53 ` Ira Weiny 2025-12-01 1:47 ` Vishal Annapurve @ 2025-12-01 22:13 ` Michael Roth 2025-12-03 2:46 ` Yan Zhao 2 siblings, 1 reply; 35+ messages in thread From: Michael Roth @ 2025-12-01 22:13 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote: > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > > > Currently the post-populate callbacks handle copying source pages into > > > > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > > > > acquires the filemap invalidate lock, then calls a post-populate > > > > callback which may issue a get_user_pages() on the source pages prior to > > > > copying them into the private GPA (e.g. TDX). > > > > > > > > This will not be compatible with in-place conversion, where the > > > > userspace page fault path will attempt to acquire filemap invalidate > > > > lock while holding the mm->mmap_lock, leading to a potential ABBA > > > > deadlock[1]. > > > > > > > > Address this by hoisting the GUP above the filemap invalidate lock so > > > > that these page faults path can be taken early, prior to acquiring the > > > > filemap invalidate lock. > > > > > > > > It's not currently clear whether this issue is reachable with the > > > > current implementation of guest_memfd, which doesn't support in-place > > > > conversion, however it does provide a consistent mechanism to provide > > > > stable source/target PFNs to callbacks rather than punting to > > > > vendor-specific code, which allows for more commonality across > > > > architectures, which may be worthwhile even without in-place conversion. > > > > > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > > --- > > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > > include/linux/kvm_host.h | 3 ++- > > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > index 0835c664fbfd..d0ac710697a2 100644 > > > > --- a/arch/x86/kvm/svm/sev.c > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > > }; > > > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > > - void __user *src, int order, void *opaque) > > > > + struct page **src_pages, loff_t src_offset, > > > > + int order, void *opaque) > > > > { > > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > int npages = (1 << order); > > > > gfn_t gfn; > > > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > > return -EINVAL; > > > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > goto err; > > > > } > > > > > > > > - if (src) { > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > + if (src_pages) { > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > - ret = -EFAULT; > > > > - goto err; > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > + kunmap_local(src_vaddr); > > > > + > > > > + if (src_offset) { > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > + > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > + kunmap_local(src_vaddr); > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > src_offset ends up being the offset into the pair of src pages that we > > are using to fully populate a single dest page with each iteration. So > > if we start at src_offset, read a page worth of data, then we are now at > > src_offset in the next src page and the loop continues that way even if > > npages > 1. > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > the 2nd memcpy is skipped on every iteration. > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > missed? > Oh, I got you. SNP expects a single src_offset applies for each src page. > > So if npages = 2, there're 4 memcpy() calls. > > src: |---------|---------|---------| (VA contiguous) > ^ ^ ^ > | | | > dst: |---------|---------| (PA contiguous) > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > as 0 for the 2nd src page. > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > snp_launch_update() to simplify the design? This was an option mentioned in the cover letter and during PUCK. I am not opposed if that's the direction we decide, but I also don't think it makes big difference since: int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, struct page **src_pages, loff_t src_offset, int order, void *opaque); basically reduces to Sean's originally proposed: int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, struct page *src_pages, int order, void *opaque); for any platform that enforces that the src is page-aligned, which doesn't seem like a huge technical burden, IMO, despite me initially thinking it would be gross when I brought this up during the PUCK call that preceeding this posting. > > > > > > > > } > > > > - kunmap_local(vaddr); > > > > + > > > > + kunmap_local(dst_vaddr); > > > > } > > > > > > > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > > > > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > if (!snp_page_reclaim(kvm, pfn + i) && > > > > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > > > > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > > > > - pr_debug("Failed to write CPUID page back to userspace\n"); > > > > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > > > > + kunmap_local(src_vaddr); > > > > + > > > > + if (src_offset) { > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > + > > > > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > > > > + kunmap_local(src_vaddr); > > > > + } > > > > > > > > - kunmap_local(vaddr); > > > > + kunmap_local(dst_vaddr); > > > > } > > > > > > > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > > > index 57ed101a1181..dd5439ec1473 100644 > > > > --- a/arch/x86/kvm/vmx/tdx.c > > > > +++ b/arch/x86/kvm/vmx/tdx.c > > > > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > > > > }; > > > > > > > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > - void __user *src, int order, void *_arg) > > > > + struct page **src_pages, loff_t src_offset, > > > > + int order, void *_arg) > > > > { > > > > struct tdx_gmem_post_populate_arg *arg = _arg; > > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > > u64 err, entry, level_state; > > > > gpa_t gpa = gfn_to_gpa(gfn); > > > > - struct page *src_page; > > > > int ret, i; > > > > > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > > > > return -EIO; > > > > > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > > > + if (KVM_BUG_ON(src_offset)) > > > if (KVM_BUG_ON(src_offset, kvm)) > > > > > > > return -EINVAL; > > > > > > > > - /* > > > > - * Get the source page if it has been faulted in. Return failure if the > > > > - * source page has been swapped out or unmapped in primary memory. > > > > - */ > > > > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > > > > - if (ret < 0) > > > > - return ret; > > > > - if (ret != 1) > > > > - return -ENOMEM; > > > > - > > > > - kvm_tdx->page_add_src = src_page; > > > > + kvm_tdx->page_add_src = src_pages[i]; > > > src_pages[0] ? i is not initialized. > > > > Sorry, I switched on TDX options for compile testing but I must have done a > > sloppy job confirming it actually built. I'll re-test push these and squash > > in the fixes in the github tree. > > > > > > > > Should there also be a KVM_BUG_ON(order > 0, kvm) ? > > > > Seems reasonable, but I'm not sure this is the right patch. Maybe I > > could squash it into the preceeding documentation patch so as to not > > give the impression this patch changes those expectations in any way. > I don't think it should be documented as a user requirement. I didn't necessarily mean in the documentation, but mainly some patch other than this. If we add that check here as part of this patch, we give the impression that the order expectations are changing as a result of the changes here, when in reality they are exactly the same as before. If not the documentation patch here, then I don't think it really fits in this series at all and would be more of a standalone patch against kvm/next. The change here: - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) + /* Source should be page-aligned, in which case src_offset will be 0. */ + if (KVM_BUG_ON(src_offset)) made sense as part of this patch, because now that we are passing struct page *src_pages, we can no longer infer alignment from 'src' field, and instead need to infer it from src_offset being 0. > > However, we need to comment out that this assertion is due to that > tdx_vcpu_init_mem_region() passes npages as 1 to kvm_gmem_populate(). You mean for the KVM_BUG_ON(order > 0, kvm) you're proposing to add? Again, if feels awkward to address this as part of this series since it is an existing/unchanged behavior and not really the intent of this patchset. > > > > > > > > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > > > > kvm_tdx->page_add_src = NULL; > > > > > > > > - put_page(src_page); > > > > - > > > > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > > > > return ret; > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > index d93f75b05ae2..7e9d2403c61f 100644 > > > > --- a/include/linux/kvm_host.h > > > > +++ b/include/linux/kvm_host.h > > > > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > > > > * Returns the number of pages that were populated. > > > > */ > > > > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > - void __user *src, int order, void *opaque); > > > > + struct page **src_pages, loff_t src_offset, > > > > + int order, void *opaque); > > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > > > > kvm_gmem_populate_cb post_populate, void *opaque); > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > > index 9160379df378..e9ac3fd4fd8f 100644 > > > > --- a/virt/kvm/guest_memfd.c > > > > +++ b/virt/kvm/guest_memfd.c > > > > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > > > > > > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > > > > + > > > > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) > > > Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge > > > folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB > > > when src_pages[] can only hold up to 512 pages? > > > > This was necessarily chosen in prep for hugepages, but more about my > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > > happens to align with 2MB hugepages while seeming like a reasonable > > batching value so that's why I chose it. > > > > Even with 1GB support, I wasn't really planning to increase it. SNP > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > > handles promotion in a completely different path. So atm I'm leaning > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > > support for kvm_gmem_populate() path and not bothering to change it > > until a solid use-case arises. > The problem is that with hugetlb-based guest_memfd, the folio itself could be > of 1GB, though SNP and TDX can force mapping at only 4KB. If TDX wants to unload handling of page-clearing to its per-page post-populate callback and tie that its shared/private tracking that's perfectly fine by me. *How* TDX tells gmem it wants this different behavior is a topic for a follow-up patchset, Vishal suggested kernel-internal flags to kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate flag would probably just default to set and punt to post-populate/prep hooks, because we absolutely *do not* want to have to re-introduce per-4K tracking of this type of state within gmem, since getting rid of that sort of tracking requirement within gmem is the entire motivation of this series. And since, within this series, the uptodate flag and prep-tracking both have the same 4K granularity, it seems unecessary to address this here. If you were to send a patchset on top of this (or even independently) that introduces said kernel-internal gmem flag to offload uptodate-tracking to post-populate/prep hooks, and utilize it to optimize the current 4K-only TDX implementation by letting TDX module handle the initial page-clearing, then I think that change/discussion can progress without being blocked in any major way by this series. But I don't think we need to flesh all that out here, so long as we are aware of this as a future change/requirement and have reasonable indication that it is compatible with this series. > > Then since max_order = folio_order(folio) (at least in the tree for [1]), > WARN_ON() in kvm_gmem_populate() could still be hit: > > folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > (npages - i) < (1 << max_order)); Yes, in the SNP implementation of hugetlb I ended up removing this warning, and in that case I also ended up forcing kvm_gmem_populate() to be 4K-only: https://github.com/AMDESE/linux/blob/snp-hugetlb-v2-wip0/virt/kvm/guest_memfd.c#L2372 but it makes a lot more sense to make those restrictions and changes in the context of hugepage support, rather than this series which is trying very hard to not do hugepage enablement, but simply keep what's partially there intact while reworking other things that have proven to be continued impediments to both in-place conversion and hugepage enablement. Also, there's talk now of enabling hugepages even without in-place conversion for hugetlbfs, and that will likely be the same path we follow for THP to remain in alignment. Rather than anticipating what all these changes will mean WRT hugepage implementation/requirements, I think it will be fruitful to remove some of the baggage that will complicate that process/discussion like this patchset attempts. -Mike > > TDX is even easier to hit this warning because it always passes npages as 1. > > [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > in Sean's sketch patch [1]? > > > > That's an option too, but SNP can make use of 2MB pages in the > > post-populate callback so I don't want to shut the door on that option > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > like TDX (at least, not one that would help much for guest boot times) > I see. > > So, what about below change? > > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > } > > folio_unlock(folio); > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > - (npages - i) < (1 << max_order)); > > ret = -EINVAL; > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || > + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > KVM_MEMORY_ATTRIBUTE_PRIVATE, > KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > if (!max_order) > > > > > > > > > > > Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily > > > triggered by TDX when max_order > 0 && npages == 1: > > > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > (npages - i) < (1 << max_order)); > > > > > > > > > [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > > > > kvm_gmem_populate_cb post_populate, void *opaque) > > > > { > > > > struct kvm_memory_slot *slot; > > > > - void __user *p; > > > > - > > > > + struct page **src_pages; > > > > int ret = 0, max_order; > > > > - long i; > > > > + loff_t src_offset = 0; > > > > + long i, src_npages; > > > > > > > > lockdep_assert_held(&kvm->slots_lock); > > > > > > > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > if (!file) > > > > return -EFAULT; > > > > > > > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > > > > + > > > > + if (src) { > > > > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > > > > + > > > > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > > > > + if (!src_pages) > > > > + return -ENOMEM; > > > > + > > > > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > > > > + if (ret < 0) > > > > + return ret; > > > > + > > > > + if (ret != src_npages) > > > > + return -ENOMEM; > > > > + > > > > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > > > > + } > > > > + > > > > filemap_invalidate_lock(file->f_mapping); > > > > > > > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > for (i = 0; i < npages; i += (1 << max_order)) { > > > > struct folio *folio; > > > > gfn_t gfn = start_gfn + i; > > > > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > max_order--; > > > > } > > > > > > > > - p = src ? src + i * PAGE_SIZE : NULL; > > > > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > > > > + src_offset, max_order, opaque); > > > Why src_offset is not 0 starting from the 2nd page? > > > > > > > if (!ret) > > > > folio_mark_uptodate(folio); > > > > > > > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > > > filemap_invalidate_unlock(file->f_mapping); > > > > > > > > + if (src) { > > > > + long j; > > > > + > > > > + for (j = 0; j < src_npages; j++) > > > > + put_page(src_pages[j]); > > > > + kfree(src_pages); > > > > + } > > > > + > > > > return ret && !i ? ret : i; > > > > } > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > > > > -- > > > > 2.25.1 > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-01 22:13 ` Michael Roth @ 2025-12-03 2:46 ` Yan Zhao 2025-12-03 14:26 ` Michael Roth 0 siblings, 1 reply; 35+ messages in thread From: Yan Zhao @ 2025-12-03 2:46 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Mon, Dec 01, 2025 at 04:13:55PM -0600, Michael Roth wrote: > On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote: > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > > > > Currently the post-populate callbacks handle copying source pages into > > > > > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > > > > > acquires the filemap invalidate lock, then calls a post-populate > > > > > callback which may issue a get_user_pages() on the source pages prior to > > > > > copying them into the private GPA (e.g. TDX). > > > > > > > > > > This will not be compatible with in-place conversion, where the > > > > > userspace page fault path will attempt to acquire filemap invalidate > > > > > lock while holding the mm->mmap_lock, leading to a potential ABBA > > > > > deadlock[1]. > > > > > > > > > > Address this by hoisting the GUP above the filemap invalidate lock so > > > > > that these page faults path can be taken early, prior to acquiring the > > > > > filemap invalidate lock. > > > > > > > > > > It's not currently clear whether this issue is reachable with the > > > > > current implementation of guest_memfd, which doesn't support in-place > > > > > conversion, however it does provide a consistent mechanism to provide > > > > > stable source/target PFNs to callbacks rather than punting to > > > > > vendor-specific code, which allows for more commonality across > > > > > architectures, which may be worthwhile even without in-place conversion. > > > > > > > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > > > --- > > > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > > > include/linux/kvm_host.h | 3 ++- > > > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > index 0835c664fbfd..d0ac710697a2 100644 > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > > > }; > > > > > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > > > - void __user *src, int order, void *opaque) > > > > > + struct page **src_pages, loff_t src_offset, > > > > > + int order, void *opaque) > > > > > { > > > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > int npages = (1 << order); > > > > > gfn_t gfn; > > > > > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > > > return -EINVAL; > > > > > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > goto err; > > > > > } > > > > > > > > > > - if (src) { > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > + if (src_pages) { > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > > - ret = -EFAULT; > > > > > - goto err; > > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > > + kunmap_local(src_vaddr); > > > > > + > > > > > + if (src_offset) { > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > + > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > + kunmap_local(src_vaddr); > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > are using to fully populate a single dest page with each iteration. So > > > if we start at src_offset, read a page worth of data, then we are now at > > > src_offset in the next src page and the loop continues that way even if > > > npages > 1. > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > the 2nd memcpy is skipped on every iteration. > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > missed? > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > So if npages = 2, there're 4 memcpy() calls. > > > > src: |---------|---------|---------| (VA contiguous) > > ^ ^ ^ > > | | | > > dst: |---------|---------| (PA contiguous) > > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > as 0 for the 2nd src page. > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > snp_launch_update() to simplify the design? > > This was an option mentioned in the cover letter and during PUCK. I am > not opposed if that's the direction we decide, but I also don't think > it makes big difference since: > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > struct page **src_pages, loff_t src_offset, > int order, void *opaque); > > basically reduces to Sean's originally proposed: > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > struct page *src_pages, int order, > void *opaque); Hmm, the requirement of having each copy to dst_page account for src_offset (which actually results in 2 copies) is quite confusing. I initially thought the src_offset only applied to the first dst_page. This will also cause kvm_gmem_populate() to allocate 1 more src_npages than npages for dst pages. > for any platform that enforces that the src is page-aligned, which > doesn't seem like a huge technical burden, IMO, despite me initially > thinking it would be gross when I brought this up during the PUCK call > that preceeding this posting. > > > > > > > > > > > } > > > > > - kunmap_local(vaddr); > > > > > + > > > > > + kunmap_local(dst_vaddr); > > > > > } > > > > > > > > > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > > > > > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > if (!snp_page_reclaim(kvm, pfn + i) && > > > > > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > > > > > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > > > > > - pr_debug("Failed to write CPUID page back to userspace\n"); > > > > > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > > > > > + kunmap_local(src_vaddr); > > > > > + > > > > > + if (src_offset) { > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > + > > > > > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > > > > > + kunmap_local(src_vaddr); > > > > > + } > > > > > > > > > > - kunmap_local(vaddr); > > > > > + kunmap_local(dst_vaddr); > > > > > } > > > > > > > > > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > > > > index 57ed101a1181..dd5439ec1473 100644 > > > > > --- a/arch/x86/kvm/vmx/tdx.c > > > > > +++ b/arch/x86/kvm/vmx/tdx.c > > > > > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > > > > > }; > > > > > > > > > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > > - void __user *src, int order, void *_arg) > > > > > + struct page **src_pages, loff_t src_offset, > > > > > + int order, void *_arg) > > > > > { > > > > > struct tdx_gmem_post_populate_arg *arg = _arg; > > > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > > > u64 err, entry, level_state; > > > > > gpa_t gpa = gfn_to_gpa(gfn); > > > > > - struct page *src_page; > > > > > int ret, i; > > > > > > > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > > > > > return -EIO; > > > > > > > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > > > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > > > > + if (KVM_BUG_ON(src_offset)) > > > > if (KVM_BUG_ON(src_offset, kvm)) > > > > > > > > > return -EINVAL; > > > > > > > > > > - /* > > > > > - * Get the source page if it has been faulted in. Return failure if the > > > > > - * source page has been swapped out or unmapped in primary memory. > > > > > - */ > > > > > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > > > > > - if (ret < 0) > > > > > - return ret; > > > > > - if (ret != 1) > > > > > - return -ENOMEM; > > > > > - > > > > > - kvm_tdx->page_add_src = src_page; > > > > > + kvm_tdx->page_add_src = src_pages[i]; > > > > src_pages[0] ? i is not initialized. > > > > > > Sorry, I switched on TDX options for compile testing but I must have done a > > > sloppy job confirming it actually built. I'll re-test push these and squash > > > in the fixes in the github tree. > > > > > > > > > > > Should there also be a KVM_BUG_ON(order > 0, kvm) ? > > > > > > Seems reasonable, but I'm not sure this is the right patch. Maybe I > > > could squash it into the preceeding documentation patch so as to not > > > give the impression this patch changes those expectations in any way. > > I don't think it should be documented as a user requirement. > > I didn't necessarily mean in the documentation, but mainly some patch > other than this. If we add that check here as part of this patch, we > give the impression that the order expectations are changing as a result > of the changes here, when in reality they are exactly the same as > before. > > If not the documentation patch here, then I don't think it really fits > in this series at all and would be more of a standalone patch against > kvm/next. > > The change here: > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > + /* Source should be page-aligned, in which case src_offset will be 0. */ > + if (KVM_BUG_ON(src_offset)) > > made sense as part of this patch, because now that we are passing struct > page *src_pages, we can no longer infer alignment from 'src' field, and > instead need to infer it from src_offset being 0. > > > > > However, we need to comment out that this assertion is due to that > > tdx_vcpu_init_mem_region() passes npages as 1 to kvm_gmem_populate(). > > You mean for the KVM_BUG_ON(order > 0, kvm) you're proposing to add? > Again, if feels awkward to address this as part of this series since it > is an existing/unchanged behavior and not really the intent of this > patchset. That's true. src_pages[0] just makes it more eye-catching. What about just adding a comment for src_pages[0] instead of KVM_BUG_ON()? > > > > > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > > > > > kvm_tdx->page_add_src = NULL; > > > > > > > > > > - put_page(src_page); > > > > > - > > > > > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > > > > > return ret; > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > > index d93f75b05ae2..7e9d2403c61f 100644 > > > > > --- a/include/linux/kvm_host.h > > > > > +++ b/include/linux/kvm_host.h > > > > > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > > > > > * Returns the number of pages that were populated. > > > > > */ > > > > > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > > - void __user *src, int order, void *opaque); > > > > > + struct page **src_pages, loff_t src_offset, > > > > > + int order, void *opaque); > > > > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > > > > > kvm_gmem_populate_cb post_populate, void *opaque); > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > > > index 9160379df378..e9ac3fd4fd8f 100644 > > > > > --- a/virt/kvm/guest_memfd.c > > > > > +++ b/virt/kvm/guest_memfd.c > > > > > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > > > > > > > > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > > > > > + > > > > > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) > > > > Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge > > > > folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB > > > > when src_pages[] can only hold up to 512 pages? > > > > > > This was necessarily chosen in prep for hugepages, but more about my > > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > > > happens to align with 2MB hugepages while seeming like a reasonable > > > batching value so that's why I chose it. > > > > > > Even with 1GB support, I wasn't really planning to increase it. SNP > > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > > > handles promotion in a completely different path. So atm I'm leaning > > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > > > support for kvm_gmem_populate() path and not bothering to change it > > > until a solid use-case arises. > > The problem is that with hugetlb-based guest_memfd, the folio itself could be > > of 1GB, though SNP and TDX can force mapping at only 4KB. > > If TDX wants to unload handling of page-clearing to its per-page > post-populate callback and tie that its shared/private tracking that's > perfectly fine by me. > > *How* TDX tells gmem it wants this different behavior is a topic for a > follow-up patchset, Vishal suggested kernel-internal flags to > kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate Not sure which flag you are referring to with "Vishal suggested kernel-internal flags to kvm_gmem_create()". However, my point is that when the backend folio is 1GB in size (leading to max_order being PUD_ORDER), even if SNP only maps at 2MB to RMP, it may hit the warning of "!IS_ALIGNED(gfn, 1 << max_order)". For TDX, it's worse because it always passes npages as 1, so it will also hit the warning of "(npages - i) < (1 << max_order)". Given that this patch already considers huge pages for SNP, it feels half-baked to leave the WARN_ON() for future handling. WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order)); > flag would probably just default to set and punt to post-populate/prep > hooks, because we absolutely *do not* want to have to re-introduce per-4K > tracking of this type of state within gmem, since getting rid of that sort > of tracking requirement within gmem is the entire motivation of this > series. And since, within this series, the uptodate flag and > prep-tracking both have the same 4K granularity, it seems unecessary to > address this here. > > If you were to send a patchset on top of this (or even independently) that > introduces said kernel-internal gmem flag to offload uptodate-tracking to > post-populate/prep hooks, and utilize it to optimize the current 4K-only > TDX implementation by letting TDX module handle the initial > page-clearing, then I think that change/discussion can progress without > being blocked in any major way by this series. > > But I don't think we need to flesh all that out here, so long as we are > aware of this as a future change/requirement and have reasonable > indication that it is compatible with this series. > > > > > Then since max_order = folio_order(folio) (at least in the tree for [1]), > > WARN_ON() in kvm_gmem_populate() could still be hit: > > > > folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > (npages - i) < (1 << max_order)); > > Yes, in the SNP implementation of hugetlb I ended up removing this > warning, and in that case I also ended up forcing kvm_gmem_populate() to > be 4K-only: > > https://github.com/AMDESE/linux/blob/snp-hugetlb-v2-wip0/virt/kvm/guest_memfd.c#L2372 For 1G (aka HugeTLB) page, this fix is also needed, which was missed in [1] and I pointed out to Ackerley at [2]. [1] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 [2] https://lore.kernel.org/all/aFPGPVbzo92t565h@yzhao56-desk.sh.intel.com/ > but it makes a lot more sense to make those restrictions and changes in > the context of hugepage support, rather than this series which is trying > very hard to not do hugepage enablement, but simply keep what's partially > there intact while reworking other things that have proven to be > continued impediments to both in-place conversion and hugepage > enablement. Not sure how fixing the warning in this series could impede hugepage enabling :) But if you prefer, I don't mind keeping it locally for longer. > Also, there's talk now of enabling hugepages even without in-place > conversion for hugetlbfs, and that will likely be the same path we > follow for THP to remain in alignment. Rather than anticipating what all > these changes will mean WRT hugepage implementation/requirements, I > think it will be fruitful to remove some of the baggage that will > complicate that process/discussion like this patchset attempts. > > -Mike > > > > > TDX is even easier to hit this warning because it always passes npages as 1. > > > > [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com > > > > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > > in Sean's sketch patch [1]? > > > > > > That's an option too, but SNP can make use of 2MB pages in the > > > post-populate callback so I don't want to shut the door on that option > > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > > like TDX (at least, not one that would help much for guest boot times) > > I see. > > > > So, what about below change? > > > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > } > > > > folio_unlock(folio); > > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > - (npages - i) < (1 << max_order)); > > > > ret = -EINVAL; > > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || > > + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > KVM_MEMORY_ATTRIBUTE_PRIVATE, > > KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > > if (!max_order) > > > > > > > > > > > > > > > > > Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily > > > > triggered by TDX when max_order > 0 && npages == 1: > > > > > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > > (npages - i) < (1 << max_order)); > > > > > > > > > > > > [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > > > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > > > > > kvm_gmem_populate_cb post_populate, void *opaque) > > > > > { > > > > > struct kvm_memory_slot *slot; > > > > > - void __user *p; > > > > > - > > > > > + struct page **src_pages; > > > > > int ret = 0, max_order; > > > > > - long i; > > > > > + loff_t src_offset = 0; > > > > > + long i, src_npages; > > > > > > > > > > lockdep_assert_held(&kvm->slots_lock); > > > > > > > > > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > if (!file) > > > > > return -EFAULT; > > > > > > > > > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > > > > > + > > > > > + if (src) { > > > > > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > > > > > + > > > > > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > > > > > + if (!src_pages) > > > > > + return -ENOMEM; > > > > > + > > > > > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > > > > > + if (ret < 0) > > > > > + return ret; > > > > > + > > > > > + if (ret != src_npages) > > > > > + return -ENOMEM; > > > > > + > > > > > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > > > > > + } > > > > > + > > > > > filemap_invalidate_lock(file->f_mapping); > > > > > > > > > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > > for (i = 0; i < npages; i += (1 << max_order)) { > > > > > struct folio *folio; > > > > > gfn_t gfn = start_gfn + i; > > > > > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > max_order--; > > > > > } > > > > > > > > > > - p = src ? src + i * PAGE_SIZE : NULL; > > > > > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > > > > > + src_offset, max_order, opaque); > > > > Why src_offset is not 0 starting from the 2nd page? > > > > > > > > > if (!ret) > > > > > folio_mark_uptodate(folio); > > > > > > > > > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > > > > > filemap_invalidate_unlock(file->f_mapping); > > > > > > > > > > + if (src) { > > > > > + long j; > > > > > + > > > > > + for (j = 0; j < src_npages; j++) > > > > > + put_page(src_pages[j]); > > > > > + kfree(src_pages); > > > > > + } > > > > > + > > > > > return ret && !i ? ret : i; > > > > > } > > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > > > > > -- > > > > > 2.25.1 > > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 2:46 ` Yan Zhao @ 2025-12-03 14:26 ` Michael Roth 2025-12-03 20:59 ` FirstName LastName ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Michael Roth @ 2025-12-03 14:26 UTC (permalink / raw) To: Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny On Wed, Dec 03, 2025 at 10:46:27AM +0800, Yan Zhao wrote: > On Mon, Dec 01, 2025 at 04:13:55PM -0600, Michael Roth wrote: > > On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote: > > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > > > > > Currently the post-populate callbacks handle copying source pages into > > > > > > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > > > > > > acquires the filemap invalidate lock, then calls a post-populate > > > > > > callback which may issue a get_user_pages() on the source pages prior to > > > > > > copying them into the private GPA (e.g. TDX). > > > > > > > > > > > > This will not be compatible with in-place conversion, where the > > > > > > userspace page fault path will attempt to acquire filemap invalidate > > > > > > lock while holding the mm->mmap_lock, leading to a potential ABBA > > > > > > deadlock[1]. > > > > > > > > > > > > Address this by hoisting the GUP above the filemap invalidate lock so > > > > > > that these page faults path can be taken early, prior to acquiring the > > > > > > filemap invalidate lock. > > > > > > > > > > > > It's not currently clear whether this issue is reachable with the > > > > > > current implementation of guest_memfd, which doesn't support in-place > > > > > > conversion, however it does provide a consistent mechanism to provide > > > > > > stable source/target PFNs to callbacks rather than punting to > > > > > > vendor-specific code, which allows for more commonality across > > > > > > architectures, which may be worthwhile even without in-place conversion. > > > > > > > > > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > > > > --- > > > > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > > > > include/linux/kvm_host.h | 3 ++- > > > > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > > index 0835c664fbfd..d0ac710697a2 100644 > > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > > > > }; > > > > > > > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > > > > - void __user *src, int order, void *opaque) > > > > > > + struct page **src_pages, loff_t src_offset, > > > > > > + int order, void *opaque) > > > > > > { > > > > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > int npages = (1 << order); > > > > > > gfn_t gfn; > > > > > > > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > > > > return -EINVAL; > > > > > > > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > goto err; > > > > > > } > > > > > > > > > > > > - if (src) { > > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > > + if (src_pages) { > > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > > > - ret = -EFAULT; > > > > > > - goto err; > > > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > > > + kunmap_local(src_vaddr); > > > > > > + > > > > > > + if (src_offset) { > > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > > + > > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > > + kunmap_local(src_vaddr); > > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > > are using to fully populate a single dest page with each iteration. So > > > > if we start at src_offset, read a page worth of data, then we are now at > > > > src_offset in the next src page and the loop continues that way even if > > > > npages > 1. > > > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > > the 2nd memcpy is skipped on every iteration. > > > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > > missed? > > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > > > So if npages = 2, there're 4 memcpy() calls. > > > > > > src: |---------|---------|---------| (VA contiguous) > > > ^ ^ ^ > > > | | | > > > dst: |---------|---------| (PA contiguous) > > > > > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > > as 0 for the 2nd src page. > > > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > > snp_launch_update() to simplify the design? > > > > This was an option mentioned in the cover letter and during PUCK. I am > > not opposed if that's the direction we decide, but I also don't think > > it makes big difference since: > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > struct page **src_pages, loff_t src_offset, > > int order, void *opaque); > > > > basically reduces to Sean's originally proposed: > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > struct page *src_pages, int order, > > void *opaque); > > Hmm, the requirement of having each copy to dst_page account for src_offset > (which actually results in 2 copies) is quite confusing. I initially thought the > src_offset only applied to the first dst_page. What I'm wondering though is if I'd done a better job of documenting this aspect, e.g. with the following comment added above kvm_gmem_populate_cb: /* * ... * 'src_pages': array of GUP'd struct page pointers corresponding to * the pages that store the data that is to be copied * into the HPA corresponding to 'pfn' * 'src_offset': byte offset, relative to the first page in the array * of pages pointed to by 'src_pages', to begin copying * the data from. * * NOTE: if the caller of kvm_gmem_populate() enforces that 'src' is * page-aligned, then 'src_offset' will always be zero, and src_pages * will contain only 1 page to copy from, beginning at byte offset 0. * In this case, callers can assume src_offset is 0. */ int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, struct page **src_pages, loff_t src_offset, int order, void *opaque); could the confusion have been avoided, or is it still unwieldly? I don't mind that users like SNP need to deal with the extra bits, but I'm hoping for users like TDX it isn't too cludgy. > > This will also cause kvm_gmem_populate() to allocate 1 more src_npages than > npages for dst pages. That's more of a decision on the part of userspace deciding to use non-page-aligned 'src' pointer to begin with. > > > for any platform that enforces that the src is page-aligned, which > > doesn't seem like a huge technical burden, IMO, despite me initially > > thinking it would be gross when I brought this up during the PUCK call > > that preceeding this posting. > > > > > > > > > > > > > > } > > > > > > - kunmap_local(vaddr); > > > > > > + > > > > > > + kunmap_local(dst_vaddr); > > > > > > } > > > > > > > > > > > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > > > > > > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > if (!snp_page_reclaim(kvm, pfn + i) && > > > > > > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > > > > > > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > > > > > > - pr_debug("Failed to write CPUID page back to userspace\n"); > > > > > > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > > > > > > + kunmap_local(src_vaddr); > > > > > > + > > > > > > + if (src_offset) { > > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > > + > > > > > > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); > > > > > > + kunmap_local(src_vaddr); > > > > > > + } > > > > > > > > > > > > - kunmap_local(vaddr); > > > > > > + kunmap_local(dst_vaddr); > > > > > > } > > > > > > > > > > > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > > > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > > > > > index 57ed101a1181..dd5439ec1473 100644 > > > > > > --- a/arch/x86/kvm/vmx/tdx.c > > > > > > +++ b/arch/x86/kvm/vmx/tdx.c > > > > > > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > > > > > > }; > > > > > > > > > > > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > > > - void __user *src, int order, void *_arg) > > > > > > + struct page **src_pages, loff_t src_offset, > > > > > > + int order, void *_arg) > > > > > > { > > > > > > struct tdx_gmem_post_populate_arg *arg = _arg; > > > > > > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > > > > u64 err, entry, level_state; > > > > > > gpa_t gpa = gfn_to_gpa(gfn); > > > > > > - struct page *src_page; > > > > > > int ret, i; > > > > > > > > > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > > > > > > return -EIO; > > > > > > > > > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > > > > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > > > > > + if (KVM_BUG_ON(src_offset)) > > > > > if (KVM_BUG_ON(src_offset, kvm)) > > > > > > > > > > > return -EINVAL; > > > > > > > > > > > > - /* > > > > > > - * Get the source page if it has been faulted in. Return failure if the > > > > > > - * source page has been swapped out or unmapped in primary memory. > > > > > > - */ > > > > > > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > > > > > > - if (ret < 0) > > > > > > - return ret; > > > > > > - if (ret != 1) > > > > > > - return -ENOMEM; > > > > > > - > > > > > > - kvm_tdx->page_add_src = src_page; > > > > > > + kvm_tdx->page_add_src = src_pages[i]; > > > > > src_pages[0] ? i is not initialized. > > > > > > > > Sorry, I switched on TDX options for compile testing but I must have done a > > > > sloppy job confirming it actually built. I'll re-test push these and squash > > > > in the fixes in the github tree. > > > > > > > > > > > > > > Should there also be a KVM_BUG_ON(order > 0, kvm) ? > > > > > > > > Seems reasonable, but I'm not sure this is the right patch. Maybe I > > > > could squash it into the preceeding documentation patch so as to not > > > > give the impression this patch changes those expectations in any way. > > > I don't think it should be documented as a user requirement. > > > > I didn't necessarily mean in the documentation, but mainly some patch > > other than this. If we add that check here as part of this patch, we > > give the impression that the order expectations are changing as a result > > of the changes here, when in reality they are exactly the same as > > before. > > > > If not the documentation patch here, then I don't think it really fits > > in this series at all and would be more of a standalone patch against > > kvm/next. > > > > The change here: > > > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > > + /* Source should be page-aligned, in which case src_offset will be 0. */ > > + if (KVM_BUG_ON(src_offset)) > > > > made sense as part of this patch, because now that we are passing struct > > page *src_pages, we can no longer infer alignment from 'src' field, and > > instead need to infer it from src_offset being 0. > > > > > > > > However, we need to comment out that this assertion is due to that > > > tdx_vcpu_init_mem_region() passes npages as 1 to kvm_gmem_populate(). > > > > You mean for the KVM_BUG_ON(order > 0, kvm) you're proposing to add? > > Again, if feels awkward to address this as part of this series since it > > is an existing/unchanged behavior and not really the intent of this > > patchset. > That's true. src_pages[0] just makes it more eye-catching. > What about just adding a comment for src_pages[0] instead of KVM_BUG_ON()? That seems fair/relevant for this series. > > > > > > > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > > > > > > kvm_tdx->page_add_src = NULL; > > > > > > > > > > > > - put_page(src_page); > > > > > > - > > > > > > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > > > > > > return ret; > > > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > > > index d93f75b05ae2..7e9d2403c61f 100644 > > > > > > --- a/include/linux/kvm_host.h > > > > > > +++ b/include/linux/kvm_host.h > > > > > > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > > > > > > * Returns the number of pages that were populated. > > > > > > */ > > > > > > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > > > - void __user *src, int order, void *opaque); > > > > > > + struct page **src_pages, loff_t src_offset, > > > > > > + int order, void *opaque); > > > > > > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > > > > > > kvm_gmem_populate_cb post_populate, void *opaque); > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > > > > index 9160379df378..e9ac3fd4fd8f 100644 > > > > > > --- a/virt/kvm/guest_memfd.c > > > > > > +++ b/virt/kvm/guest_memfd.c > > > > > > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > > > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > > > > > > > > > > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > > > > > > + > > > > > > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER) > > > > > Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge > > > > > folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB > > > > > when src_pages[] can only hold up to 512 pages? > > > > > > > > This was necessarily chosen in prep for hugepages, but more about my > > > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > > > > happens to align with 2MB hugepages while seeming like a reasonable > > > > batching value so that's why I chose it. > > > > > > > > Even with 1GB support, I wasn't really planning to increase it. SNP > > > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > > > > handles promotion in a completely different path. So atm I'm leaning > > > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > > > > support for kvm_gmem_populate() path and not bothering to change it > > > > until a solid use-case arises. > > > The problem is that with hugetlb-based guest_memfd, the folio itself could be > > > of 1GB, though SNP and TDX can force mapping at only 4KB. > > > > If TDX wants to unload handling of page-clearing to its per-page > > post-populate callback and tie that its shared/private tracking that's > > perfectly fine by me. > > > > *How* TDX tells gmem it wants this different behavior is a topic for a > > follow-up patchset, Vishal suggested kernel-internal flags to > > kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate > Not sure which flag you are referring to with "Vishal suggested kernel-internal > flags to kvm_gmem_create()". > > However, my point is that when the backend folio is 1GB in size (leading to > max_order being PUD_ORDER), even if SNP only maps at 2MB to RMP, it may hit the > warning of "!IS_ALIGNED(gfn, 1 << max_order)". I think I've had to remove that warning every time I start working on some new spin of THP/hugetlbfs-based SNP. I'm not objecting to that. But it's obvious there, in those contexts, and I can explain exactly why it's being removed. It's not obvious in this series, where all we have are hand-wavy thoughts about what hugepages will look like. For all we know we might decide that kvm_gmem_populate() path should just pre-split hugepages to make all the logic easier, or we decide to lock it in at 4K-only and just strip all the hugepage stuff out. I don't really know, and this doesn't seem like the place to try to hash all that out when nothing in this series will cause this existing WARN_ON to be tripped. > > For TDX, it's worse because it always passes npages as 1, so it will also hit > the warning of "(npages - i) < (1 << max_order)". > > Given that this patch already considers huge pages for SNP, it feels half-baked > to leave the WARN_ON() for future handling. > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > (npages - i) < (1 << max_order)); > > > flag would probably just default to set and punt to post-populate/prep > > hooks, because we absolutely *do not* want to have to re-introduce per-4K > > tracking of this type of state within gmem, since getting rid of that sort > > of tracking requirement within gmem is the entire motivation of this > > series. And since, within this series, the uptodate flag and > > prep-tracking both have the same 4K granularity, it seems unecessary to > > address this here. > > > > If you were to send a patchset on top of this (or even independently) that > > introduces said kernel-internal gmem flag to offload uptodate-tracking to > > post-populate/prep hooks, and utilize it to optimize the current 4K-only > > TDX implementation by letting TDX module handle the initial > > page-clearing, then I think that change/discussion can progress without > > being blocked in any major way by this series. > > > > But I don't think we need to flesh all that out here, so long as we are > > aware of this as a future change/requirement and have reasonable > > indication that it is compatible with this series. > > > > > > > > Then since max_order = folio_order(folio) (at least in the tree for [1]), > > > WARN_ON() in kvm_gmem_populate() could still be hit: > > > > > > folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > (npages - i) < (1 << max_order)); > > > > Yes, in the SNP implementation of hugetlb I ended up removing this > > warning, and in that case I also ended up forcing kvm_gmem_populate() to > > be 4K-only: > > > > https://github.com/AMDESE/linux/blob/snp-hugetlb-v2-wip0/virt/kvm/guest_memfd.c#L2372 > > For 1G (aka HugeTLB) page, this fix is also needed, which was missed in [1] and > I pointed out to Ackerley at [2]. > > [1] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > [2] https://lore.kernel.org/all/aFPGPVbzo92t565h@yzhao56-desk.sh.intel.com/ Yes, we'll likely need some kind of change here. I think, if we're trying to find common ground to build hugepage support on, you can assume this will be removed. But I just don't think we need to squash that into this series in order to make progress on those ends. > > > but it makes a lot more sense to make those restrictions and changes in > > the context of hugepage support, rather than this series which is trying > > very hard to not do hugepage enablement, but simply keep what's partially > > there intact while reworking other things that have proven to be > > continued impediments to both in-place conversion and hugepage > > enablement. > Not sure how fixing the warning in this series could impede hugepage enabling :) > > But if you prefer, I don't mind keeping it locally for longer. It's the whole burden of needing to anticipate hugepage design, while it is in a state of potentially massive flux just before LPC, in order to make tiny incremental progress toward enabling in-place conversion, which is something I think we can get upstream much sooner. Look at your changelog for the change above, for instance: it has no relevance in the context of this series. What do I put in its place? Bug reports about my experimental tree? It's just not the right place to try to justify these changes. And most of this weirdness stems from the fact that we prematurely added partial hugepage enablement to begin with. Let's not repeat these mistakes, and address changes in the proper context where we know they make sense. I considered stripping out the existing hugepage support as a pre-patch to avoid leaving these uncertainties in place while we are reworking things, but it felt like needless churn. But that's where I'm coming from with this series: let's get in-place conversion landed, get the API fleshed out, get it working, and then re-assess hugepages with all these common/intersecting bits out of the way. If we can remove some obstacles for hugepages as part of that, great, but that is not the main intent here. -Mike > > > Also, there's talk now of enabling hugepages even without in-place > > conversion for hugetlbfs, and that will likely be the same path we > > follow for THP to remain in alignment. Rather than anticipating what all > > these changes will mean WRT hugepage implementation/requirements, I > > think it will be fruitful to remove some of the baggage that will > > complicate that process/discussion like this patchset attempts. > > > > -Mike > > > > > > > > TDX is even easier to hit this warning because it always passes npages as 1. > > > > > > [1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com > > > > > > > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > > > in Sean's sketch patch [1]? > > > > > > > > That's an option too, but SNP can make use of 2MB pages in the > > > > post-populate callback so I don't want to shut the door on that option > > > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > > > like TDX (at least, not one that would help much for guest boot times) > > > I see. > > > > > > So, what about below change? > > > > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > } > > > > > > folio_unlock(folio); > > > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > - (npages - i) < (1 << max_order)); > > > > > > ret = -EINVAL; > > > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > > + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || > > > + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > > > KVM_MEMORY_ATTRIBUTE_PRIVATE, > > > KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > > > if (!max_order) > > > > > > > > > > > > > > > > > > > > > > > Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily > > > > > triggered by TDX when max_order > 0 && npages == 1: > > > > > > > > > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > > > > > (npages - i) < (1 << max_order)); > > > > > > > > > > > > > > > [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/ > > > > > > > > > > > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > > > > > > kvm_gmem_populate_cb post_populate, void *opaque) > > > > > > { > > > > > > struct kvm_memory_slot *slot; > > > > > > - void __user *p; > > > > > > - > > > > > > + struct page **src_pages; > > > > > > int ret = 0, max_order; > > > > > > - long i; > > > > > > + loff_t src_offset = 0; > > > > > > + long i, src_npages; > > > > > > > > > > > > lockdep_assert_held(&kvm->slots_lock); > > > > > > > > > > > > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > if (!file) > > > > > > return -EFAULT; > > > > > > > > > > > > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > > > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES); > > > > > > + > > > > > > + if (src) { > > > > > > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; > > > > > > + > > > > > > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); > > > > > > + if (!src_pages) > > > > > > + return -ENOMEM; > > > > > > + > > > > > > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); > > > > > > + if (ret < 0) > > > > > > + return ret; > > > > > > + > > > > > > + if (ret != src_npages) > > > > > > + return -ENOMEM; > > > > > > + > > > > > > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE)); > > > > > > + } > > > > > > + > > > > > > filemap_invalidate_lock(file->f_mapping); > > > > > > > > > > > > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > > > > > > for (i = 0; i < npages; i += (1 << max_order)) { > > > > > > struct folio *folio; > > > > > > gfn_t gfn = start_gfn + i; > > > > > > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > max_order--; > > > > > > } > > > > > > > > > > > > - p = src ? src + i * PAGE_SIZE : NULL; > > > > > > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > > > > > > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, > > > > > > + src_offset, max_order, opaque); > > > > > Why src_offset is not 0 starting from the 2nd page? > > > > > > > > > > > if (!ret) > > > > > > folio_mark_uptodate(folio); > > > > > > > > > > > > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > > > > > > > > > > > > filemap_invalidate_unlock(file->f_mapping); > > > > > > > > > > > > + if (src) { > > > > > > + long j; > > > > > > + > > > > > > + for (j = 0; j < src_npages; j++) > > > > > > + put_page(src_pages[j]); > > > > > > + kfree(src_pages); > > > > > > + } > > > > > > + > > > > > > return ret && !i ? ret : i; > > > > > > } > > > > > > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > > > > > > -- > > > > > > 2.25.1 > > > > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 14:26 ` Michael Roth @ 2025-12-03 20:59 ` FirstName LastName 2025-12-03 23:12 ` Michael Roth 2025-12-03 21:01 ` Ira Weiny 2025-12-05 3:38 ` Yan Zhao 2 siblings, 1 reply; 35+ messages in thread From: FirstName LastName @ 2025-12-03 20:59 UTC (permalink / raw) To: michael.roth Cc: ackerleytng, aik, ashish.kalra, david, ira.weiny, kvm, liam.merwick, linux-coco, linux-kernel, linux-mm, pbonzini, seanjc, thomas.lendacky, vannapurve, vbabka, yan.y.zhao > > > > > but it makes a lot more sense to make those restrictions and changes in > > > the context of hugepage support, rather than this series which is trying > > > very hard to not do hugepage enablement, but simply keep what's partially > > > there intact while reworking other things that have proven to be > > > continued impediments to both in-place conversion and hugepage > > > enablement. > > Not sure how fixing the warning in this series could impede hugepage enabling :) > > > > But if you prefer, I don't mind keeping it locally for longer. > > It's the whole burden of needing to anticipate hugepage design, while it > is in a state of potentially massive flux just before LPC, in order to > make tiny incremental progress toward enabling in-place conversion, > which is something I think we can get upstream much sooner. Look at your > changelog for the change above, for instance: it has no relevance in the > context of this series. What do I put in its place? Bug reports about > my experimental tree? It's just not the right place to try to justify > these changes. > > And most of this weirdness stems from the fact that we prematurely added > partial hugepage enablement to begin with. Let's not repeat these mistakes, > and address changes in the proper context where we know they make sense. > > I considered stripping out the existing hugepage support as a pre-patch > to avoid leaving these uncertainties in place while we are reworking > things, but it felt like needless churn. But that's where I'm coming I think simplifying this implementation to handle populate at 4K pages is worth considering at this stage and we could optimize for huge page granularity population in future based on the need. e.g. 4K page based population logic will keep things simple and can be further simplified if we can add PAGE_ALIGNED(params.uaddr) restriction. Extending Sean's suggestion earlier, compile tested only. diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index f59c65abe3cf..224e79ab8f86 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2267,66 +2267,56 @@ struct sev_gmem_populate_args { int fw_error; }; -static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, - void __user *src, int order, void *opaque) +static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, + struct page *src_page, void *opaque) { struct sev_gmem_populate_args *sev_populate_args = opaque; struct kvm_sev_info *sev = to_kvm_sev_info(kvm); - int n_private = 0, ret, i; - int npages = (1 << order); - gfn_t gfn; + int ret; - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)) return -EINVAL; - for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { - struct sev_data_snp_launch_update fw_args = {0}; - bool assigned = false; - int level; - - ret = snp_lookup_rmpentry((u64)pfn + i, &assigned, &level); - if (ret || assigned) { - pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", - __func__, gfn, ret, assigned); - ret = ret ? -EINVAL : -EEXIST; - goto err; - } + struct sev_data_snp_launch_update fw_args = {0}; + bool assigned = false; + int level; - if (src) { - void *vaddr = kmap_local_pfn(pfn + i); + ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level); + if (ret || assigned) { + pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", + __func__, gfn, ret, assigned); + ret = ret ? -EINVAL : -EEXIST; + goto err; + } - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { - ret = -EFAULT; - goto err; - } - kunmap_local(vaddr); - } + if (src_page) { + void *vaddr = kmap_local_pfn(pfn); - ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, - sev_get_asid(kvm), true); - if (ret) - goto err; + memcpy(vaddr, page_address(src_page), PAGE_SIZE); + kunmap_local(vaddr); + } - n_private++; + ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, PG_LEVEL_4K, + sev_get_asid(kvm), true); + if (ret) + goto err; - fw_args.gctx_paddr = __psp_pa(sev->snp_context); - fw_args.address = __sme_set(pfn_to_hpa(pfn + i)); - fw_args.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K); - fw_args.page_type = sev_populate_args->type; + fw_args.gctx_paddr = __psp_pa(sev->snp_context); + fw_args.address = __sme_set(pfn_to_hpa(pfn)); + fw_args.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K); + fw_args.page_type = sev_populate_args->type; - ret = __sev_issue_cmd(sev_populate_args->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, - &fw_args, &sev_populate_args->fw_error); - if (ret) - goto fw_err; - } + ret = __sev_issue_cmd(sev_populate_args->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, + &fw_args, &sev_populate_args->fw_error); + if (ret) + goto fw_err; return 0; fw_err: /* * If the firmware command failed handle the reclaim and cleanup of that - * PFN specially vs. prior pages which can be cleaned up below without - * needing to reclaim in advance. + * PFN specially. * * Additionally, when invalid CPUID function entries are detected, * firmware writes the expected values into the page and leaves it @@ -2336,25 +2326,18 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf * information to provide information on which CPUID leaves/fields * failed CPUID validation. */ - if (!snp_page_reclaim(kvm, pfn + i) && + if (!snp_page_reclaim(kvm, pfn) && sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { - void *vaddr = kmap_local_pfn(pfn + i); - - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) - pr_debug("Failed to write CPUID page back to userspace\n"); + void *vaddr = kmap_local_pfn(pfn); + memcpy(page_address(src_page), vaddr, PAGE_SIZE); kunmap_local(vaddr); } - /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ - n_private--; - err: - pr_debug("%s: exiting with error ret %d (fw_error %d), restoring %d gmem PFNs to shared.\n", - __func__, ret, sev_populate_args->fw_error, n_private); - for (i = 0; i < n_private; i++) - kvm_rmp_make_shared(kvm, pfn + i, PG_LEVEL_4K); + pr_debug("%s: exiting with error ret %d (fw_error %d)\n", + __func__, ret, sev_populate_args->fw_error); return ret; } diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 2d7a4d52ccfb..acdcb802d9f2 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -3118,34 +3118,21 @@ struct tdx_gmem_post_populate_arg { }; static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, - void __user *src, int order, void *_arg) + struct page *src_page, void *_arg) { struct tdx_gmem_post_populate_arg *arg = _arg; struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); u64 err, entry, level_state; gpa_t gpa = gfn_to_gpa(gfn); - struct page *src_page; - int ret, i; + int ret; if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) return -EIO; - /* - * Get the source page if it has been faulted in. Return failure if the - * source page has been swapped out or unmapped in primary memory. - */ - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); - if (ret < 0) - return ret; - if (ret != 1) - return -ENOMEM; - kvm_tdx->page_add_src = src_page; ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); kvm_tdx->page_add_src = NULL; - put_page(src_page); - if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) return ret; @@ -3156,11 +3143,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, * mmu_notifier events can't reach S-EPT entries, and KVM's internal * zapping flows are mutually exclusive with S-EPT mappings. */ - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); - if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm)) - return -EIO; - } + err = tdh_mr_extend(&kvm_tdx->td, gpa, &entry, &level_state); + if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm)) + return -EIO; return 0; } @@ -3196,38 +3181,26 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c return -EINVAL; ret = 0; - while (region.nr_pages) { - if (signal_pending(current)) { - ret = -EINTR; - break; - } - - arg = (struct tdx_gmem_post_populate_arg) { - .vcpu = vcpu, - .flags = cmd->flags, - }; - gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), - u64_to_user_ptr(region.source_addr), - 1, tdx_gmem_post_populate, &arg); - if (gmem_ret < 0) { - ret = gmem_ret; - break; - } + arg = (struct tdx_gmem_post_populate_arg) { + .vcpu = vcpu, + .flags = cmd->flags, + }; + gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), + u64_to_user_ptr(region.source_addr), + region.nr_pages, tdx_gmem_post_populate, &arg); + if (gmem_ret < 0) + ret = gmem_ret; - if (gmem_ret != 1) { - ret = -EIO; - break; - } + if (gmem_ret != region.nr_pages) + ret = -EIO; - region.source_addr += PAGE_SIZE; - region.gpa += PAGE_SIZE; - region.nr_pages--; + if (gmem_ret >= 0) { + region.source_addr += gmem_ret * PAGE_SIZE; + region.gpa += gmem_ret * PAGE_SIZE; - cond_resched(); + if (copy_to_user(u64_to_user_ptr(cmd->data), ®ion, sizeof(region))) + ret = -EFAULT; } - - if (copy_to_user(u64_to_user_ptr(cmd->data), ®ion, sizeof(region))) - ret = -EFAULT; return ret; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d93f75b05ae2..263e75f90e91 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2581,7 +2581,7 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord * Returns the number of pages that were populated. */ typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, - void __user *src, int order, void *opaque); + struct page *src_page, void *opaque); long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, kvm_gmem_populate_cb post_populate, void *opaque); diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 2e62bf882aa8..550dc818748b 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -85,7 +85,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, struct folio *folio) { - unsigned long nr_pages, i; pgoff_t index; /* @@ -794,7 +793,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, return PTR_ERR(folio); if (!folio_test_uptodate(folio)) { - clear_huge_page(&folio->page, 0, folio_nr_pages(folio)); + clear_highpage(folio_page(folio, 0)); folio_mark_uptodate(folio); } @@ -812,13 +811,54 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE +static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot, + struct file *file, gfn_t gfn, struct page *src_page, + kvm_gmem_populate_cb post_populate, void *opaque) +{ + pgoff_t index = kvm_gmem_get_index(slot, gfn); + struct gmem_inode *gi; + struct folio *folio; + int ret, max_order; + kvm_pfn_t pfn; + + gi = GMEM_I(file_inode(file)); + + filemap_invalidate_lock(file->f_mapping); + + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); + if (IS_ERR(folio)) { + ret = PTR_ERR(folio); + goto out_unlock; + } + + folio_unlock(folio); + + if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1, + KVM_MEMORY_ATTRIBUTE_PRIVATE, + KVM_MEMORY_ATTRIBUTE_PRIVATE)) { + ret = -EINVAL; + goto out_put_folio; + } + + ret = post_populate(kvm, gfn, pfn, src_page, opaque); + if (!ret) + folio_mark_uptodate(folio); + +out_put_folio: + folio_put(folio); +out_unlock: + filemap_invalidate_unlock(file->f_mapping); + return ret; +} + long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, kvm_gmem_populate_cb post_populate, void *opaque) { + struct page *src_aligned_page = NULL; struct kvm_memory_slot *slot; + struct page *src_page = NULL; void __user *p; - - int ret = 0, max_order; + int ret = 0; long i; lockdep_assert_held(&kvm->slots_lock); @@ -834,52 +874,50 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long if (!file) return -EFAULT; - filemap_invalidate_lock(file->f_mapping); + if (src && !PAGE_ALIGNED(src)) { + src_page = alloc_page(GFP_KERNEL_ACCOUNT); + if (!src_page) + return -ENOMEM; + } npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); - for (i = 0; i < npages; i += (1 << max_order)) { - struct folio *folio; - gfn_t gfn = start_gfn + i; - pgoff_t index = kvm_gmem_get_index(slot, gfn); - kvm_pfn_t pfn; - + for (i = 0; i < npages; i++) { if (signal_pending(current)) { ret = -EINTR; break; } - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); - if (IS_ERR(folio)) { - ret = PTR_ERR(folio); - break; - } - - folio_unlock(folio); - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || - (npages - i) < (1 << max_order)); + p = src ? src + i * PAGE_SIZE : NULL; - ret = -EINVAL; - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), - KVM_MEMORY_ATTRIBUTE_PRIVATE, - KVM_MEMORY_ATTRIBUTE_PRIVATE)) { - if (!max_order) - goto put_folio_and_exit; - max_order--; + if (p) { + if (src_page) { + if (copy_from_user(page_address(src_page), p, PAGE_SIZE)) { + ret = -EFAULT; + break; + } + src_aligned_page = src_page; + } else { + ret = get_user_pages((unsigned long)p, 1, 0, &src_aligned_page); + if (ret < 0) + break; + if (ret != 1) { + ret = -ENOMEM; + break; + } + } } - p = src ? src + i * PAGE_SIZE : NULL; - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); - if (!ret) - folio_mark_uptodate(folio); + ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, src_aligned_page, + post_populate, opaque); + if (p && !src_page) + put_page(src_aligned_page); -put_folio_and_exit: - folio_put(folio); if (ret) break; } - filemap_invalidate_unlock(file->f_mapping); - + if (src_page) + __free_page(src_page); return ret && !i ? ret : i; } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); -- 2.52.0.177.g9f829587af-goog ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 20:59 ` FirstName LastName @ 2025-12-03 23:12 ` Michael Roth 0 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-12-03 23:12 UTC (permalink / raw) To: FirstName LastName Cc: ackerleytng, aik, ashish.kalra, david, ira.weiny, kvm, liam.merwick, linux-coco, linux-kernel, linux-mm, pbonzini, seanjc, thomas.lendacky, vbabka, yan.y.zhao On Wed, Dec 03, 2025 at 08:59:10PM +0000, FirstName LastName wrote: > > > > > > > but it makes a lot more sense to make those restrictions and changes in > > > > the context of hugepage support, rather than this series which is trying > > > > very hard to not do hugepage enablement, but simply keep what's partially > > > > there intact while reworking other things that have proven to be > > > > continued impediments to both in-place conversion and hugepage > > > > enablement. > > > Not sure how fixing the warning in this series could impede hugepage enabling :) > > > > > > But if you prefer, I don't mind keeping it locally for longer. > > > > It's the whole burden of needing to anticipate hugepage design, while it > > is in a state of potentially massive flux just before LPC, in order to > > make tiny incremental progress toward enabling in-place conversion, > > which is something I think we can get upstream much sooner. Look at your > > changelog for the change above, for instance: it has no relevance in the > > context of this series. What do I put in its place? Bug reports about > > my experimental tree? It's just not the right place to try to justify > > these changes. > > > > And most of this weirdness stems from the fact that we prematurely added > > partial hugepage enablement to begin with. Let's not repeat these mistakes, > > and address changes in the proper context where we know they make sense. > > > > I considered stripping out the existing hugepage support as a pre-patch > > to avoid leaving these uncertainties in place while we are reworking > > things, but it felt like needless churn. But that's where I'm coming > > I think simplifying this implementation to handle populate at 4K pages is worth > considering at this stage and we could optimize for huge page granularity > population in future based on the need. That's probably for the best, after all. Though I think a separate pre-patch to remove the hugepage stuff would be cleaner, as it obfuscates the GUP changes quite a bit, which are already pretty subtle as-is. I'll plan to do this for the next spin, if there are no objections raised in the meantime. > > e.g. 4K page based population logic will keep things simple and can be > further simplified if we can add PAGE_ALIGNED(params.uaddr) restriction. I'm still hesitant to pull the trigger on retroactively enforcing page-aligned uaddr for SNP, but if the maintainers are good with it then no objection from me. > Extending Sean's suggestion earlier, compile tested only. Thanks! -Mike > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index f59c65abe3cf..224e79ab8f86 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -2267,66 +2267,56 @@ struct sev_gmem_populate_args { > int fw_error; > }; > > -static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > - void __user *src, int order, void *opaque) > +static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > + struct page *src_page, void *opaque) > { > struct sev_gmem_populate_args *sev_populate_args = opaque; > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > - int n_private = 0, ret, i; > - int npages = (1 << order); > - gfn_t gfn; > + int ret; > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_page)) > return -EINVAL; > > - for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > - struct sev_data_snp_launch_update fw_args = {0}; > - bool assigned = false; > - int level; > - > - ret = snp_lookup_rmpentry((u64)pfn + i, &assigned, &level); > - if (ret || assigned) { > - pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", > - __func__, gfn, ret, assigned); > - ret = ret ? -EINVAL : -EEXIST; > - goto err; > - } > + struct sev_data_snp_launch_update fw_args = {0}; > + bool assigned = false; > + int level; > > - if (src) { > - void *vaddr = kmap_local_pfn(pfn + i); > + ret = snp_lookup_rmpentry((u64)pfn, &assigned, &level); > + if (ret || assigned) { > + pr_debug("%s: Failed to ensure GFN 0x%llx RMP entry is initial shared state, ret: %d assigned: %d\n", > + __func__, gfn, ret, assigned); > + ret = ret ? -EINVAL : -EEXIST; > + goto err; > + } > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > - ret = -EFAULT; > - goto err; > - } > - kunmap_local(vaddr); > - } > + if (src_page) { > + void *vaddr = kmap_local_pfn(pfn); > > - ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > - sev_get_asid(kvm), true); > - if (ret) > - goto err; > + memcpy(vaddr, page_address(src_page), PAGE_SIZE); > + kunmap_local(vaddr); > + } > > - n_private++; > + ret = rmp_make_private(pfn, gfn << PAGE_SHIFT, PG_LEVEL_4K, > + sev_get_asid(kvm), true); > + if (ret) > + goto err; > > - fw_args.gctx_paddr = __psp_pa(sev->snp_context); > - fw_args.address = __sme_set(pfn_to_hpa(pfn + i)); > - fw_args.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K); > - fw_args.page_type = sev_populate_args->type; > + fw_args.gctx_paddr = __psp_pa(sev->snp_context); > + fw_args.address = __sme_set(pfn_to_hpa(pfn)); > + fw_args.page_size = PG_LEVEL_TO_RMP(PG_LEVEL_4K); > + fw_args.page_type = sev_populate_args->type; > > - ret = __sev_issue_cmd(sev_populate_args->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, > - &fw_args, &sev_populate_args->fw_error); > - if (ret) > - goto fw_err; > - } > + ret = __sev_issue_cmd(sev_populate_args->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, > + &fw_args, &sev_populate_args->fw_error); > + if (ret) > + goto fw_err; > > return 0; > > fw_err: > /* > * If the firmware command failed handle the reclaim and cleanup of that > - * PFN specially vs. prior pages which can be cleaned up below without > - * needing to reclaim in advance. > + * PFN specially. > * > * Additionally, when invalid CPUID function entries are detected, > * firmware writes the expected values into the page and leaves it > @@ -2336,25 +2326,18 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > * information to provide information on which CPUID leaves/fields > * failed CPUID validation. > */ > - if (!snp_page_reclaim(kvm, pfn + i) && > + if (!snp_page_reclaim(kvm, pfn) && > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > - void *vaddr = kmap_local_pfn(pfn + i); > - > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > - pr_debug("Failed to write CPUID page back to userspace\n"); > + void *vaddr = kmap_local_pfn(pfn); > > + memcpy(page_address(src_page), vaddr, PAGE_SIZE); > kunmap_local(vaddr); > } > > - /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > - n_private--; > - > err: > - pr_debug("%s: exiting with error ret %d (fw_error %d), restoring %d gmem PFNs to shared.\n", > - __func__, ret, sev_populate_args->fw_error, n_private); > - for (i = 0; i < n_private; i++) > - kvm_rmp_make_shared(kvm, pfn + i, PG_LEVEL_4K); > + pr_debug("%s: exiting with error ret %d (fw_error %d)\n", > + __func__, ret, sev_populate_args->fw_error); > > return ret; > } > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 2d7a4d52ccfb..acdcb802d9f2 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -3118,34 +3118,21 @@ struct tdx_gmem_post_populate_arg { > }; > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > - void __user *src, int order, void *_arg) > + struct page *src_page, void *_arg) > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > u64 err, entry, level_state; > gpa_t gpa = gfn_to_gpa(gfn); > - struct page *src_page; > - int ret, i; > + int ret; > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > return -EIO; > > - /* > - * Get the source page if it has been faulted in. Return failure if the > - * source page has been swapped out or unmapped in primary memory. > - */ > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > - if (ret < 0) > - return ret; > - if (ret != 1) > - return -ENOMEM; > - > kvm_tdx->page_add_src = src_page; > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); > kvm_tdx->page_add_src = NULL; > > - put_page(src_page); > - > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) > return ret; > > @@ -3156,11 +3143,9 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > * mmu_notifier events can't reach S-EPT entries, and KVM's internal > * zapping flows are mutually exclusive with S-EPT mappings. > */ > - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); > - if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm)) > - return -EIO; > - } > + err = tdh_mr_extend(&kvm_tdx->td, gpa, &entry, &level_state); > + if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm)) > + return -EIO; > > return 0; > } > @@ -3196,38 +3181,26 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c > return -EINVAL; > > ret = 0; > - while (region.nr_pages) { > - if (signal_pending(current)) { > - ret = -EINTR; > - break; > - } > - > - arg = (struct tdx_gmem_post_populate_arg) { > - .vcpu = vcpu, > - .flags = cmd->flags, > - }; > - gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), > - u64_to_user_ptr(region.source_addr), > - 1, tdx_gmem_post_populate, &arg); > - if (gmem_ret < 0) { > - ret = gmem_ret; > - break; > - } > + arg = (struct tdx_gmem_post_populate_arg) { > + .vcpu = vcpu, > + .flags = cmd->flags, > + }; > + gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa), > + u64_to_user_ptr(region.source_addr), > + region.nr_pages, tdx_gmem_post_populate, &arg); > + if (gmem_ret < 0) > + ret = gmem_ret; > > - if (gmem_ret != 1) { > - ret = -EIO; > - break; > - } > + if (gmem_ret != region.nr_pages) > + ret = -EIO; > > - region.source_addr += PAGE_SIZE; > - region.gpa += PAGE_SIZE; > - region.nr_pages--; > + if (gmem_ret >= 0) { > + region.source_addr += gmem_ret * PAGE_SIZE; > + region.gpa += gmem_ret * PAGE_SIZE; > > - cond_resched(); > + if (copy_to_user(u64_to_user_ptr(cmd->data), ®ion, sizeof(region))) > + ret = -EFAULT; > } > - > - if (copy_to_user(u64_to_user_ptr(cmd->data), ®ion, sizeof(region))) > - ret = -EFAULT; > return ret; > } > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index d93f75b05ae2..263e75f90e91 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2581,7 +2581,7 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord > * Returns the number of pages that were populated. > */ > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > - void __user *src, int order, void *opaque); > + struct page *src_page, void *opaque); > > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages, > kvm_gmem_populate_cb post_populate, void *opaque); > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 2e62bf882aa8..550dc818748b 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -85,7 +85,6 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo > static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > gfn_t gfn, struct folio *folio) > { > - unsigned long nr_pages, i; > pgoff_t index; > > /* > @@ -794,7 +793,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > return PTR_ERR(folio); > > if (!folio_test_uptodate(folio)) { > - clear_huge_page(&folio->page, 0, folio_nr_pages(folio)); > + clear_highpage(folio_page(folio, 0)); > folio_mark_uptodate(folio); > } > > @@ -812,13 +811,54 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); > > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE > +static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot, > + struct file *file, gfn_t gfn, struct page *src_page, > + kvm_gmem_populate_cb post_populate, void *opaque) > +{ > + pgoff_t index = kvm_gmem_get_index(slot, gfn); > + struct gmem_inode *gi; > + struct folio *folio; > + int ret, max_order; > + kvm_pfn_t pfn; > + > + gi = GMEM_I(file_inode(file)); > + > + filemap_invalidate_lock(file->f_mapping); > + > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + goto out_unlock; > + } > + > + folio_unlock(folio); > + > + if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1, > + KVM_MEMORY_ATTRIBUTE_PRIVATE, > + KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > + ret = -EINVAL; > + goto out_put_folio; > + } > + > + ret = post_populate(kvm, gfn, pfn, src_page, opaque); > + if (!ret) > + folio_mark_uptodate(folio); > + > +out_put_folio: > + folio_put(folio); > +out_unlock: > + filemap_invalidate_unlock(file->f_mapping); > + return ret; > +} > + > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, > kvm_gmem_populate_cb post_populate, void *opaque) > { > + struct page *src_aligned_page = NULL; > struct kvm_memory_slot *slot; > + struct page *src_page = NULL; > void __user *p; > - > - int ret = 0, max_order; > + int ret = 0; > long i; > > lockdep_assert_held(&kvm->slots_lock); > @@ -834,52 +874,50 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > if (!file) > return -EFAULT; > > - filemap_invalidate_lock(file->f_mapping); > + if (src && !PAGE_ALIGNED(src)) { > + src_page = alloc_page(GFP_KERNEL_ACCOUNT); > + if (!src_page) > + return -ENOMEM; > + } > > npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages); > - for (i = 0; i < npages; i += (1 << max_order)) { > - struct folio *folio; > - gfn_t gfn = start_gfn + i; > - pgoff_t index = kvm_gmem_get_index(slot, gfn); > - kvm_pfn_t pfn; > - > + for (i = 0; i < npages; i++) { > if (signal_pending(current)) { > ret = -EINTR; > break; > } > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); > - if (IS_ERR(folio)) { > - ret = PTR_ERR(folio); > - break; > - } > - > - folio_unlock(folio); > - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || > - (npages - i) < (1 << max_order)); > + p = src ? src + i * PAGE_SIZE : NULL; > > - ret = -EINVAL; > - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), > - KVM_MEMORY_ATTRIBUTE_PRIVATE, > - KVM_MEMORY_ATTRIBUTE_PRIVATE)) { > - if (!max_order) > - goto put_folio_and_exit; > - max_order--; > + if (p) { > + if (src_page) { > + if (copy_from_user(page_address(src_page), p, PAGE_SIZE)) { > + ret = -EFAULT; > + break; > + } > + src_aligned_page = src_page; > + } else { > + ret = get_user_pages((unsigned long)p, 1, 0, &src_aligned_page); > + if (ret < 0) > + break; > + if (ret != 1) { > + ret = -ENOMEM; > + break; > + } > + } > } > > - p = src ? src + i * PAGE_SIZE : NULL; > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); > - if (!ret) > - folio_mark_uptodate(folio); > + ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, src_aligned_page, > + post_populate, opaque); > + if (p && !src_page) > + put_page(src_aligned_page); > > -put_folio_and_exit: > - folio_put(folio); > if (ret) > break; > } > > - filemap_invalidate_unlock(file->f_mapping); > - > + if (src_page) > + __free_page(src_page); > return ret && !i ? ret : i; > } > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); > -- > 2.52.0.177.g9f829587af-goog > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 14:26 ` Michael Roth 2025-12-03 20:59 ` FirstName LastName @ 2025-12-03 21:01 ` Ira Weiny 2025-12-03 23:07 ` Michael Roth 2025-12-05 3:38 ` Yan Zhao 2 siblings, 1 reply; 35+ messages in thread From: Ira Weiny @ 2025-12-03 21:01 UTC (permalink / raw) To: Michael Roth, Yan Zhao Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny Michael Roth wrote: > On Wed, Dec 03, 2025 at 10:46:27AM +0800, Yan Zhao wrote: > > On Mon, Dec 01, 2025 at 04:13:55PM -0600, Michael Roth wrote: > > > On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote: > > > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: [snip] > > > > > > > --- > > > > > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > > > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > > > > > include/linux/kvm_host.h | 3 ++- > > > > > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > > > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > > > index 0835c664fbfd..d0ac710697a2 100644 > > > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > > > > > }; > > > > > > > > > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > > > > > - void __user *src, int order, void *opaque) > > > > > > > + struct page **src_pages, loff_t src_offset, > > > > > > > + int order, void *opaque) > > > > > > > { > > > > > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > > > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > > int npages = (1 << order); > > > > > > > gfn_t gfn; > > > > > > > > > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > > > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > > > > > return -EINVAL; > > > > > > > > > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > > goto err; > > > > > > > } > > > > > > > > > > > > > > - if (src) { > > > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > > > + if (src_pages) { > > > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > > > > - ret = -EFAULT; > > > > > > > - goto err; > > > > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > > > > + kunmap_local(src_vaddr); > > > > > > > + > > > > > > > + if (src_offset) { > > > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > > > + > > > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > > > + kunmap_local(src_vaddr); > > > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > > > are using to fully populate a single dest page with each iteration. So > > > > > if we start at src_offset, read a page worth of data, then we are now at > > > > > src_offset in the next src page and the loop continues that way even if > > > > > npages > 1. > > > > > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > > > the 2nd memcpy is skipped on every iteration. > > > > > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > > > missed? > > > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > > > > > So if npages = 2, there're 4 memcpy() calls. > > > > > > > > src: |---------|---------|---------| (VA contiguous) > > > > ^ ^ ^ > > > > | | | > > > > dst: |---------|---------| (PA contiguous) > > > > > > > > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > > > as 0 for the 2nd src page. > > > > > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > > > snp_launch_update() to simplify the design? > > > > > > This was an option mentioned in the cover letter and during PUCK. I am > > > not opposed if that's the direction we decide, but I also don't think > > > it makes big difference since: > > > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > struct page **src_pages, loff_t src_offset, > > > int order, void *opaque); > > > > > > basically reduces to Sean's originally proposed: > > > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > struct page *src_pages, int order, > > > void *opaque); > > > > Hmm, the requirement of having each copy to dst_page account for src_offset > > (which actually results in 2 copies) is quite confusing. I initially thought the > > src_offset only applied to the first dst_page. > > What I'm wondering though is if I'd done a better job of documenting > this aspect, e.g. with the following comment added above > kvm_gmem_populate_cb: > > /* > * ... > * 'src_pages': array of GUP'd struct page pointers corresponding to > * the pages that store the data that is to be copied > * into the HPA corresponding to 'pfn' > * 'src_offset': byte offset, relative to the first page in the array > * of pages pointed to by 'src_pages', to begin copying > * the data from. > * > * NOTE: if the caller of kvm_gmem_populate() enforces that 'src' is > * page-aligned, then 'src_offset' will always be zero, and src_pages > * will contain only 1 page to copy from, beginning at byte offset 0. > * In this case, callers can assume src_offset is 0. > */ > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > struct page **src_pages, loff_t src_offset, > int order, void *opaque); > > could the confusion have been avoided, or is it still unwieldly? > > I don't mind that users like SNP need to deal with the extra bits, but > I'm hoping for users like TDX it isn't too cludgy. FWIW I don't think the TDX code was a problem. I was trying to review the SNP code for correctness and it was confusing enough that I was concerned the investment is not worth the cost. I'll re-iterate that the in-place conversion _use_ _case_ requires user space to keep the 'source' (ie the page) aligned because it is all getting converted anyway. So I'm not seeing a good use case for supporting this. But Vishal seemed to think there was so... Given this potential use case; the above comment is more clear. FWIW, I think this is going to get even more complex if the src/dest page sizes are miss-matched. But that algorithm can be reviewed at that time, not now. > > > > This will also cause kvm_gmem_populate() to allocate 1 more src_npages than > > npages for dst pages. > > That's more of a decision on the part of userspace deciding to use > non-page-aligned 'src' pointer to begin with. IIRC this is where I think there might be an issue with the code. The code used PAGE_SIZE for the memcpy's. Is it clear that user space must have a buffer >= PAGE_SIZE when src_offset != 0? I did not see that check; and/or I was not clear how that works. [snip] > > > > > > > > > > This was necessarily chosen in prep for hugepages, but more about my > > > > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > > > > > happens to align with 2MB hugepages while seeming like a reasonable > > > > > batching value so that's why I chose it. > > > > > > > > > > Even with 1GB support, I wasn't really planning to increase it. SNP > > > > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > > > > > handles promotion in a completely different path. So atm I'm leaning > > > > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > > > > > support for kvm_gmem_populate() path and not bothering to change it > > > > > until a solid use-case arises. > > > > The problem is that with hugetlb-based guest_memfd, the folio itself could be > > > > of 1GB, though SNP and TDX can force mapping at only 4KB. > > > > > > If TDX wants to unload handling of page-clearing to its per-page > > > post-populate callback and tie that its shared/private tracking that's > > > perfectly fine by me. > > > > > > *How* TDX tells gmem it wants this different behavior is a topic for a > > > follow-up patchset, Vishal suggested kernel-internal flags to > > > kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate > > Not sure which flag you are referring to with "Vishal suggested kernel-internal > > flags to kvm_gmem_create()". > > > > However, my point is that when the backend folio is 1GB in size (leading to > > max_order being PUD_ORDER), even if SNP only maps at 2MB to RMP, it may hit the > > warning of "!IS_ALIGNED(gfn, 1 << max_order)". > > I think I've had to remove that warning every time I start working on > some new spin of THP/hugetlbfs-based SNP. I'm not objecting to that. But it's > obvious there, in those contexts, and I can explain exactly why it's being > removed. > > It's not obvious in this series, where all we have are hand-wavy thoughts > about what hugepages will look like. For all we know we might decide that > kvm_gmem_populate() path should just pre-split hugepages to make all the > logic easier, or we decide to lock it in at 4K-only and just strip all the > hugepage stuff out. Yea don't do that. > I don't really know, and this doesn't seem like the place > to try to hash all that out when nothing in this series will cause this > existing WARN_ON to be tripped. Agreed. [snip] > > > > > > but it makes a lot more sense to make those restrictions and changes in > > > the context of hugepage support, rather than this series which is trying > > > very hard to not do hugepage enablement, but simply keep what's partially > > > there intact while reworking other things that have proven to be > > > continued impediments to both in-place conversion and hugepage > > > enablement. > > Not sure how fixing the warning in this series could impede hugepage enabling :) > > > > But if you prefer, I don't mind keeping it locally for longer. > > It's the whole burden of needing to anticipate hugepage design, while it > is in a state of potentially massive flux just before LPC, in order to > make tiny incremental progress toward enabling in-place conversion, > which is something I think we can get upstream much sooner. Look at your > changelog for the change above, for instance: it has no relevance in the > context of this series. What do I put in its place? Bug reports about > my experimental tree? It's just not the right place to try to justify > these changes. > > And most of this weirdness stems from the fact that we prematurely added > partial hugepage enablement to begin with. Let's not repeat these mistakes, > and address changes in the proper context where we know they make sense. > > I considered stripping out the existing hugepage support as a pre-patch > to avoid leaving these uncertainties in place while we are reworking > things, but it felt like needless churn. But that's where I'm coming > from with this series: let's get in-place conversion landed, get the API > fleshed out, get it working, and then re-assess hugepages with all these > common/intersecting bits out of the way. If we can remove some obstacles > for hugepages as part of that, great, but that is not the main intent > here. I'd like to second what Mike is saying here. The entire discussion about hugepage support is premature for this series. Ira [snip] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 21:01 ` Ira Weiny @ 2025-12-03 23:07 ` Michael Roth 0 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-12-03 23:07 UTC (permalink / raw) To: Ira Weiny Cc: Yan Zhao, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik On Wed, Dec 03, 2025 at 03:01:24PM -0600, Ira Weiny wrote: > Michael Roth wrote: > > On Wed, Dec 03, 2025 at 10:46:27AM +0800, Yan Zhao wrote: > > > On Mon, Dec 01, 2025 at 04:13:55PM -0600, Michael Roth wrote: > > > > On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote: > > > > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote: > > > > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote: > > > > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote: > > [snip] > > > > > > > > > --- > > > > > > > > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------ > > > > > > > > arch/x86/kvm/vmx/tdx.c | 21 +++++--------------- > > > > > > > > include/linux/kvm_host.h | 3 ++- > > > > > > > > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------ > > > > > > > > 4 files changed, 71 insertions(+), 35 deletions(-) > > > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > > > > > > index 0835c664fbfd..d0ac710697a2 100644 > > > > > > > > --- a/arch/x86/kvm/svm/sev.c > > > > > > > > +++ b/arch/x86/kvm/svm/sev.c > > > > > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args { > > > > > > > > }; > > > > > > > > > > > > > > > > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn, > > > > > > > > - void __user *src, int order, void *opaque) > > > > > > > > + struct page **src_pages, loff_t src_offset, > > > > > > > > + int order, void *opaque) > > > > > > > > { > > > > > > > > struct sev_gmem_populate_args *sev_populate_args = opaque; > > > > > > > > struct kvm_sev_info *sev = to_kvm_sev_info(kvm); > > > > > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > > > int npages = (1 << order); > > > > > > > > gfn_t gfn; > > > > > > > > > > > > > > > > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src)) > > > > > > > > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages)) > > > > > > > > return -EINVAL; > > > > > > > > > > > > > > > > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) { > > > > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > > > > > > > > goto err; > > > > > > > > } > > > > > > > > > > > > > > > > - if (src) { > > > > > > > > - void *vaddr = kmap_local_pfn(pfn + i); > > > > > > > > + if (src_pages) { > > > > > > > > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > > > > > > > > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > > > > > > > > > > > > > > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > > > > > > > > - ret = -EFAULT; > > > > > > > > - goto err; > > > > > > > > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > > > > > > > > + kunmap_local(src_vaddr); > > > > > > > > + > > > > > > > > + if (src_offset) { > > > > > > > > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > > > > > > > > + > > > > > > > > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); > > > > > > > > + kunmap_local(src_vaddr); > > > > > > > IIUC, src_offset is the src's offset from the first page. e.g., > > > > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512. > > > > > > > > > > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ? > > > > > > > > > > > > src_offset ends up being the offset into the pair of src pages that we > > > > > > are using to fully populate a single dest page with each iteration. So > > > > > > if we start at src_offset, read a page worth of data, then we are now at > > > > > > src_offset in the next src page and the loop continues that way even if > > > > > > npages > 1. > > > > > > > > > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so > > > > > > the 2nd memcpy is skipped on every iteration. > > > > > > > > > > > > That's the intent at least. Is there a flaw in the code/reasoning that I > > > > > > missed? > > > > > Oh, I got you. SNP expects a single src_offset applies for each src page. > > > > > > > > > > So if npages = 2, there're 4 memcpy() calls. > > > > > > > > > > src: |---------|---------|---------| (VA contiguous) > > > > > ^ ^ ^ > > > > > | | | > > > > > dst: |---------|---------| (PA contiguous) > > > > > > > > > > > > > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset > > > > > as 0 for the 2nd src page. > > > > > > > > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in > > > > > snp_launch_update() to simplify the design? > > > > > > > > This was an option mentioned in the cover letter and during PUCK. I am > > > > not opposed if that's the direction we decide, but I also don't think > > > > it makes big difference since: > > > > > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > struct page **src_pages, loff_t src_offset, > > > > int order, void *opaque); > > > > > > > > basically reduces to Sean's originally proposed: > > > > > > > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > > > struct page *src_pages, int order, > > > > void *opaque); > > > > > > Hmm, the requirement of having each copy to dst_page account for src_offset > > > (which actually results in 2 copies) is quite confusing. I initially thought the > > > src_offset only applied to the first dst_page. > > > > What I'm wondering though is if I'd done a better job of documenting > > this aspect, e.g. with the following comment added above > > kvm_gmem_populate_cb: > > > > /* > > * ... > > * 'src_pages': array of GUP'd struct page pointers corresponding to > > * the pages that store the data that is to be copied > > * into the HPA corresponding to 'pfn' > > * 'src_offset': byte offset, relative to the first page in the array > > * of pages pointed to by 'src_pages', to begin copying > > * the data from. > > * > > * NOTE: if the caller of kvm_gmem_populate() enforces that 'src' is > > * page-aligned, then 'src_offset' will always be zero, and src_pages > > * will contain only 1 page to copy from, beginning at byte offset 0. > > * In this case, callers can assume src_offset is 0. > > */ > > int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > > struct page **src_pages, loff_t src_offset, > > int order, void *opaque); > > > > could the confusion have been avoided, or is it still unwieldly? > > > > I don't mind that users like SNP need to deal with the extra bits, but > > I'm hoping for users like TDX it isn't too cludgy. > > FWIW I don't think the TDX code was a problem. I was trying to review the > SNP code for correctness and it was confusing enough that I was concerned > the investment is not worth the cost. I think it would only be worth it if we have some reasonable indication that someone is using SNP_LAUNCH_UPDATE with un-aligned 'uaddr'/'src' parameter, or anticipate that a future architecture would rely on such a thing. I don't *think* there is, but if that guess it wrong then someone out there will be very grumpy. I'm not sure what the threshold is for greenlighting a userspace API change like that though, so I'd prefer to weasel out of that responsibility by assuming we need to support non-page-aligned src until the maintainers tell me it's okay/warranted. :) > > I'll re-iterate that the in-place conversion _use_ _case_ requires user > space to keep the 'source' (ie the page) aligned because it is all getting > converted anyway. So I'm not seeing a good use case for supporting this. > But Vishal seemed to think there was so... I think Sean wanted to leave open the possibility of using a src that isn't necessarily the same page as the destination. In this series, it is actually not possible to use 'src' at all if the src/dst are the same, since that means that src would have been marked with KVM_MEMORY_ATTRIBUTE_PRIVATE in advance of calling kvm_gmem_populate(), in which case GUP would trigger the SIGBUS handling in kvm_gmem_fault_user_mapping(). But I consider that a feature, since it's more efficient to let userspace initialize it in advance, prior to marking it as PRIVATE and calling whatever ioctl triggers kvm_gmem_populate(), and it gets naturally enforced with that existing checks in kvm_gmem_populate(). So, if src==dst, userspace would need to pass src==0 > > Given this potential use case; the above comment is more clear. > > FWIW, I think this is going to get even more complex if the src/dest page > sizes are miss-matched. But that algorithm can be reviewed at that time, > not now. > > > > > > > This will also cause kvm_gmem_populate() to allocate 1 more src_npages than > > > npages for dst pages. > > > > That's more of a decision on the part of userspace deciding to use > > non-page-aligned 'src' pointer to begin with. > > IIRC this is where I think there might be an issue with the code. The > code used PAGE_SIZE for the memcpy's. Is it clear that user space must > have a buffer >= PAGE_SIZE when src_offset != 0? > > I did not see that check; and/or I was not clear how that works. Yes, for SNP_LAUNCH_UPDATE at least, it is documented that the 'len' must be non-0 and aligned to 4K increments, and that's enforced in snp_launch_update() handler. I don't quite remember why we didn't just make it a 'npages' argument but I remember there being some reasoning for that. > > [snip] > > > > > > > > > > > > > This was necessarily chosen in prep for hugepages, but more about my > > > > > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER > > > > > > happens to align with 2MB hugepages while seeming like a reasonable > > > > > > batching value so that's why I chose it. > > > > > > > > > > > > Even with 1GB support, I wasn't really planning to increase it. SNP > > > > > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX > > > > > > handles promotion in a completely different path. So atm I'm leaning > > > > > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we > > > > > > support for kvm_gmem_populate() path and not bothering to change it > > > > > > until a solid use-case arises. > > > > > The problem is that with hugetlb-based guest_memfd, the folio itself could be > > > > > of 1GB, though SNP and TDX can force mapping at only 4KB. > > > > > > > > If TDX wants to unload handling of page-clearing to its per-page > > > > post-populate callback and tie that its shared/private tracking that's > > > > perfectly fine by me. > > > > > > > > *How* TDX tells gmem it wants this different behavior is a topic for a > > > > follow-up patchset, Vishal suggested kernel-internal flags to > > > > kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate > > > Not sure which flag you are referring to with "Vishal suggested kernel-internal > > > flags to kvm_gmem_create()". > > > > > > However, my point is that when the backend folio is 1GB in size (leading to > > > max_order being PUD_ORDER), even if SNP only maps at 2MB to RMP, it may hit the > > > warning of "!IS_ALIGNED(gfn, 1 << max_order)". > > > > I think I've had to remove that warning every time I start working on > > some new spin of THP/hugetlbfs-based SNP. I'm not objecting to that. But it's > > obvious there, in those contexts, and I can explain exactly why it's being > > removed. > > > > It's not obvious in this series, where all we have are hand-wavy thoughts > > about what hugepages will look like. For all we know we might decide that > > kvm_gmem_populate() path should just pre-split hugepages to make all the > > logic easier, or we decide to lock it in at 4K-only and just strip all the > > hugepage stuff out. > > Yea don't do that. > > > I don't really know, and this doesn't seem like the place > > to try to hash all that out when nothing in this series will cause this > > existing WARN_ON to be tripped. > > Agreed. > > > [snip] > > > > > > > > > > but it makes a lot more sense to make those restrictions and changes in > > > > the context of hugepage support, rather than this series which is trying > > > > very hard to not do hugepage enablement, but simply keep what's partially > > > > there intact while reworking other things that have proven to be > > > > continued impediments to both in-place conversion and hugepage > > > > enablement. > > > Not sure how fixing the warning in this series could impede hugepage enabling :) > > > > > > But if you prefer, I don't mind keeping it locally for longer. > > > > It's the whole burden of needing to anticipate hugepage design, while it > > is in a state of potentially massive flux just before LPC, in order to > > make tiny incremental progress toward enabling in-place conversion, > > which is something I think we can get upstream much sooner. Look at your > > changelog for the change above, for instance: it has no relevance in the > > context of this series. What do I put in its place? Bug reports about > > my experimental tree? It's just not the right place to try to justify > > these changes. > > > > And most of this weirdness stems from the fact that we prematurely added > > partial hugepage enablement to begin with. Let's not repeat these mistakes, > > and address changes in the proper context where we know they make sense. > > > > I considered stripping out the existing hugepage support as a pre-patch > > to avoid leaving these uncertainties in place while we are reworking > > things, but it felt like needless churn. But that's where I'm coming > > from with this series: let's get in-place conversion landed, get the API > > fleshed out, get it working, and then re-assess hugepages with all these > > common/intersecting bits out of the way. If we can remove some obstacles > > for hugepages as part of that, great, but that is not the main intent > > here. > > I'd like to second what Mike is saying here. The entire discussion about > hugepage support is premature for this series. Yah, maybe a clean slate, removing the existing hugepage bits as Vishal is suggesting, is the best way to free ourselves to address these things incrementally without the historical baggage. -Mike > > Ira > > [snip] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-03 14:26 ` Michael Roth 2025-12-03 20:59 ` FirstName LastName 2025-12-03 21:01 ` Ira Weiny @ 2025-12-05 3:38 ` Yan Zhao 2 siblings, 0 replies; 35+ messages in thread From: Yan Zhao @ 2025-12-05 3:38 UTC (permalink / raw) To: Michael Roth Cc: kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, Annapurve, Vishal, ackerleytng, aik, Weiny, Ira On Wed, Dec 03, 2025 at 10:26:48PM +0800, Michael Roth wrote: > Look at your > changelog for the change above, for instance: it has no relevance in the > context of this series. What do I put in its place? Bug reports about > my experimental tree? It's just not the right place to try to justify > these changes. The following diff is reasonable to this series(if npages is up to 2MB), --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -878,11 +878,10 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long } folio_unlock(folio); - WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || - (npages - i) < (1 << max_order)); ret = -EINVAL; - while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), + while (!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order) || + !kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM_MEMORY_ATTRIBUTE_PRIVATE)) { if (!max_order) because: 1. kmalloc_array() + GUP 2MB src pages + returning -ENOMEM in "Hunk 1" is a waste if max_order is always 0. 2. If we allow max_order > 0, then we must remove the WARN_ON(). 3. When start_gfn is not 2MB aligned, just allocating 4KB src page each round is enough (as in Sean's sketch patch). Hunk 1: ------------------------------------------------------------------- src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1; src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL); if (!src_pages) return -ENOMEM; ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages); if (ret < 0) return ret; if (ret != src_npages) return -ENOMEM; Hunk 2: ------------------------------------------------------------------- for (i = 0; i < npages; i += (1 << max_order)) { ... folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order); WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order)); ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL, src_offset, max_order, opaque); ... } ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-21 13:01 ` Michael Roth 2025-11-24 9:31 ` Yan Zhao @ 2025-12-01 1:44 ` Vishal Annapurve 2025-12-03 23:48 ` Michael Roth 1 sibling, 1 reply; 35+ messages in thread From: Vishal Annapurve @ 2025-12-01 1:44 UTC (permalink / raw) To: Michael Roth Cc: Yan Zhao, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Fri, Nov 21, 2025 at 5:02 AM Michael Roth <michael.roth@amd.com> wrote: > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > per 4KB while removing the max_order from post_populate() parameters, as done > > in Sean's sketch patch [1]? > > That's an option too, but SNP can make use of 2MB pages in the > post-populate callback so I don't want to shut the door on that option > just yet if it's not too much of a pain to work in. Given the guest BIOS > lives primarily in 1 or 2 of these 2MB regions the benefits might be > worthwhile, and SNP doesn't have a post-post-populate promotion path > like TDX (at least, not one that would help much for guest boot times) Given the small initial payload size, do you really think optimizing for setting up huge page-aligned RMP entries is worthwhile? The code becomes somewhat complex when trying to get this scenario working and IIUC it depends on userspace-passed initial payload regions aligning to the huge page size. What happens if userspace tries to trigger snp_launch_update() for two unaligned regions within the same huge page? What Sean suggested earlier[1] seems relatively simpler to maintain. [1] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/ > > Thanks, > > Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-12-01 1:44 ` Vishal Annapurve @ 2025-12-03 23:48 ` Michael Roth 0 siblings, 0 replies; 35+ messages in thread From: Michael Roth @ 2025-12-03 23:48 UTC (permalink / raw) To: Vishal Annapurve Cc: Yan Zhao, kvm, linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, ackerleytng, aik, ira.weiny On Sun, Nov 30, 2025 at 05:44:31PM -0800, Vishal Annapurve wrote: > On Fri, Nov 21, 2025 at 5:02 AM Michael Roth <michael.roth@amd.com> wrote: > > > > > > > > Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea. > > > > > > Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate() > > > per 4KB while removing the max_order from post_populate() parameters, as done > > > in Sean's sketch patch [1]? > > > > That's an option too, but SNP can make use of 2MB pages in the > > post-populate callback so I don't want to shut the door on that option > > just yet if it's not too much of a pain to work in. Given the guest BIOS > > lives primarily in 1 or 2 of these 2MB regions the benefits might be > > worthwhile, and SNP doesn't have a post-post-populate promotion path > > like TDX (at least, not one that would help much for guest boot times) > > Given the small initial payload size, do you really think optimizing > for setting up huge page-aligned RMP entries is worthwhile? Missed this reply earlier. It could be, but would probably be worthwhile to do some performance analysis before considering that so we can simplify in the meantime. > The code becomes somewhat complex when trying to get this scenario > working and IIUC it depends on userspace-passed initial payload > regions aligning to the huge page size. What happens if userspace > tries to trigger snp_launch_update() for two unaligned regions within > the same huge page? We'd need to gate the order that we pass to post-populate callback according to each individual call. For 2MB pages we'd end up with 4K behavior. For 1GB pages, there's some potential of using 2MB order for each region if gpa/dst/len are aligned, but without the buddy-style 1G->2M-4K splitting stuff, we'd likely need to split to 4K at some point and then the 2MB RMP entry would get PSMASH'd to 4K anyway. But maybe the 1GB could remain intact for long enough to get through a decent portion of OVMF boot before we end up creating a mixed range... not sure. But yes, this also seems like functionality that's premature to prep for, so just locking it in at 4K is probably best for now. -Mike > > What Sean suggested earlier[1] seems relatively simpler to maintain. > > [1] https://lore.kernel.org/kvm/aHEwT4X0RcfZzHlt@google.com/ > > > > > Thanks, > > > > Mike ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory 2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth 2025-11-20 9:11 ` Yan Zhao @ 2025-11-20 19:34 ` Ira Weiny 1 sibling, 0 replies; 35+ messages in thread From: Ira Weiny @ 2025-11-20 19:34 UTC (permalink / raw) To: Michael Roth, kvm Cc: linux-coco, linux-mm, linux-kernel, thomas.lendacky, pbonzini, seanjc, vbabka, ashish.kalra, liam.merwick, david, vannapurve, ackerleytng, aik, ira.weiny, yan.y.zhao Michael Roth wrote: > Currently the post-populate callbacks handle copying source pages into > private GPA ranges backed by guest_memfd, where kvm_gmem_populate() > acquires the filemap invalidate lock, then calls a post-populate > callback which may issue a get_user_pages() on the source pages prior to > copying them into the private GPA (e.g. TDX). > > This will not be compatible with in-place conversion, where the > userspace page fault path will attempt to acquire filemap invalidate > lock while holding the mm->mmap_lock, leading to a potential ABBA > deadlock[1]. > > Address this by hoisting the GUP above the filemap invalidate lock so > that these page faults path can be taken early, prior to acquiring the > filemap invalidate lock. > > It's not currently clear whether this issue is reachable with the > current implementation of guest_memfd, which doesn't support in-place > conversion, however it does provide a consistent mechanism to provide > stable source/target PFNs to callbacks rather than punting to > vendor-specific code, which allows for more commonality across > architectures, which may be worthwhile even without in-place conversion. After thinking on the alignment issue: In the direction we are going, in-place conversion, I'm struggling to see why keeping the complexity of allowing a miss-aligned src pointer for the data (which BTW seems to require at least an aligned size to (x * PAGE_SIZE to not leak data?) is valuable. Once in-place is complete the entire page needs to be converted to private and so it seems keeping that alignment just makes things cleaner without really restricting any known use cases. General comments below. [snip] > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > goto err; > } > > - if (src) { > - void *vaddr = kmap_local_pfn(pfn + i); > + if (src_pages) { > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) { > - ret = -EFAULT; > - goto err; > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset); > + kunmap_local(src_vaddr); > + > + if (src_offset) { > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > + > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset); ^^^^^^^^^^ PAGE_SIZE - src_offset? > + kunmap_local(src_vaddr); > } > - kunmap_local(vaddr); > + > + kunmap_local(dst_vaddr); > } > > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K, > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf > if (!snp_page_reclaim(kvm, pfn + i) && > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID && > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) { > - void *vaddr = kmap_local_pfn(pfn + i); > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i])); > + void *dst_vaddr = kmap_local_pfn(pfn + i); > > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE)) > - pr_debug("Failed to write CPUID page back to userspace\n"); > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset); > + kunmap_local(src_vaddr); > + > + if (src_offset) { > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1])); > + > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset); ^^^^^^^^^^ PAGE_SIZE - src_offset? > + kunmap_local(src_vaddr); > + } > > - kunmap_local(vaddr); > + kunmap_local(dst_vaddr); > } > > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */ > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > index 57ed101a1181..dd5439ec1473 100644 > --- a/arch/x86/kvm/vmx/tdx.c > +++ b/arch/x86/kvm/vmx/tdx.c > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg { > }; > > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, > - void __user *src, int order, void *_arg) > + struct page **src_pages, loff_t src_offset, > + int order, void *_arg) > { > struct tdx_gmem_post_populate_arg *arg = _arg; > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > u64 err, entry, level_state; > gpa_t gpa = gfn_to_gpa(gfn); > - struct page *src_page; > int ret, i; > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) > return -EIO; > > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm)) > + /* Source should be page-aligned, in which case src_offset will be 0. */ > + if (KVM_BUG_ON(src_offset)) This failed to compile, need the kvm parameter in the macro. > return -EINVAL; > > - /* > - * Get the source page if it has been faulted in. Return failure if the > - * source page has been swapped out or unmapped in primary memory. > - */ > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page); > - if (ret < 0) > - return ret; > - if (ret != 1) > - return -ENOMEM; > - > - kvm_tdx->page_add_src = src_page; > + kvm_tdx->page_add_src = src_pages[i]; i is uninitialized here. src_pages[0] should be fine. All the src_offset stuff in the rest of the patch would just go away if we made that restriction for SNP. You mentioned there was not a real use case for it. Also technically I think TDX _could_ do the same thing SNP populate is doing. The kernel could map the buffer do the offset copy to the destination page and do the in-place encryption. But I've not tried it because I really think this is more effort than the whole kernel needs to do. If a use case becomes necessary it may be better to have that in the gmem core once TDX is tested. Thoughts? Ira [snip] ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2025-12-05 3:56 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth 2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth 2025-11-17 23:58 ` Ackerley Tng 2025-11-19 0:18 ` Michael Roth 2025-11-20 9:12 ` Yan Zhao 2025-11-21 12:43 ` Michael Roth 2025-11-25 3:13 ` Yan Zhao 2025-12-01 1:35 ` Vishal Annapurve 2025-12-01 2:51 ` Yan Zhao 2025-12-01 19:33 ` Vishal Annapurve 2025-12-02 9:16 ` Yan Zhao 2025-12-01 23:44 ` Michael Roth 2025-12-02 9:17 ` Yan Zhao 2025-12-03 13:47 ` Michael Roth 2025-12-05 3:54 ` Yan Zhao 2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth 2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth 2025-11-20 9:11 ` Yan Zhao 2025-11-21 13:01 ` Michael Roth 2025-11-24 9:31 ` Yan Zhao 2025-11-24 15:53 ` Ira Weiny 2025-11-25 3:12 ` Yan Zhao 2025-12-01 1:47 ` Vishal Annapurve 2025-12-01 21:03 ` Michael Roth 2025-12-01 22:13 ` Michael Roth 2025-12-03 2:46 ` Yan Zhao 2025-12-03 14:26 ` Michael Roth 2025-12-03 20:59 ` FirstName LastName 2025-12-03 23:12 ` Michael Roth 2025-12-03 21:01 ` Ira Weiny 2025-12-03 23:07 ` Michael Roth 2025-12-05 3:38 ` Yan Zhao 2025-12-01 1:44 ` Vishal Annapurve 2025-12-03 23:48 ` Michael Roth 2025-11-20 19:34 ` Ira Weiny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox