linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ira Weiny <ira.weiny@intel.com>
To: Michael Roth <michael.roth@amd.com>, Yan Zhao <yan.y.zhao@intel.com>
Cc: <kvm@vger.kernel.org>, <linux-coco@lists.linux.dev>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<thomas.lendacky@amd.com>, <pbonzini@redhat.com>,
	<seanjc@google.com>, <vbabka@suse.cz>, <ashish.kalra@amd.com>,
	<liam.merwick@oracle.com>, <david@redhat.com>,
	<vannapurve@google.com>, <ackerleytng@google.com>, <aik@amd.com>,
	<ira.weiny@intel.com>
Subject: Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory
Date: Wed, 3 Dec 2025 15:01:24 -0600	[thread overview]
Message-ID: <6930a5242dd1b_307bf1002@iweiny-mobl.notmuch> (raw)
In-Reply-To: <20251203142648.trx6sslxvxr26yzd@amd.com>

Michael Roth wrote:
> On Wed, Dec 03, 2025 at 10:46:27AM +0800, Yan Zhao wrote:
> > On Mon, Dec 01, 2025 at 04:13:55PM -0600, Michael Roth wrote:
> > > On Mon, Nov 24, 2025 at 05:31:46PM +0800, Yan Zhao wrote:
> > > > On Fri, Nov 21, 2025 at 07:01:44AM -0600, Michael Roth wrote:
> > > > > On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote:
> > > > > > On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote:

[snip]

> > > > > > > ---
> > > > > > >  arch/x86/kvm/svm/sev.c   | 40 ++++++++++++++++++++++++++------------
> > > > > > >  arch/x86/kvm/vmx/tdx.c   | 21 +++++---------------
> > > > > > >  include/linux/kvm_host.h |  3 ++-
> > > > > > >  virt/kvm/guest_memfd.c   | 42 ++++++++++++++++++++++++++++++++++------
> > > > > > >  4 files changed, 71 insertions(+), 35 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > > > > index 0835c664fbfd..d0ac710697a2 100644
> > > > > > > --- a/arch/x86/kvm/svm/sev.c
> > > > > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > > > > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args {
> > > > > > >  };
> > > > > > >  
> > > > > > >  static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn,
> > > > > > > -				  void __user *src, int order, void *opaque)
> > > > > > > +				  struct page **src_pages, loff_t src_offset,
> > > > > > > +				  int order, void *opaque)
> > > > > > >  {
> > > > > > >  	struct sev_gmem_populate_args *sev_populate_args = opaque;
> > > > > > >  	struct kvm_sev_info *sev = to_kvm_sev_info(kvm);
> > > > > > > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> > > > > > >  	int npages = (1 << order);
> > > > > > >  	gfn_t gfn;
> > > > > > >  
> > > > > > > -	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src))
> > > > > > > +	if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages))
> > > > > > >  		return -EINVAL;
> > > > > > >  
> > > > > > >  	for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) {
> > > > > > > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> > > > > > >  			goto err;
> > > > > > >  		}
> > > > > > >  
> > > > > > > -		if (src) {
> > > > > > > -			void *vaddr = kmap_local_pfn(pfn + i);
> > > > > > > +		if (src_pages) {
> > > > > > > +			void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i]));
> > > > > > > +			void *dst_vaddr = kmap_local_pfn(pfn + i);
> > > > > > >  
> > > > > > > -			if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) {
> > > > > > > -				ret = -EFAULT;
> > > > > > > -				goto err;
> > > > > > > +			memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset);
> > > > > > > +			kunmap_local(src_vaddr);
> > > > > > > +
> > > > > > > +			if (src_offset) {
> > > > > > > +				src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1]));
> > > > > > > +
> > > > > > > +				memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset);
> > > > > > > +				kunmap_local(src_vaddr);
> > > > > > IIUC, src_offset is the src's offset from the first page. e.g.,
> > > > > > src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512.
> > > > > > 
> > > > > > Then it looks like the two memcpy() calls here only work when npages == 1 ?
> > > > > 
> > > > > src_offset ends up being the offset into the pair of src pages that we
> > > > > are using to fully populate a single dest page with each iteration. So
> > > > > if we start at src_offset, read a page worth of data, then we are now at
> > > > > src_offset in the next src page and the loop continues that way even if
> > > > > npages > 1.
> > > > > 
> > > > > If src_offset is 0 we never have to bother with straddling 2 src pages so
> > > > > the 2nd memcpy is skipped on every iteration.
> > > > > 
> > > > > That's the intent at least. Is there a flaw in the code/reasoning that I
> > > > > missed?
> > > > Oh, I got you. SNP expects a single src_offset applies for each src page.
> > > > 
> > > > So if npages = 2, there're 4 memcpy() calls.
> > > > 
> > > > src:  |---------|---------|---------|  (VA contiguous)
> > > >           ^         ^         ^
> > > >           |         |         |
> > > > dst:      |---------|---------|   (PA contiguous)
> > > > 
> > > > 
> > > > I previously incorrectly thought kvm_gmem_populate() should pass in src_offset
> > > > as 0 for the 2nd src page.
> > > > 
> > > > Would you consider checking if params.uaddr is PAGE_ALIGNED() in
> > > > snp_launch_update() to simplify the design?
> > > 
> > > This was an option mentioned in the cover letter and during PUCK. I am
> > > not opposed if that's the direction we decide, but I also don't think
> > > it makes big difference since:
> > > 
> > >    int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > >                                struct page **src_pages, loff_t src_offset,
> > >                                int order, void *opaque);
> > > 
> > > basically reduces to Sean's originally proposed:
> > > 
> > >    int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > >                                struct page *src_pages, int order,
> > >                                void *opaque);
> > 
> > Hmm, the requirement of having each copy to dst_page account for src_offset
> > (which actually results in 2 copies) is quite confusing. I initially thought the
> > src_offset only applied to the first dst_page.
> 
> What I'm wondering though is if I'd done a better job of documenting
> this aspect, e.g. with the following comment added above
> kvm_gmem_populate_cb:
> 
>   /*
>    * ...
>    * 'src_pages': array of GUP'd struct page pointers corresponding to
>    *              the pages that store the data that is to be copied
>    *              into the HPA corresponding to 'pfn'
>    * 'src_offset': byte offset, relative to the first page in the array
>    *               of pages pointed to by 'src_pages', to begin copying
>    *               the data from.
>    *
>    * NOTE: if the caller of kvm_gmem_populate() enforces that 'src' is
>    * page-aligned, then 'src_offset' will always be zero, and src_pages
>    * will contain only 1 page to copy from, beginning at byte offset 0.
>    * In this case, callers can assume src_offset is 0.
>    */
>   int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>                               struct page **src_pages, loff_t src_offset,
>                               int order, void *opaque);
> 
> could the confusion have been avoided, or is it still unwieldly?
> 
> I don't mind that users like SNP need to deal with the extra bits, but
> I'm hoping for users like TDX it isn't too cludgy.

FWIW I don't think the TDX code was a problem.  I was trying to review the
SNP code for correctness and it was confusing enough that I was concerned
the investment is not worth the cost.

I'll re-iterate that the in-place conversion _use_ _case_ requires user
space to keep the 'source' (ie the page) aligned because it is all getting
converted anyway.  So I'm not seeing a good use case for supporting this.
But Vishal seemed to think there was so...

Given this potential use case; the above comment is more clear.

FWIW, I think this is going to get even more complex if the src/dest page
sizes are miss-matched.  But that algorithm can be reviewed at that time,
not now.

> > 
> > This will also cause kvm_gmem_populate() to allocate 1 more src_npages than
> > npages for dst pages.
> 
> That's more of a decision on the part of userspace deciding to use
> non-page-aligned 'src' pointer to begin with.

IIRC this is where I think there might be an issue with the code.  The
code used PAGE_SIZE for the memcpy's.  Is it clear that user space must
have a buffer >= PAGE_SIZE when src_offset != 0?

I did not see that check; and/or I was not clear how that works.

[snip]

> > > > > 
> > > > > This was necessarily chosen in prep for hugepages, but more about my
> > > > > unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER
> > > > > happens to align with 2MB hugepages while seeming like a reasonable
> > > > > batching value so that's why I chose it.
> > > > >
> > > > > Even with 1GB support, I wasn't really planning to increase it. SNP
> > > > > doesn't really make use of RMP sizes >2MB, and it sounds like TDX
> > > > > handles promotion in a completely different path. So atm I'm leaning
> > > > > toward just letting GMEM_GUP_NPAGES be the cap for the max page size we
> > > > > support for kvm_gmem_populate() path and not bothering to change it
> > > > > until a solid use-case arises.
> > > > The problem is that with hugetlb-based guest_memfd, the folio itself could be
> > > > of 1GB, though SNP and TDX can force mapping at only 4KB.
> > > 
> > > If TDX wants to unload handling of page-clearing to its per-page
> > > post-populate callback and tie that its shared/private tracking that's
> > > perfectly fine by me.
> > > 
> > > *How* TDX tells gmem it wants this different behavior is a topic for a
> > > follow-up patchset, Vishal suggested kernel-internal flags to
> > > kvm_gmem_create(), which seemed reasonable to me. In that case, uptodate
> > Not sure which flag you are referring to with "Vishal suggested kernel-internal
> > flags to kvm_gmem_create()".
> > 
> > However, my point is that when the backend folio is 1GB in size (leading to
> > max_order being PUD_ORDER), even if SNP only maps at 2MB to RMP, it may hit the
> > warning of "!IS_ALIGNED(gfn, 1 << max_order)".
> 
> I think I've had to remove that warning every time I start working on
> some new spin of THP/hugetlbfs-based SNP. I'm not objecting to that. But it's
> obvious there, in those contexts, and I can explain exactly why it's being
> removed.
> 
> It's not obvious in this series, where all we have are hand-wavy thoughts
> about what hugepages will look like. For all we know we might decide that
> kvm_gmem_populate() path should just pre-split hugepages to make all the
> logic easier, or we decide to lock it in at 4K-only and just strip all the
> hugepage stuff out.

Yea don't do that.

> I don't really know, and this doesn't seem like the place
> to try to hash all that out when nothing in this series will cause this
> existing WARN_ON to be tripped.

Agreed.


[snip]

> 
> > 
> > > but it makes a lot more sense to make those restrictions and changes in
> > > the context of hugepage support, rather than this series which is trying
> > > very hard to not do hugepage enablement, but simply keep what's partially
> > > there intact while reworking other things that have proven to be
> > > continued impediments to both in-place conversion and hugepage
> > > enablement.
> > Not sure how fixing the warning in this series could impede hugepage enabling :)
> > 
> > But if you prefer, I don't mind keeping it locally for longer.
> 
> It's the whole burden of needing to anticipate hugepage design, while it
> is in a state of potentially massive flux just before LPC, in order to
> make tiny incremental progress toward enabling in-place conversion,
> which is something I think we can get upstream much sooner. Look at your
> changelog for the change above, for instance: it has no relevance in the
> context of this series. What do I put in its place? Bug reports about
> my experimental tree? It's just not the right place to try to justify
> these changes.
> 
> And most of this weirdness stems from the fact that we prematurely added
> partial hugepage enablement to begin with. Let's not repeat these mistakes,
> and address changes in the proper context where we know they make sense.
> 
> I considered stripping out the existing hugepage support as a pre-patch
> to avoid leaving these uncertainties in place while we are reworking
> things, but it felt like needless churn. But that's where I'm coming
> from with this series: let's get in-place conversion landed, get the API
> fleshed out, get it working, and then re-assess hugepages with all these
> common/intersecting bits out of the way. If we can remove some obstacles
> for hugepages as part of that, great, but that is not the main intent
> here.

I'd like to second what Mike is saying here.  The entire discussion about
hugepage support is premature for this series.

Ira

[snip]


  parent reply	other threads:[~2025-12-03 20:59 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth
2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth
2025-11-17 23:58   ` Ackerley Tng
2025-11-19  0:18     ` Michael Roth
2025-11-20  9:12   ` Yan Zhao
2025-11-21 12:43     ` Michael Roth
2025-11-25  3:13       ` Yan Zhao
2025-12-01  1:35         ` Vishal Annapurve
2025-12-01  2:51           ` Yan Zhao
2025-12-01 19:33             ` Vishal Annapurve
2025-12-02  9:16               ` Yan Zhao
2025-12-01 23:44         ` Michael Roth
2025-12-02  9:17           ` Yan Zhao
2025-12-03 13:47             ` Michael Roth
2025-12-05  3:54               ` Yan Zhao
2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth
2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth
2025-11-20  9:11   ` Yan Zhao
2025-11-21 13:01     ` Michael Roth
2025-11-24  9:31       ` Yan Zhao
2025-11-24 15:53         ` Ira Weiny
2025-11-25  3:12           ` Yan Zhao
2025-12-01  1:47         ` Vishal Annapurve
2025-12-01 21:03           ` Michael Roth
2025-12-01 22:13         ` Michael Roth
2025-12-03  2:46           ` Yan Zhao
2025-12-03 14:26             ` Michael Roth
2025-12-03 20:59               ` FirstName LastName
2025-12-03 23:12                 ` Michael Roth
2025-12-03 21:01               ` Ira Weiny [this message]
2025-12-03 23:07                 ` Michael Roth
2025-12-05  3:38               ` Yan Zhao
2025-12-01  1:44       ` Vishal Annapurve
2025-12-03 23:48         ` Michael Roth
2025-11-20 19:34   ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6930a5242dd1b_307bf1002@iweiny-mobl.notmuch \
    --to=ira.weiny@intel.com \
    --cc=ackerleytng@google.com \
    --cc=aik@amd.com \
    --cc=ashish.kalra@amd.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox