Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yan Zhao <yan.y.zhao@intel.com>
To: Michael Roth <michael.roth@amd.com>
Cc: <kvm@vger.kernel.org>, <linux-coco@lists.linux.dev>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<thomas.lendacky@amd.com>, <pbonzini@redhat.com>,
	<seanjc@google.com>, <vbabka@suse.cz>, <ashish.kalra@amd.com>,
	<liam.merwick@oracle.com>, <david@redhat.com>,
	<vannapurve@google.com>, <ackerleytng@google.com>, <aik@amd.com>,
	<ira.weiny@intel.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking
Date: Tue, 25 Nov 2025 11:13:25 +0800	[thread overview]
Message-ID: <aSUe1UfD3hXg2iMZ@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <20251121124314.36zlpzhwm5zglih2@amd.com>

On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > >  {
> > >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > >  	struct folio *folio;
> > > -	bool is_prepared = false;
> > >  	int r = 0;
> > >  
> > >  	CLASS(gmem_get_file, file)(slot);
> > >  	if (!file)
> > >  		return -EFAULT;
> > >  
> > > -	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > +	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > >  	if (IS_ERR(folio))
> > >  		return PTR_ERR(folio);
> > >  
> > > -	if (!is_prepared)
> > > -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > +	if (!folio_test_uptodate(folio)) {
> > > +		unsigned long i, nr_pages = folio_nr_pages(folio);
> > > +
> > > +		for (i = 0; i < nr_pages; i++)
> > > +			clear_highpage(folio_page(folio, i));
> > > +		folio_mark_uptodate(folio);
> > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > Then, please check my questions at the bottom
> 
> Yes, in this patch at least where I tried to mirror the current logic. I
> would not be surprised if we need to rework things for inplace/hugepage
> support though, but decoupling 'preparation' from the uptodate flag is
> the main goal here.
Could you elaborate a little why the decoupling is needed if it's not for
hugepage?


> > > +	}
> > > +
> > > +	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > >  
> > >  	folio_unlock(folio);
> > >  
> > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > >  		struct folio *folio;
> > >  		gfn_t gfn = start_gfn + i;
> > >  		pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > -		bool is_prepared = false;
> > >  		kvm_pfn_t pfn;
> > >  
> > >  		if (signal_pending(current)) {
> > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > >  			break;
> > >  		}
> > >  
> > > -		folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > +		folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order);
> > >  		if (IS_ERR(folio)) {
> > >  			ret = PTR_ERR(folio);
> > >  			break;
> > >  		}
> > >  
> > > -		if (is_prepared) {
> > > -			folio_unlock(folio);
> > > -			folio_put(folio);
> > > -			ret = -EEXIST;
> > > -			break;
> > > -		}
> > > -
> > >  		folio_unlock(folio);
> > >  		WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
> > >  			(npages - i) < (1 << max_order));
> > TDX could hit this warning easily when npages == 1, max_order == 9.
> 
> Yes, this will need to change to handle that. I don't think I had to
> change this for previous iterations of SNP hugepage support, but
> there are definitely cases where a sub-2M range might get populated 
> even though it's backed by a 2M folio, so I'm not sure why I didn't
> hit it there.
> 
> But I'm taking Sean's cue on touching as little of the existing
> hugepage logic as possible in this particular series so we can revisit
> the remaining changes with some better context.
Frankly, I don't understand why this patch 1 is required if we only want "moving
GUP out of post_populate()" to work for 4KB folios.

> > 
> > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > >  		p = src ? src + i * PAGE_SIZE : NULL;
> > >  		ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
> > >  		if (!ret)
> > > -			kvm_gmem_mark_prepared(folio);
> > > +			folio_mark_uptodate(folio);
> > As also asked in [1], why is the entire folio marked as uptodate here? Why does
> > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked
> > uptodate?
> 
> Quoting your example from[1] for more context:
> 
> > I also have a question about this patch:
> > 
> > Suppose there's a 2MB huge folio A, where
> > A1 and A2 are 4KB pages belonging to folio A.
> > 
> > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A.
> >     It adds page A1 and invokes folio_mark_uptodate() on folio A.
> 
> In SNP hugepage patchset you responded to, it would only mark A1 as
You mean code in
https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ?

> prepared/cleared. There was 4K-granularity tracking added to handle this.
I don't find the code that marks only A1 as "prepared/cleared".
Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate()
to mark the entire folio A as uptodate.

However, according to your statement below that "uptodate flag only tracks
whether a folio has been cleared", I don't follow why and where the entire folio
A would be cleared if kvm_gmem_populate() only adds page A1.

> There was an odd subtlety in that series though: it was defaulting to the
> folio_order() for the prep-tracking/post-populate, but it would then clamp
> it down based on the max order possible according whether that particular
> order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not
> a great way to handle things, and I don't remember if I'd actually intended
> to implement it that way or not... that's probably why I never tripped over
> the WARN_ON() above, now that I think of it.
> 
> But neither of these these apply to any current plans for hugepage support
> that I'm aware of, so probably not worth working through what that series
> did and look at this from a fresh perspective.
> 
> > 
> > (2) kvm_gmem_get_pfn() later faults in page A2.
> >     As folio A is uptodate, clear_highpage() is not invoked on page A2.
> >     kvm_gmem_prepare_folio() is invoked on the whole folio A.
> > 
> > (2) could occur at least in TDX when only a part the 2MB page is added as guest
> > initial memory.
> > 
> > My questions:
> > - Would (2) occur on SEV?
> > - If it does, is the lack of clear_highpage() on A2 a problem ?
> > - Is invoking gmem_prepare on page A1 a problem?
> 
> Assuming this patch goes upstream in some form, we will now have the
> following major differences versus previous code:
> 
>   1) uptodate flag only tracks whether a folio has been cleared
>   2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and
>      the architecture can handle it's own tracking at whatever granularity
>      it likes.
2) looks good to me.

> My hope is that 1) can similarly be done in such a way that gmem does not
> need to track things at sub-hugepage granularity and necessitate the need
> for some new data structure/state/flag to track sub-page status.
I actually don't understand what uptodate flag helps gmem to track.
Why can't clear_highpage() be done inside arch specific code? TDX doesn't need
this clearing after all.

> My understanding based on prior discussion in guest_memfd calls was that
> it would be okay to go ahead and clear the entire folio at initial allocation
> time, and basically never mess with it again. It was also my understanding
That's where I don't follow in this patch.
I don't see where the entire folio A is cleared if it's only partially mapped by
kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to
kvm_gmem_populate() has set the uptodate flag.

> that for TDX it might even be optimal to completely skip clearing the folio
> if it is getting mapped into SecureEPT as a hugepage since the TDX module
> would handle that, but that maybe conversely after private->shared there
> would be some need to reclear... I'll try to find that discussion and
> refresh. Vishal I believe suggested some flags to provide more control over
> this behavior.
> 
> > 
> > It's possible (at least for TDX) that a huge folio is only partially populated
> > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the
> > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region,
> > while GFN 0x820 is faulted after TD is running. However, these two GFNs can
> > belong to the same folio of order 9.
> 
> Would the above scheme of clearing the entire folio up front and not re-clearing
> at fault time work for this case?
This case doesn't affect TDX, because TDX clearing private pages internally in
SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage()
after making a folio private, it works fine for TDX.

I was just trying to understand why SNP needs the clearing of entire folio in
kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's
partially mapped in kvm_gmem_populate().
Also, I'm wondering if it would be better if SNP could move the clearing of
folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare()
which is always invoked by kvm_gmem_get_pfn() and the architecture can handle
it's own tracking at whatever granularity.

 
> > Note: the current code should not impact TDX. I'm just asking out of curiosity:)
> > 
> > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/
> > 
> >

next prev parent reply	other threads:[~2025-11-25  3:15 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth
2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth
2025-11-17 23:58   ` Ackerley Tng
2025-11-19  0:18     ` Michael Roth
2025-11-20  9:12   ` Yan Zhao
2025-11-21 12:43     ` Michael Roth
2025-11-25  3:13       ` Yan Zhao [this message]
2025-12-01  1:35         ` Vishal Annapurve
2025-12-01  2:51           ` Yan Zhao
2025-12-01 19:33             ` Vishal Annapurve
2025-12-02  9:16               ` Yan Zhao
2025-12-01 23:44         ` Michael Roth
2025-12-02  9:17           ` Yan Zhao
2025-12-03 13:47             ` Michael Roth
2025-12-05  3:54               ` Yan Zhao
2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth
2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth
2025-11-20  9:11   ` Yan Zhao
2025-11-21 13:01     ` Michael Roth
2025-11-24  9:31       ` Yan Zhao
2025-11-24 15:53         ` Ira Weiny
2025-11-25  3:12           ` Yan Zhao
2025-12-01  1:47         ` Vishal Annapurve
2025-12-01 21:03           ` Michael Roth
2025-12-01 22:13         ` Michael Roth
2025-12-03  2:46           ` Yan Zhao
2025-12-03 14:26             ` Michael Roth
2025-12-03 20:59               ` FirstName LastName
2025-12-03 23:12                 ` Michael Roth
2025-12-03 21:01               ` Ira Weiny
2025-12-03 23:07                 ` Michael Roth
2025-12-05  3:38               ` Yan Zhao
2025-12-01  1:44       ` Vishal Annapurve
2025-12-03 23:48         ` Michael Roth
2025-11-20 19:34   ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aSUe1UfD3hXg2iMZ@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=ackerleytng@google.com \
    --cc=aik@amd.com \
    --cc=ashish.kalra@amd.com \
    --cc=david@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox