From: Yan Zhao <yan.y.zhao@intel.com>
To: Michael Roth <michael.roth@amd.com>
Cc: <kvm@vger.kernel.org>, <linux-coco@lists.linux.dev>,
<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<thomas.lendacky@amd.com>, <pbonzini@redhat.com>,
<seanjc@google.com>, <vbabka@suse.cz>, <ashish.kalra@amd.com>,
<liam.merwick@oracle.com>, <david@redhat.com>,
<vannapurve@google.com>, <ackerleytng@google.com>, <aik@amd.com>,
<ira.weiny@intel.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking
Date: Tue, 25 Nov 2025 11:13:25 +0800 [thread overview]
Message-ID: <aSUe1UfD3hXg2iMZ@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <20251121124314.36zlpzhwm5zglih2@amd.com>
On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > {
> > > pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > struct folio *folio;
> > > - bool is_prepared = false;
> > > int r = 0;
> > >
> > > CLASS(gmem_get_file, file)(slot);
> > > if (!file)
> > > return -EFAULT;
> > >
> > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > > if (IS_ERR(folio))
> > > return PTR_ERR(folio);
> > >
> > > - if (!is_prepared)
> > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > + if (!folio_test_uptodate(folio)) {
> > > + unsigned long i, nr_pages = folio_nr_pages(folio);
> > > +
> > > + for (i = 0; i < nr_pages; i++)
> > > + clear_highpage(folio_page(folio, i));
> > > + folio_mark_uptodate(folio);
> > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > Then, please check my questions at the bottom
>
> Yes, in this patch at least where I tried to mirror the current logic. I
> would not be surprised if we need to rework things for inplace/hugepage
> support though, but decoupling 'preparation' from the uptodate flag is
> the main goal here.
Could you elaborate a little why the decoupling is needed if it's not for
hugepage?
> > > + }
> > > +
> > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > >
> > > folio_unlock(folio);
> > >
> > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > struct folio *folio;
> > > gfn_t gfn = start_gfn + i;
> > > pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > - bool is_prepared = false;
> > > kvm_pfn_t pfn;
> > >
> > > if (signal_pending(current)) {
> > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > break;
> > > }
> > >
> > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order);
> > > if (IS_ERR(folio)) {
> > > ret = PTR_ERR(folio);
> > > break;
> > > }
> > >
> > > - if (is_prepared) {
> > > - folio_unlock(folio);
> > > - folio_put(folio);
> > > - ret = -EEXIST;
> > > - break;
> > > - }
> > > -
> > > folio_unlock(folio);
> > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
> > > (npages - i) < (1 << max_order));
> > TDX could hit this warning easily when npages == 1, max_order == 9.
>
> Yes, this will need to change to handle that. I don't think I had to
> change this for previous iterations of SNP hugepage support, but
> there are definitely cases where a sub-2M range might get populated
> even though it's backed by a 2M folio, so I'm not sure why I didn't
> hit it there.
>
> But I'm taking Sean's cue on touching as little of the existing
> hugepage logic as possible in this particular series so we can revisit
> the remaining changes with some better context.
Frankly, I don't understand why this patch 1 is required if we only want "moving
GUP out of post_populate()" to work for 4KB folios.
> >
> > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > p = src ? src + i * PAGE_SIZE : NULL;
> > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
> > > if (!ret)
> > > - kvm_gmem_mark_prepared(folio);
> > > + folio_mark_uptodate(folio);
> > As also asked in [1], why is the entire folio marked as uptodate here? Why does
> > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked
> > uptodate?
>
> Quoting your example from[1] for more context:
>
> > I also have a question about this patch:
> >
> > Suppose there's a 2MB huge folio A, where
> > A1 and A2 are 4KB pages belonging to folio A.
> >
> > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A.
> > It adds page A1 and invokes folio_mark_uptodate() on folio A.
>
> In SNP hugepage patchset you responded to, it would only mark A1 as
You mean code in
https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ?
> prepared/cleared. There was 4K-granularity tracking added to handle this.
I don't find the code that marks only A1 as "prepared/cleared".
Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate()
to mark the entire folio A as uptodate.
However, according to your statement below that "uptodate flag only tracks
whether a folio has been cleared", I don't follow why and where the entire folio
A would be cleared if kvm_gmem_populate() only adds page A1.
> There was an odd subtlety in that series though: it was defaulting to the
> folio_order() for the prep-tracking/post-populate, but it would then clamp
> it down based on the max order possible according whether that particular
> order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not
> a great way to handle things, and I don't remember if I'd actually intended
> to implement it that way or not... that's probably why I never tripped over
> the WARN_ON() above, now that I think of it.
>
> But neither of these these apply to any current plans for hugepage support
> that I'm aware of, so probably not worth working through what that series
> did and look at this from a fresh perspective.
>
> >
> > (2) kvm_gmem_get_pfn() later faults in page A2.
> > As folio A is uptodate, clear_highpage() is not invoked on page A2.
> > kvm_gmem_prepare_folio() is invoked on the whole folio A.
> >
> > (2) could occur at least in TDX when only a part the 2MB page is added as guest
> > initial memory.
> >
> > My questions:
> > - Would (2) occur on SEV?
> > - If it does, is the lack of clear_highpage() on A2 a problem ?
> > - Is invoking gmem_prepare on page A1 a problem?
>
> Assuming this patch goes upstream in some form, we will now have the
> following major differences versus previous code:
>
> 1) uptodate flag only tracks whether a folio has been cleared
> 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and
> the architecture can handle it's own tracking at whatever granularity
> it likes.
2) looks good to me.
> My hope is that 1) can similarly be done in such a way that gmem does not
> need to track things at sub-hugepage granularity and necessitate the need
> for some new data structure/state/flag to track sub-page status.
I actually don't understand what uptodate flag helps gmem to track.
Why can't clear_highpage() be done inside arch specific code? TDX doesn't need
this clearing after all.
> My understanding based on prior discussion in guest_memfd calls was that
> it would be okay to go ahead and clear the entire folio at initial allocation
> time, and basically never mess with it again. It was also my understanding
That's where I don't follow in this patch.
I don't see where the entire folio A is cleared if it's only partially mapped by
kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to
kvm_gmem_populate() has set the uptodate flag.
> that for TDX it might even be optimal to completely skip clearing the folio
> if it is getting mapped into SecureEPT as a hugepage since the TDX module
> would handle that, but that maybe conversely after private->shared there
> would be some need to reclear... I'll try to find that discussion and
> refresh. Vishal I believe suggested some flags to provide more control over
> this behavior.
>
> >
> > It's possible (at least for TDX) that a huge folio is only partially populated
> > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the
> > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region,
> > while GFN 0x820 is faulted after TD is running. However, these two GFNs can
> > belong to the same folio of order 9.
>
> Would the above scheme of clearing the entire folio up front and not re-clearing
> at fault time work for this case?
This case doesn't affect TDX, because TDX clearing private pages internally in
SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage()
after making a folio private, it works fine for TDX.
I was just trying to understand why SNP needs the clearing of entire folio in
kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's
partially mapped in kvm_gmem_populate().
Also, I'm wondering if it would be better if SNP could move the clearing of
folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare()
which is always invoked by kvm_gmem_get_pfn() and the architecture can handle
it's own tracking at whatever granularity.
> > Note: the current code should not impact TDX. I'm just asking out of curiosity:)
> >
> > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/
> >
> >
next prev parent reply other threads:[~2025-11-25 3:15 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth
2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth
2025-11-17 23:58 ` Ackerley Tng
2025-11-19 0:18 ` Michael Roth
2025-11-20 9:12 ` Yan Zhao
2025-11-21 12:43 ` Michael Roth
2025-11-25 3:13 ` Yan Zhao [this message]
2025-12-01 1:35 ` Vishal Annapurve
2025-12-01 2:51 ` Yan Zhao
2025-12-01 19:33 ` Vishal Annapurve
2025-12-02 9:16 ` Yan Zhao
2025-12-01 23:44 ` Michael Roth
2025-12-02 9:17 ` Yan Zhao
2025-12-03 13:47 ` Michael Roth
2025-12-05 3:54 ` Yan Zhao
2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth
2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth
2025-11-20 9:11 ` Yan Zhao
2025-11-21 13:01 ` Michael Roth
2025-11-24 9:31 ` Yan Zhao
2025-11-24 15:53 ` Ira Weiny
2025-11-25 3:12 ` Yan Zhao
2025-12-01 1:47 ` Vishal Annapurve
2025-12-01 21:03 ` Michael Roth
2025-12-01 22:13 ` Michael Roth
2025-12-03 2:46 ` Yan Zhao
2025-12-03 14:26 ` Michael Roth
2025-12-03 20:59 ` FirstName LastName
2025-12-03 23:12 ` Michael Roth
2025-12-03 21:01 ` Ira Weiny
2025-12-03 23:07 ` Michael Roth
2025-12-05 3:38 ` Yan Zhao
2025-12-01 1:44 ` Vishal Annapurve
2025-12-03 23:48 ` Michael Roth
2025-11-20 19:34 ` Ira Weiny
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aSUe1UfD3hXg2iMZ@yzhao56-desk.sh.intel.com \
--to=yan.y.zhao@intel.com \
--cc=ackerleytng@google.com \
--cc=aik@amd.com \
--cc=ashish.kalra@amd.com \
--cc=david@redhat.com \
--cc=ira.weiny@intel.com \
--cc=kvm@vger.kernel.org \
--cc=liam.merwick@oracle.com \
--cc=linux-coco@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=michael.roth@amd.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=thomas.lendacky@amd.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox