On Fri, Jul 18, 2025, Xiaoyao Li wrote:
> On 7/18/2025 12:27 AM, Fuad Tabba wrote:
> > From: Ackerley Tng <ackerleytng@google.com>
> > 
> > Refactor kvm_max_private_mapping_level() to accept a NULL kvm_page_fault
> > pointer and rename it to kvm_gmem_max_mapping_level().
> > 
> > The max_mapping_level x86 operation (previously private_max_mapping_level)
> > is designed to potentially be called without an active page fault, for
> > instance, when kvm_mmu_max_mapping_level() is determining the maximum
> > mapping level for a gfn proactively.
> > 
> > Allow NULL fault pointer: Modify kvm_max_private_mapping_level() to
> > safely handle a NULL fault argument. This aligns its interface with the
> > kvm_x86_ops.max_mapping_level operation it wraps, which can also be
> > called with NULL.
> 
> are you sure of it?
> 
> The patch 09 just added the check of fault->is_private for TDX and SEV.

+1, this isn't quite right.  That's largely my fault (no pun intended) though, as
I suggested the basic gist of the NULL @fault handling, and it's a mess.  More at
the bottom.

> > Rename function to kvm_gmem_max_mapping_level(): This reinforces that
> > the function's scope is for guest_memfd-backed memory, which can be
> > either private or non-private, removing any remaining "private"
> > connotation from its name.
> > 
> > Optimize max_level checks: Introduce a check in the caller to skip
> > querying for max_mapping_level if the current max_level is already
> > PG_LEVEL_4K, as no further reduction is possible.
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Sean Christoperson <seanjc@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 16 +++++++---------
> >   1 file changed, 7 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index bb925994cbc5..6bd28fda0fd3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4467,17 +4467,13 @@ static inline u8 kvm_max_level_for_order(int order)
> >   	return PG_LEVEL_4K;
> >   }
> > -static u8 kvm_max_private_mapping_level(struct kvm *kvm,
> > -					struct kvm_page_fault *fault,
> > -					int gmem_order)
> > +static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
> > +				     struct kvm_page_fault *fault)
> >   {
> > -	u8 max_level = fault->max_level;
> >   	u8 req_max_level;
> > +	u8 max_level;
> > -	if (max_level == PG_LEVEL_4K)
> > -		return PG_LEVEL_4K;
> > -
> > -	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
> > +	max_level = kvm_max_level_for_order(order);
> >   	if (max_level == PG_LEVEL_4K)
> >   		return PG_LEVEL_4K;
> > @@ -4513,7 +4509,9 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
> >   	}
> >   	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> > -	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault, max_order);
> > +	if (fault->max_level >= PG_LEVEL_4K)
> > +		fault->max_level = kvm_gmem_max_mapping_level(vcpu->kvm,
> > +							      max_order, fault);
> 
> I cannot understand why this change is required. In what case will
> fault->max_level < PG_LEVEL_4K?

Yeah, I don't get this code either.  I also don't think KVM should call
kvm_gmem_max_mapping_level() *here*.  That's mostly a problem with my suggested
NULL @fault handling.  Dealing with kvm_gmem_max_mapping_level() here leads to
weirdness, because kvm_gmem_max_mapping_level() also needs to be invoked for the
!fault path, and then we end up with multiple call sites and the potential for a
redundant call (gmem only, is private).

Looking through surrounding patches, the ordering of things is also "off".
"Generalize private_max_mapping_level x86 op to max_mapping_level" should just
rename the helper; reacting to !is_private memory in TDX belongs in "Consult
guest_memfd when computing max_mapping_level", because that's where KVM plays
nice with non-private memory.

But that patch is also doing too much, e.g. shuffling code around and short-circuting
the non-fault case, which makes it confusing and hard to review.  Extending gmem
hugepage support to shared memory should be "just" this:

@@ -3335,8 +3336,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
        if (max_level == PG_LEVEL_4K)
                return PG_LEVEL_4K;
 
-       if (is_private)
-               host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn);
+       if (is_private || kvm_memslot_is_gmem_only(slot))
+               host_level = kvm_gmem_max_mapping_level(kvm, fault, slot, gfn,
+                                                       is_private);
        else
                host_level = host_pfn_mapping_level(kvm, gfn, slot);
        return min(host_level, max_level);

plus the plumbing and the small TDX change.  All the renames and code shuffling
should be done in prep patches.

The attached patches are compile-tested only, but I think they get use where we
want to be, and without my confusing suggestion to try and punt on private mappings
in the hugepage recovery paths.  They should slot it at the right patch numbers
(relative to v15).

Holler if the patches don't work, I'm happy to help sort things out so that v16
is ready to go.