Re: [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Roy <roypat@amazon.co.uk>
To: Ackerley Tng <ackerleytng@google.com>, Fuad Tabba <tabba@google.com>
Cc: <kvm@vger.kernel.org>, <linux-arm-msm@vger.kernel.org>,
	<linux-mm@kvack.org>, <pbonzini@redhat.com>,
	<chenhuacai@kernel.org>, <mpe@ellerman.id.au>,
	<anup@brainfault.org>, <paul.walmsley@sifive.com>,
	<palmer@dabbelt.com>, <aou@eecs.berkeley.edu>,
	<seanjc@google.com>, <viro@zeniv.linux.org.uk>,
	<brauner@kernel.org>, <willy@infradead.org>,
	<akpm@linux-foundation.org>, <xiaoyao.li@intel.com>,
	<yilun.xu@intel.com>, <chao.p.peng@linux.intel.com>,
	<jarkko@kernel.org>, <amoorthy@google.com>, <dmatlack@google.com>,
	<yu.c.zhang@linux.intel.com>, <isaku.yamahata@intel.com>,
	<mic@digikod.net>, <vbabka@suse.cz>, <vannapurve@google.com>,
	<mail@maciej.szmigiero.name>, <david@redhat.com>,
	<michael.roth@amd.com>, <wei.w.wang@intel.com>,
	<liam.merwick@oracle.com>, <isaku.yamahata@gmail.com>,
	<kirill.shutemov@linux.intel.com>, <suzuki.poulose@arm.com>,
	<steven.price@arm.com>, <quic_eberman@quicinc.com>,
	<quic_mnalajal@quicinc.com>, <quic_tsoni@quicinc.com>,
	<quic_svaddagi@quicinc.com>, <quic_cvanscha@quicinc.com>,
	<quic_pderrin@quicinc.com>, <quic_pheragu@quicinc.com>,
	<catalin.marinas@arm.com>, <james.morse@arm.com>,
	<yuzenghui@huawei.com>, <oliver.upton@linux.dev>,
	<maz@kernel.org>, <will@kernel.org>, <qperret@google.com>,
	<keirf@google.com>, <shuah@kernel.org>, <hch@infradead.org>,
	<jgg@nvidia.com>, <rientjes@google.com>, <jhubbard@nvidia.com>,
	<fvdl@google.com>, <hughd@google.com>, <jthoughton@google.com>
Subject: Re: [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
Date: Fri, 18 Oct 2024 07:57:40 +0100	[thread overview]
Message-ID: <8c6b9bfb-292f-45c8-916e-e0aae961ba5c@amazon.co.uk> (raw)
In-Reply-To: <diqzttdaxuol.fsf@ackerleytng-ctop-specialist.c.googlers.com>



On Thu, 2024-10-17 at 22:53 +0100, Ackerley Tng wrote:
> Fuad Tabba <tabba@google.com> writes:
> 
>> Make kvm_(read|/write)_guest_page() capable of accessing guest
>> memory for slots that don't have a userspace address, but only if
>> the memory is mappable, which also indicates that it is
>> accessible by the host.
> 
> Fuad explained to me that this patch removes the need for userspace to
> mmap a guest_memfd fd just to provide userspace_addr when only a limited
> range of shared pages are required, e.g. for kvm_clock.
> 
> Questions to anyone who might be more familiar:
> 
> 1. Should we let userspace save the trouble of providing userspace_addr
>    if only KVM (and not userspace) needs to access the shared pages?
> 2. Other than kvm_{read,write}_guest_page, are there any other parts of
>    KVM that may require updates so that guest_memfd can be used directly
>    from the kernel?
> 
> Patrick, does this help to answer the question of "how does KVM
> internally access guest_memfd for non-CoCo VMs" that you brought up in
> this other thread [*]?

Just patching kvm_{read,write}_guest sadly does not solve all the
problems on x86, there's also guest page table walks (which use a
get_user/try_cmpxchg on userspace_addr directly), and kvm-clock, which
uses a special mechanism (gfn_to_pfn_cache) to translate gpa to pfns via
userspace_addr (and uses gup in the process). These aren't impossible to
also teach about gmem (details in my patch series [1], which does all
this in the context of direct map removal), but with direct map removal,
all these would need to additionally temporarily restore the direct map,
which is expensive due to tlb flushes. That's the main reason why I was
so excited about the userspace_addr approach, because it'd mean no need
for tlb flushes (you can do copy_from_user and friends without direct
map entries), and every thing except kvm-clock will "just work" (and
massaging kvm-clock to also work in this model is easier than making it
work with on-demand direct map restoration, I think).

There's also something called `kvm_{read,write}_guest_cached`, (I think
async page faulting uses it), but that's fairly easy to deal with [3].

There's another problem for x86 here, which is that during instruction
fetch for MMIO emulation (the CPU doesn't help us here sadly, outside of
modern AMD processors [2]), there is no way to mark pages as
shared/mappable ahead of time (although I guess this is only a problem
for CoCo VMs with in-place sharing on x86 that don't do TDX-style
paravirtual MMIO, which I am not sure is something that even exists. I
know for my non-CoCo usecase I would just unconditionally set all of
gmem to faultable but without direct map entries. But then I'd need
separate knobs for "private" and "faultable").

[1]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[2]: https://lore.kernel.org/kvm/ZkI0SCMARCB9bAfc@google.com/
[3]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#ma61de70f906cd6553ffba02563d1e617f7c9246b

> [*] https://lore.kernel.org/all/6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk/
> 
>>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> ---
>>  virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 118 insertions(+), 19 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index aed9cf2f1685..77e6412034b9 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3399,23 +3399,114 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
>>       return kvm_gmem_toggle_mappable(kvm, start, end, false);
>>  }
>>
>> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
>> +                                    struct kvm_memory_slot *slot,
>> +                                    gfn_t gfn, void *data, int offset,
>> +                                    int len)
>> +{
>> +     struct page *page;
>> +     u64 pfn;
>> +     int r;
>> +
>> +     /*
>> +      * Holds the folio lock until after checking whether it can be faulted
>> +      * in, to avoid races with paths that change a folio's mappability.
>> +      */
>> +     r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
>> +     if (r)
>> +             return r;
>> +
>> +     page = pfn_to_page(pfn);
>> +
>> +     if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +     memcpy(data, page_address(page) + offset, len);
>> +unlock:
>> +     if (r)
>> +             put_page(page);
>> +     else
>> +             kvm_release_pfn_clean(pfn);
>> +     unlock_page(page);
>> +
>> +     return r;
>> +}
>> +
>> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
>> +                                     struct kvm_memory_slot *slot,
>> +                                     gfn_t gfn, const void *data,
>> +                                     int offset, int len)
>> +{
>> +     struct page *page;
>> +     u64 pfn;
>> +     int r;
>> +
>> +     /*
>> +      * Holds the folio lock until after checking whether it can be faulted
>> +      * in, to avoid races with paths that change a folio's mappability.
>> +      */
>> +     r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, NULL);
>> +     if (r)
>> +             return r;
>> +
>> +     page = pfn_to_page(pfn);
>> +
>> +     if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +     memcpy(page_address(page) + offset, data, len);
>> +unlock:
>> +     if (r)
>> +             put_page(page);
>> +     else
>> +             kvm_release_pfn_dirty(pfn);
>> +     unlock_page(page);
>> +
>> +     return r;
>> +}
>> +#else
>> +static int __kvm_read_guest_memfd_page(struct kvm *kvm,
>> +                                    struct kvm_memory_slot *slot,
>> +                                    gfn_t gfn, void *data, int offset,
>> +                                    int len)
>> +{
>> +     WARN_ON_ONCE(1);
>> +     return -EIO;
>> +}
>> +
>> +static int __kvm_write_guest_memfd_page(struct kvm *kvm,
>> +                                     struct kvm_memory_slot *slot,
>> +                                     gfn_t gfn, const void *data,
>> +                                     int offset, int len)
>> +{
>> +     WARN_ON_ONCE(1);
>> +     return -EIO;
>> +}
>>  #endif /* CONFIG_KVM_GMEM_MAPPABLE */
>>
>>  /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
>> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
>> -                              void *data, int offset, int len)
>> +
>> +static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
>> +                              gfn_t gfn, void *data, int offset, int len)
>>  {
>> -     int r;
>>       unsigned long addr;
>>
>>       if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>>               return -EFAULT;
>>
>> +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
>> +         kvm_slot_can_be_private(slot) &&
>> +         !slot->userspace_addr) {
>> +             return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
>> +                                                offset, len);
>> +     }
>> +
>>       addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>>       if (kvm_is_error_hva(addr))
>>               return -EFAULT;
>> -     r = __copy_from_user(data, (void __user *)addr + offset, len);
>> -     if (r)
>> +     if (__copy_from_user(data, (void __user *)addr + offset, len))
>>               return -EFAULT;
>>       return 0;
>>  }
>> @@ -3425,7 +3516,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
>>  {
>>       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>>
>> -     return __kvm_read_guest_page(slot, gfn, data, offset, len);
>> +     return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_read_guest_page);
>>
>> @@ -3434,7 +3525,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
>>  {
>>       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>>
>> -     return __kvm_read_guest_page(slot, gfn, data, offset, len);
>> +     return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
>>
>> @@ -3511,22 +3602,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
>>
>>  /* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
>>  static int __kvm_write_guest_page(struct kvm *kvm,
>> -                               struct kvm_memory_slot *memslot, gfn_t gfn,
>> -                               const void *data, int offset, int len)
>> +                               struct kvm_memory_slot *slot, gfn_t gfn,
>> +                               const void *data, int offset, int len)
>>  {
>> -     int r;
>> -     unsigned long addr;
>> -
>>       if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
>>               return -EFAULT;
>>
>> -     addr = gfn_to_hva_memslot(memslot, gfn);
>> -     if (kvm_is_error_hva(addr))
>> -             return -EFAULT;
>> -     r = __copy_to_user((void __user *)addr + offset, data, len);
>> -     if (r)
>> -             return -EFAULT;
>> -     mark_page_dirty_in_slot(kvm, memslot, gfn);
>> +     if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
>> +         kvm_slot_can_be_private(slot) &&
>> +         !slot->userspace_addr) {
>> +             int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
>> +                                                  offset, len);
>> +
>> +             if (r)
>> +                     return r;
>> +     } else {
>> +             unsigned long addr = gfn_to_hva_memslot(slot, gfn);
>> +
>> +             if (kvm_is_error_hva(addr))
>> +                     return -EFAULT;
>> +             if (__copy_to_user((void __user *)addr + offset, data, len))
>> +                     return -EFAULT;
>> +     }
>> +
>> +     mark_page_dirty_in_slot(kvm, slot, gfn);
>>       return 0;
>>  }

next prev parent reply	other threads:[~2024-10-18  6:58 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-10  8:59 [PATCH v3 00/11] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 01/11] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2024-10-12  6:12   ` kernel test robot
2024-10-10  8:59 ` [PATCH v3 02/11] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 03/11] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 04/11] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
2024-10-10 10:14   ` Kirill A. Shutemov
2024-10-10 10:23     ` Fuad Tabba
2024-10-10 12:03       ` Jason Gunthorpe
2024-10-10 14:27         ` Fuad Tabba
2024-10-10 12:20       ` Kirill A. Shutemov
2024-10-10 14:28         ` Fuad Tabba
2024-10-10 14:36           ` Kirill A. Shutemov
2024-10-10 14:37           ` Jason Gunthorpe
2024-10-14 16:52   ` Elliot Berman
2024-10-15 10:27     ` Fuad Tabba
2024-10-16 16:53       ` Elliot Berman
2024-10-10  8:59 ` [PATCH v3 05/11] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
2024-10-17 21:53   ` Ackerley Tng
2024-10-18  6:57     ` Patrick Roy [this message]
2024-10-10  8:59 ` [PATCH v3 06/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
2024-10-15 10:30   ` Suzuki K Poulose
2024-10-15 10:33     ` Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 07/11] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 08/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 09/11] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 10/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2024-10-10  8:59 ` [PATCH v3 11/11] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c6b9bfb-292f-45c8-916e-e0aae961ba5c@amazon.co.uk \
    --to=roypat@amazon.co.uk \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=fvdl@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_eberman@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=rientjes@google.com \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox