From: Patrick Roy <roypat@amazon.co.uk>
To: Mike Rapoport <rppt@kernel.org>, Sean Christopherson <seanjc@google.com>
Cc: James Gowans <jgowans@amazon.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"chao.p.peng@linux.intel.com" <chao.p.peng@linux.intel.com>,
Derek Manwaring <derekmn@amazon.com>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
David Woodhouse <dwmw@amazon.co.uk>,
Nikita Kalyazin <kalyazin@amazon.co.uk>,
"lstoakes@gmail.com" <lstoakes@gmail.com>,
"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"kirill.shutemov@linux.intel.com"
<kirill.shutemov@linux.intel.com>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"mst@redhat.com" <mst@redhat.com>,
"somlo@cmu.edu" <somlo@cmu.edu>, Alexander Graf <graf@amazon.de>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"linux-coco@lists.linux.dev" <linux-coco@lists.linux.dev>
Subject: Re: Unmapping KVM Guest Memory from Host Kernel
Date: Mon, 13 May 2024 11:31:47 +0100 [thread overview]
Message-ID: <58f39f23-0314-4e34-a8c7-30c3a1ae4777@amazon.co.uk> (raw)
In-Reply-To: <ZexEkGkNe_7UY7w6@kernel.org>
Hi all,
On 3/9/24 11:14, Mike Rapoport wrote:
>>> >>> With this in mind, what’s the best way to solve getting guest RAM out of
>>> >>> the direct map? Is memfd_secret integration with KVM the way to go, or
>>> >>> should we build a solution on top of guest_memfd, for example via some
>>> >>> flag that causes it to leave memory in the host userspace’s page tables,
>>> >>> but removes it from the direct map?
>> >> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite
>> >> sure you'll be fighting memfd_secret all the way. E.g. it's not dumpable, it
>> >> deliberately allocates at 4KiB granularity (though I suspect the bug you found
>> >> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
>> >> of sight to taking userspace out of the equation, etc.
>> >>
>> >> With guest_memfd on the other hand, everyone contributing to and maintaining it
>> >> has goals that are *very* closely aligned with what you want to do.
> > I agree with Sean, guest_memfd seems a better interface to use. It's
> > integrated by design with KVM and removing guest memory from the direct map
> > looks like a natural enhancement to guest_memfd.
> >
> > Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner
> > that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out
> > what to do with virtio :)
We’ve been playing around with extending guest_memfd to remove guest memory
from the direct map. Removal from direct map aspect is indeed fairly
straight-forward; since we cannot map guest_memfd, we don’t need to worry about
folios without direct map entries getting to places where they will cause
kernel panics.
However, we ran into problems running non-CoCo VMs with guest_memfd for guest
memory, independent of direct map entries being available or not. There’s a
handful of places where a traditional KVM / Userspace setup currently touches
guest memory:
* Loading the Guest Kernel into guest-owned memory
* Instruction fetch from arbitrary guest addresses and guest page table walks
for MMIO emulation (for example for IOAPIC accesses)
* kvm-clock
* I/O devices
With guest_memfd, if the guest is running from guest-private memory, these need
to be rethought, since now the memory is unavailable to userspace, and KVM is
not enlightened about guest_memfd’s existance everywhere (when I was
experimenting with this, it generally read garbage data from the shared VMA,
but I think I’ve since seen some patches floating around that would make it
return -EFAULT instead).
CoCo VMs have various methods for working around these: You load a guest kernel
using some “populate on first access” mechanism [1], kvm-clock and I/O is
solved by having the guest mark the relevant address ranges as “shared” ahead
of time [2] and bounce buffering via swiotlb [4], and Intel TDX solves the
instruction emulation problem for MMIO by injecting a #VE and having the guest
do the emulation itself [3].
For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a
trusted host userspace, we would like to avoid changing the VM model so
completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest
memory we would get all the complexity, yet none of the encryption.
Particularly the complexity on the MMIO path seems nasty, but x86 does not
pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it
does for PIO exits, so I also don’t really see a way around it in the
guest_memfd model.
We’ve played around a lot with allowing userspace mappings of guest_memfd, and
then having KVM internally access guest_memfd via userspace page tables (and
came up with multiple hacky ways to boot simple Linux initrds from
guest_memfd), but this is fairly awkward for two reasons:
1. Now lots of codepaths in KVM end up accessing guest_memfd, which from my
understanding goes against the guest_memfd goal of making machine checks
because of incorrect accesses to TDX memory impossible, and
2. We need to somehow get a userspace mapping of guest_memfd into KVM (a hacky
way I could make this work was setting up kvm_user_memory_region2 with
userspace_addr set to a mmap of guest_memory, which actually "works" for
everything but kvm-clock, but I also realized later that this is just
memfd_secret with extra steps).
We also played around with having KVM access guest_memfd through the direct map
(by temporarily reinserting pages into it when needed), but this again means
lots of KVM code learns about how to access guest RAM via guest_memfd.
There are a few other features we need to support, such as serving page faults
using UFFD, which we are not too sure how to realize with guest_memfd since
UFFD is VMA based (although to me some sort of “UFFD-for-FD” sounds like
something that’d be useful even outside of our guest_memfd usecase).
With these challenges in mind, some variant of memfd_secret continues to look
attractive for the non-CoCo case. Perhaps a variant that supports in-kernel
faults and provides some way for gfn_to_pfn_cache users like kvm-clock to
restore the direct map entries.
Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs.
Do you have some thoughts about how to make the above cases work in the
guest_memfd context?
> > --
> > Sincerely yours,
> > Mike.
Best,
Patrick
[1]: https://lore.kernel.org/kvm/20240404185034.3184582-1-pbonzini@redhat.com/T/#m4cc08ce3142a313d96951c2b1286eb290c7d1dac
[2]: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kvmclock.c#L227
[3]: https://www.kernel.org/doc/html/next/x86/tdx.html#mmio-handling
[4]: https://www.kernel.org/doc/html/next/x86/tdx.html#shared-memory-conversions
next prev parent reply other threads:[~2024-05-13 10:31 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <AQHacXBJeX10YUH0O0SiQBg1zQLaEw==>
2024-03-08 15:50 ` Gowans, James
2024-03-08 16:25 ` Brendan Jackman
2024-03-08 17:35 ` David Matlack
2024-03-08 17:45 ` David Woodhouse
2024-03-08 22:47 ` Sean Christopherson
2024-03-09 2:45 ` Manwaring, Derek
2024-03-18 14:11 ` Brendan Jackman
2024-03-08 23:22 ` Sean Christopherson
2024-03-09 11:14 ` Mike Rapoport
2024-05-13 10:31 ` Patrick Roy [this message]
2024-05-13 15:39 ` Sean Christopherson
2024-05-13 16:01 ` Gowans, James
2024-05-13 17:09 ` Sean Christopherson
2024-05-13 19:43 ` Gowans, James
2024-05-13 20:36 ` Sean Christopherson
2024-05-13 22:01 ` Manwaring, Derek
2024-03-14 21:45 ` Manwaring, Derek
2024-03-09 5:01 ` Matthew Wilcox
2024-03-08 21:05 Manwaring, Derek
2024-03-11 9:26 ` Fuad Tabba
2024-03-11 9:29 ` Fuad Tabba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=58f39f23-0314-4e34-a8c7-30c3a1ae4777@amazon.co.uk \
--to=roypat@amazon.co.uk \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=chao.p.peng@linux.intel.com \
--cc=derekmn@amazon.com \
--cc=dwmw@amazon.co.uk \
--cc=graf@amazon.de \
--cc=jgowans@amazon.com \
--cc=kalyazin@amazon.co.uk \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-coco@lists.linux.dev \
--cc=linux-mm@kvack.org \
--cc=lstoakes@gmail.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=somlo@cmu.edu \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox