linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Patrick Roy <roypat@amazon.co.uk>,
	seanjc@google.com, pbonzini@redhat.com,
	akpm@linux-foundation.org, dwmw@amazon.co.uk, rppt@kernel.org,
	david@redhat.com
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	willy@infradead.org, graf@amazon.com, derekmn@amazon.com,
	kalyazin@amazon.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	dmatlack@google.com, tabba@google.com,
	chao.p.peng@linux.intel.com, xmarcalx@amazon.co.uk
Subject: Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
Date: Mon, 22 Jul 2024 14:28:00 +0200	[thread overview]
Message-ID: <e12b91ef-ca0c-4b77-840b-dcfb2c76a984@kernel.org> (raw)
In-Reply-To: <20240709132041.3625501-1-roypat@amazon.co.uk>

On 7/9/24 3:20 PM, Patrick Roy wrote:
> Hey all,
> 
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.
> 
> === Implementation ===
> 
> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
> to remove its pages from the direct map when they are allocated. When
> trying to run a guest from such a VM, we now face the problem that
> without either userspace or kernelspace mappings of guest_memfd, KVM
> cannot access guest memory to, for example, do MMIO emulation of access
> memory used to guest/host communication. We have multiple options for
> solving this when running non-CoCo VMs: (1) implement a TDX-light
> solution, where the guest shares memory that KVM needs to access, and
> relies on paravirtual solutions where this is not possible (e.g. MMIO),
> (2) have KVM use userspace mappings of guest_memfd (e.g. a
> memfd_secret-style solution), or (3) dynamically reinsert pages into the
> direct map whenever KVM wants to access them.
> 
> This RFC goes for option (3). Option (1) is a lot of overhead for very
> little gain, since we are not actually constrained by a physical
> inability to access guest memory (e.g. we are not in a TDX context where
> accesses to guest memory cause a #MC). Option (2) has previously been
> rejected [1].

Do the pages have to have the same address when they are temporarily mapped?
Wouldn't it be easier to do something similar to kmap_local_page() used for
HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
doesn't have to alter the shared directmap.

Maybe that was already discussed somewhere as unsuitable but didn't spot it
here.

> In this patch series, we make sufficient parts of KVM gmem-aware to be
> able to boot a Linux initrd from private memory on x86. These include
> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
> For VM types which do not allow accessing gmem, we return -EFAULT and
> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
> 
> Additionally, this patch series adds support for "restricted" userspace
> mappings of guest_memfd, which work similar to memfd_secret (e.g.
> disallow get_user_pages), which allows handling I/O and loading the
> guest kernel in a simple way. Support for this is completely independent
> of the rest of the functionality introduced in this patch series.
> However, it is required to build a minimal hypervisor PoC that actually
> allows booting a VM from a disk.
> 
> === Performance ===
> 
> We have run some preliminary performance benchmarks to assess the impact
> of on-the-fly direct map manipulations. We were mainly interested in the
> impact of manipulating the direct map for MMIO emulation on virtio-mmio.
> Particularly, we were worried about the impact of the TLB and L1/2/3
> Cache flushes that set_memory_[n]p entails.
> 
> In our setup, we have taken a modified Firecracker VMM, spawned a Linux
> guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
> that the cache flushes caused throughput to drop from around 600MB/s to
> ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
> 8375C CPU with 64 cores). We then converted our prototype to use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
> found that without cache flushes the pure impact of the direct map
> manipulation is indistinguishable from noise. This is why we use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
> this RFC.
> 
> Note that in this comparison, both the baseline, as well as the
> guest_memfd-supporting version of Firecracker were made to bounce I/O
> buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
> the virtio stack cannot directly pass guest buffers to read/write
> syscalls.
> 
> === Security ===
> 
> We want to use unmapping guest memory from the host kernel as a security
> mitigation against transient execution attacks. Temporarily restoring
> direct map entries whenever KVM requires access to guest memory leaves a
> gap in this mitigation. We believe this to be acceptable for the above
> cases, since pages used for paravirtual guest/host communication (e.g.
> kvm-clock) and guest page tables do not contain sensitive data. MMIO
> emulation will only end up reading pages containing privileged
> instructions (e.g. guest kernel code).
> 
> === Summary ===
> 
> Patches 1-4 are about hot-patching various points inside of KVM that
> access guest memory to correctly handle the case where memory happens to
> be guest-private. This means either handling the access as a memory
> error, or simply accessing the memslot's guest_memfd instead of looking
> at the userspace provided VMA if the VM type allows these kind of
> accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
> make it remove its pages from the kernel's direct map. Whenever KVM
> wants to access guest-private memory, it will temporarily re-insert the
> relevant pages. Patches 7-8 allow for restricted userspace mappings
> (e.g. get_user_pages paths are disabled like for memfd_secret) of
> guest_memfd, so that userspace has an easy path for loading the guest
> kernel and handling I/O-buffers.
> 
> === ToDos / Limitations ===
> 
> There are still a few rough edges that need to be addressed before
> dropping the "RFC" tag, e.g.
> 
> * Handle errors of set_direct_map_default_not_flush in
>   kvm_gmem_invalidate_folio instead of calling BUG_ON
> * Lift the limitation of "at most one gfn_to_pfn_cache for each
>   gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
>   entries when needed"). It currently means that guests with more than 1
>   vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
>   structures into the same page (gfn)
> * Write selftests, particularly around hole punching, direct map removal,
>   and mmap.
> 
> Lastly, there's the question of nested virtualization which Sean brought
> up in previous discussions, which runs into similar problems as MMIO. I
> have looked at it very briefly. On Intel, KVM uses various gfn->uhva
> caches, which run in similar problems as the gfn_to_hva_caches dealt
> with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
> is private"). However, previous attempts at just converting this to
> gfn_to_pfn_cache (which would make them work with guest_memfd) proved
> complicated [2]. I suppose initially, we should probably disallow nested
> virtualization in VMs that have their memory removed from the direct
> map.
> 
> Best,
> Patrick
> 
> [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
> [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/
> 
> Patrick Roy (8):
>   kvm: Allow reading/writing gmem using kvm_{read,write}_guest
>   kvm: use slowpath in gfn_to_hva_cache if memory is private
>   kvm: pfncache: enlighten about gmem
>   kvm: x86: support walking guest page tables in gmem
>   kvm: gmem: add option to remove guest private memory from direct map
>   kvm: gmem: Temporarily restore direct map entries when needed
>   mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
>   kvm: gmem: Allow restricted userspace mappings
> 
>  arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
>  include/linux/kvm_host.h       |   5 ++
>  include/linux/kvm_types.h      |   1 +
>  include/linux/secretmem.h      |  13 +++-
>  include/uapi/linux/kvm.h       |   2 +
>  mm/secretmem.c                 |   6 +-
>  virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
>  virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
>  virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
>  9 files changed, 399 insertions(+), 47 deletions(-)
> 
> 
> base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84



  parent reply	other threads:[~2024-07-22 12:28 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-09 13:20 Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
2024-07-09 14:36   ` David Woodhouse
2024-07-10  9:49     ` Patrick Roy
2024-07-10 10:20       ` David Woodhouse
2024-07-10 10:46         ` Patrick Roy
2024-07-10 10:50           ` David Woodhouse
2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
2024-07-10  7:31   ` Mike Rapoport
2024-07-10  9:50     ` Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
2024-07-11  6:25   ` Paolo Bonzini
2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
2024-07-09 21:09   ` David Hildenbrand
2024-07-10  7:32     ` Mike Rapoport
2024-07-10  9:50       ` Patrick Roy
2024-07-10 21:14         ` David Hildenbrand
2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
2024-07-09 14:48   ` Fuad Tabba
2024-07-09 21:13     ` David Hildenbrand
2024-07-10  9:51       ` Patrick Roy
2024-07-10 21:12         ` David Hildenbrand
2024-07-10 21:53           ` Sean Christopherson
2024-07-10 21:56             ` David Hildenbrand
2024-07-12 15:59           ` Patrick Roy
2024-07-30 10:15             ` David Hildenbrand
2024-08-01 10:30               ` Patrick Roy
2024-07-22 12:28 ` Vlastimil Babka (SUSE) [this message]
2024-07-26  6:55   ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
2024-07-30 10:17     ` David Hildenbrand
2024-07-26 16:44 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e12b91ef-ca0c-4b77-840b-dcfb2c76a984@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=derekmn@amazon.com \
    --cc=dmatlack@google.com \
    --cc=dwmw@amazon.co.uk \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=roypat@amazon.co.uk \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xmarcalx@amazon.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox