From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E40F5C3DA5D for ; Mon, 22 Jul 2024 12:28:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3BD596B0082; Mon, 22 Jul 2024 08:28:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 36CF46B0083; Mon, 22 Jul 2024 08:28:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 235A56B0085; Mon, 22 Jul 2024 08:28:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 044926B0082 for ; Mon, 22 Jul 2024 08:28:15 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id BA6E81A122C for ; Mon, 22 Jul 2024 12:28:15 +0000 (UTC) X-FDA: 82367316150.22.75236BE Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf28.hostedemail.com (Postfix) with ESMTP id 3888FC0018 for ; Mon, 22 Jul 2024 12:28:12 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qaaD+vOu; spf=pass (imf28.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721651271; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=at6ikl4MFHmqxjwMpo33dls3VJBfiKUd74JPm3rFQfs=; b=id55wwe7wNwLKMt8LBsgNTLfheJkXoeXOUeL+02lIVzTXB70TAF8hj/SSIgxdQm6RWhARl HLqf8AyrjWE8so74S7xgGyHe1axsIzk34CU3YWnk5rOFh4quK7c/mgbnWsd/gXt7LRbr/4 /aZYyiEGcV1GIcTn8+8jZ/4GZu3BGzk= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qaaD+vOu; spf=pass (imf28.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721651271; a=rsa-sha256; cv=none; b=2O1FbDIBQFm2/G50/ZkPf2Q1QlOn6BovWxW2PmZtvhgH4FLlCUBFNxoCGz3oq51ZOqhRiH NHPuquSgnIcp2rJZ8TjOKg5I4HBXjddLbk1XeaAKVUOUNGAlHjJTOk5o4Vk5weWR/rMBAB 55qHHCWDkwrC6zdm3/KofX7ktK5T6AQ= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 68997CE0AF8; Mon, 22 Jul 2024 12:28:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 512E3C4AF0A; Mon, 22 Jul 2024 12:28:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721651287; bh=1wHVWXdwDpPXbRdLNkJ+1cBk9V/oWHav1DfuBfI2DU0=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=qaaD+vOuS55OkCK49tMhg7IkjyjEz7lyTKp4gQ7wPXFkLebkAnJcXA60oaAX1TeSP liW3qKZ0pWcao0+2ItIgibEW6dz2vkLKZ9WmTmgi+pmxhfEs7GMTVCF558adcxawgx OXIxaqf71/TDPYdF5KxBdUx3R24zwLd/a+dMM1bIJ0DBteedSIQyBI1ru1TpVdPjiL moqc6lu88KActxvo3q69heozZ3rNJGqRgn/+HqGiEWcVX2vdcHFCs2i1LxnSoZP+ap eE6En9BX98heKsXsr4Mk6Noeswdq/IA/ezGoeVl05gfny2HHRMBKLIru3Ny5BTvMCa cFeyuA6qRT0Dg== Message-ID: Date: Mon, 22 Jul 2024 14:28:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Content-Language: en-US To: Patrick Roy , seanjc@google.com, pbonzini@redhat.com, akpm@linux-foundation.org, dwmw@amazon.co.uk, rppt@kernel.org, david@redhat.com Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, willy@infradead.org, graf@amazon.com, derekmn@amazon.com, kalyazin@amazon.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, dmatlack@google.com, tabba@google.com, chao.p.peng@linux.intel.com, xmarcalx@amazon.co.uk References: <20240709132041.3625501-1-roypat@amazon.co.uk> From: "Vlastimil Babka (SUSE)" In-Reply-To: <20240709132041.3625501-1-roypat@amazon.co.uk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 3888FC0018 X-Stat-Signature: 3fi8ntwgxipjihb3yecx33mqk6ak59rs X-HE-Tag: 1721651292-638947 X-HE-Meta: U2FsdGVkX1/oxBRR89OJs1OF1hsicJ+BsndJ5gczS1ntfr08nQnGeJTxKmb8rk2kwQEXm0X1fs44sJkuBNDHuRicJX4b+lIROppUuF6wpkJ4DLiqLUNfkmE9j6isMQoIqMhvt6/h9HD9ucRW6RjcpyJYBsMr9CjrTm/OwlW0fOSlRZjwVSs0QouBc2nry726VM3cTbQlLs4fuzmwzNEPw7vTiaa5SKAWMsgq3Nio4d+OuEdC4qmAb5/3zcJbbB67sBtnOslMjYomTdyKV2MveknwMLphoKkO+O/foQgrD7nF8EBxmGxiT8zO5j03egafwOEXmGUWTtma/CBwIrt2Rvax5gP9Wv1GIY6bE/dnu6yXqfmeMPsyBGOJl4Fl9cGnyxAKaPJYQUiyI2WN0Q9TpRkv9PbT1eVuasJKPPdQU99U6XT8E9RFwGHqfxdmgO2zvNb3SBLsA/NdWxNzA7ahtPiw8U/pFuKeyM81RCKnURpWNiODOnuDQbJTIkc2+c9N/wLCZRyXRNuQ37OOoios0E7vHOyAAxhV2FRyBDBFQNF5vVMfLHRi/8HYrRIt2d4t2WNlaPRMWbKyu1Mbzhe1NaN5dq0VMPp+lZbDxq2AHbnhtaYEj56/odV38hf4bnMU8HYHLdvyQoo04wUisTOUBTJ800jn9fcyJYYrImbZbuPImdqwHXY6qA5ViSwm9je8kMtqsV/BxWJjTRYrQmNcckA5MVW9EvtRMNBhDFFst1W6TOM/kciRk3MOHr03lIfV29+qfwmeB0bl+4m1hG7ECDaNBjbh3hcXPArLBhRV6/RA9OaWj3UUuu8cUFf3rLi6qrZU4pmEDZYH2FCbc2s8VlnQy8Edf4ddVe1MpzEKdilIlYRPk+czu8RFa7oS58ZDzylfOAyCA+5gw0kO32lNOLhPdqG6HrB4WbmSzd7VUH3cZHsrs62PIxW/EOxZXC3aTOjFMajYmLokyaXs8jM +erF3jJd cx0z+Ubdqc8B3tFS8lRsNI0IUZsO2C+9sn918JSGwXKvMgCoHxa5EbLbTLqrLedBFzvJWEl2HRDyYta6zk1mUPGoW2JZMMZAcKCVsn8GzV7jxWNTzMRtrvTIu4pXHkfKzBS6qoTDYhz0KjeiTX7XlBhfWUSfDCvA5KB1cZDBE02mz6OC2eMgsKaF03mbYd8wRehZbUQhzIlaPhxY/lYxjC331J9FDjVcZbVO9ZAc8X43ZBaMlYLSOAdZlU30Yt2Ut0nb9ht4h8dNo63VyO00T6dJ9ddEKA3Z0ec8RamSu2lrJfa2C6WjgaSXfFyElYbOMkW6tI2nThAeM98GmtzMwFvF6uHngqW63wzyjMuGV+1Wbl0rSZ7SUJRbF/SqZe6HJNufV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 7/9/24 3:20 PM, Patrick Roy wrote: > Hey all, > > This RFC series is a rough draft adding support for running > non-confidential compute VMs in guest_memfd, based on prior discussions > with Sean [1]. Our specific usecase for this is the ability to unmap > guest memory from the host kernel's direct map, as a mitigation against > a large class of speculative execution issues. > > === Implementation === > > This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD` > to remove its pages from the direct map when they are allocated. When > trying to run a guest from such a VM, we now face the problem that > without either userspace or kernelspace mappings of guest_memfd, KVM > cannot access guest memory to, for example, do MMIO emulation of access > memory used to guest/host communication. We have multiple options for > solving this when running non-CoCo VMs: (1) implement a TDX-light > solution, where the guest shares memory that KVM needs to access, and > relies on paravirtual solutions where this is not possible (e.g. MMIO), > (2) have KVM use userspace mappings of guest_memfd (e.g. a > memfd_secret-style solution), or (3) dynamically reinsert pages into the > direct map whenever KVM wants to access them. > > This RFC goes for option (3). Option (1) is a lot of overhead for very > little gain, since we are not actually constrained by a physical > inability to access guest memory (e.g. we are not in a TDX context where > accesses to guest memory cause a #MC). Option (2) has previously been > rejected [1]. Do the pages have to have the same address when they are temporarily mapped? Wouldn't it be easier to do something similar to kmap_local_page() used for HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it doesn't have to alter the shared directmap. Maybe that was already discussed somewhere as unsuitable but didn't spot it here. > In this patch series, we make sufficient parts of KVM gmem-aware to be > able to boot a Linux initrd from private memory on x86. These include > KVM's MMIO emulation (including guest page table walking) and kvm-clock. > For VM types which do not allow accessing gmem, we return -EFAULT and > attempt to prepare a KVM_EXIT_MEMORY_FAULT. > > Additionally, this patch series adds support for "restricted" userspace > mappings of guest_memfd, which work similar to memfd_secret (e.g. > disallow get_user_pages), which allows handling I/O and loading the > guest kernel in a simple way. Support for this is completely independent > of the rest of the functionality introduced in this patch series. > However, it is required to build a minimal hypervisor PoC that actually > allows booting a VM from a disk. > > === Performance === > > We have run some preliminary performance benchmarks to assess the impact > of on-the-fly direct map manipulations. We were mainly interested in the > impact of manipulating the direct map for MMIO emulation on virtio-mmio. > Particularly, we were worried about the impact of the TLB and L1/2/3 > Cache flushes that set_memory_[n]p entails. > > In our setup, we have taken a modified Firecracker VMM, spawned a Linux > guest with 1 vCPU, and used fio to stress a virtio_blk device. We found > that the cache flushes caused throughput to drop from around 600MB/s to > ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum > 8375C CPU with 64 cores). We then converted our prototype to use > set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and > found that without cache flushes the pure impact of the direct map > manipulation is indistinguishable from noise. This is why we use > set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in > this RFC. > > Note that in this comparison, both the baseline, as well as the > guest_memfd-supporting version of Firecracker were made to bounce I/O > buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs, > the virtio stack cannot directly pass guest buffers to read/write > syscalls. > > === Security === > > We want to use unmapping guest memory from the host kernel as a security > mitigation against transient execution attacks. Temporarily restoring > direct map entries whenever KVM requires access to guest memory leaves a > gap in this mitigation. We believe this to be acceptable for the above > cases, since pages used for paravirtual guest/host communication (e.g. > kvm-clock) and guest page tables do not contain sensitive data. MMIO > emulation will only end up reading pages containing privileged > instructions (e.g. guest kernel code). > > === Summary === > > Patches 1-4 are about hot-patching various points inside of KVM that > access guest memory to correctly handle the case where memory happens to > be guest-private. This means either handling the access as a memory > error, or simply accessing the memslot's guest_memfd instead of looking > at the userspace provided VMA if the VM type allows these kind of > accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will > make it remove its pages from the kernel's direct map. Whenever KVM > wants to access guest-private memory, it will temporarily re-insert the > relevant pages. Patches 7-8 allow for restricted userspace mappings > (e.g. get_user_pages paths are disabled like for memfd_secret) of > guest_memfd, so that userspace has an easy path for loading the guest > kernel and handling I/O-buffers. > > === ToDos / Limitations === > > There are still a few rough edges that need to be addressed before > dropping the "RFC" tag, e.g. > > * Handle errors of set_direct_map_default_not_flush in > kvm_gmem_invalidate_folio instead of calling BUG_ON > * Lift the limitation of "at most one gfn_to_pfn_cache for each > gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map > entries when needed"). It currently means that guests with more than 1 > vcpu fail to boot, because multiple vcpus can put their kvm-clock PV > structures into the same page (gfn) > * Write selftests, particularly around hole punching, direct map removal, > and mmap. > > Lastly, there's the question of nested virtualization which Sean brought > up in previous discussions, which runs into similar problems as MMIO. I > have looked at it very briefly. On Intel, KVM uses various gfn->uhva > caches, which run in similar problems as the gfn_to_hva_caches dealt > with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory > is private"). However, previous attempts at just converting this to > gfn_to_pfn_cache (which would make them work with guest_memfd) proved > complicated [2]. I suppose initially, we should probably disallow nested > virtualization in VMs that have their memory removed from the direct > map. > > Best, > Patrick > > [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/ > [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/ > > Patrick Roy (8): > kvm: Allow reading/writing gmem using kvm_{read,write}_guest > kvm: use slowpath in gfn_to_hva_cache if memory is private > kvm: pfncache: enlighten about gmem > kvm: x86: support walking guest page tables in gmem > kvm: gmem: add option to remove guest private memory from direct map > kvm: gmem: Temporarily restore direct map entries when needed > mm: secretmem: use AS_INACCESSIBLE to prohibit GUP > kvm: gmem: Allow restricted userspace mappings > > arch/x86/kvm/mmu/paging_tmpl.h | 94 +++++++++++++++++++----- > include/linux/kvm_host.h | 5 ++ > include/linux/kvm_types.h | 1 + > include/linux/secretmem.h | 13 +++- > include/uapi/linux/kvm.h | 2 + > mm/secretmem.c | 6 +- > virt/kvm/guest_memfd.c | 83 +++++++++++++++++++-- > virt/kvm/kvm_main.c | 112 +++++++++++++++++++++++++++- > virt/kvm/pfncache.c | 130 +++++++++++++++++++++++++++++---- > 9 files changed, 399 insertions(+), 47 deletions(-) > > > base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84