From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CE78C25B75 for ; Mon, 13 May 2024 10:31:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF1096B0103; Mon, 13 May 2024 06:31:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA1466B0167; Mon, 13 May 2024 06:31:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D684C6B0169; Mon, 13 May 2024 06:31:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BB0BE6B0103 for ; Mon, 13 May 2024 06:31:57 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6E18740DB2 for ; Mon, 13 May 2024 10:31:57 +0000 (UTC) X-FDA: 82113007074.14.3B970C6 Received: from smtp-fw-52005.amazon.com (smtp-fw-52005.amazon.com [52.119.213.156]) by imf17.hostedemail.com (Postfix) with ESMTP id 59F634001B for ; Mon, 13 May 2024 10:31:55 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=UWPDm2J8; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf17.hostedemail.com: domain of "prvs=8567b36a1=roypat@amazon.co.uk" designates 52.119.213.156 as permitted sender) smtp.mailfrom="prvs=8567b36a1=roypat@amazon.co.uk" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715596315; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uQIw9i/g+I3Ll0Kcnj1kLLWW5A3sw6sKhdQsGDX0E88=; b=VLrhKmML14t1S9q+Jd1Hy5TuyvFduPmpK3mBSWNyxkHoyoe7/p2yib2eBH2sP3J6/DOEh5 5ZpPfrBR3fY+bmVGU1nfaYrWHA2TWku7f/dBwC/cx0O/dLOQYod2zQES9tO7/6A9pzV/PJ 0Hniw97/BtQ4RM2dud8zOsSMbOkAkUY= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=UWPDm2J8; dmarc=pass (policy=quarantine) header.from=amazon.co.uk; spf=pass (imf17.hostedemail.com: domain of "prvs=8567b36a1=roypat@amazon.co.uk" designates 52.119.213.156 as permitted sender) smtp.mailfrom="prvs=8567b36a1=roypat@amazon.co.uk" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715596315; a=rsa-sha256; cv=none; b=Hj+/j+2WKIH+ub53b0EKYWTcLcrXmZRvfyqPPyBg+KQJTXq5PlJZxMAmhtSfh3vnvvDYIl tzPm5A4nhN3znoG6iViihY1gW/fzTo3dxXsTJl+E/WR8G90BDwWwG7uHzg4mxjuUMLDFt0 9pgDPlR9ho08GXBiMOtfj6MJ0WylFpo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1715596316; x=1747132316; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=uQIw9i/g+I3Ll0Kcnj1kLLWW5A3sw6sKhdQsGDX0E88=; b=UWPDm2J8aem1DBAV7x/wG1VjsOdhoatqePxOwVWZEhVrvr4wsQkaxeT7 69j/mYsZVlmQLZe2tBeCCbT187a5XsEmZZwzJ2dY+adm2FxK9ZrAlYmDx KEGddbJK+tsNP/MQbbviLMi+J0yOQqdoYCBW1W2N82dufeHYKdLQ9zh7M Q=; X-IronPort-AV: E=Sophos;i="6.08,158,1712620800"; d="scan'208";a="653737330" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52005.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2024 10:31:53 +0000 Received: from EX19MTAUEB001.ant.amazon.com [10.0.44.209:28793] by smtpin.naws.us-east-1.prod.farcaster.email.amazon.dev [10.0.82.155:2525] with esmtp (Farcaster) id 24678bc8-ee46-4086-b40e-d21a7295c070; Mon, 13 May 2024 10:31:51 +0000 (UTC) X-Farcaster-Flow-ID: 24678bc8-ee46-4086-b40e-d21a7295c070 Received: from EX19D008UEA003.ant.amazon.com (10.252.134.116) by EX19MTAUEB001.ant.amazon.com (10.252.135.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.28; Mon, 13 May 2024 10:31:51 +0000 Received: from EX19MTAUEB001.ant.amazon.com (10.252.135.35) by EX19D008UEA003.ant.amazon.com (10.252.134.116) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.28; Mon, 13 May 2024 10:31:51 +0000 Received: from [127.0.0.1] (172.19.88.180) by mail-relay.amazon.com (10.252.135.35) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.28 via Frontend Transport; Mon, 13 May 2024 10:31:49 +0000 Message-ID: <58f39f23-0314-4e34-a8c7-30c3a1ae4777@amazon.co.uk> Date: Mon, 13 May 2024 11:31:47 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Unmapping KVM Guest Memory from Host Kernel To: Mike Rapoport , Sean Christopherson CC: James Gowans , "akpm@linux-foundation.org" , "chao.p.peng@linux.intel.com" , Derek Manwaring , "pbonzini@redhat.com" , David Woodhouse , Nikita Kalyazin , "lstoakes@gmail.com" , "Liam.Howlett@oracle.com" , "linux-mm@kvack.org" , "qemu-devel@nongnu.org" , "kirill.shutemov@linux.intel.com" , "vbabka@suse.cz" , "mst@redhat.com" , "somlo@cmu.edu" , Alexander Graf , "kvm@vger.kernel.org" , "linux-coco@lists.linux.dev" References: Content-Language: en-US From: Patrick Roy Autocrypt: addr=roypat@amazon.co.uk; keydata= xjMEY0UgYhYJKwYBBAHaRw8BAQdA7lj+ADr5b96qBcdINFVJSOg8RGtKthL5x77F2ABMh4PN NVBhdHJpY2sgUm95IChHaXRodWIga2V5IGFtYXpvbikgPHJveXBhdEBhbWF6b24uY28udWs+ wpMEExYKADsWIQQ5DAcjaM+IvmZPLohVg4tqeAbEAgUCY0UgYgIbAwULCQgHAgIiAgYVCgkI CwIEFgIDAQIeBwIXgAAKCRBVg4tqeAbEAmQKAQC1jMl/KT9pQHEdALF7SA1iJ9tpA5ppl1J9 AOIP7Nr9SwD/fvIWkq0QDnq69eK7HqW14CA7AToCF6NBqZ8r7ksi+QLOOARjRSBiEgorBgEE AZdVAQUBAQdAqoMhGmiXJ3DMGeXrlaDA+v/aF/ah7ARbFV4ukHyz+CkDAQgHwngEGBYKACAW IQQ5DAcjaM+IvmZPLohVg4tqeAbEAgUCY0UgYgIbDAAKCRBVg4tqeAbEAtjHAQDkh5jZRIsZ 7JMNkPMSCd5PuSy0/Gdx8LGgsxxPMZwePgEAn5Tnh4fVbf00esnoK588bYQgJBioXtuXhtom 8hlxFQM= In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 59F634001B X-Stat-Signature: 9ckiozeg5wjpf6rhwxaix5rnm7ty6jtp X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1715596315-267065 X-HE-Meta: U2FsdGVkX1+QDnLZbCSrhGyVy+BsWKbRkTdJWGerQvm+0PDRrTYJgMBzRcUNMM3nk1bhDDPqz35zdANjj7SKNaBzez2eXgNkXz+f3MPR/boPTQgCzJNd0lCgZZ3BwJzE3wgb5sIwo25KCwhJpO8XjUYZdt9gzDjtunb5/dozGs94cUSEg65KxK2/d7XSxce4/Mtf/RuwqJR+M0H87LMTxSEH5Xdl+gr+bo7KOaH094vqum0UZHeTUrLwf7O5QmqERRL6jBVkuYnYMhlU5QfI4ZgG0CktfwEAURnQyhpC41F0deSjohEXiSUrtVRaDAeDEQi/blL2rhnKzdciq5Yb9K0p0XYeLTyyj3l+fMpWXdSUbEsw8EykAqqZrwAgQVx/Sb0/rKz5AHJ7tHrg2Mr/poo3lt0EdRKhu+xXWgdw/l4Zf7tmcO9tGNHfKN/F2WEWlRNq7Slk59JELLwAB1dM6JMAYaQe6h17IexN0mPX26KQegpx/dL+lD8pxNdao8nqa2Hm0W3knd5RS4mLrtOgAyEDrTA8WgAuBfaZ4xXaeqERbbQNp0DtNcmikjAROUahZT3+zPJ2yOCPF8ByhtwCbDx5vNtkCfOTY/lLzT1AAYy+wL3Tgx1pRs2c4pG6Zwre9BHPm+6IkQLmPUUvfAUdHSSiWBZskmaBmaNpetauKPreZkVoVaXjM5xc1mKxrLNzLq4TlSFdiA80YwR7m6qqAGb3p6V1WMzJ1nSVomoc9W8KdW/W87G5TLOSYFyfeJUJLHgYAy7xCUU/8A95xgDS0R5BsXaG6ty+CLnQxxVtrQbfY2rIvmqezIE8WbqmF4xqY2B4fonOhr59MHX5gGq9syIrZXYlvdUdVfT9zZKNqQd5YePQyMFilsyPtb6ZIxkZMVBxNs8eN4NrKHJbiTvLE+2sxo8Hk6cZ/qscjSGdJMu1Gm65JbsK8kca1Ar74UBtEYbSY0KChD02duYXk6S iulqfcvu 8kvBwae+kfEDXmuvmKtGfLNZmyhj5n+s/QNu8jGHnwXII0WsxFw7A5AfkoMD1UBIbyS5FXIXTHWSfJ5mzL3zHT1ec78N9ql529wmLLjOgpXsxUSAHkTApl5esUOoi3by6ddN+b7bbP0DlYmEOBIuwRxHN7tOhKJGhelenszHRN9vxxd11sRgCcVMS1ysqXjPQ2OJi6x57Ee9PJDAQskJErI6Se9+N8nbOa9nA/n6HvogKiedYQY8I26xlet3PF2/V0EnGfjNr78T2w1qBCjcHcyMKwQPKKTuvvcU1Se7uVQTZcp0MnbUOBo/cCzw6TXSW58LbREaFl+CMbvQGg/SuFzq/iQP0Ny/HSw3gebbziprhRFHbcz4ucF2QbdxAD3IWqzUYrSlLhFg7vtjIWdh8QHMs74wd/omyqT42TAi9Qe30NRgWjq5IeyY+FYRWRTS742FWAE8oxAs0RrJvGbwRABFzwasX2veX6S2nAoTMo1TGTZfwh8B0OyGo8IIPCPBFB5U6CIc6o1r2zEWJJjPhmDXSE92+zUCltsUSgcV0dYUQ5dan9sm3/P2I90Zw/XxIo+tsJlh/r1m8WPiTfCcTTCl4dXKREnPdXu0b7++HJuqqEroRtc3ydbZmeizceJFD0G2AC3Z7h1diaX2DWWJyrM8xvCD8Lbm3UwPur+J2YueAnOVa1MUu+537aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi all, On 3/9/24 11:14, Mike Rapoport wrote: >>> >>> With this in mind, what’s the best way to solve getting guest RAM out of >>> >>> the direct map? Is memfd_secret integration with KVM the way to go, or >>> >>> should we build a solution on top of guest_memfd, for example via some >>> >>> flag that causes it to leave memory in the host userspace’s page tables, >>> >>> but removes it from the direct map? >> >> memfd_secret obviously gets you a PoC much faster, but in the long term I'm quite >> >> sure you'll be fighting memfd_secret all the way. E.g. it's not dumpable, it >> >> deliberately allocates at 4KiB granularity (though I suspect the bug you found >> >> means that it can be inadvertantly mapped with 2MiB hugepages), it has no line >> >> of sight to taking userspace out of the equation, etc. >> >> >> >> With guest_memfd on the other hand, everyone contributing to and maintaining it >> >> has goals that are *very* closely aligned with what you want to do. > > I agree with Sean, guest_memfd seems a better interface to use. It's > > integrated by design with KVM and removing guest memory from the direct map > > looks like a natural enhancement to guest_memfd. > > > > Unless I'm missing something, for fast-and-dirty POC it'll be a oneliner > > that adds set_memory_np() to kvm_gmem_get_folio() and then figuring out > > what to do with virtio :) We’ve been playing around with extending guest_memfd to remove guest memory from the direct map. Removal from direct map aspect is indeed fairly straight-forward; since we cannot map guest_memfd, we don’t need to worry about folios without direct map entries getting to places where they will cause kernel panics. However, we ran into problems running non-CoCo VMs with guest_memfd for guest memory, independent of direct map entries being available or not. There’s a handful of places where a traditional KVM / Userspace setup currently touches guest memory: * Loading the Guest Kernel into guest-owned memory * Instruction fetch from arbitrary guest addresses and guest page table walks     for MMIO emulation (for example for IOAPIC accesses) * kvm-clock * I/O devices With guest_memfd, if the guest is running from guest-private memory, these need to be rethought, since now the memory is unavailable to userspace, and KVM is not enlightened about guest_memfd’s existance everywhere (when I was experimenting with this, it generally read garbage data from the shared VMA, but I think I’ve since seen some patches floating around that would make it return -EFAULT instead). CoCo VMs have various methods for working around these: You load a guest kernel using some “populate on first access” mechanism [1], kvm-clock and I/O is solved by having the guest mark the relevant address ranges as “shared” ahead of time [2] and bounce buffering via swiotlb [4], and Intel TDX solves the instruction emulation problem for MMIO by injecting a #VE and having the guest do the emulation itself [3]. For non-CoCo VMs, where memory is not encrypted, and the threat model assumes a trusted host userspace, we would like to avoid changing the VM model so completely. If we adopt CoCo’s approaches where KVM / Userspace touches guest memory we would get all the complexity, yet none of the encryption. Particularly the complexity on the MMIO path seems nasty, but x86 does not pre-decode instructions on MMIO exits (which are just EPT_VIOLATIONs) like it does for PIO exits, so I also don’t really see a way around it in the guest_memfd model. We’ve played around a lot with allowing userspace mappings of guest_memfd, and then having KVM internally access guest_memfd via userspace page tables (and came up with multiple hacky ways to boot simple Linux initrds from guest_memfd), but this is fairly awkward for two reasons: 1. Now lots of codepaths in KVM end up accessing guest_memfd, which from my understanding goes against the guest_memfd goal of making machine checks because of incorrect accesses to TDX memory impossible, and 2. We need to somehow get a userspace mapping of guest_memfd into KVM (a hacky way I could make this work was setting up kvm_user_memory_region2 with userspace_addr set to a mmap of guest_memory, which actually "works" for everything but kvm-clock, but I also realized later that this is just memfd_secret with extra steps). We also played around with having KVM access guest_memfd through the direct map (by temporarily reinserting pages into it when needed), but this again means lots of KVM code learns about how to access guest RAM via guest_memfd. There are a few other features we need to support, such as serving page faults using UFFD, which we are not too sure how to realize with guest_memfd since UFFD is VMA based (although to me some sort of “UFFD-for-FD” sounds like something that’d be useful even outside of our guest_memfd usecase). With these challenges in mind, some variant of memfd_secret continues to look attractive for the non-CoCo case. Perhaps a variant that supports in-kernel faults and provides some way for gfn_to_pfn_cache users like kvm-clock to restore the direct map entries. Sean, you mentioned that you envision guest_memfd also supporting non-CoCo VMs. Do you have some thoughts about how to make the above cases work in the guest_memfd context? > > -- > > Sincerely yours, > > Mike. Best, Patrick [1]: https://lore.kernel.org/kvm/20240404185034.3184582-1-pbonzini@redhat.com/T/#m4cc08ce3142a313d96951c2b1286eb290c7d1dac [2]: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kvmclock.c#L227 [3]: https://www.kernel.org/doc/html/next/x86/tdx.html#mmio-handling [4]: https://www.kernel.org/doc/html/next/x86/tdx.html#shared-memory-conversions