From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E46C3C25B10 for ; Mon, 13 May 2024 20:36:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 28C896B02DC; Mon, 13 May 2024 16:36:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 23DBD6B02DD; Mon, 13 May 2024 16:36:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 104CD6B02DE; Mon, 13 May 2024 16:36:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E44F76B02DC for ; Mon, 13 May 2024 16:36:33 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 9AC951A0F88 for ; Mon, 13 May 2024 20:36:33 +0000 (UTC) X-FDA: 82114530666.08.0203D7C Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf26.hostedemail.com (Postfix) with ESMTP id C8F4F140008 for ; Mon, 13 May 2024 20:36:31 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vY33fVQK; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of 3znlCZgYKCAUxjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3znlCZgYKCAUxjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715632591; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RTqdF/Fyy9v2rp3elmPXvaESz+kmdh+eg2ABVOfbt3w=; b=vvMmNniS2mPFLvT3i1iaYc3zpqXDmW0FxvW5ek55USeUYwMMcUN9tOz54sAffE3TX0rFpG my/TMt10NkL8K40A1aoiEatDf+UsoIJObepIgK7/TNfCA9IWvOXC49JkHlVMB+MlcFhJvz uDh3486hun+w4BS9bnsmv3z9Wj7j2UQ= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vY33fVQK; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of 3znlCZgYKCAUxjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3znlCZgYKCAUxjfsohlttlqj.htrqnsz2-rrp0fhp.twl@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715632591; a=rsa-sha256; cv=none; b=YnbzE47XCwBMzPYTiWd4cKnzhKb7XhOjt9e87EUarE+PA5UaYj/0qVhzW4QKdyIkx5sud7 zytYq31pnyUl9Bl1ZCAZ/xjgnDRV9XKyJPfAbuFNMOjEXzL0e2IE3m0vNLPONa7XpI1x4H fDoV7/zz/V/+YpfdH9d6Akx+VrLf4j4= Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-61bea0ca5e2so82786137b3.2 for ; Mon, 13 May 2024 13:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1715632591; x=1716237391; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=RTqdF/Fyy9v2rp3elmPXvaESz+kmdh+eg2ABVOfbt3w=; b=vY33fVQKCJ7ELjYAMrKit1g1Fr/3OEH09FcCU5jjzNXoDl/Xzx6VNK7SWEpynHy1Ra lrewk3m3wXR2YBvDe4ZtTf1u33D/Hv0coqlJ1otZiDfuyYL8GS66d701j541kuc49bMC eOHuhk2zQTXsL4cb6z94ElKIG/psEWgie4ByB656VEppHd7wUfZWnpk+mEpalV5SFzTJ 1C9u65Xtuv0kpnAgCsNQehKWdetEkgwzmtldcL6JbcNZ7nG8cOGLlACZygLJmuuk5Rve kdA+LJYLTIBOXuysh0VFXwFMc9itxIArH5BHelHKXh+DA9l44PKOXHCg5QnKjgotEwV9 IqUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715632591; x=1716237391; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=RTqdF/Fyy9v2rp3elmPXvaESz+kmdh+eg2ABVOfbt3w=; b=eyjuS4ySAIfJ/xSQ3T8rvMc/TPtw3ki5fbS3atZPtk3zll2D7s2CksblYsWOckAW/1 xAlgG2+5Hdu6ka7Zf1b0uObOIB9hs8OX/N09/NNx6NGNsGavDhjpGzeEuov3anIXyy6B sX2KdJrMTN3XemDKYuOm/yuvUzIDwXxeJg/0eM9tMmsm3xjs9aBrsdQ6lT061t1IoOnW jTnUa6HGTwjARWrUJjq9C29t33v2mumom3fR/49fHt+Nj8B/Qi353eYT5jptYtTZtja4 iVuy7Psh9z6ziNHiW38vDQ8PNLw5xi7R8/Sz06uCcE9xuSq/r4zAaQ+P7qIBf3SV6Vuf YIPQ== X-Forwarded-Encrypted: i=1; AJvYcCVCkwbp8uP6Fy8Go5rIvzcAtQU0D7quySjCHS6JGL7rqMa84bwVwxkx08iiWuHkRB9M9gImKbj/1VmplZcujYuvvBE= X-Gm-Message-State: AOJu0YxTi5oEVpZne9Xpq9Z/VN2Sf9Pq6AYR4UA7tVEKVm3U9tuhUpnc y1iVTlddVK4mGcOWjJDLsMz24MzLTWSxup4JG8KRg4U/CVkLGzpx9x25rAshoUvT28uLJRpQ+Hv +SQ== X-Google-Smtp-Source: AGHT+IFxSaf5ALbxAjGJ4irStP0T4VQiFrWgBxguuMjd3vTUy8dpIf11/Kw31vWLIOru5J9yUqNrB2dHqIA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:120d:b0:dee:690f:af35 with SMTP id 3f1490d57ef6-dee690fbad4mr609613276.8.1715632590827; Mon, 13 May 2024 13:36:30 -0700 (PDT) Date: Mon, 13 May 2024 13:36:29 -0700 In-Reply-To: Mime-Version: 1.0 References: <58f39f23-0314-4e34-a8c7-30c3a1ae4777@amazon.co.uk> Message-ID: Subject: Re: Unmapping KVM Guest Memory from Host Kernel From: Sean Christopherson To: James Gowans Cc: "kvm@vger.kernel.org" , "linux-coco@lists.linux.dev" , Nikita Kalyazin , "rppt@kernel.org" , "qemu-devel@nongnu.org" , Patrick Roy , "somlo@cmu.edu" , "vbabka@suse.cz" , "akpm@linux-foundation.org" , "kirill.shutemov@linux.intel.com" , "Liam.Howlett@oracle.com" , David Woodhouse , "pbonzini@redhat.com" , "linux-mm@kvack.org" , Alexander Graf , Derek Manwaring , "chao.p.peng@linux.intel.com" , "lstoakes@gmail.com" , "mst@redhat.com" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C8F4F140008 X-Stat-Signature: sr3cen5i75zrcsodjj3euzyhsyaqstcg X-HE-Tag: 1715632591-47897 X-HE-Meta: U2FsdGVkX19faFV5KfVKP5puJFmyyP/VIyX1lCyDK0lh+c5inwLfIdsGbgYbTZLMmQUYAfx4muCyPyn82dXMI7j8/nkqX7t9i8SjBhao9ntWneCVOlF32r7dT6vr8Bp4sgOlQOTjcojCjdzfoIQ8LvaIPHhBJraaWuO3JMqa2gsmvA0jbVkFdwVU/y6rETHJxF1N0D95Y1JjtFnoqWVfOau7h7jrAmp9g3jTmHGq6F9pb4JdklcZW038iZRGkSnqXqqaaf0Pk+YvfVDrgLlvTqR2PDM2WjQ8jRgkuLUlghXuKVj3iW4E9NDDz9V2yACQSh//FNqgjDoSoy5wCjjPXuQwFVgRKBXlTFpnHGhaQ2r8hufyQ2zoJ8ukOZRw/NGZ0HCGlryk4+2OoEkP/5RBkaAPHTYThUZSz9jj5Wj61XuVc3UXiUd+MDQ0anyKjNY42sKN+fu1D4+GJBwzaM9nZtH/xQ3DuPvFYoV81n48bxuYGxYQvkZzEYK1X6S1xbI13b6ZNzQIWPdwcvFegGr1lScVnvDHjiWWIYnbN1tLJsZ5VgaTRw5rXLgwqgyC00s22+PJMfsbTdN2ioljNmf1wQA1D93pEDDoUCKyNtYyCv6AIhrJcPSJsVwUk9GEsyBqsyz0YIYc9XnFcIvbxpyUwCtlvBdXGCupyanN80BiVv4ufxTmhDkCr1Bc0le1U1zTQ/l1Xb8IT4L7w8XpB+q1Mc7IF2HrVIKlVkn6Rtlbf58dhYBUcApNREKp1IpWwdE36EfSxs5lrKw1kEO5MULm5IEL7XhuM0cb36gBe+RBJp364R9y3OipS5y43AQ62cPNhVj3GJ0y0EPi6JRqY4SDEeBh9SFqfvPYFBf5j3Cbx+4KVmFQPBPyZc9OWglyzZOL83ucnSIjILfyeHURQ4NHvPjlhRV106Ptd6JKu97N3EhlNdUeRVO0cwWVfQ7Uh1eS3tsD/NKfXaXJ3EIXMN9 OsErrs1Y LiO6cUgUHX+nK/O3HoT9zqTKZTcz7WgGKNjEad4WhHTTAPf2AUrQOHw5yr2FEyruxv+4z5GIOMDA3+uasuxus7baTJ9NE7e20qncXIxRLanehI7oDQkWN4673JjllZ6DQv1rVbwcc/if3qXr513Xn+yBoxkVHPFIw7iS+XEts6z6EQjtzE4xXjF7Kt7G6NDbzDU15cF2QUx9NDQmwpOK4Ei1od5ScKa23MwhzHhRGNisNQJ3jObrOyIXKdYgXklcU77MW41sZhlCXl4IAPPUT0ANYrA5Y3/h3zxnpDO2rSPbd7ui17Q9w4LMF1+lxQ+5w4wh0SXg06wI2wj9KR2who76xqHZ0b8tItlhiLQM5nu0vsoZaWFW9SW2aoKOLUhg7O8+QMD3VADH14wA+DVau3oO4t86F3ISm0ZF7gyQcZroUN1T35xAoWO8fUNBfAHKcsOI2JXhU02zLxSf3TTIK+JFi1rG+HzH6VgS8KiJtJSXeQBWTdgvF25KNa/8J5jgppWUCo61sCyF+mdPC2w1+j4zhnUCmz76KlZjdAEgcb/wHKCc5WLWnaiFZ2BgMYPX2K8KC0nYpl0WlG59R/WshRElHgJZa8X9rF1eRmeb7GBkKqz+EowDaEDo1DE3mwy3FmqWt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 13, 2024, James Gowans wrote: > On Mon, 2024-05-13 at 10:09 -0700, Sean Christopherson wrote: > > On Mon, May 13, 2024, James Gowans wrote: > > > On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote: > > > > > Sean, you mentioned that you envision guest_memfd also supporting= non-CoCo VMs. > > > > > Do you have some thoughts about how to make the above cases work = in the > > > > > guest_memfd context? > > > >=20 > > > > Yes.=C2=A0 The hand-wavy plan is to allow selectively mmap()ing gue= st_memfd().=C2=A0 There > > > > is a long thread[*] discussing how exactly we want to do that.=C2= =A0 The TL;DR is that > > > > the basic functionality is also straightforward; the bulk of the di= scussion is > > > > around gup(), reclaim, page migration, etc. > > >=20 > > > I still need to read this long thread, but just a thought on the word > > > "restricted" here: for MMIO the instruction can be anywhere and > > > similarly the load/store MMIO data can be anywhere. Does this mean th= at > > > for running unmodified non-CoCo VMs with guest_memfd backend that we'= ll > > > always need to have the whole of guest memory mmapped? > >=20 > > Not necessarily, e.g. KVM could re-establish the direct map or mremap()= on-demand. > > There are variation on that, e.g. if ASI[*] were to ever make it's way = upstream, > > which is a huge if, then we could have guest_memfd mapped into a KVM-on= ly CR3. >=20 > Yes, on-demand mapping in of guest RAM pages is definitely an option. It > sounds quite challenging to need to always go via interfaces which > demand map/fault memory, and also potentially quite slow needing to > unmap and flush afterwards.=20 >=20 > Not too sure what you have in mind with "guest_memfd mapped into KVM- > only CR3" - could you expand? Remove guest_memfd from the kernel's direct map, e.g. so that the kernel at= -large can't touch guest memory, but have a separate set of page tables that have = the direct map, userspace page tables, _and_ kernel mappings for guest_memfd. = On KVM_RUN (or vcpu_load()?), switch to KVM's CR3 so that KVM always map/unmap= are free (literal nops). That's an imperfect solution as IRQs and NMIs will run kernel code with KVM= 's page tables, i.e. guest memory would still be exposed to the host kernel. = And of course we'd need to get buy in from multiple architecturs and maintainer= s, etc. > > > I guess the idea is that this use case will still be subject to the > > > normal restriction rules, but for a non-CoCo non-pKVM VM there will b= e > > > no restriction in practice, and userspace will need to mmap everythin= g > > > always? > > >=20 > > > It really seems yucky to need to have all of guest RAM mmapped all th= e > > > time just for MMIO to work... But I suppose there is no way around th= at > > > for Intel x86. > >=20 > > It's not just MMIO.=C2=A0 Nested virtualization, and more specifically = shadowing nested > > TDP, is also problematic (probably more so than MMIO).=C2=A0 And there = are more cases, > > i.e. we'll need a generic solution for this.=C2=A0 As above, there are = a variety of > > options, it's largely just a matter of doing the work.=C2=A0 I'm not sa= ying it's a > > trivial amount of work/effort, but it's far from an unsolvable problem. >=20 > I didn't even think of nested virt, but that will absolutely be an even > bigger problem too. MMIO was just the first roadblock which illustrated > the problem. > Overall what I'm trying to figure out is whether there is any sane path > here other than needing to mmap all guest RAM all the time. Trying to > get nested virt and MMIO and whatever else needs access to guest RAM > working by doing just-in-time (aka: on-demand) mappings and unmappings > of guest RAM sounds like a painful game of whack-a-mole, potentially > really bad for performance too. It's a whack-a-mole game that KVM already plays, e.g. for dirty tracking, p= ost-copy demand paging, etc.. There is still plenty of room for improvement, e.g. t= o reduce the number of touchpoints and thus the potential for missed cases. But KVM= more or less needs to solve this basic problem no matter what, so I don't think = that guest_memfd adds much, if any, burden. > Do you think we should look at doing this on-demand mapping, or, for > now, simply require that all guest RAM is mmapped all the time and KVM > be given a valid virtual addr for the memslots? I don't think "map everything into userspace" is a viable approach, precise= ly because it requires reflecting that back into KVM's memslots, which in turn means guest_memfd needs to allow gup(). And I don't think we want to allow= gup(), because that opens a rather large can of worms (see the long thread I linke= d). Hmm, a slightly crazy idea (ok, maybe wildly crazy) would be to support map= ping all of guest_memfd into kernel address space, but as USER=3D1 mappings. I.= e. don't require a carve-out from userspace, but do require CLAC/STAC when access gu= est memory from the kernel. I think/hope that would provide the speculative ex= ecution mitigation properties you're looking for? Userspace would still have access to guest memory, but it would take a trul= y malicious userspace for that to matter. And when CPUs that support LASS co= me along, userspace would be completely unable to access guest memory through = KVM's magic mapping. This too would require a decent amount of buy-in from outside of KVM, e.g. = to carve out the virtual address range in the kernel. But the performance ove= rhead would be identical to the status quo. And there could be advantages to bei= ng able to identify accesses to guest memory based purely on kernel virtual ad= dress.