From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75AD1CFC5E2 for ; Thu, 10 Oct 2024 16:21:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E4C366B0082; Thu, 10 Oct 2024 12:21:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DFC986B0083; Thu, 10 Oct 2024 12:21:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9D1D6B0085; Thu, 10 Oct 2024 12:21:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A2CAB6B0082 for ; Thu, 10 Oct 2024 12:21:28 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 79E46C077C for ; Thu, 10 Oct 2024 16:21:24 +0000 (UTC) X-FDA: 82658207856.18.C00F71A Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) by imf08.hostedemail.com (Postfix) with ESMTP id C8792160005 for ; Thu, 10 Oct 2024 16:21:24 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=Jl+N9y0A; spf=pass (imf08.hostedemail.com: domain of "prvs=006ab4833=roypat@amazon.co.uk" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=006ab4833=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728577215; a=rsa-sha256; cv=none; b=nddQJM2dsBADPnpTr3LMk51hS6uz5tZcd/bSUjXW/EAAjBWpH9s9VHhSbSaDmddXmQpqyk RXdjeb0fqSab0T0cfpfdNb5NYJPCVDOn0SJy1KbEoijLs3cPn+dTDbDb0Y8MvlGT0SVGOE POsOqRPbqhTWrskAriXmQ3t0DuDjPQs= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=amazon.co.uk header.s=amazon201209 header.b=Jl+N9y0A; spf=pass (imf08.hostedemail.com: domain of "prvs=006ab4833=roypat@amazon.co.uk" designates 99.78.197.219 as permitted sender) smtp.mailfrom="prvs=006ab4833=roypat@amazon.co.uk"; dmarc=pass (policy=quarantine) header.from=amazon.co.uk ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728577215; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TcYNx3yACowR5XTJPfVNB9CsPeswpDejBELDbhhAE0o=; b=LRHLmJY/B4gDDnh6ASEmeAIfqx6qVogh8QL0I0GLTmZ9ZpxpEcOoOMKTLmsgASguv3PpmJ 5d+3At9fomY1O7n7zpq4XYpp0gOsvFUMSLCJoqF1w68MNFAiYH8wlwkCpmYd7IjhWrO8aM ETDQr5z4/woLlkL3bLqPQdvC8QGYo8o= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.uk; i=@amazon.co.uk; q=dns/txt; s=amazon201209; t=1728577286; x=1760113286; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=TcYNx3yACowR5XTJPfVNB9CsPeswpDejBELDbhhAE0o=; b=Jl+N9y0A3rvtUtIdo81B6I3qYdbs6FeiPr2LfbxPgELneUyhPlqbm1BS 7uscuxpAqOHzk0IxiBVeKOyreS6uC8pp1AKYqHq0HfRc0I/C5k734YgVM 328ZzR4nH0ra0SO8YfK159sFeWlSQOgUVM3TW1o7OsuMtBDjszuNd+TrW c=; X-IronPort-AV: E=Sophos;i="6.11,193,1725321600"; d="scan'208";a="137487477" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Oct 2024 16:21:23 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.43.254:33134] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.17.150:2525] with esmtp (Farcaster) id d0294a3e-94ae-4c3f-9c3f-178089dd3584; Thu, 10 Oct 2024 16:21:22 +0000 (UTC) X-Farcaster-Flow-ID: d0294a3e-94ae-4c3f-9c3f-178089dd3584 Received: from EX19D022EUC003.ant.amazon.com (10.252.51.167) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Thu, 10 Oct 2024 16:21:17 +0000 Received: from EX19MTAUEA002.ant.amazon.com (10.252.134.9) by EX19D022EUC003.ant.amazon.com (10.252.51.167) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Thu, 10 Oct 2024 16:21:17 +0000 Received: from email-imr-corp-prod-iad-all-1a-93a35fb4.us-east-1.amazon.com (10.43.8.2) by mail-relay.amazon.com (10.252.134.34) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Thu, 10 Oct 2024 16:21:17 +0000 Received: from [127.0.0.1] (dev-dsk-roypat-1c-dbe2a224.eu-west-1.amazon.com [172.19.88.180]) by email-imr-corp-prod-iad-all-1a-93a35fb4.us-east-1.amazon.com (Postfix) with ESMTPS id 75D4E4060D; Thu, 10 Oct 2024 16:21:12 +0000 (UTC) Message-ID: <6bca3ad4-3eca-4a75-a775-5f8b0467d7a3@amazon.co.uk> Date: Thu, 10 Oct 2024 17:21:11 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap To: Sean Christopherson , Ackerley Tng CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , References: From: Patrick Roy Content-Language: en-US Autocrypt: addr=roypat@amazon.co.uk; keydata= xjMEY0UgYhYJKwYBBAHaRw8BAQdA7lj+ADr5b96qBcdINFVJSOg8RGtKthL5x77F2ABMh4PN NVBhdHJpY2sgUm95IChHaXRodWIga2V5IGFtYXpvbikgPHJveXBhdEBhbWF6b24uY28udWs+ wpMEExYKADsWIQQ5DAcjaM+IvmZPLohVg4tqeAbEAgUCY0UgYgIbAwULCQgHAgIiAgYVCgkI CwIEFgIDAQIeBwIXgAAKCRBVg4tqeAbEAmQKAQC1jMl/KT9pQHEdALF7SA1iJ9tpA5ppl1J9 AOIP7Nr9SwD/fvIWkq0QDnq69eK7HqW14CA7AToCF6NBqZ8r7ksi+QLOOARjRSBiEgorBgEE AZdVAQUBAQdAqoMhGmiXJ3DMGeXrlaDA+v/aF/ah7ARbFV4ukHyz+CkDAQgHwngEGBYKACAW IQQ5DAcjaM+IvmZPLohVg4tqeAbEAgUCY0UgYgIbDAAKCRBVg4tqeAbEAtjHAQDkh5jZRIsZ 7JMNkPMSCd5PuSy0/Gdx8LGgsxxPMZwePgEAn5Tnh4fVbf00esnoK588bYQgJBioXtuXhtom 8hlxFQM= In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Stat-Signature: x38edggewj1e5hh598jryeuqiawcxd84 X-Rspamd-Queue-Id: C8792160005 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1728577284-639902 X-HE-Meta: U2FsdGVkX1/VlcZ0JY+yzFZaChyACBE1379T5zip4Fr7wZLsrkpxbOs1YaUfLAf+0f10SH24eevG64uxWAQCNgyooF1OTWR2RlMEjOr2OVfQFHkUpOAj2z2fU/njIjxKfd3StetNYlEfVk6mnZUJVklVc2oh6DlDBreZ2ulEgZb25dQiK4Qal0bPXb8JmKa6oJmtEkiBBntYAJKmUeYRlM167AG5/Ndk/ZkC7smwQo4GYy13tzLhlUMOQbM79PbVQ0mSGzafe0bhMQrOHfjKGSIAOz+L+clqFCguyDzRh14iyau5N7I8N201ZSAQaMw7PckF92RJTNOoZ7ROEDkzFH2xywwPo2ohtJmUrVBDmbjYwnXXjAKQgIYCvUgaQ9b+Cg668YzY9JealXlA/YRhZAfFW8WILr9KdjcjGxIxN0ab0sz5iWK2iGN+xB2UnjmDhTCKlTKG+Wb11WBYvCPGk8ZIiNlP66ejB1cH6onz71Q+3Bg4QtLa+Ny5r47oC7vFuPAD37G3UdQUHsvjW4PO/sT7nyFHLAzISoz8UeKTKW7VnGUjVe3C5ppd//oXplr65UXDsp9WgxjupopjirlFrda5jNV7ZBZzHxR12QnQq2GietyBum1AeA5oUS3BM8eXGy8Wp4y7Zcs3kYuYGh4bmpmib8MhIsgDFF6oGZlcZmzc1e9Ktjd50PqZ3fVvWAWop7G5w8olsDNj0uGmhxUuvze8+p9a4PR9Fs9hGz6jfQnSG4IEDMc4Pf59Nk4zFZ0RPzJPil1HBLn9Y+fW0Db9asa6GNG4Z2vu0OJ9W639UMw04LI0LWvMzwuK9lGTLnUaT6MGdzpD11FOvAS7DZhuhUCf0+T2OoEYQcYgah8efhNh6wh97owViwbfzu5n3n7I4Y7MssrrJ/DyB8lQWbjSqx3a9QRhOpJLs6C6wVPtJ02UNPtPbBXbqGG8sMNycaWpfvfMw8WH576t8uCZWrP J2xrwga2 iWJ0ITyXVCSSNBjtq8Un0w4UK2f6smQv3RRfLZ8lmEYBkrDWCmqWCae0KdWP4er1FvCvS/TkEbJOjQq8dWSR5TT/WWPnyAOxcni6S5VzxZBscUKj/KE0EIsxG7O6p1cnELF/5rQGGtjTEhHWum1iOcjpYKYGnXs1GqU9FpTqmDiHAu+Ybn+4U31zz49IXtbdXnu0t23pozSZDW/+R8F+9jLg3GVeAxu/i1iLAcCOQ+Scc6q68wRsoKB+0dVOlNg047rg1ersQ7OKrv6vuD7ou0i5L3iIMnDHNUrPaC39gsrINeSz8F8i7JpOOHkoXr/nI90ScXqYMw8ewRnO4liVXE8TTzlQLkGNrIu/7zS0hSn/X6ajI8RVKFraX6uTN6YGgeWU3pcutdOCj8q8klqWgFT1CwgyZqxw9X4rzV7vcr3FfhrIWhSvlGgd6iCkp7gA0MG9/kNFykfzKg+He96Ft8JTWcUPWWA0NesYpQpfXKWioEIDyTtE9CrkiPG3sFyKTrJeODdeEnhayUdfTzwOvSu0RgKI54F7w+fqyNDNPEdHyUhunKY3JYGH4GIpmwJ/ApsOJgpsxdmPqZ7CZxMx+y+B4Lw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote: > On Tue, Oct 08, 2024, Ackerley Tng wrote: >> Patrick Roy writes: >>> For the "non-CoCo with direct map entries removed" VMs that we at AWS >>> are going for, we'd like a VM type with host-controlled in-place >>> conversions which doesn't zero on transitions, > > Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for > KVM to care if userspace or the guest _wants_ a page to be shared vs. private. > Userspace is fully trusted to manage things; KVM simply reacts to the current > state of things. > > And more importantly, whether or not the direct map is zapped needs to be a > property of the guest_memfd inode, i.e. can't be associated with a struct kvm. > I forget who got volunteered to do the work, I think me? At least we talked about it briefly > but we're going to need similar > functionality for tracking the state of individual pages in a huge folio, as > folio_mark_uptodate() is too coarse-grained. I.e. at some point, I expect that > guest_memfd will make it easy-ish to determine whether or not the direct map has > been obliterated. > > The shared vs. private attributes tracking in KVM is still needed (I think), as > it communicates what userspace _wants_, whereas he guest_memfd machinery will > track what the state _is_. If I'm understanding this patch series correctly, the approach taken here is to force the KVM memory attributes and the internal guest_memfd state to be in-sync, because the VMA from mmap()ing guest_memfd is reflected back into the userspace_addr of the memslot. So, to me, in this world, "direct map zapped iff kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory attribute changes forcing the corresponding gmem state change. That's why I was talking about conversions above. I've played around with this locally, and since KVM seems to generally use copy_from_user and friends to access the userspace_addr VMA, (aka private mem that's reflected back into memslots here), with this things like MMIO emulation can be oblivious to gmem's existence, since copy_from_user and co don't require GUP or presence of direct map entries (well, "oblivious" in the sense that things like kvm_read_guest currently ignore memory attributes and unconditionally access userspace_addr, which I suppose is not really wanted for VMs where userspace_addr and guest_memfd aren't short-circuited like this). The exception is kvm_clock, where the pv_time page would need to be explicitly converted to shared to restore the direct map entry, although I think we could just let userspace deal with making sure this page is shared (and then, if gmem supports GUP on shared memory, even the gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd still need a tiny hack in the uhva->pfn translation somewhere to handle gmem vmas, but iirc you did mention that having kvm-clock be special might be fine). I guess it does come down to what you note below, answering the question of "how does KVM internally access guest_memfd for non-CoCo VMs". Is there any way we can make uaccesses like above work? I've finally gotten around to re-running some performance benchmarks of my on-demand reinsertion patches with all the needed TLB flushes added, and my fio benchmark on a virtio-blk device suffers a ~50% throughput regression, which does not necessarily spark joy. And I think James H. mentioned at LPC that making the userfault stuff work with my patches would be quite hard. All this in addition to you also not necessarily sounding too keen on it either :D >>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new >>> VM type for that. > > Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename? > The original thought behind "software protected VM" was to do a slow build of > something akin to pKVM, but realistically I don't think that idea is going anywhere. Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a playground where various configurations other VM types enforce can be mixed and matched (e.g. zero on conversions yes/no, direct map removal yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my understanding :) > Alternatively, depending on how KVM accesses guest memory that's been removed from > the direct map, another solution would be to allow "regular" VMs to bind memslots > to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to > guest_memfd, not just "private" mappings. > > That's probably the biggest topic of discussion: how do we want to allow mapping > guest_memfd into the guest, without direct map entries, but while still allowing > KVM to access guest memory as needed, e.g. for shadow paging. One approach is > your RFC, where KVM maps guest_memfd pfns on-demand. > > Another (slightly crazy) approach would be use protection keys to provide the > security properties that you want, while giving KVM (and userspace) a quick-and-easy > override to access guest memory. > > 1. mmap() guest_memfd into userpace with RW protections > 2. Configure PKRU to make guest_memfd memory inaccessible by default > 3. Swizzle PKRU on-demand when intentionally accessing guest memory > > It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory > instead of to usersepace memory. > > The benefit of the PKRU approach is that there are no PTE modifications, and thus > no TLB flushes, and only the CPU that is access guest memory gains temporary > access. The big downside is that it would be limited to modern hardware, but > that might be acceptable, especially if it simplifies KVM's implementation. Mh, but we only have 16 protection keys, so we cannot give each VM a unique one. And if all guest memory shares the same protection key, then during the on-demand swizzling the CPU would get access to _all_ guest memory on the host, which "feels" scary. What do you think, @Derek? Does ARM have something equivalent, btw? >>> Somewhat related sidenote: For VMs that allow inplace conversions and do >>> not zero, we do not need to zap the stage-2 mappings on memory attribute >>> changes, right? > > See above. I don't think conversions by toggling the shared/private flag in > KVM's memory attributes is the right fit for your use case.