From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52988C001DB for ; Tue, 8 Aug 2023 21:13:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 964316B0075; Tue, 8 Aug 2023 17:13:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8ED538D0002; Tue, 8 Aug 2023 17:13:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 766658D0001; Tue, 8 Aug 2023 17:13:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 616866B0075 for ; Tue, 8 Aug 2023 17:13:31 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 25184140134 for ; Tue, 8 Aug 2023 21:13:31 +0000 (UTC) X-FDA: 81102188622.09.3C06BF0 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) by imf20.hostedemail.com (Postfix) with ESMTP id 5AC871C0012 for ; Tue, 8 Aug 2023 21:13:29 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=EiApR7Lg; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of 3-K_SZAYKCM8Dzv84x19916z.x97638FI-775Gvx5.9C1@flex--seanjc.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3-K_SZAYKCM8Dzv84x19916z.x97638FI-775Gvx5.9C1@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691529209; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2U8op5NWCZxGer1aWuXuZ1d6J73Ge5+Hk7EgkvJJLAs=; b=4bRtRFnx1uOddyc28HPFirv4W1nTP/EuCkYbeFpAi51c6mFgUsiObiFAuXg3d/E26rL5HU NhKdXTzHhT6yEq/iTRvceTAelsUYO8PBd3jjr/LIaaqw5i3N+lD7YGMBLxbCedh7+/wJVr nkELxFQ4Y5+hu/2icd6GFjz0RNcW9xo= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=EiApR7Lg; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf20.hostedemail.com: domain of 3-K_SZAYKCM8Dzv84x19916z.x97638FI-775Gvx5.9C1@flex--seanjc.bounces.google.com designates 209.85.215.202 as permitted sender) smtp.mailfrom=3-K_SZAYKCM8Dzv84x19916z.x97638FI-775Gvx5.9C1@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691529209; a=rsa-sha256; cv=none; b=eH4U3fb1eJY8DgI3v/anlj1amOOzRJfLiEMjiBaYXfUqQwxPH+ABMQzPnopRxjv6VnLibD 21S2QCP4XNgA2ZO7tLraDakwCfkTTSWbaBD8Mjtwv/LveG8AysJmIvNPIqlCn8w/CfO4Qz u6oyUPtmpdbwglEa+qD1dNvCAmVI9lY= Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-5646e695ec1so6318292a12.1 for ; Tue, 08 Aug 2023 14:13:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691529208; x=1692134008; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=2U8op5NWCZxGer1aWuXuZ1d6J73Ge5+Hk7EgkvJJLAs=; b=EiApR7LgfgrJOIT4RH8vwTp7YvtN3WX8F65amcCZ5tUQpyv/ZsbnzElbxtLH3YiRAr Xco5OB//pxBTUJ1Hz9I9wbK8VFMOtBI25Tou7znxlR4xLwF5hwG16hAzm3VNuOm7qNVv RXaphvZexnvsjJwdUedz4miwZlrHjyPKGJ4tg41nXpc1Cr1tDeUR45j+cXRmHY2zeNeJ aD2FSNHH9s2X0AH/332mpiw5dSfBYPcqgRab/m5d5fFkQb+ZWPzwvwukMvExFid52TaS W3qOIOCRBt/UK+p15DAoWro/GC9Q6ni0EalNj49A0OSEsHwDkeyMMKxa0prAwizF/XXF JV1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691529208; x=1692134008; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2U8op5NWCZxGer1aWuXuZ1d6J73Ge5+Hk7EgkvJJLAs=; b=LySJ81/3KKS9qiM177ohJdl58L6cImsddIVRHAgXT5JHtuBPzL3zcvd7vE2hnbMiEx X0X6atAyZIy5FZ0dGJansCYunAzRMQ6ake3dFEKC+kYas2v0hWzQqzwzA6Q0ykAHU2LJ W4FwFZpvE9g/2DjKG8RpR+6nkG8NByj5Q7qDNX2XghXDNv05qTFY5+IVID4WOX1ZKDK1 dGgPTrm3DIlVYAey6hw3FOoVxZ+PBSxXrDfPoQIIWsvrBDT+znskFQMfYPYnWSnknEd/ CQ9hqF6bJCpZ9jPfE+jl/y1IWMvwTe1caitXdkeqvQKfUrm1WGRvExGPx6XW/5dVKwYa BYYA== X-Gm-Message-State: AOJu0YzWhaU4Y43nm398d64uTNJvgjoJZvEUxvUpY3nbECkhbFB2q+w3 YiY+Gy3jz9y9Sp5jMkdwTbd1PO2cuTE= X-Google-Smtp-Source: AGHT+IF3VhVvM3seA6Kd3G1yrwf0aF3MmO0cqr5SVZX1MHXlsabrjEav/SHlO1qK0PHGPVCVUrfGthX1EsA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a63:b242:0:b0:563:e937:5e87 with SMTP id t2-20020a63b242000000b00563e9375e87mr12735pgo.5.1691529208021; Tue, 08 Aug 2023 14:13:28 -0700 (PDT) Date: Tue, 8 Aug 2023 14:13:26 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Ackerley Tng Cc: pbonzini@redhat.com, maz@kernel.org, oliver.upton@linux.dev, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, willy@infradead.org, akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, chao.p.peng@linux.intel.com, tabba@google.com, jarkko@kernel.org, yu.c.zhang@linux.intel.com, vannapurve@google.com, mail@maciej.szmigiero.name, vbabka@suse.cz, david@redhat.com, qperret@google.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 9g7p4a9ho3bhds6foau6u16uh5mtnd3e X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5AC871C0012 X-HE-Tag: 1691529209-981542 X-HE-Meta: U2FsdGVkX1/FRAEuLXq12B+x6Ed+Rzcd/OFoJLOmpLIwxGJVdej8BxZV1bXrSre2MkEurTeYE14o9bFio88Mi8JRLbJE8JBQncmpgaHBTVxq/YqE71oohhg6RNaswBiNHAIBQ4zQoWte7Lubh3wFOIvg6IDcIa7isRq42GoUFFlf8w06RFIYmtXtywBlKocAqOsXjbpiivodEAdrhkqj2lecpqgOU63NUw9sDoDXaREtWc3amzZPJ3XzGMnngc89EwXCCOOBvT2xCJQgeCls+eWHHFdOPuWRU5lJNdc+qNjIdKd9Ew8nNytjHQyCfkUbh1b128mLNDUX6fDzlzxxavz0ea6neet/agBLvjW9v229vDFV9x/bKC+noRrLCnTwcm+wgXzUWIryuCix5emGmM46lSMfyG7EPWKDCD2TdYJDPlLlZRem6Frb5FPC5Drf6ZGjSw8OHiCXWV0wF5dimKpVjWQzkh9Z0icYfGDfELujFYN5B8zhmYcp7G1G+9gGyuo8mE9Gw4BWnLnlFxUpdBIISf5AqopsWS8A6w9VrgMsZ2ujl4ld90dtO52FblLYLUPWsrUsdHqGXM26ML+q1Y9e0JBcu7s0dFkeKqpcxPnO9hbVJtQJuVjGcCVU3mPHId1UnbNNkjaC7DNHxnUqaCIhuUMu0tJvNQW17pHHzoBTpGfhueAXg1299C6ZQptoOTAw8C9LRyr02Lh2v2zt1eQQWMp1zzxSQB1UvMNrnTohzOwwZ4TJJlnnq6tTbL2AOpJoYT720sPAedg3YHaDm1RRDi4Tsu1cv99kkdaTd9X07XKrcNUJ5xmRzMo0Yg38SH4ZCT6u15fhz71bN2TN7Wc5JN+CYc4/+eW5qIV0sRTw+ObVg4gdk92Fj9pp2OABmlCMAu4eHwzS/kGVoD/774d7yDlrOfc54T3IaUW4zXscvcAKT6nhSvPUoXTN+noU+mF37sQX92v9+BTxrLy 3ENRhgdG H66Fn3RxYAeEk0QapKU4LTClHNId6DXu0jb6OS6dzjSNT9SJNmkaoDCaJThZl73ZtSSCBB76f7SSmpKHRzLpvttKTIMjOv7jk60Cs/RtVhtSQMahZdK2uRVulvfGtWAw2XypZoqzU0ovMVZ6p4oznLWSvbBxHmRcUZ7cTL++3ll9s+FT4JkboVo8AzK4DQv/M04heRWdNV9SsSYl2/tqcWtG3Frd8UfrNWUnByJv2qzfGmOGnRqTso8hAST+QEhCulXAVuFKMg9ZVBcNQkqCkY6GYcFO3DJplPfUF0uzfPn156kcmyuTdoS+wZJrmebq+yTvrmymk8DKhvwWOq5tP8qLKXTl2fJnq6j4WYaVQDnv5CkRhhP4wXU9MOC5CHOSdCqoVWGinorDUF/MZAS0HWMHhxYwG0Ry5IWbDkhCpp4Ld7CTXVEp+tQ3bDpOB9gdj2rXqvE5G6kxwxd3chpwkfotksk/MpnTkZBaqqYJdZJlPRN0iFvMrlHdNOCUlUdp0O2yiSShgxOPfNRFxY8vUDOv3npWC9jbnU5iuEFnJkpVzymvVKEKOkShcCAjS8utqZHR9d3yBMEdJBuuYtKhPQ8Uf6/gCQdtPfOu0tu0C2ZZmw/uI8domNiomI0hboL9RzMoKnzNUrvZ9YlTA2KqtMmjDn3DFoDZKmfV1wHNnBI4TyAEXvffxMjJIN9NMN/sflD0erRRR24m85z0D+k2SVgZ9AMtgO0Q9tLi6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000016, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 07, 2023, Ackerley Tng wrote: > I=E2=80=99d like to propose an alternative to the refcounting approach be= tween > the gmem file and associated kvm, where we think of KVM=E2=80=99s memslot= s as > users of the gmem file. >=20 > Instead of having the gmem file pin the VM (i.e. take a refcount on > kvm), we could let memslot take a refcount on the gmem file when the > memslots are configured. >=20 > Here=E2=80=99s a POC patch that flips the refcounting (and modified selft= ests in > the next commit): > https://github.com/googleprodkernel/linux-cc/commit/7f487b029b89b9f3e9b09= 4a721bc0772f3c8c797 >=20 > One side effect of having the gmem file pin the VM is that now the gmem > file becomes sort of a false handle on the VM: >=20 > + Closing the file destroys the file pointers in the VM and invalidates > the pointers Yeah, this is less than ideal. But, it's also how things operate today. K= VM doesn't hold references to VMAs or files, e.g. if userspace munmap()s memor= y, any and all SPTEs pointing at the memory are zapped. The only difference w= ith gmem is that KVM needs to explicitly invalidate file pointers, instead of t= hat happening behind the scenes (no more VMAs to find). Again, I agree the res= ulting code is more complex than I would prefer, but from a userspace perspective = I don't see this as problematic. > + Keeping the file open keeps the VM around in the kernel even though > the VM fd may already be closed. That is perfectly ok. There is plenty of prior art, as well as plenty of w= ays for userspace to shoot itself in the foot. E.g. open a stats fd for a vCPU= and the VM and all its vCPUs will be kept alive. And conceptually it's sound, anything created in the scope of a VM _should_ pin the VM. > I feel that memslots form a natural way of managing usage of the gmem > file. When a memslot is created, it is using the file; hence we take a > refcount on the gmem file, and as memslots are removed, we drop > refcounts on the gmem file. Yes and no. It's definitely more natural *if* the goal is to allow guest_m= emfd memory to exist without being attached to a VM. But I'm not at all convinc= ed that we want to allow that, or that it has desirable properties. With TDX = and SNP in particuarly, I'm pretty sure that allowing memory to outlive the VM = is very underisable (more below). > The KVM pointer is shared among all the bindings in gmem=E2=80=99s xarray= , and we can > enforce that a gmem file is used only with one VM: >=20 > + When binding a memslot to the file, if a kvm pointer exists, it must > be the same kvm as the one in this binding > + When the binding to the last memslot is removed from a file, NULL the > kvm pointer. Nullifying the KVM pointer isn't sufficient, because without additional act= ions userspace could extract data from a VM by deleting its memslots and then bi= nding the guest_memfd to an attacker controlled VM. Or more likely with TDX and = SNP, induce badness by coercing KVM into mapping memory into a guest with the wr= ong ASID/HKID. I can think of three ways to handle that: (a) prevent a different VM from *ever* binding to the gmem instance (b) free/zero physical pages when unbinding (c) free/zero when binding to a different VM Option (a) is easy, but that pretty much defeats the purpose of decopuling guest_memfd from a VM. Option (b) isn't hard to implement, but it screws up the lifecycle of the m= emory, e.g. would require memory when a memslot is deleted. That isn't necessaril= y a deal-breaker, but it runs counter to how KVM memlots currently operate. Me= mslots are basically just weird page tables, e.g. deleting a memslot doesn't have = any impact on the underlying data in memory. TDX throws a wrench in this as re= moving a page from the Secure EPT is effectively destructive to the data (can't be= mapped back in to the VM without zeroing the data), but IMO that's an oddity with = TDX and not necessarily something we want to carry over to other VM types. There would also be performance implications (probably a non-issue in pract= ice), and weirdness if/when we get to sharing, linking and/or mmap()ing gmem. E.= g. what should happen if the last memslot (binding) is deleted, but there outstandi= ng userspace mappings? Option (c) is better from a lifecycle perspective, but it adds its own flav= or of complexity, e.g. the performant way to reclaim TDX memory requires the TDMR (effectively the VM pointer), and so a deferred relcaim doesn't really work= for TDX. And I'm pretty sure it *can't* work for SNP, because RMP entries must= not outlive the VM; KVM can't reuse an ASID if there are pages assigned to that= ASID in the RMP, i.e. until all memory belonging to the VM has been fully freed. > Could binding gmem files not on creation, but at memslot configuration > time be sufficient and simpler? After working through the flows, I think binding on-demand would simplify t= he refcounting (stating the obvious), but complicate the lifecycle of the memo= ry as well as the contract between KVM and userspace, and would break the separat= ion of concerns between the inode (physical memory / data) and file (VM's view / m= appings).