From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D34CC0015E for ; Thu, 27 Jul 2023 17:13:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 636A96B0071; Thu, 27 Jul 2023 13:13:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5BE186B0074; Thu, 27 Jul 2023 13:13:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 438E76B0075; Thu, 27 Jul 2023 13:13:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2E48D6B0071 for ; Thu, 27 Jul 2023 13:13:13 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DD4DAA063A for ; Thu, 27 Jul 2023 17:13:12 +0000 (UTC) X-FDA: 81058037424.29.2120C42 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf17.hostedemail.com (Postfix) with ESMTP id 0C9494001F for ; Thu, 27 Jul 2023 17:13:10 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=uR617wLp; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of 3paXCZAYKCCgWIERNGKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3paXCZAYKCCgWIERNGKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690477991; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=d+P15yolcCcni6H1OcWP0e6WssjScdeUbIKbSbyjhJ5Rc+9jhxPZT9/k48C/nWXAaGLcq4 pYcU5qZfhA85TYcLTC4W1vv6kkVwNhfEHk3v9ZljnWtwo4WKC9dkEB7Kb5gBIodETPxUbN Sf+7H89odOZl6WilkJpjbdbVzzAyTXU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=uR617wLp; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of 3paXCZAYKCCgWIERNGKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3paXCZAYKCCgWIERNGKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690477991; a=rsa-sha256; cv=none; b=jya5zMDAe0+3yLqJt+qdUpJJtaXKt3iqW+RhZTBiz2QIDA/3enVQndV0kzzZGk/KkcxyGC RvhIVJspjfTZ+ICqFpa4JNnwqFn8zdHGTfXKXGBMSEJJJHUcoXh6xKvLcaN8eiw+QZS5Fg tw+gXTG/zTDr4CSg4+LXiiJ4FiPAdU8= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1bb98659f3cso8122295ad.3 for ; Thu, 27 Jul 2023 10:13:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=uR617wLpg6qNB677nCTpj/mRcHCn7jeMQ6T975dwFWafb05LBOA5GXBHggNBovfIzX O/7M392ccdTOGZqesE1qC7szLmXDUVbpTkdUSAnDKesUZGTHf6YrHw7iRdcLemcslWSD KrRYixDmKkhke5BhBNDcGGoexHbctICnVSugeNNjkFTNcd3g+7vTx3m4xPX98nUwGt3/ mWhf+uFvINzhg7lFN3O3GR8jmx04AReHHh1Ek1sP09ZwG+oxGwlWe7n39MiREf5a36wi QECqs1dJt2n4+euTMFmAQH0Oiz0RKLy4HgS6MVL8RVa78gxf6JRosPxL/mvaakJouVeF ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=aSkNJCkbPKDTnr5KAQC3/ys1VitYUIrKFXpDYeUHUSBFFVhpQ6n1muWDPghRRNvKXW xC1I7fhR3soasTpLQxD35LvVvt076OqNJdkxr1N1iQ3HbYnuwfMUYLOqZv8HUcypcUpi Gt373nmkvqtD15rVEOYaqCM2NpcQxXzpVAG6OM6i/xiwLO+iiQXUURRpfaJqjMYT1iBr mncNLT2+FtNadNEr1kpUbQMYF3V0ypgqamNlmXovgg6wrwgjrIRQwueZRkJhKykUytsF RElWf9iIV+1fUGWMrGe6Fwn22LSngqfxptk5fM3A+WQEGVVqiIigtBdR9ud5uH0JnpJI kKEA== X-Gm-Message-State: ABy/qLa9A1Gx3Nd9ytaGav2KgcZ2eFWUZofaBcIoJekrchUjueEoYVLd qZw25DADJRzpxNjaVJyQOMSo0dli57I= X-Google-Smtp-Source: APBJJlEiCX9DyxliShqr2xPxTiyej7fUi+PkS+4LO77eNeghg1Fiv4b0fvD5YH2w9mP2/cy4q9w9vNXHRRI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22c6:b0:1b5:2b14:5f2c with SMTP id y6-20020a17090322c600b001b52b145f2cmr24803plg.4.1690477989357; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) Date: Thu, 27 Jul 2023 10:13:07 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: 0C9494001F X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: jetf9kms4mnec9obensyh1k1b7e58h8k X-HE-Tag: 1690477990-780980 X-HE-Meta: U2FsdGVkX1/QqAbp4FO9gQDu1NF/Nll0YP3L4ae//DnNpYSmDEiFyMsPu62yx8lDpPTKMWwQrdHV19xDr+TOnHolRJT8sUTx9iZJPwhVHJm0UhAL47xZZfMk3K7TPUK9SwIpnvZObM46ifZY3znCa3yB2qALeu8F7hIrnmfJ6AXOTK6DnNWeE8LLuE9QorEjbKKK3t5HH+qwxGC8ZYErep/vyo4bjgQxhoq1ngcloN7/ez1WqCt09TOyQ0vYWSzCqcABXYQiAWwdMjhGRkQEj38h2wV+nFxRMy5OxtBC4LK83tIJ493S6nWW7B7A90tFWW8fLOmsPZiNkBeoPXVQxCDZSggHCvs0j0ystFsLxBwG/Xtya65kx9VlVobcqP5reAxF25QiS3C4QgeePrTGQNhvvEGxxQY058HGZsXU2kOboQ9bhT6wvgrQU+BYghWawcYOuh2CIaIxViVKQRKCYBANtyoQy0XsKwxCIKeR/X8CDcq7FfZfgeSnsQ/ZT585nhmOloXsoX0DfJ8gMaQuVSkc0GEhijM4U+1UquXsp+jL03VUmMHY0yZF5lSodKa6AgjN7GSCToPdcPdlOv321wbbpvRoveuMbizFq7BiQ0+ZRpN0v+F8O/5jIKfWuD0o0KV9ca4fTxTOGK7ny0CkoBBzju92WdabkFxayNPH4OnDcL1W7ruCFzPG6+9yZYzXQ6CjwMnUuhP/bDVfE80+6VYp89twvfQFPgqqtNKJFSo/TAs/qFqlK17jDxonQmmHwH9CN0DL3uoDvyg0c5J//GQkfezmCsymzcQnLxYL6laPsW/l7z8hmUZmoLyX5JFjwYVHlqbUpbrs+2SyfuyKI5+H/MP/EM4gbjzUyGxksqLF2b0NnnDqHpeRzT0kqdwPo+4mA0PCdwqdcEhbrVuFDxDFkLb6psO8yDgKBwz5LGFU2If8fNWwCueVHj9WbhZtfWWDTYg4t223u1koAj3 e9+inGk4 KeSnd5bjTKm9r0CwMheCWL7u50JUSb//PI8LWhWvQnuxRDoxtPD4vHoBr2opZ7oqJffuZvQPKA+t26Q8DzNBipw4VGKAEF5LgbdaL8AOk8ASnW402zlq7ci29mAMfSWzvhO9cecYr71vtMZXvNjMFCLuSAvWAiIBtFCaFN+LpUNxEtEHJh5LTO18s4Vy5rirwb49PBtoN4NUqTtDc8Idm3Hzhjvd7UbPJXA+Gp0hWP+M/aQI6DYMsCbozeSLH7w07qAePMz/EmuPJZ6qPUhiOKGrP4DpLyxz61/jfOVxwvIn0RXQPwac9JbvARCV0Wlbc8/G8aRHLuxC5GTufioxz3zguKry2vnrBxmvdMq1kbOcPZNrQef7jOytxSK0fsyUjXTzafkFO67lDouTRQOvFjsalJjjy1IWlsZEMS2cRcekYi2rZcqg4IqoE5UKGCutcRuzWwBvip+cXpMTZDVVGckZM2kjF0tg1VttOhS2t3H1AXfOmtjSOaPqfql4Oc4bKehGumWIBSYfqGImYFxZO/nefZ4Nv3gxgtb3J2ExhxrJ4VK4zdL9AEkiZdUrIkzUZ4v7j+DJJPG8OBh3av7hZbHFaPWdR/0dHTk/hOSNb3tGjEFTbz20e5ExSXKlYT0ZaYH8lJQamU874j9O591gYqZvIxnbqCYruogUCPyWXOlsdlJ4dAsBtL17+hiYWuo+oUFn7W4xNb1XZiHBmc9n/V60xlyMSLk2mn5r0kaqiKUhQ66NwIozp8zCsFJuyO+NSmhmgcuob6DX0IuXcfOTX4572hVbGJTQ/6Jsw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com