From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DAE2C001DE for ; Mon, 31 Jul 2023 13:47:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A523F28003F; Mon, 31 Jul 2023 09:47:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98EC7280023; Mon, 31 Jul 2023 09:47:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 830E728003F; Mon, 31 Jul 2023 09:47:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 74731280023 for ; Mon, 31 Jul 2023 09:47:31 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E135B120B8B for ; Mon, 31 Jul 2023 13:47:30 +0000 (UTC) X-FDA: 81072034260.13.150B2C7 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf17.hostedemail.com (Postfix) with ESMTP id CDAC34000A for ; Mon, 31 Jul 2023 13:47:27 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=u9OhuV2W; spf=pass (imf17.hostedemail.com: domain of tabba@google.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690811247; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; b=LraiF2txdpdJtC0ggxXwHmWKaHBZrbBj/pF68k/rO6vQyT8APYCYdrLZqOUSKDxmvn7lyz 4j7bnfpNZKV1vZDng65CSl6DA/qAsPJvBXbAmiZAZUINwUj5NumrOhW3WEtspKfBvYCe9d oqDKTS44hWbhbVlIhyA+zhroLPPgEQc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690811247; a=rsa-sha256; cv=none; b=qvW8oocgbkv+X67sef+v2nXulm9xJ3CxayQo2OfxP543O68xHhbLb+f1fhgXkqS1+C/jHL 6izBkTRjAxsyPyXT91ZtKVAV5gGYY9C61/vuQgNR07pjuHZCv7Ex8K4vWiT+HVvv8l50Wp Rdr9bkRdsCg9LkbMu8njVNqxQmrElgU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=u9OhuV2W; spf=pass (imf17.hostedemail.com: domain of tabba@google.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-63cf57c79b5so30726436d6.0 for ; Mon, 31 Jul 2023 06:47:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690811247; x=1691416047; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; b=u9OhuV2WeoVUb538U+P/2QFmsHVbKcdA/KLYVytDxihcyMcOvBA0XVLiSr5cHMTN7D ikj7mVuN1iyhE73lkkm8FfLEjE5ponkqPZ1y3+LCm2IQAtu7oocrn6C54yzZ80W11Wvd rTxv00vYMpAR9z70F7m74ztY1F98dotGILfOawWW2zQkC0L0yO2AbJIKpEccIjfjgMtN rNOX3yDynA64SAgdgBRMMmYQcMR8X+2qNSp+2Yj5dtqRRJG2EoqS4Y/SRDVJ6nLIVtfR 4bdQAAHnMSoPJgm+FGVRTm8lI3EITt2Y96aOx0NZdkMIARTZMJb3A2kQXkT2avpiLvYZ 6BKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690811247; x=1691416047; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; b=hdKj6IRq/yhGxHfmZBSpVBpz3Ga/CO5YsDLNi1lZo/KQLBphN9Ga0q6SnCXB5m0AzR F5mj6p+pz14ZFYGehyGuWQAB/8CHJrseqfRPz26CW6gOLBm0I9LumvL2aHbrwP3B/IQN 1870qMOPmdTp82HYWWVO81qqnISgnEsVEnAuHqo4qsSEHOVlLnhCUvzjnOmo4b4p/QCi ojtoFBg3u2ZvPiJqjLtgJyF4kW4diRCy83LBmusNnb08QECs8j17jehKx+1rS8VeYgiY A2WFCcbXs9kZy6kc9J1mNfnAoyKEtAzaEil+7h/D1739XeK9G9dsOSJ3hz8f3dJvgdX1 94sw== X-Gm-Message-State: ABy/qLbkN3Svlu4PANPfWeWMb7wYyssUu4PCJoxY9wSBJXIorpuaiayo w+96IX0m0Nr6/VN0Ng7LrMlUgrXszHxdSGaEh7sBuA== X-Google-Smtp-Source: APBJJlF7tr3MB9wkYayUG07mtX0571x6EwPIiCT72OesCKkjWW/DHZdN3DzN9958IW7gK2WOAN/2LmnIZ4GFimGLnSQ= X-Received: by 2002:ad4:5884:0:b0:632:2e63:d34b with SMTP id dz4-20020ad45884000000b006322e63d34bmr9450921qvb.14.1690811246768; Mon, 31 Jul 2023 06:47:26 -0700 (PDT) MIME-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> In-Reply-To: From: Fuad Tabba Date: Mon, 31 Jul 2023 14:46:50 +0100 Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: CDAC34000A X-Rspam-User: X-Stat-Signature: chonxbzxxjw5thai8d94u4cte988kn71 X-Rspamd-Server: rspam03 X-HE-Tag: 1690811247-392769 X-HE-Meta: U2FsdGVkX19BaFJSjtKSG5/YCymqYkUnrAkAGhtqXkllOHduxOaBZ1zYj4YpfaPFP6gpYqjBcWrz0cTOBLsiTt1CRNWG6El1A4Pv9ohJ6jFQSOM5W+gNQb3baF8phpjBMsyzMUXjpDFkRatAnfRMjSafLRmZIFUrYYV5s6eTQwd+6uK50Nr0AHsEPU6rFD9DFRjSRDVxCzP0zTFj5qfyN6/NF6JLQC/++YaVFtRvCQs42JrFaSzUKc13Vnaq70cW+CdNuk6WR5Gm45CY3AP7eysIL+zvZ8X4xdyux/XO0DQrNUH34EeSuF3V52zpU2R8Lsq1DGaCUM760gJxrnvQZmDoec4t5ApI26YpAmOJe2nY8Ps0PM3adJtvDj4crnGTtGefRH0sNjyIG2vrVDTsWbUK5q9ndqbeSuQEBs9ShHZPZ6CXmlCk+xWh3Dth/kjSB3dD8h+XGDWs4tOBZx6Pj3px1hRKTnEBtNX7/hSlxh4qJo3ZTfMljdV1Uj8wjSk4lKpfAPe1FPCRuoekdf3i0yMquK4uTC8CDBB45zdLM+8lkUgHCLrTZE6FiE8iAao7WNssdkZcbIDaqJsZ0LFJNQSjy3UZyYoIJ7klgdDK6O3eFRP/02Nm3/TIX39XgS/McVwrXjReF4tZt4N6toZ4wVDBwkr7SE65iwiBmvpZnx7yCzZTbZBWJTHcnx1JsZcLLAZP17KLNIj+6NP6cfCygDN/5cE05xlfan4XBzql25qyvvdudktEVQqhiwNh21IGgswI+eXMjXb8nz8oNugvWpm0amGMfZvtJLorPpz4TvSho7TDUO2cyljFvb75jiWNlC07BGB1q/WnQ6M8yX1+0qe7eMBsXwSboihmN03obTGZgERHPBlPzL7d9LcYwk+97BfhaA+EYq9yw67a/F3jtOsiVy3v4vBXSA9wTaoT+gb5c/CZeAeuGvX8/8qo/Ca+TwLDysNQoqcvOJ5vbhz dOP8qP/+ a+wRVsSnWa4CR7niJG2KTTkOYbigQ2ZlMU2hibxW+mskny2SGl+5IOf4OV5wYYsAn0xr4jTUy72Mh/84GGCLZ7WbaCgmGdEuZAmc9plIonCLbdWkvKYULrknUbLFtwQPYeWQydn+KVtJUizcnoYyH15m0BeEpKT1P7QxWkkKvGDPtW2iulzev0JTVgwZ/HX/wWfsPESWNDPypXr1gNK5wwq+lyNcealLM47EfvSeHjZSq9TfQsxQP6vMDcm+q1Z3xsbYZwZJJ3cldEK/hi91yhooW/Pi8uU7YYD9FqTW+/SsmUeJNLcXHUf+Lb3Dq5jL0mzKMJtyq4bESy4Qh/HNafA7g1uNvxMQkFO+N5lLa+GQaatgGKbokNYdAQ5CcFhkmbv3DfjVNxLggLLr9qT+dSJ3BuJIro9I22aDvsTi1qOGzVQM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Sean, On Thu, Jul 27, 2023 at 6:13=E2=80=AFPM Sean Christopherson wrote: > > On Thu, Jul 27, 2023, Fuad Tabba wrote: > > Hi Sean, > > > > > > ... > > > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > > case KVM_GET_STATS_FD: > > > r =3D kvm_vm_ioctl_get_stats_fd(kvm); > > > break; > > > + case KVM_CREATE_GUEST_MEMFD: { > > > + struct kvm_create_guest_memfd guest_memfd; > > > + > > > + r =3D -EFAULT; > > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_m= emfd))) > > > + goto out; > > > + > > > + r =3D kvm_gmem_create(kvm, &guest_memfd); > > > + break; > > > + } > > > > I'm thinking line of sight here, by having this as a vm ioctl (rather > > than a system iocl), would it complicate making it possible in the > > future to share/donate memory between VMs? > > Maybe, but I hope not? > > There would still be a primary owner of the memory, i.e. the memory would= still > need to be allocated in the context of a specific VM. And the primary ow= ner should > be able to restrict privileges, e.g. allow a different VM to read but not= write > memory. > > My current thinking is to (a) tie the lifetime of the backing pages to th= e inode, > i.e. allow allocations to outlive the original VM, and (b) create a new f= ile each > time memory is shared/donated with a different VM (or other entity in the= kernel). > > That should make it fairly straightforward to provide different permissio= ns, e.g. > track them per-file, and I think should also avoid the need to change the= memslot > binding logic since each VM would have it's own view/bindings. > > Copy+pasting a relevant snippet from a lengthier response in a different = thread[*]: > > Conceptually, I think KVM should to bind to the file. The inode is eff= ectively > the raw underlying physical storage, while the file is the VM's view of= that > storage. I'm not aware of any implementation of sharing memory between VMs in KVM before (afaik, since there was no need for one). The following is me thinking out loud, rather than any strong opinions on my part. If an allocation can outlive the original VM, then why associate it with that (or a) VM to begin with? Wouldn't it be more flexible if it were a system-level construct, which is effectively what it was in previous iterations of this? This doesn't rule out binding to the file, and keeping the inode as the underlying physical storage. The binding of a VM to a guestmem object could happen implicitly with KVM_SET_USER_MEMORY_REGION2, or we could have a new ioctl specifically for handling binding. Cheers, /fuad > Practically, I think that gives us a clean, intuitive way to handle int= ra-host > migration. Rather than transfer ownership of the file, instantiate a n= ew file > for the target VM, using the gmem inode from the source VM, i.e. create= a hard > link. That'd probably require new uAPI, but I don't think that will be= hugely > problematic. KVM would need to ensure the new VM's guest_memfd can't b= e mapped > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify= the > memslots/bindings are identical), but that should be easy enough to enf= orce. > > That way, a VM, its memslots, and its SPTEs are tied to the file, while= allowing > the memory and the *contents* of memory to outlive the VM, i.e. be effe= ctively > transfered to the new target VM. And we'll maintain the invariant that= each > guest_memfd is bound 1:1 with a single VM. > > As above, that should also help us draw the line between mapping memory= into a > VM (file), and freeing/reclaiming the memory (inode). > > There will be extra complexity/overhead as we'll have to play nice with= the > possibility of multiple files per inode, e.g. to zap mappings across al= l files > when punching a hole, but the extra complexity is quite small, e.g. we = can use > address_space.private_list to keep track of the guest_memfd instances a= ssociated > with the inode. > > Setting aside TDX and SNP for the moment, as it's not clear how they'll= support > memory that is "private" but shared between multiple VMs, I think per-V= M files > would work well for sharing gmem between two VMs. E.g. would allow a g= ive page > to be bound to a different gfn for each VM, would allow having differen= t permissions > for each file (e.g. to allow fallocate() only from the original owner). > > [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com >