From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ECEC4C4332F for ; Wed, 1 Nov 2023 10:52:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CD628E0002; Wed, 1 Nov 2023 06:52:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 67E9D8D0001; Wed, 1 Nov 2023 06:52:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 546248E0002; Wed, 1 Nov 2023 06:52:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 413748D0001 for ; Wed, 1 Nov 2023 06:52:11 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 07D3F80C36 for ; Wed, 1 Nov 2023 10:52:11 +0000 (UTC) X-FDA: 81409070862.30.A77006A Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf11.hostedemail.com (Postfix) with ESMTP id 4098040014 for ; Wed, 1 Nov 2023 10:52:09 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=EUaJqb5m; spf=pass (imf11.hostedemail.com: domain of tabba@google.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698835929; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+1C51gwP1j6EX8xWamHNo0Nti+w4//VV3v7LQeJatAc=; b=jfLcVZ7xU9MY6lXI//D/RU0+KnvT30kcYS0e8xbBurF3ECO+sBMYu8XZZiiwIybsHSmUHj tuzHN/x8Nomrdd6UGeXcQMEmnXrsQaZxRIoIg46yWjGBPMU2FoVtBigVA5EfXQsIZ0t0DI JWtI3leE1ZxwAKwRQDRu1t5V+2cpdGg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698835929; a=rsa-sha256; cv=none; b=vR1pYQDphUVk+HExoNufIVtxm9/OCoWlhquVqkVhXYRoHU01/IrIdr5kSV/AS1wD1cPYHa Y3/HdoVY4fh/w18xlF5zOZwS7zYYPWWG9VjasBpQBT3Wg2jYhi+6PzLb4KsGncsCRR0+uK l+6PpW+EhcjSl+iQYj6pmdo8vdRu6I0= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=EUaJqb5m; spf=pass (imf11.hostedemail.com: domain of tabba@google.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qv1-f47.google.com with SMTP id 6a1803df08f44-66d09b6d007so44845076d6.1 for ; Wed, 01 Nov 2023 03:52:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698835928; x=1699440728; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+1C51gwP1j6EX8xWamHNo0Nti+w4//VV3v7LQeJatAc=; b=EUaJqb5m0rlPSGirV07jqPNaTpPAeTNbwee/BHQKkIP04J4cyKkMZ++5oGh9GX1lGz jG+9MwJ6XrGSZynMvyw+zGjA9TZQim0bkG31/mhjmPxh3BBGPOBcHmqhFtIIdoFcbiqz aV8r+DQ43yD7ny4SQKEgbNABuhDnf351b9TO1jQmAfGbD/knVotXGEKn7IVAkn/TUM7s 6XwkgWD/36ZndV3Vaygt+mzyDG+78tLU2iIrH6r8QvVJ8xY9kbXBxyUyGHXQKGVRvr16 7BTbV+MYZwGrQz+J7o/K3zbtotPKJMY+P6mISIzqVQhdsjaoVMxQ7+k0o3xbzaTWJ6jQ e17g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698835928; x=1699440728; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+1C51gwP1j6EX8xWamHNo0Nti+w4//VV3v7LQeJatAc=; b=ugKlpsunPcJP1i6dWwFifoRTBy0/Ir2ZfrOVmY6zqULUjsesDVrZrDu9Y9mC4lO4WL GCMnN32iGI2WHYByZxluFO44/BW4/LnscoQZK8JJO2rxypUxRSLUBOjqHRiDKYtJuW5d DcpVGMrbuouq4D8S0jsACX8YnoA4Qkh4hYr4o4z2JAFgJE6LZPno3p1KCVf8D/zgrTt6 hZik/XE4EJVSjHADtuU2ysB/neu+Yr9EB4ExKDpiFqet6oxvHlxQiPrXLtk6uYWNUO5A Pw3PDdftZMAfgabFpLgFnKqXxQFZgphGAthx+Mu7c/MqxuxfWgDQnrHQEFq3ym7exipO CACw== X-Gm-Message-State: AOJu0Yz7x2fcqPXO8FoWXc9bT2NDvH3vQoZBZtKWdCMfNVojQPj326np 1O4XQDV3pi/kUBvRi4OPtzt6ZyjbM5c99iZ5O16/lQ== X-Google-Smtp-Source: AGHT+IGBD7AeuMGXkTRP2ep4CYUi9124TDOVpS389Mu+hB31XqL4gTKvstmlOGh29as6609/Xh60qBUoSzB28pjDL10= X-Received: by 2002:a0c:d64a:0:b0:675:5111:4cec with SMTP id e10-20020a0cd64a000000b0067551114cecmr114725qvj.58.1698835928102; Wed, 01 Nov 2023 03:52:08 -0700 (PDT) MIME-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> In-Reply-To: From: Fuad Tabba Date: Wed, 1 Nov 2023 10:51:31 +0000 Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 5ijoap8aae8ody9webo8pzt8m4crsyrw X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4098040014 X-Rspam-User: X-HE-Tag: 1698835929-879995 X-HE-Meta: U2FsdGVkX1/XWGjan9vfLW8Z9AmgfYWmoo+l1/cWwY5ZZUEXsKs9BgtuWkUm7qQAEN8Byg5gDZrgy3FqmW0wQRoMGWZgGC1FgFBrGLAXJGhk8oBkiOv3kKQaH/GouLjLRjCZ5e4z1usNqvnu47vg+sF+/XYgIswsbROWUc/SYy+6THP6vRv/OGSjmKRMps4Tbvn6hvbpAtDrNKYAHTL9B/MyCrIyZ/T5GJPRah4AhH8ONYBh3ObCtKgbFOmERimn79EeC3nwIFUe45/EU1WeM39pfQTIStW5Gf/EwDGYx2qTtUEisiPREQe4LxrKMxW8o7gTN0CrnftYX6kiWjZx9+Oezb82GZ8qpOqpOVm31c2p6X0loSW2SlFonKeLXOPz45tZRxYkbHnjnP306WMoPRXqcD8KyUJSido+Zuuwp+SzHiWqxxxV4ZTJZatuBPxbL72HGiq2I6F4V1Xr+CqptN+IHLJjKMJ0ydB555aoIU0HjRTtho3oUJfZMMscuTWnJyZs/p8+k8SixuvQaWVYhR7FMTnstYZkuOeiNIxq7J7BT7O+JVXa77i9x7B88OqLKjn6wBlEDOvOwiy8mYHOVRHe0vGDllQG+rNF7qyivvusBCuqooFjSwU7+Ae/6sfDVFKveAeVF68D60D7fqn/6pHY2HjdzfYzH1guhdwpXLFHDrJKl2x2u+knQRCCYzeexwgR1O+wz/2kU0geBrWVtW2TYPNs4TO1PwPYqf1E8wiYTREoaZ8tqvWAhqACC32D965ljujQsSJgXpxKGtpctXx3g2b6A7QpLYCG0/qMXCRadAByribeSACYsizMJ0pPSV2HNzICrHZt2DufT8kLGx4y7OYu8ORiPL5hW+BEUc3zNR9pWvUfVkCVlnHblFU5IhAdbBDpr+jr4VquxSBhe/KGxlNUrPR/AXQzta5Rd02phWISOkHA2DTckQkHp3NNuPZZP+XoXKgVUQiR5ir GIgIIzn3 XbdVDPCJKASRErdVMO8zJIwV0P3CHq3YaGfFBodJ0nVo7zh1jFB7awOuIyg9DvfqDPQ4zOk6C9S1HIppBV7ilpI8DwuxOhhIjlaqogKpUe6vN/f1lSgvw7EHYnFpdF42HIqV5IOItEQT332D9dZMwKh/d24zUQ7qMVntIKbX2zhXDpEXLzDBgpTrLdhLafSoHn8hwQQZ7qybEd4FaluCM/69R8koGbVE20HEsihlVeH52ab6rhkDbl+l0QiKnkVHvm+bBqFZeQpeNbylCUmFBVEdpdzlvFFH3gMrRd2p9MHpWFQh/3Jr8W5rqYzVCu1Oc0qKgDp3APd0td1MkGdVPJjz+vkSb36/3GQa7vgh6YmCUYDO+bcKv7FS3jv7rHnXk+RWuJjVob2/lzMwLMfcpMDWNmpHYHoyZJT1R3qjmP/u5rD4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Sean, On Tue, Oct 31, 2023 at 10:13=E2=80=AFPM Sean Christopherson wrote: > > On Tue, Oct 31, 2023, Fuad Tabba wrote: > > Hi, > > > > On Fri, Oct 27, 2023 at 7:23=E2=80=AFPM Sean Christopherson wrote: > > > > ... > > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/= api.rst > > > index e2252c748fd6..e82c69d5e755 100644 > > > --- a/Documentation/virt/kvm/api.rst > > > +++ b/Documentation/virt/kvm/api.rst > > > @@ -6079,6 +6079,15 @@ applied. > > > :Parameters: struct kvm_userspace_memory_region2 (in) > > > :Returns: 0 on success, -1 on error > > > > > > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_R= EGION that > > > +allows mapping guest_memfd memory into a guest. All fields shared w= ith > > > +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_P= RIVATE in > > > +flags to have KVM bind the memory region to a given guest_memfd rang= e of > > > +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target = guest_memfd > > > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the curre= nt VM, and > > > +the target range must not be bound to any other memory region. All = standard > > > +bounds checks apply (use common sense). > > > + > > > > Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for > > this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that > > a region marked as KVM_MEM_PRIVATE is only potentially private. It did > > confuse the rest of the team when I walked them through a previous > > version of this code once. Would something like KVM_MEM_GUESTMEM make > > more sense? > > Heh, deja vu. We discussed this back in v7[*], and I came to the conclus= ion that > choosing a name that wasn't explicitly tied to private memory wasn't just= ified. > But that was before a KVM-owned guest_memfd was even an idea, and thus be= fore we > had anything close to a real use case. > > Since we now know that at least pKVM will use guest_memfd for shared memo= ry, and > odds are quite good that "regular" VMs will also do the same, i.e. will w= ant > guest_memfd with the concept of private memory, I agree that we should av= oid > PRIVATE. > > Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or > KVM_MEM_USE_GUEST_MEMFD). I.e. do our best to avoid ambiguity between re= ferring > to "guest memory" at-large and guest_memfd. Yeah I remember that discussion, but I didn't realize how confusing it was until I was presenting my work so far to the rest of the team, and it confused them quite a bit. I'm happy with any of the other alternatives that you suggest, as long as the distinction is clear. > > > -See KVM_SET_USER_MEMORY_REGION. > > > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private me= mory) and > > > +userspace_addr (shared memory). However, "valid" for userspace_addr= simply > > > +means that the address itself must be a legal userspace address. Th= e backing > > > +mapping for userspace_addr is not required to be valid/populated at = the time of > > > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped= /allocated > > > +on-demand. > > > > Regarding requiring that a private region have both a valid > > guest_memfd and a userspace_addr, should this be > > implementation-specific? In pKVM at least, all regions for protected > > VMs are private, and KVM doesn't care about the host userspace address > > for those regions even when part of the memory is shared. > > Hmm, as of this patch, no, because the pKVM usage doesn't exist. E.g. > > . Because this literally documents the current ABI. When Ack. > > > +When mapping a gfn into the guest, KVM selects shared vs. private, i= .e consumes > > > +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIB= UTE_PRIVATE > > > +state. At VM creation time, all memory is shared, i.e. the PRIVATE = attribute > > > +is '0' for all gfns. Userspace can control whether memory is shared= /private by > > > +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES = as needed. > > > > In pKVM, guest memory is private by default, and most of it will > > remain so for the lifetime of the VM. Userspace could explicitly mark > > all the guest's memory as private at initialization, but it would save > > a slight amount of work. That said, I understand that it might be > > better to be consistent across implementations. > > Yeah, we discussed this in v12[*]. The default really doesn't matter for= memory > overheads or performances once supports range-based xarray entries, and i= f that > isn't sufficient, KVM can internally invert the polarity of PRIVATE. > > But for the ABI, I think we put a stake in the ground and say that all me= mory is > shared by default. That way CoCo VMs and regular VMs (i.e VMs without th= e concept > of private memory) all have the same ABI. Practically speaking, the cost= to pKVM > (and likely every other CoCo VM type) is a single ioctl() during VM creat= ion to > "convert" all memory to private. > > [*] https://lore.kernel.org/all/ZRw6X2BptZnRPNK7@google.com Ack. ... > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 78a0b09ef2a5..5d1a2f1b4e94 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm= , gfn_t start, gfn_t end) > > > } > > > } > > > > > > -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_= range *range) > > > +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *= range) > > > { > > > kvm_mmu_invalidate_range_add(kvm, range->start, range->end); > > > return kvm_unmap_gfn_range(kvm, range); > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm= _memory_slot *memslot) > > > /* This does not remove the slot from struct kvm_memslots data struc= tures */ > > > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot= *slot) > > > { > > > + if (slot->flags & KVM_MEM_PRIVATE) > > > + kvm_gmem_unbind(slot); > > > + > > > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode > > might need some of the data before the unbinding, something I thought > > might be necessary at one point for the pKVM port when deleting a > > memslot, but realized later that kvm_invalidate_memslot() -> > > kvm_arch_guest_memory_reclaimed() was the more logical place for it. > > Also, since that seems to be the pattern for arch-specific handlers in > > KVM. > > Maybe? But only if we can about symmetry between the allocation and free= paths > I really don't think kvm_arch_free_memslot() should be doing anything bey= ond a > "pure" free. E.g. kvm_arch_free_memslot() is also called after moving a = memslot, > which hopefully we never actually have to allow for guest_memfd, but any = code in > kvm_arch_free_memslot() would bring about "what if" questions regarding m= emslot > movement. I.e. the API is intended to be a "free arch metadata associate= d with > the memslot". > > Out of curiosity, what does pKVM need to do at kvm_arch_guest_memory_recl= aimed()? It's about the host reclaiming ownership of guest memory when tearing down a protected guest. In pKVM, we currently teardown the guest and reclaim its memory when kvm_arch_destroy_vm() is called. The problem with guestmem is that kvm_gmem_unbind() could get called before that happens, after which the host might try to access the unbound guest memory. Since the host hasn't reclaimed ownership of the guest memory from hyp, hilarity ensues (it crashes). Initially, I hooked reclaim guest memory to kvm_free_memslot(), but then I needed to move the unbind later in the function. I realized later that kvm_arch_guest_memory_reclaimed() gets called earlier (at the right time), and is more aptly named. Thanks, /fuad