From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA7B1C4332F for ; Tue, 31 Oct 2023 15:06:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1075C8D0015; Tue, 31 Oct 2023 11:06:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 091C48D0012; Tue, 31 Oct 2023 11:06:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E4C238D0015; Tue, 31 Oct 2023 11:06:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D00538D0012 for ; Tue, 31 Oct 2023 11:06:26 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A6EB31A0B36 for ; Tue, 31 Oct 2023 15:06:26 +0000 (UTC) X-FDA: 81406082772.12.674D44E Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42]) by imf05.hostedemail.com (Postfix) with ESMTP id C417710000C for ; Tue, 31 Oct 2023 15:06:23 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="A/7hISOK"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of tabba@google.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698764783; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XFiJeiYR5SSNo2dE3+Iew6FPjauUb0S9LLfJ6JqjMGA=; b=PtqDl3rA26eU1Zs+2cAzCir3lGc1UmZZ28iSDgMADNkCGJJ68w/feynzpGoRvDesrsHMpx PxcZtnd/nXQ9dDeY+MjdvpXAhQPuuXUaMPrxi32UCxa1wUQ07w+jB72dM1RIG9mQCVOQMn CCknY6kOZw11W3h3BXVPbnnojmu89JI= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="A/7hISOK"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of tabba@google.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698764783; a=rsa-sha256; cv=none; b=grDQ3k1pDjy81HhrjRukIh4qaCC3W/gJN39i7VpM+Zm5zYhgTMY3wWoJ9NjHeosmtIwwFo 8Ah4rgl0UJWuhH2k74glqmALLDzN0PgPDPSROqsbrqUd6tyO0QSqdSCUoiNN/idTUNKeeD Wh1QlMemzXX4YGSTkHf1D6TlPx4V//A= Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-66d09b6d007so39534016d6.1 for ; Tue, 31 Oct 2023 08:06:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698764783; x=1699369583; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XFiJeiYR5SSNo2dE3+Iew6FPjauUb0S9LLfJ6JqjMGA=; b=A/7hISOKcYibBWzOZb+omYIPI62QhyPoWc21RpH/vzmHh/+VkvIcZgeOu3a2gtxSA8 H4D7UVDEGQ5ZTTD+z5OUOjZsEW77UdJs5g9HFZArpLG41m7poKN91xmfiLR0nv72WXp4 RBjmhXVydbVbG9epbJk1Vdt2o8nGt7CGGKrV2Ht50NLOXC4eLfxiq9bFbE8UIzDWgE15 h9Tv34IOHb+Jeg1YT2CuoqFSruWHfHOd7grzfDlJ+z8Kxd42/SIkZlzT1A+yPURM6FVj qsBRngC172SQHGIRFz5mHCbfIpMbS9m1K+XIN/MFTzEx1DMD8f0+CrG4SfMe9r8P/v43 AbwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698764783; x=1699369583; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XFiJeiYR5SSNo2dE3+Iew6FPjauUb0S9LLfJ6JqjMGA=; b=LeiwBcEbFdzaMGXo83flzfd3uGp5/KSgulZ1hk9EF1Qakf4dyt/mBeX5FoMyMW2q0Z 7MmX5FEmF1K1wXBs4sQOKym2LkUSaX7m0LxoGpsgTCSzWji6Aop9cJUX+596zPWy0i76 yj9/wOipWdDyYui/ov7JCEzfFPdPoDsUTNjf8UsnAssbeuVIJS8OZ3XwYq2//Hsoqyju CtsZsaQDKBuU4V9/zm0lyalwVxJfl6aHLJbRHWunGyvVNxLRzYulmu4h6NRjkPSeq/aZ C0bkH0/I8dilYPl7wio6+CEtqUGWWrD0AI5PIiQZXZSngsdagveFpBtmch+GE1QSeYZf 6lpA== X-Gm-Message-State: AOJu0YwsB6VR40xCwhxVlcWKYIe9SX8RmnoFQt2jGlJyQgFmGrMoWzHd wFzjvdvKCnd+GxjZ5Sc5DQm5hSCIrHAHRdUuH3ViHA== X-Google-Smtp-Source: AGHT+IE151kEFir7qf7e2DG0sarSRU7TzUC/5hmwAex0rxC4ctantPqpX1svsxQNccGEPatzUcut0zJ7zmReLsYMGwo= X-Received: by 2002:ad4:5ce3:0:b0:66d:5b50:44d with SMTP id iv3-20020ad45ce3000000b0066d5b50044dmr20218799qvb.57.1698764782743; Tue, 31 Oct 2023 08:06:22 -0700 (PDT) MIME-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> In-Reply-To: <20231027182217.3615211-17-seanjc@google.com> From: Fuad Tabba Date: Tue, 31 Oct 2023 15:05:45 +0000 Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C417710000C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: afqwbyoxf3kxkbcij96kbqb154ekfien X-HE-Tag: 1698764783-422398 X-HE-Meta: U2FsdGVkX1/1DWfJ5lkSElEG3frn5GpdmNoyf8MbWpDzeKhWjm+BdfUHyrXEIY97QaTngmPmWgo/Rm6iyI55rja9IwRmk8qCA/B86ijbU9G+YBJlWzNGH6FR1gqVSALE6Ap+32r/qWqLJHGicP244XbyKoPbDHxe4XmDkXSh3tL5SypQZe9tVq30j2AisjsGThjUbbeQAf7i0ETWD7kf3tiUXSJWDY7553oB2TgaV7za+w7HTugEoxTp25h35Pn3YhZRkOJ1IIAYLxXn9XBfPBJwIbJvO8PMxABfAZZxCsMqeHlWRya3R/+vFBtw7BPTv8GdZSRKeEV63yA2y20vlI38EAAhvM+Bu7Ic2NoWlyZLd4QoSmmPIprxOAg+yKgLV/PMnQSJRylVvxUIY62TZkSr+UfZHbQfz4J6CcIj5aHjpDfi1DcLKXZ0weIGGf1ByNR4ELi4vtqfeL2sNuNQKE4niRLZkRC3W5f9twPFVkC1ObljVMSK3DdMEB9vjSRsQMX6dI52RGAs6iyo7HbrRQvQTAGr69iS6C4Q7dzewbeUsjHgNG6j3sSOzesAbY6cPV2ZWNPJ3JEgyOK7AUo7j0k4s1gBIFQm5lXOv0qt6rKtTlRqT3qgNcvo9oCtQUUb+aDBbXXQ9TLCs6bgdfXb5VKGdV5bVxfvXT7rZ7Sqk4HVAqZljIacGAgf00n1w+M13gnBcZ1ID+UzPhb+z0Sgd4cWTtq3bsISr3wfjqcztd5OMIgRWfnHXxS/U3WgyczmN/hm3LJrnTGHn7jElkYbGNuTdjhzyrOIcgmNKr8PZrH+6kAqa0PwcCQj/mT8xa6tW757CXWkxRA9ubHP67KuRkgmwRS5jcZmPKfsPhNPMT6pHh5RNTh43TDiEYFLN81zEa719bh3XlyM9bgUFGmoRyk5qhcnbZ2k3o77NKhtheq3j5D9gP8xJzOeQFLFOpNjcbMmn0WOBXlVR9TXsis aqxaOyl8 nzAyq79km9yWss0x0bvvSUC8am1lblxsPfECqlY7uxVpJwp3FimKDjcT/IHOeX+7sVHFq8gnmP05b73pK03CvTmgllLIfb4B9cPQm1kRbg8z8g7aWcBt0wMLiYoBzxG/cnnsc5w/cjyJohXM8Kc7khNUsucWNIrSAMhasZgEvnzEUL4mZYXR8ad80+yVtNbT1fTsMJAjPuooS/GX71vFdzBgfzcuYhEwzL+d7AHsxE2EL/9acPsK+K8X1YrIJ4SLiQK0V96OWY3dsz/fbTc/VckLWcnyZ48nnjlY0Og4Zhw2Pwi9syAy3PuAfzwCKbUq/7CCHNUMmyftFZ6eW4HHro5ZDfmQztdE2b+7rb90s0gX+rFU7P42JGZq1sA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, On Fri, Oct 27, 2023 at 7:23=E2=80=AFPM Sean Christopherson wrote: ... > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.= rst > index e2252c748fd6..e82c69d5e755 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6079,6 +6079,15 @@ applied. > :Parameters: struct kvm_userspace_memory_region2 (in) > :Returns: 0 on success, -1 on error > > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGIO= N that > +allows mapping guest_memfd memory into a guest. All fields shared with > +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRIVA= TE in > +flags to have KVM bind the memory region to a given guest_memfd range of > +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target gues= t_memfd > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current V= M, and > +the target range must not be bound to any other memory region. All stan= dard > +bounds checks apply (use common sense). > + Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that a region marked as KVM_MEM_PRIVATE is only potentially private. It did confuse the rest of the team when I walked them through a previous version of this code once. Would something like KVM_MEM_GUESTMEM make more sense? > :: > > struct kvm_userspace_memory_region2 { > @@ -6087,9 +6096,24 @@ applied. > __u64 guest_phys_addr; > __u64 memory_size; /* bytes */ > __u64 userspace_addr; /* start of the userspace allocated memory = */ > + __u64 guest_memfd_offset; > + __u32 guest_memfd; > + __u32 pad1; > + __u64 pad2[14]; > }; > > -See KVM_SET_USER_MEMORY_REGION. > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory= ) and > +userspace_addr (shared memory). However, "valid" for userspace_addr sim= ply > +means that the address itself must be a legal userspace address. The ba= cking > +mapping for userspace_addr is not required to be valid/populated at the = time of > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/all= ocated > +on-demand. Regarding requiring that a private region have both a valid guest_memfd and a userspace_addr, should this be implementation-specific? In pKVM at least, all regions for protected VMs are private, and KVM doesn't care about the host userspace address for those regions even when part of the memory is shared. > +When mapping a gfn into the guest, KVM selects shared vs. private, i.e c= onsumes > +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_= PRIVATE > +state. At VM creation time, all memory is shared, i.e. the PRIVATE attr= ibute > +is '0' for all gfns. Userspace can control whether memory is shared/pri= vate by > +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as n= eeded. In pKVM, guest memory is private by default, and most of it will remain so for the lifetime of the VM. Userspace could explicitly mark all the guest's memory as private at initialization, but it would save a slight amount of work. That said, I understand that it might be better to be consistent across implementations. ... > --- /dev/null > +++ b/virt/kvm/guest_memfd.c > @@ -0,0 +1,548 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > +#include > +#include nit: should this include be first (to maintain alphabetical ordering of the includes)? > + > +#include "kvm_mm.h" > + > +struct kvm_gmem { > + struct kvm *kvm; > + struct xarray bindings; > + struct list_head entry; > +}; > + > +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t ind= ex) > +{ > + struct folio *folio; > + > + /* TODO: Support huge pages. */ > + folio =3D filemap_grab_folio(inode->i_mapping, index); > + if (IS_ERR_OR_NULL(folio)) > + return NULL; > + > + /* > + * Use the up-to-date flag to track whether or not the memory has= been > + * zeroed before being handed off to the guest. There is no back= ing > + * storage for the memory, so the folio will remain up-to-date un= til > + * it's removed. > + * > + * TODO: Skip clearing pages when trusted firmware will do it whe= n > + * assigning memory to the guest. > + */ > + if (!folio_test_uptodate(folio)) { > + unsigned long nr_pages =3D folio_nr_pages(folio); > + unsigned long i; > + > + for (i =3D 0; i < nr_pages; i++) > + clear_highpage(folio_page(folio, i)); > + > + folio_mark_uptodate(folio); > + } > + > + /* > + * Ignore accessed, referenced, and dirty flags. The memory is > + * unevictable and there is no storage to write back to. > + */ > + return folio; > +} > + > +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t sta= rt, > + pgoff_t end) > +{ > + bool flush =3D false, found_memslot =3D false; > + struct kvm_memory_slot *slot; > + struct kvm *kvm =3D gmem->kvm; > + unsigned long index; > + > + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { > + pgoff_t pgoff =3D slot->gmem.pgoff; > + > + struct kvm_gfn_range gfn_range =3D { > + .start =3D slot->base_gfn + max(pgoff, start) - p= goff, > + .end =3D slot->base_gfn + min(pgoff + slot->npage= s, end) - pgoff, > + .slot =3D slot, > + .may_block =3D true, > + }; > + > + if (!found_memslot) { > + found_memslot =3D true; > + > + KVM_MMU_LOCK(kvm); > + kvm_mmu_invalidate_begin(kvm); > + } > + > + flush |=3D kvm_mmu_unmap_gfn_range(kvm, &gfn_range); > + } > + > + if (flush) > + kvm_flush_remote_tlbs(kvm); > + > + if (found_memslot) > + KVM_MMU_UNLOCK(kvm); > +} > + > +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start= , > + pgoff_t end) > +{ > + struct kvm *kvm =3D gmem->kvm; > + > + if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { > + KVM_MMU_LOCK(kvm); > + kvm_mmu_invalidate_end(kvm); > + KVM_MMU_UNLOCK(kvm); > + } > +} > + > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff= _t len) > +{ > + struct list_head *gmem_list =3D &inode->i_mapping->private_list; > + pgoff_t start =3D offset >> PAGE_SHIFT; > + pgoff_t end =3D (offset + len) >> PAGE_SHIFT; > + struct kvm_gmem *gmem; > + > + /* > + * Bindings must stable across invalidation to ensure the start+e= nd nit: Bindings must _be/stay?_ stable ... > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 78a0b09ef2a5..5d1a2f1b4e94 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -798,7 +798,7 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gf= n_t start, gfn_t end) > } > } > > -static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_rang= e *range) > +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *rang= e) > { > kvm_mmu_invalidate_range_add(kvm, range->start, range->end); > return kvm_unmap_gfn_range(kvm, range); > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_mem= ory_slot *memslot) > /* This does not remove the slot from struct kvm_memslots data structure= s */ > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *sl= ot) > { > + if (slot->flags & KVM_MEM_PRIVATE) > + kvm_gmem_unbind(slot); > + Should this be called after kvm_arch_free_memslot()? Arch-specific ode might need some of the data before the unbinding, something I thought might be necessary at one point for the pKVM port when deleting a memslot, but realized later that kvm_invalidate_memslot() -> kvm_arch_guest_memory_reclaimed() was the more logical place for it. Also, since that seems to be the pattern for arch-specific handlers in KVM. > kvm_destroy_dirty_bitmap(slot); > > kvm_arch_free_memslot(kvm, slot); ... Cheers, /fuad