From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B581C54E65 for ; Thu, 22 May 2025 16:26:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C3BA6B0089; Thu, 22 May 2025 12:26:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 19B026B008A; Thu, 22 May 2025 12:26:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B1586B008C; Thu, 22 May 2025 12:26:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DF24C6B0089 for ; Thu, 22 May 2025 12:26:43 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 72ECC5FE9C for ; Thu, 22 May 2025 16:26:43 +0000 (UTC) X-FDA: 83471072286.08.0CE4DBB Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf16.hostedemail.com (Postfix) with ESMTP id A5B3D180003 for ; Thu, 22 May 2025 16:26:41 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3vvnW5yD; spf=pass (imf16.hostedemail.com: domain of 3QFAvaAYKCOUZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3QFAvaAYKCOUZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747931201; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1/Y3MnN1CQvgNIXaw6MJZKAlF/SRi4SM/x7PuhK8ENM=; b=Xy9Z3dqkOEaaJYEH+sein5CoeEuqvfKaGfRKKPKuQZ6/aEsCUbZLSorsBfTk2gWjscmSin fZot+Eho/wCWhD504AmdB/Fvku9pifiSD+r1ZVcMwOJzdY4vGvFwcsRDOxKQYdaGLeyyzO seRPFKWKxtHZBsmx3RaGizUujiNocbI= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3vvnW5yD; spf=pass (imf16.hostedemail.com: domain of 3QFAvaAYKCOUZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3QFAvaAYKCOUZLHUQJNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747931201; a=rsa-sha256; cv=none; b=ZagIWCI4Oe9jn6C3u+LlBUDPIfr5znEwJBtEdTrXiW9tNm5dRSZGdwEWOiKpGXjOjfTgMU K9tfkFffEgaC16GQK3OfoBcvRg6gh76vtOQtpA0o/kMIduIIXiAOG59Vz+nm9RTxLy2hVG r1pPO24QJj/dJvFH1cf9zR9onzLqOVg= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2322e8c4dc5so48425325ad.3 for ; Thu, 22 May 2025 09:26:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747931200; x=1748536000; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=1/Y3MnN1CQvgNIXaw6MJZKAlF/SRi4SM/x7PuhK8ENM=; b=3vvnW5yD2juJqYQOdLAwXZnRR2w0/pbVB7NjB8oldpad1l8xA6sUo9LJ6EVRhYVYfs BzQMabWq3KsCCVkjV9gP09C2FXZZGJx8DU16HgdNsttU24a06KdtGMvGIKqI1KSAf8xI V3VD8unx7kBAKqQDBo2UpND358NJfcLn+1yglgxxOm1m2yoBA0QZf4E44OT+6Mx+PSkV hM2ER3WrryDFHAnTKpnn7Vt5q4UuYdJad2ufBXyrmT68pBW1HE8vVpJInAwixQs49Gub jcujy9Wo17NXqXTUw2lbnwGdKloR4IjEKuCvol/5b9pZBNchGZtdTlduHwL3sTq23Icy vDuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747931200; x=1748536000; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1/Y3MnN1CQvgNIXaw6MJZKAlF/SRi4SM/x7PuhK8ENM=; b=DSBnHWO13pH8nIm/ZhCyFLUPc019xn3dOUj9UqsYAFOj2TWrDb42bvb66l8mXQmidt b5d1gBYy2VGcgTHG4tf1OXIigJyVawmS7TDBB+7EFIuRvLgkbLpCrK+jY4/p8sK2oSse EPXOo8QgUyCH3WvRjxT847RW24KsXLnEOnHz8IDPB4BnUx1kpAl/yGCDfBkLQRt79hnU 891DyTYcHRNs7jwuVlwjiEaBVIamgwaLkCCqzxsg7X6qMSAXapS0SXN3yWkWH5MqQMBY ECDxpmt71wp/VrjJBEN+/fENfA/39QeNCUCyu7XZSMbIdUvCmzPPiaiS2SP8sNjcfbWu +IQA== X-Forwarded-Encrypted: i=1; AJvYcCXdbi2FYxAukaSp15vvFGwxeKSsDjferVYOQJI4fLyc+nqbV8oztUbxaVKstM7ncVR8WraQ3cjamQ==@kvack.org X-Gm-Message-State: AOJu0YxtVAH22s9ScB+W9VDejii0tKRMKNW25lUvmowZoXNjdm2mo1Py 3znt/Orkk0LiptMmemzpTUGOzl4C23rTQCOxYcK3ZXDqWEI2UImPUkVIc7z23rx11KmeEGrFRR2 akYdXFA== X-Google-Smtp-Source: AGHT+IG11bJ7U64NeAncdnegyKck+686dgfizFr3LVIf6dCovZhR/ckc60gLEITPUFULGhmCVlE/JqNg0Zo= X-Received: from plhw13.prod.google.com ([2002:a17:903:2f4d:b0:231:c831:9520]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:c951:b0:224:1af1:87f4 with SMTP id d9443c01a7336-231d43bb822mr384251005ad.22.1747931200213; Thu, 22 May 2025 09:26:40 -0700 (PDT) Date: Thu, 22 May 2025 09:26:38 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls From: Sean Christopherson To: Fuad Tabba Cc: Vishal Annapurve , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: A5B3D180003 X-Stat-Signature: ekfcwkj9oo8kwmegpcupgpp68gu8hk56 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1747931201-990146 X-HE-Meta: U2FsdGVkX18WOsE6nZhA2ulhECsI+cbxYHnlFVyROSU7e5NShslC3sJKJE/HS6VaIn1HOjRwA21grwHgzhBMhpfkvram/MozahHXNuts1laERMD/8cogqgwgXfLy/7Z8p5++FrxZsrX0NS0VJQRXDNslYpm7HcYVX7V95pVmo7RhnqsehV4vtNvgtMtmQNwTExKPjiOtnWODHkJq9yNEMlgSUnJIEi7+ySgUXRiPcRQmJJCeWljM0Ge0gz/84cHE9SQHMPFodFFobqUHtNYV+tQC0hknwg9+/QCxQ6LK8kQYv0aY4EtIR9g6dWsWBCZEb4VTK86S8EbIKG6EqaG+RtsNhMDMd6fXZ9mS24I+CE9T+fZGFAY9YhMNaJddtHTLk3NsKGLjx458ATbItP2Q7IHtP6+U0Myqu/AA1Q+ONX4TJtuXtAwg3fQe5H65T1ctnYMuH/l6IEep6z7k07/xD97+g/BOXkJwvtgnatyqaVONLcoaj2kPO4oyIXbxQaJ7YHXi+wknFhIIGktF4SjkY9y3pGZ8drvgo5nzpQMIOhTG32n0Cp7UQrMd4DlZI/eLj1GMKgHHP+KZTyQoYEFXSFIIlYLgsCU7tuFi+kmqvcwBWEIw8/xyBpIf1WbNSUiRPQ8Y7l1nI4RxAvKAC/98nc4KczEnP9my5qnuMD67yWXdTPpuB7KMwwMvAWKNq0zb1QO67ioqMy74H7Fa781Je8x4PMMdWIPGjK7jDPPj9sGqwmrdgh7c5zOk6xzpKvtyTe83YrFw/6YU4lnzGVcQtMYGzzxFBqL9fTjWPWeIKJpLYnb27Iw3oYKskiNvj7/Hef/BVcTUEX+8sF6bc6Wt2t4cSC3P8Z1CLEm6DDPD25HGqgbVrX7YBDbB1YPcfqK6z8JY29rRAuworRSp0Gs/Fns7HTzx0hZofMPnFkxrM0U+/yd8E2lXGKLU7e4ni+rW/GZqY2LLnmvSGwo1QJG j7CND/At igbDyKfHGqte3FRFn5Nmjv9/E5DjBrUTHgiQe80abxmciIkjGn4L2XV60VP+qBX5KYO2D/8qUmqOd8Sj0CrXCyCQpzNraILMBAl1IQpi3VVYrjCZB7pvE2HY2ErGFloclY6LexXzbro378gA1FuQTNDYJtbuOfCARO9/h0qYGaQTso9FXZUdcNrPqrNtLToxy1Em8WDl23Jau49MMkP1NUbZHWEjrULQHN1MLVx3cl5uIYe7jXoXCZjDwUdcWU/fjgLLra6QTRkc7tsYinrciQJxMzt6A+C4Fx4vHGetY0pO7+SmPYMi4+o+0d4LhYbgc+p/0PCHc7Y+75UfC4wN/iOaiE2mDyIeNvwYneNALg0UogS4BhSCaz9aO6DQVIL4n/JdBnw1rTPr1qKQ3vcC1+b5elhFvWnqMFZpLji8Jbkwx7+hAs8YIIEvUgYNqjq6S9h1EL46FOGhb929mH4rAu87y7myYyuiYzaNqr4b2T+q0FNT1Xyh4BEJFfkA6Vaco2beCuqQQs4iT1TfActgiRD9anzg2kxePRPa2jQk8dxf8Ax2bX954i/4N342fl+CYMIgT+ceeRQW0zMYbbNWfWFiGaCQt9cBZNzpYUtSa/e7CtjlP31R3wXAZpMQMfGufdpWw4rxHgE6QLFyOUC4ETk/AheUK1lcQGU8gNRBfVj+jYoGHbHJZebo09VJ894Gr/Nfm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 22, 2025, Fuad Tabba wrote: > On Thu, 22 May 2025 at 15:52, Sean Christopherson wrote: > > On Wed, May 21, 2025, Fuad Tabba wrote: > > > How does the host userspace find that out? If the host userspace is capable > > > of finding that out, then surely KVM is also capable of finding out the same. > > > > Nope, not on x86. Well, not without userspace invoking a new ioctl, which would > > defeat the purpose of adding these ioctls. > > > > KVM is only responsible for emulating/virtualizing the "CPU". The chipset, e.g. > > the PCI config space, is fully owned by userspace. KVM doesn't even know whether > > or not PCI exists for the VM. And reboot may be emulated by simply creating a > > new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the > > change in state would happen in an entirely new struct kvm. > > > > That said, Vishal and Ackerley, this patch is a bit lacking on the documentation > > front. The changelog asserts that: > > > > A guest_memfd ioctl is used because shareability is a property of the memory, > > and this property should be modifiable independently of the attached struct kvm > > > > but then follows with a very weak and IMO largely irrelevant justification of: > > > > This allows shareability to be modified even if the memory is not yet bound > > using memslots. > > > > Allowing userspace to change shareability without memslots is one relatively minor > > flow in one very specific use case. > > > > The real justification for these ioctls is that fundamentally, shareability for > > in-place conversions is a property of a guest_memfd instance and not a struct kvm > > instance, and so needs to owned by guest_memfd. > > Thanks for the clarification Sean. I have a couple of followup > questions/comments that you might be able to help with: > > From a conceptual point of view, I understand that the in-place conversion is > a property of guest_memfd. But that doesn't necessarily mean that the > interface between kvm <-> guest_memfd is a userspace IOCTL. kvm and guest_memfd aren't the communication endpoints for in-place conversions, and more importantly, kvm isn't part of the control plane. kvm's primary role (for guest_memfd with in-place conversions) is to manage the page tables to map memory into the guest. kvm *may* also explicitly provide a communication channel between the guest and host, e.g. when conversions are initiated via hypercalls, but in some cases the communication channel may be created through pre-existing mechanisms, e.g. a shared memory buffer or emulated I/O (such as the PCI reset case). guest => kvm (dumb pipe) => userspace => guest_memfd => kvm (invalidate) And in other cases, kvm might not be in that part of the picture at all, e.g. if the userspace VMM provides an interface to the VM owner (which could also be the user running the VM) to reset the VM, then the flow would look like: userspace => guest_memfd => kvm (invalidate) A decent comparison is vCPUs. KVM _could_ route all ioctls through the VM, but that's unpleasant for all parties, as it'd be cumbersome for userspace, and unnecessarily complex and messy for KVM. Similarly, routing guest_memfd state changes through KVM_SET_MEMORY_ATTRIBUTES is awkward from both design and mechanical perspectives. Even if we disagree on how ugly/pretty routing conversions through kvm would be, which I'll allow is subjective, the bigger problem is that bouncing through KVM_SET_MEMORY_ATTRIBUTES would create an unholy mess of an ABI. Today, KVM_SET_MEMORY_ATTRIBUTES is handled entirely within kvm, and any changes take effect irrespective of any memslot bindings. And that didn't happen by chance; preserving and enforcing attribute changes independently of memslots was a key design requirement, precisely because memslots are ephemeral to a certain extent. Adding support for in-place guest_memfd conversion will require new ABI, and so will be a "breaking" change for KVM_SET_MEMORY_ATTRIBUTES no matter what. E.g. KVM will need to reject KVM_MEMORY_ATTRIBUTE_PRIVATE for VMs that elect to use in-place guest_memfd conversions. But very critically, KVM can cripsly enumerate the lack of KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_CAP_MEMORY_ATTRIBUTES, the behavior will be very straightforward to document (e.g. CAP X is mutually excusive with KVM_MEMORY_ATTRIBUTE_PRIVATE), and it will be opt-in, i.e. won't truly be a breaking change. If/when we move shareability to guest_memfd, routing state changes through KVM_SET_MEMORY_ATTRIBUTES will gain a subtle dependency on userspace having to create memslots in order for state changes to take effect. That wrinkle would be weird and annoying to document, e.g. "if CAP X is enabled, the ioctl ordering is A => B => C, otherwise the ordering doesn't matter", and would create many more conundrums: - If a memslot needs to exist in order for KVM_SET_MEMORY_ATTRIBUTES to take effect, what should happen if that memslot is deleted? - If a memslot isn't found, should KVM_SET_MEMORY_ATTRIBUTES fail and report an error, or silently do nothing? - If KVM_SET_MEMORY_ATTRIBUTES affects multiple memslots that are bound to multiple guest_memfd, how does KVM guarantee atomicity? What happens if one guest_memfd conversion succeeds, but a later fails? > We already communicate directly between the two. Other, even less related > subsystems within the kernel also interact without going through userspace. > Why can't we do the same here? I'm not suggesting it not be owned by > guest_memfd, but that we communicate directly. I'm not concerned about kvm communicating with guest_memfd, as you note it's all KVM. As above, my concerns are all about KVM's ABI and who owns/controls what. > From a performance point of view, I would expect the common case to be that > when KVM gets an unshare request from the guest, it would be able to unmap > those pages from the (cooperative) host userspace, and return back to the > guest. In this scenario, the host userspace wouldn't even need to be > involved. Hard NAK, at least from an x86 perspective. Userspace is the sole decision maker with respect to what memory is state of shared vs. private, full stop. The guest can make *requests* to convert memory, but ultimately it's host userspace that decides whether or not to honor the request. We've litigated this exact issue multiple times. All state changes must be controlled by userspace, because userspace is the only entity that can gracefully handle exceptions and edge cases, and is the only entity with (almost) full knowledge of the system. We can discuss this again if necessary, but I'd much prefer to not rehash all of those conversations. > Having a userspace IOCTL as part of this makes that trip unnecessarily longer > for the common case. I'm very skeptical that an exit to userspace is going to even be measurable in terms of the cost to convert memory. Conversion is going to require multiple locks, modifications to multiple sets of page tables with all the associated TLB maintenance, possibly cache maintenance, and probably a few other things I'm forgetting. The cost of a few user<=>kernel transitions is likely going to be a drop in the bucket. If I'm wrong, and there are flows where the user<=>kernel transitions are the long pole, then we could certainly exploring adding a way for userspace to opt into a "fast path" conversion. But it would need to be exactly that, an optional fast path that can fall back to the "slow" userspace-driven conversion as needed.