From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEBACC3ABDD for ; Tue, 20 May 2025 16:03:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DF25E6B009A; Tue, 20 May 2025 12:03:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DA2766B009B; Tue, 20 May 2025 12:03:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6A4A6B009C; Tue, 20 May 2025 12:03:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9F6106B009A for ; Tue, 20 May 2025 12:03:06 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 45EDF1A087B for ; Tue, 20 May 2025 16:03:06 +0000 (UTC) X-FDA: 83463755172.25.A190736 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf24.hostedemail.com (Postfix) with ESMTP id 23280180026 for ; Tue, 20 May 2025 16:03:03 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2IqhvboT; spf=pass (imf24.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747756984; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LP+IgKh5WuP7/61GoXCSgoWlxqk6ttGptinEY/thQH0=; b=U+gHyVsIF/tb4BoImoWHyVyw6Ch6j/xKkeNgzGxG2SauUJ3Y+e2I+pWrHRWnakuyeVPClS 8DfJ7cNOYVE18OurcFd2q44ijYhBNO6scLr/rz7SZlbQ8NRm1qBljJEk7oCk41/iFI0Z2A L7p6xVUU6+ZprumrYgkymzcXzRUqy4A= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2IqhvboT; spf=pass (imf24.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747756984; a=rsa-sha256; cv=none; b=lYQYRdTG/XvuR9X/UzcZhbG6gUup/rKbbnl0BrGgOzUm7kxKOKEAYHhyR/mW9QX5e5UD8s jrG1TxxtM77X2+caQoN55R8ven3SBOTAJiNzn8QExLMfPWRCVrB9pR77WoypcHRlhv+FhV cSKhoEUqZYH4ltdVclfwK+bE+zKkdgs= Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-231f61dc510so785205ad.0 for ; Tue, 20 May 2025 09:03:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747756983; x=1748361783; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LP+IgKh5WuP7/61GoXCSgoWlxqk6ttGptinEY/thQH0=; b=2IqhvboTAfckHZM6pjeOFjennW7leujd0M04U+6zqTmncr/BhtNJr2ZUZYWDg05b/r 4dyLHWVQgw3mk2U4G2YDlcKm5BGjO8NkBZP4sLVTvrfsXfRG1D4LHUyxRPZgwPMJS1NF Snh3sAh4s/hK4SHQubtJf6/ZxPfyP/Ohs8eOwVk/GpYAjRXC4f1cWwElmUwq0WEIyS3V McYglSBNoWbYfTXde28WaUFEh3+VRUtl+cXZ3+YwfdtN53lb4+kChrRJzt9+SXW6I+yx y9wXN7IZcmgVcYSqxy+UJ78j2mb/TjS4ggBBRgCn1jRjAJNfO+D/3VC2RFnNUdeDPU0O AR9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747756983; x=1748361783; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LP+IgKh5WuP7/61GoXCSgoWlxqk6ttGptinEY/thQH0=; b=mmSiCvNkaZ1kM3c6IXBbHtXVmL7Xd/qJ/oxdY5JQXxzIO8EB2IkyK8eVT1G9eF6SMm ObCZ+J4xVpAwtJThSlwCQ7JBhptu+ZRCAjvJhxoguGlEOGuRFoz1h1h6yk4LVtbyjefH BDGuU0E1Xm55b/1DkPtWjP1t+SW1hHTaehZrpglDE4Rx2gnZCdDR3HfTOua126xPQGkY IsTZYZ6QlbcW4sL9nLBLWNh0TKJDybjnAbq7ES0UBSk1v9nGE7lYzahyt9gjToi+KM7S DixIWr/nFaaZYFm9dl4q0Rz+9lYMkxC1AMGIfa9Nss1TGZBuwZBN3Ym/bgz9VuC11mTg 8RMQ== X-Forwarded-Encrypted: i=1; AJvYcCVE8uxfXEWmmoNRUODCF3dcyGZXWUZavY5UTA8PBgf704UP38b5vT5HBYlamzYFhb8n/j2d3c9ckA==@kvack.org X-Gm-Message-State: AOJu0YzhNDtikDoMLNWncdBc0HZDr6gAEYqyjiZyeldLKqc79CAU02D6 WkNsUmp4Q8SMyYhHv9U8piSU4anyomHHy3NrG5cYMsn/9MFy+uBlvwLEgJlJ9UVNzHXCi36Q2nC Yv8x1BYCsCkiZ1+/ky6z629ZfEUMCfeNAaK45nUJ+ X-Gm-Gg: ASbGncs5G3LqQ/KalgTwyXEYApHNIjNsSPZKPowcF+OAv+aOtPlLjgSjpbqNfY8OGt9 jcUDerrJHsDOcVj0aCu5CWPZjWo0ntV8YLyMQkE7Nhcpjka7hZ5ex4S2OdQeF0moXbY8dlF0sd4 5SorL9AuvTJPpKtMZhGiRT8Sp24WNlQ6Sm/OPtXXIf5o2BNoqnilDYQqORqz46DTfBvgoNvgszY YVU X-Google-Smtp-Source: AGHT+IHIqS2Dqz8t9Kspjt12t3UanzbX0TZpQnWe3Ht48Uo1xURTRGgrxqIt9Z9aa1RVZDHLDUDlIsXJxb1nUq6AGcg= X-Received: by 2002:a17:903:4b4f:b0:231:fb83:9c3d with SMTP id d9443c01a7336-231ffdc5e60mr8624145ad.20.1747756982425; Tue, 20 May 2025 09:03:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Vishal Annapurve Date: Tue, 20 May 2025 09:02:50 -0700 X-Gm-Features: AX0GCFsR8m63CkFf_0gOE0SYuixW7eqLNNVHBhmGF8G-ihHN6Drh2TEMsr5J4cg Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Fuad Tabba Cc: Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: sbbe5jq6diaczbsttdkkfrhsa7eqrmxq X-Rspamd-Queue-Id: 23280180026 X-Rspam-User: X-HE-Tag: 1747756983-246199 X-HE-Meta: U2FsdGVkX1/gBpL+6zPlamPjflaQcgg+xuVTgT8rwmLghH1S7a6hKFnzJX1KiRXHZluKOt/HJhWh2NeWHYlrLLkh+t5JuueEY3lHHePR0Ih5ULpVQOsGEIHxhM1jAZvP+c7rQnSKxEK/iUO5XyKZZpHqjyjZuxI5ADL2EQcyQEVdICckY1pIvlR4XE+GYWtWOMmeaPV+F5bfJ+upaAUPbPsL0vP+TiNTQZx4kiiXeG77HYU7JZr5w7c8NWnKw7A7Fym2GiwJnDantYGlKJrvsb+Eg245vmIpz4UsM/HDnWIJ1HsXT1Xall1o2V7dXIYEwBhOWh7Jvqidcbp/yYajWMXVK5/MpBaxumPmdogIrsVu3hpz4mVriOlJxmyJ3bL17b0aVViPu3O37OCOj2uQ6Uv8DlS/4yw2754m9eM4d+8Gj+nq/Xyy976DXm0TGtjfabr/HbcHgJGZxit+bXRaEFq2gb8ytKdyzLWDZGOKrGx9IIOvnFk4dLCdoS5xSz4E2T4zGs66N4Wd4CjjHZIyJpGRda6reqe9AjMEXZ/S+Xs1/1YcSwuqhckPontvzH505Kf2S1PWIca5Nzo2hTshI+mQ0GDj3Wvzx+WUCnQFTAg52yA4hq9kpKJuGe5C7jK+1gJbXKKAfkpKd+I7w9sqqzLJJtOTAIJjQn4WWNScaeQ86iR7mci5yCbEzz4XYUMRb2naEMR1zlao7p1fZZhbTC9GYjTPojOhFrKLNvTmn4Vt0H8yr5LhbV+tEjiVv+KCoydsN5T1iIoNAF4e3MBItk0/vLBmfIzJrLPONeGYqxJDB0475rwISUghjo9iNRXhPEd71TVCjUAQBXeRmmo+6/W07lGd1xsjGY8tDk7Kn4PmFi1SDamq4gWlMve3WHOXGASvmjBa5V+OeF9NImJBe6Q66UmO7G5BKYvjrXFo/riXoW6mPL91IyvRgIBCYPTRLtKvUPH/zVPi08Q3SGK RIQAXk5a 4f4bUYmKtjDPntlLxjQ7/0YZ8fGPmzhqP6lep5arxzRUFqoB+pJ2K6rV6C67+ogWVEWdESCoAWwLskLi+hV2br1mVKIqESsnhj0T5bTh3ieKtFHv0kRDaKaA0Gi+FA94pxGUXzYQK7+i9iwf2L1lML6z0TLEMSaYYrjmsuZRJqhaWOs1dnvUmpuhvdsYa9NZNr5C9V+4eiBE9VlFqdsED8Khtf6+ch4LfGmx2AWxzs5l5RMJxrfHEXEmEVDNAoLg41O1i22SrPtNC41oid4G63r+NJf2dI6rECpl5wgfksHo4z0VcN4Ti9XxQ/bQSgKn69q1RmA9pyep1BdbQVPKYXc254Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 20, 2025 at 7:34=E2=80=AFAM Fuad Tabba wrote= : > > Hi Vishal, > > On Tue, 20 May 2025 at 15:11, Vishal Annapurve wr= ote: > > > > On Tue, May 20, 2025 at 6:44=E2=80=AFAM Fuad Tabba w= rote: > > > > > > Hi Vishal, > > > > > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve wrote: > > > > > > > > On Tue, May 20, 2025 at 2:23=E2=80=AFAM Fuad Tabba wrote: > > > > > > > > > > Hi Ackerley, > > > > > > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng wrote: > > > > > > > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and > > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to= shared > > > > > > and private respectively. > > > > > > > > > > I have a high level question about this particular patch and this > > > > > approach for conversion: why do we need IOCTLs to manage conversi= on > > > > > between private and shared? > > > > > > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch > > > > > series that performs in-place conversion [3] and the associated (= by > > > > > now outdated) state diagram [4], I didn't see the need to have a > > > > > userspace-facing interface to manage that. KVM has all the inform= ation > > > > > it needs to handle conversions, which are triggered by the guest.= To > > > > > me this seems like it adds additional complexity, as well as a us= er > > > > > facing interface that we would need to maintain. > > > > > > > > > > There are various ways we could handle conversion without explici= t > > > > > interference from userspace. What I had in mind is the following = (as > > > > > an example, details can vary according to VM type). I will use us= e the > > > > > case of conversion from shared to private because that is the mor= e > > > > > complicated (interesting) case: > > > > > > > > > > - Guest issues a hypercall to request that a shared folio become = private. > > > > > > > > > > - The hypervisor receives the call, and passes it to KVM. > > > > > > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86 > > > > > parlance), and unmaps it from the host. The host however, could s= till > > > > > have references (e.g., GUP). > > > > > > > > > > - KVM exits to the host (hypervisor call exit), with the informat= ion > > > > > that the folio has been unshared from it. > > > > > > > > > > - A well behaving host would now get rid of all of its references > > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues > > > > > running as normal. I expect this to be the common case. > > > > > > > > > > But to handle the more interesting situation, let's say that the = host > > > > > doesn't do it immediately, and for some reason it holds on to som= e > > > > > references to that folio. > > > > > > > > > > - Even if that's the case, the guest can still run *. If the gues= t > > > > > tries to access the folio, KVM detects that access when it tries = to > > > > > fault it into the guest, sees that the host still has references = to > > > > > that folio, and exits back to the host with a memory fault exit. = At > > > > > this point, the VCPU that has tried to fault in that particular f= olio > > > > > cannot continue running as long as it cannot fault in that folio. > > > > > > > > Are you talking about the following scheme? > > > > 1) guest_memfd checks shareability on each get pfn and if there is = a > > > > mismatch exit to the host. > > > > > > I think we are not really on the same page here (no pun intended :) )= . > > > I'll try to answer your questions anyway... > > > > > > Which get_pfn? Are you referring to get_pfn when faulting the page > > > into the guest or into the host? > > > > I am referring to guest fault handling in KVM. > > > > > > > > > 2) host user space has to guess whether it's a pending refcount or > > > > whether it's an actual mismatch. > > > > > > No need to guess. VCPU run will let it know exactly why it's exiting. > > > > > > > 3) guest_memfd will maintain a third state > > > > "pending_private_conversion" or equivalent which will transition to > > > > private upon the last refcount drop of each page. > > > > > > > > If conversion is triggered by userspace (in case of pKVM, it will b= e > > > > triggered from within the KVM (?)): > > > > > > Why would conversion be triggered by userspace? As far as I know, it'= s > > > the guest that triggers the conversion. > > > > > > > * Conversion will just fail if there are extra refcounts and usersp= ace > > > > can try to get rid of extra refcounts on the range while it has eno= ugh > > > > context without hitting any ambiguity with memory fault exit. > > > > * guest_memfd will not have to deal with this extra state from 3 ab= ove > > > > and overall guest_memfd conversion handling becomes relatively > > > > simpler. > > > > > > That's not really related. The extra state isn't necessary any more > > > once we agreed in the previous discussion that we will retry instead. > > > > Who is *we* here? Which entity will retry conversion? > > Userspace will re-attempt the VCPU run. Then KVM will have to keep track of the ranges that need conversion across exits. I think it's cleaner to let userspace make the decision and invoke conversion without carrying additional state in KVM about guest request. > > > > > > > > Note that for x86 CoCo cases, memory conversion is already triggere= d > > > > by userspace using KVM ioctl, this series is proposing to use > > > > guest_memfd ioctl to do the same. > > > > > > The reason why for x86 CoCo cases conversion is already triggered by > > > userspace using KVM ioctl is that it has to, since shared memory and > > > private memory are two separate pages, and userspace needs to manage > > > that. Sharing memory in place removes the need for that. > > > > Userspace still needs to clean up memory usage before conversion is > > successful. e.g. remove IOMMU mappings for shared to private > > conversion. I would think that memory conversion should not succeed > > before all existing users let go of the guest_memfd pages for the > > range being converted. > > Yes. Userspace will know that it needs to do that on the VCPU exit, > which informs it of the guest's hypervisor request to unshare (convert > from shared to private) the page. > > > In x86 CoCo usecases, userspace can also decide to not allow > > conversion for scenarios where ranges are still under active use by > > the host and guest is erroneously trying to take away memory. Both > > SNP/TDX spec allow failure of conversion due to in use memory. > > How can the guest erroneously try to take away memory? If the guest > sends a hypervisor request asking for a conversion of memory that > doesn't belong to it, then I would expect the hypervisor to prevent > that. Making a range as private is effectively disallowing host from accessing those ranges -> so taking away memory. > > I don't see how having an IOCTL to trigger the conversion is needed to > allow conversion failure. How is that different from userspace > ignoring or delaying releasing all references it has for the > conversion request? > > > > > > > This series isn't using the same ioctl, it's introducing new ones to > > > perform a task that as far as I can tell so far, KVM can handle by > > > itself. > > > > I would like to understand this better. How will KVM handle the > > conversion process for guest_memfd pages? Can you help walk an example > > sequence for shared to private conversion specifically around > > guest_memfd offset states? > > To make sure that we are discussing the same scenario: can you do the > same as well please --- walk me through an example sequence for shared > to private conversion specifically around guest_memfd offset states > With the IOCTLs involved? > > Here is an example that I have implemented and tested with pKVM. Note > that there are alternatives, the flow below is architecture or even > vm-type dependent. None of this code is code KVM code and the > behaviour could vary. > > > Assuming the folio is shared with the host: > > Guest sends unshare hypercall to the hypervisor > Hypervisor forwards request to KVM (gmem) (having done due diligence) > KVM (gmem) performs an unmap_folio(), exits to userspace with For x86 CoCo VM usecases I was talking about, userspace would like to avoid unmap_mapping_range() on the range before it's safe to unshare the range. > KVM_EXIT_UNSHARE and all the information about the folio being > unshared > > Case 1: > Userspace removes any remaining references (GUPs, IOMMU Mappings etc...) > Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any > references, sets state to PRIVATE > > Case 2 (alternative 1): > Userspace doesn't release its references > Userspace calls vcpu_run(): KVM (gmem) sees that there are still > references, exits back to userspace with KVM_EXIT_UNSHARE > > Case 2 (alternative 2): > Userspace doesn't release its references > Userspace calls vcpu_run(): KVM (gmem) sees that there are still > references, unmaps folio from guest, but allows it to run (until it > tries to fault in the folio) > Guest tries to fault in folio that still has reference, KVM does not > allow that (it sees that the folio is shared, and it doesn't fault in > shared folios to confidential guests) > KVM exits back to userspace with KVM_EXIT_UNSHARE > > As I mentioned, the alternatives above are _not_ set in core KVM code. > They can vary by architecture of VM type, depending on the policy, > support, etc.. > > Now for your example please on how this would work with IOCTLs :) > > Thanks, > /fuad > > > > > > > > - Allows not having to keep track of separate shared/private range > > > > information in KVM. > > > > > > This patch series is already tracking shared/private range informatio= n in KVM. > > > > > > > - Simpler handling of the conversion process done per guest_memfd > > > > rather than for full range. > > > > - Userspace can handle the rollback as needed, simplifying err= or > > > > handling in guest_memfd. > > > > - guest_memfd is single source of truth and notifies the users of > > > > shareability change. > > > > - e.g. IOMMU, userspace, KVM MMU all can be registered for > > > > getting notifications from guest_memfd directly and will get notifi= ed > > > > for invalidation upon shareability attribute updates. > > > > > > All of these can still be done without introducing a new ioctl. > > > > > > Cheers, > > > /fuad