From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFF37C3ABDD for ; Tue, 20 May 2025 14:11:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 701606B0085; Tue, 20 May 2025 10:11:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D8C76B0088; Tue, 20 May 2025 10:11:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EF166B008C; Tue, 20 May 2025 10:11:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 402A46B0085 for ; Tue, 20 May 2025 10:11:34 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D9A5780402 for ; Tue, 20 May 2025 14:11:33 +0000 (UTC) X-FDA: 83463474066.26.3B402C0 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf18.hostedemail.com (Postfix) with ESMTP id EAF3C1C0012 for ; Tue, 20 May 2025 14:11:31 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JmRuG3uR; spf=pass (imf18.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747750292; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=W0V2ESEDZyNB2h5dvIwZaMO3TFNNthEGeGHcq7PS6fo=; b=JbBEv0GW9DZKv6l07MMPD99PCqXkDSL+t3FwRxjnNfzBvzLumR4pX7mK6eIcfPkZT8O4Cu DLWvfKXB3XzR221M1vHWmGgNbXC2fJuKLbgASkWkzs4L7Xt65YE/7azyD4/sSS9UQ95z1l 0pKNg5gYdiaim6viKNrTT3dLk6tmOlE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JmRuG3uR; spf=pass (imf18.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747750292; a=rsa-sha256; cv=none; b=dYpAQCNSfowMod7wvV2kx3hUgGj/GHLMrdJfJRbT8YcnydGuUdqeiFSnEBMuRH7wC/p5dA QstP/uUZzSRj+bjMmWLHqP36FVeXY4S5r+nMyvYd0oFIzda9CcBOeLOkGdACvhIDcG6ESy zi335tnbUVfooWd4tK8l6DUnpCqpgog= Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-231f61dc510so748255ad.0 for ; Tue, 20 May 2025 07:11:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747750291; x=1748355091; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=W0V2ESEDZyNB2h5dvIwZaMO3TFNNthEGeGHcq7PS6fo=; b=JmRuG3uRiImyJtfZPZzs3eI49rSdbd6qBMvbB1g45sscjL3IkGanagEX4ig3f5maat mEQ3l1PybcUephBoCwHCijtrT/O857NHJCtM469kwIryC9wu6TNJg9oBeime3EUH+9Uf CbH1zmuA1BZ2btCm99D6o80PNOuSbANRtVawAJEyvKicaUIWqVDRlyx0L3xtHx+mEyIe LFkMO/Vy5XNcYutSwuGGN0Hqo4MFbwG2ZB8ZKM9if9Slf2FZSzCaG/UFYTCzwqVA23ar S5xEP62AgFxywr7wk3AxQhlMbStJnRbbNC84IufQBXNObJF5/Yvh8KVDql8gbvrb4x4q flQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747750291; x=1748355091; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=W0V2ESEDZyNB2h5dvIwZaMO3TFNNthEGeGHcq7PS6fo=; b=oniOmHHMCfKs/yFYV5OAnx8IbuBE4gs7TdT1QFOSuAFnRq23e5YB7CAtJ+3eotAON3 A4j1WIypKoCdwkBLMAVuQzPRkdQDwNaTT42qF/F72ulAPWGmH6qj3fpnkshrovIZpWtH ZTUG0XyHtqndL0DHz//fKwAfRFNIA2TOpmyCCXomsqjfEbDglXEhNHhjrpTUvwwYU1fq GyDdy/v2rPO1PLz/Daqc3YabgF3lhMYzpXJjifQEuzzPrL2HacyCV/hB7e/FC6LgikBX /bIY7yYGr9FQDObh6w81+nU1lWTtYemnQgi/QqbFMNTPEniEdNyoFip2ghNRWZyaYtBS HmYQ== X-Forwarded-Encrypted: i=1; AJvYcCVsqMkIwoybnGR0PwgBFrhyGMZO3AEnq52n1kQAcdr26DjYKs/mwUCRTkaxUUxWfWky4ZetoJJINw==@kvack.org X-Gm-Message-State: AOJu0Yxw+FCLnUeNWvagBQpaqTizTCY9k+Vg4bc+JOCkUgnW0/poqd1N qwsBTiB98J7tChMjhIjkcpKb4a7kimm14X7Vvmdens+xL172AQYbtt0HjuhyX0REwMKH1ZfB7MU DDVSYVt++2tMzfRg+oWoj7gwRBWqrQUalFUSir0Cn X-Gm-Gg: ASbGncsPld1wKVag2qBUCcXIOp56kptOmxsEKCQimMgecUtS3BlbVfhfhbNS03cY5mM niQy94rvCpRpsP5HxbRybZYJ53ZZbM3knDvKPGlia+FrlwwX0S5wcpJoM3s99OMFr3+gvWsooWr 3zB0rdaHk4/u7EeGhYDezr6UpJh+OqOHqLi3j7t/Sn034mSUbRX3nb7k1V5Yvx9j70U+MvJh9ef G7H/hPoGBFrjjU= X-Google-Smtp-Source: AGHT+IHZOvULWQMabZ1/iUJRuXlrpoOaAagjIY7TcYaquuzrKcD3XGch+6LvlFv8i9o0JrVmoU1rYJj46EvyHPibmqA= X-Received: by 2002:a17:903:1b6c:b0:223:f479:3860 with SMTP id d9443c01a7336-231ffdc5bb1mr8356255ad.18.1747750290129; Tue, 20 May 2025 07:11:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Vishal Annapurve Date: Tue, 20 May 2025 07:11:17 -0700 X-Gm-Features: AX0GCFsMvcqcyDCpl08egts_Rk9tmZhsChEdmC9Tx1PSj0Hg70TFTxOoW40ga_M Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Fuad Tabba Cc: Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: pzsj43wwmk3ig4m8cmracjdda8t3wnce X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: EAF3C1C0012 X-HE-Tag: 1747750291-501217 X-HE-Meta: U2FsdGVkX19WBkML7pWJWBV01pQWBNv5HP83xIifHdFtXqYwf29SuXo5/P0vZFOwAjXp+vGTQKlT5IaRrNvXK3fJIorPAvTLM292BqK7vBaHr5HiSdm9IoA2L1aZeu+ZKfN3KoXyzqID/UTjPm82bmNFWEtMg7oDEkIh+EsrQlicReDgSH3z1Ni2HsYhrOfhi/hKOMd6LJkkVC1C9SrUVbumKv1xEom4gFav7hatqhy+S65AoMrImw0brHjMHB2VQUfJvGlpgLnTgvIC2vJo7MuiDh4ag1Gw5CvLNXWJQdtIzN5zKsWLAq7xk/sDcUURWNjTlaoIgvG4v5gUuC8oAOA7yqS2QLqnpOeWYwdnIS8Zo1epigUbRDwqIZNtt1gi/LWNibmpH76opWaXu1rOONPTgTYclShL4vx8do8kA2548n5rX6zGquHdPRelR+ReqoZpj4BXlqjFrLPC1gmKe1iSo44LwTAjBn1C0BW6hDDsmqIQ+BEm9ORYOWY98FS4GEoJybOmqkSUZO/iy2UI+Of1WUXr11q05/blXqluTO2fsuDfzpZwjUoNmfaoaAdvpzO7ACmHMX8XfECut3X5aZLo10nDqAdD7XlVM2OM+NCtEPfPID50iTtgPL3Hb9E3WPaDqEepgGCvNPRfGp5VnPBiU47dLoo2Ab6b/OI3cipeudYLD8QJCeAfblqRkZCVuWPIFcJtR9EpPdi53yUkfM4P9i8loEo+rMkh03yGdFP9cC6jMm1w9xuKNx2/b6+A7mkxDS6MB64psYvaqDP70S7i1xe4ASmWZOgClVP/5yamyMTPko7mSxkPjBcDVCtj1syMd3DxNnMOfSHlnk/hk0wpcDFlrZQ/Cf379aJoboLTly8UjDPPiPMy50e9+vRUkL19t9l8ju0peTmRy+Qw0p7tMr4e/7cn0tmWH2sETA8GgwHRe6BngnD5LrkOdfpFk8eBaOpHhWbs5MLFhNS H7C5EnQ1 7gQxkiA3Egt/ba/2iFM/MIzHRrBz2oyYtSohSW97sbSExUv7Wq2ovFgQgyYo5Bql43d3yhELwQujdFYWpTnreciId0OxuUA2Gtjdkk4kDLQorX+JwNw4T60uizYAg2Y356RP/VFwzJ4D6559CpNJ4auQzNnvA0XV4s4scYS0g5SeO0vmDypYpd0Wudvb0R03u7ki/k4xcRq2aoI7mpLTyoktHHH+Y/QwLI1lXE9R7w+seWsU+iCx66GbK9Ms/C1Gsfi/ASopQhOmq4IR5+SRrmL5COxoNMDxK9NL8abZIS4awg9viHj/ymhfM2YQyhvxgiKrkenODbrQcdNOcno2L4TsG0w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 20, 2025 at 6:44=E2=80=AFAM Fuad Tabba wrote= : > > Hi Vishal, > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve wr= ote: > > > > On Tue, May 20, 2025 at 2:23=E2=80=AFAM Fuad Tabba w= rote: > > > > > > Hi Ackerley, > > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng w= rote: > > > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to sha= red > > > > and private respectively. > > > > > > I have a high level question about this particular patch and this > > > approach for conversion: why do we need IOCTLs to manage conversion > > > between private and shared? > > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch > > > series that performs in-place conversion [3] and the associated (by > > > now outdated) state diagram [4], I didn't see the need to have a > > > userspace-facing interface to manage that. KVM has all the informatio= n > > > it needs to handle conversions, which are triggered by the guest. To > > > me this seems like it adds additional complexity, as well as a user > > > facing interface that we would need to maintain. > > > > > > There are various ways we could handle conversion without explicit > > > interference from userspace. What I had in mind is the following (as > > > an example, details can vary according to VM type). I will use use th= e > > > case of conversion from shared to private because that is the more > > > complicated (interesting) case: > > > > > > - Guest issues a hypercall to request that a shared folio become priv= ate. > > > > > > - The hypervisor receives the call, and passes it to KVM. > > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86 > > > parlance), and unmaps it from the host. The host however, could still > > > have references (e.g., GUP). > > > > > > - KVM exits to the host (hypervisor call exit), with the information > > > that the folio has been unshared from it. > > > > > > - A well behaving host would now get rid of all of its references > > > (e.g., release GUPs), perform a VCPU run, and the guest continues > > > running as normal. I expect this to be the common case. > > > > > > But to handle the more interesting situation, let's say that the host > > > doesn't do it immediately, and for some reason it holds on to some > > > references to that folio. > > > > > > - Even if that's the case, the guest can still run *. If the guest > > > tries to access the folio, KVM detects that access when it tries to > > > fault it into the guest, sees that the host still has references to > > > that folio, and exits back to the host with a memory fault exit. At > > > this point, the VCPU that has tried to fault in that particular folio > > > cannot continue running as long as it cannot fault in that folio. > > > > Are you talking about the following scheme? > > 1) guest_memfd checks shareability on each get pfn and if there is a > > mismatch exit to the host. > > I think we are not really on the same page here (no pun intended :) ). > I'll try to answer your questions anyway... > > Which get_pfn? Are you referring to get_pfn when faulting the page > into the guest or into the host? I am referring to guest fault handling in KVM. > > > 2) host user space has to guess whether it's a pending refcount or > > whether it's an actual mismatch. > > No need to guess. VCPU run will let it know exactly why it's exiting. > > > 3) guest_memfd will maintain a third state > > "pending_private_conversion" or equivalent which will transition to > > private upon the last refcount drop of each page. > > > > If conversion is triggered by userspace (in case of pKVM, it will be > > triggered from within the KVM (?)): > > Why would conversion be triggered by userspace? As far as I know, it's > the guest that triggers the conversion. > > > * Conversion will just fail if there are extra refcounts and userspace > > can try to get rid of extra refcounts on the range while it has enough > > context without hitting any ambiguity with memory fault exit. > > * guest_memfd will not have to deal with this extra state from 3 above > > and overall guest_memfd conversion handling becomes relatively > > simpler. > > That's not really related. The extra state isn't necessary any more > once we agreed in the previous discussion that we will retry instead. Who is *we* here? Which entity will retry conversion? > > > Note that for x86 CoCo cases, memory conversion is already triggered > > by userspace using KVM ioctl, this series is proposing to use > > guest_memfd ioctl to do the same. > > The reason why for x86 CoCo cases conversion is already triggered by > userspace using KVM ioctl is that it has to, since shared memory and > private memory are two separate pages, and userspace needs to manage > that. Sharing memory in place removes the need for that. Userspace still needs to clean up memory usage before conversion is successful. e.g. remove IOMMU mappings for shared to private conversion. I would think that memory conversion should not succeed before all existing users let go of the guest_memfd pages for the range being converted. In x86 CoCo usecases, userspace can also decide to not allow conversion for scenarios where ranges are still under active use by the host and guest is erroneously trying to take away memory. Both SNP/TDX spec allow failure of conversion due to in use memory. > > This series isn't using the same ioctl, it's introducing new ones to > perform a task that as far as I can tell so far, KVM can handle by > itself. I would like to understand this better. How will KVM handle the conversion process for guest_memfd pages? Can you help walk an example sequence for shared to private conversion specifically around guest_memfd offset states? > > > - Allows not having to keep track of separate shared/private range > > information in KVM. > > This patch series is already tracking shared/private range information in= KVM. > > > - Simpler handling of the conversion process done per guest_memfd > > rather than for full range. > > - Userspace can handle the rollback as needed, simplifying error > > handling in guest_memfd. > > - guest_memfd is single source of truth and notifies the users of > > shareability change. > > - e.g. IOMMU, userspace, KVM MMU all can be registered for > > getting notifications from guest_memfd directly and will get notified > > for invalidation upon shareability attribute updates. > > All of these can still be done without introducing a new ioctl. > > Cheers, > /fuad