From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8DCAC3ABDD for ; Tue, 20 May 2025 13:02:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 709C46B008A; Tue, 20 May 2025 09:02:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E1156B008C; Tue, 20 May 2025 09:02:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F84A6B0095; Tue, 20 May 2025 09:02:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 43A726B008A for ; Tue, 20 May 2025 09:02:23 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id EC5461A01C2 for ; Tue, 20 May 2025 13:02:22 +0000 (UTC) X-FDA: 83463299724.13.40DC94A Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by imf04.hostedemail.com (Postfix) with ESMTP id DD46E4000F for ; Tue, 20 May 2025 13:02:20 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="W9/jPkWZ"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747746141; a=rsa-sha256; cv=none; b=pwzsRB29wGAABuf8W1C1H/bSRyYXDXpFgdZ2qIzV7AyShuLHyFv51SCIldO5Cw7cq0bwkE 4pQBd3NpgKe1TINOFLwEwq5/YF8mzb3eO712Z6f7AfmH+pTTDtGEvpnxMZMrxN5eJaOhOD At5izTwE3LUokTehiCJzU6kH/d7Azw0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="W9/jPkWZ"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747746140; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G51NjueMtgwdkSkAq4nxlm82fg57frn9bjK0YUuFTbY=; b=u7JfTQ6uVSrCFqkb4PtG/Nq+Zuyhsu1La6q9VFcrHIiTg0SSKXk/nbQkJoqwupX74O7Ao+ TV3ZECmCwspd1ermyBwRQTdPqtncwx8JwkGT++4647+UG/KnrqtG6YW9J3++eU5Bmtk9V8 T15UHofIwdRK5fqrLtOgGAQ2djiSDwY= Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-231ba6da557so523345ad.1 for ; Tue, 20 May 2025 06:02:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1747746140; x=1748350940; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=G51NjueMtgwdkSkAq4nxlm82fg57frn9bjK0YUuFTbY=; b=W9/jPkWZXy/Il0kZq8H9HqXnpVdRFJDXuHeWDYul3AxOlr9hD0AlZrOADtcrwqcfNf UCQAntHpERuukjSaDp+Obgl6nk6wEPRqW/uqTagRFfr7VdZ5u/BbSW2LhaPgfXL0LUF4 Ip+81ua8cPM9pAcYOd+jowILZKGu45VJvDUI3dMIqO1PQtmrxxZjNcY2lZ78he8+KOzO Stuz6zS8/0uD3YBo0HzEKzNaAz67uwu02PCh6BV7aulwUCd7IoYn9YpCa6BEm5wPRppr MkavSzAzzEMfHO/nbu9oIY9CIahv9w66l5t/Oo96tPfRFx8ZEGJw88GehxEeZ1Z9fpUT JASg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747746140; x=1748350940; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G51NjueMtgwdkSkAq4nxlm82fg57frn9bjK0YUuFTbY=; b=Ica6YkmoUCHStRZ/YelTrN8PiANdQKD10OT4Z4i+XlSS/tTBw81M97MqGn0cThT83a TizEguW7jrfnEtq/hOHjB7h7NkFesjU//43k2L48rVWUHLhDqRzQhcfZbWNlvc+eB3aw Bjvu1DMkNvt/jHKgM1Ahk+vSpILO7cfD531O5vooYLzNzFn+UAikPijzaCjounFnw2LA xWYqtOxQz1XpmGH2580QdUJvqJjo53PXtP0QTi3ArpbI+eywGLVCa0YAdE5KNAb9VXFQ BNBclSLydFFFRwloXNwlYpjkhWtcUgaF1nrd2RSJCbvO9Pcri5LedqiI6OD5Mq3WVEQb Q41Q== X-Forwarded-Encrypted: i=1; AJvYcCXRYvYWu2oeJaclpM4u036fZa1Ye0Z7UuFZDveLVR9UZn6S30CUKb3Gsy64SXwzYnt6QykTZLnTzg==@kvack.org X-Gm-Message-State: AOJu0YwwJrTyD2bD0K9NNDoug1W+I3tT0wpjh45MgctBsEqdQevPSvml aR1K3FtrHnoHhgtMnpD0avKzkmtljtzm6M//JTNRj3/9RaWIXNXyYSNnHeucFa/N334qHRTOxHD Sib9JRSFMTQXD3MLOMZIpzGShEKUrK5PiRNC+OcIc X-Gm-Gg: ASbGnctw19u0MvzuIm+D+1UTZ+hClS0KhNWkyz3WJZTL844UTi9MhRR7tIR5V4VSK4p pKFoMASz8Zf8qlta6cbKOK9Ab1WHEqeF8jsHglmD8CFMXh+e31R9j/7G42IBXVP5+05aT315mSt ioLCNpXo/NA79gc65KmN18e2ojxLyyv6RUmL2HzmyNN75cdLIO+0JK4K2CKRhPNsuCij+dlX9HI 9OC X-Google-Smtp-Source: AGHT+IENsMjdA7e7SkDhYjjQmP6hDXAt2xKgihGIflXdn8EzSlt5WSL+7YEji7uYL6DeVUvbwnXXwaz9JyzL1XUjkKA= X-Received: by 2002:a17:902:cecd:b0:231:d7cf:cf18 with SMTP id d9443c01a7336-23203eee503mr7578955ad.1.1747746139014; Tue, 20 May 2025 06:02:19 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Vishal Annapurve Date: Tue, 20 May 2025 06:02:06 -0700 X-Gm-Features: AX0GCFsldswygKR-GoZQ5DpCzbCzOo98e7cVAofAkIc30PsIDhwcqCcCF6oRMqo Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Fuad Tabba Cc: Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, aik@amd.com, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgg@ziepe.ca, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: DD46E4000F X-Stat-Signature: dw1yxueia874wp1dqzg7ep9po6be7oag X-HE-Tag: 1747746140-441812 X-HE-Meta: U2FsdGVkX18wL3U43f3hcMJ/zxGYgfgOcL+dL6DfHmhil3bpxOzmvno3mz/G+zjY+vYEYJmygv5z+08G3LgplOzKCFURJgPsDX9e2ErO/sM66VPBzXN0zY9NR1umt3GzziWamseeeaTozNNg0aP2mv1kFlSBFbwmZGhELJiTNg+DnrgnN3gKQxpNU86OWpRF96Y/T0qQeVoSDyNNa5FhFiHIyknVj0kXEc3QcRUZLsOVgZTeBrQmGKUvNtbPcBjci7eR/gUZ/5F9P+fdApRJ6a6SfRNYyieuiA72g8J+aydCtrZQnsImHJ6uTxgwQwIaIavAghrvexWwmtKNkG7kO8kl8rC6XmrMgOiw4HaeA66QQdvfyfxqQiSreOPEhJ12UPTOptlFp9Mm3HgZEFD0iHj3a8rWthmjhCfp4cS2iLvHbbg+Rfhv24vsYMKD4pzEGhKGocZApdheQ+q3j4yoDzPZwpivAmMGLjIARHV2l6zct3wyPVJkOKxMccBj0FcIrbtilznlh5MLw8pN1OuXfBLV7dGTbL3ZafWJsUNMSD7FevhirrBnMJQpqALE8JxhiLQ3u+rzA7Wrh++kWHejvsnG4WaVAikToOvINIvT/wsBVMxg0iiU/kehOw92pAp+otq1f+koxCHlIy7fyJDdpKwIaxvoF9gqU/Z63P93pz1Ooagjl9FCYPqXHAHMaarLD0aP3DDafKgChgT0AXCHJNaVcQBnvHFml+YTCLjU0Sehx4PXvb7oad9SAzcnvbb6f+rZxrlHGMBjaioU8kfjnamczX96pAAVOwlg18S3ZwiwsRrxwtYAHSPmm3PHHmsPOPD+kxJ1BJZJum5wICJr8WvR9y/I4tN7DGHMhpowCfygq7KrAPlzg+Dr3WuKfvtYe8eVDsgep4iMZRkq0dt5Uhr6lKdbN+SArlPOjOjpar6dVEIlTLdefTl+mmFtE4j9fnBDb7ypc7r3fngmTEP 8A7k5IuP zurH8Jc8lSOMOpdQTsKUJOaICLESRevIK4HXPvBwtCkGf/X83V8mzn3l8Pe0LWYHooLGwZhUDBuveWbZvU9PL0msPKKscCASOSfMqtVRR79sx968ms+gmId+gqKLy2oZnAotyRj/qxmLgyphPaaA8D3Fpd1tUz0DkeIC96VFC8mFtY8/KwkA813SkJVzEQXLX84GFBwnyexFjT5M+OD9Twsy1kwzA+FKlN5U6dpphyv51ZLq0JDiE1JOM+o3Ohy3DP+3nAQ7Xu+XGzmc1ybbbCcOs4Xvk4GVmUD+guxkUHYf/nt8a58nGxcz7SPIMzjGKuQCDyYmSLCt9O1Z6ejs9r/YDLg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 20, 2025 at 2:23=E2=80=AFAM Fuad Tabba wrote= : > > Hi Ackerley, > > On Thu, 15 May 2025 at 00:43, Ackerley Tng wrote= : > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared > > and private respectively. > > I have a high level question about this particular patch and this > approach for conversion: why do we need IOCTLs to manage conversion > between private and shared? > > In the presentations I gave at LPC [1, 2], and in my latest patch > series that performs in-place conversion [3] and the associated (by > now outdated) state diagram [4], I didn't see the need to have a > userspace-facing interface to manage that. KVM has all the information > it needs to handle conversions, which are triggered by the guest. To > me this seems like it adds additional complexity, as well as a user > facing interface that we would need to maintain. > > There are various ways we could handle conversion without explicit > interference from userspace. What I had in mind is the following (as > an example, details can vary according to VM type). I will use use the > case of conversion from shared to private because that is the more > complicated (interesting) case: > > - Guest issues a hypercall to request that a shared folio become private. > > - The hypervisor receives the call, and passes it to KVM. > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86 > parlance), and unmaps it from the host. The host however, could still > have references (e.g., GUP). > > - KVM exits to the host (hypervisor call exit), with the information > that the folio has been unshared from it. > > - A well behaving host would now get rid of all of its references > (e.g., release GUPs), perform a VCPU run, and the guest continues > running as normal. I expect this to be the common case. > > But to handle the more interesting situation, let's say that the host > doesn't do it immediately, and for some reason it holds on to some > references to that folio. > > - Even if that's the case, the guest can still run *. If the guest > tries to access the folio, KVM detects that access when it tries to > fault it into the guest, sees that the host still has references to > that folio, and exits back to the host with a memory fault exit. At > this point, the VCPU that has tried to fault in that particular folio > cannot continue running as long as it cannot fault in that folio. Are you talking about the following scheme? 1) guest_memfd checks shareability on each get pfn and if there is a mismatch exit to the host. 2) host user space has to guess whether it's a pending refcount or whether it's an actual mismatch. 3) guest_memfd will maintain a third state "pending_private_conversion" or equivalent which will transition to private upon the last refcount drop of each page. If conversion is triggered by userspace (in case of pKVM, it will be triggered from within the KVM (?)): * Conversion will just fail if there are extra refcounts and userspace can try to get rid of extra refcounts on the range while it has enough context without hitting any ambiguity with memory fault exit. * guest_memfd will not have to deal with this extra state from 3 above and overall guest_memfd conversion handling becomes relatively simpler. Note that for x86 CoCo cases, memory conversion is already triggered by userspace using KVM ioctl, this series is proposing to use guest_memfd ioctl to do the same. - Allows not having to keep track of separate shared/private range information in KVM. - Simpler handling of the conversion process done per guest_memfd rather than for full range. - Userspace can handle the rollback as needed, simplifying error handling in guest_memfd. - guest_memfd is single source of truth and notifies the users of shareability change. - e.g. IOMMU, userspace, KVM MMU all can be registered for getting notifications from guest_memfd directly and will get notified for invalidation upon shareability attribute updates.