From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E463D374A6 for ; Thu, 17 Oct 2024 14:58:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D8F646B0083; Thu, 17 Oct 2024 10:58:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D17BA6B0085; Thu, 17 Oct 2024 10:58:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B93436B0088; Thu, 17 Oct 2024 10:58:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 95BC46B0083 for ; Thu, 17 Oct 2024 10:58:39 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B87D4812F9 for ; Thu, 17 Oct 2024 14:58:29 +0000 (UTC) X-FDA: 82683400548.04.1A88CF6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf20.hostedemail.com (Postfix) with ESMTP id DAD631C0012 for ; Thu, 17 Oct 2024 14:58:24 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Y9XKfRqi; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729176998; a=rsa-sha256; cv=none; b=mwvc+1KJVbQ4TGx727FIlya2Nhc+BulStclR1g9ppmtyId3xqoC7pEicy53FhfHZ+KJDge ik64/TLemUrwVGsMbWIX7cG7vrQyxX/Af15m2pW0KNuFk4M1DktgPrGekD9pOYMFD7Nsug Ejb8IFympebSBq4hTrFEWEyT5PzNwOA= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Y9XKfRqi; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf20.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729176998; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YeuRkHE7ropmdcbR4Cq1Ya+ydc0Cmu/su/zFW+YnyjY=; b=hpIa6LbArYTsQthFhSw8abLBVOPYe4LG6P3mkEFOmt5VhddW++z9j5wsMgRQo/cLN3WP2t 6/1jq3wNb2sXwm6yuv8Gib/oFKAjrkMkviFm4rrvB5nECdoPzol569k72IZWvzGkP+bSzS tfYIVCwJpt7/a9HHrYiLtXEdktstrLM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729177116; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YeuRkHE7ropmdcbR4Cq1Ya+ydc0Cmu/su/zFW+YnyjY=; b=Y9XKfRqihOZQsq3Kh9NHpSrhGVveMTjYPiaWWDWX8q4Ejqs7vDDATHKsLVlhpjDAcGOQv+ 4uDI6qGIBg4pzhpWxbzYis66X5QDyfA8mFQcM4iWkYub4TXd++kfFkw6bMB8pzWLvKEumT 4ZsgMDTFW1Ln0YIlk5qyBg1YFdux2SI= Received: from mail-ot1-f69.google.com (mail-ot1-f69.google.com [209.85.210.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-363-OlSBSRuGOQaLQQptzsCliQ-1; Thu, 17 Oct 2024 10:58:35 -0400 X-MC-Unique: OlSBSRuGOQaLQQptzsCliQ-1 Received: by mail-ot1-f69.google.com with SMTP id 46e09a7af769-71817451e83so146334a34.0 for ; Thu, 17 Oct 2024 07:58:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729177114; x=1729781914; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=YeuRkHE7ropmdcbR4Cq1Ya+ydc0Cmu/su/zFW+YnyjY=; b=kpbjKYgZ6q793tXO8nhUNxILIVdlW+LVRwp2QFl3dmzxPB7NSJ7JzWdj/l4mPqKHbm LMrlIFDHSPIv5VvH7Bbz47EjHnGx8j66bq1p95mZTMDLWDetyQ2te8q3xP/apta60yEa 3h5uZqrkPgP/1RxL6DkU9tE+PbDY7+SPxZIWO//OPfRlIUinF5aljWyHbI+2jdYuHmfE SOG4Tl7E3tu2tNSC21Vmi0X5H01VR+d+BYC1PtFCKG7KQbSTA3BgissUVXagzWzy/m7i lnAEe6WAqNceLUVriBBgei9W5QYs1m1SmU+E6l5/zF6+BgggnoyNncKVfD6xeq6GKSpf 2CYA== X-Forwarded-Encrypted: i=1; AJvYcCUOyOkRj+NjNI581sZkbnPfj6XWfmpL8Ms8adRPjjBN+T8kAzAryZkBZfxEdRIcINxfR7GXUBQY2A==@kvack.org X-Gm-Message-State: AOJu0Ywc4O/U5fiBL0lfbqzGuCd7hu34kM5NEjC/8FGTu/yqyq5c70sf TgLq5i5rO/i8+X+V+VEeLaW91m52mLb+EKPsZuhSoQesAikjCuNYbggeFzC3/ySS2vrbfMNeH8x y1JZRt7657Vfq0O6+A87PRnJoaa9oanrn1U8TisguGoHqHcvw X-Received: by 2002:a05:6359:8004:b0:1c3:8643:7f05 with SMTP id e5c5f4694b2df-1c386438253mr210465855d.28.1729177114347; Thu, 17 Oct 2024 07:58:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGuZSic+RIPKM1fxR2iF8bnthSep4CnGMyoY4sLA5CNzt7oJ7XBOHxPnKDiMbmM7UyruxT0DA== X-Received: by 2002:a05:6359:8004:b0:1c3:8643:7f05 with SMTP id e5c5f4694b2df-1c386438253mr210461955d.28.1729177113968; Thu, 17 Oct 2024 07:58:33 -0700 (PDT) Received: from x1n (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6cc22911c5bsm28610376d6.28.2024.10.17.07.58.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Oct 2024 07:58:33 -0700 (PDT) Date: Thu, 17 Oct 2024 10:58:29 -0400 From: Peter Xu To: Jason Gunthorpe Cc: David Hildenbrand , Ackerley Tng , tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, vannapurve@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Message-ID: References: <1d243dde-2ddf-4875-890d-e6bb47931e40@redhat.com> <20241016225157.GQ3559746@nvidia.com> <20241016235424.GU3559746@nvidia.com> MIME-Version: 1.0 In-Reply-To: <20241016235424.GU3559746@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DAD631C0012 X-Stat-Signature: psrhndnuwqbrockr33tqa1mhhj1iqde8 X-Rspam-User: X-HE-Tag: 1729177104-576608 X-HE-Meta: U2FsdGVkX19TKyYBaicwzO+TqhJ34xB4yxK/zbVfe0TjTLq7PYr06udlhagKZb+z4FesvaK0benzPIFDWy9uogBOxEDps/bug2DB0NoxPwaxhDgKMarRR8HpQZrv4zRF8zdsgW3yTpc7pw3DjfeXWZNSCjAK+cMMC1xPe9Hl4Uc2MtgrC0M95u33/53u2e+g1+551/cYHSjkeUAef/NwDncSp+Wr2j35g1w/Asl1s9ZtJ9NMVLPImk9gpZp93VabTVi8VehmIxBfu0N7z6v4xmgWVR1+UPp/mCGerKqGe2/L6Ab7k/UIMpX1NHaHSRtmLmxhtFWgRjzt30DWlZ7R3tUG8IxSmjQV9guHAzD+kNtB3ynuNMd2KG3jecNFnzPw0wDfA/CGqx4RcXrEYLy52+t+z77fP2r2t/gdTJRrGFGbr+IWY+0KpQU546RtN0dBnt/t53Dqv9jVQjlNIvxPx1awhwlyNY9UrgQyh93MGf3TLbQMfNwnJ4bDzrd5oBCkZuJQf5/IuGuTZX9Q+kRoSzv9WdmsvrUkln8ybvHCr0EUnpOuTO0Ow66t5iia/5/5RDHBzMVuA04yn5d7/PbjIxcC325dDtnelmaEbfLbHKxj4g3mx0X0AL+WAVw+7jwPWwB4dwkGfJwywncB42ZvLARZsflobe5PV8KeoeUvz/JLlb7RD5HjVnoHsXMEh5WhpGNlhdNCdLtav3gdDiwf/dR8GRPquoAL1OWeVXAENuVKgPk6yP0wgMu8F/D+DiTEvU8KiMiYwZEyh1YEsnoReDFfxrO1kfQurJnFK0CRHRxKTI7RMkPWWyrOCeGbvLeyxheLjSo5PHb7gWcRzqRqcX0R2gOYEC6U3XaDBuMB8ftAUtEqSLGARFRspvf+r/nOyXV9J5KXzQMgfFzb0q24Rh9hF5tpD6oJrGkgPcjaHQP7JgLhfz/MAWm7ycfoCyTnQ+lFC3Dc2rv3s/thPhD 9F0OJcVI /1PINTw+gsEOOmqnwXJYktpC6JBx7pFhMEzMerm0+EE0C2wVxHCCAYZDSOMLxGuz41MUN/O6iVItlxFuk2vGZU1CfTJB5p1w1ejxTnDCkmFq5aUjphFvqAwh1crhiXY5MWJLwr78yKSz/4WRaabCxgFui9dvmOoxTBmZ9HeQ5iHjWWzCIsOs9fOmJ2utNTJzan9FxzPyxmP3Xj4ge8eNQYCAMzJAaFNLEuit7FiBRgRGZd0oUtPzmaC9WrdvXSV3XkyeA8urGCoCtIpJpA4Jn5e+KCsQZTQaBt0XV6qglga8I9XTa52JZCPKSInHdRiMVjtWJxhOWhj67mB40D4nReKu/OT03chDumIaTtQaoghv+MARxk51lQE8fWgvltvmUVen7wkP0bWGxZIxdRf8TBkHELv5srYbcNJwaj7kXXTA1CxGmr9XmrhXr425GUXSqiXY2cKMiHZQR4XUyaffheCRjXOQWMGsOqBX9LwmEnLJG4uJ2Hy0l+dJGS39lz9zDGk80e/N5XosWWlHTThwtMYpQdzChUe+CRHK5 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 16, 2024 at 08:54:24PM -0300, Jason Gunthorpe wrote: > On Wed, Oct 16, 2024 at 07:49:31PM -0400, Peter Xu wrote: > > On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote: > > > On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote: > > > > > > > > Is there chance that when !CoCo will be supported, then external modules > > > > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd? > > > > > > > > To support CoCo, I understand gmem+offset is required all over the places. > > > > However in a non-CoCo context, I wonder whether the other modules are > > > > required to stick with gmem+offset, or they can reuse the old VA ways, > > > > because how it works can fundamentally be the same as before, except that > > > > the folios now will be managed by gmemfd. > > > > > > My intention with iommufd was to see fd + offest as the "new" way > > > to refer to all guest memory and discourage people from using VMA > > > handles. > > > > Does it mean anonymous memory guests will not be supported at all for > > iommufd? > > No, they can use the "old" way with normal VMA's still, or they can > use an anonymous memfd with the new way.. > > I just don't expect to have new complex stuff built on the VMA > interface - I don't expect guestmemfd VMAs to work. Yes, if with guestmemfd already we probably don't need to bother on the VA interface. It's the same when guestmemfd supports KVM_SET_USER_MEMORY_REGION2 already, then it's not a problem at all to use fd+offset for this KVM API. My question was more torwards whether gmemfd could still expose the possibility to be used in VA forms to other modules that may not support fd+offsets yet. And I assume your reference on the word "VMA" means "VA ranges", while "gmemfd VMA" on its own is probably OK? Which is proposed in this series with the fault handler. It may not be a problem to many cloud providers, but if QEMU is involved, it's still pretty flexible and QEMU will need to add fd+offset support for many of the existing interfaces that is mostly based on VA or VA ranges. I believe that includes QEMU itself, aka, the user hypervisor (which is about how user app should access shared pages that KVM is fault-allowed), vhost-kernel (more GUP oriented), vhost-user (similar to userapp side), etc. I think as long as we can provide gmemfd VMAs like what this series provides, it sounds possible to reuse the old VA interfaces before the CoCo interfaces are ready, so that people can already start leveraging gmemfd backing pages. The idea is in general nice to me - QEMU used to have a requirement where we want to have strict vIOMMU semantics between QEMU and another process that runs the device emulation (aka, vhost-user). We didn't want to map all guest RAM all the time because OVS bug can corrupt QEMU memory until now even if vIOMMU is present (which should be able to prevent this, only logically..). We used to have the idea that we can have one fd sent to vhost-user process that we can have control of what is mapped and what can be zapped. In this case of gmemfd that is mostly what we used to persue already before, that: - It allows mmap() of a guest memory region (without yet the capability to access all of them... otherwise it can bypass protection, no matter it's for CoCo or a vIOMMU in this case) - It allows the main process (in this case, it can be QEMU/KVM or anything/KVM) to control how to fault in the pages, in this case gmemfd lazily faults in the pages only if they're falutable / shared - It allows remote tearing down of pages that were not faultable / shared anymore, which guarantees the safety measure that the other process cannot access any page that was not authorized I wonder if it's good enough even for CoCo's use case, where if anyone wants to illegally access some page, it'll simply crash. Besides that, we definitely can also have good use of non-CoCo 1G pages on either postcopy solution (that James used to work on for HGM), or hwpoisoning (where currently at least the latter one is, I believe, still a common issue for all of us, to make hwpoison work for hugetlbfs with PAGE_SIZE granule [1]). The former issue will be still required at least for QEMU to leverage the split-abliity of gmemfd huge folios. Then even if both KVM ioctls + iommufd ioctls will only support fd+offsets, as long as it's allowed to be faultable and gupped on the shared portion of the gmemfd folios, they can start to be considered using to replace hugetlb to overcome those difficulties even before CoCo is supported all over the places. There's also a question on whether all the known modules would finally support fd+offsets, which I'm not sure. If some module won't support it, maybe it can still work with gmemfd in VA ranges so that it can still benefit from what gmemfd can provide. So in short, not sure if the use case can use a combination of (fd, offset) interfacing on some modules like KVM/iommufd, but VA ranges like before on some others. Thanks, [1] https://lore.kernel.org/all/20240924043924.3562257-1-jiaqiyan@google.com/ -- Peter Xu