From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6B74C83029 for ; Mon, 30 Jun 2025 14:20:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 318976B00D0; Mon, 30 Jun 2025 10:20:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F0016B00D1; Mon, 30 Jun 2025 10:20:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DEBE6B00D5; Mon, 30 Jun 2025 10:20:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0AD886B00D0 for ; Mon, 30 Jun 2025 10:20:05 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AC7121D4C66 for ; Mon, 30 Jun 2025 14:20:04 +0000 (UTC) X-FDA: 83612276328.05.501BDB5 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf10.hostedemail.com (Postfix) with ESMTP id B9518C0010 for ; Mon, 30 Jun 2025 14:20:02 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uHpieTOS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751293202; a=rsa-sha256; cv=none; b=vhj9nhI4xtZBHmvd0yiYEGx2BbJZ/7FEt6iZXo5zR1QoutpqBWUWinGSqUJG2+TwgzFyhJ SFACI8FczGUzLjgw4iUiZruQoEQEb7qpOkChh3Fq2Kxk5Szi4YojlbccPQ+i1EwpSmbK+r hoCSP5agwUVqOtaQzh5sbNFOObo9FPc= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=uHpieTOS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of vannapurve@google.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751293202; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U6BgpGI+izTDP23PbygNDraRvPegCbPWLha3s13/sL8=; b=rKVks01HZWfjhuMsfOn3bQ3qHh5p7EzVKdLFykzp/6X13ZZWt63rEvzCvy+zzxXqREo7Kt 6i6uiKgxiz88gn0GFSmhX7dUiBvXjYpG5aZ/sqV0Acr8e9BvXnSd1j5y6MYASQONFd+Tju RxMXJXl8Sag8RNOfp4pq2EylNTYN3fY= Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-235e389599fso327945ad.0 for ; Mon, 30 Jun 2025 07:20:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751293201; x=1751898001; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=U6BgpGI+izTDP23PbygNDraRvPegCbPWLha3s13/sL8=; b=uHpieTOSjcCIHDJbKZEytUj/5T+NHb+9Uxp2ENgP1wDsP9YaBFNHPyhlFtY16U8Tcq p0Hl2JFk+IVys3sX8Nb1zHl7IUN5DJIxJHTq59itTd0ybDQTjcVM4LBaALy8gh2jcifE fA4QMewClcWuiRhsDGtbr/x5K4Vh6beurCdTJtOZrz4WwxwJxc/OZEm1jf/Hs61Fz0Pc 5C1TWE5KnqQ5xvxg6PMx/heuStzN4spM9nJWW6cf02G/JkBBPQkOigzh5Kie/Z1N6V0t sb/oOj/0OzSzRhWazL2Kyl3yMFS/rJ/S/RQ8Y6NXOM4cXiaVqtZpAKf2Ni89EW1pPSWU KmZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751293201; x=1751898001; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=U6BgpGI+izTDP23PbygNDraRvPegCbPWLha3s13/sL8=; b=PW2oG3fE9iWrv6c2lYxj+p5G64k5227xo5lGD9HH36GMEm/AyKgPwE2kXy3Sk+ANbg lh8VHEBu/GaCQe6gCtUqUUZfqYAdSkneGQqrgSIpaxuBZqDbRPAsEUOq90Kf1Uos7Ftz bqVSZty2bu20fN/rbMmFtFlOoyUP3EnRnEGrjPiby7mdYAMEvbG3PW7R8D2cAylre6kF Esrv6mKFnyv8DEFmZrCROw0Z97GGlLvJSE4qDH4jODI2YHbd0wOkIrjgOt/r7vxEyTbI yOUmELOjTh5jxwuAMEP6yR+70B1/2ptgWhnQPCkhu2hksWToVBuvD50MCoHzTWBk71n0 /NfQ== X-Forwarded-Encrypted: i=1; AJvYcCUaVw0Ol+hh8cD3MTRojfL6wlLBI4k+tt5hYwgMTqSfydVnYNeYn8F5XbYwbNvhyzrWY687nLRxAA==@kvack.org X-Gm-Message-State: AOJu0YwVKuliHHTCo0jOuXqa7jMTb5yKGaf2VLw2oT52AF5TkNNEDsCy BAVDuaOhMQSU0Fx5fxVDqO6S20haMO/ru8rs9P6cwHyzal/GLnCnJHrZOifxCdrltxWZMFdti1n SzEGzLzebae/krPLgMNFr3ncDMkYlCJONBpQPLu95 X-Gm-Gg: ASbGncsR1hT5sMKDBrjBCbRgqaKxLNqgNr0gS8Ju+k6FYgK/1XJbpDLeN1Xm0GPlhA2 wUHTYSa+FQhZVQFT4xydCn3iJiVrbMs3y3WhP0eK7I+PhsMeW7jgq/mUUxCAAIOwtKAXw8Z3nTz LBg/y+1/H82rsZgQzkENr3iWfGv5gEEN50bl4xhELiQWpaAh2LfuuDUl8Y6B9AJmr18lmbx4HoL EUc X-Google-Smtp-Source: AGHT+IE5AgZOIXEN2kYR+pPHATg3kKWBvA9NG7t3TRCU6UJnhNGR4z3nEopD8cR2TPGhUdKYcMbUvEi2+5nz92PWy1A= X-Received: by 2002:a17:902:ecc1:b0:234:9f02:e937 with SMTP id d9443c01a7336-23ae9f7531dmr3497505ad.25.1751293200306; Mon, 30 Jun 2025 07:20:00 -0700 (PDT) MIME-Version: 1.0 References: <9502503f-e0c2-489e-99b0-94146f9b6f85@amd.com> <20250624130811.GB72557@ziepe.ca> <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> <8f04f1df-d68d-4ef8-b176-595bbf00a9d1@amd.com> In-Reply-To: <8f04f1df-d68d-4ef8-b176-595bbf00a9d1@amd.com> From: Vishal Annapurve Date: Mon, 30 Jun 2025 07:19:47 -0700 X-Gm-Features: Ac12FXwcrbDkWWpGm6EqP7Q9KESbZoaZfTZr5E9iZdm0C78qOvk6uq87JKmXh1E Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Alexey Kardashevskiy Cc: Jason Gunthorpe , Fuad Tabba , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B9518C0010 X-Stat-Signature: 4b4ueokyu88qyaibaaggkdosgjczfmgm X-HE-Tag: 1751293202-149332 X-HE-Meta: U2FsdGVkX1+abQ1TPLFJld5eINZQBHdBVeEgF8RoRvpAd/HRRefmNUgb+ZNwf2IMFyfhr6axw4Why4UrBP4kFOuaYcktwiNCZQCxGhzfH0UE7UJOYHo3oe9Yj5YFB3kEWQaIx8mpWhnywPW4B17Lm+Wy1vtucCfghwFpHigfVnmtRmtWyV5AEcMP7ZOVy44slKQMhLNBCvVi/rPJBmQKY0D88ZEos7nKLa8HqAvWi2Lv2G4oO4v9Y0VziakQO59vP78NnGxR4qoxQwIL2t7B67dnL9C6UW5V9HGsncT+wD4/4DvZdAeOsktLVIeym2RGbm0VWiEHFmt+0JvSDsX0pcMT+KUCM0nOZJMZC986Bk2WTlU3pEaAUHiCvmMEYGa1xdYOD8x81yajiOGzVTQj0WryZtXh5XRava9Kj25tOC6OojUau/Mcw8E9JGgAOR/RZddmR987Q/gAA2lGNi/qXEos6DeW9h5swVrRWZNxcj2WeJySKDcFng6XmKdB9ZDT83tsy7VmtPIZGQeu+mK07TXUFJ/9C106dx/QB8kljStjDU/jju5jlTYFhPmTfaTQiZpdlY+C1ScsgDHKaxGnSjo4FnTHiifcbX6C2cdYXqqMxqtnk1vdSYWxxsj0Q/5BpWMwn/XuqnbUpkcGzneDzHsJZKj7bnfP6QLkbEVMPs096ov8QQ0WVWO7knjI8sTKJd3m5oPTrm46rXLYoVJVxCQULy6twsWNS9k+9l408varRLoZ23lGlbNe0v43brL1BQOQC5QCu9JbCTR/gOR6P1VXntH1PMFVdAehPql5ZIixJ6ZF3ztgVYE42HZWasg/AF+3Dkp51YdULXrQQz5+nXNVxNVqUi1GykEzy/r6tanVw0zpX5l+CPnM8IPDjRhRTwCB1i7Eh4dUtQh4/RSYeM8Qij6hsU+BYM2ptEI36Pqx5iAVZiSVgou3nYBHmK0mKDa52ETNPrJ3EBKIqQ9 TLXcsu3U whY/7dTmD0EDdpgLIn1KOsk31gfPBsC5YzH/+kHOiUUpnRcHdIcSj62NN7+QtItWppgbmDSdrBggVSy8TjDgyej7YX/ZQVt82rISStNIdaEBFNDQ81FybLEns1KCoIK1KK6cL9G0Oeg6zshiHEv8NQpvbpJ2QLh+Ksq1D4ck5lu+/TZRgxrePxCBguOFggO2MiLV0EqDzNiKI1GVskngsPOm9db0nMVqEXGFy1FfFmA3DyM6CQhRDZMHiZBqG6jB4xH1Bt2vyQOihjsny/abBkO5GPp0qrkyypGIFS2n9G810b5s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jun 29, 2025 at 5:19=E2=80=AFPM Alexey Kardashevskiy = wrote: > ... > >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > >>> > >>> For IOMMU, could something like below work? > >>> > >>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges > >> > >> Done that. > >> > >>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from > >>> guest_memfd ranges using kvm_gmem_get_pfn() > >> > >> This API imho should drop the confusing kvm_ prefix. > >> > >>> -> kvm invokes kvm_gmem_is_private() to check for the range > >>> shareability, IOMMU could use the same or we could add an API in gmem > >>> that takes in access type and checks the shareability before returnin= g > >>> the pfn. > >> > >> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is file= map_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new = links between iommufd.ko and kvm.ko. It is probably unavoidable though. > > > > I don't think that's the way to avoid links between iommufd.ko and > > kvm.ko. Cleaner way probably is to have gmem logic built-in and allow > > runtime registration of invalidation callbacks from KVM/IOMMU > > backends. Need to think about this more. > > Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (= =3D=3Dkvm.ko) in run time so more beloved symbol_get() :) > > > > >> > >> > >>> * IOMMU stack exposes an invalidation callback that can be invoked by > >>> guest_memfd. > >>> > >>> Private to Shared conversion via kvm_gmem_convert_range() - > >>> 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the rang= es > >>> on each bound memslot overlapping with the range > >>> 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > >>> actually unmaps the KVM SEPT/NPT entries. > >>> -> guest_memfd invokes IOMMU invalidation callback to za= p > >>> the secure IOMMU entries. > >>> 3) guest_memfd invokes kvm_gmem_execute_work() which updates t= he > >>> shareability and then splits the folios if needed > >>> 4) Userspace invokes IOMMU map operation to map the ranges in > >>> non-secure IOMMU. > >>> > >>> Shared to private conversion via kvm_gmem_convert_range() - > >>> 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the rang= es > >>> on each bound memslot overlapping with the range > >>> 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > >>> actually unmaps the host mappings which will unmap the KVM non-seucur= e > >>> EPT/NPT entries. > >>> -> guest_memfd invokes IOMMU invalidation callback to zap = the > >>> non-secure IOMMU entries. > >>> 3) guest_memfd invokes kvm_gmem_execute_work() which updates t= he > >>> shareability and then merges the folios if needed. > >>> 4) Userspace invokes IOMMU map operation to map the ranges in = secure IOMMU. > >> > >> > >> Alright (although this zap+map is not necessary on the AMD hw). > > > > IMO guest_memfd ideally should not directly interact or cater to arch > > specific needs, it should implement a mechanism that works for all > > archs. KVM/IOMMU implement invalidation callbacks and have all the > > architecture specific knowledge to take the right decisions. > > > Every page conversion will go through: > > kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (buil= d-in). > > Which one decides on IOMMU not needing (un)mapping? Got to be (1) but the= n it need to propagate the decision to amd-iommu (and we do not have (3) at= the moment in that path). If there is a need, guest_memfd can support two different callbacks: 1) Conversion notifier/callback invoked by guest_memfd during conversion handling. 2) Invalidation notifier/callback invoked by guest_memfd during truncation. Iommufd/kvm can handle conversion callback/notifier as per the needs of underlying architecture. e.g. for TDX connect do the unmapping vs for SEV Trusted IO skip the unmapping. Invalidation callback/notifier will need to be handled by unmapping page ta= bles. > > Or we just always do unmap+map (and trigger unwanted page huge page smash= ing)? All is doable and neither particularly horrible, I'm trying to see wh= ere the consensus is now. Thanks, > I assume when you say huge page smashing, it means huge page NPT mapping getting split. AFAIR, based on discussion with Michael during guest_memfd calls, stage2 NPT entries need to be of the same granularity as RMP tables for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on the KVM side during conversion. So today guest_memfd sends invalidation notification to KVM for both conversion and truncation. Doesn't the same constraint for keeping IOMMU page tables at the same granularity as RMP tables hold for trusted IO?