From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7380AC77B7F for ; Fri, 27 Jun 2025 15:17:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 025E96B00AA; Fri, 27 Jun 2025 11:17:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F19756B00AB; Fri, 27 Jun 2025 11:17:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE0F56B00AD; Fri, 27 Jun 2025 11:17:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C71DA6B00AA for ; Fri, 27 Jun 2025 11:17:50 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7A824B76CF for ; Fri, 27 Jun 2025 15:17:50 +0000 (UTC) X-FDA: 83601535500.26.49A923C Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) by imf07.hostedemail.com (Postfix) with ESMTP id 9109B40006 for ; Fri, 27 Jun 2025 15:17:48 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Q1aPbwRQ; spf=pass (imf07.hostedemail.com: domain of vannapurve@google.com designates 209.85.166.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751037468; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ctaq7Efseld3m3PwMhLwojtbJNBExDIWVuNO2ad89Gw=; b=xIGrh6CDBCqHRRmp3wwodxQjcUdL8YsnvIc8lp0xMoXqEF/YDgCstk5KF6Iph0XCvXFsw9 mp2BoOzoQ1IJNBYPyIu0zErf6T56m7j6dKX5znQrA1vLgvh1PPUT+l7ZhAavjvI+GyLeGt eVNqNxAKhTmi7Uhc0EHR2vWMyQ3Oaas= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751037468; a=rsa-sha256; cv=none; b=LapFikBwvRCxf76qdR7UPFpZ52xZUyvS/A92Wx1KtJBZ0GlKfobi5bOeARqkuz8DN2qAHI j2z8kqvhcP9qPvsUzvJnoLPOfLxzjnWGBq+Dav2OjVJ+JDhvitgrlMXwkX+vmlv9TdsuFB DfFWA71MA0rXM7m7hxF6rINWV/p74qA= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Q1aPbwRQ; spf=pass (imf07.hostedemail.com: domain of vannapurve@google.com designates 209.85.166.178 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-il1-f178.google.com with SMTP id e9e14a558f8ab-3de210e6076so150645ab.1 for ; Fri, 27 Jun 2025 08:17:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1751037467; x=1751642267; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ctaq7Efseld3m3PwMhLwojtbJNBExDIWVuNO2ad89Gw=; b=Q1aPbwRQSLxabWoW+1LJ5599DXPm24n+4qPLgDE1cz5GjOeShquDlqPeNDc0zrpOvN VtQge0ePiD3O5DAwUmtMqttrrdW5ovRFfjqCJNmvrqrOwal6+TmZDBSdmZ7tMQer+Ziz nBmFD+u7iMFCRJ60rvVFqI8klGTyZqTgLWg5SxGKfcDoKigriMqlvohMbSPqY5pi1c84 e3i0xSt6V7ygfasGMpjRL0wP2YL8QPv/ax20+D4wCXIjLZXZnM6xVMIWlhzxXQWficon JyP42/A7yq06aw4odW4VATpJBH6CwejBddRPFQvZf0JcI9ZT7ZzSOxnN3qdUPQsPwrx8 xmBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751037467; x=1751642267; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ctaq7Efseld3m3PwMhLwojtbJNBExDIWVuNO2ad89Gw=; b=Lkxs1KFK0U++RKvRIIJYuuNiy6gadTBFq9IYZgwyaZF6lP64gI0ClwmyiJgjWIQIGP ArtjfEdaR0XdsxIGXmepr+fX6muVxR50emzyTpEpMADc6fIEokHuYU09neHM2fRH4tCb uG4UCKeHtxcjPe3p71VgtikEoknQiBhHFBgSeeESC+rNa7o6se2K+YggkVx+rVR13GWE pdGiP+qK+xT3rPiJR6ow04DG34/ah5VDGVq2RN7G/ZBsqn/6DjGpDBW1JFIwHzy9m/2u BE+HrZ0Ol68S/h8qsJxc0dXdZd2RVTyUp0fX70mVMo4qqRoBDLg2jHlhRfPeKllgnI4x /hiA== X-Forwarded-Encrypted: i=1; AJvYcCWN0yntjUiHW/EG9NngZf27G3dLCOQGi2PGtbr9Aq8PUSQ+KjPcOpt4PFE3TucacKqOp0uKIGsNEg==@kvack.org X-Gm-Message-State: AOJu0YwCRoO/T0oWHdHhZ9ylwqlfTd6PUd3LMa5ojIawHiIU/QBdt1Vh nYMNWag8zsCi0aHF6g0sNsmDVKfWk2S0P4AsI4xgU/HLOjBuXQzkTIq/fqJnTTfV42wAmGv3Idy 9TayTh8a22FsLIPYsXlI9NmI2sxeRS+G7GetNTjYR X-Gm-Gg: ASbGncstoV3xCC1Xf9rHFeUP3iwurxNUWI1S+zsD70m3bpS5q71JAXaokDm0Xyb7OzK 9c1IImfGQk83CQImDblBf+gy/oYHrZ+6tsoGZOVBGaxTylcPIZsp/YZvCqz0NohGGTnvWFYZJ8X sEPNU51UkDTPByGy39IH52VwIa9cO7U/2qFwLOwWCVSl1qovbkk6MY+xdakDkI0BPtiZsQtsF3K 0cN0XbdNO+yp10= X-Google-Smtp-Source: AGHT+IEb+xxzVEl4IYeZdvYpF90TB/n7ds2F+wk1b/UN6A6zXnHR6Y9V2ll0YmyvEsQKpdZrsBliB1gS8siVMNUIbc0= X-Received: by 2002:a17:902:c94f:b0:234:a469:62ef with SMTP id d9443c01a7336-23ae4da691bmr146245ad.3.1751037466544; Fri, 27 Jun 2025 08:17:46 -0700 (PDT) MIME-Version: 1.0 References: <9502503f-e0c2-489e-99b0-94146f9b6f85@amd.com> <20250624130811.GB72557@ziepe.ca> <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> In-Reply-To: <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com> From: Vishal Annapurve Date: Fri, 27 Jun 2025 08:17:34 -0700 X-Gm-Features: Ac12FXyPZKiqBA6vQz47UeBICglaWzMRTynZ4wc-OUKHNtdWirkqG5VqjYk0knQ Message-ID: Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls To: Alexey Kardashevskiy Cc: Jason Gunthorpe , Fuad Tabba , Ackerley Tng , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, ajones@ventanamicro.com, akpm@linux-foundation.org, amoorthy@google.com, anthony.yznaga@oracle.com, anup@brainfault.org, aou@eecs.berkeley.edu, bfoster@redhat.com, binbin.wu@linux.intel.com, brauner@kernel.org, catalin.marinas@arm.com, chao.p.peng@intel.com, chenhuacai@kernel.org, dave.hansen@intel.com, david@redhat.com, dmatlack@google.com, dwmw@amazon.co.uk, erdemaktas@google.com, fan.du@intel.com, fvdl@google.com, graf@amazon.com, haibo1.xu@intel.com, hch@infradead.org, hughd@google.com, ira.weiny@intel.com, isaku.yamahata@intel.com, jack@suse.cz, james.morse@arm.com, jarkko@kernel.org, jgowans@amazon.com, jhubbard@nvidia.com, jroedel@suse.de, jthoughton@google.com, jun.miao@intel.com, kai.huang@intel.com, keirf@google.com, kent.overstreet@linux.dev, kirill.shutemov@intel.com, liam.merwick@oracle.com, maciej.wieczor-retman@intel.com, mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net, michael.roth@amd.com, mpe@ellerman.id.au, muchun.song@linux.dev, nikunj@amd.com, nsaenz@amazon.es, oliver.upton@linux.dev, palmer@dabbelt.com, pankaj.gupta@amd.com, paul.walmsley@sifive.com, pbonzini@redhat.com, pdurrant@amazon.co.uk, peterx@redhat.com, pgonda@google.com, pvorel@suse.cz, qperret@google.com, quic_cvanscha@quicinc.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com, richard.weiyang@gmail.com, rick.p.edgecombe@intel.com, rientjes@google.com, roypat@amazon.co.uk, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, steven.sistare@oracle.com, suzuki.poulose@arm.com, thomas.lendacky@amd.com, usama.arif@bytedance.com, vbabka@suse.cz, viro@zeniv.linux.org.uk, vkuznets@redhat.com, wei.w.wang@intel.com, will@kernel.org, willy@infradead.org, xiaoyao.li@intel.com, yan.y.zhao@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com, zhiquan1.li@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 3cgmzhf7kj3wfm4czbriggi46kt5kg4r X-Rspamd-Queue-Id: 9109B40006 X-Rspamd-Server: rspam08 X-HE-Tag: 1751037468-30892 X-HE-Meta: U2FsdGVkX1+BsnAomHlnyJdPB+8SgTbMJua1uHskPKNaUlnzcvlWuWQV01mBUYzzwMI8bLccTopdyKQX4KMvepqrtReYix8HCP1u+yAgBaRwuEZALKlHSM71IEie6HlcvbBUY76vx7PdIlxt7Ujz5xbm01uPlpfP3k5Hp0KmOTRcpfU2M3IBZBCPVDZiEuVEizApycRsaj+yiHHRYTV+ykz3rK3mU+SgWLH51ePgKueUGgdFiyITDhGUVGmwlPtZ0ri6pRJXmJniHqbkqbTd7WzMpMNLeYWUKH7O/1jEvFQsPpaoJiI8Ly1yUst0tyO+0Xnt/RggtZa3itE+WvPVI3mhpuJxmKP5qsAEOSdgJ7sFzwOgmTuEkC7iISju4RlLw2lQBEha6+7MwbqYoYsrNPm7BgfR1eKLAhYq7c6bK6dU6KHCQ31RV9HdhGNGHDv+9GfdTf+LXoyVJrY4W+DTc/Z/T5uN18uO2I0oMCm4+fhN7MGKQo7FoUR/a3CSfKCs8Pp7Ar9d0hS377y//KnuRoohC7tfwhS10gge7ivNZDhOqZycZQWNuAt4vZ4TL5qKQGhfKsBJp1cMF8es5++xUBBsJVzZqE5OGyOPVlHyWXDWad0tm7BJMFIgk5ttCvmzfUJBtj7E/pU839tnvJuxlswivEd8xSJyToOue+cUc08GGso6mN3KeuLJHWgP7G+/XD3bUUxOTKp/otcRp52ZxsBHMNzA/9LuPoIw45CP31Yi0/PikoNq5AL8syWKlL0koQp1ekcl2KALX6JEZJ/5bbNAwNn6ggtQ2FG48vT5+2n2ghKezEPrFEyWG9RymtNfqQgGcAQQOpPnhPtKq/C97PRiO1CrMhsuP8vJFwxG8HVSnAth6Ofs6xfWjsz2KrdHDdiiSPAJ6nFGMRhV3FVx417e1mZj2R4J0Gm6P9tNZUAR/oQ4lo5gNjswIbTCt5rm2QQOw2rogag19Ufwvzk K6o0CvfT 1x0OFHWalBlMCvroPTTv8FxWsyDt1xb9b4LlZXNIFZO5cZ3/FPEhz3WDr2fOQVqewyBr4I7581+tNYuz1pf1sfONOTzccKE3IvRRhGMz86OZyrS8FS9uC+WGseK+RXyrFswV4bATRRrjn28p3ps97Fuwz9cKn3RXEPmrXM4cGs+GG4Al6Br20d5Zu5swpw+k0TSkZszzFmqr1vEbwgXCmgIGmwPn73zjf1TmQhU0eu3TaB2J+VrlDUHb3fq/pC9VNdXozXN48THlHr82FbA7YlcUL3Epr6tJwLl5hU1XSDr951z5yqFe0eepHjQvIhSxsRBrK9l6v6f6G4hTQym9pzwld/7dYB7EY2/yawJrf4Z1g13o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 26, 2025 at 9:50=E2=80=AFPM Alexey Kardashevskiy = wrote: > > > > On 25/6/25 00:10, Vishal Annapurve wrote: > > On Tue, Jun 24, 2025 at 6:08=E2=80=AFAM Jason Gunthorpe = wrote: > >> > >> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote: > >> > >>> Now, I am rebasing my RFC on top of this patchset and it fails in > >>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these > >>> folios in my RFC. > >>> > >>> So what is the expected sequence here? The userspace unmaps a DMA > >>> page and maps it back right away, all from the userspace? The end > >>> result will be the exactly same which seems useless. And IOMMU TLB > > > > As Jason described, ideally IOMMU just like KVM, should just: > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken > > by IOMMU stack > > 2) Directly query pfns from guest_memfd for both shared/private ranges > > 3) Implement an invalidation callback that guest_memfd can invoke on > > conversions. Conversions and truncations both. > > > > Current flow: > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > -> KVM has the concept of invalidation_begin() and end(), > > which effectively ensures that between these function calls, no new > > EPT/NPT entries can be added for the range. > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > > > For IOMMU, could something like below work? > > > > * A new UAPI to bind IOMMU FDs with guest_memfd ranges > > Done that. > > > * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from > > guest_memfd ranges using kvm_gmem_get_pfn() > > This API imho should drop the confusing kvm_ prefix. > > > -> kvm invokes kvm_gmem_is_private() to check for the range > > shareability, IOMMU could use the same or we could add an API in gmem > > that takes in access type and checks the shareability before returning > > the pfn. > > Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap= _lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new lin= ks between iommufd.ko and kvm.ko. It is probably unavoidable though. I don't think that's the way to avoid links between iommufd.ko and kvm.ko. Cleaner way probably is to have gmem logic built-in and allow runtime registration of invalidation callbacks from KVM/IOMMU backends. Need to think about this more. > > > > * IOMMU stack exposes an invalidation callback that can be invoked by > > guest_memfd. > > > > Private to Shared conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the KVM SEPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap > > the secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then splits the folios if needed > > 4) Userspace invokes IOMMU map operation to map the ranges in > > non-secure IOMMU. > > > > Shared to private conversion via kvm_gmem_convert_range() - > > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges > > on each bound memslot overlapping with the range > > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which > > actually unmaps the host mappings which will unmap the KVM non-seucure > > EPT/NPT entries. > > -> guest_memfd invokes IOMMU invalidation callback to zap the > > non-secure IOMMU entries. > > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the > > shareability and then merges the folios if needed. > > 4) Userspace invokes IOMMU map operation to map the ranges in sec= ure IOMMU. > > > Alright (although this zap+map is not necessary on the AMD hw). IMO guest_memfd ideally should not directly interact or cater to arch specific needs, it should implement a mechanism that works for all archs. KVM/IOMMU implement invalidation callbacks and have all the architecture specific knowledge to take the right decisions. > > > > There should be a way to block external IOMMU pagetable updates while > > guest_memfd is performing conversion e.g. something like > > kvm_invalidate_begin()/end(). > > > >>> is going to be flushed on a page conversion anyway (the RMPUPDATE > >>> instruction does that). All this is about AMD's x86 though. > >> > >> The iommu should not be using the VMA to manage the mapping. It should > > > > +1. > > Yeah, not doing this already, because I physically cannot map gmemfd's me= mory in IOMMU via VMA (which allocates memory via gup() so wrong memory is = mapped in IOMMU). Thanks, > > > >> be directly linked to the guestmemfd in some way that does not disturb > >> its operations. I imagine there would be some kind of invalidation > >> callback directly to the iommu. > >> > >> Presumably that invalidation call back can include a reason for the > >> invalidation (addr change, shared/private conversion, etc) > >> > >> I'm not sure how we will figure out which case is which but guestmemfd > >> should allow the iommu to plug in either invalidation scheme.. > >> > >> Probably invalidation should be a global to the FD thing, I imagine > >> that once invalidation is established the iommu will not be > >> incrementing page refcounts. > > > > +1. > > Alright. Thanks for the comments. > > > > >> > >> Jason > > -- > Alexey >