From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8728C02183 for ; Thu, 16 Jan 2025 09:19:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B01E6B007B; Thu, 16 Jan 2025 04:19:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5380A6B0082; Thu, 16 Jan 2025 04:19:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3D8E86B0085; Thu, 16 Jan 2025 04:19:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1CD8D6B007B for ; Thu, 16 Jan 2025 04:19:50 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 98EDA140984 for ; Thu, 16 Jan 2025 09:19:49 +0000 (UTC) X-FDA: 83012767698.03.866CA96 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) by imf18.hostedemail.com (Postfix) with ESMTP id AB8411C0015 for ; Thu, 16 Jan 2025 09:19:47 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Xy21q7vD; spf=pass (imf18.hostedemail.com: domain of tabba@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737019187; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QOQTH7tUiMk7fJ9MtVwTdXT5DSu9exzYGVvW00E16nc=; b=1bmy3E0TsTTO+DljQ7SGBIZVoRRM2mUsS+actapjPiSIREuqVMrmsKrMhcj3DVI7eQa1XK ixDtEx/P1lOcLGRGd3/puzAMDe2Eu1d+gx9pTOARcahlbVMq5ToOM8lPX/ZJSey8lVrfjL 40cEDUlu2Y/tAR7/n8VGH4XjXYn0D/k= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737019187; a=rsa-sha256; cv=none; b=VfU+iDrUAp3fgYdq26552WcO+XPY0zGCvFGGpwEddnMpQDnXNRILqUEzKTLwTrzz5ny44v VbtlRv02I//u84hqO5DCQipcY4dsqoaAnQPU2rBAmlAgNitGjn6JE6b1q/+PrNmAP7VMKp d/PgN2nkaWdrqz1PAHUUeKnuc2pKGwA= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Xy21q7vD; spf=pass (imf18.hostedemail.com: domain of tabba@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wm1-f43.google.com with SMTP id 5b1f17b1804b1-43621d2dd4cso40415e9.0 for ; Thu, 16 Jan 2025 01:19:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737019186; x=1737623986; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=QOQTH7tUiMk7fJ9MtVwTdXT5DSu9exzYGVvW00E16nc=; b=Xy21q7vD9FUnzH3pKxn/ifWR0lqQXoz4D/wPXGVN2S7I8spYubSDNG9txLd9x0Sn6I RfoGgJ1orpm1dVelmDZ+uqMUBM8ksQxXRWYeo4dBFTXbiIkyahGzbaywfKlzviMMJbWH mYCwNC+p8rlzBHhXg0/1YIfyMx6w77MsQiIav95IXIW2b5JQznmhmQswRHJaMtxt61i/ Lp2/XXfpFo/QN/lqhN89bRrv/Y/XZ5mVcMkw9gfW+wCKZKeP9APs2ySjkmP23qaSB0PR Ca3wJ9zdFvNcSNwwGdpS/nNLS6Tu4UKAtiHPfo/87aY2Ysl3Fl4d7OInrOSFBz+QidUm 2F/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737019186; x=1737623986; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=QOQTH7tUiMk7fJ9MtVwTdXT5DSu9exzYGVvW00E16nc=; b=MJgQzOeZ4ez7MPvlDvpEAFc7t+eGvOaTEKDgN14P7SOZxen88aH/+SkzaDVbt3DRdO ShtBsIHm93nfZta9huJ0DLrGsnv5DpR9xq6Fob05+PLeFdl2eQ6mnb/MdwQbdBKc7wm6 X6rUQMaMnVFYFGGZxkC+tvSo2ZImZnKxja9nRHgutHdM7XpOhDdTaZNV4qVGowQe2is9 IQZQn52Y55CX+ATjfvZq8ZcMatbahTc04rsHeT0DA1axFslb4YrE18TW+ur/T3cy1/+k BLXDU15l1V8Pf+ub78kLnmezYfEOeRTyaZ45ZUN/tyaKWqnb91+PQrfj27nr00mpM8Ae Jxpw== X-Forwarded-Encrypted: i=1; AJvYcCVYslAHDl1JyQHDFoFNcMAhMcHZyPe9VgnoJHRACjdCSM3OsOUODgeCnrX9COEdCXgBZX16lRF9Dg==@kvack.org X-Gm-Message-State: AOJu0YyprAWYj9jIkcqjRBZV5+YGuz/7zpjDYomrMnaTCwmvR8VmlBnT teL3adE0D4Z3E7/3XomrULXGw19kMQqoqWv8IvYq4DrP8xnYh5n9SDjbOrJmUbpfr5S9NrXCeVM gQNLZ2QPqV4SHH+gH/nNWh13PeaWx3DnXtrcY X-Gm-Gg: ASbGnct3NoO0IVrzg49tScwnmY3LL+lm9nJqsciJKXBHrvgv62PYve/5DllYz2FoqDW wGr8a4jtGAOn7eP7HOrOyF8/RSQP4vIA5gEeVJCwkOC7z2IKoq8hLKIUDNYqvzMFAWDQ= X-Google-Smtp-Source: AGHT+IE8aMwLow5sWY5fJzCKY+xspZ5j3Pvj+/nmGPI1JUwww3Pw+7qETN7C17jt8ym+ycXUPeSkWTSDDlDTu48ZCmI= X-Received: by 2002:a05:600c:19c9:b0:42c:9e35:cde6 with SMTP id 5b1f17b1804b1-4388b2ec3a2mr860695e9.2.1737019185907; Thu, 16 Jan 2025 01:19:45 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Fuad Tabba Date: Thu, 16 Jan 2025 09:19:09 +0000 X-Gm-Features: AbW1kvYfdnAg5jzOGsWKnus5CfXOeCmlxVqueDrpwKXgQxSqa_22kWzst03_IAY Message-ID: Subject: Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support To: Ackerley Tng Cc: kvm@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org, jgg@nvidia.com, rientjes@google.com, jhubbard@nvidia.com, fvdl@google.com, hughd@google.com, jthoughton@google.com Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: udjo4tajnff3rwbfj144aybfpccwrs4i X-Rspamd-Queue-Id: AB8411C0015 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1737019187-318594 X-HE-Meta: U2FsdGVkX1+7zPq92Pj2p+NtpXHraINrGFLNvc2EAcxsnDSGO4182ZebNSmSAWaDMDqa87FnfkS7obZUb+685SBnzS+o/EfBjcH8Ub0SKUBQY3BVOROoGaCYAlGfYnAG16Y7vgCgMIJ6o2svX1PvU06dobw4Bc7avUjXvJb/JSthh2Ng5Qd8AvFkOfLx6ZPHba61broMBd1eN64YPiOjeSg5Q6A34nga7KEl+E8buoHkMDmtzjc/psNd1xESUhj1NYylhWyfD8k3Iawyt3+q0IecCru/O0YKmcHG4YRFD2UvYdRT55G0HX2yB3fVjjWGaRaXqmw/7iVPJMbLijnapc3U9dkCQMwYhiLpTbm70igdnl/LMQTdzKZcb1dUFFXYg4dD/YiB075ze2VYBOONL1XnBtVcUMmuGQIrC8QNbCZPNtXW4t3+3PWgPdXcw7iQSEqOBygq1DgsPkwLmQURcUkJ2UiRguSr8kwqjio/9ROl58L0Dtw+T5XWqEar0w0zMP6SAtUBWpENtX4F0op1/Wkkmuiu9muwUEUWCH+RHiPgUNoLhFpjDUCe4xU7DRnztCWyWYFNIGPcCVDY3eArsNlH+WH/q1imj/3hcXFL5oJh7IcVo+8JUNQpNxy2PRNTICvVMJ8apEaUgtgqcFb4iE/3sY8laxHvtTcqLorZ4iVNVvscPeOkNO3Auii19EK+mn8ja/JUOY883DXVnwEub5b4fa/JebrML6jBccMxZ+9QSwp0+76CZgdoSE8sEihG5uu08oFtk9gODzOV9sYv/fZJeDrgnbEuoUhdVl6E11K7WCuxKruLgqPYaS7T+lZafW+osYBwWjhI8cJm122yRtY57YBPrLfAggWgwYl7o0ODu+RH2DFtGCIMmzwZ1GPgIW92jdc35AF24nOQ2k86d+VH7jVCLBRg+wVPiTWlsDyz7XV1t+xXBoYUyRcdUe2YqytHTWo+0WcZQDzDg2B y6xyHn4l 87Yc14br80x9JZDaRr4X10AI+psRYf51XxwfddpHoQiExKG+ac5ZDfaNb6uIneGx/1SxX7ZKw82mxqlt+JioHr4Hb2/0dnpblCueH/hw488YgHZgf49r+WHUFV1gEbb/nV1EqiZw+kdKLAHJfcFN7RDp0g39VfopW9/6BIcmoVZFw0G3i4u3LRRLRs5+5GTsNJhqp2bTrZ4ytgi6pvtR9T8/DZVIq+wsggYolorEnX9w9TWRu/E2+eP5nbz63jjQYPTAfVJQEN2aMLVqsAnnXaLVsVeSIbSu/ZxsuYb5R/8CzUgxx33lXHigjISoS3wAfVKv+AgwvlaLrcNM5tFOzLEk7l7cG3cQFF/ZA3t06hhsEmitNLWsgvt7IwlWSZgNXYYx2Kg0mLoiGBkNyoZN9OUaqKyPznu+c5Vd6hq7L7iNVpPTTn94G8kBZbnya2iWkRxYc2JxwFe+y/eafyLkTd+AAyG/cpBwV0hTYGCv93B2YHew0LX91/Gfmzg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ackerley, On Thu, 16 Jan 2025 at 00:35, Ackerley Tng wrote: > > Fuad Tabba writes: > > > Hi, > > > > As mentioned in the guest_memfd sync (2025-01-09), below is the state > > diagram that uses the new states in this patch series, and how they > > would interact with sharing/unsharing in pKVM: > > > > https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf > > Thanks Fuad! > > I took a look at the state diagram [1] and the branch that this patch is > on [2], and here's what I understand about the flow: > > 1. From state H in the state diagram, the guest can request to unshare a > page. When KVM handles this unsharing, KVM marks the folio > mappability as NONE (state J). > 2. The transition from state J to state K or I is independent of KVM - > userspace has to do this unmapping > 3. On the next vcpu_run() from userspace, continuing from userspace's > handling of the unshare request, guest_memfd will check and try to > register a callback if the folio's mappability is NONE. If the folio > is mapped, or if folio is not mapped but refcount is elevated for > whatever reason, vcpu_run() fails and exits to userspace. If folio is > not mapped and gmem holds the last refcount, set folio mappability to > GUEST. > > Here's one issue I see based on the above understanding: > > Registration of the folio_put() callback only happens if the VMM > actually tries to do vcpu_run(). For 4K folios I think this is okay > since the 4K folio can be freed via the transition state K -> state I, > but for hugetlb folios that have been split for sharing with userspace, > not getting a folio_put() callback means never putting the hugetlb folio > together. Hence, relying on vcpu_run() to add the folio_put() callback > leaves a way that hugetlb pages can be removed from the system. > > I think we should try and find a path forward that works for both 4K and > hugetlb folios. I agree, this could be an issue, but we could find other ways to trigger the callback for huge folios. The important thing I was trying to get to is how to have the callback and be able to register it. > IIUC page._mapcount and page.page_type works as a union because > page_type is only set for page types that are never mapped to userspace, > like PGTY_slab, PGTY_offline, etc. In the last guest_memfd sync, David Hildenbrand mentioned that that would be a temporary restriction since the two structures would eventually be decoupled, work being done by Matthew Wilcox I believe. > Technically PGTY_guest_memfd is only set once the page can never be > mapped to userspace, but PGTY_guest_memfd can only be set once mapcount > reaches 0. Since mapcount is added in the faulting process, could gmem > perhaps use some kind of .unmap/.unfault callback, so that gmem gets > notified of all unmaps and will know for sure that the mapcount gets to > 0? I'm not sure if there is such a callback. If there were, I'm not sure what that would buy us really. The main pain point is the refcount going down to zero. The mapcount part is pretty straightforard and likely to be only temporary as mentioned, i.e., when it get decoupled, we could register the callback earlier and simplify the transition altogether. > Alternatively, I took a look at the folio_is_zone_device() > implementation, and page.flags is used to identify the page's type. IIUC > a ZONE_DEVICE page also falls in the intersection of needing a > folio_put() callback and can be mapped to userspace. Could we use a > similar approach, using page.flags to identify a page as a guest_memfd > page? That way we don't need to know when unmapping happens, and will > always be able to get a folio_put() callback. Same as above, with this being temporary, adding a new page flag might not be something that the rest of the community might be too excited about :) Thanks for your comments! /fuad > [1] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf > [2] https://android-kvm.googlesource.com/linux/+/764360863785ba16d974253a572c87abdd9fdf0b%5E%21/#F0 > > > This patch series doesn't necessarily impose all these transitions, > > many of them would be a matter of policy. This just happens to be the > > current way I've done it with pKVM/arm64. > > > > Cheers, > > /fuad > > > > On Fri, 13 Dec 2024 at 16:48, Fuad Tabba wrote: > >> > >> This series adds restricted mmap() support to guest_memfd, as > >> well as support for guest_memfd on arm64. It is based on Linux > >> 6.13-rc2. Please refer to v3 for the context [1]. > >> > >> Main changes since v3: > >> - Added a new folio type for guestmem, used to register a > >> callback when a folio's reference count reaches 0 (Matthew > >> Wilcox, DavidH) [2] > >> - Introduce new mappability states for folios, where a folio can > >> be mappable by the host and the guest, only the guest, or by no > >> one (transient state) > >> - Rebased on Linux 6.13-rc2 > >> - Refactoring and tidying up > >> > >> Cheers, > >> /fuad > >> > >> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/ > >> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/ > >> > >> Ackerley Tng (2): > >> KVM: guest_memfd: Make guest mem use guest mem inodes instead of > >> anonymous inodes > >> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private > >> > >> Fuad Tabba (12): > >> mm: Consolidate freeing of typed folios on final folio_put() > >> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains > >> the folio lock > >> KVM: guest_memfd: Folio mappability states and functions that manage > >> their transition > >> KVM: guest_memfd: Handle final folio_put() of guestmem pages > >> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared > >> KVM: guest_memfd: Add guest_memfd support to > >> kvm_(read|/write)_guest_page() > >> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host > >> mappable > >> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as > >> mappable > >> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is > >> allowed > >> KVM: arm64: Skip VMA checks for slots without userspace address > >> KVM: arm64: Handle guest_memfd()-backed guest page faults > >> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled > >> > >> Documentation/virt/kvm/api.rst | 4 + > >> arch/arm64/include/asm/kvm_host.h | 3 + > >> arch/arm64/kvm/Kconfig | 1 + > >> arch/arm64/kvm/mmu.c | 119 +++- > >> include/linux/kvm_host.h | 75 +++ > >> include/linux/page-flags.h | 22 + > >> include/uapi/linux/kvm.h | 2 + > >> include/uapi/linux/magic.h | 1 + > >> mm/debug.c | 1 + > >> mm/swap.c | 28 +- > >> tools/testing/selftests/kvm/Makefile | 1 + > >> .../testing/selftests/kvm/guest_memfd_test.c | 64 +- > >> virt/kvm/Kconfig | 4 + > >> virt/kvm/guest_memfd.c | 579 +++++++++++++++++- > >> virt/kvm/kvm_main.c | 229 ++++++- > >> 15 files changed, 1074 insertions(+), 59 deletions(-) > >> > >> > >> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4 > >> -- > >> 2.47.1.613.gc27f4b7a9f-goog > >>