From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36575CEE345 for ; Wed, 9 Oct 2024 20:14:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C06336B00A7; Wed, 9 Oct 2024 16:14:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB65B6B00AA; Wed, 9 Oct 2024 16:14:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A303B6B00AC; Wed, 9 Oct 2024 16:14:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7BD786B00A7 for ; Wed, 9 Oct 2024 16:14:19 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BFD9C8010C for ; Wed, 9 Oct 2024 20:14:16 +0000 (UTC) X-FDA: 82655165796.23.C98DA88 Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) by imf17.hostedemail.com (Postfix) with ESMTP id 0458D4001A for ; Wed, 9 Oct 2024 20:14:16 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zJ1FRcTn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of vannapurve@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728504829; a=rsa-sha256; cv=none; b=5qEiGN9BW6r1BsJSKKwZJ3sP4Z7E3z7/T0WKhd9x42snObH29XrDnUinFshVLOcPAF+HRi c/bkhb55gq+rl8u+ult771EkI+smraa5M44W0x63N8XwVLQctvkywYNMlK8aC3OJzp7AfI CKN8+CmIMgTNaIXgxsSG0oxNg76v5rg= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zJ1FRcTn; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of vannapurve@google.com designates 209.85.128.44 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728504829; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=00dvkUx7dowxRyLP14nGRMSqcFQqg69XRrk4zXSh3Ao=; b=keoRhzVXS1Ioti59Mj71IXRR/mW+/gi6XoXXB6a5+ZS/DxLhlZaAJDridEViPWIe+8bLCK s6397SeJRdAViVAoNgovuvUxj1py08bsRnc4JXWgKzxFONg1bnAnHHs5WPfcl+9Z7gz4lg N4TlFhajC/nd78drZ70dMeGFcp6zGYU= Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-431141cb4efso66755e9.0 for ; Wed, 09 Oct 2024 13:14:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728504855; x=1729109655; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=00dvkUx7dowxRyLP14nGRMSqcFQqg69XRrk4zXSh3Ao=; b=zJ1FRcTnONPW9tfUoA7zkTN7cIwOE5NM/JN/CYVx4b1DsPgwmf49AWPpMKWlRNkO0w knSvEIJ7cMchCb3JgabBMmN4mEiwiBFGsb0g+jrH97S4IeE4tzg7aAPrO0kQ3X9Fqmft i7MiE8Xe+bIEQRao+VzD7wXpKrEKwRLhLaPhVkGGJzgC7t4+HdG1AYBf4LIMcT7jTQJp VquvYonPppLZxrJXq8X55TfIy5SlUUa2Fiu6BFq6ui3fgxOGtwgxFpl/XqWBPUEHrjIj b5dv61FXdjJED0Ikkeg9SzeBndyRu/Fm6+7fJuA2jGwgEzFtxE53Bd95vN5YKULtknAC jeNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728504855; x=1729109655; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=00dvkUx7dowxRyLP14nGRMSqcFQqg69XRrk4zXSh3Ao=; b=Wi7hss21Up7k/WkqOS5SVGDyH6rimupPMpcp5YlXF5mKLUWUeup60GyihSjYkBSBk0 aMVuVbcXk1UnP3IgcS5nK0ID3ICZ9Og+6/Hyec44mRpM/e6gW7E7FA7l0rhSYYEAB0gs dUXmvPJ1bDF4/WKBIY5Og/lQHsXPuSaASsD/589zt9nQvlYUqDuAiosy2fHaFq+PGGqV cAldyEtfKNvsfDoGYAnJu3JgXSn3wkox085/kzE5vt/pzl9QWx9Fm8GcdGTIKqSz/pZX Vt/eVP06qaqA0WWsCYCoW6qZeZq6SzDE/bTFcSE5QXwJMBPlzZXKjo7iQL6mFJmM00nL uIwQ== X-Forwarded-Encrypted: i=1; AJvYcCVUXmI/ENRPp79jJQNP81akguMlbsmf0GrR7VAhCU6ObzKja8zOFNP6TH81ycLV5PLsARkus3Im4Q==@kvack.org X-Gm-Message-State: AOJu0YwZBF/4n3nV2nWgjH2HmTLiJQMJcpORNfRqGhKOhg6lpFHwKftw IhvVNh53tlwI029c7iGhAkFl6ZW5TDm+MLSuGMRCgsteEjNcwEsBHOf7ode87J5eC+zsZMUL8zU xhZ2BhL7PwD0JRRxslN3eq3XQdu3avRMRzIWK X-Google-Smtp-Source: AGHT+IERrmlIt9QcjUaPDNCzfWlCpl13jC7IdXa4U/ODkNRpSvURDowoy6IvUKjnwww9bh0lvRB0+LqZ8b4NcursxD0= X-Received: by 2002:a05:600c:1d92:b0:42c:acd7:b59b with SMTP id 5b1f17b1804b1-431161b38e1mr906765e9.6.1728504855085; Wed, 09 Oct 2024 13:14:15 -0700 (PDT) MIME-Version: 1.0 References: <20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com> <20240829-guest-memfd-lib-v2-2-b9afc1ff3656@quicinc.com> In-Reply-To: <20240829-guest-memfd-lib-v2-2-b9afc1ff3656@quicinc.com> From: Vishal Annapurve Date: Thu, 10 Oct 2024 01:44:01 +0530 Message-ID: Subject: Re: [PATCH RFC v2 2/5] mm: guest_memfd: Allow folios to be accessible to host To: Elliot Berman Cc: Andrew Morton , Sean Christopherson , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Fuad Tabba , David Hildenbrand , Patrick Roy , qperret@google.com, Ackerley Tng , Mike Rapoport , x86@kernel.org, "H. Peter Anvin" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-coco@lists.linux.dev, linux-arm-msm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: ozrmu3jrcoy3115mt37apaxesfuzaijx X-Rspamd-Queue-Id: 0458D4001A X-Rspamd-Server: rspam02 X-HE-Tag: 1728504856-575555 X-HE-Meta: U2FsdGVkX19gL0MzE6SOWXBi0MesnNkkrSp7L4SIwDHjJ03N4LzMeSRGRq9m8d4VACD85OJlRQSgKb09UTE267DrWhXntRIWG66CQUyAQOs6u/f393GoTwNHF4MSubfhnVrrnALJXjOT+HVuQ7A2E3AsJibaq3JGUPcjNBUnWN3+CtdkG/HGPIQr9rJSDmj4Pl4YraXmfJJvO5dQ/UcCyuX7z3D7kgoIM+lmx55dypbuQuzfTWQaEJYXqvzWopgL2J5hkUrIHQNpQuO/eO8S/Vgd8E8s9IX9A7WGcZjoLws0Sr4aC2JFd7/cNjAEQdhnKR3GiQ8+iWrlrV9oNrI7/1ObE7iTSfHJRnhAmWb+jTyt7cAdZUPQ9pFlKIKvJzHGL5zjIBS7Cp6vgepjPoRydorSqdebSonbOUHObW4FvdiQUj0zvrwA0ujCCiQo7A/KNnN/INIMS2R7FluYFZcu1GUgJ5ao1LQNwT9ktwwxuiQNK/oqeFQ3D9O/LBLEIZkkFDDmxW8KoNAPuFS2ufQwwaQRExNBgxjRuzRGUU4QtNrPmuklca/ebIdFCgxH83eF3HNZRFpfIimnyXO9jfrAg7En8MbBYRoR4L6JiW+o0Toz+K0F32LU89wexi9M2PCnMAH7Bm5YYJtCmht6WD4y7Hjub3DO9GNbHKYBQs0RU+2fjzz19QslHzYOwhPupd1/9JwStRI078bCq0GgMd1mqW4Y8vnROzyGJbW9gHyYrH+Fk+I9cjnJMMUpzTxU+kTo/SjncTQTHWQ+SjFbVcYSHw61n4d+/zVZxJPYL7LpL9x7uyf1qUDNCIVkaC3P2ATkSzJPfjrH//C2BjbTSeO5GMuWbrUQQ5uz178+FHUecgLku2DY9Y5moMpZ3k6oQqXCoOiapCFc0IGa4r9V43umKtifxSZeJFLqBEjpLFInEZkmWXj0M35VvfwtXcbGdZVfqvj53nr/CvdQszzpF08 qu1p6la6 9/+u+JPdJ2Te4AFdNrVfKLdGODWbRgT9FGg46x+HT/7JvTy121vel9De37EqyT+mbMwHfBpQ8a8YMwUbu85gPkIkM7aqUI0v34QE/8UEUCbwnxt/6VxsNUZCh+RA/WxoDqsK9QoPn9G2thSjcfQez9RxylGcjGi1+6OgVONmd3TYtlm7Mu5WLf0vflK9Dgk263Nal87/gGE9NYfmiR2KBV7/uhqwqZOfhIne/8mwvlzCLdZuRH2c+emRs/Eo5sM7eXq0q9h+ynD49qMJwJLgxkGf1Ib2qwlhsDxDwUNeJqq0ss3zeabpcnMVVdceVqHvKRiXkrztkly+GyPO3Sru8leDcgu1kH59gqVqE12CNUcwCelBo+k//9zBJUYqxYJZKhBOsTYJAnDFekt3RoDMOVxTOakkXPMQ2qPhEHx/qMUO4GVBD2uNnFj585I4iaOujupXMyRPm4SYxQHNkLi3j3IVzPqcw0b6A4fvL+gqZ4LUd0qY9Cy65/f0s6jxKqJ3EwKKDaBx1K84fYRg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Aug 30, 2024 at 3:55=E2=80=AFAM Elliot Berman wrote: > > Memory given to a confidential VM sometimes needs to be accessible by > Linux. Before the VM starts, Linux needs to load some payload into > the guest memory. While the VM is running, the guest may make some of > the memory accessible to the host, e.g. to share virtqueue buffers. We > choose less-used terminology here to avoid confusion with other terms > (i.e. private). Memory is considered "accessible" when Linux (the host) > can read/write to that memory. It is considered "inaccessible" when > reads/writes aren't allowed by the hypervisor. > > Careful tracking of when the memory is supposed to be "inaccessible" and > "accessible" is needed because hypervisors will fault Linux if we > incorrectly access memory which shouldn't be accessed. On arm64 systems, > this is a translation fault. On x86 systems, this could be a machine > check. > > After discussion in [1], we are using 3 counters to track the state of a > folio: the general folio_ref_count, a "safe" ref count, and an > "accessible" counter. This is a long due response after discussion at LPC. In order to support hugepages with guest memfd, the current direction is to split hugepages on memory conversion [1]. During LPC session [2], we discussed the need of reconstructing split hugepages before giving back the memory to hugepage allocator. After the session I discussed this topic with David H and I think that the best way to handle reconstruction would be to get a callback from folio_put when the last refcount of pages backing guest shared memory is dropped. Reason being that get/pin_user_pages* don't increase inode refcount and so guest_memfd inode can get cleaned up while backing memory is still pinned e.g. by VFIO pinning memory ranges to setup IOMMU pagetables. If such a callback is supported, I believe we don't need to implement the "safe refcount" logic. shared -> private conversion (should be similar for truncation) handling by guest_memfd may look like: 1) Drop all the guest_memfd internal refcounts on pages backing converted ranges. 2) Tag such pages such that core-mm invokes a callback implemented by guest_memfd when the last refcount gets dropped. At this point I feel that the mechanism to achieve step 2 above should be a small modification in core-mm logic which should be generic enough and will not need piggy-backing on ZONE_DEVICE memory handling which already carries similar logic [3]. Private memory doesn't need such a special callback since we discussed at LPC about the desired policy to be: 1) Guest memfd owns all long-term refcounts on private memory. 2) Any short-term refcounts distributed outside guest_memfd should be protected by folio locks. 3) On truncation/conversion, guest memfd private memory users will be notified to unmap/refresh the memory mappings. i.e. After private -> shared conversion, it would be guaranteed that there are no active users of guest private memory. [1] Linux MM Alignment Session: https://lore.kernel.org/all/20240712232937.2861788-1-ackerleytng@google.com= / [2] https://lpc.events/event/18/contributions/1764/ [3] https://elixir.bootlin.com/linux/v6.11.2/source/mm/swap.c#L117 > > Transition between accessible and inaccessible is allowed only when: > 0. The folio is locked > 1. The "accessible" counter is at 0. > 2. The "safe" ref count equals the folio_ref_count. > 3. The hypervisor allows it. > > The accessible counter can be used by Linux to guarantee the page stays > accessible, without elevating the general refcount. When the accessible > counter decrements to 0, we attempt to make the page inaccessible. When > the accessible counters increments to 1, we attempt to make the page > accessible. > > We expect the folio_ref_count to be nearly zero. The "nearly" amount is > determined by the "safe" ref count value. The safe ref count isn't a > signal whether the folio is accessible or not, it is only used to > compare against the folio_ref_count. > > The final condition to transition between (in)accessible is whether the > ->prepare_accessible or ->prepare_inaccessible guest_memfd_operation > passes. In arm64 pKVM/Gunyah terms, the fallible "prepare_accessible" > check is needed to ensure that the folio is unlocked by the guest and > thus accessible to the host. > > When grabbing a folio, the client can either request for it to be > accessible or inaccessible. If the folio already exists, we attempt to > transition it to the state, if not already in that state. This will > allow KVM or userspace to access guest_memfd *before* it is made > inaccessible because KVM and userspace will use > GUEST_MEMFD_GRAB_ACCESSIBLE. > > [1]: https://lore.kernel.org/all/a7c5bfc0-1648-4ae1-ba08-e706596e014b@red= hat.com/ > > Signed-off-by: Elliot Berman > --- > include/linux/guest_memfd.h | 10 ++ > mm/guest_memfd.c | 238 ++++++++++++++++++++++++++++++++++++++= +++--- > 2 files changed, 236 insertions(+), 12 deletions(-) > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > index 8785b7d599051..66e5d3ab42613 100644 > --- a/include/linux/guest_memfd.h > +++ b/include/linux/guest_memfd.h > @@ -22,17 +22,27 @@ struct guest_memfd_operations { > int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsi= gned long nr); > void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsig= ned long nr); > int (*prepare_inaccessible)(struct inode *inode, struct folio *fo= lio); > + int (*prepare_accessible)(struct inode *inode, struct folio *foli= o); > int (*release)(struct inode *inode); > }; > > +enum guest_memfd_grab_flags { > + GUEST_MEMFD_GRAB_INACCESSIBLE =3D (0UL << 0), > + GUEST_MEMFD_GRAB_ACCESSIBLE =3D (1UL << 0), > +}; > + > enum guest_memfd_create_flags { > GUEST_MEMFD_FLAG_CLEAR_INACCESSIBLE =3D (1UL << 0), > }; > > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u= 32 flags); > +void guest_memfd_put_folio(struct folio *folio, unsigned int accessible_= refs); > +void guest_memfd_unsafe_folio(struct folio *folio); > struct file *guest_memfd_alloc(const char *name, > const struct guest_memfd_operations *ops, > loff_t size, unsigned long flags); > bool is_guest_memfd(struct file *file, const struct guest_memfd_operatio= ns *ops); > +int guest_memfd_make_accessible(struct folio *folio); > +int guest_memfd_make_inaccessible(struct folio *folio); > > #endif > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > index c6cd01e6064a7..62cb576248a9d 100644 > --- a/mm/guest_memfd.c > +++ b/mm/guest_memfd.c > @@ -4,9 +4,33 @@ > */ > > #include > +#include > #include > #include > #include > +#include > + > +#include "internal.h" > + > +static DECLARE_WAIT_QUEUE_HEAD(safe_wait); > + > +/** > + * struct guest_memfd_private - private per-folio data > + * @accessible: number of kernel users expecting folio to be accessible. > + * When zero, the folio converts to being inaccessible. > + * @safe: number of "safe" references to the folio. Each reference is > + * aware that the folio can be made (in)accessible at any time. > + */ > +struct guest_memfd_private { > + atomic_t accessible; > + atomic_t safe; > +}; > + > +static inline int base_safe_refs(struct folio *folio) > +{ > + /* 1 for filemap */ > + return 1 + folio_nr_pages(folio); > +} > > /** > * guest_memfd_grab_folio() -- grabs a folio from the guest memfd > @@ -35,21 +59,56 @@ > */ > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u= 32 flags) > { > - unsigned long gmem_flags =3D (unsigned long)file->private_data; > + const bool accessible =3D flags & GUEST_MEMFD_GRAB_ACCESSIBLE; > struct inode *inode =3D file_inode(file); > struct guest_memfd_operations *ops =3D inode->i_private; > + struct guest_memfd_private *private; > + unsigned long gmem_flags; > struct folio *folio; > int r; > > /* TODO: Support huge pages. */ > - folio =3D filemap_grab_folio(inode->i_mapping, index); > + folio =3D __filemap_get_folio(inode->i_mapping, index, > + FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE, > + mapping_gfp_mask(inode->i_mapping)); > if (IS_ERR(folio)) > return folio; > > - if (folio_test_uptodate(folio)) > + if (folio_test_uptodate(folio)) { > + private =3D folio_get_private(folio); > + atomic_inc(&private->safe); > + if (accessible) > + r =3D guest_memfd_make_accessible(folio); > + else > + r =3D guest_memfd_make_inaccessible(folio); > + > + if (r) { > + atomic_dec(&private->safe); > + goto out_err; > + } > + > + wake_up_all(&safe_wait); > return folio; > + } > > - folio_wait_stable(folio); > + private =3D kmalloc(sizeof(*private), GFP_KERNEL); > + if (!private) { > + r =3D -ENOMEM; > + goto out_err; > + } > + > + folio_attach_private(folio, private); > + /* > + * 1 for us > + * 1 for unmapping from userspace > + */ > + atomic_set(&private->accessible, accessible ? 2 : 0); > + /* > + * +1 for us > + */ > + atomic_set(&private->safe, 1 + base_safe_refs(folio)); > + > + gmem_flags =3D (unsigned long)inode->i_mapping->i_private_data; > > /* > * Use the up-to-date flag to track whether or not the memory has= been > @@ -57,19 +116,26 @@ struct folio *guest_memfd_grab_folio(struct file *fi= le, pgoff_t index, u32 flags > * storage for the memory, so the folio will remain up-to-date un= til > * it's removed. > */ > - if (gmem_flags & GUEST_MEMFD_FLAG_CLEAR_INACCESSIBLE) { > + if (accessible || (gmem_flags & GUEST_MEMFD_FLAG_CLEAR_INACCESSIB= LE)) { > unsigned long nr_pages =3D folio_nr_pages(folio); > unsigned long i; > > for (i =3D 0; i < nr_pages; i++) > clear_highpage(folio_page(folio, i)); > - > } > > - if (ops->prepare_inaccessible) { > - r =3D ops->prepare_inaccessible(inode, folio); > - if (r < 0) > - goto out_err; > + if (accessible) { > + if (ops->prepare_accessible) { > + r =3D ops->prepare_accessible(inode, folio); > + if (r < 0) > + goto out_free; > + } > + } else { > + if (ops->prepare_inaccessible) { > + r =3D ops->prepare_inaccessible(inode, folio); > + if (r < 0) > + goto out_free; > + } > } > > folio_mark_uptodate(folio); > @@ -78,6 +144,8 @@ struct folio *guest_memfd_grab_folio(struct file *file= , pgoff_t index, u32 flags > * unevictable and there is no storage to write back to. > */ > return folio; > +out_free: > + kfree(private); > out_err: > folio_unlock(folio); > folio_put(folio); > @@ -85,6 +153,132 @@ struct folio *guest_memfd_grab_folio(struct file *fi= le, pgoff_t index, u32 flags > } > EXPORT_SYMBOL_GPL(guest_memfd_grab_folio); > > +/** > + * guest_memfd_put_folio() - Drop safe and accessible references to a fo= lio > + * @folio: the folio to drop references to > + * @accessible_refs: number of accessible refs to drop, 0 if holding a > + * reference to an inaccessible folio. > + */ > +void guest_memfd_put_folio(struct folio *folio, unsigned int accessible_= refs) > +{ > + struct guest_memfd_private *private =3D folio_get_private(folio); > + > + WARN_ON_ONCE(atomic_sub_return(accessible_refs, &private->accessi= ble) < 0); > + atomic_dec(&private->safe); > + folio_put(folio); > + wake_up_all(&safe_wait); > +} > +EXPORT_SYMBOL_GPL(guest_memfd_put_folio); > + > +/** > + * guest_memfd_unsafe_folio() - Demotes the current folio reference to "= unsafe" > + * @folio: the folio to demote > + * > + * Decrements the number of safe references to this folio. The folio wil= l not > + * transition to inaccessible until the folio_ref_count is also decremen= ted. > + * > + * This function does not release the folio reference count. > + */ > +void guest_memfd_unsafe_folio(struct folio *folio) > +{ > + struct guest_memfd_private *private =3D folio_get_private(folio); > + > + atomic_dec(&private->safe); > + wake_up_all(&safe_wait); > +} > +EXPORT_SYMBOL_GPL(guest_memfd_unsafe_folio); > + > +/** > + * guest_memfd_make_accessible() - Attempt to make the folio accessible = to host > + * @folio: the folio to make accessible > + * > + * Makes the given folio accessible to the host. If the folio is current= ly > + * inaccessible, attempts to convert it to accessible. Otherwise, return= s with > + * EBUSY. > + * > + * This function may sleep. > + */ > +int guest_memfd_make_accessible(struct folio *folio) > +{ > + struct guest_memfd_private *private =3D folio_get_private(folio); > + struct inode *inode =3D folio_inode(folio); > + struct guest_memfd_operations *ops =3D inode->i_private; > + int r; > + > + /* > + * If we already know the folio is accessible, then no need to do > + * anything else. > + */ > + if (atomic_inc_not_zero(&private->accessible)) > + return 0; > + > + r =3D wait_event_timeout(safe_wait, > + folio_ref_count(folio) =3D=3D atomic_read(= &private->safe), > + msecs_to_jiffies(10)); > + if (!r) > + return -EBUSY; > + > + if (ops->prepare_accessible) { > + r =3D ops->prepare_accessible(inode, folio); > + if (r) > + return r; > + } > + > + atomic_inc(&private->accessible); > + return 0; > +} > +EXPORT_SYMBOL_GPL(guest_memfd_make_accessible); > + > +/** > + * guest_memfd_make_inaccessible() - Attempt to make the folio inaccessi= ble > + * @folio: the folio to make inaccessible > + * > + * Makes the given folio inaccessible to the host. IF the folio is curre= ntly > + * accessible, attempt so convert it to inaccessible. Otherwise, returns= with > + * EBUSY. > + * > + * Conversion to inaccessible is allowed when ->accessible decrements to= zero, > + * the folio safe counter =3D=3D folio reference counter, the folio is u= nmapped > + * from host, and ->prepare_inaccessible returns it's ready to do so. > + * > + * This function may sleep. > + */ > +int guest_memfd_make_inaccessible(struct folio *folio) > +{ > + struct guest_memfd_private *private =3D folio_get_private(folio); > + struct inode *inode =3D folio_inode(folio); > + struct guest_memfd_operations *ops =3D inode->i_private; > + int r; > + > + r =3D atomic_dec_if_positive(&private->accessible); > + if (r < 0) > + return 0; > + else if (r > 0) > + return -EBUSY; > + > + unmap_mapping_folio(folio); > + > + r =3D wait_event_timeout(safe_wait, > + folio_ref_count(folio) =3D=3D atomic_read(= &private->safe), > + msecs_to_jiffies(10)); > + if (!r) { > + r =3D -EBUSY; > + goto err; > + } > + > + if (ops->prepare_inaccessible) { > + r =3D ops->prepare_inaccessible(inode, folio); > + if (r) > + goto err; > + } > + > + return 0; > +err: > + atomic_inc(&private->accessible); > + return r; > +} > +EXPORT_SYMBOL_GPL(guest_memfd_make_inaccessible); > + > static long gmem_punch_hole(struct file *file, loff_t offset, loff_t len= ) > { > struct inode *inode =3D file_inode(file); > @@ -229,10 +423,12 @@ static int gmem_error_folio(struct address_space *m= apping, struct folio *folio) > > static bool gmem_release_folio(struct folio *folio, gfp_t gfp) > { > + struct guest_memfd_private *private =3D folio_get_private(folio); > struct inode *inode =3D folio_inode(folio); > struct guest_memfd_operations *ops =3D inode->i_private; > off_t offset =3D folio->index; > size_t nr =3D folio_nr_pages(folio); > + unsigned long val, expected; > int ret; > > ret =3D ops->invalidate_begin(inode, offset, nr); > @@ -241,14 +437,32 @@ static bool gmem_release_folio(struct folio *folio,= gfp_t gfp) > if (ops->invalidate_end) > ops->invalidate_end(inode, offset, nr); > > + expected =3D base_safe_refs(folio); > + val =3D atomic_read(&private->safe); > + WARN_ONCE(val !=3D expected, "folio[%x] safe ref: %d !=3D expecte= d %d\n", > + folio_index(folio), val, expected); > + > + folio_detach_private(folio); > + kfree(private); > + > return true; > } > > +static void gmem_invalidate_folio(struct folio *folio, size_t offset, si= ze_t len) > +{ > + WARN_ON_ONCE(offset !=3D 0); > + WARN_ON_ONCE(len !=3D folio_size(folio)); > + > + if (offset =3D=3D 0 && len =3D=3D folio_size(folio)) > + filemap_release_folio(folio, 0); > +} > + > static const struct address_space_operations gmem_aops =3D { > .dirty_folio =3D noop_dirty_folio, > .migrate_folio =3D gmem_migrate_folio, > .error_remove_folio =3D gmem_error_folio, > .release_folio =3D gmem_release_folio, > + .invalidate_folio =3D gmem_invalidate_folio, > }; > > static inline bool guest_memfd_check_ops(const struct guest_memfd_operat= ions *ops) > @@ -291,8 +505,7 @@ struct file *guest_memfd_alloc(const char *name, > * instead of reusing a single inode. Each guest_memfd instance = needs > * its own inode to track the size, flags, etc. > */ > - file =3D anon_inode_create_getfile(name, &gmem_fops, (void *)flag= s, > - O_RDWR, NULL); > + file =3D anon_inode_create_getfile(name, &gmem_fops, NULL, O_RDWR= , NULL); > if (IS_ERR(file)) > return file; > > @@ -303,6 +516,7 @@ struct file *guest_memfd_alloc(const char *name, > > inode->i_private =3D (void *)ops; /* discards const qualifier */ > inode->i_mapping->a_ops =3D &gmem_aops; > + inode->i_mapping->i_private_data =3D (void *)flags; > inode->i_mode |=3D S_IFREG; > inode->i_size =3D size; > mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); > > -- > 2.34.1 > >