[RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support
@ 2025-01-17 16:29 Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

This series adds restricted mmap() support to guest_memfd, as
well as support for guest_memfd on arm64. It is based on Linux
6.13-rc7.  Please refer to v3 for the context [1].

Main changes since v4 [2]:
- Fixed handling of guest_memfd()-backed page faults in arm64
- Rebased on Linux 6.13-rc7

Not a change per se, but I am able to trigger/test the callback
on the final __folio_put() using vmsplice to grab a reference
without increasing the mapcount.

The state diagram that uses the new states in this patch series,
and how they would interact with sharing/unsharing in pKVM [3].

Cheers,
/fuad

[1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
[2] https://lore.kernel.org/all/20241213164811.2006197-1-tabba@google.com/
[3] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

Ackerley Tng (2):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Track mappability within a struct kvm_gmem_private

Fuad Tabba (13):
  mm: Consolidate freeing of typed folios on final folio_put()
  KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
    the folio lock
  KVM: guest_memfd: Folio mappability states and functions that manage
    their transition
  KVM: guest_memfd: Handle final folio_put() of guestmem pages
  KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  KVM: guest_memfd: Add guest_memfd support to
    kvm_(read|/write)_guest_page()
  KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
    mappable
  KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
    mappable
  KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
    allowed
  KVM: arm64: Skip VMA checks for slots without userspace address
  KVM: arm64: Refactor user_mem_abort() calculation of force_pte
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Enable guest_memfd private memory when pKVM is enabled

 Documentation/virt/kvm/api.rst                |   4 +
 arch/arm64/include/asm/kvm_host.h             |   3 +
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          |  98 ++-
 include/linux/kvm_host.h                      |  80 +++
 include/linux/page-flags.h                    |  22 +
 include/uapi/linux/kvm.h                      |   2 +
 include/uapi/linux/magic.h                    |   1 +
 mm/debug.c                                    |   1 +
 mm/swap.c                                     |  28 +-
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  64 +-
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        | 579 +++++++++++++++++-
 virt/kvm/kvm_main.c                           | 234 ++++++-
 15 files changed, 1034 insertions(+), 88 deletions(-)


base-commit: 5bc55a333a2f7316b58edc7573e8e893f7acb532
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 22:05   ` Elliot Berman
  2025-01-20 10:39   ` David Hildenbrand
  2025-01-17 16:29 ` [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Some folio types, such as hugetlb, handle freeing their own
folios. Moreover, guest_memfd will require being notified once a
folio's reference count reaches 0 to facilitate shared to private
folio conversion, without the folio actually being freed at that
point.

As a first step towards that, this patch consolidates freeing
folios that have a type. The first user is hugetlb folios. Later
in this patch series, guest_memfd will become the second user of
this.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/page-flags.h | 15 +++++++++++++++
 mm/swap.c                  | 24 +++++++++++++++++++-----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 691506bdf2c5..6615f2f59144 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
 	return page_mapcount_is_type(data_race(page->page_type));
 }
 
+static inline int page_get_type(const struct page *page)
+{
+	return page->page_type >> 24;
+}
+
+static inline bool folio_has_type(const struct folio *folio)
+{
+	return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+	return page_get_type(&folio->page);
+}
+
 #define FOLIO_TYPE_OPS(lname, fname)					\
 static __always_inline bool folio_test_##fname(const struct folio *folio) \
 {									\
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..6f01b56bce13 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 }
 
+static void free_typed_folio(struct folio *folio)
+{
+	switch (folio_get_type(folio)) {
+	case PGTY_hugetlb:
+		free_huge_folio(folio);
+		return;
+	case PGTY_offline:
+		/* Nothing to do, it's offline. */
+		return;
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+
 void __folio_put(struct folio *folio)
 {
 	if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +115,8 @@ void __folio_put(struct folio *folio)
 		return;
 	}
 
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
+	if (unlikely(folio_has_type(folio))) {
+		free_typed_folio(folio);
 		return;
 	}
 
@@ -934,13 +948,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
+		if (unlikely(folio_has_type(folio))) {
+			/* typed folios have their own memcg, if any */
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			free_huge_folio(folio);
+			free_typed_folio(folio);
 			continue;
 		}
 		folio_unqueue_deferred_split(folio);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-24  4:25   ` Gavin Shan
  2025-01-17 16:29 ` [RFC PATCH v5 03/15] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

From: Ackerley Tng <ackerleytng@google.com>

Using guest mem inodes allows us to store metadata for the backing
memory on the inode. Metadata will be added in a later patch to
support HugeTLB pages.

Metadata about backing memory should not be stored on the file, since
the file represents a guest_memfd's binding with a struct kvm, and
metadata about backing memory is not unique to a specific binding and
struct kvm.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/uapi/linux/magic.h |   1 +
 virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
 2 files changed, 100 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..169dba2a6920 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 47a9f68f7b24..198554b1f0b5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
 #include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
 
 #include "kvm_mm.h"
 
+static struct vfsmount *kvm_gmem_mnt;
+
 struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
@@ -307,6 +312,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static const struct super_operations kvm_gmem_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
+		return -ENOMEM;
+
+	ctx = fc->fs_private;
+	ctx->ops = &kvm_gmem_super_operations;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+static void kvm_gmem_init_mount(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+	BUG_ON(IS_ERR(kvm_gmem_mnt));
+
+	/* For giggles. Userspace can never map this anyways. */
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+}
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
@@ -316,6 +353,8 @@ static struct file_operations kvm_gmem_fops = {
 void kvm_gmem_init(struct module *module)
 {
 	kvm_gmem_fops.owner = module;
+
+	kvm_gmem_init_mount();
 }
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -397,11 +436,67 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+						      loff_t size, u64 flags)
+{
+	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct inode *inode;
+	int err;
+
+	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return inode;
+
+	err = security_inode_init_security_anon(inode, &qname, NULL);
+	if (err) {
+		iput(inode);
+		return ERR_PTR(err);
+	}
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_inaccessible(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+						  u64 flags)
+{
+	static const char *name = "[kvm-gmem]";
+	struct inode *inode;
+	struct file *file;
+
+	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
+		return ERR_PTR(-ENOENT);
+
+	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+				 &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		return file;
+	}
+
+	file->f_mapping = inode->i_mapping;
+	file->f_flags |= O_LARGEFILE;
+	file->private_data = priv;
+
+	return file;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
-	const char *anon_name = "[kvm-gmem]";
 	struct kvm_gmem *gmem;
-	struct inode *inode;
 	struct file *file;
 	int fd, err;
 
@@ -415,32 +510,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fd;
 	}
 
-	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
-					 O_RDWR, NULL);
+	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_gmem;
 	}
 
-	file->f_flags |= O_LARGEFILE;
-
-	inode = file->f_inode;
-	WARN_ON(file->f_mapping != inode->i_mapping);
-
-	inode->i_private = (void *)(unsigned long)flags;
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
 
 	fd_install(fd, file);
 	return fd;
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 03/15] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Create a new variant of kvm_gmem_get_pfn(), which retains the
folio lock if it returns successfully. This is needed in
subsequent patches in order to protect against races when
checking whether a folio can be mapped by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 11 +++++++++++
 virt/kvm/guest_memfd.c   | 27 ++++++++++++++++++++-------
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3..cda3ed4c3c27 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2500,6 +2500,9 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
 		     int *max_order);
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+			    int *max_order);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2509,6 +2512,14 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
+					  struct kvm_memory_slot *slot,
+					  gfn_t gfn, kvm_pfn_t *pfn,
+					  struct page **page, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 198554b1f0b5..6453658d2650 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -672,9 +672,9 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
 	return folio;
 }
 
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
-		     int *max_order)
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+			    gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+			    int *max_order)
 {
 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
 	struct file *file = kvm_gmem_get_file(slot);
@@ -694,17 +694,30 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!is_prepared)
 		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
 
-	folio_unlock(folio);
-
-	if (!r)
+	if (!r) {
 		*page = folio_file_page(folio, index);
-	else
+	} else {
+		folio_unlock(folio);
 		folio_put(folio);
+	}
 
 out:
 	fput(file);
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn_locked);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+		     int *max_order)
+{
+	int r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, pfn, page, max_order);
+
+	if (!r)
+		unlock_page(*page);
+
+	return r;
+}
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (2 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 03/15] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-24  5:31   ` Gavin Shan
  2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

From: Ackerley Tng <ackerleytng@google.com>

Track whether guest_memfd memory can be mapped within the inode,
since it is property of the guest_memfd's memory contents.

The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6453658d2650..0a7b6cf8bd8f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -18,6 +18,17 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	struct xarray mappable_offsets;
+#endif
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -312,8 +323,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	/*
+	 * .evict_inode can be called before private data is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->mappable_offsets);
+#endif
+
+	truncate_inode_pages_final(inode->i_mapping);
+
+	kfree(private);
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
-	.statfs		= simple_statfs,
+	.statfs         = simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -440,6 +471,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -448,10 +480,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 		return inode;
 
 	err = security_inode_init_security_anon(inode, &qname, NULL);
-	if (err) {
-		iput(inode);
-		return ERR_PTR(err);
-	}
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	xa_init(&private->mappable_offsets);
+#endif
+
+	inode->i_mapping->i_private_data = private;
 
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
@@ -464,6 +505,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (3 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-20 10:30   ` Kirill A. Shutemov
  2025-02-19 23:33   ` Ackerley Tng
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

To allow restricted mapping of guest_memfd folios by the host,
guest_memfd needs to track whether they can be mapped and by who,
since the mapping will only be allowed under conditions where it
safe to access these folios. These conditions depend on the
folios being explicitly shared with the host, or not yet exposed
to the guest (e.g., at initialization).

This patch introduces states that determine whether the host and
the guest can fault in the folios as well as the functions that
manage transitioning between those states.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h |  53 ++++++++++++++
 virt/kvm/guest_memfd.c   | 153 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |  92 +++++++++++++++++++++++
 3 files changed, 298 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cda3ed4c3c27..84aa7908a5dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2564,4 +2564,57 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
+			       gfn_t end);
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
+				 gfn_t end);
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+#else
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
+					  gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
+					     gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
+					       gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
+					     gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
+						   gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return false;
+}
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 #endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0a7b6cf8bd8f..d1c192927cf7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -375,6 +375,159 @@ static void kvm_gmem_init_mount(void)
 	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
 }
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+/*
+ * An enum of the valid states that describe who can map a folio.
+ * Bit 0: if set guest cannot map the page
+ * Bit 1: if set host cannot map the page
+ */
+enum folio_mappability {
+	KVM_GMEM_ALL_MAPPABLE	= 0b00,	/* Mappable by host and guest. */
+	KVM_GMEM_GUEST_MAPPABLE	= 0b10, /* Mappable only by guest. */
+	KVM_GMEM_NONE_MAPPABLE	= 0b11, /* Not mappable, transient state. */
+};
+
+/*
+ * Marks the range [start, end) as mappable by both the host and the guest.
+ * Usually called when guest shares memory with the host.
+ */
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	void *xval = xa_mk_value(KVM_GMEM_ALL_MAPPABLE);
+	pgoff_t i;
+	int r = 0;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	for (i = start; i < end; i++) {
+		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
+		if (r)
+			break;
+	}
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
+/*
+ * Marks the range [start, end) as not mappable by the host. If the host doesn't
+ * have any references to a particular folio, then that folio is marked as
+ * mappable by the guest.
+ *
+ * However, if the host still has references to the folio, then the folio is
+ * marked and not mappable by anyone. Marking it is not mappable allows it to
+ * drain all references from the host, and to ensure that the hypervisor does
+ * not transition the folio to private, since the host still might access it.
+ *
+ * Usually called when guest unshares memory with the host.
+ */
+static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+	void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
+	pgoff_t i;
+	int r = 0;
+
+	filemap_invalidate_lock(inode->i_mapping);
+	for (i = start; i < end; i++) {
+		struct folio *folio;
+		int refcount = 0;
+
+		folio = filemap_lock_folio(inode->i_mapping, i);
+		if (!IS_ERR(folio)) {
+			refcount = folio_ref_count(folio);
+		} else {
+			r = PTR_ERR(folio);
+			if (WARN_ON_ONCE(r != -ENOENT))
+				break;
+
+			folio = NULL;
+		}
+
+		/* +1 references are expected because of filemap_lock_folio(). */
+		if (folio && refcount > folio_nr_pages(folio) + 1) {
+			/*
+			 * Outstanding references, the folio cannot be faulted
+			 * in by anyone until they're dropped.
+			 */
+			r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
+		} else {
+			/*
+			 * No outstanding references. Transition the folio to
+			 * guest mappable immediately.
+			 */
+			r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
+		}
+
+		if (folio) {
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
+static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	unsigned long r;
+
+	r = xa_to_value(xa_load(mappable_offsets, pgoff));
+
+	return (r == KVM_GMEM_ALL_MAPPABLE);
+}
+
+static bool gmem_is_guest_mappable(struct inode *inode, pgoff_t pgoff)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	unsigned long r;
+
+	r = xa_to_value(xa_load(mappable_offsets, pgoff));
+
+	return (r == KVM_GMEM_ALL_MAPPABLE || r == KVM_GMEM_GUEST_MAPPABLE);
+}
+
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return gmem_set_mappable(inode, start_off, end_off);
+}
+
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+	pgoff_t end_off = start_off + end - start;
+
+	return gmem_clear_mappable(inode, start_off, end_off);
+}
+
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+
+	return gmem_is_mappable(inode, pgoff);
+}
+
+bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct inode *inode = file_inode(slot->gmem.file);
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+
+	return gmem_is_guest_mappable(inode, pgoff);
+}
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..fffff01cebe7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3094,6 +3094,98 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+	bool r = true;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end, i;
+
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(gfn_start >= gfn_end))
+			continue;
+
+		for (i = gfn_start; i < gfn_end; i++) {
+			r = kvm_slot_gmem_is_mappable(memslot, i);
+			if (r)
+				goto out;
+		}
+	}
+out:
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+	int r = 0;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end;
+
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_memslot_iter iter;
+	int r = 0;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t gfn_start, gfn_end;
+
+		if (!kvm_slot_can_be_private(memslot))
+			continue;
+
+		gfn_start = max(start, memslot->base_gfn);
+		gfn_end = min(end, memslot->base_gfn + memslot->npages);
+		if (WARN_ON_ONCE(start >= end))
+			continue;
+
+		r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
+		if (WARN_ON_ONCE(r))
+			break;
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return r;
+}
+
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
 /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (4 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-20 11:37   ` Vlastimil Babka
                     ` (4 more replies)
  2025-01-17 16:29 ` [RFC PATCH v5 07/15] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
                   ` (8 subsequent siblings)
  14 siblings, 5 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Before transitioning a guest_memfd folio to unshared, thereby
disallowing access by the host and allowing the hypervisor to
transition its view of the guest page as private, we need to be
sure that the host doesn't have any references to the folio.

This patch introduces a new type for guest_memfd folios, and uses
that to register a callback that informs the guest_memfd
subsystem when the last reference is dropped, therefore knowing
that the host doesn't have any remaining references.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
The function kvm_slot_gmem_register_callback() isn't used in this
series. It will be used later in code that performs unsharing of
memory. I have tested it with pKVM, based on downstream code [*].
It's included in this RFC since it demonstrates the plan to
handle unsharing of private folios.

[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
---
 include/linux/kvm_host.h   |  11 +++
 include/linux/page-flags.h |   7 ++
 mm/debug.c                 |   1 +
 mm/swap.c                  |   4 +
 virt/kvm/guest_memfd.c     | 145 +++++++++++++++++++++++++++++++++++++
 5 files changed, 168 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 84aa7908a5dd..63e6d6dd98b3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2574,6 +2574,8 @@ int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
 				 gfn_t end);
 bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn);
+void kvm_gmem_handle_folio_put(struct folio *folio);
 #else
 static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
 {
@@ -2615,6 +2617,15 @@ static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
 	WARN_ON_ONCE(1);
 	return false;
 }
+static inline int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+static inline void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+	WARN_ON_ONCE(1);
+}
 #endif /* CONFIG_KVM_GMEM_MAPPABLE */
 
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6615f2f59144..bab3cac1f93b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -942,6 +942,7 @@ enum pagetype {
 	PGTY_slab	= 0xf5,
 	PGTY_zsmalloc	= 0xf6,
 	PGTY_unaccepted	= 0xf7,
+	PGTY_guestmem	= 0xf8,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1091,6 +1092,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
 FOLIO_TEST_FLAG_FALSE(hugetlb)
 #endif
 
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+FOLIO_TYPE_OPS(guestmem, guestmem)
+#else
+FOLIO_TEST_FLAG_FALSE(guestmem)
+#endif
+
 PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 95b6ab809c0e..db93be385ed9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -56,6 +56,7 @@ static const char *page_type_names[] = {
 	DEF_PAGETYPE_NAME(table),
 	DEF_PAGETYPE_NAME(buddy),
 	DEF_PAGETYPE_NAME(unaccepted),
+	DEF_PAGETYPE_NAME(guestmem),
 };
 
 static const char *page_type_name(unsigned int page_type)
diff --git a/mm/swap.c b/mm/swap.c
index 6f01b56bce13..15220eaabc86 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/kvm_host.h>
 
 #include "internal.h"
 
@@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
 	case PGTY_offline:
 		/* Nothing to do, it's offline. */
 		return;
+	case PGTY_guestmem:
+		kvm_gmem_handle_folio_put(folio);
+		return;
 	default:
 		WARN_ON_ONCE(1);
 	}
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d1c192927cf7..722afd9f8742 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -387,6 +387,28 @@ enum folio_mappability {
 	KVM_GMEM_NONE_MAPPABLE	= 0b11, /* Not mappable, transient state. */
 };
 
+/*
+ * Unregisters the __folio_put() callback from the folio.
+ *
+ * Restores a folio's refcount after all pending references have been released,
+ * and removes the folio type, thereby removing the callback. Now the folio can
+ * be freed normaly once all actual references have been dropped.
+ *
+ * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
+ * Must also have exclusive access to the folio: folio must be either locked, or
+ * gmem holds the only reference.
+ */
+static void __kvm_gmem_restore_pending_folio(struct folio *folio)
+{
+	if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
+		return;
+
+	WARN_ON_ONCE(!folio_test_locked(folio) && folio_ref_count(folio) > 1);
+
+	__folio_clear_guestmem(folio);
+	folio_ref_add(folio, folio_nr_pages(folio));
+}
+
 /*
  * Marks the range [start, end) as mappable by both the host and the guest.
  * Usually called when guest shares memory with the host.
@@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
 
 	filemap_invalidate_lock(inode->i_mapping);
 	for (i = start; i < end; i++) {
+		struct folio *folio = NULL;
+
+		/*
+		 * If the folio is NONE_MAPPABLE, it indicates that it is
+		 * transitioning to private (GUEST_MAPPABLE). Transition it to
+		 * shared (ALL_MAPPABLE) immediately, and remove the callback.
+		 */
+		if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
+			folio = filemap_lock_folio(inode->i_mapping, i);
+			if (WARN_ON_ONCE(IS_ERR(folio))) {
+				r = PTR_ERR(folio);
+				break;
+			}
+
+			if (folio_test_guestmem(folio))
+				__kvm_gmem_restore_pending_folio(folio);
+		}
+
 		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
+
+		if (folio) {
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+
 		if (r)
 			break;
 	}
@@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
 	return r;
 }
 
+/*
+ * Registers a callback to __folio_put(), so that gmem knows that the host does
+ * not have any references to the folio. It does that by setting the folio type
+ * to guestmem.
+ *
+ * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
+ * has references, and the callback has been registered.
+ *
+ * Must be called with the following locks held:
+ * - filemap (inode->i_mapping) invalidate_lock
+ * - folio lock
+ */
+static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
+{
+	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+	int refcount;
+
+	rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
+	WARN_ON_ONCE(!folio_test_locked(folio));
+
+	if (folio_mapped(folio) || folio_test_guestmem(folio))
+		return -EAGAIN;
+
+	/* Register a callback first. */
+	__folio_set_guestmem(folio);
+
+	/*
+	 * Check for references after setting the type to guestmem, to guard
+	 * against potential races with the refcount being decremented later.
+	 *
+	 * At least one reference is expected because the folio is locked.
+	 */
+
+	refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
+	if (refcount == 1) {
+		int r;
+
+		/* refcount isn't elevated, it's now faultable by the guest. */
+		r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
+		if (!r)
+			__kvm_gmem_restore_pending_folio(folio);
+
+		return r;
+	}
+
+	return -EAGAIN;
+}
+
+int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+	struct inode *inode = file_inode(slot->gmem.file);
+	struct folio *folio;
+	int r;
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	folio = filemap_lock_folio(inode->i_mapping, pgoff);
+	if (WARN_ON_ONCE(IS_ERR(folio))) {
+		r = PTR_ERR(folio);
+		goto out;
+	}
+
+	r = __gmem_register_callback(folio, inode, pgoff);
+
+	folio_unlock(folio);
+	folio_put(folio);
+out:
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
+/*
+ * Callback function for __folio_put(), i.e., called when all references by the
+ * host to the folio have been dropped. This allows gmem to transition the state
+ * of the folio to mappable by the guest, and allows the hypervisor to continue
+ * transitioning its state to private, since the host cannot attempt to access
+ * it anymore.
+ */
+void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+	struct xarray *mappable_offsets;
+	struct inode *inode;
+	pgoff_t index;
+	void *xval;
+
+	inode = folio->mapping->host;
+	index = folio->index;
+	mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+	xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+
+	filemap_invalidate_lock(inode->i_mapping);
+	__kvm_gmem_restore_pending_folio(folio);
+	WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
+	filemap_invalidate_unlock(inode->i_mapping);
+}
+
 static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
 {
 	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 07/15] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (5 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 08/15] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add support for mmap() and fault() for guest_memfd in the host.
The ability to fault in a guest page is contingent on that page
being shared with the host.

The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.

The mapping is restricted to only memory explicitly shared with
the host. KVM checks that the host doesn't have any mappings for
private memory via the folio's refcount. To avoid races between
paths that check mappability and paths that check whether the
host has any mappings (via the refcount), the folio lock is held
in while either check is being performed.

This new feature is gated with a new configuration option,
CONFIG_KVM_GMEM_MAPPABLE.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Fuad Tabba <tabba@google.com>

---
The functions kvm_gmem_is_mapped(), kvm_gmem_set_mappable(), and
int kvm_gmem_clear_mappable() are not used in this patch series.
They are intended to be used in future patches [*], which check
and toggle mapability when the guest shares/unshares pages with
the host.

[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm

---
 virt/kvm/Kconfig       |  4 ++
 virt/kvm/guest_memfd.c | 87 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..59400fd8f539 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config KVM_GMEM_MAPPABLE
+       select KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 722afd9f8742..159ffa17f562 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -671,9 +671,88 @@ bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
 
 	return gmem_is_guest_mappable(inode, pgoff);
 }
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct folio *folio;
+	vm_fault_t ret = VM_FAULT_LOCKED;
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+
+	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	if (IS_ERR(folio)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_filemap;
+	}
+
+	if (folio_test_hwpoison(folio)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out_folio;
+	}
+
+	if (!gmem_is_mappable(inode, vmf->pgoff)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	if (WARN_ON_ONCE(folio_test_guestmem(folio)))  {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		unsigned long nr_pages = folio_nr_pages(folio);
+		unsigned long i;
+
+		for (i = 0; i < nr_pages; i++)
+			clear_highpage(folio_page(folio, i));
+
+		folio_mark_uptodate(folio);
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+out_folio:
+	if (ret != VM_FAULT_LOCKED) {
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+out_filemap:
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	return ret;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+#else
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+#define kvm_gmem_mmap NULL
 #endif /* CONFIG_KVM_GMEM_MAPPABLE */
 
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -860,6 +939,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_gmem;
 	}
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+		err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
+		if (err) {
+			fput(file);
+			goto err_gmem;
+		}
+	}
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 08/15] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (6 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 07/15] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 09/15] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Make kvm_(read|/write)_guest_page() capable of accessing guest
memory for slots that don't have a userspace address, but only if
the memory is mappable, which also indicates that it is
accessible by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/kvm_main.c | 133 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 114 insertions(+), 19 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fffff01cebe7..53692feb6213 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3184,23 +3184,110 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
 	return r;
 }
 
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, void *data, int offset,
+				       int len)
+{
+	struct page *page;
+	u64 pfn;
+	int r;
+
+	/*
+	 * Holds the folio lock until after checking whether it can be faulted
+	 * in, to avoid races with paths that change a folio's mappability.
+	 */
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, &page, NULL);
+	if (r)
+		return r;
+
+	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+		r = -EPERM;
+		goto unlock;
+	}
+	memcpy(data, page_address(page) + offset, len);
+unlock:
+	unlock_page(page);
+	if (r)
+		put_page(page);
+	else
+		kvm_release_page_clean(page);
+
+	return r;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					gfn_t gfn, const void *data,
+					int offset, int len)
+{
+	struct page *page;
+	u64 pfn;
+	int r;
+
+	/*
+	 * Holds the folio lock until after checking whether it can be faulted
+	 * in, to avoid races with paths that change a folio's mappability.
+	 */
+	r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, &page, NULL);
+	if (r)
+		return r;
+
+	if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+		r = -EPERM;
+		goto unlock;
+	}
+	memcpy(page_address(page) + offset, data, len);
+unlock:
+	unlock_page(page);
+	if (r)
+		put_page(page);
+	else
+		kvm_release_page_dirty(page);
+
+	return r;
+}
+#else
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, void *data, int offset,
+				       int len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+					struct kvm_memory_slot *slot,
+					gfn_t gfn, const void *data,
+					int offset, int len)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_GMEM_MAPPABLE */
 
 /* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-				 void *data, int offset, int len)
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gfn_t gfn, void *data, int offset, int len)
 {
-	int r;
 	unsigned long addr;
 
 	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
 		return -EFAULT;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(slot) &&
+	    !slot->userspace_addr) {
+		return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
+						   offset, len);
+	}
+
 	addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
-	r = __copy_from_user(data, (void __user *)addr + offset, len);
-	if (r)
+	if (__copy_from_user(data, (void __user *)addr + offset, len))
 		return -EFAULT;
 	return 0;
 }
@@ -3210,7 +3297,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -3219,7 +3306,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-	return __kvm_read_guest_page(slot, gfn, data, offset, len);
+	return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
@@ -3296,22 +3383,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
 /* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
 static int __kvm_write_guest_page(struct kvm *kvm,
-				  struct kvm_memory_slot *memslot, gfn_t gfn,
-			          const void *data, int offset, int len)
+				  struct kvm_memory_slot *slot, gfn_t gfn,
+				  const void *data, int offset, int len)
 {
-	int r;
-	unsigned long addr;
-
 	if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
 		return -EFAULT;
 
-	addr = gfn_to_hva_memslot(memslot, gfn);
-	if (kvm_is_error_hva(addr))
-		return -EFAULT;
-	r = __copy_to_user((void __user *)addr + offset, data, len);
-	if (r)
-		return -EFAULT;
-	mark_page_dirty_in_slot(kvm, memslot, gfn);
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    kvm_slot_can_be_private(slot) &&
+	    !slot->userspace_addr) {
+		int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
+						     offset, len);
+
+		if (r)
+			return r;
+	} else {
+		unsigned long addr = gfn_to_hva_memslot(slot, gfn);
+
+		if (kvm_is_error_hva(addr))
+			return -EFAULT;
+		if (__copy_to_user((void __user *)addr + offset, data, len))
+			return -EFAULT;
+	}
+
+	mark_page_dirty_in_slot(kvm, slot, gfn);
 	return 0;
 }
 
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 09/15] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (7 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 08/15] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 10/15] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add the KVM capability KVM_CAP_GUEST_MEMFD_MAPPABLE, which is
true if mapping guest memory is supported by the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 502ea63b5d2e..021f8ef9979b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_GUEST_MEMFD_MAPPABLE 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 53692feb6213..0d1c2e95e771 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4979,6 +4979,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+	case KVM_CAP_GUEST_MEMFD_MAPPABLE:
+		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
 	default:
 		break;
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 10/15] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (8 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 09/15] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 11/15] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Not all use cases require guest_memfd() to be mappable by the
host when first created. Add a new flag,
GUEST_MEMFD_FLAG_INIT_MAPPABLE, which when set on
KVM_CREATE_GUEST_MEMFD initializes the memory as mappable by the
host. Otherwise, memory is private until shared by the guest with
the host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst                 | 4 ++++
 include/uapi/linux/kvm.h                       | 1 +
 tools/testing/selftests/kvm/guest_memfd_test.c | 7 +++++--
 virt/kvm/guest_memfd.c                         | 6 +++++-
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f15b61317aad..2a8571b1629f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6383,6 +6383,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If the capability KVM_CAP_GUEST_MEMFD_MAPPABLE is supported, then the flags
+field supports GUEST_MEMFD_FLAG_INIT_MAPPABLE, which initializes the memory
+as mappable by the host.
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 021f8ef9979b..b34aed04ffa5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1566,6 +1566,7 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+#define GUEST_MEMFD_FLAG_INIT_MAPPABLE		(1UL << 0)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ce687f8d248f..04b4111b7190 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -123,7 +123,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
 static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 {
 	size_t page_size = getpagesize();
-	uint64_t flag;
+	uint64_t flag = BIT(0);
 	size_t size;
 	int fd;
 
@@ -134,7 +134,10 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 			    size);
 	}
 
-	for (flag = BIT(0); flag; flag <<= 1) {
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+		flag = GUEST_MEMFD_FLAG_INIT_MAPPABLE << 1;
+
+	for (; flag; flag <<= 1) {
 		fd = __vm_create_guest_memfd(vm, page_size, flag);
 		TEST_ASSERT(fd == -1 && errno == EINVAL,
 			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 159ffa17f562..932c23f6b2e5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -939,7 +939,8 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_gmem;
 	}
 
-	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+	    (flags & GUEST_MEMFD_FLAG_INIT_MAPPABLE)) {
 		err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
 		if (err) {
 			fput(file);
@@ -968,6 +969,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE))
+		valid_flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 11/15] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (9 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 10/15] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 12/15] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Expand the guest_memfd selftests to include testing mapping guest
memory if the capability is supported, and that still checks that
memory is not mappable if the capability isn't supported.

Also, build the guest_memfd selftest for aarch64.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 tools/testing/selftests/kvm/Makefile          |  1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 57 +++++++++++++++++--
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 41593d2e7de9..c998eb3c3b77 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -174,6 +174,7 @@ TEST_GEN_PROGS_aarch64 += coalesced_io_test
 TEST_GEN_PROGS_aarch64 += demand_paging_test
 TEST_GEN_PROGS_aarch64 += dirty_log_test
 TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += guest_memfd_test
 TEST_GEN_PROGS_aarch64 += guest_print_test
 TEST_GEN_PROGS_aarch64 += get-reg-list
 TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 04b4111b7190..12b5777c2eb5 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -34,12 +34,55 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_allowed(int fd, size_t total_size)
 {
+	size_t page_size = getpagesize();
+	char *mem;
+	int ret;
+	int i;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+	memset(mem, 0xaa, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			page_size);
+	TEST_ASSERT(!ret, "fallocate the first page should succeed");
+
+	for (i = 0; i < page_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0x00);
+	for (; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	memset(mem, 0xaa, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0xaa);
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+}
+
+static void test_mmap_denied(int fd, size_t total_size)
+{
+	size_t page_size = getpagesize();
 	char *mem;
 
 	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	TEST_ASSERT_EQ(mem, MAP_FAILED);
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_mmap(int fd, size_t total_size)
+{
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+		test_mmap_allowed(fd, total_size);
+	else
+		test_mmap_denied(fd, total_size);
 }
 
 static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -175,13 +218,17 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 
 int main(int argc, char *argv[])
 {
-	size_t page_size;
+	uint64_t flags = 0;
+	struct kvm_vm *vm;
 	size_t total_size;
+	size_t page_size;
 	int fd;
-	struct kvm_vm *vm;
 
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+		flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
 	page_size = getpagesize();
 	total_size = page_size * 4;
 
@@ -190,10 +237,10 @@ int main(int argc, char *argv[])
 	test_create_guest_memfd_invalid(vm);
 	test_create_guest_memfd_multiple(vm);
 
-	fd = vm_create_guest_memfd(vm, total_size, 0);
+	fd = vm_create_guest_memfd(vm, total_size, flags);
 
 	test_file_read_write(fd);
-	test_mmap(fd, page_size);
+	test_mmap(fd, total_size);
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 12/15] KVM: arm64: Skip VMA checks for slots without userspace address
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (10 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 11/15] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:29 ` [RFC PATCH v5 13/15] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Memory slots backed by guest memory might be created with no
intention of being mapped by the host. These are recognized by
not having a userspace address in the memory slot.

VMA checks are neither possible nor necessary for this kind of
slot, so skip them.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9d46ad57e52..342a9bd3848f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -988,6 +988,10 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 	phys_addr_t size = PAGE_SIZE * memslot->npages;
 	hva_t reg_end = hva + size;
 
+	/* Host will not map this private memory without a userspace address. */
+	if (kvm_slot_can_be_private(memslot) && !hva)
+		return;
+
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
 	 * between them, so iterate over all of them to find out if we should
@@ -2133,6 +2137,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	hva = new->userspace_addr;
 	reg_end = hva + (new->npages << PAGE_SHIFT);
 
+	/* Host will not map this private memory without a userspace address. */
+	if ((kvm_slot_can_be_private(new)) && !hva)
+		return 0;
+
 	mmap_read_lock(current->mm);
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 13/15] KVM: arm64: Refactor user_mem_abort() calculation of force_pte
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (11 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 12/15] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
@ 2025-01-17 16:29 ` Fuad Tabba
  2025-01-17 16:30 ` [RFC PATCH v5 14/15] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
  2025-01-17 16:30 ` [RFC PATCH v5 15/15] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:29 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

To simplify the code and to make the assumptions clearer,
refactor user_mem_abort() by immediately setting force_pte to
true if logging_active is true. Also, add a check to ensure that
the assumption that logging_active is guaranteed to never be true
for VM_PFNMAP memslot is true.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 342a9bd3848f..9b1921c1a1a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1440,7 +1440,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault, writable;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
@@ -1452,6 +1452,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
+	bool force_pte = logging_active;
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
@@ -1497,12 +1498,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active) {
-		force_pte = true;
+	if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
+		return -EFAULT;
+
+	if (force_pte)
 		vma_shift = PAGE_SHIFT;
-	} else {
+	else
 		vma_shift = get_vma_page_shift(vma, hva);
-	}
 
 	switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 14/15] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (12 preceding siblings ...)
  2025-01-17 16:29 ` [RFC PATCH v5 13/15] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
@ 2025-01-17 16:30 ` Fuad Tabba
  2025-01-17 16:30 ` [RFC PATCH v5 15/15] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:30 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add arm64 support for resolving guest page faults on
guest_memfd() backed memslots. This support is not contingent on
pKVM, or other confidential computing support, and works in both
VHE and nVHE modes.

Without confidential computing, this support is useful for
testing and debugging. In the future, it might also be useful
should a user want to use guest_memfd() for all code, whether
it's for a protected guest or not.

For now, the fault granule is restricted to PAGE_SIZE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c     | 86 ++++++++++++++++++++++++++++------------
 include/linux/kvm_host.h |  5 +++
 virt/kvm/kvm_main.c      |  5 ---
 3 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 9b1921c1a1a0..adf23618e2a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1434,6 +1434,39 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+			     gfn_t gfn, bool write_fault, bool *writable,
+			     struct page **page, bool is_private)
+{
+	kvm_pfn_t pfn;
+	int ret;
+
+	if (!is_private)
+		return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
+
+	*writable = false;
+
+	if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
+		return KVM_PFN_ERR_NOSLOT_MASK;
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
+	if (!ret) {
+		*writable = write_fault;
+		return pfn;
+	}
+
+	if (ret == -EHWPOISON)
+		return KVM_PFN_ERR_HWPOISON;
+
+	return KVM_PFN_ERR_NOSLOT_MASK;
+}
+
+static bool is_private_mem(struct kvm *kvm, struct kvm_memory_slot *memslot, phys_addr_t ipa)
+{
+	return kvm_arch_has_private_mem(kvm) && kvm_slot_can_be_private(memslot) &&
+	       (kvm_mem_is_private(kvm, ipa >> PAGE_SHIFT) || !memslot->userspace_addr);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1441,24 +1474,25 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 {
 	int ret = 0;
 	bool write_fault, writable;
-	bool exec_fault, mte_allowed;
+	bool exec_fault, mte_allowed = false;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *vma = NULL;
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	bool force_pte = logging_active;
-	long vma_pagesize, fault_granule;
+	bool is_private = is_private_mem(kvm, memslot, fault_ipa);
+	bool force_pte = logging_active || is_private;
+	long vma_pagesize, fault_granule = PAGE_SIZE;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 	struct page *page;
 
-	if (fault_is_perm)
+	if (fault_is_perm && !is_private)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
@@ -1482,24 +1516,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
+	mmap_read_lock(current->mm);
+
 	/*
 	 * Let's check if we will get back a huge page backed by hugetlbfs, or
 	 * get block mapping for device MMIO region.
 	 */
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
-	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-		mmap_read_unlock(current->mm);
-		return -EFAULT;
-	}
+	if (!is_private) {
+		vma = vma_lookup(current->mm, hva);
+		if (unlikely(!vma)) {
+			kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+			mmap_read_unlock(current->mm);
+			return -EFAULT;
+		}
 
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
-		return -EFAULT;
+		/*
+		 * logging_active is guaranteed to never be true for VM_PFNMAP
+		 * memslots.
+		 */
+		if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
+			return -EFAULT;
+
+		vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
+		mte_allowed = kvm_vma_mte_allowed(vma);
+	}
 
 	if (force_pte)
 		vma_shift = PAGE_SHIFT;
@@ -1570,17 +1610,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
-
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
-	 * vma_lookup() or __kvm_faultin_pfn() become stale prior to
-	 * acquiring kvm->mmu_lock.
+	 * vma_lookup() or faultin_pfn() become stale prior to acquiring
+	 * kvm->mmu_lock.
 	 *
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
@@ -1588,8 +1625,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
-				&writable, &page);
+	pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 63e6d6dd98b3..76ebd496feda 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1853,6 +1853,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return gfn_to_memslot(kvm, gfn)->id;
 }
 
+static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_READONLY;
+}
+
 static inline gfn_t
 hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0d1c2e95e771..1fdfa8c89c04 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2622,11 +2622,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
 	return size;
 }
 
-static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
-{
-	return slot->flags & KVM_MEM_READONLY;
-}
-
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write)
 {
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v5 15/15] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
  2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
                   ` (13 preceding siblings ...)
  2025-01-17 16:30 ` [RFC PATCH v5 14/15] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2025-01-17 16:30 ` Fuad Tabba
  14 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-17 16:30 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Implement kvm_arch_has_private_mem() in arm64 when pKVM is
enabled, and make it dependent on the configuration option.

Also, now that the infrastructure is in place for arm64 to
support guest private memory, enable it in the arm64 kernel
configuration.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 3 +++
 arch/arm64/kvm/Kconfig            | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e18e9244d17a..8dfae9183651 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1529,4 +1529,7 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 #define kvm_has_s1poe(k)				\
 	(kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP))
 
+#define kvm_arch_has_private_mem(kvm)					\
+	(IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && is_protected_kvm_enabled())
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..fe3451f244b5 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_GMEM_MAPPABLE
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.48.0.rc2.279.g1de40edade-goog



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
@ 2025-01-17 22:05   ` Elliot Berman
  2025-01-19 14:39     ` Fuad Tabba
  2025-01-20 10:39     ` David Hildenbrand
  2025-01-20 10:39   ` David Hildenbrand
  1 sibling, 2 replies; 60+ messages in thread
From: Elliot Berman @ 2025-01-17 22:05 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Fri, Jan 17, 2025 at 04:29:47PM +0000, Fuad Tabba wrote:
> Some folio types, such as hugetlb, handle freeing their own
> folios. Moreover, guest_memfd will require being notified once a
> folio's reference count reaches 0 to facilitate shared to private
> folio conversion, without the folio actually being freed at that
> point.
> 
> As a first step towards that, this patch consolidates freeing
> folios that have a type. The first user is hugetlb folios. Later
> in this patch series, guest_memfd will become the second user of
> this.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/page-flags.h | 15 +++++++++++++++
>  mm/swap.c                  | 24 +++++++++++++++++++-----
>  2 files changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 691506bdf2c5..6615f2f59144 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
>  	return page_mapcount_is_type(data_race(page->page_type));
>  }
>  
> +static inline int page_get_type(const struct page *page)
> +{
> +	return page->page_type >> 24;
> +}
> +
> +static inline bool folio_has_type(const struct folio *folio)
> +{
> +	return page_has_type(&folio->page);
> +}
> +
> +static inline int folio_get_type(const struct folio *folio)
> +{
> +	return page_get_type(&folio->page);
> +}
> +
>  #define FOLIO_TYPE_OPS(lname, fname)					\
>  static __always_inline bool folio_test_##fname(const struct folio *folio) \
>  {									\
> diff --git a/mm/swap.c b/mm/swap.c
> index 10decd9dffa1..6f01b56bce13 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>  		unlock_page_lruvec_irqrestore(lruvec, flags);
>  }
>  
> +static void free_typed_folio(struct folio *folio)
> +{
> +	switch (folio_get_type(folio)) {

I think you need:

+#if IS_ENABLED(CONFIG_HUGETLBFS)
> +	case PGTY_hugetlb:
> +		free_huge_folio(folio);
> +		return;
+#endif

I think this worked before because folio_test_hugetlb was defined by:
FOLIO_TEST_FLAG_FALSE(hugetlb)
and evidently compiler optimizes out the free_huge_folio(folio) before
linking.

You'll probably want to do the same for the PGTY_guestmem in the later
patch!

> +	case PGTY_offline:
> +		/* Nothing to do, it's offline. */
> +		return;
> +	default:
> +		WARN_ON_ONCE(1);
> +	}
> +}
> +
>  void __folio_put(struct folio *folio)
>  {
>  	if (unlikely(folio_is_zone_device(folio))) {
> @@ -101,8 +115,8 @@ void __folio_put(struct folio *folio)
>  		return;
>  	}
>  
> -	if (folio_test_hugetlb(folio)) {
> -		free_huge_folio(folio);
> +	if (unlikely(folio_has_type(folio))) {
> +		free_typed_folio(folio);
>  		return;
>  	}
>  
> @@ -934,13 +948,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>  		if (!folio_ref_sub_and_test(folio, nr_refs))
>  			continue;
>  
> -		/* hugetlb has its own memcg */
> -		if (folio_test_hugetlb(folio)) {
> +		if (unlikely(folio_has_type(folio))) {
> +			/* typed folios have their own memcg, if any */
>  			if (lruvec) {
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			free_huge_folio(folio);
> +			free_typed_folio(folio);
>  			continue;
>  		}
>  		folio_unqueue_deferred_split(folio);
> -- 
> 2.48.0.rc2.279.g1de40edade-goog
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-17 22:05   ` Elliot Berman
@ 2025-01-19 14:39     ` Fuad Tabba
  2025-01-20 10:39     ` David Hildenbrand
  1 sibling, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-19 14:39 UTC (permalink / raw)
  To: Elliot Berman
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Fri, 17 Jan 2025 at 22:05, Elliot Berman
<elliot.berman@oss.qualcomm.com> wrote:
>
> On Fri, Jan 17, 2025 at 04:29:47PM +0000, Fuad Tabba wrote:
> > Some folio types, such as hugetlb, handle freeing their own
> > folios. Moreover, guest_memfd will require being notified once a
> > folio's reference count reaches 0 to facilitate shared to private
> > folio conversion, without the folio actually being freed at that
> > point.
> >
> > As a first step towards that, this patch consolidates freeing
> > folios that have a type. The first user is hugetlb folios. Later
> > in this patch series, guest_memfd will become the second user of
> > this.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  include/linux/page-flags.h | 15 +++++++++++++++
> >  mm/swap.c                  | 24 +++++++++++++++++++-----
> >  2 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 691506bdf2c5..6615f2f59144 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
> >       return page_mapcount_is_type(data_race(page->page_type));
> >  }
> >
> > +static inline int page_get_type(const struct page *page)
> > +{
> > +     return page->page_type >> 24;
> > +}
> > +
> > +static inline bool folio_has_type(const struct folio *folio)
> > +{
> > +     return page_has_type(&folio->page);
> > +}
> > +
> > +static inline int folio_get_type(const struct folio *folio)
> > +{
> > +     return page_get_type(&folio->page);
> > +}
> > +
> >  #define FOLIO_TYPE_OPS(lname, fname)                                 \
> >  static __always_inline bool folio_test_##fname(const struct folio *folio) \
> >  {                                                                    \
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 10decd9dffa1..6f01b56bce13 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
> >               unlock_page_lruvec_irqrestore(lruvec, flags);
> >  }
> >
> > +static void free_typed_folio(struct folio *folio)
> > +{
> > +     switch (folio_get_type(folio)) {
>
> I think you need:
>
> +#if IS_ENABLED(CONFIG_HUGETLBFS)
> > +     case PGTY_hugetlb:
> > +             free_huge_folio(folio);
> > +             return;
> +#endif
>
> I think this worked before because folio_test_hugetlb was defined by:
> FOLIO_TEST_FLAG_FALSE(hugetlb)
> and evidently compiler optimizes out the free_huge_folio(folio) before
> linking.
>
> You'll probably want to do the same for the PGTY_guestmem in the later
> patch!

Thanks Elliot. This will keep the kernel test robot happy when I respin.

Cheers,
/fuad

>
> > +     case PGTY_offline:
> > +             /* Nothing to do, it's offline. */
> > +             return;
> > +     default:
> > +             WARN_ON_ONCE(1);
> > +     }
> > +}
> > +
> >  void __folio_put(struct folio *folio)
> >  {
> >       if (unlikely(folio_is_zone_device(folio))) {
> > @@ -101,8 +115,8 @@ void __folio_put(struct folio *folio)
> >               return;
> >       }
> >
> > -     if (folio_test_hugetlb(folio)) {
> > -             free_huge_folio(folio);
> > +     if (unlikely(folio_has_type(folio))) {
> > +             free_typed_folio(folio);
> >               return;
> >       }
> >
> > @@ -934,13 +948,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> >               if (!folio_ref_sub_and_test(folio, nr_refs))
> >                       continue;
> >
> > -             /* hugetlb has its own memcg */
> > -             if (folio_test_hugetlb(folio)) {
> > +             if (unlikely(folio_has_type(folio))) {
> > +                     /* typed folios have their own memcg, if any */
> >                       if (lruvec) {
> >                               unlock_page_lruvec_irqrestore(lruvec, flags);
> >                               lruvec = NULL;
> >                       }
> > -                     free_huge_folio(folio);
> > +                     free_typed_folio(folio);
> >                       continue;
> >               }
> >               folio_unqueue_deferred_split(folio);
> > --
> > 2.48.0.rc2.279.g1de40edade-goog
> >


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
@ 2025-01-20 10:30   ` Kirill A. Shutemov
  2025-01-20 10:40     ` Fuad Tabba
  2025-02-19 23:33   ` Ackerley Tng
  1 sibling, 1 reply; 60+ messages in thread
From: Kirill A. Shutemov @ 2025-01-20 10:30 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Fri, Jan 17, 2025 at 04:29:51PM +0000, Fuad Tabba wrote:
> +/*
> + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> + * have any references to a particular folio, then that folio is marked as
> + * mappable by the guest.
> + *
> + * However, if the host still has references to the folio, then the folio is
> + * marked and not mappable by anyone. Marking it is not mappable allows it to
> + * drain all references from the host, and to ensure that the hypervisor does
> + * not transition the folio to private, since the host still might access it.
> + *
> + * Usually called when guest unshares memory with the host.
> + */
> +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +	void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> +	pgoff_t i;
> +	int r = 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	for (i = start; i < end; i++) {
> +		struct folio *folio;
> +		int refcount = 0;
> +
> +		folio = filemap_lock_folio(inode->i_mapping, i);
> +		if (!IS_ERR(folio)) {
> +			refcount = folio_ref_count(folio);
> +		} else {
> +			r = PTR_ERR(folio);
> +			if (WARN_ON_ONCE(r != -ENOENT))
> +				break;
> +
> +			folio = NULL;
> +		}
> +
> +		/* +1 references are expected because of filemap_lock_folio(). */
> +		if (folio && refcount > folio_nr_pages(folio) + 1) {

Looks racy.

What prevent anybody from obtaining a reference just after check?

Lock on folio doesn't stop random filemap_get_entry() from elevating the
refcount.

folio_ref_freeze() might be required.

> +			/*
> +			 * Outstanding references, the folio cannot be faulted
> +			 * in by anyone until they're dropped.
> +			 */
> +			r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> +		} else {
> +			/*
> +			 * No outstanding references. Transition the folio to
> +			 * guest mappable immediately.
> +			 */
> +			r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> +		}
> +
> +		if (folio) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
  2025-01-17 22:05   ` Elliot Berman
@ 2025-01-20 10:39   ` David Hildenbrand
  2025-01-20 10:43     ` Fuad Tabba
  2025-01-20 10:43     ` Vlastimil Babka
  1 sibling, 2 replies; 60+ messages in thread
From: David Hildenbrand @ 2025-01-20 10:39 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 17.01.25 17:29, Fuad Tabba wrote:
> Some folio types, such as hugetlb, handle freeing their own
> folios. Moreover, guest_memfd will require being notified once a
> folio's reference count reaches 0 to facilitate shared to private
> folio conversion, without the folio actually being freed at that
> point.
> 
> As a first step towards that, this patch consolidates freeing
> folios that have a type. The first user is hugetlb folios. Later
> in this patch series, guest_memfd will become the second user of
> this.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   include/linux/page-flags.h | 15 +++++++++++++++
>   mm/swap.c                  | 24 +++++++++++++++++++-----
>   2 files changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 691506bdf2c5..6615f2f59144 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
>   	return page_mapcount_is_type(data_race(page->page_type));
>   }
>   
> +static inline int page_get_type(const struct page *page)
> +{
> +	return page->page_type >> 24;
> +}
> +
> +static inline bool folio_has_type(const struct folio *folio)
> +{
> +	return page_has_type(&folio->page);
> +}
> +
> +static inline int folio_get_type(const struct folio *folio)
> +{
> +	return page_get_type(&folio->page);
> +}
> +
>   #define FOLIO_TYPE_OPS(lname, fname)					\
>   static __always_inline bool folio_test_##fname(const struct folio *folio) \
>   {									\
> diff --git a/mm/swap.c b/mm/swap.c
> index 10decd9dffa1..6f01b56bce13 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>   }
>   
> +static void free_typed_folio(struct folio *folio)
> +{
> +	switch (folio_get_type(folio)) {
> +	case PGTY_hugetlb:
> +		free_huge_folio(folio);
> +		return;
> +	case PGTY_offline:
> +		/* Nothing to do, it's offline. */
> +		return;

Please drop the PGTY_offline part for now, it was rather to highlight 
what could be done.

But the real goal will be to not make offline pages
use the refcount at all (frozen).

If we really want the temporary PGTY_offline change, it should be 
introduced separately.

Apart from that LGTM!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-17 22:05   ` Elliot Berman
  2025-01-19 14:39     ` Fuad Tabba
@ 2025-01-20 10:39     ` David Hildenbrand
  2025-01-20 10:50       ` Fuad Tabba
  1 sibling, 1 reply; 60+ messages in thread
From: David Hildenbrand @ 2025-01-20 10:39 UTC (permalink / raw)
  To: Elliot Berman, Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 17.01.25 23:05, Elliot Berman wrote:
> On Fri, Jan 17, 2025 at 04:29:47PM +0000, Fuad Tabba wrote:
>> Some folio types, such as hugetlb, handle freeing their own
>> folios. Moreover, guest_memfd will require being notified once a
>> folio's reference count reaches 0 to facilitate shared to private
>> folio conversion, without the folio actually being freed at that
>> point.
>>
>> As a first step towards that, this patch consolidates freeing
>> folios that have a type. The first user is hugetlb folios. Later
>> in this patch series, guest_memfd will become the second user of
>> this.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> ---
>>   include/linux/page-flags.h | 15 +++++++++++++++
>>   mm/swap.c                  | 24 +++++++++++++++++++-----
>>   2 files changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 691506bdf2c5..6615f2f59144 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
>>   	return page_mapcount_is_type(data_race(page->page_type));
>>   }
>>   
>> +static inline int page_get_type(const struct page *page)
>> +{
>> +	return page->page_type >> 24;
>> +}
>> +
>> +static inline bool folio_has_type(const struct folio *folio)
>> +{
>> +	return page_has_type(&folio->page);
>> +}
>> +
>> +static inline int folio_get_type(const struct folio *folio)
>> +{
>> +	return page_get_type(&folio->page);
>> +}
>> +
>>   #define FOLIO_TYPE_OPS(lname, fname)					\
>>   static __always_inline bool folio_test_##fname(const struct folio *folio) \
>>   {									\
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 10decd9dffa1..6f01b56bce13 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>>   }
>>   
>> +static void free_typed_folio(struct folio *folio)
>> +{
>> +	switch (folio_get_type(folio)) {
> 
> I think you need:
> 
> +#if IS_ENABLED(CONFIG_HUGETLBFS)
>> +	case PGTY_hugetlb:
>> +		free_huge_folio(folio);
>> +		return;
> +#endif
> 
> I think this worked before because folio_test_hugetlb was defined by:
> FOLIO_TEST_FLAG_FALSE(hugetlb)
> and evidently compiler optimizes out the free_huge_folio(folio) before
> linking.

Likely, we should be using

	case PGTY_hugetlb:
		if(IF_ENABLED(CONFIG_HUGETLBFS))
			free_huge_folio(folio);
		return:

if possible (I assume so).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-01-20 10:30   ` Kirill A. Shutemov
@ 2025-01-20 10:40     ` Fuad Tabba
  2025-02-06  3:14       ` Ackerley Tng
  0 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-20 10:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Mon, 20 Jan 2025 at 10:30, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Fri, Jan 17, 2025 at 04:29:51PM +0000, Fuad Tabba wrote:
> > +/*
> > + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> > + * have any references to a particular folio, then that folio is marked as
> > + * mappable by the guest.
> > + *
> > + * However, if the host still has references to the folio, then the folio is
> > + * marked and not mappable by anyone. Marking it is not mappable allows it to
> > + * drain all references from the host, and to ensure that the hypervisor does
> > + * not transition the folio to private, since the host still might access it.
> > + *
> > + * Usually called when guest unshares memory with the host.
> > + */
> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     for (i = start; i < end; i++) {
> > +             struct folio *folio;
> > +             int refcount = 0;
> > +
> > +             folio = filemap_lock_folio(inode->i_mapping, i);
> > +             if (!IS_ERR(folio)) {
> > +                     refcount = folio_ref_count(folio);
> > +             } else {
> > +                     r = PTR_ERR(folio);
> > +                     if (WARN_ON_ONCE(r != -ENOENT))
> > +                             break;
> > +
> > +                     folio = NULL;
> > +             }
> > +
> > +             /* +1 references are expected because of filemap_lock_folio(). */
> > +             if (folio && refcount > folio_nr_pages(folio) + 1) {
>
> Looks racy.
>
> What prevent anybody from obtaining a reference just after check?
>
> Lock on folio doesn't stop random filemap_get_entry() from elevating the
> refcount.
>
> folio_ref_freeze() might be required.

I thought the folio lock would be sufficient, but you're right,
nothing prevents getting a reference after the check. I'll use a
folio_ref_freeze() when I respin.

Thanks,
/fuad

> > +                     /*
> > +                      * Outstanding references, the folio cannot be faulted
> > +                      * in by anyone until they're dropped.
> > +                      */
> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> > +             } else {
> > +                     /*
> > +                      * No outstanding references. Transition the folio to
> > +                      * guest mappable immediately.
> > +                      */
> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> > +             }
> > +
> > +             if (folio) {
> > +                     folio_unlock(folio);
> > +                     folio_put(folio);
> > +             }
> > +
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-20 10:39   ` David Hildenbrand
@ 2025-01-20 10:43     ` Fuad Tabba
  2025-01-20 10:43     ` Vlastimil Babka
  1 sibling, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-20 10:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Mon, 20 Jan 2025 at 10:39, David Hildenbrand <david@redhat.com> wrote:
>
> On 17.01.25 17:29, Fuad Tabba wrote:
> > Some folio types, such as hugetlb, handle freeing their own
> > folios. Moreover, guest_memfd will require being notified once a
> > folio's reference count reaches 0 to facilitate shared to private
> > folio conversion, without the folio actually being freed at that
> > point.
> >
> > As a first step towards that, this patch consolidates freeing
> > folios that have a type. The first user is hugetlb folios. Later
> > in this patch series, guest_memfd will become the second user of
> > this.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   include/linux/page-flags.h | 15 +++++++++++++++
> >   mm/swap.c                  | 24 +++++++++++++++++++-----
> >   2 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 691506bdf2c5..6615f2f59144 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
> >       return page_mapcount_is_type(data_race(page->page_type));
> >   }
> >
> > +static inline int page_get_type(const struct page *page)
> > +{
> > +     return page->page_type >> 24;
> > +}
> > +
> > +static inline bool folio_has_type(const struct folio *folio)
> > +{
> > +     return page_has_type(&folio->page);
> > +}
> > +
> > +static inline int folio_get_type(const struct folio *folio)
> > +{
> > +     return page_get_type(&folio->page);
> > +}
> > +
> >   #define FOLIO_TYPE_OPS(lname, fname)                                        \
> >   static __always_inline bool folio_test_##fname(const struct folio *folio) \
> >   {                                                                   \
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 10decd9dffa1..6f01b56bce13 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
> >               unlock_page_lruvec_irqrestore(lruvec, flags);
> >   }
> >
> > +static void free_typed_folio(struct folio *folio)
> > +{
> > +     switch (folio_get_type(folio)) {
> > +     case PGTY_hugetlb:
> > +             free_huge_folio(folio);
> > +             return;
> > +     case PGTY_offline:
> > +             /* Nothing to do, it's offline. */
> > +             return;
>
> Please drop the PGTY_offline part for now, it was rather to highlight
> what could be done.

Will do.

Thanks,
/fuad

>
> But the real goal will be to not make offline pages
> use the refcount at all (frozen).
>
> If we really want the temporary PGTY_offline change, it should be
> introduced separately.
>
> Apart from that LGTM!
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-20 10:39   ` David Hildenbrand
  2025-01-20 10:43     ` Fuad Tabba
@ 2025-01-20 10:43     ` Vlastimil Babka
  2025-01-20 11:12       ` Vlastimil Babka
  2025-01-20 11:28       ` David Hildenbrand
  1 sibling, 2 replies; 60+ messages in thread
From: Vlastimil Babka @ 2025-01-20 10:43 UTC (permalink / raw)
  To: David Hildenbrand, Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

On 1/20/25 11:39, David Hildenbrand wrote:
> On 17.01.25 17:29, Fuad Tabba wrote:
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 10decd9dffa1..6f01b56bce13 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>>   }
>>   
>> +static void free_typed_folio(struct folio *folio)
>> +{
>> +	switch (folio_get_type(folio)) {
>> +	case PGTY_hugetlb:
>> +		free_huge_folio(folio);
>> +		return;
>> +	case PGTY_offline:
>> +		/* Nothing to do, it's offline. */
>> +		return;
> 
> Please drop the PGTY_offline part for now, it was rather to highlight 
> what could be done.
> 
> But the real goal will be to not make offline pages
> use the refcount at all (frozen).
> 
> If we really want the temporary PGTY_offline change, it should be 
> introduced separately.
> 
> Apart from that LGTM!

I gues you mean the WARN_ON_ONCE(1) should be dropped from the default:
handler as well, right? IIUC offline pages are not not yet frozen so there
will be warnings otherwise. And I haven't check if the other types are
frozen (I know slab is, very recently :)




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-20 10:39     ` David Hildenbrand
@ 2025-01-20 10:50       ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-20 10:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Elliot Berman, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard,
	fvdl, hughd, jthoughton

On Mon, 20 Jan 2025 at 10:39, David Hildenbrand <david@redhat.com> wrote:
>
> On 17.01.25 23:05, Elliot Berman wrote:
> > On Fri, Jan 17, 2025 at 04:29:47PM +0000, Fuad Tabba wrote:
> >> Some folio types, such as hugetlb, handle freeing their own
> >> folios. Moreover, guest_memfd will require being notified once a
> >> folio's reference count reaches 0 to facilitate shared to private
> >> folio conversion, without the folio actually being freed at that
> >> point.
> >>
> >> As a first step towards that, this patch consolidates freeing
> >> folios that have a type. The first user is hugetlb folios. Later
> >> in this patch series, guest_memfd will become the second user of
> >> this.
> >>
> >> Suggested-by: David Hildenbrand <david@redhat.com>
> >> Signed-off-by: Fuad Tabba <tabba@google.com>
> >> ---
> >>   include/linux/page-flags.h | 15 +++++++++++++++
> >>   mm/swap.c                  | 24 +++++++++++++++++++-----
> >>   2 files changed, 34 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> index 691506bdf2c5..6615f2f59144 100644
> >> --- a/include/linux/page-flags.h
> >> +++ b/include/linux/page-flags.h
> >> @@ -962,6 +962,21 @@ static inline bool page_has_type(const struct page *page)
> >>      return page_mapcount_is_type(data_race(page->page_type));
> >>   }
> >>
> >> +static inline int page_get_type(const struct page *page)
> >> +{
> >> +    return page->page_type >> 24;
> >> +}
> >> +
> >> +static inline bool folio_has_type(const struct folio *folio)
> >> +{
> >> +    return page_has_type(&folio->page);
> >> +}
> >> +
> >> +static inline int folio_get_type(const struct folio *folio)
> >> +{
> >> +    return page_get_type(&folio->page);
> >> +}
> >> +
> >>   #define FOLIO_TYPE_OPS(lname, fname)                                       \
> >>   static __always_inline bool folio_test_##fname(const struct folio *folio) \
> >>   {                                                                  \
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index 10decd9dffa1..6f01b56bce13 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
> >>              unlock_page_lruvec_irqrestore(lruvec, flags);
> >>   }
> >>
> >> +static void free_typed_folio(struct folio *folio)
> >> +{
> >> +    switch (folio_get_type(folio)) {
> >
> > I think you need:
> >
> > +#if IS_ENABLED(CONFIG_HUGETLBFS)
> >> +    case PGTY_hugetlb:
> >> +            free_huge_folio(folio);
> >> +            return;
> > +#endif
> >
> > I think this worked before because folio_test_hugetlb was defined by:
> > FOLIO_TEST_FLAG_FALSE(hugetlb)
> > and evidently compiler optimizes out the free_huge_folio(folio) before
> > linking.
>
> Likely, we should be using
>
>         case PGTY_hugetlb:
>                 if(IF_ENABLED(CONFIG_HUGETLBFS))
>                         free_huge_folio(folio);
>                 return:
>
> if possible (I assume so).

Yes it does. I'll fix it.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-20 10:43     ` Vlastimil Babka
@ 2025-01-20 11:12       ` Vlastimil Babka
  2025-01-20 11:28       ` David Hildenbrand
  1 sibling, 0 replies; 60+ messages in thread
From: Vlastimil Babka @ 2025-01-20 11:12 UTC (permalink / raw)
  To: David Hildenbrand, Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

On 1/20/25 11:43, Vlastimil Babka wrote:
> On 1/20/25 11:39, David Hildenbrand wrote:
>> On 17.01.25 17:29, Fuad Tabba wrote:
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 10decd9dffa1..6f01b56bce13 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>>>   		unlock_page_lruvec_irqrestore(lruvec, flags);
>>>   }
>>>   
>>> +static void free_typed_folio(struct folio *folio)
>>> +{
>>> +	switch (folio_get_type(folio)) {
>>> +	case PGTY_hugetlb:
>>> +		free_huge_folio(folio);
>>> +		return;
>>> +	case PGTY_offline:
>>> +		/* Nothing to do, it's offline. */
>>> +		return;
>> 
>> Please drop the PGTY_offline part for now, it was rather to highlight 
>> what could be done.
>> 
>> But the real goal will be to not make offline pages
>> use the refcount at all (frozen).
>> 
>> If we really want the temporary PGTY_offline change, it should be 
>> introduced separately.
>> 
>> Apart from that LGTM!
> 
> I gues you mean the WARN_ON_ONCE(1) should be dropped from the default:
> handler as well, right? IIUC offline pages are not not yet frozen so there
> will be warnings otherwise. And I haven't check if the other types are
> frozen (I know slab is, very recently :)

Oh and also free_typed_folio() would have to become bool and if it returns
false, the normal freeing proceeds?






^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put()
  2025-01-20 10:43     ` Vlastimil Babka
  2025-01-20 11:12       ` Vlastimil Babka
@ 2025-01-20 11:28       ` David Hildenbrand
  1 sibling, 0 replies; 60+ messages in thread
From: David Hildenbrand @ 2025-01-20 11:28 UTC (permalink / raw)
  To: Vlastimil Babka, Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

On 20.01.25 11:43, Vlastimil Babka wrote:
> On 1/20/25 11:39, David Hildenbrand wrote:
>> On 17.01.25 17:29, Fuad Tabba wrote:
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 10decd9dffa1..6f01b56bce13 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
>>>    		unlock_page_lruvec_irqrestore(lruvec, flags);
>>>    }
>>>    
>>> +static void free_typed_folio(struct folio *folio)
>>> +{
>>> +	switch (folio_get_type(folio)) {
>>> +	case PGTY_hugetlb:
>>> +		free_huge_folio(folio);
>>> +		return;
>>> +	case PGTY_offline:
>>> +		/* Nothing to do, it's offline. */
>>> +		return;
>>
>> Please drop the PGTY_offline part for now, it was rather to highlight
>> what could be done.
>>
>> But the real goal will be to not make offline pages
>> use the refcount at all (frozen).
>>
>> If we really want the temporary PGTY_offline change, it should be
>> introduced separately.
>>
>> Apart from that LGTM!
> 
> I gues you mean the WARN_ON_ONCE(1) should be dropped from the default:
> handler as well, right? IIUC offline pages are not not yet frozen so there
> will be warnings otherwise.

If we get offline pages here, it is unexpected and wrong. All users 
clear PG_offline before handing them back to the buddy.

There is one nasty race in virtio-mem code for handling memory offlining 
with PG_offline pages, which we haven't seen so far in practice. See 
virtio_mem_fake_offline_going_offline()->page_ref_dec_and_test() for the 
nasty details. It would only trigger with some weird speculative references.

The proper fix will be do leave the refcount frozen for them, so 
speculative refcount users will just fail.

If we want to tackle that before we can do that, we should do it in a 
separate patch (and not buried in this series).

And I haven't check if the other types are
> frozen (I know slab is, very recently :)

I think if we would have that, we would end up triggering the 
free_pages_prepare()->free_page_is_bad(), because page->_mapcount == 
folio->mapcount would not be -1 for typed folios.

Right now we don't expect folios of specific types to ever get freed (in 
the future, these won't be folios anymore at all -- only guest_memfd and 
hugetlb would be folios, that need special care to be handed back to 
their pool).

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
@ 2025-01-20 11:37   ` Vlastimil Babka
  2025-01-20 12:14     ` Fuad Tabba
  2025-01-22 22:16   ` Ackerley Tng
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 60+ messages in thread
From: Vlastimil Babka @ 2025-01-20 11:37 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 1/17/25 17:29, Fuad Tabba wrote:
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
> 
> This patch introduces a new type for guest_memfd folios, and uses
> that to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
> The function kvm_slot_gmem_register_callback() isn't used in this
> series. It will be used later in code that performs unsharing of
> memory. I have tested it with pKVM, based on downstream code [*].
> It's included in this RFC since it demonstrates the plan to
> handle unsharing of private folios.
> 
> [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm

<snip>

> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -387,6 +387,28 @@ enum folio_mappability {
>  	KVM_GMEM_NONE_MAPPABLE	= 0b11, /* Not mappable, transient state. */
>  };
>  
> +/*
> + * Unregisters the __folio_put() callback from the folio.
> + *
> + * Restores a folio's refcount after all pending references have been released,
> + * and removes the folio type, thereby removing the callback. Now the folio can
> + * be freed normaly once all actual references have been dropped.
> + *
> + * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
> + * Must also have exclusive access to the folio: folio must be either locked, or
> + * gmem holds the only reference.
> + */
> +static void __kvm_gmem_restore_pending_folio(struct folio *folio)
> +{
> +	if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
> +		return;
> +
> +	WARN_ON_ONCE(!folio_test_locked(folio) && folio_ref_count(folio) > 1);

Similar to Kirill's objection on the other patch, I think there might be a
speculative refcount increase (i.e. from a pfn scanner) as long as we have
refcount over 1. Probably not a problem here if we want to restore refcount
anyway? But the warning would be spurious.

> +
> +	__folio_clear_guestmem(folio);
> +	folio_ref_add(folio, folio_nr_pages(folio));
> +}
> +
>  /*
>   * Marks the range [start, end) as mappable by both the host and the guest.
>   * Usually called when guest shares memory with the host.
> @@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
>  
>  	filemap_invalidate_lock(inode->i_mapping);
>  	for (i = start; i < end; i++) {
> +		struct folio *folio = NULL;
> +
> +		/*
> +		 * If the folio is NONE_MAPPABLE, it indicates that it is
> +		 * transitioning to private (GUEST_MAPPABLE). Transition it to
> +		 * shared (ALL_MAPPABLE) immediately, and remove the callback.
> +		 */
> +		if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
> +			folio = filemap_lock_folio(inode->i_mapping, i);
> +			if (WARN_ON_ONCE(IS_ERR(folio))) {
> +				r = PTR_ERR(folio);
> +				break;
> +			}
> +
> +			if (folio_test_guestmem(folio))
> +				__kvm_gmem_restore_pending_folio(folio);
> +		}
> +
>  		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> +
> +		if (folio) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +
>  		if (r)
>  			break;
>  	}
> @@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
>  	return r;
>  }
>  
> +/*
> + * Registers a callback to __folio_put(), so that gmem knows that the host does
> + * not have any references to the folio. It does that by setting the folio type
> + * to guestmem.
> + *
> + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> + * has references, and the callback has been registered.

Note this comment.

> + *
> + * Must be called with the following locks held:
> + * - filemap (inode->i_mapping) invalidate_lock
> + * - folio lock
> + */
> +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +	int refcount;
> +
> +	rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> +	WARN_ON_ONCE(!folio_test_locked(folio));
> +
> +	if (folio_mapped(folio) || folio_test_guestmem(folio))
> +		return -EAGAIN;

But here we return -EAGAIN and no callback was registered?

> +
> +	/* Register a callback first. */
> +	__folio_set_guestmem(folio);
> +
> +	/*
> +	 * Check for references after setting the type to guestmem, to guard
> +	 * against potential races with the refcount being decremented later.
> +	 *
> +	 * At least one reference is expected because the folio is locked.
> +	 */
> +
> +	refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> +	if (refcount == 1) {
> +		int r;
> +
> +		/* refcount isn't elevated, it's now faultable by the guest. */

Again this seems racy, somebody could have just speculatively increased it.
Maybe we need to freeze here as well?

> +		r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> +		if (!r)
> +			__kvm_gmem_restore_pending_folio(folio);
> +
> +		return r;
> +	}
> +
> +	return -EAGAIN;
> +}
> +
> +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> +	struct inode *inode = file_inode(slot->gmem.file);
> +	struct folio *folio;
> +	int r;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	folio = filemap_lock_folio(inode->i_mapping, pgoff);
> +	if (WARN_ON_ONCE(IS_ERR(folio))) {
> +		r = PTR_ERR(folio);
> +		goto out;
> +	}
> +
> +	r = __gmem_register_callback(folio, inode, pgoff);
> +
> +	folio_unlock(folio);
> +	folio_put(folio);
> +out:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +/*
> + * Callback function for __folio_put(), i.e., called when all references by the
> + * host to the folio have been dropped. This allows gmem to transition the state
> + * of the folio to mappable by the guest, and allows the hypervisor to continue
> + * transitioning its state to private, since the host cannot attempt to access
> + * it anymore.
> + */
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	struct xarray *mappable_offsets;
> +	struct inode *inode;
> +	pgoff_t index;
> +	void *xval;
> +
> +	inode = folio->mapping->host;
> +	index = folio->index;
> +	mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	__kvm_gmem_restore_pending_folio(folio);
> +	WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> +	filemap_invalidate_unlock(inode->i_mapping);
> +}
> +
>  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>  {
>  	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-20 11:37   ` Vlastimil Babka
@ 2025-01-20 12:14     ` Fuad Tabba
  2025-01-22 22:24       ` Ackerley Tng
  0 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-20 12:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Vlastimil,

On Mon, 20 Jan 2025 at 11:37, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/17/25 17:29, Fuad Tabba wrote:
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, and uses
> > that to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> > The function kvm_slot_gmem_register_callback() isn't used in this
> > series. It will be used later in code that performs unsharing of
> > memory. I have tested it with pKVM, based on downstream code [*].
> > It's included in this RFC since it demonstrates the plan to
> > handle unsharing of private folios.
> >
> > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
>
> <snip>
>
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -387,6 +387,28 @@ enum folio_mappability {
> >       KVM_GMEM_NONE_MAPPABLE  = 0b11, /* Not mappable, transient state. */
> >  };
> >
> > +/*
> > + * Unregisters the __folio_put() callback from the folio.
> > + *
> > + * Restores a folio's refcount after all pending references have been released,
> > + * and removes the folio type, thereby removing the callback. Now the folio can
> > + * be freed normaly once all actual references have been dropped.
> > + *
> > + * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
> > + * Must also have exclusive access to the folio: folio must be either locked, or
> > + * gmem holds the only reference.
> > + */
> > +static void __kvm_gmem_restore_pending_folio(struct folio *folio)
> > +{
> > +     if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
> > +             return;
> > +
> > +     WARN_ON_ONCE(!folio_test_locked(folio) && folio_ref_count(folio) > 1);
>
> Similar to Kirill's objection on the other patch, I think there might be a
> speculative refcount increase (i.e. from a pfn scanner) as long as we have
> refcount over 1. Probably not a problem here if we want to restore refcount
> anyway? But the warning would be spurious.
>
> > +
> > +     __folio_clear_guestmem(folio);
> > +     folio_ref_add(folio, folio_nr_pages(folio));
> > +}
> > +
> >  /*
> >   * Marks the range [start, end) as mappable by both the host and the guest.
> >   * Usually called when guest shares memory with the host.
> > @@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >
> >       filemap_invalidate_lock(inode->i_mapping);
> >       for (i = start; i < end; i++) {
> > +             struct folio *folio = NULL;
> > +
> > +             /*
> > +              * If the folio is NONE_MAPPABLE, it indicates that it is
> > +              * transitioning to private (GUEST_MAPPABLE). Transition it to
> > +              * shared (ALL_MAPPABLE) immediately, and remove the callback.
> > +              */
> > +             if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
> > +                     folio = filemap_lock_folio(inode->i_mapping, i);
> > +                     if (WARN_ON_ONCE(IS_ERR(folio))) {
> > +                             r = PTR_ERR(folio);
> > +                             break;
> > +                     }
> > +
> > +                     if (folio_test_guestmem(folio))
> > +                             __kvm_gmem_restore_pending_folio(folio);
> > +             }
> > +
> >               r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> > +
> > +             if (folio) {
> > +                     folio_unlock(folio);
> > +                     folio_put(folio);
> > +             }
> > +
> >               if (r)
> >                       break;
> >       }
> > @@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >       return r;
> >  }
> >
> > +/*
> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
> > + * not have any references to the folio. It does that by setting the folio type
> > + * to guestmem.
> > + *
> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> > + * has references, and the callback has been registered.
>
> Note this comment.
>
> > + *
> > + * Must be called with the following locks held:
> > + * - filemap (inode->i_mapping) invalidate_lock
> > + * - folio lock
> > + */
> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +     int refcount;
> > +
> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> > +     WARN_ON_ONCE(!folio_test_locked(folio));
> > +
> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
> > +             return -EAGAIN;
>
> But here we return -EAGAIN and no callback was registered?

This is intentional. If the folio is still mapped (i.e., its mapcount
is elevated), then we cannot register the callback yet, so the
host/vmm needs to unmap first, then try again. That said, I see the
problem with the comment above, and I will clarify this.

> > +
> > +     /* Register a callback first. */
> > +     __folio_set_guestmem(folio);
> > +
> > +     /*
> > +      * Check for references after setting the type to guestmem, to guard
> > +      * against potential races with the refcount being decremented later.
> > +      *
> > +      * At least one reference is expected because the folio is locked.
> > +      */
> > +
> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> > +     if (refcount == 1) {
> > +             int r;
> > +
> > +             /* refcount isn't elevated, it's now faultable by the guest. */
>
> Again this seems racy, somebody could have just speculatively increased it.
> Maybe we need to freeze here as well?

A speculative increase here is ok I think (famous last words). The
callback was registered before the check, therefore, such an increase
would trigger the callback.

Thanks,
/fuad


> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> > +             if (!r)
> > +                     __kvm_gmem_restore_pending_folio(folio);
> > +
> > +             return r;
> > +     }
> > +
> > +     return -EAGAIN;
> > +}
> > +
> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> > +     struct inode *inode = file_inode(slot->gmem.file);
> > +     struct folio *folio;
> > +     int r;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +
> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
> > +             r = PTR_ERR(folio);
> > +             goto out;
> > +     }
> > +
> > +     r = __gmem_register_callback(folio, inode, pgoff);
> > +
> > +     folio_unlock(folio);
> > +     folio_put(folio);
> > +out:
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +/*
> > + * Callback function for __folio_put(), i.e., called when all references by the
> > + * host to the folio have been dropped. This allows gmem to transition the state
> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> > + * transitioning its state to private, since the host cannot attempt to access
> > + * it anymore.
> > + */
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     struct xarray *mappable_offsets;
> > +     struct inode *inode;
> > +     pgoff_t index;
> > +     void *xval;
> > +
> > +     inode = folio->mapping->host;
> > +     index = folio->index;
> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     __kvm_gmem_restore_pending_folio(folio);
> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +}
> > +
> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >  {
> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
  2025-01-20 11:37   ` Vlastimil Babka
@ 2025-01-22 22:16   ` Ackerley Tng
  2025-01-23  9:50     ` Fuad Tabba
  2025-02-05  0:42   ` Vishal Annapurve
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-01-22 22:16 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Fuad Tabba <tabba@google.com> writes:

Hey Fuad, I'm still working on verifying all this but for now this is
one issue. I think this can be fixed by checking if the folio->mapping
is NULL. If it's NULL, then the folio has been disassociated from the
inode, and during the dissociation (removal from filemap), the
mappability can also either

1. Be unset so that the default mappability can be set up based on
   GUEST_MEMFD_FLAG_INIT_MAPPABLE, or
2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE

> <snip>
>
> +
> +/*
> + * Callback function for __folio_put(), i.e., called when all references by the
> + * host to the folio have been dropped. This allows gmem to transition the state
> + * of the folio to mappable by the guest, and allows the hypervisor to continue
> + * transitioning its state to private, since the host cannot attempt to access
> + * it anymore.
> + */
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	struct xarray *mappable_offsets;
> +	struct inode *inode;
> +	pgoff_t index;
> +	void *xval;
> +
> +	inode = folio->mapping->host;

IIUC this will be a NULL pointer dereference if the folio had been
removed from the filemap, either through truncation or if the
guest_memfd file got closed.

> +	index = folio->index;

And if removed from the filemap folio->index is probably invalid.

> +	mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	__kvm_gmem_restore_pending_folio(folio);
> +	WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> +	filemap_invalidate_unlock(inode->i_mapping);
> +}
> +
>  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>  {
>  	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-20 12:14     ` Fuad Tabba
@ 2025-01-22 22:24       ` Ackerley Tng
  2025-01-23 11:00         ` Fuad Tabba
  2025-01-30 14:23         ` Fuad Tabba
  0 siblings, 2 replies; 60+ messages in thread
From: Ackerley Tng @ 2025-01-22 22:24 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Fuad Tabba <tabba@google.com> writes:

>> > <snip>
>> >
>> > +/*
>> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
>> > + * not have any references to the folio. It does that by setting the folio type
>> > + * to guestmem.
>> > + *
>> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
>> > + * has references, and the callback has been registered.
>>
>> Note this comment.
>>
>> > + *
>> > + * Must be called with the following locks held:
>> > + * - filemap (inode->i_mapping) invalidate_lock
>> > + * - folio lock
>> > + */
>> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
>> > +{
>> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> > +     int refcount;
>> > +
>> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
>> > +     WARN_ON_ONCE(!folio_test_locked(folio));
>> > +
>> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
>> > +             return -EAGAIN;
>>
>> But here we return -EAGAIN and no callback was registered?
>
> This is intentional. If the folio is still mapped (i.e., its mapcount
> is elevated), then we cannot register the callback yet, so the
> host/vmm needs to unmap first, then try again. That said, I see the
> problem with the comment above, and I will clarify this.
>
>> > +
>> > +     /* Register a callback first. */
>> > +     __folio_set_guestmem(folio);
>> > +
>> > +     /*
>> > +      * Check for references after setting the type to guestmem, to guard
>> > +      * against potential races with the refcount being decremented later.
>> > +      *
>> > +      * At least one reference is expected because the folio is locked.
>> > +      */
>> > +
>> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
>> > +     if (refcount == 1) {
>> > +             int r;
>> > +
>> > +             /* refcount isn't elevated, it's now faultable by the guest. */
>>
>> Again this seems racy, somebody could have just speculatively increased it.
>> Maybe we need to freeze here as well?
>
> A speculative increase here is ok I think (famous last words). The
> callback was registered before the check, therefore, such an increase
> would trigger the callback.
>
> Thanks,
> /fuad
>
>

I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
about the mappability state at callback-time).

However, what if the new speculative reference writes to the page and
guest goes on to fault/use the page?

>> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
>> > +             if (!r)
>> > +                     __kvm_gmem_restore_pending_folio(folio);
>> > +
>> > +             return r;
>> > +     }
>> > +
>> > +     return -EAGAIN;
>> > +}
>> > +
>> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
>> > +{
>> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
>> > +     struct inode *inode = file_inode(slot->gmem.file);
>> > +     struct folio *folio;
>> > +     int r;
>> > +
>> > +     filemap_invalidate_lock(inode->i_mapping);
>> > +
>> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
>> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
>> > +             r = PTR_ERR(folio);
>> > +             goto out;
>> > +     }
>> > +
>> > +     r = __gmem_register_callback(folio, inode, pgoff);
>> > +
>> > +     folio_unlock(folio);
>> > +     folio_put(folio);
>> > +out:
>> > +     filemap_invalidate_unlock(inode->i_mapping);
>> > +
>> > +     return r;
>> > +}
>> > +
>> > +/*
>> > + * Callback function for __folio_put(), i.e., called when all references by the
>> > + * host to the folio have been dropped. This allows gmem to transition the state
>> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
>> > + * transitioning its state to private, since the host cannot attempt to access
>> > + * it anymore.
>> > + */
>> > +void kvm_gmem_handle_folio_put(struct folio *folio)
>> > +{
>> > +     struct xarray *mappable_offsets;
>> > +     struct inode *inode;
>> > +     pgoff_t index;
>> > +     void *xval;
>> > +
>> > +     inode = folio->mapping->host;
>> > +     index = folio->index;
>> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> > +
>> > +     filemap_invalidate_lock(inode->i_mapping);
>> > +     __kvm_gmem_restore_pending_folio(folio);
>> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
>> > +     filemap_invalidate_unlock(inode->i_mapping);
>> > +}
>> > +
>> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>> >  {
>> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-22 22:16   ` Ackerley Tng
@ 2025-01-23  9:50     ` Fuad Tabba
  2025-02-05  1:28       ` Vishal Annapurve
  0 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-23  9:50 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Wed, 22 Jan 2025 at 22:16, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> Hey Fuad, I'm still working on verifying all this but for now this is
> one issue. I think this can be fixed by checking if the folio->mapping
> is NULL. If it's NULL, then the folio has been disassociated from the
> inode, and during the dissociation (removal from filemap), the
> mappability can also either
>
> 1. Be unset so that the default mappability can be set up based on
>    GUEST_MEMFD_FLAG_INIT_MAPPABLE, or
> 2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE

Thanks for pointing this out. I hadn't considered this case. I'll fix
in the respin.

> > <snip>
> >
> > +
> > +/*
> > + * Callback function for __folio_put(), i.e., called when all references by the
> > + * host to the folio have been dropped. This allows gmem to transition the state
> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> > + * transitioning its state to private, since the host cannot attempt to access
> > + * it anymore.
> > + */
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     struct xarray *mappable_offsets;
> > +     struct inode *inode;
> > +     pgoff_t index;
> > +     void *xval;
> > +
> > +     inode = folio->mapping->host;
>
> IIUC this will be a NULL pointer dereference if the folio had been
> removed from the filemap, either through truncation or if the
> guest_memfd file got closed.

Ack.

> > +     index = folio->index;
>
> And if removed from the filemap folio->index is probably invalid.

Ack and thanks again,
/fuad

> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     __kvm_gmem_restore_pending_folio(folio);
> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +}
> > +
> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >  {
> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-22 22:24       ` Ackerley Tng
@ 2025-01-23 11:00         ` Fuad Tabba
  2025-02-06  3:18           ` Ackerley Tng
  2025-02-06  3:28           ` Ackerley Tng
  2025-01-30 14:23         ` Fuad Tabba
  1 sibling, 2 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-23 11:00 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Wed, 22 Jan 2025 at 22:24, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> >> > <snip>
> >> >
> >> > +/*
> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
> >> > + * not have any references to the folio. It does that by setting the folio type
> >> > + * to guestmem.
> >> > + *
> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> >> > + * has references, and the callback has been registered.
> >>
> >> Note this comment.
> >>
> >> > + *
> >> > + * Must be called with the following locks held:
> >> > + * - filemap (inode->i_mapping) invalidate_lock
> >> > + * - folio lock
> >> > + */
> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> >> > +{
> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +     int refcount;
> >> > +
> >> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> >> > +     WARN_ON_ONCE(!folio_test_locked(folio));
> >> > +
> >> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
> >> > +             return -EAGAIN;
> >>
> >> But here we return -EAGAIN and no callback was registered?
> >
> > This is intentional. If the folio is still mapped (i.e., its mapcount
> > is elevated), then we cannot register the callback yet, so the
> > host/vmm needs to unmap first, then try again. That said, I see the
> > problem with the comment above, and I will clarify this.
> >
> >> > +
> >> > +     /* Register a callback first. */
> >> > +     __folio_set_guestmem(folio);
> >> > +
> >> > +     /*
> >> > +      * Check for references after setting the type to guestmem, to guard
> >> > +      * against potential races with the refcount being decremented later.
> >> > +      *
> >> > +      * At least one reference is expected because the folio is locked.
> >> > +      */
> >> > +
> >> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> >> > +     if (refcount == 1) {
> >> > +             int r;
> >> > +
> >> > +             /* refcount isn't elevated, it's now faultable by the guest. */
> >>
> >> Again this seems racy, somebody could have just speculatively increased it.
> >> Maybe we need to freeze here as well?
> >
> > A speculative increase here is ok I think (famous last words). The
> > callback was registered before the check, therefore, such an increase
> > would trigger the callback.
> >
> > Thanks,
> > /fuad
> >
> >
>
> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
> about the mappability state at callback-time).
>
> However, what if the new speculative reference writes to the page and
> guest goes on to fault/use the page?

I don't think that's a problem. At this point the page is in a
transient state, but still shared from the guest's point of view.
Moreover, no one can fault-in the page at the host at this point (we
check in kvm_gmem_fault()).

Let's have a look at the code:

+static int __gmem_register_callback(struct folio *folio, struct inode
*inode, pgoff_t idx)
+{
+       struct xarray *mappable_offsets =
&kvm_gmem_private(inode)->mappable_offsets;
+       void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+       int refcount;

At this point the guest still perceives the page as shared, the state
of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means
that kvm_gmem_fault() doesn't fault-in the page at the host anymore.

+       rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
+       WARN_ON_ONCE(!folio_test_locked(folio));
+
+       if (folio_mapped(folio) || folio_test_guestmem(folio))
+               return -EAGAIN;
+
+       /* Register a callback first. */
+       __folio_set_guestmem(folio);

This (in addition to the state of the NONE_MAPPABLE), also ensures
that kvm_gmem_fault() doesn't fault-in the page at the host anymore.

+       /*
+        * Check for references after setting the type to guestmem, to guard
+        * against potential races with the refcount being decremented later.
+        *
+        * At least one reference is expected because the folio is locked.
+        */
+
+       refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
+       if (refcount == 1) {
+               int r;

At this point we know that guest_memfd has the only real reference.
Speculative references AFAIK do not access the page itself.
+
+               /* refcount isn't elevated, it's now faultable by the guest. */
+               r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets,
idx, xval_guest, GFP_KERNEL)));

Now it's safe so let the guest know that it can map the page.

+               if (!r)
+                       __kvm_gmem_restore_pending_folio(folio);
+
+               return r;
+       }
+
+       return -EAGAIN;
+}

Does this make sense, or did I miss something?

Thanks!
/fuad

> >> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> >> > +             if (!r)
> >> > +                     __kvm_gmem_restore_pending_folio(folio);
> >> > +
> >> > +             return r;
> >> > +     }
> >> > +
> >> > +     return -EAGAIN;
> >> > +}
> >> > +
> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> >> > +{
> >> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> >> > +     struct inode *inode = file_inode(slot->gmem.file);
> >> > +     struct folio *folio;
> >> > +     int r;
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +
> >> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
> >> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
> >> > +             r = PTR_ERR(folio);
> >> > +             goto out;
> >> > +     }
> >> > +
> >> > +     r = __gmem_register_callback(folio, inode, pgoff);
> >> > +
> >> > +     folio_unlock(folio);
> >> > +     folio_put(folio);
> >> > +out:
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +
> >> > +     return r;
> >> > +}
> >> > +
> >> > +/*
> >> > + * Callback function for __folio_put(), i.e., called when all references by the
> >> > + * host to the folio have been dropped. This allows gmem to transition the state
> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> >> > + * transitioning its state to private, since the host cannot attempt to access
> >> > + * it anymore.
> >> > + */
> >> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> >> > +{
> >> > +     struct xarray *mappable_offsets;
> >> > +     struct inode *inode;
> >> > +     pgoff_t index;
> >> > +     void *xval;
> >> > +
> >> > +     inode = folio->mapping->host;
> >> > +     index = folio->index;
> >> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +     __kvm_gmem_restore_pending_folio(folio);
> >> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +}
> >> > +
> >> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >> >  {
> >> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-01-17 16:29 ` [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2025-01-24  4:25   ` Gavin Shan
  2025-01-29 10:12     ` Fuad Tabba
  2025-02-11 15:58     ` Ackerley Tng
  0 siblings, 2 replies; 60+ messages in thread
From: Gavin Shan @ 2025-01-24  4:25 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Fuad,

On 1/18/25 2:29 AM, Fuad Tabba wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Using guest mem inodes allows us to store metadata for the backing
> memory on the inode. Metadata will be added in a later patch to
> support HugeTLB pages.
> 
> Metadata about backing memory should not be stored on the file, since
> the file represents a guest_memfd's binding with a struct kvm, and
> metadata about backing memory is not unique to a specific binding and
> struct kvm.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   include/uapi/linux/magic.h |   1 +
>   virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
>   2 files changed, 100 insertions(+), 20 deletions(-)
> 
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index bb575f3ab45e..169dba2a6920 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -103,5 +103,6 @@
>   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>   #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
> +#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
>   
>   #endif /* __LINUX_MAGIC_H__ */
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 47a9f68f7b24..198554b1f0b5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1,12 +1,17 @@
>   // SPDX-License-Identifier: GPL-2.0
> +#include <linux/fs.h>
> +#include <linux/mount.h>

This can be dropped since "linux/mount.h" has been included to "linux/fs.h".

>   #include <linux/backing-dev.h>
>   #include <linux/falloc.h>
>   #include <linux/kvm_host.h>
> +#include <linux/pseudo_fs.h>
>   #include <linux/pagemap.h>
>   #include <linux/anon_inodes.h>
>   
>   #include "kvm_mm.h"
>   
> +static struct vfsmount *kvm_gmem_mnt;
> +
>   struct kvm_gmem {
>   	struct kvm *kvm;
>   	struct xarray bindings;
> @@ -307,6 +312,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>   	return gfn - slot->base_gfn + slot->gmem.pgoff;
>   }
>   
> +static const struct super_operations kvm_gmem_super_operations = {
> +	.statfs		= simple_statfs,
> +};
> +
> +static int kvm_gmem_init_fs_context(struct fs_context *fc)
> +{
> +	struct pseudo_fs_context *ctx;
> +
> +	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
> +		return -ENOMEM;
> +
> +	ctx = fc->fs_private;
> +	ctx->ops = &kvm_gmem_super_operations;
> +
> +	return 0;
> +}
> +
> +static struct file_system_type kvm_gmem_fs = {
> +	.name		 = "kvm_guest_memory",
> +	.init_fs_context = kvm_gmem_init_fs_context,
> +	.kill_sb	 = kill_anon_super,
> +};
> +
> +static void kvm_gmem_init_mount(void)
> +{
> +	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
> +	BUG_ON(IS_ERR(kvm_gmem_mnt));
> +
> +	/* For giggles. Userspace can never map this anyways. */
> +	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
> +}
> +
>   static struct file_operations kvm_gmem_fops = {
>   	.open		= generic_file_open,
>   	.release	= kvm_gmem_release,
> @@ -316,6 +353,8 @@ static struct file_operations kvm_gmem_fops = {
>   void kvm_gmem_init(struct module *module)
>   {
>   	kvm_gmem_fops.owner = module;
> +
> +	kvm_gmem_init_mount();
>   }
>   
>   static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -397,11 +436,67 @@ static const struct inode_operations kvm_gmem_iops = {
>   	.setattr	= kvm_gmem_setattr,
>   };
>   
> +static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> +						      loff_t size, u64 flags)
> +{
> +	const struct qstr qname = QSTR_INIT(name, strlen(name));
> +	struct inode *inode;
> +	int err;
> +
> +	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
> +	if (IS_ERR(inode))
> +		return inode;
> +
> +	err = security_inode_init_security_anon(inode, &qname, NULL);
> +	if (err) {
> +		iput(inode);
> +		return ERR_PTR(err);
> +	}
> +
> +	inode->i_private = (void *)(unsigned long)flags;
> +	inode->i_op = &kvm_gmem_iops;
> +	inode->i_mapping->a_ops = &kvm_gmem_aops;
> +	inode->i_mode |= S_IFREG;
> +	inode->i_size = size;
> +	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +	mapping_set_inaccessible(inode->i_mapping);
> +	/* Unmovable mappings are supposed to be marked unevictable as well. */
> +	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +
> +	return inode;
> +}
> +
> +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> +						  u64 flags)
> +{
> +	static const char *name = "[kvm-gmem]";
> +	struct inode *inode;
> +	struct file *file;
> +
> +	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
> +		return ERR_PTR(-ENOENT);
> +

The validation on 'kvm_gmem_fops.owner' can be removed since try_module_get()
and module_put() are friendly to a NULL parameter, even when CONFIG_MODULE_UNLOAD == N

A module_put(kvm_gmem_fops.owner) is needed in the various erroneous cases in
this function. Otherwise, the reference count of the owner (module) will become
imbalanced on any errors.


> +	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
> +	if (IS_ERR(inode))
> +		return ERR_CAST(inode);
> +

ERR_CAST may be dropped since there is nothing to be casted or converted?

> +	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
> +				 &kvm_gmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		return file;
> +	}
> +
> +	file->f_mapping = inode->i_mapping;
> +	file->f_flags |= O_LARGEFILE;
> +	file->private_data = priv;
> +

'file->f_mapping = inode->i_mapping' may be dropped since it's already correctly
set by alloc_file_pseudo().

alloc_file_pseudo
   alloc_path_pseudo
   alloc_file
     alloc_empty_file
     file_init_path         // Set by this function
   

> +	return file;
> +}
> +
>   static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>   {
> -	const char *anon_name = "[kvm-gmem]";
>   	struct kvm_gmem *gmem;
> -	struct inode *inode;
>   	struct file *file;
>   	int fd, err;
>   
> @@ -415,32 +510,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>   		goto err_fd;
>   	}
>   
> -	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
> -					 O_RDWR, NULL);
> +	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
>   	if (IS_ERR(file)) {
>   		err = PTR_ERR(file);
>   		goto err_gmem;
>   	}
>   
> -	file->f_flags |= O_LARGEFILE;
> -
> -	inode = file->f_inode;
> -	WARN_ON(file->f_mapping != inode->i_mapping);
> -
> -	inode->i_private = (void *)(unsigned long)flags;
> -	inode->i_op = &kvm_gmem_iops;
> -	inode->i_mapping->a_ops = &kvm_gmem_aops;
> -	inode->i_mode |= S_IFREG;
> -	inode->i_size = size;
> -	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> -	mapping_set_inaccessible(inode->i_mapping);
> -	/* Unmovable mappings are supposed to be marked unevictable as well. */
> -	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> -
>   	kvm_get_kvm(kvm);
>   	gmem->kvm = kvm;
>   	xa_init(&gmem->bindings);
> -	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
> +	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
>   
>   	fd_install(fd, file);
>   	return fd;

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
  2025-01-17 16:29 ` [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
@ 2025-01-24  5:31   ` Gavin Shan
  2025-01-29 10:15     ` Fuad Tabba
  0 siblings, 1 reply; 60+ messages in thread
From: Gavin Shan @ 2025-01-24  5:31 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Fuad,

On 1/18/25 2:29 AM, Fuad Tabba wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Track whether guest_memfd memory can be mapped within the inode,
> since it is property of the guest_memfd's memory contents.
> 
> The guest_memfd PRIVATE memory attribute is not used for two
> reasons. First because it reflects the userspace expectation for
> that memory location, and therefore can be toggled by userspace.
> The second is, although each guest_memfd file has a 1:1 binding
> with a KVM instance, the plan is to allow multiple files per
> inode, e.g. to allow intra-host migration to a new KVM instance,
> without destroying guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 6453658d2650..0a7b6cf8bd8f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -18,6 +18,17 @@ struct kvm_gmem {
>   	struct list_head entry;
>   };
>   
> +struct kvm_gmem_inode_private {
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +	struct xarray mappable_offsets;
> +#endif
> +};
> +
> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> +{
> +	return inode->i_mapping->i_private_data;
> +}
> +
>   /**
>    * folio_file_pfn - like folio_file_page, but return a pfn.
>    * @folio: The folio which contains this index.
> @@ -312,8 +323,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>   	return gfn - slot->base_gfn + slot->gmem.pgoff;
>   }
>   
> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +	/*
> +	 * .evict_inode can be called before private data is set up if there are
> +	 * issues during inode creation.
> +	 */
> +	if (private)
> +		xa_destroy(&private->mappable_offsets);
> +#endif
> +
> +	truncate_inode_pages_final(inode->i_mapping);
> +
> +	kfree(private);
> +	clear_inode(inode);
> +}
> +
>   static const struct super_operations kvm_gmem_super_operations = {
> -	.statfs		= simple_statfs,
> +	.statfs         = simple_statfs,
> +	.evict_inode	= kvm_gmem_evict_inode,
>   };
>   

As I understood, ->destroy_inode() may be more suitable place where the xarray is
released. ->evict_inode() usually detach the inode from the existing struct, to make
it offline. ->destroy_inode() is actually the place where the associated resource
(memory) is relased.

Another benefit with ->destroy_inode() is we're not concerned to truncate_inode_pages_final()
and clear_inode().


>   static int kvm_gmem_init_fs_context(struct fs_context *fc)
> @@ -440,6 +471,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>   						      loff_t size, u64 flags)
>   {
>   	const struct qstr qname = QSTR_INIT(name, strlen(name));
> +	struct kvm_gmem_inode_private *private;
>   	struct inode *inode;
>   	int err;
>   
> @@ -448,10 +480,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>   		return inode;
>   
>   	err = security_inode_init_security_anon(inode, &qname, NULL);
> -	if (err) {
> -		iput(inode);
> -		return ERR_PTR(err);
> -	}
> +	if (err)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +	xa_init(&private->mappable_offsets);
> +#endif
> +
> +	inode->i_mapping->i_private_data = private;
>   

The whole block of code needs to be guarded by CONFIG_KVM_GMEM_MAPPABLE because
kzalloc(sizeof(...)) is translated to kzalloc(0) when CONFIG_KVM_GMEM_MAPPABLE
is disabled, and kzalloc() will always fail. It will lead to unusable guest-memfd
if CONFIG_KVM_GMEM_MAPPABLE is disabled.

>   	inode->i_private = (void *)(unsigned long)flags;
>   	inode->i_op = &kvm_gmem_iops;
> @@ -464,6 +505,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>   	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>   
>   	return inode;
> +
> +out:
> +	iput(inode);
> +
> +	return ERR_PTR(err);
>   }
>   
>   static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,

Thanks,
Gavin



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-01-24  4:25   ` Gavin Shan
@ 2025-01-29 10:12     ` Fuad Tabba
  2025-02-11 15:58     ` Ackerley Tng
  1 sibling, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-29 10:12 UTC (permalink / raw)
  To: Gavin Shan
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Gavin,

On Fri, 24 Jan 2025 at 04:26, Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Fuad,
>
> On 1/18/25 2:29 AM, Fuad Tabba wrote:
> > From: Ackerley Tng <ackerleytng@google.com>
> >
> > Using guest mem inodes allows us to store metadata for the backing
> > memory on the inode. Metadata will be added in a later patch to
> > support HugeTLB pages.
> >
> > Metadata about backing memory should not be stored on the file, since
> > the file represents a guest_memfd's binding with a struct kvm, and
> > metadata about backing memory is not unique to a specific binding and
> > struct kvm.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   include/uapi/linux/magic.h |   1 +
> >   virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
> >   2 files changed, 100 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index bb575f3ab45e..169dba2a6920 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -103,5 +103,6 @@
> >   #define DEVMEM_MAGIC                0x454d444d      /* "DMEM" */
> >   #define SECRETMEM_MAGIC             0x5345434d      /* "SECM" */
> >   #define PID_FS_MAGIC                0x50494446      /* "PIDF" */
> > +#define GUEST_MEMORY_MAGIC   0x474d454d      /* "GMEM" */
> >
> >   #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 47a9f68f7b24..198554b1f0b5 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -1,12 +1,17 @@
> >   // SPDX-License-Identifier: GPL-2.0
> > +#include <linux/fs.h>
> > +#include <linux/mount.h>
>
> This can be dropped since "linux/mount.h" has been included to "linux/fs.h".
>
> >   #include <linux/backing-dev.h>
> >   #include <linux/falloc.h>
> >   #include <linux/kvm_host.h>
> > +#include <linux/pseudo_fs.h>
> >   #include <linux/pagemap.h>
> >   #include <linux/anon_inodes.h>
> >
> >   #include "kvm_mm.h"
> >
> > +static struct vfsmount *kvm_gmem_mnt;
> > +
> >   struct kvm_gmem {
> >       struct kvm *kvm;
> >       struct xarray bindings;
> > @@ -307,6 +312,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
> >       return gfn - slot->base_gfn + slot->gmem.pgoff;
> >   }
> >
> > +static const struct super_operations kvm_gmem_super_operations = {
> > +     .statfs         = simple_statfs,
> > +};
> > +
> > +static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > +{
> > +     struct pseudo_fs_context *ctx;
> > +
> > +     if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
> > +             return -ENOMEM;
> > +
> > +     ctx = fc->fs_private;
> > +     ctx->ops = &kvm_gmem_super_operations;
> > +
> > +     return 0;
> > +}
> > +
> > +static struct file_system_type kvm_gmem_fs = {
> > +     .name            = "kvm_guest_memory",
> > +     .init_fs_context = kvm_gmem_init_fs_context,
> > +     .kill_sb         = kill_anon_super,
> > +};
> > +
> > +static void kvm_gmem_init_mount(void)
> > +{
> > +     kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
> > +     BUG_ON(IS_ERR(kvm_gmem_mnt));
> > +
> > +     /* For giggles. Userspace can never map this anyways. */
> > +     kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
> > +}
> > +
> >   static struct file_operations kvm_gmem_fops = {
> >       .open           = generic_file_open,
> >       .release        = kvm_gmem_release,
> > @@ -316,6 +353,8 @@ static struct file_operations kvm_gmem_fops = {
> >   void kvm_gmem_init(struct module *module)
> >   {
> >       kvm_gmem_fops.owner = module;
> > +
> > +     kvm_gmem_init_mount();
> >   }
> >
> >   static int kvm_gmem_migrate_folio(struct address_space *mapping,
> > @@ -397,11 +436,67 @@ static const struct inode_operations kvm_gmem_iops = {
> >       .setattr        = kvm_gmem_setattr,
> >   };
> >
> > +static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > +                                                   loff_t size, u64 flags)
> > +{
> > +     const struct qstr qname = QSTR_INIT(name, strlen(name));
> > +     struct inode *inode;
> > +     int err;
> > +
> > +     inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
> > +     if (IS_ERR(inode))
> > +             return inode;
> > +
> > +     err = security_inode_init_security_anon(inode, &qname, NULL);
> > +     if (err) {
> > +             iput(inode);
> > +             return ERR_PTR(err);
> > +     }
> > +
> > +     inode->i_private = (void *)(unsigned long)flags;
> > +     inode->i_op = &kvm_gmem_iops;
> > +     inode->i_mapping->a_ops = &kvm_gmem_aops;
> > +     inode->i_mode |= S_IFREG;
> > +     inode->i_size = size;
> > +     mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +     mapping_set_inaccessible(inode->i_mapping);
> > +     /* Unmovable mappings are supposed to be marked unevictable as well. */
> > +     WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> > +
> > +     return inode;
> > +}
> > +
> > +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> > +                                               u64 flags)
> > +{
> > +     static const char *name = "[kvm-gmem]";
> > +     struct inode *inode;
> > +     struct file *file;
> > +
> > +     if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
> > +             return ERR_PTR(-ENOENT);
> > +
>
> The validation on 'kvm_gmem_fops.owner' can be removed since try_module_get()
> and module_put() are friendly to a NULL parameter, even when CONFIG_MODULE_UNLOAD == N
>
> A module_put(kvm_gmem_fops.owner) is needed in the various erroneous cases in
> this function. Otherwise, the reference count of the owner (module) will become
> imbalanced on any errors.
>
>
> > +     inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
> > +     if (IS_ERR(inode))
> > +             return ERR_CAST(inode);
> > +
>
> ERR_CAST may be dropped since there is nothing to be casted or converted?
>
> > +     file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
> > +                              &kvm_gmem_fops);
> > +     if (IS_ERR(file)) {
> > +             iput(inode);
> > +             return file;
> > +     }
> > +
> > +     file->f_mapping = inode->i_mapping;
> > +     file->f_flags |= O_LARGEFILE;
> > +     file->private_data = priv;
> > +
>
> 'file->f_mapping = inode->i_mapping' may be dropped since it's already correctly
> set by alloc_file_pseudo().
>
> alloc_file_pseudo
>    alloc_path_pseudo
>    alloc_file
>      alloc_empty_file
>      file_init_path         // Set by this function

Thanks for the fixes. Will include them when we respin.

Cheers,
/fuad

> > +     return file;
> > +}
> > +
> >   static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> >   {
> > -     const char *anon_name = "[kvm-gmem]";
> >       struct kvm_gmem *gmem;
> > -     struct inode *inode;
> >       struct file *file;
> >       int fd, err;
> >
> > @@ -415,32 +510,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> >               goto err_fd;
> >       }
> >
> > -     file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
> > -                                      O_RDWR, NULL);
> > +     file = kvm_gmem_inode_create_getfile(gmem, size, flags);
> >       if (IS_ERR(file)) {
> >               err = PTR_ERR(file);
> >               goto err_gmem;
> >       }
> >
> > -     file->f_flags |= O_LARGEFILE;
> > -
> > -     inode = file->f_inode;
> > -     WARN_ON(file->f_mapping != inode->i_mapping);
> > -
> > -     inode->i_private = (void *)(unsigned long)flags;
> > -     inode->i_op = &kvm_gmem_iops;
> > -     inode->i_mapping->a_ops = &kvm_gmem_aops;
> > -     inode->i_mode |= S_IFREG;
> > -     inode->i_size = size;
> > -     mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > -     mapping_set_inaccessible(inode->i_mapping);
> > -     /* Unmovable mappings are supposed to be marked unevictable as well. */
> > -     WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> > -
> >       kvm_get_kvm(kvm);
> >       gmem->kvm = kvm;
> >       xa_init(&gmem->bindings);
> > -     list_add(&gmem->entry, &inode->i_mapping->i_private_list);
> > +     list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
> >
> >       fd_install(fd, file);
> >       return fd;
>
> Thanks,
> Gavin
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
  2025-01-24  5:31   ` Gavin Shan
@ 2025-01-29 10:15     ` Fuad Tabba
  2025-02-26 22:29       ` Ackerley Tng
  0 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-01-29 10:15 UTC (permalink / raw)
  To: Gavin Shan
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Gavin,

On Fri, 24 Jan 2025 at 05:32, Gavin Shan <gshan@redhat.com> wrote:
>
> Hi Fuad,
>
> On 1/18/25 2:29 AM, Fuad Tabba wrote:
> > From: Ackerley Tng <ackerleytng@google.com>
> >
> > Track whether guest_memfd memory can be mapped within the inode,
> > since it is property of the guest_memfd's memory contents.
> >
> > The guest_memfd PRIVATE memory attribute is not used for two
> > reasons. First because it reflects the userspace expectation for
> > that memory location, and therefore can be toggled by userspace.
> > The second is, although each guest_memfd file has a 1:1 binding
> > with a KVM instance, the plan is to allow multiple files per
> > inode, e.g. to allow intra-host migration to a new KVM instance,
> > without destroying guest_memfd.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > Co-developed-by: Fuad Tabba <tabba@google.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 51 insertions(+), 5 deletions(-)
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 6453658d2650..0a7b6cf8bd8f 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -18,6 +18,17 @@ struct kvm_gmem {
> >       struct list_head entry;
> >   };
> >
> > +struct kvm_gmem_inode_private {
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +     struct xarray mappable_offsets;
> > +#endif
> > +};
> > +
> > +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> > +{
> > +     return inode->i_mapping->i_private_data;
> > +}
> > +
> >   /**
> >    * folio_file_pfn - like folio_file_page, but return a pfn.
> >    * @folio: The folio which contains this index.
> > @@ -312,8 +323,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
> >       return gfn - slot->base_gfn + slot->gmem.pgoff;
> >   }
> >
> > +static void kvm_gmem_evict_inode(struct inode *inode)
> > +{
> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > +
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +     /*
> > +      * .evict_inode can be called before private data is set up if there are
> > +      * issues during inode creation.
> > +      */
> > +     if (private)
> > +             xa_destroy(&private->mappable_offsets);
> > +#endif
> > +
> > +     truncate_inode_pages_final(inode->i_mapping);
> > +
> > +     kfree(private);
> > +     clear_inode(inode);
> > +}
> > +
> >   static const struct super_operations kvm_gmem_super_operations = {
> > -     .statfs         = simple_statfs,
> > +     .statfs         = simple_statfs,
> > +     .evict_inode    = kvm_gmem_evict_inode,
> >   };
> >
>
> As I understood, ->destroy_inode() may be more suitable place where the xarray is
> released. ->evict_inode() usually detach the inode from the existing struct, to make
> it offline. ->destroy_inode() is actually the place where the associated resource
> (memory) is relased.
>
> Another benefit with ->destroy_inode() is we're not concerned to truncate_inode_pages_final()
> and clear_inode().

I see. I'll give this a try.

>
> >   static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > @@ -440,6 +471,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >                                                     loff_t size, u64 flags)
> >   {
> >       const struct qstr qname = QSTR_INIT(name, strlen(name));
> > +     struct kvm_gmem_inode_private *private;
> >       struct inode *inode;
> >       int err;
> >
> > @@ -448,10 +480,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >               return inode;
> >
> >       err = security_inode_init_security_anon(inode, &qname, NULL);
> > -     if (err) {
> > -             iput(inode);
> > -             return ERR_PTR(err);
> > -     }
> > +     if (err)
> > +             goto out;
> > +
> > +     err = -ENOMEM;
> > +     private = kzalloc(sizeof(*private), GFP_KERNEL);
> > +     if (!private)
> > +             goto out;
> > +
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +     xa_init(&private->mappable_offsets);
> > +#endif
> > +
> > +     inode->i_mapping->i_private_data = private;
> >
>
> The whole block of code needs to be guarded by CONFIG_KVM_GMEM_MAPPABLE because
> kzalloc(sizeof(...)) is translated to kzalloc(0) when CONFIG_KVM_GMEM_MAPPABLE
> is disabled, and kzalloc() will always fail. It will lead to unusable guest-memfd
> if CONFIG_KVM_GMEM_MAPPABLE is disabled.

Good point, thanks for pointing this out.

Cheers,
/fuad

> >       inode->i_private = (void *)(unsigned long)flags;
> >       inode->i_op = &kvm_gmem_iops;
> > @@ -464,6 +505,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> >
> >       return inode;
> > +
> > +out:
> > +     iput(inode);
> > +
> > +     return ERR_PTR(err);
> >   }
> >
> >   static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>
> Thanks,
> Gavin
>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-22 22:24       ` Ackerley Tng
  2025-01-23 11:00         ` Fuad Tabba
@ 2025-01-30 14:23         ` Fuad Tabba
  1 sibling, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-01-30 14:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Ackerley,

On Wed, 22 Jan 2025 at 22:24, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> >> > <snip>
> >> >
> >> > +/*
> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
> >> > + * not have any references to the folio. It does that by setting the folio type
> >> > + * to guestmem.
> >> > + *
> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> >> > + * has references, and the callback has been registered.
> >>
> >> Note this comment.
> >>
> >> > + *
> >> > + * Must be called with the following locks held:
> >> > + * - filemap (inode->i_mapping) invalidate_lock
> >> > + * - folio lock
> >> > + */
> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> >> > +{
> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +     int refcount;
> >> > +
> >> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> >> > +     WARN_ON_ONCE(!folio_test_locked(folio));
> >> > +
> >> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
> >> > +             return -EAGAIN;
> >>
> >> But here we return -EAGAIN and no callback was registered?
> >
> > This is intentional. If the folio is still mapped (i.e., its mapcount
> > is elevated), then we cannot register the callback yet, so the
> > host/vmm needs to unmap first, then try again. That said, I see the
> > problem with the comment above, and I will clarify this.
> >
> >> > +
> >> > +     /* Register a callback first. */
> >> > +     __folio_set_guestmem(folio);
> >> > +
> >> > +     /*
> >> > +      * Check for references after setting the type to guestmem, to guard
> >> > +      * against potential races with the refcount being decremented later.
> >> > +      *
> >> > +      * At least one reference is expected because the folio is locked.
> >> > +      */
> >> > +
> >> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> >> > +     if (refcount == 1) {
> >> > +             int r;
> >> > +
> >> > +             /* refcount isn't elevated, it's now faultable by the guest. */
> >>
> >> Again this seems racy, somebody could have just speculatively increased it.
> >> Maybe we need to freeze here as well?
> >
> > A speculative increase here is ok I think (famous last words). The
> > callback was registered before the check, therefore, such an increase
> > would trigger the callback.
> >
> > Thanks,
> > /fuad
> >
> >
>
> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
> about the mappability state at callback-time).
>
> However, what if the new speculative reference writes to the page and
> guest goes on to fault/use the page?

In my last email I explained why I thought the code was fine as it is.
Now that I'm updating the patch series with all the comments, I
realized that even if I were right (which I am starting to doubt),
freezing the refcount makes the code easier to reason about. So I'm
going with ref_freeze here as well when I respin.

Thanks again,
/fuad



> >> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> >> > +             if (!r)
> >> > +                     __kvm_gmem_restore_pending_folio(folio);
> >> > +
> >> > +             return r;
> >> > +     }
> >> > +
> >> > +     return -EAGAIN;
> >> > +}
> >> > +
> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> >> > +{
> >> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> >> > +     struct inode *inode = file_inode(slot->gmem.file);
> >> > +     struct folio *folio;
> >> > +     int r;
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +
> >> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
> >> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
> >> > +             r = PTR_ERR(folio);
> >> > +             goto out;
> >> > +     }
> >> > +
> >> > +     r = __gmem_register_callback(folio, inode, pgoff);
> >> > +
> >> > +     folio_unlock(folio);
> >> > +     folio_put(folio);
> >> > +out:
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +
> >> > +     return r;
> >> > +}
> >> > +
> >> > +/*
> >> > + * Callback function for __folio_put(), i.e., called when all references by the
> >> > + * host to the folio have been dropped. This allows gmem to transition the state
> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> >> > + * transitioning its state to private, since the host cannot attempt to access
> >> > + * it anymore.
> >> > + */
> >> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> >> > +{
> >> > +     struct xarray *mappable_offsets;
> >> > +     struct inode *inode;
> >> > +     pgoff_t index;
> >> > +     void *xval;
> >> > +
> >> > +     inode = folio->mapping->host;
> >> > +     index = folio->index;
> >> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +     __kvm_gmem_restore_pending_folio(folio);
> >> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +}
> >> > +
> >> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >> >  {
> >> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
  2025-01-20 11:37   ` Vlastimil Babka
  2025-01-22 22:16   ` Ackerley Tng
@ 2025-02-05  0:42   ` Vishal Annapurve
  2025-02-05 10:06     ` Fuad Tabba
  2025-02-05  0:51   ` Vishal Annapurve
  2025-02-06  3:37   ` Ackerley Tng
  4 siblings, 1 reply; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05  0:42 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
>
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
>
> This patch introduces a new type for guest_memfd folios, and uses
> that to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
> The function kvm_slot_gmem_register_callback() isn't used in this
> series. It will be used later in code that performs unsharing of
> memory. I have tested it with pKVM, based on downstream code [*].
> It's included in this RFC since it demonstrates the plan to
> handle unsharing of private folios.
>
> [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm

Should the invocation of kvm_slot_gmem_register_callback() happen in
the same critical block as setting the guest memfd range mappability
to NONE, otherwise conversion/truncation could race with registration
of callback?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
                     ` (2 preceding siblings ...)
  2025-02-05  0:42   ` Vishal Annapurve
@ 2025-02-05  0:51   ` Vishal Annapurve
  2025-02-05 10:07     ` Fuad Tabba
  2025-02-06  3:37   ` Ackerley Tng
  4 siblings, 1 reply; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05  0:51 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
>
<snip>
>
>  static const char *page_type_name(unsigned int page_type)
> diff --git a/mm/swap.c b/mm/swap.c
> index 6f01b56bce13..15220eaabc86 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
> +#include <linux/kvm_host.h>
>
>  #include "internal.h"
>
> @@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
>         case PGTY_offline:
>                 /* Nothing to do, it's offline. */
>                 return;
> +       case PGTY_guestmem:
> +               kvm_gmem_handle_folio_put(folio);
> +               return;

Unless it's discussed before, kvm_gmem_handle_folio_put() needs to be
implemented outside KVM code which could be unloaded at runtime.
Eliott's plan [1] to implement a guest_memfd library can handle this
scenario in future.

[1] https://patches.linaro.org/project/linux-arm-msm/patch/20240829-guest-memfd-lib-v2-1-b9afc1ff3656@quicinc.com/


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-23  9:50     ` Fuad Tabba
@ 2025-02-05  1:28       ` Vishal Annapurve
  2025-02-05  4:31         ` Ackerley Tng
  0 siblings, 1 reply; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05  1:28 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Thu, Jan 23, 2025 at 1:51 AM Fuad Tabba <tabba@google.com> wrote:
>
> On Wed, 22 Jan 2025 at 22:16, Ackerley Tng <ackerleytng@google.com> wrote:
> >
> > Fuad Tabba <tabba@google.com> writes:
> >
> > Hey Fuad, I'm still working on verifying all this but for now this is
> > one issue. I think this can be fixed by checking if the folio->mapping
> > is NULL. If it's NULL, then the folio has been disassociated from the
> > inode, and during the dissociation (removal from filemap), the
> > mappability can also either
> >
> > 1. Be unset so that the default mappability can be set up based on
> >    GUEST_MEMFD_FLAG_INIT_MAPPABLE, or
> > 2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE
>
> Thanks for pointing this out. I hadn't considered this case. I'll fix
> in the respin.
>

Can the below scenario cause trouble?
1) Userspace converts a certain range of guest memfd as shared and
grabs some refcounts on shared memory pages through existing kernel
exposed mechanisms.
2) Userspace converts the same range to private which would cause the
corresponding mappability attributes to be *MAPPABILITY_NONE.
3) Userspace truncates the range which will remove the page from pagecache.
4) Userspace does the fallocate again, leading to a new page getting
allocated without freeing the older page which is still refcounted
(step 1).

Effectively this could allow userspace to keep allocating multiple
pages for the same guest_memfd range.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05  1:28       ` Vishal Annapurve
@ 2025-02-05  4:31         ` Ackerley Tng
  2025-02-05  5:58           ` Vishal Annapurve
  0 siblings, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-05  4:31 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Vishal Annapurve <vannapurve@google.com> writes:

> On Thu, Jan 23, 2025 at 1:51 AM Fuad Tabba <tabba@google.com> wrote:
>>
>> On Wed, 22 Jan 2025 at 22:16, Ackerley Tng <ackerleytng@google.com> wrote:
>> >
>> > Fuad Tabba <tabba@google.com> writes:
>> >
>> > Hey Fuad, I'm still working on verifying all this but for now this is
>> > one issue. I think this can be fixed by checking if the folio->mapping
>> > is NULL. If it's NULL, then the folio has been disassociated from the
>> > inode, and during the dissociation (removal from filemap), the
>> > mappability can also either
>> >
>> > 1. Be unset so that the default mappability can be set up based on
>> >    GUEST_MEMFD_FLAG_INIT_MAPPABLE, or
>> > 2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE
>>
>> Thanks for pointing this out. I hadn't considered this case. I'll fix
>> in the respin.
>>
>
> Can the below scenario cause trouble?
> 1) Userspace converts a certain range of guest memfd as shared and
> grabs some refcounts on shared memory pages through existing kernel
> exposed mechanisms.
> 2) Userspace converts the same range to private which would cause the
> corresponding mappability attributes to be *MAPPABILITY_NONE.
> 3) Userspace truncates the range which will remove the page from pagecache.
> 4) Userspace does the fallocate again, leading to a new page getting
> allocated without freeing the older page which is still refcounted
> (step 1).
>
> Effectively this could allow userspace to keep allocating multiple
> pages for the same guest_memfd range.

I'm still verifying this but for now here's the flow Vishal described in
greater detail:

+ guest_memfd starts without GUEST_MEMFD_FLAG_INIT_MAPPABLE
    + All new pages will start with mappability = GUEST
+ guest uses a page
    + Get new page
    + Add page to filemap
+ guest converts page to shared
    + Mappability is now ALL
+ host uses page
+ host takes transient refcounts on page
    + Refcount on the page is now (a) filemap's refcount (b) vma's refcount
      (c) transient refcount
+ guest converts page to private
    + Page is unmapped
        + Refcount on the page is now (a) filemap's refcount (b) transient
          refcount
    + Since refcount is elevated, the mappabilities are left as NONE
    + Filemap's refcounts are removed from the page
        + Refcount on the page is now (a) transient refcount
+ host punches hole to deallocate page
    + Since mappability was NONE, restore filemap's refcount
        + Refcount on the page is now (a) transient refcount (b) filemap's
          refcount
    + Mappabilities are reset to GUEST for truncated range
    + Folio is removed from filemap
        + Refcount on the page is now (a) transient refcount
    + Callback remains registered so that when the transient refcounts are
      dropped, cleanup can happen - this is where merging will happen
      with 1G page support
+ host fallocate()s in the same address range
    + will get a new page

Though the host does manage to get a new page while the old one stays
around, I think this is working as intended, since the transient
refcounts are truly holding the old folio around. When the transient
refcounts go away, the old folio will still get cleaned up (with 1G page
support: merged and returned) to as expected. The new page will also be
freed at some point later.

If the userspace program decides to keep taking transient refcounts to hold
pages around, then the userspace program is truly leaking memory and it
shouldn't be guest_memfd's bug.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05  4:31         ` Ackerley Tng
@ 2025-02-05  5:58           ` Vishal Annapurve
  0 siblings, 0 replies; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05  5:58 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tue, Feb 4, 2025 at 8:31 PM Ackerley Tng <ackerleytng@google.com> wrote:
>
> Vishal Annapurve <vannapurve@google.com> writes:
>
> > On Thu, Jan 23, 2025 at 1:51 AM Fuad Tabba <tabba@google.com> wrote:
> >>
> >> On Wed, 22 Jan 2025 at 22:16, Ackerley Tng <ackerleytng@google.com> wrote:
> >> >
> >> > Fuad Tabba <tabba@google.com> writes:
> >> >
> >> > Hey Fuad, I'm still working on verifying all this but for now this is
> >> > one issue. I think this can be fixed by checking if the folio->mapping
> >> > is NULL. If it's NULL, then the folio has been disassociated from the
> >> > inode, and during the dissociation (removal from filemap), the
> >> > mappability can also either
> >> >
> >> > 1. Be unset so that the default mappability can be set up based on
> >> >    GUEST_MEMFD_FLAG_INIT_MAPPABLE, or
> >> > 2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE
> >>
> >> Thanks for pointing this out. I hadn't considered this case. I'll fix
> >> in the respin.
> >>
> >
> > Can the below scenario cause trouble?
> > 1) Userspace converts a certain range of guest memfd as shared and
> > grabs some refcounts on shared memory pages through existing kernel
> > exposed mechanisms.
> > 2) Userspace converts the same range to private which would cause the
> > corresponding mappability attributes to be *MAPPABILITY_NONE.
> > 3) Userspace truncates the range which will remove the page from pagecache.
> > 4) Userspace does the fallocate again, leading to a new page getting
> > allocated without freeing the older page which is still refcounted
> > (step 1).
> >
> > Effectively this could allow userspace to keep allocating multiple
> > pages for the same guest_memfd range.
>
> I'm still verifying this but for now here's the flow Vishal described in
> greater detail:
>
> + guest_memfd starts without GUEST_MEMFD_FLAG_INIT_MAPPABLE
>     + All new pages will start with mappability = GUEST
> + guest uses a page
>     + Get new page
>     + Add page to filemap
> + guest converts page to shared
>     + Mappability is now ALL
> + host uses page
> + host takes transient refcounts on page
>     + Refcount on the page is now (a) filemap's refcount (b) vma's refcount
>       (c) transient refcount
> + guest converts page to private
>     + Page is unmapped
>         + Refcount on the page is now (a) filemap's refcount (b) transient
>           refcount
>     + Since refcount is elevated, the mappabilities are left as NONE
>     + Filemap's refcounts are removed from the page
>         + Refcount on the page is now (a) transient refcount
> + host punches hole to deallocate page
>     + Since mappability was NONE, restore filemap's refcount
>         + Refcount on the page is now (a) transient refcount (b) filemap's
>           refcount
>     + Mappabilities are reset to GUEST for truncated range
>     + Folio is removed from filemap
>         + Refcount on the page is now (a) transient refcount
>     + Callback remains registered so that when the transient refcounts are
>       dropped, cleanup can happen - this is where merging will happen
>       with 1G page support
> + host fallocate()s in the same address range
>     + will get a new page
>
> Though the host does manage to get a new page while the old one stays
> around, I think this is working as intended, since the transient
> refcounts are truly holding the old folio around. When the transient
> refcounts go away, the old folio will still get cleaned up (with 1G page
> support: merged and returned) to as expected. The new page will also be
> freed at some point later.
>
> If the userspace program decides to keep taking transient refcounts to hold
> pages around, then the userspace program is truly leaking memory and it
> shouldn't be guest_memfd's bug.

I wouldn't call such references transient. But a similar scenario is
applicable for shmem files so it makes sense to call out this behavior
as WAI.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05  0:42   ` Vishal Annapurve
@ 2025-02-05 10:06     ` Fuad Tabba
  2025-02-05 17:39       ` Vishal Annapurve
  0 siblings, 1 reply; 60+ messages in thread
From: Fuad Tabba @ 2025-02-05 10:06 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Vishal,

On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, and uses
> > that to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> > The function kvm_slot_gmem_register_callback() isn't used in this
> > series. It will be used later in code that performs unsharing of
> > memory. I have tested it with pKVM, based on downstream code [*].
> > It's included in this RFC since it demonstrates the plan to
> > handle unsharing of private folios.
> >
> > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
>
> Should the invocation of kvm_slot_gmem_register_callback() happen in
> the same critical block as setting the guest memfd range mappability
> to NONE, otherwise conversion/truncation could race with registration
> of callback?

I don't think it needs to, at least not as far potencial races are
concerned. First because kvm_slot_gmem_register_callback() grabs the
mapping's invalidate_lock as well as the folio lock, and
gmem_clear_mappable() grabs the mapping lock and the folio lock if a
folio has been allocated before.

Second, __gmem_register_callback() checks before returning whether all
references have been dropped, and adjusts the mappability/shareability
if needed.

Cheers,
/fuad


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05  0:51   ` Vishal Annapurve
@ 2025-02-05 10:07     ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-05 10:07 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Vishal,

On Wed, 5 Feb 2025 at 00:51, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
> >
> <snip>
> >
> >  static const char *page_type_name(unsigned int page_type)
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 6f01b56bce13..15220eaabc86 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/page_idle.h>
> >  #include <linux/local_lock.h>
> >  #include <linux/buffer_head.h>
> > +#include <linux/kvm_host.h>
> >
> >  #include "internal.h"
> >
> > @@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
> >         case PGTY_offline:
> >                 /* Nothing to do, it's offline. */
> >                 return;
> > +       case PGTY_guestmem:
> > +               kvm_gmem_handle_folio_put(folio);
> > +               return;
>
> Unless it's discussed before, kvm_gmem_handle_folio_put() needs to be
> implemented outside KVM code which could be unloaded at runtime.
> Eliott's plan [1] to implement a guest_memfd library can handle this
> scenario in future.
>
> [1] https://patches.linaro.org/project/linux-arm-msm/patch/20240829-guest-memfd-lib-v2-1-b9afc1ff3656@quicinc.com/

Yes, not just that, but there's a lot of KVM code in guest_memdf in general.

Cheers,
/fuad


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05 10:06     ` Fuad Tabba
@ 2025-02-05 17:39       ` Vishal Annapurve
  2025-02-05 17:42         ` Vishal Annapurve
  0 siblings, 1 reply; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05 17:39 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Wed, Feb 5, 2025 at 2:07 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi Vishal,
>
> On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
> > >
> > > Before transitioning a guest_memfd folio to unshared, thereby
> > > disallowing access by the host and allowing the hypervisor to
> > > transition its view of the guest page as private, we need to be
> > > sure that the host doesn't have any references to the folio.
> > >
> > > This patch introduces a new type for guest_memfd folios, and uses
> > > that to register a callback that informs the guest_memfd
> > > subsystem when the last reference is dropped, therefore knowing
> > > that the host doesn't have any remaining references.
> > >
> > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > ---
> > > The function kvm_slot_gmem_register_callback() isn't used in this
> > > series. It will be used later in code that performs unsharing of
> > > memory. I have tested it with pKVM, based on downstream code [*].
> > > It's included in this RFC since it demonstrates the plan to
> > > handle unsharing of private folios.
> > >
> > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
> >
> > Should the invocation of kvm_slot_gmem_register_callback() happen in
> > the same critical block as setting the guest memfd range mappability
> > to NONE, otherwise conversion/truncation could race with registration
> > of callback?
>
> I don't think it needs to, at least not as far potencial races are
> concerned. First because kvm_slot_gmem_register_callback() grabs the
> mapping's invalidate_lock as well as the folio lock, and
> gmem_clear_mappable() grabs the mapping lock and the folio lock if a
> folio has been allocated before.

I was hinting towards such a scenario:
Core1                                                                     Core2
Shared to private conversion                                 ....
  -> Results in mappability attributes
      being set to NONE
...
        Trigger private to shared conversion/truncation for
...
        overlapping ranges
...
kvm_slot_gmem_register_callback() on
      the guest_memfd ranges converted
      above (This will end up registering callback
      for guest_memfd ranges which possibly don't
      carry *_MAPPABILITY_NONE)

>
> Second, __gmem_register_callback() checks before returning whether all
> references have been dropped, and adjusts the mappability/shareability
> if needed.
>
> Cheers,
> /fuad


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05 17:39       ` Vishal Annapurve
@ 2025-02-05 17:42         ` Vishal Annapurve
  2025-02-07 10:46           ` Ackerley Tng
  0 siblings, 1 reply; 60+ messages in thread
From: Vishal Annapurve @ 2025-02-05 17:42 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Wed, Feb 5, 2025 at 9:39 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, Feb 5, 2025 at 2:07 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Vishal,
> >
> > On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@google.com> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
> > > >
> > > > Before transitioning a guest_memfd folio to unshared, thereby
> > > > disallowing access by the host and allowing the hypervisor to
> > > > transition its view of the guest page as private, we need to be
> > > > sure that the host doesn't have any references to the folio.
> > > >
> > > > This patch introduces a new type for guest_memfd folios, and uses
> > > > that to register a callback that informs the guest_memfd
> > > > subsystem when the last reference is dropped, therefore knowing
> > > > that the host doesn't have any remaining references.
> > > >
> > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > ---
> > > > The function kvm_slot_gmem_register_callback() isn't used in this
> > > > series. It will be used later in code that performs unsharing of
> > > > memory. I have tested it with pKVM, based on downstream code [*].
> > > > It's included in this RFC since it demonstrates the plan to
> > > > handle unsharing of private folios.
> > > >
> > > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
> > >
> > > Should the invocation of kvm_slot_gmem_register_callback() happen in
> > > the same critical block as setting the guest memfd range mappability
> > > to NONE, otherwise conversion/truncation could race with registration
> > > of callback?
> >
> > I don't think it needs to, at least not as far potencial races are
> > concerned. First because kvm_slot_gmem_register_callback() grabs the
> > mapping's invalidate_lock as well as the folio lock, and
> > gmem_clear_mappable() grabs the mapping lock and the folio lock if a
> > folio has been allocated before.
>
> I was hinting towards such a scenario:
> Core1
> Shared to private conversion
>   -> Results in mappability attributes
>       being set to NONE
> ...
>         Trigger private to shared conversion/truncation for
> ...
>         overlapping ranges
> ...
> kvm_slot_gmem_register_callback() on
>       the guest_memfd ranges converted
>       above (This will end up registering callback
>       for guest_memfd ranges which possibly don't
>       carry *_MAPPABILITY_NONE)
>

Sorry for the format mess above.

I was hinting towards such a scenario:
Core1-
Shared to private conversion -> Results in mappability attributes
being set to NONE
...
Core2
Trigger private to shared conversion/truncation for overlapping ranges
...
Core1
kvm_slot_gmem_register_callback() on the guest_memfd ranges converted
above (This will end up registering callback for guest_memfd ranges
which possibly don't carry *_MAPPABILITY_NONE)

> >
> > Second, __gmem_register_callback() checks before returning whether all
> > references have been dropped, and adjusts the mappability/shareability
> > if needed.
> >
> > Cheers,
> > /fuad


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-01-20 10:40     ` Fuad Tabba
@ 2025-02-06  3:14       ` Ackerley Tng
  2025-02-06  9:45         ` Fuad Tabba
  0 siblings, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-06  3:14 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kirill, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Fuad Tabba <tabba@google.com> writes:

> On Mon, 20 Jan 2025 at 10:30, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>
>> On Fri, Jan 17, 2025 at 04:29:51PM +0000, Fuad Tabba wrote:
>> > +/*
>> > + * Marks the range [start, end) as not mappable by the host. If the host doesn't
>> > + * have any references to a particular folio, then that folio is marked as
>> > + * mappable by the guest.
>> > + *
>> > + * However, if the host still has references to the folio, then the folio is
>> > + * marked and not mappable by anyone. Marking it is not mappable allows it to
>> > + * drain all references from the host, and to ensure that the hypervisor does
>> > + * not transition the folio to private, since the host still might access it.
>> > + *
>> > + * Usually called when guest unshares memory with the host.
>> > + */
>> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
>> > +{
>> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
>> > +     pgoff_t i;
>> > +     int r = 0;
>> > +
>> > +     filemap_invalidate_lock(inode->i_mapping);
>> > +     for (i = start; i < end; i++) {
>> > +             struct folio *folio;
>> > +             int refcount = 0;
>> > +
>> > +             folio = filemap_lock_folio(inode->i_mapping, i);
>> > +             if (!IS_ERR(folio)) {
>> > +                     refcount = folio_ref_count(folio);
>> > +             } else {
>> > +                     r = PTR_ERR(folio);
>> > +                     if (WARN_ON_ONCE(r != -ENOENT))
>> > +                             break;
>> > +
>> > +                     folio = NULL;
>> > +             }
>> > +
>> > +             /* +1 references are expected because of filemap_lock_folio(). */
>> > +             if (folio && refcount > folio_nr_pages(folio) + 1) {
>>
>> Looks racy.
>>
>> What prevent anybody from obtaining a reference just after check?
>>
>> Lock on folio doesn't stop random filemap_get_entry() from elevating the
>> refcount.
>>
>> folio_ref_freeze() might be required.
>
> I thought the folio lock would be sufficient, but you're right,
> nothing prevents getting a reference after the check. I'll use a
> folio_ref_freeze() when I respin.
>
> Thanks,
> /fuad
>

Is it correct to say that the only non-racy check for refcounts is a
check for refcount == 0?

What do you think of this instead: If there exists a folio, don't check
the refcount, just set mappability to NONE and register the callback
(the folio should already have been unmapped, which leaves
folio->page_type available for use), and then drop the filemap's
refcounts. When the filemap's refcounts are dropped, in most cases (no
transient refcounts) the callback will be hit and the callback can set
mappability to GUEST.

If there are transient refcounts, the folio will just be waiting
for the refcounts to drop to 0, and that's when the callback will be hit
and the mappability can be transitioned to GUEST.

If there isn't a folio, then guest_memfd was requested to set
mappability ahead of any folio allocation, and in that case
transitioning to GUEST immediately is correct.

>> > +                     /*
>> > +                      * Outstanding references, the folio cannot be faulted
>> > +                      * in by anyone until they're dropped.
>> > +                      */
>> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
>> > +             } else {
>> > +                     /*
>> > +                      * No outstanding references. Transition the folio to
>> > +                      * guest mappable immediately.
>> > +                      */
>> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
>> > +             }
>> > +
>> > +             if (folio) {
>> > +                     folio_unlock(folio);
>> > +                     folio_put(folio);
>> > +             }
>> > +
>> > +             if (WARN_ON_ONCE(r))
>> > +                     break;
>> > +     }
>> > +     filemap_invalidate_unlock(inode->i_mapping);
>> > +
>> > +     return r;
>> > +}
>>
>> --
>>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-23 11:00         ` Fuad Tabba
@ 2025-02-06  3:18           ` Ackerley Tng
  2025-02-06  3:28           ` Ackerley Tng
  1 sibling, 0 replies; 60+ messages in thread
From: Ackerley Tng @ 2025-02-06  3:18 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-23 11:00         ` Fuad Tabba
  2025-02-06  3:18           ` Ackerley Tng
@ 2025-02-06  3:28           ` Ackerley Tng
  2025-02-06  9:47             ` Fuad Tabba
  1 sibling, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-06  3:28 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Fuad Tabba <tabba@google.com> writes:

> On Wed, 22 Jan 2025 at 22:24, Ackerley Tng <ackerleytng@google.com> wrote:
>>
>> Fuad Tabba <tabba@google.com> writes:
>>
>> >> > <snip>
>> >> >
>> >> > +/*
>> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
>> >> > + * not have any references to the folio. It does that by setting the folio type
>> >> > + * to guestmem.
>> >> > + *
>> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
>> >> > + * has references, and the callback has been registered.
>> >>
>> >> Note this comment.
>> >>
>> >> > + *
>> >> > + * Must be called with the following locks held:
>> >> > + * - filemap (inode->i_mapping) invalidate_lock
>> >> > + * - folio lock
>> >> > + */
>> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
>> >> > +{
>> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> >> > +     int refcount;
>> >> > +
>> >> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
>> >> > +     WARN_ON_ONCE(!folio_test_locked(folio));
>> >> > +
>> >> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
>> >> > +             return -EAGAIN;
>> >>
>> >> But here we return -EAGAIN and no callback was registered?
>> >
>> > This is intentional. If the folio is still mapped (i.e., its mapcount
>> > is elevated), then we cannot register the callback yet, so the
>> > host/vmm needs to unmap first, then try again. That said, I see the
>> > problem with the comment above, and I will clarify this.
>> >
>> >> > +
>> >> > +     /* Register a callback first. */
>> >> > +     __folio_set_guestmem(folio);
>> >> > +
>> >> > +     /*
>> >> > +      * Check for references after setting the type to guestmem, to guard
>> >> > +      * against potential races with the refcount being decremented later.
>> >> > +      *
>> >> > +      * At least one reference is expected because the folio is locked.
>> >> > +      */
>> >> > +
>> >> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
>> >> > +     if (refcount == 1) {
>> >> > +             int r;
>> >> > +
>> >> > +             /* refcount isn't elevated, it's now faultable by the guest. */
>> >>
>> >> Again this seems racy, somebody could have just speculatively increased it.
>> >> Maybe we need to freeze here as well?
>> >
>> > A speculative increase here is ok I think (famous last words). The
>> > callback was registered before the check, therefore, such an increase
>> > would trigger the callback.
>> >
>> > Thanks,
>> > /fuad
>> >
>> >
>>
>> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
>> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
>> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
>> about the mappability state at callback-time).
>>
>> However, what if the new speculative reference writes to the page and
>> guest goes on to fault/use the page?
>
> I don't think that's a problem. At this point the page is in a
> transient state, but still shared from the guest's point of view.
> Moreover, no one can fault-in the page at the host at this point (we
> check in kvm_gmem_fault()).
>
> Let's have a look at the code:
>
> +static int __gmem_register_callback(struct folio *folio, struct inode
> *inode, pgoff_t idx)
> +{
> +       struct xarray *mappable_offsets =
> &kvm_gmem_private(inode)->mappable_offsets;
> +       void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +       int refcount;
>
> At this point the guest still perceives the page as shared, the state
> of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means
> that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
>
> +       rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> +       WARN_ON_ONCE(!folio_test_locked(folio));
> +
> +       if (folio_mapped(folio) || folio_test_guestmem(folio))
> +               return -EAGAIN;
> +
> +       /* Register a callback first. */
> +       __folio_set_guestmem(folio);
>
> This (in addition to the state of the NONE_MAPPABLE), also ensures
> that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
>
> +       /*
> +        * Check for references after setting the type to guestmem, to guard
> +        * against potential races with the refcount being decremented later.
> +        *
> +        * At least one reference is expected because the folio is locked.
> +        */
> +
> +       refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> +       if (refcount == 1) {
> +               int r;
>
> At this point we know that guest_memfd has the only real reference.
> Speculative references AFAIK do not access the page itself.
> +
> +               /* refcount isn't elevated, it's now faultable by the guest. */
> +               r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets,
> idx, xval_guest, GFP_KERNEL)));
>
> Now it's safe so let the guest know that it can map the page.
>
> +               if (!r)
> +                       __kvm_gmem_restore_pending_folio(folio);
> +
> +               return r;
> +       }
> +
> +       return -EAGAIN;
> +}
>
> Does this make sense, or did I miss something?

Thanks for explaining! I don't know enough to confirm/deny this but I agree
that if speculative references don't access the page itself, this works.

What if over here, we just drop the refcount, and let setting mappability to
GUEST happen in the folio_put() callback?

>
> Thanks!
> /fuad
>
>> >> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
>> >> > +             if (!r)
>> >> > +                     __kvm_gmem_restore_pending_folio(folio);
>> >> > +
>> >> > +             return r;
>> >> > +     }
>> >> > +
>> >> > +     return -EAGAIN;
>> >> > +}
>> >> > +
>> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
>> >> > +{
>> >> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
>> >> > +     struct inode *inode = file_inode(slot->gmem.file);
>> >> > +     struct folio *folio;
>> >> > +     int r;
>> >> > +
>> >> > +     filemap_invalidate_lock(inode->i_mapping);
>> >> > +
>> >> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
>> >> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
>> >> > +             r = PTR_ERR(folio);
>> >> > +             goto out;
>> >> > +     }
>> >> > +
>> >> > +     r = __gmem_register_callback(folio, inode, pgoff);
>> >> > +
>> >> > +     folio_unlock(folio);
>> >> > +     folio_put(folio);
>> >> > +out:
>> >> > +     filemap_invalidate_unlock(inode->i_mapping);
>> >> > +
>> >> > +     return r;
>> >> > +}
>> >> > +
>> >> > +/*
>> >> > + * Callback function for __folio_put(), i.e., called when all references by the
>> >> > + * host to the folio have been dropped. This allows gmem to transition the state
>> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
>> >> > + * transitioning its state to private, since the host cannot attempt to access
>> >> > + * it anymore.
>> >> > + */
>> >> > +void kvm_gmem_handle_folio_put(struct folio *folio)
>> >> > +{
>> >> > +     struct xarray *mappable_offsets;
>> >> > +     struct inode *inode;
>> >> > +     pgoff_t index;
>> >> > +     void *xval;
>> >> > +
>> >> > +     inode = folio->mapping->host;
>> >> > +     index = folio->index;
>> >> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
>> >> > +
>> >> > +     filemap_invalidate_lock(inode->i_mapping);
>> >> > +     __kvm_gmem_restore_pending_folio(folio);
>> >> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
>> >> > +     filemap_invalidate_unlock(inode->i_mapping);
>> >> > +}
>> >> > +
>> >> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>> >> >  {
>> >> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
>> >>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
                     ` (3 preceding siblings ...)
  2025-02-05  0:51   ` Vishal Annapurve
@ 2025-02-06  3:37   ` Ackerley Tng
  2025-02-06  9:49     ` Fuad Tabba
  4 siblings, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-06  3:37 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Fuad Tabba <tabba@google.com> writes:

> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
>
> This patch introduces a new type for guest_memfd folios, and uses
> that to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
> The function kvm_slot_gmem_register_callback() isn't used in this
> series. It will be used later in code that performs unsharing of
> memory. I have tested it with pKVM, based on downstream code [*].
> It's included in this RFC since it demonstrates the plan to
> handle unsharing of private folios.
>
> [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
> ---
>  include/linux/kvm_host.h   |  11 +++
>  include/linux/page-flags.h |   7 ++
>  mm/debug.c                 |   1 +
>  mm/swap.c                  |   4 +
>  virt/kvm/guest_memfd.c     | 145 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 168 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 84aa7908a5dd..63e6d6dd98b3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2574,6 +2574,8 @@ int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
>  				 gfn_t end);
>  bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
>  bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn);
> +void kvm_gmem_handle_folio_put(struct folio *folio);
>  #else
>  static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
>  {
> @@ -2615,6 +2617,15 @@ static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
>  	WARN_ON_ONCE(1);
>  	return false;
>  }
> +static inline int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	WARN_ON_ONCE(1);
> +	return -EINVAL;
> +}
> +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ON_ONCE(1);
> +}
>  #endif /* CONFIG_KVM_GMEM_MAPPABLE */
>  
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6615f2f59144..bab3cac1f93b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -942,6 +942,7 @@ enum pagetype {
>  	PGTY_slab	= 0xf5,
>  	PGTY_zsmalloc	= 0xf6,
>  	PGTY_unaccepted	= 0xf7,
> +	PGTY_guestmem	= 0xf8,
>  
>  	PGTY_mapcount_underflow = 0xff
>  };
> @@ -1091,6 +1092,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
>  FOLIO_TEST_FLAG_FALSE(hugetlb)
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> +FOLIO_TYPE_OPS(guestmem, guestmem)
> +#else
> +FOLIO_TEST_FLAG_FALSE(guestmem)
> +#endif
> +
>  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>  
>  /*
> diff --git a/mm/debug.c b/mm/debug.c
> index 95b6ab809c0e..db93be385ed9 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
>  	DEF_PAGETYPE_NAME(table),
>  	DEF_PAGETYPE_NAME(buddy),
>  	DEF_PAGETYPE_NAME(unaccepted),
> +	DEF_PAGETYPE_NAME(guestmem),
>  };
>  
>  static const char *page_type_name(unsigned int page_type)
> diff --git a/mm/swap.c b/mm/swap.c
> index 6f01b56bce13..15220eaabc86 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
> +#include <linux/kvm_host.h>
>  
>  #include "internal.h"
>  
> @@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
>  	case PGTY_offline:
>  		/* Nothing to do, it's offline. */
>  		return;
> +	case PGTY_guestmem:
> +		kvm_gmem_handle_folio_put(folio);
> +		return;
>  	default:
>  		WARN_ON_ONCE(1);
>  	}
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d1c192927cf7..722afd9f8742 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -387,6 +387,28 @@ enum folio_mappability {
>  	KVM_GMEM_NONE_MAPPABLE	= 0b11, /* Not mappable, transient state. */
>  };
>  
> +/*
> + * Unregisters the __folio_put() callback from the folio.
> + *
> + * Restores a folio's refcount after all pending references have been released,
> + * and removes the folio type, thereby removing the callback. Now the folio can
> + * be freed normaly once all actual references have been dropped.
> + *
> + * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
> + * Must also have exclusive access to the folio: folio must be either locked, or
> + * gmem holds the only reference.
> + */
> +static void __kvm_gmem_restore_pending_folio(struct folio *folio)
> +{
> +	if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
> +		return;
> +
> +	WARN_ON_ONCE(!folio_test_locked(folio) && folio_ref_count(folio) > 1);
> +
> +	__folio_clear_guestmem(folio);
> +	folio_ref_add(folio, folio_nr_pages(folio));
> +}
> +
>  /*
>   * Marks the range [start, end) as mappable by both the host and the guest.
>   * Usually called when guest shares memory with the host.
> @@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
>  
>  	filemap_invalidate_lock(inode->i_mapping);
>  	for (i = start; i < end; i++) {
> +		struct folio *folio = NULL;
> +
> +		/*
> +		 * If the folio is NONE_MAPPABLE, it indicates that it is
> +		 * transitioning to private (GUEST_MAPPABLE). Transition it to
> +		 * shared (ALL_MAPPABLE) immediately, and remove the callback.
> +		 */
> +		if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
> +			folio = filemap_lock_folio(inode->i_mapping, i);
> +			if (WARN_ON_ONCE(IS_ERR(folio))) {
> +				r = PTR_ERR(folio);
> +				break;
> +			}
> +
> +			if (folio_test_guestmem(folio))
> +				__kvm_gmem_restore_pending_folio(folio);
> +		}
> +
>  		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> +
> +		if (folio) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +
>  		if (r)
>  			break;
>  	}
> @@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
>  	return r;
>  }
>

I think one of these functions to restore mappability needs to be called
to restore the refcounts on truncation. Without doing this, the
refcounts on the folios at truncation time would only be the
transient/speculative ones, and truncating will take off the filemap
refcounts which were already taken off to set up the folio_put()
callback.

Should mappability can be restored according to
GUEST_MEMFD_FLAG_INIT_MAPPABLE? Or should mappability of NONE be
restored to GUEST and mappability of ALL left as ALL?

> +/*
> + * Registers a callback to __folio_put(), so that gmem knows that the host does
> + * not have any references to the folio. It does that by setting the folio type
> + * to guestmem.
> + *
> + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> + * has references, and the callback has been registered.
> + *
> + * Must be called with the following locks held:
> + * - filemap (inode->i_mapping) invalidate_lock
> + * - folio lock
> + */
> +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +	int refcount;
> +
> +	rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> +	WARN_ON_ONCE(!folio_test_locked(folio));
> +
> +	if (folio_mapped(folio) || folio_test_guestmem(folio))
> +		return -EAGAIN;
> +
> +	/* Register a callback first. */
> +	__folio_set_guestmem(folio);
> +
> +	/*
> +	 * Check for references after setting the type to guestmem, to guard
> +	 * against potential races with the refcount being decremented later.
> +	 *
> +	 * At least one reference is expected because the folio is locked.
> +	 */
> +
> +	refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> +	if (refcount == 1) {
> +		int r;
> +
> +		/* refcount isn't elevated, it's now faultable by the guest. */
> +		r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> +		if (!r)
> +			__kvm_gmem_restore_pending_folio(folio);
> +
> +		return r;
> +	}
> +
> +	return -EAGAIN;
> +}
> +
> +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> +	struct inode *inode = file_inode(slot->gmem.file);
> +	struct folio *folio;
> +	int r;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	folio = filemap_lock_folio(inode->i_mapping, pgoff);
> +	if (WARN_ON_ONCE(IS_ERR(folio))) {
> +		r = PTR_ERR(folio);
> +		goto out;
> +	}
> +
> +	r = __gmem_register_callback(folio, inode, pgoff);
> +
> +	folio_unlock(folio);
> +	folio_put(folio);
> +out:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +/*
> + * Callback function for __folio_put(), i.e., called when all references by the
> + * host to the folio have been dropped. This allows gmem to transition the state
> + * of the folio to mappable by the guest, and allows the hypervisor to continue
> + * transitioning its state to private, since the host cannot attempt to access
> + * it anymore.
> + */
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	struct xarray *mappable_offsets;
> +	struct inode *inode;
> +	pgoff_t index;
> +	void *xval;
> +
> +	inode = folio->mapping->host;
> +	index = folio->index;
> +	mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	__kvm_gmem_restore_pending_folio(folio);
> +	WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> +	filemap_invalidate_unlock(inode->i_mapping);
> +}
> +
>  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
>  {
>  	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-02-06  3:14       ` Ackerley Tng
@ 2025-02-06  9:45         ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-06  9:45 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kirill, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Ackerley,

On Thu, 6 Feb 2025 at 03:14, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > On Mon, 20 Jan 2025 at 10:30, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> On Fri, Jan 17, 2025 at 04:29:51PM +0000, Fuad Tabba wrote:
> >> > +/*
> >> > + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> >> > + * have any references to a particular folio, then that folio is marked as
> >> > + * mappable by the guest.
> >> > + *
> >> > + * However, if the host still has references to the folio, then the folio is
> >> > + * marked and not mappable by anyone. Marking it is not mappable allows it to
> >> > + * drain all references from the host, and to ensure that the hypervisor does
> >> > + * not transition the folio to private, since the host still might access it.
> >> > + *
> >> > + * Usually called when guest unshares memory with the host.
> >> > + */
> >> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >> > +{
> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> >> > +     pgoff_t i;
> >> > +     int r = 0;
> >> > +
> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> > +     for (i = start; i < end; i++) {
> >> > +             struct folio *folio;
> >> > +             int refcount = 0;
> >> > +
> >> > +             folio = filemap_lock_folio(inode->i_mapping, i);
> >> > +             if (!IS_ERR(folio)) {
> >> > +                     refcount = folio_ref_count(folio);
> >> > +             } else {
> >> > +                     r = PTR_ERR(folio);
> >> > +                     if (WARN_ON_ONCE(r != -ENOENT))
> >> > +                             break;
> >> > +
> >> > +                     folio = NULL;
> >> > +             }
> >> > +
> >> > +             /* +1 references are expected because of filemap_lock_folio(). */
> >> > +             if (folio && refcount > folio_nr_pages(folio) + 1) {
> >>
> >> Looks racy.
> >>
> >> What prevent anybody from obtaining a reference just after check?
> >>
> >> Lock on folio doesn't stop random filemap_get_entry() from elevating the
> >> refcount.
> >>
> >> folio_ref_freeze() might be required.
> >
> > I thought the folio lock would be sufficient, but you're right,
> > nothing prevents getting a reference after the check. I'll use a
> > folio_ref_freeze() when I respin.
> >
> > Thanks,
> > /fuad
> >
>
> Is it correct to say that the only non-racy check for refcounts is a
> check for refcount == 0?
>
> What do you think of this instead: If there exists a folio, don't check
> the refcount, just set mappability to NONE and register the callback
> (the folio should already have been unmapped, which leaves
> folio->page_type available for use), and then drop the filemap's
> refcounts. When the filemap's refcounts are dropped, in most cases (no
> transient refcounts) the callback will be hit and the callback can set
> mappability to GUEST.
>
> If there are transient refcounts, the folio will just be waiting
> for the refcounts to drop to 0, and that's when the callback will be hit
> and the mappability can be transitioned to GUEST.
>
> If there isn't a folio, then guest_memfd was requested to set
> mappability ahead of any folio allocation, and in that case
> transitioning to GUEST immediately is correct.

This seems to me to add additional complexity to the common case that
isn't needed for correctness, and would make things more difficult to
reason about. If we know that there aren't any mappings at the host
(mapcount == 0), and we know that the refcount has at one point
reached 0 after we have taken the folio lock, even if the refcount
gets (transiently) elevated, we know that no one at the host is
accessing the folio itself.

Keep in mind that the common case (in a well behaved system) is that
neither the mapcount nor the refcount are elevated, and both for
performance, and for understanding, I think that's what we should be
targeting. Unless of course I'm wrong, and there's a correctness issue
here.

Cheers,
/fuad
> >> > +                     /*
> >> > +                      * Outstanding references, the folio cannot be faulted
> >> > +                      * in by anyone until they're dropped.
> >> > +                      */
> >> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> >> > +             } else {
> >> > +                     /*
> >> > +                      * No outstanding references. Transition the folio to
> >> > +                      * guest mappable immediately.
> >> > +                      */
> >> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> >> > +             }
> >> > +
> >> > +             if (folio) {
> >> > +                     folio_unlock(folio);
> >> > +                     folio_put(folio);
> >> > +             }
> >> > +
> >> > +             if (WARN_ON_ONCE(r))
> >> > +                     break;
> >> > +     }
> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> > +
> >> > +     return r;
> >> > +}
> >>
> >> --
> >>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-06  3:28           ` Ackerley Tng
@ 2025-02-06  9:47             ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-06  9:47 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vbabka, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vannapurve, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Thu, 6 Feb 2025 at 03:28, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > On Wed, 22 Jan 2025 at 22:24, Ackerley Tng <ackerleytng@google.com> wrote:
> >>
> >> Fuad Tabba <tabba@google.com> writes:
> >>
> >> >> > <snip>
> >> >> >
> >> >> > +/*
> >> >> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
> >> >> > + * not have any references to the folio. It does that by setting the folio type
> >> >> > + * to guestmem.
> >> >> > + *
> >> >> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> >> >> > + * has references, and the callback has been registered.
> >> >>
> >> >> Note this comment.
> >> >>
> >> >> > + *
> >> >> > + * Must be called with the following locks held:
> >> >> > + * - filemap (inode->i_mapping) invalidate_lock
> >> >> > + * - folio lock
> >> >> > + */
> >> >> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> >> >> > +{
> >> >> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> >> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> >> > +     int refcount;
> >> >> > +
> >> >> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> >> >> > +     WARN_ON_ONCE(!folio_test_locked(folio));
> >> >> > +
> >> >> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
> >> >> > +             return -EAGAIN;
> >> >>
> >> >> But here we return -EAGAIN and no callback was registered?
> >> >
> >> > This is intentional. If the folio is still mapped (i.e., its mapcount
> >> > is elevated), then we cannot register the callback yet, so the
> >> > host/vmm needs to unmap first, then try again. That said, I see the
> >> > problem with the comment above, and I will clarify this.
> >> >
> >> >> > +
> >> >> > +     /* Register a callback first. */
> >> >> > +     __folio_set_guestmem(folio);
> >> >> > +
> >> >> > +     /*
> >> >> > +      * Check for references after setting the type to guestmem, to guard
> >> >> > +      * against potential races with the refcount being decremented later.
> >> >> > +      *
> >> >> > +      * At least one reference is expected because the folio is locked.
> >> >> > +      */
> >> >> > +
> >> >> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> >> >> > +     if (refcount == 1) {
> >> >> > +             int r;
> >> >> > +
> >> >> > +             /* refcount isn't elevated, it's now faultable by the guest. */
> >> >>
> >> >> Again this seems racy, somebody could have just speculatively increased it.
> >> >> Maybe we need to freeze here as well?
> >> >
> >> > A speculative increase here is ok I think (famous last words). The
> >> > callback was registered before the check, therefore, such an increase
> >> > would trigger the callback.
> >> >
> >> > Thanks,
> >> > /fuad
> >> >
> >> >
> >>
> >> I checked the callback (kvm_gmem_handle_folio_put()) and agree with you
> >> that the mappability reset to KVM_GMEM_GUEST_MAPPABLE is handled
> >> correctly (since kvm_gmem_handle_folio_put() doesn't assume anything
> >> about the mappability state at callback-time).
> >>
> >> However, what if the new speculative reference writes to the page and
> >> guest goes on to fault/use the page?
> >
> > I don't think that's a problem. At this point the page is in a
> > transient state, but still shared from the guest's point of view.
> > Moreover, no one can fault-in the page at the host at this point (we
> > check in kvm_gmem_fault()).
> >
> > Let's have a look at the code:
> >
> > +static int __gmem_register_callback(struct folio *folio, struct inode
> > *inode, pgoff_t idx)
> > +{
> > +       struct xarray *mappable_offsets =
> > &kvm_gmem_private(inode)->mappable_offsets;
> > +       void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +       int refcount;
> >
> > At this point the guest still perceives the page as shared, the state
> > of the page is KVM_GMEM_NONE_MAPPABLE (transient state). This means
> > that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
> >
> > +       rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> > +       WARN_ON_ONCE(!folio_test_locked(folio));
> > +
> > +       if (folio_mapped(folio) || folio_test_guestmem(folio))
> > +               return -EAGAIN;
> > +
> > +       /* Register a callback first. */
> > +       __folio_set_guestmem(folio);
> >
> > This (in addition to the state of the NONE_MAPPABLE), also ensures
> > that kvm_gmem_fault() doesn't fault-in the page at the host anymore.
> >
> > +       /*
> > +        * Check for references after setting the type to guestmem, to guard
> > +        * against potential races with the refcount being decremented later.
> > +        *
> > +        * At least one reference is expected because the folio is locked.
> > +        */
> > +
> > +       refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> > +       if (refcount == 1) {
> > +               int r;
> >
> > At this point we know that guest_memfd has the only real reference.
> > Speculative references AFAIK do not access the page itself.
> > +
> > +               /* refcount isn't elevated, it's now faultable by the guest. */
> > +               r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets,
> > idx, xval_guest, GFP_KERNEL)));
> >
> > Now it's safe so let the guest know that it can map the page.
> >
> > +               if (!r)
> > +                       __kvm_gmem_restore_pending_folio(folio);
> > +
> > +               return r;
> > +       }
> > +
> > +       return -EAGAIN;
> > +}
> >
> > Does this make sense, or did I miss something?
>
> Thanks for explaining! I don't know enough to confirm/deny this but I agree
> that if speculative references don't access the page itself, this works.
>
> What if over here, we just drop the refcount, and let setting mappability to
> GUEST happen in the folio_put() callback?

Similar to what I mentioned in the other thread, the common case
should be that the mapcount and refcount are not elevated, therefore,
I think it's better not to go through the callback route unless it's
necessary for correctness.

Cheers,
/fuad

> >
> > Thanks!
> > /fuad
> >
> >> >> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> >> >> > +             if (!r)
> >> >> > +                     __kvm_gmem_restore_pending_folio(folio);
> >> >> > +
> >> >> > +             return r;
> >> >> > +     }
> >> >> > +
> >> >> > +     return -EAGAIN;
> >> >> > +}
> >> >> > +
> >> >> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> >> >> > +{
> >> >> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> >> >> > +     struct inode *inode = file_inode(slot->gmem.file);
> >> >> > +     struct folio *folio;
> >> >> > +     int r;
> >> >> > +
> >> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> >> > +
> >> >> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
> >> >> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
> >> >> > +             r = PTR_ERR(folio);
> >> >> > +             goto out;
> >> >> > +     }
> >> >> > +
> >> >> > +     r = __gmem_register_callback(folio, inode, pgoff);
> >> >> > +
> >> >> > +     folio_unlock(folio);
> >> >> > +     folio_put(folio);
> >> >> > +out:
> >> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> >> > +
> >> >> > +     return r;
> >> >> > +}
> >> >> > +
> >> >> > +/*
> >> >> > + * Callback function for __folio_put(), i.e., called when all references by the
> >> >> > + * host to the folio have been dropped. This allows gmem to transition the state
> >> >> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> >> >> > + * transitioning its state to private, since the host cannot attempt to access
> >> >> > + * it anymore.
> >> >> > + */
> >> >> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> >> >> > +{
> >> >> > +     struct xarray *mappable_offsets;
> >> >> > +     struct inode *inode;
> >> >> > +     pgoff_t index;
> >> >> > +     void *xval;
> >> >> > +
> >> >> > +     inode = folio->mapping->host;
> >> >> > +     index = folio->index;
> >> >> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> >> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> >> >> > +
> >> >> > +     filemap_invalidate_lock(inode->i_mapping);
> >> >> > +     __kvm_gmem_restore_pending_folio(folio);
> >> >> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> >> >> > +     filemap_invalidate_unlock(inode->i_mapping);
> >> >> > +}
> >> >> > +
> >> >> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >> >> >  {
> >> >> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> >> >>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-06  3:37   ` Ackerley Tng
@ 2025-02-06  9:49     ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-06  9:49 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Thu, 6 Feb 2025 at 03:37, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, and uses
> > that to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> > The function kvm_slot_gmem_register_callback() isn't used in this
> > series. It will be used later in code that performs unsharing of
> > memory. I have tested it with pKVM, based on downstream code [*].
> > It's included in this RFC since it demonstrates the plan to
> > handle unsharing of private folios.
> >
> > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
> > ---
> >  include/linux/kvm_host.h   |  11 +++
> >  include/linux/page-flags.h |   7 ++
> >  mm/debug.c                 |   1 +
> >  mm/swap.c                  |   4 +
> >  virt/kvm/guest_memfd.c     | 145 +++++++++++++++++++++++++++++++++++++
> >  5 files changed, 168 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 84aa7908a5dd..63e6d6dd98b3 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2574,6 +2574,8 @@ int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
> >                                gfn_t end);
> >  bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> >  bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn);
> > +void kvm_gmem_handle_folio_put(struct folio *folio);
> >  #else
> >  static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
> >  {
> > @@ -2615,6 +2617,15 @@ static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
> >       WARN_ON_ONCE(1);
> >       return false;
> >  }
> > +static inline int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +     WARN_ON_ONCE(1);
> > +     return -EINVAL;
> > +}
> > +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ON_ONCE(1);
> > +}
> >  #endif /* CONFIG_KVM_GMEM_MAPPABLE */
> >
> >  #endif
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6615f2f59144..bab3cac1f93b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -942,6 +942,7 @@ enum pagetype {
> >       PGTY_slab       = 0xf5,
> >       PGTY_zsmalloc   = 0xf6,
> >       PGTY_unaccepted = 0xf7,
> > +     PGTY_guestmem   = 0xf8,
> >
> >       PGTY_mapcount_underflow = 0xff
> >  };
> > @@ -1091,6 +1092,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
> >  FOLIO_TEST_FLAG_FALSE(hugetlb)
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
> > +FOLIO_TYPE_OPS(guestmem, guestmem)
> > +#else
> > +FOLIO_TEST_FLAG_FALSE(guestmem)
> > +#endif
> > +
> >  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> >
> >  /*
> > diff --git a/mm/debug.c b/mm/debug.c
> > index 95b6ab809c0e..db93be385ed9 100644
> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> > @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
> >       DEF_PAGETYPE_NAME(table),
> >       DEF_PAGETYPE_NAME(buddy),
> >       DEF_PAGETYPE_NAME(unaccepted),
> > +     DEF_PAGETYPE_NAME(guestmem),
> >  };
> >
> >  static const char *page_type_name(unsigned int page_type)
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 6f01b56bce13..15220eaabc86 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/page_idle.h>
> >  #include <linux/local_lock.h>
> >  #include <linux/buffer_head.h>
> > +#include <linux/kvm_host.h>
> >
> >  #include "internal.h"
> >
> > @@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
> >       case PGTY_offline:
> >               /* Nothing to do, it's offline. */
> >               return;
> > +     case PGTY_guestmem:
> > +             kvm_gmem_handle_folio_put(folio);
> > +             return;
> >       default:
> >               WARN_ON_ONCE(1);
> >       }
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index d1c192927cf7..722afd9f8742 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -387,6 +387,28 @@ enum folio_mappability {
> >       KVM_GMEM_NONE_MAPPABLE  = 0b11, /* Not mappable, transient state. */
> >  };
> >
> > +/*
> > + * Unregisters the __folio_put() callback from the folio.
> > + *
> > + * Restores a folio's refcount after all pending references have been released,
> > + * and removes the folio type, thereby removing the callback. Now the folio can
> > + * be freed normaly once all actual references have been dropped.
> > + *
> > + * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
> > + * Must also have exclusive access to the folio: folio must be either locked, or
> > + * gmem holds the only reference.
> > + */
> > +static void __kvm_gmem_restore_pending_folio(struct folio *folio)
> > +{
> > +     if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
> > +             return;
> > +
> > +     WARN_ON_ONCE(!folio_test_locked(folio) && folio_ref_count(folio) > 1);
> > +
> > +     __folio_clear_guestmem(folio);
> > +     folio_ref_add(folio, folio_nr_pages(folio));
> > +}
> > +
> >  /*
> >   * Marks the range [start, end) as mappable by both the host and the guest.
> >   * Usually called when guest shares memory with the host.
> > @@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >
> >       filemap_invalidate_lock(inode->i_mapping);
> >       for (i = start; i < end; i++) {
> > +             struct folio *folio = NULL;
> > +
> > +             /*
> > +              * If the folio is NONE_MAPPABLE, it indicates that it is
> > +              * transitioning to private (GUEST_MAPPABLE). Transition it to
> > +              * shared (ALL_MAPPABLE) immediately, and remove the callback.
> > +              */
> > +             if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
> > +                     folio = filemap_lock_folio(inode->i_mapping, i);
> > +                     if (WARN_ON_ONCE(IS_ERR(folio))) {
> > +                             r = PTR_ERR(folio);
> > +                             break;
> > +                     }
> > +
> > +                     if (folio_test_guestmem(folio))
> > +                             __kvm_gmem_restore_pending_folio(folio);
> > +             }
> > +
> >               r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> > +
> > +             if (folio) {
> > +                     folio_unlock(folio);
> > +                     folio_put(folio);
> > +             }
> > +
> >               if (r)
> >                       break;
> >       }
> > @@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> >       return r;
> >  }
> >
>
> I think one of these functions to restore mappability needs to be called
> to restore the refcounts on truncation. Without doing this, the
> refcounts on the folios at truncation time would only be the
> transient/speculative ones, and truncating will take off the filemap
> refcounts which were already taken off to set up the folio_put()
> callback.

Good point.

> Should mappability can be restored according to
> GUEST_MEMFD_FLAG_INIT_MAPPABLE? Or should mappability of NONE be
> restored to GUEST and mappability of ALL left as ALL?

Not sure I follow :)

Thanks,
/fuad

> > +/*
> > + * Registers a callback to __folio_put(), so that gmem knows that the host does
> > + * not have any references to the folio. It does that by setting the folio type
> > + * to guestmem.
> > + *
> > + * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
> > + * has references, and the callback has been registered.
> > + *
> > + * Must be called with the following locks held:
> > + * - filemap (inode->i_mapping) invalidate_lock
> > + * - folio lock
> > + */
> > +static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +     int refcount;
> > +
> > +     rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
> > +     WARN_ON_ONCE(!folio_test_locked(folio));
> > +
> > +     if (folio_mapped(folio) || folio_test_guestmem(folio))
> > +             return -EAGAIN;
> > +
> > +     /* Register a callback first. */
> > +     __folio_set_guestmem(folio);
> > +
> > +     /*
> > +      * Check for references after setting the type to guestmem, to guard
> > +      * against potential races with the refcount being decremented later.
> > +      *
> > +      * At least one reference is expected because the folio is locked.
> > +      */
> > +
> > +     refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
> > +     if (refcount == 1) {
> > +             int r;
> > +
> > +             /* refcount isn't elevated, it's now faultable by the guest. */
> > +             r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
> > +             if (!r)
> > +                     __kvm_gmem_restore_pending_folio(folio);
> > +
> > +             return r;
> > +     }
> > +
> > +     return -EAGAIN;
> > +}
> > +
> > +int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +     unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
> > +     struct inode *inode = file_inode(slot->gmem.file);
> > +     struct folio *folio;
> > +     int r;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +
> > +     folio = filemap_lock_folio(inode->i_mapping, pgoff);
> > +     if (WARN_ON_ONCE(IS_ERR(folio))) {
> > +             r = PTR_ERR(folio);
> > +             goto out;
> > +     }
> > +
> > +     r = __gmem_register_callback(folio, inode, pgoff);
> > +
> > +     folio_unlock(folio);
> > +     folio_put(folio);
> > +out:
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +/*
> > + * Callback function for __folio_put(), i.e., called when all references by the
> > + * host to the folio have been dropped. This allows gmem to transition the state
> > + * of the folio to mappable by the guest, and allows the hypervisor to continue
> > + * transitioning its state to private, since the host cannot attempt to access
> > + * it anymore.
> > + */
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     struct xarray *mappable_offsets;
> > +     struct inode *inode;
> > +     pgoff_t index;
> > +     void *xval;
> > +
> > +     inode = folio->mapping->host;
> > +     index = folio->index;
> > +     mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     __kvm_gmem_restore_pending_folio(folio);
> > +     WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +}
> > +
> >  static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
> >  {
> >       struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-05 17:42         ` Vishal Annapurve
@ 2025-02-07 10:46           ` Ackerley Tng
  2025-02-10 16:04             ` Fuad Tabba
  0 siblings, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-07 10:46 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Vishal Annapurve <vannapurve@google.com> writes:

> On Wed, Feb 5, 2025 at 9:39 AM Vishal Annapurve <vannapurve@google.com> wrote:
>>
>> On Wed, Feb 5, 2025 at 2:07 AM Fuad Tabba <tabba@google.com> wrote:
>> >
>> > Hi Vishal,
>> >
>> > On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@google.com> wrote:
>> > >
>> > > On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
>> > > >
>> > > > Before transitioning a guest_memfd folio to unshared, thereby
>> > > > disallowing access by the host and allowing the hypervisor to
>> > > > transition its view of the guest page as private, we need to be
>> > > > sure that the host doesn't have any references to the folio.
>> > > >
>> > > > This patch introduces a new type for guest_memfd folios, and uses
>> > > > that to register a callback that informs the guest_memfd
>> > > > subsystem when the last reference is dropped, therefore knowing
>> > > > that the host doesn't have any remaining references.
>> > > >
>> > > > Signed-off-by: Fuad Tabba <tabba@google.com>
>> > > > ---
>> > > > The function kvm_slot_gmem_register_callback() isn't used in this
>> > > > series. It will be used later in code that performs unsharing of
>> > > > memory. I have tested it with pKVM, based on downstream code [*].
>> > > > It's included in this RFC since it demonstrates the plan to
>> > > > handle unsharing of private folios.
>> > > >
>> > > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
>> > >
>> > > Should the invocation of kvm_slot_gmem_register_callback() happen in
>> > > the same critical block as setting the guest memfd range mappability
>> > > to NONE, otherwise conversion/truncation could race with registration
>> > > of callback?
>> >
>> > I don't think it needs to, at least not as far potencial races are
>> > concerned. First because kvm_slot_gmem_register_callback() grabs the
>> > mapping's invalidate_lock as well as the folio lock, and
>> > gmem_clear_mappable() grabs the mapping lock and the folio lock if a
>> > folio has been allocated before.
>>
>> I was hinting towards such a scenario:
>> Core1
>> Shared to private conversion
>>   -> Results in mappability attributes
>>       being set to NONE
>> ...
>>         Trigger private to shared conversion/truncation for
>> ...
>>         overlapping ranges
>> ...
>> kvm_slot_gmem_register_callback() on
>>       the guest_memfd ranges converted
>>       above (This will end up registering callback
>>       for guest_memfd ranges which possibly don't
>>       carry *_MAPPABILITY_NONE)
>>
>
> Sorry for the format mess above.
>
> I was hinting towards such a scenario:
> Core1-
> Shared to private conversion -> Results in mappability attributes
> being set to NONE
> ...
> Core2
> Trigger private to shared conversion/truncation for overlapping ranges
> ...
> Core1
> kvm_slot_gmem_register_callback() on the guest_memfd ranges converted
> above (This will end up registering callback for guest_memfd ranges
> which possibly don't carry *_MAPPABILITY_NONE)
>

In my model (I'm working through internal processes to open source this)
I set up the the folio_put() callback to be registered on truncation
regardless of mappability state.

The folio_put() callback has multiple purposes, see slide 5 of this deck
[1]:

1. Transitioning mappability from NONE to GUEST
2. Merging the folio if it is ready for merging
3. Keeping subfolio around (even if refcount == 0) until folio is ready
   for merging or return it to hugetlb

So it is okay and in fact better to have the callback registered:

1. Folios with mappability == NONE can be transitioned to GUEST
2. Folios with mappability == GUEST/ALL can be merged if the other subfolios
   are ready for merging
3. And no matter the mappability, if subfolios are not yet merged, they
   have to be kept around even with refcount 0 until they are merged.

The model doesn't model locking so I'll have to code it up for real to
verify this, but for now I think we should take a mappability lock
during mappability read/write, and do any necessary callback
(un)registration while holding the lock. There's no concern of nested
locking here since callback registration will purely (un)set
PGTY_guest_memfd and does not add/drop refcounts.

With the callback registration locked with mappability updates, the
refcounting and folio_put() callback should keep guest_memfd in a
consistent state.

>> >
>> > Second, __gmem_register_callback() checks before returning whether all
>> > references have been dropped, and adjusts the mappability/shareability
>> > if needed.
>> >
>> > Cheers,
>> > /fuad

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages
  2025-02-07 10:46           ` Ackerley Tng
@ 2025-02-10 16:04             ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-10 16:04 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, seanjc, viro,
	brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko,
	amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Ackerley,

On Fri, 7 Feb 2025 at 10:46, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Vishal Annapurve <vannapurve@google.com> writes:
>
> > On Wed, Feb 5, 2025 at 9:39 AM Vishal Annapurve <vannapurve@google.com> wrote:
> >>
> >> On Wed, Feb 5, 2025 at 2:07 AM Fuad Tabba <tabba@google.com> wrote:
> >> >
> >> > Hi Vishal,
> >> >
> >> > On Wed, 5 Feb 2025 at 00:42, Vishal Annapurve <vannapurve@google.com> wrote:
> >> > >
> >> > > On Fri, Jan 17, 2025 at 8:30 AM Fuad Tabba <tabba@google.com> wrote:
> >> > > >
> >> > > > Before transitioning a guest_memfd folio to unshared, thereby
> >> > > > disallowing access by the host and allowing the hypervisor to
> >> > > > transition its view of the guest page as private, we need to be
> >> > > > sure that the host doesn't have any references to the folio.
> >> > > >
> >> > > > This patch introduces a new type for guest_memfd folios, and uses
> >> > > > that to register a callback that informs the guest_memfd
> >> > > > subsystem when the last reference is dropped, therefore knowing
> >> > > > that the host doesn't have any remaining references.
> >> > > >
> >> > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> >> > > > ---
> >> > > > The function kvm_slot_gmem_register_callback() isn't used in this
> >> > > > series. It will be used later in code that performs unsharing of
> >> > > > memory. I have tested it with pKVM, based on downstream code [*].
> >> > > > It's included in this RFC since it demonstrates the plan to
> >> > > > handle unsharing of private folios.
> >> > > >
> >> > > > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v5-pkvm
> >> > >
> >> > > Should the invocation of kvm_slot_gmem_register_callback() happen in
> >> > > the same critical block as setting the guest memfd range mappability
> >> > > to NONE, otherwise conversion/truncation could race with registration
> >> > > of callback?
> >> >
> >> > I don't think it needs to, at least not as far potencial races are
> >> > concerned. First because kvm_slot_gmem_register_callback() grabs the
> >> > mapping's invalidate_lock as well as the folio lock, and
> >> > gmem_clear_mappable() grabs the mapping lock and the folio lock if a
> >> > folio has been allocated before.
> >>
> >> I was hinting towards such a scenario:
> >> Core1
> >> Shared to private conversion
> >>   -> Results in mappability attributes
> >>       being set to NONE
> >> ...
> >>         Trigger private to shared conversion/truncation for
> >> ...
> >>         overlapping ranges
> >> ...
> >> kvm_slot_gmem_register_callback() on
> >>       the guest_memfd ranges converted
> >>       above (This will end up registering callback
> >>       for guest_memfd ranges which possibly don't
> >>       carry *_MAPPABILITY_NONE)
> >>
> >
> > Sorry for the format mess above.
> >
> > I was hinting towards such a scenario:
> > Core1-
> > Shared to private conversion -> Results in mappability attributes
> > being set to NONE
> > ...
> > Core2
> > Trigger private to shared conversion/truncation for overlapping ranges
> > ...
> > Core1
> > kvm_slot_gmem_register_callback() on the guest_memfd ranges converted
> > above (This will end up registering callback for guest_memfd ranges
> > which possibly don't carry *_MAPPABILITY_NONE)
> >
>
> In my model (I'm working through internal processes to open source this)
> I set up the the folio_put() callback to be registered on truncation
> regardless of mappability state.
>
> The folio_put() callback has multiple purposes, see slide 5 of this deck
> [1]:
>
> 1. Transitioning mappability from NONE to GUEST
> 2. Merging the folio if it is ready for merging
> 3. Keeping subfolio around (even if refcount == 0) until folio is ready
>    for merging or return it to hugetlb
>
> So it is okay and in fact better to have the callback registered:
>
> 1. Folios with mappability == NONE can be transitioned to GUEST
> 2. Folios with mappability == GUEST/ALL can be merged if the other subfolios
>    are ready for merging
> 3. And no matter the mappability, if subfolios are not yet merged, they
>    have to be kept around even with refcount 0 until they are merged.
>
> The model doesn't model locking so I'll have to code it up for real to
> verify this, but for now I think we should take a mappability lock
> during mappability read/write, and do any necessary callback
> (un)registration while holding the lock. There's no concern of nested
> locking here since callback registration will purely (un)set
> PGTY_guest_memfd and does not add/drop refcounts.
>
> With the callback registration locked with mappability updates, the
> refcounting and folio_put() callback should keep guest_memfd in a
> consistent state.

So if I understand you correctly, we'll need to always register for
large folios, right? If that's the case, we could expand the check to
whether to register the callback, and ensure it's always registered
for large folios. Since, like I said, the common case for small folios
is that it would be just additional overhead. Right?

Cheers,
/fuad

> >> >
> >> > Second, __gmem_register_callback() checks before returning whether all
> >> > references have been dropped, and adjusts the mappability/shareability
> >> > if needed.
> >> >
> >> > Cheers,
> >> > /fuad
>
> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-01-24  4:25   ` Gavin Shan
  2025-01-29 10:12     ` Fuad Tabba
@ 2025-02-11 15:58     ` Ackerley Tng
  1 sibling, 0 replies; 60+ messages in thread
From: Ackerley Tng @ 2025-02-11 15:58 UTC (permalink / raw)
  To: Gavin Shan
  Cc: tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton


Thanks for reviewing, Gavin! I'll also adopt these when I respin.

Gavin Shan <gshan@redhat.com> writes:

> Hi Fuad,
>
> On 1/18/25 2:29 AM, Fuad Tabba wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>> 
>> Using guest mem inodes allows us to store metadata for the backing
>> memory on the inode. Metadata will be added in a later patch to
>> support HugeTLB pages.
>> 
>> Metadata about backing memory should not be stored on the file, since
>> the file represents a guest_memfd's binding with a struct kvm, and
>> metadata about backing memory is not unique to a specific binding and
>> struct kvm.
>> 
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> ---
>>   include/uapi/linux/magic.h |   1 +
>>   virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
>>   2 files changed, 100 insertions(+), 20 deletions(-)
>> 
>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>> index bb575f3ab45e..169dba2a6920 100644
>> --- a/include/uapi/linux/magic.h
>> +++ b/include/uapi/linux/magic.h
>> @@ -103,5 +103,6 @@
>>   #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>>   #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
>>   #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
>> +#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
>>   
>>   #endif /* __LINUX_MAGIC_H__ */
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 47a9f68f7b24..198554b1f0b5 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -1,12 +1,17 @@
>>   // SPDX-License-Identifier: GPL-2.0
>> +#include <linux/fs.h>
>> +#include <linux/mount.h>
>
> This can be dropped since "linux/mount.h" has been included to "linux/fs.h".
>
>>   #include <linux/backing-dev.h>
>>   #include <linux/falloc.h>
>>   #include <linux/kvm_host.h>
>> +#include <linux/pseudo_fs.h>
>>   #include <linux/pagemap.h>
>>   #include <linux/anon_inodes.h>
>>   
>>   #include "kvm_mm.h"
>>   
>> +static struct vfsmount *kvm_gmem_mnt;
>> +
>>   struct kvm_gmem {
>>   	struct kvm *kvm;
>>   	struct xarray bindings;
>> @@ -307,6 +312,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>>   	return gfn - slot->base_gfn + slot->gmem.pgoff;
>>   }
>>   
>> +static const struct super_operations kvm_gmem_super_operations = {
>> +	.statfs		= simple_statfs,
>> +};
>> +
>> +static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> +{
>> +	struct pseudo_fs_context *ctx;
>> +
>> +	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
>> +		return -ENOMEM;
>> +
>> +	ctx = fc->fs_private;
>> +	ctx->ops = &kvm_gmem_super_operations;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct file_system_type kvm_gmem_fs = {
>> +	.name		 = "kvm_guest_memory",
>> +	.init_fs_context = kvm_gmem_init_fs_context,
>> +	.kill_sb	 = kill_anon_super,
>> +};
>> +
>> +static void kvm_gmem_init_mount(void)
>> +{
>> +	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
>> +	BUG_ON(IS_ERR(kvm_gmem_mnt));
>> +
>> +	/* For giggles. Userspace can never map this anyways. */
>> +	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
>> +}
>> +
>>   static struct file_operations kvm_gmem_fops = {
>>   	.open		= generic_file_open,
>>   	.release	= kvm_gmem_release,
>> @@ -316,6 +353,8 @@ static struct file_operations kvm_gmem_fops = {
>>   void kvm_gmem_init(struct module *module)
>>   {
>>   	kvm_gmem_fops.owner = module;
>> +
>> +	kvm_gmem_init_mount();
>>   }
>>   
>>   static int kvm_gmem_migrate_folio(struct address_space *mapping,
>> @@ -397,11 +436,67 @@ static const struct inode_operations kvm_gmem_iops = {
>>   	.setattr	= kvm_gmem_setattr,
>>   };
>>   
>> +static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> +						      loff_t size, u64 flags)
>> +{
>> +	const struct qstr qname = QSTR_INIT(name, strlen(name));
>> +	struct inode *inode;
>> +	int err;
>> +
>> +	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
>> +	if (IS_ERR(inode))
>> +		return inode;
>> +
>> +	err = security_inode_init_security_anon(inode, &qname, NULL);
>> +	if (err) {
>> +		iput(inode);
>> +		return ERR_PTR(err);
>> +	}
>> +
>> +	inode->i_private = (void *)(unsigned long)flags;
>> +	inode->i_op = &kvm_gmem_iops;
>> +	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> +	inode->i_mode |= S_IFREG;
>> +	inode->i_size = size;
>> +	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> +	mapping_set_inaccessible(inode->i_mapping);
>> +	/* Unmovable mappings are supposed to be marked unevictable as well. */
>> +	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>> +
>> +	return inode;
>> +}
>> +
>> +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>> +						  u64 flags)
>> +{
>> +	static const char *name = "[kvm-gmem]";
>> +	struct inode *inode;
>> +	struct file *file;
>> +
>> +	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
>> +		return ERR_PTR(-ENOENT);
>> +
>
> The validation on 'kvm_gmem_fops.owner' can be removed since try_module_get()
> and module_put() are friendly to a NULL parameter, even when CONFIG_MODULE_UNLOAD == N
>
> A module_put(kvm_gmem_fops.owner) is needed in the various erroneous cases in
> this function. Otherwise, the reference count of the owner (module) will become
> imbalanced on any errors.
>

Thanks for catching this! Will add module_put() for error paths.

>
>> +	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
>> +	if (IS_ERR(inode))
>> +		return ERR_CAST(inode);
>> +
>
> ERR_CAST may be dropped since there is nothing to be casted or converted?
>

This cast is necessary as it casts from a struct inode * to a struct
file *.

>> +	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
>> +				 &kvm_gmem_fops);
>> +	if (IS_ERR(file)) {
>> +		iput(inode);
>> +		return file;
>> +	}
>> +
>> +	file->f_mapping = inode->i_mapping;
>> +	file->f_flags |= O_LARGEFILE;
>> +	file->private_data = priv;
>> +
>
> 'file->f_mapping = inode->i_mapping' may be dropped since it's already correctly
> set by alloc_file_pseudo().
>
> alloc_file_pseudo
>    alloc_path_pseudo
>    alloc_file
>      alloc_empty_file
>      file_init_path         // Set by this function
>

Thanks!

>
>> +	return file;
>> +}
>> +
>>   static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>   {
>> -	const char *anon_name = "[kvm-gmem]";
>>   	struct kvm_gmem *gmem;
>> -	struct inode *inode;
>>   	struct file *file;
>>   	int fd, err;
>>   
>> @@ -415,32 +510,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>   		goto err_fd;
>>   	}
>>   
>> -	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
>> -					 O_RDWR, NULL);
>> +	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
>>   	if (IS_ERR(file)) {
>>   		err = PTR_ERR(file);
>>   		goto err_gmem;
>>   	}
>>   
>> -	file->f_flags |= O_LARGEFILE;
>> -
>> -	inode = file->f_inode;
>> -	WARN_ON(file->f_mapping != inode->i_mapping);
>> -
>> -	inode->i_private = (void *)(unsigned long)flags;
>> -	inode->i_op = &kvm_gmem_iops;
>> -	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> -	inode->i_mode |= S_IFREG;
>> -	inode->i_size = size;
>> -	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> -	mapping_set_inaccessible(inode->i_mapping);
>> -	/* Unmovable mappings are supposed to be marked unevictable as well. */
>> -	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>> -
>>   	kvm_get_kvm(kvm);
>>   	gmem->kvm = kvm;
>>   	xa_init(&gmem->bindings);
>> -	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
>> +	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
>>   
>>   	fd_install(fd, file);
>>   	return fd;
>
> Thanks,
> Gavin


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
  2025-01-20 10:30   ` Kirill A. Shutemov
@ 2025-02-19 23:33   ` Ackerley Tng
  2025-02-20  9:26     ` Fuad Tabba
  1 sibling, 1 reply; 60+ messages in thread
From: Ackerley Tng @ 2025-02-19 23:33 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Fuad Tabba <tabba@google.com> writes:

This question should not block merging of this series since performance
can be improved in a separate series:

> <snip>
>
> +
> +/*
> + * Marks the range [start, end) as mappable by both the host and the guest.
> + * Usually called when guest shares memory with the host.
> + */
> +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval = xa_mk_value(KVM_GMEM_ALL_MAPPABLE);
> +	pgoff_t i;
> +	int r = 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	for (i = start; i < end; i++) {

Were any alternative data structures considered, or does anyone have
suggestions for alternatives? Doing xa_store() in a loop here will take
a long time for large ranges.

I looked into the following:

Option 1: (preferred) Maple trees

Maple tree has a nice API, though it would be better if it can combine
ranges that have the same value.

I will have to dig into performance, but I'm assuming that even large
ranges are stored in a few nodes so this would be faster than iterating
over indices in an xarray.

void explore_maple_tree(void)
{
	DEFINE_MTREE(mt);

	mt_init_flags(&mt, MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU);

	mtree_store_range(&mt, 0, 16, xa_mk_value(0x20), GFP_KERNEL);
	mtree_store_range(&mt, 8, 24, xa_mk_value(0x32), GFP_KERNEL);
	mtree_store_range(&mt, 5, 10, xa_mk_value(0x32), GFP_KERNEL);

	{
		void *entry;
		MA_STATE(mas, &mt, 0, 0);

		mas_for_each(&mas, entry, ULONG_MAX) {
			pr_err("[%ld, %ld]: 0x%lx\n", mas.index, mas.last, xa_to_value(entry));
		}
	}

	mtree_destroy(&mt);
}

stdout:

[0, 4]: 0x20
[5, 10]: 0x32
[11, 24]: 0x32

Option 2: Multi-index xarray

The API is more complex than maple tree's, and IIUC multi-index xarrays
are not generalizable to any range, so the range can't be 8 1G pages + 1
4K page for example. The size of the range has to be a power of 2 that
is greater than 4K.

Using multi-index xarrays would mean computing order to store
multi-index entries. This can be computed from the size of the range to
be added, but is an additional source of errors.

Option 3: Interval tree, which is built on top of red-black trees

The API is set up at a lower level. A macro is used to define interval
trees, the user has to deal with nodes in the tree directly and
separately define functions to override sub-ranges in larger ranges.

> +		r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> +		if (r)
> +			break;
> +	}
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
> +/*
> + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> + * have any references to a particular folio, then that folio is marked as
> + * mappable by the guest.
> + *
> + * However, if the host still has references to the folio, then the folio is
> + * marked and not mappable by anyone. Marking it is not mappable allows it to
> + * drain all references from the host, and to ensure that the hypervisor does
> + * not transition the folio to private, since the host still might access it.
> + *
> + * Usually called when guest unshares memory with the host.
> + */
> +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +	struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> +	void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> +	void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> +	pgoff_t i;
> +	int r = 0;
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +	for (i = start; i < end; i++) {
> +		struct folio *folio;
> +		int refcount = 0;
> +
> +		folio = filemap_lock_folio(inode->i_mapping, i);
> +		if (!IS_ERR(folio)) {
> +			refcount = folio_ref_count(folio);
> +		} else {
> +			r = PTR_ERR(folio);
> +			if (WARN_ON_ONCE(r != -ENOENT))
> +				break;
> +
> +			folio = NULL;
> +		}
> +
> +		/* +1 references are expected because of filemap_lock_folio(). */
> +		if (folio && refcount > folio_nr_pages(folio) + 1) {
> +			/*
> +			 * Outstanding references, the folio cannot be faulted
> +			 * in by anyone until they're dropped.
> +			 */
> +			r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> +		} else {
> +			/*
> +			 * No outstanding references. Transition the folio to
> +			 * guest mappable immediately.
> +			 */
> +			r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> +		}
> +
> +		if (folio) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +
> +		if (WARN_ON_ONCE(r))
> +			break;
> +	}
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
>
> <snip>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition
  2025-02-19 23:33   ` Ackerley Tng
@ 2025-02-20  9:26     ` Fuad Tabba
  0 siblings, 0 replies; 60+ messages in thread
From: Fuad Tabba @ 2025-02-20  9:26 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Ackerley,

On Wed, 19 Feb 2025 at 23:33, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> This question should not block merging of this series since performance
> can be improved in a separate series:
>
> > <snip>
> >
> > +
> > +/*
> > + * Marks the range [start, end) as mappable by both the host and the guest.
> > + * Usually called when guest shares memory with the host.
> > + */
> > +static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval = xa_mk_value(KVM_GMEM_ALL_MAPPABLE);
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     for (i = start; i < end; i++) {
>
> Were any alternative data structures considered, or does anyone have
> suggestions for alternatives? Doing xa_store() in a loop here will take
> a long time for large ranges.
>
> I looked into the following:
>
> Option 1: (preferred) Maple trees
>
> Maple tree has a nice API, though it would be better if it can combine
> ranges that have the same value.
>
> I will have to dig into performance, but I'm assuming that even large
> ranges are stored in a few nodes so this would be faster than iterating
> over indices in an xarray.
>
> void explore_maple_tree(void)
> {
>         DEFINE_MTREE(mt);
>
>         mt_init_flags(&mt, MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU);
>
>         mtree_store_range(&mt, 0, 16, xa_mk_value(0x20), GFP_KERNEL);
>         mtree_store_range(&mt, 8, 24, xa_mk_value(0x32), GFP_KERNEL);
>         mtree_store_range(&mt, 5, 10, xa_mk_value(0x32), GFP_KERNEL);
>
>         {
>                 void *entry;
>                 MA_STATE(mas, &mt, 0, 0);
>
>                 mas_for_each(&mas, entry, ULONG_MAX) {
>                         pr_err("[%ld, %ld]: 0x%lx\n", mas.index, mas.last, xa_to_value(entry));
>                 }
>         }
>
>         mtree_destroy(&mt);
> }
>
> stdout:
>
> [0, 4]: 0x20
> [5, 10]: 0x32
> [11, 24]: 0x32
>
> Option 2: Multi-index xarray
>
> The API is more complex than maple tree's, and IIUC multi-index xarrays
> are not generalizable to any range, so the range can't be 8 1G pages + 1
> 4K page for example. The size of the range has to be a power of 2 that
> is greater than 4K.
>
> Using multi-index xarrays would mean computing order to store
> multi-index entries. This can be computed from the size of the range to
> be added, but is an additional source of errors.
>
> Option 3: Interval tree, which is built on top of red-black trees
>
> The API is set up at a lower level. A macro is used to define interval
> trees, the user has to deal with nodes in the tree directly and
> separately define functions to override sub-ranges in larger ranges.

I didn't consider any other data structures, mainly out of laziness :)
What I mean by that is, xarrays is what is already used in guest_memfd
for tracking other gfn related items, even though many have talked
about in the future replacing it with something else.

I agree with you that it's not the ideal data structure, but also,
like you said, this isn't part of the interface, and it would be easy
to replace in the future. As you mention, one of the challenges is
figuring out the performance impact in practice, and once things have
settled down and the interface is more or less settled, some
benchmarking would be useful to guide us here.

Thanks!
/fuad

> > +             r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
> > +             if (r)
> > +                     break;
> > +     }
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> > +/*
> > + * Marks the range [start, end) as not mappable by the host. If the host doesn't
> > + * have any references to a particular folio, then that folio is marked as
> > + * mappable by the guest.
> > + *
> > + * However, if the host still has references to the folio, then the folio is
> > + * marked and not mappable by anyone. Marking it is not mappable allows it to
> > + * drain all references from the host, and to ensure that the hypervisor does
> > + * not transition the folio to private, since the host still might access it.
> > + *
> > + * Usually called when guest unshares memory with the host.
> > + */
> > +static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
> > +{
> > +     struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
> > +     void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
> > +     void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
> > +     pgoff_t i;
> > +     int r = 0;
> > +
> > +     filemap_invalidate_lock(inode->i_mapping);
> > +     for (i = start; i < end; i++) {
> > +             struct folio *folio;
> > +             int refcount = 0;
> > +
> > +             folio = filemap_lock_folio(inode->i_mapping, i);
> > +             if (!IS_ERR(folio)) {
> > +                     refcount = folio_ref_count(folio);
> > +             } else {
> > +                     r = PTR_ERR(folio);
> > +                     if (WARN_ON_ONCE(r != -ENOENT))
> > +                             break;
> > +
> > +                     folio = NULL;
> > +             }
> > +
> > +             /* +1 references are expected because of filemap_lock_folio(). */
> > +             if (folio && refcount > folio_nr_pages(folio) + 1) {
> > +                     /*
> > +                      * Outstanding references, the folio cannot be faulted
> > +                      * in by anyone until they're dropped.
> > +                      */
> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
> > +             } else {
> > +                     /*
> > +                      * No outstanding references. Transition the folio to
> > +                      * guest mappable immediately.
> > +                      */
> > +                     r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
> > +             }
> > +
> > +             if (folio) {
> > +                     folio_unlock(folio);
> > +                     folio_put(folio);
> > +             }
> > +
> > +             if (WARN_ON_ONCE(r))
> > +                     break;
> > +     }
> > +     filemap_invalidate_unlock(inode->i_mapping);
> > +
> > +     return r;
> > +}
> > +
> >
> > <snip>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
  2025-01-29 10:15     ` Fuad Tabba
@ 2025-02-26 22:29       ` Ackerley Tng
  0 siblings, 0 replies; 60+ messages in thread
From: Ackerley Tng @ 2025-02-26 22:29 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: gshan, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Fuad Tabba <tabba@google.com> writes:

> Hi Gavin,
>
> On Fri, 24 Jan 2025 at 05:32, Gavin Shan <gshan@redhat.com> wrote:
>>
>> Hi Fuad,
>>
>> On 1/18/25 2:29 AM, Fuad Tabba wrote:
>> > From: Ackerley Tng <ackerleytng@google.com>
>> >
>> > Track whether guest_memfd memory can be mapped within the inode,
>> > since it is property of the guest_memfd's memory contents.
>> >
>> > The guest_memfd PRIVATE memory attribute is not used for two
>> > reasons. First because it reflects the userspace expectation for
>> > that memory location, and therefore can be toggled by userspace.
>> > The second is, although each guest_memfd file has a 1:1 binding
>> > with a KVM instance, the plan is to allow multiple files per
>> > inode, e.g. to allow intra-host migration to a new KVM instance,
>> > without destroying guest_memfd.
>> >
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> > Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> > Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> > Co-developed-by: Fuad Tabba <tabba@google.com>
>> > Signed-off-by: Fuad Tabba <tabba@google.com>
>> > ---
>> >   virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
>> >   1 file changed, 51 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index 6453658d2650..0a7b6cf8bd8f 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -18,6 +18,17 @@ struct kvm_gmem {
>> >       struct list_head entry;
>> >   };
>> >
>> > +struct kvm_gmem_inode_private {
>> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
>> > +     struct xarray mappable_offsets;
>> > +#endif
>> > +};
>> > +
>> > +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>> > +{
>> > +     return inode->i_mapping->i_private_data;
>> > +}
>> > +
>> >   /**
>> >    * folio_file_pfn - like folio_file_page, but return a pfn.
>> >    * @folio: The folio which contains this index.
>> > @@ -312,8 +323,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>> >       return gfn - slot->base_gfn + slot->gmem.pgoff;
>> >   }
>> >
>> > +static void kvm_gmem_evict_inode(struct inode *inode)
>> > +{
>> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> > +
>> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
>> > +     /*
>> > +      * .evict_inode can be called before private data is set up if there are
>> > +      * issues during inode creation.
>> > +      */
>> > +     if (private)
>> > +             xa_destroy(&private->mappable_offsets);
>> > +#endif
>> > +
>> > +     truncate_inode_pages_final(inode->i_mapping);
>> > +
>> > +     kfree(private);
>> > +     clear_inode(inode);
>> > +}
>> > +
>> >   static const struct super_operations kvm_gmem_super_operations = {
>> > -     .statfs         = simple_statfs,
>> > +     .statfs         = simple_statfs,
>> > +     .evict_inode    = kvm_gmem_evict_inode,
>> >   };
>> >
>>
>> As I understood, ->destroy_inode() may be more suitable place where the xarray is
>> released. ->evict_inode() usually detach the inode from the existing struct, to make
>> it offline. ->destroy_inode() is actually the place where the associated resource
>> (memory) is relased.
>>
>> Another benefit with ->destroy_inode() is we're not concerned to truncate_inode_pages_final()
>> and clear_inode().
>
> I see. I'll give this a try.
>

While working on 1G page support (old revision at [1]), I was looking at
this.

Using .destroy_inode to clean up private->mappable_offsets should work
fine, and I agree this should be refactored to use .destroy_inode
instead. Thanks for pointing this out!

FWIW, for 1G page support, the truncation process has to be overridden
to, so the .evict_inode override will have to come back.

>>
>> >   static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> > @@ -440,6 +471,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> >                                                     loff_t size, u64 flags)
>> >   {
>> >       const struct qstr qname = QSTR_INIT(name, strlen(name));
>> > +     struct kvm_gmem_inode_private *private;
>> >       struct inode *inode;
>> >       int err;
>> >
>> > @@ -448,10 +480,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> >               return inode;
>> >
>> >       err = security_inode_init_security_anon(inode, &qname, NULL);
>> > -     if (err) {
>> > -             iput(inode);
>> > -             return ERR_PTR(err);
>> > -     }
>> > +     if (err)
>> > +             goto out;
>> > +
>> > +     err = -ENOMEM;
>> > +     private = kzalloc(sizeof(*private), GFP_KERNEL);
>> > +     if (!private)
>> > +             goto out;
>> > +
>> > +#ifdef CONFIG_KVM_GMEM_MAPPABLE
>> > +     xa_init(&private->mappable_offsets);
>> > +#endif
>> > +
>> > +     inode->i_mapping->i_private_data = private;
>> >
>>
>> The whole block of code needs to be guarded by CONFIG_KVM_GMEM_MAPPABLE because
>> kzalloc(sizeof(...)) is translated to kzalloc(0) when CONFIG_KVM_GMEM_MAPPABLE
>> is disabled, and kzalloc() will always fail. It will lead to unusable guest-memfd
>> if CONFIG_KVM_GMEM_MAPPABLE is disabled.
>
> Good point, thanks for pointing this out.
>
> Cheers,
> /fuad
>
>> >       inode->i_private = (void *)(unsigned long)flags;
>> >       inode->i_op = &kvm_gmem_iops;
>> > @@ -464,6 +505,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> >       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>> >
>> >       return inode;
>> > +
>> > +out:
>> > +     iput(inode);
>> > +
>> > +     return ERR_PTR(err);
>> >   }
>> >
>> >   static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>>
>> Thanks,
>> Gavin
>>

[1] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2025-02-26 22:29 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-17 16:29 [RFC PATCH v5 00/15] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 01/15] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
2025-01-17 22:05   ` Elliot Berman
2025-01-19 14:39     ` Fuad Tabba
2025-01-20 10:39     ` David Hildenbrand
2025-01-20 10:50       ` Fuad Tabba
2025-01-20 10:39   ` David Hildenbrand
2025-01-20 10:43     ` Fuad Tabba
2025-01-20 10:43     ` Vlastimil Babka
2025-01-20 11:12       ` Vlastimil Babka
2025-01-20 11:28       ` David Hildenbrand
2025-01-17 16:29 ` [RFC PATCH v5 02/15] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2025-01-24  4:25   ` Gavin Shan
2025-01-29 10:12     ` Fuad Tabba
2025-02-11 15:58     ` Ackerley Tng
2025-01-17 16:29 ` [RFC PATCH v5 03/15] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 04/15] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
2025-01-24  5:31   ` Gavin Shan
2025-01-29 10:15     ` Fuad Tabba
2025-02-26 22:29       ` Ackerley Tng
2025-01-17 16:29 ` [RFC PATCH v5 05/15] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
2025-01-20 10:30   ` Kirill A. Shutemov
2025-01-20 10:40     ` Fuad Tabba
2025-02-06  3:14       ` Ackerley Tng
2025-02-06  9:45         ` Fuad Tabba
2025-02-19 23:33   ` Ackerley Tng
2025-02-20  9:26     ` Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 06/15] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
2025-01-20 11:37   ` Vlastimil Babka
2025-01-20 12:14     ` Fuad Tabba
2025-01-22 22:24       ` Ackerley Tng
2025-01-23 11:00         ` Fuad Tabba
2025-02-06  3:18           ` Ackerley Tng
2025-02-06  3:28           ` Ackerley Tng
2025-02-06  9:47             ` Fuad Tabba
2025-01-30 14:23         ` Fuad Tabba
2025-01-22 22:16   ` Ackerley Tng
2025-01-23  9:50     ` Fuad Tabba
2025-02-05  1:28       ` Vishal Annapurve
2025-02-05  4:31         ` Ackerley Tng
2025-02-05  5:58           ` Vishal Annapurve
2025-02-05  0:42   ` Vishal Annapurve
2025-02-05 10:06     ` Fuad Tabba
2025-02-05 17:39       ` Vishal Annapurve
2025-02-05 17:42         ` Vishal Annapurve
2025-02-07 10:46           ` Ackerley Tng
2025-02-10 16:04             ` Fuad Tabba
2025-02-05  0:51   ` Vishal Annapurve
2025-02-05 10:07     ` Fuad Tabba
2025-02-06  3:37   ` Ackerley Tng
2025-02-06  9:49     ` Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 07/15] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 08/15] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 09/15] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 10/15] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 11/15] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 12/15] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2025-01-17 16:29 ` [RFC PATCH v5 13/15] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
2025-01-17 16:30 ` [RFC PATCH v5 14/15] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2025-01-17 16:30 ` [RFC PATCH v5 15/15] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox