[PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs
@ 2025-02-11 12:11 Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
                   ` (10 more replies)
  0 siblings, 11 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Changes since v2 [1]:
- Added more documentation
- Hook the folio_put callback as a stub with a warning
- Tidying up and refactoring
- Rebased on Linux 6.14-rc2

The purpose of this series is to serve as a base for _restricted_
mmap() support for guest_memfd backed memory at the host [2]. It
would allow experimentation with what that support would be like
in the safe environment of the software VM types, which are meant
for testing and experimentation.

For more background and how to test this series, please refer to v2 [1].

Cheers,
/fuad

[1] https://lore.kernel.org/all/20250129172320.950523-1-tabba@google.com/
[2] https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/

Fuad Tabba (11):
  mm: Consolidate freeing of typed folios on final folio_put()
  KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  KVM: guest_memfd: Allow host to map guest_memfd() pages
  KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared
  KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed
    memory
  KVM: x86: Mark KVM_X86_SW_PROTECTED_VM as supporting guest_memfd
    shared memory
  KVM: arm64: Refactor user_mem_abort() calculation of force_pte
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  KVM: arm64: Enable mapping guest_memfd in arm64
  KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
    allowed

 Documentation/virt/kvm/api.rst                |   5 +
 arch/arm64/include/asm/kvm_host.h             |  10 ++
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   5 +
 arch/arm64/kvm/mmu.c                          |  91 ++++++++++------
 arch/x86/include/asm/kvm_host.h               |   5 +
 arch/x86/kvm/Kconfig                          |   3 +-
 include/linux/kvm_host.h                      |  28 ++++-
 include/linux/page-flags.h                    |  32 ++++++
 include/uapi/linux/kvm.h                      |   7 ++
 mm/debug.c                                    |   1 +
 mm/swap.c                                     |  32 +++++-
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  75 +++++++++++--
 tools/testing/selftests/kvm/lib/kvm_util.c    |   3 +-
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        | 100 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |   9 +-
 18 files changed, 360 insertions(+), 52 deletions(-)


base-commit: a64dcfb451e254085a7daee5fe51bf22959d52d3
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put()
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-17  9:33   ` Vlastimil Babka
  2025-02-20 11:17   ` David Hildenbrand
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Some folio types, such as hugetlb, handle freeing their own
folios. Moreover, guest_memfd will require being notified once a
folio's reference count reaches 0 to facilitate shared to private
folio conversion, without the folio actually being freed at that
point.

As a first step towards that, this patch consolidates freeing
folios that have a type. The first user is hugetlb folios. Later
in this patch series, guest_memfd will become the second user of
this.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/page-flags.h | 15 +++++++++++++++
 mm/swap.c                  | 23 ++++++++++++++++++-----
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 36d283552f80..6dc2494bd002 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -953,6 +953,21 @@ static inline bool page_has_type(const struct page *page)
 	return page_mapcount_is_type(data_race(page->page_type));
 }
 
+static inline int page_get_type(const struct page *page)
+{
+	return page->page_type >> 24;
+}
+
+static inline bool folio_has_type(const struct folio *folio)
+{
+	return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+	return page_get_type(&folio->page);
+}
+
 #define FOLIO_TYPE_OPS(lname, fname)					\
 static __always_inline bool folio_test_##fname(const struct folio *folio) \
 {									\
diff --git a/mm/swap.c b/mm/swap.c
index fc8281ef4241..47bc1bb919cc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,19 @@ static void page_cache_release(struct folio *folio)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 }
 
+static void free_typed_folio(struct folio *folio)
+{
+	switch (folio_get_type(folio)) {
+#ifdef CONFIG_HUGETLBFS
+	case PGTY_hugetlb:
+		free_huge_folio(folio);
+		return;
+#endif
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+
 void __folio_put(struct folio *folio)
 {
 	if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +114,8 @@ void __folio_put(struct folio *folio)
 		return;
 	}
 
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
+	if (unlikely(folio_has_type(folio))) {
+		free_typed_folio(folio);
 		return;
 	}
 
@@ -966,13 +979,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
+		if (unlikely(folio_has_type(folio))) {
+			/* typed folios have their own memcg, if any */
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			free_huge_folio(folio);
+			free_typed_folio(folio);
 			continue;
 		}
 		folio_unqueue_deferred_split(folio);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-12 18:19   ` Peter Xu
                     ` (3 more replies)
  2025-02-11 12:11 ` [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages Fuad Tabba
                   ` (8 subsequent siblings)
  10 siblings, 4 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Before transitioning a guest_memfd folio to unshared, thereby
disallowing access by the host and allowing the hypervisor to
transition its view of the guest page as private, we need to be
sure that the host doesn't have any references to the folio.

This patch introduces a new type for guest_memfd folios, which
isn't activated in this series but is here as a placeholder and
to facilitate the code in the next patch. This will be used in
the future to register a callback that informs the guest_memfd
subsystem when the last reference is dropped, therefore knowing
that the host doesn't have any remaining references.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h   |  9 +++++++++
 include/linux/page-flags.h | 17 +++++++++++++++++
 mm/debug.c                 |  1 +
 mm/swap.c                  |  9 +++++++++
 virt/kvm/guest_memfd.c     |  7 +++++++
 5 files changed, 43 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f34f4cfaa513..8b5f28f6efff 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range);
 #endif
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+void kvm_gmem_handle_folio_put(struct folio *folio);
+#else
+static inline void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+	WARN_ON_ONCE(1);
+}
+#endif
+
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6dc2494bd002..734afda268ab 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -933,6 +933,17 @@ enum pagetype {
 	PGTY_slab	= 0xf5,
 	PGTY_zsmalloc	= 0xf6,
 	PGTY_unaccepted	= 0xf7,
+	/*
+	 * guestmem folios are used to back VM memory as managed by guest_memfd.
+	 * Once the last reference is put, instead of freeing these folios back
+	 * to the page allocator, they are returned to guest_memfd.
+	 *
+	 * For now, guestmem will only be set on these folios as long as they
+	 * cannot be mapped to user space ("private state"), with the plan of
+	 * always setting that type once typed folios can be mapped to user
+	 * space cleanly.
+	 */
+	PGTY_guestmem	= 0xf8,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
 FOLIO_TEST_FLAG_FALSE(hugetlb)
 #endif
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+FOLIO_TYPE_OPS(guestmem, guestmem)
+#else
+FOLIO_TEST_FLAG_FALSE(guestmem)
+#endif
+
 PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 8d2acf432385..08bc42c6cba8 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -56,6 +56,7 @@ static const char *page_type_names[] = {
 	DEF_PAGETYPE_NAME(table),
 	DEF_PAGETYPE_NAME(buddy),
 	DEF_PAGETYPE_NAME(unaccepted),
+	DEF_PAGETYPE_NAME(guestmem),
 };
 
 static const char *page_type_name(unsigned int page_type)
diff --git a/mm/swap.c b/mm/swap.c
index 47bc1bb919cc..241880a46358 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -38,6 +38,10 @@
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+#include <linux/kvm_host.h>
+#endif
+
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
 	case PGTY_hugetlb:
 		free_huge_folio(folio);
 		return;
+#endif
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	case PGTY_guestmem:
+		kvm_gmem_handle_folio_put(folio);
+		return;
 #endif
 	default:
 		WARN_ON_ONCE(1);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..c6f6792bec2a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -312,6 +312,13 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
+}
+#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-12  5:07   ` Ackerley Tng
  2025-02-12 21:23   ` Peter Xu
  2025-02-11 12:11 ` [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared Fuad Tabba
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add support for mmap() and fault() for guest_memfd backed memory
in the host for VMs that support in-place conversion between
shared and private (shared memory). To that end, this patch adds
the ability to check whether the VM type has that support, and
only allows mapping its memory if that's the case.

Additionally, this behavior is gated with a new configuration
option, CONFIG_KVM_GMEM_SHARED_MEM.

Signed-off-by: Fuad Tabba <tabba@google.com>

---

This patch series will allow shared memory support for software
VMs in x86. It will also introduce a similar VM type for arm64
and allow shared memory support for that. In the future, pKVM
will also support shared memory.
---
 include/linux/kvm_host.h | 11 +++++
 virt/kvm/Kconfig         |  4 ++
 virt/kvm/guest_memfd.c   | 93 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 108 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8b5f28f6efff..438aa3df3175 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -728,6 +728,17 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 }
 #endif
 
+/*
+ * Arch code must define kvm_arch_gmem_supports_shared_mem if support for
+ * private memory is enabled and it supports in-place shared/private conversion.
+ */
+#if !defined(kvm_arch_gmem_supports_shared_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
+static inline bool kvm_arch_gmem_supports_shared_mem(struct kvm *kvm)
+{
+	return false;
+}
+#endif
+
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..4e759e8020c5 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_PRIVATE_MEM
+
+config KVM_GMEM_SHARED_MEM
+       select KVM_PRIVATE_MEM
+       bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index c6f6792bec2a..85467a3ef8ea 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -317,9 +317,102 @@ void kvm_gmem_handle_folio_put(struct folio *folio)
 {
 	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
 }
+
+static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
+{
+	struct kvm_gmem *gmem = file->private_data;
+
+	/* For now, VMs that support shared memory share all their memory. */
+	return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
+}
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct folio *folio;
+	vm_fault_t ret = VM_FAULT_LOCKED;
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+
+	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	if (IS_ERR(folio)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_filemap;
+	}
+
+	if (folio_test_hwpoison(folio)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out_folio;
+	}
+
+	/* Must be called with folio lock held, i.e., after kvm_gmem_get_folio() */
+	if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	/*
+	 * Only private folios are marked as "guestmem" so far, and we never
+	 * expect private folios at this point.
+	 */
+	if (WARN_ON_ONCE(folio_test_guestmem(folio)))  {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	/* No support for huge pages. */
+	if (WARN_ON_ONCE(folio_test_large(folio))) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		clear_highpage(folio_page(folio, 0));
+		kvm_gmem_mark_prepared(folio);
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+out_folio:
+	if (ret != VM_FAULT_LOCKED) {
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+out_filemap:
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	return ret;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct kvm_gmem *gmem = file->private_data;
+
+	if (!kvm_arch_gmem_supports_shared_mem(gmem->kvm))
+		return -ENODEV;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+#else
+#define kvm_gmem_mmap NULL
 #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
 
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (2 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-20 11:37   ` David Hildenbrand
  2025-02-11 12:11 ` [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory Fuad Tabba
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add the KVM capability KVM_CAP_GMEM_SHARED_MEM, which indicates
that the VM supports shared memory in guest_memfd, or that the
host can create VMs that support shared memory. Supporting shared
memory implies that memory can be mapped when shared with the
host.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 45e6d8fca9b9..117937a895da 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -929,6 +929,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_GMEM_SHARED_MEM 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba0327e2d0d3..38f0f402ea46 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4830,6 +4830,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	case KVM_CAP_GMEM_SHARED_MEM:
+		return !kvm || kvm_arch_gmem_supports_shared_mem(kvm);
 #endif
 	default:
 		break;
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (3 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-12  0:15   ` Ackerley Tng
  2025-02-11 12:11 ` [PATCH v3 06/11] KVM: x86: Mark KVM_X86_SW_PROTECTED_VM as supporting guest_memfd shared memory Fuad Tabba
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

For VMs that allow sharing guest_memfd backed memory in-place,
handle that memory the same as "private" guest_memfd memory. This
means that faulting that memory in the host or in the guest will
go through the guest_memfd subsystem.

Note that the word "private" in the name of the function
kvm_mem_is_private() doesn't necessarily indicate that the memory
isn't shared, but is due to the history and evolution of
guest_memfd and the various names it has received. In effect,
this function is used to multiplex between the path of a normal
page fault and the path of a guest_memfd backed page fault.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 438aa3df3175..39fd6e35c723 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2521,7 +2521,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #else
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
-	return false;
+	return kvm_arch_gmem_supports_shared_mem(kvm) &&
+	       kvm_slot_can_be_private(gfn_to_memslot(kvm, gfn));
 }
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 06/11] KVM: x86: Mark KVM_X86_SW_PROTECTED_VM as supporting guest_memfd shared memory
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (4 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 07/11] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

The KVM_X86_SW_PROTECTED_VM type is meant for experimentation and
does not have any underlying support for protected guests. This
makes it a good candidate for testing mapping shared memory.
Therefore, when the kconfig option is enabled, mark
KVM_X86_SW_PROTECTED_VM as supporting shared memory.

This means that this memory is considered by guest_memfd to be
shared with the host, with the possibility of in-place conversion
between shared and private. This allows the host to map and fault
in guest_memfd memory belonging to this VM type.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h | 5 +++++
 arch/x86/kvm/Kconfig            | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b15cde0a9b5c..1fb6cacbbeef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2244,8 +2244,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
+
+#define kvm_arch_gmem_supports_shared_mem(kvm)         \
+	(IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&      \
+	 ((kvm)->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
 #else
 #define kvm_arch_has_private_mem(kvm) false
+#define kvm_arch_gmem_supports_shared_mem(kvm) false
 #endif
 
 #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..22d1bcdaad58 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -45,7 +45,8 @@ config KVM_X86
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_GENERIC_PRE_FAULT_MEMORY
-	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
+	select KVM_PRIVATE_MEM if KVM_SW_PROTECTED_VM
+	select KVM_GMEM_SHARED_MEM if KVM_SW_PROTECTED_VM
 	select KVM_WERROR if WERROR
 
 config KVM
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 07/11] KVM: arm64: Refactor user_mem_abort() calculation of force_pte
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (5 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 06/11] KVM: x86: Mark KVM_X86_SW_PROTECTED_VM as supporting guest_memfd shared memory Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

To simplify the code and to make the assumptions clearer,
refactor user_mem_abort() by immediately setting force_pte to
true if the conditions are met. Also, add a check to ensure that
the assumption that logging_active is guaranteed to never be true
for VM_PFNMAP memslot is true.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1f55b0c7b11d..b6c0acb2311c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1460,7 +1460,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool write_fault, writable;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
@@ -1472,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
+	bool force_pte = logging_active || is_protected_kvm_enabled();
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
@@ -1525,12 +1526,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * logging_active is guaranteed to never be true for VM_PFNMAP
 	 * memslots.
 	 */
-	if (logging_active || is_protected_kvm_enabled()) {
-		force_pte = true;
+	if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
+		return -EFAULT;
+
+	if (force_pte)
 		vma_shift = PAGE_SHIFT;
-	} else {
+	else
 		vma_shift = get_vma_page_shift(vma, hva);
-	}
 
 	switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (6 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 07/11] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-11 15:57   ` Quentin Perret
  2025-02-11 12:11 ` [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type Fuad Tabba
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Add arm64 support for handling guest page faults on guest_memfd
backed memslots.

For now, the fault granule is restricted to PAGE_SIZE.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c     | 84 ++++++++++++++++++++++++++--------------
 include/linux/kvm_host.h |  5 +++
 virt/kvm/kvm_main.c      |  5 ---
 3 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b6c0acb2311c..305060518766 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1454,6 +1454,33 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+			     gfn_t gfn, bool write_fault, bool *writable,
+			     struct page **page, bool is_private)
+{
+	kvm_pfn_t pfn;
+	int ret;
+
+	if (!is_private)
+		return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
+
+	*writable = false;
+
+	if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
+		return KVM_PFN_ERR_NOSLOT_MASK;
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
+	if (!ret) {
+		*writable = write_fault;
+		return pfn;
+	}
+
+	if (ret == -EHWPOISON)
+		return KVM_PFN_ERR_HWPOISON;
+
+	return KVM_PFN_ERR_NOSLOT_MASK;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1461,25 +1488,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 {
 	int ret = 0;
 	bool write_fault, writable;
-	bool exec_fault, mte_allowed;
+	bool exec_fault, mte_allowed = false;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *vma = NULL;
 	short vma_shift;
 	void *memcache;
-	gfn_t gfn;
+	gfn_t gfn = ipa >> PAGE_SHIFT;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
-	bool force_pte = logging_active || is_protected_kvm_enabled();
-	long vma_pagesize, fault_granule;
+	bool is_private = kvm_mem_is_private(kvm, gfn);
+	bool force_pte = logging_active || is_private || is_protected_kvm_enabled();
+	long vma_pagesize, fault_granule = PAGE_SIZE;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 	struct page *page;
 	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
 
-	if (fault_is_perm)
+	if (fault_is_perm && !is_private)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
@@ -1510,24 +1538,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
+	mmap_read_lock(current->mm);
+
 	/*
 	 * Let's check if we will get back a huge page backed by hugetlbfs, or
 	 * get block mapping for device MMIO region.
 	 */
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
-	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-		mmap_read_unlock(current->mm);
-		return -EFAULT;
-	}
+	if (!is_private) {
+		vma = vma_lookup(current->mm, hva);
+		if (unlikely(!vma)) {
+			kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+			mmap_read_unlock(current->mm);
+			return -EFAULT;
+		}
 
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
-		return -EFAULT;
+		/*
+		 * logging_active is guaranteed to never be true for VM_PFNMAP
+		 * memslots.
+		 */
+		if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
+			return -EFAULT;
+
+		vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
+		mte_allowed = kvm_vma_mte_allowed(vma);
+	}
 
 	if (force_pte)
 		vma_shift = PAGE_SHIFT;
@@ -1597,18 +1631,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		ipa &= ~(vma_pagesize - 1);
 	}
 
-	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
-
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
-
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
-	 * vma_lookup() or __kvm_faultin_pfn() become stale prior to
-	 * acquiring kvm->mmu_lock.
+	 * vma_lookup() or faultin_pfn() become stale prior to acquiring
+	 * kvm->mmu_lock.
 	 *
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
@@ -1616,8 +1645,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
-	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
-				&writable, &page);
+	pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 39fd6e35c723..415c6274aede 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
 	return gfn_to_memslot(kvm, gfn)->id;
 }
 
+static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
+{
+	return slot->flags & KVM_MEM_READONLY;
+}
+
 static inline gfn_t
 hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38f0f402ea46..3e40acb9f5c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2624,11 +2624,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
 	return size;
 }
 
-static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
-{
-	return slot->flags & KVM_MEM_READONLY;
-}
-
 static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
 				       gfn_t *nr_pages, bool write)
 {
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (7 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-11 16:12   ` Quentin Perret
  2025-02-11 12:11 ` [PATCH v3 10/11] KVM: arm64: Enable mapping guest_memfd in arm64 Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 11/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
  10 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Introduce a new virtual machine type,
KVM_VM_TYPE_ARM_SW_PROTECTED, to serve as a development and
testing vehicle for Confidential (CoCo) VMs, similar to the x86
KVM_X86_SW_PROTECTED_VM type.

Initially, this is used to test guest_memfd without needing any
underlying protection.

Similar to the x86 type, this is currently only for development
and testing.  Do not use KVM_VM_TYPE_ARM_SW_PROTECTED for "real"
VMs, and especially not in production. The behavior and effective
ABI for software-protected VMs is unstable.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst    |  5 +++++
 arch/arm64/include/asm/kvm_host.h | 10 ++++++++++
 arch/arm64/kvm/arm.c              |  5 +++++
 arch/arm64/kvm/mmu.c              |  3 ---
 include/uapi/linux/kvm.h          |  6 ++++++
 5 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 2b52eb77e29c..0fccee4feee7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -214,6 +214,11 @@ exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects
 size of the address translated by the stage2 level (guest physical to
 host physical address translations).
 
+KVM_VM_TYPE_ARM_SW_PROTECTED is currently only for development and testing of
+confidential VMs without having underlying support. Do not use
+KVM_VM_TYPE_ARM_SW_PROTECTED for "real" VMs, and especially not in production.
+The behavior and effective ABI for software-protected VMs is unstable.
+
 
 4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST
 ----------------------------------------------------------
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7cfa024de4e3..a4276d56f54d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -383,6 +383,8 @@ struct kvm_arch {
 	 * the associated pKVM instance in the hypervisor.
 	 */
 	struct kvm_protected_vm pkvm;
+
+	unsigned long vm_type;
 };
 
 struct kvm_vcpu_fault_info {
@@ -1555,4 +1557,12 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
 #define kvm_has_s1poe(k)				\
 	(kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP))
 
+#define kvm_arch_has_private_mem(kvm)			\
+	(IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&		\
+	 ((kvm)->arch.vm_type & KVM_VM_TYPE_ARM_SW_PROTECTED))
+
+#define kvm_arch_gmem_supports_shared_mem(kvm)		\
+	(IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&	\
+	 ((kvm)->arch.vm_type & KVM_VM_TYPE_ARM_SW_PROTECTED))
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 071a7d75be68..a2066db52ada 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -146,6 +146,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	int ret;
 
+	if (type & ~KVM_VM_TYPE_MASK)
+		return -EINVAL;
+
 	mutex_init(&kvm->arch.config_lock);
 
 #ifdef CONFIG_LOCKDEP
@@ -187,6 +190,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
 
+	kvm->arch.vm_type = type;
+
 	return 0;
 
 err_free_cpumask:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 305060518766..b89649d31127 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -882,9 +882,6 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
 	u64 mmfr0, mmfr1;
 	u32 phys_shift;
 
-	if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
-		return -EINVAL;
-
 	phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
 	if (is_protected_kvm_enabled()) {
 		phys_shift = kvm_ipa_limit;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 117937a895da..f155d3781e08 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -652,6 +652,12 @@ struct kvm_enable_cap {
 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK	0xffULL
 #define KVM_VM_TYPE_ARM_IPA_SIZE(x)		\
 	((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
+#define KVM_VM_TYPE_ARM_SW_PROTECTED	(1UL << 9)
+
+#define KVM_VM_TYPE_MASK	(KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \
+				 KVM_VM_TYPE_ARM_SW_PROTECTED)
+
 /*
  * ioctls for /dev/kvm fds:
  */
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 10/11] KVM: arm64: Enable mapping guest_memfd in arm64
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (8 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  2025-02-11 12:11 ` [PATCH v3 11/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
  10 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Enable mapping guest_memfd in arm64, which would only apply to
VMs with the type KVM_VM_TYPE_ARM_SW_PROTECTED.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..4830d8805bed 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_GMEM_SHARED_MEM
 	help
 	  Support hosting virtualized guest machines.
 
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 11/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed
  2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
                   ` (9 preceding siblings ...)
  2025-02-11 12:11 ` [PATCH v3 10/11] KVM: arm64: Enable mapping guest_memfd in arm64 Fuad Tabba
@ 2025-02-11 12:11 ` Fuad Tabba
  10 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 12:11 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Expand the guest_memfd selftests to include testing mapping guest
memory for VM types that support it.

Also, build the guest_memfd selftest for aarch64.

Signed-off-by: Fuad Tabba <tabba@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |  1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 75 +++++++++++++++++--
 tools/testing/selftests/kvm/lib/kvm_util.c    |  3 +-
 3 files changed, 71 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 4277b983cace..c9a3f30e28dd 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -160,6 +160,7 @@ TEST_GEN_PROGS_arm64 += coalesced_io_test
 TEST_GEN_PROGS_arm64 += demand_paging_test
 TEST_GEN_PROGS_arm64 += dirty_log_test
 TEST_GEN_PROGS_arm64 += dirty_log_perf_test
+TEST_GEN_PROGS_arm64 += guest_memfd_test
 TEST_GEN_PROGS_arm64 += guest_print_test
 TEST_GEN_PROGS_arm64 += get-reg-list
 TEST_GEN_PROGS_arm64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ce687f8d248f..f1e89f72b89f 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -34,12 +34,48 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_allowed(int fd, size_t total_size)
 {
+	size_t page_size = getpagesize();
+	const char val = 0xaa;
+	char *mem;
+	int ret;
+	int i;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+	memset(mem, val, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], val);
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			page_size);
+	TEST_ASSERT(!ret, "fallocate the first page should succeed");
+
+	for (i = 0; i < page_size; i++)
+		TEST_ASSERT_EQ(mem[i], 0x00);
+	for (; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], val);
+
+	memset(mem, val, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(mem[i], val);
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+}
+
+static void test_mmap_denied(int fd, size_t total_size)
+{
+	size_t page_size = getpagesize();
 	char *mem;
 
 	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	TEST_ASSERT_EQ(mem, MAP_FAILED);
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
 }
 
 static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -170,19 +206,30 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 	close(fd1);
 }
 
-int main(int argc, char *argv[])
+unsigned long get_shared_type(void)
 {
-	size_t page_size;
+#ifdef __x86_64__
+	return KVM_X86_SW_PROTECTED_VM;
+#endif
+#ifdef __aarch64__
+	return KVM_VM_TYPE_ARM_SW_PROTECTED;
+#endif
+	return 0;
+}
+
+void test_vm_type(unsigned long type, bool is_shared)
+{
+	struct kvm_vm *vm;
 	size_t total_size;
+	size_t page_size;
 	int fd;
-	struct kvm_vm *vm;
 
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
 	page_size = getpagesize();
 	total_size = page_size * 4;
 
-	vm = vm_create_barebones();
+	vm = vm_create_barebones_type(type);
 
 	test_create_guest_memfd_invalid(vm);
 	test_create_guest_memfd_multiple(vm);
@@ -190,10 +237,26 @@ int main(int argc, char *argv[])
 	fd = vm_create_guest_memfd(vm, total_size, 0);
 
 	test_file_read_write(fd);
-	test_mmap(fd, page_size);
+
+	if (is_shared)
+		test_mmap_allowed(fd, total_size);
+	else
+		test_mmap_denied(fd, total_size);
+
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
 
 	close(fd);
+	kvm_vm_release(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	test_vm_type(VM_TYPE_DEFAULT, false);
+
+	if (kvm_has_cap(KVM_CAP_GMEM_SHARED_MEM))
+		test_vm_type(get_shared_type(), true);
+
+	return 0;
 }
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 33fefeb3ca44..17aed505746b 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -347,9 +347,8 @@ struct kvm_vm *____vm_create(struct vm_shape shape)
 	}
 
 #ifdef __aarch64__
-	TEST_ASSERT(!vm->type, "ARM doesn't support test-provided types");
 	if (vm->pa_bits != 40)
-		vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
+		vm->type |= KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits);
 #endif
 
 	vm_open(vm);
-- 
2.48.1.502.g6dc24dfdaf-goog



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 12:11 ` [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2025-02-11 15:57   ` Quentin Perret
  2025-02-11 16:13     ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 15:57 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hey Fuad,

On Tuesday 11 Feb 2025 at 12:11:24 (+0000), Fuad Tabba wrote:
> Add arm64 support for handling guest page faults on guest_memfd
> backed memslots.
> 
> For now, the fault granule is restricted to PAGE_SIZE.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  arch/arm64/kvm/mmu.c     | 84 ++++++++++++++++++++++++++--------------
>  include/linux/kvm_host.h |  5 +++
>  virt/kvm/kvm_main.c      |  5 ---
>  3 files changed, 61 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index b6c0acb2311c..305060518766 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1454,6 +1454,33 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			     gfn_t gfn, bool write_fault, bool *writable,
> +			     struct page **page, bool is_private)
> +{
> +	kvm_pfn_t pfn;
> +	int ret;
> +
> +	if (!is_private)
> +		return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
> +
> +	*writable = false;
> +
> +	if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
> +		return KVM_PFN_ERR_NOSLOT_MASK;

I believe this check is superfluous, we should decide to report an MMIO
exit to userspace for write faults to RO memslots and not get anywhere
near user_mem_abort(). And nit but the error code should probably be
KVM_PFN_ERR_RO_FAULT or something instead?

> +
> +	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> +	if (!ret) {
> +		*writable = write_fault;

In normal KVM, if we're not dirty logging we'll actively map the page as
writable if both the memslot and the userspace mappings are writable.
With gmem, the latter doesn't make much sense, but essentially the
underlying page should really be writable (e.g. no CoW getting in the
way and such?). If so, then perhaps make this

		*writable = !memslot_is_readonly(slot);

Wdyt?

> +		return pfn;
> +	}
> +
> +	if (ret == -EHWPOISON)
> +		return KVM_PFN_ERR_HWPOISON;
> +
> +	return KVM_PFN_ERR_NOSLOT_MASK;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1461,25 +1488,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  {
>  	int ret = 0;
>  	bool write_fault, writable;
> -	bool exec_fault, mte_allowed;
> +	bool exec_fault, mte_allowed = false;
>  	bool device = false, vfio_allow_any_uc = false;
>  	unsigned long mmu_seq;
>  	phys_addr_t ipa = fault_ipa;
>  	struct kvm *kvm = vcpu->kvm;
> -	struct vm_area_struct *vma;
> +	struct vm_area_struct *vma = NULL;
>  	short vma_shift;
>  	void *memcache;
> -	gfn_t gfn;
> +	gfn_t gfn = ipa >> PAGE_SHIFT;
>  	kvm_pfn_t pfn;
>  	bool logging_active = memslot_is_logging(memslot);
> -	bool force_pte = logging_active || is_protected_kvm_enabled();
> -	long vma_pagesize, fault_granule;
> +	bool is_private = kvm_mem_is_private(kvm, gfn);

Just trying to understand the locking rule for the xarray behind this.
Is it kvm->srcu that protects it for reads here? Something else?

> +	bool force_pte = logging_active || is_private || is_protected_kvm_enabled();
> +	long vma_pagesize, fault_granule = PAGE_SIZE;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  	struct page *page;
>  	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
>  
> -	if (fault_is_perm)
> +	if (fault_is_perm && !is_private)

Nit: not strictly necessary I think.

>  		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
>  	write_fault = kvm_is_write_fault(vcpu);
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> @@ -1510,24 +1538,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			return ret;
>  	}
>  
> +	mmap_read_lock(current->mm);
> +
>  	/*
>  	 * Let's check if we will get back a huge page backed by hugetlbfs, or
>  	 * get block mapping for device MMIO region.
>  	 */
> -	mmap_read_lock(current->mm);
> -	vma = vma_lookup(current->mm, hva);
> -	if (unlikely(!vma)) {
> -		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> -		mmap_read_unlock(current->mm);
> -		return -EFAULT;
> -	}
> +	if (!is_private) {
> +		vma = vma_lookup(current->mm, hva);
> +		if (unlikely(!vma)) {
> +			kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> +			mmap_read_unlock(current->mm);
> +			return -EFAULT;
> +		}
>  
> -	/*
> -	 * logging_active is guaranteed to never be true for VM_PFNMAP
> -	 * memslots.
> -	 */
> -	if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> -		return -EFAULT;
> +		/*
> +		 * logging_active is guaranteed to never be true for VM_PFNMAP
> +		 * memslots.
> +		 */
> +		if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> +			return -EFAULT;
> +
> +		vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> +		mte_allowed = kvm_vma_mte_allowed(vma);
> +	}
>  
>  	if (force_pte)
>  		vma_shift = PAGE_SHIFT;
> @@ -1597,18 +1631,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		ipa &= ~(vma_pagesize - 1);
>  	}
>  
> -	gfn = ipa >> PAGE_SHIFT;
> -	mte_allowed = kvm_vma_mte_allowed(vma);
> -
> -	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> -
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
>  	/*
>  	 * Read mmu_invalidate_seq so that KVM can detect if the results of
> -	 * vma_lookup() or __kvm_faultin_pfn() become stale prior to
> -	 * acquiring kvm->mmu_lock.
> +	 * vma_lookup() or faultin_pfn() become stale prior to acquiring
> +	 * kvm->mmu_lock.
>  	 *
>  	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
>  	 * with the smp_wmb() in kvm_mmu_invalidate_end().
> @@ -1616,8 +1645,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
>  	mmap_read_unlock(current->mm);
>  
> -	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> -				&writable, &page);
> +	pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
>  	if (pfn == KVM_PFN_ERR_HWPOISON) {
>  		kvm_send_hwpoison_signal(hva, vma_shift);
>  		return 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 39fd6e35c723..415c6274aede 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
>  	return gfn_to_memslot(kvm, gfn)->id;
>  }
>  
> +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> +{
> +	return slot->flags & KVM_MEM_READONLY;
> +}
> +
>  static inline gfn_t
>  hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 38f0f402ea46..3e40acb9f5c0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2624,11 +2624,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
>  	return size;
>  }
>  
> -static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> -{
> -	return slot->flags & KVM_MEM_READONLY;
> -}
> -
>  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
>  				       gfn_t *nr_pages, bool write)
>  {
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 12:11 ` [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type Fuad Tabba
@ 2025-02-11 16:12   ` Quentin Perret
  2025-02-11 16:17     ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 16:12 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Fuad,

On Tuesday 11 Feb 2025 at 12:11:25 (+0000), Fuad Tabba wrote:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 117937a895da..f155d3781e08 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -652,6 +652,12 @@ struct kvm_enable_cap {
>  #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK	0xffULL
>  #define KVM_VM_TYPE_ARM_IPA_SIZE(x)		\
>  	((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> +
> +#define KVM_VM_TYPE_ARM_SW_PROTECTED	(1UL << 9)

FWIW, the downstream Android code has used bit 31 since forever
for that.

Although I very much believe that upstream should not care about the
downstream mess in general, in this particular instance bit 9 really
isn't superior in any way, and there's a bunch of existing userspace
code that uses bit 31 today as we speak. It is very much Android's
problem to update these userspace programs if we do go with bit 9
upstream, but I don't really see how that would benefit upstream
either.

So, given that there is no maintenance cost for upstream to use bit 31
instead of 9, I'd vote for using bit 31 and ease the landing with
existing userspace code, unless folks are really opinionated with this
stuff :)

Thanks,
Quentin

> +#define KVM_VM_TYPE_MASK	(KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \
> +				 KVM_VM_TYPE_ARM_SW_PROTECTED)
> +
>  /*
>   * ioctls for /dev/kvm fds:
>   */
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 15:57   ` Quentin Perret
@ 2025-02-11 16:13     ` Fuad Tabba
  2025-02-11 16:25       ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 16:13 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Tue, 11 Feb 2025 at 15:57, Quentin Perret <qperret@google.com> wrote:
>
> Hey Fuad,
>
> On Tuesday 11 Feb 2025 at 12:11:24 (+0000), Fuad Tabba wrote:
> > Add arm64 support for handling guest page faults on guest_memfd
> > backed memslots.
> >
> > For now, the fault granule is restricted to PAGE_SIZE.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c     | 84 ++++++++++++++++++++++++++--------------
> >  include/linux/kvm_host.h |  5 +++
> >  virt/kvm/kvm_main.c      |  5 ---
> >  3 files changed, 61 insertions(+), 33 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index b6c0acb2311c..305060518766 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1454,6 +1454,33 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> >       return vma->vm_flags & VM_MTE_ALLOWED;
> >  }
> >
> > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +                          gfn_t gfn, bool write_fault, bool *writable,
> > +                          struct page **page, bool is_private)
> > +{
> > +     kvm_pfn_t pfn;
> > +     int ret;
> > +
> > +     if (!is_private)
> > +             return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
> > +
> > +     *writable = false;
> > +
> > +     if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
> > +             return KVM_PFN_ERR_NOSLOT_MASK;
>
> I believe this check is superfluous, we should decide to report an MMIO
> exit to userspace for write faults to RO memslots and not get anywhere
> near user_mem_abort(). And nit but the error code should probably be
> KVM_PFN_ERR_RO_FAULT or something instead?

I tried to replicate the behavior of __kvm_faultin_pfn() here (but got
the wrong error!). I think you're right though that in the arm64 case,
this check isn't needed. Should I fix the return error and keep the
warning though?

> > +
> > +     ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> > +     if (!ret) {
> > +             *writable = write_fault;
>
> In normal KVM, if we're not dirty logging we'll actively map the page as
> writable if both the memslot and the userspace mappings are writable.
> With gmem, the latter doesn't make much sense, but essentially the
> underlying page should really be writable (e.g. no CoW getting in the
> way and such?). If so, then perhaps make this
>
>                 *writable = !memslot_is_readonly(slot);
>
> Wdyt?

Ack.

> > +             return pfn;
> > +     }
> > +
> > +     if (ret == -EHWPOISON)
> > +             return KVM_PFN_ERR_HWPOISON;
> > +
> > +     return KVM_PFN_ERR_NOSLOT_MASK;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                         struct kvm_s2_trans *nested,
> >                         struct kvm_memory_slot *memslot, unsigned long hva,
> > @@ -1461,25 +1488,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  {
> >       int ret = 0;
> >       bool write_fault, writable;
> > -     bool exec_fault, mte_allowed;
> > +     bool exec_fault, mte_allowed = false;
> >       bool device = false, vfio_allow_any_uc = false;
> >       unsigned long mmu_seq;
> >       phys_addr_t ipa = fault_ipa;
> >       struct kvm *kvm = vcpu->kvm;
> > -     struct vm_area_struct *vma;
> > +     struct vm_area_struct *vma = NULL;
> >       short vma_shift;
> >       void *memcache;
> > -     gfn_t gfn;
> > +     gfn_t gfn = ipa >> PAGE_SHIFT;
> >       kvm_pfn_t pfn;
> >       bool logging_active = memslot_is_logging(memslot);
> > -     bool force_pte = logging_active || is_protected_kvm_enabled();
> > -     long vma_pagesize, fault_granule;
> > +     bool is_private = kvm_mem_is_private(kvm, gfn);
>
> Just trying to understand the locking rule for the xarray behind this.
> Is it kvm->srcu that protects it for reads here? Something else?

I'm not sure I follow. Which xarray are you referring to?

>
> > +     bool force_pte = logging_active || is_private || is_protected_kvm_enabled();
> > +     long vma_pagesize, fault_granule = PAGE_SIZE;
> >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> >       struct kvm_pgtable *pgt;
> >       struct page *page;
> >       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
> >
> > -     if (fault_is_perm)
> > +     if (fault_is_perm && !is_private)
>
> Nit: not strictly necessary I think.

You're right.

Thanks,
/fuad

> >               fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
> >       write_fault = kvm_is_write_fault(vcpu);
> >       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > @@ -1510,24 +1538,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                       return ret;
> >       }
> >
> > +     mmap_read_lock(current->mm);
> > +
> >       /*
> >        * Let's check if we will get back a huge page backed by hugetlbfs, or
> >        * get block mapping for device MMIO region.
> >        */
> > -     mmap_read_lock(current->mm);
> > -     vma = vma_lookup(current->mm, hva);
> > -     if (unlikely(!vma)) {
> > -             kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > -             mmap_read_unlock(current->mm);
> > -             return -EFAULT;
> > -     }
> > +     if (!is_private) {
> > +             vma = vma_lookup(current->mm, hva);
> > +             if (unlikely(!vma)) {
> > +                     kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > +                     mmap_read_unlock(current->mm);
> > +                     return -EFAULT;
> > +             }
> >
> > -     /*
> > -      * logging_active is guaranteed to never be true for VM_PFNMAP
> > -      * memslots.
> > -      */
> > -     if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > -             return -EFAULT;
> > +             /*
> > +              * logging_active is guaranteed to never be true for VM_PFNMAP
> > +              * memslots.
> > +              */
> > +             if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > +                     return -EFAULT;
> > +
> > +             vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > +             mte_allowed = kvm_vma_mte_allowed(vma);
> > +     }
> >
> >       if (force_pte)
> >               vma_shift = PAGE_SHIFT;
> > @@ -1597,18 +1631,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >               ipa &= ~(vma_pagesize - 1);
> >       }
> >
> > -     gfn = ipa >> PAGE_SHIFT;
> > -     mte_allowed = kvm_vma_mte_allowed(vma);
> > -
> > -     vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > -
> >       /* Don't use the VMA after the unlock -- it may have vanished */
> >       vma = NULL;
> >
> >       /*
> >        * Read mmu_invalidate_seq so that KVM can detect if the results of
> > -      * vma_lookup() or __kvm_faultin_pfn() become stale prior to
> > -      * acquiring kvm->mmu_lock.
> > +      * vma_lookup() or faultin_pfn() become stale prior to acquiring
> > +      * kvm->mmu_lock.
> >        *
> >        * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
> >        * with the smp_wmb() in kvm_mmu_invalidate_end().
> > @@ -1616,8 +1645,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >       mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> >       mmap_read_unlock(current->mm);
> >
> > -     pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> > -                             &writable, &page);
> > +     pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
> >       if (pfn == KVM_PFN_ERR_HWPOISON) {
> >               kvm_send_hwpoison_signal(hva, vma_shift);
> >               return 0;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 39fd6e35c723..415c6274aede 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
> >       return gfn_to_memslot(kvm, gfn)->id;
> >  }
> >
> > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > +{
> > +     return slot->flags & KVM_MEM_READONLY;
> > +}
> > +
> >  static inline gfn_t
> >  hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
> >  {
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 38f0f402ea46..3e40acb9f5c0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2624,11 +2624,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
> >       return size;
> >  }
> >
> > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > -{
> > -     return slot->flags & KVM_MEM_READONLY;
> > -}
> > -
> >  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
> >                                      gfn_t *nr_pages, bool write)
> >  {
> > --
> > 2.48.1.502.g6dc24dfdaf-goog
> >


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 16:12   ` Quentin Perret
@ 2025-02-11 16:17     ` Fuad Tabba
  2025-02-11 16:29       ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 16:17 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Tue, 11 Feb 2025 at 16:12, Quentin Perret <qperret@google.com> wrote:
>
> Hi Fuad,
>
> On Tuesday 11 Feb 2025 at 12:11:25 (+0000), Fuad Tabba wrote:
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 117937a895da..f155d3781e08 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -652,6 +652,12 @@ struct kvm_enable_cap {
> >  #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK        0xffULL
> >  #define KVM_VM_TYPE_ARM_IPA_SIZE(x)          \
> >       ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> > +
> > +#define KVM_VM_TYPE_ARM_SW_PROTECTED (1UL << 9)
>
> FWIW, the downstream Android code has used bit 31 since forever
> for that.
>
> Although I very much believe that upstream should not care about the
> downstream mess in general, in this particular instance bit 9 really
> isn't superior in any way, and there's a bunch of existing userspace
> code that uses bit 31 today as we speak. It is very much Android's
> problem to update these userspace programs if we do go with bit 9
> upstream, but I don't really see how that would benefit upstream
> either.
>
> So, given that there is no maintenance cost for upstream to use bit 31
> instead of 9, I'd vote for using bit 31 and ease the landing with
> existing userspace code, unless folks are really opinionated with this
> stuff :)

My thinking is that this bit does _not_ mean pKVM. It means an
experimental software VM that is similar to the x86
KVM_X86_SW_PROTECTED_VM. Hence why I didn't choose bit 31.

From Documentation/virt/kvm/api.rst (for x86):

'''
Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
production.  The behavior and effective ABI for software-protected VMs is
unstable.
'''

which is similar to the documentation I added here.

Cheers,
/fuad



> Thanks,
> Quentin
>
> > +#define KVM_VM_TYPE_MASK     (KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \
> > +                              KVM_VM_TYPE_ARM_SW_PROTECTED)
> > +
> >  /*
> >   * ioctls for /dev/kvm fds:
> >   */
> > --
> > 2.48.1.502.g6dc24dfdaf-goog
> >


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 16:13     ` Fuad Tabba
@ 2025-02-11 16:25       ` Quentin Perret
  2025-02-11 16:34         ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 16:25 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tuesday 11 Feb 2025 at 16:13:27 (+0000), Fuad Tabba wrote:
> Hi Quentin,
> 
> On Tue, 11 Feb 2025 at 15:57, Quentin Perret <qperret@google.com> wrote:
> >
> > Hey Fuad,
> >
> > On Tuesday 11 Feb 2025 at 12:11:24 (+0000), Fuad Tabba wrote:
> > > Add arm64 support for handling guest page faults on guest_memfd
> > > backed memslots.
> > >
> > > For now, the fault granule is restricted to PAGE_SIZE.
> > >
> > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > ---
> > >  arch/arm64/kvm/mmu.c     | 84 ++++++++++++++++++++++++++--------------
> > >  include/linux/kvm_host.h |  5 +++
> > >  virt/kvm/kvm_main.c      |  5 ---
> > >  3 files changed, 61 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index b6c0acb2311c..305060518766 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1454,6 +1454,33 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> > >       return vma->vm_flags & VM_MTE_ALLOWED;
> > >  }
> > >
> > > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +                          gfn_t gfn, bool write_fault, bool *writable,
> > > +                          struct page **page, bool is_private)
> > > +{
> > > +     kvm_pfn_t pfn;
> > > +     int ret;
> > > +
> > > +     if (!is_private)
> > > +             return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
> > > +
> > > +     *writable = false;
> > > +
> > > +     if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
> > > +             return KVM_PFN_ERR_NOSLOT_MASK;
> >
> > I believe this check is superfluous, we should decide to report an MMIO
> > exit to userspace for write faults to RO memslots and not get anywhere
> > near user_mem_abort(). And nit but the error code should probably be
> > KVM_PFN_ERR_RO_FAULT or something instead?
> 
> I tried to replicate the behavior of __kvm_faultin_pfn() here (but got
> the wrong error!). I think you're right though that in the arm64 case,
> this check isn't needed. Should I fix the return error and keep the
> warning though?

__kvm_faultin_pfn() will just set *writable to false if it find an RO
memslot apparently, not return an error. So I'd vote for dropping that
check so we align with that behaviour.

> > > +
> > > +     ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> > > +     if (!ret) {
> > > +             *writable = write_fault;
> >
> > In normal KVM, if we're not dirty logging we'll actively map the page as
> > writable if both the memslot and the userspace mappings are writable.
> > With gmem, the latter doesn't make much sense, but essentially the
> > underlying page should really be writable (e.g. no CoW getting in the
> > way and such?). If so, then perhaps make this
> >
> >                 *writable = !memslot_is_readonly(slot);
> >
> > Wdyt?
> 
> Ack.
> 
> > > +             return pfn;
> > > +     }
> > > +
> > > +     if (ret == -EHWPOISON)
> > > +             return KVM_PFN_ERR_HWPOISON;
> > > +
> > > +     return KVM_PFN_ERR_NOSLOT_MASK;
> > > +}
> > > +
> > >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >                         struct kvm_s2_trans *nested,
> > >                         struct kvm_memory_slot *memslot, unsigned long hva,
> > > @@ -1461,25 +1488,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  {
> > >       int ret = 0;
> > >       bool write_fault, writable;
> > > -     bool exec_fault, mte_allowed;
> > > +     bool exec_fault, mte_allowed = false;
> > >       bool device = false, vfio_allow_any_uc = false;
> > >       unsigned long mmu_seq;
> > >       phys_addr_t ipa = fault_ipa;
> > >       struct kvm *kvm = vcpu->kvm;
> > > -     struct vm_area_struct *vma;
> > > +     struct vm_area_struct *vma = NULL;
> > >       short vma_shift;
> > >       void *memcache;
> > > -     gfn_t gfn;
> > > +     gfn_t gfn = ipa >> PAGE_SHIFT;
> > >       kvm_pfn_t pfn;
> > >       bool logging_active = memslot_is_logging(memslot);
> > > -     bool force_pte = logging_active || is_protected_kvm_enabled();
> > > -     long vma_pagesize, fault_granule;
> > > +     bool is_private = kvm_mem_is_private(kvm, gfn);
> >
> > Just trying to understand the locking rule for the xarray behind this.
> > Is it kvm->srcu that protects it for reads here? Something else?
> 
> I'm not sure I follow. Which xarray are you referring to?

Sorry, yes, that wasn't clear. I meant that kvm_mem_is_private() calls
kvm_get_memory_attributes() which indexes kvm->mem_attr_array. The
comment in struct kvm indicates that this xarray is protected by RCU for
readers, so I was just checking if we were relying on
kvm_handle_guest_abort() to take srcu_read_lock(&kvm->srcu) for us, or
if there was something else more subtle here.

Cheers,
Quentin

> >
> > > +     bool force_pte = logging_active || is_private || is_protected_kvm_enabled();
> > > +     long vma_pagesize, fault_granule = PAGE_SIZE;
> > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > >       struct kvm_pgtable *pgt;
> > >       struct page *page;
> > >       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
> > >
> > > -     if (fault_is_perm)
> > > +     if (fault_is_perm && !is_private)
> >
> > Nit: not strictly necessary I think.
> 
> You're right.
> 
> Thanks,
> /fuad
> 
> > >               fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
> > >       write_fault = kvm_is_write_fault(vcpu);
> > >       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > @@ -1510,24 +1538,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >                       return ret;
> > >       }
> > >
> > > +     mmap_read_lock(current->mm);
> > > +
> > >       /*
> > >        * Let's check if we will get back a huge page backed by hugetlbfs, or
> > >        * get block mapping for device MMIO region.
> > >        */
> > > -     mmap_read_lock(current->mm);
> > > -     vma = vma_lookup(current->mm, hva);
> > > -     if (unlikely(!vma)) {
> > > -             kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > -             mmap_read_unlock(current->mm);
> > > -             return -EFAULT;
> > > -     }
> > > +     if (!is_private) {
> > > +             vma = vma_lookup(current->mm, hva);
> > > +             if (unlikely(!vma)) {
> > > +                     kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > +                     mmap_read_unlock(current->mm);
> > > +                     return -EFAULT;
> > > +             }
> > >
> > > -     /*
> > > -      * logging_active is guaranteed to never be true for VM_PFNMAP
> > > -      * memslots.
> > > -      */
> > > -     if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > > -             return -EFAULT;
> > > +             /*
> > > +              * logging_active is guaranteed to never be true for VM_PFNMAP
> > > +              * memslots.
> > > +              */
> > > +             if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > > +                     return -EFAULT;
> > > +
> > > +             vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > > +             mte_allowed = kvm_vma_mte_allowed(vma);
> > > +     }
> > >
> > >       if (force_pte)
> > >               vma_shift = PAGE_SHIFT;
> > > @@ -1597,18 +1631,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >               ipa &= ~(vma_pagesize - 1);
> > >       }
> > >
> > > -     gfn = ipa >> PAGE_SHIFT;
> > > -     mte_allowed = kvm_vma_mte_allowed(vma);
> > > -
> > > -     vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > > -
> > >       /* Don't use the VMA after the unlock -- it may have vanished */
> > >       vma = NULL;
> > >
> > >       /*
> > >        * Read mmu_invalidate_seq so that KVM can detect if the results of
> > > -      * vma_lookup() or __kvm_faultin_pfn() become stale prior to
> > > -      * acquiring kvm->mmu_lock.
> > > +      * vma_lookup() or faultin_pfn() become stale prior to acquiring
> > > +      * kvm->mmu_lock.
> > >        *
> > >        * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
> > >        * with the smp_wmb() in kvm_mmu_invalidate_end().
> > > @@ -1616,8 +1645,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >       mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > >       mmap_read_unlock(current->mm);
> > >
> > > -     pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> > > -                             &writable, &page);
> > > +     pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
> > >       if (pfn == KVM_PFN_ERR_HWPOISON) {
> > >               kvm_send_hwpoison_signal(hva, vma_shift);
> > >               return 0;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 39fd6e35c723..415c6274aede 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
> > >       return gfn_to_memslot(kvm, gfn)->id;
> > >  }
> > >
> > > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > > +{
> > > +     return slot->flags & KVM_MEM_READONLY;
> > > +}
> > > +
> > >  static inline gfn_t
> > >  hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
> > >  {
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 38f0f402ea46..3e40acb9f5c0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2624,11 +2624,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
> > >       return size;
> > >  }
> > >
> > > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > > -{
> > > -     return slot->flags & KVM_MEM_READONLY;
> > > -}
> > > -
> > >  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
> > >                                      gfn_t *nr_pages, bool write)
> > >  {
> > > --
> > > 2.48.1.502.g6dc24dfdaf-goog
> > >


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 16:17     ` Fuad Tabba
@ 2025-02-11 16:29       ` Quentin Perret
  2025-02-11 16:32         ` Patrick Roy
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 16:29 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tuesday 11 Feb 2025 at 16:17:25 (+0000), Fuad Tabba wrote:
> Hi Quentin,
> 
> On Tue, 11 Feb 2025 at 16:12, Quentin Perret <qperret@google.com> wrote:
> >
> > Hi Fuad,
> >
> > On Tuesday 11 Feb 2025 at 12:11:25 (+0000), Fuad Tabba wrote:
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index 117937a895da..f155d3781e08 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -652,6 +652,12 @@ struct kvm_enable_cap {
> > >  #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK        0xffULL
> > >  #define KVM_VM_TYPE_ARM_IPA_SIZE(x)          \
> > >       ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> > > +
> > > +#define KVM_VM_TYPE_ARM_SW_PROTECTED (1UL << 9)
> >
> > FWIW, the downstream Android code has used bit 31 since forever
> > for that.
> >
> > Although I very much believe that upstream should not care about the
> > downstream mess in general, in this particular instance bit 9 really
> > isn't superior in any way, and there's a bunch of existing userspace
> > code that uses bit 31 today as we speak. It is very much Android's
> > problem to update these userspace programs if we do go with bit 9
> > upstream, but I don't really see how that would benefit upstream
> > either.
> >
> > So, given that there is no maintenance cost for upstream to use bit 31
> > instead of 9, I'd vote for using bit 31 and ease the landing with
> > existing userspace code, unless folks are really opinionated with this
> > stuff :)
> 
> My thinking is that this bit does _not_ mean pKVM. It means an
> experimental software VM that is similar to the x86
> KVM_X86_SW_PROTECTED_VM. Hence why I didn't choose bit 31.
> 
> From Documentation/virt/kvm/api.rst (for x86):
> 
> '''
> Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
> Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
> production.  The behavior and effective ABI for software-protected VMs is
> unstable.
> '''
> 
> which is similar to the documentation I added here.

Aha, I see, but are we going to allocate _another_ bit for protected VMs
proper once they're supported? Or just update the doc for the existing
bit? If the latter, then I guess this discussion can still happen :)

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 16:29       ` Quentin Perret
@ 2025-02-11 16:32         ` Patrick Roy
  2025-02-11 17:09           ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick Roy @ 2025-02-11 16:32 UTC (permalink / raw)
  To: Quentin Perret, Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Tue, 2025-02-11 at 16:29 +0000, Quentin Perret wrote:> On Tuesday 11 Feb 2025 at 16:17:25 (+0000), Fuad Tabba wrote:
>> Hi Quentin,
>>
>> On Tue, 11 Feb 2025 at 16:12, Quentin Perret <qperret@google.com> wrote:
>>>
>>> Hi Fuad,
>>>
>>> On Tuesday 11 Feb 2025 at 12:11:25 (+0000), Fuad Tabba wrote:
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 117937a895da..f155d3781e08 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -652,6 +652,12 @@ struct kvm_enable_cap {
>>>>  #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK        0xffULL
>>>>  #define KVM_VM_TYPE_ARM_IPA_SIZE(x)          \
>>>>       ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>>>> +
>>>> +#define KVM_VM_TYPE_ARM_SW_PROTECTED (1UL << 9)
>>>
>>> FWIW, the downstream Android code has used bit 31 since forever
>>> for that.
>>>
>>> Although I very much believe that upstream should not care about the
>>> downstream mess in general, in this particular instance bit 9 really
>>> isn't superior in any way, and there's a bunch of existing userspace
>>> code that uses bit 31 today as we speak. It is very much Android's
>>> problem to update these userspace programs if we do go with bit 9
>>> upstream, but I don't really see how that would benefit upstream
>>> either.
>>>
>>> So, given that there is no maintenance cost for upstream to use bit 31
>>> instead of 9, I'd vote for using bit 31 and ease the landing with
>>> existing userspace code, unless folks are really opinionated with this
>>> stuff :)
>>
>> My thinking is that this bit does _not_ mean pKVM. It means an
>> experimental software VM that is similar to the x86
>> KVM_X86_SW_PROTECTED_VM. Hence why I didn't choose bit 31.
>>
>> From Documentation/virt/kvm/api.rst (for x86):
>>
>> '''
>> Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
>> Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
>> production.  The behavior and effective ABI for software-protected VMs is
>> unstable.
>> '''
>>
>> which is similar to the documentation I added here.
> 
> Aha, I see, but are we going to allocate _another_ bit for protected VMs
> proper once they're supported? Or just update the doc for the existing
> bit? If the latter, then I guess this discussion can still happen :)

I was hoping that SW_PROTECTED_VM will be the VM type that something
like Firecracker could use, e.g. an interface to guest_memfd specifically
_without_ pKVM, as Fuad was saying.

> Thanks,
> Quentin

Best, 
Patrick


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 16:25       ` Quentin Perret
@ 2025-02-11 16:34         ` Fuad Tabba
  2025-02-11 16:57           ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 16:34 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Tue, 11 Feb 2025 at 16:26, Quentin Perret <qperret@google.com> wrote:
>
> On Tuesday 11 Feb 2025 at 16:13:27 (+0000), Fuad Tabba wrote:
> > Hi Quentin,
> >
> > On Tue, 11 Feb 2025 at 15:57, Quentin Perret <qperret@google.com> wrote:
> > >
> > > Hey Fuad,
> > >
> > > On Tuesday 11 Feb 2025 at 12:11:24 (+0000), Fuad Tabba wrote:
> > > > Add arm64 support for handling guest page faults on guest_memfd
> > > > backed memslots.
> > > >
> > > > For now, the fault granule is restricted to PAGE_SIZE.
> > > >
> > > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > ---
> > > >  arch/arm64/kvm/mmu.c     | 84 ++++++++++++++++++++++++++--------------
> > > >  include/linux/kvm_host.h |  5 +++
> > > >  virt/kvm/kvm_main.c      |  5 ---
> > > >  3 files changed, 61 insertions(+), 33 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index b6c0acb2311c..305060518766 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -1454,6 +1454,33 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> > > >       return vma->vm_flags & VM_MTE_ALLOWED;
> > > >  }
> > > >
> > > > +static kvm_pfn_t faultin_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > +                          gfn_t gfn, bool write_fault, bool *writable,
> > > > +                          struct page **page, bool is_private)
> > > > +{
> > > > +     kvm_pfn_t pfn;
> > > > +     int ret;
> > > > +
> > > > +     if (!is_private)
> > > > +             return __kvm_faultin_pfn(slot, gfn, write_fault ? FOLL_WRITE : 0, writable, page);
> > > > +
> > > > +     *writable = false;
> > > > +
> > > > +     if (WARN_ON_ONCE(write_fault && memslot_is_readonly(slot)))
> > > > +             return KVM_PFN_ERR_NOSLOT_MASK;
> > >
> > > I believe this check is superfluous, we should decide to report an MMIO
> > > exit to userspace for write faults to RO memslots and not get anywhere
> > > near user_mem_abort(). And nit but the error code should probably be
> > > KVM_PFN_ERR_RO_FAULT or something instead?
> >
> > I tried to replicate the behavior of __kvm_faultin_pfn() here (but got
> > the wrong error!). I think you're right though that in the arm64 case,
> > this check isn't needed. Should I fix the return error and keep the
> > warning though?
>
> __kvm_faultin_pfn() will just set *writable to false if it find an RO
> memslot apparently, not return an error. So I'd vote for dropping that
> check so we align with that behaviour.

Ack.

> > > > +
> > > > +     ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, page, NULL);
> > > > +     if (!ret) {
> > > > +             *writable = write_fault;
> > >
> > > In normal KVM, if we're not dirty logging we'll actively map the page as
> > > writable if both the memslot and the userspace mappings are writable.
> > > With gmem, the latter doesn't make much sense, but essentially the
> > > underlying page should really be writable (e.g. no CoW getting in the
> > > way and such?). If so, then perhaps make this
> > >
> > >                 *writable = !memslot_is_readonly(slot);
> > >
> > > Wdyt?
> >
> > Ack.
> >
> > > > +             return pfn;
> > > > +     }
> > > > +
> > > > +     if (ret == -EHWPOISON)
> > > > +             return KVM_PFN_ERR_HWPOISON;
> > > > +
> > > > +     return KVM_PFN_ERR_NOSLOT_MASK;
> > > > +}
> > > > +
> > > >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >                         struct kvm_s2_trans *nested,
> > > >                         struct kvm_memory_slot *memslot, unsigned long hva,
> > > > @@ -1461,25 +1488,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >  {
> > > >       int ret = 0;
> > > >       bool write_fault, writable;
> > > > -     bool exec_fault, mte_allowed;
> > > > +     bool exec_fault, mte_allowed = false;
> > > >       bool device = false, vfio_allow_any_uc = false;
> > > >       unsigned long mmu_seq;
> > > >       phys_addr_t ipa = fault_ipa;
> > > >       struct kvm *kvm = vcpu->kvm;
> > > > -     struct vm_area_struct *vma;
> > > > +     struct vm_area_struct *vma = NULL;
> > > >       short vma_shift;
> > > >       void *memcache;
> > > > -     gfn_t gfn;
> > > > +     gfn_t gfn = ipa >> PAGE_SHIFT;
> > > >       kvm_pfn_t pfn;
> > > >       bool logging_active = memslot_is_logging(memslot);
> > > > -     bool force_pte = logging_active || is_protected_kvm_enabled();
> > > > -     long vma_pagesize, fault_granule;
> > > > +     bool is_private = kvm_mem_is_private(kvm, gfn);
> > >
> > > Just trying to understand the locking rule for the xarray behind this.
> > > Is it kvm->srcu that protects it for reads here? Something else?
> >
> > I'm not sure I follow. Which xarray are you referring to?
>
> Sorry, yes, that wasn't clear. I meant that kvm_mem_is_private() calls
> kvm_get_memory_attributes() which indexes kvm->mem_attr_array. The
> comment in struct kvm indicates that this xarray is protected by RCU for
> readers, so I was just checking if we were relying on
> kvm_handle_guest_abort() to take srcu_read_lock(&kvm->srcu) for us, or
> if there was something else more subtle here.

I was kind of afraid that people would be confused by this, and I
commented on it in the commit message of the earlier patch:
https://lore.kernel.org/all/20250211121128.703390-6-tabba@google.com/

> Note that the word "private" in the name of the function
> kvm_mem_is_private() doesn't necessarily indicate that the memory
> isn't shared, but is due to the history and evolution of
> guest_memfd and the various names it has received. In effect,
> this function is used to multiplex between the path of a normal
> page fault and the path of a guest_memfd backed page fault.

kvm_mem_is_private() is property of the memslot itself. No xarrays
harmed in the process :)

Cheers,
/fuad

> Cheers,
> Quentin
>
> > >
> > > > +     bool force_pte = logging_active || is_private || is_protected_kvm_enabled();
> > > > +     long vma_pagesize, fault_granule = PAGE_SIZE;
> > > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > >       struct kvm_pgtable *pgt;
> > > >       struct page *page;
> > > >       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
> > > >
> > > > -     if (fault_is_perm)
> > > > +     if (fault_is_perm && !is_private)
> > >
> > > Nit: not strictly necessary I think.
> >
> > You're right.
> >
> > Thanks,
> > /fuad
> >
> > > >               fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
> > > >       write_fault = kvm_is_write_fault(vcpu);
> > > >       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > > @@ -1510,24 +1538,30 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >                       return ret;
> > > >       }
> > > >
> > > > +     mmap_read_lock(current->mm);
> > > > +
> > > >       /*
> > > >        * Let's check if we will get back a huge page backed by hugetlbfs, or
> > > >        * get block mapping for device MMIO region.
> > > >        */
> > > > -     mmap_read_lock(current->mm);
> > > > -     vma = vma_lookup(current->mm, hva);
> > > > -     if (unlikely(!vma)) {
> > > > -             kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > > -             mmap_read_unlock(current->mm);
> > > > -             return -EFAULT;
> > > > -     }
> > > > +     if (!is_private) {
> > > > +             vma = vma_lookup(current->mm, hva);
> > > > +             if (unlikely(!vma)) {
> > > > +                     kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
> > > > +                     mmap_read_unlock(current->mm);
> > > > +                     return -EFAULT;
> > > > +             }
> > > >
> > > > -     /*
> > > > -      * logging_active is guaranteed to never be true for VM_PFNMAP
> > > > -      * memslots.
> > > > -      */
> > > > -     if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > > > -             return -EFAULT;
> > > > +             /*
> > > > +              * logging_active is guaranteed to never be true for VM_PFNMAP
> > > > +              * memslots.
> > > > +              */
> > > > +             if (WARN_ON_ONCE(logging_active && (vma->vm_flags & VM_PFNMAP)))
> > > > +                     return -EFAULT;
> > > > +
> > > > +             vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > > > +             mte_allowed = kvm_vma_mte_allowed(vma);
> > > > +     }
> > > >
> > > >       if (force_pte)
> > > >               vma_shift = PAGE_SHIFT;
> > > > @@ -1597,18 +1631,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >               ipa &= ~(vma_pagesize - 1);
> > > >       }
> > > >
> > > > -     gfn = ipa >> PAGE_SHIFT;
> > > > -     mte_allowed = kvm_vma_mte_allowed(vma);
> > > > -
> > > > -     vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
> > > > -
> > > >       /* Don't use the VMA after the unlock -- it may have vanished */
> > > >       vma = NULL;
> > > >
> > > >       /*
> > > >        * Read mmu_invalidate_seq so that KVM can detect if the results of
> > > > -      * vma_lookup() or __kvm_faultin_pfn() become stale prior to
> > > > -      * acquiring kvm->mmu_lock.
> > > > +      * vma_lookup() or faultin_pfn() become stale prior to acquiring
> > > > +      * kvm->mmu_lock.
> > > >        *
> > > >        * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
> > > >        * with the smp_wmb() in kvm_mmu_invalidate_end().
> > > > @@ -1616,8 +1645,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > >       mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > >       mmap_read_unlock(current->mm);
> > > >
> > > > -     pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> > > > -                             &writable, &page);
> > > > +     pfn = faultin_pfn(kvm, memslot, gfn, write_fault, &writable, &page, is_private);
> > > >       if (pfn == KVM_PFN_ERR_HWPOISON) {
> > > >               kvm_send_hwpoison_signal(hva, vma_shift);
> > > >               return 0;
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 39fd6e35c723..415c6274aede 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -1882,6 +1882,11 @@ static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
> > > >       return gfn_to_memslot(kvm, gfn)->id;
> > > >  }
> > > >
> > > > +static inline bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > > > +{
> > > > +     return slot->flags & KVM_MEM_READONLY;
> > > > +}
> > > > +
> > > >  static inline gfn_t
> > > >  hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
> > > >  {
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 38f0f402ea46..3e40acb9f5c0 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -2624,11 +2624,6 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > >       return size;
> > > >  }
> > > >
> > > > -static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> > > > -{
> > > > -     return slot->flags & KVM_MEM_READONLY;
> > > > -}
> > > > -
> > > >  static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
> > > >                                      gfn_t *nr_pages, bool write)
> > > >  {
> > > > --
> > > > 2.48.1.502.g6dc24dfdaf-goog
> > > >


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 16:34         ` Fuad Tabba
@ 2025-02-11 16:57           ` Quentin Perret
  2025-02-11 17:04             ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 16:57 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tuesday 11 Feb 2025 at 16:34:02 (+0000), Fuad Tabba wrote:
> > Sorry, yes, that wasn't clear. I meant that kvm_mem_is_private() calls
> > kvm_get_memory_attributes() which indexes kvm->mem_attr_array. The
> > comment in struct kvm indicates that this xarray is protected by RCU for
> > readers, so I was just checking if we were relying on
> > kvm_handle_guest_abort() to take srcu_read_lock(&kvm->srcu) for us, or
> > if there was something else more subtle here.
> 
> I was kind of afraid that people would be confused by this, and I
> commented on it in the commit message of the earlier patch:
> https://lore.kernel.org/all/20250211121128.703390-6-tabba@google.com/
> 
> > Note that the word "private" in the name of the function
> > kvm_mem_is_private() doesn't necessarily indicate that the memory
> > isn't shared, but is due to the history and evolution of
> > guest_memfd and the various names it has received. In effect,
> > this function is used to multiplex between the path of a normal
> > page fault and the path of a guest_memfd backed page fault.
> 
> kvm_mem_is_private() is property of the memslot itself. No xarrays
> harmed in the process :)

Ah, I see, but could someone enable CONFIG_GENERIC_PRIVATE_MEM and
related and get confused? Should KVM_GENERIC_MEMORY_ATTRIBUTES=n
depend on !ARM64? Or is it KVM_GMEM_SHARED_MEM that needs to depend on
the generic implementation being off?

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 16:57           ` Quentin Perret
@ 2025-02-11 17:04             ` Fuad Tabba
  2025-02-11 17:19               ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-11 17:04 UTC (permalink / raw)
  To: Quentin Perret
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Tue, 11 Feb 2025 at 16:57, Quentin Perret <qperret@google.com> wrote:
>
> On Tuesday 11 Feb 2025 at 16:34:02 (+0000), Fuad Tabba wrote:
> > > Sorry, yes, that wasn't clear. I meant that kvm_mem_is_private() calls
> > > kvm_get_memory_attributes() which indexes kvm->mem_attr_array. The
> > > comment in struct kvm indicates that this xarray is protected by RCU for
> > > readers, so I was just checking if we were relying on
> > > kvm_handle_guest_abort() to take srcu_read_lock(&kvm->srcu) for us, or
> > > if there was something else more subtle here.
> >
> > I was kind of afraid that people would be confused by this, and I
> > commented on it in the commit message of the earlier patch:
> > https://lore.kernel.org/all/20250211121128.703390-6-tabba@google.com/
> >
> > > Note that the word "private" in the name of the function
> > > kvm_mem_is_private() doesn't necessarily indicate that the memory
> > > isn't shared, but is due to the history and evolution of
> > > guest_memfd and the various names it has received. In effect,
> > > this function is used to multiplex between the path of a normal
> > > page fault and the path of a guest_memfd backed page fault.
> >
> > kvm_mem_is_private() is property of the memslot itself. No xarrays
> > harmed in the process :)
>
> Ah, I see, but could someone enable CONFIG_GENERIC_PRIVATE_MEM and
> related and get confused? Should KVM_GENERIC_MEMORY_ATTRIBUTES=n
> depend on !ARM64? Or is it KVM_GMEM_SHARED_MEM that needs to depend on
> the generic implementation being off?

VMs that have sharing in place don't need
KVM_GENERIC_MEMORY_ATTRIBUTES, since that presents the userspace
view/desire of the state of the folio. It's not necessarily an arm64
thing, for example, CCA would need it, since it behaves like TDX.

I guess that KVM_GMEM_SHARED_MEM and KVM_GENERIC_MEMORY_ATTRIBUTES are
mutually exclusive. I cannot think how the two could be used or useful
together. We could have a check to ensure that both are not enabled at
the same time. The behavior in this patch series is that
KVM_GMEM_SHARED_MEM selects GENERIC_PRIVATE_MEM.

Also, to help reduce the confusion above, I could rename the variable
is_private in user_mem_abort() to is_guestmem. WDYT?

Cheers,
/fuad


> Thanks,
> Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 16:32         ` Patrick Roy
@ 2025-02-11 17:09           ` Quentin Perret
  2025-02-14 11:13             ` Quentin Perret
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 17:09 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Fuad Tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Patrick,

On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
> I was hoping that SW_PROTECTED_VM will be the VM type that something
> like Firecracker could use, e.g. an interface to guest_memfd specifically
> _without_ pKVM, as Fuad was saying.

I had, probably incorrectly, assumed that we'd eventually want to allow
gmem for all VMs, including traditional KVM VMs that don't have anything
special. Perhaps the gmem support could be exposed via a KVM_CAP in this
case?

Anyway, no objection to the proposed approach in this patch assuming we
will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
bit 31 :).

Thanks,
Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults
  2025-02-11 17:04             ` Fuad Tabba
@ 2025-02-11 17:19               ` Quentin Perret
  0 siblings, 0 replies; 54+ messages in thread
From: Quentin Perret @ 2025-02-11 17:19 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tuesday 11 Feb 2025 at 17:04:05 (+0000), Fuad Tabba wrote:
> Hi Quentin,
> 
> On Tue, 11 Feb 2025 at 16:57, Quentin Perret <qperret@google.com> wrote:
> >
> > On Tuesday 11 Feb 2025 at 16:34:02 (+0000), Fuad Tabba wrote:
> > > > Sorry, yes, that wasn't clear. I meant that kvm_mem_is_private() calls
> > > > kvm_get_memory_attributes() which indexes kvm->mem_attr_array. The
> > > > comment in struct kvm indicates that this xarray is protected by RCU for
> > > > readers, so I was just checking if we were relying on
> > > > kvm_handle_guest_abort() to take srcu_read_lock(&kvm->srcu) for us, or
> > > > if there was something else more subtle here.
> > >
> > > I was kind of afraid that people would be confused by this, and I
> > > commented on it in the commit message of the earlier patch:
> > > https://lore.kernel.org/all/20250211121128.703390-6-tabba@google.com/
> > >
> > > > Note that the word "private" in the name of the function
> > > > kvm_mem_is_private() doesn't necessarily indicate that the memory
> > > > isn't shared, but is due to the history and evolution of
> > > > guest_memfd and the various names it has received. In effect,
> > > > this function is used to multiplex between the path of a normal
> > > > page fault and the path of a guest_memfd backed page fault.
> > >
> > > kvm_mem_is_private() is property of the memslot itself. No xarrays
> > > harmed in the process :)
> >
> > Ah, I see, but could someone enable CONFIG_GENERIC_PRIVATE_MEM and
> > related and get confused? Should KVM_GENERIC_MEMORY_ATTRIBUTES=n
> > depend on !ARM64? Or is it KVM_GMEM_SHARED_MEM that needs to depend on
> > the generic implementation being off?
> 
> VMs that have sharing in place don't need
> KVM_GENERIC_MEMORY_ATTRIBUTES, since that presents the userspace
> view/desire of the state of the folio. It's not necessarily an arm64
> thing, for example, CCA would need it, since it behaves like TDX.
> 
> I guess that KVM_GMEM_SHARED_MEM and KVM_GENERIC_MEMORY_ATTRIBUTES are
> mutually exclusive. I cannot think how the two could be used or useful
> together. We could have a check to ensure that both are not enabled at
> the same time.

Right, that should be a matter of adding

	depend on !KVM_GENERIC_MEMORY_ATTRIBUTES

to the KVM_GMEM_SHARED_MEM Kconfig I think then.

> The behavior in this patch series is that
> KVM_GMEM_SHARED_MEM selects GENERIC_PRIVATE_MEM.

You meant s/GENERIC_PRIVATE_MEM/KVM_PRIVATE_MEM right?

> Also, to help reduce the confusion above, I could rename the variable
> is_private in user_mem_abort() to is_guestmem. WDYT?

I actually don't mind the variable name in that it is consistent with the
rest of the code, but I do positively hate how the definition of
'private' in this code doesn't match my intuition :-) 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory
  2025-02-11 12:11 ` [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory Fuad Tabba
@ 2025-02-12  0:15   ` Ackerley Tng
  2025-02-12  9:23     ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2025-02-12  0:15 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Fuad Tabba <tabba@google.com> writes:

> For VMs that allow sharing guest_memfd backed memory in-place,
> handle that memory the same as "private" guest_memfd memory. This
> means that faulting that memory in the host or in the guest will
> go through the guest_memfd subsystem.
>
> Note that the word "private" in the name of the function
> kvm_mem_is_private() doesn't necessarily indicate that the memory
> isn't shared, but is due to the history and evolution of
> guest_memfd and the various names it has received. In effect,
> this function is used to multiplex between the path of a normal
> page fault and the path of a guest_memfd backed page fault.
>

Thanks for this summary! It has always been confusing and this really
helps.

Is there any chance we could rename the functions in KVM, or maybe add a
comment at the function definitions? The name of the userspace flag will
have to remain, of course.

> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/kvm_host.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 438aa3df3175..39fd6e35c723 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2521,7 +2521,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  #else
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
> -	return false;
> +	return kvm_arch_gmem_supports_shared_mem(kvm) &&
> +	       kvm_slot_can_be_private(gfn_to_memslot(kvm, gfn));
>  }
>  #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages
  2025-02-11 12:11 ` [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages Fuad Tabba
@ 2025-02-12  5:07   ` Ackerley Tng
  2025-02-12  9:21     ` Fuad Tabba
  2025-02-12 21:23   ` Peter Xu
  1 sibling, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2025-02-12  5:07 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
	tabba

Fuad Tabba <tabba@google.com> writes:

> Add support for mmap() and fault() for guest_memfd backed memory
> in the host for VMs that support in-place conversion between
> shared and private (shared memory). To that end, this patch adds
> the ability to check whether the VM type has that support, and
> only allows mapping its memory if that's the case.
>
> Additionally, this behavior is gated with a new configuration
> option, CONFIG_KVM_GMEM_SHARED_MEM.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
>
> ---
>
> This patch series will allow shared memory support for software
> VMs in x86. It will also introduce a similar VM type for arm64
> and allow shared memory support for that. In the future, pKVM
> will also support shared memory.

Thanks, I agree that introducing mmap this way could help in having it
merged earlier, independently of conversion support, to support testing.

I'll adopt this patch in the next revision of 1G page support for
guest_memfd.

> ---
>  include/linux/kvm_host.h | 11 +++++
>  virt/kvm/Kconfig         |  4 ++
>  virt/kvm/guest_memfd.c   | 93 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 108 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8b5f28f6efff..438aa3df3175 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -728,6 +728,17 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>  }
>  #endif
>  
> +/*
> + * Arch code must define kvm_arch_gmem_supports_shared_mem if support for
> + * private memory is enabled and it supports in-place shared/private conversion.
> + */
> +#if !defined(kvm_arch_gmem_supports_shared_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
> +static inline bool kvm_arch_gmem_supports_shared_mem(struct kvm *kvm)
> +{
> +	return false;
> +}
> +#endif

Perhaps this could be declared in the #ifdef CONFIG_KVM_PRIVATE_MEM
block?

Could this be defined as a __weak symbol for architectures to override?
Or perhaps that can be done once guest_memfd gets refactored separately
since now the entire guest_memfd.c isn't even compiled if
CONFIG_KVM_PRIVATE_MEM is not set.

> +
>  #ifndef kvm_arch_has_readonly_mem
>  static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
>  {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..4e759e8020c5 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
>  config HAVE_KVM_ARCH_GMEM_INVALIDATE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config KVM_GMEM_SHARED_MEM
> +       select KVM_PRIVATE_MEM
> +       bool
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index c6f6792bec2a..85467a3ef8ea 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -317,9 +317,102 @@ void kvm_gmem_handle_folio_put(struct folio *folio)
>  {
>  	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
>  }
> +
> +static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> +{
> +	struct kvm_gmem *gmem = file->private_data;
> +
> +	/* For now, VMs that support shared memory share all their memory. */
> +	return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> +}
> +
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	struct folio *folio;
> +	vm_fault_t ret = VM_FAULT_LOCKED;
> +
> +	filemap_invalidate_lock_shared(inode->i_mapping);
> +
> +	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	if (IS_ERR(folio)) {
> +		ret = VM_FAULT_SIGBUS;

Will it always be a SIGBUS if there is some error getting a folio?

> +		goto out_filemap;
> +	}
> +
> +	if (folio_test_hwpoison(folio)) {
> +		ret = VM_FAULT_HWPOISON;
> +		goto out_folio;
> +	}
> +
> +	/* Must be called with folio lock held, i.e., after kvm_gmem_get_folio() */
> +	if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_folio;
> +	}
> +
> +	/*
> +	 * Only private folios are marked as "guestmem" so far, and we never
> +	 * expect private folios at this point.
> +	 */

Proposal - rephrase this comment as: before typed folios can be mapped,
PGTY_guestmem is only tagged on folios so that guest_memfd will receive
the kvm_gmem_handle_folio_put() callback. The tag is definitely not
expected when a folio is about to be faulted in.

I propose the above because I think technically when mappability is NONE
the folio isn't private? Not sure if others see this differently.

> +	if (WARN_ON_ONCE(folio_test_guestmem(folio)))  {
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_folio;
> +	}
> +
> +	/* No support for huge pages. */
> +	if (WARN_ON_ONCE(folio_test_large(folio))) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_folio;
> +	}
> +
> +	if (!folio_test_uptodate(folio)) {
> +		clear_highpage(folio_page(folio, 0));
> +		kvm_gmem_mark_prepared(folio);
> +	}
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);
> +
> +out_folio:
> +	if (ret != VM_FAULT_LOCKED) {
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +
> +out_filemap:
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
> +
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> +	.fault = kvm_gmem_fault,
> +};
> +
> +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct kvm_gmem *gmem = file->private_data;
> +
> +	if (!kvm_arch_gmem_supports_shared_mem(gmem->kvm))
> +		return -ENODEV;
> +
> +	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> +	    (VM_SHARED | VM_MAYSHARE)) {
> +		return -EINVAL;
> +	}
> +
> +	file_accessed(file);
> +	vm_flags_set(vma, VM_DONTDUMP);
> +	vma->vm_ops = &kvm_gmem_vm_ops;
> +
> +	return 0;
> +}
> +#else
> +#define kvm_gmem_mmap NULL
>  #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>  
>  static struct file_operations kvm_gmem_fops = {
> +	.mmap		= kvm_gmem_mmap,

I think it's better to surround this with #ifdef
CONFIG_KVM_GMEM_SHARED_MEM so that when more code gets inserted between
the struct declaration and the definition of kvm_gmem_mmap() it is more
obvious that .mmap is only overridden when CONFIG_KVM_GMEM_SHARED_MEM is
set.

>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,
>  	.fallocate	= kvm_gmem_fallocate,


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages
  2025-02-12  5:07   ` Ackerley Tng
@ 2025-02-12  9:21     ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-12  9:21 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Ackerley,

On Wed, 12 Feb 2025 at 05:07, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Add support for mmap() and fault() for guest_memfd backed memory
> > in the host for VMs that support in-place conversion between
> > shared and private (shared memory). To that end, this patch adds
> > the ability to check whether the VM type has that support, and
> > only allows mapping its memory if that's the case.
> >
> > Additionally, this behavior is gated with a new configuration
> > option, CONFIG_KVM_GMEM_SHARED_MEM.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> >
> > ---
> >
> > This patch series will allow shared memory support for software
> > VMs in x86. It will also introduce a similar VM type for arm64
> > and allow shared memory support for that. In the future, pKVM
> > will also support shared memory.
>
> Thanks, I agree that introducing mmap this way could help in having it
> merged earlier, independently of conversion support, to support testing.
>
> I'll adopt this patch in the next revision of 1G page support for
> guest_memfd.
>
> > ---
> >  include/linux/kvm_host.h | 11 +++++
> >  virt/kvm/Kconfig         |  4 ++
> >  virt/kvm/guest_memfd.c   | 93 ++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 108 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 8b5f28f6efff..438aa3df3175 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -728,6 +728,17 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> >  }
> >  #endif
> >
> > +/*
> > + * Arch code must define kvm_arch_gmem_supports_shared_mem if support for
> > + * private memory is enabled and it supports in-place shared/private conversion.
> > + */
> > +#if !defined(kvm_arch_gmem_supports_shared_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
> > +static inline bool kvm_arch_gmem_supports_shared_mem(struct kvm *kvm)
> > +{
> > +     return false;
> > +}
> > +#endif
>
> Perhaps this could be declared in the #ifdef CONFIG_KVM_PRIVATE_MEM
> block?
>
> Could this be defined as a __weak symbol for architectures to override?
> Or perhaps that can be done once guest_memfd gets refactored separately
> since now the entire guest_memfd.c isn't even compiled if
> CONFIG_KVM_PRIVATE_MEM is not set.

I don't have a strong opinion about this. I did it this way because
that's what the existing code for kvm_arch_has_private_mem() is like.
It seemed logical to follow that pattern.

> > +
> >  #ifndef kvm_arch_has_readonly_mem
> >  static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
> >  {
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 54e959e7d68f..4e759e8020c5 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> >  config HAVE_KVM_ARCH_GMEM_INVALIDATE
> >         bool
> >         depends on KVM_PRIVATE_MEM
> > +
> > +config KVM_GMEM_SHARED_MEM
> > +       select KVM_PRIVATE_MEM
> > +       bool
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index c6f6792bec2a..85467a3ef8ea 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -317,9 +317,102 @@ void kvm_gmem_handle_folio_put(struct folio *folio)
> >  {
> >       WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> >  }
> > +
> > +static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index)
> > +{
> > +     struct kvm_gmem *gmem = file->private_data;
> > +
> > +     /* For now, VMs that support shared memory share all their memory. */
> > +     return kvm_arch_gmem_supports_shared_mem(gmem->kvm);
> > +}
> > +
> > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > +{
> > +     struct inode *inode = file_inode(vmf->vma->vm_file);
> > +     struct folio *folio;
> > +     vm_fault_t ret = VM_FAULT_LOCKED;
> > +
> > +     filemap_invalidate_lock_shared(inode->i_mapping);
> > +
> > +     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > +     if (IS_ERR(folio)) {
> > +             ret = VM_FAULT_SIGBUS;
>
> Will it always be a SIGBUS if there is some error getting a folio?

I could try to expand it, and map more of the VM_FAULT_* to the
potential return values from __filemap_get_folio(), i.e.,

-EAGAIN -> VM_FAULT_RETRY
-ENOMEM -> VM_FAULT_OOM

but some don't really map 1:1 if you read the description. For example
is it correct to map
-ENOENT -> VM_FAULT_NOPAGE /* fault installed the pte, not return page */

That said, it might be useful to map at least EAGAIN and ENOMEM.

> > +             goto out_filemap;
> > +     }
> > +
> > +     if (folio_test_hwpoison(folio)) {
> > +             ret = VM_FAULT_HWPOISON;
> > +             goto out_folio;
> > +     }
> > +
> > +     /* Must be called with folio lock held, i.e., after kvm_gmem_get_folio() */
> > +     if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) {
> > +             ret = VM_FAULT_SIGBUS;
> > +             goto out_folio;
> > +     }
> > +
> > +     /*
> > +      * Only private folios are marked as "guestmem" so far, and we never
> > +      * expect private folios at this point.
> > +      */
>
> Proposal - rephrase this comment as: before typed folios can be mapped,
> PGTY_guestmem is only tagged on folios so that guest_memfd will receive
> the kvm_gmem_handle_folio_put() callback. The tag is definitely not
> expected when a folio is about to be faulted in.
>
> I propose the above because I think technically when mappability is NONE
> the folio isn't private? Not sure if others see this differently.

A folio whose mappability is NONE is private as far as the host is
concerned. That said, I can rephrase this for clarity.


> > +     if (WARN_ON_ONCE(folio_test_guestmem(folio)))  {
> > +             ret = VM_FAULT_SIGBUS;
> > +             goto out_folio;
> > +     }
> > +
> > +     /* No support for huge pages. */
> > +     if (WARN_ON_ONCE(folio_test_large(folio))) {
> > +             ret = VM_FAULT_SIGBUS;
> > +             goto out_folio;
> > +     }
> > +
> > +     if (!folio_test_uptodate(folio)) {
> > +             clear_highpage(folio_page(folio, 0));
> > +             kvm_gmem_mark_prepared(folio);
> > +     }
> > +
> > +     vmf->page = folio_file_page(folio, vmf->pgoff);
> > +
> > +out_folio:
> > +     if (ret != VM_FAULT_LOCKED) {
> > +             folio_unlock(folio);
> > +             folio_put(folio);
> > +     }
> > +
> > +out_filemap:
> > +     filemap_invalidate_unlock_shared(inode->i_mapping);
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> > +     .fault = kvm_gmem_fault,
> > +};
> > +
> > +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     struct kvm_gmem *gmem = file->private_data;
> > +
> > +     if (!kvm_arch_gmem_supports_shared_mem(gmem->kvm))
> > +             return -ENODEV;
> > +
> > +     if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> > +         (VM_SHARED | VM_MAYSHARE)) {
> > +             return -EINVAL;
> > +     }
> > +
> > +     file_accessed(file);
> > +     vm_flags_set(vma, VM_DONTDUMP);
> > +     vma->vm_ops = &kvm_gmem_vm_ops;
> > +
> > +     return 0;
> > +}
> > +#else
> > +#define kvm_gmem_mmap NULL
> >  #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >
> >  static struct file_operations kvm_gmem_fops = {
> > +     .mmap           = kvm_gmem_mmap,
>
> I think it's better to surround this with #ifdef
> CONFIG_KVM_GMEM_SHARED_MEM so that when more code gets inserted between
> the struct declaration and the definition of kvm_gmem_mmap() it is more
> obvious that .mmap is only overridden when CONFIG_KVM_GMEM_SHARED_MEM is
> set.

This is a question of style, but I prefer this because it avoids
having more ifdefs. I think that I've seen this pattern elsewhere in
the kernel. That said, if others disagree, I'm happy to change this.

Thank you,
/fuad
>
> >       .open           = generic_file_open,
> >       .release        = kvm_gmem_release,
> >       .fallocate      = kvm_gmem_fallocate,


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory
  2025-02-12  0:15   ` Ackerley Tng
@ 2025-02-12  9:23     ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-12  9:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Ackerley,

On Wed, 12 Feb 2025 at 00:15, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > For VMs that allow sharing guest_memfd backed memory in-place,
> > handle that memory the same as "private" guest_memfd memory. This
> > means that faulting that memory in the host or in the guest will
> > go through the guest_memfd subsystem.
> >
> > Note that the word "private" in the name of the function
> > kvm_mem_is_private() doesn't necessarily indicate that the memory
> > isn't shared, but is due to the history and evolution of
> > guest_memfd and the various names it has received. In effect,
> > this function is used to multiplex between the path of a normal
> > page fault and the path of a guest_memfd backed page fault.
> >
>
> Thanks for this summary! It has always been confusing and this really
> helps.
>
> Is there any chance we could rename the functions in KVM, or maybe add a
> comment at the function definitions? The name of the userspace flag will
> have to remain, of course.

Actually, I was thinking of doing that in V4. Rename, or at least add
an alias, as a separate patch, to see what the community thinks.
Since, even with this comment, it is still confusing (as
evidenced by Quentin's comment on the later patch).

Cheers,
/fuad


> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  include/linux/kvm_host.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 438aa3df3175..39fd6e35c723 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2521,7 +2521,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> >  #else
> >  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> >  {
> > -     return false;
> > +     return kvm_arch_gmem_supports_shared_mem(kvm) &&
> > +            kvm_slot_can_be_private(gfn_to_memslot(kvm, gfn));
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
@ 2025-02-12 18:19   ` Peter Xu
  2025-02-13  8:29     ` Fuad Tabba
  2025-02-17  9:49   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2025-02-12 18:19 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Tue, Feb 11, 2025 at 12:11:18PM +0000, Fuad Tabba wrote:
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
> 
> This patch introduces a new type for guest_memfd folios, which
> isn't activated in this series but is here as a placeholder and
> to facilitate the code in the next patch. This will be used in
> the future to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/kvm_host.h   |  9 +++++++++
>  include/linux/page-flags.h | 17 +++++++++++++++++
>  mm/debug.c                 |  1 +
>  mm/swap.c                  |  9 +++++++++
>  virt/kvm/guest_memfd.c     |  7 +++++++
>  5 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f34f4cfaa513..8b5f28f6efff 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +void kvm_gmem_handle_folio_put(struct folio *folio);
> +#else
> +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +#endif
> +
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6dc2494bd002..734afda268ab 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -933,6 +933,17 @@ enum pagetype {
>  	PGTY_slab	= 0xf5,
>  	PGTY_zsmalloc	= 0xf6,
>  	PGTY_unaccepted	= 0xf7,
> +	/*
> +	 * guestmem folios are used to back VM memory as managed by guest_memfd.
> +	 * Once the last reference is put, instead of freeing these folios back
> +	 * to the page allocator, they are returned to guest_memfd.
> +	 *
> +	 * For now, guestmem will only be set on these folios as long as they
> +	 * cannot be mapped to user space ("private state"), with the plan of
> +	 * always setting that type once typed folios can be mapped to user
> +	 * space cleanly.

Does it imply that gmem folios can be mapped to userspace at some point?
It'll be great if you can share more about it, since as of now it looks
like anything has a page type cannot use the per-page mapcount.

When looking at this, I also found that __folio_rmap_sanity_checks() has
some folio_test_hugetlb() tests, not sure whether they're prone to be
changed too e.g. to cover all pages that have a type, so as to cover gmem.

For the longer term, it'll be definitely nice if gmem folios can be
mapcounted just like normal file folios.  It can enable gmem as a backstore
just like what normal memfd would do, with gmem managing the folios.

> +	 */
> +	PGTY_guestmem	= 0xf8,
>  
>  	PGTY_mapcount_underflow = 0xff
>  };
> @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
>  FOLIO_TEST_FLAG_FALSE(hugetlb)
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM

This seems to only be defined in follow up patches.. so may need some
adjustments.

> +FOLIO_TYPE_OPS(guestmem, guestmem)
> +#else
> +FOLIO_TEST_FLAG_FALSE(guestmem)
> +#endif
> +
>  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>  
>  /*
> diff --git a/mm/debug.c b/mm/debug.c
> index 8d2acf432385..08bc42c6cba8 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
>  	DEF_PAGETYPE_NAME(table),
>  	DEF_PAGETYPE_NAME(buddy),
>  	DEF_PAGETYPE_NAME(unaccepted),
> +	DEF_PAGETYPE_NAME(guestmem),
>  };
>  
>  static const char *page_type_name(unsigned int page_type)
> diff --git a/mm/swap.c b/mm/swap.c
> index 47bc1bb919cc..241880a46358 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -38,6 +38,10 @@
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +#include <linux/kvm_host.h>
> +#endif
> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>  	case PGTY_hugetlb:
>  		free_huge_folio(folio);
>  		return;
> +#endif
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	case PGTY_guestmem:
> +		kvm_gmem_handle_folio_put(folio);
> +		return;
>  #endif
>  	default:
>  		WARN_ON_ONCE(1);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b2aa6bf24d3a..c6f6792bec2a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -312,6 +312,13 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>  	return gfn - slot->base_gfn + slot->gmem.pgoff;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> +}
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> +
>  static struct file_operations kvm_gmem_fops = {
>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages
  2025-02-11 12:11 ` [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages Fuad Tabba
  2025-02-12  5:07   ` Ackerley Tng
@ 2025-02-12 21:23   ` Peter Xu
  2025-02-13  8:24     ` Fuad Tabba
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Xu @ 2025-02-12 21:23 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

On Tue, Feb 11, 2025 at 12:11:19PM +0000, Fuad Tabba wrote:
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 54e959e7d68f..4e759e8020c5 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
>  config HAVE_KVM_ARCH_GMEM_INVALIDATE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config KVM_GMEM_SHARED_MEM
> +       select KVM_PRIVATE_MEM
> +       bool

No strong opinion here, but this might not be straightforward enough for
any reader to know why a shared mem option will select a private mem..

I wonder would it be clearer if we could have a config for gmem alone, and
select that option no matter how gmem would be consumed.  Then the two
options above could select it.

I'm not sure whether there're too many guest-memfd stuff hard-coded to
PRIVATE_MEM, actually that's what I hit myself both in qemu & kvm when I
wanted to try guest-memfd on QEMU as purely shared (aka no conversions, no
duplicated backends, but in-place).  So pretty much a pure question to ask
here.

The other thing is, currently guest-memfd binding only allows 1:1 binding
to kvm memslots for a specific offset range of gmem, rather than being able
to be mapped in multiple memslots:

kvm_gmem_bind():
	if (!xa_empty(&gmem->bindings) &&
	    xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
		filemap_invalidate_unlock(inode->i_mapping);
		goto err;
	}

I didn't dig further yet, but I feel like this won't trivially work with
things like SMRAM when in-place, which can map the same portion of a gmem
range more than once.  I wonder if this is a hard limit for guest-memfd,
and whether you hit anything similar when working on this series.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages
  2025-02-12 21:23   ` Peter Xu
@ 2025-02-13  8:24     ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-13  8:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Peter,

On Wed, 12 Feb 2025 at 21:24, Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 11, 2025 at 12:11:19PM +0000, Fuad Tabba wrote:
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 54e959e7d68f..4e759e8020c5 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> >  config HAVE_KVM_ARCH_GMEM_INVALIDATE
> >         bool
> >         depends on KVM_PRIVATE_MEM
> > +
> > +config KVM_GMEM_SHARED_MEM
> > +       select KVM_PRIVATE_MEM
> > +       bool
>
> No strong opinion here, but this might not be straightforward enough for
> any reader to know why a shared mem option will select a private mem..
>
> I wonder would it be clearer if we could have a config for gmem alone, and
> select that option no matter how gmem would be consumed.  Then the two
> options above could select it.
>
> I'm not sure whether there're too many guest-memfd stuff hard-coded to
> PRIVATE_MEM, actually that's what I hit myself both in qemu & kvm when I
> wanted to try guest-memfd on QEMU as purely shared (aka no conversions, no
> duplicated backends, but in-place).  So pretty much a pure question to ask
> here.

Yes, the whole thing with guest_memfd being initially called private
mem has left a few things like this, e.g., config options, function
names. It has caused (and will probably continue to cause) confusion.

In order not to blend bikeshedding over names and the patch series
adding mmap support (i.e., this one), I am planning on sending a
separate patch series to handle the name issue/

> The other thing is, currently guest-memfd binding only allows 1:1 binding
> to kvm memslots for a specific offset range of gmem, rather than being able
> to be mapped in multiple memslots:
>
> kvm_gmem_bind():
>         if (!xa_empty(&gmem->bindings) &&
>             xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
>                 filemap_invalidate_unlock(inode->i_mapping);
>                 goto err;
>         }
>
> I didn't dig further yet, but I feel like this won't trivially work with
> things like SMRAM when in-place, which can map the same portion of a gmem
> range more than once.  I wonder if this is a hard limit for guest-memfd,
> and whether you hit anything similar when working on this series.

I haven't thought about this much, but it could be something to tackle later on.

Thank you,
/fuad

> Thanks,
>
> --
> Peter Xu
>

On Wed, 12 Feb 2025 at 21:24, Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 11, 2025 at 12:11:19PM +0000, Fuad Tabba wrote:
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 54e959e7d68f..4e759e8020c5 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
> >  config HAVE_KVM_ARCH_GMEM_INVALIDATE
> >         bool
> >         depends on KVM_PRIVATE_MEM
> > +
> > +config KVM_GMEM_SHARED_MEM
> > +       select KVM_PRIVATE_MEM
> > +       bool
>
> No strong opinion here, but this might not be straightforward enough for
> any reader to know why a shared mem option will select a private mem..
>
> I wonder would it be clearer if we could have a config for gmem alone, and
> select that option no matter how gmem would be consumed.  Then the two
> options above could select it.
>
> I'm not sure whether there're too many guest-memfd stuff hard-coded to
> PRIVATE_MEM, actually that's what I hit myself both in qemu & kvm when I
> wanted to try guest-memfd on QEMU as purely shared (aka no conversions, no
> duplicated backends, but in-place).  So pretty much a pure question to ask
> here.
>
> The other thing is, currently guest-memfd binding only allows 1:1 binding
> to kvm memslots for a specific offset range of gmem, rather than being able
> to be mapped in multiple memslots:
>
> kvm_gmem_bind():
>         if (!xa_empty(&gmem->bindings) &&
>             xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
>                 filemap_invalidate_unlock(inode->i_mapping);
>                 goto err;
>         }
>
> I didn't dig further yet, but I feel like this won't trivially work with
> things like SMRAM when in-place, which can map the same portion of a gmem
> range more than once.  I wonder if this is a hard limit for guest-memfd,
> and whether you hit anything similar when working on this series.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-12 18:19   ` Peter Xu
@ 2025-02-13  8:29     ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-13  8:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton

Hi Peter,

On Wed, 12 Feb 2025 at 18:19, Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 11, 2025 at 12:11:18PM +0000, Fuad Tabba wrote:
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, which
> > isn't activated in this series but is here as a placeholder and
> > to facilitate the code in the next patch. This will be used in
> > the future to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  include/linux/kvm_host.h   |  9 +++++++++
> >  include/linux/page-flags.h | 17 +++++++++++++++++
> >  mm/debug.c                 |  1 +
> >  mm/swap.c                  |  9 +++++++++
> >  virt/kvm/guest_memfd.c     |  7 +++++++
> >  5 files changed, 43 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f34f4cfaa513..8b5f28f6efff 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +void kvm_gmem_handle_folio_put(struct folio *folio);
> > +#else
> > +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ON_ONCE(1);
> > +}
> > +#endif
> > +
> >  #endif
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6dc2494bd002..734afda268ab 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -933,6 +933,17 @@ enum pagetype {
> >       PGTY_slab       = 0xf5,
> >       PGTY_zsmalloc   = 0xf6,
> >       PGTY_unaccepted = 0xf7,
> > +     /*
> > +      * guestmem folios are used to back VM memory as managed by guest_memfd.
> > +      * Once the last reference is put, instead of freeing these folios back
> > +      * to the page allocator, they are returned to guest_memfd.
> > +      *
> > +      * For now, guestmem will only be set on these folios as long as they
> > +      * cannot be mapped to user space ("private state"), with the plan of
> > +      * always setting that type once typed folios can be mapped to user
> > +      * space cleanly.
>
> Does it imply that gmem folios can be mapped to userspace at some point?
> It'll be great if you can share more about it, since as of now it looks
> like anything has a page type cannot use the per-page mapcount.

This is the goal of this series. By the end of this series, you can
map gmem folios, as long as they belong to a VM type that allows it.
My other series, which will be rebased on this one, adds the
distinction of memory shared with the host vs memory private to the
guest:

https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/

That series deals with the mapcount issue, by only applying the type
once the mapcount is 0. We talked about this in the guest_memfd mm
sync, David Hildenbrand mentioned ongoing work to remove the
overlaying of the type with the memcount. That should solve the
problem completely.


> When looking at this, I also found that __folio_rmap_sanity_checks() has
> some folio_test_hugetlb() tests, not sure whether they're prone to be
> changed too e.g. to cover all pages that have a type, so as to cover gmem.
>
> For the longer term, it'll be definitely nice if gmem folios can be
> mapcounted just like normal file folios.  It can enable gmem as a backstore
> just like what normal memfd would do, with gmem managing the folios.

That's the plan, I agree.

> > +      */
> > +     PGTY_guestmem   = 0xf8,
> >
> >       PGTY_mapcount_underflow = 0xff
> >  };
> > @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
> >  FOLIO_TEST_FLAG_FALSE(hugetlb)
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>
> This seems to only be defined in follow up patches.. so may need some
> adjustments.

It's a configuration option. If you like, I could bring forward the
patch that adds it to the kconfig file.

Thank you,
/fuad

> > +FOLIO_TYPE_OPS(guestmem, guestmem)
> > +#else
> > +FOLIO_TEST_FLAG_FALSE(guestmem)
> > +#endif
> > +
> >  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> >
> >  /*
> > diff --git a/mm/debug.c b/mm/debug.c
> > index 8d2acf432385..08bc42c6cba8 100644
> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> > @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
> >       DEF_PAGETYPE_NAME(table),
> >       DEF_PAGETYPE_NAME(buddy),
> >       DEF_PAGETYPE_NAME(unaccepted),
> > +     DEF_PAGETYPE_NAME(guestmem),
> >  };
> >
> >  static const char *page_type_name(unsigned int page_type)
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 47bc1bb919cc..241880a46358 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -38,6 +38,10 @@
> >  #include <linux/local_lock.h>
> >  #include <linux/buffer_head.h>
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +#include <linux/kvm_host.h>
> > +#endif
> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
> >       case PGTY_hugetlb:
> >               free_huge_folio(folio);
> >               return;
> > +#endif
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +     case PGTY_guestmem:
> > +             kvm_gmem_handle_folio_put(folio);
> > +             return;
> >  #endif
> >       default:
> >               WARN_ON_ONCE(1);
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index b2aa6bf24d3a..c6f6792bec2a 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -312,6 +312,13 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
> >       return gfn - slot->base_gfn + slot->gmem.pgoff;
> >  }
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> > +}
> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > +
> >  static struct file_operations kvm_gmem_fops = {
> >       .open           = generic_file_open,
> >       .release        = kvm_gmem_release,
> > --
> > 2.48.1.502.g6dc24dfdaf-goog
> >
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-11 17:09           ` Quentin Perret
@ 2025-02-14 11:13             ` Quentin Perret
  2025-02-14 11:33               ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Quentin Perret @ 2025-02-14 11:13 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Fuad Tabba, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
> Hi Patrick,
> 
> On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
> > I was hoping that SW_PROTECTED_VM will be the VM type that something
> > like Firecracker could use, e.g. an interface to guest_memfd specifically
> > _without_ pKVM, as Fuad was saying.
> 
> I had, probably incorrectly, assumed that we'd eventually want to allow
> gmem for all VMs, including traditional KVM VMs that don't have anything
> special. Perhaps the gmem support could be exposed via a KVM_CAP in this
> case?
> 
> Anyway, no objection to the proposed approach in this patch assuming we
> will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
> bit 31 :).

Thinking about this a bit deeper, I am still wondering what this new
SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
both guest-memfd backed memslots and traditional HVA-backed memslots, we
could just make normal KVM guests accept guest-memfd memslots and get
the same thing? Is there any reason not to do that instead? Even though
SW_PROTECTED VMs are documented as 'unstable', the reality is this is
UAPI and you can bet it will end up being relied upon, so I would prefer
to have a solid reason for introducing this new VM type.

Cheers,
Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-14 11:13             ` Quentin Perret
@ 2025-02-14 11:33               ` Fuad Tabba
  2025-02-14 12:37                 ` Patrick Roy
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-14 11:33 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Patrick Roy, kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai,
	mpe, anup, paul.walmsley, palmer, aou, seanjc, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Quentin,

On Fri, 14 Feb 2025 at 11:13, Quentin Perret <qperret@google.com> wrote:
>
> On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
> > Hi Patrick,
> >
> > On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
> > > I was hoping that SW_PROTECTED_VM will be the VM type that something
> > > like Firecracker could use, e.g. an interface to guest_memfd specifically
> > > _without_ pKVM, as Fuad was saying.
> >
> > I had, probably incorrectly, assumed that we'd eventually want to allow
> > gmem for all VMs, including traditional KVM VMs that don't have anything
> > special. Perhaps the gmem support could be exposed via a KVM_CAP in this
> > case?
> >
> > Anyway, no objection to the proposed approach in this patch assuming we
> > will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
> > bit 31 :).
>
> Thinking about this a bit deeper, I am still wondering what this new
> SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
> both guest-memfd backed memslots and traditional HVA-backed memslots, we
> could just make normal KVM guests accept guest-memfd memslots and get
> the same thing? Is there any reason not to do that instead? Even though
> SW_PROTECTED VMs are documented as 'unstable', the reality is this is
> UAPI and you can bet it will end up being relied upon, so I would prefer
> to have a solid reason for introducing this new VM type.

The more I think about it, I agree with you. I think that reasonable
behavior (for kvm/arm64) would be to allow using guest_memfd with all
VM types. If the VM type is a non-protected type, then its memory is
considered shared by default and is mappable --- as long as the
kconfig option is enabled. If VM is protected then the memory is not
shared by default.

What do you think Patrick? Do you need an explicit VM type?

Cheers,
/fuad

> Cheers,
> Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-14 11:33               ` Fuad Tabba
@ 2025-02-14 12:37                 ` Patrick Roy
  2025-02-14 13:11                   ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick Roy @ 2025-02-14 12:37 UTC (permalink / raw)
  To: Fuad Tabba, Quentin Perret, seanjc
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton



On Fri, 2025-02-14 at 11:33 +0000, Fuad Tabba wrote:
> Hi Quentin,
> 
> On Fri, 14 Feb 2025 at 11:13, Quentin Perret <qperret@google.com> wrote:
>>
>> On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
>>> Hi Patrick,
>>>
>>> On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
>>>> I was hoping that SW_PROTECTED_VM will be the VM type that something
>>>> like Firecracker could use, e.g. an interface to guest_memfd specifically
>>>> _without_ pKVM, as Fuad was saying.
>>>
>>> I had, probably incorrectly, assumed that we'd eventually want to allow
>>> gmem for all VMs, including traditional KVM VMs that don't have anything
>>> special. Perhaps the gmem support could be exposed via a KVM_CAP in this
>>> case?
>>>
>>> Anyway, no objection to the proposed approach in this patch assuming we
>>> will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
>>> bit 31 :).
>>
>> Thinking about this a bit deeper, I am still wondering what this new
>> SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
>> both guest-memfd backed memslots and traditional HVA-backed memslots, we
>> could just make normal KVM guests accept guest-memfd memslots and get
>> the same thing? Is there any reason not to do that instead? Even though
>> SW_PROTECTED VMs are documented as 'unstable', the reality is this is
>> UAPI and you can bet it will end up being relied upon, so I would prefer
>> to have a solid reason for introducing this new VM type.
> 
> The more I think about it, I agree with you. I think that reasonable
> behavior (for kvm/arm64) would be to allow using guest_memfd with all
> VM types. If the VM type is a non-protected type, then its memory is
> considered shared by default and is mappable --- as long as the
> kconfig option is enabled. If VM is protected then the memory is not
> shared by default.
> 
> What do you think Patrick? Do you need an explicit VM type?

Mhh, no, if "normal" VMs support guest_memfd, then that works too. I
suggested the VM type because that's how x86 works
(KVM_X86_SW_PROTECTED_VM), but never actually stopped to think about
whether it makes sense for ARM. Maybe Sean knows something we're missing?

I wonder whether having the "default sharedness" depend on the vm type
works out though - whether a range of gmem is shared or private is a
property of the guest_memfd instance, not the VM it's attached to, so I
guess the default behavior needs to be based solely on the guest_memfd
as well (and then if someone tries to attach a gmem to a VM whose desire
of protection doesnt match the guest_memfd's configuration, that
operation would fail)?

Tangentially related, does KVM_GMEM_SHARED to you mean "guest_memfd also
supports shared sections", or "guest_memfd does not support private
memory anymore"? (the difference being that in the former, then
KVM_GMEM_SHARED would later get the ability to convert ranges private,
and the EOPNOSUPP is just a transient state until conversion support is
merged) - doesnt matter for my usecase, but I got curious as some other
threads implied the second option to me and I ended up wondering why.

Best,
Patrick

> Cheers,
> /fuad
> 
>> Cheers,
>> Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-14 12:37                 ` Patrick Roy
@ 2025-02-14 13:11                   ` Fuad Tabba
  2025-02-14 13:18                     ` Patrick Roy
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-14 13:11 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Quentin Perret, seanjc, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Patrick,

On Fri, 14 Feb 2025 at 12:37, Patrick Roy <roypat@amazon.co.uk> wrote:
>
>
>
> On Fri, 2025-02-14 at 11:33 +0000, Fuad Tabba wrote:
> > Hi Quentin,
> >
> > On Fri, 14 Feb 2025 at 11:13, Quentin Perret <qperret@google.com> wrote:
> >>
> >> On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
> >>> Hi Patrick,
> >>>
> >>> On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
> >>>> I was hoping that SW_PROTECTED_VM will be the VM type that something
> >>>> like Firecracker could use, e.g. an interface to guest_memfd specifically
> >>>> _without_ pKVM, as Fuad was saying.
> >>>
> >>> I had, probably incorrectly, assumed that we'd eventually want to allow
> >>> gmem for all VMs, including traditional KVM VMs that don't have anything
> >>> special. Perhaps the gmem support could be exposed via a KVM_CAP in this
> >>> case?
> >>>
> >>> Anyway, no objection to the proposed approach in this patch assuming we
> >>> will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
> >>> bit 31 :).
> >>
> >> Thinking about this a bit deeper, I am still wondering what this new
> >> SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
> >> both guest-memfd backed memslots and traditional HVA-backed memslots, we
> >> could just make normal KVM guests accept guest-memfd memslots and get
> >> the same thing? Is there any reason not to do that instead? Even though
> >> SW_PROTECTED VMs are documented as 'unstable', the reality is this is
> >> UAPI and you can bet it will end up being relied upon, so I would prefer
> >> to have a solid reason for introducing this new VM type.
> >
> > The more I think about it, I agree with you. I think that reasonable
> > behavior (for kvm/arm64) would be to allow using guest_memfd with all
> > VM types. If the VM type is a non-protected type, then its memory is
> > considered shared by default and is mappable --- as long as the
> > kconfig option is enabled. If VM is protected then the memory is not
> > shared by default.
> >
> > What do you think Patrick? Do you need an explicit VM type?
>
> Mhh, no, if "normal" VMs support guest_memfd, then that works too. I
> suggested the VM type because that's how x86 works
> (KVM_X86_SW_PROTECTED_VM), but never actually stopped to think about
> whether it makes sense for ARM. Maybe Sean knows something we're missing?
>
> I wonder whether having the "default sharedness" depend on the vm type
> works out though - whether a range of gmem is shared or private is a
> property of the guest_memfd instance, not the VM it's attached to, so I
> guess the default behavior needs to be based solely on the guest_memfd
> as well (and then if someone tries to attach a gmem to a VM whose desire
> of protection doesnt match the guest_memfd's configuration, that
> operation would fail)?

Each guest_memfd is associated with a KVM instance. Although it could
migrate, it would be weird for a guest_memfd instance to migrate
between different types of VM, or at least, migrate between VMs that
have different confidentiality requirements.


> Tangentially related, does KVM_GMEM_SHARED to you mean "guest_memfd also
> supports shared sections", or "guest_memfd does not support private
> memory anymore"? (the difference being that in the former, then
> KVM_GMEM_SHARED would later get the ability to convert ranges private,
> and the EOPNOSUPP is just a transient state until conversion support is
> merged) - doesnt matter for my usecase, but I got curious as some other
> threads implied the second option to me and I ended up wondering why.

My thinking (and implementation in the other patch series) is that
KVM_GMEM_SHARED (back then called KVM_GMEM_MAPPABLE) allows sharing in
place/mapping, without adding restrictions.

Cheers,
/fuad

> Best,
> Patrick
>
> > Cheers,
> > /fuad
> >
> >> Cheers,
> >> Quentin


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-14 13:11                   ` Fuad Tabba
@ 2025-02-14 13:18                     ` Patrick Roy
  2025-02-14 15:12                       ` Sean Christopherson
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick Roy @ 2025-02-14 13:18 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Quentin Perret, seanjc, kvm, linux-arm-msm, linux-mm, pbonzini,
	chenhuacai, mpe, anup, paul.walmsley, palmer, aou, viro, brauner,
	willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve,
	ackerleytng, mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, maz, will, keirf, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton



On Fri, 2025-02-14 at 13:11 +0000, Fuad Tabba wrote:
> Hi Patrick,
> 
> On Fri, 14 Feb 2025 at 12:37, Patrick Roy <roypat@amazon.co.uk> wrote:
>>
>>
>>
>> On Fri, 2025-02-14 at 11:33 +0000, Fuad Tabba wrote:
>>> Hi Quentin,
>>>
>>> On Fri, 14 Feb 2025 at 11:13, Quentin Perret <qperret@google.com> wrote:
>>>>
>>>> On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
>>>>> Hi Patrick,
>>>>>
>>>>> On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
>>>>>> I was hoping that SW_PROTECTED_VM will be the VM type that something
>>>>>> like Firecracker could use, e.g. an interface to guest_memfd specifically
>>>>>> _without_ pKVM, as Fuad was saying.
>>>>>
>>>>> I had, probably incorrectly, assumed that we'd eventually want to allow
>>>>> gmem for all VMs, including traditional KVM VMs that don't have anything
>>>>> special. Perhaps the gmem support could be exposed via a KVM_CAP in this
>>>>> case?
>>>>>
>>>>> Anyway, no objection to the proposed approach in this patch assuming we
>>>>> will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
>>>>> bit 31 :).
>>>>
>>>> Thinking about this a bit deeper, I am still wondering what this new
>>>> SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
>>>> both guest-memfd backed memslots and traditional HVA-backed memslots, we
>>>> could just make normal KVM guests accept guest-memfd memslots and get
>>>> the same thing? Is there any reason not to do that instead? Even though
>>>> SW_PROTECTED VMs are documented as 'unstable', the reality is this is
>>>> UAPI and you can bet it will end up being relied upon, so I would prefer
>>>> to have a solid reason for introducing this new VM type.
>>>
>>> The more I think about it, I agree with you. I think that reasonable
>>> behavior (for kvm/arm64) would be to allow using guest_memfd with all
>>> VM types. If the VM type is a non-protected type, then its memory is
>>> considered shared by default and is mappable --- as long as the
>>> kconfig option is enabled. If VM is protected then the memory is not
>>> shared by default.
>>>
>>> What do you think Patrick? Do you need an explicit VM type?
>>
>> Mhh, no, if "normal" VMs support guest_memfd, then that works too. I
>> suggested the VM type because that's how x86 works
>> (KVM_X86_SW_PROTECTED_VM), but never actually stopped to think about
>> whether it makes sense for ARM. Maybe Sean knows something we're missing?
>>
>> I wonder whether having the "default sharedness" depend on the vm type
>> works out though - whether a range of gmem is shared or private is a
>> property of the guest_memfd instance, not the VM it's attached to, so I
>> guess the default behavior needs to be based solely on the guest_memfd
>> as well (and then if someone tries to attach a gmem to a VM whose desire
>> of protection doesnt match the guest_memfd's configuration, that
>> operation would fail)?
> 
> Each guest_memfd is associated with a KVM instance. Although it could
> migrate, it would be weird for a guest_memfd instance to migrate
> between different types of VM, or at least, migrate between VMs that
> have different confidentiality requirements.

Ahh, right, I keep forgetting that CREATE_GUEST_MEMFD() is a vm ioctl. My
bad, sorry!

>> Tangentially related, does KVM_GMEM_SHARED to you mean "guest_memfd also
>> supports shared sections", or "guest_memfd does not support private
>> memory anymore"? (the difference being that in the former, then
>> KVM_GMEM_SHARED would later get the ability to convert ranges private,
>> and the EOPNOSUPP is just a transient state until conversion support is
>> merged) - doesnt matter for my usecase, but I got curious as some other
>> threads implied the second option to me and I ended up wondering why.
> 
> My thinking (and implementation in the other patch series) is that
> KVM_GMEM_SHARED (back then called KVM_GMEM_MAPPABLE) allows sharing in
> place/mapping, without adding restrictions.

That makes sense to me, thanks for the explanation!

> Cheers,
> /fuad
> 
>> Best,
>> Patrick
>>
>>> Cheers,
>>> /fuad
>>>
>>>> Cheers,
>>>> Quentin



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type
  2025-02-14 13:18                     ` Patrick Roy
@ 2025-02-14 15:12                       ` Sean Christopherson
  0 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2025-02-14 15:12 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Fuad Tabba, Quentin Perret, kvm, linux-arm-msm, linux-mm,
	pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	viro, brauner, willy, akpm, xiaoyao.li, yilun.xu, chao.p.peng,
	jarkko, amoorthy, dmatlack, yu.c.zhang, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, keirf, shuah, hch, jgg, rientjes,
	jhubbard, fvdl, hughd, jthoughton

On Fri, Feb 14, 2025, Patrick Roy wrote:
> On Fri, 2025-02-14 at 13:11 +0000, Fuad Tabba wrote:
> > On Fri, 14 Feb 2025 at 12:37, Patrick Roy <roypat@amazon.co.uk> wrote:
> >> On Fri, 2025-02-14 at 11:33 +0000, Fuad Tabba wrote:
> >>> Hi Quentin,
> >>>
> >>> On Fri, 14 Feb 2025 at 11:13, Quentin Perret <qperret@google.com> wrote:
> >>>>
> >>>> On Tuesday 11 Feb 2025 at 17:09:20 (+0000), Quentin Perret wrote:
> >>>>> Hi Patrick,
> >>>>>
> >>>>> On Tuesday 11 Feb 2025 at 16:32:31 (+0000), Patrick Roy wrote:
> >>>>>> I was hoping that SW_PROTECTED_VM will be the VM type that something
> >>>>>> like Firecracker could use, e.g. an interface to guest_memfd specifically
> >>>>>> _without_ pKVM, as Fuad was saying.
> >>>>>
> >>>>> I had, probably incorrectly, assumed that we'd eventually want to allow
> >>>>> gmem for all VMs, including traditional KVM VMs that don't have anything
> >>>>> special. Perhaps the gmem support could be exposed via a KVM_CAP in this
> >>>>> case?
> >>>>>
> >>>>> Anyway, no objection to the proposed approach in this patch assuming we
> >>>>> will eventually have HW_PROTECTED_VM for pKVM VMs, and that _that_ can be
> >>>>> bit 31 :).
> >>>>
> >>>> Thinking about this a bit deeper, I am still wondering what this new
> >>>> SW_PROTECTED VM type is buying us? Given that SW_PROTECTED VMs accept
> >>>> both guest-memfd backed memslots and traditional HVA-backed memslots, we
> >>>> could just make normal KVM guests accept guest-memfd memslots and get
> >>>> the same thing? Is there any reason not to do that instead?

Once guest_memfd can be mmap()'d, no.  KVM_X86_SW_PROTECTED_VM was added for
testing and development of guest_memfd largely because KVM can't support a "real"
VM if KVM can't read/write guest memory through its normal mechanisms.  The gap
is most apparent on x86, but it holds true for arm64 as well.

> >>>> Even though SW_PROTECTED VMs are documented as 'unstable', the reality
> >>>> is this is UAPI and you can bet it will end up being relied upon, so I
> >>>> would prefer to have a solid reason for introducing this new VM type.
> >>>
> >>> The more I think about it, I agree with you. I think that reasonable
> >>> behavior (for kvm/arm64) would be to allow using guest_memfd with all
> >>> VM types. If the VM type is a non-protected type, then its memory is
> >>> considered shared by default and is mappable --- as long as the
> >>> kconfig option is enabled. If VM is protected then the memory is not
> >>> shared by default.

This aligns with what I see happening for x86, except that for non-protected VMs
there will be no shared vs. private, because such VMs won't have a concept of
private memory.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put()
  2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
@ 2025-02-17  9:33   ` Vlastimil Babka
  2025-02-20 11:17   ` David Hildenbrand
  1 sibling, 0 replies; 54+ messages in thread
From: Vlastimil Babka @ 2025-02-17  9:33 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 2/11/25 13:11, Fuad Tabba wrote:
> Some folio types, such as hugetlb, handle freeing their own
> folios. Moreover, guest_memfd will require being notified once a
> folio's reference count reaches 0 to facilitate shared to private
> folio conversion, without the folio actually being freed at that
> point.
> 
> As a first step towards that, this patch consolidates freeing
> folios that have a type. The first user is hugetlb folios. Later
> in this patch series, guest_memfd will become the second user of
> this.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
  2025-02-12 18:19   ` Peter Xu
@ 2025-02-17  9:49   ` Vlastimil Babka
  2025-02-17 10:12     ` Fuad Tabba
  2025-02-20 11:19   ` David Hildenbrand
  2025-02-20 11:25   ` David Hildenbrand
  3 siblings, 1 reply; 54+ messages in thread
From: Vlastimil Babka @ 2025-02-17  9:49 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, david,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 2/11/25 13:11, Fuad Tabba wrote:
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
> 
> This patch introduces a new type for guest_memfd folios, which
> isn't activated in this series but is here as a placeholder and
> to facilitate the code in the next patch. This will be used in

It's not clear to me how the code in the next page is facilitated as it
doesn't use any of this?

> the future to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/kvm_host.h   |  9 +++++++++
>  include/linux/page-flags.h | 17 +++++++++++++++++
>  mm/debug.c                 |  1 +
>  mm/swap.c                  |  9 +++++++++
>  virt/kvm/guest_memfd.c     |  7 +++++++
>  5 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f34f4cfaa513..8b5f28f6efff 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range);
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +void kvm_gmem_handle_folio_put(struct folio *folio);
> +#else
> +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ON_ONCE(1);
> +}

Since the caller is guarded by CONFIG_KVM_GMEM_SHARED_MEM, do we need the
CONFIG_KVM_GMEM_SHARED_MEM=n variant at all?

> +#endif
> +
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6dc2494bd002..734afda268ab 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -933,6 +933,17 @@ enum pagetype {
>  	PGTY_slab	= 0xf5,
>  	PGTY_zsmalloc	= 0xf6,
>  	PGTY_unaccepted	= 0xf7,
> +	/*
> +	 * guestmem folios are used to back VM memory as managed by guest_memfd.
> +	 * Once the last reference is put, instead of freeing these folios back
> +	 * to the page allocator, they are returned to guest_memfd.
> +	 *
> +	 * For now, guestmem will only be set on these folios as long as they
> +	 * cannot be mapped to user space ("private state"), with the plan of

Might be a bit misleading as I don't think it's set yet as of this series.
But I guess we can keep it to avoid another update later.

> +	 * always setting that type once typed folios can be mapped to user
> +	 * space cleanly.
> +	 */
> +	PGTY_guestmem	= 0xf8,
>  
>  	PGTY_mapcount_underflow = 0xff
>  };
> @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
>  FOLIO_TEST_FLAG_FALSE(hugetlb)
>  #endif
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +FOLIO_TYPE_OPS(guestmem, guestmem)
> +#else
> +FOLIO_TEST_FLAG_FALSE(guestmem)
> +#endif
> +
>  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>  
>  /*
> diff --git a/mm/debug.c b/mm/debug.c
> index 8d2acf432385..08bc42c6cba8 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
>  	DEF_PAGETYPE_NAME(table),
>  	DEF_PAGETYPE_NAME(buddy),
>  	DEF_PAGETYPE_NAME(unaccepted),
> +	DEF_PAGETYPE_NAME(guestmem),
>  };
>  
>  static const char *page_type_name(unsigned int page_type)
> diff --git a/mm/swap.c b/mm/swap.c
> index 47bc1bb919cc..241880a46358 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -38,6 +38,10 @@
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +#include <linux/kvm_host.h>
> +#endif

Do we need to guard the include?

> +
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>  	case PGTY_hugetlb:
>  		free_huge_folio(folio);
>  		return;
> +#endif
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	case PGTY_guestmem:
> +		kvm_gmem_handle_folio_put(folio);
> +		return;
>  #endif
>  	default:
>  		WARN_ON_ONCE(1);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b2aa6bf24d3a..c6f6792bec2a 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -312,6 +312,13 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>  	return gfn - slot->base_gfn + slot->gmem.pgoff;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> +}
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> +
>  static struct file_operations kvm_gmem_fops = {
>  	.open		= generic_file_open,
>  	.release	= kvm_gmem_release,



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-17  9:49   ` Vlastimil Babka
@ 2025-02-17 10:12     ` Fuad Tabba
  2025-02-17 11:21       ` Vlastimil Babka
  2025-02-20 11:22       ` David Hildenbrand
  0 siblings, 2 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-17 10:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi Vlastimil,

On Mon, 17 Feb 2025 at 09:49, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/11/25 13:11, Fuad Tabba wrote:
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, which
> > isn't activated in this series but is here as a placeholder and
> > to facilitate the code in the next patch. This will be used in
>
> It's not clear to me how the code in the next page is facilitated as it
> doesn't use any of this?

I'm sorry about that, I'm missing the word "series". i.e.,

> > This patch introduces a new type for guest_memfd folios, which
> > isn't activated in this series but is here as a placeholder and
> > to facilitate the code in the next patch *series*.

I'm referring to this series:
https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/

> > the future to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  include/linux/kvm_host.h   |  9 +++++++++
> >  include/linux/page-flags.h | 17 +++++++++++++++++
> >  mm/debug.c                 |  1 +
> >  mm/swap.c                  |  9 +++++++++
> >  virt/kvm/guest_memfd.c     |  7 +++++++
> >  5 files changed, 43 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f34f4cfaa513..8b5f28f6efff 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >                                   struct kvm_pre_fault_memory *range);
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +void kvm_gmem_handle_folio_put(struct folio *folio);
> > +#else
> > +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ON_ONCE(1);
> > +}
>
> Since the caller is guarded by CONFIG_KVM_GMEM_SHARED_MEM, do we need the
> CONFIG_KVM_GMEM_SHARED_MEM=n variant at all?

No. I'll remove it.

> > +#endif
> > +
> >  #endif
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6dc2494bd002..734afda268ab 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -933,6 +933,17 @@ enum pagetype {
> >       PGTY_slab       = 0xf5,
> >       PGTY_zsmalloc   = 0xf6,
> >       PGTY_unaccepted = 0xf7,
> > +     /*
> > +      * guestmem folios are used to back VM memory as managed by guest_memfd.
> > +      * Once the last reference is put, instead of freeing these folios back
> > +      * to the page allocator, they are returned to guest_memfd.
> > +      *
> > +      * For now, guestmem will only be set on these folios as long as they
> > +      * cannot be mapped to user space ("private state"), with the plan of
>
> Might be a bit misleading as I don't think it's set yet as of this series.
> But I guess we can keep it to avoid another update later.

You're right, it's not in this series. But as you said, the idea is to
have the least amount of churn in the core mm code.

> > +      * always setting that type once typed folios can be mapped to user
> > +      * space cleanly.
> > +      */
> > +     PGTY_guestmem   = 0xf8,
> >
> >       PGTY_mapcount_underflow = 0xff
> >  };
> > @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
> >  FOLIO_TEST_FLAG_FALSE(hugetlb)
> >  #endif
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +FOLIO_TYPE_OPS(guestmem, guestmem)
> > +#else
> > +FOLIO_TEST_FLAG_FALSE(guestmem)
> > +#endif
> > +
> >  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> >
> >  /*
> > diff --git a/mm/debug.c b/mm/debug.c
> > index 8d2acf432385..08bc42c6cba8 100644
> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> > @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
> >       DEF_PAGETYPE_NAME(table),
> >       DEF_PAGETYPE_NAME(buddy),
> >       DEF_PAGETYPE_NAME(unaccepted),
> > +     DEF_PAGETYPE_NAME(guestmem),
> >  };
> >
> >  static const char *page_type_name(unsigned int page_type)
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 47bc1bb919cc..241880a46358 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -38,6 +38,10 @@
> >  #include <linux/local_lock.h>
> >  #include <linux/buffer_head.h>
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +#include <linux/kvm_host.h>
> > +#endif
>
> Do we need to guard the include?

Yes, otherwise allnoconfig complains due to many of the x86 things it
drags along if included but KVM isn't configured. I could put it in a
different header that doesn't have this problem, but I couldn't think
of a better header for this.

Thank you,
/fuad

> > +
> >  #include "internal.h"
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
> >       case PGTY_hugetlb:
> >               free_huge_folio(folio);
> >               return;
> > +#endif
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +     case PGTY_guestmem:
> > +             kvm_gmem_handle_folio_put(folio);
> > +             return;
> >  #endif
> >       default:
> >               WARN_ON_ONCE(1);
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index b2aa6bf24d3a..c6f6792bec2a 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -312,6 +312,13 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
> >       return gfn - slot->base_gfn + slot->gmem.pgoff;
> >  }
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +void kvm_gmem_handle_folio_put(struct folio *folio)
> > +{
> > +     WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> > +}
> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > +
> >  static struct file_operations kvm_gmem_fops = {
> >       .open           = generic_file_open,
> >       .release        = kvm_gmem_release,
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-17 10:12     ` Fuad Tabba
@ 2025-02-17 11:21       ` Vlastimil Babka
  2025-02-17 11:21         ` Fuad Tabba
  2025-02-20 11:22       ` David Hildenbrand
  1 sibling, 1 reply; 54+ messages in thread
From: Vlastimil Babka @ 2025-02-17 11:21 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 2/17/25 11:12, Fuad Tabba wrote:
> Hi Vlastimil,
> 
> On Mon, 17 Feb 2025 at 09:49, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 2/11/25 13:11, Fuad Tabba wrote:
>> > Before transitioning a guest_memfd folio to unshared, thereby
>> > disallowing access by the host and allowing the hypervisor to
>> > transition its view of the guest page as private, we need to be
>> > sure that the host doesn't have any references to the folio.
>> >
>> > This patch introduces a new type for guest_memfd folios, which
>> > isn't activated in this series but is here as a placeholder and
>> > to facilitate the code in the next patch. This will be used in
>>
>> It's not clear to me how the code in the next page is facilitated as it
>> doesn't use any of this?
> 
> I'm sorry about that, I'm missing the word "series". i.e.,
> 
>> > This patch introduces a new type for guest_memfd folios, which
>> > isn't activated in this series but is here as a placeholder and
>> > to facilitate the code in the next patch *series*.
> 
> I'm referring to this series:
> https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/
> 
>> > the future to register a callback that informs the guest_memfd
>> > subsystem when the last reference is dropped, therefore knowing
>> > that the host doesn't have any remaining references.
>> >
>> > Signed-off-by: Fuad Tabba <tabba@google.com>
>> > ---
>> >  include/linux/kvm_host.h   |  9 +++++++++
>> >  include/linux/page-flags.h | 17 +++++++++++++++++
>> >  mm/debug.c                 |  1 +
>> >  mm/swap.c                  |  9 +++++++++
>> >  virt/kvm/guest_memfd.c     |  7 +++++++
>> >  5 files changed, 43 insertions(+)
>> >
>> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> > index f34f4cfaa513..8b5f28f6efff 100644
>> > --- a/include/linux/kvm_host.h
>> > +++ b/include/linux/kvm_host.h
>> > @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>> >                                   struct kvm_pre_fault_memory *range);
>> >  #endif
>> >
>> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> > +void kvm_gmem_handle_folio_put(struct folio *folio);
>> > +#else
>> > +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
>> > +{
>> > +     WARN_ON_ONCE(1);
>> > +}
>>
>> Since the caller is guarded by CONFIG_KVM_GMEM_SHARED_MEM, do we need the
>> CONFIG_KVM_GMEM_SHARED_MEM=n variant at all?
> 
> No. I'll remove it.
> 
>> > +#endif
>> > +
>> >  #endif
>> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> > index 6dc2494bd002..734afda268ab 100644
>> > --- a/include/linux/page-flags.h
>> > +++ b/include/linux/page-flags.h
>> > @@ -933,6 +933,17 @@ enum pagetype {
>> >       PGTY_slab       = 0xf5,
>> >       PGTY_zsmalloc   = 0xf6,
>> >       PGTY_unaccepted = 0xf7,
>> > +     /*
>> > +      * guestmem folios are used to back VM memory as managed by guest_memfd.
>> > +      * Once the last reference is put, instead of freeing these folios back
>> > +      * to the page allocator, they are returned to guest_memfd.
>> > +      *
>> > +      * For now, guestmem will only be set on these folios as long as they
>> > +      * cannot be mapped to user space ("private state"), with the plan of
>>
>> Might be a bit misleading as I don't think it's set yet as of this series.
>> But I guess we can keep it to avoid another update later.
> 
> You're right, it's not in this series. But as you said, the idea is to
> have the least amount of churn in the core mm code.
> 
>> > +      * always setting that type once typed folios can be mapped to user
>> > +      * space cleanly.
>> > +      */
>> > +     PGTY_guestmem   = 0xf8,
>> >
>> >       PGTY_mapcount_underflow = 0xff
>> >  };
>> > @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
>> >  FOLIO_TEST_FLAG_FALSE(hugetlb)
>> >  #endif
>> >
>> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> > +FOLIO_TYPE_OPS(guestmem, guestmem)
>> > +#else
>> > +FOLIO_TEST_FLAG_FALSE(guestmem)
>> > +#endif
>> > +
>> >  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>> >
>> >  /*
>> > diff --git a/mm/debug.c b/mm/debug.c
>> > index 8d2acf432385..08bc42c6cba8 100644
>> > --- a/mm/debug.c
>> > +++ b/mm/debug.c
>> > @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
>> >       DEF_PAGETYPE_NAME(table),
>> >       DEF_PAGETYPE_NAME(buddy),
>> >       DEF_PAGETYPE_NAME(unaccepted),
>> > +     DEF_PAGETYPE_NAME(guestmem),
>> >  };
>> >
>> >  static const char *page_type_name(unsigned int page_type)
>> > diff --git a/mm/swap.c b/mm/swap.c
>> > index 47bc1bb919cc..241880a46358 100644
>> > --- a/mm/swap.c
>> > +++ b/mm/swap.c
>> > @@ -38,6 +38,10 @@
>> >  #include <linux/local_lock.h>
>> >  #include <linux/buffer_head.h>
>> >
>> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> > +#include <linux/kvm_host.h>
>> > +#endif
>>
>> Do we need to guard the include?
> 
> Yes, otherwise allnoconfig complains due to many of the x86 things it
> drags along if included but KVM isn't configured. I could put it in a
> different header that doesn't have this problem, but I couldn't think
> of a better header for this.

Ok, you can add
Acked-by: Vlastimil Babka <vbabka@suse.cz>




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-17 11:21       ` Vlastimil Babka
@ 2025-02-17 11:21         ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-17 11:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vannapurve, ackerleytng, mail,
	david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Mon, 17 Feb 2025 at 11:21, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/17/25 11:12, Fuad Tabba wrote:
> > Hi Vlastimil,
> >
> > On Mon, 17 Feb 2025 at 09:49, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 2/11/25 13:11, Fuad Tabba wrote:
> >> > Before transitioning a guest_memfd folio to unshared, thereby
> >> > disallowing access by the host and allowing the hypervisor to
> >> > transition its view of the guest page as private, we need to be
> >> > sure that the host doesn't have any references to the folio.
> >> >
> >> > This patch introduces a new type for guest_memfd folios, which
> >> > isn't activated in this series but is here as a placeholder and
> >> > to facilitate the code in the next patch. This will be used in
> >>
> >> It's not clear to me how the code in the next page is facilitated as it
> >> doesn't use any of this?
> >
> > I'm sorry about that, I'm missing the word "series". i.e.,
> >
> >> > This patch introduces a new type for guest_memfd folios, which
> >> > isn't activated in this series but is here as a placeholder and
> >> > to facilitate the code in the next patch *series*.
> >
> > I'm referring to this series:
> > https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/
> >
> >> > the future to register a callback that informs the guest_memfd
> >> > subsystem when the last reference is dropped, therefore knowing
> >> > that the host doesn't have any remaining references.
> >> >
> >> > Signed-off-by: Fuad Tabba <tabba@google.com>
> >> > ---
> >> >  include/linux/kvm_host.h   |  9 +++++++++
> >> >  include/linux/page-flags.h | 17 +++++++++++++++++
> >> >  mm/debug.c                 |  1 +
> >> >  mm/swap.c                  |  9 +++++++++
> >> >  virt/kvm/guest_memfd.c     |  7 +++++++
> >> >  5 files changed, 43 insertions(+)
> >> >
> >> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> > index f34f4cfaa513..8b5f28f6efff 100644
> >> > --- a/include/linux/kvm_host.h
> >> > +++ b/include/linux/kvm_host.h
> >> > @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >> >                                   struct kvm_pre_fault_memory *range);
> >> >  #endif
> >> >
> >> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> > +void kvm_gmem_handle_folio_put(struct folio *folio);
> >> > +#else
> >> > +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> >> > +{
> >> > +     WARN_ON_ONCE(1);
> >> > +}
> >>
> >> Since the caller is guarded by CONFIG_KVM_GMEM_SHARED_MEM, do we need the
> >> CONFIG_KVM_GMEM_SHARED_MEM=n variant at all?
> >
> > No. I'll remove it.
> >
> >> > +#endif
> >> > +
> >> >  #endif
> >> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >> > index 6dc2494bd002..734afda268ab 100644
> >> > --- a/include/linux/page-flags.h
> >> > +++ b/include/linux/page-flags.h
> >> > @@ -933,6 +933,17 @@ enum pagetype {
> >> >       PGTY_slab       = 0xf5,
> >> >       PGTY_zsmalloc   = 0xf6,
> >> >       PGTY_unaccepted = 0xf7,
> >> > +     /*
> >> > +      * guestmem folios are used to back VM memory as managed by guest_memfd.
> >> > +      * Once the last reference is put, instead of freeing these folios back
> >> > +      * to the page allocator, they are returned to guest_memfd.
> >> > +      *
> >> > +      * For now, guestmem will only be set on these folios as long as they
> >> > +      * cannot be mapped to user space ("private state"), with the plan of
> >>
> >> Might be a bit misleading as I don't think it's set yet as of this series.
> >> But I guess we can keep it to avoid another update later.
> >
> > You're right, it's not in this series. But as you said, the idea is to
> > have the least amount of churn in the core mm code.
> >
> >> > +      * always setting that type once typed folios can be mapped to user
> >> > +      * space cleanly.
> >> > +      */
> >> > +     PGTY_guestmem   = 0xf8,
> >> >
> >> >       PGTY_mapcount_underflow = 0xff
> >> >  };
> >> > @@ -1082,6 +1093,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
> >> >  FOLIO_TEST_FLAG_FALSE(hugetlb)
> >> >  #endif
> >> >
> >> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> > +FOLIO_TYPE_OPS(guestmem, guestmem)
> >> > +#else
> >> > +FOLIO_TEST_FLAG_FALSE(guestmem)
> >> > +#endif
> >> > +
> >> >  PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> >> >
> >> >  /*
> >> > diff --git a/mm/debug.c b/mm/debug.c
> >> > index 8d2acf432385..08bc42c6cba8 100644
> >> > --- a/mm/debug.c
> >> > +++ b/mm/debug.c
> >> > @@ -56,6 +56,7 @@ static const char *page_type_names[] = {
> >> >       DEF_PAGETYPE_NAME(table),
> >> >       DEF_PAGETYPE_NAME(buddy),
> >> >       DEF_PAGETYPE_NAME(unaccepted),
> >> > +     DEF_PAGETYPE_NAME(guestmem),
> >> >  };
> >> >
> >> >  static const char *page_type_name(unsigned int page_type)
> >> > diff --git a/mm/swap.c b/mm/swap.c
> >> > index 47bc1bb919cc..241880a46358 100644
> >> > --- a/mm/swap.c
> >> > +++ b/mm/swap.c
> >> > @@ -38,6 +38,10 @@
> >> >  #include <linux/local_lock.h>
> >> >  #include <linux/buffer_head.h>
> >> >
> >> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> > +#include <linux/kvm_host.h>
> >> > +#endif
> >>
> >> Do we need to guard the include?
> >
> > Yes, otherwise allnoconfig complains due to many of the x86 things it
> > drags along if included but KVM isn't configured. I could put it in a
> > different header that doesn't have this problem, but I couldn't think
> > of a better header for this.
>
> Ok, you can add
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thank you!
/fuad


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put()
  2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
  2025-02-17  9:33   ` Vlastimil Babka
@ 2025-02-20 11:17   ` David Hildenbrand
  1 sibling, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:17 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 11.02.25 13:11, Fuad Tabba wrote:
> Some folio types, such as hugetlb, handle freeing their own
> folios. Moreover, guest_memfd will require being notified once a
> folio's reference count reaches 0 to facilitate shared to private
> folio conversion, without the folio actually being freed at that
> point.
> 
> As a first step towards that, this patch consolidates freeing
> folios that have a type. The first user is hugetlb folios. Later
> in this patch series, guest_memfd will become the second user of
> this.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
  2025-02-12 18:19   ` Peter Xu
  2025-02-17  9:49   ` Vlastimil Babka
@ 2025-02-20 11:19   ` David Hildenbrand
  2025-02-20 11:25   ` David Hildenbrand
  3 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:19 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 11.02.25 13:11, Fuad Tabba wrote:
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
> 
> This patch introduces a new type for guest_memfd folios, which
> isn't activated in this series but is here as a placeholder and
> to facilitate the code in the next patch. This will be used in
> the future to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   include/linux/kvm_host.h   |  9 +++++++++
>   include/linux/page-flags.h | 17 +++++++++++++++++
>   mm/debug.c                 |  1 +
>   mm/swap.c                  |  9 +++++++++
>   virt/kvm/guest_memfd.c     |  7 +++++++
>   5 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f34f4cfaa513..8b5f28f6efff 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2571,4 +2571,13 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>   				    struct kvm_pre_fault_memory *range);
>   #endif
>   
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +void kvm_gmem_handle_folio_put(struct folio *folio);
> +#else
> +static inline void kvm_gmem_handle_folio_put(struct folio *folio)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +#endif
> +
>   #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6dc2494bd002..734afda268ab 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -933,6 +933,17 @@ enum pagetype {
>   	PGTY_slab	= 0xf5,
>   	PGTY_zsmalloc	= 0xf6,
>   	PGTY_unaccepted	= 0xf7,
> +	/*
> +	 * guestmem folios are used to back VM memory as managed by guest_memfd.
> +	 * Once the last reference is put, instead of freeing these folios back
> +	 * to the page allocator, they are returned to guest_memfd.
> +	 *
> +	 * For now, guestmem will only be set on these folios as long as they
> +	 * cannot be mapped to user space ("private state"), with the plan of
> +	 * always setting that type once typed folios can be mapped to user
> +	 * space cleanly.
> +	 */

I would move that comment above the CONFIG_KVM_GMEM_SHARED_MEM below. 
Similar to how we do it for (some of) the others.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-17 10:12     ` Fuad Tabba
  2025-02-17 11:21       ` Vlastimil Babka
@ 2025-02-20 11:22       ` David Hildenbrand
  1 sibling, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:22 UTC (permalink / raw)
  To: Fuad Tabba, Vlastimil Babka
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

>>> +#endif
>>> +
>>>   #endif
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 6dc2494bd002..734afda268ab 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -933,6 +933,17 @@ enum pagetype {
>>>        PGTY_slab       = 0xf5,
>>>        PGTY_zsmalloc   = 0xf6,
>>>        PGTY_unaccepted = 0xf7,
>>> +     /*
>>> +      * guestmem folios are used to back VM memory as managed by guest_memfd.
>>> +      * Once the last reference is put, instead of freeing these folios back
>>> +      * to the page allocator, they are returned to guest_memfd.
>>> +      *
>>> +      * For now, guestmem will only be set on these folios as long as they
>>> +      * cannot be mapped to user space ("private state"), with the plan of
>>
>> Might be a bit misleading as I don't think it's set yet as of this series.
>> But I guess we can keep it to avoid another update later.
> 
> You're right, it's not in this series. But as you said, the idea is to
> have the least amount of churn in the core mm code.

I proposed adding the latter, to make it clearer that the purpose of 
these things is to be mapped to user space long-term (-> be a proper 
folio, like hugetlb folios will have to be), compared to long-term not 
being a folio (slab, offline, unaccepted, ...).

Might become relevant for people working on the memdesc work, to know 
what to do here.

(or at least have less questions :) )

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
                     ` (2 preceding siblings ...)
  2025-02-20 11:19   ` David Hildenbrand
@ 2025-02-20 11:25   ` David Hildenbrand
  2025-02-20 11:28     ` Vlastimil Babka
  2025-02-20 11:38     ` Fuad Tabba
  3 siblings, 2 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:25 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 11.02.25 13:11, Fuad Tabba wrote:
> Before transitioning a guest_memfd folio to unshared, thereby
> disallowing access by the host and allowing the hypervisor to
> transition its view of the guest page as private, we need to be
> sure that the host doesn't have any references to the folio.
> 
> This patch introduces a new type for guest_memfd folios, which
> isn't activated in this series but is here as a placeholder and
> to facilitate the code in the next patch. This will be used in
> the future to register a callback that informs the guest_memfd
> subsystem when the last reference is dropped, therefore knowing
> that the host doesn't have any remaining references.
> 
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---

[...]

>   static const char *page_type_name(unsigned int page_type)
> diff --git a/mm/swap.c b/mm/swap.c
> index 47bc1bb919cc..241880a46358 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -38,6 +38,10 @@
>   #include <linux/local_lock.h>
>   #include <linux/buffer_head.h>
>   
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +#include <linux/kvm_host.h>
> +#endif
> +
>   #include "internal.h"
>   
>   #define CREATE_TRACE_POINTS
> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>   	case PGTY_hugetlb:
>   		free_huge_folio(folio);
>   		return;
> +#endif
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	case PGTY_guestmem:
> +		kvm_gmem_handle_folio_put(folio);
> +		return;

Hm, if KVM is built as a module, will that work? Or would we need the 
core-mm guest_memfd shim that would always be compiled into the core and 
decouple KVM from guest_memfd ("library")?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-20 11:25   ` David Hildenbrand
@ 2025-02-20 11:28     ` Vlastimil Babka
  2025-02-20 11:32       ` David Hildenbrand
  2025-02-20 11:38     ` Fuad Tabba
  1 sibling, 1 reply; 54+ messages in thread
From: Vlastimil Babka @ 2025-02-20 11:28 UTC (permalink / raw)
  To: David Hildenbrand, Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

On 2/20/25 12:25, David Hildenbrand wrote:
> On 11.02.25 13:11, Fuad Tabba wrote:
>> Before transitioning a guest_memfd folio to unshared, thereby
>> disallowing access by the host and allowing the hypervisor to
>> transition its view of the guest page as private, we need to be
>> sure that the host doesn't have any references to the folio.
>> 
>> This patch introduces a new type for guest_memfd folios, which
>> isn't activated in this series but is here as a placeholder and
>> to facilitate the code in the next patch. This will be used in
>> the future to register a callback that informs the guest_memfd
>> subsystem when the last reference is dropped, therefore knowing
>> that the host doesn't have any remaining references.
>> 
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> ---
> 
> [...]
> 
>>   static const char *page_type_name(unsigned int page_type)
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 47bc1bb919cc..241880a46358 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -38,6 +38,10 @@
>>   #include <linux/local_lock.h>
>>   #include <linux/buffer_head.h>
>>   
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +#include <linux/kvm_host.h>
>> +#endif
>> +
>>   #include "internal.h"
>>   
>>   #define CREATE_TRACE_POINTS
>> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>>   	case PGTY_hugetlb:
>>   		free_huge_folio(folio);
>>   		return;
>> +#endif
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +	case PGTY_guestmem:
>> +		kvm_gmem_handle_folio_put(folio);
>> +		return;
> 
> Hm, if KVM is built as a module, will that work? Or would we need the 

Good catch, I guess not?

> core-mm guest_memfd shim that would always be compiled into the core and 
> decouple KVM from guest_memfd ("library")?

That could also help avoid exporting the mpol symbols in the NUMA mempolicy
series?



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-20 11:28     ` Vlastimil Babka
@ 2025-02-20 11:32       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:32 UTC (permalink / raw)
  To: Vlastimil Babka, Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vannapurve, ackerleytng, mail, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton

On 20.02.25 12:28, Vlastimil Babka wrote:
> On 2/20/25 12:25, David Hildenbrand wrote:
>> On 11.02.25 13:11, Fuad Tabba wrote:
>>> Before transitioning a guest_memfd folio to unshared, thereby
>>> disallowing access by the host and allowing the hypervisor to
>>> transition its view of the guest page as private, we need to be
>>> sure that the host doesn't have any references to the folio.
>>>
>>> This patch introduces a new type for guest_memfd folios, which
>>> isn't activated in this series but is here as a placeholder and
>>> to facilitate the code in the next patch. This will be used in
>>> the future to register a callback that informs the guest_memfd
>>> subsystem when the last reference is dropped, therefore knowing
>>> that the host doesn't have any remaining references.
>>>
>>> Signed-off-by: Fuad Tabba <tabba@google.com>
>>> ---
>>
>> [...]
>>
>>>    static const char *page_type_name(unsigned int page_type)
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 47bc1bb919cc..241880a46358 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -38,6 +38,10 @@
>>>    #include <linux/local_lock.h>
>>>    #include <linux/buffer_head.h>
>>>    
>>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>>> +#include <linux/kvm_host.h>
>>> +#endif
>>> +
>>>    #include "internal.h"
>>>    
>>>    #define CREATE_TRACE_POINTS
>>> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>>>    	case PGTY_hugetlb:
>>>    		free_huge_folio(folio);
>>>    		return;
>>> +#endif
>>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>>> +	case PGTY_guestmem:
>>> +		kvm_gmem_handle_folio_put(folio);
>>> +		return;
>>
>> Hm, if KVM is built as a module, will that work? Or would we need the
> 
> Good catch, I guess not?
> 
>> core-mm guest_memfd shim that would always be compiled into the core and
>> decouple KVM from guest_memfd ("library")?
> 
> That could also help avoid exporting the mpol symbols in the NUMA mempolicy
> series?

Yes [1]! :)

[1] 
https://lore.kernel.org/linux-mm/9392618e-32de-4a86-9e1e-bcfeefe39181@redhat.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared
  2025-02-11 12:11 ` [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared Fuad Tabba
@ 2025-02-20 11:37   ` David Hildenbrand
  2025-02-20 11:39     ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:37 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 11.02.25 13:11, Fuad Tabba wrote:
> Add the KVM capability KVM_CAP_GMEM_SHARED_MEM, which indicates
> that the VM supports shared memory in guest_memfd, or that the
> host can create VMs that support shared memory. Supporting shared
> memory implies that memory can be mapped when shared with the
> host.

Was there a good reason to not squash this into the next patch?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-20 11:25   ` David Hildenbrand
  2025-02-20 11:28     ` Vlastimil Babka
@ 2025-02-20 11:38     ` Fuad Tabba
  2025-02-20 11:44       ` David Hildenbrand
  1 sibling, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2025-02-20 11:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

Hi David,

On Thu, 20 Feb 2025 at 11:25, David Hildenbrand <david@redhat.com> wrote:
>
> On 11.02.25 13:11, Fuad Tabba wrote:
> > Before transitioning a guest_memfd folio to unshared, thereby
> > disallowing access by the host and allowing the hypervisor to
> > transition its view of the guest page as private, we need to be
> > sure that the host doesn't have any references to the folio.
> >
> > This patch introduces a new type for guest_memfd folios, which
> > isn't activated in this series but is here as a placeholder and
> > to facilitate the code in the next patch. This will be used in
> > the future to register a callback that informs the guest_memfd
> > subsystem when the last reference is dropped, therefore knowing
> > that the host doesn't have any remaining references.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
>
> [...]
>
> >   static const char *page_type_name(unsigned int page_type)
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 47bc1bb919cc..241880a46358 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -38,6 +38,10 @@
> >   #include <linux/local_lock.h>
> >   #include <linux/buffer_head.h>
> >
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +#include <linux/kvm_host.h>
> > +#endif
> > +
> >   #include "internal.h"
> >
> >   #define CREATE_TRACE_POINTS
> > @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
> >       case PGTY_hugetlb:
> >               free_huge_folio(folio);
> >               return;
> > +#endif
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +     case PGTY_guestmem:
> > +             kvm_gmem_handle_folio_put(folio);
> > +             return;
>
> Hm, if KVM is built as a module, will that work? Or would we need the
> core-mm guest_memfd shim that would always be compiled into the core and
> decouple KVM from guest_memfd ("library")?

I'd assumed that it would work, since the module would have been
loaded before this could trigger, but I guess I was wrong.

Can you point me to an example of a similar shim in the core code, for
me to add on the respin?

Thanks,
/fuad


> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared
  2025-02-20 11:37   ` David Hildenbrand
@ 2025-02-20 11:39     ` David Hildenbrand
  2025-02-20 11:39       ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:39 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
	isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
	michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On 20.02.25 12:37, David Hildenbrand wrote:
> On 11.02.25 13:11, Fuad Tabba wrote:
>> Add the KVM capability KVM_CAP_GMEM_SHARED_MEM, which indicates
>> that the VM supports shared memory in guest_memfd, or that the
>> host can create VMs that support shared memory. Supporting shared
>> memory implies that memory can be mapped when shared with the
>> host.
> 
> Was there a good reason to not squash this into the next patch?

Sorry, I was confused, I meant the previous commit, where we essentially 
add the mmap option in the first place.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared
  2025-02-20 11:39     ` David Hildenbrand
@ 2025-02-20 11:39       ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2025-02-20 11:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

On Thu, 20 Feb 2025 at 11:39, David Hildenbrand <david@redhat.com> wrote:
>
> On 20.02.25 12:37, David Hildenbrand wrote:
> > On 11.02.25 13:11, Fuad Tabba wrote:
> >> Add the KVM capability KVM_CAP_GMEM_SHARED_MEM, which indicates
> >> that the VM supports shared memory in guest_memfd, or that the
> >> host can create VMs that support shared memory. Supporting shared
> >> memory implies that memory can be mapped when shared with the
> >> host.
> >
> > Was there a good reason to not squash this into the next patch?
>
> Sorry, I was confused, I meant the previous commit, where we essentially
> add the mmap option in the first place.

Will do.

Thanks,
/fuad

> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages
  2025-02-20 11:38     ` Fuad Tabba
@ 2025-02-20 11:44       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2025-02-20 11:44 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
	paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
	xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
	yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
	shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton

>>>    static const char *page_type_name(unsigned int page_type)
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 47bc1bb919cc..241880a46358 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -38,6 +38,10 @@
>>>    #include <linux/local_lock.h>
>>>    #include <linux/buffer_head.h>
>>>
>>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>>> +#include <linux/kvm_host.h>
>>> +#endif
>>> +
>>>    #include "internal.h"
>>>
>>>    #define CREATE_TRACE_POINTS
>>> @@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
>>>        case PGTY_hugetlb:
>>>                free_huge_folio(folio);
>>>                return;
>>> +#endif
>>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>>> +     case PGTY_guestmem:
>>> +             kvm_gmem_handle_folio_put(folio);
>>> +             return;
>>
>> Hm, if KVM is built as a module, will that work? Or would we need the
>> core-mm guest_memfd shim that would always be compiled into the core and
>> decouple KVM from guest_memfd ("library")?
> 
> I'd assumed that it would work, since the module would have been
> loaded before this could trigger, but I guess I was wrong.
> 
> Can you point me to an example of a similar shim in the core code, for
> me to add on the respin?

As part of this series, you could make the function with the WARN an 
inline function, and tackle the shim separately, once that function 
actually has to do something.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2025-02-20 11:44 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11 12:11 [PATCH v3 00/11] KVM: Mapping guest_memfd backed memory at the host for software protected VMs Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 01/11] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
2025-02-17  9:33   ` Vlastimil Babka
2025-02-20 11:17   ` David Hildenbrand
2025-02-11 12:11 ` [PATCH v3 02/11] KVM: guest_memfd: Handle final folio_put() of guest_memfd pages Fuad Tabba
2025-02-12 18:19   ` Peter Xu
2025-02-13  8:29     ` Fuad Tabba
2025-02-17  9:49   ` Vlastimil Babka
2025-02-17 10:12     ` Fuad Tabba
2025-02-17 11:21       ` Vlastimil Babka
2025-02-17 11:21         ` Fuad Tabba
2025-02-20 11:22       ` David Hildenbrand
2025-02-20 11:19   ` David Hildenbrand
2025-02-20 11:25   ` David Hildenbrand
2025-02-20 11:28     ` Vlastimil Babka
2025-02-20 11:32       ` David Hildenbrand
2025-02-20 11:38     ` Fuad Tabba
2025-02-20 11:44       ` David Hildenbrand
2025-02-11 12:11 ` [PATCH v3 03/11] KVM: guest_memfd: Allow host to map guest_memfd() pages Fuad Tabba
2025-02-12  5:07   ` Ackerley Tng
2025-02-12  9:21     ` Fuad Tabba
2025-02-12 21:23   ` Peter Xu
2025-02-13  8:24     ` Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 04/11] KVM: guest_memfd: Add KVM capability to check if guest_memfd is shared Fuad Tabba
2025-02-20 11:37   ` David Hildenbrand
2025-02-20 11:39     ` David Hildenbrand
2025-02-20 11:39       ` Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 05/11] KVM: guest_memfd: Handle in-place shared memory as guest_memfd backed memory Fuad Tabba
2025-02-12  0:15   ` Ackerley Tng
2025-02-12  9:23     ` Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 06/11] KVM: x86: Mark KVM_X86_SW_PROTECTED_VM as supporting guest_memfd shared memory Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 07/11] KVM: arm64: Refactor user_mem_abort() calculation of force_pte Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 08/11] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2025-02-11 15:57   ` Quentin Perret
2025-02-11 16:13     ` Fuad Tabba
2025-02-11 16:25       ` Quentin Perret
2025-02-11 16:34         ` Fuad Tabba
2025-02-11 16:57           ` Quentin Perret
2025-02-11 17:04             ` Fuad Tabba
2025-02-11 17:19               ` Quentin Perret
2025-02-11 12:11 ` [PATCH v3 09/11] KVM: arm64: Introduce KVM_VM_TYPE_ARM_SW_PROTECTED machine type Fuad Tabba
2025-02-11 16:12   ` Quentin Perret
2025-02-11 16:17     ` Fuad Tabba
2025-02-11 16:29       ` Quentin Perret
2025-02-11 16:32         ` Patrick Roy
2025-02-11 17:09           ` Quentin Perret
2025-02-14 11:13             ` Quentin Perret
2025-02-14 11:33               ` Fuad Tabba
2025-02-14 12:37                 ` Patrick Roy
2025-02-14 13:11                   ` Fuad Tabba
2025-02-14 13:18                     ` Patrick Roy
2025-02-14 15:12                       ` Sean Christopherson
2025-02-11 12:11 ` [PATCH v3 10/11] KVM: arm64: Enable mapping guest_memfd in arm64 Fuad Tabba
2025-02-11 12:11 ` [PATCH v3 11/11] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox