[RFC PATCH v2 00/10] Unmapping guest

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map
@ 2024-09-10 16:30 Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Patrick Roy
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Hey all,

This is an overhaul of my RFC [1] for removing guest_memfd folios from
the direct map. The goal of this is to back non-confidential VMs using
guest_memfd and protect their memory from a large class of speculative
execution issues [1, Table 1]. This RFC series is also the basis of my
LPC submission [3].

=== Changes to v1 ===

- Drop patches related to userspace mappings to only focus on direct map
  removal.
- Use a refcount to track temporary direct map reinsertions
  (Paolo/Elliot)
- Implement invalidation of gfn_to_pfn_caches holding gmem pfns (David
  W.)
- Do not assume folios have only a single page (Mike R.)

=== Implementation ===

This patch series extends guest_memfd to run "non-confidential" VMs that
still have their guest memory removed from the host's direct map.
"non-confidential" here means that we wish to treat the VM pretty much
the same as a VM with traditional, non-guest_memfd memslots: KVM should
be able to access non sensitive parts of guest memory such as page
tables and MMIO instructions for MMIO emulation, or things like the
kvm-clock page, without requiring the guest collaboration. 

This patch series thus does two things: First introduce a new
`KVM_GMEM_NO_DIRECT_MAP` flag to the `KVM_CREATE_GUEST_MEMFD` ioctl that
causes guest_memfd to remove its folios from the direct map immediately
after allocation. Then, teach key parts of KVM about how to access
guest_memfd memory (if the vm type allows it) via temporary restoration
of direct map entries. The parts of KVM which we enlighten like this are

- kvm_{read,write}_guest and friends (needed for instruction fetch
  during MMIO emulation)
- paging64_walk_addr_generic (needed to resolve GPAs during MMIO
  emulation)
- pfncache.c (needed for kvm-clock)

These are a minimal set needed to boot a non-confidential initrd from
guest_memfd (provided one finds a way to load such a thing into
guest_memfd. My testing was done with an additional commit on this of
this series that allows unconditional userspace mappings of
guest_memfd).

Instruction fetch for MMIO emulation is special here in the sense that
it cannot be solved by having the guest explicitly share memory ahead of
time, since such conversions in guest_memfd are destructive, and the
guest cannot know which instructions will trigger MMIO ahead of time
(TDX for example has a special paravirtual solution for this case). It
is thus the original motivation for the approach in this patch series.

In terms of the proposed framework for allowing both "shared" and
"private" memory inside guest_memfd (with in-place conversions
supported) [4], memory with its direct map entries removed can be
considered "private", while gmem memory with direct map entries can be
considered "shared" (I'm afraid this patch series also hasn't found
better names than the horribly overloaded "shared" and "private" for
this). 

Implementing support for accessing guest_memfd in kvm_{read,write}_guest
is fairly straight forward, as the operation is a simple
"remap->access->unmap" sequence that can be completely done while
holding the folio lock. However, "long term" accesses such a
gfn_to_pfn_caches, which reinsert parts of gmem into the direct map for
long periods of time, proved to be tricky to implement, due to the need
to respond to gmem invalidation events (to, for example, avoid modifying
direct map entries after a folio has been fallocated away and freed).
This part is why this series is still an RFC, because my confidence in
getting those patches right is fairly low. For what's implemented here,
an alternative would be to just have the guest share page tables the
kvm-clock pages ahead of time (while keeping the changes to
kvm_{read,write}_guest to handle instruction emulation), however I'm not
sure this would work for potential future usecases such as nested
virtualization (where the L1 guest cannot know ahead of time where the
L2 will place page tables, and thus cannot mark them as shared).

=== Security ===

We want to use unmapping guest memory from the host kernel as a security
mitigation against transient execution attacks. Temporarily restoring
direct map entries whenever KVM requires access to guest memory leaves a
gap in this mitigation. We believe this to be acceptable for the above
cases, since pages used for paravirtual guest/host communication (e.g.
kvm-clock) and guest page tables do not contain sensitive data. MMIO
emulation will only end up reading pages containing privileged
instructions (e.g. guest kernel code).

=== Summary ===

Patches 1-3 are about adding the KVM_GMEM_NO_DIRECT_MAP flag, and
providing the basic functions needed to on-demand remap guest_memfd
folios. Patch 4 deals with kvm_{read,write}_guest. Patches 4 and 5 are
about adding the "sharing refcount" framework for supporting long-term
direct map restoration of gmem folios. Patches 7-9 integrate guest_memfd
with the pfncache machinery. Patch 10 teaches the guest page table
walker about guest_memfd.

The first few patches are very similar to parts of Elliot's "mm:
Introduce guest_memfd library" RFC series [5]. This series focuses on
the non-confidential usecase, while Elliot's series focuses more on
userspace mappings for confidential VMs. We've had some discussions on
how to marry these two in [6].

=== ToDos / Limitations ===

The main question I am looking for feedback on is whether I am on the
right path with the "sharing refcount" idea for long-term, KVM-initiated
sharing of gmem folios at all, or whether the last few patches look so
horrendous that a completely different solution is needed. Other than
that, the patches are of course still missing selftests.

Best, 
Patrick

[1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#mf6eb2d36bab802da411505f46ba154885cb207e6
[2]: https://download.vusec.net/papers/quarantine_raid23.pdf
[3]: https://lpc.events/event/18/contributions/1763/
[4]: https://lore.kernel.org/linux-mm/BN9PR11MB5276D7FAC258CFC02F75D0648CB32@BN9PR11MB5276.namprd11.prod.outlook.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
[5]: https://lore.kernel.org/kvm/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/#mbcf942dcccc3726921743251d07b1a3a7e711d3f
[6]: https://lore.kernel.org/kvm/20240805-guest-memfd-lib-v1-0-e5a29a4ff5d7@quicinc.com/T/#m785c2c1731be216fd0f6aa4c22d8b4aab146f3c1

Patrick Roy (10):
  kvm: gmem: Add option to remove gmem from direct map
  kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED
  kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED
  kvm: Allow reading/writing gmem using kvm_{read,write}_guest
  kvm: gmem: Refcount internal accesses to gmem
  kvm: gmem: add tracepoints for gmem share/unshare
  kvm: pfncache: invalidate when memory attributes change
  kvm: pfncache: Support caching gmem pfns
  kvm: pfncache: hook up to gmem invalidation
  kvm: x86: support walking guest page tables in gmem

 arch/x86/kvm/mmu/mmu.c         |   2 +-
 arch/x86/kvm/mmu/paging_tmpl.h |  95 ++++++++++++---
 include/linux/kvm_host.h       |  17 ++-
 include/linux/kvm_types.h      |   2 +
 include/trace/events/kvm.h     |  43 +++++++
 include/uapi/linux/kvm.h       |   2 +
 virt/kvm/guest_memfd.c         | 216 ++++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c            |  91 ++++++++++++++
 virt/kvm/kvm_mm.h              |  12 ++
 virt/kvm/pfncache.c            | 144 ++++++++++++++++++++--
 10 files changed, 579 insertions(+), 45 deletions(-)

base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-18  5:48   ` Mike Rapoport
  2024-09-10 16:30 ` [RFC PATCH v2 02/10] kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED Patrick Roy
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Add a flag to the KVM_CREATE_GUEST_MEMFD ioctl that causes gmem pfns
to be removed from the host kernel's direct map. Memory is removed
immediately after allocation and preparation of gmem folios (after
preparation, as the prepare callback might expect the direct map entry
to be present). Direct map entries are restored before
kvm_arch_gmem_invalidate is called (as ->invalidate_folio is called
before ->free_folio), for the same reason.

Use the PG_private flag to indicate that a folio is part of gmem with
direct map removal enabled. While in this patch, PG_private does have a
meaning of "folio not in direct map", this will no longer be true in
follow up patches. Gmem folios might get temporarily reinserted into the
direct map, but the PG_private flag needs to remain set, as the folios
will have private data that needs to be freed independently of direct
map status. This is why kvm_gmem_folio_clear_private does not call
folio_clear_private.

kvm_gmem_{set,clear}_folio_private must be called with the folio lock
held.

To ensure that failures in kvm_gmem_{clear,set}_private do not cause
system instability due to leaving holes in the direct map, try to always
restore direct map entries on failure. Pages for which restoration of
direct map entries fails are marked as HWPOISON, to prevent the
kernel from ever touching them again.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/uapi/linux/kvm.h |  2 +
 virt/kvm/guest_memfd.c   | 96 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc0551453..81b0f4a236b8c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GMEM_NO_DIRECT_MAP			(1ULL << 0)
+
 #define KVM_PRE_FAULT_MEMORY	_IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
 
 struct kvm_pre_fault_memory {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1c509c3512614..2ed27992206f3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/set_memory.h>
 
 #include "kvm_mm.h"
 
@@ -49,8 +50,69 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
 	return 0;
 }
 
+static bool kvm_gmem_test_no_direct_map(struct inode *inode)
+{
+	return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == KVM_GMEM_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_set_private(struct folio *folio)
+{
+	unsigned long start, npages, i;
+	int r;
+
+	start = (unsigned long) folio_address(folio);
+	npages = folio_nr_pages(folio);
+
+	for (i = 0; i < npages; ++i) {
+		r = set_direct_map_invalid_noflush(folio_page(folio, i));
+		if (r)
+			goto out_remap;
+	}
+	flush_tlb_kernel_range(start, start + folio_size(folio));
+	folio_set_private(folio);
+	return 0;
+out_remap:
+	for (; i > 0; i--) {
+		struct page *page = folio_page(folio, i - 1);
+
+		if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) {
+			/*
+			 * Random holes in the direct map are bad, let's mark
+			 * these pages as corrupted memory so that the kernel
+			 * avoids ever touching them again.
+			 */
+			folio_set_hwpoison(folio);
+			r = -EHWPOISON;
+		}
+	}
+	return r;
+}
+
+static int kvm_gmem_folio_clear_private(struct folio *folio)
+{
+	unsigned long npages, i;
+	int r = 0;
+
+	npages = folio_nr_pages(folio);
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = folio_page(folio, i);
+
+		if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) {
+			folio_set_hwpoison(folio);
+			r = -EHWPOISON;
+		}
+	}
+	/*
+	 * no TLB flush here: pages without direct map entries should
+	 * never be in the TLB in the first place.
+	 */
+	return r;
+}
+
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
 {
+	int r;
 	struct folio *folio;
 
 	/* TODO: Support huge pages. */
@@ -78,19 +140,31 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
 	}
 
 	if (prepare) {
-		int r =	kvm_gmem_prepare_folio(inode, index, folio);
-		if (r < 0) {
-			folio_unlock(folio);
-			folio_put(folio);
-			return ERR_PTR(r);
-		}
+		r = kvm_gmem_prepare_folio(inode, index, folio);
+		if (r < 0)
+			goto out_err;
 	}
 
+	if (!kvm_gmem_test_no_direct_map(inode))
+		goto out;
+
+	if (!folio_test_private(folio)) {
+		r = kvm_gmem_folio_set_private(folio);
+		if (r)
+			goto out_err;
+	}
+
+out:
 	/*
 	 * Ignore accessed, referenced, and dirty flags.  The memory is
 	 * unevictable and there is no storage to write back to.
 	 */
 	return folio;
+
+out_err:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ERR_PTR(r);
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -343,6 +417,13 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	return MF_DELAYED;
 }
 
+static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
+{
+	if (start == 0 && end == folio_size(folio)) {
+		kvm_gmem_folio_clear_private(folio);
+	}
+}
+
 #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
 static void kvm_gmem_free_folio(struct folio *folio)
 {
@@ -358,6 +439,7 @@ static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
+	.invalidate_folio = kvm_gmem_invalidate_folio,
 #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
 	.free_folio = kvm_gmem_free_folio,
 #endif
@@ -442,7 +524,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
 	u64 flags = args->flags;
-	u64 valid_flags = 0;
+	u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
 
 	if (flags & ~valid_flags)
 		return -EINVAL;

base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 02/10] kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 03/10] kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED Patrick Roy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

If `KVM_GMEM_NO_DIRECT_MAP` is set, all gmem folios are removed from the
direct map immediately after allocation. Add a flag to
kvm_gmem_grab_folio to overwrite this behavior, and expose it via
`kvm_gmem_get_pfn`. Only allow this flag to be set if KVM can actually
access gmem (currently only if the vm type is KVM_X86_SW_PROTECTED_VM).

KVM_GMEM_GET_PFN_SHARED defers the direct map removal for newly
allocated folios until kvm_gmem_put_shared_pfn is called. For existing
folios, the direct map entry is temporarily restored until
kvm_gmem_put_shared_pfn is called.

The folio lock must be held the entire time the folio is present in the
direct map, to prevent races with concurrent calls
kvm_gmem_folio_set_private that might remove direct map entries while
the folios are being accessed by KVM. As this is currently not possible
(kvm_gmem_get_pfn always unlocks the folio), the next patch will
introduce a KVM_GMEM_GET_PFN_LOCKED flag.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 12 +++++++++--
 virt/kvm/guest_memfd.c   | 46 +++++++++++++++++++++++++++++++---------
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4c..cb2f111f2cce0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4349,7 +4349,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
-			     &max_order);
+			     &max_order, 0);
 	if (r) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return r;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 689e8be873a75..8a2975674de4b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2432,17 +2432,25 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 }
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
+#define KVM_GMEM_GET_PFN_SHARED         BIT(0)
+#define KVM_GMEM_GET_PFN_PREPARE        BIT(31)  /* internal */
+
 #ifdef CONFIG_KVM_PRIVATE_MEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags);
+int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
-				   kvm_pfn_t *pfn, int *max_order)
+				   kvm_pfn_t *pfn, int *max_order, int flags)
 {
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn)
+{
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 2ed27992206f3..492b04f4e5c18 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -55,6 +55,11 @@ static bool kvm_gmem_test_no_direct_map(struct inode *inode)
 	return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == KVM_GMEM_NO_DIRECT_MAP;
 }
 
+static bool kvm_gmem_test_accessible(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM;
+}
+
 static int kvm_gmem_folio_set_private(struct folio *folio)
 {
 	unsigned long start, npages, i;
@@ -110,10 +115,11 @@ static int kvm_gmem_folio_clear_private(struct folio *folio)
 	return r;
 }
 
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, unsigned long flags)
 {
 	int r;
 	struct folio *folio;
+	bool share = flags & KVM_GMEM_GET_PFN_SHARED;
 
 	/* TODO: Support huge pages. */
 	folio = filemap_grab_folio(inode->i_mapping, index);
@@ -139,7 +145,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
 		folio_mark_uptodate(folio);
 	}
 
-	if (prepare) {
+	if (flags & KVM_GMEM_GET_PFN_PREPARE) {
 		r = kvm_gmem_prepare_folio(inode, index, folio);
 		if (r < 0)
 			goto out_err;
@@ -148,12 +154,15 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
 	if (!kvm_gmem_test_no_direct_map(inode))
 		goto out;
 
-	if (!folio_test_private(folio)) {
+	if (folio_test_private(folio) && share) {
+		r = kvm_gmem_folio_clear_private(folio);
+	} else if (!folio_test_private(folio) && !share) {
 		r = kvm_gmem_folio_set_private(folio);
-		if (r)
-			goto out_err;
 	}
 
+	if (r)
+		goto out_err;
+
 out:
 	/*
 	 * Ignore accessed, referenced, and dirty flags.  The memory is
@@ -264,7 +273,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 			break;
 		}
 
-		folio = kvm_gmem_get_folio(inode, index, true);
+		folio = kvm_gmem_get_folio(inode, index, KVM_GMEM_GET_PFN_PREPARE);
 		if (IS_ERR(folio)) {
 			r = PTR_ERR(folio);
 			break;
@@ -624,7 +633,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 }
 
 static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
-		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
+		       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags)
 {
 	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	struct kvm_gmem *gmem = file->private_data;
@@ -643,7 +652,7 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 		return -EIO;
 	}
 
-	folio = kvm_gmem_get_folio(file_inode(file), index, prepare);
+	folio = kvm_gmem_get_folio(file_inode(file), index, flags);
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
@@ -667,20 +676,37 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 }
 
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order, unsigned long flags)
 {
 	struct file *file = kvm_gmem_get_file(slot);
 	int r;
+	int valid_flags = KVM_GMEM_GET_PFN_SHARED;
+
+	if ((flags & valid_flags) != flags)
+		return -EINVAL;
+
+	if ((flags & KVM_GMEM_GET_PFN_SHARED) && !kvm_gmem_test_accessible(kvm))
+		return -EPERM;
 
 	if (!file)
 		return -EFAULT;
 
-	r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, true);
+	r = __kvm_gmem_get_pfn(file, slot, gfn, pfn, max_order, flags | KVM_GMEM_GET_PFN_PREPARE);
 	fput(file);
 	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
+int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) {
+	struct folio *folio = pfn_folio(pfn);
+
+	if (!kvm_gmem_test_no_direct_map(folio_inode(folio)))
+		return 0;
+
+	return kvm_gmem_folio_set_private(folio);
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn);
+
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque)
 {
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 03/10] kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 02/10] kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 04/10] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Allow kvm_gmem_get_pfn to return with the folio lock held by adding a
KVM_GMEM_GET_PFN_LOCKED option to `flags`.

When accessing the content of gmem folios, the lock must be held until
kvm_gmem_put_pfn, to avoid concurrent direct map modifications of the
same folio causing use-after-free-like problems. However,
kvm_gmem_get_pfn so far unconditionally drops the folio lock, making it
currently impossible to use the KVM_GMEM_GET_PFN_SHARED flag safely.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/guest_memfd.c   | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8a2975674de4b..cd28eb34aaeb1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2433,6 +2433,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
 #define KVM_GMEM_GET_PFN_SHARED         BIT(0)
+#define KVM_GMEM_GET_PFN_LOCKED         BIT(1)
 #define KVM_GMEM_GET_PFN_PREPARE        BIT(31)  /* internal */
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 492b04f4e5c18..f637abc6045ba 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -670,7 +670,8 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 
 	r = 0;
 
-	folio_unlock(folio);
+	if (!(flags & KVM_GMEM_GET_PFN_LOCKED))
+		folio_unlock(folio);
 
 	return r;
 }
@@ -680,7 +681,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 {
 	struct file *file = kvm_gmem_get_file(slot);
 	int r;
-	int valid_flags = KVM_GMEM_GET_PFN_SHARED;
+	int valid_flags = KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED;
 
 	if ((flags & valid_flags) != flags)
 		return -EINVAL;
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 04/10] kvm: Allow reading/writing gmem using kvm_{read,write}_guest
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (2 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 03/10] kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 05/10] kvm: gmem: Refcount internal accesses to gmem Patrick Roy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

If KVM can access guest_memfd memory (or at least convert it into a
state in which KVM can access it) without causing a host-kernel panic
(e.g. currently only if the vm type is KVM_X86_SW_PROTECTED_VM), allow
`kvm_{read,write}_guest` to access gfns that are backed by gmem.  If KVM
cannot access guest_memfd memory (say, because it is running a TDX VM),
prepare a KVM_EXIT_MEMORY_FAULT (if possible) and return -EFAULT.

KVM can only prepare the memory fault exit inside the
`kvm_vcpu_{read,write}_guest` variant, as it needs a vcpu reference to
assign the exit reason to.

KVM accesses to gmem are done via the direct map (as no userspace
mappings exist, and even if they existed, they wouldn't be reflected
into the memslots). If `KVM_GMEM_NO_DIRECT_MAP` is set, then temporarily
reinsert the accessed folio into the direct map. Hold the folio lock for
the entire duration of the access to prevent concurrent direct map
modifications from taking place (as these might remove the direct map
entry while kvm_{read,write}_guest is using it, which would result in a
panic).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/kvm_main.c | 83 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc0..13347fb03d4a9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3286,11 +3286,51 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 	return 0;
 }
 
+static int __kvm_read_guest_private_page(struct kvm *kvm,
+					 struct kvm_memory_slot *memslot, gfn_t gfn,
+					 void *data, int offset, int len)
+{
+	kvm_pfn_t pfn;
+	int r;
+	struct folio *folio;
+
+	r = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL,
+			     KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED);
+
+	if (r < 0)
+		return r;
+
+	folio = pfn_folio(pfn);
+	memcpy(data, folio_address(folio) + offset, len);
+	r =  kvm_gmem_put_shared_pfn(pfn);
+	folio_unlock(folio);
+	folio_put(folio);
+	return r;
+}
+
+static int __kvm_vcpu_read_guest_private_page(struct kvm_vcpu *vcpu,
+					       struct kvm_memory_slot *memslot, gfn_t gfn,
+					       void *data, int offset, int len)
+{
+	int r = __kvm_read_guest_private_page(vcpu->kvm, memslot, gfn, data, offset, len);
+
+	/* kvm not allowed to access gmem */
+	if (r == -EPERM) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, false,
+					      false, true);
+		return -EFAULT;
+	}
+
+	return r;
+}
+
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len)
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return __kvm_read_guest_private_page(kvm, slot, gfn, data, offset, len);
 	return __kvm_read_guest_page(slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
@@ -3300,6 +3340,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
+	if (kvm_mem_is_private(vcpu->kvm, gfn))
+		return __kvm_vcpu_read_guest_private_page(vcpu, slot, gfn, data, offset, len);
 	return __kvm_read_guest_page(slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
@@ -3390,11 +3432,50 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	return 0;
 }
 
+static int __kvm_write_guest_private_page(struct kvm *kvm,
+					  struct kvm_memory_slot *memslot, gfn_t gfn,
+					  const void *data, int offset, int len)
+{
+	kvm_pfn_t pfn;
+	int r;
+	struct folio *folio;
+
+	r = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL,
+			     KVM_GMEM_GET_PFN_SHARED | KVM_GMEM_GET_PFN_LOCKED);
+
+	if (r < 0)
+		return r;
+
+	folio = pfn_folio(pfn);
+	memcpy(folio_address(folio) + offset, data, len);
+	r =  kvm_gmem_put_shared_pfn(pfn);
+	folio_unlock(folio);
+	folio_put(folio);
+	return r;
+}
+
+static int __kvm_vcpu_write_guest_private_page(struct kvm_vcpu *vcpu,
+					       struct kvm_memory_slot *memslot, gfn_t gfn,
+					       const void *data, int offset, int len)
+{
+	int r = __kvm_write_guest_private_page(vcpu->kvm, memslot, gfn, data, offset, len);
+
+	if (r == -EPERM) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, true,
+					      false, true);
+		return -EFAULT;
+	}
+
+	return r;
+}
+
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 			 const void *data, int offset, int len)
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return __kvm_write_guest_private_page(kvm, slot, gfn, data, offset, len);
 	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
@@ -3404,6 +3485,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
+	if (kvm_mem_is_private(vcpu->kvm, gfn))
+		return __kvm_vcpu_write_guest_private_page(vcpu, slot, gfn, data, offset, len);
 	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 05/10] kvm: gmem: Refcount internal accesses to gmem
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (3 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 04/10] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare Patrick Roy
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Currently, if KVM_GMEM_NO_DIRECT_MAP is set and KVM wants to
internally access a gmem folio, KVM needs to reinsert the folio into the
direct map, and hold the folio lock until KVM is done using the folio
(and the folio is removed from the direct map again).

This means that long-term reinsertion into the direct map, and
concurrent accesses to the same gmem folio are currently impossible.
These are needed however for data structures of paravirtual devices,
such as kvm-clock, which are shared between guest and host via guest
memory pages (and multiple vCPUs can put their kvm-clock data into the
same guest page).

Thus, introduce the concept of a "sharing refcount", which gets
incremented on every call to kvm_gmem_get_pfn with
KVM_GMEM_GET_PFN_SHARED set. Direct map manipulations are only done when
the first refcount is grabbed (direct map entries are restored), or when
the last reference goes away (direct map entries are removed). While
holding a sharing reference, the folio lock may be dropped, as the
refcounting ensures that the direct map entry will not be removed as
long as at least one reference is held. However, whoever is holding a
reference will need to listen and respond to gmem invalidation events
(such as the page being in the process of being fallocated away).

Since refcount_t does not play nicely with references dropping to 0 and
later being raised again (it will WARN), we use a refcount of 1 to mean
"no sharing references held anywhere, folio not in direct map".

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/guest_memfd.c | 61 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f637abc6045ba..6772253497e4d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -60,10 +60,37 @@ static bool kvm_gmem_test_accessible(struct kvm *kvm)
 	return kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM;
 }
 
+static int kvm_gmem_init_sharing_count(struct folio *folio)
+{
+	refcount_t *sharing_count = kmalloc(sizeof(*sharing_count), GFP_KERNEL);
+
+	if (!sharing_count)
+		return -ENOMEM;
+
+	/*
+	 * we need to use sharing_count == 1 to mean "no sharing", because
+	 * dropping a refcount_t to 0 and later incrementing it again would
+	 * result in a WARN.
+	 */
+	refcount_set(sharing_count, 1);
+	folio_change_private(folio, (void *)sharing_count);
+
+	return 0;
+}
+
 static int kvm_gmem_folio_set_private(struct folio *folio)
 {
 	unsigned long start, npages, i;
 	int r;
+	unsigned int sharing_refcount = refcount_read(folio_get_private(folio));
+
+	/*
+	 * We must only remove direct map entries after the last internal
+	 * reference has gone away, e.g. after the refcount dropped back
+	 * to 1.
+	 */
+	WARN_ONCE(sharing_refcount != 1, "%d unexpected sharing_refcounts pfn=%lx",
+		  sharing_refcount - 1, folio_pfn(folio));
 
 	start = (unsigned long) folio_address(folio);
 	npages = folio_nr_pages(folio);
@@ -97,6 +124,15 @@ static int kvm_gmem_folio_clear_private(struct folio *folio)
 {
 	unsigned long npages, i;
 	int r = 0;
+	unsigned int sharing_refcount = refcount_read(folio_get_private(folio));
+
+	/*
+	 * We must restore direct map entries on acquiring the first "sharing
+	 * reference". The refcount is lifted _after_ the call to
+	 * kvm_gmem_folio_clear_private, so it will still be 1 here.
+	 */
+	WARN_ONCE(sharing_refcount != 1, "%d unexpected sharing_refcounts pfn=%lx",
+		  sharing_refcount - 1, folio_pfn(folio));
 
 	npages = folio_nr_pages(folio);
 
@@ -156,13 +192,21 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, unsi
 
 	if (folio_test_private(folio) && share) {
 		r = kvm_gmem_folio_clear_private(folio);
-	} else if (!folio_test_private(folio) && !share) {
-		r = kvm_gmem_folio_set_private(folio);
+	} else if (!folio_test_private(folio)) {
+		r = kvm_gmem_init_sharing_count(folio);
+		if (r)
+			goto out_err;
+
+		if (!share)
+			r = kvm_gmem_folio_set_private(folio);
 	}
 
 	if (r)
 		goto out_err;
 
+	if (share)
+		refcount_inc(folio_get_private(folio));
+
 out:
 	/*
 	 * Ignore accessed, referenced, and dirty flags.  The memory is
@@ -429,7 +473,10 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
 {
 	if (start == 0 && end == folio_size(folio)) {
+		refcount_t *sharing_count = folio_get_private(folio);
+
 		kvm_gmem_folio_clear_private(folio);
+		kfree(sharing_count);
 	}
 }
 
@@ -699,12 +746,20 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) {
+	int r = 0;
 	struct folio *folio = pfn_folio(pfn);
+	refcount_t *sharing_count;
 
 	if (!kvm_gmem_test_no_direct_map(folio_inode(folio)))
 		return 0;
 
-	return kvm_gmem_folio_set_private(folio);
+	sharing_count = folio_get_private(folio);
+	refcount_dec(sharing_count);
+
+	if (refcount_read(sharing_count) == 1)
+		r = kvm_gmem_folio_set_private(folio);
+
+	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn);
 
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (4 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 05/10] kvm: gmem: Refcount internal accesses to gmem Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-10-04 22:50   ` Steven Rostedt
  2024-09-10 16:30 ` [RFC PATCH v2 07/10] kvm: pfncache: invalidate when memory attributes change Patrick Roy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Add tracepoints for calls to kvm_gmem_get_folio that cause the returned
folio to be considered "shared" (e.g. accessible by host KVM), and
tracepoint for when KVM is done accessing a gmem pfn
(kvm_gmem_put_shared_pfn).

The above operations can cause folios to be insert/removed into/from the
direct map. We want to be able to make sure that only those gmem folios
that we expect KVM to access are ever reinserted into the direct map,
and that all folios that are temporarily reinserted are also removed
again at a later point. Processing ftrace output is one way to verify
this.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/trace/events/kvm.h | 43 ++++++++++++++++++++++++++++++++++++++
 virt/kvm/guest_memfd.c     |  7 ++++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 74e40d5d4af42..4a40fd4c22f91 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,6 +489,49 @@ TRACE_EVENT(kvm_test_age_hva,
 	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
 );
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+TRACE_EVENT(kvm_gmem_share,
+	TP_PROTO(struct folio *folio, pgoff_t index),
+	TP_ARGS(folio, index),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, sharing_count)
+		__field(kvm_pfn_t, pfn)
+		__field(pgoff_t, index)
+		__field(unsigned long,  npages)
+	),
+
+	TP_fast_assign(
+		__entry->sharing_count = refcount_read(folio_get_private(folio));
+		__entry->pfn = folio_pfn(folio);
+		__entry->index = index;
+		__entry->npages = folio_nr_pages(folio);
+	),
+
+	TP_printk("pfn=0x%llx index=%lu pages=%lu (refcount now %d)",
+	          __entry->pfn, __entry->index, __entry->npages, __entry->sharing_count - 1)
+);
+
+TRACE_EVENT(kvm_gmem_unshare,
+	TP_PROTO(kvm_pfn_t pfn),
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, sharing_count)
+		__field(kvm_pfn_t, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->sharing_count = refcount_read(folio_get_private(pfn_folio(pfn)));
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=0x%llx (refcount now %d)",
+	          __entry->pfn, __entry->sharing_count - 1)
+)
+
+#endif
+
 #endif /* _TRACE_KVM_MAIN_H */
 
 /* This part must be outside protection */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6772253497e4d..742eba36d2371 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -7,6 +7,7 @@
 #include <linux/set_memory.h>
 
 #include "kvm_mm.h"
+#include "trace/events/kvm.h"
 
 struct kvm_gmem {
 	struct kvm *kvm;
@@ -204,8 +205,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, unsi
 	if (r)
 		goto out_err;
 
-	if (share)
+	if (share) {
 		refcount_inc(folio_get_private(folio));
+		trace_kvm_gmem_share(folio, index);
+	}
 
 out:
 	/*
@@ -759,6 +762,8 @@ int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) {
 	if (refcount_read(sharing_count) == 1)
 		r = kvm_gmem_folio_set_private(folio);
 
+	trace_kvm_gmem_unshare(pfn);
+
 	return r;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn);
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 07/10] kvm: pfncache: invalidate when memory attributes change
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (5 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 08/10] kvm: pfncache: Support caching gmem pfns Patrick Roy
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Invalidate gfn_to_pfn_caches when the memory attributes of the gfn it
contains change.

Since gfn_to_pfn_caches are not hooked up to KVM's MMU notifiers, but
rather have to be invalidated right _before_ KVM's MMU notifiers are
triggers, adopt the approach used by
kvm_mmu_notifier_invalidate_range_start for invalidating gpcs inside
kvm_vm_set_mem_attributes.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      |  5 +++++
 virt/kvm/kvm_mm.h        | 10 +++++++++
 virt/kvm/pfncache.c      | 45 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cd28eb34aaeb1..7d36164a2cee5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -840,6 +840,7 @@ struct kvm {
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	/* Protected by slots_locks (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
+	bool attribute_change_in_progress;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13347fb03d4a9..183f7ce57a428 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2533,6 +2533,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	mutex_lock(&kvm->slots_lock);
 
+
 	/* Nothing to do if the entire range as the desired attributes. */
 	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
 		goto out_unlock;
@@ -2547,6 +2548,9 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 			goto out_unlock;
 	}
 
+	kvm->attribute_change_in_progress = true;
+	gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end);
+
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
 	for (i = start; i < end; i++) {
@@ -2558,6 +2562,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	kvm_handle_gfn_range(kvm, &post_set_range);
 
 out_unlock:
+	kvm->attribute_change_in_progress = false;
 	mutex_unlock(&kvm->slots_lock);
 
 	return r;
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 715f19669d01f..5a53d888e4b18 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -27,12 +27,22 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 				       unsigned long start,
 				       unsigned long end);
+
+void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm,
+					    gfn_t start,
+					    gfn_t end);
 #else
 static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 						     unsigned long start,
 						     unsigned long end)
 {
 }
+
+static inline void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm,
+							  gfn_t start,
+							  gfn_t end)
+{
+}
 #endif /* HAVE_KVM_PFNCACHE */
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e3..6de934a8a153f 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -57,6 +57,43 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 	spin_unlock(&kvm->gpc_lock);
 }
 
+/*
+ * Identical to `gfn_to_pfn_cache_invalidate_start`, except based on gfns
+ * instead of uhvas.
+ */
+void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct gfn_to_pfn_cache *gpc;
+
+	spin_lock(&kvm->gpc_lock);
+	list_for_each_entry(gpc, &kvm->gpc_list, list) {
+		read_lock_irq(&gpc->lock);
+
+		/*
+		 * uhva based gpcs must not be used with gmem enabled memslots
+		 */
+		if (kvm_is_error_gpa(gpc->gpa)) {
+			read_unlock_irq(&gpc->lock);
+			continue;
+		}
+
+		if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
+		    gpa_to_gfn(gpc->gpa) >= start && gpa_to_gfn(gpc->gpa) < end) {
+			read_unlock_irq(&gpc->lock);
+
+			write_lock_irq(&gpc->lock);
+			if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
+			    gpa_to_gfn(gpc->gpa) >= start && gpa_to_gfn(gpc->gpa) < end)
+				gpc->valid = false;
+			write_unlock_irq(&gpc->lock);
+			continue;
+		}
+
+		read_unlock_irq(&gpc->lock);
+	}
+	spin_unlock(&kvm->gpc_lock);
+}
+
 static bool kvm_gpc_is_valid_len(gpa_t gpa, unsigned long uhva,
 				 unsigned long len)
 {
@@ -141,6 +178,14 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	if (kvm->mn_active_invalidate_count)
 		return true;
 
+	/*
+	 * Similarly to the above, attribute_change_in_progress is set
+	 * before gfn_to_pfn_cache_invalidate_start is called in
+	 * kvm_vm_set_mem_attributes, and isn't cleared until after
+	 * mmu_invalidate_seq is updated.
+	 */
+	if (kvm->attribute_change_in_progress)
+		return true;
 	/*
 	 * Ensure mn_active_invalidate_count is read before
 	 * mmu_invalidate_seq.  This pairs with the smp_wmb() in
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 08/10] kvm: pfncache: Support caching gmem pfns
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (6 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 07/10] kvm: pfncache: invalidate when memory attributes change Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 09/10] kvm: pfncache: hook up to gmem invalidation Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem Patrick Roy
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Inside the `hva_to_pfn_retry` loop, for gpa based gpcs, check whether
the gpa has KVM_MEMORY_ATTRIBUTE_PRIVATE set, and if so, use
`kvm_gmem_get_pfn` with `KVM_GMEM_GET_PFN_SHARED` to resolve the pfn.
Ignore uhva based gpcs for now, as they are only used with Xen, and we
don't have guest_memfd there (yet). Gmem pfns that are cached by a gpc
have their sharing refcount elevated until the gpc gets invalidated (or
rather: until it gets refreshed after invalidation) or deactivated.

Since during the refresh loop the memory attributes could change between
private shared, store a uhva anyway, even if it will not be used in the
translation in the end.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_types.h |  1 +
 virt/kvm/pfncache.c       | 63 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 827ecc0b7e10a..8903b8f46cf6c 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -70,6 +70,7 @@ struct gfn_to_pfn_cache {
 	kvm_pfn_t pfn;
 	bool active;
 	bool valid;
+	bool private;
 };
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 6de934a8a153f..a4f935e80f545 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -16,6 +16,7 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/errno.h>
+#include <linux/pagemap.h>
 
 #include "kvm_mm.h"
 
@@ -145,13 +146,20 @@ static void *gpc_map(kvm_pfn_t pfn)
 #endif
 }
 
-static void gpc_unmap(kvm_pfn_t pfn, void *khva)
+static void gpc_unmap(kvm_pfn_t pfn, void *khva, bool private)
 {
 	/* Unmap the old pfn/page if it was mapped before. */
 	if (is_error_noslot_pfn(pfn) || !khva)
 		return;
 
 	if (pfn_valid(pfn)) {
+		if (private) {
+			struct folio *folio = pfn_folio(pfn);
+
+			folio_lock(folio);
+			kvm_gmem_put_shared_pfn(pfn);
+			folio_unlock(folio);
+		}
 		kunmap(pfn_to_page(pfn));
 		return;
 	}
@@ -203,6 +211,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
+	bool private = gpc->private;
 	unsigned long mmu_seq;
 
 	lockdep_assert_held(&gpc->refresh_lock);
@@ -235,17 +244,43 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			 * the existing mapping and didn't create a new one.
 			 */
 			if (new_khva != old_khva)
-				gpc_unmap(new_pfn, new_khva);
+				gpc_unmap(new_pfn, new_khva, private);
 
 			kvm_release_pfn_clean(new_pfn);
 
 			cond_resched();
 		}
 
-		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
-		if (is_error_noslot_pfn(new_pfn))
-			goto out_error;
+		/*
+		 * If we do not have a GPA, we cannot immediately determine
+		 * whether the area of guest memory gpc->uhva pointed to
+		 * is currently set to shared. So assume that uhva-based gpcs
+		 * never have their underlying guest memory switched to
+		 * private (which we can do as uhva-based gpcs are only used
+		 * with Xen, and guest_memfd is not supported there).
+		 */
+		if (gpc->gpa != INVALID_GPA) {
+			/*
+			 * mmu_notifier events can be due to shared/private conversions,
+			 * thus recheck this every iteration.
+			 */
+			private = kvm_mem_is_private(gpc->kvm, gpa_to_gfn(gpc->gpa));
+		} else {
+			private = false;
+		}
+
+		if (private) {
+			int r = kvm_gmem_get_pfn(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa),
+						 &new_pfn, NULL, KVM_GMEM_GET_PFN_SHARED);
+			if (r)
+				goto out_error;
+		} else {
+			/* We always request a writeable mapping */
+			new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL,
+					     true, NULL);
+			if (is_error_noslot_pfn(new_pfn))
+				goto out_error;
+		}
 
 		/*
 		 * Obtain a new kernel mapping if KVM itself will access the
@@ -274,6 +309,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	gpc->valid = true;
 	gpc->pfn = new_pfn;
 	gpc->khva = new_khva + offset_in_page(gpc->uhva);
+	gpc->private = private;
 
 	/*
 	 * Put the reference to the _new_ pfn.  The pfn is now tracked by the
@@ -298,6 +334,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	kvm_pfn_t old_pfn;
 	bool hva_change = false;
 	void *old_khva;
+	bool old_private;
 	int ret;
 
 	/* Either gpa or uhva must be valid, but not both */
@@ -316,6 +353,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	old_pfn = gpc->pfn;
 	old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
 	old_uhva = PAGE_ALIGN_DOWN(gpc->uhva);
+	old_private = gpc->private;
 
 	if (kvm_is_error_gpa(gpa)) {
 		page_offset = offset_in_page(uhva);
@@ -338,6 +376,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 			gpc->gpa = gpa;
 			gpc->generation = slots->generation;
 			gpc->memslot = __gfn_to_memslot(slots, gfn);
+			/*
+			 * compute the uhva even for private memory, in case an
+			 * invalidation event flips memory from private to
+			 * shared while in hva_to_pfn_retry
+			 */
 			gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
 
 			if (kvm_is_error_hva(gpc->uhva)) {
@@ -395,7 +438,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	write_unlock_irq(&gpc->lock);
 
 	if (unmap_old)
-		gpc_unmap(old_pfn, old_khva);
+		gpc_unmap(old_pfn, old_khva, old_private);
 
 	return ret;
 }
@@ -486,6 +529,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 	struct kvm *kvm = gpc->kvm;
 	kvm_pfn_t old_pfn;
 	void *old_khva;
+	bool old_private;
 
 	guard(mutex)(&gpc->refresh_lock);
 
@@ -508,6 +552,9 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 		old_khva = gpc->khva - offset_in_page(gpc->khva);
 		gpc->khva = NULL;
 
+		old_private = gpc->private;
+		gpc->private = false;
+
 		old_pfn = gpc->pfn;
 		gpc->pfn = KVM_PFN_ERR_FAULT;
 		write_unlock_irq(&gpc->lock);
@@ -516,6 +563,6 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 		list_del(&gpc->list);
 		spin_unlock(&kvm->gpc_lock);
 
-		gpc_unmap(old_pfn, old_khva);
+		gpc_unmap(old_pfn, old_khva, old_private);
 	}
 }
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 09/10] kvm: pfncache: hook up to gmem invalidation
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (7 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 08/10] kvm: pfncache: Support caching gmem pfns Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  2024-09-10 16:30 ` [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem Patrick Roy
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Invalidate gfn_to_pfn_caches that hold gmem pfns whenever gmem
invalidations occur (fallocate(FALLOC_FL_PUNCH_HOLE),
error_remove_folio)..

gmem invalidations are difficult to handle for gpcs. The unmap path for
gmem pfns in gpc tries to decrement the sharing ref count, and
potentially modifies the direct map. However, these are not operations
we can do after the gmem folio that used to sit in the pfn has been
freed (and after we drop gpc->lock in
gfn_to_pfn_cache_invalidate_gfns_start we are racing against the freeing
of the folio, and we cannot do direct map manipulations before dropping
the lock). Thus, in these cases (punch hole and error_remove_folio), we
must "leak" the sharing reference (which is fine because either the
folio has already been freed, or it is about to be freed by
->invalidate_folio, which only reinserts into the direct map. So if the
folio already is in the direct map, no harm is done). So in these cases,
we simply store a flag that tells gpc to skip unmapping of these pfns
when the time comes to refresh the cache.

A slightly different case are if just the memory attributes on a memslot
change. If we switch from private to shared, the gmem pfn will still be
there, it will simply no longer be mapped into the guest. In this
scenario, we must unmap to decrement the sharing count, and reinsert
into the direct map. Otherwise, if for example the gpc gets deactivated
while the gfn is set to shared, and after that the gfn is flipped to
private, something else might use the pfn, but it is still present in
the direct map (which violates the security goal of direct map removal).

However, there is one edge case we need to deal with: It could happen
that a gpc gets invalidated by a memory attribute change (e.g.
gpc->needs_unmap = true), then refreshed, and after the refresh loop has
exited and the gpc->lock is dropped, but before we get to gpc_unmap, the
gmem folio that occupies the invalidated pfn of the cache is fallocated
away. Now needs_unmap will be true, but we are once again racing against
the freeing of the folio. For this case, take a reference to the folio
before we drop the gpc->lock, and only drop the reference after
gpc_unmap returned, to avoid the folio being freed.

For similar reasons, gfn_to_pfn_cache_invalidate_gfns_start needs to not
ignore already invalidated caches, as a cache that was invalidated due
to a memory attribute change will have needs_unmap=true. If a
fallocate(FALLOC_FL_PUNCH_HOLE) operation happens on the same range,
this will need to get updated to needs_unmap=false, even if the cache is
already invalidated.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h  |  3 +++
 include/linux/kvm_types.h |  1 +
 virt/kvm/guest_memfd.c    | 19 +++++++++++++++-
 virt/kvm/kvm_main.c       |  5 ++++-
 virt/kvm/kvm_mm.h         |  6 +++--
 virt/kvm/pfncache.c       | 46 +++++++++++++++++++++++++++++++++------
 6 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7d36164a2cee5..62e45a4ab810e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -843,6 +843,9 @@ struct kvm {
 	bool attribute_change_in_progress;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	atomic_t gmem_active_invalidate_count;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 8903b8f46cf6c..a2df9623b17ce 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -71,6 +71,7 @@ struct gfn_to_pfn_cache {
 	bool active;
 	bool valid;
 	bool private;
+	bool needs_unmap;
 };
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 742eba36d2371..ac502f9b220c3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -231,6 +231,15 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 	struct kvm *kvm = gmem->kvm;
 	unsigned long index;
 
+	atomic_inc(&kvm->gmem_active_invalidate_count);
+
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+
+		gfn_to_pfn_cache_invalidate_gfns_start(kvm, slot->base_gfn + start - pgoff,
+						       slot->base_gfn + end - pgoff, true);
+	}
+
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
 		pgoff_t pgoff = slot->gmem.pgoff;
 
@@ -268,6 +277,8 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 		kvm_mmu_invalidate_end(kvm);
 		KVM_MMU_UNLOCK(kvm);
 	}
+
+	atomic_dec(&kvm->gmem_active_invalidate_count);
 }
 
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
@@ -478,7 +489,13 @@ static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t
 	if (start == 0 && end == folio_size(folio)) {
 		refcount_t *sharing_count = folio_get_private(folio);
 
-		kvm_gmem_folio_clear_private(folio);
+		/*
+		 * gfn_to_pfn_caches do not decrement the refcount if they
+		 * get invalidated due to the gmem pfn going away (fallocate,
+		 * or error_remove_folio)
+		 */
+		if (refcount_read(sharing_count) == 1)
+			kvm_gmem_folio_clear_private(folio);
 		kfree(sharing_count);
 	}
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 183f7ce57a428..6d0818c723d73 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1161,6 +1161,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	xa_init(&kvm->mem_attr_array);
 #endif
+#ifdef CONFIG_KVM_PRIVATE_MEM
+	atomic_set(&kvm->gmem_active_invalidate_count, 0);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -2549,7 +2552,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	}
 
 	kvm->attribute_change_in_progress = true;
-	gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end);
+	gfn_to_pfn_cache_invalidate_gfns_start(kvm, start, end, false);
 
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 5a53d888e4b18..f4d0ced4a8f57 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -30,7 +30,8 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 
 void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm,
 					    gfn_t start,
-					    gfn_t end);
+					    gfn_t end,
+					    bool needs_unmap);
 #else
 static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 						     unsigned long start,
@@ -40,7 +41,8 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 
 static inline void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm,
 							  gfn_t start,
-							  gfn_t end)
+							  gfn_t end,
+							  bool needs_unmap)
 {
 }
 #endif /* HAVE_KVM_PFNCACHE */
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index a4f935e80f545..828ba8ad8f20d 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -61,8 +61,15 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 /*
  * Identical to `gfn_to_pfn_cache_invalidate_start`, except based on gfns
  * instead of uhvas.
+ *
+ * needs_unmap indicates whether this invalidation is because a gmem range went
+ * away (fallocate(FALLOC_FL_PUNCH_HOLE), error_remove_folio), in which case
+ * we must not call kvm_gmem_put_shared_pfn for it, or because of a memory
+ * attribute change, in which case the gmem pfn still exists, but simply
+ * is no longer mapped into the guest.
  */
-void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, gfn_t end)
+void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, gfn_t end,
+					    bool needs_unmap)
 {
 	struct gfn_to_pfn_cache *gpc;
 
@@ -78,14 +85,16 @@ void gfn_to_pfn_cache_invalidate_gfns_start(struct kvm *kvm, gfn_t start, gfn_t
 			continue;
 		}
 
-		if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
+		if (!is_error_noslot_pfn(gpc->pfn) &&
 		    gpa_to_gfn(gpc->gpa) >= start && gpa_to_gfn(gpc->gpa) < end) {
 			read_unlock_irq(&gpc->lock);
 
 			write_lock_irq(&gpc->lock);
-			if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
-			    gpa_to_gfn(gpc->gpa) >= start && gpa_to_gfn(gpc->gpa) < end)
+			if (!is_error_noslot_pfn(gpc->pfn) &&
+			    gpa_to_gfn(gpc->gpa) >= start && gpa_to_gfn(gpc->gpa) < end) {
 				gpc->valid = false;
+				gpc->needs_unmap = needs_unmap && gpc->private;
+			}
 			write_unlock_irq(&gpc->lock);
 			continue;
 		}
@@ -194,6 +203,9 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	 */
 	if (kvm->attribute_change_in_progress)
 		return true;
+
+	if (atomic_read_acquire(&kvm->gmem_active_invalidate_count))
+		return true;
 	/*
 	 * Ensure mn_active_invalidate_count is read before
 	 * mmu_invalidate_seq.  This pairs with the smp_wmb() in
@@ -425,20 +437,28 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	 * Some/all of the uhva, gpa, and memslot generation info may still be
 	 * valid, leave it as is.
 	 */
+	unmap_old = gpc->needs_unmap;
 	if (ret) {
 		gpc->valid = false;
 		gpc->pfn = KVM_PFN_ERR_FAULT;
 		gpc->khva = NULL;
+		gpc->needs_unmap = false;
+	} else {
+		gpc->needs_unmap = true;
 	}
 
 	/* Detect a pfn change before dropping the lock! */
-	unmap_old = (old_pfn != gpc->pfn);
+	unmap_old &= (old_pfn != gpc->pfn);
 
 out_unlock:
+	if (unmap_old)
+		folio_get(pfn_folio(old_pfn));
 	write_unlock_irq(&gpc->lock);
 
-	if (unmap_old)
+	if (unmap_old) {
 		gpc_unmap(old_pfn, old_khva, old_private);
+		folio_put(pfn_folio(old_pfn));
+	}
 
 	return ret;
 }
@@ -530,6 +550,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t old_pfn;
 	void *old_khva;
 	bool old_private;
+	bool old_needs_unmap;
 
 	guard(mutex)(&gpc->refresh_lock);
 
@@ -555,14 +576,25 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 		old_private = gpc->private;
 		gpc->private = false;
 
+		old_needs_unmap = gpc->needs_unmap;
+		gpc->needs_unmap = false;
+
 		old_pfn = gpc->pfn;
 		gpc->pfn = KVM_PFN_ERR_FAULT;
+
+		if (old_needs_unmap && old_private)
+			folio_get(pfn_folio(old_pfn));
+
 		write_unlock_irq(&gpc->lock);
 
 		spin_lock(&kvm->gpc_lock);
 		list_del(&gpc->list);
 		spin_unlock(&kvm->gpc_lock);
 
-		gpc_unmap(old_pfn, old_khva, old_private);
+		if (old_needs_unmap) {
+			gpc_unmap(old_pfn, old_khva, old_private);
+			if (old_private)
+				folio_put(pfn_folio(old_pfn));
+		}
 	}
 }
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem
  2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (8 preceding siblings ...)
  2024-09-10 16:30 ` [RFC PATCH v2 09/10] kvm: pfncache: hook up to gmem invalidation Patrick Roy
@ 2024-09-10 16:30 ` Patrick Roy
  9 siblings, 0 replies; 13+ messages in thread
From: Patrick Roy @ 2024-09-10 16:30 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack
  Cc: Patrick Roy, graf, jgowans, derekmn, kalyazin, xmarcalx

Update the logic in paging_tmpl.h to work with guest_private memory. If
KVM cannot access gmem and the guest's page tables are in gfns marked as
private, then error out.

Let the guest page table walker access gmem by making it use
gfn_to_pfn_caches, which are already gmem aware, and also handle
on-demand mapping of gmem if KVM_GMEM_NO_DIRECT_MAP is set.  We re-use
the gfn_to_pfn_cache here to avoid implementing yet another remapping
solution to support the cmpxchg used to set the "accessed" bit on guest
PTEs. The only case that now needs some special handling is page tables
in read-only memslots, as gfn_to_pfn_caches cannot be used for readonly
memory. In this case, use kvm_vcpu_read_guest (which is also gmem
aware), as there is no need to cache the gfn->pfn translation in this
case (there is no need to do a cmpxchg on the PTE as the walker does not
set the accessed bit for read-only ptes).

gfn_to_pfn_caches are hooked up to the MMU notifiers, meaning if
something about guest memory changes between the page table talk and
setting the dirty bits (for example a concurrent fallocate on gmem), the
gfn_to_pfn_caches will have been invalidated and the entire page table
walk is retried.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 95 ++++++++++++++++++++++++++++------
 1 file changed, 78 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 69941cebb3a87..d96fa423bed05 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -84,7 +84,7 @@ struct guest_walker {
 	pt_element_t ptes[PT_MAX_FULL_LEVELS];
 	pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
 	gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
-	pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
+	struct gfn_to_pfn_cache ptep_caches[PT_MAX_FULL_LEVELS];
 	bool pte_writable[PT_MAX_FULL_LEVELS];
 	unsigned int pt_access[PT_MAX_FULL_LEVELS];
 	unsigned int pte_access;
@@ -201,7 +201,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 {
 	unsigned level, index;
 	pt_element_t pte, orig_pte;
-	pt_element_t __user *ptep_user;
+	struct gfn_to_pfn_cache *pte_cache;
 	gfn_t table_gfn;
 	int ret;
 
@@ -210,10 +210,12 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		return 0;
 
 	for (level = walker->max_level; level >= walker->level; --level) {
+		unsigned long flags;
+
 		pte = orig_pte = walker->ptes[level - 1];
 		table_gfn = walker->table_gfn[level - 1];
-		ptep_user = walker->ptep_user[level - 1];
-		index = offset_in_page(ptep_user) / sizeof(pt_element_t);
+		pte_cache = &walker->ptep_caches[level - 1];
+		index = offset_in_page(pte_cache->khva) / sizeof(pt_element_t);
 		if (!(pte & PT_GUEST_ACCESSED_MASK)) {
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_GUEST_ACCESSED_MASK;
@@ -246,11 +248,26 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		if (unlikely(!walker->pte_writable[level - 1]))
 			continue;
 
-		ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault);
+		read_lock_irqsave(&pte_cache->lock, flags);
+		if (!kvm_gpc_check(pte_cache, sizeof(pte))) {
+			read_unlock_irqrestore(&pte_cache->lock, flags);
+			/*
+			 * If the gpc got invalidated, then the page table
+			 * it contained probably changed, so we probably need
+			 * to redo the entire walk.
+			 */
+			return 1;
+		}
+		ret = __try_cmpxchg((pt_element_t *)pte_cache->khva, &orig_pte, pte, sizeof(pte));
+
+		if (!ret)
+			kvm_gpc_mark_dirty_in_slot(pte_cache);
+
+		read_unlock_irqrestore(&pte_cache->lock, flags);
+
 		if (ret)
 			return ret;
 
-		kvm_vcpu_mark_page_dirty(vcpu, table_gfn);
 		walker->ptes[level - 1] = pte;
 	}
 	return 0;
@@ -296,6 +313,13 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu *mmu,
 
 	return gpte & PT_PAGE_SIZE_MASK;
 }
+
+static void FNAME(walk_deactivate_gpcs)(struct guest_walker *walker) {
+	for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level)
+		if (walker->ptep_caches[level].active)
+			kvm_gpc_deactivate(&walker->ptep_caches[level]);
+}
+
 /*
  * Fetch a guest pte for a guest virtual address, or for an L2's GPA.
  */
@@ -305,7 +329,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 {
 	int ret;
 	pt_element_t pte;
-	pt_element_t __user *ptep_user;
 	gfn_t table_gfn;
 	u64 pt_access, pte_access;
 	unsigned index, accessed_dirty, pte_pkey;
@@ -320,8 +343,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	u16 errcode = 0;
 	gpa_t real_gpa;
 	gfn_t gfn;
+	struct gfn_to_pfn_cache *pte_cache;
 
 	trace_kvm_mmu_pagetable_walk(addr, access);
+
+	for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level) {
+		pte_cache = &walker->ptep_caches[level];
+
+		memset(pte_cache, 0, sizeof(*pte_cache));
+		kvm_gpc_init(pte_cache, vcpu->kvm);
+	}
+
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
 	pte           = kvm_mmu_get_guest_pgd(vcpu, mmu);
@@ -362,11 +394,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 
 	do {
 		struct kvm_memory_slot *slot;
-		unsigned long host_addr;
+		unsigned long flags;
 
 		pt_access = pte_access;
 		--walker->level;
 
+		pte_cache = &walker->ptep_caches[walker->level - 1];
+
 		index = PT_INDEX(addr, walker->level);
 		table_gfn = gpte_to_gfn(pte);
 		offset    = index * sizeof(pt_element_t);
@@ -396,15 +430,36 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		if (!kvm_is_visible_memslot(slot))
 			goto error;
 
-		host_addr = gfn_to_hva_memslot_prot(slot, gpa_to_gfn(real_gpa),
-					    &walker->pte_writable[walker->level - 1]);
-		if (unlikely(kvm_is_error_hva(host_addr)))
-			goto error;
+		/*
+		 * gfn_to_pfn_cache expects the memory to be writable. However,
+		 * if the memory is not writable, we do not need caching in the
+		 * first place, as we only need it to later potentially write
+		 * the access bit (which we cannot do anyway if the memory is
+		 * readonly).
+		 */
+		if (slot->flags & KVM_MEM_READONLY) {
+			if (kvm_vcpu_read_guest(vcpu, real_gpa + offset, &pte, sizeof(pte)))
+				goto error;
+		} else {
+			if (kvm_gpc_activate(pte_cache, real_gpa + offset,
+					     sizeof(pte)))
+				goto error;
 
-		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
-		walker->ptep_user[walker->level - 1] = ptep_user;
+			read_lock_irqsave(&pte_cache->lock, flags);
+			while (!kvm_gpc_check(pte_cache, sizeof(pte))) {
+				read_unlock_irqrestore(&pte_cache->lock, flags);
+
+				if (kvm_gpc_refresh(pte_cache, sizeof(pte)))
+					goto error;
+
+				read_lock_irqsave(&pte_cache->lock, flags);
+			}
+
+			pte = *(pt_element_t *)pte_cache->khva;
+			read_unlock_irqrestore(&pte_cache->lock, flags);
+
+			walker->pte_writable[walker->level - 1] = true;
+		}
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
 
@@ -467,13 +522,19 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 							addr, write_fault);
 		if (unlikely(ret < 0))
 			goto error;
-		else if (ret)
+		else if (ret) {
+			FNAME(walk_deactivate_gpcs)(walker);
 			goto retry_walk;
+		}
 	}
 
+	FNAME(walk_deactivate_gpcs)(walker);
+
 	return 1;
 
 error:
+	FNAME(walk_deactivate_gpcs)(walker);
+
 	errcode |= write_fault | user_fault;
 	if (fetch_fault && (is_efer_nx(mmu) || is_cr4_smep(mmu)))
 		errcode |= PFERR_FETCH_MASK;
-- 
2.46.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map
  2024-09-10 16:30 ` [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Patrick Roy
@ 2024-09-18  5:48   ` Mike Rapoport
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2024-09-18  5:48 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	rostedt, mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, linux-mm,
	dmatlack, graf, jgowans, derekmn, kalyazin, xmarcalx

On Tue, Sep 10, 2024 at 05:30:27PM +0100, Patrick Roy wrote:
> Add a flag to the KVM_CREATE_GUEST_MEMFD ioctl that causes gmem pfns
> to be removed from the host kernel's direct map. Memory is removed
> immediately after allocation and preparation of gmem folios (after
> preparation, as the prepare callback might expect the direct map entry
> to be present). Direct map entries are restored before
> kvm_arch_gmem_invalidate is called (as ->invalidate_folio is called
> before ->free_folio), for the same reason.
> 
> Use the PG_private flag to indicate that a folio is part of gmem with
> direct map removal enabled. While in this patch, PG_private does have a
> meaning of "folio not in direct map", this will no longer be true in
> follow up patches. Gmem folios might get temporarily reinserted into the
> direct map, but the PG_private flag needs to remain set, as the folios
> will have private data that needs to be freed independently of direct
> map status. This is why kvm_gmem_folio_clear_private does not call
> folio_clear_private.
> 
> kvm_gmem_{set,clear}_folio_private must be called with the folio lock
> held.
> 
> To ensure that failures in kvm_gmem_{clear,set}_private do not cause
> system instability due to leaving holes in the direct map, try to always
> restore direct map entries on failure. Pages for which restoration of
> direct map entries fails are marked as HWPOISON, to prevent the
> kernel from ever touching them again.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>  include/uapi/linux/kvm.h |  2 +
>  virt/kvm/guest_memfd.c   | 96 +++++++++++++++++++++++++++++++++++++---
>  2 files changed, 91 insertions(+), 7 deletions(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc0551453..81b0f4a236b8c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd {
>  	__u64 reserved[6];
>  };
>  
> +#define KVM_GMEM_NO_DIRECT_MAP			(1ULL << 0)
> +
>  #define KVM_PRE_FAULT_MEMORY	_IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
>  
>  struct kvm_pre_fault_memory {
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 1c509c3512614..2ed27992206f3 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/kvm_host.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/set_memory.h>
>  
>  #include "kvm_mm.h"
>  
> @@ -49,8 +50,69 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
>  	return 0;
>  }
>  
> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
> +{
> +	return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == KVM_GMEM_NO_DIRECT_MAP;
> +}
> +
> +static int kvm_gmem_folio_set_private(struct folio *folio)
> +{
> +	unsigned long start, npages, i;
> +	int r;
> +
> +	start = (unsigned long) folio_address(folio);
> +	npages = folio_nr_pages(folio);
> +
> +	for (i = 0; i < npages; ++i) {
> +		r = set_direct_map_invalid_noflush(folio_page(folio, i));
> +		if (r)
> +			goto out_remap;
> +	}

I feels like we need a new helper that takes care of contiguous pages.
arm64 already has set_memory_valid(), so it may be something like

	int set_direct_map_valid_noflush(struct page *p, unsigned nr, bool valid);

> +	flush_tlb_kernel_range(start, start + folio_size(folio));
> +	folio_set_private(folio);
> +	return 0;
> +out_remap:
> +	for (; i > 0; i--) {
> +		struct page *page = folio_page(folio, i - 1);
> +
> +		if (WARN_ON_ONCE(set_direct_map_default_noflush(page))) {
> +			/*
> +			 * Random holes in the direct map are bad, let's mark
> +			 * these pages as corrupted memory so that the kernel
> +			 * avoids ever touching them again.
> +			 */
> +			folio_set_hwpoison(folio);
> +			r = -EHWPOISON;
> +		}
> +	}
> +	return r;
> +}
> +

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare
  2024-09-10 16:30 ` [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare Patrick Roy
@ 2024-10-04 22:50   ` Steven Rostedt
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2024-10-04 22:50 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	mhiramat, mathieu.desnoyers, kvm, linux-kernel,
	linux-trace-kernel, quic_eberman, dwmw, david, tabba, rppt,
	linux-mm, dmatlack, graf, jgowans, derekmn, kalyazin, xmarcalx

On Tue, 10 Sep 2024 17:30:32 +0100
Patrick Roy <roypat@amazon.co.uk> wrote:

> Add tracepoints for calls to kvm_gmem_get_folio that cause the returned
> folio to be considered "shared" (e.g. accessible by host KVM), and
> tracepoint for when KVM is done accessing a gmem pfn
> (kvm_gmem_put_shared_pfn).
> 
> The above operations can cause folios to be insert/removed into/from the
> direct map. We want to be able to make sure that only those gmem folios
> that we expect KVM to access are ever reinserted into the direct map,
> and that all folios that are temporarily reinserted are also removed
> again at a later point. Processing ftrace output is one way to verify
> this.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>  include/trace/events/kvm.h | 43 ++++++++++++++++++++++++++++++++++++++
>  virt/kvm/guest_memfd.c     |  7 ++++++-
>  2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 74e40d5d4af42..4a40fd4c22f91 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -489,6 +489,49 @@ TRACE_EVENT(kvm_test_age_hva,
>  	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
>  );
>  
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +TRACE_EVENT(kvm_gmem_share,
> +	TP_PROTO(struct folio *folio, pgoff_t index),
> +	TP_ARGS(folio, index),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned int, sharing_count)
> +		__field(kvm_pfn_t, pfn)
> +		__field(pgoff_t, index)
> +		__field(unsigned long,  npages)

Looking at the TP_printk() format below, the pfn is 8 bytes and
sharing_count is 4. This will likely create a hole between the two fields
for alignment reasons. Should put the sharing_count at the end.

> +	),
> +
> +	TP_fast_assign(
> +		__entry->sharing_count = refcount_read(folio_get_private(folio));
> +		__entry->pfn = folio_pfn(folio);
> +		__entry->index = index;
> +		__entry->npages = folio_nr_pages(folio);
> +	),
> +
> +	TP_printk("pfn=0x%llx index=%lu pages=%lu (refcount now %d)",
> +	          __entry->pfn, __entry->index, __entry->npages, __entry->sharing_count - 1)
> +);
> +
> +TRACE_EVENT(kvm_gmem_unshare,
> +	TP_PROTO(kvm_pfn_t pfn),
> +	TP_ARGS(pfn),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned int, sharing_count)
> +		__field(kvm_pfn_t, pfn)

Same here. It should swap the two fields. Note, if you already added this,
it will not break backward compatibility swapping them, as tooling should
use the format files that state where these fields are located in the raw
data.

-- Steve


> +	),
> +
> +	TP_fast_assign(
> +		__entry->sharing_count = refcount_read(folio_get_private(pfn_folio(pfn)));
> +		__entry->pfn = pfn;
> +	),
> +
> +	TP_printk("pfn=0x%llx (refcount now %d)",
> +	          __entry->pfn, __entry->sharing_count - 1)
> +)
> +
> +#endif
> +
>  #endif /* _TRACE_KVM_MAIN_H */
>  
>  /* This part must be outside protection */
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 6772253497e4d..742eba36d2371 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -7,6 +7,7 @@
>  #include <linux/set_memory.h>
>  
>  #include "kvm_mm.h"
> +#include "trace/events/kvm.h"
>  
>  struct kvm_gmem {
>  	struct kvm *kvm;
> @@ -204,8 +205,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, unsi
>  	if (r)
>  		goto out_err;
>  
> -	if (share)
> +	if (share) {
>  		refcount_inc(folio_get_private(folio));
> +		trace_kvm_gmem_share(folio, index);
> +	}
>  
>  out:
>  	/*
> @@ -759,6 +762,8 @@ int kvm_gmem_put_shared_pfn(kvm_pfn_t pfn) {
>  	if (refcount_read(sharing_count) == 1)
>  		r = kvm_gmem_folio_set_private(folio);
>  
> +	trace_kvm_gmem_unshare(pfn);
> +
>  	return r;
>  }
>  EXPORT_SYMBOL_GPL(kvm_gmem_put_shared_pfn);



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-04 22:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-10 16:30 [RFC PATCH v2 00/10] Unmapping guest_memfd from Direct Map Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 01/10] kvm: gmem: Add option to remove gmem from direct map Patrick Roy
2024-09-18  5:48   ` Mike Rapoport
2024-09-10 16:30 ` [RFC PATCH v2 02/10] kvm: gmem: Add KVM_GMEM_GET_PFN_SHARED Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 03/10] kvm: gmem: Add KVM_GMEM_GET_PFN_LOCKED Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 04/10] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 05/10] kvm: gmem: Refcount internal accesses to gmem Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 06/10] kvm: gmem: add tracepoints for gmem share/unshare Patrick Roy
2024-10-04 22:50   ` Steven Rostedt
2024-09-10 16:30 ` [RFC PATCH v2 07/10] kvm: pfncache: invalidate when memory attributes change Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 08/10] kvm: pfncache: Support caching gmem pfns Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 09/10] kvm: pfncache: hook up to gmem invalidation Patrick Roy
2024-09-10 16:30 ` [RFC PATCH v2 10/10] kvm: x86: support walking guest page tables in gmem Patrick Roy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox