[RFC PATCH 0/8] Unmapping guest_memfd from Direct Map

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
@ 2024-07-09 13:20 Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

Hey all,

This RFC series is a rough draft adding support for running
non-confidential compute VMs in guest_memfd, based on prior discussions
with Sean [1]. Our specific usecase for this is the ability to unmap
guest memory from the host kernel's direct map, as a mitigation against
a large class of speculative execution issues.

=== Implementation ===

This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
to remove its pages from the direct map when they are allocated. When
trying to run a guest from such a VM, we now face the problem that
without either userspace or kernelspace mappings of guest_memfd, KVM
cannot access guest memory to, for example, do MMIO emulation of access
memory used to guest/host communication. We have multiple options for
solving this when running non-CoCo VMs: (1) implement a TDX-light
solution, where the guest shares memory that KVM needs to access, and
relies on paravirtual solutions where this is not possible (e.g. MMIO),
(2) have KVM use userspace mappings of guest_memfd (e.g. a
memfd_secret-style solution), or (3) dynamically reinsert pages into the
direct map whenever KVM wants to access them.

This RFC goes for option (3). Option (1) is a lot of overhead for very
little gain, since we are not actually constrained by a physical
inability to access guest memory (e.g. we are not in a TDX context where
accesses to guest memory cause a #MC). Option (2) has previously been
rejected [1].

In this patch series, we make sufficient parts of KVM gmem-aware to be
able to boot a Linux initrd from private memory on x86. These include
KVM's MMIO emulation (including guest page table walking) and kvm-clock.
For VM types which do not allow accessing gmem, we return -EFAULT and
attempt to prepare a KVM_EXIT_MEMORY_FAULT.

Additionally, this patch series adds support for "restricted" userspace
mappings of guest_memfd, which work similar to memfd_secret (e.g.
disallow get_user_pages), which allows handling I/O and loading the
guest kernel in a simple way. Support for this is completely independent
of the rest of the functionality introduced in this patch series.
However, it is required to build a minimal hypervisor PoC that actually
allows booting a VM from a disk.

=== Performance ===

We have run some preliminary performance benchmarks to assess the impact
of on-the-fly direct map manipulations. We were mainly interested in the
impact of manipulating the direct map for MMIO emulation on virtio-mmio.
Particularly, we were worried about the impact of the TLB and L1/2/3
Cache flushes that set_memory_[n]p entails.

In our setup, we have taken a modified Firecracker VMM, spawned a Linux
guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
that the cache flushes caused throughput to drop from around 600MB/s to
~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
8375C CPU with 64 cores). We then converted our prototype to use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
found that without cache flushes the pure impact of the direct map
manipulation is indistinguishable from noise. This is why we use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
this RFC.

Note that in this comparison, both the baseline, as well as the
guest_memfd-supporting version of Firecracker were made to bounce I/O
buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
the virtio stack cannot directly pass guest buffers to read/write
syscalls.

=== Security ===

We want to use unmapping guest memory from the host kernel as a security
mitigation against transient execution attacks. Temporarily restoring
direct map entries whenever KVM requires access to guest memory leaves a
gap in this mitigation. We believe this to be acceptable for the above
cases, since pages used for paravirtual guest/host communication (e.g.
kvm-clock) and guest page tables do not contain sensitive data. MMIO
emulation will only end up reading pages containing privileged
instructions (e.g. guest kernel code).

=== Summary ===

Patches 1-4 are about hot-patching various points inside of KVM that
access guest memory to correctly handle the case where memory happens to
be guest-private. This means either handling the access as a memory
error, or simply accessing the memslot's guest_memfd instead of looking
at the userspace provided VMA if the VM type allows these kind of
accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
make it remove its pages from the kernel's direct map. Whenever KVM
wants to access guest-private memory, it will temporarily re-insert the
relevant pages. Patches 7-8 allow for restricted userspace mappings
(e.g. get_user_pages paths are disabled like for memfd_secret) of
guest_memfd, so that userspace has an easy path for loading the guest
kernel and handling I/O-buffers.

=== ToDos / Limitations ===

There are still a few rough edges that need to be addressed before
dropping the "RFC" tag, e.g.

* Handle errors of set_direct_map_default_not_flush in
  kvm_gmem_invalidate_folio instead of calling BUG_ON
* Lift the limitation of "at most one gfn_to_pfn_cache for each
  gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
  entries when needed"). It currently means that guests with more than 1
  vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
  structures into the same page (gfn)
* Write selftests, particularly around hole punching, direct map removal,
  and mmap.

Lastly, there's the question of nested virtualization which Sean brought
up in previous discussions, which runs into similar problems as MMIO. I
have looked at it very briefly. On Intel, KVM uses various gfn->uhva
caches, which run in similar problems as the gfn_to_hva_caches dealt
with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
is private"). However, previous attempts at just converting this to
gfn_to_pfn_cache (which would make them work with guest_memfd) proved
complicated [2]. I suppose initially, we should probably disallow nested
virtualization in VMs that have their memory removed from the direct
map.

Best,
Patrick

[1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
[2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/

Patrick Roy (8):
  kvm: Allow reading/writing gmem using kvm_{read,write}_guest
  kvm: use slowpath in gfn_to_hva_cache if memory is private
  kvm: pfncache: enlighten about gmem
  kvm: x86: support walking guest page tables in gmem
  kvm: gmem: add option to remove guest private memory from direct map
  kvm: gmem: Temporarily restore direct map entries when needed
  mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  kvm: gmem: Allow restricted userspace mappings

 arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
 include/linux/kvm_host.h       |   5 ++
 include/linux/kvm_types.h      |   1 +
 include/linux/secretmem.h      |  13 +++-
 include/uapi/linux/kvm.h       |   2 +
 mm/secretmem.c                 |   6 +-
 virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
 virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
 virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
 9 files changed, 399 insertions(+), 47 deletions(-)

base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84
-- 
2.45.2

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

If KVM can access guest-private memory without causing a host-kernel
panic (e.g. currently only if the vm type is KVM_SW_PROTECTED_VM), allow
`kvm_{read,write}_guest` to access gfns that are set to "private".  If
KVM cannot access guest-private memory (say, because it is running a TDX
VM), prepare a KVM_EXIT_MEMORY_FAULT (if possible) and return -EFAULT.

KVM can only prepare the memory fault exit inside the
`kvm_vcpu_{read,write}_guest` variant, as it needs a vcpu reference to
assign the exit reason to.

KVM accesses guest-private memory via kernel virtual addresses/the
direct map. In the special case of guest_memfd, it does not have to
worry about gfn->pfn mappings being invalidated, since guest_memfd pages
are immovable.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h |  5 +++
 virt/kvm/kvm_main.c      | 85 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2a6679b46427..8f980aafd5ca 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2407,6 +2407,11 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
+static inline bool kvm_can_access_gmem(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM;
+}
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8c7cbc9ec9ee..b3b3de70a4df 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3286,11 +3286,51 @@ static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 	return 0;
 }
 
+static int __kvm_read_guest_private_page(struct kvm *kvm,
+					 struct kvm_memory_slot *memslot, gfn_t gfn,
+					 void *data, int offset, int len)
+{
+	kvm_pfn_t pfn;
+	int r;
+	struct page *page;
+	void *kaddr;
+
+	if (!kvm_can_access_gmem(kvm))
+		return -EFAULT;
+
+	r = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL);
+
+	if (r < 0)
+		return -EFAULT;
+
+	page = pfn_to_page(pfn);
+	lock_page(page);
+	kaddr = page_address(page) + offset;
+	memcpy(data, kaddr, len);
+	unlock_page(page);
+	put_page(page);
+	return 0;
+}
+
+static int __kvm_vcpu_read_guest_private_page(struct kvm_vcpu *vcpu,
+					       struct kvm_memory_slot *memslot, gfn_t gfn,
+					       void *data, int offset, int len)
+{
+	if (!kvm_can_access_gmem(vcpu->kvm)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, false,
+					      false, true);
+		return -EFAULT;
+	}
+	return __kvm_read_guest_private_page(vcpu->kvm, memslot, gfn, data, offset, len);
+}
+
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
 			int len)
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return __kvm_read_guest_private_page(kvm, slot, gfn, data, offset, len);
 	return __kvm_read_guest_page(slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
@@ -3300,6 +3340,8 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
+	if (kvm_mem_is_private(vcpu->kvm, gfn))
+		return __kvm_vcpu_read_guest_private_page(vcpu, slot, gfn, data, offset, len);
 	return __kvm_read_guest_page(slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
@@ -3390,11 +3432,52 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	return 0;
 }
 
+static int __kvm_write_guest_private_page(struct kvm *kvm,
+					  struct kvm_memory_slot *memslot, gfn_t gfn,
+					  const void *data, int offset, int len)
+{
+	kvm_pfn_t pfn;
+	int r;
+	struct page *page;
+	void *kaddr;
+
+	if (!kvm_can_access_gmem(kvm))
+		return -EFAULT;
+
+	r = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, NULL);
+
+	if (r < 0)
+		return -EFAULT;
+
+	page = pfn_to_page(pfn);
+	lock_page(page);
+	kaddr = page_address(page) + offset;
+	memcpy(kaddr, data, len);
+	unlock_page(page);
+	put_page(page);
+
+	return 0;
+}
+
+static int __kvm_vcpu_write_guest_private_page(struct kvm_vcpu *vcpu,
+					       struct kvm_memory_slot *memslot, gfn_t gfn,
+					       const void *data, int offset, int len)
+{
+	if (!kvm_can_access_gmem(vcpu->kvm)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn + offset, len, true,
+					      false, true);
+		return -EFAULT;
+	}
+	return __kvm_write_guest_private_page(vcpu->kvm, memslot, gfn, data, offset, len);
+}
+
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn,
 			 const void *data, int offset, int len)
 {
 	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
 
+	if (kvm_mem_is_private(kvm, gfn))
+		return __kvm_write_guest_private_page(kvm, slot, gfn, data, offset, len);
 	return __kvm_write_guest_page(kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_page);
@@ -3404,6 +3487,8 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
+	if (kvm_mem_is_private(vcpu->kvm, gfn))
+		return __kvm_vcpu_write_guest_private_page(vcpu, slot, gfn, data, offset, len);
 	return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);

base-commit: 771df9ffadb8204e61d3e98f36c5067102aab78f
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

Currently, KVM uses gfn_to_hva_caches to cache gfn->memslot->userspace
host virtual address (uhva) translations. If a gfn is backed by
guest_memfd however, there is no uhva-equivalent item we could possible
cache, since accesses go through a file descriptor instead of a VMA.
Thus, we effectively disable gfn_to_hva_caches in the case where gfns
are gmem-backed, and instead do a gfn->pfn translation on the fly by
calling `kvm_{read,write}_guest` inside `kvm_{read,write}_guest_cached`.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/kvm_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b3b3de70a4df..4357f7cdf040 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3603,7 +3603,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (kvm_is_error_hva(ghc->hva))
 		return -EFAULT;
 
-	if (unlikely(!ghc->memslot))
+	if (unlikely(!ghc->memslot || kvm_mem_is_private(kvm, gpa_to_gfn(gpa))))
 		return kvm_write_guest(kvm, gpa, data, len);
 
 	r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
@@ -3641,7 +3641,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 	if (kvm_is_error_hva(ghc->hva))
 		return -EFAULT;
 
-	if (unlikely(!ghc->memslot))
+	if (unlikely(!ghc->memslot || kvm_mem_is_private(kvm, gpa_to_gfn(gpa))))
 		return kvm_read_guest(kvm, gpa, data, len);
 
 	r = __copy_from_user(data, (void __user *)ghc->hva + offset, len);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 14:36   ` David Woodhouse
  2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

KVM uses gfn_to_pfn_caches to cache translations from gfn all the way to
the pfn (for example, kvm-clock caches the page storing the page used
for guest/host communication this way). Unlike the gfn_to_hva_cache,
where no equivalent caching semantics were possible to gmem-backed gfns
(see also 858e8068a750 ("kvm: pfncache: enlighten about gmem")), here it
is possible to simply cache the pfn returned by `kvm_gmem_get_pfn`.

Additionally, gfn_to_pfn_caches now invalidate whenever a cached gfn's
attributes are flipped from shared to private (or vice-versa).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_types.h |  1 +
 virt/kvm/pfncache.c       | 41 +++++++++++++++++++++++++++++++++------
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 827ecc0b7e10..8f85f01f6bb0 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -70,6 +70,7 @@ struct gfn_to_pfn_cache {
 	kvm_pfn_t pfn;
 	bool active;
 	bool valid;
+	bool is_private;
 };
 
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e..6430e0a49558 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -90,6 +90,9 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
 	if (!kvm_gpc_is_valid_len(gpc->gpa, gpc->uhva, len))
 		return false;
 
+	if (gpc->is_private != kvm_mem_is_private(gpc->kvm, gpa_to_gfn(gpc->gpa)))
+		return false;
+
 	if (!gpc->valid)
 		return false;
 
@@ -159,6 +162,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
+	gfn_t gfn;
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -173,6 +177,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 
 	do {
 		mmu_seq = gpc->kvm->mmu_invalidate_seq;
+		gfn = gpa_to_gfn(gpc->gpa);
 		smp_rmb();
 
 		write_unlock_irq(&gpc->lock);
@@ -197,10 +202,19 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
-		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
-		if (is_error_noslot_pfn(new_pfn))
-			goto out_error;
+		if (gpc->is_private) {
+			int r = kvm_gmem_get_pfn(gpc->kvm, gfn_to_memslot(gpc->kvm, gfn), gfn,
+						 &new_pfn, NULL);
+
+			if (r)
+				goto out_error;
+		} else {
+			/* We always request a writeable mapping */
+			new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL,
+					     true, NULL);
+			if (is_error_noslot_pfn(new_pfn))
+				goto out_error;
+		}
 
 		/*
 		 * Obtain a new kernel mapping if KVM itself will access the
@@ -252,6 +266,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	unsigned long old_uhva;
 	kvm_pfn_t old_pfn;
 	bool hva_change = false;
+	bool old_private;
 	void *old_khva;
 	int ret;
 
@@ -271,8 +286,21 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	old_pfn = gpc->pfn;
 	old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
 	old_uhva = PAGE_ALIGN_DOWN(gpc->uhva);
+	old_private = gpc->is_private;
+
+	gpc->is_private = kvm_mem_is_private(gpc->kvm, gpa_to_gfn(gpa));
+
+	if (gpc->is_private && !kvm_can_access_gmem(gpc->kvm)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
 
 	if (kvm_is_error_gpa(gpa)) {
+		if (WARN_ON_ONCE(gpc->is_private)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
 		page_offset = offset_in_page(uhva);
 
 		gpc->gpa = INVALID_GPA;
@@ -316,9 +344,10 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 	/*
 	 * If the userspace HVA changed or the PFN was already invalid,
-	 * drop the lock and do the HVA to PFN lookup again.
+	 * drop the lock and do the HVA to PFN lookup again. Also
+	 * recompute the pfn if the gfn changed from shared to private (or vice-versa).
 	 */
-	if (!gpc->valid || hva_change) {
+	if (!gpc->valid || hva_change || gpc->is_private != old_private) {
 		ret = hva_to_pfn_retry(gpc);
 	} else {
 		/*
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (2 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

Update the logic in paging_tmpl.h to work with guest_private memory. If
KVM cannot access gmem and the guest's page tables are in gfns marked as
private, then error out.

Let the guest page table walker access gmem by making it use
gfn_to_pfn_caches, which are already gmem aware, and will later also
handle on-demand mapping of gmem once it supports being removed from the
direct map.  We re-use the gfn_to_pfn_cache here to avoid implementing
yet another remapping solution to support the cmpxchg used to set the
"accessed" bit on guest PTEs. The only case that now needs some special
handling is page tables in read-only memslots, as gfn_to_pfn_caches
cannot be used for readonly memory. In this case, use
kvm_vcpu_read_guest (which is also gmem aware), as there is no need to
cache the gfn->pfn translation in this case (there is no need to do a
cmpxchg on the PTE as the walker does not set the accessed bit for
read-only ptes).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 94 ++++++++++++++++++++++++++++------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 69941cebb3a8..ddf3b4bd479e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -84,7 +84,7 @@ struct guest_walker {
 	pt_element_t ptes[PT_MAX_FULL_LEVELS];
 	pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
 	gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
-	pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
+	struct gfn_to_pfn_cache ptep_caches[PT_MAX_FULL_LEVELS];
 	bool pte_writable[PT_MAX_FULL_LEVELS];
 	unsigned int pt_access[PT_MAX_FULL_LEVELS];
 	unsigned int pte_access;
@@ -201,7 +201,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 {
 	unsigned level, index;
 	pt_element_t pte, orig_pte;
-	pt_element_t __user *ptep_user;
+	struct gfn_to_pfn_cache *pte_cache;
 	gfn_t table_gfn;
 	int ret;
 
@@ -210,10 +210,12 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		return 0;
 
 	for (level = walker->max_level; level >= walker->level; --level) {
+		unsigned long flags;
+
 		pte = orig_pte = walker->ptes[level - 1];
 		table_gfn = walker->table_gfn[level - 1];
-		ptep_user = walker->ptep_user[level - 1];
-		index = offset_in_page(ptep_user) / sizeof(pt_element_t);
+		pte_cache = &walker->ptep_caches[level - 1];
+		index = offset_in_page(pte_cache->khva) / sizeof(pt_element_t);
 		if (!(pte & PT_GUEST_ACCESSED_MASK)) {
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
 			pte |= PT_GUEST_ACCESSED_MASK;
@@ -246,11 +248,26 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 		if (unlikely(!walker->pte_writable[level - 1]))
 			continue;
 
-		ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault);
+		read_lock_irqsave(&pte_cache->lock, flags);
+		while (!kvm_gpc_check(pte_cache, sizeof(pte))) {
+			read_unlock_irqrestore(&pte_cache->lock, flags);
+
+			ret = kvm_gpc_refresh(pte_cache, sizeof(pte));
+			if (ret)
+				return ret;
+
+			read_lock_irqsave(&pte_cache->lock, flags);
+		}
+		ret = __try_cmpxchg((pt_element_t *)pte_cache->khva, &orig_pte, pte, sizeof(pte));
+
+		if (!ret)
+			kvm_gpc_mark_dirty_in_slot(pte_cache);
+
+		read_unlock_irqrestore(&pte_cache->lock, flags);
+
 		if (ret)
 			return ret;
 
-		kvm_vcpu_mark_page_dirty(vcpu, table_gfn);
 		walker->ptes[level - 1] = pte;
 	}
 	return 0;
@@ -296,6 +313,12 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu *mmu,
 
 	return gpte & PT_PAGE_SIZE_MASK;
 }
+
+static void FNAME(walk_deactivate_gpcs)(struct guest_walker *walker) {
+	for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level)
+		kvm_gpc_deactivate(&walker->ptep_caches[level]);
+}
+
 /*
  * Fetch a guest pte for a guest virtual address, or for an L2's GPA.
  */
@@ -305,7 +328,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 {
 	int ret;
 	pt_element_t pte;
-	pt_element_t __user *ptep_user;
 	gfn_t table_gfn;
 	u64 pt_access, pte_access;
 	unsigned index, accessed_dirty, pte_pkey;
@@ -320,8 +342,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	u16 errcode = 0;
 	gpa_t real_gpa;
 	gfn_t gfn;
+	struct gfn_to_pfn_cache *pte_cache;
 
 	trace_kvm_mmu_pagetable_walk(addr, access);
+
+	for (unsigned int level = 0; level < PT_MAX_FULL_LEVELS; ++level) {
+		pte_cache = &walker->ptep_caches[level];
+
+		memset(pte_cache, 0, sizeof(*pte_cache));
+		kvm_gpc_init(pte_cache, vcpu->kvm);
+	}
+
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
 	pte           = kvm_mmu_get_guest_pgd(vcpu, mmu);
@@ -362,11 +393,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 
 	do {
 		struct kvm_memory_slot *slot;
-		unsigned long host_addr;
+		unsigned long flags;
 
 		pt_access = pte_access;
 		--walker->level;
 
+		pte_cache = &walker->ptep_caches[walker->level - 1];
+
 		index = PT_INDEX(addr, walker->level);
 		table_gfn = gpte_to_gfn(pte);
 		offset    = index * sizeof(pt_element_t);
@@ -396,15 +429,36 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		if (!kvm_is_visible_memslot(slot))
 			goto error;
 
-		host_addr = gfn_to_hva_memslot_prot(slot, gpa_to_gfn(real_gpa),
-					    &walker->pte_writable[walker->level - 1]);
-		if (unlikely(kvm_is_error_hva(host_addr)))
-			goto error;
+		/*
+		 * gfn_to_pfn_cache expects the memory to be writable. However,
+		 * if the memory is not writable, we do not need caching in the
+		 * first place, as we only need it to later potentially write
+		 * the access bit (which we cannot do anyway if the memory is
+		 * readonly).
+		 */
+		if (slot->flags & KVM_MEM_READONLY) {
+			if (kvm_vcpu_read_guest(vcpu, real_gpa + offset, &pte, sizeof(pte)))
+				goto error;
+		} else {
+			if (kvm_gpc_activate(pte_cache, real_gpa + offset,
+					     sizeof(pte)))
+				goto error;
 
-		ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
-		if (unlikely(__get_user(pte, ptep_user)))
-			goto error;
-		walker->ptep_user[walker->level - 1] = ptep_user;
+			read_lock_irqsave(&pte_cache->lock, flags);
+			while (!kvm_gpc_check(pte_cache, sizeof(pte))) {
+				read_unlock_irqrestore(&pte_cache->lock, flags);
+
+				if (kvm_gpc_refresh(pte_cache, sizeof(pte)))
+					goto error;
+
+				read_lock_irqsave(&pte_cache->lock, flags);
+			}
+
+			pte = *(pt_element_t *)pte_cache->khva;
+			read_unlock_irqrestore(&pte_cache->lock, flags);
+
+			walker->pte_writable[walker->level - 1] = true;
+		}
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
 
@@ -467,13 +521,19 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 							addr, write_fault);
 		if (unlikely(ret < 0))
 			goto error;
-		else if (ret)
+		else if (ret) {
+			FNAME(walk_deactivate_gpcs)(walker);
 			goto retry_walk;
+		}
 	}
 
+	FNAME(walk_deactivate_gpcs)(walker);
+
 	return 1;
 
 error:
+	FNAME(walk_deactivate_gpcs)(walker);
+
 	errcode |= write_fault | user_fault;
 	if (fetch_fault && (is_efer_nx(mmu) || is_cr4_smep(mmu)))
 		errcode |= PFERR_FETCH_MASK;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (3 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-10  7:31   ` Mike Rapoport
  2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

While guest_memfd is not available to be mapped by userspace, it is
still accessible through the kernel's direct map. This means that in
scenarios where guest-private memory is not hardware protected, it can
be speculatively read and its contents potentially leaked through
hardware side-channels. Removing guest-private memory from the direct
map, thus mitigates a large class of speculative execution issues
[1, Table 1].

This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the
struct pages backing guest-private memory from the direct map. Should
`CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed
after preparation and before invalidation, so that the
prepare/invalidate routines do not have to worry about potentially
absent direct map entries.

Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be
called multiple time, and it is the responsibility of the preparation
routine to not "prepare" the same folio twice [2]. Thus, instead
explicitly check if `filemap_grab_folio` allocated a new folio, and
remove the returned folio from the direct map only if this was the case.

The patch uses release_folio instead of free_folio to reinsert pages
back into the direct map as by the time free_folio is called,
folio->mapping can already be NULL. This means that a call to
folio_inode inside free_folio might deference a NULL pointer, leaving no
way to access the inode which stores the flags that allow determining
whether the page was removed from the direct map in the first place.

Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead
of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache
hierarchy. This is especially important once KVM restores direct map
entries on-demand in later patches, where simple FIO benchmarks of a
virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R)
Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput
compared to a non-flushing solution.

Not flushing the TLB means that until TLB entries for temporarily
restored direct map entries get naturally evicted, they can be used
during speculative execution, and effectively "unhide" the memory for
longer than intended. We consider this acceptable, as the only pages
that are temporarily reinserted into the direct map like this will
either hold PV data structures (kvm-clock, asyncpf, etc), or pages
containing privileged instructions inside the guest kernel image (in the
MMIO emulation case).

[1]: https://download.vusec.net/papers/quarantine_raid23.pdf

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/guest_memfd.c   | 52 ++++++++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e065d9fe7ab2..409116aa23c9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GMEM_NO_DIRECT_MAP                 (1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 9148b9679bb1..dc9b0c2d0b0e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/set_memory.h>
 
 #include "kvm_mm.h"
 
@@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
 	return 0;
 }
 
+static bool kvm_gmem_not_present(struct inode *inode)
+{
+	return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0;
+}
+
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
 {
 	struct folio *folio;
+	bool zap_direct_map = false;
+	int r;
 
 	/* TODO: Support huge pages. */
 	folio = filemap_grab_folio(inode->i_mapping, index);
@@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
 		for (i = 0; i < nr_pages; i++)
 			clear_highpage(folio_page(folio, i));
 
+		// We need to clear the folio before calling kvm_gmem_prepare_folio,
+		// but can only remove it from the direct map _after_ preparation is done.
+		zap_direct_map = kvm_gmem_not_present(inode);
+
 		folio_mark_uptodate(folio);
 	}
 
 	if (prepare) {
-		int r =	kvm_gmem_prepare_folio(inode, index, folio);
-		if (r < 0) {
-			folio_unlock(folio);
-			folio_put(folio);
-			return ERR_PTR(r);
-		}
+		r = kvm_gmem_prepare_folio(inode, index, folio);
+		if (r < 0)
+			goto out_err;
+	}
+
+	if (zap_direct_map) {
+		r = set_direct_map_invalid_noflush(&folio->page);
+		if (r < 0)
+			goto out_err;
+
+		// We use the private flag to track whether the folio has been removed
+		// from the direct map. This is because inside of ->free_folio,
+		// we do not have access to the address_space anymore, meaning we
+		// cannot check folio_inode(folio)->i_private to determine whether
+		// KVM_GMEM_NO_DIRECT_MAP was set.
+		folio_set_private(folio);
 	}
 
 	/*
@@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
 	 * unevictable and there is no storage to write back to.
 	 */
 	return folio;
+out_err:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ERR_PTR(r);
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio)
 }
 #endif
 
+static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
+{
+	if (start == 0 && end == PAGE_SIZE) {
+		// We only get here if PG_private is set, which only happens if kvm_gmem_not_present
+		// returned true in kvm_gmem_get_folio. Thus no need to do that check again.
+		BUG_ON(set_direct_map_default_noflush(&folio->page));
+
+		folio_clear_private(folio);
+	}
+}
+
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
+	.invalidate_folio = kvm_gmem_invalidate_folio,
 #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
 	.free_folio = kvm_gmem_free_folio,
 #endif
@@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
 	u64 flags = args->flags;
-	u64 valid_flags = 0;
+	u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
 
 	if (flags & ~valid_flags)
 		return -EINVAL;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (4 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-11  6:25   ` Paolo Bonzini
  2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

If KVM_GMEM_NO_DIRECT_MAP is set, and KVM tries to internally access
guest-private memory inside kvm_{read,write}_guest, or via a
gfn_to_pfn_cache, temporarily restore the direct map entry.

To avoid race conditions between two threads restoring or zapping direct
map entries for the same page and potentially interfering with each
other (e.g. unfortune interweavings of map->read->unmap in the form of
map(A)->map(B)->read(A)->unmap(A)->read(B) [BOOM]), the following
invariant is upheld in this patch:

- Only a single gfn_to_pfn_cache can exist for any given pfn, and
- All non-gfn_to_pfn_cache code paths that temporarily restore direct
  map entries complete the entire map->access->unmap critical section
while holding the folio lock.

To remember whether a given folio currently has a direct map entry, use
the PG_private flag. If this flag is set, then the folio is removed from
the direct map, otherwise it is present in the direct map.
Modifications of this flag, together with the corresponding direct map
manipulations, must happen while holding the folio's lock.

A gfn_to_pfn_cache cannot hold the folio lock for its entire lifetime,
so it operates as follows: In gpc_map, under folio lock, restore the
direct map entry and set PG_private to 0. In gpc_unmap, zap the direct
map entry again and set PG_private back to 1.

If inside gpc_map the cache finds a folio that has PG_private set to 0,
it knows that another gfn_to_pfn_cache is currently active for the given
pfn (as this is the only scenario in which PG_private can be 0 without
the folio lock being held), and so it returns -EINVAL.

The only other interesting scenario is then if kvm_{read,write}_guest is
called for a gfn whose translation is currently cached inside a
gfn_to_pfn_cache. In this case, kvm_{read,write}_guest notices that
PG_private is 0 and skips all direct map manipulations. Since it is
holding the folio lock, it can be sure that gpc_unmap cannot
concurrently zap the direct map entries while kvm_{read,write}_guest
still needs them.

Note that this implementation is slightly too restrictive, as sometimes
multiple gfn_to_pfn_caches need to be active for the same gfn (for
example, each vCPU has its own kvm-clock structure, which they all try
to put into the same gfn).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/kvm_main.c | 59 +++++++++++++++++++++---------
 virt/kvm/pfncache.c | 89 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 123 insertions(+), 25 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4357f7cdf040..f968f1f3d7f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -52,6 +52,7 @@
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
+#include <asm/set_memory.h>
 #include <linux/uaccess.h>
 
 #include "coalesced_mmio.h"
@@ -3291,8 +3292,8 @@ static int __kvm_read_guest_private_page(struct kvm *kvm,
 					 void *data, int offset, int len)
 {
 	kvm_pfn_t pfn;
-	int r;
-	struct page *page;
+	int r = 0;
+	struct folio *folio;
 	void *kaddr;
 
 	if (!kvm_can_access_gmem(kvm))
@@ -3303,13 +3304,24 @@ static int __kvm_read_guest_private_page(struct kvm *kvm,
 	if (r < 0)
 		return -EFAULT;
 
-	page = pfn_to_page(pfn);
-	lock_page(page);
-	kaddr = page_address(page) + offset;
-	memcpy(data, kaddr, len);
-	unlock_page(page);
-	put_page(page);
-	return 0;
+	folio = pfn_folio(pfn);
+	folio_lock(folio);
+	kaddr = folio_address(folio);
+	if (folio_test_private(folio)) {
+		r = set_direct_map_default_noflush(&folio->page);
+		if (r)
+			goto out_unlock;
+	}
+	memcpy(data, kaddr + offset, len);
+	if (folio_test_private(folio)) {
+		r = set_direct_map_invalid_noflush(&folio->page);
+		if (r)
+			goto out_unlock;
+	}
+out_unlock:
+	folio_unlock(folio);
+	folio_put(folio);
+	return r;
 }
 
 static int __kvm_vcpu_read_guest_private_page(struct kvm_vcpu *vcpu,
@@ -3437,8 +3449,8 @@ static int __kvm_write_guest_private_page(struct kvm *kvm,
 					  const void *data, int offset, int len)
 {
 	kvm_pfn_t pfn;
-	int r;
-	struct page *page;
+	int r = 0;
+	struct folio *folio;
 	void *kaddr;
 
 	if (!kvm_can_access_gmem(kvm))
@@ -3449,14 +3461,25 @@ static int __kvm_write_guest_private_page(struct kvm *kvm,
 	if (r < 0)
 		return -EFAULT;
 
-	page = pfn_to_page(pfn);
-	lock_page(page);
-	kaddr = page_address(page) + offset;
-	memcpy(kaddr, data, len);
-	unlock_page(page);
-	put_page(page);
+	folio = pfn_folio(pfn);
+	folio_lock(folio);
+	kaddr = folio_address(folio);
+	if (folio_test_private(folio)) {
+		r = set_direct_map_default_noflush(&folio->page);
+		if (r)
+			goto out_unlock;
+	}
+	memcpy(kaddr + offset, data, len);
+	if (folio_test_private(folio)) {
+		r = set_direct_map_invalid_noflush(&folio->page);
+		if (r)
+			goto out_unlock;
+	}
 
-	return 0;
+out_unlock:
+	folio_unlock(folio);
+	folio_put(folio);
+	return r;
 }
 
 static int __kvm_vcpu_write_guest_private_page(struct kvm_vcpu *vcpu,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 6430e0a49558..95d2d5cdaa12 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -16,6 +16,9 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/errno.h>
+#include <linux/pagemap.h>
+
+#include <asm/set_memory.h>
 
 #include "kvm_mm.h"
 
@@ -99,10 +102,68 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
 	return true;
 }
 
-static void *gpc_map(kvm_pfn_t pfn)
+static int gpc_map_gmem(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
+	int r = 0;
+	struct folio *folio = pfn_folio(pfn);
+	struct inode *inode = folio_inode(folio);
+
+	if (((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == 0)
+		goto out;
+
+	/* We need to avoid race conditions where set_memory_np is called for
+	 * pages that other parts of KVM still try to access.  We use the
+	 * PG_private bit for this. If it is set, then the page is removed from
+	 * the direct map. If it is cleared, the page is present in the direct
+	 * map. All changes to this bit, and all modifications of the direct
+	 * map entries for the page happen under the page lock. The _only_
+	 * place where a page will be in the direct map while the page lock is
+	 * _not_ held is if it is inside a gpc. All other parts of KVM that
+	 * temporarily re-insert gmem pages into the direct map (currently only
+	 * guest_{read,write}_page) take the page lock before the direct map
+	 * entry is restored, and hold it until it is zapped again. This means
+	 * - If we reach gpc_map while, say, guest_read_page is operating on
+	 *   this page, we block on acquiring the page lock until
+	 *   guest_read_page is done.
+	 * - If we reach gpc_map while another gpc is already caching this
+	 *   page, the page is present in the direct map and the PG_private
+	 *   flag is cleared. Int his case, we return -EINVAL below to avoid
+	 *   two gpcs caching the same page (since we do not ref-count
+	 *   insertions back into the direct map, when the first cache gets
+	 *   invalidated it would "break" the second cache that assumes the
+	 *   page is present in the direct map until the second cache itself
+	 *   gets invalidated).
+	 * - Lastly, if guest_read_page is called for a page inside of a gpc,
+	 *   it will see that the PG_private flag is cleared, and thus assume
+	 *   it is present in the direct map (and leave the direct map entry
+	 *   untouched). Since it will be holding the page lock, it cannot race
+	 *   with gpc_unmap.
+	 */
+	folio_lock(folio);
+	if (folio_test_private(folio)) {
+		r = set_direct_map_default_noflush(&folio->page);
+		if (r)
+			goto out_unlock;
+
+		folio_clear_private(folio);
+	} else {
+		r = -EINVAL;
+	}
+out_unlock:
+	folio_unlock(folio);
+out:
+	return r;
+}
+
+static void *gpc_map(kvm_pfn_t pfn, bool private)
+{
+	if (pfn_valid(pfn)) {
+		if (private) {
+			if (gpc_map_gmem(pfn))
+				return NULL;
+		}
 		return kmap(pfn_to_page(pfn));
+	}
 
 #ifdef CONFIG_HAS_IOMEM
 	return memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
@@ -111,13 +172,27 @@ static void *gpc_map(kvm_pfn_t pfn)
 #endif
 }
 
-static void gpc_unmap(kvm_pfn_t pfn, void *khva)
+static void gpc_unmap(kvm_pfn_t pfn, void *khva, bool private)
 {
 	/* Unmap the old pfn/page if it was mapped before. */
 	if (is_error_noslot_pfn(pfn) || !khva)
 		return;
 
 	if (pfn_valid(pfn)) {
+		if (private) {
+			struct folio *folio = pfn_folio(pfn);
+			struct inode *inode = folio_inode(folio);
+
+			if ((unsigned long)inode->i_private &
+			    KVM_GMEM_NO_DIRECT_MAP) {
+				folio_lock(folio);
+				BUG_ON(folio_test_private(folio));
+				BUG_ON(set_direct_map_invalid_noflush(
+					&folio->page));
+				folio_set_private(folio);
+				folio_unlock(folio);
+			}
+		}
 		kunmap(pfn_to_page(pfn));
 		return;
 	}
@@ -195,7 +270,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			 * the existing mapping and didn't create a new one.
 			 */
 			if (new_khva != old_khva)
-				gpc_unmap(new_pfn, new_khva);
+				gpc_unmap(new_pfn, new_khva, gpc->is_private);
 
 			kvm_release_pfn_clean(new_pfn);
 
@@ -224,7 +299,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		if (new_pfn == gpc->pfn)
 			new_khva = old_khva;
 		else
-			new_khva = gpc_map(new_pfn);
+			new_khva = gpc_map(new_pfn, gpc->is_private);
 
 		if (!new_khva) {
 			kvm_release_pfn_clean(new_pfn);
@@ -379,7 +454,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	write_unlock_irq(&gpc->lock);
 
 	if (unmap_old)
-		gpc_unmap(old_pfn, old_khva);
+		gpc_unmap(old_pfn, old_khva, old_private);
 
 	return ret;
 }
@@ -500,6 +575,6 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 		list_del(&gpc->list);
 		spin_unlock(&kvm->gpc_lock);
 
-		gpc_unmap(old_pfn, old_khva);
+		gpc_unmap(old_pfn, old_khva, gpc->is_private);
 	}
 }
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (5 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 21:09   ` David Hildenbrand
  2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

Inside of vma_is_secretmem and secretmem_mapping, instead of checking
whether a vm_area_struct/address_space has the secretmem ops structure
attached to it, check whether the address_space has the AS_INACCESSIBLE
bit set. Then set the AS_INACCESSIBLE flag for secretmem's
address_space.

This means that get_user_pages and friends are disables for all
adress_spaces that set AS_INACCESIBLE. The AS_INACCESSIBLE flag was
introduced in commit c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for
encrypted/confidential memory") specifically for guest_memfd to indicate
that no reads and writes should ever be done to guest_memfd
address_spaces. Disallowing gup seems like a reasonable semantic
extension, and means that potential future mmaps of guest_memfd cannot
be GUP'd.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/secretmem.h | 13 +++++++++++--
 mm/secretmem.c            |  6 +-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e918f96881f5..886c8f7eb63e 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -8,10 +8,19 @@ extern const struct address_space_operations secretmem_aops;
 
 static inline bool secretmem_mapping(struct address_space *mapping)
 {
-	return mapping->a_ops == &secretmem_aops;
+	return mapping->flags & AS_INACCESSIBLE;
+}
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+
+	if (!file)
+		return false;
+
+	return secretmem_mapping(file->f_inode->i_mapping);
 }
 
-bool vma_is_secretmem(struct vm_area_struct *vma);
 bool secretmem_active(void);
 
 #else
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 3afb5ad701e1..fd03a84a1cb5 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -136,11 +136,6 @@ static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
-bool vma_is_secretmem(struct vm_area_struct *vma)
-{
-	return vma->vm_ops == &secretmem_vm_ops;
-}
-
 static const struct file_operations secretmem_fops = {
 	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
@@ -218,6 +213,7 @@ static struct file *secretmem_file_create(unsigned long flags)
 
 	inode->i_op = &secretmem_iops;
 	inode->i_mapping->a_ops = &secretmem_aops;
+	inode->i_mapping->flags |= AS_INACCESSIBLE;
 
 	/* pretend we are a normal file with zero size */
 	inode->i_mode |= S_IFREG;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (6 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
@ 2024-07-09 13:20 ` Patrick Roy
  2024-07-09 14:48   ` Fuad Tabba
  2024-07-22 12:28 ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Vlastimil Babka (SUSE)
  2024-07-26 16:44 ` Yosry Ahmed
  9 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-09 13:20 UTC (permalink / raw)
  To: seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: Patrick Roy, tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf,
	derekmn, kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
on the underlying address_space struct, no GUP of guest_memfd will be
possible.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 virt/kvm/guest_memfd.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index dc9b0c2d0b0e..101ec2b248bf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -319,7 +319,37 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct folio *folio;
+
+	folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff, true);
+
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap           = kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -594,7 +624,6 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 		return -EFAULT;
 	}
 
-	gmem = file->private_data;
 	if (xa_load(&gmem->bindings, index) != slot) {
 		WARN_ON_ONCE(xa_load(&gmem->bindings, index));
 		return -EIO;
-- 
2.45.2



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
@ 2024-07-09 14:36   ` David Woodhouse
  2024-07-10  9:49     ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-07-09 14:36 UTC (permalink / raw)
  To: Patrick Roy, seanjc, pbonzini, akpm, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

On Tue, 2024-07-09 at 14:20 +0100, Patrick Roy wrote:
> KVM uses gfn_to_pfn_caches to cache translations from gfn all the way to
> the pfn (for example, kvm-clock caches the page storing the page used
> for guest/host communication this way). Unlike the gfn_to_hva_cache,
> where no equivalent caching semantics were possible to gmem-backed gfns
> (see also 858e8068a750 ("kvm: pfncache: enlighten about gmem")), here it
> is possible to simply cache the pfn returned by `kvm_gmem_get_pfn`.
> 
> Additionally, gfn_to_pfn_caches now invalidate whenever a cached gfn's
> attributes are flipped from shared to private (or vice-versa).
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

I can't see how this is safe from race conditions.

When the GPC is invalidated from gfn_to_pfn_cache_invalidate_start()
its *write* lock is taken and gpc->valid is set to false.

In parallel, any code using the GPC to access guest memory will take
the *read* lock, call kvm_gpc_check(), and then go ahead and use the
pointer to its heart's content until eventually dropping the read lock.

Since invalidation takes the write lock, it has to wait until the GPC
is no longer in active use, and the pointer cannot be being
dereferenced.

How does this work for the kvm_mem_is_private() check. You've added a
check in kvm_gpc_check(), but what if the pfn is made private
immediately *after* that check? Unless the code path which makes the
pfn private also takes the write lock, how is it safe?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
@ 2024-07-09 14:48   ` Fuad Tabba
  2024-07-09 21:13     ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Fuad Tabba @ 2024-07-09 14:48 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, pbonzini, akpm, dwmw, rppt, david, tglx, mingo, bp,
	dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, chao.p.peng, xmarcalx

Hi Patrick,

On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>
> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
> on the underlying address_space struct, no GUP of guest_memfd will be
> possible.

This patch allows mapping guest_memfd() unconditionally. Even if it's
not guppable, there are other reasons why you wouldn't want to allow
this. Maybe a config flag to gate it? e.g.,

https://lore.kernel.org/all/20240222161047.402609-4-tabba@google.com/

>
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>  virt/kvm/guest_memfd.c | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index dc9b0c2d0b0e..101ec2b248bf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -319,7 +319,37 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
>         return get_file_active(&slot->gmem.file);
>  }
>
> +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> +{
> +       struct folio *folio;
> +
> +       folio = kvm_gmem_get_folio(file_inode(vmf->vma->vm_file), vmf->pgoff, true);
> +
> +       if (!folio)
> +               return VM_FAULT_SIGBUS;
> +
> +       vmf->page = folio_file_page(folio, vmf->pgoff);
> +
> +       return VM_FAULT_LOCKED;
> +}
> +
> +static const struct vm_operations_struct kvm_gmem_vm_ops = {
> +       .fault = kvm_gmem_fault
> +};
> +
> +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> +               return -EINVAL;
> +
> +       vm_flags_set(vma, VM_DONTDUMP);
> +       vma->vm_ops = &kvm_gmem_vm_ops;
> +
> +       return 0;
> +}
> +
>  static struct file_operations kvm_gmem_fops = {
> +       .mmap           = kvm_gmem_mmap,
>         .open           = generic_file_open,
>         .release        = kvm_gmem_release,
>         .fallocate      = kvm_gmem_fallocate,
> @@ -594,7 +624,6 @@ static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>                 return -EFAULT;
>         }
>
> -       gmem = file->private_data;

Is this intentional?

Cheers,
/fuad

>         if (xa_load(&gmem->bindings, index) != slot) {
>                 WARN_ON_ONCE(xa_load(&gmem->bindings, index));
>                 return -EIO;
> --
> 2.45.2
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
@ 2024-07-09 21:09   ` David Hildenbrand
  2024-07-10  7:32     ` Mike Rapoport
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2024-07-09 21:09 UTC (permalink / raw)
  To: Patrick Roy, seanjc, pbonzini, akpm, dwmw, rppt
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

On 09.07.24 15:20, Patrick Roy wrote:
> Inside of vma_is_secretmem and secretmem_mapping, instead of checking
> whether a vm_area_struct/address_space has the secretmem ops structure
> attached to it, check whether the address_space has the AS_INACCESSIBLE
> bit set. Then set the AS_INACCESSIBLE flag for secretmem's
> address_space.
> 
> This means that get_user_pages and friends are disables for all
> adress_spaces that set AS_INACCESIBLE. The AS_INACCESSIBLE flag was
> introduced in commit c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for
> encrypted/confidential memory") specifically for guest_memfd to indicate
> that no reads and writes should ever be done to guest_memfd
> address_spaces. Disallowing gup seems like a reasonable semantic
> extension, and means that potential future mmaps of guest_memfd cannot
> be GUP'd.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>   include/linux/secretmem.h | 13 +++++++++++--
>   mm/secretmem.c            |  6 +-----
>   2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
> index e918f96881f5..886c8f7eb63e 100644
> --- a/include/linux/secretmem.h
> +++ b/include/linux/secretmem.h
> @@ -8,10 +8,19 @@ extern const struct address_space_operations secretmem_aops;
>   
>   static inline bool secretmem_mapping(struct address_space *mapping)
>   {
> -	return mapping->a_ops == &secretmem_aops;
> +	return mapping->flags & AS_INACCESSIBLE;
> +}
> +
> +static inline bool vma_is_secretmem(struct vm_area_struct *vma)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	if (!file)
> +		return false;
> +
> +	return secretmem_mapping(file->f_inode->i_mapping);
>   }

That sounds wrong. You should leave *secretmem alone and instead have 
something like inaccessible_mapping that is used where appropriate.

vma_is_secretmem() should not suddenly succeed on something that is not 
mm/secretmem.c

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-09 14:48   ` Fuad Tabba
@ 2024-07-09 21:13     ` David Hildenbrand
  2024-07-10  9:51       ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2024-07-09 21:13 UTC (permalink / raw)
  To: Fuad Tabba, Patrick Roy
  Cc: seanjc, pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen,
	x86, hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel,
	linux-mm, dmatlack, chao.p.peng, xmarcalx

On 09.07.24 16:48, Fuad Tabba wrote:
> Hi Patrick,
> 
> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>
>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>> on the underlying address_space struct, no GUP of guest_memfd will be
>> possible.
> 
> This patch allows mapping guest_memfd() unconditionally. Even if it's
> not guppable, there are other reasons why you wouldn't want to allow
> this. Maybe a config flag to gate it? e.g.,


As discussed with Jason, maybe not the direction we want to take with 
guest_memfd.
If it's private memory, it shall not be mapped. Also not via magic 
config options.

We'll likely discuss some of that in the meeting MM tomorrow I guess 
(having both shared and private memory in guest_memfd).

Note that just from staring at this commit, I don't understand the 
motivation *why* we would want to do that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map
  2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
@ 2024-07-10  7:31   ` Mike Rapoport
  2024-07-10  9:50     ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Rapoport @ 2024-07-10  7:31 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, pbonzini, akpm, dwmw, david, tglx, mingo, bp,
	dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, tabba, chao.p.peng, xmarcalx

On Tue, Jul 09, 2024 at 02:20:33PM +0100, Patrick Roy wrote:
> While guest_memfd is not available to be mapped by userspace, it is
> still accessible through the kernel's direct map. This means that in
> scenarios where guest-private memory is not hardware protected, it can
> be speculatively read and its contents potentially leaked through
> hardware side-channels. Removing guest-private memory from the direct
> map, thus mitigates a large class of speculative execution issues
> [1, Table 1].
> 
> This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the
> struct pages backing guest-private memory from the direct map. Should
> `CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed
> after preparation and before invalidation, so that the
> prepare/invalidate routines do not have to worry about potentially
> absent direct map entries.
> 
> Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be
> called multiple time, and it is the responsibility of the preparation
> routine to not "prepare" the same folio twice [2]. Thus, instead
> explicitly check if `filemap_grab_folio` allocated a new folio, and
> remove the returned folio from the direct map only if this was the case.
> 
> The patch uses release_folio instead of free_folio to reinsert pages
> back into the direct map as by the time free_folio is called,
> folio->mapping can already be NULL. This means that a call to
> folio_inode inside free_folio might deference a NULL pointer, leaving no
> way to access the inode which stores the flags that allow determining
> whether the page was removed from the direct map in the first place.
> 
> Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead
> of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache
> hierarchy. This is especially important once KVM restores direct map
> entries on-demand in later patches, where simple FIO benchmarks of a
> virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R)
> Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput
> compared to a non-flushing solution.
> 
> Not flushing the TLB means that until TLB entries for temporarily
> restored direct map entries get naturally evicted, they can be used
> during speculative execution, and effectively "unhide" the memory for
> longer than intended. We consider this acceptable, as the only pages
> that are temporarily reinserted into the direct map like this will
> either hold PV data structures (kvm-clock, asyncpf, etc), or pages
> containing privileged instructions inside the guest kernel image (in the
> MMIO emulation case).
> 
> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>  include/uapi/linux/kvm.h |  2 ++
>  virt/kvm/guest_memfd.c   | 52 ++++++++++++++++++++++++++++++++++------
>  2 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e065d9fe7ab2..409116aa23c9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd {
>  	__u64 reserved[6];
>  };
>  
> +#define KVM_GMEM_NO_DIRECT_MAP                 (1ULL << 0)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9148b9679bb1..dc9b0c2d0b0e 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/kvm_host.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/set_memory.h>
>  
>  #include "kvm_mm.h"
>  
> @@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
>  	return 0;
>  }
>  
> +static bool kvm_gmem_not_present(struct inode *inode)
> +{
> +	return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0;
> +}
> +
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
>  {
>  	struct folio *folio;
> +	bool zap_direct_map = false;
> +	int r;
>  
>  	/* TODO: Support huge pages. */
>  	folio = filemap_grab_folio(inode->i_mapping, index);
> @@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>  		for (i = 0; i < nr_pages; i++)
>  			clear_highpage(folio_page(folio, i));
>  
> +		// We need to clear the folio before calling kvm_gmem_prepare_folio,
> +		// but can only remove it from the direct map _after_ preparation is done.

No C++ comments please

> +		zap_direct_map = kvm_gmem_not_present(inode);
> +
>  		folio_mark_uptodate(folio);
>  	}
>  
>  	if (prepare) {
> -		int r =	kvm_gmem_prepare_folio(inode, index, folio);
> -		if (r < 0) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			return ERR_PTR(r);
> -		}
> +		r = kvm_gmem_prepare_folio(inode, index, folio);
> +		if (r < 0)
> +			goto out_err;
> +	}
> +
> +	if (zap_direct_map) {
> +		r = set_direct_map_invalid_noflush(&folio->page);

It's not future proof to presume that folio is a single page here.
You should loop over folio pages and add a TLB flush after the loop.

> +		if (r < 0)
> +			goto out_err;
> +
> +		// We use the private flag to track whether the folio has been removed
> +		// from the direct map. This is because inside of ->free_folio,
> +		// we do not have access to the address_space anymore, meaning we
> +		// cannot check folio_inode(folio)->i_private to determine whether
> +		// KVM_GMEM_NO_DIRECT_MAP was set.
> +		folio_set_private(folio);
>  	}
>  
>  	/*
> @@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>  	 * unevictable and there is no storage to write back to.
>  	 */
>  	return folio;
> +out_err:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ERR_PTR(r);
>  }
>  
>  static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> @@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio)
>  }
>  #endif
>  
> +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
> +{
> +	if (start == 0 && end == PAGE_SIZE) {
> +		// We only get here if PG_private is set, which only happens if kvm_gmem_not_present
> +		// returned true in kvm_gmem_get_folio. Thus no need to do that check again.
> +		BUG_ON(set_direct_map_default_noflush(&folio->page));

Ditto.

> +
> +		folio_clear_private(folio);
> +	}
> +}
> +
>  static const struct address_space_operations kvm_gmem_aops = {
>  	.dirty_folio = noop_dirty_folio,
>  	.migrate_folio	= kvm_gmem_migrate_folio,
>  	.error_remove_folio = kvm_gmem_error_folio,
> +	.invalidate_folio = kvm_gmem_invalidate_folio,
>  #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
>  	.free_folio = kvm_gmem_free_folio,
>  #endif
> @@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  {
>  	loff_t size = args->size;
>  	u64 flags = args->flags;
> -	u64 valid_flags = 0;
> +	u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
>  
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
> -- 
> 2.45.2
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  2024-07-09 21:09   ` David Hildenbrand
@ 2024-07-10  7:32     ` Mike Rapoport
  2024-07-10  9:50       ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Rapoport @ 2024-07-10  7:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Patrick Roy, seanjc, pbonzini, akpm, dwmw, tglx, mingo, bp,
	dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, tabba, chao.p.peng, xmarcalx

On Tue, Jul 09, 2024 at 11:09:29PM +0200, David Hildenbrand wrote:
> On 09.07.24 15:20, Patrick Roy wrote:
> > Inside of vma_is_secretmem and secretmem_mapping, instead of checking
> > whether a vm_area_struct/address_space has the secretmem ops structure
> > attached to it, check whether the address_space has the AS_INACCESSIBLE
> > bit set. Then set the AS_INACCESSIBLE flag for secretmem's
> > address_space.
> > 
> > This means that get_user_pages and friends are disables for all
> > adress_spaces that set AS_INACCESIBLE. The AS_INACCESSIBLE flag was
> > introduced in commit c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for
> > encrypted/confidential memory") specifically for guest_memfd to indicate
> > that no reads and writes should ever be done to guest_memfd
> > address_spaces. Disallowing gup seems like a reasonable semantic
> > extension, and means that potential future mmaps of guest_memfd cannot
> > be GUP'd.
> > 
> > Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> > ---
> >   include/linux/secretmem.h | 13 +++++++++++--
> >   mm/secretmem.c            |  6 +-----
> >   2 files changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
> > index e918f96881f5..886c8f7eb63e 100644
> > --- a/include/linux/secretmem.h
> > +++ b/include/linux/secretmem.h
> > @@ -8,10 +8,19 @@ extern const struct address_space_operations secretmem_aops;
> >   static inline bool secretmem_mapping(struct address_space *mapping)
> >   {
> > -	return mapping->a_ops == &secretmem_aops;
> > +	return mapping->flags & AS_INACCESSIBLE;
> > +}
> > +
> > +static inline bool vma_is_secretmem(struct vm_area_struct *vma)
> > +{
> > +	struct file *file = vma->vm_file;
> > +
> > +	if (!file)
> > +		return false;
> > +
> > +	return secretmem_mapping(file->f_inode->i_mapping);
> >   }
> 
> That sounds wrong. You should leave *secretmem alone and instead have
> something like inaccessible_mapping that is used where appropriate.
> 
> vma_is_secretmem() should not suddenly succeed on something that is not
> mm/secretmem.c

I'm with David here.
 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-09 14:36   ` David Woodhouse
@ 2024-07-10  9:49     ` Patrick Roy
  2024-07-10 10:20       ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-10  9:49 UTC (permalink / raw)
  To: David Woodhouse, seanjc, pbonzini, akpm, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx, James Gowans

On 7/9/24 15:36, David Woodhouse wrote:
> On Tue, 2024-07-09 at 14:20 +0100, Patrick Roy wrote:
>> KVM uses gfn_to_pfn_caches to cache translations from gfn all the way to
>> the pfn (for example, kvm-clock caches the page storing the page used
>> for guest/host communication this way). Unlike the gfn_to_hva_cache,
>> where no equivalent caching semantics were possible to gmem-backed gfns
>> (see also 858e8068a750 ("kvm: pfncache: enlighten about gmem")), here it
>> is possible to simply cache the pfn returned by `kvm_gmem_get_pfn`.
>>
>> Additionally, gfn_to_pfn_caches now invalidate whenever a cached gfn's
>> attributes are flipped from shared to private (or vice-versa).
>>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> 
> I can't see how this is safe from race conditions.
> 
> When the GPC is invalidated from gfn_to_pfn_cache_invalidate_start()
> its *write* lock is taken and gpc->valid is set to false.
> 
> In parallel, any code using the GPC to access guest memory will take
> the *read* lock, call kvm_gpc_check(), and then go ahead and use the
> pointer to its heart's content until eventually dropping the read lock.
> 
> Since invalidation takes the write lock, it has to wait until the GPC
> is no longer in active use, and the pointer cannot be being
> dereferenced.
> 
> How does this work for the kvm_mem_is_private() check. You've added a
> check in kvm_gpc_check(), but what if the pfn is made private
> immediately *after* that check? Unless the code path which makes the
> pfn private also takes the write lock, how is it safe?

Ah, you're right - I did in fact overlook this. I do think that it works
out though: kvm_vm_set_mem_attributes, which is used for flipping
between shared/private, registers the range which had its attributes
changed for invalidation, and thus gfn_to_pfn_cache_invalidate_start
should get called for it (although I have to admit I do not immediately
see what the exact callstack for this looks like, so maybe I am
misunderstanding something about invalidation here?).



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map
  2024-07-10  7:31   ` Mike Rapoport
@ 2024-07-10  9:50     ` Patrick Roy
  0 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2024-07-10  9:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: seanjc, pbonzini, akpm, dwmw, david, tglx, mingo, bp,
	dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, tabba, chao.p.peng, xmarcalx,
	James Gowans



On 7/10/24 08:31, Mike Rapoport wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Jul 09, 2024 at 02:20:33PM +0100, Patrick Roy wrote:
>> While guest_memfd is not available to be mapped by userspace, it is
>> still accessible through the kernel's direct map. This means that in
>> scenarios where guest-private memory is not hardware protected, it can
>> be speculatively read and its contents potentially leaked through
>> hardware side-channels. Removing guest-private memory from the direct
>> map, thus mitigates a large class of speculative execution issues
>> [1, Table 1].
>>
>> This patch adds a flag to the `KVM_CREATE_GUEST_MEMFD` which, if set, removes the
>> struct pages backing guest-private memory from the direct map. Should
>> `CONFIG_HAVE_KVM_GMEM_{INVALIDATE, PREPARE}` be set, pages are removed
>> after preparation and before invalidation, so that the
>> prepare/invalidate routines do not have to worry about potentially
>> absent direct map entries.
>>
>> Direct map removal do not reuse the `KVM_GMEM_PREPARE` machinery, since `prepare` can be
>> called multiple time, and it is the responsibility of the preparation
>> routine to not "prepare" the same folio twice [2]. Thus, instead
>> explicitly check if `filemap_grab_folio` allocated a new folio, and
>> remove the returned folio from the direct map only if this was the case.
>>
>> The patch uses release_folio instead of free_folio to reinsert pages
>> back into the direct map as by the time free_folio is called,
>> folio->mapping can already be NULL. This means that a call to
>> folio_inode inside free_folio might deference a NULL pointer, leaving no
>> way to access the inode which stores the flags that allow determining
>> whether the page was removed from the direct map in the first place.
>>
>> Lastly, the patch uses set_direct_map_{invalid,default}_noflush instead
>> of `set_memory_[n]p` to avoid expensive flushes of TLBs and the L*-cache
>> hierarchy. This is especially important once KVM restores direct map
>> entries on-demand in later patches, where simple FIO benchmarks of a
>> virtio-blk device have shown that TLB flushes on a Intel(R) Xeon(R)
>> Platinum 8375C CPU @ 2.90GHz resulted in 80% degradation in throughput
>> compared to a non-flushing solution.
>>
>> Not flushing the TLB means that until TLB entries for temporarily
>> restored direct map entries get naturally evicted, they can be used
>> during speculative execution, and effectively "unhide" the memory for
>> longer than intended. We consider this acceptable, as the only pages
>> that are temporarily reinserted into the direct map like this will
>> either hold PV data structures (kvm-clock, asyncpf, etc), or pages
>> containing privileged instructions inside the guest kernel image (in the
>> MMIO emulation case).
>>
>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf
>>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>> ---
>>  include/uapi/linux/kvm.h |  2 ++
>>  virt/kvm/guest_memfd.c   | 52 ++++++++++++++++++++++++++++++++++------
>>  2 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index e065d9fe7ab2..409116aa23c9 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1563,4 +1563,6 @@ struct kvm_create_guest_memfd {
>>       __u64 reserved[6];
>>  };
>>
>> +#define KVM_GMEM_NO_DIRECT_MAP                 (1ULL << 0)
>> +
>>  #endif /* __LINUX_KVM_H */
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 9148b9679bb1..dc9b0c2d0b0e 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/kvm_host.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/set_memory.h>
>>
>>  #include "kvm_mm.h"
>>
>> @@ -49,9 +50,16 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
>>       return 0;
>>  }
>>
>> +static bool kvm_gmem_not_present(struct inode *inode)
>> +{
>> +     return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0;
>> +}
>> +
>>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
>>  {
>>       struct folio *folio;
>> +     bool zap_direct_map = false;
>> +     int r;
>>
>>       /* TODO: Support huge pages. */
>>       folio = filemap_grab_folio(inode->i_mapping, index);
>> @@ -74,16 +82,30 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>>               for (i = 0; i < nr_pages; i++)
>>                       clear_highpage(folio_page(folio, i));
>>
>> +             // We need to clear the folio before calling kvm_gmem_prepare_folio,
>> +             // but can only remove it from the direct map _after_ preparation is done.
> 
> No C++ comments please
> 

Ack, sorry, will avoid in the future!

>> +             zap_direct_map = kvm_gmem_not_present(inode);
>> +
>>               folio_mark_uptodate(folio);
>>       }
>>
>>       if (prepare) {
>> -             int r = kvm_gmem_prepare_folio(inode, index, folio);
>> -             if (r < 0) {
>> -                     folio_unlock(folio);
>> -                     folio_put(folio);
>> -                     return ERR_PTR(r);
>> -             }
>> +             r = kvm_gmem_prepare_folio(inode, index, folio);
>> +             if (r < 0)
>> +                     goto out_err;
>> +     }
>> +
>> +     if (zap_direct_map) {
>> +             r = set_direct_map_invalid_noflush(&folio->page);
> 
> It's not future proof to presume that folio is a single page here.
> You should loop over folio pages and add a TLB flush after the loop.
>

Right, will do the folio iteration thing (and same for all other places
I call the direct_map* functions in this RFC)! 

I'll also have a look at the TLB flush. I specifically avoided using
set_memory_np here because the flushes it did (TLB + L1/2/3) had
significant performance impact (see cover letter). I'll have to rerun my
benchmark with set_direct_map_invalid_noflush + flush_tlb_kernel_range
instead, but if the result is similar, and we really need the flush here
for correctness, I might have to go back to the drawing board about this
whole on-demand mapping approach :(

>> +             if (r < 0)
>> +                     goto out_err;
>> +
>> +             // We use the private flag to track whether the folio has been removed
>> +             // from the direct map. This is because inside of ->free_folio,
>> +             // we do not have access to the address_space anymore, meaning we
>> +             // cannot check folio_inode(folio)->i_private to determine whether
>> +             // KVM_GMEM_NO_DIRECT_MAP was set.
>> +             folio_set_private(folio);
>>       }
>>
>>       /*
>> @@ -91,6 +113,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool
>>        * unevictable and there is no storage to write back to.
>>        */
>>       return folio;
>> +out_err:
>> +     folio_unlock(folio);
>> +     folio_put(folio);
>> +     return ERR_PTR(r);
>>  }
>>
>>  static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>> @@ -354,10 +380,22 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>  }
>>  #endif
>>
>> +static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end)
>> +{
>> +     if (start == 0 && end == PAGE_SIZE) {
>> +             // We only get here if PG_private is set, which only happens if kvm_gmem_not_present
>> +             // returned true in kvm_gmem_get_folio. Thus no need to do that check again.
>> +             BUG_ON(set_direct_map_default_noflush(&folio->page));
> 
> Ditto.
> 
>> +
>> +             folio_clear_private(folio);
>> +     }
>> +}
>> +
>>  static const struct address_space_operations kvm_gmem_aops = {
>>       .dirty_folio = noop_dirty_folio,
>>       .migrate_folio  = kvm_gmem_migrate_folio,
>>       .error_remove_folio = kvm_gmem_error_folio,
>> +     .invalidate_folio = kvm_gmem_invalidate_folio,
>>  #ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
>>       .free_folio = kvm_gmem_free_folio,
>>  #endif
>> @@ -443,7 +481,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>  {
>>       loff_t size = args->size;
>>       u64 flags = args->flags;
>> -     u64 valid_flags = 0;
>> +     u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
>>
>>       if (flags & ~valid_flags)
>>               return -EINVAL;
>> --
>> 2.45.2
>>
> 
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  2024-07-10  7:32     ` Mike Rapoport
@ 2024-07-10  9:50       ` Patrick Roy
  2024-07-10 21:14         ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-10  9:50 UTC (permalink / raw)
  To: Mike Rapoport, David Hildenbrand
  Cc: seanjc, pbonzini, akpm, dwmw, tglx, mingo, bp, dave.hansen, x86,
	hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel, linux-mm,
	dmatlack, tabba, chao.p.peng, xmarcalx, James Gowans



On 7/10/24 08:32, Mike Rapoport wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Jul 09, 2024 at 11:09:29PM +0200, David Hildenbrand wrote:
>> On 09.07.24 15:20, Patrick Roy wrote:
>>> Inside of vma_is_secretmem and secretmem_mapping, instead of checking
>>> whether a vm_area_struct/address_space has the secretmem ops structure
>>> attached to it, check whether the address_space has the AS_INACCESSIBLE
>>> bit set. Then set the AS_INACCESSIBLE flag for secretmem's
>>> address_space.
>>>
>>> This means that get_user_pages and friends are disables for all
>>> adress_spaces that set AS_INACCESIBLE. The AS_INACCESSIBLE flag was
>>> introduced in commit c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for
>>> encrypted/confidential memory") specifically for guest_memfd to indicate
>>> that no reads and writes should ever be done to guest_memfd
>>> address_spaces. Disallowing gup seems like a reasonable semantic
>>> extension, and means that potential future mmaps of guest_memfd cannot
>>> be GUP'd.
>>>
>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>> ---
>>>   include/linux/secretmem.h | 13 +++++++++++--
>>>   mm/secretmem.c            |  6 +-----
>>>   2 files changed, 12 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
>>> index e918f96881f5..886c8f7eb63e 100644
>>> --- a/include/linux/secretmem.h
>>> +++ b/include/linux/secretmem.h
>>> @@ -8,10 +8,19 @@ extern const struct address_space_operations secretmem_aops;
>>>   static inline bool secretmem_mapping(struct address_space *mapping)
>>>   {
>>> -   return mapping->a_ops == &secretmem_aops;
>>> +   return mapping->flags & AS_INACCESSIBLE;
>>> +}
>>> +
>>> +static inline bool vma_is_secretmem(struct vm_area_struct *vma)
>>> +{
>>> +   struct file *file = vma->vm_file;
>>> +
>>> +   if (!file)
>>> +           return false;
>>> +
>>> +   return secretmem_mapping(file->f_inode->i_mapping);
>>>   }
>>
>> That sounds wrong. You should leave *secretmem alone and instead have
>> something like inaccessible_mapping that is used where appropriate.
>>
>> vma_is_secretmem() should not suddenly succeed on something that is not
>> mm/secretmem.c
> 
> I'm with David here.
> 

Right, that makes sense. My thinking here was that if memfd_secret and
potential mappings of guest_memfd have the same behavior wrt GUP, then
it makes sense to just have them rely on the same checks. But I guess I
didn't follow that thought to its logical conclusion of renaming the
"secretmem" checks into "inaccessible" checks and moving them out of
secretmem.h.

Or do you mean to just leave secretmem untouched and add separate
"inaccessible" checks? But then we'd have two different ways of
disabling GUP for specific VMAs that both rely on checks in exactly the
same places :/

>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-09 21:13     ` David Hildenbrand
@ 2024-07-10  9:51       ` Patrick Roy
  2024-07-10 21:12         ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-10  9:51 UTC (permalink / raw)
  To: David Hildenbrand, Fuad Tabba
  Cc: seanjc, pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen,
	x86, hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel,
	linux-mm, dmatlack, chao.p.peng, xmarcalx, James Gowans

On 7/9/24 22:13, David Hildenbrand wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 09.07.24 16:48, Fuad Tabba wrote:
>> Hi Patrick,
>>
>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>
>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>> possible.
>>
>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>> not guppable, there are other reasons why you wouldn't want to allow
>> this. Maybe a config flag to gate it? e.g.,
> 
> 
> As discussed with Jason, maybe not the direction we want to take with
> guest_memfd.
> If it's private memory, it shall not be mapped. Also not via magic
> config options.
> 
> We'll likely discuss some of that in the meeting MM tomorrow I guess
> (having both shared and private memory in guest_memfd).

Oh, nice. I'm assuming you mean this meeting:
https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
Would it be okay if I also attend? I see it also mentions huge pages,
which is another thing we are interested in, actually :)

> Note that just from staring at this commit, I don't understand the
> motivation *why* we would want to do that.

Fair - I admittedly didn't get into that as much as I probably should
have. In our usecase, we do not have anything that pKVM would (I think)
call "guest-private" memory. I think our memory can be better described
as guest-owned, but always shared with the VMM (e.g. userspace), but
ideally never shared with the host kernel. This model lets us do a lot
of simplifying assumptions: Things like I/O can be handled in userspace
without the guest explicitly sharing I/O buffers (which is not exactly
what we would want long-term anyway, as sharing in the guest_memfd
context means sharing with the host kernel), we can easily do VM
snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.

> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-10  9:49     ` Patrick Roy
@ 2024-07-10 10:20       ` David Woodhouse
  2024-07-10 10:46         ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-07-10 10:20 UTC (permalink / raw)
  To: Patrick Roy, seanjc, pbonzini, akpm, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx, James Gowans

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

On Wed, 2024-07-10 at 10:49 +0100, Patrick Roy wrote:
> On 7/9/24 15:36, David Woodhouse wrote:

I did? It isn't September yet, surely?

> > On Tue, 2024-07-09 at 14:20 +0100, Patrick Roy wrote:
> > > KVM uses gfn_to_pfn_caches to cache translations from gfn all the way to
> > > the pfn (for example, kvm-clock caches the page storing the page used
> > > for guest/host communication this way). Unlike the gfn_to_hva_cache,
> > > where no equivalent caching semantics were possible to gmem-backed gfns
> > > (see also 858e8068a750 ("kvm: pfncache: enlighten about gmem")), here it
> > > is possible to simply cache the pfn returned by `kvm_gmem_get_pfn`.
> > > 
> > > Additionally, gfn_to_pfn_caches now invalidate whenever a cached gfn's
> > > attributes are flipped from shared to private (or vice-versa).
> > > 
> > > Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> > 
> > I can't see how this is safe from race conditions.
> > 
> > When the GPC is invalidated from gfn_to_pfn_cache_invalidate_start()
> > its *write* lock is taken and gpc->valid is set to false.
> > 
> > In parallel, any code using the GPC to access guest memory will take
> > the *read* lock, call kvm_gpc_check(), and then go ahead and use the
> > pointer to its heart's content until eventually dropping the read lock.
> > 
> > Since invalidation takes the write lock, it has to wait until the GPC
> > is no longer in active use, and the pointer cannot be being
> > dereferenced.
> > 
> > How does this work for the kvm_mem_is_private() check. You've added a
> > check in kvm_gpc_check(), but what if the pfn is made private
> > immediately *after* that check? Unless the code path which makes the
> > pfn private also takes the write lock, how is it safe?
> 
> Ah, you're right - I did in fact overlook this. I do think that it works
> out though: kvm_vm_set_mem_attributes, which is used for flipping
> between shared/private, registers the range which had its attributes
> changed for invalidation, and thus gfn_to_pfn_cache_invalidate_start
> should get called for it (although I have to admit I do not immediately
> see what the exact callstack for this looks like, so maybe I am
> misunderstanding something about invalidation here?).

In that case, wouldn't that mean the explicit checks on gpc->is_private
matching kvm_mem_is_private() would be redundant and you can remove
them because you can trust that gpc->valid would be cleared?


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-10 10:20       ` David Woodhouse
@ 2024-07-10 10:46         ` Patrick Roy
  2024-07-10 10:50           ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-10 10:46 UTC (permalink / raw)
  To: David Woodhouse, seanjc, pbonzini, akpm, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx, James Gowans

On Wed, 2024-07-10 at 11:20 +0100, David Woodhouse wrote:
> On Wed, 2024-07-10 at 10:49 +0100, Patrick Roy wrote:
>> On 7/9/24 15:36, David Woodhouse wrote:
> 
> I did? It isn't September yet, surely?

Argh, thanks for letting me know, I think I've whacked some sense into
my mail client now :)

>>> On Tue, 2024-07-09 at 14:20 +0100, Patrick Roy wrote:
>>>> KVM uses gfn_to_pfn_caches to cache translations from gfn all the way to
>>>> the pfn (for example, kvm-clock caches the page storing the page used
>>>> for guest/host communication this way). Unlike the gfn_to_hva_cache,
>>>> where no equivalent caching semantics were possible to gmem-backed gfns
>>>> (see also 858e8068a750 ("kvm: pfncache: enlighten about gmem")), here it
>>>> is possible to simply cache the pfn returned by `kvm_gmem_get_pfn`.
>>>>
>>>> Additionally, gfn_to_pfn_caches now invalidate whenever a cached gfn's
>>>> attributes are flipped from shared to private (or vice-versa).
>>>>
>>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>>
>>> I can't see how this is safe from race conditions.
>>>
>>> When the GPC is invalidated from gfn_to_pfn_cache_invalidate_start()
>>> its *write* lock is taken and gpc->valid is set to false.
>>>
>>> In parallel, any code using the GPC to access guest memory will take
>>> the *read* lock, call kvm_gpc_check(), and then go ahead and use the
>>> pointer to its heart's content until eventually dropping the read lock.
>>>
>>> Since invalidation takes the write lock, it has to wait until the GPC
>>> is no longer in active use, and the pointer cannot be being
>>> dereferenced.
>>>
>>> How does this work for the kvm_mem_is_private() check. You've added a
>>> check in kvm_gpc_check(), but what if the pfn is made private
>>> immediately *after* that check? Unless the code path which makes the
>>> pfn private also takes the write lock, how is it safe?
>>
>> Ah, you're right - I did in fact overlook this. I do think that it works
>> out though: kvm_vm_set_mem_attributes, which is used for flipping
>> between shared/private, registers the range which had its attributes
>> changed for invalidation, and thus gfn_to_pfn_cache_invalidate_start
>> should get called for it (although I have to admit I do not immediately
>> see what the exact callstack for this looks like, so maybe I am
>> misunderstanding something about invalidation here?).
> 
> In that case, wouldn't that mean the explicit checks on gpc->is_private
> matching kvm_mem_is_private() would be redundant and you can remove
> them because you can trust that gpc->valid would be cleared?
> 

Right, yes, it would indeed mean that. I'll double-check my assumption
about the whole invalidation thing and adjust the code for the next
iteration!



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem
  2024-07-10 10:46         ` Patrick Roy
@ 2024-07-10 10:50           ` David Woodhouse
  0 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2024-07-10 10:50 UTC (permalink / raw)
  To: Patrick Roy, seanjc, pbonzini, akpm, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx, James Gowans

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

On Wed, 2024-07-10 at 11:46 +0100, Patrick Roy wrote:
> On Wed, 2024-07-10 at 11:20 +0100, David Woodhouse wrote:
> > In that case, wouldn't that mean the explicit checks on gpc->is_private
> > matching kvm_mem_is_private() would be redundant and you can remove
> > them because you can trust that gpc->valid would be cleared?
> > 
> 
> Right, yes, it would indeed mean that. I'll double-check my assumption
> about the whole invalidation thing and adjust the code for the next
> iteration!

I was going to suggest that you take the check you'd added to
kvm_gpc_check() and move it down below the ->valid check, and turn it
into a BUG_ON() to check that assertion.

You *might* get false positives with that though, if the result of
kvm_mem_is_private() becomes true before the flush actually *happens*?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-10  9:51       ` Patrick Roy
@ 2024-07-10 21:12         ` David Hildenbrand
  2024-07-10 21:53           ` Sean Christopherson
  2024-07-12 15:59           ` Patrick Roy
  0 siblings, 2 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-07-10 21:12 UTC (permalink / raw)
  To: Patrick Roy, Fuad Tabba
  Cc: seanjc, pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen,
	x86, hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel,
	linux-mm, dmatlack, chao.p.peng, xmarcalx, James Gowans

On 10.07.24 11:51, Patrick Roy wrote:
> 
> 
> On 7/9/24 22:13, David Hildenbrand wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On 09.07.24 16:48, Fuad Tabba wrote:
>>> Hi Patrick,
>>>
>>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>>
>>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>>> possible.
>>>
>>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>>> not guppable, there are other reasons why you wouldn't want to allow
>>> this. Maybe a config flag to gate it? e.g.,
>>
>>
>> As discussed with Jason, maybe not the direction we want to take with
>> guest_memfd.
>> If it's private memory, it shall not be mapped. Also not via magic
>> config options.
>>
>> We'll likely discuss some of that in the meeting MM tomorrow I guess
>> (having both shared and private memory in guest_memfd).
> 
> Oh, nice. I'm assuming you mean this meeting:
> https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
> Would it be okay if I also attend? I see it also mentions huge pages,
> which is another thing we are interested in, actually :)

Hi,

sorry for the late reply. Yes, you could have joined .... too late. 
There will be a summary posted soon. So far the agreement is that we're 
planning on allowing shared memory as part guest_memfd, and will allow 
that to get mapped and pinned. Private memory is not going to get mapped 
and pinned.

If we have to disallow pinning of shared memory on top for some use 
cases (i.e., no directmap), I assume that could be added.

> 
>> Note that just from staring at this commit, I don't understand the
>> motivation *why* we would want to do that.
> 
> Fair - I admittedly didn't get into that as much as I probably should
> have. In our usecase, we do not have anything that pKVM would (I think)
> call "guest-private" memory. I think our memory can be better described
> as guest-owned, but always shared with the VMM (e.g. userspace), but
> ideally never shared with the host kernel. This model lets us do a lot
> of simplifying assumptions: Things like I/O can be handled in userspace
> without the guest explicitly sharing I/O buffers (which is not exactly
> what we would want long-term anyway, as sharing in the guest_memfd
> context means sharing with the host kernel), we can easily do VM
> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.

Okay, so essentially you would want to use guest_memfd to only contain 
shard memory and disallow any pinning like for secretmem.

If so, I wonder if it wouldn't be better to simply add KVM support to 
consume *real* secretmem memory? IIRC so far there was only demand to 
probably remove the directmap of private memory in guest_memfd, not of 
shared memory.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  2024-07-10  9:50       ` Patrick Roy
@ 2024-07-10 21:14         ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-07-10 21:14 UTC (permalink / raw)
  To: Patrick Roy, Mike Rapoport
  Cc: seanjc, pbonzini, akpm, dwmw, tglx, mingo, bp, dave.hansen, x86,
	hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel, linux-mm,
	dmatlack, tabba, chao.p.peng, xmarcalx, James Gowans

On 10.07.24 11:50, Patrick Roy wrote:
> 
> 
> On 7/10/24 08:32, Mike Rapoport wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On Tue, Jul 09, 2024 at 11:09:29PM +0200, David Hildenbrand wrote:
>>> On 09.07.24 15:20, Patrick Roy wrote:
>>>> Inside of vma_is_secretmem and secretmem_mapping, instead of checking
>>>> whether a vm_area_struct/address_space has the secretmem ops structure
>>>> attached to it, check whether the address_space has the AS_INACCESSIBLE
>>>> bit set. Then set the AS_INACCESSIBLE flag for secretmem's
>>>> address_space.
>>>>
>>>> This means that get_user_pages and friends are disables for all
>>>> adress_spaces that set AS_INACCESIBLE. The AS_INACCESSIBLE flag was
>>>> introduced in commit c72ceafbd12c ("mm: Introduce AS_INACCESSIBLE for
>>>> encrypted/confidential memory") specifically for guest_memfd to indicate
>>>> that no reads and writes should ever be done to guest_memfd
>>>> address_spaces. Disallowing gup seems like a reasonable semantic
>>>> extension, and means that potential future mmaps of guest_memfd cannot
>>>> be GUP'd.
>>>>
>>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>>> ---
>>>>    include/linux/secretmem.h | 13 +++++++++++--
>>>>    mm/secretmem.c            |  6 +-----
>>>>    2 files changed, 12 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
>>>> index e918f96881f5..886c8f7eb63e 100644
>>>> --- a/include/linux/secretmem.h
>>>> +++ b/include/linux/secretmem.h
>>>> @@ -8,10 +8,19 @@ extern const struct address_space_operations secretmem_aops;
>>>>    static inline bool secretmem_mapping(struct address_space *mapping)
>>>>    {
>>>> -   return mapping->a_ops == &secretmem_aops;
>>>> +   return mapping->flags & AS_INACCESSIBLE;
>>>> +}
>>>> +
>>>> +static inline bool vma_is_secretmem(struct vm_area_struct *vma)
>>>> +{
>>>> +   struct file *file = vma->vm_file;
>>>> +
>>>> +   if (!file)
>>>> +           return false;
>>>> +
>>>> +   return secretmem_mapping(file->f_inode->i_mapping);
>>>>    }
>>>
>>> That sounds wrong. You should leave *secretmem alone and instead have
>>> something like inaccessible_mapping that is used where appropriate.
>>>
>>> vma_is_secretmem() should not suddenly succeed on something that is not
>>> mm/secretmem.c
>>
>> I'm with David here.
>>
> 
> Right, that makes sense. My thinking here was that if memfd_secret and
> potential mappings of guest_memfd have the same behavior wrt GUP, then
> it makes sense to just have them rely on the same checks. But I guess I
> didn't follow that thought to its logical conclusion of renaming the
> "secretmem" checks into "inaccessible" checks and moving them out of
> secretmem.h.
> 
> Or do you mean to just leave secretmem untouched and add separate
> "inaccessible" checks? But then we'd have two different ways of
> disabling GUP for specific VMAs that both rely on checks in exactly the
> same places :/

You can just replace the vma_is_secretmem in relevant places by checks 
if inaccessible address spaces. No need for the additional 
vma_is_secretmem check then.

BUT, as raised in my other reply, I wonder if adding support for 
secretmem in KVM (I assume) would be simpler+cleaner.

> 
>>> --
>>> Cheers,
>>>
>>> David / dhildenb
>>>
>>
>> --
>> Sincerely yours,
>> Mike.
> 

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-10 21:12         ` David Hildenbrand
@ 2024-07-10 21:53           ` Sean Christopherson
  2024-07-10 21:56             ` David Hildenbrand
  2024-07-12 15:59           ` Patrick Roy
  1 sibling, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2024-07-10 21:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Patrick Roy, Fuad Tabba, pbonzini, akpm, dwmw, rppt, tglx, mingo,
	bp, dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, chao.p.peng, xmarcalx,
	James Gowans

On Wed, Jul 10, 2024, David Hildenbrand wrote:
> On 10.07.24 11:51, Patrick Roy wrote:
> > On 7/9/24 22:13, David Hildenbrand wrote:
> > > Note that just from staring at this commit, I don't understand the
> > > motivation *why* we would want to do that.
> > 
> > Fair - I admittedly didn't get into that as much as I probably should
> > have. In our usecase, we do not have anything that pKVM would (I think)
> > call "guest-private" memory. I think our memory can be better described
> > as guest-owned, but always shared with the VMM (e.g. userspace), but
> > ideally never shared with the host kernel. This model lets us do a lot
> > of simplifying assumptions: Things like I/O can be handled in userspace
> > without the guest explicitly sharing I/O buffers (which is not exactly
> > what we would want long-term anyway, as sharing in the guest_memfd
> > context means sharing with the host kernel), we can easily do VM
> > snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
> 
> Okay, so essentially you would want to use guest_memfd to only contain shard
> memory and disallow any pinning like for secretmem.
> 
> If so, I wonder if it wouldn't be better to simply add KVM support to
> consume *real* secretmem memory? IIRC so far there was only demand to
> probably remove the directmap of private memory in guest_memfd, not of
> shared memory.

It's also desirable to remove shared memory from the directmap, e.g. to prevent
using the directmap in a cross-VM attack.

I don't think we want to allow KVM to consume secretmem.  That would require
letting KVM gup() secretmem, which AIUI defeats the entire purpose of secretmem,
and I don't think KVM should be special.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-10 21:53           ` Sean Christopherson
@ 2024-07-10 21:56             ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-07-10 21:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Patrick Roy, Fuad Tabba, pbonzini, akpm, dwmw, rppt, tglx, mingo,
	bp, dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, chao.p.peng, xmarcalx,
	James Gowans

On 10.07.24 23:53, Sean Christopherson wrote:
> On Wed, Jul 10, 2024, David Hildenbrand wrote:
>> On 10.07.24 11:51, Patrick Roy wrote:
>>> On 7/9/24 22:13, David Hildenbrand wrote:
>>>> Note that just from staring at this commit, I don't understand the
>>>> motivation *why* we would want to do that.
>>>
>>> Fair - I admittedly didn't get into that as much as I probably should
>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>> call "guest-private" memory. I think our memory can be better described
>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>> ideally never shared with the host kernel. This model lets us do a lot
>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>> what we would want long-term anyway, as sharing in the guest_memfd
>>> context means sharing with the host kernel), we can easily do VM
>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>
>> Okay, so essentially you would want to use guest_memfd to only contain shard
>> memory and disallow any pinning like for secretmem.
>>
>> If so, I wonder if it wouldn't be better to simply add KVM support to
>> consume *real* secretmem memory? IIRC so far there was only demand to
>> probably remove the directmap of private memory in guest_memfd, not of
>> shared memory.
> 
> It's also desirable to remove shared memory from the directmap, e.g. to prevent
> using the directmap in a cross-VM attack.
> 
> I don't think we want to allow KVM to consume secretmem.  That would require
> letting KVM gup() secretmem, which AIUI defeats the entire purpose of secretmem,
> and I don't think KVM should be special.

I would mean consuming secretmem via the fd, *not* via page tables / gup.

But if we also want to have the option of directmap modifications for 
shared memory in guest_memfd, we could make that indeed a guest_memfd 
feature.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed
  2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
@ 2024-07-11  6:25   ` Paolo Bonzini
  0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2024-07-11  6:25 UTC (permalink / raw)
  To: Patrick Roy, seanjc, akpm, dwmw, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

On 7/9/24 15:20, Patrick Roy wrote:
> If KVM_GMEM_NO_DIRECT_MAP is set, and KVM tries to internally access
> guest-private memory inside kvm_{read,write}_guest, or via a
> gfn_to_pfn_cache, temporarily restore the direct map entry.
> 
> To avoid race conditions between two threads restoring or zapping direct
> map entries for the same page and potentially interfering with each
> other (e.g. unfortune interweavings of map->read->unmap in the form of
> map(A)->map(B)->read(A)->unmap(A)->read(B) [BOOM]), the following
> invariant is upheld in this patch:
> 
> - Only a single gfn_to_pfn_cache can exist for any given pfn, and

I think this is not ensured.  You can however use 
set_page_private()/page_private() to count the number of references.

Paolo

> - All non-gfn_to_pfn_cache code paths that temporarily restore direct
>    map entries complete the entire map->access->unmap critical section
> while holding the folio lock.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-10 21:12         ` David Hildenbrand
  2024-07-10 21:53           ` Sean Christopherson
@ 2024-07-12 15:59           ` Patrick Roy
  2024-07-30 10:15             ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-12 15:59 UTC (permalink / raw)
  To: David Hildenbrand, seanjc, Fuad Tabba
  Cc: pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen, x86,
	hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel, linux-mm,
	dmatlack, chao.p.peng, xmarcalx, James Gowans

Hey,

On Wed, 2024-07-10 at 22:12 +0100, David Hildenbrand wrote:
> On 10.07.24 11:51, Patrick Roy wrote:
>>
>>
>> On 7/9/24 22:13, David Hildenbrand wrote:
>>> On 09.07.24 16:48, Fuad Tabba wrote:
>>>> Hi Patrick,
>>>>
>>>> On Tue, Jul 9, 2024 at 2:21 PM Patrick Roy <roypat@amazon.co.uk> wrote:
>>>>>
>>>>> Allow mapping guest_memfd into userspace. Since AS_INACCESSIBLE is set
>>>>> on the underlying address_space struct, no GUP of guest_memfd will be
>>>>> possible.
>>>>
>>>> This patch allows mapping guest_memfd() unconditionally. Even if it's
>>>> not guppable, there are other reasons why you wouldn't want to allow
>>>> this. Maybe a config flag to gate it? e.g.,
>>>
>>>
>>> As discussed with Jason, maybe not the direction we want to take with
>>> guest_memfd.
>>> If it's private memory, it shall not be mapped. Also not via magic
>>> config options.
>>>
>>> We'll likely discuss some of that in the meeting MM tomorrow I guess
>>> (having both shared and private memory in guest_memfd).
>>
>> Oh, nice. I'm assuming you mean this meeting:
>> https://lore.kernel.org/linux-mm/197a2f19-c71c-fbde-a62a-213dede1f4fd@google.com/T/?
>> Would it be okay if I also attend? I see it also mentions huge pages,
>> which is another thing we are interested in, actually :)
> 
> Hi,
> 
> sorry for the late reply. Yes, you could have joined .... too late.

No worries, I did end up joining to listen in to y'all's discussion
anyway :)

> There will be a summary posted soon. So far the agreement is that we're
> planning on allowing shared memory as part guest_memfd, and will allow
> that to get mapped and pinned. Private memory is not going to get mapped
> and pinned.
> 
> If we have to disallow pinning of shared memory on top for some use
> cases (i.e., no directmap), I assume that could be added.
> 
>>
>>> Note that just from staring at this commit, I don't understand the
>>> motivation *why* we would want to do that.
>>
>> Fair - I admittedly didn't get into that as much as I probably should
>> have. In our usecase, we do not have anything that pKVM would (I think)
>> call "guest-private" memory. I think our memory can be better described
>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>> ideally never shared with the host kernel. This model lets us do a lot
>> of simplifying assumptions: Things like I/O can be handled in userspace
>> without the guest explicitly sharing I/O buffers (which is not exactly
>> what we would want long-term anyway, as sharing in the guest_memfd
>> context means sharing with the host kernel), we can easily do VM
>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
> 
> Okay, so essentially you would want to use guest_memfd to only contain
> shard memory and disallow any pinning like for secretmem.

Yeah, this is pretty much what I thought we wanted before listening in
on Wednesday.

I've actually be thinking about this some more since then though. With
hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
map insertion approach runs into the same problem that CoCo VMs have
when they're backed by hugepages: How to deal with the guest only
sharing a 4K range in a hugepage? If we want to restore the direct map
for e.g. the page containing kvm-clock data, then we can't simply go
ahead and restore the direct map for the entire 2M page, because there
very well might be stuff in the other 511 small guest pages that we
really do not want in the direct map. And we can't even take the
approach of letting the guest deal with the problem, because here
"sharing" is driven by the host, not the guest, so the guest cannot
possibly know that it maybe should avoid putting stuff it doesn't want
shared into those remaining 511 pages! To me that sounds a lot like the
whole "breaking down huge folios to allow GUP to only some parts of it"
thing mentioned on Wednesday.

Now, if we instead treat "guest memory without direct map entries" as
"private", and "guest memory with direct map entries" as "shared", then
the above will be solved by whatever mechanism allows gupping/mapping of
only the "shared" parts of huge folios, IIUC. The fact that GUP is then
also allowed for the "shared" parts is not actually a problem for us -
we went down the route of disabling GUP altogether here because based on
[1] it sounded like GUP for anything gmem related would never happen.
But after something is re-inserted into the direct map, we don't very
much care if it can be GUP-ed or not. In fact, allowing GUP for the
shared parts probably makes some things easier for us, as we can then do
I/O without bounce buffers by just in-place converting I/O-buffers to
shared, and then treating that shared slice of guest_memfd the same way
we treat traditional guest memory today. In a very far-off future, we'd
like to be able to do I/O without ever reinserting pages into the direct
map, but I don't think adopting this private/shared model for gmem would
block us from doing that?

Although all of this does hinge on us being able to do the in-place
shared/private conversion without any guest involvement. Do you envision
that to be possible?

> If so, I wonder if it wouldn't be better to simply add KVM support to
> consume *real* secretmem memory? IIRC so far there was only demand to
> probably remove the directmap of private memory in guest_memfd, not of
> shared memory.
> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick

[1]: https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (7 preceding siblings ...)
  2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
@ 2024-07-22 12:28 ` Vlastimil Babka (SUSE)
  2024-07-26  6:55   ` Patrick Roy
  2024-07-26 16:44 ` Yosry Ahmed
  9 siblings, 1 reply; 34+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-07-22 12:28 UTC (permalink / raw)
  To: Patrick Roy, seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

On 7/9/24 3:20 PM, Patrick Roy wrote:
> Hey all,
> 
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.
> 
> === Implementation ===
> 
> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
> to remove its pages from the direct map when they are allocated. When
> trying to run a guest from such a VM, we now face the problem that
> without either userspace or kernelspace mappings of guest_memfd, KVM
> cannot access guest memory to, for example, do MMIO emulation of access
> memory used to guest/host communication. We have multiple options for
> solving this when running non-CoCo VMs: (1) implement a TDX-light
> solution, where the guest shares memory that KVM needs to access, and
> relies on paravirtual solutions where this is not possible (e.g. MMIO),
> (2) have KVM use userspace mappings of guest_memfd (e.g. a
> memfd_secret-style solution), or (3) dynamically reinsert pages into the
> direct map whenever KVM wants to access them.
> 
> This RFC goes for option (3). Option (1) is a lot of overhead for very
> little gain, since we are not actually constrained by a physical
> inability to access guest memory (e.g. we are not in a TDX context where
> accesses to guest memory cause a #MC). Option (2) has previously been
> rejected [1].

Do the pages have to have the same address when they are temporarily mapped?
Wouldn't it be easier to do something similar to kmap_local_page() used for
HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
doesn't have to alter the shared directmap.

Maybe that was already discussed somewhere as unsuitable but didn't spot it
here.

> In this patch series, we make sufficient parts of KVM gmem-aware to be
> able to boot a Linux initrd from private memory on x86. These include
> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
> For VM types which do not allow accessing gmem, we return -EFAULT and
> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
> 
> Additionally, this patch series adds support for "restricted" userspace
> mappings of guest_memfd, which work similar to memfd_secret (e.g.
> disallow get_user_pages), which allows handling I/O and loading the
> guest kernel in a simple way. Support for this is completely independent
> of the rest of the functionality introduced in this patch series.
> However, it is required to build a minimal hypervisor PoC that actually
> allows booting a VM from a disk.
> 
> === Performance ===
> 
> We have run some preliminary performance benchmarks to assess the impact
> of on-the-fly direct map manipulations. We were mainly interested in the
> impact of manipulating the direct map for MMIO emulation on virtio-mmio.
> Particularly, we were worried about the impact of the TLB and L1/2/3
> Cache flushes that set_memory_[n]p entails.
> 
> In our setup, we have taken a modified Firecracker VMM, spawned a Linux
> guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
> that the cache flushes caused throughput to drop from around 600MB/s to
> ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
> 8375C CPU with 64 cores). We then converted our prototype to use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
> found that without cache flushes the pure impact of the direct map
> manipulation is indistinguishable from noise. This is why we use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
> this RFC.
> 
> Note that in this comparison, both the baseline, as well as the
> guest_memfd-supporting version of Firecracker were made to bounce I/O
> buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
> the virtio stack cannot directly pass guest buffers to read/write
> syscalls.
> 
> === Security ===
> 
> We want to use unmapping guest memory from the host kernel as a security
> mitigation against transient execution attacks. Temporarily restoring
> direct map entries whenever KVM requires access to guest memory leaves a
> gap in this mitigation. We believe this to be acceptable for the above
> cases, since pages used for paravirtual guest/host communication (e.g.
> kvm-clock) and guest page tables do not contain sensitive data. MMIO
> emulation will only end up reading pages containing privileged
> instructions (e.g. guest kernel code).
> 
> === Summary ===
> 
> Patches 1-4 are about hot-patching various points inside of KVM that
> access guest memory to correctly handle the case where memory happens to
> be guest-private. This means either handling the access as a memory
> error, or simply accessing the memslot's guest_memfd instead of looking
> at the userspace provided VMA if the VM type allows these kind of
> accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
> make it remove its pages from the kernel's direct map. Whenever KVM
> wants to access guest-private memory, it will temporarily re-insert the
> relevant pages. Patches 7-8 allow for restricted userspace mappings
> (e.g. get_user_pages paths are disabled like for memfd_secret) of
> guest_memfd, so that userspace has an easy path for loading the guest
> kernel and handling I/O-buffers.
> 
> === ToDos / Limitations ===
> 
> There are still a few rough edges that need to be addressed before
> dropping the "RFC" tag, e.g.
> 
> * Handle errors of set_direct_map_default_not_flush in
>   kvm_gmem_invalidate_folio instead of calling BUG_ON
> * Lift the limitation of "at most one gfn_to_pfn_cache for each
>   gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
>   entries when needed"). It currently means that guests with more than 1
>   vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
>   structures into the same page (gfn)
> * Write selftests, particularly around hole punching, direct map removal,
>   and mmap.
> 
> Lastly, there's the question of nested virtualization which Sean brought
> up in previous discussions, which runs into similar problems as MMIO. I
> have looked at it very briefly. On Intel, KVM uses various gfn->uhva
> caches, which run in similar problems as the gfn_to_hva_caches dealt
> with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
> is private"). However, previous attempts at just converting this to
> gfn_to_pfn_cache (which would make them work with guest_memfd) proved
> complicated [2]. I suppose initially, we should probably disallow nested
> virtualization in VMs that have their memory removed from the direct
> map.
> 
> Best,
> Patrick
> 
> [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
> [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/
> 
> Patrick Roy (8):
>   kvm: Allow reading/writing gmem using kvm_{read,write}_guest
>   kvm: use slowpath in gfn_to_hva_cache if memory is private
>   kvm: pfncache: enlighten about gmem
>   kvm: x86: support walking guest page tables in gmem
>   kvm: gmem: add option to remove guest private memory from direct map
>   kvm: gmem: Temporarily restore direct map entries when needed
>   mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
>   kvm: gmem: Allow restricted userspace mappings
> 
>  arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
>  include/linux/kvm_host.h       |   5 ++
>  include/linux/kvm_types.h      |   1 +
>  include/linux/secretmem.h      |  13 +++-
>  include/uapi/linux/kvm.h       |   2 +
>  mm/secretmem.c                 |   6 +-
>  virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
>  virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
>  virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
>  9 files changed, 399 insertions(+), 47 deletions(-)
> 
> 
> base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
  2024-07-22 12:28 ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Vlastimil Babka (SUSE)
@ 2024-07-26  6:55   ` Patrick Roy
  2024-07-30 10:17     ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2024-07-26  6:55 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), seanjc, pbonzini, akpm, dwmw, rppt, david
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx



On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote:
>> === Implementation ===
>>
>> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
>> to remove its pages from the direct map when they are allocated. When
>> trying to run a guest from such a VM, we now face the problem that
>> without either userspace or kernelspace mappings of guest_memfd, KVM
>> cannot access guest memory to, for example, do MMIO emulation of access
>> memory used to guest/host communication. We have multiple options for
>> solving this when running non-CoCo VMs: (1) implement a TDX-light
>> solution, where the guest shares memory that KVM needs to access, and
>> relies on paravirtual solutions where this is not possible (e.g. MMIO),
>> (2) have KVM use userspace mappings of guest_memfd (e.g. a
>> memfd_secret-style solution), or (3) dynamically reinsert pages into the
>> direct map whenever KVM wants to access them.
>>
>> This RFC goes for option (3). Option (1) is a lot of overhead for very
>> little gain, since we are not actually constrained by a physical
>> inability to access guest memory (e.g. we are not in a TDX context where
>> accesses to guest memory cause a #MC). Option (2) has previously been
>> rejected [1].
> 
> Do the pages have to have the same address when they are temporarily mapped?
> Wouldn't it be easier to do something similar to kmap_local_page() used for
> HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
> doesn't have to alter the shared directmap.
> 
> Maybe that was already discussed somewhere as unsuitable but didn't spot it
> here.

For what I had prototyped here, there's no requirement to have the pages
mapped at the same address (I remember briefly looking at memremap to
achieve the temporary mappings, but since that doesnt work for normal
memory, I gave up on that path). However, I think guest_memfd is moving
into a direction where ranges marked as "in-place shared" (e.g. those
that are temporarily reinserted into the direct map in this RFC)  should
be able to be GUP'd [1]. I think for that the direct map entries would
need to be present, right?

>> In this patch series, we make sufficient parts of KVM gmem-aware to be
>> able to boot a Linux initrd from private memory on x86. These include
>> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
>> For VM types which do not allow accessing gmem, we return -EFAULT and
>> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
>>
>> Additionally, this patch series adds support for "restricted" userspace
>> mappings of guest_memfd, which work similar to memfd_secret (e.g.
>> disallow get_user_pages), which allows handling I/O and loading the
>> guest kernel in a simple way. Support for this is completely independent
>> of the rest of the functionality introduced in this patch series.
>> However, it is required to build a minimal hypervisor PoC that actually
>> allows booting a VM from a disk.
 
[1]: https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
  2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
                   ` (8 preceding siblings ...)
  2024-07-22 12:28 ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Vlastimil Babka (SUSE)
@ 2024-07-26 16:44 ` Yosry Ahmed
  9 siblings, 0 replies; 34+ messages in thread
From: Yosry Ahmed @ 2024-07-26 16:44 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, pbonzini, akpm, dwmw, rppt, david, tglx, mingo, bp,
	dave.hansen, x86, hpa, willy, graf, derekmn, kalyazin, kvm,
	linux-kernel, linux-mm, dmatlack, tabba, chao.p.peng, xmarcalx

On Tue, Jul 9, 2024 at 6:21 AM Patrick Roy <roypat@amazon.co.uk> wrote:
>
> Hey all,
>
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.

Not to sound like a salesman, but did you happen to come across the RFC for ASI?
https://lore.kernel.org/lkml/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/

The current implementation considers userspace allocations as
sensitive, so when a VM is running with ASI, the memory of other VMs
is unmapped from the direct map (i.e. in the restricted address
space). It also incorporates a mechanism to map this memory on-demand
when needed (i.e. switch to the unrestricted address space), and
running mitigations at this point to make sure it isn't exploited.

In theory, it should be a more generic approach because it should
apply to VMs that do not use guest_memfd as well, and it should be
extensible to protect other parts of memory (e.g. sensitive kernel
allocations).

I understand that unmapping guest_memfd memory from the direct map in
general could still be favorable, and for other reasons beyond
mitigating speculative execution attacks. Just thought you may be
interested in looking at ASI.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-12 15:59           ` Patrick Roy
@ 2024-07-30 10:15             ` David Hildenbrand
  2024-08-01 10:30               ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2024-07-30 10:15 UTC (permalink / raw)
  To: Patrick Roy, seanjc, Fuad Tabba
  Cc: pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen, x86,
	hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel, linux-mm,
	dmatlack, chao.p.peng, xmarcalx, James Gowans

>> Hi,
>>
>> sorry for the late reply. Yes, you could have joined .... too late.
> 
> No worries, I did end up joining to listen in to y'all's discussion
> anyway :)

Sorry for the late reply :(

> 
>> There will be a summary posted soon. So far the agreement is that we're
>> planning on allowing shared memory as part guest_memfd, and will allow
>> that to get mapped and pinned. Private memory is not going to get mapped
>> and pinned.
>>
>> If we have to disallow pinning of shared memory on top for some use
>> cases (i.e., no directmap), I assume that could be added.
>>
>>>
>>>> Note that just from staring at this commit, I don't understand the
>>>> motivation *why* we would want to do that.
>>>
>>> Fair - I admittedly didn't get into that as much as I probably should
>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>> call "guest-private" memory. I think our memory can be better described
>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>> ideally never shared with the host kernel. This model lets us do a lot
>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>> what we would want long-term anyway, as sharing in the guest_memfd
>>> context means sharing with the host kernel), we can easily do VM
>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>
>> Okay, so essentially you would want to use guest_memfd to only contain
>> shard memory and disallow any pinning like for secretmem.
> 
> Yeah, this is pretty much what I thought we wanted before listening in
> on Wednesday.
> 
> I've actually be thinking about this some more since then though. With
> hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
> map insertion approach runs into the same problem that CoCo VMs have
> when they're backed by hugepages: How to deal with the guest only
> sharing a 4K range in a hugepage? If we want to restore the direct map
> for e.g. the page containing kvm-clock data, then we can't simply go
> ahead and restore the direct map for the entire 2M page, because there
> very well might be stuff in the other 511 small guest pages that we
> really do not want in the direct map. And we can't even take the

Right, you'd only want to restore the direct map for a fragment. Or 
dynamically map that fragment using kmap where required (as raised by 
Vlastimil).

> approach of letting the guest deal with the problem, because here
> "sharing" is driven by the host, not the guest, so the guest cannot
> possibly know that it maybe should avoid putting stuff it doesn't want
> shared into those remaining 511 pages! To me that sounds a lot like the
> whole "breaking down huge folios to allow GUP to only some parts of it"
> thing mentioned on Wednesday.

Yes. While it would be one logical huge page, it would be exposed to the 
remainder of the kernel as 512 individual pages.

> 
> Now, if we instead treat "guest memory without direct map entries" as
> "private", and "guest memory with direct map entries" as "shared", then
> the above will be solved by whatever mechanism allows gupping/mapping of
> only the "shared" parts of huge folios, IIUC. The fact that GUP is then
> also allowed for the "shared" parts is not actually a problem for us -
> we went down the route of disabling GUP altogether here because based on
> [1] it sounded like GUP for anything gmem related would never happen.

Right. Might there also be a case for removing the directmap for shared 
memory or is that not really a requirement so far?

> But after something is re-inserted into the direct map, we don't very
> much care if it can be GUP-ed or not. In fact, allowing GUP for the
> shared parts probably makes some things easier for us, as we can then do
> I/O without bounce buffers by just in-place converting I/O-buffers to
> shared, and then treating that shared slice of guest_memfd the same way
> we treat traditional guest memory today. 

Yes.

> In a very far-off future, we'd
> like to be able to do I/O without ever reinserting pages into the direct
> map, but I don't think adopting this private/shared model for gmem would
> block us from doing that?

How would that I/O get triggered? GUP would require the directmap.

> 
> Although all of this does hinge on us being able to do the in-place
> shared/private conversion without any guest involvement. Do you envision
> that to be possible?

Who would trigger the conversion and how? I don't see a reason why -- 
for your use case -- user space shouldn't be able to trigger conversion 
private <-> shared. At least nothing fundamental comes to mind that 
would prohibit that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map
  2024-07-26  6:55   ` Patrick Roy
@ 2024-07-30 10:17     ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-07-30 10:17 UTC (permalink / raw)
  To: Patrick Roy, Vlastimil Babka (SUSE), seanjc, pbonzini, akpm, dwmw, rppt
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, willy, graf, derekmn,
	kalyazin, kvm, linux-kernel, linux-mm, dmatlack, tabba,
	chao.p.peng, xmarcalx

On 26.07.24 08:55, Patrick Roy wrote:
> 
> 
> On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote:
>>> === Implementation ===
>>>
>>> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
>>> to remove its pages from the direct map when they are allocated. When
>>> trying to run a guest from such a VM, we now face the problem that
>>> without either userspace or kernelspace mappings of guest_memfd, KVM
>>> cannot access guest memory to, for example, do MMIO emulation of access
>>> memory used to guest/host communication. We have multiple options for
>>> solving this when running non-CoCo VMs: (1) implement a TDX-light
>>> solution, where the guest shares memory that KVM needs to access, and
>>> relies on paravirtual solutions where this is not possible (e.g. MMIO),
>>> (2) have KVM use userspace mappings of guest_memfd (e.g. a
>>> memfd_secret-style solution), or (3) dynamically reinsert pages into the
>>> direct map whenever KVM wants to access them.
>>>
>>> This RFC goes for option (3). Option (1) is a lot of overhead for very
>>> little gain, since we are not actually constrained by a physical
>>> inability to access guest memory (e.g. we are not in a TDX context where
>>> accesses to guest memory cause a #MC). Option (2) has previously been
>>> rejected [1].
>>
>> Do the pages have to have the same address when they are temporarily mapped?
>> Wouldn't it be easier to do something similar to kmap_local_page() used for
>> HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
>> doesn't have to alter the shared directmap.
>>
>> Maybe that was already discussed somewhere as unsuitable but didn't spot it
>> here.
> 
> For what I had prototyped here, there's no requirement to have the pages
> mapped at the same address (I remember briefly looking at memremap to
> achieve the temporary mappings, but since that doesnt work for normal
> memory, I gave up on that path). However, I think guest_memfd is moving
> into a direction where ranges marked as "in-place shared" (e.g. those
> that are temporarily reinserted into the direct map in this RFC)  should
> be able to be GUP'd [1]. I think for that the direct map entries would
> need to be present, right?

Yes, we'd allow GUP. Of course, one could think of a similar extension 
like secretmem that would allow shared memory to get mapped into user 
page tables but would disallow any GUP on (shared) guest_memfd memory.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings
  2024-07-30 10:15             ` David Hildenbrand
@ 2024-08-01 10:30               ` Patrick Roy
  0 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2024-08-01 10:30 UTC (permalink / raw)
  To: David Hildenbrand, seanjc, Fuad Tabba
  Cc: pbonzini, akpm, dwmw, rppt, tglx, mingo, bp, dave.hansen, x86,
	hpa, willy, graf, derekmn, kalyazin, kvm, linux-kernel, linux-mm,
	dmatlack, chao.p.peng, xmarcalx, James Gowans

On Tue, 2024-07-30 at 11:15 +0100, David Hildenbrand wrote:
>>> Hi,
>>>
>>> sorry for the late reply. Yes, you could have joined .... too late.
>>
>> No worries, I did end up joining to listen in to y'all's discussion
>> anyway :)
> 
> Sorry for the late reply :(

No worries :)

>>
>>> There will be a summary posted soon. So far the agreement is that we're
>>> planning on allowing shared memory as part guest_memfd, and will allow
>>> that to get mapped and pinned. Private memory is not going to get mapped
>>> and pinned.
>>>
>>> If we have to disallow pinning of shared memory on top for some use
>>> cases (i.e., no directmap), I assume that could be added.
>>>
>>>>
>>>>> Note that just from staring at this commit, I don't understand the
>>>>> motivation *why* we would want to do that.
>>>>
>>>> Fair - I admittedly didn't get into that as much as I probably should
>>>> have. In our usecase, we do not have anything that pKVM would (I think)
>>>> call "guest-private" memory. I think our memory can be better described
>>>> as guest-owned, but always shared with the VMM (e.g. userspace), but
>>>> ideally never shared with the host kernel. This model lets us do a lot
>>>> of simplifying assumptions: Things like I/O can be handled in userspace
>>>> without the guest explicitly sharing I/O buffers (which is not exactly
>>>> what we would want long-term anyway, as sharing in the guest_memfd
>>>> context means sharing with the host kernel), we can easily do VM
>>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc.
>>>
>>> Okay, so essentially you would want to use guest_memfd to only contain
>>> shard memory and disallow any pinning like for secretmem.
>>
>> Yeah, this is pretty much what I thought we wanted before listening in
>> on Wednesday.
>>
>> I've actually be thinking about this some more since then though. With
>> hugepages, if the VM is backed by, say, 2M pages, our on-demand direct
>> map insertion approach runs into the same problem that CoCo VMs have
>> when they're backed by hugepages: How to deal with the guest only
>> sharing a 4K range in a hugepage? If we want to restore the direct map
>> for e.g. the page containing kvm-clock data, then we can't simply go
>> ahead and restore the direct map for the entire 2M page, because there
>> very well might be stuff in the other 511 small guest pages that we
>> really do not want in the direct map. And we can't even take the
> 
> Right, you'd only want to restore the direct map for a fragment. Or
> dynamically map that fragment using kmap where required (as raised by
> Vlastimil).

Can the kmap approach work if the memory is supposed to be GUP-able?

>> approach of letting the guest deal with the problem, because here
>> "sharing" is driven by the host, not the guest, so the guest cannot
>> possibly know that it maybe should avoid putting stuff it doesn't want
>> shared into those remaining 511 pages! To me that sounds a lot like the
>> whole "breaking down huge folios to allow GUP to only some parts of it"
>> thing mentioned on Wednesday.
> 
> Yes. While it would be one logical huge page, it would be exposed to the
> remainder of the kernel as 512 individual pages.
> 
>>
>> Now, if we instead treat "guest memory without direct map entries" as
>> "private", and "guest memory with direct map entries" as "shared", then
>> the above will be solved by whatever mechanism allows gupping/mapping of
>> only the "shared" parts of huge folios, IIUC. The fact that GUP is then
>> also allowed for the "shared" parts is not actually a problem for us -
>> we went down the route of disabling GUP altogether here because based on
>> [1] it sounded like GUP for anything gmem related would never happen.
> 
> Right. Might there also be a case for removing the directmap for shared
> memory or is that not really a requirement so far?

No, not really - we would only mark as "shared" memory that _needs_ to
be in the direct map for functional reasons (e.g. MMIO instruction
emulation, etc.).

>> But after something is re-inserted into the direct map, we don't very
>> much care if it can be GUP-ed or not. In fact, allowing GUP for the
>> shared parts probably makes some things easier for us, as we can then do
>> I/O without bounce buffers by just in-place converting I/O-buffers to
>> shared, and then treating that shared slice of guest_memfd the same way
>> we treat traditional guest memory today.
> 
> Yes.
> 
>> In a very far-off future, we'd
>> like to be able to do I/O without ever reinserting pages into the direct
>> map, but I don't think adopting this private/shared model for gmem would
>> block us from doing that?
> 
> How would that I/O get triggered? GUP would require the directmap.

I was hoping that this "phyr" thing Matthew has been talking about [1]
would allow somehow doing I/O without direct map entries/GUP, but maybe
I am misunderstanding something.

>>
>> Although all of this does hinge on us being able to do the in-place
>> shared/private conversion without any guest involvement. Do you envision
>> that to be possible?
> 
> Who would trigger the conversion and how? I don't see a reason why --
> for your use case -- user space shouldn't be able to trigger conversion
> private <-> shared. At least nothing fundamental comes to mind that
> would prohibit that.

Either KVM itself would trigger the conversions whenever it wants to
access gmem (e.g. each place in this series where there is a
set_direct_map_{invalid,default} it would do a shared/private
conversion), or userspace would do it via some syscall/ioctl (the one
place I can think of right now is I/O, where the VMM receives a virtio
buffer from the guest and converts it from private to shared in-place.
Although I guess 2 syscalls for each I/O operation aren't great
perf-wise, so maybe swiotlb still wins out here?).

I actually see that Fuad just posted an RFC series that implements the
basic shared/private handling [2], so will probably also comment about
this over there after I had a closer look :)

> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick

[1]: https://lore.kernel.org/netdev/Yd0IeK5s%2FE0fuWqn@casper.infradead.org/T/
[2]: https://lore.kernel.org/kvm/20240801090117.3841080-1-tabba@google.com/T/#t


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-08-01 10:30 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-09 13:20 [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 1/8] kvm: Allow reading/writing gmem using kvm_{read,write}_guest Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 2/8] kvm: use slowpath in gfn_to_hva_cache if memory is private Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 3/8] kvm: pfncache: enlighten about gmem Patrick Roy
2024-07-09 14:36   ` David Woodhouse
2024-07-10  9:49     ` Patrick Roy
2024-07-10 10:20       ` David Woodhouse
2024-07-10 10:46         ` Patrick Roy
2024-07-10 10:50           ` David Woodhouse
2024-07-09 13:20 ` [RFC PATCH 4/8] kvm: x86: support walking guest page tables in gmem Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 5/8] kvm: gmem: add option to remove guest private memory from direct map Patrick Roy
2024-07-10  7:31   ` Mike Rapoport
2024-07-10  9:50     ` Patrick Roy
2024-07-09 13:20 ` [RFC PATCH 6/8] kvm: gmem: Temporarily restore direct map entries when needed Patrick Roy
2024-07-11  6:25   ` Paolo Bonzini
2024-07-09 13:20 ` [RFC PATCH 7/8] mm: secretmem: use AS_INACCESSIBLE to prohibit GUP Patrick Roy
2024-07-09 21:09   ` David Hildenbrand
2024-07-10  7:32     ` Mike Rapoport
2024-07-10  9:50       ` Patrick Roy
2024-07-10 21:14         ` David Hildenbrand
2024-07-09 13:20 ` [RFC PATCH 8/8] kvm: gmem: Allow restricted userspace mappings Patrick Roy
2024-07-09 14:48   ` Fuad Tabba
2024-07-09 21:13     ` David Hildenbrand
2024-07-10  9:51       ` Patrick Roy
2024-07-10 21:12         ` David Hildenbrand
2024-07-10 21:53           ` Sean Christopherson
2024-07-10 21:56             ` David Hildenbrand
2024-07-12 15:59           ` Patrick Roy
2024-07-30 10:15             ` David Hildenbrand
2024-08-01 10:30               ` Patrick Roy
2024-07-22 12:28 ` [RFC PATCH 0/8] Unmapping guest_memfd from Direct Map Vlastimil Babka (SUSE)
2024-07-26  6:55   ` Patrick Roy
2024-07-30 10:17     ` David Hildenbrand
2024-07-26 16:44 ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox