* [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
@ 2024-12-13 16:47 Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 01/14] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
` (15 more replies)
0 siblings, 16 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:47 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
This series adds restricted mmap() support to guest_memfd, as
well as support for guest_memfd on arm64. It is based on Linux
6.13-rc2. Please refer to v3 for the context [1].
Main changes since v3:
- Added a new folio type for guestmem, used to register a
callback when a folio's reference count reaches 0 (Matthew
Wilcox, DavidH) [2]
- Introduce new mappability states for folios, where a folio can
be mappable by the host and the guest, only the guest, or by no
one (transient state)
- Rebased on Linux 6.13-rc2
- Refactoring and tidying up
Cheers,
/fuad
[1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
[2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
Ackerley Tng (2):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
Fuad Tabba (12):
mm: Consolidate freeing of typed folios on final folio_put()
KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
the folio lock
KVM: guest_memfd: Folio mappability states and functions that manage
their transition
KVM: guest_memfd: Handle final folio_put() of guestmem pages
KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
KVM: guest_memfd: Add guest_memfd support to
kvm_(read|/write)_guest_page()
KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
mappable
KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
mappable
KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
allowed
KVM: arm64: Skip VMA checks for slots without userspace address
KVM: arm64: Handle guest_memfd()-backed guest page faults
KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
Documentation/virt/kvm/api.rst | 4 +
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/mmu.c | 119 +++-
include/linux/kvm_host.h | 75 +++
include/linux/page-flags.h | 22 +
include/uapi/linux/kvm.h | 2 +
include/uapi/linux/magic.h | 1 +
mm/debug.c | 1 +
mm/swap.c | 28 +-
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 64 +-
virt/kvm/Kconfig | 4 +
virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
virt/kvm/kvm_main.c | 229 ++++++-
15 files changed, 1074 insertions(+), 59 deletions(-)
base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 01/14] mm: Consolidate freeing of typed folios on final folio_put()
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2024-12-13 16:47 ` Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 02/14] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
` (14 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:47 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Some folio types, such as hugetlb, handle freeing their own
folios. Moreover, guest_memfd will require being notified once a
folio's reference count reaches 0 to facilitate shared to private
folio conversion, without the folio actually being freed at that
point.
As a first step towards that, this patch consolidates freeing
folios that have a type. The first user is hugetlb folios. Later
in this patch series, guest_memfd will become the second user of
this.
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
include/linux/page-flags.h | 15 +++++++++++++++
mm/swap.c | 24 +++++++++++++++++++-----
2 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cf46ac720802..aca57802d7c7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -970,6 +970,21 @@ static inline bool page_has_type(const struct page *page)
return page_mapcount_is_type(data_race(page->page_type));
}
+static inline int page_get_type(const struct page *page)
+{
+ return page->page_type >> 24;
+}
+
+static inline bool folio_has_type(const struct folio *folio)
+{
+ return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+ return page_get_type(&folio->page);
+}
+
#define FOLIO_TYPE_OPS(lname, fname) \
static __always_inline bool folio_test_##fname(const struct folio *folio) \
{ \
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..6f01b56bce13 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,20 @@ static void page_cache_release(struct folio *folio)
unlock_page_lruvec_irqrestore(lruvec, flags);
}
+static void free_typed_folio(struct folio *folio)
+{
+ switch (folio_get_type(folio)) {
+ case PGTY_hugetlb:
+ free_huge_folio(folio);
+ return;
+ case PGTY_offline:
+ /* Nothing to do, it's offline. */
+ return;
+ default:
+ WARN_ON_ONCE(1);
+ }
+}
+
void __folio_put(struct folio *folio)
{
if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +115,8 @@ void __folio_put(struct folio *folio)
return;
}
- if (folio_test_hugetlb(folio)) {
- free_huge_folio(folio);
+ if (unlikely(folio_has_type(folio))) {
+ free_typed_folio(folio);
return;
}
@@ -934,13 +948,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
if (!folio_ref_sub_and_test(folio, nr_refs))
continue;
- /* hugetlb has its own memcg */
- if (folio_test_hugetlb(folio)) {
+ if (unlikely(folio_has_type(folio))) {
+ /* typed folios have their own memcg, if any */
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- free_huge_folio(folio);
+ free_typed_folio(folio);
continue;
}
folio_unqueue_deferred_split(folio);
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 02/14] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 01/14] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
@ 2024-12-13 16:47 ` Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 03/14] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
` (13 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:47 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
From: Ackerley Tng <ackerleytng@google.com>
Using guest mem inodes allows us to store metadata for the backing
memory on the inode. Metadata will be added in a later patch to
support HugeTLB pages.
Metadata about backing memory should not be stored on the file, since
the file represents a guest_memfd's binding with a struct kvm, and
metadata about backing memory is not unique to a specific binding and
struct kvm.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
include/uapi/linux/magic.h | 1 +
virt/kvm/guest_memfd.c | 119 ++++++++++++++++++++++++++++++-------
2 files changed, 100 insertions(+), 20 deletions(-)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..169dba2a6920 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#define PID_FS_MAGIC 0x50494446 /* "PIDF" */
+#define GUEST_MEMORY_MAGIC 0x474d454d /* "GMEM" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 47a9f68f7b24..198554b1f0b5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,17 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/backing-dev.h>
#include <linux/falloc.h>
#include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
#include <linux/pagemap.h>
#include <linux/anon_inodes.h>
#include "kvm_mm.h"
+static struct vfsmount *kvm_gmem_mnt;
+
struct kvm_gmem {
struct kvm *kvm;
struct xarray bindings;
@@ -307,6 +312,38 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
return gfn - slot->base_gfn + slot->gmem.pgoff;
}
+static const struct super_operations kvm_gmem_super_operations = {
+ .statfs = simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+ struct pseudo_fs_context *ctx;
+
+ if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
+ return -ENOMEM;
+
+ ctx = fc->fs_private;
+ ctx->ops = &kvm_gmem_super_operations;
+
+ return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+ .name = "kvm_guest_memory",
+ .init_fs_context = kvm_gmem_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static void kvm_gmem_init_mount(void)
+{
+ kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+ BUG_ON(IS_ERR(kvm_gmem_mnt));
+
+ /* For giggles. Userspace can never map this anyways. */
+ kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+}
+
static struct file_operations kvm_gmem_fops = {
.open = generic_file_open,
.release = kvm_gmem_release,
@@ -316,6 +353,8 @@ static struct file_operations kvm_gmem_fops = {
void kvm_gmem_init(struct module *module)
{
kvm_gmem_fops.owner = module;
+
+ kvm_gmem_init_mount();
}
static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -397,11 +436,67 @@ static const struct inode_operations kvm_gmem_iops = {
.setattr = kvm_gmem_setattr,
};
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+ loff_t size, u64 flags)
+{
+ const struct qstr qname = QSTR_INIT(name, strlen(name));
+ struct inode *inode;
+ int err;
+
+ inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
+ if (IS_ERR(inode))
+ return inode;
+
+ err = security_inode_init_security_anon(inode, &qname, NULL);
+ if (err) {
+ iput(inode);
+ return ERR_PTR(err);
+ }
+
+ inode->i_private = (void *)(unsigned long)flags;
+ inode->i_op = &kvm_gmem_iops;
+ inode->i_mapping->a_ops = &kvm_gmem_aops;
+ inode->i_mode |= S_IFREG;
+ inode->i_size = size;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_inaccessible(inode->i_mapping);
+ /* Unmovable mappings are supposed to be marked unevictable as well. */
+ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+ return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+ u64 flags)
+{
+ static const char *name = "[kvm-gmem]";
+ struct inode *inode;
+ struct file *file;
+
+ if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
+ return ERR_PTR(-ENOENT);
+
+ inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+ &kvm_gmem_fops);
+ if (IS_ERR(file)) {
+ iput(inode);
+ return file;
+ }
+
+ file->f_mapping = inode->i_mapping;
+ file->f_flags |= O_LARGEFILE;
+ file->private_data = priv;
+
+ return file;
+}
+
static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
{
- const char *anon_name = "[kvm-gmem]";
struct kvm_gmem *gmem;
- struct inode *inode;
struct file *file;
int fd, err;
@@ -415,32 +510,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
goto err_fd;
}
- file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
- O_RDWR, NULL);
+ file = kvm_gmem_inode_create_getfile(gmem, size, flags);
if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_gmem;
}
- file->f_flags |= O_LARGEFILE;
-
- inode = file->f_inode;
- WARN_ON(file->f_mapping != inode->i_mapping);
-
- inode->i_private = (void *)(unsigned long)flags;
- inode->i_op = &kvm_gmem_iops;
- inode->i_mapping->a_ops = &kvm_gmem_aops;
- inode->i_mode |= S_IFREG;
- inode->i_size = size;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
- mapping_set_inaccessible(inode->i_mapping);
- /* Unmovable mappings are supposed to be marked unevictable as well. */
- WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
kvm_get_kvm(kvm);
gmem->kvm = kvm;
xa_init(&gmem->bindings);
- list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+ list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
fd_install(fd, file);
return fd;
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 03/14] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 01/14] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 02/14] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
@ 2024-12-13 16:47 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 04/14] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
` (12 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:47 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Create a new variant of kvm_gmem_get_pfn(), which retains the
folio lock if it returns successfully. This is needed in
subsequent patches in order to protect against races when
checking whether a folio can be mapped by the host.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
include/linux/kvm_host.h | 11 +++++++++++
virt/kvm/guest_memfd.c | 27 ++++++++++++++++++++-------
2 files changed, 31 insertions(+), 7 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 401439bb21e3..cda3ed4c3c27 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2500,6 +2500,9 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
int *max_order);
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+ int *max_order);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2509,6 +2512,14 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
KVM_BUG_ON(1, kvm);
return -EIO;
}
+static inline int kvm_gmem_get_pfn_locked(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn,
+ struct page **page, int *max_order)
+{
+ KVM_BUG_ON(1, kvm);
+ return -EIO;
+}
#endif /* CONFIG_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 198554b1f0b5..6453658d2650 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -672,9 +672,9 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
return folio;
}
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
- int *max_order)
+int kvm_gmem_get_pfn_locked(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+ int *max_order)
{
pgoff_t index = kvm_gmem_get_index(slot, gfn);
struct file *file = kvm_gmem_get_file(slot);
@@ -694,17 +694,30 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
if (!is_prepared)
r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
- folio_unlock(folio);
-
- if (!r)
+ if (!r) {
*page = folio_file_page(folio, index);
- else
+ } else {
+ folio_unlock(folio);
folio_put(folio);
+ }
out:
fput(file);
return r;
}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn_locked);
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+ int *max_order)
+{
+ int r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, pfn, page, max_order);
+
+ if (!r)
+ unlock_page(*page);
+
+ return r;
+}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 04/14] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (2 preceding siblings ...)
2024-12-13 16:47 ` [RFC PATCH v4 03/14] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 05/14] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
` (11 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
From: Ackerley Tng <ackerleytng@google.com>
Track whether guest_memfd memory can be mapped within the inode,
since it is property of the guest_memfd's memory contents.
The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
virt/kvm/guest_memfd.c | 56 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 51 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6453658d2650..0a7b6cf8bd8f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -18,6 +18,17 @@ struct kvm_gmem {
struct list_head entry;
};
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+ struct xarray mappable_offsets;
+#endif
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+ return inode->i_mapping->i_private_data;
+}
+
/**
* folio_file_pfn - like folio_file_page, but return a pfn.
* @folio: The folio which contains this index.
@@ -312,8 +323,28 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
return gfn - slot->base_gfn + slot->gmem.pgoff;
}
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+ struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+ /*
+ * .evict_inode can be called before private data is set up if there are
+ * issues during inode creation.
+ */
+ if (private)
+ xa_destroy(&private->mappable_offsets);
+#endif
+
+ truncate_inode_pages_final(inode->i_mapping);
+
+ kfree(private);
+ clear_inode(inode);
+}
+
static const struct super_operations kvm_gmem_super_operations = {
- .statfs = simple_statfs,
+ .statfs = simple_statfs,
+ .evict_inode = kvm_gmem_evict_inode,
};
static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -440,6 +471,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
loff_t size, u64 flags)
{
const struct qstr qname = QSTR_INIT(name, strlen(name));
+ struct kvm_gmem_inode_private *private;
struct inode *inode;
int err;
@@ -448,10 +480,19 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
return inode;
err = security_inode_init_security_anon(inode, &qname, NULL);
- if (err) {
- iput(inode);
- return ERR_PTR(err);
- }
+ if (err)
+ goto out;
+
+ err = -ENOMEM;
+ private = kzalloc(sizeof(*private), GFP_KERNEL);
+ if (!private)
+ goto out;
+
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+ xa_init(&private->mappable_offsets);
+#endif
+
+ inode->i_mapping->i_private_data = private;
inode->i_private = (void *)(unsigned long)flags;
inode->i_op = &kvm_gmem_iops;
@@ -464,6 +505,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
return inode;
+
+out:
+ iput(inode);
+
+ return ERR_PTR(err);
}
static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 05/14] KVM: guest_memfd: Folio mappability states and functions that manage their transition
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (3 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 04/14] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 06/14] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
` (10 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
To allow restricted mapping of guest_memfd folios by the host,
guest_memfd needs to track whether they can be mapped and by who,
since the mapping will only be allowed under conditions where it
safe to access these folios. These conditions depend on the
folios being explicitly shared with the host, or not yet exposed
to the guest (e.g., at initialization).
This patch introduces states that determine whether the host and
the guest can fault in the folios as well as the functions that
manage transitioning between those states.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
include/linux/kvm_host.h | 53 ++++++++++++++
virt/kvm/guest_memfd.c | 153 +++++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 92 +++++++++++++++++++++++
3 files changed, 298 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cda3ed4c3c27..84aa7908a5dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2564,4 +2564,57 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range);
#endif
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end);
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end);
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start,
+ gfn_t end);
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
+ gfn_t end);
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+#else
+static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
+{
+ WARN_ON_ONCE(1);
+ return false;
+}
+static inline int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+static inline int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start,
+ gfn_t end)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+static inline int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot,
+ gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+static inline int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot,
+ gfn_t start, gfn_t end)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+static inline bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot,
+ gfn_t gfn)
+{
+ WARN_ON_ONCE(1);
+ return false;
+}
+static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
+ gfn_t gfn)
+{
+ WARN_ON_ONCE(1);
+ return false;
+}
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
#endif
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0a7b6cf8bd8f..d1c192927cf7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -375,6 +375,159 @@ static void kvm_gmem_init_mount(void)
kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
}
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+/*
+ * An enum of the valid states that describe who can map a folio.
+ * Bit 0: if set guest cannot map the page
+ * Bit 1: if set host cannot map the page
+ */
+enum folio_mappability {
+ KVM_GMEM_ALL_MAPPABLE = 0b00, /* Mappable by host and guest. */
+ KVM_GMEM_GUEST_MAPPABLE = 0b10, /* Mappable only by guest. */
+ KVM_GMEM_NONE_MAPPABLE = 0b11, /* Not mappable, transient state. */
+};
+
+/*
+ * Marks the range [start, end) as mappable by both the host and the guest.
+ * Usually called when guest shares memory with the host.
+ */
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+ struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ void *xval = xa_mk_value(KVM_GMEM_ALL_MAPPABLE);
+ pgoff_t i;
+ int r = 0;
+
+ filemap_invalidate_lock(inode->i_mapping);
+ for (i = start; i < end; i++) {
+ r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
+ if (r)
+ break;
+ }
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return r;
+}
+
+/*
+ * Marks the range [start, end) as not mappable by the host. If the host doesn't
+ * have any references to a particular folio, then that folio is marked as
+ * mappable by the guest.
+ *
+ * However, if the host still has references to the folio, then the folio is
+ * marked and not mappable by anyone. Marking it is not mappable allows it to
+ * drain all references from the host, and to ensure that the hypervisor does
+ * not transition the folio to private, since the host still might access it.
+ *
+ * Usually called when guest unshares memory with the host.
+ */
+static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+ struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+ void *xval_none = xa_mk_value(KVM_GMEM_NONE_MAPPABLE);
+ pgoff_t i;
+ int r = 0;
+
+ filemap_invalidate_lock(inode->i_mapping);
+ for (i = start; i < end; i++) {
+ struct folio *folio;
+ int refcount = 0;
+
+ folio = filemap_lock_folio(inode->i_mapping, i);
+ if (!IS_ERR(folio)) {
+ refcount = folio_ref_count(folio);
+ } else {
+ r = PTR_ERR(folio);
+ if (WARN_ON_ONCE(r != -ENOENT))
+ break;
+
+ folio = NULL;
+ }
+
+ /* +1 references are expected because of filemap_lock_folio(). */
+ if (folio && refcount > folio_nr_pages(folio) + 1) {
+ /*
+ * Outstanding references, the folio cannot be faulted
+ * in by anyone until they're dropped.
+ */
+ r = xa_err(xa_store(mappable_offsets, i, xval_none, GFP_KERNEL));
+ } else {
+ /*
+ * No outstanding references. Transition the folio to
+ * guest mappable immediately.
+ */
+ r = xa_err(xa_store(mappable_offsets, i, xval_guest, GFP_KERNEL));
+ }
+
+ if (folio) {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+
+ if (WARN_ON_ONCE(r))
+ break;
+ }
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return r;
+}
+
+static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
+{
+ struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ unsigned long r;
+
+ r = xa_to_value(xa_load(mappable_offsets, pgoff));
+
+ return (r == KVM_GMEM_ALL_MAPPABLE);
+}
+
+static bool gmem_is_guest_mappable(struct inode *inode, pgoff_t pgoff)
+{
+ struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ unsigned long r;
+
+ r = xa_to_value(xa_load(mappable_offsets, pgoff));
+
+ return (r == KVM_GMEM_ALL_MAPPABLE || r == KVM_GMEM_GUEST_MAPPABLE);
+}
+
+int kvm_slot_gmem_set_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+ struct inode *inode = file_inode(slot->gmem.file);
+ pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+ pgoff_t end_off = start_off + end - start;
+
+ return gmem_set_mappable(inode, start_off, end_off);
+}
+
+int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start, gfn_t end)
+{
+ struct inode *inode = file_inode(slot->gmem.file);
+ pgoff_t start_off = slot->gmem.pgoff + start - slot->base_gfn;
+ pgoff_t end_off = start_off + end - start;
+
+ return gmem_clear_mappable(inode, start_off, end_off);
+}
+
+bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ struct inode *inode = file_inode(slot->gmem.file);
+ unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+
+ return gmem_is_mappable(inode, pgoff);
+}
+
+bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ struct inode *inode = file_inode(slot->gmem.file);
+ unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+
+ return gmem_is_guest_mappable(inode, pgoff);
+}
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
static struct file_operations kvm_gmem_fops = {
.open = generic_file_open,
.release = kvm_gmem_release,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..fffff01cebe7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3094,6 +3094,98 @@ static int next_segment(unsigned long len, int offset)
return len;
}
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_memslot_iter iter;
+ bool r = true;
+
+ mutex_lock(&kvm->slots_lock);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+ struct kvm_memory_slot *memslot = iter.slot;
+ gfn_t gfn_start, gfn_end, i;
+
+ if (!kvm_slot_can_be_private(memslot))
+ continue;
+
+ gfn_start = max(start, memslot->base_gfn);
+ gfn_end = min(end, memslot->base_gfn + memslot->npages);
+ if (WARN_ON_ONCE(gfn_start >= gfn_end))
+ continue;
+
+ for (i = gfn_start; i < gfn_end; i++) {
+ r = kvm_slot_gmem_is_mappable(memslot, i);
+ if (r)
+ goto out;
+ }
+ }
+out:
+ mutex_unlock(&kvm->slots_lock);
+
+ return r;
+}
+
+int kvm_gmem_set_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_memslot_iter iter;
+ int r = 0;
+
+ mutex_lock(&kvm->slots_lock);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+ struct kvm_memory_slot *memslot = iter.slot;
+ gfn_t gfn_start, gfn_end;
+
+ if (!kvm_slot_can_be_private(memslot))
+ continue;
+
+ gfn_start = max(start, memslot->base_gfn);
+ gfn_end = min(end, memslot->base_gfn + memslot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ r = kvm_slot_gmem_set_mappable(memslot, gfn_start, gfn_end);
+ if (WARN_ON_ONCE(r))
+ break;
+ }
+
+ mutex_unlock(&kvm->slots_lock);
+
+ return r;
+}
+
+int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ struct kvm_memslot_iter iter;
+ int r = 0;
+
+ mutex_lock(&kvm->slots_lock);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, kvm_memslots(kvm), start, end) {
+ struct kvm_memory_slot *memslot = iter.slot;
+ gfn_t gfn_start, gfn_end;
+
+ if (!kvm_slot_can_be_private(memslot))
+ continue;
+
+ gfn_start = max(start, memslot->base_gfn);
+ gfn_end = min(end, memslot->base_gfn + memslot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ r = kvm_slot_gmem_clear_mappable(memslot, gfn_start, gfn_end);
+ if (WARN_ON_ONCE(r))
+ break;
+ }
+
+ mutex_unlock(&kvm->slots_lock);
+
+ return r;
+}
+
+#endif /* CONFIG_KVM_GMEM_MAPPABLE */
+
/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
void *data, int offset, int len)
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 06/14] KVM: guest_memfd: Handle final folio_put() of guestmem pages
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (4 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 05/14] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
` (9 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Before transitioning a guest_memfd folio to unshared, thereby
disallowing access by the host and allowing the hypervisor to
transition its view of the guest page as private, we need to be
sure that the host doesn't have any references to the folio.
This patch introduces a new type for guest_memfd folios, and uses
that to register a callback that informs the guest_memfd
subsystem when the last reference is dropped, therefore knowing
that the host doesn't have any remaining references.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
The function kvm_slot_gmem_register_callback() isn't used in this
series. It will be used later in code that performs unsharing of
memory. I have tested it with pKVM, based on downstream code [*].
It's included in this RFC since it demonstrates the plan to
handle unsharing of private folios.
[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v4-pkvm
---
include/linux/kvm_host.h | 11 +++
include/linux/page-flags.h | 7 ++
mm/debug.c | 1 +
mm/swap.c | 4 +
virt/kvm/guest_memfd.c | 145 +++++++++++++++++++++++++++++++++++++
5 files changed, 168 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 84aa7908a5dd..7ada5f78ded4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2574,6 +2574,8 @@ int kvm_slot_gmem_clear_mappable(struct kvm_memory_slot *slot, gfn_t start,
gfn_t end);
bool kvm_slot_gmem_is_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn);
+int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn);
+void kvm_gmem_handle_folio_put(struct folio *folio);
#else
static inline bool kvm_gmem_is_mappable(struct kvm *kvm, gfn_t gfn, gfn_t end)
{
@@ -2615,6 +2617,15 @@ static inline bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot,
WARN_ON_ONCE(1);
return false;
}
+int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+static inline void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+ WARN_ON_ONCE(1);
+}
#endif /* CONFIG_KVM_GMEM_MAPPABLE */
#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index aca57802d7c7..b0e8e43de77c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -950,6 +950,7 @@ enum pagetype {
PGTY_slab = 0xf5,
PGTY_zsmalloc = 0xf6,
PGTY_unaccepted = 0xf7,
+ PGTY_guestmem = 0xf8,
PGTY_mapcount_underflow = 0xff
};
@@ -1099,6 +1100,12 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
FOLIO_TEST_FLAG_FALSE(hugetlb)
#endif
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+FOLIO_TYPE_OPS(guestmem, guestmem)
+#else
+FOLIO_TEST_FLAG_FALSE(guestmem)
+#endif
+
PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
/*
diff --git a/mm/debug.c b/mm/debug.c
index 95b6ab809c0e..db93be385ed9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -56,6 +56,7 @@ static const char *page_type_names[] = {
DEF_PAGETYPE_NAME(table),
DEF_PAGETYPE_NAME(buddy),
DEF_PAGETYPE_NAME(unaccepted),
+ DEF_PAGETYPE_NAME(guestmem),
};
static const char *page_type_name(unsigned int page_type)
diff --git a/mm/swap.c b/mm/swap.c
index 6f01b56bce13..15220eaabc86 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
#include <linux/page_idle.h>
#include <linux/local_lock.h>
#include <linux/buffer_head.h>
+#include <linux/kvm_host.h>
#include "internal.h"
@@ -103,6 +104,9 @@ static void free_typed_folio(struct folio *folio)
case PGTY_offline:
/* Nothing to do, it's offline. */
return;
+ case PGTY_guestmem:
+ kvm_gmem_handle_folio_put(folio);
+ return;
default:
WARN_ON_ONCE(1);
}
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d1c192927cf7..5ecaa5dfcd00 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -387,6 +387,28 @@ enum folio_mappability {
KVM_GMEM_NONE_MAPPABLE = 0b11, /* Not mappable, transient state. */
};
+/*
+ * Unregisters the __folio_put() callback from the folio.
+ *
+ * Restores a folio's refcount after all pending references have been released,
+ * and removes the folio type, thereby removing the callback. Now the folio can
+ * be freed normaly once all actual references have been dropped.
+ *
+ * Must be called with the filemap (inode->i_mapping) invalidate_lock held.
+ * Must also have exclusive access to the folio: folio must be either locked, or
+ * gmem holds the only reference.
+ */
+static void __kvm_gmem_restore_pending_folio(struct folio *folio)
+{
+ if (WARN_ON_ONCE(folio_mapped(folio) || !folio_test_guestmem(folio)))
+ return;
+
+ WARN_ON_ONCE(!folio_test_locked(folio) || folio_ref_count(folio) > 1);
+
+ __folio_clear_guestmem(folio);
+ folio_ref_add(folio, folio_nr_pages(folio));
+}
+
/*
* Marks the range [start, end) as mappable by both the host and the guest.
* Usually called when guest shares memory with the host.
@@ -400,7 +422,31 @@ static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
filemap_invalidate_lock(inode->i_mapping);
for (i = start; i < end; i++) {
+ struct folio *folio = NULL;
+
+ /*
+ * If the folio is NONE_MAPPABLE, it indicates that it is
+ * transitioning to private (GUEST_MAPPABLE). Transition it to
+ * shared (ALL_MAPPABLE) immediately, and remove the callback.
+ */
+ if (xa_to_value(xa_load(mappable_offsets, i)) == KVM_GMEM_NONE_MAPPABLE) {
+ folio = filemap_lock_folio(inode->i_mapping, i);
+ if (WARN_ON_ONCE(IS_ERR(folio))) {
+ r = PTR_ERR(folio);
+ break;
+ }
+
+ if (folio_test_guestmem(folio))
+ __kvm_gmem_restore_pending_folio(folio);
+ }
+
r = xa_err(xa_store(mappable_offsets, i, xval, GFP_KERNEL));
+
+ if (folio) {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+
if (r)
break;
}
@@ -473,6 +519,105 @@ static int gmem_clear_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
return r;
}
+/*
+ * Registers a callback to __folio_put(), so that gmem knows that the host does
+ * not have any references to the folio. It does that by setting the folio type
+ * to guestmem.
+ *
+ * Returns 0 if the host doesn't have any references, or -EAGAIN if the host
+ * has references, and the callback has been registered.
+ *
+ * Must be called with the following locks held:
+ * - filemap (inode->i_mapping) invalidate_lock
+ * - folio lock
+ */
+static int __gmem_register_callback(struct folio *folio, struct inode *inode, pgoff_t idx)
+{
+ struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ void *xval_guest = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+ int refcount;
+
+ rwsem_assert_held_write_nolockdep(&inode->i_mapping->invalidate_lock);
+ WARN_ON_ONCE(!folio_test_locked(folio));
+
+ if (folio_mapped(folio) || folio_test_guestmem(folio))
+ return -EAGAIN;
+
+ /* Register a callback first. */
+ __folio_set_guestmem(folio);
+
+ /*
+ * Check for references after setting the type to guestmem, to guard
+ * against potential races with the refcount being decremented later.
+ *
+ * At least one reference is expected because the folio is locked.
+ */
+
+ refcount = folio_ref_sub_return(folio, folio_nr_pages(folio));
+ if (refcount == 1) {
+ int r;
+
+ /* refcount isn't elevated, it's now faultable by the guest. */
+ r = WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, idx, xval_guest, GFP_KERNEL)));
+ if (!r)
+ __kvm_gmem_restore_pending_folio(folio);
+
+ return r;
+ }
+
+ return -EAGAIN;
+}
+
+int kvm_slot_gmem_register_callback(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ unsigned long pgoff = slot->gmem.pgoff + gfn - slot->base_gfn;
+ struct inode *inode = file_inode(slot->gmem.file);
+ struct folio *folio;
+ int r;
+
+ filemap_invalidate_lock(inode->i_mapping);
+
+ folio = filemap_lock_folio(inode->i_mapping, pgoff);
+ if (WARN_ON_ONCE(IS_ERR(folio))) {
+ r = PTR_ERR(folio);
+ goto out;
+ }
+
+ r = __gmem_register_callback(folio, inode, pgoff);
+
+ folio_unlock(folio);
+ folio_put(folio);
+out:
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return r;
+}
+
+/*
+ * Callback function for __folio_put(), i.e., called when all references by the
+ * host to the folio have been dropped. This allows gmem to transition the state
+ * of the folio to mappable by the guest, and allows the hypervisor to continue
+ * transitioning its state to private, since the host cannot attempt to access
+ * it anymore.
+ */
+void kvm_gmem_handle_folio_put(struct folio *folio)
+{
+ struct xarray *mappable_offsets;
+ struct inode *inode;
+ pgoff_t index;
+ void *xval;
+
+ inode = folio->mapping->host;
+ index = folio->index;
+ mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
+ xval = xa_mk_value(KVM_GMEM_GUEST_MAPPABLE);
+
+ filemap_invalidate_lock(inode->i_mapping);
+ __kvm_gmem_restore_pending_folio(folio);
+ WARN_ON_ONCE(xa_err(xa_store(mappable_offsets, index, xval, GFP_KERNEL)));
+ filemap_invalidate_unlock(inode->i_mapping);
+}
+
static bool gmem_is_mappable(struct inode *inode, pgoff_t pgoff)
{
struct xarray *mappable_offsets = &kvm_gmem_private(inode)->mappable_offsets;
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (5 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 06/14] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-27 4:21 ` Alexey Kardashevskiy
2024-12-13 16:48 ` [RFC PATCH v4 08/14] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
` (8 subsequent siblings)
15 siblings, 1 reply; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Add support for mmap() and fault() for guest_memfd in the host.
The ability to fault in a guest page is contingent on that page
being shared with the host.
The guest_memfd PRIVATE memory attribute is not used for two
reasons. First because it reflects the userspace expectation for
that memory location, and therefore can be toggled by userspace.
The second is, although each guest_memfd file has a 1:1 binding
with a KVM instance, the plan is to allow multiple files per
inode, e.g. to allow intra-host migration to a new KVM instance,
without destroying guest_memfd.
The mapping is restricted to only memory explicitly shared with
the host. KVM checks that the host doesn't have any mappings for
private memory via the folio's refcount. To avoid races between
paths that check mappability and paths that check whether the
host has any mappings (via the refcount), the folio lock is held
in while either check is being performed.
This new feature is gated with a new configuration option,
CONFIG_KVM_GMEM_MAPPABLE.
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
The functions kvm_gmem_is_mapped(), kvm_gmem_set_mappable(), and
int kvm_gmem_clear_mappable() are not used in this patch series.
They are intended to be used in future patches [*], which check
and toggle mapability when the guest shares/unshares pages with
the host.
[*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v4-pkvm
---
virt/kvm/Kconfig | 4 ++
virt/kvm/guest_memfd.c | 87 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 54e959e7d68f..59400fd8f539 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
config HAVE_KVM_ARCH_GMEM_INVALIDATE
bool
depends on KVM_PRIVATE_MEM
+
+config KVM_GMEM_MAPPABLE
+ select KVM_PRIVATE_MEM
+ bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5ecaa5dfcd00..3d3645924db9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -671,9 +671,88 @@ bool kvm_slot_gmem_is_guest_mappable(struct kvm_memory_slot *slot, gfn_t gfn)
return gmem_is_guest_mappable(inode, pgoff);
}
+
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct folio *folio;
+ vm_fault_t ret = VM_FAULT_LOCKED;
+
+ filemap_invalidate_lock_shared(inode->i_mapping);
+
+ folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+ if (IS_ERR(folio)) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_filemap;
+ }
+
+ if (folio_test_hwpoison(folio)) {
+ ret = VM_FAULT_HWPOISON;
+ goto out_folio;
+ }
+
+ if (!gmem_is_mappable(inode, vmf->pgoff)) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_folio;
+ }
+
+ if (WARN_ON_ONCE(folio_test_guestmem(folio))) {
+ ret = VM_FAULT_SIGBUS;
+ goto out_folio;
+ }
+
+ if (!folio_test_uptodate(folio)) {
+ unsigned long nr_pages = folio_nr_pages(folio);
+ unsigned long i;
+
+ for (i = 0; i < nr_pages; i++)
+ clear_highpage(folio_page(folio, i));
+
+ folio_mark_uptodate(folio);
+ }
+
+ vmf->page = folio_file_page(folio, vmf->pgoff);
+
+out_folio:
+ if (ret != VM_FAULT_LOCKED) {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+
+out_filemap:
+ filemap_invalidate_unlock_shared(inode->i_mapping);
+
+ return ret;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+ .fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+ (VM_SHARED | VM_MAYSHARE)) {
+ return -EINVAL;
+ }
+
+ file_accessed(file);
+ vm_flags_set(vma, VM_DONTDUMP);
+ vma->vm_ops = &kvm_gmem_vm_ops;
+
+ return 0;
+}
+#else
+static int gmem_set_mappable(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+}
+#define kvm_gmem_mmap NULL
#endif /* CONFIG_KVM_GMEM_MAPPABLE */
static struct file_operations kvm_gmem_fops = {
+ .mmap = kvm_gmem_mmap,
.open = generic_file_open,
.release = kvm_gmem_release,
.fallocate = kvm_gmem_fallocate,
@@ -860,6 +939,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
goto err_gmem;
}
+ if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+ err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
+ if (err) {
+ fput(file);
+ goto err_gmem;
+ }
+ }
+
kvm_get_kvm(kvm);
gmem->kvm = kvm;
xa_init(&gmem->bindings);
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 08/14] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page()
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (6 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 09/14] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
` (7 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Make kvm_(read|/write)_guest_page() capable of accessing guest
memory for slots that don't have a userspace address, but only if
the memory is mappable, which also indicates that it is
accessible by the host.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
virt/kvm/kvm_main.c | 133 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 114 insertions(+), 19 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fffff01cebe7..53692feb6213 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3184,23 +3184,110 @@ int kvm_gmem_clear_mappable(struct kvm *kvm, gfn_t start, gfn_t end)
return r;
}
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, void *data, int offset,
+ int len)
+{
+ struct page *page;
+ u64 pfn;
+ int r;
+
+ /*
+ * Holds the folio lock until after checking whether it can be faulted
+ * in, to avoid races with paths that change a folio's mappability.
+ */
+ r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, &page, NULL);
+ if (r)
+ return r;
+
+ if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+ r = -EPERM;
+ goto unlock;
+ }
+ memcpy(data, page_address(page) + offset, len);
+unlock:
+ unlock_page(page);
+ if (r)
+ put_page(page);
+ else
+ kvm_release_page_clean(page);
+
+ return r;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, const void *data,
+ int offset, int len)
+{
+ struct page *page;
+ u64 pfn;
+ int r;
+
+ /*
+ * Holds the folio lock until after checking whether it can be faulted
+ * in, to avoid races with paths that change a folio's mappability.
+ */
+ r = kvm_gmem_get_pfn_locked(kvm, slot, gfn, &pfn, &page, NULL);
+ if (r)
+ return r;
+
+ if (!kvm_gmem_is_mappable(kvm, gfn, gfn + 1)) {
+ r = -EPERM;
+ goto unlock;
+ }
+ memcpy(page_address(page) + offset, data, len);
+unlock:
+ unlock_page(page);
+ if (r)
+ put_page(page);
+ else
+ kvm_release_page_dirty(page);
+
+ return r;
+}
+#else
+static int __kvm_read_guest_memfd_page(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, void *data, int offset,
+ int len)
+{
+ WARN_ON_ONCE(1);
+ return -EIO;
+}
+
+static int __kvm_write_guest_memfd_page(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn, const void *data,
+ int offset, int len)
+{
+ WARN_ON_ONCE(1);
+ return -EIO;
+}
#endif /* CONFIG_KVM_GMEM_MAPPABLE */
/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
- void *data, int offset, int len)
+
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, void *data, int offset, int len)
{
- int r;
unsigned long addr;
if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
return -EFAULT;
+ if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+ kvm_slot_can_be_private(slot) &&
+ !slot->userspace_addr) {
+ return __kvm_read_guest_memfd_page(kvm, slot, gfn, data,
+ offset, len);
+ }
+
addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
- r = __copy_from_user(data, (void __user *)addr + offset, len);
- if (r)
+ if (__copy_from_user(data, (void __user *)addr + offset, len))
return -EFAULT;
return 0;
}
@@ -3210,7 +3297,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
{
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
- return __kvm_read_guest_page(slot, gfn, data, offset, len);
+ return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len);
}
EXPORT_SYMBOL_GPL(kvm_read_guest_page);
@@ -3219,7 +3306,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
{
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- return __kvm_read_guest_page(slot, gfn, data, offset, len);
+ return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
@@ -3296,22 +3383,30 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
/* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
static int __kvm_write_guest_page(struct kvm *kvm,
- struct kvm_memory_slot *memslot, gfn_t gfn,
- const void *data, int offset, int len)
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ const void *data, int offset, int len)
{
- int r;
- unsigned long addr;
-
if (WARN_ON_ONCE(offset + len > PAGE_SIZE))
return -EFAULT;
- addr = gfn_to_hva_memslot(memslot, gfn);
- if (kvm_is_error_hva(addr))
- return -EFAULT;
- r = __copy_to_user((void __user *)addr + offset, data, len);
- if (r)
- return -EFAULT;
- mark_page_dirty_in_slot(kvm, memslot, gfn);
+ if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+ kvm_slot_can_be_private(slot) &&
+ !slot->userspace_addr) {
+ int r = __kvm_write_guest_memfd_page(kvm, slot, gfn, data,
+ offset, len);
+
+ if (r)
+ return r;
+ } else {
+ unsigned long addr = gfn_to_hva_memslot(slot, gfn);
+
+ if (kvm_is_error_hva(addr))
+ return -EFAULT;
+ if (__copy_to_user((void __user *)addr + offset, data, len))
+ return -EFAULT;
+ }
+
+ mark_page_dirty_in_slot(kvm, slot, gfn);
return 0;
}
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 09/14] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (7 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 08/14] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 10/14] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
` (6 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Add the KVM capability KVM_CAP_GUEST_MEMFD_MAPPABLE, which is
true if mapping guest memory is supported by the host.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 502ea63b5d2e..021f8ef9979b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@ struct kvm_enable_cap {
#define KVM_CAP_PRE_FAULT_MEMORY 236
#define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
#define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_GUEST_MEMFD_MAPPABLE 239
struct kvm_irq_routing_irqchip {
__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 53692feb6213..0d1c2e95e771 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4979,6 +4979,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#ifdef CONFIG_KVM_PRIVATE_MEM
case KVM_CAP_GUEST_MEMFD:
return !kvm || kvm_arch_has_private_mem(kvm);
+#endif
+#ifdef CONFIG_KVM_GMEM_MAPPABLE
+ case KVM_CAP_GUEST_MEMFD_MAPPABLE:
+ return !kvm || kvm_arch_has_private_mem(kvm);
#endif
default:
break;
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 10/14] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (8 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 09/14] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 11/14] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
` (5 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Not all use cases require guest_memfd() to be mappable by the
host when first created. Add a new flag,
GUEST_MEMFD_FLAG_INIT_MAPPABLE, which when set on
KVM_CREATE_GUEST_MEMFD initializes the memory as mappable by the
host. Otherwise, memory is private until shared by the guest with
the host.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
Documentation/virt/kvm/api.rst | 4 ++++
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/guest_memfd_test.c | 7 +++++--
virt/kvm/guest_memfd.c | 6 +++++-
4 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 454c2aaa155e..60b65d9b8077 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6380,6 +6380,10 @@ most one mapping per page, i.e. binding multiple memory regions to a single
guest_memfd range is not allowed (any number of memory regions can be bound to
a single guest_memfd file, but the bound ranges must not overlap).
+If the capability KVM_CAP_GUEST_MEMFD_MAPPABLE is supported, then the flags
+field supports GUEST_MEMFD_FLAG_INIT_MAPPABLE, which initializes the memory
+as mappable by the host.
+
See KVM_SET_USER_MEMORY_REGION2 for additional details.
4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 021f8ef9979b..b34aed04ffa5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1566,6 +1566,7 @@ struct kvm_memory_attributes {
#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
+#define GUEST_MEMFD_FLAG_INIT_MAPPABLE (1UL << 0)
struct kvm_create_guest_memfd {
__u64 size;
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ce687f8d248f..04b4111b7190 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -123,7 +123,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
{
size_t page_size = getpagesize();
- uint64_t flag;
+ uint64_t flag = BIT(0);
size_t size;
int fd;
@@ -134,7 +134,10 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
size);
}
- for (flag = BIT(0); flag; flag <<= 1) {
+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+ flag = GUEST_MEMFD_FLAG_INIT_MAPPABLE << 1;
+
+ for (; flag; flag <<= 1) {
fd = __vm_create_guest_memfd(vm, page_size, flag);
TEST_ASSERT(fd == -1 && errno == EINVAL,
"guest_memfd() with flag '0x%lx' should fail with EINVAL",
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 3d3645924db9..f33a577295b3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -939,7 +939,8 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
goto err_gmem;
}
- if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE)) {
+ if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE) &&
+ (flags & GUEST_MEMFD_FLAG_INIT_MAPPABLE)) {
err = gmem_set_mappable(file_inode(file), 0, size >> PAGE_SHIFT);
if (err) {
fput(file);
@@ -968,6 +969,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
u64 flags = args->flags;
u64 valid_flags = 0;
+ if (IS_ENABLED(CONFIG_KVM_GMEM_MAPPABLE))
+ valid_flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
if (flags & ~valid_flags)
return -EINVAL;
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 11/14] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (9 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 10/14] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 12/14] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
` (4 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Expand the guest_memfd selftests to include testing mapping guest
memory if the capability is supported, and that still checks that
memory is not mappable if the capability isn't supported.
Also, build the guest_memfd selftest for aarch64.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 57 +++++++++++++++++--
2 files changed, 53 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 41593d2e7de9..c998eb3c3b77 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -174,6 +174,7 @@ TEST_GEN_PROGS_aarch64 += coalesced_io_test
TEST_GEN_PROGS_aarch64 += demand_paging_test
TEST_GEN_PROGS_aarch64 += dirty_log_test
TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
+TEST_GEN_PROGS_aarch64 += guest_memfd_test
TEST_GEN_PROGS_aarch64 += guest_print_test
TEST_GEN_PROGS_aarch64 += get-reg-list
TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 04b4111b7190..12b5777c2eb5 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -34,12 +34,55 @@ static void test_file_read_write(int fd)
"pwrite on a guest_mem fd should fail");
}
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_allowed(int fd, size_t total_size)
{
+ size_t page_size = getpagesize();
+ char *mem;
+ int ret;
+ int i;
+
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+ memset(mem, 0xaa, total_size);
+ for (i = 0; i < total_size; i++)
+ TEST_ASSERT_EQ(mem[i], 0xaa);
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+ page_size);
+ TEST_ASSERT(!ret, "fallocate the first page should succeed");
+
+ for (i = 0; i < page_size; i++)
+ TEST_ASSERT_EQ(mem[i], 0x00);
+ for (; i < total_size; i++)
+ TEST_ASSERT_EQ(mem[i], 0xaa);
+
+ memset(mem, 0xaa, total_size);
+ for (i = 0; i < total_size; i++)
+ TEST_ASSERT_EQ(mem[i], 0xaa);
+
+ ret = munmap(mem, total_size);
+ TEST_ASSERT(!ret, "munmap should succeed");
+}
+
+static void test_mmap_denied(int fd, size_t total_size)
+{
+ size_t page_size = getpagesize();
char *mem;
mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
TEST_ASSERT_EQ(mem, MAP_FAILED);
+
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_mmap(int fd, size_t total_size)
+{
+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+ test_mmap_allowed(fd, total_size);
+ else
+ test_mmap_denied(fd, total_size);
}
static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -175,13 +218,17 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
int main(int argc, char *argv[])
{
- size_t page_size;
+ uint64_t flags = 0;
+ struct kvm_vm *vm;
size_t total_size;
+ size_t page_size;
int fd;
- struct kvm_vm *vm;
TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+ if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MAPPABLE))
+ flags |= GUEST_MEMFD_FLAG_INIT_MAPPABLE;
+
page_size = getpagesize();
total_size = page_size * 4;
@@ -190,10 +237,10 @@ int main(int argc, char *argv[])
test_create_guest_memfd_invalid(vm);
test_create_guest_memfd_multiple(vm);
- fd = vm_create_guest_memfd(vm, total_size, 0);
+ fd = vm_create_guest_memfd(vm, total_size, flags);
test_file_read_write(fd);
- test_mmap(fd, page_size);
+ test_mmap(fd, total_size);
test_file_size(fd, page_size, total_size);
test_fallocate(fd, page_size, total_size);
test_invalid_punch_hole(fd, page_size, total_size);
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 12/14] KVM: arm64: Skip VMA checks for slots without userspace address
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (10 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 11/14] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
` (3 subsequent siblings)
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Memory slots backed by guest memory might be created with no
intention of being mapped by the host. These are recognized by
not having a userspace address in the memory slot.
VMA checks are neither possible nor necessary for this kind of
slot, so skip them.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/mmu.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c9d46ad57e52..342a9bd3848f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -988,6 +988,10 @@ static void stage2_unmap_memslot(struct kvm *kvm,
phys_addr_t size = PAGE_SIZE * memslot->npages;
hva_t reg_end = hva + size;
+ /* Host will not map this private memory without a userspace address. */
+ if (kvm_slot_can_be_private(memslot) && !hva)
+ return;
+
/*
* A memory region could potentially cover multiple VMAs, and any holes
* between them, so iterate over all of them to find out if we should
@@ -2133,6 +2137,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
hva = new->userspace_addr;
reg_end = hva + (new->npages << PAGE_SHIFT);
+ /* Host will not map this private memory without a userspace address. */
+ if ((kvm_slot_can_be_private(new)) && !hva)
+ return 0;
+
mmap_read_lock(current->mm);
/*
* A memory region could potentially cover multiple VMAs, and any holes
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (11 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 12/14] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2025-01-16 14:48 ` Patrick Roy
2024-12-13 16:48 ` [RFC PATCH v4 14/14] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
` (2 subsequent siblings)
15 siblings, 1 reply; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Add arm64 support for resolving guest page faults on
guest_memfd() backed memslots. This support is not contingent on
pKVM, or other confidential computing support, and works in both
VHE and nVHE modes.
Without confidential computing, this support is useful for
testing and debugging. In the future, it might also be useful
should a user want to use guest_memfd() for all code, whether
it's for a protected guest or not.
For now, the fault granule is restricted to PAGE_SIZE.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/mmu.c | 111 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 342a9bd3848f..1c4b3871967c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1434,6 +1434,107 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
return vma->vm_flags & VM_MTE_ALLOWED;
}
+static int guest_memfd_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+ struct kvm_memory_slot *memslot, bool fault_is_perm)
+{
+ struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+ bool exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
+ bool logging_active = memslot_is_logging(memslot);
+ struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
+ bool write_fault = kvm_is_write_fault(vcpu);
+ struct mm_struct *mm = current->mm;
+ gfn_t gfn = gpa_to_gfn(fault_ipa);
+ struct kvm *kvm = vcpu->kvm;
+ struct page *page;
+ kvm_pfn_t pfn;
+ int ret;
+
+ /* For now, guest_memfd() only supports PAGE_SIZE granules. */
+ if (WARN_ON_ONCE(fault_is_perm &&
+ kvm_vcpu_trap_get_perm_fault_granule(vcpu) != PAGE_SIZE)) {
+ return -EFAULT;
+ }
+
+ VM_BUG_ON(write_fault && exec_fault);
+
+ if (fault_is_perm && !write_fault && !exec_fault) {
+ kvm_err("Unexpected L2 read permission error\n");
+ return -EFAULT;
+ }
+
+ /*
+ * Permission faults just need to update the existing leaf entry,
+ * and so normally don't require allocations from the memcache. The
+ * only exception to this is when dirty logging is enabled at runtime
+ * and a write fault needs to collapse a block entry into a table.
+ */
+ if (!fault_is_perm || (logging_active && write_fault)) {
+ ret = kvm_mmu_topup_memory_cache(memcache,
+ kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Holds the folio lock until mapped in the guest and its refcount is
+ * stable, to avoid races with paths that check if the folio is mapped
+ * by the host.
+ */
+ ret = kvm_gmem_get_pfn_locked(kvm, memslot, gfn, &pfn, &page, NULL);
+ if (ret)
+ return ret;
+
+ if (!kvm_slot_gmem_is_guest_mappable(memslot, gfn)) {
+ ret = -EAGAIN;
+ goto unlock_page;
+ }
+
+ /*
+ * Once it's faulted in, a guest_memfd() page will stay in memory.
+ * Therefore, count it as locked.
+ */
+ if (!fault_is_perm) {
+ ret = account_locked_vm(mm, 1, true);
+ if (ret)
+ goto unlock_page;
+ }
+
+ read_lock(&kvm->mmu_lock);
+ if (write_fault)
+ prot |= KVM_PGTABLE_PROT_W;
+
+ if (exec_fault)
+ prot |= KVM_PGTABLE_PROT_X;
+
+ if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
+ prot |= KVM_PGTABLE_PROT_X;
+
+ /*
+ * Under the premise of getting a FSC_PERM fault, we just need to relax
+ * permissions.
+ */
+ if (fault_is_perm)
+ ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
+ else
+ ret = kvm_pgtable_stage2_map(pgt, fault_ipa, PAGE_SIZE,
+ __pfn_to_phys(pfn), prot,
+ memcache,
+ KVM_PGTABLE_WALK_HANDLE_FAULT |
+ KVM_PGTABLE_WALK_SHARED);
+
+ kvm_release_faultin_page(kvm, page, !!ret, write_fault);
+ read_unlock(&kvm->mmu_lock);
+
+ if (ret && !fault_is_perm)
+ account_locked_vm(mm, 1, false);
+unlock_page:
+ unlock_page(page);
+ put_page(page);
+
+ return ret != -EAGAIN ? ret : 0;
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_s2_trans *nested,
struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1900,8 +2001,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
goto out_unlock;
}
- ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
- esr_fsc_is_permission_fault(esr));
+ if (kvm_slot_can_be_private(memslot)) {
+ ret = guest_memfd_abort(vcpu, fault_ipa, memslot,
+ esr_fsc_is_permission_fault(esr));
+ } else {
+ ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
+ esr_fsc_is_permission_fault(esr));
+ }
+
if (ret == 0)
ret = 1;
out:
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v4 14/14] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (12 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2024-12-13 16:48 ` Fuad Tabba
2025-01-09 16:34 ` [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-01-16 14:48 ` Patrick Roy
15 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2024-12-13 16:48 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
tabba
Implement kvm_arch_has_private_mem() in arm64 when pKVM is
enabled, and make it dependent on the configuration option.
Also, now that the infrastructure is in place for arm64 to
support guest private memory, enable it in the arm64 kernel
configuration.
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/Kconfig | 1 +
2 files changed, 4 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e18e9244d17a..8dfae9183651 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1529,4 +1529,7 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
#define kvm_has_s1poe(k) \
(kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP))
+#define kvm_arch_has_private_mem(kvm) \
+ (IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && is_protected_kvm_enabled())
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index ead632ad01b4..fe3451f244b5 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select SCHED_INFO
select GUEST_PERF_EVENTS if PERF_EVENTS
+ select KVM_GMEM_MAPPABLE
help
Support hosting virtualized guest machines.
--
2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
2024-12-13 16:48 ` [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
@ 2024-12-27 4:21 ` Alexey Kardashevskiy
2025-01-09 10:17 ` Fuad Tabba
0 siblings, 1 reply; 26+ messages in thread
From: Alexey Kardashevskiy @ 2024-12-27 4:21 UTC (permalink / raw)
To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton
On 14/12/24 03:48, Fuad Tabba wrote:
> Add support for mmap() and fault() for guest_memfd in the host.
> The ability to fault in a guest page is contingent on that page
> being shared with the host.
>
> The guest_memfd PRIVATE memory attribute is not used for two
> reasons. First because it reflects the userspace expectation for
> that memory location, and therefore can be toggled by userspace.
> The second is, although each guest_memfd file has a 1:1 binding
> with a KVM instance, the plan is to allow multiple files per
> inode, e.g. to allow intra-host migration to a new KVM instance,
> without destroying guest_memfd.
>
> The mapping is restricted to only memory explicitly shared with
> the host. KVM checks that the host doesn't have any mappings for
> private memory via the folio's refcount. To avoid races between
> paths that check mappability and paths that check whether the
> host has any mappings (via the refcount), the folio lock is held
> in while either check is being performed.
>
> This new feature is gated with a new configuration option,
> CONFIG_KVM_GMEM_MAPPABLE.
>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
>
> ---
> The functions kvm_gmem_is_mapped(), kvm_gmem_set_mappable(), and
> int kvm_gmem_clear_mappable() are not used in this patch series.
> They are intended to be used in future patches [*], which check
> and toggle mapability when the guest shares/unshares pages with
> the host.
>
> [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v4-pkvm
This one requires access, can you please push it somewhere public? I am
interested in in-place shared<->private memory conversion and I wonder
if kvm_gmem_set_mappable() that guy. Thanks,
--
Alexey
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
2024-12-27 4:21 ` Alexey Kardashevskiy
@ 2025-01-09 10:17 ` Fuad Tabba
0 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2025-01-09 10:17 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
mail, david, michael.roth, wei.w.wang, liam.merwick,
isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
jthoughton
Hi Alexey,
On Fri, 27 Dec 2024 at 04:21, Alexey Kardashevskiy <aik@amd.com> wrote:
>
> On 14/12/24 03:48, Fuad Tabba wrote:
> > Add support for mmap() and fault() for guest_memfd in the host.
> > The ability to fault in a guest page is contingent on that page
> > being shared with the host.
> >
> > The guest_memfd PRIVATE memory attribute is not used for two
> > reasons. First because it reflects the userspace expectation for
> > that memory location, and therefore can be toggled by userspace.
> > The second is, although each guest_memfd file has a 1:1 binding
> > with a KVM instance, the plan is to allow multiple files per
> > inode, e.g. to allow intra-host migration to a new KVM instance,
> > without destroying guest_memfd.
> >
> > The mapping is restricted to only memory explicitly shared with
> > the host. KVM checks that the host doesn't have any mappings for
> > private memory via the folio's refcount. To avoid races between
> > paths that check mappability and paths that check whether the
> > host has any mappings (via the refcount), the folio lock is held
> > in while either check is being performed.
> >
> > This new feature is gated with a new configuration option,
> > CONFIG_KVM_GMEM_MAPPABLE.
> >
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Co-developed-by: Elliot Berman <quic_eberman@quicinc.com>
> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> >
> > ---
> > The functions kvm_gmem_is_mapped(), kvm_gmem_set_mappable(), and
> > int kvm_gmem_clear_mappable() are not used in this patch series.
> > They are intended to be used in future patches [*], which check
> > and toggle mapability when the guest shares/unshares pages with
> > the host.
> >
> > [*] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v4-pkvm
>
> This one requires access, can you please push it somewhere public? I am
> interested in in-place shared<->private memory conversion and I wonder
> if kvm_gmem_set_mappable() that guy. Thanks,
Sorry for the late reply, I was away, and sorry for the broken link,
I'd forgotten to push. Could you try now?
https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.13-v4-pkvm
Thanks,
/fuad
> --
> Alexey
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (13 preceding siblings ...)
2024-12-13 16:48 ` [RFC PATCH v4 14/14] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
@ 2025-01-09 16:34 ` Fuad Tabba
2025-01-16 0:35 ` Ackerley Tng
2025-01-16 14:48 ` Patrick Roy
15 siblings, 1 reply; 26+ messages in thread
From: Fuad Tabba @ 2025-01-09 16:34 UTC (permalink / raw)
To: kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton
Hi,
As mentioned in the guest_memfd sync (2025-01-09), below is the state
diagram that uses the new states in this patch series, and how they
would interact with sharing/unsharing in pKVM:
https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
This patch series doesn't necessarily impose all these transitions,
many of them would be a matter of policy. This just happens to be the
current way I've done it with pKVM/arm64.
Cheers,
/fuad
On Fri, 13 Dec 2024 at 16:48, Fuad Tabba <tabba@google.com> wrote:
>
> This series adds restricted mmap() support to guest_memfd, as
> well as support for guest_memfd on arm64. It is based on Linux
> 6.13-rc2. Please refer to v3 for the context [1].
>
> Main changes since v3:
> - Added a new folio type for guestmem, used to register a
> callback when a folio's reference count reaches 0 (Matthew
> Wilcox, DavidH) [2]
> - Introduce new mappability states for folios, where a folio can
> be mappable by the host and the guest, only the guest, or by no
> one (transient state)
> - Rebased on Linux 6.13-rc2
> - Refactoring and tidying up
>
> Cheers,
> /fuad
>
> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
>
> Ackerley Tng (2):
> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> anonymous inodes
> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
>
> Fuad Tabba (12):
> mm: Consolidate freeing of typed folios on final folio_put()
> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
> the folio lock
> KVM: guest_memfd: Folio mappability states and functions that manage
> their transition
> KVM: guest_memfd: Handle final folio_put() of guestmem pages
> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
> KVM: guest_memfd: Add guest_memfd support to
> kvm_(read|/write)_guest_page()
> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
> mappable
> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
> mappable
> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
> allowed
> KVM: arm64: Skip VMA checks for slots without userspace address
> KVM: arm64: Handle guest_memfd()-backed guest page faults
> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
>
> Documentation/virt/kvm/api.rst | 4 +
> arch/arm64/include/asm/kvm_host.h | 3 +
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/mmu.c | 119 +++-
> include/linux/kvm_host.h | 75 +++
> include/linux/page-flags.h | 22 +
> include/uapi/linux/kvm.h | 2 +
> include/uapi/linux/magic.h | 1 +
> mm/debug.c | 1 +
> mm/swap.c | 28 +-
> tools/testing/selftests/kvm/Makefile | 1 +
> .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
> virt/kvm/Kconfig | 4 +
> virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
> virt/kvm/kvm_main.c | 229 ++++++-
> 15 files changed, 1074 insertions(+), 59 deletions(-)
>
>
> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> --
> 2.47.1.613.gc27f4b7a9f-goog
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2025-01-09 16:34 ` [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2025-01-16 0:35 ` Ackerley Tng
2025-01-16 9:19 ` Fuad Tabba
0 siblings, 1 reply; 26+ messages in thread
From: Ackerley Tng @ 2025-01-16 0:35 UTC (permalink / raw)
To: Fuad Tabba
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton
Fuad Tabba <tabba@google.com> writes:
> Hi,
>
> As mentioned in the guest_memfd sync (2025-01-09), below is the state
> diagram that uses the new states in this patch series, and how they
> would interact with sharing/unsharing in pKVM:
>
> https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
Thanks Fuad!
I took a look at the state diagram [1] and the branch that this patch is
on [2], and here's what I understand about the flow:
1. From state H in the state diagram, the guest can request to unshare a
page. When KVM handles this unsharing, KVM marks the folio
mappability as NONE (state J).
2. The transition from state J to state K or I is independent of KVM -
userspace has to do this unmapping
3. On the next vcpu_run() from userspace, continuing from userspace's
handling of the unshare request, guest_memfd will check and try to
register a callback if the folio's mappability is NONE. If the folio
is mapped, or if folio is not mapped but refcount is elevated for
whatever reason, vcpu_run() fails and exits to userspace. If folio is
not mapped and gmem holds the last refcount, set folio mappability to
GUEST.
Here's one issue I see based on the above understanding:
Registration of the folio_put() callback only happens if the VMM
actually tries to do vcpu_run(). For 4K folios I think this is okay
since the 4K folio can be freed via the transition state K -> state I,
but for hugetlb folios that have been split for sharing with userspace,
not getting a folio_put() callback means never putting the hugetlb folio
together. Hence, relying on vcpu_run() to add the folio_put() callback
leaves a way that hugetlb pages can be removed from the system.
I think we should try and find a path forward that works for both 4K and
hugetlb folios.
IIUC page._mapcount and page.page_type works as a union because
page_type is only set for page types that are never mapped to userspace,
like PGTY_slab, PGTY_offline, etc.
Technically PGTY_guest_memfd is only set once the page can never be
mapped to userspace, but PGTY_guest_memfd can only be set once mapcount
reaches 0. Since mapcount is added in the faulting process, could gmem
perhaps use some kind of .unmap/.unfault callback, so that gmem gets
notified of all unmaps and will know for sure that the mapcount gets to
0?
Alternatively, I took a look at the folio_is_zone_device()
implementation, and page.flags is used to identify the page's type. IIUC
a ZONE_DEVICE page also falls in the intersection of needing a
folio_put() callback and can be mapped to userspace. Could we use a
similar approach, using page.flags to identify a page as a guest_memfd
page? That way we don't need to know when unmapping happens, and will
always be able to get a folio_put() callback.
[1] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
[2] https://android-kvm.googlesource.com/linux/+/764360863785ba16d974253a572c87abdd9fdf0b%5E%21/#F0
> This patch series doesn't necessarily impose all these transitions,
> many of them would be a matter of policy. This just happens to be the
> current way I've done it with pKVM/arm64.
>
> Cheers,
> /fuad
>
> On Fri, 13 Dec 2024 at 16:48, Fuad Tabba <tabba@google.com> wrote:
>>
>> This series adds restricted mmap() support to guest_memfd, as
>> well as support for guest_memfd on arm64. It is based on Linux
>> 6.13-rc2. Please refer to v3 for the context [1].
>>
>> Main changes since v3:
>> - Added a new folio type for guestmem, used to register a
>> callback when a folio's reference count reaches 0 (Matthew
>> Wilcox, DavidH) [2]
>> - Introduce new mappability states for folios, where a folio can
>> be mappable by the host and the guest, only the guest, or by no
>> one (transient state)
>> - Rebased on Linux 6.13-rc2
>> - Refactoring and tidying up
>>
>> Cheers,
>> /fuad
>>
>> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
>> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
>>
>> Ackerley Tng (2):
>> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>> anonymous inodes
>> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
>>
>> Fuad Tabba (12):
>> mm: Consolidate freeing of typed folios on final folio_put()
>> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
>> the folio lock
>> KVM: guest_memfd: Folio mappability states and functions that manage
>> their transition
>> KVM: guest_memfd: Handle final folio_put() of guestmem pages
>> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
>> KVM: guest_memfd: Add guest_memfd support to
>> kvm_(read|/write)_guest_page()
>> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
>> mappable
>> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
>> mappable
>> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
>> allowed
>> KVM: arm64: Skip VMA checks for slots without userspace address
>> KVM: arm64: Handle guest_memfd()-backed guest page faults
>> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
>>
>> Documentation/virt/kvm/api.rst | 4 +
>> arch/arm64/include/asm/kvm_host.h | 3 +
>> arch/arm64/kvm/Kconfig | 1 +
>> arch/arm64/kvm/mmu.c | 119 +++-
>> include/linux/kvm_host.h | 75 +++
>> include/linux/page-flags.h | 22 +
>> include/uapi/linux/kvm.h | 2 +
>> include/uapi/linux/magic.h | 1 +
>> mm/debug.c | 1 +
>> mm/swap.c | 28 +-
>> tools/testing/selftests/kvm/Makefile | 1 +
>> .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
>> virt/kvm/Kconfig | 4 +
>> virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
>> virt/kvm/kvm_main.c | 229 ++++++-
>> 15 files changed, 1074 insertions(+), 59 deletions(-)
>>
>>
>> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
>> --
>> 2.47.1.613.gc27f4b7a9f-goog
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2025-01-16 0:35 ` Ackerley Tng
@ 2025-01-16 9:19 ` Fuad Tabba
2025-01-20 9:26 ` Vlastimil Babka
0 siblings, 1 reply; 26+ messages in thread
From: Fuad Tabba @ 2025-01-16 9:19 UTC (permalink / raw)
To: Ackerley Tng
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, mail, david,
michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton
Hi Ackerley,
On Thu, 16 Jan 2025 at 00:35, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Hi,
> >
> > As mentioned in the guest_memfd sync (2025-01-09), below is the state
> > diagram that uses the new states in this patch series, and how they
> > would interact with sharing/unsharing in pKVM:
> >
> > https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
>
> Thanks Fuad!
>
> I took a look at the state diagram [1] and the branch that this patch is
> on [2], and here's what I understand about the flow:
>
> 1. From state H in the state diagram, the guest can request to unshare a
> page. When KVM handles this unsharing, KVM marks the folio
> mappability as NONE (state J).
> 2. The transition from state J to state K or I is independent of KVM -
> userspace has to do this unmapping
> 3. On the next vcpu_run() from userspace, continuing from userspace's
> handling of the unshare request, guest_memfd will check and try to
> register a callback if the folio's mappability is NONE. If the folio
> is mapped, or if folio is not mapped but refcount is elevated for
> whatever reason, vcpu_run() fails and exits to userspace. If folio is
> not mapped and gmem holds the last refcount, set folio mappability to
> GUEST.
>
> Here's one issue I see based on the above understanding:
>
> Registration of the folio_put() callback only happens if the VMM
> actually tries to do vcpu_run(). For 4K folios I think this is okay
> since the 4K folio can be freed via the transition state K -> state I,
> but for hugetlb folios that have been split for sharing with userspace,
> not getting a folio_put() callback means never putting the hugetlb folio
> together. Hence, relying on vcpu_run() to add the folio_put() callback
> leaves a way that hugetlb pages can be removed from the system.
>
> I think we should try and find a path forward that works for both 4K and
> hugetlb folios.
I agree, this could be an issue, but we could find other ways to
trigger the callback for huge folios. The important thing I was trying
to get to is how to have the callback and be able to register it.
> IIUC page._mapcount and page.page_type works as a union because
> page_type is only set for page types that are never mapped to userspace,
> like PGTY_slab, PGTY_offline, etc.
In the last guest_memfd sync, David Hildenbrand mentioned that that
would be a temporary restriction since the two structures would
eventually be decoupled, work being done by Matthew Wilcox I believe.
> Technically PGTY_guest_memfd is only set once the page can never be
> mapped to userspace, but PGTY_guest_memfd can only be set once mapcount
> reaches 0. Since mapcount is added in the faulting process, could gmem
> perhaps use some kind of .unmap/.unfault callback, so that gmem gets
> notified of all unmaps and will know for sure that the mapcount gets to
> 0?
I'm not sure if there is such a callback. If there were, I'm not sure
what that would buy us really. The main pain point is the refcount
going down to zero. The mapcount part is pretty straightforard and
likely to be only temporary as mentioned, i.e., when it get decoupled,
we could register the callback earlier and simplify the transition
altogether.
> Alternatively, I took a look at the folio_is_zone_device()
> implementation, and page.flags is used to identify the page's type. IIUC
> a ZONE_DEVICE page also falls in the intersection of needing a
> folio_put() callback and can be mapped to userspace. Could we use a
> similar approach, using page.flags to identify a page as a guest_memfd
> page? That way we don't need to know when unmapping happens, and will
> always be able to get a folio_put() callback.
Same as above, with this being temporary, adding a new page flag might
not be something that the rest of the community might be too excited
about :)
Thanks for your comments!
/fuad
> [1] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
> [2] https://android-kvm.googlesource.com/linux/+/764360863785ba16d974253a572c87abdd9fdf0b%5E%21/#F0
>
> > This patch series doesn't necessarily impose all these transitions,
> > many of them would be a matter of policy. This just happens to be the
> > current way I've done it with pKVM/arm64.
> >
> > Cheers,
> > /fuad
> >
> > On Fri, 13 Dec 2024 at 16:48, Fuad Tabba <tabba@google.com> wrote:
> >>
> >> This series adds restricted mmap() support to guest_memfd, as
> >> well as support for guest_memfd on arm64. It is based on Linux
> >> 6.13-rc2. Please refer to v3 for the context [1].
> >>
> >> Main changes since v3:
> >> - Added a new folio type for guestmem, used to register a
> >> callback when a folio's reference count reaches 0 (Matthew
> >> Wilcox, DavidH) [2]
> >> - Introduce new mappability states for folios, where a folio can
> >> be mappable by the host and the guest, only the guest, or by no
> >> one (transient state)
> >> - Rebased on Linux 6.13-rc2
> >> - Refactoring and tidying up
> >>
> >> Cheers,
> >> /fuad
> >>
> >> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> >> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
> >>
> >> Ackerley Tng (2):
> >> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> >> anonymous inodes
> >> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
> >>
> >> Fuad Tabba (12):
> >> mm: Consolidate freeing of typed folios on final folio_put()
> >> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
> >> the folio lock
> >> KVM: guest_memfd: Folio mappability states and functions that manage
> >> their transition
> >> KVM: guest_memfd: Handle final folio_put() of guestmem pages
> >> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
> >> KVM: guest_memfd: Add guest_memfd support to
> >> kvm_(read|/write)_guest_page()
> >> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
> >> mappable
> >> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
> >> mappable
> >> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
> >> allowed
> >> KVM: arm64: Skip VMA checks for slots without userspace address
> >> KVM: arm64: Handle guest_memfd()-backed guest page faults
> >> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
> >>
> >> Documentation/virt/kvm/api.rst | 4 +
> >> arch/arm64/include/asm/kvm_host.h | 3 +
> >> arch/arm64/kvm/Kconfig | 1 +
> >> arch/arm64/kvm/mmu.c | 119 +++-
> >> include/linux/kvm_host.h | 75 +++
> >> include/linux/page-flags.h | 22 +
> >> include/uapi/linux/kvm.h | 2 +
> >> include/uapi/linux/magic.h | 1 +
> >> mm/debug.c | 1 +
> >> mm/swap.c | 28 +-
> >> tools/testing/selftests/kvm/Makefile | 1 +
> >> .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
> >> virt/kvm/Kconfig | 4 +
> >> virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
> >> virt/kvm/kvm_main.c | 229 ++++++-
> >> 15 files changed, 1074 insertions(+), 59 deletions(-)
> >>
> >>
> >> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> >> --
> >> 2.47.1.613.gc27f4b7a9f-goog
> >>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
` (14 preceding siblings ...)
2025-01-09 16:34 ` [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
@ 2025-01-16 14:48 ` Patrick Roy
2025-01-16 15:02 ` Fuad Tabba
15 siblings, 1 reply; 26+ messages in thread
From: Patrick Roy @ 2025-01-16 14:48 UTC (permalink / raw)
To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, shuah, hch,
jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, Kalyazin,
Nikita, Manwaring, Derek, Cali, Marco, James Gowans
Hi Fuad!
I finally got around to giving this patch series a spin for my non-CoCo
usecase. I used the below diff to expose the functionality outside of pKVM
(Based on Steven P.'s ARM CCA patch for custom VM types on ARM [2]).
There's two small things that were broken for me (will post as responses
to individual patches), but after fixing those, I was able to boot some
guests using a modified Firecracker [1].
Just wondering, are you still looking into posting a separate series
with just the MMU changes (e.g. something to have a bare-bones
KVM_SW_PROTECTED_VM on ARM, like we do for x86), like you mentioned in
the guest_memfd call before Christmas? We're pretty keen to
get our hands something like that for our non-CoCo VMs (and ofc, am
happy to help with any work required to get there :)
Best,
Patrick
[1]: https://github.com/roypat/firecracker/tree/secret-freedom-mmap
[2]: https://lore.kernel.org/kvm/20241004152804.72508-12-steven.price@arm.com/
---
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8dfae9183651..0b8dfb855e51 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -380,6 +380,8 @@ struct kvm_arch {
* the associated pKVM instance in the hypervisor.
*/
struct kvm_protected_vm pkvm;
+
+ unsigned long type;
};
struct kvm_vcpu_fault_info {
@@ -1529,7 +1531,11 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
#define kvm_has_s1poe(k) \
(kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP))
-#define kvm_arch_has_private_mem(kvm) \
- (IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && is_protected_kvm_enabled())
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) \
+ ((kvm)->arch.type == KVM_VM_TYPE_ARM_SW_PROTECTED || is_protected_kvm_enabled())
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index fe3451f244b5..2da26aa3b0b5 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -38,6 +38,7 @@ menuconfig KVM
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select SCHED_INFO
select GUEST_PERF_EVENTS if PERF_EVENTS
+ select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
select KVM_GMEM_MAPPABLE
help
Support hosting virtualized guest machines.
@@ -84,4 +85,10 @@ config PTDUMP_STAGE2_DEBUGFS
If in doubt, say N.
+config KVM_SW_PROTECTED_VM
+ bool "Enable support for KVM software-protected VMs"
+ depends on EXPERT
+ depends on KVM && ARM64
+ select KVM_GENERIC_PRIVATE_MEM
+
endif # VIRTUALIZATION
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a102c3aebdbc..35683868c0e4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -181,6 +181,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
mutex_unlock(&kvm->lock);
#endif
+ if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK))
+ return -EINVAL;
+
+ switch (type & KVM_VM_TYPE_ARM_MASK) {
+ case KVM_VM_TYPE_ARM_NORMAL:
+ case KVM_VM_TYPE_ARM_SW_PROTECTED:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ kvm->arch.type = type & KVM_VM_TYPE_ARM_MASK;
+
kvm_init_nested(kvm);
ret = kvm_share_hyp(kvm, kvm + 1);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1c4b3871967c..9dbb472eb96a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -869,9 +869,6 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
u64 mmfr0, mmfr1;
u32 phys_shift;
- if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
- return -EINVAL;
-
phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
if (is_protected_kvm_enabled()) {
phys_shift = kvm_ipa_limit;
@@ -2373,3 +2370,31 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
}
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range)
+{
+ /*
+ * Zap SPTEs even if the slot can't be mapped PRIVATE. KVM only
+ * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
+ * can simply ignore such slots. But if userspace is making memory
+ * PRIVATE, then KVM must prevent the guest from accessing the memory
+ * as shared. And if userspace is making memory SHARED and this point
+ * is reached, then at least one page within the range was previously
+ * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
+ * Zapping SPTEs in this case ensures KVM will reassess whether or not
+ * a hugepage can be used for affected ranges.
+ */
+ if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+ return false;
+
+ return kvm_unmap_gfn_range(kvm, range);
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+ struct kvm_gfn_range *range)
+{
+ return false;
+}
+#endif
\ No newline at end of file
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b34aed04ffa5..214f6b5da43f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -653,6 +653,13 @@ struct kvm_enable_cap {
* PA size shift (i.e, log2(PA_Size)). For backward compatibility,
* value 0 implies the default IPA size, 40bits.
*/
+#define KVM_VM_TYPE_ARM_SHIFT 8
+#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
+#define KVM_VM_TYPE_ARM(_type) \
+ (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
+#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
+#define KVM_VM_TYPE_ARM_SW_PROTECTED KVM_VM_TYPE_ARM(1)
+
#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
On Fri, 2024-12-13 at 16:47 +0000, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd, as
> well as support for guest_memfd on arm64. It is based on Linux
> 6.13-rc2. Please refer to v3 for the context [1].
>
> Main changes since v3:
> - Added a new folio type for guestmem, used to register a
> callback when a folio's reference count reaches 0 (Matthew
> Wilcox, DavidH) [2]
> - Introduce new mappability states for folios, where a folio can
> be mappable by the host and the guest, only the guest, or by no
> one (transient state)
> - Rebased on Linux 6.13-rc2
> - Refactoring and tidying up
>
> Cheers,
> /fuad
>
> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
>
> Ackerley Tng (2):
> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> anonymous inodes
> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
>
> Fuad Tabba (12):
> mm: Consolidate freeing of typed folios on final folio_put()
> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
> the folio lock
> KVM: guest_memfd: Folio mappability states and functions that manage
> their transition
> KVM: guest_memfd: Handle final folio_put() of guestmem pages
> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
> KVM: guest_memfd: Add guest_memfd support to
> kvm_(read|/write)_guest_page()
> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
> mappable
> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
> mappable
> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
> allowed
> KVM: arm64: Skip VMA checks for slots without userspace address
> KVM: arm64: Handle guest_memfd()-backed guest page faults
> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
>
> Documentation/virt/kvm/api.rst | 4 +
> arch/arm64/include/asm/kvm_host.h | 3 +
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/mmu.c | 119 +++-
> include/linux/kvm_host.h | 75 +++
> include/linux/page-flags.h | 22 +
> include/uapi/linux/kvm.h | 2 +
> include/uapi/linux/magic.h | 1 +
> mm/debug.c | 1 +
> mm/swap.c | 28 +-
> tools/testing/selftests/kvm/Makefile | 1 +
> .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
> virt/kvm/Kconfig | 4 +
> virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
> virt/kvm/kvm_main.c | 229 ++++++-
> 15 files changed, 1074 insertions(+), 59 deletions(-)
>
>
> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> --
> 2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults
2024-12-13 16:48 ` [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
@ 2025-01-16 14:48 ` Patrick Roy
2025-01-16 15:16 ` Fuad Tabba
0 siblings, 1 reply; 26+ messages in thread
From: Patrick Roy @ 2025-01-16 14:48 UTC (permalink / raw)
To: Fuad Tabba, kvm, linux-arm-msm, linux-mm
Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
chao.p.peng, jarkko, amoorthy, dmatlack, yu.c.zhang,
isaku.yamahata, mic, vbabka, vannapurve, ackerleytng, mail,
david, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, shuah, hch,
jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, Kalyazin,
Nikita, Manwaring, Derek, Cali, Marco, James Gowans
On Fri, 2024-12-13 at 16:48 +0000, Fuad Tabba wrote:
> Add arm64 support for resolving guest page faults on
> guest_memfd() backed memslots. This support is not contingent on
> pKVM, or other confidential computing support, and works in both
> VHE and nVHE modes.
>
> Without confidential computing, this support is useful forQ
> testing and debugging. In the future, it might also be useful
> should a user want to use guest_memfd() for all code, whether
> it's for a protected guest or not.
>
> For now, the fault granule is restricted to PAGE_SIZE.
>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
> arch/arm64/kvm/mmu.c | 111 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 109 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 342a9bd3848f..1c4b3871967c 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1434,6 +1434,107 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> return vma->vm_flags & VM_MTE_ALLOWED;
> }
>
> +static int guest_memfd_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> + struct kvm_memory_slot *memslot, bool fault_is_perm)
> +{
> + struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> + bool exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> + bool logging_active = memslot_is_logging(memslot);
> + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> + bool write_fault = kvm_is_write_fault(vcpu);
> + struct mm_struct *mm = current->mm;
> + gfn_t gfn = gpa_to_gfn(fault_ipa);
> + struct kvm *kvm = vcpu->kvm;
> + struct page *page;
> + kvm_pfn_t pfn;
> + int ret;
> +
> + /* For now, guest_memfd() only supports PAGE_SIZE granules. */
> + if (WARN_ON_ONCE(fault_is_perm &&
> + kvm_vcpu_trap_get_perm_fault_granule(vcpu) != PAGE_SIZE)) {
> + return -EFAULT;
> + }
> +
> + VM_BUG_ON(write_fault && exec_fault);
> +
> + if (fault_is_perm && !write_fault && !exec_fault) {
> + kvm_err("Unexpected L2 read permission error\n");
> + return -EFAULT;
> + }
> +
> + /*
> + * Permission faults just need to update the existing leaf entry,
> + * and so normally don't require allocations from the memcache. The
> + * only exception to this is when dirty logging is enabled at runtime
> + * and a write fault needs to collapse a block entry into a table.
> + */
> + if (!fault_is_perm || (logging_active && write_fault)) {
> + ret = kvm_mmu_topup_memory_cache(memcache,
> + kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Holds the folio lock until mapped in the guest and its refcount is
> + * stable, to avoid races with paths that check if the folio is mapped
> + * by the host.
> + */
> + ret = kvm_gmem_get_pfn_locked(kvm, memslot, gfn, &pfn, &page, NULL);
> + if (ret)
> + return ret;
> +
> + if (!kvm_slot_gmem_is_guest_mappable(memslot, gfn)) {
> + ret = -EAGAIN;
> + goto unlock_page;
> + }
> +
> + /*
> + * Once it's faulted in, a guest_memfd() page will stay in memory.
> + * Therefore, count it as locked.
> + */
> + if (!fault_is_perm) {
> + ret = account_locked_vm(mm, 1, true);
> + if (ret)
> + goto unlock_page;
> + }
> +
> + read_lock(&kvm->mmu_lock);
> + if (write_fault)
> + prot |= KVM_PGTABLE_PROT_W;
> +
> + if (exec_fault)
> + prot |= KVM_PGTABLE_PROT_X;
> +
> + if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
> + prot |= KVM_PGTABLE_PROT_X;
> +
> + /*
> + * Under the premise of getting a FSC_PERM fault, we just need to relax
> + * permissions.
> + */
> + if (fault_is_perm)
> + ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> + else
> + ret = kvm_pgtable_stage2_map(pgt, fault_ipa, PAGE_SIZE,
> + __pfn_to_phys(pfn), prot,
> + memcache,
> + KVM_PGTABLE_WALK_HANDLE_FAULT |
> + KVM_PGTABLE_WALK_SHARED);
> +
> + kvm_release_faultin_page(kvm, page, !!ret, write_fault);
> + read_unlock(&kvm->mmu_lock);
> +
> + if (ret && !fault_is_perm)
> + account_locked_vm(mm, 1, false);
> +unlock_page:
> + unlock_page(page);
> + put_page(page);
There's a double-free of `page` here, as kvm_release_faultin_page
already calls put_page. I fixed it up locally with
+ unlock_page(page);
kvm_release_faultin_page(kvm, page, !!ret, write_fault);
read_unlock(&kvm->mmu_lock);
if (ret && !fault_is_perm)
account_locked_vm(mm, 1, false);
+ goto out;
+
unlock_page:
unlock_page(page);
put_page(page);
-
+out:
return ret != -EAGAIN ? ret : 0;
}
which I'm admittedly not sure is correct either because now the locks
don't get released in reverse order of acquisition, but with this I
was able to boot simple VMs.
> +
> + return ret != -EAGAIN ? ret : 0;
> +}
> +
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_s2_trans *nested,
> struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1900,8 +2001,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> goto out_unlock;
> }
>
> - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> - esr_fsc_is_permission_fault(esr));
> + if (kvm_slot_can_be_private(memslot)) {
For my setup, I needed
if (kvm_mem_is_private(vcpu->kvm, gfn))
here instead, because I am making use of KVM_GENERIC_MEMORY_ATTRIBUTES,
and had a memslot with the `KVM_MEM_GUEST_MEMFD` flag set, but whose
gfn range wasn't actually set to KVM_MEMORY_ATTRIBUTE_PRIVATE.
If I'm reading patch 12 correctly, your memslots always set only one of
userspace_addr or guest_memfd, and the stage 2 table setup simply checks
which one is the case to decide what to fault in, so maybe to support
both cases, this check should be
if (kvm_mem_is_private(vcpu->kvm, gfn) || (kvm_slot_can_be_private(memslot) && !memslot->userspace_addr)
?
[1]: https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/
> + ret = guest_memfd_abort(vcpu, fault_ipa, memslot,
> + esr_fsc_is_permission_fault(esr));
> + } else {
> + ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> + esr_fsc_is_permission_fault(esr));
> + }
> +
> if (ret == 0)
> ret = 1;
> out:
> --
> 2.47.1.613.gc27f4b7a9f-goog
Best,
Patrick
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2025-01-16 14:48 ` Patrick Roy
@ 2025-01-16 15:02 ` Fuad Tabba
0 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2025-01-16 15:02 UTC (permalink / raw)
To: Patrick Roy
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
mail, david, michael.roth, wei.w.wang, liam.merwick,
isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
Kalyazin, Nikita, Manwaring, Derek, Cali, Marco, James Gowans
Hi Patrick,
On Thu, 16 Jan 2025 at 14:48, Patrick Roy <roypat@amazon.co.uk> wrote:
>
> Hi Fuad!
>
> I finally got around to giving this patch series a spin for my non-CoCo
> usecase. I used the below diff to expose the functionality outside of pKVM
> (Based on Steven P.'s ARM CCA patch for custom VM types on ARM [2]).
> There's two small things that were broken for me (will post as responses
> to individual patches), but after fixing those, I was able to boot some
> guests using a modified Firecracker [1].
That's great, thanks for that, and for your comments on the patches.
> Just wondering, are you still looking into posting a separate series
> with just the MMU changes (e.g. something to have a bare-bones
> KVM_SW_PROTECTED_VM on ARM, like we do for x86), like you mentioned in
> the guest_memfd call before Christmas? We're pretty keen to
> get our hands something like that for our non-CoCo VMs (and ofc, am
> happy to help with any work required to get there :)
Yes I am. I'm almost done with it now. That said, I need to make it
work with attributes as well (as you mention in your comments on the
other patch). I should to send it out next week, before the biweekly
meeting in case we need to discuss it.
Cheers,
/fuad
> Best,
> Patrick
>
> [1]: https://github.com/roypat/firecracker/tree/secret-freedom-mmap
> [2]: https://lore.kernel.org/kvm/20241004152804.72508-12-steven.price@arm.com/
>
> ---
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 8dfae9183651..0b8dfb855e51 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -380,6 +380,8 @@ struct kvm_arch {
> * the associated pKVM instance in the hypervisor.
> */
> struct kvm_protected_vm pkvm;
> +
> + unsigned long type;
> };
>
> struct kvm_vcpu_fault_info {
> @@ -1529,7 +1531,11 @@ void kvm_set_vm_id_reg(struct kvm *kvm, u32 reg, u64 val);
> #define kvm_has_s1poe(k) \
> (kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP))
>
> -#define kvm_arch_has_private_mem(kvm) \
> - (IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && is_protected_kvm_enabled())
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +#define kvm_arch_has_private_mem(kvm) \
> + ((kvm)->arch.type == KVM_VM_TYPE_ARM_SW_PROTECTED || is_protected_kvm_enabled())
> +#else
> +#define kvm_arch_has_private_mem(kvm) false
> +#endif
>
> #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index fe3451f244b5..2da26aa3b0b5 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -38,6 +38,7 @@ menuconfig KVM
> select HAVE_KVM_VCPU_RUN_PID_CHANGE
> select SCHED_INFO
> select GUEST_PERF_EVENTS if PERF_EVENTS
> + select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
> select KVM_GMEM_MAPPABLE
> help
> Support hosting virtualized guest machines.
> @@ -84,4 +85,10 @@ config PTDUMP_STAGE2_DEBUGFS
>
> If in doubt, say N.
>
> +config KVM_SW_PROTECTED_VM
> + bool "Enable support for KVM software-protected VMs"
> + depends on EXPERT
> + depends on KVM && ARM64
> + select KVM_GENERIC_PRIVATE_MEM
> +
> endif # VIRTUALIZATION
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a102c3aebdbc..35683868c0e4 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -181,6 +181,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> mutex_unlock(&kvm->lock);
> #endif
>
> + if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK))
> + return -EINVAL;
> +
> + switch (type & KVM_VM_TYPE_ARM_MASK) {
> + case KVM_VM_TYPE_ARM_NORMAL:
> + case KVM_VM_TYPE_ARM_SW_PROTECTED:
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + kvm->arch.type = type & KVM_VM_TYPE_ARM_MASK;
> +
> kvm_init_nested(kvm);
>
> ret = kvm_share_hyp(kvm, kvm + 1);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1c4b3871967c..9dbb472eb96a 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -869,9 +869,6 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
> u64 mmfr0, mmfr1;
> u32 phys_shift;
>
> - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> - return -EINVAL;
> -
> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> if (is_protected_kvm_enabled()) {
> phys_shift = kvm_ipa_limit;
> @@ -2373,3 +2370,31 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>
> trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
> }
> +
> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range)
> +{
> + /*
> + * Zap SPTEs even if the slot can't be mapped PRIVATE. KVM only
> + * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like KVM
> + * can simply ignore such slots. But if userspace is making memory
> + * PRIVATE, then KVM must prevent the guest from accessing the memory
> + * as shared. And if userspace is making memory SHARED and this point
> + * is reached, then at least one page within the range was previously
> + * PRIVATE, i.e. the slot's possible hugepage ranges are changing.
> + * Zapping SPTEs in this case ensures KVM will reassess whether or not
> + * a hugepage can be used for affected ranges.
> + */
> + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
> + return false;
> +
> + return kvm_unmap_gfn_range(kvm, range);
> +}
> +
> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range)
> +{
> + return false;
> +}
> +#endif
> \ No newline at end of file
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b34aed04ffa5..214f6b5da43f 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -653,6 +653,13 @@ struct kvm_enable_cap {
> * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> * value 0 implies the default IPA size, 40bits.
> */
> +#define KVM_VM_TYPE_ARM_SHIFT 8
> +#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
> +#define KVM_VM_TYPE_ARM(_type) \
> + (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
> +#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
> +#define KVM_VM_TYPE_ARM_SW_PROTECTED KVM_VM_TYPE_ARM(1)
> +
> #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
> #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
> ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>
>
> On Fri, 2024-12-13 at 16:47 +0000, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd, as
> > well as support for guest_memfd on arm64. It is based on Linux
> > 6.13-rc2. Please refer to v3 for the context [1].
> >
> > Main changes since v3:
> > - Added a new folio type for guestmem, used to register a
> > callback when a folio's reference count reaches 0 (Matthew
> > Wilcox, DavidH) [2]
> > - Introduce new mappability states for folios, where a folio can
> > be mappable by the host and the guest, only the guest, or by no
> > one (transient state)
> > - Rebased on Linux 6.13-rc2
> > - Refactoring and tidying up
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
> > [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
> >
> > Ackerley Tng (2):
> > KVM: guest_memfd: Make guest mem use guest mem inodes instead of
> > anonymous inodes
> > KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
> >
> > Fuad Tabba (12):
> > mm: Consolidate freeing of typed folios on final folio_put()
> > KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
> > the folio lock
> > KVM: guest_memfd: Folio mappability states and functions that manage
> > their transition
> > KVM: guest_memfd: Handle final folio_put() of guestmem pages
> > KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
> > KVM: guest_memfd: Add guest_memfd support to
> > kvm_(read|/write)_guest_page()
> > KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
> > mappable
> > KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
> > mappable
> > KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
> > allowed
> > KVM: arm64: Skip VMA checks for slots without userspace address
> > KVM: arm64: Handle guest_memfd()-backed guest page faults
> > KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
> >
> > Documentation/virt/kvm/api.rst | 4 +
> > arch/arm64/include/asm/kvm_host.h | 3 +
> > arch/arm64/kvm/Kconfig | 1 +
> > arch/arm64/kvm/mmu.c | 119 +++-
> > include/linux/kvm_host.h | 75 +++
> > include/linux/page-flags.h | 22 +
> > include/uapi/linux/kvm.h | 2 +
> > include/uapi/linux/magic.h | 1 +
> > mm/debug.c | 1 +
> > mm/swap.c | 28 +-
> > tools/testing/selftests/kvm/Makefile | 1 +
> > .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
> > virt/kvm/Kconfig | 4 +
> > virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
> > virt/kvm/kvm_main.c | 229 ++++++-
> > 15 files changed, 1074 insertions(+), 59 deletions(-)
> >
> >
> > base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults
2025-01-16 14:48 ` Patrick Roy
@ 2025-01-16 15:16 ` Fuad Tabba
0 siblings, 0 replies; 26+ messages in thread
From: Fuad Tabba @ 2025-01-16 15:16 UTC (permalink / raw)
To: Patrick Roy
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
mail, david, michael.roth, wei.w.wang, liam.merwick,
isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
james.morse, yuzenghui, oliver.upton, maz, will, qperret, keirf,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton,
Kalyazin, Nikita, Manwaring, Derek, Cali, Marco, James Gowans
Hi Patrick,
On Thu, 16 Jan 2025 at 14:48, Patrick Roy <roypat@amazon.co.uk> wrote:
>
> On Fri, 2024-12-13 at 16:48 +0000, Fuad Tabba wrote:
> > Add arm64 support for resolving guest page faults on
> > guest_memfd() backed memslots. This support is not contingent on
> > pKVM, or other confidential computing support, and works in both
> > VHE and nVHE modes.
> >
> > Without confidential computing, this support is useful forQ
> > testing and debugging. In the future, it might also be useful
> > should a user want to use guest_memfd() for all code, whether
> > it's for a protected guest or not.
> >
> > For now, the fault granule is restricted to PAGE_SIZE.
> >
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> > arch/arm64/kvm/mmu.c | 111 ++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 109 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 342a9bd3848f..1c4b3871967c 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1434,6 +1434,107 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> > return vma->vm_flags & VM_MTE_ALLOWED;
> > }
> >
> > +static int guest_memfd_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > + struct kvm_memory_slot *memslot, bool fault_is_perm)
> > +{
> > + struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> > + bool exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > + bool logging_active = memslot_is_logging(memslot);
> > + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > + bool write_fault = kvm_is_write_fault(vcpu);
> > + struct mm_struct *mm = current->mm;
> > + gfn_t gfn = gpa_to_gfn(fault_ipa);
> > + struct kvm *kvm = vcpu->kvm;
> > + struct page *page;
> > + kvm_pfn_t pfn;
> > + int ret;
> > +
> > + /* For now, guest_memfd() only supports PAGE_SIZE granules. */
> > + if (WARN_ON_ONCE(fault_is_perm &&
> > + kvm_vcpu_trap_get_perm_fault_granule(vcpu) != PAGE_SIZE)) {
> > + return -EFAULT;
> > + }
> > +
> > + VM_BUG_ON(write_fault && exec_fault);
> > +
> > + if (fault_is_perm && !write_fault && !exec_fault) {
> > + kvm_err("Unexpected L2 read permission error\n");
> > + return -EFAULT;
> > + }
> > +
> > + /*
> > + * Permission faults just need to update the existing leaf entry,
> > + * and so normally don't require allocations from the memcache. The
> > + * only exception to this is when dirty logging is enabled at runtime
> > + * and a write fault needs to collapse a block entry into a table.
> > + */
> > + if (!fault_is_perm || (logging_active && write_fault)) {
> > + ret = kvm_mmu_topup_memory_cache(memcache,
> > + kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + /*
> > + * Holds the folio lock until mapped in the guest and its refcount is
> > + * stable, to avoid races with paths that check if the folio is mapped
> > + * by the host.
> > + */
> > + ret = kvm_gmem_get_pfn_locked(kvm, memslot, gfn, &pfn, &page, NULL);
> > + if (ret)
> > + return ret;
> > +
> > + if (!kvm_slot_gmem_is_guest_mappable(memslot, gfn)) {
> > + ret = -EAGAIN;
> > + goto unlock_page;
> > + }
> > +
> > + /*
> > + * Once it's faulted in, a guest_memfd() page will stay in memory.
> > + * Therefore, count it as locked.
> > + */
> > + if (!fault_is_perm) {
> > + ret = account_locked_vm(mm, 1, true);
> > + if (ret)
> > + goto unlock_page;
> > + }
> > +
> > + read_lock(&kvm->mmu_lock);
> > + if (write_fault)
> > + prot |= KVM_PGTABLE_PROT_W;
> > +
> > + if (exec_fault)
> > + prot |= KVM_PGTABLE_PROT_X;
> > +
> > + if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
> > + prot |= KVM_PGTABLE_PROT_X;
> > +
> > + /*
> > + * Under the premise of getting a FSC_PERM fault, we just need to relax
> > + * permissions.
> > + */
> > + if (fault_is_perm)
> > + ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> > + else
> > + ret = kvm_pgtable_stage2_map(pgt, fault_ipa, PAGE_SIZE,
> > + __pfn_to_phys(pfn), prot,
> > + memcache,
> > + KVM_PGTABLE_WALK_HANDLE_FAULT |
> > + KVM_PGTABLE_WALK_SHARED);
> > +
> > + kvm_release_faultin_page(kvm, page, !!ret, write_fault);
> > + read_unlock(&kvm->mmu_lock);
> > +
> > + if (ret && !fault_is_perm)
> > + account_locked_vm(mm, 1, false);
> > +unlock_page:
> > + unlock_page(page);
> > + put_page(page);
>
> There's a double-free of `page` here, as kvm_release_faultin_page
> already calls put_page. I fixed it up locally with
>
> + unlock_page(page);
> kvm_release_faultin_page(kvm, page, !!ret, write_fault);
> read_unlock(&kvm->mmu_lock);
>
> if (ret && !fault_is_perm)
> account_locked_vm(mm, 1, false);
> + goto out;
> +
> unlock_page:
> unlock_page(page);
> put_page(page);
> -
> +out:
> return ret != -EAGAIN ? ret : 0;
> }
>
> which I'm admittedly not sure is correct either because now the locks
> don't get released in reverse order of acquisition, but with this I
> was able to boot simple VMs.
Thanks for that. You're right, I broke this code right before sending
out the series while fixing a merge conflict. have prepared a new
patch series (rebased on Linux 6.13-rc7), with this redone to be part
of user_mem_abort(), as opposed to being in its own function. Makes
the code cleaner more maintainable.
> > +
> > + return ret != -EAGAIN ? ret : 0;
> > +}
> > +
> > static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > struct kvm_s2_trans *nested,
> > struct kvm_memory_slot *memslot, unsigned long hva,
> > @@ -1900,8 +2001,14 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > goto out_unlock;
> > }
> >
> > - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> > - esr_fsc_is_permission_fault(esr));
> > + if (kvm_slot_can_be_private(memslot)) {
>
> For my setup, I needed
>
> if (kvm_mem_is_private(vcpu->kvm, gfn))
>
> here instead, because I am making use of KVM_GENERIC_MEMORY_ATTRIBUTES,
> and had a memslot with the `KVM_MEM_GUEST_MEMFD` flag set, but whose
> gfn range wasn't actually set to KVM_MEMORY_ATTRIBUTE_PRIVATE.
>
> If I'm reading patch 12 correctly, your memslots always set only one of
> userspace_addr or guest_memfd, and the stage 2 table setup simply checks
> which one is the case to decide what to fault in, so maybe to support
> both cases, this check should be
>
> if (kvm_mem_is_private(vcpu->kvm, gfn) || (kvm_slot_can_be_private(memslot) && !memslot->userspace_addr)
>
> ?
I've actually missed supporting both cases, and I think your
suggestion is the right way to do it. I'll fix it in the respin.
Cheers,
/fuad
>
> [1]: https://lore.kernel.org/all/20240801090117.3841080-1-tabba@google.com/
>
> > + ret = guest_memfd_abort(vcpu, fault_ipa, memslot,
> > + esr_fsc_is_permission_fault(esr));
> > + } else {
> > + ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
> > + esr_fsc_is_permission_fault(esr));
> > + }
> > +
> > if (ret == 0)
> > ret = 1;
> > out:
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
>
> Best,
> Patrick
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2025-01-16 9:19 ` Fuad Tabba
@ 2025-01-20 9:26 ` Vlastimil Babka
2025-01-20 9:36 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2025-01-20 9:26 UTC (permalink / raw)
To: Fuad Tabba, Ackerley Tng
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vannapurve, mail, david,
michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat,
shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton
On 1/16/25 10:19, Fuad Tabba wrote:
> Hi Ackerley,
>
> On Thu, 16 Jan 2025 at 00:35, Ackerley Tng <ackerleytng@google.com> wrote:
>>
>> Registration of the folio_put() callback only happens if the VMM
>> actually tries to do vcpu_run(). For 4K folios I think this is okay
>> since the 4K folio can be freed via the transition state K -> state I,
>> but for hugetlb folios that have been split for sharing with userspace,
>> not getting a folio_put() callback means never putting the hugetlb folio
>> together. Hence, relying on vcpu_run() to add the folio_put() callback
>> leaves a way that hugetlb pages can be removed from the system.
>>
>> I think we should try and find a path forward that works for both 4K and
>> hugetlb folios.
>
> I agree, this could be an issue, but we could find other ways to
> trigger the callback for huge folios. The important thing I was trying
> to get to is how to have the callback and be able to register it.
>
>> IIUC page._mapcount and page.page_type works as a union because
>> page_type is only set for page types that are never mapped to userspace,
>> like PGTY_slab, PGTY_offline, etc.
>
> In the last guest_memfd sync, David Hildenbrand mentioned that that
> would be a temporary restriction since the two structures would
> eventually be decoupled, work being done by Matthew Wilcox I believe.
Note the "temporary" might be few years still, it's a long-term project.
>> Technically PGTY_guest_memfd is only set once the page can never be
>> mapped to userspace, but PGTY_guest_memfd can only be set once mapcount
>> reaches 0. Since mapcount is added in the faulting process, could gmem
>> perhaps use some kind of .unmap/.unfault callback, so that gmem gets
>> notified of all unmaps and will know for sure that the mapcount gets to
>> 0?
>
> I'm not sure if there is such a callback. If there were, I'm not sure
> what that would buy us really. The main pain point is the refcount
> going down to zero. The mapcount part is pretty straightforard and
> likely to be only temporary as mentioned, i.e., when it get decoupled,
> we could register the callback earlier and simplify the transition
> altogether.
>
>> Alternatively, I took a look at the folio_is_zone_device()
>> implementation, and page.flags is used to identify the page's type. IIUC
>> a ZONE_DEVICE page also falls in the intersection of needing a
>> folio_put() callback and can be mapped to userspace. Could we use a
>> similar approach, using page.flags to identify a page as a guest_memfd
>> page? That way we don't need to know when unmapping happens, and will
>> always be able to get a folio_put() callback.
>
> Same as above, with this being temporary, adding a new page flag might
> not be something that the rest of the community might be too excited
> about :)
Yeah, adding a page flag is very difficult these days. Also while it's
technically true that being a ZONE_DEVICE page is recorded in the page flags
field, it's not really a separate flag - just that some bits of the flags
field encode the zonenum. But zonenum is a number, not independent flags -
i.e. there can be up to 8 zones, using up to 3 flag bits. And page's zonenum
also has to match the zone's pfn range, so we couldn't just change the zone
of a page to some hypothetical ZONE_MEMFD when it becomes used for memfd.
> Thanks for your comments!
> /fuad
>
>> [1] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf
>> [2] https://android-kvm.googlesource.com/linux/+/764360863785ba16d974253a572c87abdd9fdf0b%5E%21/#F0
>>
>> > This patch series doesn't necessarily impose all these transitions,
>> > many of them would be a matter of policy. This just happens to be the
>> > current way I've done it with pKVM/arm64.
>> >
>> > Cheers,
>> > /fuad
>> >
>> > On Fri, 13 Dec 2024 at 16:48, Fuad Tabba <tabba@google.com> wrote:
>> >>
>> >> This series adds restricted mmap() support to guest_memfd, as
>> >> well as support for guest_memfd on arm64. It is based on Linux
>> >> 6.13-rc2. Please refer to v3 for the context [1].
>> >>
>> >> Main changes since v3:
>> >> - Added a new folio type for guestmem, used to register a
>> >> callback when a folio's reference count reaches 0 (Matthew
>> >> Wilcox, DavidH) [2]
>> >> - Introduce new mappability states for folios, where a folio can
>> >> be mappable by the host and the guest, only the guest, or by no
>> >> one (transient state)
>> >> - Rebased on Linux 6.13-rc2
>> >> - Refactoring and tidying up
>> >>
>> >> Cheers,
>> >> /fuad
>> >>
>> >> [1] https://lore.kernel.org/all/20241010085930.1546800-1-tabba@google.com/
>> >> [2] https://lore.kernel.org/all/20241108162040.159038-1-tabba@google.com/
>> >>
>> >> Ackerley Tng (2):
>> >> KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>> >> anonymous inodes
>> >> KVM: guest_memfd: Track mappability within a struct kvm_gmem_private
>> >>
>> >> Fuad Tabba (12):
>> >> mm: Consolidate freeing of typed folios on final folio_put()
>> >> KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains
>> >> the folio lock
>> >> KVM: guest_memfd: Folio mappability states and functions that manage
>> >> their transition
>> >> KVM: guest_memfd: Handle final folio_put() of guestmem pages
>> >> KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared
>> >> KVM: guest_memfd: Add guest_memfd support to
>> >> kvm_(read|/write)_guest_page()
>> >> KVM: guest_memfd: Add KVM capability to check if guest_memfd is host
>> >> mappable
>> >> KVM: guest_memfd: Add a guest_memfd() flag to initialize it as
>> >> mappable
>> >> KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is
>> >> allowed
>> >> KVM: arm64: Skip VMA checks for slots without userspace address
>> >> KVM: arm64: Handle guest_memfd()-backed guest page faults
>> >> KVM: arm64: Enable guest_memfd private memory when pKVM is enabled
>> >>
>> >> Documentation/virt/kvm/api.rst | 4 +
>> >> arch/arm64/include/asm/kvm_host.h | 3 +
>> >> arch/arm64/kvm/Kconfig | 1 +
>> >> arch/arm64/kvm/mmu.c | 119 +++-
>> >> include/linux/kvm_host.h | 75 +++
>> >> include/linux/page-flags.h | 22 +
>> >> include/uapi/linux/kvm.h | 2 +
>> >> include/uapi/linux/magic.h | 1 +
>> >> mm/debug.c | 1 +
>> >> mm/swap.c | 28 +-
>> >> tools/testing/selftests/kvm/Makefile | 1 +
>> >> .../testing/selftests/kvm/guest_memfd_test.c | 64 +-
>> >> virt/kvm/Kconfig | 4 +
>> >> virt/kvm/guest_memfd.c | 579 +++++++++++++++++-
>> >> virt/kvm/kvm_main.c | 229 ++++++-
>> >> 15 files changed, 1074 insertions(+), 59 deletions(-)
>> >>
>> >>
>> >> base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
>> >> --
>> >> 2.47.1.613.gc27f4b7a9f-goog
>> >>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support
2025-01-20 9:26 ` Vlastimil Babka
@ 2025-01-20 9:36 ` David Hildenbrand
0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-01-20 9:36 UTC (permalink / raw)
To: Vlastimil Babka, Fuad Tabba, Ackerley Tng
Cc: kvm, linux-arm-msm, linux-mm, pbonzini, chenhuacai, mpe, anup,
paul.walmsley, palmer, aou, seanjc, viro, brauner, willy, akpm,
xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack,
yu.c.zhang, isaku.yamahata, mic, vannapurve, mail, michael.roth,
wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
quic_pheragu, catalin.marinas, james.morse, yuzenghui,
oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
rientjes, jhubbard, fvdl, hughd, jthoughton
On 20.01.25 10:26, Vlastimil Babka wrote:
> On 1/16/25 10:19, Fuad Tabba wrote:
>> Hi Ackerley,
>>
>> On Thu, 16 Jan 2025 at 00:35, Ackerley Tng <ackerleytng@google.com> wrote:
>>>
>>> Registration of the folio_put() callback only happens if the VMM
>>> actually tries to do vcpu_run(). For 4K folios I think this is okay
>>> since the 4K folio can be freed via the transition state K -> state I,
>>> but for hugetlb folios that have been split for sharing with userspace,
>>> not getting a folio_put() callback means never putting the hugetlb folio
>>> together. Hence, relying on vcpu_run() to add the folio_put() callback
>>> leaves a way that hugetlb pages can be removed from the system.
>>>
>>> I think we should try and find a path forward that works for both 4K and
>>> hugetlb folios.
>>
>> I agree, this could be an issue, but we could find other ways to
>> trigger the callback for huge folios. The important thing I was trying
>> to get to is how to have the callback and be able to register it.
>>
>>> IIUC page._mapcount and page.page_type works as a union because
>>> page_type is only set for page types that are never mapped to userspace,
>>> like PGTY_slab, PGTY_offline, etc.
>>
>> In the last guest_memfd sync, David Hildenbrand mentioned that that
>> would be a temporary restriction since the two structures would
>> eventually be decoupled, work being done by Matthew Wilcox I believe.
>
> Note the "temporary" might be few years still, it's a long-term project.
Right, nobody knows how long it will actually take. Willy thinks the
part that would be required here might be feasible in the nearer future:
"'d like to lay out some goals for the coming year. I think we can
accomplish a big goal this year" [1]
[1] https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/T/#u
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-01-20 9:36 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-13 16:47 [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 01/14] mm: Consolidate freeing of typed folios on final folio_put() Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 02/14] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Fuad Tabba
2024-12-13 16:47 ` [RFC PATCH v4 03/14] KVM: guest_memfd: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 04/14] KVM: guest_memfd: Track mappability within a struct kvm_gmem_private Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 05/14] KVM: guest_memfd: Folio mappability states and functions that manage their transition Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 06/14] KVM: guest_memfd: Handle final folio_put() of guestmem pages Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 07/14] KVM: guest_memfd: Allow host to mmap guest_memfd() pages when shared Fuad Tabba
2024-12-27 4:21 ` Alexey Kardashevskiy
2025-01-09 10:17 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 08/14] KVM: guest_memfd: Add guest_memfd support to kvm_(read|/write)_guest_page() Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 09/14] KVM: guest_memfd: Add KVM capability to check if guest_memfd is host mappable Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 10/14] KVM: guest_memfd: Add a guest_memfd() flag to initialize it as mappable Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 11/14] KVM: guest_memfd: selftests: guest_memfd mmap() test when mapping is allowed Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 12/14] KVM: arm64: Skip VMA checks for slots without userspace address Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 13/14] KVM: arm64: Handle guest_memfd()-backed guest page faults Fuad Tabba
2025-01-16 14:48 ` Patrick Roy
2025-01-16 15:16 ` Fuad Tabba
2024-12-13 16:48 ` [RFC PATCH v4 14/14] KVM: arm64: Enable guest_memfd private memory when pKVM is enabled Fuad Tabba
2025-01-09 16:34 ` [RFC PATCH v4 00/14] KVM: Restricted mapping of guest_memfd at the host and arm64 support Fuad Tabba
2025-01-16 0:35 ` Ackerley Tng
2025-01-16 9:19 ` Fuad Tabba
2025-01-20 9:26 ` Vlastimil Babka
2025-01-20 9:36 ` David Hildenbrand
2025-01-16 14:48 ` Patrick Roy
2025-01-16 15:02 ` Fuad Tabba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox