From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3403C25B75 for ; Tue, 14 May 2024 16:06:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 204A06B0278; Tue, 14 May 2024 12:06:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 18D586B027A; Tue, 14 May 2024 12:06:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02E286B027F; Tue, 14 May 2024 12:06:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D56AD6B0278 for ; Tue, 14 May 2024 12:06:32 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 75E72C0B8C for ; Tue, 14 May 2024 16:06:32 +0000 (UTC) X-FDA: 82117479024.03.5CA680E Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf01.hostedemail.com (Postfix) with ESMTP id D9D244001A for ; Tue, 14 May 2024 16:06:29 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=BUqFM2oB; spf=pass (imf01.hostedemail.com: domain of gregkh@linuxfoundation.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715702790; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DUvOIuRUe/3xohBRLXfo1D2IycjSben7CJRUnGPIj4k=; b=BBdUMfsfWCEFdmTjHSzEwMaV88YYfOkhI3UIb8g0vJ7NW7UkggTo/fpuPT2U/zOJBzyGg7 WdOcAOZb/eZXR0ATD5BW0S1HMk3+4onGg0UOpQFT1Qf/slq1oXVQ5CMamlyKHsJ3Cx0Ql8 ZwAE54ALqtmBbzpIStusGNTB3vVh578= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linuxfoundation.org header.s=korg header.b=BUqFM2oB; spf=pass (imf01.hostedemail.com: domain of gregkh@linuxfoundation.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=gregkh@linuxfoundation.org; dmarc=pass (policy=none) header.from=linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715702790; a=rsa-sha256; cv=none; b=gxAVesT4mzULIEKQAfxV1jOyjBH1Azyd9LW4/Ic9yqzu+vw7ATPdX6tDhTdDPJ1bLR4A6b k6M+wY3DGr00FSgmcrJslocqoltFzNsEimoHgS7bictDlfpah8NmdR6HRqseoHOnWRC54Q vjQ7NEc8NQBnEg727owshRXHtJon8s8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 00945CE098D; Tue, 14 May 2024 16:06:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 61AF9C2BD10; Tue, 14 May 2024 16:06:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1715702784; bh=3VTQ+qFOzuuwKQuIMtg6jrQG+DYa6Ub+oWWIBNCsk9c=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=BUqFM2oBjAvHJ/UAsphzwNVxAS/h89U5UNISVKOlo6P7HmSfPZaNT3cbWEfj5zGdj G2TagUbzZMfzxu4VP5MvOZhiEUlxbC6ypsUF0Sv1nxg0L03EB0prEA4cf0DvTe562e G0nQx62d8SGgLPVvtfsUqZ7nxV20q/vh8fUjXS6k= Date: Tue, 14 May 2024 18:06:18 +0200 From: Greg Kroah-Hartman To: Yuanchu Xie Cc: Wei Liu , Rob Bradford , Theodore Ts'o , Pasha Tatashin , Jonathan Corbet , Thomas Zimmermann , Dan Williams , Tom Lendacky , Kuppuswamy Sathyanarayanan , linux-kernel@vger.kernel.org, linux-mm@kvack.org, virtualization@lists.linux.dev, dev@lists.cloudhypervisor.org Subject: Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties Message-ID: <2024051414-untie-deviant-ed35@gregkh> References: <20240514020301.1835794-1-yuanchu@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240514020301.1835794-1-yuanchu@google.com> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D9D244001A X-Stat-Signature: cysjrqu9johbz15fkc1ftsneiiph9zku X-HE-Tag: 1715702789-277222 X-HE-Meta: U2FsdGVkX1+OBBc0nqobVL2eA+UUFgQoTC6U757+pxRX1XrkyOdj2BpptEvkyMJcl/EtvrcDJNnGZCOLOwwNLX7eK3zKqZORbHwoReI9SFqG1cp4ysbhnFXhB1YGpfvNIx1EH8pW24rzh4FuHI5AHJ9rDbvh0OajC+sOKjvb6LAdS9KtO6ZjK0/Xs8XAQPC6R7oy6yZw+h5nVaGGX/w4kLkUOQ+Y1A72K5pIlW0QDlEllNCX/bPNIYyOJfPWHmvwlkbKC9KvzRIoHkv1b0GBXtZjEGqF+e6nmLr72F2vM6Qzy+EXqCHOAu7wgVQZ6cLV1SSOCV7F8TdIwQ/52eih/K9sZi6V5r5nk+D0h+ebjNF7Wtdks4juaX5Zfts29G1NMAAW9Nm15l0yRmIQmRDCSlSh9H7pQysddPRtVJnxTB+n9uUZIQGrZkE291rEJjM1fynvJil0I15pajB0rADhQQhOoCYcUGMHWI1DDQwMLroNl7zj9WwAOpEqFKs5JpxwHewXA5e6aCDqwp2qDA9MscD8uIbSpwQFV4TrWP/TsYR1mlN+7StbmRdPEtanKCOvEFvJOjxZuhG1VWtSbNEn0d+htor72+/rIVM6/Kxl/feEIxgh2BVAQnlRZmZ3kV/AFjf40EWkuoMmfEvhfW+TfDLKLm4B/pCiHMGe7aLG/xJXJgd+ZYoJZMuuVi8O8Lgys737ooCoWEtROx8HsdWGSXmsuJh++glj5/Xn99OHyhz9NFyPe8+4T1jvW3ddsjCYf9gzsnFFsMRoMb2X7HeM3qz9IvpG4PtEQHXM/UUV1qVPG6UMitoPXw48/wRrONA9Jqm+eXqh6em89M5je3qQuJ7376eDc6oigOYo7NKxluOlBcWTKHPMgp1GQ7Mvm5ecvKnHrBljZzrIZmLUbBW5ZUl7lm2Bd+fAW53Dn+w7g1vWWL9sND+U3FD+KGwtt9FvbyiRt6woWLv4ZEtLHsC Pw2vA8jE KrUl3/cL20pkh7uF0JUkhfM8OXh1n2TuQBUMxIuvUoSrVw1HNyqqnC9MxkLAaper8UjdkG1twuNtwjq1jPDucLFjzq4gVxis1XZ/tt0p33MvCMsBi3svQ1sTi8zcDFdXLAR7lT64btX5NDY4omyKuTddWuvXsEQZboTrd/9dLFDJqMv9BefUiA/roHSJFXOetGZ86Jf00mH8p+AjL7+qJylRgtd7sQnpQ1lDYYwi+g4Y2iJWKGNXRSDZcDZuuL1XeQiyUfJzKNS56redMoO2ZfCGoatVuwbf7scctk4OE/iNy/6WmjmAOGFeaaJi/G81AaHkoniO1JclD8cXy7pu3AVWIkhfVbUqXa0LG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote: > Memctl provides a way for the guest to control its physical memory > properties, and enables optimizations and security features. For > example, the guest can provide information to the host where parts of a > hugepage may be unbacked, or sensitive data may not be swapped out, etc. > > Memctl allows guests to manipulate its gPTE entries in the SLAT, and > also some other properties of the memory map the back's host memory. > This is achieved by using the KVM_CAP_SYNC_MMU capability. When this > capability is available, the changes in the backing of the memory region > on the host are automatically reflected into the guest. For example, an > mmap() or madvise() that affects the region will be made visible > immediately. > > There are two components of the implementation: the guest Linux driver > and Virtual Machine Monitor (VMM) device. A guest-allocated shared > buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM > device assigns a unique command for each per-cpu buffer. The guest > writes its memctl request in the per-cpu buffer, then writes the > corresponding command into the command register, calling into the VMM > device to perform the memctl request. > > The synchronous per-cpu shared buffer approach avoids the kick and busy > waiting that the guest would have to do with virtio virtqueue transport. > > We provide both kernel and userspace APIs > Kernel API > long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg, > struct memctl_buf *buf); > > Kernel drivers can take advantage of the memctl calls to provide > paravirtualization of kernel stacks or page zeroing. > > User API > >From the userland, the memctl guest driver is controlled via ioctl(2) > call. It requires CAP_SYS_ADMIN. > > ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm); > > Guest userland applications can tag VMAs and guest hugepages, or advise > the host on how to handle sensitive guest pages. > > Supported function codes and their use cases: > MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the > struct page and page table lookup overhead by using hugepages backed by > smaller pages on the host. These memctl commands can allow for partial > freeing of private guest hugepages to save memory. They also allow > kernel memory, such as kernel stacks and task_structs to be > paravirtualized. > > MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to > share its backing pages. > The same with MADV_DONTDUMP, so sensitive pages are not included in a > dump. > MLOCK/UNLOCK can advise the host that sensitive information is not > swapped out on the host. > > MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack > guard pages can be handled in the host and memory can be saved in the > hugepage. > > MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how > guest memory is being mapped on the host. > > Sample program making use of MEMCTL_SET_VMA_ANON_NAME and > MEMCTL_DONTNEED: > https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main > https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed > > The VMM implementation is being proposed for Cloud Hypervisor: > https://github.com/Dummyc0m/cloud-hypervisor/ > > Cloud Hypervisor issue: > https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318 > > Signed-off-by: Yuanchu Xie > --- > .../userspace-api/ioctl/ioctl-number.rst | 2 + > drivers/virt/Kconfig | 2 + > drivers/virt/Makefile | 1 + > drivers/virt/memctl/Kconfig | 10 + > drivers/virt/memctl/Makefile | 2 + > drivers/virt/memctl/memctl.c | 425 ++++++++++++++++++ > include/linux/memctl.h | 27 ++ > include/uapi/linux/memctl.h | 81 ++++ You are mixing your PCI driver in with the memctl core code, is that intentional? Will there never be another PCI device for this type of interface other than this one PCI device? And if so, why export anything, why isn't this all in one body of code? > 8 files changed, 550 insertions(+) > create mode 100644 drivers/virt/memctl/Kconfig > create mode 100644 drivers/virt/memctl/Makefile > create mode 100644 drivers/virt/memctl/memctl.c > create mode 100644 include/linux/memctl.h > create mode 100644 include/uapi/linux/memctl.h > > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst > index 457e16f06e04..789d1251c0be 100644 > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst > @@ -368,6 +368,8 @@ Code Seq# Include File Comments > 0xCD 01 linux/reiserfs_fs.h > 0xCE 01-02 uapi/linux/cxl_mem.h Compute Express Link Memory Devices > 0xCF 02 fs/smb/client/cifs_ioctl.h > +0xDA 00 linux/memctl.h Memctl Device > + > 0xDB 00-0F drivers/char/mwave/mwavepub.h > 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/ > > diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig > index 40129b6f0eca..419496558cfc 100644 > --- a/drivers/virt/Kconfig > +++ b/drivers/virt/Kconfig > @@ -50,4 +50,6 @@ source "drivers/virt/acrn/Kconfig" > > source "drivers/virt/coco/Kconfig" > > +source "drivers/virt/memctl/Kconfig" > + > endif > diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile > index f29901bd7820..68e152e7cef1 100644 > --- a/drivers/virt/Makefile > +++ b/drivers/virt/Makefile > @@ -10,3 +10,4 @@ obj-y += vboxguest/ > obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/ > obj-$(CONFIG_ACRN_HSM) += acrn/ > obj-y += coco/ > +obj-$(CONFIG_MEMCTL) += memctl/ > diff --git a/drivers/virt/memctl/Kconfig b/drivers/virt/memctl/Kconfig > new file mode 100644 > index 000000000000..981ed9b76f97 > --- /dev/null > +++ b/drivers/virt/memctl/Kconfig > @@ -0,0 +1,10 @@ > +# SPDX-License-Identifier: GPL-2.0 > +config MEMCTL > + tristate "memctl Guest Service Module" > + depends on KVM_GUEST && 64BIT > + help > + memctl is a guest kernel module that allows to communicate > + with hypervisor / VMM and control the guest memory backing. > + > + To compile as a module, choose M, the module will be called > + memctl. If unsure, say N. Pretty generic name for a hardware-specific driver :( > diff --git a/drivers/virt/memctl/Makefile b/drivers/virt/memctl/Makefile > new file mode 100644 > index 000000000000..410829a3c297 > --- /dev/null > +++ b/drivers/virt/memctl/Makefile > @@ -0,0 +1,2 @@ > +# SPDX-License-Identifier: GPL-2.0 > +obj-$(CONFIG_MEMCTL) := memctl.o > diff --git a/drivers/virt/memctl/memctl.c b/drivers/virt/memctl/memctl.c > new file mode 100644 > index 000000000000..661a552f98d8 > --- /dev/null > +++ b/drivers/virt/memctl/memctl.c > @@ -0,0 +1,425 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Control guest memory mappings > + * > + * Author: Yuanchu Xie > + * Author: Pasha Tatashin > + */ > +#define pr_fmt(fmt) "memctl %s: " fmt, __func__ You have real devices here, use dev_*() calls instead of pr_*() please, which means you can remove this define. > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define PCI_VENDOR_ID_GOOGLE 0x1ae0 > +#define PCI_DEVICE_ID_GOOGLE_MEMCTL 0x0087 > + > +#define MEMCTL_VERSION "0.01" Versions mean nothing once code is merged, please remove this. > + > +enum memctl_transport_command { > + MEMCTL_TRANSPORT_RESET = 0x060FE6D2, > + MEMCTL_TRANSPORT_REGISTER = 0x0E359539, > + MEMCTL_TRANSPORT_READY = 0x0CA8D227, > + MEMCTL_TRANSPORT_DISCONNECT = 0x030F5DA0, > + MEMCTL_TRANSPORT_ACK = 0x03CF5196, > + MEMCTL_TRANSPORT_ERROR = 0x01FBA249, What are these magic values? What endian are they, native? Hardware? something else? > +}; > + > +struct memctl_transport { > + union { > + struct { > + u64 buf_phys_addr; > + } reg; > + struct { > + u32 command; > + u32 _padding; > + } resp; Endain-ness of all of this as it goes to the hardware, right? > + }; > + u32 command; > +}; > + > +struct memctl_percpu_channel { > + struct memctl_buf buf; > + u64 buf_phys_addr; > + u32 command; > +}; > + > +struct memctl { > + void __iomem *base_addr; > + /* cache the info call */ > + struct memctl_vmm_info memctl_vmm_info; > + struct memctl_percpu_channel __percpu *pcpu_channels; > +}; > + > +static DEFINE_RWLOCK(memctl_lock); > +static struct memctl *memctl __read_mostly; > + > +static void memctl_write_command(void __iomem *base_addr, u32 command) > +{ > + iowrite32(command, > + base_addr + offsetof(struct memctl_transport, command)); Yup, you write this to hardware, please use proper structures and types for that, otherwise you will have problems in the near future. thanks, greg k-h