From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47EC0C3DA4A for ; Mon, 5 Aug 2024 10:21:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D29D06B0095; Mon, 5 Aug 2024 06:21:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD9816B0096; Mon, 5 Aug 2024 06:21:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC85B6B0098; Mon, 5 Aug 2024 06:21:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9E4486B0095 for ; Mon, 5 Aug 2024 06:21:05 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 39B5E141D23 for ; Mon, 5 Aug 2024 10:21:05 +0000 (UTC) X-FDA: 82417798890.23.9660345 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf26.hostedemail.com (Postfix) with ESMTP id E066A140018 for ; Mon, 5 Aug 2024 10:21:02 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Lr0BcwZR; spf=pass (imf26.hostedemail.com: domain of brauner@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722853233; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BSCXSRT16Oc799aD7EQlmOxPBI+1OzlRSFP2YgbX97o=; b=ja6NjNXqOYrBkrIDUwXRnLXaaDb5VRNppjT1Qr0uLwNzjo/yBcvYZt8xQuO5/i+4A+MZ6S Dm9HBr0xUkG4kAbr4wPIe+wu63/nV9xgeqzYMmQoLnl0nSrZkarkh7ACTlySqKZ2D1TjR1 leG3raiB4DQWgSvbOLcaSkeOkkDtkrY= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Lr0BcwZR; spf=pass (imf26.hostedemail.com: domain of brauner@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=brauner@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722853233; a=rsa-sha256; cv=none; b=e23kltbI5xvoqJZFO77fuClCfKHhq+ynz9tiRjiA/Ys9yk263xnXQCZg36vjv0vLfiktkz bRpu6eY79L+xJShYCBCxg9HonGyNC3JWCrBMcZnA6i5IdTjkUxs8pDZUF0r83yB1Xt34zY aTw0m8y03VhvrMDmpr/4E6NMqA6YEic= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 13C70CE0AB2; Mon, 5 Aug 2024 10:20:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4EF52C4AF12; Mon, 5 Aug 2024 10:20:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722853258; bh=yKff71coMiTPQXo77q6gFamXJPAuyHeBjScWwioklIg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Lr0BcwZRRVJqq8nnxFKElkds8d0oizUwP/XzFg2C9TgNPcwJ7VgjTQHUV4TWnslKu 0ji6soFr7FFtqkREOp6xF7C5XGMe/C6ScpgTdC1HRTYgd2xgy+lp3c/Kw0vusJITcp zvuEx/1+SCeQRNgYrVEq7mhtJ9pz+v8bXKwdbFzE+NFIkV6mdiDzuP8S5bijdyV/Z+ OeX6RCJtXHOEc/n3zPWLXyYRzTb2hN8flgQ3LyjLPeoa++yR3oPQ6/Dyo2jg2HPFJj mW+FQYm2X3nXDNZZrGTDqRyBUIFsBU0wcFRfKx2CMCXMi86On6H/WJhk4N5AupjS7+ b49Xb5ugpem6A== Date: Mon, 5 Aug 2024 12:20:51 +0200 From: Christian Brauner To: James Gowans Cc: linux-kernel@vger.kernel.org, Sean Christopherson , Paolo Bonzini , Alexander Viro , Steve Sistare , Jan Kara , Anthony Yznaga , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Jason Gunthorpe , linux-fsdevel@vger.kernel.org, Usama Arif , kvm@vger.kernel.org, Alexander Graf , David Woodhouse , Paul Durrant , Nicolas Saenz Julienne Subject: Re: [PATCH 01/10] guestmemfs: Introduce filesystem skeleton Message-ID: <20240805-tukan-emblem-f84119475e49@brauner> References: <20240805093245.889357-1-jgowans@amazon.com> <20240805093245.889357-2-jgowans@amazon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240805093245.889357-2-jgowans@amazon.com> X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: E066A140018 X-Stat-Signature: ws4ehqk4pm8xughab57zumhrauedycky X-HE-Tag: 1722853262-41674 X-HE-Meta: U2FsdGVkX1977fzH6VA4EtenoK7QZmv/nD2J0inGEoUX19GAKKTnwCr+se2gwDqX5/tJKGr3PmV9SMuMrqRtcPDvl7MyQ1zEr7Enh9wNY4eRtPe0DF3wGB3Q/C91k69GzY+day2SB/pow53j+OAGrMSlqDdW88+nmvqqX3tQsuCkl34ROhDcfnTIZ4Zreh1FJXVJCs9UvAnPLpa6AbJrO1SicNXHXBXnqgV1gwOYbp4S/3rxIXheyw4gKlKZ7l10f4mM7x2i9OcMexBmkWMr2NKxPu4Nc2SdvDJUyOfNIlIH0TPeu4wf4sJUgiPW5BiDOqOsj0CbIKs6woTXA3QSHom3t8yqhcM19c/MstvbS1zNijmcwPwdTzQCBQhhwfvppqYo5an4FRS5+MyBbel+td05JtDLFq//kmo0BpV0HN3Dv4iqvC4cCiXzwocDcji/KNaAklvYnMKc2nXIelBQ3NekjFeTBnNOo3texmhPdzHtmQTi8NMrQpxqivL+BEN+0WfCC+LfJ13G4sVfXs2+W3mGxAHy8DFBQH1u0OdUeZWL1zRc701rLKbCazSurB1Gu4KcMcM+XM1gcK3hhRLrw3iyhobirRi4cN+bIYSKs17LsJ93cIg0Xprh+3W1U5A0SpIn5fJi3EtejKsGpJBybi+k9vM7EyBJDAT6TUZvt1UJLmthgEQUqf7zZaYnOwahknY59TvdsnXiSIKQLVGlkKBOdYaKfc//9Bm7WICKooWkYEnQlL+p+xMy+DA15z7IxGx69cPmC5vmRpRI3H9+b11G5IzkB/ooVXPsllojqyYS4wbJyL5evh0RqKdYdFNGd9PlDcDodVF/stPTHlFa/B4Hi02sL8OrtrJDhnmxL6r8p5B6na+ot2E4deyIfIduzq6HJo4TwYtdwyqJn/Tcy18rMEzReXT5hEdcBa3I+hMUENDNVlJjL71HX5/6enDXUpfQ84DpXnMGKCkVw65 W4+Yxa8W XeQ1taglcrenMFsnEWiJYxrTAhVgdis5UUjE/pP5bVGRt2yWwplhtJ7FbYmwzBcQXHgQXBvF1jTXRsNz5ISkldKrTP5maoD9IZgBxV9gLW3oHKW6K1xs75dIHg0yJGoq0hCYxz5gO6j123Vydlvw3OAIc6CncKDMfLMllLop8jZ3iVgfUuvo1DfpNsw62eF9yBcH5xTOGYCkV6HIy1ltAgQQ2Z2FX7liNdwdAVtVZGZn9Rl8u/SJyZLFG1iU9vlrpU0QA8tbGfAf9qKhVMbplmRNZSIdH3dKSLRrSVYumxlbrupY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: (I'm just going to point at a few things but it's by no means a comprehensive review.) On Mon, Aug 05, 2024 at 11:32:36AM GMT, James Gowans wrote: > Add an in-memory filesystem: guestmemfs. Memory is donated to guestmemfs > by carving it out of the normal System RAM range with the memmap= cmdline > parameter and then giving that same physical range to guestmemfs with the > guestmemfs= cmdline parameter. > > A new filesystem is added; so far it doesn't do much except persist a > super block at the start of the donated memory and allows itself to be > mounted. > > A hook to x86 mm init is added to reserve the memory really early on via > memblock allocator. There is probably a better arch-independent place to > do this... > > Signed-off-by: James Gowans > --- > arch/x86/mm/init_64.c | 2 + > fs/Kconfig | 1 + > fs/Makefile | 1 + > fs/guestmemfs/Kconfig | 11 ++++ > fs/guestmemfs/Makefile | 6 ++ > fs/guestmemfs/guestmemfs.c | 116 +++++++++++++++++++++++++++++++++++++ > fs/guestmemfs/guestmemfs.h | 9 +++ > include/linux/guestmemfs.h | 16 +++++ > 8 files changed, 162 insertions(+) > create mode 100644 fs/guestmemfs/Kconfig > create mode 100644 fs/guestmemfs/Makefile > create mode 100644 fs/guestmemfs/guestmemfs.c > create mode 100644 fs/guestmemfs/guestmemfs.h > create mode 100644 include/linux/guestmemfs.h > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 8932ba8f5cdd..39fcf017c90c 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1331,6 +1332,7 @@ static void __init preallocate_vmalloc_pages(void) > > void __init mem_init(void) > { > + guestmemfs_reserve_mem(); > pci_iommu_alloc(); > > /* clear_bss() already clear the empty_zero_page */ > diff --git a/fs/Kconfig b/fs/Kconfig > index a46b0cbc4d8f..727359901da8 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -321,6 +321,7 @@ source "fs/befs/Kconfig" > source "fs/bfs/Kconfig" > source "fs/efs/Kconfig" > source "fs/jffs2/Kconfig" > +source "fs/guestmemfs/Kconfig" > # UBIFS File system configuration > source "fs/ubifs/Kconfig" > source "fs/cramfs/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index 6ecc9b0a53f2..044524b17d63 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -129,3 +129,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ > obj-$(CONFIG_EROFS_FS) += erofs/ > obj-$(CONFIG_VBOXSF_FS) += vboxsf/ > obj-$(CONFIG_ZONEFS_FS) += zonefs/ > +obj-$(CONFIG_GUESTMEMFS_FS) += guestmemfs/ > diff --git a/fs/guestmemfs/Kconfig b/fs/guestmemfs/Kconfig > new file mode 100644 > index 000000000000..d87fca4822cb > --- /dev/null > +++ b/fs/guestmemfs/Kconfig > @@ -0,0 +1,11 @@ > +# SPDX-License-Identifier: GPL-2.0-only > + > +config GUESTMEMFS_FS > + bool "Persistent Guest memory filesystem (guestmemfs)" > + help > + An in-memory filesystem on top of reserved memory specified via > + guestmemfs= cmdline argument. Used for storing kernel state and > + userspace memory which is preserved across kexec to support > + live update of a hypervisor when running guest virtual machines. > + Select this if you need the ability to persist memory for guest VMs > + across kexec to do live update. > diff --git a/fs/guestmemfs/Makefile b/fs/guestmemfs/Makefile > new file mode 100644 > index 000000000000..6dc820a9d4fe > --- /dev/null > +++ b/fs/guestmemfs/Makefile > @@ -0,0 +1,6 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +# > +# Makefile for persistent kernel filesystem > +# > + > +obj-y += guestmemfs.o > diff --git a/fs/guestmemfs/guestmemfs.c b/fs/guestmemfs/guestmemfs.c > new file mode 100644 > index 000000000000..3aaada1b8df6 > --- /dev/null > +++ b/fs/guestmemfs/guestmemfs.c > @@ -0,0 +1,116 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +#include "guestmemfs.h" > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static phys_addr_t guestmemfs_base, guestmemfs_size; > +struct guestmemfs_sb *psb; > + > +static int statfs(struct dentry *root, struct kstatfs *buf) > +{ > + simple_statfs(root, buf); > + buf->f_bsize = PMD_SIZE; > + buf->f_blocks = guestmemfs_size / PMD_SIZE; > + buf->f_bfree = buf->f_bavail = buf->f_blocks; > + return 0; > +} > + > +static const struct super_operations guestmemfs_super_ops = { > + .statfs = statfs, (Please make it a habit to name these things with a consistent prefix. Doesn't matter if it's wubalubadubdub_statfs() or guestmemfs_statfs() as far as I'm concerned but just something that is grep-able and local to your fs.) > +}; > + > +static int guestmemfs_fill_super(struct super_block *sb, struct fs_context *fc) > +{ > + struct inode *inode; > + struct dentry *dentry; > + > + psb = kzalloc(sizeof(*psb), GFP_KERNEL); > + /* > + * Keep a reference to the persistent super block in the > + * ephemeral super block. > + */ > + sb->s_fs_info = psb; > + sb->s_op = &guestmemfs_super_ops; > + > + inode = new_inode(sb); > + if (!inode) > + return -ENOMEM; > + > + inode->i_ino = 1; > + inode->i_mode = S_IFDIR; > + inode->i_op = &simple_dir_inode_operations; > + inode->i_fop = &simple_dir_operations; > + simple_inode_init_ts(inode); > + /* directory inodes start off with i_nlink == 2 (for "." entry) */ > + inc_nlink(inode); > + > + dentry = d_make_root(inode); > + if (!dentry) > + return -ENOMEM; > + sb->s_root = dentry; > + > + return 0; > +} > + > +static int guestmemfs_get_tree(struct fs_context *fc) > +{ > + return get_tree_nodev(fc, guestmemfs_fill_super); That makes the filesystem multi-instance so mount -t guestmemfs guestmemfs /mnt mount -t guestmemfs guestmemfs /opt would mount two separate instances of guestmemfs. That is intentional, right as multiple instances draw memory from the same reserved memblock? > +} > + > +static const struct fs_context_operations guestmemfs_context_ops = { > + .get_tree = guestmemfs_get_tree, > +}; > + > +static int guestmemfs_init_fs_context(struct fs_context *const fc) > +{ > + fc->ops = &guestmemfs_context_ops; > + return 0; > +} > + > +static struct file_system_type guestmemfs_fs_type = { > + .owner = THIS_MODULE, > + .name = "guestmemfs", > + .init_fs_context = guestmemfs_init_fs_context, > + .kill_sb = kill_litter_super, > + .fs_flags = FS_USERNS_MOUNT, This makes the filesystem mountable by unprivileged containers and therefore unprivileged users. Iiuc, you need a mechanism to prevent a container from just taking over the whole reserved memory block. Afaict memblock isn't accounted for in cgroups at all so it'd be good to know how that would be done. And that should be explained somewhere in the documentation patch, please. > +}; > + > +static int __init guestmemfs_init(void) > +{ > + int ret; > + > + ret = register_filesystem(&guestmemfs_fs_type); > + return ret; > +} > + > +/** > + * Format: guestmemfs=: > + * Just like: memmap=nn[KMG]!ss[KMG] > + */ > +static int __init parse_guestmemfs_extents(char *p) > +{ > + guestmemfs_size = memparse(p, &p); > + return 0; > +} > + > +early_param("guestmemfs", parse_guestmemfs_extents); > + > +void __init guestmemfs_reserve_mem(void) > +{ > + guestmemfs_base = memblock_phys_alloc(guestmemfs_size, 4 << 10); > + if (guestmemfs_base) { > + memblock_reserved_mark_noinit(guestmemfs_base, guestmemfs_size); > + memblock_mark_nomap(guestmemfs_base, guestmemfs_size); > + } else { > + pr_warn("Failed to alloc %llu bytes for guestmemfs\n", guestmemfs_size); > + } > +} > + > +MODULE_ALIAS_FS("guestmemfs"); > +module_init(guestmemfs_init); > diff --git a/fs/guestmemfs/guestmemfs.h b/fs/guestmemfs/guestmemfs.h > new file mode 100644 > index 000000000000..37d8cf630e0a > --- /dev/null > +++ b/fs/guestmemfs/guestmemfs.h > @@ -0,0 +1,9 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > + > +#define pr_fmt(fmt) "guestmemfs: " KBUILD_MODNAME ": " fmt > + > +#include > + > +struct guestmemfs_sb { > + /* Will be populated soon... */ > +}; > diff --git a/include/linux/guestmemfs.h b/include/linux/guestmemfs.h > new file mode 100644 > index 000000000000..60e769c8e533 > --- /dev/null > +++ b/include/linux/guestmemfs.h > @@ -0,0 +1,16 @@ > +/* SPDX-License-Identifier: MIT */ > + > +#ifndef _LINUX_GUESTMEMFS_H > +#define _LINUX_GUESTMEMFS_H > + > +/* > + * Carves out chunks of memory from memblocks for guestmemfs. > + * Must be called in early boot before memblocks are freed. > + */ > +# ifdef CONFIG_GUESTMEMFS_FS > +void guestmemfs_reserve_mem(void); > +#else > +void guestmemfs_reserve_mem(void) { } > +#endif > + > +#endif > -- > 2.34.1 >