From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9208DD2F7E5 for ; Thu, 17 Oct 2024 04:53:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1B3B16B007B; Thu, 17 Oct 2024 00:53:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 13C5A6B0082; Thu, 17 Oct 2024 00:53:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF74E6B0083; Thu, 17 Oct 2024 00:53:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C6BF66B007B for ; Thu, 17 Oct 2024 00:53:55 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 05E40A0EF9 for ; Thu, 17 Oct 2024 04:53:36 +0000 (UTC) X-FDA: 82681876494.06.34740F8 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf22.hostedemail.com (Postfix) with ESMTP id 4C12DC0013 for ; Thu, 17 Oct 2024 04:53:42 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=plOAqwKq; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of vannapurve@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729140727; a=rsa-sha256; cv=none; b=RpjoyRP+UqTyQ5Lj/WV4ocjkmS3CD8w2r8kPzMEbiAZiDHVHGNuYaw5EbjhqxU2l6KcLxF 5w6lFVnXMlFXOL4NR3t8w6ko06OOpJIS9qokUUiRpOiNStsPyEXFK705Tq0VpT4lE5j6GX 0EHX2niW8arWoK01S/fpbgXEPPBKMC0= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=plOAqwKq; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of vannapurve@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729140727; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BnKl1QwbwUsKFMP4HLrcYnPjaoTQ17Jib2qe3x9b52Q=; b=iMDLv0uyfvwiiq2uV3AnQnZ+rS02Zy/6LVozYM/eCzV5S9x/YuZRwK3x8Vo1kDXbrWenSp sNyeY/BLLN318XjIGLt3JFodwNSsCiUFf+ujf+YrPPz1QNceSeoo6RVSLNn9gM2rUxb0t0 qI8BTV0bLfXAhbmMxQP1VWWVjoBKrxU= Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-5c932b47552so15743a12.0 for ; Wed, 16 Oct 2024 21:53:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1729140832; x=1729745632; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BnKl1QwbwUsKFMP4HLrcYnPjaoTQ17Jib2qe3x9b52Q=; b=plOAqwKqh1ZVpMbzMRe/9PcopQzpk8LxfN/NgdsvM/d5dxgj7lljIszPZ8iXpPlGkU OtcbGa/RGDahEurICFtl2+SJKU6KPA/Ew1mu8J2U3NDlNio5Rl/VYV4yiXVgRTjXY0LG TszhnbSBkZ9wp//7Faa6il0fZav0ngCX0Ios5IzzrrkiPUlL6TvkRYFYRFI/hH2zOOD0 qfk0axJKli89Qk6+BYdM03KmnLh6zFAeCVDeFuvxjFBdlawPPLlZbjqn7y3o3wv/SmEf vief6G0kDjdgwVT4jtnHQc8KYh8r/pjBBOPNAr3TglgROyHOYTReVR8D4MQF4+AzABuW IKfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729140832; x=1729745632; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BnKl1QwbwUsKFMP4HLrcYnPjaoTQ17Jib2qe3x9b52Q=; b=QK/EndN6Otk+QBHLgq2INHaZQE7COj16um/6R3Mmd9ec0rLVDTocY9vhWts7SjAzzw Om48MEse6oJxPAWlvho65kpSddN5EKtbOLHQq4Di+YHw2eU+KTfgwbDZpzCpQuraLHH8 kbSqpdJjzJmfIHKI3sEW5RAT3c0HYfikWTzM26ZS4SIie4JbPA64g84rzOPuuTnZwFkS cBHTOGxyzpfB7yx+0jUDicRdwQPibK7T5BYOISuHmVUkkHQYI4ubrxmjYfiFBogMd56C Rbn2+w4zIjHWN4HaTM4Enw3wweluNSPKx9KA9C+M5nSuMJMnH7XAsNcgEW4SWfKeUzDB sdZA== X-Forwarded-Encrypted: i=1; AJvYcCUjTB4GaIP88Je9tkTe76VJBAZIc0Le7u2D1Kg3MIi+ooH2UadLTEjOVZwuQ8ive+X8sYaiqUqpEQ==@kvack.org X-Gm-Message-State: AOJu0Yx5Decnxc0o4ARs8/3reA7hr8BPIAR0EL+L607pTFqKp8Hisc6C fesfTgyLTsUwndAD4EnCzEWPt+QNFsgnLjCLurK69EiyShH18tFXOVyc3oIFK8pcQoXt0NZA6v6 pceFRrGXY9p1L4s6KboCX5p2q+f3MzxkUKjtw X-Google-Smtp-Source: AGHT+IFCcmMy3OXU8QksT8UF+eXw1a4ADFyOFkBS5bVywWQA6mQyQwUxzFuW9jnMvnjqseJvwvlsa8cfN9d+elGiJq8= X-Received: by 2002:a05:6402:2355:b0:5c8:84b5:7e78 with SMTP id 4fb4d7f45d1cf-5c9a6613d7fmr307679a12.4.1729140831633; Wed, 16 Oct 2024 21:53:51 -0700 (PDT) MIME-Version: 1.0 References: <20240805093245.889357-1-jgowans@amazon.com> In-Reply-To: <20240805093245.889357-1-jgowans@amazon.com> From: Vishal Annapurve Date: Thu, 17 Oct 2024 10:23:38 +0530 Message-ID: Subject: Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem To: James Gowans Cc: linux-kernel@vger.kernel.org, Sean Christopherson , Paolo Bonzini , Alexander Viro , Steve Sistare , Christian Brauner , Jan Kara , Anthony Yznaga , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Jason Gunthorpe , linux-fsdevel@vger.kernel.org, Usama Arif , kvm@vger.kernel.org, Alexander Graf , David Woodhouse , Paul Durrant , Nicolas Saenz Julienne Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4C12DC0013 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 13uxkpdgw33qbzd57nhxu7gwzhhp9wdf X-HE-Tag: 1729140822-200797 X-HE-Meta: U2FsdGVkX19MGtqKLWvTTb31Ej3alepLMe7wCZvC8gEy9nVDfD7+hHqxBahU0nSyRxVkr7xSuaFz+AX+QryOpnTTtZMo4fO+0EUWWFvylpkdAqrVYoZW+V5F5KXPeZfswiz5pH8H8froBOpFHV8EzwA0WURx+AJOl/fXQlMEGv9r9TpFPwxbZMI7iEZSlhDRYxzGBoN8CKiwal9ri5d4iDkcME0HYAXogE2fNVk5l8U3FaNDYuSkjjRggS9Ceq9gZRInzQAUPB/rwjWBlbbl7cHhBt3rRto6w9wFZegjQ62DOfIyauDSVM28tnXx0A2jDK9tprIDRbvsoC30wY4xY2mPO5dFKD3hbHJ5LD/1I2911A1O9pjGXF2PT4CxUzsOmuzLMjnq2yPziODF7Tyo0TgfCYDQELpJ+sp6U7uZD77s5dAlqka73w7iBfDBwL8J/B+G1Wdh5k7AbKAtMYd3P9pXzgRYYUP0BZ9eavVHiIrQd/b2/P+pMCoZJk51QOrsDoQu2T4gX8XJ/iKtegKDkoBTV9QJcu+l505GpBwDgSX0ocX9acmo6HWqeSOsJLScmqwsV1cxc1tK77AzfEPro2ukoaUHfWWY3K7zaBNuB/ce3E2hei7e8gIStowsGeAd9zN30BgTrehpkDj9QJ5Kn5tYgXWOzQRpU5KmSQ+4qmEH9QxxAYF20ZaH7gECBIG/B1dLdyHvihljgmCRb9I2ELRGLrxM04XH+FxfGY0i72xMeZHO4FVFyvggIgVMuNULxmWCUFId9B+hsJpN6HxmQhZs2q90+Fy4dUq762qrb4D5LWZqXZvuw/kCDj0ME1YBcnsUeMl+TSTr4JAfCViik/6bNmk8MPdKS+bpkZZlumgN0KFkOfi3cm4z+4ZXiya8SL9UrdP/W/gtdOS1SrYrB/sFEv/J1TsPenpCgrnRsGGgSkWolAEOHE3N9caKDyfs2nWfCJPW/WBdi6yb2oW /JO+r0nY GCq7CSE1Lv5qfuewW0KqdFJnmHa1+YxCwge0m+KLa13Oo3+TkcGYGl8td1DyoaT+VTG+pTrx4Dq6rIpCT2j+17qAHU0JzoBv49kF/WodaVVvQiVMhyj5BjSH2o8AXZupqrZjz1lFu26OmoAuDOEBAT/E1F4jRz+HI9VG/YKtRpp2wxo+MrbtDw47DjGM589chO0uxeQ2lzVbNT8xd7k4fT2+51oOWDxxYROypCxnKBlU4wTOwACGXDbYtEp0k+nf/jthiVeIJlvtdkbycweMEd8JO38/6i3M+bTesJRzDVRCUaazghA1spL49Nikg0HnuvC2y/YzMqj2c7rUGUZeKno91og== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Aug 5, 2024 at 3:03=E2=80=AFPM James Gowans wr= ote: > > In this patch series a new in-memory filesystem designed specifically > for live update is implemented. Live update is a mechanism to support > updating a hypervisor in a way that has limited impact to running > virtual machines. This is done by pausing/serialising running VMs, > kexec-ing into a new kernel, starting new VMM processes and then > deserialising/resuming the VMs so that they continue running from where > they were. To support this, guest memory needs to be preserved. > > Guestmemfs implements preservation acrosss kexec by carving out a large > contiguous block of host system RAM early in boot which is then used as > the data for the guestmemfs files. As well as preserving that large > block of data memory across kexec, the filesystem metadata is preserved > via the Kexec Hand Over (KHO) framework (still under review): > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/ > > Filesystem metadata is structured to make preservation across kexec > easy: inodes are one large contiguous array, and each inode has a > "mappings" block which defines which block from the filesystem data > memory corresponds to which offset in the file. > > There are additional constraints/requirements which guestmemfs aims to > meet: > > 1. Secret hiding: all filesystem data is removed from the kernel direct > map so immune from speculative access. read()/write() are not supported; > the only way to get at the data is via mmap. > > 2. Struct page overhead elimination: the memory is not managed by the > buddy allocator and hence has no struct pages. > > 3. PMD and PUD level allocations for TLB performance: guestmemfs > allocates PMD-sized pages to back files which improves TLB perf (caveat > below!). PUD size allocations are a next step. > > 4. Device assignment: being able to use guestmemfs memory for > VFIO/iommufd mappings, and allow those mappings to survive and continue > to be used across kexec. > > > Next steps > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > The idea is that this patch series implements a minimal filesystem to > provide the foundations for in-memory persistent across kexec files. > One this foundation is in place it will be extended: > > 1. Improve the filesystem to be more comprehensive - currently it's just > functional enough to demonstrate the main objective of reserved memory > and persistence via KHO. > > 2. Build support for iommufd IOAS and HWPT persistence, and integrate > that with guestmemfs. The idea is that if VMs have DMA devices assigned > to them, DMA should continue running across kexec. A future patch series > will add support for this in iommufd and connect iommufd to guestmemfs > so that guestmemfs files can remain mapped into the IOMMU during kexec. > > 3. Support a guest_memfd interface to files so that they can be used for > confidential computing without needing to mmap into userspace. I am guessing this goal was before we discussed the need of supporting mmap on guest_memfd for confidential computing usecases to support hugepages [1]. This series [1] as of today tries to leverage hugetlb allocator functionality to allocate huge pages which seems to be along the lines of what you are aiming for. There are also discussions to support NUMA mempolicy [2] for guest memfd. In order to use guest_memfd to back non-confidential VMs with hugepages, core-mm will need to support PMD/PUD level mappings in future. David H's suggestion from the other thread to extend guest_memfd to support guest memory persistence over kexec instead of introducing guestmemfs as a parallel subsystem seems appealing to me. [1] https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@google.com= /T/ [2] https://lore.kernel.org/kvm/47476c27-897c-4487-bcd2-7ef6ec089dd1@amd.co= m/T/ > > 3. Gigantic PUD level mappings for even better TLB perf. > > Caveats > =3D=3D=3D=3D=3D=3D=3D > > There are a issues with the current implementation which should be > solved either in this patch series or soon in follow-on work: > > 1. Although PMD-size allocations are done, PTE-level page tables are > still created. This is because guestmemfs uses remap_pfn_range() to set > up userspace pgtables. Currently remap_pfn_range() only creates > PTE-level mappings. I suggest enhancing remap_pfn_range() to support > creating higher level mappings where possible, by adding pmd_special > and pud_special flags. > > 2. NUMA support is currently non-existent. To make this more generally > useful it's necessary to have NUMA-awareness. One thought on how to do > this is to be able to specify multiple allocations with wNUMA affinity > on the kernel cmdline and have multiple mount points, one per NUMA node. > Currently, for simplicity, only a single contiguous filesystem data > allocation and a single mount point is supported. > > 3. MCEs are currently not handled - we need to add functionality for > this to be able to track block ownership and deliver an MCE correctly. > > 4. Looking for reviews from filesystem experts to see if necessary > callbacks, refcounting, locking, etc, is done correctly. > > Open questions > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > It is not too clear if or how guestmemfs should use DAX as a source of > memory. Seeing as guestmemfs has an in-memory design, it seems that it > is not necessary to use DAX as a source of memory, but I am keen for > guidance/input on whether DAX should be used here. > > The filesystem data memory is removed from the direct map for secret > hiding, but it is still necessary to mmap it to be accessible to KVM. > For improving secret hiding even more a guest_memfd-style interface > could be used to remove the need to mmap. That introduces a new problem > of the memory being completely inaccessible to KVM for this like MMIO > instruction emulation. How can this be handled? > > Related Work > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > There are similarities to a few attempts at solving aspects of this > problem previously. > > The original was probably PKRAM from Oracle; a tempfs filesystem with > persistence: > https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.y= znaga@oracle.com/ > guestmemfs will additionally provide secret hiding, PMD/PUD allocations > and a path to DMA persistence and NUMA support. > > Dmemfs from Tencent aimed to remove the need for struct page overhead: > https://lore.kernel.org/kvm/cover.1602093760.git.yuleixzhang@tencent.com/ > Guestmemfs provides this benefit too, along with persistence across > kexec and secret hiding. > > Pkernfs attempted to solve guest memory persistence and IOMMU > persistence all in one: > https://lore.kernel.org/all/20240205120203.60312-1-jgowans@amazon.com/ > Guestmemfs is a re-work of that to only persist guest RAM in the > filesystem, and to use KHO for filesystem metadata. IOMMU persistence > will be implemented independently with persistent iommufd domains via > KHO. > > Testing > =3D=3D=3D=3D=3D=3D=3D > > The testing for this can be seen in the Documentation file in this patch > series. Essentially it is using a guestmemfs file for a QEMU VM's RAM, > doing a kexec, restoring the QEMU VM and confirming that the VM picked > up from where it left off. > > James Gowans (10): > guestmemfs: Introduce filesystem skeleton > guestmemfs: add inode store, files and dirs > guestmemfs: add persistent data block allocator > guestmemfs: support file truncation > guestmemfs: add file mmap callback > kexec/kho: Add addr flag to not initialise memory > guestmemfs: Persist filesystem metadata via KHO > guestmemfs: Block modifications when serialised > guestmemfs: Add documentation and usage instructions > MAINTAINERS: Add maintainers for guestmemfs > > Documentation/filesystems/guestmemfs.rst | 87 +++++++ > MAINTAINERS | 8 + > arch/x86/mm/init_64.c | 2 + > fs/Kconfig | 1 + > fs/Makefile | 1 + > fs/guestmemfs/Kconfig | 11 + > fs/guestmemfs/Makefile | 8 + > fs/guestmemfs/allocator.c | 40 +++ > fs/guestmemfs/dir.c | 43 ++++ > fs/guestmemfs/file.c | 106 ++++++++ > fs/guestmemfs/guestmemfs.c | 160 ++++++++++++ > fs/guestmemfs/guestmemfs.h | 60 +++++ > fs/guestmemfs/inode.c | 189 ++++++++++++++ > fs/guestmemfs/serialise.c | 302 +++++++++++++++++++++++ > include/linux/guestmemfs.h | 16 ++ > include/uapi/linux/kexec.h | 6 + > kernel/kexec_kho_in.c | 12 +- > kernel/kexec_kho_out.c | 4 + > 18 files changed, 1055 insertions(+), 1 deletion(-) > create mode 100644 Documentation/filesystems/guestmemfs.rst > create mode 100644 fs/guestmemfs/Kconfig > create mode 100644 fs/guestmemfs/Makefile > create mode 100644 fs/guestmemfs/allocator.c > create mode 100644 fs/guestmemfs/dir.c > create mode 100644 fs/guestmemfs/file.c > create mode 100644 fs/guestmemfs/guestmemfs.c > create mode 100644 fs/guestmemfs/guestmemfs.h > create mode 100644 fs/guestmemfs/inode.c > create mode 100644 fs/guestmemfs/serialise.c > create mode 100644 include/linux/guestmemfs.h > > -- > 2.34.1 > >