linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Gowans, James" <jgowans@amazon.com>
To: "jack@suse.cz" <jack@suse.cz>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"brauner@kernel.org" <brauner@kernel.org>, "Graf (AWS),
	Alexander" <graf@amazon.de>,
	"anthony.yznaga@oracle.com" <anthony.yznaga@oracle.com>,
	"steven.sistare@oracle.com" <steven.sistare@oracle.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Durrant, Paul" <pdurrant@amazon.co.uk>,
	"seanjc@google.com" <seanjc@google.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"Saenz Julienne, Nicolas" <nsaenz@amazon.es>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"nh-open-source@amazon.com" <nh-open-source@amazon.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>
Subject: Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem
Date: Tue, 6 Aug 2024 08:12:59 +0000	[thread overview]
Message-ID: <9802ddc299c72b189487fd56668de65a84f7d94b.camel@amazon.com> (raw)
In-Reply-To: <20240805200151.oja474ju4i32y5bj@quack3>

On Mon, 2024-08-05 at 22:01 +0200, Jan Kara wrote:
> 
> On Mon 05-08-24 11:32:35, James Gowans wrote:
> > In this patch series a new in-memory filesystem designed specifically
> > for live update is implemented. Live update is a mechanism to support
> > updating a hypervisor in a way that has limited impact to running
> > virtual machines. This is done by pausing/serialising running VMs,
> > kexec-ing into a new kernel, starting new VMM processes and then
> > deserialising/resuming the VMs so that they continue running from where
> > they were. To support this, guest memory needs to be preserved.
> > 
> > Guestmemfs implements preservation acrosss kexec by carving out a large
> > contiguous block of host system RAM early in boot which is then used as
> > the data for the guestmemfs files. As well as preserving that large
> > block of data memory across kexec, the filesystem metadata is preserved
> > via the Kexec Hand Over (KHO) framework (still under review):
> > https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/
> > 
> > Filesystem metadata is structured to make preservation across kexec
> > easy: inodes are one large contiguous array, and each inode has a
> > "mappings" block which defines which block from the filesystem data
> > memory corresponds to which offset in the file.
> > 
> > There are additional constraints/requirements which guestmemfs aims to
> > meet:
> > 
> > 1. Secret hiding: all filesystem data is removed from the kernel direct
> > map so immune from speculative access. read()/write() are not supported;
> > the only way to get at the data is via mmap.
> > 
> > 2. Struct page overhead elimination: the memory is not managed by the
> > buddy allocator and hence has no struct pages.
> > 
> > 3. PMD and PUD level allocations for TLB performance: guestmemfs
> > allocates PMD-sized pages to back files which improves TLB perf (caveat
> > below!). PUD size allocations are a next step.
> > 
> > 4. Device assignment: being able to use guestmemfs memory for
> > VFIO/iommufd mappings, and allow those mappings to survive and continue
> > to be used across kexec.
> 
> To me the basic functionality resembles a lot hugetlbfs. Now I know very
> little details about hugetlbfs so I've added relevant folks to CC. Have you
> considered to extend hugetlbfs with the functionality you need (such as
> preservation across kexec) instead of implementing completely new filesystem?

Oof, I forgot to mention hugetlbfs in the cover letter - thanks for
raising this! Indeed, there are similarities: in-memory fs, with
huge/gigantic allocations.

We did consider extending hugetlbfs to support persistence, but there
are differences in requirements which we're not sure would be practical
or desirable to add to hugetlbfs.

1. Secret hiding: with guestmemfs all of the memory is out of the kernel
direct map as an additional defence mechanism. This means no
read()/write() syscalls to guestmemfs files, and no IO to it. The only
way to access it is to mmap the file.

2. No struct page overhead: the intended use case is for systems whose
sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
the majority of system RAM would be donated to this fs. We definitely
don't want 4 KiB struct pages here as it would be a significant
overhead. That's why guestmemfs carves the memory out in early boot and
sets memblock flags to avoid struct page allocation. I don't know if
hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
for its memory?

3. guest_memfd interface: For confidential computing use-cases we need
to provide a guest_memfd style interface so that these FDs can be used
as a guest_memfd file in KVM memslots. Would there be interest in
extending hugetlbfs to also support a guest_memfd style interface?

4. Metadata designed for persistence: guestmemfs will need to keep
simple internal metadata data structures (limited allocations, limited
fragmentation) so that pages can easily and efficiently be marked as
persistent via KHO. Something like slab allocations would probably be a
no-go as then we'd need to persist and reconstruct the slab allocator. I
don't know how hugetlbfs structures its fs metadata but I'm guessing it
uses the slab and does lots of small allocations so trying to retrofit
persistence via KHO to it may be challenging.

5. Integration with persistent IOMMU mappings: to keep DMA running
across kexec, iommufd needs to know that the backing memory for an IOAS
is persistent too. The idea is to do some DMA pinning of persistent
files, which would require iommufd/guestmemfs integration - would we
want to add this to hugetlbfs?

6. Virtualisation-specific APIs: starting to get a bit esoteric here,
but use-cases like being able to carve out specific chunks of memory
from a running VM and turn it into memory for another side car VM, or
doing post-copy LM via DMA by mapping memory into the IOMMU but taking
page faults on the CPU. This may require virtualisation-specific ioctls
on the files which wouldn't be generally applicable to hugetlbfs.

7. NUMA control: a requirement is to always have correct NUMA affinity.
While currently not implemented the idea is to extend the guestmemfs
allocation to support specifying allocation sizes from each NUMA node at
early boot, and then having multiple mount points, one per NUMA node (or
something like that...). Unclear if this is something hugetlbfs would
want.

There are probably more potential issues, but those are the ones that
come to mind... That being said, if hugetlbfs maintainers are interested
in going in this direction then we can definitely look at enhancing
hugetlbfs.

I think there are two types of problems: "Would hugetlbfs want this
functionality?" - that's the majority. An a few are "This would be hard
with hugetlbfs!" - persistence probably falls into this category.

Looking forward to input from maintainers. :-)

JG

  parent reply	other threads:[~2024-08-06  8:13 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-05  9:32 James Gowans
2024-08-05  9:32 ` [PATCH 01/10] guestmemfs: Introduce filesystem skeleton James Gowans
2024-08-05 10:20   ` Christian Brauner
2024-08-05  9:32 ` [PATCH 02/10] guestmemfs: add inode store, files and dirs James Gowans
2024-08-05  9:32 ` [PATCH 03/10] guestmemfs: add persistent data block allocator James Gowans
2024-08-05  9:32 ` [PATCH 04/10] guestmemfs: support file truncation James Gowans
2024-08-05  9:32 ` [PATCH 05/10] guestmemfs: add file mmap callback James Gowans
2024-10-29 23:05   ` Elliot Berman
2024-10-30 22:18     ` Frank van der Linden
2024-11-01 12:55       ` Gowans, James
2024-10-31 15:30     ` Gowans, James
2024-10-31 16:06       ` Jason Gunthorpe
2024-11-01 13:01         ` Gowans, James
2024-11-01 13:42           ` Jason Gunthorpe
2024-11-02  8:24             ` Gowans, James
2024-11-04 11:11               ` Mike Rapoport
2024-11-04 14:39               ` Jason Gunthorpe
2024-11-04 10:49             ` Mike Rapoport
2024-08-05  9:32 ` [PATCH 06/10] kexec/kho: Add addr flag to not initialise memory James Gowans
2024-08-05  9:32 ` [PATCH 07/10] guestmemfs: Persist filesystem metadata via KHO James Gowans
2024-08-05  9:32 ` [PATCH 08/10] guestmemfs: Block modifications when serialised James Gowans
2024-08-05  9:32 ` [PATCH 09/10] guestmemfs: Add documentation and usage instructions James Gowans
2024-08-05  9:32 ` [PATCH 10/10] MAINTAINERS: Add maintainers for guestmemfs James Gowans
2024-08-05 14:32 ` [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem Theodore Ts'o
2024-08-05 14:41   ` Paolo Bonzini
2024-08-05 19:47     ` Gowans, James
2024-08-05 19:53   ` Gowans, James
2024-08-05 20:01 ` Jan Kara
2024-08-05 23:29   ` Jason Gunthorpe
2024-08-06  8:26     ` Gowans, James
2024-08-06  8:12   ` Gowans, James [this message]
2024-08-06 13:43     ` David Hildenbrand
2024-10-17  4:53 ` Vishal Annapurve
2024-11-01 12:53   ` Gowans, James

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9802ddc299c72b189487fd56668de65a84f7d94b.camel@amazon.com \
    --to=jgowans@amazon.com \
    --cc=akpm@linux-foundation.org \
    --cc=anthony.yznaga@oracle.com \
    --cc=brauner@kernel.org \
    --cc=dwmw@amazon.co.uk \
    --cc=graf@amazon.de \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=kvm@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=nh-open-source@amazon.com \
    --cc=nsaenz@amazon.es \
    --cc=pbonzini@redhat.com \
    --cc=pdurrant@amazon.co.uk \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=steven.sistare@oracle.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox