Pmemfs/guestmemfs discussion recap and open questions

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Pmemfs/guestmemfs discussion recap and open questions
@ 2024-10-17  4:42 David Rientjes
  2024-10-26  6:07 ` David Rientjes
  0 siblings, 1 reply; 5+ messages in thread
From: David Rientjes @ 2024-10-17  4:42 UTC (permalink / raw)
  To: James Gowans, Dave Hansen, David Hildenbrand, Matthew Wilcox,
	Mike Rapoport, Pasha Tatashin, Peter Xu, Alexander Graf,
	Ashish Kalra, Tom Lendacky, David Woodhouse, Anthony Yznaga,
	Jason Gunthorpe, Andrew Morton, Frank van der Linden,
	Vipin Sharma, David Matlack, Steve Rutherford, Erdem Aktas,
	Alper Gun, Vishal Annapurve, Ackerley Tng, Sagi Shahar
  Cc: linux-mm, kexec

Hi all,

We had a very interesting discussion today led by James Gowans in the
Linux MM Alignment Session, thank you James!  And thanks to everybody who
attended and provided great questions, suggestions, and feedback.

Guestmemfs[*] is proposed to provide an in-memory persistent filesystem
primarily aimed at Kexec Hand-Over (KHO) use cases: 1GB allocations, no
struct pages, unmapped from the kernel direct map.  The memory for this
filesystem is set aside by the memblock allocator as defined by the
kernel command line (like guestmemfs=900G on a 1TB system).

----->o-----
Feedback from David Hildenbrand was that we may want to leverge HVO
to get struct page savings and the alignment was to define this as
part of the filesystem configuration: do you want all struct pages to
be gone and memory unmapped from the kernel direct map, or in the
kernel direct map with tail pages freed for I/O?  You get to choose!

----->o-----
It was noted that the premise for guestmemfs sounded very similar to
guest_memfd, a filesystem that would index non-anonymous guest_memfds;
indeed, this is not dissimilar to persistent guest_memfd.  The new
kernel would need to present the fds to userspace so they can
be used once again, so a filesystem abstraction may make sense.  We
may also want to use uid and gid permissions.

It's highly desirable for the kernel to share the same infrastructure and 
source code, like struct page optimizations and unmapping from the kernel 
direct map, and name the guest_memfd.  We'd want to avoid duplicating 
this, but it's still questionable how this would be glued together.

David Hildenbrand brought up the idea of a persistent filesystem that
even databases could use that may not be guest_memfd.  Persistent
filesystems do exist, but lack the 1GB memory allocation requirement; if
we were to support databases or other workloads that want to persist
memory across kexec, this instead would become a new optimized filesystem 
for generic use cases that require persistence.  Mike Rapoport noted that 
tying the ability to persist memory across kexec to only guests would 
preclude this without major changes.

Frank van der Linden noted the abstraction between guest_memfd and
guestmemfs doesn't mesh very well and we may want to do this at the
allocator level instead: basically a factory that gives you exactly what
you want -- memory unmapped from the kernel direct map, with HVO instead,
etc.

Jason Gunthorpe noted there's a desire to add iommufd connections to
guest_memfd and that would have to be duplicated for guestmemfs.  KVM has
special connections to it, ioctls, etc.  So likely a whole new API
surface is coming around guest_memfd that guestmemfs will want to re-use.

To support this, it was also noted that guest_memfd is largely used for
confidential computing and pKVM today, and confidential computing is a
requirement for cloud providers: they need to expose guest_memfd style 
interface for such VMs as well.

Jason suggested that when you create a file on the filesystem, you tell
it exactly what you want: unmapped memory, guest_memfd semantics, or just
a plain file.  James expanded on this by brainstorming an API for such
use cases and backed by this new kind of allocator to provide exactly
what you need.

----->o-----
James also noted some users are interested in smaller regions of memory
that aren't preallocated, like tmpfs, so there is interest in "persistent
tmpfs," including dynamic sizing.  This may be tricky because tmpfs uses 
page cache.  In this case, the preallocation would not be needed.  Mike 
Rapoport noted the same is the case for memory mapped into the kernel 
direct map which is not required for persistence (including if you want to 
do I/O).

The tricky part of this is to determine what should and should not be
solved with the same solution.  Is it acceptable to have something like
guestmemfs which is very specific to cloud providers running VMs in most
of their host memory?

Matthew Wilcox noted there perhaps are ways to support persistence in
tmpfs, such as with swap, for this other use case, James noted this could
be used for things like systemd information that people have brought up
for containerization.  He indicated we should ensure KHO can mark tmpfs
pages to be persistent.  We'd need to follow up with Alex.

----->o-----
Pasha Tatashin asked about NUMA support with the current guestmemfs
proposal.  James noted this would be an essential requirement.  When
specifying the kernel command line with guestmemfs=, we could specify
the lengths required from each NUMA node.  This would result in per-node
mount points.

----->o-----
Peter Xu asked if IOMMU page tables could be stored on the guestmemfs
themselves to preserve across kexec.  James noted previous solutions for
this existed, but were tricky because of filesystem ordering at boot.  
This led to the conclusion that if we want persistent devices, then we 
need persistent memory as well; only files from guestmemfs that are known 
to be persistent can be mapped into a persistent VMA domain.  In the case 
of IOMMU page tables, the IOMMU driver needs to tell KHO that they must be
persisted.

----->o-----
My takeaway: based on the feedback that was provided in the discussion:

 - we need an allocator abstraction for persistent memory that can return
   memory with various characteristics: 1GB or not, kernel direct map or
   not, HVO or not, etc.

 - built on top of that, we need the ability to carve out very large
   ranges of memory (cloud provider use case) with NUMA awareness on the
   kernel command line

 - we also need the ability to be able to dynamically resize this or
   provide hints at allocation time that memory must be persisted across
   kexec to support the non-cloud provider use case

 - we need a filesystem abstraction that map memory of the type that is
   requested, including guest_memfd and then deal with all the fun of
   multitenancy since it would be drawing from a finite per-NUMA node
   pool of persistent memory

 - absolutely critical to this discussion is defining what is the core
   infrastructure that is required for a generally acceptable solution
   and then what builds off of that to be more special cased (like the
   cloud provider use case or persistent tmpfs use case)

We're looking to continue that discussion here and then come together 
again in a few weeks.

Thanks!

[*] https://lore.kernel.org/kvm/20240805093245.889357-1-jgowans@amazon.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pmemfs/guestmemfs discussion recap and open questions
  2024-10-17  4:42 Pmemfs/guestmemfs discussion recap and open questions David Rientjes
@ 2024-10-26  6:07 ` David Rientjes
  2024-10-29 15:32   ` Vishal Annapurve
  2024-10-29 16:35   ` Mike Rapoport
  0 siblings, 2 replies; 5+ messages in thread
From: David Rientjes @ 2024-10-26  6:07 UTC (permalink / raw)
  To: James Gowans, Dave Hansen, David Hildenbrand, Matthew Wilcox,
	Mike Rapoport, Pasha Tatashin, Peter Xu, Alexander Graf,
	Ashish Kalra, Tom Lendacky, David Woodhouse, Anthony Yznaga,
	Jason Gunthorpe, Andrew Morton, Frank van der Linden,
	Vipin Sharma, David Matlack, Steve Rutherford, Erdem Aktas,
	Alper Gun, Vishal Annapurve, Ackerley Tng, Sagi Shahar
  Cc: linux-mm, kexec

On Wed, 16 Oct 2024, David Rientjes wrote:

> ----->o-----
> My takeaway: based on the feedback that was provided in the discussion:
> 
>  - we need an allocator abstraction for persistent memory that can return
>    memory with various characteristics: 1GB or not, kernel direct map or
>    not, HVO or not, etc.
> 
>  - built on top of that, we need the ability to carve out very large
>    ranges of memory (cloud provider use case) with NUMA awareness on the
>    kernel command line
> 

Following up on this, I think this physical memory allocator would also be 
possible to use as a backend for hugetlb.  Hopefully this would be an 
allocator that would be generally useful for multiple purposes, something 
like a mm/phys_alloc.c.

Frank van der Linden may also have thoughts on the above?

>  - we also need the ability to be able to dynamically resize this or
>    provide hints at allocation time that memory must be persisted across
>    kexec to support the non-cloud provider use case
> 
>  - we need a filesystem abstraction that map memory of the type that is
>    requested, including guest_memfd and then deal with all the fun of
>    multitenancy since it would be drawing from a finite per-NUMA node
>    pool of persistent memory
> 
>  - absolutely critical to this discussion is defining what is the core
>    infrastructure that is required for a generally acceptable solution
>    and then what builds off of that to be more special cased (like the
>    cloud provider use case or persistent tmpfs use case)
> 
> We're looking to continue that discussion here and then come together 
> again in a few weeks.
> 

We'll be looking to schedule some more time to talk about this topic in 
the Wednesday, November 13 instance of the Linux MM Alignment Session.

After that, I think it would be quite useful to break out the set of 
people that are interested in persisting guest memory across kexec and KHO 
into a separate series to accelerate discussion and next stpes.  Getting 
the requirements and design locked down are critical, so happy to 
facilitate that to any extent possible and welcome everybody interested in 
discussing it.

James, for the guestmemfs discussions, would this work for you?

Alexander, same question for you regarding the KHO work?

It's a global community, so the timing won't work for eveyrbody.  We'd 
plan on sending out summaries of these discussions, such as in this email, 
to solicit feedback and ideas from everybody.

If you're not on the To: or Cc: list already, please email me separatel if 
you're interested in participating and then we can find a regular time.

This is exciting!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pmemfs/guestmemfs discussion recap and open questions
  2024-10-26  6:07 ` David Rientjes
@ 2024-10-29 15:32   ` Vishal Annapurve
  2024-10-29 16:35   ` Mike Rapoport
  1 sibling, 0 replies; 5+ messages in thread
From: Vishal Annapurve @ 2024-10-29 15:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: James Gowans, Dave Hansen, David Hildenbrand, Matthew Wilcox,
	Mike Rapoport, Pasha Tatashin, Peter Xu, Alexander Graf,
	Ashish Kalra, Tom Lendacky, David Woodhouse, Anthony Yznaga,
	Jason Gunthorpe, Andrew Morton, Frank van der Linden,
	Vipin Sharma, David Matlack, Steve Rutherford, Erdem Aktas,
	Alper Gun, Ackerley Tng, Sagi Shahar, linux-mm, kexec

On Sat, Oct 26, 2024 at 11:37 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 16 Oct 2024, David Rientjes wrote:
>
> > ----->o-----
> > My takeaway: based on the feedback that was provided in the discussion:
> >
> >  - we need an allocator abstraction for persistent memory that can return
> >    memory with various characteristics: 1GB or not, kernel direct map or
> >    not, HVO or not, etc.
> >
> >  - built on top of that, we need the ability to carve out very large
> >    ranges of memory (cloud provider use case) with NUMA awareness on the
> >    kernel command line
> >
>
> Following up on this, I think this physical memory allocator would also be
> possible to use as a backend for hugetlb.  Hopefully this would be an
> allocator that would be generally useful for multiple purposes, something
> like a mm/phys_alloc.c.
>
> Frank van der Linden may also have thoughts on the above?
>
> >  - we also need the ability to be able to dynamically resize this or
> >    provide hints at allocation time that memory must be persisted across
> >    kexec to support the non-cloud provider use case
> >
> >  - we need a filesystem abstraction that map memory of the type that is
> >    requested, including guest_memfd and then deal with all the fun of
> >    multitenancy since it would be drawing from a finite per-NUMA node
> >    pool of persistent memory
> >
> >  - absolutely critical to this discussion is defining what is the core
> >    infrastructure that is required for a generally acceptable solution
> >    and then what builds off of that to be more special cased (like the
> >    cloud provider use case or persistent tmpfs use case)
> >
> > We're looking to continue that discussion here and then come together
> > again in a few weeks.
> >
>
> We'll be looking to schedule some more time to talk about this topic in
> the Wednesday, November 13 instance of the Linux MM Alignment Session.
>
> After that, I think it would be quite useful to break out the set of
> people that are interested in persisting guest memory across kexec and KHO
> into a separate series to accelerate discussion and next stpes.  Getting
> the requirements and design locked down are critical, so happy to
> facilitate that to any extent possible and welcome everybody interested in
> discussing it.

I think there is a nice overlap between requirements for the guest
memory persistence and guest_memfd 1G page support for
confidential/non-confidential VMs. Memory persistence of guest_memfd
backed CoCo VMs and KHO will be a critical usecase for us at Google as
well, so I am interested in further discussion here.

Regards,
Vishal

>
> James, for the guestmemfs discussions, would this work for you?
>
> Alexander, same question for you regarding the KHO work?
>
> It's a global community, so the timing won't work for eveyrbody.  We'd
> plan on sending out summaries of these discussions, such as in this email,
> to solicit feedback and ideas from everybody.
>
> If you're not on the To: or Cc: list already, please email me separatel if
> you're interested in participating and then we can find a regular time.
>
> This is exciting!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pmemfs/guestmemfs discussion recap and open questions
  2024-10-26  6:07 ` David Rientjes
  2024-10-29 15:32   ` Vishal Annapurve
@ 2024-10-29 16:35   ` Mike Rapoport
  2024-10-30 22:43     ` Frank van der Linden
  1 sibling, 1 reply; 5+ messages in thread
From: Mike Rapoport @ 2024-10-29 16:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: James Gowans, Dave Hansen, David Hildenbrand, Matthew Wilcox,
	Mike Rapoport, Pasha Tatashin, Peter Xu, Alexander Graf,
	Ashish Kalra, Tom Lendacky, David Woodhouse, Anthony Yznaga,
	Jason Gunthorpe, Andrew Morton, Frank van der Linden,
	Vipin Sharma, David Matlack, Steve Rutherford, Erdem Aktas,
	Alper Gun, Vishal Annapurve, Ackerley Tng, Sagi Shahar, linux-mm,
	kexec

Hi David,

On Fri, Oct 25, 2024 at 11:07:27PM -0700, David Rientjes wrote:
> On Wed, 16 Oct 2024, David Rientjes wrote:
> 
> > ----->o-----
> > My takeaway: based on the feedback that was provided in the discussion:
> > 
> >  - we need an allocator abstraction for persistent memory that can return
> >    memory with various characteristics: 1GB or not, kernel direct map or
> >    not, HVO or not, etc.
> > 
> >  - built on top of that, we need the ability to carve out very large
> >    ranges of memory (cloud provider use case) with NUMA awareness on the
> >    kernel command line
> > 
> 
> Following up on this, I think this physical memory allocator would also be 
> possible to use as a backend for hugetlb.  Hopefully this would be an 
> allocator that would be generally useful for multiple purposes, something 
> like a mm/phys_alloc.c.

Can you elaborate on this? mm/page_alloc.c already allocates physical
memory :)

Or you mean an allocator that will deal with memory carved out from what page
allocator manages?
 
> Frank van der Linden may also have thoughts on the above?
> 
> >  - we also need the ability to be able to dynamically resize this or
> >    provide hints at allocation time that memory must be persisted across
> >    kexec to support the non-cloud provider use case
> > 
> >  - we need a filesystem abstraction that map memory of the type that is
> >    requested, including guest_memfd and then deal with all the fun of
> >    multitenancy since it would be drawing from a finite per-NUMA node
> >    pool of persistent memory
> > 
> >  - absolutely critical to this discussion is defining what is the core
> >    infrastructure that is required for a generally acceptable solution
> >    and then what builds off of that to be more special cased (like the
> >    cloud provider use case or persistent tmpfs use case)
> > 
> > We're looking to continue that discussion here and then come together 
> > again in a few weeks.
> > 
> 
> We'll be looking to schedule some more time to talk about this topic in 
> the Wednesday, November 13 instance of the Linux MM Alignment Session.
> 
> After that, I think it would be quite useful to break out the set of 
> people that are interested in persisting guest memory across kexec and KHO 
> into a separate series to accelerate discussion and next stpes.  Getting 
> the requirements and design locked down are critical, so happy to 
> facilitate that to any extent possible and welcome everybody interested in 
> discussing it.
> 
> James, for the guestmemfs discussions, would this work for you?
> 
> Alexander, same question for you regarding the KHO work?
> 
> It's a global community, so the timing won't work for eveyrbody.  We'd 
> plan on sending out summaries of these discussions, such as in this email, 
> to solicit feedback and ideas from everybody.
> 
> If you're not on the To: or Cc: list already, please email me separatel if 
> you're interested in participating and then we can find a regular time.
> 
> This is exciting!
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pmemfs/guestmemfs discussion recap and open questions
  2024-10-29 16:35   ` Mike Rapoport
@ 2024-10-30 22:43     ` Frank van der Linden
  0 siblings, 0 replies; 5+ messages in thread
From: Frank van der Linden @ 2024-10-30 22:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Rientjes, James Gowans, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Mike Rapoport, Pasha Tatashin, Peter Xu,
	Alexander Graf, Ashish Kalra, Tom Lendacky, David Woodhouse,
	Anthony Yznaga, Jason Gunthorpe, Andrew Morton, Vipin Sharma,
	David Matlack, Steve Rutherford, Erdem Aktas, Alper Gun,
	Vishal Annapurve, Ackerley Tng, Sagi Shahar, linux-mm, kexec

On Tue, Oct 29, 2024 at 9:39 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi David,
>
> On Fri, Oct 25, 2024 at 11:07:27PM -0700, David Rientjes wrote:
> > On Wed, 16 Oct 2024, David Rientjes wrote:
> >
> > > ----->o-----
> > > My takeaway: based on the feedback that was provided in the discussion:
> > >
> > >  - we need an allocator abstraction for persistent memory that can return
> > >    memory with various characteristics: 1GB or not, kernel direct map or
> > >    not, HVO or not, etc.
> > >
> > >  - built on top of that, we need the ability to carve out very large
> > >    ranges of memory (cloud provider use case) with NUMA awareness on the
> > >    kernel command line
> > >
> >
> > Following up on this, I think this physical memory allocator would also be
> > possible to use as a backend for hugetlb.  Hopefully this would be an
> > allocator that would be generally useful for multiple purposes, something
> > like a mm/phys_alloc.c.
>
> Can you elaborate on this? mm/page_alloc.c already allocates physical
> memory :)
>
> Or you mean an allocator that will deal with memory carved out from what page
> allocator manages?
>
> > Frank van der Linden may also have thoughts on the above?

Yeah 'physical allocator' is a bit of a misnomer. You're right, an
allocator that deals with memory not under page allocator control is a
better description.

To elaborate a bit: there are various scenarios where allocating
contiguous stretches of physical memory is useful. HugeTLB, VM guest
memory. Or where you are presented with an external range of VM_PFNMAP
memory and need to manage it in a simple way and hand it out for guest
memory support (see NVidia's github for nvgrace-egm). However, all of
these cases may come with slightly different requirements: is the
memory purely external? Does it have struct pages? If so, is it in the
direct map? Is the memmap for the memory optimized (HVO-style)? Does
it need to be persistent? When does it need to be zeroed out?

So that's why it seems like a good idea to come up with a slightly
more generalized version of pool allocator - something that manages,
usually larger, chunks of physically contiguous memory. A is
initialized with certain properties (persistence, etc). It has methods
to grow and shrink the pool if needed. It's in no way meant to be
anywhere near as sophisticated as the page allocator, that would not
be useful (and pointless code duplication). A simple fixed-size chunk
pool will satisfy a lot of these cases.

A number of the building blocks are already there: there's CMA,
there's ZONE_DEVICE which has tools to manipulate some of these
properties (by going through a hotremove / hotplug cycle). I created a
simple prototype that essentially uses CMA as a pool provider, and
uses some ZONE_DEVICE tools to initialize memory however you want it
when it's added to the pool. I also added some new init code to to
avoid things like unneeded memmap allocation at boot for hugetlbfs
pages. I put hugetlbfs on top of it - but in a restricted way for
prototyping purposes (no reservations, no demotion).

Anyway, this is the basic idea.

- Frank

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-10-30 22:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-17  4:42 Pmemfs/guestmemfs discussion recap and open questions David Rientjes
2024-10-26  6:07 ` David Rientjes
2024-10-29 15:32   ` Vishal Annapurve
2024-10-29 16:35   ` Mike Rapoport
2024-10-30 22:43     ` Frank van der Linden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox