[LSF/MM/BPF TOPIC] memory persistence over kexec

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] memory persistence over kexec
@ 2025-01-20  7:54 Mike Rapoport
  2025-01-20 14:14 ` Jason Gunthorpe
  2025-01-24 18:23 ` Andrey Ryabinin
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Rapoport @ 2025-01-20  7:54 UTC (permalink / raw)
  To: lsf-pc, Alexander Graf, Gowans, James
  Cc: linux-mm, David Rientjes, Pasha Tatashin

Hi,

I'd like to discuss memory persistence across kexec.

Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
serialization and deserialization of kernel data as well as preserving
arbitrary memory ranges across kexec.

In addition, KHO keeps a physically contiguous memory regions that are
guaranteed to not have any memory that KHO would preserve, but still can be
used by the system. The kexeced kernel bootstraps itself using those
regions and sets all handed over memory as in use. KHO users then can
recover their state from the preserved data. This includes memory
reservations, where the user can either discard or claim reservations.

KHO can be used as the base layer for implementation of persistence-aware
memory allocator and persistent in-memory filesystem.

Aside from status update on KHO progress there are a few topics that I would
like to discuss:
* Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
* Or is it better to implement yet another in-memory filesystem dedicated
  for persistence?
* What is the best way to ensure that the memory we want to persist is not
  scattered all over the place?

[1] https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20  7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport
@ 2025-01-20 14:14 ` Jason Gunthorpe
  2025-01-20 19:42   ` David Rientjes
  2025-01-24 11:30   ` Mike Rapoport
  2025-01-24 18:23 ` Andrey Ryabinin
  1 sibling, 2 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 14:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes,
	Pasha Tatashin

On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> Hi,
> 
> I'd like to discuss memory persistence across kexec.
> 
> Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> serialization and deserialization of kernel data as well as preserving
> arbitrary memory ranges across kexec.
> 
> In addition, KHO keeps a physically contiguous memory regions that are
> guaranteed to not have any memory that KHO would preserve, but still can be
> used by the system. The kexeced kernel bootstraps itself using those
> regions and sets all handed over memory as in use. KHO users then can
> recover their state from the preserved data. This includes memory
> reservations, where the user can either discard or claim reservations.
> 
> KHO can be used as the base layer for implementation of persistence-aware
> memory allocator and persistent in-memory filesystem.
> 
> Aside from status update on KHO progress there are a few topics that I would
> like to discuss:
> * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> * Or is it better to implement yet another in-memory filesystem dedicated
>   for persistence?
> * What is the best way to ensure that the memory we want to persist is not
>   scattered all over the place?

There is alot of talk about taking *drivers* and having them survive
kexec, meaning the driver has to put alot of its state into KHO and
then get it back out again.

I've been hoping for a model where a driver can be told to "go to KHO"
and the KHO code can be largely contained in the driver and regulated
to recording the driver state. This implies the state may be
fragmented all over memory.

The other direction is that the driver has to start up in some special
KHO mode and KHO becomes invasive on all driver paths to use special
KHO allocations. This seems like a PITA.

You can see this difference just in the discussion around the iommu
serialization where one idea was to have KHO be an integral (and
invasive!) part of the page table operations from time zero vs some
later serialization at kexec time.

Regardless, I'm interested in this discussion to bring some
concreteness about how drivers work..

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20 14:14 ` Jason Gunthorpe
@ 2025-01-20 19:42   ` David Rientjes
  2025-01-22 23:30     ` Pasha Tatashin
  2025-01-24 21:03     ` Zhu Yanjun
  2025-01-24 11:30   ` Mike Rapoport
  1 sibling, 2 replies; 17+ messages in thread
From: David Rientjes @ 2025-01-20 19:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, lsf-pc, Alexander Graf, Gowans, James, linux-mm,
	Pasha Tatashin

On Mon, 20 Jan 2025, Jason Gunthorpe wrote:

> On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > Hi,
> > 
> > I'd like to discuss memory persistence across kexec.
> > 
> > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> > serialization and deserialization of kernel data as well as preserving
> > arbitrary memory ranges across kexec.
> > 
> > In addition, KHO keeps a physically contiguous memory regions that are
> > guaranteed to not have any memory that KHO would preserve, but still can be
> > used by the system. The kexeced kernel bootstraps itself using those
> > regions and sets all handed over memory as in use. KHO users then can
> > recover their state from the preserved data. This includes memory
> > reservations, where the user can either discard or claim reservations.
> > 
> > KHO can be used as the base layer for implementation of persistence-aware
> > memory allocator and persistent in-memory filesystem.
> > 
> > Aside from status update on KHO progress there are a few topics that I would
> > like to discuss:
> > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?

This is a very timely discussion since the last Linux MM Alignment Session 
on the topic since some use cases, at least for tmpfs, have emerged.  Not 
necessarily a requirement, but more out of convenience.

> > * Or is it better to implement yet another in-memory filesystem dedicated
> >   for persistence?
> > * What is the best way to ensure that the memory we want to persist is not
> >   scattered all over the place?
> 
> There is alot of talk about taking *drivers* and having them survive
> kexec, meaning the driver has to put alot of its state into KHO and
> then get it back out again.
> 
> I've been hoping for a model where a driver can be told to "go to KHO"
> and the KHO code can be largely contained in the driver and regulated
> to recording the driver state. This implies the state may be
> fragmented all over memory.
> 

This sounds fantastic if it is doable!

> The other direction is that the driver has to start up in some special
> KHO mode and KHO becomes invasive on all driver paths to use special
> KHO allocations. This seems like a PITA.
> 
> You can see this difference just in the discussion around the iommu
> serialization where one idea was to have KHO be an integral (and
> invasive!) part of the page table operations from time zero vs some
> later serialization at kexec time.
> 
> Regardless, I'm interested in this discussion to bring some
> concreteness about how drivers work..
> 

+1, I'm also interested in this discussion.

As previously mentioned[1], we'll also start a biweekly on hypervisor live 
update to accelerate progress.  The first instance of that meeting will be 
next week, Monday, January 27 at 8am PST (UTC-8).  Calendar invites will 
go out later today for everybody on that email thread, if anybody else is 
interested in attending on a regular basis please email me.  Hoping this 
can be leveraged as well to build up to LSF/MM/BPF.

[1] 
https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20 19:42   ` David Rientjes
@ 2025-01-22 23:30     ` Pasha Tatashin
  2025-01-25  9:53       ` Mike Rapoport
  2025-01-24 21:03     ` Zhu Yanjun
  1 sibling, 1 reply; 17+ messages in thread
From: Pasha Tatashin @ 2025-01-22 23:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jason Gunthorpe, Mike Rapoport, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

On Mon, Jan 20, 2025 at 2:42 PM David Rientjes <rientjes@google.com> wrote:
>
> On Mon, 20 Jan 2025, Jason Gunthorpe wrote:
>
> > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > Hi,
> > >
> > > I'd like to discuss memory persistence across kexec.
> > >
> > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> > > serialization and deserialization of kernel data as well as preserving
> > > arbitrary memory ranges across kexec.
> > >
> > > In addition, KHO keeps a physically contiguous memory regions that are
> > > guaranteed to not have any memory that KHO would preserve, but still can be
> > > used by the system. The kexeced kernel bootstraps itself using those
> > > regions and sets all handed over memory as in use. KHO users then can
> > > recover their state from the preserved data. This includes memory
> > > reservations, where the user can either discard or claim reservations.
> > >
> > > KHO can be used as the base layer for implementation of persistence-aware
> > > memory allocator and persistent in-memory filesystem.
> > >
> > > Aside from status update on KHO progress there are a few topics that I would
> > > like to discuss:
> > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
>
> This is a very timely discussion since the last Linux MM Alignment Session
> on the topic since some use cases, at least for tmpfs, have emerged.  Not
> necessarily a requirement, but more out of convenience.
>
> > > * Or is it better to implement yet another in-memory filesystem dedicated
> > >   for persistence?
> > > * What is the best way to ensure that the memory we want to persist is not
> > >   scattered all over the place?
> >
> > There is alot of talk about taking *drivers* and having them survive
> > kexec, meaning the driver has to put alot of its state into KHO and
> > then get it back out again.
> >
> > I've been hoping for a model where a driver can be told to "go to KHO"
> > and the KHO code can be largely contained in the driver and regulated
> > to recording the driver state. This implies the state may be
> > fragmented all over memory.
> >
>
> This sounds fantastic if it is doable!
>
> > The other direction is that the driver has to start up in some special
> > KHO mode and KHO becomes invasive on all driver paths to use special
> > KHO allocations. This seems like a PITA.
> >
> > You can see this difference just in the discussion around the iommu
> > serialization where one idea was to have KHO be an integral (and
> > invasive!) part of the page table operations from time zero vs some
> > later serialization at kexec time.
> >
> > Regardless, I'm interested in this discussion to bring some
> > concreteness about how drivers work..
> >
>
> +1, I'm also interested in this discussion.
>
> As previously mentioned[1], we'll also start a biweekly on hypervisor live
> update to accelerate progress.  The first instance of that meeting will be
> next week, Monday, January 27 at 8am PST (UTC-8).  Calendar invites will
> go out later today for everybody on that email thread, if anybody else is
> interested in attending on a regular basis please email me.  Hoping this
> can be leveraged as well to build up to LSF/MM/BPF.
>
> [1]
> https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u

+1

Hi Mike,

I'm very interested in this topic and can contribute both presenting
and implementing changes upstream. We're planning on using KHO in our
kernel at Google but there are some limitations for our use case that
I believe can be addressed.

Limitations:

1. Serialization callbacks are called by KHO during the activation
phase in series. In most cases different device drivers are
independent, the serialization can be parallelized.

2. Once the serialization callbacks are done, the device tree data
cannot be altered and drivers cannot add more data into the device
tree (except limited modification where drivers can remember the exact
node that was created and modify some properties, but that is too
limited).
This is bad because we have use cases where we need to save buffered
data (not memory locations) into the device tree at some late stage
before jumping to the new kernel.

3. KHO requires devices to be serialized before
kexec_file_load()/kexec_load(), which means that load becomes part of
the VM blackout window, if KHO is used for hypervisor live update
scenarios this is a very bad limitation.

4. KHO activation should not really be needed, there should be two
phases: old KHO tree passed from the old kernel, and once it is fully
consumed, new KHO tree that can be updated at any time by devices that
is going to be passed to the next kernel during next reboot (kexec or
firmware that is aware of KHO...), instead of activation there should
be a user driver phase shift from old tree to new tree, once that is
done drivers can start serialize at will.

I believe all these limitations are because KHO uses device trees in
FDT format during serialization. Instead, the KHO device tree in the
new kernel should be in a more relaxed format such as hash table until
it is converted into FDT sometime very late during the kernel shutdown
path, right before jumping to the next kernel.

As David mentioned there is going to be a hypervisor live update
bi-weekly meeting, where we can discuss this.

Pasha

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20 14:14 ` Jason Gunthorpe
  2025-01-20 19:42   ` David Rientjes
@ 2025-01-24 11:30   ` Mike Rapoport
  2025-01-24 14:56     ` Jason Gunthorpe
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Rapoport @ 2025-01-24 11:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes,
	Pasha Tatashin

Hi Jason,

On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > Hi,
> > 
> > I'd like to discuss memory persistence across kexec.
> > 
> > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> > serialization and deserialization of kernel data as well as preserving
> > arbitrary memory ranges across kexec.
> > 
> > In addition, KHO keeps a physically contiguous memory regions that are
> > guaranteed to not have any memory that KHO would preserve, but still can be
> > used by the system. The kexeced kernel bootstraps itself using those
> > regions and sets all handed over memory as in use. KHO users then can
> > recover their state from the preserved data. This includes memory
> > reservations, where the user can either discard or claim reservations.
> > 
> > KHO can be used as the base layer for implementation of persistence-aware
> > memory allocator and persistent in-memory filesystem.
> > 
> > Aside from status update on KHO progress there are a few topics that I would
> > like to discuss:
> > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> > * Or is it better to implement yet another in-memory filesystem dedicated
> >   for persistence?
> > * What is the best way to ensure that the memory we want to persist is not
> >   scattered all over the place?
> 
> There is alot of talk about taking *drivers* and having them survive
> kexec, meaning the driver has to put alot of its state into KHO and
> then get it back out again.
> 
> I've been hoping for a model where a driver can be told to "go to KHO"
> and the KHO code can be largely contained in the driver and regulated
> to recording the driver state. This implies the state may be
> fragmented all over memory.

I'm not sure I follow what do you mean by "go to KHO" here.

I believe that ftrace example in Alex's v3 of KHO
(https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com)
has enough meat to demonstrate the basic model.

The driver has to pass the state it wishes to preserve and then during the
initialization after kexec the driver can restore it's state from the
preserved one.
 
> The other direction is that the driver has to start up in some special
> KHO mode and KHO becomes invasive on all driver paths to use special
> KHO allocations. This seems like a PITA.
> 
> You can see this difference just in the discussion around the iommu
> serialization where one idea was to have KHO be an integral (and
> invasive!) part of the page table operations from time zero vs some
> later serialization at kexec time.

I didn't follow that discussion closely, but there still should be a step
when iommu driver would try to deserialize the data and use it if
deserialization succeeds.

My understanding it that a major part of the complexity in iommu is the
userspace facing bits that need to be somehow connected to the restored in
kernel structures after kexec.

> Regardless, I'm interested in this discussion to bring some
> concreteness about how drivers work..
> 
> Jason

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-24 11:30   ` Mike Rapoport
@ 2025-01-24 14:56     ` Jason Gunthorpe
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-24 14:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes,
	Pasha Tatashin

On Fri, Jan 24, 2025 at 01:30:52PM +0200, Mike Rapoport wrote:
> Hi Jason,
> 
> On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > Hi,
> > > 
> > > I'd like to discuss memory persistence across kexec.
> > > 
> > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> > > serialization and deserialization of kernel data as well as preserving
> > > arbitrary memory ranges across kexec.
> > > 
> > > In addition, KHO keeps a physically contiguous memory regions that are
> > > guaranteed to not have any memory that KHO would preserve, but still can be
> > > used by the system. The kexeced kernel bootstraps itself using those
> > > regions and sets all handed over memory as in use. KHO users then can
> > > recover their state from the preserved data. This includes memory
> > > reservations, where the user can either discard or claim reservations.
> > > 
> > > KHO can be used as the base layer for implementation of persistence-aware
> > > memory allocator and persistent in-memory filesystem.
> > > 
> > > Aside from status update on KHO progress there are a few topics that I would
> > > like to discuss:
> > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> > > * Or is it better to implement yet another in-memory filesystem dedicated
> > >   for persistence?
> > > * What is the best way to ensure that the memory we want to persist is not
> > >   scattered all over the place?
> > 
> > There is alot of talk about taking *drivers* and having them survive
> > kexec, meaning the driver has to put alot of its state into KHO and
> > then get it back out again.
> > 
> > I've been hoping for a model where a driver can be told to "go to KHO"
> > and the KHO code can be largely contained in the driver and regulated
> > to recording the driver state. This implies the state may be
> > fragmented all over memory.
> 
> I'm not sure I follow what do you mean by "go to KHO" here.

Drawing on our now extensive experiance with PCI device live
migration, I imagine a state progression approximately like:

RUNNING - minimal or no KHO involvement
PREPARE - KHO stuff starts to get ready, preallocations, loading
          successor kernels, etc. No VM degradation
PRE-STOP - KHO gets serious, stuff starts to become unavailable,
           userspace needs to shut things down and get ready. Some
           level of VM degradation - ie changing IOMMU translations
	   may block the VM until CONCLUDE.
STOP - Now you've done it. KHO state is finalized - VMs stop running
KEXEC - Weee - VMs not running
RESUME - Get booted up, get ready to start up the VMs - VM still stopped
POST-RESUME - Start unpacking more stuff from KHO, userspace starts
              bringing back other stuff it may have shutdown. Some
	      level of VM degradation
CONCLUDE - Discard all the remaining KHO stuff. No VM degradation
RUNNING - minimal or no KHO involvment

Each of these states should inform drivers/etc when we reach them, and
the KHO state that will survive the kexec evolves and extends as it
progress.

So "go to KHO" would refer to a driver that is using PREPARE and
PRE-STOP to start moving its functionality from normal memory to KHO
preserved memory, possibly with some functional degradation.

> I believe that ftrace example in Alex's v3 of KHO
> (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com)
> has enough meat to demonstrate the basic model.

ftrace is just too simple to capture the full complexity of what a
real HW device would need. We've now spent time thinking about what it
would take to make a complex NIC survive kexec and I suggest the above
model for how to approach it.

> > The other direction is that the driver has to start up in some special
> > KHO mode and KHO becomes invasive on all driver paths to use special
> > KHO allocations. This seems like a PITA.
> > 
> > You can see this difference just in the discussion around the iommu
> > serialization where one idea was to have KHO be an integral (and
> > invasive!) part of the page table operations from time zero vs some
> > later serialization at kexec time.
> 
> I didn't follow that discussion closely, but there still should be a step
> when iommu driver would try to deserialize the data and use it if
> deserialization succeeds.

There were two options, one is that the iommu always lives in KHO, the
other is that the iommu moves (ie go to KHO) into KHO.

For instance asumming the latter, as you progress through the above
state list:

RUNNING - IOMMU page tables are in normal memory and normal IOMMU code
 	  is used to manipulate them
PREPARE - We allocate an approximate amount of KHO memory needed to hold
	  the page tables
PRE-STOP - The page tables are copied into the KHO memory and frozen
           to be unchanging
STOP - The IOMMU driver records to KHO which devices have KHO page
       tables
RESUME - The IOMMU driver recovers the KHO page tables and hitlessly
         sets up the new HW lookup tables to use them
POST-RESUME - The page tables are copied out of the KHO memory and
              back to normal memory where normal IOMMU algorithms can run
              them
CONCLUDE - All the KHO memory is freed

Compared to the first option, we'd somehow teach the IOMMU code to
always use KHO for allocations, and KHO is somehow compatible and
preserving the IOMMU's use of struct page metadata. Avoids the
serializing copy, but you have to make invasive KHO changes to the
existing IOMMU page table code.

vs serialize which could be isolated to a KHO module that doesn't
bother anyone else.

[Also, I would prefer to see KHO updates to page table code after
consolidating the iommu page table code in one place. Could use some
help on that project too :)

https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com
]

> My understanding it that a major part of the complexity in iommu is the
> userspace facing bits that need to be somehow connected to the restored in
> kernel structures after kexec.

Yes certainly this is hard too. I have yet to see a complete
functional proposal for this.

I have been feeling that KHO should have a way to preserve a driver
file descriptor. Not a full descriptor, but something stripped back
and simplified. Getting a descriptor through KHO, vs /dev/XXX would
trigger special stuff like not FLRing VFIO PCI devices, not wrecking
the IOMMU translation and so on.

For instance for iommufd we may move the tables into KHO, destory all
other iommufd objects, then transfer the stripped down iommufd FD to
KHO. On resume the VMM would recover the KHO iommufd FD and rebuild
the lost objects, then destroy the special KHO page table.

The really tricky thing is there is *alot* of state in these FDs, some
we can imagine to retain, others will have to be rebuilt.

There is aslo alot of kernel actions that don't happen at FD open
time. Some kind of philosophy is needed here - what happens if the
kernel skips steps to preserve KHO, but the userspace doesn't follow
the KHO flow? Ie userspace opens /dev/vfio instead of the KHO version?
The /dev/vfio is pretty wrecked because of what KHO did. Does the
kernel have to fix it? Should the kernel forbid it? What happens if
KHO and KHO again without userspace fixing everything? So many
questions :\

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20  7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport
  2025-01-20 14:14 ` Jason Gunthorpe
@ 2025-01-24 18:23 ` Andrey Ryabinin
  1 sibling, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2025-01-24 18:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes,
	Pasha Tatashin, Jason Gunthorpe

On Mon, Jan 20, 2025 at 8:54 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> I'd like to discuss memory persistence across kexec.
>

Hi, I'm very interested in this topic as well, I'd like to join the club )

> Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
> serialization and deserialization of kernel data as well as preserving
> arbitrary memory ranges across kexec.
>

To be able to perform live update of hypervisor kernel with running
VMs which use VFIO devices
we would need to [de]serialize a lots of different and complex states
(PCI, IOMMU, VFIO ...)
When I've looked at KHO I found that the process of describing data
using KHO is complicated, requires
to write a lot of code that needs to be invaded deeply into subsystem
code. So I think this might be
a blocker for applying KHO to VFIO device state which is more
complicated than the ftrace buffers.

To address this particular issue I've come up with the proof of
concept which I sent a few months ago:

    https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com

The idea behind was inspired by QEMU's VMSTATE mechanism which solves
similar problem -
 to describe and migrate devices states across different instances of QEMU.
As an example, I've chosen to preserve ftrace buffers as well, so it's
easier to compare with KHO approach.

> In addition, KHO keeps a physically contiguous memory regions that are
> guaranteed to not have any memory that KHO would preserve, but still can be
> used by the system. The kexeced kernel bootstraps itself using those
> regions and sets all handed over memory as in use. KHO users then can
> recover their state from the preserved data. This includes memory
> reservations, where the user can either discard or claim reservations.
>
> KHO can be used as the base layer for implementation of persistence-aware
> memory allocator and persistent in-memory filesystem.
>
> Aside from status update on KHO progress there are a few topics that I would
> like to discuss:
> * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> * Or is it better to implement yet another in-memory filesystem dedicated
>   for persistence?

We would definitely need a framework to [de]serialize data. With that
we should be able
to preserve tmpfs/hugetblfs (and it probably will be easier than
preserving some device state).
So yet another in-memory filesystem should come only as a solution for
some potential problem, just for example:
 - serialization of tmpfs/hugetlbfs requires unreasonable amount of
memory (or time to process)
 - implementation ends up too complicated and fragile, so it's just
better to have separate dedicated fs
 - whatever else comes up...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-20 19:42   ` David Rientjes
  2025-01-22 23:30     ` Pasha Tatashin
@ 2025-01-24 21:03     ` Zhu Yanjun
  1 sibling, 0 replies; 17+ messages in thread
From: Zhu Yanjun @ 2025-01-24 21:03 UTC (permalink / raw)
  To: David Rientjes, Jason Gunthorpe
  Cc: Mike Rapoport, lsf-pc, Alexander Graf, Gowans, James, linux-mm,
	Pasha Tatashin

在 2025/1/20 20:42, David Rientjes 写道:
> On Mon, 20 Jan 2025, Jason Gunthorpe wrote:
> 
>> On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
>>> Hi,
>>>
>>> I'd like to discuss memory persistence across kexec.
>>>
>>> Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows
>>> serialization and deserialization of kernel data as well as preserving
>>> arbitrary memory ranges across kexec.
>>>
>>> In addition, KHO keeps a physically contiguous memory regions that are
>>> guaranteed to not have any memory that KHO would preserve, but still can be
>>> used by the system. The kexeced kernel bootstraps itself using those
>>> regions and sets all handed over memory as in use. KHO users then can
>>> recover their state from the preserved data. This includes memory
>>> reservations, where the user can either discard or claim reservations.
>>>
>>> KHO can be used as the base layer for implementation of persistence-aware
>>> memory allocator and persistent in-memory filesystem.
>>>
>>> Aside from status update on KHO progress there are a few topics that I would
>>> like to discuss:
>>> * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs?
> 
> This is a very timely discussion since the last Linux MM Alignment Session
> on the topic since some use cases, at least for tmpfs, have emerged.  Not
> necessarily a requirement, but more out of convenience.
> 
>>> * Or is it better to implement yet another in-memory filesystem dedicated
>>>    for persistence?
>>> * What is the best way to ensure that the memory we want to persist is not
>>>    scattered all over the place?
>>
>> There is alot of talk about taking *drivers* and having them survive
>> kexec, meaning the driver has to put alot of its state into KHO and
>> then get it back out again.
>>
>> I've been hoping for a model where a driver can be told to "go to KHO"
>> and the KHO code can be largely contained in the driver and regulated
>> to recording the driver state. This implies the state may be
>> fragmented all over memory.
>>
> 
> This sounds fantastic if it is doable!
> 
>> The other direction is that the driver has to start up in some special
>> KHO mode and KHO becomes invasive on all driver paths to use special
>> KHO allocations. This seems like a PITA.
>>
>> You can see this difference just in the discussion around the iommu
>> serialization where one idea was to have KHO be an integral (and
>> invasive!) part of the page table operations from time zero vs some
>> later serialization at kexec time.
>>
>> Regardless, I'm interested in this discussion to bring some
>> concreteness about how drivers work..
>>
> 
> +1, I'm also interested in this discussion.

+1, Hope I can join the meeting to listen to this presentation.

Zhu Yanjun

> 
> As previously mentioned[1], we'll also start a biweekly on hypervisor live
> update to accelerate progress.  The first instance of that meeting will be
> next week, Monday, January 27 at 8am PST (UTC-8).  Calendar invites will
> go out later today for everybody on that email thread, if anybody else is
> interested in attending on a regular basis please email me.  Hoping this
> can be leveraged as well to build up to LSF/MM/BPF.

> 
> [1]
> https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-22 23:30     ` Pasha Tatashin
@ 2025-01-25  9:53       ` Mike Rapoport
  2025-01-25 15:19         ` Pasha Tatashin
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Rapoport @ 2025-01-25  9:53 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Rientjes, Jason Gunthorpe, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

Hi Pasha,

On Wed, Jan 22, 2025 at 06:30:22PM -0500, Pasha Tatashin wrote:
> > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > > Hi,
> > > >
> > > > I'd like to discuss memory persistence across kexec.
> > > >
> 
> Hi Mike,
> 
> I'm very interested in this topic and can contribute both presenting
> and implementing changes upstream. We're planning on using KHO in our
> kernel at Google but there are some limitations for our use case that
> I believe can be addressed.
> 
> Limitations:
> 
> 1. Serialization callbacks are called by KHO during the activation
> phase in series. In most cases different device drivers are
> independent, the serialization can be parallelized.
> 
> 2. Once the serialization callbacks are done, the device tree data
> cannot be altered and drivers cannot add more data into the device
> tree (except limited modification where drivers can remember the exact
> node that was created and modify some properties, but that is too
> limited).
> This is bad because we have use cases where we need to save buffered
> data (not memory locations) into the device tree at some late stage
> before jumping to the new kernel.

The device tree data cannot be altered because at kexec load time it is
appended to kexec image and that image cannot be altered without a new
kexec load.
 
> 3. KHO requires devices to be serialized before
> kexec_file_load()/kexec_load(), which means that load becomes part of
> the VM blackout window, if KHO is used for hypervisor live update
> scenarios this is a very bad limitation.

KHO data has to be a part of kexec image and the way kexec works now there
is no way to add anything to kexec image after kexec load.
To be able to serialize the state closer to kexec reboot we'd need to
change the way kexec images are created, regardless of what data format
we use to pass the data between kernels.
 
> 4. KHO activation should not really be needed, there should be two
> phases: old KHO tree passed from the old kernel, and once it is fully
> consumed, new KHO tree that can be updated at any time by devices that
> is going to be passed to the next kernel during next reboot (kexec or
> firmware that is aware of KHO...), instead of activation there should
> be a user driver phase shift from old tree to new tree, once that is
> done drivers can start serialize at will.

If I understand you correctly, it's up driver to decide when to update the
data that should be passed to the new kernel?
Again, for now it's kexec limitation that kexec image cannot be altered
between load and exec. 
Still, it's not clear to me how drivers could decide when they need to do
the updates.

> As David mentioned there is going to be a hypervisor live update
> bi-weekly meeting, where we can discuss this.

Sure :)
 
> Pasha

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-25  9:53       ` Mike Rapoport
@ 2025-01-25 15:19         ` Pasha Tatashin
  2025-01-26 20:04           ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Pasha Tatashin @ 2025-01-25 15:19 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Rientjes, Jason Gunthorpe, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

On Sat, Jan 25, 2025 at 4:53 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Pasha,
>
> On Wed, Jan 22, 2025 at 06:30:22PM -0500, Pasha Tatashin wrote:
> > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote:
> > > > > Hi,
> > > > >
> > > > > I'd like to discuss memory persistence across kexec.
> > > > >
> >
> > Hi Mike,
> >
> > I'm very interested in this topic and can contribute both presenting
> > and implementing changes upstream. We're planning on using KHO in our
> > kernel at Google but there are some limitations for our use case that
> > I believe can be addressed.
> >
> > Limitations:
> >
> > 1. Serialization callbacks are called by KHO during the activation
> > phase in series. In most cases different device drivers are
> > independent, the serialization can be parallelized.
> >
> > 2. Once the serialization callbacks are done, the device tree data
> > cannot be altered and drivers cannot add more data into the device
> > tree (except limited modification where drivers can remember the exact
> > node that was created and modify some properties, but that is too
> > limited).
> > This is bad because we have use cases where we need to save buffered
> > data (not memory locations) into the device tree at some late stage
> > before jumping to the new kernel.
>
> The device tree data cannot be altered because at kexec load time it is
> appended to kexec image and that image cannot be altered without a new
> kexec load.

Right, this is how it is implemented now.

One way to solve that is pre-reserving space for the KHO tree -
ideally a reasonable amount, perhaps 32-64 MB and allocating it at
kexec load time. During shutdown, we would use this pre-allocated
space to convert the KHO sparse tree to FDT format. Performing kexec
load during a blackout period violates the hypervisor's live update
time requirements, and also prevents breaking serialization into
phases: i.e. pre-blackout during blackout, during shutdown etc.
Furthermore, for performance reasons serialization must be
parallelizable for live updates, which the FDT format does not
support. Since we can specify KHO scratch space which is the maximum
amount of memory needed for the next kernel, we can similarly specify
the maximum KHO tree size.

> > 3. KHO requires devices to be serialized before
> > kexec_file_load()/kexec_load(), which means that load becomes part of
> > the VM blackout window, if KHO is used for hypervisor live update
> > scenarios this is a very bad limitation.
>
> KHO data has to be a part of kexec image and the way kexec works now there
> is no way to add anything to kexec image after kexec load.
> To be able to serialize the state closer to kexec reboot we'd need to
> change the way kexec images are created, regardless of what data format
> we use to pass the data between kernels.
>
> > 4. KHO activation should not really be needed, there should be two
> > phases: old KHO tree passed from the old kernel, and once it is fully
> > consumed, new KHO tree that can be updated at any time by devices that
> > is going to be passed to the next kernel during next reboot (kexec or
> > firmware that is aware of KHO...), instead of activation there should
> > be a user driver phase shift from old tree to new tree, once that is
> > done drivers can start serialize at will.
>
> If I understand you correctly, it's up driver to decide when to update the
> data that should be passed to the new kernel?

That is correct, I planning to propose drive dev->{driver,
bus}->liveupdate(dev, liveupdate_phase) callback, where  drivers can
preserve stuff into KHO during different phases of live update cycle:
before blackout during blackout, during shutdown. When implemented,
and when we perform liveupdate reboot, this call will be called
instead of shutdown() callback.

> Again, for now it's kexec limitation that kexec image cannot be altered
> between load and exec.
> Still, it's not clear to me how drivers could decide when they need to do
> the updates.

I will send the API proposal to the mailing list in a couple weeks,
and we can also discuss that at one of David's meetings.

Pasha


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-25 15:19         ` Pasha Tatashin
@ 2025-01-26 20:04           ` Jason Gunthorpe
  2025-01-26 20:41             ` Pasha Tatashin
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-26 20:04 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:

> One way to solve that is pre-reserving space for the KHO tree -
> ideally a reasonable amount, perhaps 32-64 MB and allocating it at
> kexec load time.

Why is there any weird limit? We are preserving hudreds of GB of pages
backing the VM and more. There is endless memory being preserved across?

So why are we trying to shoehorn a bunch of KHO stuff into the DT?
Shouldn't the DT just have a small KHO info pointing to the real KHO
memory in normal pages?

Even if you want to re-use DT as some kind of serializing scheme in
drivers the DT framework can let each driver build its own tree,
serialize it to its own memory and then just link a pointer to that
tree.

Also, I'm not sure forcing using DT as a serializing scheme is a great
idea. It is complicated and doesn't do that much to solve the complex
versioning problem drivers face here..

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-26 20:04           ` Jason Gunthorpe
@ 2025-01-26 20:41             ` Pasha Tatashin
  2025-01-27  0:21               ` Alexander Graf
  2025-01-27 13:05               ` Jason Gunthorpe
  0 siblings, 2 replies; 17+ messages in thread
From: Pasha Tatashin @ 2025-01-26 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:
>
> > One way to solve that is pre-reserving space for the KHO tree -
> > ideally a reasonable amount, perhaps 32-64 MB and allocating it at
> > kexec load time.
>
> Why is there any weird limit?

Setting a limit for KHO trees is similar to the limit we set for the
scratch area; we can overrun both. It is just one simple way to ensure
serialization is possible after kexec load, but there are obviously
other ways to solve this problem."

> We are preserving hudreds of GB of pages
> backing the VM and more. There is endless memory being preserved across?

There are other ways to do that, but even with this limit, I do not
see this as an issue. The gigabytes of pages backing VMs would not be
scattered as individual 4K pages; that's simply inefficient. The
number of physical ranges is going to be small. If the preserved data
is so large that it cannot fit into a reasonably sized tree, then I
claim that the data should not be saved directly in the tree. Instead,
it should have its own metadata that is pointed to from the tree.

Alternatively, we could allow allocate FDT tree during kernel shutdown
time. At that time there should be plenty of free memory as we already
finished with userland. However, we have to be careful to allocate
from memory that does not overlap the area where kernel segments and
initramfs are going to be relocated.

> So why are we trying to shoehorn a bunch of KHO stuff into the DT?
> Shouldn't the DT just have a small KHO info pointing to the real KHO
> memory in normal pages?

Yes, for entities like file systems, there absolutely should be a
small KHO info entry pointing to metadata pages that preserve the
normal pages. However, for devices that are kept alive, most of the
data should be saved directly in the tree, unless there is a large
sparse soft state that must be carried for some reason (i.e. network
flows or something similar)

> Even if you want to re-use DT as some kind of serializing scheme in
> drivers the DT framework can let each driver build its own tree,
> serialize it to its own memory and then just link a pointer to that
> tree.
>
> Also, I'm not sure forcing using DT as a serializing scheme is a great
> idea. It is complicated and doesn't do that much to solve the complex
> versioning problem drivers face here..

The primary goal of the KHO device tree is to standardize the
live-update metadata that drivers preserve to maintain device
functionality across reboots. We will document this using the YAML
binding format, similar to our current approach for cold boot and
getting device tree from firmware. Otherwise, we could just use other
methods such as PKRAM where it no inherent standardization involved,
but that allows to serialize devices absolutely during any phase of
reboot.

Pasha

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-26 20:41             ` Pasha Tatashin
@ 2025-01-27  0:21               ` Alexander Graf
  2025-01-27 13:15                 ` Jason Gunthorpe
  2025-01-27 13:05               ` Jason Gunthorpe
  1 sibling, 1 reply; 17+ messages in thread
From: Alexander Graf @ 2025-01-27  0:21 UTC (permalink / raw)
  To: Pasha Tatashin, Jason Gunthorpe
  Cc: Mike Rapoport, David Rientjes, lsf-pc, Gowans, James, linux-mm

On 26.01.25 12:41, Pasha Tatashin wrote:
> On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote:
>>
>>> One way to solve that is pre-reserving space for the KHO tree -
>>> ideally a reasonable amount, perhaps 32-64 MB and allocating it at
>>> kexec load time.
>> Why is there any weird limit?
> Setting a limit for KHO trees is similar to the limit we set for the
> scratch area; we can overrun both. It is just one simple way to ensure
> serialization is possible after kexec load, but there are obviously
> other ways to solve this problem."

The problem is not only with allocation. Kexec has 2 schemes: User space 
and kernel based file loading. In the latter, we can do whatever we 
like. In the former, the flow expects user space has ultimate control 
over placement of the future data blobs and their contents.

I like the flexibility this allows for. It means that user space can 
inject its own KHO data for example if it wants to. Or modify it. It 
will come in very handy for debugging and testing later.

>> We are preserving hudreds of GB of pages
>> backing the VM and more. There is endless memory being preserved across?
> There are other ways to do that, but even with this limit, I do not
> see this as an issue. The gigabytes of pages backing VMs would not be
> scattered as individual 4K pages; that's simply inefficient. The
> number of physical ranges is going to be small. If the preserved data
> is so large that it cannot fit into a reasonably sized tree, then I
> claim that the data should not be saved directly in the tree. Instead,
> it should have its own metadata that is pointed to from the tree.

Correct :). The way I think of the KHO DT is as a uniform way to 
implement setup_data across kexec that is identical across all 
architectures, enforces review and structure to ensure we keep 
compatibility and generalizes memory reservation.

The alternative we have today are hacks like IMA: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/uapi/asm/setup_data.h#n73

> Alternatively, we could allow allocate FDT tree during kernel shutdown
> time. At that time there should be plenty of free memory as we already
> finished with userland. However, we have to be careful to allocate
> from memory that does not overlap the area where kernel segments and
> initramfs are going to be relocated.

Yes, this is easier said than done. In the user space driven kexec path, 
user space is in control of memory locations. At least after the first 
kexec iteration, these locations will overlap with the existing Linux 
runtime environment, because both lie in the scratch region. Only the 
purgatory moves everything to where it should be.

Maybe we could create a special kexec memory type that means "KHO DT"?

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-26 20:41             ` Pasha Tatashin
  2025-01-27  0:21               ` Alexander Graf
@ 2025-01-27 13:05               ` Jason Gunthorpe
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-27 13:05 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans,
	James, linux-mm

On Sun, Jan 26, 2025 at 03:41:11PM -0500, Pasha Tatashin wrote:

> number of physical ranges is going to be small. If the preserved data
> is so large that it cannot fit into a reasonably sized tree, then I
> claim that the data should not be saved directly in the tree. Instead,
> it should have its own metadata that is pointed to from the tree.

It sounds like if a driver needs more than a few hundred bytes it
should go this other way.

> Yes, for entities like file systems, there absolutely should be a
> small KHO info entry pointing to metadata pages that preserve the
> normal pages. However, for devices that are kept alive, most of the
> data should be saved directly in the tree, 

This doesn't seem feasible for the NIC we are looking at. There will
be ALOT of data, it doesn't make alot of sense to significantly
involve the boot DT in this. I think the same will be true for iommu
as well.

I think you guys are leaning too much into simpler SW based things
like ftrace as samples..

> > Also, I'm not sure forcing using DT as a serializing scheme is a great
> > idea. It is complicated and doesn't do that much to solve the complex
> > versioning problem drivers face here..
> 
> The primary goal of the KHO device tree is to standardize the
> live-update metadata that drivers preserve to maintain device
> functionality across reboots. 

Honestly, I think this will not be welcomed, or workable.

DT does not fully preserve compatability, it is designed for a world
where if you don't read the values serialized then no harm. If you
want to use DT you need to make it simple for drivers to address this.

But that does not describe KHO, if the predecessor kernel serialized
something and the successor doesn't understand it, that is fatal.

I also think YAML and more formality is *way* too much process!

One of my big bug-a-boos about KHO is that it *NOT* create a downside
for the majority of kernel users that don't/can't use it. Meaning we
don't mangle the normal driver paths, we don't impose difficult ABI
requirements into the driver design and more.

> getting device tree from firmware. Otherwise, we could just use other
> methods such as PKRAM where it no inherent standardization involved,
> but that allows to serialize devices absolutely during any phase of
> reboot.

This sounds like a better idea, at least as as starting point. Maybe
down the road some more formalism can be be agreed, but until we have
experiance with how much pain KHO is going to cause I think we should
go slow in upstream.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-27  0:21               ` Alexander Graf
@ 2025-01-27 13:15                 ` Jason Gunthorpe
  2025-01-27 16:12                   ` Alexander Graf
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-27 13:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans,
	James, linux-mm

On Sun, Jan 26, 2025 at 04:21:05PM -0800, Alexander Graf wrote:

> Yes, this is easier said than done. In the user space driven kexec path,
> user space is in control of memory locations. At least after the first kexec
> iteration, these locations will overlap with the existing Linux runtime
> environment, because both lie in the scratch region. Only the purgatory
> moves everything to where it should be.

This just doesn't seem ideal to me.. It makes sense for old fashioned
kexec, but if you are committed to KHO start earlier.

I would imagine a system that wants to do KHO to have A/B chunks of
memory that are used to boot up the kernel, and the running kernel
keeps the successor kernel's chunk entirely as ZONE_MOVABLE.

When kexec time comes the running kernel evacuates the successor
chunk, and the new kernel gets one of two reliable linear mappings to
work with. No complex purgatory, no copying, its simple.

The next kernel then makes the prior kernel's chunk ZONE_MOVABLE and
the cycle repeats.

Why make it so complicated by using overlapping memory???

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-27 13:15                 ` Jason Gunthorpe
@ 2025-01-27 16:12                   ` Alexander Graf
  2025-01-28 14:04                     ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Graf @ 2025-01-27 16:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans,
	James, linux-mm

Hey Jason,

On 27.01.25 05:15, Jason Gunthorpe wrote:
> On Sun, Jan 26, 2025 at 04:21:05PM -0800, Alexander Graf wrote:
>
>> Yes, this is easier said than done. In the user space driven kexec path,
>> user space is in control of memory locations. At least after the first kexec
>> iteration, these locations will overlap with the existing Linux runtime
>> environment, because both lie in the scratch region. Only the purgatory
>> moves everything to where it should be.
> This just doesn't seem ideal to me.. It makes sense for old fashioned
> kexec, but if you are committed to KHO start earlier.
>
> I would imagine a system that wants to do KHO to have A/B chunks of
> memory that are used to boot up the kernel, and the running kernel
> keeps the successor kernel's chunk entirely as ZONE_MOVABLE.
>
> When kexec time comes the running kernel evacuates the successor
> chunk, and the new kernel gets one of two reliable linear mappings to
> work with. No complex purgatory, no copying, its simple.
>
> The next kernel then makes the prior kernel's chunk ZONE_MOVABLE and
> the cycle repeats.
>
> Why make it so complicated by using overlapping memory???


I agree with the simplifications you're proposing; not using the 
purgatory would be a great property to have.

The reason why KHO doesn't do it yet is that I wanted to keep it simple 
from the other end. The big problem with going A/B is that if done the 
simple way, you only map B as MOVABLE while running in A. That means A 
could accidentally allocate persistent memory from A's memory region. 
When A then switches to B, B can no longer make all of A MOVABLE.

So we need to ensure that *both* regions are MOVABLE, and the system is 
always fully aware of both.


Alex



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec
  2025-01-27 16:12                   ` Alexander Graf
@ 2025-01-28 14:04                     ` Jason Gunthorpe
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-01-28 14:04 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans,
	James, linux-mm

On Mon, Jan 27, 2025 at 08:12:37AM -0800, Alexander Graf wrote:

> I agree with the simplifications you're proposing; not using the purgatory
> would be a great property to have.
> 
> The reason why KHO doesn't do it yet is that I wanted to keep it simple from
> the other end. The big problem with going A/B is that if done the simple
> way, you only map B as MOVABLE while running in A. That means A could
> accidentally allocate persistent memory from A's memory region. When A then
> switches to B, B can no longer make all of A MOVABLE.

But you have this basic problem no matter what? kexec requires a
pretty big region of linear memory to boot a kernel into. Even with
purgatory and copying you still have to have ensure a free linear
space that has no KHO pages in it.

This seems impossible to really guarentee unless you have a special
KHO allocator that happens to guarentee available linear memory, or
are doing tricks like we are discussing to use the normal allocator to
keep allocations out of some linear memory.

> So we need to ensure that *both* regions are MOVABLE, and the system is
> always fully aware of both.

I imagined the kernel would boot with only the A or B area of memory
available during early boot, and then in later boot phases it would
setup the additional memory that has a mix of KHO and free pages.
This feels easier to do once the allocators are all fully started up -
ie you can deal with KHO pages by just allocating them. [*]

IOW each A/B area should be large enough to complete alot of boot and
would end up naturally containing GFP_KERNEL allocations during this
process as it is the only memory available.

If you have a special KHO allocator (GFP_KHO?) then it can simply be
aware of this and avoid allocating from the A/B zone.

However, it would be much nicer to avoid having to mark possible KHO
allocations in code at the allocation point, this would be nicer:
  p = alloc_pages(GFP_KERNEL)
  // time passes
  to_kho(p)

So I agree there is an appeal to somehow using the existing allocators
to stop taking unmovable pages from the A/B region after some point so
that no to_kho() will ever get a page that in A/B.

Can you take a ZONE_NORMAL, use it for booting, and then switch it to
ZONE_MOVABLE, keeping all the unmovable memory? Something else?

* - For drivers I'm imaging that we can do:
     p = alloc_pages(GFP_KERNEL|GFP_KHO|GFP_COMP, order);
     to_kho(p);
     // kexec
     from_kho(p);
     folio_put(p)
    Meaning KHO has to preserve the folio, keep the KVA the same,
    manage the refcount, and restore the GFP_COMP.

    I think if you have this as the basic primitive you can build
    everything else on top of it.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-01-28 14:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-20  7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport
2025-01-20 14:14 ` Jason Gunthorpe
2025-01-20 19:42   ` David Rientjes
2025-01-22 23:30     ` Pasha Tatashin
2025-01-25  9:53       ` Mike Rapoport
2025-01-25 15:19         ` Pasha Tatashin
2025-01-26 20:04           ` Jason Gunthorpe
2025-01-26 20:41             ` Pasha Tatashin
2025-01-27  0:21               ` Alexander Graf
2025-01-27 13:15                 ` Jason Gunthorpe
2025-01-27 16:12                   ` Alexander Graf
2025-01-28 14:04                     ` Jason Gunthorpe
2025-01-27 13:05               ` Jason Gunthorpe
2025-01-24 21:03     ` Zhu Yanjun
2025-01-24 11:30   ` Mike Rapoport
2025-01-24 14:56     ` Jason Gunthorpe
2025-01-24 18:23 ` Andrey Ryabinin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox