* [LSF/MM/BPF TOPIC] memory persistence over kexec @ 2025-01-20 7:54 Mike Rapoport 2025-01-20 14:14 ` Jason Gunthorpe 2025-01-24 18:23 ` Andrey Ryabinin 0 siblings, 2 replies; 17+ messages in thread From: Mike Rapoport @ 2025-01-20 7:54 UTC (permalink / raw) To: lsf-pc, Alexander Graf, Gowans, James Cc: linux-mm, David Rientjes, Pasha Tatashin Hi, I'd like to discuss memory persistence across kexec. Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows serialization and deserialization of kernel data as well as preserving arbitrary memory ranges across kexec. In addition, KHO keeps a physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve, but still can be used by the system. The kexeced kernel bootstraps itself using those regions and sets all handed over memory as in use. KHO users then can recover their state from the preserved data. This includes memory reservations, where the user can either discard or claim reservations. KHO can be used as the base layer for implementation of persistence-aware memory allocator and persistent in-memory filesystem. Aside from status update on KHO progress there are a few topics that I would like to discuss: * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? * Or is it better to implement yet another in-memory filesystem dedicated for persistence? * What is the best way to ensure that the memory we want to persist is not scattered all over the place? [1] https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/ -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport @ 2025-01-20 14:14 ` Jason Gunthorpe 2025-01-20 19:42 ` David Rientjes 2025-01-24 11:30 ` Mike Rapoport 2025-01-24 18:23 ` Andrey Ryabinin 1 sibling, 2 replies; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-20 14:14 UTC (permalink / raw) To: Mike Rapoport Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes, Pasha Tatashin On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > Hi, > > I'd like to discuss memory persistence across kexec. > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > serialization and deserialization of kernel data as well as preserving > arbitrary memory ranges across kexec. > > In addition, KHO keeps a physically contiguous memory regions that are > guaranteed to not have any memory that KHO would preserve, but still can be > used by the system. The kexeced kernel bootstraps itself using those > regions and sets all handed over memory as in use. KHO users then can > recover their state from the preserved data. This includes memory > reservations, where the user can either discard or claim reservations. > > KHO can be used as the base layer for implementation of persistence-aware > memory allocator and persistent in-memory filesystem. > > Aside from status update on KHO progress there are a few topics that I would > like to discuss: > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > * Or is it better to implement yet another in-memory filesystem dedicated > for persistence? > * What is the best way to ensure that the memory we want to persist is not > scattered all over the place? There is alot of talk about taking *drivers* and having them survive kexec, meaning the driver has to put alot of its state into KHO and then get it back out again. I've been hoping for a model where a driver can be told to "go to KHO" and the KHO code can be largely contained in the driver and regulated to recording the driver state. This implies the state may be fragmented all over memory. The other direction is that the driver has to start up in some special KHO mode and KHO becomes invasive on all driver paths to use special KHO allocations. This seems like a PITA. You can see this difference just in the discussion around the iommu serialization where one idea was to have KHO be an integral (and invasive!) part of the page table operations from time zero vs some later serialization at kexec time. Regardless, I'm interested in this discussion to bring some concreteness about how drivers work.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 14:14 ` Jason Gunthorpe @ 2025-01-20 19:42 ` David Rientjes 2025-01-22 23:30 ` Pasha Tatashin 2025-01-24 21:03 ` Zhu Yanjun 2025-01-24 11:30 ` Mike Rapoport 1 sibling, 2 replies; 17+ messages in thread From: David Rientjes @ 2025-01-20 19:42 UTC (permalink / raw) To: Jason Gunthorpe Cc: Mike Rapoport, lsf-pc, Alexander Graf, Gowans, James, linux-mm, Pasha Tatashin On Mon, 20 Jan 2025, Jason Gunthorpe wrote: > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > Hi, > > > > I'd like to discuss memory persistence across kexec. > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > serialization and deserialization of kernel data as well as preserving > > arbitrary memory ranges across kexec. > > > > In addition, KHO keeps a physically contiguous memory regions that are > > guaranteed to not have any memory that KHO would preserve, but still can be > > used by the system. The kexeced kernel bootstraps itself using those > > regions and sets all handed over memory as in use. KHO users then can > > recover their state from the preserved data. This includes memory > > reservations, where the user can either discard or claim reservations. > > > > KHO can be used as the base layer for implementation of persistence-aware > > memory allocator and persistent in-memory filesystem. > > > > Aside from status update on KHO progress there are a few topics that I would > > like to discuss: > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? This is a very timely discussion since the last Linux MM Alignment Session on the topic since some use cases, at least for tmpfs, have emerged. Not necessarily a requirement, but more out of convenience. > > * Or is it better to implement yet another in-memory filesystem dedicated > > for persistence? > > * What is the best way to ensure that the memory we want to persist is not > > scattered all over the place? > > There is alot of talk about taking *drivers* and having them survive > kexec, meaning the driver has to put alot of its state into KHO and > then get it back out again. > > I've been hoping for a model where a driver can be told to "go to KHO" > and the KHO code can be largely contained in the driver and regulated > to recording the driver state. This implies the state may be > fragmented all over memory. > This sounds fantastic if it is doable! > The other direction is that the driver has to start up in some special > KHO mode and KHO becomes invasive on all driver paths to use special > KHO allocations. This seems like a PITA. > > You can see this difference just in the discussion around the iommu > serialization where one idea was to have KHO be an integral (and > invasive!) part of the page table operations from time zero vs some > later serialization at kexec time. > > Regardless, I'm interested in this discussion to bring some > concreteness about how drivers work.. > +1, I'm also interested in this discussion. As previously mentioned[1], we'll also start a biweekly on hypervisor live update to accelerate progress. The first instance of that meeting will be next week, Monday, January 27 at 8am PST (UTC-8). Calendar invites will go out later today for everybody on that email thread, if anybody else is interested in attending on a regular basis please email me. Hoping this can be leveraged as well to build up to LSF/MM/BPF. [1] https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 19:42 ` David Rientjes @ 2025-01-22 23:30 ` Pasha Tatashin 2025-01-25 9:53 ` Mike Rapoport 2025-01-24 21:03 ` Zhu Yanjun 1 sibling, 1 reply; 17+ messages in thread From: Pasha Tatashin @ 2025-01-22 23:30 UTC (permalink / raw) To: David Rientjes Cc: Jason Gunthorpe, Mike Rapoport, lsf-pc, Alexander Graf, Gowans, James, linux-mm On Mon, Jan 20, 2025 at 2:42 PM David Rientjes <rientjes@google.com> wrote: > > On Mon, 20 Jan 2025, Jason Gunthorpe wrote: > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > Hi, > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > > serialization and deserialization of kernel data as well as preserving > > > arbitrary memory ranges across kexec. > > > > > > In addition, KHO keeps a physically contiguous memory regions that are > > > guaranteed to not have any memory that KHO would preserve, but still can be > > > used by the system. The kexeced kernel bootstraps itself using those > > > regions and sets all handed over memory as in use. KHO users then can > > > recover their state from the preserved data. This includes memory > > > reservations, where the user can either discard or claim reservations. > > > > > > KHO can be used as the base layer for implementation of persistence-aware > > > memory allocator and persistent in-memory filesystem. > > > > > > Aside from status update on KHO progress there are a few topics that I would > > > like to discuss: > > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > This is a very timely discussion since the last Linux MM Alignment Session > on the topic since some use cases, at least for tmpfs, have emerged. Not > necessarily a requirement, but more out of convenience. > > > > * Or is it better to implement yet another in-memory filesystem dedicated > > > for persistence? > > > * What is the best way to ensure that the memory we want to persist is not > > > scattered all over the place? > > > > There is alot of talk about taking *drivers* and having them survive > > kexec, meaning the driver has to put alot of its state into KHO and > > then get it back out again. > > > > I've been hoping for a model where a driver can be told to "go to KHO" > > and the KHO code can be largely contained in the driver and regulated > > to recording the driver state. This implies the state may be > > fragmented all over memory. > > > > This sounds fantastic if it is doable! > > > The other direction is that the driver has to start up in some special > > KHO mode and KHO becomes invasive on all driver paths to use special > > KHO allocations. This seems like a PITA. > > > > You can see this difference just in the discussion around the iommu > > serialization where one idea was to have KHO be an integral (and > > invasive!) part of the page table operations from time zero vs some > > later serialization at kexec time. > > > > Regardless, I'm interested in this discussion to bring some > > concreteness about how drivers work.. > > > > +1, I'm also interested in this discussion. > > As previously mentioned[1], we'll also start a biweekly on hypervisor live > update to accelerate progress. The first instance of that meeting will be > next week, Monday, January 27 at 8am PST (UTC-8). Calendar invites will > go out later today for everybody on that email thread, if anybody else is > interested in attending on a regular basis please email me. Hoping this > can be leveraged as well to build up to LSF/MM/BPF. > > [1] > https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u +1 Hi Mike, I'm very interested in this topic and can contribute both presenting and implementing changes upstream. We're planning on using KHO in our kernel at Google but there are some limitations for our use case that I believe can be addressed. Limitations: 1. Serialization callbacks are called by KHO during the activation phase in series. In most cases different device drivers are independent, the serialization can be parallelized. 2. Once the serialization callbacks are done, the device tree data cannot be altered and drivers cannot add more data into the device tree (except limited modification where drivers can remember the exact node that was created and modify some properties, but that is too limited). This is bad because we have use cases where we need to save buffered data (not memory locations) into the device tree at some late stage before jumping to the new kernel. 3. KHO requires devices to be serialized before kexec_file_load()/kexec_load(), which means that load becomes part of the VM blackout window, if KHO is used for hypervisor live update scenarios this is a very bad limitation. 4. KHO activation should not really be needed, there should be two phases: old KHO tree passed from the old kernel, and once it is fully consumed, new KHO tree that can be updated at any time by devices that is going to be passed to the next kernel during next reboot (kexec or firmware that is aware of KHO...), instead of activation there should be a user driver phase shift from old tree to new tree, once that is done drivers can start serialize at will. I believe all these limitations are because KHO uses device trees in FDT format during serialization. Instead, the KHO device tree in the new kernel should be in a more relaxed format such as hash table until it is converted into FDT sometime very late during the kernel shutdown path, right before jumping to the next kernel. As David mentioned there is going to be a hypervisor live update bi-weekly meeting, where we can discuss this. Pasha ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-22 23:30 ` Pasha Tatashin @ 2025-01-25 9:53 ` Mike Rapoport 2025-01-25 15:19 ` Pasha Tatashin 0 siblings, 1 reply; 17+ messages in thread From: Mike Rapoport @ 2025-01-25 9:53 UTC (permalink / raw) To: Pasha Tatashin Cc: David Rientjes, Jason Gunthorpe, lsf-pc, Alexander Graf, Gowans, James, linux-mm Hi Pasha, On Wed, Jan 22, 2025 at 06:30:22PM -0500, Pasha Tatashin wrote: > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > > Hi, > > > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > Hi Mike, > > I'm very interested in this topic and can contribute both presenting > and implementing changes upstream. We're planning on using KHO in our > kernel at Google but there are some limitations for our use case that > I believe can be addressed. > > Limitations: > > 1. Serialization callbacks are called by KHO during the activation > phase in series. In most cases different device drivers are > independent, the serialization can be parallelized. > > 2. Once the serialization callbacks are done, the device tree data > cannot be altered and drivers cannot add more data into the device > tree (except limited modification where drivers can remember the exact > node that was created and modify some properties, but that is too > limited). > This is bad because we have use cases where we need to save buffered > data (not memory locations) into the device tree at some late stage > before jumping to the new kernel. The device tree data cannot be altered because at kexec load time it is appended to kexec image and that image cannot be altered without a new kexec load. > 3. KHO requires devices to be serialized before > kexec_file_load()/kexec_load(), which means that load becomes part of > the VM blackout window, if KHO is used for hypervisor live update > scenarios this is a very bad limitation. KHO data has to be a part of kexec image and the way kexec works now there is no way to add anything to kexec image after kexec load. To be able to serialize the state closer to kexec reboot we'd need to change the way kexec images are created, regardless of what data format we use to pass the data between kernels. > 4. KHO activation should not really be needed, there should be two > phases: old KHO tree passed from the old kernel, and once it is fully > consumed, new KHO tree that can be updated at any time by devices that > is going to be passed to the next kernel during next reboot (kexec or > firmware that is aware of KHO...), instead of activation there should > be a user driver phase shift from old tree to new tree, once that is > done drivers can start serialize at will. If I understand you correctly, it's up driver to decide when to update the data that should be passed to the new kernel? Again, for now it's kexec limitation that kexec image cannot be altered between load and exec. Still, it's not clear to me how drivers could decide when they need to do the updates. > As David mentioned there is going to be a hypervisor live update > bi-weekly meeting, where we can discuss this. Sure :) > Pasha -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-25 9:53 ` Mike Rapoport @ 2025-01-25 15:19 ` Pasha Tatashin 2025-01-26 20:04 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Pasha Tatashin @ 2025-01-25 15:19 UTC (permalink / raw) To: Mike Rapoport Cc: David Rientjes, Jason Gunthorpe, lsf-pc, Alexander Graf, Gowans, James, linux-mm On Sat, Jan 25, 2025 at 4:53 AM Mike Rapoport <rppt@kernel.org> wrote: > > Hi Pasha, > > On Wed, Jan 22, 2025 at 06:30:22PM -0500, Pasha Tatashin wrote: > > > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > > > Hi, > > > > > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > > > > Hi Mike, > > > > I'm very interested in this topic and can contribute both presenting > > and implementing changes upstream. We're planning on using KHO in our > > kernel at Google but there are some limitations for our use case that > > I believe can be addressed. > > > > Limitations: > > > > 1. Serialization callbacks are called by KHO during the activation > > phase in series. In most cases different device drivers are > > independent, the serialization can be parallelized. > > > > 2. Once the serialization callbacks are done, the device tree data > > cannot be altered and drivers cannot add more data into the device > > tree (except limited modification where drivers can remember the exact > > node that was created and modify some properties, but that is too > > limited). > > This is bad because we have use cases where we need to save buffered > > data (not memory locations) into the device tree at some late stage > > before jumping to the new kernel. > > The device tree data cannot be altered because at kexec load time it is > appended to kexec image and that image cannot be altered without a new > kexec load. Right, this is how it is implemented now. One way to solve that is pre-reserving space for the KHO tree - ideally a reasonable amount, perhaps 32-64 MB and allocating it at kexec load time. During shutdown, we would use this pre-allocated space to convert the KHO sparse tree to FDT format. Performing kexec load during a blackout period violates the hypervisor's live update time requirements, and also prevents breaking serialization into phases: i.e. pre-blackout during blackout, during shutdown etc. Furthermore, for performance reasons serialization must be parallelizable for live updates, which the FDT format does not support. Since we can specify KHO scratch space which is the maximum amount of memory needed for the next kernel, we can similarly specify the maximum KHO tree size. > > 3. KHO requires devices to be serialized before > > kexec_file_load()/kexec_load(), which means that load becomes part of > > the VM blackout window, if KHO is used for hypervisor live update > > scenarios this is a very bad limitation. > > KHO data has to be a part of kexec image and the way kexec works now there > is no way to add anything to kexec image after kexec load. > To be able to serialize the state closer to kexec reboot we'd need to > change the way kexec images are created, regardless of what data format > we use to pass the data between kernels. > > > 4. KHO activation should not really be needed, there should be two > > phases: old KHO tree passed from the old kernel, and once it is fully > > consumed, new KHO tree that can be updated at any time by devices that > > is going to be passed to the next kernel during next reboot (kexec or > > firmware that is aware of KHO...), instead of activation there should > > be a user driver phase shift from old tree to new tree, once that is > > done drivers can start serialize at will. > > If I understand you correctly, it's up driver to decide when to update the > data that should be passed to the new kernel? That is correct, I planning to propose drive dev->{driver, bus}->liveupdate(dev, liveupdate_phase) callback, where drivers can preserve stuff into KHO during different phases of live update cycle: before blackout during blackout, during shutdown. When implemented, and when we perform liveupdate reboot, this call will be called instead of shutdown() callback. > Again, for now it's kexec limitation that kexec image cannot be altered > between load and exec. > Still, it's not clear to me how drivers could decide when they need to do > the updates. I will send the API proposal to the mailing list in a couple weeks, and we can also discuss that at one of David's meetings. Pasha ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-25 15:19 ` Pasha Tatashin @ 2025-01-26 20:04 ` Jason Gunthorpe 2025-01-26 20:41 ` Pasha Tatashin 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-26 20:04 UTC (permalink / raw) To: Pasha Tatashin Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans, James, linux-mm On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote: > One way to solve that is pre-reserving space for the KHO tree - > ideally a reasonable amount, perhaps 32-64 MB and allocating it at > kexec load time. Why is there any weird limit? We are preserving hudreds of GB of pages backing the VM and more. There is endless memory being preserved across? So why are we trying to shoehorn a bunch of KHO stuff into the DT? Shouldn't the DT just have a small KHO info pointing to the real KHO memory in normal pages? Even if you want to re-use DT as some kind of serializing scheme in drivers the DT framework can let each driver build its own tree, serialize it to its own memory and then just link a pointer to that tree. Also, I'm not sure forcing using DT as a serializing scheme is a great idea. It is complicated and doesn't do that much to solve the complex versioning problem drivers face here.. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-26 20:04 ` Jason Gunthorpe @ 2025-01-26 20:41 ` Pasha Tatashin 2025-01-27 0:21 ` Alexander Graf 2025-01-27 13:05 ` Jason Gunthorpe 0 siblings, 2 replies; 17+ messages in thread From: Pasha Tatashin @ 2025-01-26 20:41 UTC (permalink / raw) To: Jason Gunthorpe Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans, James, linux-mm On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote: > > > One way to solve that is pre-reserving space for the KHO tree - > > ideally a reasonable amount, perhaps 32-64 MB and allocating it at > > kexec load time. > > Why is there any weird limit? Setting a limit for KHO trees is similar to the limit we set for the scratch area; we can overrun both. It is just one simple way to ensure serialization is possible after kexec load, but there are obviously other ways to solve this problem." > We are preserving hudreds of GB of pages > backing the VM and more. There is endless memory being preserved across? There are other ways to do that, but even with this limit, I do not see this as an issue. The gigabytes of pages backing VMs would not be scattered as individual 4K pages; that's simply inefficient. The number of physical ranges is going to be small. If the preserved data is so large that it cannot fit into a reasonably sized tree, then I claim that the data should not be saved directly in the tree. Instead, it should have its own metadata that is pointed to from the tree. Alternatively, we could allow allocate FDT tree during kernel shutdown time. At that time there should be plenty of free memory as we already finished with userland. However, we have to be careful to allocate from memory that does not overlap the area where kernel segments and initramfs are going to be relocated. > So why are we trying to shoehorn a bunch of KHO stuff into the DT? > Shouldn't the DT just have a small KHO info pointing to the real KHO > memory in normal pages? Yes, for entities like file systems, there absolutely should be a small KHO info entry pointing to metadata pages that preserve the normal pages. However, for devices that are kept alive, most of the data should be saved directly in the tree, unless there is a large sparse soft state that must be carried for some reason (i.e. network flows or something similar) > Even if you want to re-use DT as some kind of serializing scheme in > drivers the DT framework can let each driver build its own tree, > serialize it to its own memory and then just link a pointer to that > tree. > > Also, I'm not sure forcing using DT as a serializing scheme is a great > idea. It is complicated and doesn't do that much to solve the complex > versioning problem drivers face here.. The primary goal of the KHO device tree is to standardize the live-update metadata that drivers preserve to maintain device functionality across reboots. We will document this using the YAML binding format, similar to our current approach for cold boot and getting device tree from firmware. Otherwise, we could just use other methods such as PKRAM where it no inherent standardization involved, but that allows to serialize devices absolutely during any phase of reboot. Pasha ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-26 20:41 ` Pasha Tatashin @ 2025-01-27 0:21 ` Alexander Graf 2025-01-27 13:15 ` Jason Gunthorpe 2025-01-27 13:05 ` Jason Gunthorpe 1 sibling, 1 reply; 17+ messages in thread From: Alexander Graf @ 2025-01-27 0:21 UTC (permalink / raw) To: Pasha Tatashin, Jason Gunthorpe Cc: Mike Rapoport, David Rientjes, lsf-pc, Gowans, James, linux-mm On 26.01.25 12:41, Pasha Tatashin wrote: > On Sun, Jan 26, 2025 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> On Sat, Jan 25, 2025 at 10:19:51AM -0500, Pasha Tatashin wrote: >> >>> One way to solve that is pre-reserving space for the KHO tree - >>> ideally a reasonable amount, perhaps 32-64 MB and allocating it at >>> kexec load time. >> Why is there any weird limit? > Setting a limit for KHO trees is similar to the limit we set for the > scratch area; we can overrun both. It is just one simple way to ensure > serialization is possible after kexec load, but there are obviously > other ways to solve this problem." The problem is not only with allocation. Kexec has 2 schemes: User space and kernel based file loading. In the latter, we can do whatever we like. In the former, the flow expects user space has ultimate control over placement of the future data blobs and their contents. I like the flexibility this allows for. It means that user space can inject its own KHO data for example if it wants to. Or modify it. It will come in very handy for debugging and testing later. >> We are preserving hudreds of GB of pages >> backing the VM and more. There is endless memory being preserved across? > There are other ways to do that, but even with this limit, I do not > see this as an issue. The gigabytes of pages backing VMs would not be > scattered as individual 4K pages; that's simply inefficient. The > number of physical ranges is going to be small. If the preserved data > is so large that it cannot fit into a reasonably sized tree, then I > claim that the data should not be saved directly in the tree. Instead, > it should have its own metadata that is pointed to from the tree. Correct :). The way I think of the KHO DT is as a uniform way to implement setup_data across kexec that is identical across all architectures, enforces review and structure to ensure we keep compatibility and generalizes memory reservation. The alternative we have today are hacks like IMA: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/uapi/asm/setup_data.h#n73 > Alternatively, we could allow allocate FDT tree during kernel shutdown > time. At that time there should be plenty of free memory as we already > finished with userland. However, we have to be careful to allocate > from memory that does not overlap the area where kernel segments and > initramfs are going to be relocated. Yes, this is easier said than done. In the user space driven kexec path, user space is in control of memory locations. At least after the first kexec iteration, these locations will overlap with the existing Linux runtime environment, because both lie in the scratch region. Only the purgatory moves everything to where it should be. Maybe we could create a special kexec memory type that means "KHO DT"? Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-27 0:21 ` Alexander Graf @ 2025-01-27 13:15 ` Jason Gunthorpe 2025-01-27 16:12 ` Alexander Graf 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-27 13:15 UTC (permalink / raw) To: Alexander Graf Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans, James, linux-mm On Sun, Jan 26, 2025 at 04:21:05PM -0800, Alexander Graf wrote: > Yes, this is easier said than done. In the user space driven kexec path, > user space is in control of memory locations. At least after the first kexec > iteration, these locations will overlap with the existing Linux runtime > environment, because both lie in the scratch region. Only the purgatory > moves everything to where it should be. This just doesn't seem ideal to me.. It makes sense for old fashioned kexec, but if you are committed to KHO start earlier. I would imagine a system that wants to do KHO to have A/B chunks of memory that are used to boot up the kernel, and the running kernel keeps the successor kernel's chunk entirely as ZONE_MOVABLE. When kexec time comes the running kernel evacuates the successor chunk, and the new kernel gets one of two reliable linear mappings to work with. No complex purgatory, no copying, its simple. The next kernel then makes the prior kernel's chunk ZONE_MOVABLE and the cycle repeats. Why make it so complicated by using overlapping memory??? Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-27 13:15 ` Jason Gunthorpe @ 2025-01-27 16:12 ` Alexander Graf 2025-01-28 14:04 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Alexander Graf @ 2025-01-27 16:12 UTC (permalink / raw) To: Jason Gunthorpe Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans, James, linux-mm Hey Jason, On 27.01.25 05:15, Jason Gunthorpe wrote: > On Sun, Jan 26, 2025 at 04:21:05PM -0800, Alexander Graf wrote: > >> Yes, this is easier said than done. In the user space driven kexec path, >> user space is in control of memory locations. At least after the first kexec >> iteration, these locations will overlap with the existing Linux runtime >> environment, because both lie in the scratch region. Only the purgatory >> moves everything to where it should be. > This just doesn't seem ideal to me.. It makes sense for old fashioned > kexec, but if you are committed to KHO start earlier. > > I would imagine a system that wants to do KHO to have A/B chunks of > memory that are used to boot up the kernel, and the running kernel > keeps the successor kernel's chunk entirely as ZONE_MOVABLE. > > When kexec time comes the running kernel evacuates the successor > chunk, and the new kernel gets one of two reliable linear mappings to > work with. No complex purgatory, no copying, its simple. > > The next kernel then makes the prior kernel's chunk ZONE_MOVABLE and > the cycle repeats. > > Why make it so complicated by using overlapping memory??? I agree with the simplifications you're proposing; not using the purgatory would be a great property to have. The reason why KHO doesn't do it yet is that I wanted to keep it simple from the other end. The big problem with going A/B is that if done the simple way, you only map B as MOVABLE while running in A. That means A could accidentally allocate persistent memory from A's memory region. When A then switches to B, B can no longer make all of A MOVABLE. So we need to ensure that *both* regions are MOVABLE, and the system is always fully aware of both. Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-27 16:12 ` Alexander Graf @ 2025-01-28 14:04 ` Jason Gunthorpe 0 siblings, 0 replies; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-28 14:04 UTC (permalink / raw) To: Alexander Graf Cc: Pasha Tatashin, Mike Rapoport, David Rientjes, lsf-pc, Gowans, James, linux-mm On Mon, Jan 27, 2025 at 08:12:37AM -0800, Alexander Graf wrote: > I agree with the simplifications you're proposing; not using the purgatory > would be a great property to have. > > The reason why KHO doesn't do it yet is that I wanted to keep it simple from > the other end. The big problem with going A/B is that if done the simple > way, you only map B as MOVABLE while running in A. That means A could > accidentally allocate persistent memory from A's memory region. When A then > switches to B, B can no longer make all of A MOVABLE. But you have this basic problem no matter what? kexec requires a pretty big region of linear memory to boot a kernel into. Even with purgatory and copying you still have to have ensure a free linear space that has no KHO pages in it. This seems impossible to really guarentee unless you have a special KHO allocator that happens to guarentee available linear memory, or are doing tricks like we are discussing to use the normal allocator to keep allocations out of some linear memory. > So we need to ensure that *both* regions are MOVABLE, and the system is > always fully aware of both. I imagined the kernel would boot with only the A or B area of memory available during early boot, and then in later boot phases it would setup the additional memory that has a mix of KHO and free pages. This feels easier to do once the allocators are all fully started up - ie you can deal with KHO pages by just allocating them. [*] IOW each A/B area should be large enough to complete alot of boot and would end up naturally containing GFP_KERNEL allocations during this process as it is the only memory available. If you have a special KHO allocator (GFP_KHO?) then it can simply be aware of this and avoid allocating from the A/B zone. However, it would be much nicer to avoid having to mark possible KHO allocations in code at the allocation point, this would be nicer: p = alloc_pages(GFP_KERNEL) // time passes to_kho(p) So I agree there is an appeal to somehow using the existing allocators to stop taking unmovable pages from the A/B region after some point so that no to_kho() will ever get a page that in A/B. Can you take a ZONE_NORMAL, use it for booting, and then switch it to ZONE_MOVABLE, keeping all the unmovable memory? Something else? * - For drivers I'm imaging that we can do: p = alloc_pages(GFP_KERNEL|GFP_KHO|GFP_COMP, order); to_kho(p); // kexec from_kho(p); folio_put(p) Meaning KHO has to preserve the folio, keep the KVA the same, manage the refcount, and restore the GFP_COMP. I think if you have this as the basic primitive you can build everything else on top of it. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-26 20:41 ` Pasha Tatashin 2025-01-27 0:21 ` Alexander Graf @ 2025-01-27 13:05 ` Jason Gunthorpe 1 sibling, 0 replies; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-27 13:05 UTC (permalink / raw) To: Pasha Tatashin Cc: Mike Rapoport, David Rientjes, lsf-pc, Alexander Graf, Gowans, James, linux-mm On Sun, Jan 26, 2025 at 03:41:11PM -0500, Pasha Tatashin wrote: > number of physical ranges is going to be small. If the preserved data > is so large that it cannot fit into a reasonably sized tree, then I > claim that the data should not be saved directly in the tree. Instead, > it should have its own metadata that is pointed to from the tree. It sounds like if a driver needs more than a few hundred bytes it should go this other way. > Yes, for entities like file systems, there absolutely should be a > small KHO info entry pointing to metadata pages that preserve the > normal pages. However, for devices that are kept alive, most of the > data should be saved directly in the tree, This doesn't seem feasible for the NIC we are looking at. There will be ALOT of data, it doesn't make alot of sense to significantly involve the boot DT in this. I think the same will be true for iommu as well. I think you guys are leaning too much into simpler SW based things like ftrace as samples.. > > Also, I'm not sure forcing using DT as a serializing scheme is a great > > idea. It is complicated and doesn't do that much to solve the complex > > versioning problem drivers face here.. > > The primary goal of the KHO device tree is to standardize the > live-update metadata that drivers preserve to maintain device > functionality across reboots. Honestly, I think this will not be welcomed, or workable. DT does not fully preserve compatability, it is designed for a world where if you don't read the values serialized then no harm. If you want to use DT you need to make it simple for drivers to address this. But that does not describe KHO, if the predecessor kernel serialized something and the successor doesn't understand it, that is fatal. I also think YAML and more formality is *way* too much process! One of my big bug-a-boos about KHO is that it *NOT* create a downside for the majority of kernel users that don't/can't use it. Meaning we don't mangle the normal driver paths, we don't impose difficult ABI requirements into the driver design and more. > getting device tree from firmware. Otherwise, we could just use other > methods such as PKRAM where it no inherent standardization involved, > but that allows to serialize devices absolutely during any phase of > reboot. This sounds like a better idea, at least as as starting point. Maybe down the road some more formalism can be be agreed, but until we have experiance with how much pain KHO is going to cause I think we should go slow in upstream. Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 19:42 ` David Rientjes 2025-01-22 23:30 ` Pasha Tatashin @ 2025-01-24 21:03 ` Zhu Yanjun 1 sibling, 0 replies; 17+ messages in thread From: Zhu Yanjun @ 2025-01-24 21:03 UTC (permalink / raw) To: David Rientjes, Jason Gunthorpe Cc: Mike Rapoport, lsf-pc, Alexander Graf, Gowans, James, linux-mm, Pasha Tatashin 在 2025/1/20 20:42, David Rientjes 写道: > On Mon, 20 Jan 2025, Jason Gunthorpe wrote: > >> On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: >>> Hi, >>> >>> I'd like to discuss memory persistence across kexec. >>> >>> Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows >>> serialization and deserialization of kernel data as well as preserving >>> arbitrary memory ranges across kexec. >>> >>> In addition, KHO keeps a physically contiguous memory regions that are >>> guaranteed to not have any memory that KHO would preserve, but still can be >>> used by the system. The kexeced kernel bootstraps itself using those >>> regions and sets all handed over memory as in use. KHO users then can >>> recover their state from the preserved data. This includes memory >>> reservations, where the user can either discard or claim reservations. >>> >>> KHO can be used as the base layer for implementation of persistence-aware >>> memory allocator and persistent in-memory filesystem. >>> >>> Aside from status update on KHO progress there are a few topics that I would >>> like to discuss: >>> * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > This is a very timely discussion since the last Linux MM Alignment Session > on the topic since some use cases, at least for tmpfs, have emerged. Not > necessarily a requirement, but more out of convenience. > >>> * Or is it better to implement yet another in-memory filesystem dedicated >>> for persistence? >>> * What is the best way to ensure that the memory we want to persist is not >>> scattered all over the place? >> >> There is alot of talk about taking *drivers* and having them survive >> kexec, meaning the driver has to put alot of its state into KHO and >> then get it back out again. >> >> I've been hoping for a model where a driver can be told to "go to KHO" >> and the KHO code can be largely contained in the driver and regulated >> to recording the driver state. This implies the state may be >> fragmented all over memory. >> > > This sounds fantastic if it is doable! > >> The other direction is that the driver has to start up in some special >> KHO mode and KHO becomes invasive on all driver paths to use special >> KHO allocations. This seems like a PITA. >> >> You can see this difference just in the discussion around the iommu >> serialization where one idea was to have KHO be an integral (and >> invasive!) part of the page table operations from time zero vs some >> later serialization at kexec time. >> >> Regardless, I'm interested in this discussion to bring some >> concreteness about how drivers work.. >> > > +1, I'm also interested in this discussion. +1, Hope I can join the meeting to listen to this presentation. Zhu Yanjun > > As previously mentioned[1], we'll also start a biweekly on hypervisor live > update to accelerate progress. The first instance of that meeting will be > next week, Monday, January 27 at 8am PST (UTC-8). Calendar invites will > go out later today for everybody on that email thread, if anybody else is > interested in attending on a regular basis please email me. Hoping this > can be leveraged as well to build up to LSF/MM/BPF. > > [1] > https://lore.kernel.org/kexec/2908e4ab-abc4-ddd0-b191-fe820856cfb4@google.com/T/#u > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 14:14 ` Jason Gunthorpe 2025-01-20 19:42 ` David Rientjes @ 2025-01-24 11:30 ` Mike Rapoport 2025-01-24 14:56 ` Jason Gunthorpe 1 sibling, 1 reply; 17+ messages in thread From: Mike Rapoport @ 2025-01-24 11:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes, Pasha Tatashin Hi Jason, On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote: > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > Hi, > > > > I'd like to discuss memory persistence across kexec. > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > serialization and deserialization of kernel data as well as preserving > > arbitrary memory ranges across kexec. > > > > In addition, KHO keeps a physically contiguous memory regions that are > > guaranteed to not have any memory that KHO would preserve, but still can be > > used by the system. The kexeced kernel bootstraps itself using those > > regions and sets all handed over memory as in use. KHO users then can > > recover their state from the preserved data. This includes memory > > reservations, where the user can either discard or claim reservations. > > > > KHO can be used as the base layer for implementation of persistence-aware > > memory allocator and persistent in-memory filesystem. > > > > Aside from status update on KHO progress there are a few topics that I would > > like to discuss: > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > * Or is it better to implement yet another in-memory filesystem dedicated > > for persistence? > > * What is the best way to ensure that the memory we want to persist is not > > scattered all over the place? > > There is alot of talk about taking *drivers* and having them survive > kexec, meaning the driver has to put alot of its state into KHO and > then get it back out again. > > I've been hoping for a model where a driver can be told to "go to KHO" > and the KHO code can be largely contained in the driver and regulated > to recording the driver state. This implies the state may be > fragmented all over memory. I'm not sure I follow what do you mean by "go to KHO" here. I believe that ftrace example in Alex's v3 of KHO (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com) has enough meat to demonstrate the basic model. The driver has to pass the state it wishes to preserve and then during the initialization after kexec the driver can restore it's state from the preserved one. > The other direction is that the driver has to start up in some special > KHO mode and KHO becomes invasive on all driver paths to use special > KHO allocations. This seems like a PITA. > > You can see this difference just in the discussion around the iommu > serialization where one idea was to have KHO be an integral (and > invasive!) part of the page table operations from time zero vs some > later serialization at kexec time. I didn't follow that discussion closely, but there still should be a step when iommu driver would try to deserialize the data and use it if deserialization succeeds. My understanding it that a major part of the complexity in iommu is the userspace facing bits that need to be somehow connected to the restored in kernel structures after kexec. > Regardless, I'm interested in this discussion to bring some > concreteness about how drivers work.. > > Jason -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-24 11:30 ` Mike Rapoport @ 2025-01-24 14:56 ` Jason Gunthorpe 0 siblings, 0 replies; 17+ messages in thread From: Jason Gunthorpe @ 2025-01-24 14:56 UTC (permalink / raw) To: Mike Rapoport Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes, Pasha Tatashin On Fri, Jan 24, 2025 at 01:30:52PM +0200, Mike Rapoport wrote: > Hi Jason, > > On Mon, Jan 20, 2025 at 10:14:27AM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 20, 2025 at 09:54:15AM +0200, Mike Rapoport wrote: > > > Hi, > > > > > > I'd like to discuss memory persistence across kexec. > > > > > > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > > > serialization and deserialization of kernel data as well as preserving > > > arbitrary memory ranges across kexec. > > > > > > In addition, KHO keeps a physically contiguous memory regions that are > > > guaranteed to not have any memory that KHO would preserve, but still can be > > > used by the system. The kexeced kernel bootstraps itself using those > > > regions and sets all handed over memory as in use. KHO users then can > > > recover their state from the preserved data. This includes memory > > > reservations, where the user can either discard or claim reservations. > > > > > > KHO can be used as the base layer for implementation of persistence-aware > > > memory allocator and persistent in-memory filesystem. > > > > > > Aside from status update on KHO progress there are a few topics that I would > > > like to discuss: > > > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > > > * Or is it better to implement yet another in-memory filesystem dedicated > > > for persistence? > > > * What is the best way to ensure that the memory we want to persist is not > > > scattered all over the place? > > > > There is alot of talk about taking *drivers* and having them survive > > kexec, meaning the driver has to put alot of its state into KHO and > > then get it back out again. > > > > I've been hoping for a model where a driver can be told to "go to KHO" > > and the KHO code can be largely contained in the driver and regulated > > to recording the driver state. This implies the state may be > > fragmented all over memory. > > I'm not sure I follow what do you mean by "go to KHO" here. Drawing on our now extensive experiance with PCI device live migration, I imagine a state progression approximately like: RUNNING - minimal or no KHO involvement PREPARE - KHO stuff starts to get ready, preallocations, loading successor kernels, etc. No VM degradation PRE-STOP - KHO gets serious, stuff starts to become unavailable, userspace needs to shut things down and get ready. Some level of VM degradation - ie changing IOMMU translations may block the VM until CONCLUDE. STOP - Now you've done it. KHO state is finalized - VMs stop running KEXEC - Weee - VMs not running RESUME - Get booted up, get ready to start up the VMs - VM still stopped POST-RESUME - Start unpacking more stuff from KHO, userspace starts bringing back other stuff it may have shutdown. Some level of VM degradation CONCLUDE - Discard all the remaining KHO stuff. No VM degradation RUNNING - minimal or no KHO involvment Each of these states should inform drivers/etc when we reach them, and the KHO state that will survive the kexec evolves and extends as it progress. So "go to KHO" would refer to a driver that is using PREPARE and PRE-STOP to start moving its functionality from normal memory to KHO preserved memory, possibly with some functional degradation. > I believe that ftrace example in Alex's v3 of KHO > (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com) > has enough meat to demonstrate the basic model. ftrace is just too simple to capture the full complexity of what a real HW device would need. We've now spent time thinking about what it would take to make a complex NIC survive kexec and I suggest the above model for how to approach it. > > The other direction is that the driver has to start up in some special > > KHO mode and KHO becomes invasive on all driver paths to use special > > KHO allocations. This seems like a PITA. > > > > You can see this difference just in the discussion around the iommu > > serialization where one idea was to have KHO be an integral (and > > invasive!) part of the page table operations from time zero vs some > > later serialization at kexec time. > > I didn't follow that discussion closely, but there still should be a step > when iommu driver would try to deserialize the data and use it if > deserialization succeeds. There were two options, one is that the iommu always lives in KHO, the other is that the iommu moves (ie go to KHO) into KHO. For instance asumming the latter, as you progress through the above state list: RUNNING - IOMMU page tables are in normal memory and normal IOMMU code is used to manipulate them PREPARE - We allocate an approximate amount of KHO memory needed to hold the page tables PRE-STOP - The page tables are copied into the KHO memory and frozen to be unchanging STOP - The IOMMU driver records to KHO which devices have KHO page tables RESUME - The IOMMU driver recovers the KHO page tables and hitlessly sets up the new HW lookup tables to use them POST-RESUME - The page tables are copied out of the KHO memory and back to normal memory where normal IOMMU algorithms can run them CONCLUDE - All the KHO memory is freed Compared to the first option, we'd somehow teach the IOMMU code to always use KHO for allocations, and KHO is somehow compatible and preserving the IOMMU's use of struct page metadata. Avoids the serializing copy, but you have to make invasive KHO changes to the existing IOMMU page table code. vs serialize which could be isolated to a KHO module that doesn't bother anyone else. [Also, I would prefer to see KHO updates to page table code after consolidating the iommu page table code in one place. Could use some help on that project too :) https://patch.msgid.link/r/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com ] > My understanding it that a major part of the complexity in iommu is the > userspace facing bits that need to be somehow connected to the restored in > kernel structures after kexec. Yes certainly this is hard too. I have yet to see a complete functional proposal for this. I have been feeling that KHO should have a way to preserve a driver file descriptor. Not a full descriptor, but something stripped back and simplified. Getting a descriptor through KHO, vs /dev/XXX would trigger special stuff like not FLRing VFIO PCI devices, not wrecking the IOMMU translation and so on. For instance for iommufd we may move the tables into KHO, destory all other iommufd objects, then transfer the stripped down iommufd FD to KHO. On resume the VMM would recover the KHO iommufd FD and rebuild the lost objects, then destroy the special KHO page table. The really tricky thing is there is *alot* of state in these FDs, some we can imagine to retain, others will have to be rebuilt. There is aslo alot of kernel actions that don't happen at FD open time. Some kind of philosophy is needed here - what happens if the kernel skips steps to preserve KHO, but the userspace doesn't follow the KHO flow? Ie userspace opens /dev/vfio instead of the KHO version? The /dev/vfio is pretty wrecked because of what KHO did. Does the kernel have to fix it? Should the kernel forbid it? What happens if KHO and KHO again without userspace fixing everything? So many questions :\ Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [LSF/MM/BPF TOPIC] memory persistence over kexec 2025-01-20 7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport 2025-01-20 14:14 ` Jason Gunthorpe @ 2025-01-24 18:23 ` Andrey Ryabinin 1 sibling, 0 replies; 17+ messages in thread From: Andrey Ryabinin @ 2025-01-24 18:23 UTC (permalink / raw) To: Mike Rapoport Cc: lsf-pc, Alexander Graf, Gowans, James, linux-mm, David Rientjes, Pasha Tatashin, Jason Gunthorpe On Mon, Jan 20, 2025 at 8:54 AM Mike Rapoport <rppt@kernel.org> wrote: > > Hi, > > I'd like to discuss memory persistence across kexec. > Hi, I'm very interested in this topic as well, I'd like to join the club ) > Currently there is ongoing work on Kexec HandOver (KHO) [1] that allows > serialization and deserialization of kernel data as well as preserving > arbitrary memory ranges across kexec. > To be able to perform live update of hypervisor kernel with running VMs which use VFIO devices we would need to [de]serialize a lots of different and complex states (PCI, IOMMU, VFIO ...) When I've looked at KHO I found that the process of describing data using KHO is complicated, requires to write a lot of code that needs to be invaded deeply into subsystem code. So I think this might be a blocker for applying KHO to VFIO device state which is more complicated than the ftrace buffers. To address this particular issue I've come up with the proof of concept which I sent a few months ago: https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com The idea behind was inspired by QEMU's VMSTATE mechanism which solves similar problem - to describe and migrate devices states across different instances of QEMU. As an example, I've chosen to preserve ftrace buffers as well, so it's easier to compare with KHO approach. > In addition, KHO keeps a physically contiguous memory regions that are > guaranteed to not have any memory that KHO would preserve, but still can be > used by the system. The kexeced kernel bootstraps itself using those > regions and sets all handed over memory as in use. KHO users then can > recover their state from the preserved data. This includes memory > reservations, where the user can either discard or claim reservations. > > KHO can be used as the base layer for implementation of persistence-aware > memory allocator and persistent in-memory filesystem. > > Aside from status update on KHO progress there are a few topics that I would > like to discuss: > * Is it feasible and desirable to enable KHO support in tmpfs and hugetlbfs? > * Or is it better to implement yet another in-memory filesystem dedicated > for persistence? We would definitely need a framework to [de]serialize data. With that we should be able to preserve tmpfs/hugetblfs (and it probably will be easier than preserving some device state). So yet another in-memory filesystem should come only as a solution for some potential problem, just for example: - serialization of tmpfs/hugetlbfs requires unreasonable amount of memory (or time to process) - implementation ends up too complicated and fragile, so it's just better to have separate dedicated fs - whatever else comes up... ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-01-28 14:04 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-20 7:54 [LSF/MM/BPF TOPIC] memory persistence over kexec Mike Rapoport 2025-01-20 14:14 ` Jason Gunthorpe 2025-01-20 19:42 ` David Rientjes 2025-01-22 23:30 ` Pasha Tatashin 2025-01-25 9:53 ` Mike Rapoport 2025-01-25 15:19 ` Pasha Tatashin 2025-01-26 20:04 ` Jason Gunthorpe 2025-01-26 20:41 ` Pasha Tatashin 2025-01-27 0:21 ` Alexander Graf 2025-01-27 13:15 ` Jason Gunthorpe 2025-01-27 16:12 ` Alexander Graf 2025-01-28 14:04 ` Jason Gunthorpe 2025-01-27 13:05 ` Jason Gunthorpe 2025-01-24 21:03 ` Zhu Yanjun 2025-01-24 11:30 ` Mike Rapoport 2025-01-24 14:56 ` Jason Gunthorpe 2025-01-24 18:23 ` Andrey Ryabinin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox