[Hypervisor Live Update] Notes from March 10, 2025

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Hypervisor Live Update] Notes from March 10, 2025
@ 2025-03-17  3:52 David Rientjes
  2025-03-17 17:22 ` Jason Gunthorpe
  0 siblings, 1 reply; 5+ messages in thread
From: David Rientjes @ 2025-03-17  3:52 UTC (permalink / raw)
  To: Alexander Graf, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Frank van der Linden, James Gowans, Jason Gunthorpe,
	Junaid Shahid, Matthew Wilcox, Mike Rapoport, Pankaj Gupta,
	Pasha Tatashin, Pratyush Yadav, Vipin Sharma, Vishal Annapurve,
	Woodhouse, David
  Cc: linux-mm, kexec

Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, March 10.  Thanks for everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
Mike discussed taking Jason's proposal in response to v4 with xarray and
extending it a bit for memory reservations, it appeared to be working
correctly.  He's hoping to have the first implementation for that by this
week.

Mike noted that the next KHO series was being prepared to be sent out
before LSF/MM/BPF, including device tree.

----->o-----
Mike noted that Pratyush found that the KHO scratch area does not work
well with swiotlb[1].  The scratch area is reserved before the swiotlb is
initialized; the second kernel doesn't have enough low memory for the
swiotlb because it's still allocated from memblock.  The current scratch
areas are allocated in higher memory.  Mike posted a patch series that
split meminit for all architectures so this should be easier to fix.
This will affect any driver that requires memory in the first 4GB.

Alexander suggested allocating a scratch region in the low memory area.
Pratyush proposed this as a solution, although he wondered if it would be
possible to move swiotlb allocations to after the buddy allocator was up.
Heuristics were discussed to determine how much memory should be reserved
in low memory for this.

Mike noted that for successive kexecs, there will be multiple regions of
scratch area for each NUMA node.  For low memory, this would be sized
suitable for allocations that must originate below 4GB for DMA.  Mike
said a solution would still need to be developed for overlap with
preserved scratch memory and Pratyush noted that should be explicit by
denying those reservations.

Pasha asked how drivers would know if reservations would be denied in the
first 4GB of memory.  Mike said an error code would be returned.  Pasha
was specific about devices that wanted to preserve the memory because
they knew DMA would be on-going during the reboot.  This became a more
general question: what devices should we support for KHO and what should
we not (what is considered too legacy?).  In the meantime, Pratyush
suggested explicit checks for this.

----->o-----
We shifted to talking about Pratyush's patch series supporting fdbox for
memfd[2].  Reaction was mixed to this: some feedback focused on the use
of miscdevice and there were security concerns.  Pratyush noted that
there was no intent to propose this as a generic concept outside KHO.

Pratyush noted there was no way to preserve folio orders in KHO and he
also noted there was a need for page flags.  He also said it would be
possible to move away from miscdevice and perhaps toward VFS but would
need to look more into this.

Pasha asked about how the page flags were preserved.  Pratyush said there
was another property that would store them currently.

Pasha asked how cgroups would be handled, but there was no current
support for that.  Pratyush said the current RFC focused on anon memfd
and has not yet looked at hugetlb.  Pasha emphasized the importance of
focusing on one type of memory to start.

Pratyush noted in chat: "With FDBox work, I also realized that you can't
use FDT code from modules. Should not really be a problem since we can
export those symbols I suppose, but it doesn't work _currently_ at
least".

----->o-----
Andrey had recently sent another patch series for KSTATE[3] that was
discussed, now in v2, which was closer to being a formal submission
rather than an RFC.  He noted his concern with KHO was how hard it was to
write serialization code.  His goal was to give drivers the ability to
migrate structs across kexec which could be more elegant (see the
struct kstate_description).  He suggested this would be more
maintainable.  It had previously been used for live migration in qemu.

Andrey noted that each description would have a version field that
enables defining the minimal supported version for each driver.  He made
the connection between this and version control in qemu.  Pasha asked how
this solves the problem when memory becomes sufficiently fragmented and
the next kernel cannot boot due to it; Andrey noted the kexec would fail.
Andrey suggested allocating a big contiguous area, the source and
destination ranges would be the same.

Mike noted that kstate_description definitions and the way drivers
declare their state to preserve are independent from scratch memory
reservation.  Andrey noted this wasn't a replacement for KHO but rather
could be built on top of KHO.

Mike suggested on top of KHO we have FDT, then what Pasha is proposing
for dynamic tree on top of that, then perhaps kstate on top of that.  He
would need to look more into kstate.

----->o-----
Mike asked if kstate descriptions depend on how it's preserved on the
backend, an earlier version had a migration stream.  Andrey suggested
using FDT underneath, but there is no strong dependency.

Pasha asked what architectures were supported today for kstate, Andrey
said x86.  Pasha suggested that anything that lands upstream should
likely support both x86 and ARM.

Chris Li asked about kstate descriptions and if a struct adds or removes
a member.  Andrey said if you want to add a new member, then you can bump
the version number.  He showed an example from qemu[4] that could be used
as reference for this.  You could also add a new kstate description with
a new id, on downgrade it wouldn't be used for backward compatibility.

Alexander suggested starting with FDT logic because it already exists and
then serialize and de-serialize binary data using a UAPI.  Then, we
should discuss deprecating FDT if/when we have something better.  That
won't be problematic unless we gain hundreds of users.  He emphaszied we
should focus on how to easily and quickly preserve memory across kexec,
calling back to drivers to store their state at the right time, etc.  The
data format for how to serialize is a tiny detail in comparison.  Pasha
fully agreed with this.

----->o-----
Next meeting is PREEMPTED for LSF/MM/BPF 2025 in Montreal.  So the next 
meeting will be on Monday, April 7 at 8am PDT (UTC-7).  I'll send a 
reminder on this mailing list.

Topics I think we should cover in the next meeting:

 - debrief discussions at LSF/MM/BPF 2025
 - update on Mike's patch series for memory reservation
 - update on Pratyush's progress for allocating swiotlb in low memory
   regions and any additional support required based on device
   requirements (who needs this scratch support?)
 - discuss whether the fdbox support would obsolete the need for
   guestmemfs in the long term
 - alignment on memblock as the first use case for KHO to justify
   upstreaming, including ftrace use cases
 - discuss Live Update Orchestrater (LUO) based on RFC patches sent by
   Pasha before then that helps to define the state machine
 - discuss how KSTATE plays into KHO upstreaming and complementary or
   overlapping goals
 - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds
   can be added to an fdbox
 - iommufd patch series (as well as qemu) from James
 - establishing an API for callbacks into drivers to serialize state
   during brownout
 - topics proposed by Pasha: reducing blackout window, relaxed
   serialization, and KHO activation requirements
 - implications of preserving vIOMMU state
 - testing methodology for these components, including selftests

Please let me know if you'd like to propose additional topics for
discussion, thank you!

[1] https://lore.kernel.org/all/mafs0cyf4ii4k.fsf@kernel.org
[2] 
https://lore.kernel.org/lkml/20250307005830.65293-5-ptyadav@amazon.de/T/
[3] 
https://lore.kernel.org/linux-mm/20250310120318.2124-6-arbn@yandex-team.com/T/
[4] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Hypervisor Live Update] Notes from March 10, 2025
  2025-03-17  3:52 [Hypervisor Live Update] Notes from March 10, 2025 David Rientjes
@ 2025-03-17 17:22 ` Jason Gunthorpe
  2025-03-20  5:37   ` Pratyush Yadav
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2025-03-17 17:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alexander Graf, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Frank van der Linden, James Gowans, Junaid Shahid,
	Matthew Wilcox, Mike Rapoport, Pankaj Gupta, Pasha Tatashin,
	Pratyush Yadav, Vipin Sharma, Vishal Annapurve, Woodhouse, David,
	linux-mm, kexec

On Sun, Mar 16, 2025 at 08:52:43PM -0700, David Rientjes wrote:

> Pasha asked how drivers would know if reservations would be denied in the
> first 4GB of memory.  Mike said an error code would be returned.  Pasha
> was specific about devices that wanted to preserve the memory because
> they knew DMA would be on-going during the reboot.  This became a more
> general question: what devices should we support for KHO and what should
> we not (what is considered too legacy?).  In the meantime, Pratyush
> suggested explicit checks for this.

IMHO supports DMA to all memory, and doesn't require swiotlb seem like
reasonable starting points for device capability.

If you can't start swiotlb later, after the buddy allocator, maybe
just make it KConfig conflicting with KHO?

You could probably also preserve the swiotlb memory across the kexec
to bootstrap it, but why? It should never be used on modern HW..

> Pratyush noted there was no way to preserve folio orders in KHO and he
> also noted there was a need for page flags.

I think the xarray idea will preserve folio orders, that was a big
point of it.

Not clear why we'd need to preserve page flags. The same page flags
may not even exist in the new kernel? New kernel should set the page
flags correctly based on what it is doing. Shouldn't, say, memfd know
exactly what it's page flags should be in the new kernel when adopting
the memory?

> Pasha asked how cgroups would be handled, but there was no current
> support for that.  Pratyush said the current RFC focused on anon memfd
> and has not yet looked at hugetlb.  Pasha emphasized the importance of
> focusing on one type of memory to start.

I'd say userspace should deal with this. It should de-serialize the FD
within the context of the cgroup it wants to charge that FD too, and
the de-serializing process should charge that cgroups accounting with
whatever is restored inside the FD.

Is that possible?

> Andrey had recently sent another patch series for KSTATE[3] that was
> discussed, now in v2, which was closer to being a formal submission
> rather than an RFC.  He noted his concern with KHO was how hard it was to
> write serialization code.  His goal was to give drivers the ability to
> migrate structs across kexec which could be more elegant (see the
> struct kstate_description).  

I think we will have structs, I think most things will be structs. The
KHO just gives a small FDT area to keep track of the top level struct
pointers in a more understandable and auditable way.

> Mike suggested on top of KHO we have FDT, then what Pasha is proposing
> for dynamic tree on top of that, then perhaps kstate on top of that.  He
> would need to look more into kstate.

This makes sense to me. I think the FDT should be the first layer, and
then we can go item by item and decide the best serialization to use,
always starting from the FDT.

At worst the FD is just an expensive way to store a pointer to a
struct.

If we determine at the end that the FDT is predominantly used for very
simple struct things then maybe it gets replaced..

Regards,
Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Hypervisor Live Update] Notes from March 10, 2025
  2025-03-17 17:22 ` Jason Gunthorpe
@ 2025-03-20  5:37   ` Pratyush Yadav
  2025-03-20 12:23     ` Jason Gunthorpe
  0 siblings, 1 reply; 5+ messages in thread
From: Pratyush Yadav @ 2025-03-20  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe, David Rientjes
  Cc: Alexander Graf, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Frank van der Linden, James Gowans, Junaid Shahid,
	Matthew Wilcox, Mike Rapoport, Pankaj Gupta, Pasha Tatashin,
	Vipin Sharma, Vishal Annapurve, Woodhouse, David, linux-mm,
	kexec

Writing this from my phone so apologies in advance if it messes up formatting somewhere.

On Mon, Mar 17, 2025, at 9:22 PM, Jason Gunthorpe wrote:
> On Sun, Mar 16, 2025 at 08:52:43PM -0700, David Rientjes wrote:
[...]
>> Pratyush noted there was no way to preserve folio orders in KHO and he
>> also noted there was a need for page flags.
>
> I think the xarray idea will preserve folio orders, that was a big
> point of it.
>
> Not clear why we'd need to preserve page flags. The same page flags
> may not even exist in the new kernel? New kernel should set the page
> flags correctly based on what it is doing. Shouldn't, say, memfd know
> exactly what it's page flags should be in the new kernel when adopting
> the memory?

I didn't mean the exact flags value, but the ability to have per-folio flags. The exact bits and their meaning would of course need to be part of the ABI. Shmem uses the dirty and uptodate flags to track some state on the folios, and the flags can affect it's behavior (lazily zeroing out falloc-ed pages for example). I am assuming other FD types or drivers might also want to store per-folio information. Having KHO core provide this facility can avoid duplicating the logic in each subsystem.

That said, I don't think this is a blocking feature that should be present from the get go. I would be happy if it is, since that would make the shmem flag tracking easy, but for now I can have a separate property to track this.

>
>> Pasha asked how cgroups would be handled, but there was no current
>> support for that.  Pratyush said the current RFC focused on anon memfd
>> and has not yet looked at hugetlb.  Pasha emphasized the importance of
>> focusing on one type of memory to start.
>
> I'd say userspace should deal with this. It should de-serialize the FD
> within the context of the cgroup it wants to charge that FD too, and
> the de-serializing process should charge that cgroups accounting with
> whatever is restored inside the FD.
>
> Is that possible?

For FDBox, it is certainly possible. In the current patch version, deserialization happens on boot so it can't be done, but in later versions I want to give userspace control on when to deserialize. So whichever context triggers that gets charged.
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Hypervisor Live Update] Notes from March 10, 2025
  2025-03-20  5:37   ` Pratyush Yadav
@ 2025-03-20 12:23     ` Jason Gunthorpe
  2025-03-26 16:18       ` Pratyush Yadav
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 12:23 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: David Rientjes, Alexander Graf, Anthony Yznaga, Dave Hansen,
	David Hildenbrand, Frank van der Linden, James Gowans,
	Junaid Shahid, Matthew Wilcox, Mike Rapoport, Pankaj Gupta,
	Pasha Tatashin, Vipin Sharma, Vishal Annapurve, Woodhouse, David,
	linux-mm, kexec


> I didn't mean the exact flags value, but the ability to have
> per-folio flags. The exact bits and their meaning would of course
> need to be part of the ABI. Shmem uses the dirty and uptodate flags
> to track some state on the folios, and the flags can affect it's
> behavior (lazily zeroing out falloc-ed pages for example). I am
> assuming other FD types or drivers might also want to store
> per-folio information. Having KHO core provide this facility can
> avoid duplicating the logic in each subsystem.

For something simple like shmem I'd probably just suggest a side car bitmap
array or something?

The trouble with trying to feed flags through the xarray thing is that
the memory holding that pfn data across the kexec is not itself
preserved memory so it is all blown away once the allocator starts.

Any data that needs to be preserved further has to be copied into the
frozen struct page, which is pretty limiting in terms of what you
could preserve. A few bits could maybe work out but not alot of data.

> For FDBox, it is certainly possible. In the current patch version,
> deserialization happens on boot so it can't be done, but in later
> versions I want to give userspace control on when to deserialize. So
> whichever context triggers that gets charged. 

Yeah, I think allowing userspace to sequence the deserialize is
important.

Jason


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Hypervisor Live Update] Notes from March 10, 2025
  2025-03-20 12:23     ` Jason Gunthorpe
@ 2025-03-26 16:18       ` Pratyush Yadav
  0 siblings, 0 replies; 5+ messages in thread
From: Pratyush Yadav @ 2025-03-26 16:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, David Rientjes, Alexander Graf, Anthony Yznaga,
	Dave Hansen, David Hildenbrand, Frank van der Linden,
	James Gowans, Junaid Shahid, Matthew Wilcox, Mike Rapoport,
	Pankaj Gupta, Pasha Tatashin, Vipin Sharma, Vishal Annapurve,
	Woodhouse, David, linux-mm, kexec

On Thu, Mar 20 2025, Jason Gunthorpe wrote:

>> I didn't mean the exact flags value, but the ability to have
>> per-folio flags. The exact bits and their meaning would of course
>> need to be part of the ABI. Shmem uses the dirty and uptodate flags
>> to track some state on the folios, and the flags can affect it's
>> behavior (lazily zeroing out falloc-ed pages for example). I am
>> assuming other FD types or drivers might also want to store
>> per-folio information. Having KHO core provide this facility can
>> avoid duplicating the logic in each subsystem.
>
> For something simple like shmem I'd probably just suggest a side car bitmap
> array or something?
>
> The trouble with trying to feed flags through the xarray thing is that
> the memory holding that pfn data across the kexec is not itself
> preserved memory so it is all blown away once the allocator starts.
>
> Any data that needs to be preserved further has to be copied into the
> frozen struct page, which is pretty limiting in terms of what you
> could preserve. A few bits could maybe work out but not alot of data.

Right, that makes sense. I can live with a sidecar bitmap then.

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-03-26 16:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-17  3:52 [Hypervisor Live Update] Notes from March 10, 2025 David Rientjes
2025-03-17 17:22 ` Jason Gunthorpe
2025-03-20  5:37   ` Pratyush Yadav
2025-03-20 12:23     ` Jason Gunthorpe
2025-03-26 16:18       ` Pratyush Yadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox