From: Andrey Ryabinin <ryabinin.a.a@gmail.com>
To: Chris Li <chrisl@kernel.org>
Cc: linux-kernel@vger.kernel.org, Alexander Graf <graf@amazon.com>,
James Gowans <jgowans@amazon.com>,
Mike Rapoport <rppt@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
Eric Biederman <ebiederm@xmission.com>,
kexec@lists.infradead.org, Pratyush Yadav <ptyadav@amazon.de>,
Jason Gunthorpe <jgg@nvidia.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
Date: Mon, 5 May 2025 16:35:57 +0200 [thread overview]
Message-ID: <b9a80cff-8315-4be9-9a01-3c618b00d3e9@gmail.com> (raw)
In-Reply-To: <CAF8kJuMmiBEoaAL=XYcj6Y1qfAVd=Q_s9iT0wi70LJYn6ht42w@mail.gmail.com>
On 4/29/25 1:01 AM, Chris Li wrote:
> Hi Andrey,
>
> I am working on the PCI portion of the live update and looking at
> using KSTATE as an alternative to the FDT. Here are some high level
> feedbacks.
>
Hi, thanks a lot.
> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>>
>> Main changes from v1 [1]:
>> - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>> - Lots of misc cleanups/refactorings.
>>
>> kstate (kernel state) is a mechanism to describe internal some part of the
>> kernel state, save it into the memory and restore the state after kexec
>> in the new kernel.
>>
>> The end goal here and the main use case for this is to be able to
>> update host kernel under VMs with VFIO pass-through devices running
>> on that host. Since we are pretty far from that end goal yet, this
>> only establishes some basic infrastructure to describe and migrate complex
>> in-kernel states.
>>
>> The idea behind KSTATE resembles QEMU's migration framework [1], which
>> solves quite similar problem - migrate state of VM/emulated devices
>> across different versions of QEMU.
>>
>> This is an altenative to Kexec Hand Over (KHO [3]).
>>
>> So, why not KHO?
>>
>
> KHO does more than just serializing/unserializing. It also has scratch
> areas etc to allow safely performing early allocation without stepping
> on the preserved memory. I see KSTATE as an alternative to libFDT as
> ways of serializing the preserved memory. Not a replacement for KHO.
>
> With that, it would be great to see a KSTATE build on top of the
> current version of KHO. The V6 version of KHO uses a recursive FDT
> object. I see recursive FDT can map to the C struct description
> similar to the KSTATE field description nicely. However, that will
> require KSTATE to make some considerable changes to embrace the KHO
> v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> V6 uses many recursive physical address object pointers for different
> objects. Maybe a KSTATE V3?
>
Yep, I'll take a look into combinig KSTATE with KHO.
....
>
>> a_kho_restore()
>> {
>> ...
>> a.i = fdt_getprop(fdt, offset, "i", &len);
>> if (!a.i || len != sizeof(a.i))
>> goto err
>> *a.p_ulong = fdt_getprop....
>> }
>>
>> Each driver/subsystem has to solve this problem in their own way.
>> Also if we use fdt properties for individual fields, that might be wastefull
>> in terms of used memory, as these properties use strings as keys.
>
> Right, I need to write a lot of boilerplate code to do the per
> property save/restore. I am not worried too much about memory usage. A
> lot of string keys are not much longer than 8 bytes. The memory saving
> convert to binary index is not huge. I actually would suggest adding
> the string version of the field name to the description table, so that
> we can dump the state in KSTATE just like the YAML FDT output for
> debugging purposes. It is a very useful feature of FDT to dump the
> current saving state into a human readable form. KSTATE can have the
> same feature added.
>
kstate_field already have string with name of the field:
#define KSTATE_BASE_TYPE(_f, _state, _type) { \
.name = (__stringify(_f)), \
Currently it's not used in code, but it's there for debug purposes
>> While with KSTATE solves the same problem in more elegant way, with this:
>> struct kstate_description a_state = {
>> .name = "a_struct",
>> .version_id = 1,
>> .id = KSTATE_TEST_ID,
>> .state_list = LIST_HEAD_INIT(test_state.state_list),
>> .fields = (const struct kstate_field[]) {
>> KSTATE_BASE_TYPE(i, struct a, int),
>> KSTATE_BASE_TYPE(s, struct a, char [10]),
>> KSTATE_POINTER(p_ulong, struct a),
>> KSTATE_PAGE(page, struct a),
>> KSTATE_END_OF_LIST()
>> },
>> };
>>
>>
>> {
>> static unsigned long ulong
>> static struct a a_data = { .p_ulong = &ulong };
>>
>> kstate_register(&test_state, &a_data);
>> }
>>
>> The driver needs only to have a proper 'kstate_description' and call kstate_register()
>> to save/restore a_data.
>> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>> And kstate_register() does all this save/restore stuff under the hood.
>
> It seems the KSTATE uses one contiguous stream and the object has to
> be loaded in the order it was saved. For the PCI code, the PCI device
> scanning and probing might cause the device load out of the order of
> saving. (The PCI probing is actually the reverse order of saving).
> This kstate_register() might pose restrictions on the restore order.
> PCI will need to look up and find the device state based on the PCI
> device ID. Other subsystems will likely have the requirement to look
> up their own saved state as well.
> I also see KSTATE can be extended to support that.
>
Absolutely agreed. I think we need to decouple restore and register, ie remove
restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
so the PCI code could do:
kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));
And on probing stage (probably in pci_device_add()) call
kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))
which would locate state for the device if any and restore it.
>> - Another bonus point - kstate can preserve migratable memory, which is required
>> to preserve guest memory
>>
>>
>> So now to the part how this works.
>>
>> State of kernel data (usually it's some struct) is described by the
>> 'struct kstate_description' containing the array of individual
>> fields descpriptions - 'struct kstate_field'. Each field
>> has set of bits in ->flags which instructs how to save/restore
>> a certain field of the struct. E.g.:
>> - KS_BASE_TYPE flag tells that field can be just copied by value,
>>
>> - KS_POINTER means that the struct member is a pointer to the actual
>> data, so it needs to be dereference before saving/restoring data
>> to/from kstate data steam.
>>
>> - KS_STRUCT - contains another struct, field->ksd must point to
>> another 'struct kstate_dscription'
>
> The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
> type, right? Some of these flag combinations do not make sense. This
> part might need more careful planning to keep it simple. Maybe some of
> the flags bits should be enum.
>
Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field.
Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL)
>>
>> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>> ->restore() callbacks to save/restore data.
>>
>> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>> field->count() callback
>> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>> linear address. Store offset
>
> I think we want to describe different stream types.
> For example the most simple stream container is just a contiguous
> buffer with start address and size.
> The more complex one might be and size then an array of page pointers,
> all those pointers add to up the new buffer which describe an saved
> KSTATE that is larger than 4K and spread into an array of pages. Those
> pages don't need to be contiguous. Such a page array buffer stores the
> KSTATE entry described by a separate description table.
>
Agreed, I had similar thoughts. But this complicates code, so I started with
something simple.
>>
>> - KS_END - special flag indicating the end of migration stream data.
>>
>> kstate_register() call accepts kstate_description along with an instance
>> of an object and registers it in the global 'states' list.
>>
>> During kexec reboot phase we go through the list of 'kstate_description's
>> and each instance of kstate_description forms the 'struct kstate_entry'
>> which save into the kstate's data stream.
>>
>> The 'kstate_entry' contains information like ID of kstate_description, version
>> of it, size of migration data and the data itself. The ->data is formed in
>> accordance to the kstate_field's of the corresponding kstate_description.
>
> The version for the kstate_description might not be enough. The
> version works if there is a linear history. Here we are likely to have
> different vendors add their own extension to the device state saving.
I think vendors can just declare separate kstate_description with different ID.
The only problem with this, is that kstate_description.id is integer, so it would be
a problem to allocate those without conflicts.
Perhaps change the id to string? So vendors can just add vendor prefix to ID.
> I suggest instead we save the old kernel's kstate_description table
> (once per description table as a recursive object) alongside the
> object physical address as well. The new kernel has their new version
> of the description table. It can compare between the old and new
> description tables and find out what fields need to be upgraded or
> downgraded. The new kernel will use the old kstate_description table
> to decode the previous kernel's saved object.
Hmm.. I'm not sure, there is a lot to think about.
This might make changes in kstate_description painful,
e.g. if I want to rearrange some ->flags for whatever reason.
So how to deal with changes in kstate_description itself?
How do we save links to methods in kstate_field (->restore/->save/->count),
and what if we'll need to change function prototypes of these methods ?
> I think that way it is> more flexible to support adding one or more features and not tight to
> which version has what feature. It can also make sure the new kernel
> can always dump the old KSTATE into YAML.
>
> That way we might be able to simplify the subsection and the
> depreciation flags. The new kernel doesn't need to carry the history
> of changes made to the old description table.
>
>> After the reboot, when the kstate_register() called it parses migration
>> stream, finds the appropriate 'kstate_entry' and restores the contents of
>> the object in accordance with kstate_description and ->fields.
>
> Again this restoring can happen in a different order when the PCI
> device scanning and probing order. The restoration might not happen in
> one single call chain. Material for V3?
Agreed.
> I am happy to work with you to get KSTATE working with the existing KHO effort.
>
Thanks for useful feedback, appreciated.
next prev parent reply other threads:[~2025-05-05 14:37 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-10 12:03 Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 3/7] kexec: exclude control pages from the destination addresses Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
2025-03-11 11:31 ` kernel test robot
2025-03-11 12:25 ` kernel test robot
2025-03-10 12:03 ` [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem Andrey Ryabinin
2025-03-11 2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
2025-03-11 12:19 ` Andrey Ryabinin
2025-04-28 23:01 ` Chris Li
2025-04-28 23:01 ` Chris Li
2025-05-05 14:35 ` Andrey Ryabinin [this message]
2025-05-07 6:11 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b9a80cff-8315-4be9-9a01-3c618b00d3e9@gmail.com \
--to=ryabinin.a.a@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=chrisl@kernel.org \
--cc=dave.hansen@linux.intel.com \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=hpa@zytor.com \
--cc=jgg@nvidia.com \
--cc=jgowans@amazon.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=ptyadav@amazon.de \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox