Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrey Ryabinin <ryabinin.a.a@gmail.com>
To: Chris Li <chrisl@kernel.org>
Cc: linux-kernel@vger.kernel.org, Alexander Graf <graf@amazon.com>,
	James Gowans <jgowans@amazon.com>,
	Mike Rapoport <rppt@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Eric Biederman <ebiederm@xmission.com>,
	kexec@lists.infradead.org, Pratyush Yadav <ptyadav@amazon.de>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
Date: Mon, 5 May 2025 16:35:57 +0200	[thread overview]
Message-ID: <b9a80cff-8315-4be9-9a01-3c618b00d3e9@gmail.com> (raw)
In-Reply-To: <CAF8kJuMmiBEoaAL=XYcj6Y1qfAVd=Q_s9iT0wi70LJYn6ht42w@mail.gmail.com>



On 4/29/25 1:01 AM, Chris Li wrote:
> Hi Andrey,
> 
> I am working on the PCI portion of the live update and looking at
> using KSTATE as an alternative to the FDT. Here are some high level
> feedbacks.
> 

Hi, thanks  a lot.

> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>>
>>  Main changes from v1 [1]:
>>   - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>>   - Lots of misc cleanups/refactorings.
>>
>> kstate (kernel state) is a mechanism to describe internal some part of the
>> kernel state, save it into the memory and restore the state after kexec
>> in the new kernel.
>>
>> The end goal here and the main use case for this is to be able to
>> update host kernel under VMs with VFIO pass-through devices running
>> on that host. Since we are pretty far from that end goal yet, this
>> only establishes some basic infrastructure to describe and migrate complex
>> in-kernel states.
>>
>> The idea behind KSTATE resembles QEMU's migration framework [1], which
>> solves quite similar problem - migrate state of VM/emulated devices
>> across different versions of QEMU.
>>
>> This is an altenative to Kexec Hand Over (KHO [3]).
>>
>> So, why not KHO?
>>
> 
> KHO does more than just serializing/unserializing. It also has scratch
> areas etc to allow safely performing early allocation without stepping
> on the preserved memory. I see KSTATE as an alternative to libFDT as
> ways of serializing the preserved memory. Not a replacement for KHO.
> 
> With that, it would be great to see a KSTATE build on top of the
> current version of KHO. The V6 version of KHO uses a recursive FDT
> object. I see recursive FDT can map to the C struct description
> similar to the KSTATE field description nicely. However, that will
> require KSTATE to make some considerable changes to embrace the KHO
> v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> V6 uses many recursive physical address object pointers for different
> objects.  Maybe a KSTATE V3?
> 

Yep, I'll take a look into combinig KSTATE with KHO.

....

> 
>>      a_kho_restore()
>>      {
>>              ...
>>              a.i = fdt_getprop(fdt, offset, "i", &len);
>>              if (!a.i || len != sizeof(a.i))
>>                 goto err
>>              *a.p_ulong = fdt_getprop....
>>      }
>>
>>     Each driver/subsystem has to solve this problem in their own way.
>>     Also if we use fdt properties for individual fields, that might be wastefull
>>     in terms of used memory, as these properties use strings as keys.
> 
> Right, I need to write a lot of boilerplate code to do the per
> property save/restore. I am not worried too much about memory usage. A
> lot of string keys are not much longer than 8 bytes. The memory saving
> convert to binary index is not huge. I actually would suggest adding
> the string version of the field name to the description table, so that
> we can dump the state in KSTATE just like the YAML FDT output for
> debugging purposes. It is a very useful feature of FDT to dump the
> current saving state into a human readable form. KSTATE can have the
> same feature added.
> 

kstate_field already have string with name of the field:

#define KSTATE_BASE_TYPE(_f, _state, _type) {		\
	.name = (__stringify(_f)),			\

Currently it's not used in code, but it's there for debug purposes 


>>    While with KSTATE solves the same problem in more elegant way, with this:
>>         struct kstate_description a_state = {
>>                 .name = "a_struct",
>>                 .version_id = 1,
>>                 .id = KSTATE_TEST_ID,
>>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>>                 .fields = (const struct kstate_field[]) {
>>                         KSTATE_BASE_TYPE(i, struct a, int),
>>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>>                         KSTATE_POINTER(p_ulong, struct a),
>>                         KSTATE_PAGE(page, struct a),
>>                         KSTATE_END_OF_LIST()
>>                 },
>>         };
>>
>>
>>         {
>>                 static unsigned long ulong
>>                 static struct a a_data = { .p_ulong = &ulong };
>>
>>                 kstate_register(&test_state, &a_data);
>>         }
>>
>>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
>>        to save/restore a_data.
>>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>>        And kstate_register() does all this save/restore stuff under the hood.
> 
> It seems the KSTATE uses one contiguous stream and the object has to
> be loaded in the order it was saved. For the PCI code, the PCI device
> scanning and probing might cause the device load out of the order of
> saving. (The PCI probing is actually the reverse order of saving).
> This kstate_register() might pose restrictions on the restore order.
> PCI will need to look up and find the device state based on the PCI
> device ID. Other subsystems will likely have the requirement to look
> up their own saved state as well.
> I also see KSTATE can be extended to support that.
> 

Absolutely agreed. I think we need to decouple restore and register, ie remove
 restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
so the PCI code could do:
            kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));

And on probing stage (probably in pci_device_add()) call
        kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))

which would locate state for the device if any and restore it.


>>  - Another bonus point - kstate can preserve migratable memory, which is required
>>     to preserve guest memory
>>
>>
>> So now to the part how this works.
>>
>> State of kernel data (usually it's some struct) is described by the
>> 'struct kstate_description' containing the array of individual
>> fields descpriptions - 'struct kstate_field'. Each field
>> has set of bits in ->flags which instructs how to save/restore
>> a certain field of the struct. E.g.:
>>   - KS_BASE_TYPE flag tells that field can be just copied by value,
>>
>>   - KS_POINTER means that the struct member is a pointer to the actual
>>      data, so it needs to be dereference before saving/restoring data
>>      to/from kstate data steam.
>>
>>   - KS_STRUCT - contains another struct,  field->ksd must point to
>>       another 'struct kstate_dscription'
> 
> The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
> type, right? Some of these flag combinations do not make sense. This
> part might need more careful planning to keep it simple. Maybe some of
> the flags bits should be enum.
> 

Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field.
 Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL)


>>
>>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>>                ->restore() callbacks to save/restore data.
>>
>>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>>                          field->count() callback
>>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>>      linear address. Store offset
> 
> I think we want to describe different stream types.
> For example the most simple stream container is just a contiguous
> buffer with start address and size.
> The more complex one might be and size then an array of page pointers,
> all those pointers add to up the new buffer which describe an saved
> KSTATE that is larger than 4K and spread into an array of pages. Those
> pages don't need to be contiguous. Such a page array buffer stores the
> KSTATE entry described by a separate description table.
> 

Agreed, I had similar thoughts. But this complicates code, so I started with
something simple.

>>
>>   - KS_END - special flag indicating the end of migration stream data.
>>
>> kstate_register() call accepts kstate_description along with an instance
>> of an object and registers it in the global 'states' list.
>>
>> During kexec reboot phase we go through the list of 'kstate_description's
>> and each instance of kstate_description forms the 'struct kstate_entry'
>> which save into the kstate's data stream.
>>
>> The 'kstate_entry' contains information like ID of kstate_description, version
>> of it, size of migration data and the data itself. The ->data is formed in
>> accordance to the kstate_field's of the corresponding kstate_description.
> 
> The version for the kstate_description might not be enough. The
> version works if there is a linear history. Here we are likely to have
> different vendors add their own extension to the device state saving.

I think vendors can just declare separate kstate_description with different ID.
The only problem with this, is that kstate_description.id is integer, so it would be
a problem to allocate those without conflicts.
Perhaps change the id to string? So vendors can just add vendor prefix to ID.


> I suggest instead we save the old kernel's kstate_description table
> (once per description table as a recursive object)  alongside the
> object physical address as well. The new kernel has their new version
> of the description table. It can compare between the old and new
> description tables and find out what fields need to be upgraded or
> downgraded. The new kernel will use the old kstate_description table
> to decode the previous kernel's saved object.

Hmm.. I'm not sure, there is a lot to think about.
 This might make changes in kstate_description painful,
e.g. if I want to rearrange some ->flags for whatever reason.
So how to deal with changes in kstate_description itself?

How do we save links to methods in kstate_field (->restore/->save/->count),
and what if we'll need to change function prototypes of these methods ?


> I think that way it is> more flexible to support adding one or more features and not tight to
> which version has what feature. It can also make sure the new kernel
> can always dump the old KSTATE into YAML.
> 
> That way we might be able to simplify the subsection and the
> depreciation flags. The new kernel doesn't need to carry the history
> of changes made to the old description table.
> 
>> After the reboot, when the kstate_register() called it parses migration
>> stream, finds the appropriate 'kstate_entry' and restores the contents of
>> the object in accordance with kstate_description and ->fields.
> 
> Again this restoring can happen in a different order when the PCI
> device scanning and probing order. The restoration might not happen in
> one single call chain. Material for V3?

Agreed.

> I am happy to work with you to get KSTATE working with the existing KHO effort.
> 
Thanks for useful feedback, appreciated.

next prev parent reply	other threads:[~2025-05-05 14:37 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10 12:03 Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 3/7] kexec: exclude control pages from the destination addresses Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
2025-03-11 11:31   ` kernel test robot
2025-03-11 12:25   ` kernel test robot
2025-03-10 12:03 ` [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem Andrey Ryabinin
2025-03-11  2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
2025-03-11 12:19   ` Andrey Ryabinin
2025-04-28 23:01     ` Chris Li
2025-04-28 23:01 ` Chris Li
2025-05-05 14:35   ` Andrey Ryabinin [this message]
2025-05-07  6:11     ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9a80cff-8315-4be9-9a01-3c618b00d3e9@gmail.com \
    --to=ryabinin.a.a@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=chrisl@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiederm@xmission.com \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jgg@nvidia.com \
    --cc=jgowans@amazon.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=ptyadav@amazon.de \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox