* [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
@ 2025-03-10 12:03 Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
` (8 more replies)
0 siblings, 9 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
Main changes from v1 [1]:
- Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
- Lots of misc cleanups/refactorings.
kstate (kernel state) is a mechanism to describe internal some part of the
kernel state, save it into the memory and restore the state after kexec
in the new kernel.
The end goal here and the main use case for this is to be able to
update host kernel under VMs with VFIO pass-through devices running
on that host. Since we are pretty far from that end goal yet, this
only establishes some basic infrastructure to describe and migrate complex
in-kernel states.
The idea behind KSTATE resembles QEMU's migration framework [1], which
solves quite similar problem - migrate state of VM/emulated devices
across different versions of QEMU.
This is an altenative to Kexec Hand Over (KHO [3]).
So, why not KHO?
- The main reason is KHO doesn't provide simple and convenient internal
API for the drivers/subsystems to preserve internal data.
E.g. lets consider we have some variable of type 'struct a'
that needs to be preserved:
struct a {
int i;
unsigned long *p_ulong;
char s[10];
struct page *page;
};
The KHO-way requires driver/subsystem to have a bunch of code
dealing with FDT stuff, something like
a_kho_write()
{
...
fdt_property(fdt, "i", &a.i, sizeof(a.i));
fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
fdt_property(fdt, "s", &a.s, sizeof(a.s));
if (err)
...
}
a_kho_restore()
{
...
a.i = fdt_getprop(fdt, offset, "i", &len);
if (!a.i || len != sizeof(a.i))
goto err
*a.p_ulong = fdt_getprop....
}
Each driver/subsystem has to solve this problem in their own way.
Also if we use fdt properties for individual fields, that might be wastefull
in terms of used memory, as these properties use strings as keys.
While with KSTATE solves the same problem in more elegant way, with this:
struct kstate_description a_state = {
.name = "a_struct",
.version_id = 1,
.id = KSTATE_TEST_ID,
.state_list = LIST_HEAD_INIT(test_state.state_list),
.fields = (const struct kstate_field[]) {
KSTATE_BASE_TYPE(i, struct a, int),
KSTATE_BASE_TYPE(s, struct a, char [10]),
KSTATE_POINTER(p_ulong, struct a),
KSTATE_PAGE(page, struct a),
KSTATE_END_OF_LIST()
},
};
{
static unsigned long ulong
static struct a a_data = { .p_ulong = &ulong };
kstate_register(&test_state, &a_data);
}
The driver needs only to have a proper 'kstate_description' and call kstate_register()
to save/restore a_data.
Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
And kstate_register() does all this save/restore stuff under the hood.
- Another bonus point - kstate can preserve migratable memory, which is required
to preserve guest memory
So now to the part how this works.
State of kernel data (usually it's some struct) is described by the
'struct kstate_description' containing the array of individual
fields descpriptions - 'struct kstate_field'. Each field
has set of bits in ->flags which instructs how to save/restore
a certain field of the struct. E.g.:
- KS_BASE_TYPE flag tells that field can be just copied by value,
- KS_POINTER means that the struct member is a pointer to the actual
data, so it needs to be dereference before saving/restoring data
to/from kstate data steam.
- KS_STRUCT - contains another struct, field->ksd must point to
another 'struct kstate_dscription'
- KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
->restore() callbacks to save/restore data.
- KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
field->count() callback
- KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
linear address. Store offset
- KS_END - special flag indicating the end of migration stream data.
kstate_register() call accepts kstate_description along with an instance
of an object and registers it in the global 'states' list.
During kexec reboot phase we go through the list of 'kstate_description's
and each instance of kstate_description forms the 'struct kstate_entry'
which save into the kstate's data stream.
The 'kstate_entry' contains information like ID of kstate_description, version
of it, size of migration data and the data itself. The ->data is formed in
accordance to the kstate_field's of the corresponding kstate_description.
After the reboot, when the kstate_register() called it parses migration
stream, finds the appropriate 'kstate_entry' and restores the contents of
the object in accordance with kstate_description and ->fields.
[1] https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com
[2] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate
[3] https://lkml.kernel.org/r/20250206132754.2596694-1-rppt@kernel.org
Andrey Ryabinin (7):
kstate: Add kstate - a mechanism to describe and migrate kernel state
across kexec
kstate, kexec, x86: transfer kstate data across kexec
kexec: exclude control pages from the destination addresses
kexec, kstate: delay loading of kexec segments
x86, kstate: Add the ability to preserve memory pages across kexec.
kexec, kstate: save kstate data before kexec'ing
kstate, test: add test module for testing kstate subsystem.
arch/x86/Kconfig | 1 +
arch/x86/kernel/kexec-bzimage64.c | 4 +
arch/x86/kernel/setup.c | 2 +
include/linux/kexec.h | 3 +
include/linux/kstate.h | 216 ++++++++++++++
kernel/Kconfig.kexec | 13 +
kernel/Makefile | 1 +
kernel/kexec_core.c | 30 ++
kernel/kexec_file.c | 159 +++++++----
kernel/kexec_internal.h | 9 +
kernel/kstate.c | 458 ++++++++++++++++++++++++++++++
lib/Makefile | 2 +
lib/test_kstate.c | 86 ++++++
13 files changed, 925 insertions(+), 59 deletions(-)
create mode 100644 include/linux/kstate.h
create mode 100644 kernel/kstate.c
create mode 100644 lib/test_kstate.c
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate kernel state across kexec
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data " Andrey Ryabinin
` (7 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
KSTATE (kernel state) is a mechanism to describe internal kernel state
save it into the memory and restore the state after kexec in new kernel.
The end goal here and the main use case for this is to be able to
update host kernel under VMs with VFIO pass-through devices running
on that host.
The idea behind KSTATE resembles QEMU's migration framework [1], which
solves quite similar problem - migrate state of VM/emulated devices
across different versions of QEMU.
This and following patches try to establish some basic infrastructure
to describe and migrate in-kernel data structures.
State of kernel data (usually it's some struct) is described by the
'struct kstate_description' containing the array of individual
fields descpriptions - 'struct kstate_field'. Each field
has set of bits in ->flags which instructs how to save/restore
a certain field of the struct. E.g. (see kstate.h for the full list):
KS_BASE_TYPE flag tells that field can be just copied by value,
KS_POINTER means that the struct member is a pointer to the actual
data, so it needs to be dereference before saving/restoring data
to/from kstate data steam.
kstate_register() call accepts kstate_description along with an instance
of an object and registers it in the global 'states' list.
During kexec reboot phase we go through the list of 'kstate_description's
and each instance of kstate_description forms the 'struct kstate_entry'
which save into the kstate's data stream.
The 'kstate_entry' contains information like ID of kstate_description, version
of it, size of migration data and the data itself. The ->data is formed in
accordance to the kstate_field's of the corresponding kstate_description.
After the reboot, when the kstate_register() called it parses migration
stream, finds the appropriate 'kstate_entry' and restores the contents of
the object in accordance with kstate_description and ->fields.
[1] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
include/linux/kstate.h | 178 ++++++++++++++++++++++++++
kernel/Kconfig.kexec | 13 ++
kernel/Makefile | 1 +
kernel/kstate.c | 282 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 474 insertions(+)
create mode 100644 include/linux/kstate.h
create mode 100644 kernel/kstate.c
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
new file mode 100644
index 000000000000..4fc01e535bc0
--- /dev/null
+++ b/include/linux/kstate.h
@@ -0,0 +1,178 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _KSTATE_H
+#define _KSTATE_H
+
+#include <linux/atomic.h>
+#include <linux/build_bug.h>
+#include <linux/list.h>
+#include <linux/stringify.h>
+
+struct kstate_description;
+struct kstate_stream;
+struct kimage;
+
+enum kstate_flags {
+
+ /*
+ * The struct member at 'obj + kstate_field.offset' is some basic
+ * type, just copy it by value. The size is kstate_field->size.
+ */
+
+ KS_BASE_TYPE = (1 << 0),
+
+ /*
+ * The struct member at 'obj + kstate_field.offset' is a pointer
+ * to the actual data (e.g. struct a { int *b; }).
+ * save_kstate() will dereference the pointer to get the actual data
+ * and store it to the stream. restore_kstate() will copy the data from
+ * the stream to wherever the pointer points to.
+ */
+ KS_POINTER = (1 << 1),
+
+ /*
+ * The struct member at 'obj + kstate_field.offset' is another struct.
+ * kstate_field->ksd points to 'kstate_description' of that struct.
+ */
+ KS_STRUCT = (1 << 2),
+
+ /*
+ * Some non-trivial field that requires custom kstate_field->save()
+ * ->restore() callbacks to save/restore data.
+ */
+ KS_CUSTOM = (1 << 3),
+
+ /*
+ * The field is a array of kstate_field->count() pointers
+ * (e.g. struct a { uint8_t *b[]; }). Dereference each array entry
+ * before store/restore data.
+ */
+ KS_ARRAY_OF_POINTER = (1 << 4),
+
+ /*
+ * The field is a pointer to vmemmap or linear memory (determined by
+ * kstate_field->addr_type). This is used for pointers to persistent
+ * pages/data. Store offset from the start of the area instead of
+ * pointer itself, so we could defeat KASLR on restore phase (by adding
+ * new kernel's corresponding offset).
+ */
+ KS_ADDRESS = (1 << 5),
+
+ /* Marks the end of fields list */
+ KS_END = (1UL << 31),
+};
+
+enum kstate_addr_type {
+ KS_VMEMMAP_ADDR,
+ KS_LINEAR_ADDR,
+};
+
+struct kstate_stream {
+ void *start;
+ void *pos;
+ size_t size;
+};
+
+struct kstate_field {
+ const char *name;
+ size_t offset;
+ size_t size;
+ enum kstate_flags flags;
+ const struct kstate_description *ksd;
+ enum kstate_addr_type addr_type;
+ int version_id;
+ int (*restore)(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field);
+ int (*save)(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field);
+ int (*count)(void);
+};
+
+enum kstate_ids {
+ KSTATE_LAST_ID = -1,
+};
+
+struct kstate_description {
+ const char *name;
+ enum kstate_ids id;
+ atomic_t instance_id;
+ int version_id;
+ struct list_head state_list;
+
+ const struct kstate_field *fields;
+};
+
+struct state_entry {
+ u64 id;
+ struct list_head list;
+ struct kstate_description *kstd;
+ void *obj;
+};
+
+extern int kstate_save_data(struct kstate_stream *stream, void *val, size_t size);
+
+static inline bool kstate_get_byte(struct kstate_stream *stream)
+{
+ bool ret = *(u8 *)stream->pos;
+ stream->pos++;
+ return ret;
+}
+
+static inline unsigned long kstate_get_ulong(struct kstate_stream *stream)
+{
+ unsigned long ret = *(unsigned long *)stream->pos;
+ stream->pos += sizeof(unsigned long);
+ return ret;
+}
+
+#ifdef CONFIG_KSTATE
+
+int kstate_save_state(void);
+void free_kstate_stream(void);
+
+int kstate_register(struct kstate_description *state, void *obj);
+
+struct kstate_entry;
+int save_kstate(struct kstate_stream *stream, int id,
+ const struct kstate_description *kstate,
+ void *obj);
+void restore_kstate(struct kstate_stream *stream, int id,
+ const struct kstate_description *kstate, void *obj);
+
+#else
+
+#define kstate_register(state, obj)
+
+static inline int kstate_save_state(void) { return 0; }
+static inline void free_kstate_stream(void) { }
+
+#endif
+
+
+#define KSTATE_BASE_TYPE(_f, _state, _type) { \
+ .name = (__stringify(_f)), \
+ .size = sizeof(_type) + BUILD_BUG_ON_ZERO( \
+ !__same_type(typeof_member(_state, _f), _type)),\
+ .flags = KS_BASE_TYPE, \
+ .offset = offsetof(_state, _f), \
+}
+
+#define KSTATE_POINTER(_f, _state) { \
+ .name = (__stringify(_f)), \
+ .size = sizeof(*(((_state *)0)->_f)), \
+ .flags = KS_POINTER, \
+ .offset = offsetof(_state, _f), \
+ }
+
+#define KSTATE_ADDRESS(_f, _state, _addr_type) { \
+ .name = (__stringify(_f)), \
+ .size = sizeof(*(((_state *)0)->_f)), \
+ .addr_type = (_addr_type), \
+ .flags = KS_ADDRESS, \
+ .offset = offsetof(_state, _f), \
+ }
+
+#define KSTATE_END_OF_LIST() { \
+ .flags = KS_END,\
+ }
+
+#endif
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 4d111f871951..480dc156b08b 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -151,4 +151,17 @@ config CRASH_MAX_MEMORY_RANGES
the computation behind the value provided through the
/sys/kernel/crash_elfcorehdr_size attribute.
+config ARCH_HAS_KSTATE
+ bool
+
+config KSTATE
+ bool "Migrate internal kernel state across kexec"
+ default n
+ depends on ARCH_HAS_KSTATE
+ depends on KEXEC_FILE
+ help
+ KSTATE (kernel state) is a mechanism to describe internal kernel
+ state, save it into the memory and restore the state after kexec
+ in new kernel.
+
endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..6bdf947fc84f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KSTATE) += kstate.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kstate.c b/kernel/kstate.c
new file mode 100644
index 000000000000..a73a9a42e55b
--- /dev/null
+++ b/kernel/kstate.c
@@ -0,0 +1,282 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/ctype.h>
+#include <linux/kexec.h>
+#include <linux/kstate.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+
+static LIST_HEAD(states);
+
+struct kstate_entry {
+ int state_id;
+ int version_id;
+ int instance_id;
+ int size;
+ DECLARE_FLEX_ARRAY(u8, data);
+};
+
+struct kstate_stream kstate_stream;
+
+static unsigned long get_addr_offset(const struct kstate_field *field)
+{
+ switch (field->addr_type) {
+ case KS_VMEMMAP_ADDR:
+ return VMEMMAP_START;
+ case KS_LINEAR_ADDR:
+ return PAGE_OFFSET;
+ default:
+ WARN_ON(1);
+ }
+ return 0;
+}
+
+static int alloc_space(struct kstate_stream *stream, size_t size)
+{
+ void *new_start;
+ size_t new_size;
+ size_t cur_size = stream->pos - stream->start;
+
+ size = size + 4; /* Always alloc extra for KSTATE_LAST_ID */
+ if (cur_size + size < stream->size)
+ return 0;
+
+ new_size = PAGE_ALIGN(cur_size + size);
+
+ new_start = vrealloc(stream->start, new_size, GFP_KERNEL);
+ if (!new_start)
+ return -ENOMEM;
+
+ stream->start = new_start;
+ stream->size = new_size;
+ stream->pos = stream->start + cur_size;
+ return 0;
+}
+
+int kstate_save_data(struct kstate_stream *stream, void *val, size_t size)
+{
+ int ret;
+
+ ret = alloc_space(stream, size);
+ if (ret)
+ return ret;
+ memcpy(stream->pos, val, size);
+ stream->pos += size;
+ return 0;
+}
+
+int save_kstate(struct kstate_stream *stream, int id,
+ const struct kstate_description *kstate,
+ void *obj)
+{
+ const struct kstate_field *field = kstate->fields;
+ struct kstate_entry *ke;
+ unsigned long ke_off;
+ int ret = 0;
+
+ ret = alloc_space(stream, sizeof(*ke));
+ if (ret)
+ goto err;
+
+ ke_off = stream->pos - stream->start;
+ ke = stream->pos;
+ stream->pos += sizeof(*ke);
+
+ ke->state_id = kstate->id;
+ ke->version_id = kstate->version_id;
+ ke->instance_id = id;
+
+ while (field->flags != KS_END) {
+ void *first, *cur;
+ int n_elems = 1;
+ int size, i;
+
+ first = obj + field->offset;
+
+ if (field->flags & KS_POINTER)
+ first = *(void **)(obj + field->offset);
+ if (field->count)
+ n_elems = field->count();
+ size = field->size;
+ for (i = 0; i < n_elems; i++) {
+ cur = first + i * size;
+
+ if (field->flags & KS_ARRAY_OF_POINTER)
+ cur = *(void **)cur;
+
+ if (field->flags & KS_STRUCT) {
+ ret = save_kstate(stream, 0, field->ksd, cur);
+ if (ret)
+ goto err;
+ } else if (field->flags & KS_CUSTOM) {
+ if (field->save) {
+ ret = field->save(stream, cur, field);
+ if (ret)
+ goto err;
+ }
+ } else if (field->flags & (KS_BASE_TYPE|KS_POINTER)) {
+ ret = kstate_save_data(stream, cur, size);
+ if (ret)
+ goto err;
+ } else if (field->flags & KS_ADDRESS) {
+ void *addr_offset = *(void **)cur
+ - get_addr_offset(field);
+ ret = kstate_save_data(stream, &addr_offset,
+ sizeof(addr_offset));
+ if (ret)
+ goto err;
+ } else
+ WARN_ON_ONCE(1);
+ }
+ field++;
+
+ }
+
+ ke = stream->start + ke_off;
+ ke->size = (stream->pos - stream->start) - (ke_off + sizeof(*ke));
+err:
+ if (ret)
+ pr_err("kstate: save of state %s failed\n", kstate->name);
+
+ return ret;
+}
+
+static int alloc_kstate_stream(void)
+{
+ size_t size = PAGE_SIZE;
+ void *buf;
+
+ buf = vzalloc(size);
+ if (!buf)
+ return -ENOMEM;
+
+ kstate_stream.size = size;
+ kstate_stream.start = kstate_stream.pos = buf;
+ return 0;
+}
+
+void free_kstate_stream(void)
+{
+ vfree(kstate_stream.start);
+ kstate_stream.start = NULL;
+ kstate_stream.size = 0;
+}
+
+int kstate_save_state(void)
+{
+ struct state_entry *se;
+ struct kstate_entry *ke;
+ int ret;
+
+ ret = alloc_kstate_stream();
+ if (ret)
+ return ret;
+
+ list_for_each_entry(se, &states, list) {
+ ret = save_kstate(&kstate_stream, se->id, se->kstd, se->obj);
+ if (ret)
+ return ret;
+ }
+ ke = kstate_stream.pos;
+ ke->state_id = KSTATE_LAST_ID;
+ return 0;
+}
+
+void restore_kstate(struct kstate_stream *stream, int id,
+ const struct kstate_description *kstate, void *obj)
+{
+ const struct kstate_field *field = kstate->fields;
+ struct kstate_entry *ke = stream->pos;
+ stream->pos = ke->data;
+
+ WARN_ONCE(ke->version_id != kstate->version_id, "version mismatch %d %d\n",
+ ke->version_id, kstate->version_id);
+
+ WARN_ONCE(ke->instance_id != id, "instance id mismatch %d %d\n",
+ ke->instance_id, id);
+
+ while (field->flags != KS_END) {
+ void *first, *cur;
+ int n_elems = 1;
+ int size, i;
+
+ first = obj + field->offset;
+ if (field->flags & KS_POINTER)
+ first = *(void **)(obj + field->offset);
+ if (field->count)
+ n_elems = field->count();
+ size = field->size;
+ for (i = 0; i < n_elems; i++) {
+ cur = first + i * size;
+
+ if (field->flags & KS_ARRAY_OF_POINTER)
+ cur = *(void **)cur;
+
+ if (field->flags & KS_STRUCT)
+ restore_kstate(stream, 0, field->ksd, cur);
+ else if (field->flags & KS_CUSTOM) {
+ if (field->restore)
+ field->restore(stream, cur, field);
+ } else if (field->flags & (KS_BASE_TYPE | KS_POINTER)) {
+ memcpy(cur, stream->pos, size);
+ stream->pos += size;
+ } else if (field->flags & KS_ADDRESS) {
+ *(void **)cur = (*(void **)stream->pos) +
+ get_addr_offset(field);
+ stream->pos += sizeof(void *);
+ } else
+ WARN_ON_ONCE(1);
+
+ }
+ field++;
+ }
+}
+
+static void restore_migrate_state(unsigned long kstate_data,
+ struct state_entry *se)
+{
+ struct kstate_stream stream;
+ struct kstate_entry *ke;
+
+ if (kstate_data == -1)
+ return;
+
+ ke = (struct kstate_entry *)phys_to_virt(kstate_data);
+ if (WARN_ON_ONCE(ke->state_id == 0))
+ return;
+
+ stream.start = stream.pos = ke;
+ while (ke->state_id != KSTATE_LAST_ID) {
+ if (ke->state_id != se->kstd->id ||
+ ke->instance_id != se->id) {
+ ke = (struct kstate_entry *)(ke->data + ke->size);
+ continue;
+ }
+ stream.pos = ke;
+ restore_kstate(&stream, se->id, se->kstd, se->obj);
+ ke = (struct kstate_entry *)(ke->data + ke->size);
+ }
+}
+
+static void __kstate_register(struct kstate_description *state, void *obj,
+ struct state_entry *se)
+{
+ se->kstd = state;
+ se->id = atomic_inc_return(&state->instance_id);
+ se->obj = obj;
+ list_add(&se->list, &states);
+ restore_migrate_state(0 /*migrate_stream_addr*/, se);
+}
+
+int kstate_register(struct kstate_description *state, void *obj)
+{
+ struct state_entry *se;
+
+ se = kmalloc(sizeof(*se), GFP_KERNEL);
+ if (!se)
+ return -ENOMEM;
+
+ __kstate_register(state, obj, se);
+ return 0;
+}
+
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data across kexec
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 3/7] kexec: exclude control pages from the destination addresses Andrey Ryabinin
` (6 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
Add kstate data to kexec segments so it got copied to the new kernel.
Use cmdline to inform next kernel about kstate data location and size.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
I've used cmdline as it's the simplest way to transfer address
to the new kernel. Perhaps passing it via dtb would be more elegant
solution, but I don't have strong opinion here.
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/kexec-bzimage64.c | 4 +++
arch/x86/kernel/setup.c | 2 ++
include/linux/kexec.h | 2 ++
include/linux/kstate.h | 5 ++++
kernel/kexec_file.c | 5 ++++
kernel/kstate.c | 49 ++++++++++++++++++++++++++++++-
7 files changed, 67 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0e27ebd7e36a..7358d9e15957 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -90,6 +90,7 @@ config X86
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_KERNEL_FPU_SUPPORT
+ select ARCH_HAS_KSTATE if X86_64
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 68530fad05f7..d3c98c8bda29 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/kexec.h>
#include <linux/kernel.h>
+#include <linux/kstate.h>
#include <linux/mm.h>
#include <linux/efi.h>
#include <linux/random.h>
@@ -77,6 +78,9 @@ static int setup_cmdline(struct kimage *image, struct boot_params *params,
len = sprintf(cmdline_ptr,
"elfcorehdr=0x%lx ", image->elf_load_addr);
}
+ if (IS_ENABLED(CONFIG_KSTATE))
+ len = sprintf(cmdline_ptr, "kstate_stream=0x0%lx@%ld ",
+ image->kstate_stream_addr, image->kstate_size);
memcpy(cmdline_ptr + len, cmdline, cmdline_len);
cmdline_len += len;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..b32c141ffcdd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -15,6 +15,7 @@
#include <linux/init_ohci1394_dma.h>
#include <linux/initrd.h>
#include <linux/iscsi_ibft.h>
+#include <linux/kstate.h>
#include <linux/memblock.h>
#include <linux/panic_notifier.h>
#include <linux/pci.h>
@@ -992,6 +993,7 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
e820__memblock_setup();
+ kstate_init();
/*
* Needs to run after memblock setup because it needs the physical
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..bd82f04888a1 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -299,6 +299,8 @@ struct kimage {
unsigned long start;
struct page *control_code_page;
struct page *swap_page;
+ unsigned long kstate_stream_addr;
+ size_t kstate_size;
void *vmcoreinfo_data_copy; /* locates in the crash memory */
unsigned long nr_segments;
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 4fc01e535bc0..ae583d090111 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -126,6 +126,8 @@ static inline unsigned long kstate_get_ulong(struct kstate_stream *stream)
#ifdef CONFIG_KSTATE
+void kstate_init(void);
+
int kstate_save_state(void);
void free_kstate_stream(void);
@@ -137,14 +139,17 @@ int save_kstate(struct kstate_stream *stream, int id,
void *obj);
void restore_kstate(struct kstate_stream *stream, int id,
const struct kstate_description *kstate, void *obj);
+int kstate_load_migrate_buf(struct kimage *image);
#else
+static inline void kstate_init(void) { }
#define kstate_register(state, obj)
static inline int kstate_save_state(void) { return 0; }
static inline void free_kstate_stream(void) { }
+static inline int kstate_load_migrate_buf(struct kimage *image) { return 0; }
#endif
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 3eedb8c226ad..a024ff379133 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -14,6 +14,7 @@
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/kexec.h>
+#include <linux/kstate.h>
#include <linux/memblock.h>
#include <linux/mutex.h>
#include <linux/list.h>
@@ -253,6 +254,10 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
/* IMA needs to pass the measurement list to the next kernel. */
ima_add_kexec_buffer(image);
+ ret = kstate_load_migrate_buf(image);
+ if (ret)
+ goto out;
+
/* Call image load handler */
ldata = kexec_image_load_default(image);
diff --git a/kernel/kstate.c b/kernel/kstate.c
index a73a9a42e55b..d35996287b76 100644
--- a/kernel/kstate.c
+++ b/kernel/kstate.c
@@ -2,6 +2,7 @@
#include <linux/ctype.h>
#include <linux/kexec.h>
#include <linux/kstate.h>
+#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/vmalloc.h>
@@ -182,6 +183,31 @@ int kstate_save_state(void)
return 0;
}
+int kstate_load_migrate_buf(struct kimage *image)
+{
+ int ret;
+ struct kexec_buf kbuf = { .image = image, .buf_min = 0,
+ .buf_max = ULONG_MAX, .top_down = true };
+
+ kbuf.bufsz = kstate_stream.size;
+ kbuf.buffer = kstate_stream.start;
+
+ kbuf.memsz = kstate_stream.size;
+
+ kbuf.buf_align = PAGE_SIZE;
+ kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
+ ret = kexec_add_buffer(&kbuf);
+ if (ret)
+ return ret;
+ image->kstate_stream_addr = kbuf.mem;
+ image->kstate_size = kstate_stream.size;
+
+ pr_info("kstate: Loaded mig_stream at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ kbuf.mem, kbuf.bufsz, kbuf.memsz);
+
+ return ret;
+}
+
void restore_kstate(struct kstate_stream *stream, int id,
const struct kstate_description *kstate, void *obj)
{
@@ -258,6 +284,9 @@ static void restore_migrate_state(unsigned long kstate_data,
}
}
+static unsigned long kstate_stream_addr = -1;
+static unsigned long kstate_size;
+
static void __kstate_register(struct kstate_description *state, void *obj,
struct state_entry *se)
{
@@ -265,7 +294,7 @@ static void __kstate_register(struct kstate_description *state, void *obj,
se->id = atomic_inc_return(&state->instance_id);
se->obj = obj;
list_add(&se->list, &states);
- restore_migrate_state(0 /*migrate_stream_addr*/, se);
+ restore_migrate_state(kstate_stream_addr, se);
}
int kstate_register(struct kstate_description *state, void *obj)
@@ -280,3 +309,21 @@ int kstate_register(struct kstate_description *state, void *obj)
return 0;
}
+static int __init setup_kstate(char *arg)
+{
+ char *end;
+
+ if (!arg)
+ return -EINVAL;
+ kstate_stream_addr = memparse(arg, &end);
+ if (*end == '@')
+ kstate_size = memparse(end + 1, &end);
+
+ return end > arg ? 0 : -EINVAL;
+}
+early_param("kstate_stream", setup_kstate);
+
+void __init kstate_init(void)
+{
+ memblock_reserve(kstate_stream_addr, kstate_size);
+}
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 3/7] kexec: exclude control pages from the destination addresses
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data " Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
` (5 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
Kexec relies on control pages allocated after all destination ranges
have been chosen. To be able to preserve memory across kexec we need
to be able to pick destination ranges after the control pages
allocated. Add check for control pages to locate_mem_hole() callbacks
so it excludes control pages, hence we can allocate them in any order.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
kernel/kexec_core.c | 18 ++++++++++++++++++
kernel/kexec_file.c | 18 ++++--------------
kernel/kexec_internal.h | 3 +++
3 files changed, 25 insertions(+), 14 deletions(-)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c0bdc1686154..647ab5705c37 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -264,6 +264,24 @@ int kimage_is_destination_range(struct kimage *image,
return 0;
}
+int kimage_is_control_page(struct kimage *image,
+ unsigned long start,
+ unsigned long end)
+{
+
+ struct page *page;
+
+ list_for_each_entry(page, &image->control_pages, lru) {
+ unsigned long pstart, pend;
+ pstart = page_to_boot_pfn(page) << PAGE_SHIFT;
+ pend = pstart + PAGE_SIZE * (1 << page_private(page)) - 1;
+ if ((end >= pstart) && (start <= pend))
+ return 1;
+ }
+
+ return 0;
+}
+
static struct page *kimage_alloc_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *pages;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index a024ff379133..8ecd34071bfa 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -464,7 +464,8 @@ static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
* Make sure this does not conflict with any of existing
* segments
*/
- if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ if (kimage_is_destination_range(image, temp_start, temp_end) ||
+ kimage_is_control_page(image, temp_start, temp_end)) {
temp_start = temp_start - PAGE_SIZE;
continue;
}
@@ -498,7 +499,8 @@ static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
* Make sure this does not conflict with any of existing
* segments
*/
- if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ if (kimage_is_destination_range(image, temp_start, temp_end) ||
+ kimage_is_control_page(image, temp_start, temp_end)) {
temp_start = temp_start + PAGE_SIZE;
continue;
}
@@ -671,18 +673,6 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
if (kbuf->image->nr_segments >= KEXEC_SEGMENT_MAX)
return -EINVAL;
- /*
- * Make sure we are not trying to add buffer after allocating
- * control pages. All segments need to be placed first before
- * any control pages are allocated. As control page allocation
- * logic goes through list of segments to make sure there are
- * no destination overlaps.
- */
- if (!list_empty(&kbuf->image->control_pages)) {
- WARN_ON(1);
- return -EINVAL;
- }
-
/* Ensure minimum alignment needed for segments. */
kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE);
kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index d35d9792402d..12e655a70e25 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -14,6 +14,9 @@ int kimage_load_segment(struct kimage *image, struct kexec_segment *segment);
void kimage_terminate(struct kimage *image);
int kimage_is_destination_range(struct kimage *image,
unsigned long start, unsigned long end);
+int kimage_is_control_page(struct kimage *image,
+ unsigned long start,
+ unsigned long end);
/*
* Whatever is used to serialize accesses to the kexec_crash_image needs to be
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (2 preceding siblings ...)
2025-03-10 12:03 ` [PATCH v2 3/7] kexec: exclude control pages from the destination addresses Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-11 11:31 ` kernel test robot
2025-03-11 12:25 ` kernel test robot
2025-03-10 12:03 ` [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec Andrey Ryabinin
` (4 subsequent siblings)
8 siblings, 2 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
KSTATE's purpose is to preserve some memory across kexec. To make this
happen kexec needs to choose destination ranges after the KSTATE, so
these ranges doesn't collide with KSTATE preserved memory.
Kexec chooses destination ranges on the kexec load stage which might
happen long before the actual reboot to the new kernel. This means that
KSTATE must know all preserved memory before the kexec_file_load(), unless
we delay loading of kexec segments/destination addresses to the latter,
at the point of reboot to the new kernel. So let's do that.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
include/linux/kexec.h | 1 +
kernel/kexec_core.c | 6 ++
kernel/kexec_file.c | 144 ++++++++++++++++++++++++++--------------
kernel/kexec_internal.h | 6 ++
4 files changed, 108 insertions(+), 49 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index bd82f04888a1..539aaacfd3fd 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -377,6 +377,7 @@ extern void machine_kexec(struct kimage *image);
extern int machine_kexec_prepare(struct kimage *image);
extern void machine_kexec_cleanup(struct kimage *image);
extern int kernel_kexec(void);
+extern int kexec_file_load_segments(struct kimage *image);
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 647ab5705c37..7c79addeb93b 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1017,6 +1017,12 @@ int kernel_kexec(void)
goto Unlock;
}
+ if (kexec_late_load(kexec_image)) {
+ error = kexec_file_load_segments(kexec_image);
+ if (error)
+ goto Unlock;
+ }
+
#ifdef CONFIG_KEXEC_JUMP
if (kexec_image->preserve_context) {
/*
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 8ecd34071bfa..634e2ed4cc4c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -187,6 +187,34 @@ kimage_validate_signature(struct kimage *image)
}
#endif
+static int kimage_add_buffers(struct kimage *image)
+{
+ void *ldata;
+ int ret = 0;
+
+ /* IMA needs to pass the measurement list to the next kernel. */
+ ima_add_kexec_buffer(image);
+
+ ret = kstate_load_migrate_buf(image);
+ if (ret)
+ goto out;
+
+ /* Call image load handler */
+ ldata = kexec_image_load_default(image);
+
+ if (IS_ERR(ldata)) {
+ ret = PTR_ERR(ldata);
+ goto out;
+ }
+
+ image->image_loader_data = ldata;
+out:
+ /* In case of error, free up all allocated memory in this function */
+ if (ret)
+ kimage_file_post_load_cleanup(image);
+ return ret;
+
+}
/*
* In file mode list of segments is prepared by kernel. Copy relevant
* data from user space, do error checking, prepare segment list
@@ -197,7 +225,6 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
unsigned long cmdline_len, unsigned flags)
{
ssize_t ret;
- void *ldata;
ret = kernel_read_file_from_fd(kernel_fd, 0, &image->kernel_buf,
KEXEC_FILE_SIZE_MAX, NULL,
@@ -251,22 +278,6 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
image->cmdline_buf_len - 1);
}
- /* IMA needs to pass the measurement list to the next kernel. */
- ima_add_kexec_buffer(image);
-
- ret = kstate_load_migrate_buf(image);
- if (ret)
- goto out;
-
- /* Call image load handler */
- ldata = kexec_image_load_default(image);
-
- if (IS_ERR(ldata)) {
- ret = PTR_ERR(ldata);
- goto out;
- }
-
- image->image_loader_data = ldata;
out:
/* In case of error, free up all allocated memory in this function */
if (ret)
@@ -303,10 +314,6 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
if (ret)
goto out_free_image;
- ret = sanity_check_segment_list(image);
- if (ret)
- goto out_free_post_load_bufs;
-
ret = -ENOMEM;
image->control_code_page = kimage_alloc_control_pages(image,
get_order(KEXEC_CONTROL_PAGE_SIZE));
@@ -334,6 +341,70 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
return ret;
}
+static int kimage_post_load(struct kimage *image)
+{
+ int ret, i;
+
+ ret = kexec_calculate_store_digests(image);
+ if (ret)
+ goto out;
+
+ kexec_dprintk("nr_segments = %lu\n", image->nr_segments);
+ for (i = 0; i < image->nr_segments; i++) {
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[i];
+ kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
+ i, ksegment->buf, ksegment->bufsz, ksegment->mem,
+ ksegment->memsz);
+
+ ret = kimage_load_segment(image, &image->segment[i]);
+ if (ret)
+ goto out;
+ }
+
+ kimage_terminate(image);
+
+ ret = machine_kexec_post_load(image);
+ if (ret)
+ goto out;
+
+ kexec_dprintk("kexec_file_load: type:%u, start:0x%lx head:0x%lx\n",
+ image->type, image->start, image->head);
+out:
+ return ret;
+}
+
+int kexec_file_load_segments(struct kimage *image)
+{
+ int ret;
+
+ ret = kimage_add_buffers(image);
+ if (ret) {
+ pr_err("failed to add kimage buffers %d\n", ret);
+ goto out;
+ }
+
+ ret = sanity_check_segment_list(image);
+ if (ret) {
+ pr_err("sanity check failed %d\n", ret);
+ goto out;
+ }
+
+ ret = kimage_post_load(image);
+ if (ret)
+ pr_err("kimage post load failed %d\n", ret);
+
+out:
+ /*
+ * Free up any temporary buffers allocated which are not needed
+ * after image has been loaded
+ */
+ kimage_file_post_load_cleanup(image);
+
+ return ret;
+}
+
SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
unsigned long, cmdline_len, const char __user *, cmdline_ptr,
unsigned long, flags)
@@ -341,7 +412,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
int image_type = (flags & KEXEC_FILE_ON_CRASH) ?
KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT;
struct kimage **dest_image, *image;
- int ret = 0, i;
+ int ret = 0;
/* We only trust the superuser with rebooting the system. */
if (!kexec_load_permitted(image_type))
@@ -398,37 +469,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
if (ret)
goto out;
- ret = kexec_calculate_store_digests(image);
- if (ret)
- goto out;
-
- kexec_dprintk("nr_segments = %lu\n", image->nr_segments);
- for (i = 0; i < image->nr_segments; i++) {
- struct kexec_segment *ksegment;
-
- ksegment = &image->segment[i];
- kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
- i, ksegment->buf, ksegment->bufsz, ksegment->mem,
- ksegment->memsz);
-
- ret = kimage_load_segment(image, &image->segment[i]);
+ if (!kexec_late_load(image)) {
+ ret = kexec_file_load_segments(image);
if (ret)
goto out;
}
- kimage_terminate(image);
-
- ret = machine_kexec_post_load(image);
- if (ret)
- goto out;
-
- kexec_dprintk("kexec_file_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n",
- image->type, image->start, image->head, flags);
- /*
- * Free up any temporary buffers allocated which are not needed
- * after image has been loaded
- */
- kimage_file_post_load_cleanup(image);
exchange:
image = xchg(dest_image, image);
out:
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 12e655a70e25..690b1c21b642 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -34,6 +34,12 @@ static inline void kexec_unlock(void)
atomic_set_release(&__kexec_lock, 0);
}
+static inline bool kexec_late_load(struct kimage *image)
+{
+ return IS_ENABLED(CONFIG_KSTATE) && image->file_mode &&
+ (image->type == KEXEC_TYPE_DEFAULT);
+}
+
#ifdef CONFIG_KEXEC_FILE
#include <linux/purgatory.h>
void kimage_file_post_load_cleanup(struct kimage *image);
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec.
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (3 preceding siblings ...)
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing Andrey Ryabinin
` (3 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
This adds ability to specify page of memory that kstate needs to
preserve across kexec.
kstate_register_page() stores struct page in the special list of
'struct kpage_state's. At kexec reboot stage this list iterated, pfns
saved into kstate's data stream. The new kernel after kexec reads
pfns from the stream and marks memory as reserved to keep it
intact.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
include/linux/kstate.h | 30 ++++++++++
kernel/kexec_core.c | 3 +-
kernel/kstate.c | 124 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 156 insertions(+), 1 deletion(-)
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index ae583d090111..36cfefd87572 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -88,6 +88,8 @@ struct kstate_field {
};
enum kstate_ids {
+ KSTATE_RSVD_MEM_ID = 1,
+ KSTATE_STRUCT_PAGE_ID,
KSTATE_LAST_ID = -1,
};
@@ -124,6 +126,8 @@ static inline unsigned long kstate_get_ulong(struct kstate_stream *stream)
return ret;
}
+extern struct kstate_description page_state;
+
#ifdef CONFIG_KSTATE
void kstate_init(void);
@@ -141,6 +145,12 @@ void restore_kstate(struct kstate_stream *stream, int id,
const struct kstate_description *kstate, void *obj);
int kstate_load_migrate_buf(struct kimage *image);
+int kstate_page_save(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field);
+int kstate_register_page(struct page *page, int order);
+
+bool kstate_range_is_preserved(unsigned long start, unsigned long end);
+
#else
static inline void kstate_init(void) { }
@@ -150,6 +160,11 @@ static inline int kstate_save_state(void) { return 0; }
static inline void free_kstate_stream(void) { }
static inline int kstate_load_migrate_buf(struct kimage *image) { return 0; }
+
+static inline bool kstate_range_is_preserved(unsigned long start,
+ unsigned long end)
+{ return 0; }
+
#endif
@@ -176,6 +191,21 @@ static inline int kstate_load_migrate_buf(struct kimage *image) { return 0; }
.offset = offsetof(_state, _f), \
}
+#define KSTATE_PAGE(_f, _state) \
+ { \
+ .name = "page", \
+ .flags = KS_CUSTOM, \
+ .offset = offsetof(_state, _f), \
+ .save = kstate_page_save, \
+ }, \
+ KSTATE_ADDRESS(_f, _state, KS_VMEMMAP_ADDR), \
+ { \
+ .name = "struct_page", \
+ .flags = KS_STRUCT | KS_POINTER, \
+ .offset = offsetof(_state, _f), \
+ .ksd = &page_state, \
+ }
+
#define KSTATE_END_OF_LIST() { \
.flags = KS_END,\
}
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 7c79addeb93b..5d001b7a9e44 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/kexec.h>
+#include <linux/kstate.h>
#include <linux/mutex.h>
#include <linux/list.h>
#include <linux/highmem.h>
@@ -261,7 +262,7 @@ int kimage_is_destination_range(struct kimage *image,
return 1;
}
- return 0;
+ return kstate_range_is_preserved(start, end);
}
int kimage_is_control_page(struct kimage *image,
diff --git a/kernel/kstate.c b/kernel/kstate.c
index d35996287b76..68a1272abceb 100644
--- a/kernel/kstate.c
+++ b/kernel/kstate.c
@@ -309,6 +309,13 @@ int kstate_register(struct kstate_description *state, void *obj)
return 0;
}
+int kstate_page_save(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field)
+{
+ kstate_register_page(*(struct page **)obj, 0);
+ return 0;
+}
+
static int __init setup_kstate(char *arg)
{
char *end;
@@ -323,7 +330,124 @@ static int __init setup_kstate(char *arg)
}
early_param("kstate_stream", setup_kstate);
+/*
+ * TODO: probably should use folio instead/in addition,
+ * also will need to think/decide what fields
+ * to preserve or not
+ */
+struct kstate_description page_state = {
+ .name = "struct_page",
+ .id = KSTATE_STRUCT_PAGE_ID,
+ .state_list = LIST_HEAD_INIT(page_state.state_list),
+ .fields = (const struct kstate_field[]) {
+ KSTATE_BASE_TYPE(_mapcount, struct page, atomic_t),
+ KSTATE_BASE_TYPE(_refcount, struct page, atomic_t),
+ KSTATE_END_OF_LIST()
+ },
+};
+
+struct state_entry preserved_se;
+
+struct preserved_pages {
+ unsigned int nr_pages;
+ struct list_head list;
+};
+struct kpage_state {
+ struct list_head list;
+ u8 order;
+ struct page *page;
+};
+
+struct preserved_pages preserved_pages = {
+ .list = LIST_HEAD_INIT(preserved_pages.list)
+};
+
+int kstate_register_page(struct page *page, int order)
+{
+ struct kpage_state *state;
+
+ state = kmalloc(sizeof(*state), GFP_KERNEL);
+ if (!state)
+ return -ENOMEM;
+
+ state->page = page;
+ state->order = order;
+ list_add(&state->list, &preserved_pages.list);
+ preserved_pages.nr_pages++;
+ return 0;
+}
+
+static int kstate_pages_save(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field)
+{
+ struct kpage_state *p_state;
+ int ret;
+
+ list_for_each_entry(p_state, &preserved_pages.list, list) {
+ unsigned long paddr = page_to_phys(p_state->page);
+
+ ret = kstate_save_data(stream, &p_state->order,
+ sizeof(p_state->order));
+ if (ret)
+ return ret;
+ ret = kstate_save_data(stream, &paddr, sizeof(paddr));
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+bool kstate_range_is_preserved(unsigned long start, unsigned long end)
+{
+ struct kpage_state *p_state;
+
+ list_for_each_entry(p_state, &preserved_pages.list, list) {
+ unsigned long pstart, pend;
+ pstart = page_to_boot_pfn(p_state->page);
+ pend = pstart + (p_state->order << PAGE_SHIFT) - 1;
+ if ((end >= pstart) && (start <= pend))
+ return 1;
+ }
+ return 0;
+}
+
+static int __init kstate_pages_restore(struct kstate_stream *stream, void *obj,
+ const struct kstate_field *field)
+{
+ struct preserved_pages *preserved_pages = obj;
+ int nr_pages, i;
+
+ nr_pages = preserved_pages->nr_pages;
+ for (i = 0; i < nr_pages; i++) {
+ int order = kstate_get_byte(stream);
+ unsigned long phys = kstate_get_ulong(stream);
+
+ memblock_reserve(phys, PAGE_SIZE << order);
+ }
+ return 0;
+}
+
+struct kstate_description kstate_preserved_mem = {
+ .name = "preserved_range",
+ .id = KSTATE_RSVD_MEM_ID,
+ .state_list = LIST_HEAD_INIT(kstate_preserved_mem.state_list),
+ .fields = (const struct kstate_field[]) {
+ KSTATE_BASE_TYPE(nr_pages, struct preserved_pages, unsigned int),
+ {
+ .name = "pages",
+ .flags = KS_CUSTOM,
+ .size = sizeof(struct preserved_pages),
+ .save = kstate_pages_save,
+ .restore = kstate_pages_restore,
+ },
+
+ KSTATE_END_OF_LIST()
+ },
+};
+
void __init kstate_init(void)
{
memblock_reserve(kstate_stream_addr, kstate_size);
+ __kstate_register(&kstate_preserved_mem, &preserved_pages,
+ &preserved_se);
}
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (4 preceding siblings ...)
2025-03-10 12:03 ` [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem Andrey Ryabinin
` (2 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
Call kstate_save_state() to serialize all the required data
into the kstate data stream.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
kernel/kexec_core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5d001b7a9e44..7dcdaee14bfa 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1017,11 +1017,14 @@ int kernel_kexec(void)
error = -EINVAL;
goto Unlock;
}
+ error = kstate_save_state();
+ if (error)
+ goto Unlock;
if (kexec_late_load(kexec_image)) {
error = kexec_file_load_segments(kexec_image);
if (error)
- goto Unlock;
+ goto Free_kstate;
}
#ifdef CONFIG_KEXEC_JUMP
@@ -1104,6 +1107,8 @@ int kernel_kexec(void)
}
#endif
+ Free_kstate:
+ free_kstate_stream();
Unlock:
kexec_unlock();
return error;
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem.
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (5 preceding siblings ...)
2025-03-10 12:03 ` [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing Andrey Ryabinin
@ 2025-03-10 12:03 ` Andrey Ryabinin
2025-03-11 2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
2025-04-28 23:01 ` Chris Li
8 siblings, 0 replies; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-10 12:03 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, James Gowans, Mike Rapoport, Andrew Morton,
linux-mm, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Eric Biederman, kexec,
Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, David Rientjes,
Andrey Ryabinin
This is simple test and playground useful kstate subsystem development.
It contains some structure with different kind of data which migrated
across kexec to the new kernel using kstate.
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
include/linux/kstate.h | 3 ++
kernel/kstate.c | 5 +++
lib/Makefile | 2 +
lib/test_kstate.c | 86 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 96 insertions(+)
create mode 100644 lib/test_kstate.c
diff --git a/include/linux/kstate.h b/include/linux/kstate.h
index 36cfefd87572..0bde76aa4d8f 100644
--- a/include/linux/kstate.h
+++ b/include/linux/kstate.h
@@ -90,6 +90,7 @@ struct kstate_field {
enum kstate_ids {
KSTATE_RSVD_MEM_ID = 1,
KSTATE_STRUCT_PAGE_ID,
+ KSTATE_TEST_ID,
KSTATE_LAST_ID = -1,
};
@@ -132,6 +133,8 @@ extern struct kstate_description page_state;
void kstate_init(void);
+bool is_kstate_kernel(void);
+
int kstate_save_state(void);
void free_kstate_stream(void);
diff --git a/kernel/kstate.c b/kernel/kstate.c
index 68a1272abceb..3d9b786da72a 100644
--- a/kernel/kstate.c
+++ b/kernel/kstate.c
@@ -287,6 +287,11 @@ static void restore_migrate_state(unsigned long kstate_data,
static unsigned long kstate_stream_addr = -1;
static unsigned long kstate_size;
+bool is_kstate_kernel(void)
+{
+ return kstate_stream_addr != -1;
+}
+
static void __kstate_register(struct kstate_description *state, void *obj,
struct state_entry *se)
{
diff --git a/lib/Makefile b/lib/Makefile
index d5cfc7afbbb8..1395b852b58d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -356,6 +356,8 @@ obj-$(CONFIG_PARMAN) += parman.o
obj-y += group_cpus.o
+obj-$(CONFIG_KSTATE) += test_kstate.o
+
# GCC library routines
obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
diff --git a/lib/test_kstate.c b/lib/test_kstate.c
new file mode 100644
index 000000000000..1d9feb017415
--- /dev/null
+++ b/lib/test_kstate.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/io.h>
+#include <linux/kstate.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+
+static unsigned long ulong_val;
+struct kstate_test_data {
+ int i;
+ unsigned long *p_ulong;
+ char s[10];
+ struct page *page;
+};
+
+struct kstate_description test_state = {
+ .name = "test",
+ .version_id = 1,
+ .id = KSTATE_TEST_ID,
+ .state_list = LIST_HEAD_INIT(test_state.state_list),
+ .fields = (const struct kstate_field[]) {
+ KSTATE_BASE_TYPE(i, struct kstate_test_data, int),
+ KSTATE_BASE_TYPE(s, struct kstate_test_data, char [10]),
+ KSTATE_POINTER(p_ulong, struct kstate_test_data),
+ KSTATE_PAGE(page, struct kstate_test_data),
+ KSTATE_END_OF_LIST()
+ },
+};
+
+static struct kstate_test_data test_data;
+
+static int init_test_data(void)
+{
+ struct page *page;
+ int i;
+
+ test_data.i = 10;
+ ulong_val = 20;
+ memcpy(test_data.s, "abcdefghk", sizeof(test_data.s));
+ page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ for (i = 0; i < PAGE_SIZE/4; i += 4)
+ *((u32 *)page_address(page) + i) = 0xdeadbeef;
+ test_data.page = page;
+ return 0;
+}
+
+static void validate_test_data(void)
+{
+ int i;
+
+ if (WARN_ON(test_data.i != 10))
+ return;
+ if (WARN_ON(*test_data.p_ulong != 20))
+ return;
+ if (WARN_ON(strcmp(test_data.s, "abcdefghk") != 0))
+ return;
+
+ for (i = 0; i < PAGE_SIZE/4; i += 4) {
+ u32 val = *((u32 *)page_address(test_data.page) + i);
+
+ WARN_ON(val != 0xdeadbeef);
+ }
+}
+
+static int __init test_kstate_init(void)
+{
+ int ret = 0;
+
+ test_data.p_ulong = &ulong_val;
+
+ if (!is_kstate_kernel()) {
+ ret = init_test_data();
+ if (ret)
+ goto out;
+ }
+
+ kstate_register(&test_state, &test_data);
+
+ validate_test_data();
+
+out:
+ return ret;
+}
+__initcall(test_kstate_init);
--
2.45.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (6 preceding siblings ...)
2025-03-10 12:03 ` [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem Andrey Ryabinin
@ 2025-03-11 2:27 ` Cong Wang
2025-03-11 12:19 ` Andrey Ryabinin
2025-04-28 23:01 ` Chris Li
8 siblings, 1 reply; 16+ messages in thread
From: Cong Wang @ 2025-03-11 2:27 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: linux-kernel, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, linux-mm, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes
Hi Andrey,
On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
> Each driver/subsystem has to solve this problem in their own way.
> Also if we use fdt properties for individual fields, that might be wastefull
> in terms of used memory, as these properties use strings as keys.
>
> While with KSTATE solves the same problem in more elegant way, with this:
> struct kstate_description a_state = {
> .name = "a_struct",
> .version_id = 1,
> .id = KSTATE_TEST_ID,
> .state_list = LIST_HEAD_INIT(test_state.state_list),
> .fields = (const struct kstate_field[]) {
> KSTATE_BASE_TYPE(i, struct a, int),
> KSTATE_BASE_TYPE(s, struct a, char [10]),
> KSTATE_POINTER(p_ulong, struct a),
> KSTATE_PAGE(page, struct a),
> KSTATE_END_OF_LIST()
> },
> };
Hmm, this still requires manual efforts to implement this, so potentially
a lot of work given how many drivers we have in-tree.
And those KSTATE_* stuffs look a lot similar to BTF:
https://docs.kernel.org/bpf/btf.html
So, any possibility to reuse BTF here? Note, BTF is automatically
generated by pahole, no manual effort is required.
Regards,
Cong Wang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
@ 2025-03-11 11:31 ` kernel test robot
2025-03-11 12:25 ` kernel test robot
1 sibling, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-03-11 11:31 UTC (permalink / raw)
To: Andrey Ryabinin, linux-kernel
Cc: llvm, oe-kbuild-all, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, Linux Memory Management List, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes, Andrey Ryabinin
Hi Andrey,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/x86/core]
[also build test ERROR on akpm-mm/mm-nonmm-unstable akpm-mm/mm-everything linus/master v6.14-rc6 next-20250307]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andrey-Ryabinin/kstate-Add-kstate-a-mechanism-to-describe-and-migrate-kernel-state-across-kexec/20250310-200803
base: tip/x86/core
patch link: https://lore.kernel.org/r/20250310120318.2124-5-arbn%40yandex-team.com
patch subject: [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments
config: i386-buildonly-randconfig-003-20250311 (https://download.01.org/0day-ci/archive/20250311/202503111944.pfx9UdvP-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250311/202503111944.pfx9UdvP-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503111944.pfx9UdvP-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/vmcore_info.c:24:
>> kernel/kexec_internal.h:39:43: error: incomplete definition of type 'struct kimage'
39 | return IS_ENABLED(CONFIG_KSTATE) && image->file_mode &&
| ~~~~~^
kernel/kexec_internal.h:9:8: note: forward declaration of 'struct kimage'
9 | struct kimage *do_kimage_alloc_init(void);
| ^
kernel/kexec_internal.h:40:9: error: incomplete definition of type 'struct kimage'
40 | (image->type == KEXEC_TYPE_DEFAULT);
| ~~~~~^
kernel/kexec_internal.h:9:8: note: forward declaration of 'struct kimage'
9 | struct kimage *do_kimage_alloc_init(void);
| ^
>> kernel/kexec_internal.h:40:19: error: use of undeclared identifier 'KEXEC_TYPE_DEFAULT'
40 | (image->type == KEXEC_TYPE_DEFAULT);
| ^
3 errors generated.
vim +39 kernel/kexec_internal.h
36
37 static inline bool kexec_late_load(struct kimage *image)
38 {
> 39 return IS_ENABLED(CONFIG_KSTATE) && image->file_mode &&
> 40 (image->type == KEXEC_TYPE_DEFAULT);
41 }
42
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-03-11 2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
@ 2025-03-11 12:19 ` Andrey Ryabinin
2025-04-28 23:01 ` Chris Li
0 siblings, 1 reply; 16+ messages in thread
From: Andrey Ryabinin @ 2025-03-11 12:19 UTC (permalink / raw)
To: Cong Wang
Cc: Andrey Ryabinin, linux-kernel, Alexander Graf, James Gowans,
Mike Rapoport, Andrew Morton, linux-mm, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes
On Tue, Mar 11, 2025 at 3:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Andrey,
>
> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
> > Each driver/subsystem has to solve this problem in their own way.
> > Also if we use fdt properties for individual fields, that might be wastefull
> > in terms of used memory, as these properties use strings as keys.
> >
> > While with KSTATE solves the same problem in more elegant way, with this:
> > struct kstate_description a_state = {
> > .name = "a_struct",
> > .version_id = 1,
> > .id = KSTATE_TEST_ID,
> > .state_list = LIST_HEAD_INIT(test_state.state_list),
> > .fields = (const struct kstate_field[]) {
> > KSTATE_BASE_TYPE(i, struct a, int),
> > KSTATE_BASE_TYPE(s, struct a, char [10]),
> > KSTATE_POINTER(p_ulong, struct a),
> > KSTATE_PAGE(page, struct a),
> > KSTATE_END_OF_LIST()
> > },
> > };
>
> Hmm, this still requires manual efforts to implement this, so potentially
> a lot of work given how many drivers we have in-tree.
>
We are not going to have every possible driver to be able to persist its state.
I think the main target is VFIO driver which also implies PCI/IOMMU.
Besides, we'll need to persist only some fields of the struct, not the
entire thing.
There is no way to automate such decisions, so there will be some
manual effort anyway.
> And those KSTATE_* stuffs look a lot similar to BTF:
> https://docs.kernel.org/bpf/btf.html
>
> So, any possibility to reuse BTF here?
Perhaps, but I don't see it right away. I'll think about it.
> Note, BTF is automatically generated by pahole, no manual effort is required.
Nothing will save us from manual efforts of what parts of data we want to save,
so there has to be some way to mark that data.
Also same C types may represent different kind of data, e.g.
we may have an address to some persistent data (in linear mapping)
stored as an 'unsigned long address'.
Because of KASLR we can't copy 'address' by value, we'll need to save
it as an offset from PAGE_OFFSET
and add PAGE_OFFSET of the new kernel on restore.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
2025-03-11 11:31 ` kernel test robot
@ 2025-03-11 12:25 ` kernel test robot
1 sibling, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-03-11 12:25 UTC (permalink / raw)
To: Andrey Ryabinin, linux-kernel
Cc: llvm, oe-kbuild-all, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, Linux Memory Management List, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes, Andrey Ryabinin
Hi Andrey,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/x86/core]
[also build test ERROR on akpm-mm/mm-nonmm-unstable akpm-mm/mm-everything linus/master v6.14-rc6 next-20250307]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andrey-Ryabinin/kstate-Add-kstate-a-mechanism-to-describe-and-migrate-kernel-state-across-kexec/20250310-200803
base: tip/x86/core
patch link: https://lore.kernel.org/r/20250310120318.2124-5-arbn%40yandex-team.com
patch subject: [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments
config: riscv-randconfig-002-20250311 (https://download.01.org/0day-ci/archive/20250311/202503112016.VZt1HD9v-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250311/202503112016.VZt1HD9v-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503112016.VZt1HD9v-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/vmcore_info.c:24:
kernel/kexec_internal.h: In function 'kexec_late_load':
>> kernel/kexec_internal.h:39:50: error: invalid use of undefined type 'struct kimage'
39 | return IS_ENABLED(CONFIG_KSTATE) && image->file_mode &&
| ^~
kernel/kexec_internal.h:40:23: error: invalid use of undefined type 'struct kimage'
40 | (image->type == KEXEC_TYPE_DEFAULT);
| ^~
>> kernel/kexec_internal.h:40:33: error: 'KEXEC_TYPE_DEFAULT' undeclared (first use in this function); did you mean 'KEXEC_ARCH_DEFAULT'?
40 | (image->type == KEXEC_TYPE_DEFAULT);
| ^~~~~~~~~~~~~~~~~~
| KEXEC_ARCH_DEFAULT
kernel/kexec_internal.h:40:33: note: each undeclared identifier is reported only once for each function it appears in
vim +39 kernel/kexec_internal.h
36
37 static inline bool kexec_late_load(struct kimage *image)
38 {
> 39 return IS_ENABLED(CONFIG_KSTATE) && image->file_mode &&
> 40 (image->type == KEXEC_TYPE_DEFAULT);
41 }
42
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
` (7 preceding siblings ...)
2025-03-11 2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
@ 2025-04-28 23:01 ` Chris Li
2025-05-05 14:35 ` Andrey Ryabinin
8 siblings, 1 reply; 16+ messages in thread
From: Chris Li @ 2025-04-28 23:01 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: linux-kernel, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, linux-mm, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes
Hi Andrey,
I am working on the PCI portion of the live update and looking at
using KSTATE as an alternative to the FDT. Here are some high level
feedbacks.
On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>
> Main changes from v1 [1]:
> - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
> - Lots of misc cleanups/refactorings.
>
> kstate (kernel state) is a mechanism to describe internal some part of the
> kernel state, save it into the memory and restore the state after kexec
> in the new kernel.
>
> The end goal here and the main use case for this is to be able to
> update host kernel under VMs with VFIO pass-through devices running
> on that host. Since we are pretty far from that end goal yet, this
> only establishes some basic infrastructure to describe and migrate complex
> in-kernel states.
>
> The idea behind KSTATE resembles QEMU's migration framework [1], which
> solves quite similar problem - migrate state of VM/emulated devices
> across different versions of QEMU.
>
> This is an altenative to Kexec Hand Over (KHO [3]).
>
> So, why not KHO?
>
KHO does more than just serializing/unserializing. It also has scratch
areas etc to allow safely performing early allocation without stepping
on the preserved memory. I see KSTATE as an alternative to libFDT as
ways of serializing the preserved memory. Not a replacement for KHO.
With that, it would be great to see a KSTATE build on top of the
current version of KHO. The V6 version of KHO uses a recursive FDT
object. I see recursive FDT can map to the C struct description
similar to the KSTATE field description nicely. However, that will
require KSTATE to make some considerable changes to embrace the KHO
v6. For example, the KSTATE uses one contiguous stream buffer and KHO
V6 uses many recursive physical address object pointers for different
objects. Maybe a KSTATE V3?
> - The main reason is KHO doesn't provide simple and convenient internal
> API for the drivers/subsystems to preserve internal data.
> E.g. lets consider we have some variable of type 'struct a'
> that needs to be preserved:
> struct a {
> int i;
> unsigned long *p_ulong;
> char s[10];
> struct page *page;
> };
>
> The KHO-way requires driver/subsystem to have a bunch of code
> dealing with FDT stuff, something like
>
> a_kho_write()
> {
> ...
> fdt_property(fdt, "i", &a.i, sizeof(a.i));
> fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
> fdt_property(fdt, "s", &a.s, sizeof(a.s));
> if (err)
> ...
> }
I can add more of the pain point of using FDT as data format for
load/restore states. It is not easy to determine how much memory the
FDT serialize is going to use up front. We want to do all the memory
allocation in the KHO PREPARE phase, so that after the KHO PREPARE
phase there is no KHO failure due to can't allocate memory.
The current KHO V6 does not handle the case where the recursive FDT
goes beyond 4K pages. There is a feature gap where the PCI subsystem
will likely save state for a list of PCI devices and the FDT can
possibly go more than 4K.
FDT also does not save the type of the object buffer, only the size.
There is an implicit contract of what this object points to. The
KSTATE description table can be extended to be more expressive than
FDT, e.g. cover optional min max allowed values.
> a_kho_restore()
> {
> ...
> a.i = fdt_getprop(fdt, offset, "i", &len);
> if (!a.i || len != sizeof(a.i))
> goto err
> *a.p_ulong = fdt_getprop....
> }
>
> Each driver/subsystem has to solve this problem in their own way.
> Also if we use fdt properties for individual fields, that might be wastefull
> in terms of used memory, as these properties use strings as keys.
Right, I need to write a lot of boilerplate code to do the per
property save/restore. I am not worried too much about memory usage. A
lot of string keys are not much longer than 8 bytes. The memory saving
convert to binary index is not huge. I actually would suggest adding
the string version of the field name to the description table, so that
we can dump the state in KSTATE just like the YAML FDT output for
debugging purposes. It is a very useful feature of FDT to dump the
current saving state into a human readable form. KSTATE can have the
same feature added.
>
> While with KSTATE solves the same problem in more elegant way, with this:
> struct kstate_description a_state = {
> .name = "a_struct",
> .version_id = 1,
> .id = KSTATE_TEST_ID,
> .state_list = LIST_HEAD_INIT(test_state.state_list),
> .fields = (const struct kstate_field[]) {
> KSTATE_BASE_TYPE(i, struct a, int),
> KSTATE_BASE_TYPE(s, struct a, char [10]),
> KSTATE_POINTER(p_ulong, struct a),
> KSTATE_PAGE(page, struct a),
> KSTATE_END_OF_LIST()
> },
> };
>
>
> {
> static unsigned long ulong
> static struct a a_data = { .p_ulong = &ulong };
>
> kstate_register(&test_state, &a_data);
> }
>
> The driver needs only to have a proper 'kstate_description' and call kstate_register()
> to save/restore a_data.
> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
> And kstate_register() does all this save/restore stuff under the hood.
It seems the KSTATE uses one contiguous stream and the object has to
be loaded in the order it was saved. For the PCI code, the PCI device
scanning and probing might cause the device load out of the order of
saving. (The PCI probing is actually the reverse order of saving).
This kstate_register() might pose restrictions on the restore order.
PCI will need to look up and find the device state based on the PCI
device ID. Other subsystems will likely have the requirement to look
up their own saved state as well.
I also see KSTATE can be extended to support that.
> - Another bonus point - kstate can preserve migratable memory, which is required
> to preserve guest memory
>
>
> So now to the part how this works.
>
> State of kernel data (usually it's some struct) is described by the
> 'struct kstate_description' containing the array of individual
> fields descpriptions - 'struct kstate_field'. Each field
> has set of bits in ->flags which instructs how to save/restore
> a certain field of the struct. E.g.:
> - KS_BASE_TYPE flag tells that field can be just copied by value,
>
> - KS_POINTER means that the struct member is a pointer to the actual
> data, so it needs to be dereference before saving/restoring data
> to/from kstate data steam.
>
> - KS_STRUCT - contains another struct, field->ksd must point to
> another 'struct kstate_dscription'
The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
type, right? Some of these flag combinations do not make sense. This
part might need more careful planning to keep it simple. Maybe some of
the flags bits should be enum.
>
> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
> ->restore() callbacks to save/restore data.
>
> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
> field->count() callback
> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
> linear address. Store offset
I think we want to describe different stream types.
For example the most simple stream container is just a contiguous
buffer with start address and size.
The more complex one might be and size then an array of page pointers,
all those pointers add to up the new buffer which describe an saved
KSTATE that is larger than 4K and spread into an array of pages. Those
pages don't need to be contiguous. Such a page array buffer stores the
KSTATE entry described by a separate description table.
>
> - KS_END - special flag indicating the end of migration stream data.
>
> kstate_register() call accepts kstate_description along with an instance
> of an object and registers it in the global 'states' list.
>
> During kexec reboot phase we go through the list of 'kstate_description's
> and each instance of kstate_description forms the 'struct kstate_entry'
> which save into the kstate's data stream.
>
> The 'kstate_entry' contains information like ID of kstate_description, version
> of it, size of migration data and the data itself. The ->data is formed in
> accordance to the kstate_field's of the corresponding kstate_description.
The version for the kstate_description might not be enough. The
version works if there is a linear history. Here we are likely to have
different vendors add their own extension to the device state saving.
I suggest instead we save the old kernel's kstate_description table
(once per description table as a recursive object) alongside the
object physical address as well. The new kernel has their new version
of the description table. It can compare between the old and new
description tables and find out what fields need to be upgraded or
downgraded. The new kernel will use the old kstate_description table
to decode the previous kernel's saved object. I think that way it is
more flexible to support adding one or more features and not tight to
which version has what feature. It can also make sure the new kernel
can always dump the old KSTATE into YAML.
That way we might be able to simplify the subsection and the
depreciation flags. The new kernel doesn't need to carry the history
of changes made to the old description table.
> After the reboot, when the kstate_register() called it parses migration
> stream, finds the appropriate 'kstate_entry' and restores the contents of
> the object in accordance with kstate_description and ->fields.
Again this restoring can happen in a different order when the PCI
device scanning and probing order. The restoration might not happen in
one single call chain. Material for V3?
I am happy to work with you to get KSTATE working with the existing KHO effort.
Chris
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-03-11 12:19 ` Andrey Ryabinin
@ 2025-04-28 23:01 ` Chris Li
0 siblings, 0 replies; 16+ messages in thread
From: Chris Li @ 2025-04-28 23:01 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: Cong Wang, Andrey Ryabinin, linux-kernel, Alexander Graf,
James Gowans, Mike Rapoport, Andrew Morton, linux-mm,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H . Peter Anvin, Eric Biederman, kexec, Pratyush Yadav,
Jason Gunthorpe, Pasha Tatashin, David Rientjes
On Tue, Mar 11, 2025 at 5:19 AM Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> > Hmm, this still requires manual efforts to implement this, so potentially
> > a lot of work given how many drivers we have in-tree.
> >
>
> We are not going to have every possible driver to be able to persist its state.
> I think the main target is VFIO driver which also implies PCI/IOMMU.
>
> Besides, we'll need to persist only some fields of the struct, not the
> entire thing.
> There is no way to automate such decisions, so there will be some
> manual effort anyway.
>
>
> > And those KSTATE_* stuffs look a lot similar to BTF:
> > https://docs.kernel.org/bpf/btf.html
> >
> > So, any possibility to reuse BTF here?
>
> Perhaps, but I don't see it right away. I'll think about it.
There is some possibility to use tools to lighten the repeat portion
of the load.
For example, the use sparse checker to example the struct field.
>
> > Note, BTF is automatically generated by pahole, no manual effort is required.
>
> Nothing will save us from manual efforts of what parts of data we want to save,
> so there has to be some way to mark that data.
> Also same C types may represent different kind of data, e.g.
> we may have an address to some persistent data (in linear mapping)
> stored as an 'unsigned long address'.
> Because of KASLR we can't copy 'address' by value, we'll need to save
> it as an offset from PAGE_OFFSET
> and add PAGE_OFFSET of the new kernel on restore.
Agree, there will be cases requiring manual intervention. It is
unlikely to fully automate this process.
Chris
Chris
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-04-28 23:01 ` Chris Li
@ 2025-05-05 14:35 ` Andrey Ryabinin
2025-05-07 6:11 ` Chris Li
0 siblings, 1 reply; 16+ messages in thread
From: Andrey Ryabinin @ 2025-05-05 14:35 UTC (permalink / raw)
To: Chris Li
Cc: linux-kernel, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, linux-mm, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes
On 4/29/25 1:01 AM, Chris Li wrote:
> Hi Andrey,
>
> I am working on the PCI portion of the live update and looking at
> using KSTATE as an alternative to the FDT. Here are some high level
> feedbacks.
>
Hi, thanks a lot.
> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>>
>> Main changes from v1 [1]:
>> - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>> - Lots of misc cleanups/refactorings.
>>
>> kstate (kernel state) is a mechanism to describe internal some part of the
>> kernel state, save it into the memory and restore the state after kexec
>> in the new kernel.
>>
>> The end goal here and the main use case for this is to be able to
>> update host kernel under VMs with VFIO pass-through devices running
>> on that host. Since we are pretty far from that end goal yet, this
>> only establishes some basic infrastructure to describe and migrate complex
>> in-kernel states.
>>
>> The idea behind KSTATE resembles QEMU's migration framework [1], which
>> solves quite similar problem - migrate state of VM/emulated devices
>> across different versions of QEMU.
>>
>> This is an altenative to Kexec Hand Over (KHO [3]).
>>
>> So, why not KHO?
>>
>
> KHO does more than just serializing/unserializing. It also has scratch
> areas etc to allow safely performing early allocation without stepping
> on the preserved memory. I see KSTATE as an alternative to libFDT as
> ways of serializing the preserved memory. Not a replacement for KHO.
>
> With that, it would be great to see a KSTATE build on top of the
> current version of KHO. The V6 version of KHO uses a recursive FDT
> object. I see recursive FDT can map to the C struct description
> similar to the KSTATE field description nicely. However, that will
> require KSTATE to make some considerable changes to embrace the KHO
> v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> V6 uses many recursive physical address object pointers for different
> objects. Maybe a KSTATE V3?
>
Yep, I'll take a look into combinig KSTATE with KHO.
....
>
>> a_kho_restore()
>> {
>> ...
>> a.i = fdt_getprop(fdt, offset, "i", &len);
>> if (!a.i || len != sizeof(a.i))
>> goto err
>> *a.p_ulong = fdt_getprop....
>> }
>>
>> Each driver/subsystem has to solve this problem in their own way.
>> Also if we use fdt properties for individual fields, that might be wastefull
>> in terms of used memory, as these properties use strings as keys.
>
> Right, I need to write a lot of boilerplate code to do the per
> property save/restore. I am not worried too much about memory usage. A
> lot of string keys are not much longer than 8 bytes. The memory saving
> convert to binary index is not huge. I actually would suggest adding
> the string version of the field name to the description table, so that
> we can dump the state in KSTATE just like the YAML FDT output for
> debugging purposes. It is a very useful feature of FDT to dump the
> current saving state into a human readable form. KSTATE can have the
> same feature added.
>
kstate_field already have string with name of the field:
#define KSTATE_BASE_TYPE(_f, _state, _type) { \
.name = (__stringify(_f)), \
Currently it's not used in code, but it's there for debug purposes
>> While with KSTATE solves the same problem in more elegant way, with this:
>> struct kstate_description a_state = {
>> .name = "a_struct",
>> .version_id = 1,
>> .id = KSTATE_TEST_ID,
>> .state_list = LIST_HEAD_INIT(test_state.state_list),
>> .fields = (const struct kstate_field[]) {
>> KSTATE_BASE_TYPE(i, struct a, int),
>> KSTATE_BASE_TYPE(s, struct a, char [10]),
>> KSTATE_POINTER(p_ulong, struct a),
>> KSTATE_PAGE(page, struct a),
>> KSTATE_END_OF_LIST()
>> },
>> };
>>
>>
>> {
>> static unsigned long ulong
>> static struct a a_data = { .p_ulong = &ulong };
>>
>> kstate_register(&test_state, &a_data);
>> }
>>
>> The driver needs only to have a proper 'kstate_description' and call kstate_register()
>> to save/restore a_data.
>> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>> And kstate_register() does all this save/restore stuff under the hood.
>
> It seems the KSTATE uses one contiguous stream and the object has to
> be loaded in the order it was saved. For the PCI code, the PCI device
> scanning and probing might cause the device load out of the order of
> saving. (The PCI probing is actually the reverse order of saving).
> This kstate_register() might pose restrictions on the restore order.
> PCI will need to look up and find the device state based on the PCI
> device ID. Other subsystems will likely have the requirement to look
> up their own saved state as well.
> I also see KSTATE can be extended to support that.
>
Absolutely agreed. I think we need to decouple restore and register, ie remove
restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
so the PCI code could do:
kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));
And on probing stage (probably in pci_device_add()) call
kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))
which would locate state for the device if any and restore it.
>> - Another bonus point - kstate can preserve migratable memory, which is required
>> to preserve guest memory
>>
>>
>> So now to the part how this works.
>>
>> State of kernel data (usually it's some struct) is described by the
>> 'struct kstate_description' containing the array of individual
>> fields descpriptions - 'struct kstate_field'. Each field
>> has set of bits in ->flags which instructs how to save/restore
>> a certain field of the struct. E.g.:
>> - KS_BASE_TYPE flag tells that field can be just copied by value,
>>
>> - KS_POINTER means that the struct member is a pointer to the actual
>> data, so it needs to be dereference before saving/restoring data
>> to/from kstate data steam.
>>
>> - KS_STRUCT - contains another struct, field->ksd must point to
>> another 'struct kstate_dscription'
>
> The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
> type, right? Some of these flag combinations do not make sense. This
> part might need more careful planning to keep it simple. Maybe some of
> the flags bits should be enum.
>
Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field.
Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL)
>>
>> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>> ->restore() callbacks to save/restore data.
>>
>> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>> field->count() callback
>> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>> linear address. Store offset
>
> I think we want to describe different stream types.
> For example the most simple stream container is just a contiguous
> buffer with start address and size.
> The more complex one might be and size then an array of page pointers,
> all those pointers add to up the new buffer which describe an saved
> KSTATE that is larger than 4K and spread into an array of pages. Those
> pages don't need to be contiguous. Such a page array buffer stores the
> KSTATE entry described by a separate description table.
>
Agreed, I had similar thoughts. But this complicates code, so I started with
something simple.
>>
>> - KS_END - special flag indicating the end of migration stream data.
>>
>> kstate_register() call accepts kstate_description along with an instance
>> of an object and registers it in the global 'states' list.
>>
>> During kexec reboot phase we go through the list of 'kstate_description's
>> and each instance of kstate_description forms the 'struct kstate_entry'
>> which save into the kstate's data stream.
>>
>> The 'kstate_entry' contains information like ID of kstate_description, version
>> of it, size of migration data and the data itself. The ->data is formed in
>> accordance to the kstate_field's of the corresponding kstate_description.
>
> The version for the kstate_description might not be enough. The
> version works if there is a linear history. Here we are likely to have
> different vendors add their own extension to the device state saving.
I think vendors can just declare separate kstate_description with different ID.
The only problem with this, is that kstate_description.id is integer, so it would be
a problem to allocate those without conflicts.
Perhaps change the id to string? So vendors can just add vendor prefix to ID.
> I suggest instead we save the old kernel's kstate_description table
> (once per description table as a recursive object) alongside the
> object physical address as well. The new kernel has their new version
> of the description table. It can compare between the old and new
> description tables and find out what fields need to be upgraded or
> downgraded. The new kernel will use the old kstate_description table
> to decode the previous kernel's saved object.
Hmm.. I'm not sure, there is a lot to think about.
This might make changes in kstate_description painful,
e.g. if I want to rearrange some ->flags for whatever reason.
So how to deal with changes in kstate_description itself?
How do we save links to methods in kstate_field (->restore/->save/->count),
and what if we'll need to change function prototypes of these methods ?
> I think that way it is> more flexible to support adding one or more features and not tight to
> which version has what feature. It can also make sure the new kernel
> can always dump the old KSTATE into YAML.
>
> That way we might be able to simplify the subsection and the
> depreciation flags. The new kernel doesn't need to carry the history
> of changes made to the old description table.
>
>> After the reboot, when the kstate_register() called it parses migration
>> stream, finds the appropriate 'kstate_entry' and restores the contents of
>> the object in accordance with kstate_description and ->fields.
>
> Again this restoring can happen in a different order when the PCI
> device scanning and probing order. The restoration might not happen in
> one single call chain. Material for V3?
Agreed.
> I am happy to work with you to get KSTATE working with the existing KHO effort.
>
Thanks for useful feedback, appreciated.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec
2025-05-05 14:35 ` Andrey Ryabinin
@ 2025-05-07 6:11 ` Chris Li
0 siblings, 0 replies; 16+ messages in thread
From: Chris Li @ 2025-05-07 6:11 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: linux-kernel, Alexander Graf, James Gowans, Mike Rapoport,
Andrew Morton, linux-mm, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Eric Biederman, kexec, Pratyush Yadav, Jason Gunthorpe,
Pasha Tatashin, David Rientjes
On Mon, May 5, 2025 at 7:37 AM Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> >
> > With that, it would be great to see a KSTATE build on top of the
> > current version of KHO. The V6 version of KHO uses a recursive FDT
> > object. I see recursive FDT can map to the C struct description
> > similar to the KSTATE field description nicely. However, that will
> > require KSTATE to make some considerable changes to embrace the KHO
> > v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> > V6 uses many recursive physical address object pointers for different
> > objects. Maybe a KSTATE V3?
> >
>
> Yep, I'll take a look into combinig KSTATE with KHO.
Wonderful. KHO use kho_preserve_folio() to mark a folio to be preserved.
After kexec, it use kho_restore_folio/page() to restore a folio.
There is also kho_preserve_phys(), you should only use it for memory
that does not have a page struct, e.g. CMA.
You can take a look at the current version of luo and replace the libFDT
usage with KSTATE, that would be a good starting point.
> > debugging purposes. It is a very useful feature of FDT to dump the
> > current saving state into a human readable form. KSTATE can have the
> > same feature added.
> >
>
> kstate_field already have string with name of the field:
>
> #define KSTATE_BASE_TYPE(_f, _state, _type) { \
> .name = (__stringify(_f)), \
>
> Currently it's not used in code, but it's there for debug purposes
That is good to know.
We need to have something like fdt debugfs node for kstate.
>
>
> >> While with KSTATE solves the same problem in more elegant way, with this:
> >> struct kstate_description a_state = {
> >> .name = "a_struct",
> >> .version_id = 1,
> >> .id = KSTATE_TEST_ID,
> >> .state_list = LIST_HEAD_INIT(test_state.state_list),
> >> .fields = (const struct kstate_field[]) {
> >> KSTATE_BASE_TYPE(i, struct a, int),
> >> KSTATE_BASE_TYPE(s, struct a, char [10]),
> >> KSTATE_POINTER(p_ulong, struct a),
> >> KSTATE_PAGE(page, struct a),
> >> KSTATE_END_OF_LIST()
> >> },
> >> };
> >>
> >>
> >> {
> >> static unsigned long ulong
> >> static struct a a_data = { .p_ulong = &ulong };
> >>
> >> kstate_register(&test_state, &a_data);
> >> }
> >>
> >> The driver needs only to have a proper 'kstate_description' and call kstate_register()
> >> to save/restore a_data.
> >> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
> >> And kstate_register() does all this save/restore stuff under the hood.
> >
> > It seems the KSTATE uses one contiguous stream and the object has to
> > be loaded in the order it was saved. For the PCI code, the PCI device
> > scanning and probing might cause the device load out of the order of
> > saving. (The PCI probing is actually the reverse order of saving).
> > This kstate_register() might pose restrictions on the restore order.
> > PCI will need to look up and find the device state based on the PCI
> > device ID. Other subsystems will likely have the requirement to look
> > up their own saved state as well.
> > I also see KSTATE can be extended to support that.
> >
>
> Absolutely agreed. I think we need to decouple restore and register, ie remove
> restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
> so the PCI code could do:
> kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));
Need to have the domain number with PCI_DEVID as well.
I am expect to call some thing like kstate_save() in the LUO prepare call back.
LUO has a few stage. "Prepare" is where you save most of the stuff and
VM is still running.
"Reboot" is where VM is already paused. The last chance to save any
thing before kexec.
> And on probing stage (probably in pci_device_add()) call
> kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))
It needs to happen before that, in pci_setup_device() it already need
to know if the device is keepalive or not. If device is keep alive,
the PCI core will need to re-create the device state from an already
running device rather than initialize the device fresh. It needs to
avoid perform PCI config space write to keepalive PCI devices.
>
> which would locate state for the device if any and restore it.
>
>
> >> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
> >> ->restore() callbacks to save/restore data.
> >>
> >> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
> >> field->count() callback
> >> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
> >> linear address. Store offset
> >
> > I think we want to describe different stream types.
> > For example the most simple stream container is just a contiguous
> > buffer with start address and size.
> > The more complex one might be and size then an array of page pointers,
> > all those pointers add to up the new buffer which describe an saved
> > KSTATE that is larger than 4K and spread into an array of pages. Those
> > pages don't need to be contiguous. Such a page array buffer stores the
> > KSTATE entry described by a separate description table.
>
> Agreed, I had similar thoughts. But this complicates code, so I started with
> something simple.
>
Yes we can start simple. Right now the KSTATE only follow pointer in
the Kernel C struct rather than the saved state objects. We will need
to support pointer following in the saved state object as well. That
is some thing very different from typical message serialization. In
the kernel it is easier to spread the state buffer in to recursive C
struct that store in different pages. The kstate stream will be more
or less like a tree of objects.
> >>
> >> The 'kstate_entry' contains information like ID of kstate_description, version
> >> of it, size of migration data and the data itself. The ->data is formed in
> >> accordance to the kstate_field's of the corresponding kstate_description.
> >
> > The version for the kstate_description might not be enough. The
> > version works if there is a linear history. Here we are likely to have
> > different vendors add their own extension to the device state saving.
>
> I think vendors can just declare separate kstate_description with different ID.
> The only problem with this, is that kstate_description.id is integer, so it would be
> a problem to allocate those without conflicts.
> Perhaps change the id to string? So vendors can just add vendor prefix to ID.
In my mind, the ID is a number, it is unique to a struct type. Once
the number is assign to member field, it will not able to reuse that
number for other field in that struct, it will stay with that field
member for live. If the field get deleted, that number will still not
be able to re-use by other member in that struct. Each field will also
have a string name for debug purpose.
Different struct can have same ID number, but it will have different
meaning. Just like (struct a*)->foo have different meaning than
(struct b*)->foo, it is on different name space.
If the vendor want to make sure never conflict with upstream, they
should allocate the ID in a vendor specific struct. That way it will
make sure the vendor get their own struct name space. The struct has a
texted name so it will not conflict with vendors. The field ID will
remain as number.
> > I suggest instead we save the old kernel's kstate_description table
> > (once per description table as a recursive object) alongside the
> > object physical address as well. The new kernel has their new version
> > of the description table. It can compare between the old and new
> > description tables and find out what fields need to be upgraded or
> > downgraded. The new kernel will use the old kstate_description table
> > to decode the previous kernel's saved object.
>
> Hmm.. I'm not sure, there is a lot to think about.
> This might make changes in kstate_description painful,
Let me clarify. I don't mean to save the V2.1 kstate_description as it
is. The current kstate_description have type system and a run time
portion as well. The run time portion does not save. Only the type
system and possible value description. (min, max, enum etc).
> e.g. if I want to rearrange some ->flags for whatever reason.
If possible, would be best not save those flags. However, if the flag
is used to describe how the wire format object is layouted, those will
have to be save and part of the ABI as well. There is no way to get
around that even without saving the descriptor table.
> So how to deal with changes in kstate_description itself?
The member in the kstate_description itself will have a description
table to describe it as well. The will be a minimal set of
kstate_description feature to describe other capabilities.
The kstate_descrition can have a capability array (inspired by the PCI
capabilities) declare what it supports.
For example, the basic version only support two buffer container type.
1) pointer with size. 2) <array counter n> + linear array layout in
memory of n elements.
The new kernel add a container type 3) single link list ptr pointer
array. It points to page contain {<next list page pfn> + <array
counter n> + 510<page pfn array>}. The ptr list array can describe a
kvmalloc buffer without recursive allocate page pointer array . The
new kernel can detect the old kernel does not have this capability.
When roll back to old kernel. It do not use ptr link list array to
write out saved state object.
The description table will have a version number as well as last
defense If we have to introduce change that break the description
table compatibility. We can bump that version. That is the only place
need to have a version number. Other capability should always describe
using capability feature set.
>
> How do we save links to methods in kstate_field (->restore/->save/->count),
> and what if we'll need to change function prototypes of these methods ?
Those I consider as run time behavior. I hope we don't need to save
the function pointers.
Chris
> > I think that way it is> more flexible to support adding one or more features and not tight to
> > which version has what feature. It can also make sure the new kernel
> > can always dump the old KSTATE into YAML.
> >
> > That way we might be able to simplify the subsection and the
> > depreciation flags. The new kernel doesn't need to carry the history
> > of changes made to the old description table.
> >
> >> After the reboot, when the kstate_register() called it parses migration
> >> stream, finds the appropriate 'kstate_entry' and restores the contents of
> >> the object in accordance with kstate_description and ->fields.
> >
> > Again this restoring can happen in a different order when the PCI
> > device scanning and probing order. The restoration might not happen in
> > one single call chain. Material for V3?
>
> Agreed.
>
> > I am happy to work with you to get KSTATE working with the existing KHO effort.
> >
> Thanks for useful feedback, appreciated.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-05-07 6:11 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-10 12:03 [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 1/7] kstate: Add kstate - a mechanism to describe and migrate " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 2/7] kstate, kexec, x86: transfer kstate data " Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 3/7] kexec: exclude control pages from the destination addresses Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 4/7] kexec, kstate: delay loading of kexec segments Andrey Ryabinin
2025-03-11 11:31 ` kernel test robot
2025-03-11 12:25 ` kernel test robot
2025-03-10 12:03 ` [PATCH v2 5/7] x86, kstate: Add the ability to preserve memory pages across kexec Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 6/7] kexec, kstate: save kstate data before kexec'ing Andrey Ryabinin
2025-03-10 12:03 ` [PATCH v2 7/7] kstate, test: add test module for testing kstate subsystem Andrey Ryabinin
2025-03-11 2:27 ` [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Cong Wang
2025-03-11 12:19 ` Andrey Ryabinin
2025-04-28 23:01 ` Chris Li
2025-04-28 23:01 ` Chris Li
2025-05-05 14:35 ` Andrey Ryabinin
2025-05-07 6:11 ` Chris Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox