linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/3] Live Update Orchestrator
@ 2025-03-20  2:40 Pasha Tatashin
  2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20  2:40 UTC (permalink / raw)
  To: changyuanl, graf, pasha.tatashin, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, jgowans, jgg

From: Pasha Tatashin <tatashin@google.com>

This series applies on top of the kho v5 patch series:
https://lore.kernel.org/all/20250320015551.2157511-1-changyuanl@google.com

The git branch for this series:
https://github.com/googleprodkernel/linux-liveupdate/commits/luo/rfc-v1

What is Live Update?
Live Update is a specialized reboot process where selected devices are
kept operational across a kernel transition. For these devices, DMA and
interrupt activity may continue uninterrupted during the kernel reboot.

Please find attached a series of three patches introducing the Live
Update Orchestrator (LUO), a new kernel subsystem designed to
facilitate live kernel updates with minimal downtime. The primary
use case is in cloud environments, allowing hypervisor updates without
fully disrupting running virtual machines by keeping selected devices
alive across the reboot boundary. This series also inroduces a device
layer infrastructure (dev_liveupdate) to be used with LUO.

The core of LUO is a state machine that tracks the progress of a live
update, along with a callback API that allows other kernel subsystems
to participate in the process. Example subsystems that can hook into LUO
include: kvm, iommu, interrupts, the Device Layer (through the
dev_liveupdate infrastructure introduced in patch 2), and mm.

LUO uses KHO to transfer memory state from Old Kernel to the New Kernel.

LUO can be controlled through sysfs interface. It provides the following
files under: `/sys/kernel/liveupdate/{state, prepare, finish}`

The `state` file can contain the following values:

normal
The system is operating normally, and no live update is in progress.
This is the initial state.

prepared
The system has begun preparing for a live update. This state is reached
after subsystems have successfully responded to the `LIVEUPDATE_PREPARE`
callback. It indicates that initial preparation is done, but it does not
necessarily mean all state has been serialized; subsystems can save more
state during the subsequent `LIVEUPDATE_REBOOT` callback.

updated
The new kernel has successfully taken over, and any suspended operations
are resumed. However, the system has not yet fully transitioned back to
a normal operational state; this happens after the `LIVEUPDATE_FINISH`
callback is invoked.

Writing '1' to the `prepare` file triggers a transition from normal
to prepared (if possible), which involves invoking the
`LIVEUPDATE_PREPARE` notifiers. Similarly, writing to the `finish` file
attempts a transition to the normal state from updated via the
`LIVEUPDATE_FINISH` notifiers.

The state machine ensures that operations are performed in the correct
sequence and provides a mechanism to track and recover from potential
failures, and select devices and subsystems that should participate in
live update sequence.

==============
dev_liveupdate
==============

To allow device drivers and bus drivers to participate, the second patch
introduces the `dev_liveupdate` infrastructure. This provides a
`liveupdate()` callback in `struct device_driver` and `struct bus_type`,
which receives the LUO state machine events.

The `dev_liveupdate` component also adds a "liveupdate" sysfs directory
under each device (e.g., `/sys/devices/.../device/liveupdate/`). This
directory contains the following attributes:

`requested`
A read-write attribute allowing userspace to control whether a device
should participate in the live update sequence. Writing `1` requests the
device and its ancestors (that support live update) be preserved.
Writing `0` requests the device be excluded. This attribute can only be
modified when LUO is in the `normal` state.

`preserved`
A read-only attribute indicating whether the device's state was
preserved during the `prepare` and `reboot` stages.

`reclaimed`
A read-only attribute indicating whether the device was successfully
re-attached and resumed operation in the new kernel after an update.
For example, a VM to which this device was passthrough has been resumed.

By default, devices do not participate in the live update. Userspace can
explicitly request participation by writing '1' to the `requested` file.

TODO:
- Expand, improve, clean-up documentation
- Embed a flow chart via Graphviz
- Add selftests for LUO and dev_liveupdate
- Add debug interface to allow LUO to perform LIVEUPDATE_REBOOT via sysfs
  to help developers of subsystems and device drivers.
- dev_liveupdate should add KHO node names to dev / drivers/ bus, and also
  dev->lu should contain a link to a KHO node for this device that is allocated
  and freed through dev_liveupdate
- dev_liveupdate should also partcipate during boot to track the reclaimed
  devices

Pasha Tatashin (3):
  luo: Live Update Orchestrator
  luo: dev_liveupdate: Add device live update infrastructure
  luo: x86: Enable live update support

 .../ABI/testing/sysfs-kernel-liveupdate       |  51 ++
 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/liveupdate.rst      |  23 +
 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/liveupdate.rst       |  23 +
 MAINTAINERS                                   |  13 +
 arch/x86/Kconfig                              |   1 +
 drivers/base/Makefile                         |   1 +
 drivers/base/core.c                           |  25 +-
 drivers/base/dev_liveupdate.c                 | 816 ++++++++++++++++++
 include/linux/dev_liveupdate.h                | 109 +++
 include/linux/device.h                        |   6 +
 include/linux/device/bus.h                    |   4 +
 include/linux/device/driver.h                 |   4 +
 include/linux/liveupdate.h                    | 238 +++++
 init/Kconfig                                  |   2 +
 kernel/Kconfig.liveupdate                     |  19 +
 kernel/Makefile                               |   1 +
 kernel/liveupdate.c                           | 749 ++++++++++++++++
 kernel/reboot.c                               |   4 +
 20 files changed, 2083 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-liveupdate
 create mode 100644 Documentation/admin-guide/liveupdate.rst
 create mode 100644 Documentation/driver-api/liveupdate.rst
 create mode 100644 drivers/base/dev_liveupdate.c
 create mode 100644 include/linux/dev_liveupdate.h
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 kernel/Kconfig.liveupdate
 create mode 100644 kernel/liveupdate.c

-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20  2:40 [RFC v1 0/3] Live Update Orchestrator Pasha Tatashin
@ 2025-03-20  2:40 ` Pasha Tatashin
  2025-03-20 13:39   ` Andy Shevchenko
  2025-03-20 14:43   ` Jason Gunthorpe
  2025-03-20  2:40 ` [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure Pasha Tatashin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20  2:40 UTC (permalink / raw)
  To: changyuanl, graf, pasha.tatashin, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, jgowans, jgg

Introduces the Live Update Orchestrator (LUO), a new kernel subsystem
designed to facilitate live updates. Live update is a method to reboot
the kernel while attempting to keep selected devices alive across the
reboot boundary, minimizing downtime.

The primary use case is cloud environments, allowing hypervisor updates
without fully disrupting running virtual machines. VMs can be suspended
while the hypervisor kernel reboots, and devices attached to these VM
are kept operational by the LUO.

Features introduced:

- Core orchestration logic for managing the live update process.
- A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
  the progress of live updates.
- Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
  etc.) to register callbacks for different live update events:
    - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
    - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
    - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
    - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.
- A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
    - `prepare`: Initiate preparation (write 1) or reset (write 0).
    - `finish`: Finalize update in new kernel (write 1).
    - `cancel`: Abort ongoing preparation or reboot (write 1).
    - `reset`: Force state back to normal (write 1).
    - `state`: Read-only view of the current LUO state.
    - `enabled`: Read-only view of whether live update is enabled.
- Integration with KHO to pass orchestrator state to the new kernel.
- Version checking during startup of the new kernel to ensure
  compatibility with the previous kernel's live update state.

This infrastructure allows various kernel subsystems to coordinate and
participate in the live update process, serializing and restoring device
state across a kernel reboot.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 .../ABI/testing/sysfs-kernel-liveupdate       |  51 ++
 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/liveupdate.rst      |  23 +
 MAINTAINERS                                   |  10 +
 include/linux/liveupdate.h                    | 238 ++++++
 init/Kconfig                                  |   2 +
 kernel/Kconfig.liveupdate                     |  19 +
 kernel/Makefile                               |   1 +
 kernel/liveupdate.c                           | 749 ++++++++++++++++++
 kernel/reboot.c                               |   4 +
 10 files changed, 1098 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-liveupdate
 create mode 100644 Documentation/admin-guide/liveupdate.rst
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 kernel/Kconfig.liveupdate
 create mode 100644 kernel/liveupdate.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-liveupdate b/Documentation/ABI/testing/sysfs-kernel-liveupdate
new file mode 100644
index 000000000000..92f4f745163f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-liveupdate
@@ -0,0 +1,51 @@
+What:		/sys/kernel/liveupdate/
+Date:		March 2025
+KernelVersion:	6.14.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Interface to control and query live update orchestrator. Live
+		update is a feature that allows to reboot kernel without
+		resetting selected devices. This is needed, for example,  in
+		order to do a quick hypervisor update without terminating
+		virtual machines.
+
+What:		/sys/kernel/liveupdate/state
+Date:		March 2025
+KernelVersion:	6.14.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Read only file that contains the current live update state.
+
+		The state can be one of the following:
+
+		normal: no live update in progress.
+		prepared: live update is prepared for reboot.
+		updated: rebooted to a new kernel, live update can be finished
+		by echoing 1 into finish file.
+
+What:		/sys/kernel/liveupdate/prepare
+Date:		March 2025
+KernelVersion:	6.14.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Is a write-only file that notifies the devices about upcoming
+		live update reboot or cancels it.
+		Writing '1' to this file changes the live update state from
+		"normal" to "prepared".
+		Internally, all drivers that implement liveupdate callback are
+		notified by calling this function with LIVEUPDATE_PREPARE
+		parameter. If any liveupdate() callback fails, the state is not
+		changed, and all already notiified subsystems are notified via
+		liveupdate(LIVEUPDATE_CANCEL) prior to returning to usersapce.
+		Writing '0' to this file change the live update state from
+		"prepared" back to "normal" state by notifying all registered
+		subsystems via liveupdate(LIVEUPDATE_CANCEL) callback..
+
+What:		/sys/kernel/liveupdate/finish
+Date:		March 2025
+KernelVersion:	6.14.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Is a write-only file that notifies the devices that live update
+		has been completed.
+		Writing '1' to this file changes the live update state from
+		"updated" to "normal" state.
+		Internally, all drivers that implement liveupdate callback are
+		notified by calling this function with LIVEUPDATE_FINISH
+		parameter.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index c8af32a8f800..049f18034e10 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -95,6 +95,7 @@ likely to be of interest on almost any system.
    cgroup-v2
    cgroup-v1/index
    cpu-load
+   liveupdate
    mm/index
    module-signing
    namespaces/index
diff --git a/Documentation/admin-guide/liveupdate.rst b/Documentation/admin-guide/liveupdate.rst
new file mode 100644
index 000000000000..f66e4e13f50b
--- /dev/null
+++ b/Documentation/admin-guide/liveupdate.rst
@@ -0,0 +1,23 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+Live Update
+===========
+:Author: Pasha Tatashin <pasha.tatashin@soleen.com>
+
+Live Update Orchestrator (LUO)
+==============================
+.. kernel-doc:: kernel/liveupdate.c
+   :doc: Live Update Orchestrator (LUO)
+
+Public API
+==========
+.. kernel-doc:: include/linux/liveupdate.h
+
+.. kernel-doc:: kernel/liveupdate.c
+   :export:
+
+Internal API
+============
+.. kernel-doc:: kernel/liveupdate.c
+   :internal:
diff --git a/MAINTAINERS b/MAINTAINERS
index d0df0b380e34..32257bde9647 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13481,6 +13481,16 @@ F:	kernel/module/livepatch.c
 F:	samples/livepatch/
 F:	tools/testing/selftests/livepatch/
 
+LIVE UPDATE
+M:	Pasha Tatashin <pasha.tatashin@soleen.com>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	Documentation/ABI/testing/sysfs-kernel-liveupdate
+F:	Documentation/admin-guide/liveupdate.rst
+F:	include/linux/liveupdate.h
+F:	kernel/Kconfig.liveupdate
+F:	kernel/liveupdate.c
+
 LLC (802.2)
 L:	netdev@vger.kernel.org
 S:	Odd fixes
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
new file mode 100644
index 000000000000..66c4e9d28a4a
--- /dev/null
+++ b/include/linux/liveupdate.h
@@ -0,0 +1,238 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+#ifndef _LINUX_LIVEUPDATE_H
+#define _LINUX_LIVEUPDATE_H
+
+#include <linux/compiler.h>
+#include <linux/notifier.h>
+
+/**
+ * enum liveupdate_event - Events that trigger live update callbacks.
+ * @LIVEUPDATE_PREPARE: Sent when the live update process is initiated via
+ *                      a sysfs by writing '1' into
+ *                      ``/sys/kernel/liveupdate/prepare``. This happens
+ *                      *before* the blackout window. Subsystems should prepare
+ *                      for an upcoming reboot by serializing their states.
+ *                      However, it must be considered that user applications,
+ *                      e.g. virtual machines are still running during this
+ *                      phase.
+ * @LIVEUPDATE_REBOOT:  Sent from the reboot() syscall, when the old kernel is
+ *                      on its way out. This is the final opportunity for
+ *                      subsystems to save any state that must persist across
+ *                      the reboot. Callbacks for this event are part of the
+ *                      blackout window and must be fast.
+ * @LIVEUPDATE_FINISH:  Sent in the newly booted kernel after a successful live
+ *                      update and *after* the blackout window. This event is
+ *                      initiated by writing '1' into
+ *                      ``/sys/kernel/liveupdate/prepare``. Subsystems should
+ *                      perform any final cleanup during this phase. This phase
+ *                      also provides an opportunity to clean up devices that
+ *                      were preserved but never explicitly reclaimed during the
+ *                      live update process. State restoration should have
+ *                      already occurred before this event. Callbacks for this
+ *                      event must not fail. The completion of this call
+ *                      transitions the machine from ``updated`` to ``normal``
+ *                      state.
+ * @LIVEUPDATE_CANCEL:  Sent if the LIVEUPDATE_PREPARE or LIVEUPDATE_REBOOT
+ *                      stage fails. Subsystems should revert any actions taken
+ *                      during the corresponding prepare phase. Callbacks for
+ *                      this event must not fail.
+ *
+ * These events represent the different stages and actions within the live
+ * update process that subsystems (like device drivers and bus drivers)
+ * need to be aware of to correctly serialize and restore their state.
+ *
+ */
+enum liveupdate_event {
+	LIVEUPDATE_PREPARE,
+	LIVEUPDATE_REBOOT,
+	LIVEUPDATE_FINISH,
+	LIVEUPDATE_CANCEL,
+};
+
+/**
+ * enum liveupdate_state - Defines the possible states of the live update
+ * orchestrator.
+ * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
+ * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
+ *                                   LIVEUPDATE_PREPARE callbacks have completed
+ *                                   successfully.
+ *                                   Devices might operate in a limited state
+ *                                   for example the participating devices might
+ *                                   not be allowed to unbind, and also the
+ *                                   setting up of new DMA mappings might be
+ *                                   disabled in this state.
+ * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into a new kernel
+ *                                   via live update the system is now running
+ *                                   the new kernel, awaiting the finish stage.
+ *
+ * These states track the progress and outcome of a live update operation.
+ */
+enum liveupdate_state  {
+	LIVEUPDATE_STATE_NORMAL,
+	LIVEUPDATE_STATE_PREPARED,
+	LIVEUPDATE_STATE_UPDATED,
+};
+
+/**
+ * enum liveupdate_cb_priority - Priority levels for live update notifiers.
+ * @LIVEUPDATE_CB_PRIO_BEFORE_DEVICES: Callbacks with this priority will be
+ *                                     executed before the device layer
+ *                                     callbacks.
+ * @LIVEUPDATE_CB_PRIO_WITH_DEVICES:   Callbacks with this priority will be
+ *                                     executed at the same time as the device
+ *                                     layer callbacks.
+ * @LIVEUPDATE_CB_PRIO_AFTER_DEVICES:  Callbacks with this priority will be
+ *                                     executed after the device layer
+ *                                     callbacks.
+ *
+ * This enum defines the priority levels for notifier callbacks registered with
+ * the live update orchestrator. It allows subsystems to control the order in
+ * which their callbacks are executed relative to other subsystems during the
+ * live update process.
+ */
+enum liveupdate_cb_priority {
+	LIVEUPDATE_CB_PRIO_BEFORE_DEVICES,
+	LIVEUPDATE_CB_PRIO_WITH_DEVICES,
+	LIVEUPDATE_CB_PRIO_AFTER_DEVICES,
+};
+
+#ifdef CONFIG_LIVEUPDATE
+
+/* Called during reboot to notify subsystems to complete serialization */
+int liveupdate_reboot(void);
+
+/*
+ * Return true if machine is in updated state (i.e. live update boot in
+ * progress)
+ */
+bool liveupdate_state_updated(void);
+
+/*
+ * Return true if machine is in normal state (i.e. no live update in progress).
+ */
+bool liveupdate_state_normal(void);
+
+/* Protect live update state with a rwsem, take it as a reader */
+int liveupdate_read_state_enter_killable(void);
+void liveupdate_read_state_enter(void);
+void liveupdate_read_state_exit(void);
+
+/* Return true if live update orchestrator is enabled */
+bool liveupdate_enabled(void);
+
+int liveupdate_register_notifier(struct notifier_block *nb);
+int liveupdate_unregister_notifier(struct notifier_block *nb);
+
+/**
+ * LIVEUPDATE_DECLARE_NOTIFIER - Declare a live update notifier with default
+ * structure.
+ * @_name: A base name used to generate the names of the notifier block
+ * (e.g., ``_name##_liveupdate_notifier_block``) and the callback function
+ * (e.g., ``_name##_liveupdate``).
+ * @_priority: The priority of the notifier, specified using the
+ * ``enum liveupdate_cb_priority`` values
+ * (e.g., ``LIVEUPDATE_CB_PRIO_BEFORE_DEVICES``).
+ *
+ * This macro declares a static struct notifier_block and a corresponding
+ * notifier callback function for use with the live update orchestrator.
+ * It simplifies the process by automatically handling the dispatching of
+ * live update events to separate handler functions for prepare, reboot,
+ * finish, and cancel.
+ *
+ * This macro expects the following functions to be defined:
+ *
+ * ``_name##_liveupdate_prepare()``:  Called on LIVEUPDATE_PREPARE.
+ * ``_name##_liveupdate_reboot()``:   Called on LIVEUPDATE_REBOOT.
+ * ``_name##_liveupdate_finish()``:   Called on LIVEUPDATE_FINISH.
+ * ``_name##_liveupdate_cancel()``:   Called on LIVEUPDATE_CANCEL.
+ *
+ * The generated callback function handles the switch statement for the
+ * different live update events and calls the appropriate handler function.
+ * It also includes warnings if the finish or cancel handlers return an error.
+ *
+ * For example, declartion can look like this:
+ *
+ * ``static int foo_liveupdate_prepare(void) { ... }``
+ *
+ * ``static int foo_liveupdate_reboot(void) { ... }``
+ *
+ * ``static int foo_liveupdate_finish(void) { ... }``
+ *
+ * ``static int foo_liveupdate_cancel(void) { ... }``
+ *
+ * ``LIVEUPDATE_DECLARE_NOTIFIER(foo, LIVEUPDATE_CB_PRIO_WITH_DEVICES);``
+ *
+ */
+#define LIVEUPDATE_DECLARE_NOTIFIER(_name, _priority)			\
+static int _name##_liveupdate(struct notifier_block *nb,		\
+			      unsigned long action,			\
+			      void *data)				\
+{									\
+	enum liveupdate_event event = (enum liveupdate_event)action;	\
+	int err = 0;							\
+	int rv;								\
+									\
+	switch (event) {						\
+	case LIVEUPDATE_PREPARE:					\
+		err = _name##_liveupdate_prepare();			\
+		break;							\
+	case LIVEUPDATE_REBOOT:						\
+		err = _name##_liveupdate_reboot();			\
+		break;							\
+	case LIVEUPDATE_FINISH:						\
+		rv = _name##_liveupdate_finish();			\
+		WARN_ONCE(rv, "finish failed[%d]\n", rv);		\
+		break;							\
+	case LIVEUPDATE_CANCEL:						\
+		rv = _name##_liveupdate_cancel();			\
+		WARN_ONCE(rv, "cancel failed[%d]\n", rv);		\
+		break;							\
+	default:							\
+		WARN_ONCE(1, "unexpected event[%d]\n", event);		\
+		return NOTIFY_DONE;					\
+	}								\
+									\
+	return notifier_from_errno(err);				\
+}									\
+									\
+static struct notifier_block _name##_liveupdate_notifier_block = {	\
+	.notifier_call = _name##_liveupdate,				\
+	.priority = _priority,						\
+}
+
+/**
+ * LIVEUPDATE_REGISTER_NOTIFIER - Register a live update notifier declared with
+ * the macro.
+ * @_name: The base name used when declaring the notifier with
+ * ``LIVEUPDATE_DECLARE_NOTIFIER``.
+ *
+ * This macro simplifies the registration of a notifier block that was
+ * declared using the LIVEUPDATE_DECLARE_NOTIFIER macro.
+ */
+#define LIVEUPDATE_REGISTER_NOTIFIER(_name)				\
+	liveupdate_register_notifier(&_name##_liveupdate_notifier_block)
+
+#else /* CONFIG_LIVEUPDATE */
+
+static inline int liveupdate_reboot(void)
+{
+	return 0;
+}
+
+static inline int liveupdate_register_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int liveupdate_unregister_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
+#endif /* CONFIG_LIVEUPDATE */
+#endif /* _LINUX_LIVEUPDATE_H */
diff --git a/init/Kconfig b/init/Kconfig
index 324c2886b2ea..9800b8301fa2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2079,3 +2079,5 @@ config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 # <asm/syscall_wrapper.h>.
 config ARCH_HAS_SYSCALL_WRAPPER
 	def_bool n
+
+source "kernel/Kconfig.liveupdate"
diff --git a/kernel/Kconfig.liveupdate b/kernel/Kconfig.liveupdate
new file mode 100644
index 000000000000..8468591fac4a
--- /dev/null
+++ b/kernel/Kconfig.liveupdate
@@ -0,0 +1,19 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Live update configs
+#
+
+config ARCH_SUPPORTS_LIVEUPDATE
+	bool
+
+config LIVEUPDATE
+	bool "Enable kernel live update"
+	depends on ARCH_SUPPORTS_LIVEUPDATE
+	depends on KEXEC_HANDOVER
+	help
+	  Enables support for Live Update, a feature that allows to
+	  keep devices alive across the transition from the old kernel
+	  to the new kernel. Live Update designed to minimize downtime
+	  during kernel updates
+
+	  If unsure, say N.
diff --git a/kernel/Makefile b/kernel/Makefile
index cef5377c25cd..18c65f71ddb5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
+obj-$(CONFIG_LIVEUPDATE) += liveupdate.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_TRACE_CLOCK) += trace/
diff --git a/kernel/liveupdate.c b/kernel/liveupdate.c
new file mode 100644
index 000000000000..64b5d4d4b6c4
--- /dev/null
+++ b/kernel/liveupdate.c
@@ -0,0 +1,749 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: Live Update Orchestrator (LUO)
+ *
+ * Live Update is a specialized reboot process where selected devices are
+ * kept operational across a kernel transition. For these devices, DMA and
+ * interrupt activity may continue uninterrupted during the kernel reboot.
+ *
+ * The primary use case is in cloud environments, allowing hypervisor updates
+ * without disrupting running virtual machines. During a live update, VMs can be
+ * suspended (with their state preserved in memory), while the hypervisor kernel
+ * reboots. Devices attached to these VMs (e.g., NICs, block devices) are kept
+ * operational by the LUO during the hypervisor reboot, allowing the VMs to be
+ * quickly resumed on the new kernel.
+ *
+ * Various kernel subsystems register with the Live Update Orchestrator to
+ * participate in the live update process. These subsystems are notified at
+ * different stages of the live update sequence, allowing them to serialize
+ * device state before the reboot and restore it afterwards. Examples include
+ * the device layer, interrupt controllers, KVM, IOMMU, and specific device
+ * drivers.
+ *
+ * The core of LUO is a state machine that tracks the progress of a live update,
+ * along with a callback API that allows other kernel subsystems to participate
+ * in the process. Example subsystems that can hook into LUO include: kvm,
+ * iommu, interrupts, Documentation/driver-api/liveupdate.rst, participating
+ * filesystems, and mm.
+ *
+ * LUO uses KHO to transfer memory state from Old Kernel to the New Kernel.
+ *
+ * LUO can be controlled through sysfs interface. It provides the following
+ * files under: ``/sys/kernel/liveupdate/{state, prepare, cancel}``
+ *
+ * The ``state`` file can contain the following values:
+ *
+ * ``normal``
+ *   The system is operating normally, and no live update is in progress.
+ *   This is the initial state.
+ * ``prepared``
+ *   The system has begun preparing for a live update. This state is reached
+ *   after subsystems have successfully responded to the ``LIVEUPDATE_PREPARE``
+ *   callback. It indicates that initial preparation is done, but it does not
+ *   necessarily mean all state has been serialized; subsystems can save more
+ *   state during the subsequent ``LIVEUPDATE_REBOOT`` callback.
+ * ``updated``
+ *   The new kernel has successfully taken over, and any suspended operations
+ *   are resumed. However, the system has not yet fully transitioned back to
+ *   a normal operational state; this happens after the ``LIVEUPDATE_FINISH``
+ *   callback is invoked.
+ *
+ * The state machine ensures that operations are performed in the correct
+ * sequence and provides a mechanism to track and recover from potential
+ * failures, and select devices and subsystems that should participate in
+ * live update sequence.
+ *
+ */
+
+ #undef pr_fmt
+ #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#undef pr_fmt
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/sysfs.h>
+#include <linux/string.h>
+#include <linux/rwsem.h>
+#include <linux/err.h>
+#include <linux/liveupdate.h>
+#include <linux/cpufreq.h>
+#include <linux/kexec_handover.h>
+
+#define LUO_KHO_NODE_NAME		"liveupdate_orchestrator"
+#define LUO_KHO_VERSION_PROP_NAME	"version"
+#define LUO_VERSION_MAJOR		1
+#define LUO_VERSION_MINOR		0
+
+/* 'version' property */
+struct luo_kho_version_prop {
+	u32 major;
+	u32 minor;
+};
+
+static const struct luo_kho_version_prop luo_version = {
+	.major = LUO_VERSION_MAJOR,
+	.minor = LUO_VERSION_MINOR,
+};
+
+static struct kho_node luo_node = KHO_NODE_INIT;
+static enum liveupdate_state luo_state;
+static DECLARE_RWSEM(luo_state_rwsem);
+static BLOCKING_NOTIFIER_HEAD(luo_notify_list);
+
+static const char *const luo_event_str[] = {
+	"PREPARE",
+	"REBOOT",
+	"FINISH",
+	"CANCEL",
+};
+
+static const char *const luo_state_str[] = {
+	"normal",
+	"prepared",
+	"updated",
+};
+
+static bool luo_enabled;
+static bool luo_sysfs_initialized;
+
+static int __init early_liveupdate_param(char *buf)
+{
+	return kstrtobool(buf, &luo_enabled);
+}
+
+early_param("liveupdate", early_liveupdate_param);
+
+/* Return true if the current state is equal to the provided state */
+#define IS_STATE(state) (READ_ONCE(luo_state) == (state))
+
+/* Get the current state as a string */
+#define LUO_STATE_STR luo_state_str[READ_ONCE(luo_state)]
+
+static void __luo_set_state(enum liveupdate_state state)
+{
+	WRITE_ONCE(luo_state, state);
+	if (luo_sysfs_initialized)
+		sysfs_notify(kernel_kobj, NULL, "state");
+}
+
+static inline void luo_set_state(enum liveupdate_state state)
+{
+	pr_info("Switched from [%s] to [%s] state\n",
+		LUO_STATE_STR, luo_state_str[state]);
+	__luo_set_state(state);
+}
+
+/* Show the current live update state */
+static ssize_t state_show(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  char *buf)
+{
+	return sysfs_emit(buf, "%s\n", LUO_STATE_STR);
+}
+
+/**
+ * luo_notify - Call registered notifiers for a live update event.
+ * @event: The live update event to notify subsystems about.
+ *
+ * This function is notifying registered subsystems about the specified event.
+ *
+ * For ``LIVEUPDATE_PREPARE`` event, it uses
+ * ``blocking_notifier_call_chain_robust()`` to ensure that if a notifier
+ * callback fails, a corresponding ``LIVEUPDATE_CANCEL`` notification is sent
+ * to already-notified subsystems, allowing for a rollback.
+ *
+ * For ``LIVEUPDATE_REBOOT`` event, it uses ``blocking_notifier_call_chain()``
+ * and if it returns a failure, cancels the operation via calling
+ * ``lou_notify(LIVEUPDATE_CANCEL)`` to notify every subsystem to transition
+ * back to ``normal`` state.
+ *
+ * For ``LIVEUPDATE_FINISH`` and ``LIVEUPDATE_CANCEL`` events, it uses the
+ * standard ``blocking_notifier_call_chain()``. These events are expected not to
+ * fail, and a warning is printed if they do.
+ *
+ * @return 0 on success, or the negative error code returned by the failing
+ * notifier callback (for ``LIVEUPDATE_PREPARE`` and ``LIVEUPDATE_REBOOT``), or
+ * 0 for ``LIVEUPDATE_FINISH`` and ``LIVEUPDATE_CANCEL`` even if a warning was
+ * printed due to a callback failure.
+ */
+static int luo_notify(enum liveupdate_event event)
+{
+	int ret;
+
+	if (event == LIVEUPDATE_PREPARE) {
+		ret = blocking_notifier_call_chain_robust(&luo_notify_list,
+							  LIVEUPDATE_PREPARE,
+							  LIVEUPDATE_CANCEL,
+							  NULL);
+	} else if (event == LIVEUPDATE_REBOOT) {
+		ret = blocking_notifier_call_chain(&luo_notify_list,
+						   LIVEUPDATE_REBOOT,
+						   NULL);
+		/*
+		 * For LIVEUPDATE_REBOOT do CANCEL for everyone, so even
+		 * prepared subsystems return back to the normal state
+		 */
+		if (notifier_to_errno(ret))
+			lou_notify(LIVEUPDATE_CANCEL)
+	} else {
+		ret = blocking_notifier_call_chain(&luo_notify_list,
+						   event,
+						   NULL);
+		/* Cancel and finish must not fail, warn and return success */
+		WARN_ONCE(notifier_to_errno(ret), "Callback failed event: %s [%d]\n",
+			  luo_event_str[event], notifier_to_errno(ret));
+		ret = 0;
+	}
+
+	return notifier_to_errno(ret);
+}
+
+/**
+ * luo_prepare - Initiate the live update preparation phase.
+ *
+ * This function is called to begin the live update process. It attempts to
+ * transition the luo to the ``LIVEUPDATE_STATE_PREPARED`` state.
+ *
+ * It first acquires the write lock for the orchestrator state. Then, it checks
+ * if the current state is ``LIVEUPDATE_STATE_NORMAL``. If not, it returns an
+ * error. If the state is normal, it triggers the ``LIVEUPDATE_PREPARE``
+ * notifier chain.
+ *
+ * If the notifier chain completes successfully, the orchestrator state is set
+ * to ``LIVEUPDATE_STATE_PREPARED``. If any notifier callback fails a
+ * ``LIVEUPDATE_CANCEL`` notification is sent to roll back any actions.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, ``-EINVAL`` if the orchestrator is not in
+ * the normal state, or a negative error code returned by the notifier chain.
+ */
+static int luo_prepare(void)
+{
+	int ret;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn(" %s, change state canceled by user\n", __func__);
+		return -EAGAIN;
+	}
+
+	if (!IS_STATE(LIVEUPDATE_STATE_NORMAL)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_PREPARED],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	ret = luo_notify(LIVEUPDATE_PREPARE);
+	if (!ret)
+		luo_set_state(LIVEUPDATE_STATE_PREPARED);
+
+	up_write(&luo_state_rwsem);
+
+	return ret;
+}
+
+/**
+ * luo_finish - Finalize the live update process in the new kernel.
+ *
+ * This function is called  after a successful live update reboot into a new
+ * kernel, once the new kernel is ready to transition to the normal operational
+ * state. It signals the completion of the live update sequence to subsystems.
+ *
+ * It first attempts to acquire the write lock for the orchestrator state.
+ *
+ * Then, it checks if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state.
+ * If not, it logs a warning and returns ``-EINVAL``.
+ *
+ * If the state is correct, it triggers the ``LIVEUPDATE_FINISH`` notifier
+ * chain. Note that the return value of the notifier is intentionally ignored as
+ * finish callbacks must not fail. Finally, the orchestrator state is
+ * transitioned back to ``LIVEUPDATE_STATE_NORMAL``, indicating the end of the
+ * live update process.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, or ``-EINVAL`` if the orchestrator is not in
+ * the updated state.
+ */
+static int luo_finish(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn(" %s, change state canceled by user\n", __func__);
+		return -EAGAIN;
+	}
+
+	if (!IS_STATE(LIVEUPDATE_STATE_UPDATED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	(void)luo_notify(LIVEUPDATE_FINISH);
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return 0;
+}
+
+/**
+ * luo_cancel - Cancel the ongoing live update preparation or reboot states.
+ *
+ * This function is called to abort a live update that is currently in the
+ * ``LIVEUPDATE_STATE_PREPARED`` state. It can be triggered either
+ * programmatically or via the sysfs interface.
+ *
+ * If the state is correct, it triggers the ``LIVEUPDATE_CANCEL`` notifier chain
+ * to allow subsystems to undo any actions performed during the prepare or
+ * reboot phase. Finally, the orchestrator state is transitioned back to
+ * ``LIVEUPDATE_STATE_NORMAL``.
+ *
+ * @return 0 on success, or ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock.
+ */
+static int luo_cancel(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn(" %s, change state canceled by user\n", __func__);
+		return -EAGAIN;
+	}
+
+	if (!IS_STATE(LIVEUPDATE_STATE_PREPARED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	(void)luo_notify(LIVEUPDATE_CANCEL);
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return 0;
+}
+
+/**
+ * prepare_store - store method for starting live update prepare state or go
+ * back to normal from a prepared state.
+ * @kobj: The kobject associated with luo.
+ * @attr: The sysfs attribute
+ * @buf: The buffer containing the value written by the user.
+ * @count: The number of bytes written.
+ *
+ * This function is the store method for the 'prepare' file under the
+ * 'liveupdate' sysfs directory.
+ *
+ * Writing "1" to this attribute will trigger the luo_prepare() function,
+ * attempting to start the live update preparation phase.
+ *
+ * Writing "0" to this attribute will trigger the luo_cancel() function,
+ * attempting to cancel the orchestrator to the normal state.
+ *
+ * @return The number of bytes processed on success, or a negative error code
+ * if the input is invalid or if the underlying functions fail.
+ */
+static ssize_t prepare_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf,
+			     size_t count)
+{
+	ssize_t ret;
+	long val;
+
+	if (kstrtol(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val != 1 && val != 0)
+		return -EINVAL;
+
+	if (val)
+		ret = luo_prepare();
+	else
+		ret = luo_cancel();
+
+	if (!ret)
+		ret = count;
+
+	return ret;
+}
+
+/**
+ * finish_store - store method for finalizing a live update.
+ * @kobj: The kobject associated with the luo.
+ * @attr: The sysfs attribute
+ * @buf: The buffer containing the value written by the user.
+ * @count: The number of bytes written.
+ *
+ * This function is the store method for the ``finish`` file under the
+ * ``liveupdate`` sysfs directory.
+ *
+ * Writing "1" to this attribute will trigger the luo_finish() function,
+ * attempting to finalize the live update process in the new kernel and
+ * transition to the normal state.
+ *
+ * @return The number of bytes processed on success, or a negative error code
+ * if the input is invalid or if luo_finish() fails.
+ */
+static ssize_t finish_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf,
+			    size_t count)
+{
+	ssize_t ret;
+	long val;
+
+	if (kstrtol(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val != 1)
+		return -EINVAL;
+
+	ret = luo_finish();
+	if (!ret)
+		ret = count;
+
+	return ret;
+}
+
+static struct kobj_attribute state_attribute = __ATTR_RO(state);
+static struct kobj_attribute prepare_attribute = __ATTR_WO(prepare);
+static struct kobj_attribute finish_attribute = __ATTR_WO(finish);
+
+static struct attribute *luo_attrs[] = {
+	&state_attribute.attr,
+	&prepare_attribute.attr,
+	&finish_attribute.attr,
+	NULL,
+};
+
+static struct attribute_group luo_attr_group = {
+	.attrs = luo_attrs,
+	.name = "liveupdate",
+};
+
+/**
+ * luo_init - Initialize the Live Update Orchestrator sysfs interface.
+ *
+ * This function is called during the kernel's late initialization phase
+ * (``late_initcall``). It is responsible for creating the sysfs interface
+ * that allows user-space to interact with the Live Update Orchestrator.
+ *
+ * If the "liveupdate" feature is enabled (checked via luo_enabled()), this
+ * function creates a sysfs directory named ``liveupdate`` under the kernel's
+ * top-level sysfs directory (``/sys/kernel/``).
+ *
+ * It then creates the following sysfs attribute files within the
+ * ``/sys/kernel/liveupdate/`` directory:
+ *
+ * - ``prepare``: Writing '1' initiates preparation, '0' cancels.
+ * - ``finish``:  Writing '1' finalizes the update in the new kernel.
+ * - ``state``:   Read-only file displaying the current orchestrator state.
+ *
+ * @return 0 on success, or a negative error code if sysfs directory or
+ * attribute creation fails.
+ */
+static int __init luo_init(void)
+{
+	int ret;
+
+	if (!luo_enabled || !kho_is_enabled()) {
+		pr_info("disabled by user\n");
+		luo_enabled = false;
+
+		return 0;
+	}
+
+	ret = sysfs_create_group(kernel_kobj, &luo_attr_group);
+	if (ret)
+		pr_err("Failed to create group\n");
+
+	luo_sysfs_initialized = true;
+	pr_info("Initialized\n");
+
+	return ret;
+}
+subsys_initcall(luo_init);
+
+/**
+ * luo_startup - Initialize the Live Update Orchestrator on live update boot.
+ *
+ * This function is called during the kernel's early initialization phase
+ * (early_initcall). Its primary role is to detect if the system is booting
+ * as part of a live update sequence by checking for the presence of a
+ * luo node in the kho tree.
+ *
+ * If a kho node named ``liveupdate_orchestrator`` is found, the function
+ * extracts the version information from the previous kernel. It then performs
+ * the following checks to ensure a safe continuation of the live update:
+ *
+ * 1. Verifies the size of the version property.
+ * 2. Compares the major version and checks if the minor version of the
+ *    previous orchestrator is compatible with the current one. If a mismatch
+ *    is detected, the system panics to prevent potential memory corruption.
+ * 3. Checks if the ``liveupdate`` kernel command-line parameter has enabled
+ *    the feature. If the kho node exists but the feature is disabled, the
+ *    system panics.
+ *
+ * If all checks pass, the orchestrator state is set to
+ * ``LIVEUPDATE_STATE_UPDATED``.
+ *
+ * @return 0 always.
+ */
+static int __init luo_startup(void)
+{
+	enum liveupdate_state state = LIVEUPDATE_STATE_NORMAL;
+	const struct luo_kho_version_prop *p;
+	struct kho_in_node luo_node;
+	int len;
+
+	if (kho_get_node(NULL, LUO_KHO_NODE_NAME, &luo_node) < 0)
+		goto no_liveupdate;
+
+	p = kho_get_prop(&luo_node, LUO_KHO_VERSION_PROP_NAME, &len);
+	if (len != sizeof(struct luo_kho_version_prop)) {
+		panic("Unexcpected version property size, excpected[%ld] found[%d]\n",
+		      sizeof(struct luo_kho_version_prop), len);
+	}
+
+	/*
+	 * Panic if feature is disabled or version mismatch, we do not want
+	 * memory corruptions due to DMA or interrupt tables activity.
+	 */
+	if (p->major != LUO_VERSION_MAJOR ||
+	    p->minor > LUO_VERSION_MINOR) {
+		pr_err("prev orchestrator version (%d.%d)\n",
+		       p->major, p->minor);
+		pr_err("new orchestrator version (%d.%d)\n",
+		       LUO_VERSION_MAJOR, LUO_VERSION_MINOR);
+		panic("Orchestrator version mismatch\n");
+	}
+
+	if (!luo_enabled)
+		panic("Live update node found, but feature is disabled\n");
+
+	state = LIVEUPDATE_STATE_UPDATED;
+	pr_info("live update boot\n");
+
+no_liveupdate:
+	__luo_set_state(state);
+
+	return 0;
+}
+early_initcall(luo_startup);
+
+/* Public Functions */
+
+/**
+ * liveupdate_reboot - Notify subsystems to perform final serialization for live
+ * update.
+ *
+ * This function is called directly from the reboot() syscall path when a live
+ * update is prepared (i.e., the system is rebooting into a new kernel while
+ * preserving devices). It is part of the "blackout" window where the old kernel
+ * is transitioning to the new one.
+ *
+ * During this phase, the function iterates through the list of participating in
+ * the live update subsystems and invokes their registered ``LIVEUPDATE_REBOOT``
+ * callbacks. These callbacks *must* be extremely time-sensitive as they perform
+ * the final serialization of device/subsystem state necessary to survive the
+ * imminent kernel transition. Any delays here directly impact the duration of
+ * the blackout window.
+ *
+ * If any callback fails, the live update process is aborted, and a
+ * ``LIVEUPDATE_CANCEL`` notification is sent to all subsystems, that were
+ * already notified and were not notified to bring machine back to the
+ * ``LIVEUPDATE_NORMAL`` state..
+ *
+ * On success, the function adds a node to the KHO tree to indicate to the next
+ * kernel that a live update is in progress.
+ *
+ * @return 0 on success, or a negative error code if a callback fails or if
+ * adding the KHO node fails.
+ */
+int liveupdate_reboot(void)
+{
+	int ret;
+
+	if (!IS_STATE(LIVEUPDATE_STATE_PREPARED))
+		return 0;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn(" %s, change state canceled by user\n", __func__);
+		return -EAGAIN;
+	}
+
+	ret = luo_notify(LIVEUPDATE_REBOOT);
+	if (ret < 0) {
+		luo_set_state(LIVEUPDATE_STATE_NORMAL);
+	} else {
+		/* Add live update orchestrator node to KHO tree */
+		ret = kho_add_node(NULL, LUO_KHO_NODE_NAME, &luo_node);
+		if (!ret) {
+			ret = kho_add_prop(&luo_node, LUO_KHO_VERSION_PROP_NAME,
+					   &luo_version, sizeof(luo_version));
+		}
+
+		if (ret) {
+			(void)luo_notify(LIVEUPDATE_CANCEL);
+			luo_set_state(LIVEUPDATE_STATE_NORMAL);
+		}
+	}
+
+	up_write(&luo_state_rwsem);
+
+	if (ret)
+		pr_warn("%s failed: %d\n", __func__, ret);
+
+	return ret;
+}
+
+/**
+ * liveupdate_state_updated - Check if the system is in the live update
+ * 'updated' state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
+ * successfully rebooted into a new kernel as part of a live update, and the
+ * preserved devices are expected to be in the process of being reclaimed.
+ *
+ * This is typically used by subsystems during early boot of the new kernel
+ * to determine if they need to attempt to restore state from a previous
+ * live update.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_updated(void)
+{
+	return IS_STATE(LIVEUPDATE_STATE_UPDATED);
+}
+EXPORT_SYMBOL_GPL(liveupdate_state_updated);
+
+/**
+ * liveupdate_state_normal - Check if the system is in the live update 'normal'
+ * state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_NORMAL`` state. This state indicates that no live update
+ * is in progress. It represents the default operational state of the system.
+ *
+ * This can be used to gate actions that should only be performed when no
+ * live update activity is occurring.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_NORMAL`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_normal(void)
+{
+	return IS_STATE(LIVEUPDATE_STATE_NORMAL);
+}
+EXPORT_SYMBOL_GPL(liveupdate_state_normal);
+
+/**
+ * liveupdate_register_notifier - Register a notifier for live update events.
+ *
+ * This function registers a notifier block to receive callbacks for various
+ * stages of the live update process. Notifiers are called when the live
+ * update state changes, allowing subsystems to participate in the
+ * serialization and restoration of state.
+ *
+ * @nb: Pointer to the notifier block to register.
+ *
+ * @return 0 on success, or a negative error code on failure (e.g., if
+ * the notifier block is already registered).
+ */
+int liveupdate_register_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&luo_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(liveupdate_register_notifier);
+
+/**
+ * liveupdate_unregister_notifier - Unregister a live update event notifier.
+ *
+ * This function unregisters a previously registered notifier block from
+ * receiving further callbacks for live update events.
+ *
+ * @nb: Pointer to the notifier block to unregister.
+ *
+ * @return 0 on success, or a negative error code if the notifier block
+ * was not found.
+ */
+int liveupdate_unregister_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&luo_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(liveupdate_unregister_notifier);
+
+/**
+ * liveupdate_enabled - Check if the live update feature is enabled.
+ *
+ * This function returns the state of the live update feature flag, which
+ * can be controlled via the ``liveupdate`` kernel command-line parameter.
+ *
+ * @return true if live update is enabled, false otherwise.
+ */
+bool liveupdate_enabled(void)
+{
+	return luo_enabled;
+}
+EXPORT_SYMBOL_GPL(liveupdate_enabled);
+
+/**
+ * liveupdate_read_state_enter_killable - Acquire the live update state read
+ * lock (killable).
+ *
+ * This function attempts to acquire the read lock protecting the live update
+ * orchestrator state. It allows multiple readers but excludes writers. The
+ * call is interruptible by signals.
+ *
+ * Subsystems should acquire this lock if they need to read the live update
+ * state and potentially perform actions based on it.
+ *
+ * Callers *must* call liveupdate_read_state_exit() to release the lock.
+ *
+ * @return 0 on success, or ``-EINTR`` if interrupted by a signal.
+ */
+int liveupdate_read_state_enter_killable(void)
+{
+	return down_read_killable(&luo_state_rwsem);
+}
+EXPORT_SYMBOL_GPL(liveupdate_read_state_enter_killable);
+
+/**
+ * liveupdate_read_state_enter - Acquire the live update state read lock.
+ *
+ * The same as liveupdate_read_state_enter_killable(), but not interruptable.
+ */
+void liveupdate_read_state_enter(void)
+{
+	down_read(&luo_state_rwsem);
+}
+EXPORT_SYMBOL_GPL(liveupdate_read_state_enter);
+
+/**
+ * liveupdate_read_state_exit - Release the live update state read lock.
+ *
+ * This function releases the read lock protecting the live update
+ * orchestrator state. It must be called after a successful call to
+ * liveupdate_read_state_enter_killable() or liveupdate_read_state_enter().
+ */
+void liveupdate_read_state_exit(void)
+{
+	up_read(&luo_state_rwsem);
+}
+EXPORT_SYMBOL_GPL(liveupdate_read_state_exit);
diff --git a/kernel/reboot.c b/kernel/reboot.c
index b5a8569e5d81..d57413cdc9b9 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -18,6 +18,7 @@
 #include <linux/syscalls.h>
 #include <linux/syscore_ops.h>
 #include <linux/uaccess.h>
+#include <linux/liveupdate.h>
 
 /*
  * this indicates whether you can reboot with ctrl-alt-del: the default is yes
@@ -791,6 +792,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 
 #ifdef CONFIG_KEXEC_CORE
 	case LINUX_REBOOT_CMD_KEXEC:
+		ret = liveupdate_reboot();
+		if (ret)
+			break;
 		ret = kernel_kexec();
 		break;
 #endif
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure
  2025-03-20  2:40 [RFC v1 0/3] Live Update Orchestrator Pasha Tatashin
  2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
@ 2025-03-20  2:40 ` Pasha Tatashin
  2025-03-20 13:34   ` Greg KH
  2025-03-20  2:40 ` [RFC v1 3/3] luo: x86: Enable live update support Pasha Tatashin
  2025-03-20 13:35 ` [RFC v1 0/3] Live Update Orchestrator Greg KH
  3 siblings, 1 reply; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20  2:40 UTC (permalink / raw)
  To: changyuanl, graf, pasha.tatashin, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, jgowans, jgg

Introduce a new subsystem within the driver core to enable keeping
devices alive during kernel live update. This infrastructure is
designed to be registered with and driven by a separate Live Update
Orchestrator, allowing the LUO's state machine to manage the save and
restore process of device state during a kernel transition.

The goal is to allow drivers and buses to participate in a coordinated
save and restore process orchestrated by a live update mechanism. By
saving device state before the kernel switch and restoring it
immediately after, the device can appear to remain continuously
operational from the perspective of the system and userspace.

components introduced:

- `struct dev_liveupdate`: Embedded in `struct device` to track the
  device's participation and state during a live update, including
  request status, preservation status, and dependency depth.

- `liveupdate()` callback: Added to `struct bus_type` and
  `struct device_driver`. This callback receives an enum
  `liveupdate_event` to manage device state at different stages of the
  live update process:
    - LIVEUPDATE_PREPARE: Save device state before the kernel switch.
    - LIVEUPDATE_REBOOT: Final actions just before the kernel jump.
    - LIVEUPDATE_FINISH: Clean-up after live update.
    - LIVEUPDATE_CANCEL: Clean up any saved state if the update is
      aborted.

- Sysfs attribute "liveupdate/requested": Added under each device
  directory, allowing user to request that a specific device to
  participate in live update. I.e. its state is to be preserved
  during the update.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 Documentation/driver-api/index.rst      |   1 +
 Documentation/driver-api/liveupdate.rst |  23 +
 MAINTAINERS                             |   3 +
 drivers/base/Makefile                   |   1 +
 drivers/base/core.c                     |  25 +-
 drivers/base/dev_liveupdate.c           | 816 ++++++++++++++++++++++++
 include/linux/dev_liveupdate.h          | 109 ++++
 include/linux/device.h                  |   6 +
 include/linux/device/bus.h              |   4 +
 include/linux/device/driver.h           |   4 +
 10 files changed, 984 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/driver-api/liveupdate.rst
 create mode 100644 drivers/base/dev_liveupdate.c
 create mode 100644 include/linux/dev_liveupdate.h

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 16e2c4ec3c01..70df19321f58 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -25,6 +25,7 @@ of interest to most developers working on device drivers.
    infrastructure
    ioctl
    pm/index
+   liveupdate
 
 Useful support libraries
 ========================
diff --git a/Documentation/driver-api/liveupdate.rst b/Documentation/driver-api/liveupdate.rst
new file mode 100644
index 000000000000..3afa6173a536
--- /dev/null
+++ b/Documentation/driver-api/liveupdate.rst
@@ -0,0 +1,23 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Device Live Update
+==================
+:Author: Pasha Tatashin <pasha.tatashin@soleen.com>
+
+dev_liveupdate
+==============
+.. kernel-doc:: drivers/base/dev_liveupdate.c
+   :doc: Device Live Update
+
+Public API
+==========
+.. kernel-doc:: include/linux/dev_liveupdate.h
+
+.. kernel-doc:: drivers/base/dev_liveupdate.c
+   :export:
+
+Internal API
+============
+.. kernel-doc:: drivers/base/dev_liveupdate.c
+   :internal:
diff --git a/MAINTAINERS b/MAINTAINERS
index 32257bde9647..81f8c2881e60 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13487,6 +13487,9 @@ L:	linux-kernel@vger.kernel.org
 S:	Maintained
 F:	Documentation/ABI/testing/sysfs-kernel-liveupdate
 F:	Documentation/admin-guide/liveupdate.rst
+F:	Documentation/driver-api/liveupdate.rst
+F:	drivers/base/dev_liveupdate.c
+F:	include/linux/dev_liveupdate.h
 F:	include/linux/liveupdate.h
 F:	kernel/Kconfig.liveupdate
 F:	kernel/liveupdate.c
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 8074a10183dc..58939921e5e1 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_GENERIC_MSI_IRQ) += platform-msi.o
 obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
 obj-$(CONFIG_GENERIC_ARCH_NUMA) += arch_numa.o
 obj-$(CONFIG_ACPI) += physical_location.o
+obj-$(CONFIG_LIVEUPDATE) += dev_liveupdate.o
 
 obj-y			+= test/
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2fde698430df..21b5dfa0f70c 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3151,6 +3151,7 @@ void device_initialize(struct device *dev)
 	dev->dma_coherent = dma_default_coherent;
 #endif
 	swiotlb_dev_init(dev);
+	dev_liveupdate_init(dev);
 }
 EXPORT_SYMBOL_GPL(device_initialize);
 
@@ -3627,6 +3628,7 @@ int device_add(struct device *dev)
 	if (error)
 		goto DPMError;
 	device_pm_add(dev);
+	dev_liveupdate_add_device(dev);
 
 	if (MAJOR(dev->devt)) {
 		error = device_create_file(dev, &dev_attr_dev);
@@ -4740,6 +4742,10 @@ int device_change_owner(struct device *dev, kuid_t kuid, kgid_t kgid)
 	if (error)
 		goto out;
 
+	error = dev_liveupdate_sysfs_change_owner(dev, kuid, kgid);
+	if (error)
+		goto out;
+
 	/*
 	 * Change the owner of the symlink located in the class directory of
 	 * the device class associated with @dev which points to the actual
@@ -4810,14 +4816,17 @@ void device_shutdown(void)
 				dev_info(dev, "shutdown_pre\n");
 			dev->class->shutdown_pre(dev);
 		}
-		if (dev->bus && dev->bus->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->bus->shutdown(dev);
-		} else if (dev->driver && dev->driver->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->driver->shutdown(dev);
+
+		if (!dev_liveupdate_preserved(dev)) {
+			if (dev->bus && dev->bus->shutdown) {
+				if (initcall_debug)
+					dev_info(dev, "shutdown\n");
+				dev->bus->shutdown(dev);
+			} else if (dev->driver && dev->driver->shutdown) {
+				if (initcall_debug)
+					dev_info(dev, "shutdown\n");
+				dev->driver->shutdown(dev);
+			}
 		}
 
 		device_unlock(dev);
diff --git a/drivers/base/dev_liveupdate.c b/drivers/base/dev_liveupdate.c
new file mode 100644
index 000000000000..7e961d2cd3b1
--- /dev/null
+++ b/drivers/base/dev_liveupdate.c
@@ -0,0 +1,816 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: Device Live Update
+ *
+ * Provides infrastructure for preserving device state across a system update.
+ *
+ * This subsystem allows drivers and buses to save and restore device state,
+ * enabling a seamless transition during a live update.
+ *
+ * The core idea is to identify a set of devices whose state needs to be
+ * preserved. For each such device, the associated driver and bus can implement
+ * callbacks to save the device's state before the update and restore it
+ * afterwards.
+ *
+ * Userspace can interact with this subsystem via sysfs attributes exposed
+ * under each device directory (e.g., ``/sys/devices/.../liveupdate/``).
+ * This directory contains the following attributes:
+ *
+ * ``requested``
+ *   A read-write attribute allowing userspace to control whether a device
+ *   should participate in the live update sequence. Writing "1" requests the
+ *   device and its ancestors (that support live update) be preserved.
+ *   Writing "0" requests the device be excluded. This attribute can only be
+ *   modified when LUO is in the ``normal`` state.
+ * ``preserved``
+ *   A read-only attribute indicating whether the device's state was
+ *   preserved during the ``prepare`` and ``reboot`` stages.
+ * ``reclaimed``
+ *   A read-only attribute indicating whether the device was successfully
+ *   re-attached and resumed operation in the new kernel after an update.
+ *   For example, a VM to which this device was passthrough has been resumed.
+ *
+ * By default, devices do not participate in the live update. Userspace can
+ * explicitly request participation by writing "1" to the ``requested`` file.
+ *
+ * The live update process typically involves the following stages,
+ * reflected in the ``liveupdate_event`` enum:
+ *
+ * ``LIVEUPDATE_PREPARE``
+ *   Prepare devices for the upcoming state transition. Drivers and buses should
+ *   save the necessary device state. Happens before blackouts.
+ * ``LIVEUPDATE_REBOOT``
+ *   A final notification before the system jumps to the new kernel. Called
+ *   during blackout from reboot() syscall.
+ * ``LIVEUPDATE_FINISH``
+ *   The system has completed a transition. Drivers and buses should have
+ *   already restored the previously saved state. Clean up, reset unreclaimed
+ *   devices.
+ * ``LIVEUPDATE_CANCEL``
+ *   Cancel the live update process. Drivers and buses should clean up any saved
+ *   state if necessary.
+ *
+ * Documentation/admin-guide/liveupdate.rst contains more details.
+ *
+ * The global state of the live update subsystem can be accessed and
+ * controlled via a separate sysfs interface (e.g., ``/sys/kernel/liveupdate/``)
+ * via Live Update Orchestrator.
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/dev_liveupdate.h>
+#include <linux/list_sort.h>
+#include <linux/kobject.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include "base.h"
+
+static const char liveupdate_group_name[] = "liveupdate";
+
+/**
+ * is_liveupdate_possible() - Check if a device can participate in live update
+ * @dev: The device to check.
+ *
+ * This function verifies if the given device and all its ancestors (up to
+ * the root device or until a missing callback is found) are capable of
+ * participating in a live update.
+ *
+ * It checks for the presence of the ``liveupdate`` callback in the device's
+ * driver and bus, and performs the same check for all parent devices. If any
+ * device in the hierarchy (including the device itself)
+ * lacks a ``liveupdate`` callback in either its driver or bus, the function
+ * returns false.
+ *
+ * Return: True if the device and all its relevant ancestors have the
+ * liveupdate callback, false otherwise.
+ */
+static bool is_liveupdate_possible(struct device *dev)
+{
+	struct device *parent_dev;
+	bool is_possible = true;
+
+	dev = get_device(dev);
+	for (; ;) {
+		if (dev->driver) {
+			is_possible = !!dev->driver->liveupdate;
+			if (!is_possible) {
+				dev_warn(dev, "driver[%s] no liveupdate callback\n",
+					 dev->driver->name);
+				break;
+			}
+		}
+
+		if (dev->bus) {
+			is_possible = !!dev->bus->liveupdate;
+			if (!is_possible) {
+				dev_warn(dev, "bus[%s] no liveupdate callback\n",
+					 dev->bus->name);
+				break;
+			}
+		}
+
+		if (!dev->parent)
+			break;
+
+		parent_dev = get_device(dev->parent);
+		put_device(dev);
+		dev = parent_dev;
+	}
+	put_device(dev);
+
+	return is_possible;
+}
+
+/*
+ * dev->{driver, bus}->liveupdate->{prepare, reboot} callback
+ * Warn if liveupdate not present, this is an internal error, and should never
+ * be the case.
+ * return callback result, or 0 if callback is not implemented.
+ */
+#define DEV_LIVEUPDATE_RET_CALLBACK(_dev, _drv_or_bus, _func) ({	\
+	int rv = 0;							\
+									\
+	if ((_dev)->_drv_or_bus &&					\
+	    !WARN_ON(!(_dev)->_drv_or_bus->liveupdate) &&		\
+	    (_dev)->_drv_or_bus->liveupdate->_func) {			\
+		rv = (_dev)->_drv_or_bus->liveupdate->_func(_dev);	\
+	}								\
+	rv;								\
+})
+
+/*
+ * A void variant of the previous macro
+ * dev->{driver, bus}->liveupdate->{cancel, finish} callback
+ * Warn if liveupdate not present, this is an internal error, and should never
+ * be the case.
+ */
+#define DEV_LIVEUPDATE_CALLBACK(_dev, _drv_or_bus, _func) do {		\
+	if ((_dev)->_drv_or_bus &&					\
+	    !WARN_ON(!(_dev)->_drv_or_bus->liveupdate) &&		\
+	    (_dev)->_drv_or_bus->liveupdate->_func) {			\
+		(_dev)->_drv_or_bus->liveupdate->_func(_dev);		\
+	}								\
+} while (0)
+
+static ssize_t preserved_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", dev_liveupdate_preserved(dev));
+}
+static DEVICE_ATTR_RO(preserved);
+
+static ssize_t reclaimed_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	return sysfs_emit(buf, "%d\n", dev_liveupdate_reclaimed(dev));
+}
+static DEVICE_ATTR_RO(reclaimed);
+
+static ssize_t requested_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	return sysfs_emit(buf, "%d\n", dev_liveupdate_requested(dev));
+}
+
+/**
+ * requested_store() - Store function for the ``requested`` sysfs attribute
+ * @dev: The device associated with the attribute.
+ * @attr: The device attribute structure.
+ * @buf: The buffer containing the value written by the user.
+ * @count: The number of bytes written.
+ *
+ * Allows userspace to request that a device be included in or excluded from
+ * the live update process. Writing "1" requests the device to be preserved
+ * during live update, and writing "0" requests it to be excluded.
+ *
+ * This function checks if the live update system is in the 'normal' state
+ * before allowing changes. It also verifies that the device supports
+ * live update before setting the requested state.
+ *
+ * Return: The number of bytes written on success, ``-EINVAL`` if the input is
+ * invalid or if the live update system is not in the 'normal' state, or
+ * ``-EAGAIN`` if the operation was interrupted.
+ */
+static ssize_t requested_store(struct device *dev, struct device_attribute *attr,
+			       const char *buf, size_t count)
+{
+	long val;
+
+	if (kstrtol(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val != 1 && val != 0)
+		return -EINVAL;
+
+	/* if state does not change, ignore */
+	if (dev_liveupdate_requested(dev) == !!val)
+		return count;
+
+	if (liveupdate_read_state_enter_killable()) {
+		dev_warn(dev, "Changing requested state Canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!liveupdate_state_normal()) {
+		dev_warn(dev, "Participation can be requested only in [normal] state\n");
+		liveupdate_read_state_exit();
+		return -EINVAL;
+	}
+
+	if (!val) {
+		dev_liveupdate_set_requested(dev, false);
+		list_del_init(&dev->lu.liveupdate_entry);
+		liveupdate_read_state_exit();
+		return count;
+	}
+
+	if (!is_liveupdate_possible(dev)) {
+		liveupdate_read_state_exit();
+		return -EINVAL;
+	}
+
+	dev_liveupdate_set_requested(dev, true);
+	liveupdate_read_state_exit();
+
+	return count;
+}
+static DEVICE_ATTR_RW(requested);
+
+static struct attribute *liveupdate_attrs[] = {
+	&dev_attr_preserved.attr,
+	&dev_attr_reclaimed.attr,
+	&dev_attr_requested.attr,
+	NULL,
+};
+
+static const struct attribute_group liveupdate_attr_group = {
+	.name	= liveupdate_group_name,
+	.attrs	= liveupdate_attrs,
+};
+
+static int dev_liveupdate_sysfs_add(struct device *dev)
+{
+	int rv;
+
+	rv = sysfs_create_group(&dev->kobj, &liveupdate_attr_group);
+
+	return rv;
+}
+
+static int dev_liveupdate_get_depth(struct device *current_dev)
+{
+	struct device *dev;
+	int depth = 0;
+
+	for (dev = current_dev; dev; dev = dev->parent)
+		depth++;
+
+	return depth;
+}
+
+/**
+ * LIST_HEAD(dev_liveupdate_preserve_list) - List of devices to preserve during
+ * live update
+ * @dev_liveupdate_preserve_list: This section is about this list.
+ *
+ * This list holds devices that need to have their state preserved across a
+ * live update. It is populated during the ``LIVEUPDATE_PREPARE`` stage by
+ * dev_liveupdate_build_preserve_list() with devices explicitly requested
+ * for live update and their ancestors. The list is sorted by device depth
+ * to ensure correct processing order: children before parents.
+ *
+ * Functions like __dev_liveupdate_reboot_prepare() iterate through this list
+ * to notify drivers and buses about the upcoming update or reboot.
+ * __dev_liveupdate_cancel() uses this list to perform cancellation.
+ * The list is cleared by dev_liveupdate_destroy_preserve_list() when it's
+ * no longer needed.
+ *
+ * The list is protected by ``luo_state_rwsem`` as it is used only during
+ * prepare and reboot callbacks when this lock is taken as writer.
+ */
+static LIST_HEAD(dev_liveupdate_preserve_list);
+
+/**
+ * __find_ancestors_and_depth() - Add a device and its ancestors to the preserve
+ * list
+ * @current_dev: The device to start with.
+ *
+ * This function adds the @current_dev and all its ancestors to the
+ * dev_liveupdate_preserve_list. It also calculates and sets the
+ * liveupdate_depth for each device added, relative to the @current_dev.
+ *
+ * The function iterates from @current_dev up to the root device. For each
+ * device in the path, if it's not already in the preserve list (checked via
+ * the liveupdate_depth field), it's added to the list, its depth is set,
+ * and a reference is taken using get_device() (unless it's the initial
+ * @current_dev, which already has a reference).
+ *
+ * The list to which the devices are added (dev_liveupdate_preserve_list) is
+ * expected to be sorted later.
+ */
+static void __find_ancestors_and_depth(struct device *current_dev)
+{
+	struct device *dev;
+	int depth = 0;
+
+	/*
+	 * If depth is set, it means this devices was already included as an
+	 * ancestor of another requested device.
+	 */
+	if (current_dev->lu.liveupdate_depth)
+		return;
+
+	depth = dev_liveupdate_get_depth(dev);
+
+	for (dev = current_dev; dev; dev = dev->parent) {
+		/*
+		 * This ancestor, and all above are already in the
+		 * dev_liveupdate_preserve_list
+		 */
+		if (dev->lu.liveupdate_depth)
+			break;
+
+		if (dev != current_dev)
+			get_device(dev);
+
+		/* Ancestor might be in the request_list */
+		list_del_init(&dev->lu.liveupdate_entry);
+		dev->lu.liveupdate_depth = depth;
+		list_add_tail(&dev->lu.liveupdate_entry,
+			      &dev_liveupdate_preserve_list);
+		depth--;
+	}
+}
+
+static int dev_depth_cmp(void *priv,
+			 const struct list_head *head_a,
+			 const struct list_head *head_b)
+{
+	struct device *dev_a, *dev_b;
+
+	dev_a = container_of(head_a, struct device, lu.liveupdate_entry);
+	dev_b = container_of(head_b, struct device, lu.liveupdate_entry);
+
+	if (dev_a->lu.liveupdate_depth > dev_b->lu.liveupdate_depth)
+		return -1;
+
+	if (dev_a->lu.liveupdate_depth < dev_b->lu.liveupdate_depth)
+		return 1;
+
+	return 0;
+}
+
+/**
+ * dev_liveupdate_build_preserve_list() - Build a list of devices to preserve
+ *
+ * This function constructs a list ``dev_liveupdate_preserve_list`` of devices
+ * that require state preservation during a live update.
+ *
+ * It first iterates through all devices and identifies those for which a live
+ * update has been explicitly requested using dev_liveupdate_requested().
+ * These devices are added to a temporary list.
+ *
+ * Then, for each device in the temporary list, the function calls
+ * __find_ancestors_and_depth() to add the device and all its ancestors to the
+ * global ``dev_liveupdate_preserve_list`` and calculate their respective
+ * depths.
+ *
+ * Finally, the ``dev_liveupdate_preserve_list`` is sorted by device depth using
+ * dev_depth_cmp() to ensure a correct preservation order (e.g., children before
+ * parents). A reference count is maintained for each device added to the
+ * preserve list using get_device().
+ */
+static void dev_liveupdate_build_preserve_list(void)
+{
+	LIST_HEAD(request_list);
+	struct device *dev;
+
+	spin_lock(&devices_kset->list_lock);
+	list_for_each_entry(dev, &devices_kset->list, kobj.entry) {
+		get_device(dev);
+		spin_unlock(&devices_kset->list_lock);
+		if (dev_liveupdate_requested(dev)) {
+			list_add_tail(&dev->lu.liveupdate_entry,
+				      &request_list);
+		} else {
+			put_device(dev);
+		}
+		spin_lock(&devices_kset->list_lock);
+	}
+	spin_unlock(&devices_kset->list_lock);
+
+	while (!list_empty(&request_list)) {
+		dev = list_first_entry(&request_list,
+				       struct device,
+				       lu.liveupdate_entry);
+		list_del_init(&dev->lu.liveupdate_entry);
+		__find_ancestors_and_depth(dev);
+	}
+
+	list_sort(NULL, &dev_liveupdate_preserve_list, dev_depth_cmp);
+}
+
+/**
+ * dev_liveupdate_destroy_preserve_list() - Destroy the live update preserve
+ * list
+ *
+ * This function iterates through the ``dev_liveupdate_preserve_list``, which
+ * contains devices ordered by depth, and performs cleanup for each device.
+ * For each device in the list, it:
+ *
+ * 1. Removes the device from the list and reinitializes its list head.
+ * 2. Resets the liveupdate_depth field to 0.
+ * 3. Calls put_device() to decrement the device's reference count.
+ *
+ * This function is typically called after the preserve list is no longer
+ * needed, such as after the reboot phase of a live update or during
+ * cancellation.
+ */
+static void dev_liveupdate_destroy_preserve_list(void)
+{
+	struct device *dev;
+
+	while (!list_empty(&dev_liveupdate_preserve_list)) {
+		dev = list_first_entry(&dev_liveupdate_preserve_list,
+				       struct device,
+				       lu.liveupdate_entry);
+		list_del_init(&dev->lu.liveupdate_entry);
+		dev->lu.liveupdate_depth = 0;
+		put_device(dev);
+	}
+}
+
+/**
+ * __dev_liveupdate_cancel() - Cancel live update for devices
+ * @dev: The device from which to start the cancellation (or NULL to cancel
+ * all).
+ *
+ * This function cancels the ongoing live update process for devices starting
+ * from the position just before the given @dev in the
+ * ``dev_liveupdate_preserve_list`` and proceeding backwards to the beginning of
+ * the list. If @dev is ``NULL``, the cancellation is performed for all devices
+ * in the list.
+ *
+ * It iterates through the relevant devices in reverse order, calling the
+ * ``LIVEUPDATE_CANCEL`` handler for each device's bus and driver (if
+ * available). After processing the devices, it clears the liveupdate_preserved
+ * flag for each device and finally destroys the
+ * ``dev_liveupdate_preserve_list``.
+ */
+static void __dev_liveupdate_cancel(struct device *dev)
+{
+	dev = list_prepare_entry(dev, &dev_liveupdate_preserve_list,
+				 lu.liveupdate_entry);
+
+	list_for_each_entry_continue_reverse(dev, &dev_liveupdate_preserve_list,
+					     lu.liveupdate_entry) {
+		DEV_LIVEUPDATE_CALLBACK(dev, bus, cancel);
+		DEV_LIVEUPDATE_CALLBACK(dev, driver, cancel);
+
+		dev->lu.liveupdate_preserved = false;
+	}
+
+	dev_liveupdate_destroy_preserve_list();
+}
+
+/**
+ * __dev_liveupdate_reboot_prepare() - Notify drivers and buses of a
+ * prepare/reboot event
+ * @event: The live update event, either ``LIVEUPDATE_PREPARE`` or
+ * ``LIVEUPDATE_REBOOT``.
+ *
+ * This function iterates through the list of devices to be preserved
+ * (``dev_liveupdate_preserve_list``) and calls the liveupdate() callback for
+ * the driver and bus of each device with the specified event.
+ *
+ * If a driver or bus  callback returns an error, a warning is logged,
+ * and the function attempts to cancel the live update for the remaining devices
+ * using __dev_liveupdate_cancel().
+ *
+ * Upon successful completion for a device, the ``liveupdate_preserved`` flag
+ * for that device is set to true.
+ *
+ * Return: 0 on success, or the error code from the failing driver/bus
+ * liveupdate->{prepare, reboot} callback.
+ */
+static int __dev_liveupdate_reboot_prepare(enum liveupdate_event event)
+{
+	struct device *dev;
+	int rv;
+
+	rv = 0;
+	list_for_each_entry(dev, &dev_liveupdate_preserve_list,
+			    lu.liveupdate_entry) {
+		if (event == LIVEUPDATE_PREPARE)
+			rv = DEV_LIVEUPDATE_RET_CALLBACK(dev, driver, prepare);
+		else
+			rv = DEV_LIVEUPDATE_RET_CALLBACK(dev, driver, reboot);
+
+		if (rv) {
+			dev_warn(dev, "driver live update failed\n");
+			goto err_cancel;
+		}
+
+		if (event == LIVEUPDATE_PREPARE)
+			rv = DEV_LIVEUPDATE_RET_CALLBACK(dev, bus, prepare);
+		else
+			rv = DEV_LIVEUPDATE_RET_CALLBACK(dev, bus, reboot);
+
+		if (rv) {
+			dev_warn(dev, "bus live update failed\n");
+			goto err_cancel_bus;
+		}
+
+		dev->lu.liveupdate_preserved = true;
+	}
+
+	return 0;
+
+err_cancel_bus:
+	DEV_LIVEUPDATE_CALLBACK(dev, driver, cancel);
+
+err_cancel:
+	__dev_liveupdate_cancel(dev);
+
+	return rv;
+}
+
+/**
+ * device_liveupdate_prepare() - Prepare devices for a live update
+ *
+ * This function is called as part of the ``LIVEUPDATE_PREPARE`` stage.
+ * It first calls dev_liveupdate_build_preserve_list() to construct a list
+ * of devices that need their state preserved during the update.
+ * Then, it calls the internal function __dev_liveupdate_reboot_prepare()
+ * with the ``LIVEUPDATE_PREPARE`` event to notify drivers and buses to prepare
+ * for the upcoming update.
+ *
+ * Return: The return value from __dev_liveupdate_reboot_prepare().
+ */
+static int device_liveupdate_prepare(void)
+{
+	dev_liveupdate_build_preserve_list();
+
+	return __dev_liveupdate_reboot_prepare(LIVEUPDATE_PREPARE);
+}
+
+/**
+ * device_liveupdate_reboot() - Prepare devices for the reboot stage of a live
+ * update
+ *
+ * This function is called as part of the ``LIVEUPDATE_REBOOT`` stage, from
+ * reboot() syscall. It calls the internal function
+ * __dev_liveupdate_reboot_prepare() with the LIVEUPDATE_REBOOT event to notify
+ * drivers and buses to perform any actions needed before the reboot.  If the
+ * reboot preparation is successful (returns 0), it then calls
+ * dev_liveupdate_destroy_preserve_list() to free the list of devices that was
+ * built during the prepare stage.
+ *
+ * Return: The return value from __dev_liveupdate_reboot_prepare().
+ */
+static int device_liveupdate_reboot(void)
+{
+	int rv;
+
+	rv = __dev_liveupdate_reboot_prepare(LIVEUPDATE_REBOOT);
+	if (!rv)
+		dev_liveupdate_destroy_preserve_list();
+
+	return rv;
+}
+
+/**
+ * device_liveupdate_finish() - Finalize the device live update process
+ *
+ * This function is called as part of the ``LIVEUPDATE_FINISH`` stage. It
+ * iterates through all registered devices, identifies devices that were
+ * preserved during the prepare phase, sorts them by depth.
+ *
+ * After sorting, the function iterates through the list. For each device, it
+ * logs a warning about unreclaimed device and call the
+ * ``{driver, bus}->liveupdate->finish()`` handler for ever device's driver and
+ * bus on the list. Finally, it resets the live update related fields in the
+ * device's ``dev_liveupdate`` structure, effectively removing it from the live
+ * update tracking.
+ *
+ * Note: this function must not fail.
+ *
+ * Return: Always returns 0.
+ */
+static int device_liveupdate_finish(void)
+{
+	LIST_HEAD(preserved_list);
+	struct device *dev;
+
+	spin_lock(&devices_kset->list_lock);
+	list_for_each_entry(dev, &devices_kset->list, kobj.entry) {
+		get_device(dev);
+		spin_unlock(&devices_kset->list_lock);
+		if (!dev_liveupdate_preserved(dev)) {
+			put_device(dev);
+			spin_lock(&devices_kset->list_lock);
+			continue;
+		}
+
+		list_add_tail(&dev->lu.liveupdate_entry, &preserved_list);
+		dev->lu.liveupdate_depth = dev_liveupdate_get_depth(dev);
+		spin_lock(&devices_kset->list_lock);
+	}
+	spin_unlock(&devices_kset->list_lock);
+
+	list_sort(NULL, &preserved_list, dev_depth_cmp);
+
+	while (!list_empty(&preserved_list)) {
+		dev = list_first_entry(&preserved_list, struct device,
+				       lu.liveupdate_entry);
+
+		if (!dev_liveupdate_reclaimed(dev))
+			dev_warn(dev, "Device was not reclaimed during live update\n");
+
+		DEV_LIVEUPDATE_CALLBACK(dev, driver, finish);
+		DEV_LIVEUPDATE_CALLBACK(dev, bus, finish);
+
+		/* Reset live update fields to their default values */
+		list_del_init(&dev->lu.liveupdate_entry);
+		dev->lu.liveupdate_reclaimed = false;
+		dev->lu.liveupdate_preserved = false;
+		dev->lu.liveupdate_depth = 0;
+		put_device(dev);
+	}
+
+	return 0;
+}
+
+/**
+ * device_liveupdate_cancel() - Cancel the ongoing device live update process
+ *
+ * This function is called as part of the ``LIVEUPDATE_CANCEL`` stage. It
+ * initiates the cancellation of the live update process by calling the
+ * internal function __dev_liveupdate_cancel() with a NULL argument,
+ * indicating a global cancellation.
+ *
+ * Note: this function must not fail.
+ *
+ * Return: Always returns 0.
+ */
+static int device_liveupdate_cancel(void)
+{
+	__dev_liveupdate_cancel(NULL);
+
+	return 0;
+}
+
+LIVEUPDATE_DECLARE_NOTIFIER(device, LIVEUPDATE_CB_PRIO_WITH_DEVICES);
+
+/**
+ * dev_liveupdate_startup() - Register device live update notifier
+ *
+ * This function is called during the late initialization phase of the kernel.
+ * It registers a notifier for devices subsystem with live update orchestrator.
+ *
+ * If registration fails, a warning message is printed to the kernel log.
+ *
+ * Return: 0 on success (notifier registration is void, so only failure
+ * is explicitly handled).
+ */
+static int __init dev_liveupdate_startup(void)
+{
+	int rv;
+
+	rv = LIVEUPDATE_REGISTER_NOTIFIER(device);
+	if (rv) {
+		pr_warn("Failed to register devices with live update orchestrator [%d]\n",
+			rv);
+	}
+
+	return 0;
+}
+late_initcall(dev_liveupdate_startup);
+
+/* Public Interfaces */
+
+/**
+ * dev_liveupdate_init() - Initialize the dev_liveupdate structure
+ * @dev: Pointer to the dev_liveupdate structure to initialize.
+ *
+ * This function initializes the fields of the dev_liveupdate structure
+ * to their default states. The list head is initialized, and the
+ * boolean flags are cleared. The depth is initialized to 0.
+ */
+void dev_liveupdate_init(struct device *dev)
+{
+	INIT_LIST_HEAD(&dev->lu.liveupdate_entry);
+	dev->lu.liveupdate_requested = false;
+	dev->lu.liveupdate_preserved = false;
+	dev->lu.liveupdate_reclaimed = false;
+	dev->lu.liveupdate_depth = 0;
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_init);
+
+/**
+ * dev_liveupdate_add_device() - Add live update sysfs interface to a new device
+ * @dev: The device to add to the live update system.
+ *
+ * This function checks if live update functionality is enabled. If it is,
+ * it attempts to add the live update sysfs interface for the given device.
+ * If the sysfs group creation fails, a warning message is logged.
+ */
+void dev_liveupdate_add_device(struct device *dev)
+{
+	if (!liveupdate_enabled())
+		return;
+
+	if (dev_liveupdate_sysfs_add(dev))
+		dev_warn(dev, "Failed to create liveupdate sysfs group\n");
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_add_device);
+
+/**
+ * dev_liveupdate_sysfs_change_owner() - Change the owner of the liveupdate
+ * sysfs group
+ * @dev: The device whose liveupdate sysfs group owner is to be changed.
+ * @kuid: The user ID for the new owner.
+ * @kgid: The group ID for the new owner.
+ *
+ * This function changes the ownership of the sysfs attribute group associated
+ * with the live update interface for the given device. It uses the
+ * sysfs_group_change_owner() function to update the owner to the specified
+ * user ID (@kuid) and group ID (@kgid).
+ *
+ * Return: 0 on success, or a negative error code returned by
+ * sysfs_group_change_owner().
+ */
+int dev_liveupdate_sysfs_change_owner(struct device *dev,
+				      kuid_t kuid,
+				      kgid_t kgid)
+{
+	return sysfs_group_change_owner(&dev->kobj, &liveupdate_attr_group,
+					kuid, kgid);
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_sysfs_change_owner);
+
+/**
+ * dev_liveupdate_preserved() - Check if a device's live update state is
+ * preserved
+ * @dev: The device to check.
+ *
+ * Returns: true if the device's live update state has been preserved,
+ * false otherwise.
+ */
+bool dev_liveupdate_preserved(struct device *dev)
+{
+	return dev->lu.liveupdate_preserved;
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_preserved);
+
+/**
+ * dev_liveupdate_reclaimed() - Check if a device was reclaimed after live
+ * update
+ * @dev: The device to check.
+ *
+ * Returns: true if the device has been reclaimed, false otherwise.
+ */
+bool dev_liveupdate_reclaimed(struct device *dev)
+{
+	return dev->lu.liveupdate_reclaimed;
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_reclaimed);
+
+/**
+ * dev_liveupdate_requested() - Check if a live update has been requested for
+ * the device
+ * @dev: The device to check.
+ *
+ * Returns: true if a live update has been requested for the device (i.e.
+ * device and its ancestors are going to participate in live update), false
+ * otherwise.
+ */
+bool dev_liveupdate_requested(struct device *dev)
+{
+	return dev->lu.liveupdate_requested;
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_requested);
+
+/**
+ * dev_liveupdate_set_requested() - Set the live update requested state for a
+ * device
+ * @dev: The device to modify.
+ * @val: The boolean value to set the requested state to (true or false).
+ *
+ * Sets the ``liveupdate_requested`` flag for the given device to the
+ * specified value.
+ */
+void dev_liveupdate_set_requested(struct device *dev, bool val)
+{
+	dev->lu.liveupdate_requested = val;
+}
+EXPORT_SYMBOL_GPL(dev_liveupdate_set_requested);
diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
new file mode 100644
index 000000000000..caf38e16ba91
--- /dev/null
+++ b/include/linux/dev_liveupdate.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+#ifndef _LINUX_DEV_LIVEUPDATE_H
+#define _LINUX_DEV_LIVEUPDATE_H
+
+#include <linux/liveupdate.h>
+
+#ifdef CONFIG_LIVEUPDATE
+
+/**
+ * struct dev_liveupdate - Device state for live update operations
+ * @liveupdate_entry:     List head for linking the device into live update
+ *                        related lists (e.g., a list of devices participating
+ *                        in a live update sequence).
+ * @liveupdate_requested: Set if a live update has been requested for this
+ *                        device (i.e. device will participate in live update).
+ * @liveupdate_preserved: Set if the device's state has been successfully
+ *                        preserved during a live update prepare phase.
+ * @liveupdate_reclaimed: Set if resources or state associated with a
+ *                        previous live update attempt have been reclaimed.
+ *                        Device has been re-attached to previous work and
+ *                        resumed its operation.
+ * @liveupdate_depth:     The hierarchical depth of the device, used for
+ *                        ordering live update operations. Lower values
+ *                        indicate devices closer to the root.
+ *
+ * This structure holds the state information required for performing
+ * live update operations on a device. It is embedded within a struct device.
+ */
+struct dev_liveupdate {
+	struct list_head liveupdate_entry;
+	bool liveupdate_requested:1;
+	bool liveupdate_preserved:1;
+	bool liveupdate_reclaimed:1;
+	int liveupdate_depth:28;
+};
+
+/**
+ * struct dev_liveupdate_cbs - Live Update callback functions
+ * @prepare:     Prepare device for the upcoming state transition. Driver and
+ *               buse should save the necessary device state. Happens before
+ *               blackouts.
+ * @reboot:      A final notification before the system jumps to the new kernel.
+ *               Called during blackout from reboot() syscall.
+ * @finish:      The system has completed a transition. Drivers and buses should
+ *               have already restored the previously saved device state.
+ *               Clean-up any saved state or reset unreclaimed device.
+ * @cancel:      Cancel the live update process. Driver should clean
+ *               up any saved state if necessary.
+ *
+ * This structure is used by drivers and buses to hold the callback from LUO.
+ */
+struct dev_liveupdate_cbs {
+	int (*prepare)(struct device *dev);
+	int (*reboot)(struct device *dev);
+	void (*finish)(struct device *dev);
+	void (*cancel)(struct device *dev);
+};
+
+void dev_liveupdate_init(struct device *dev);
+void dev_liveupdate_add_device(struct device *dev);
+int dev_liveupdate_sysfs_change_owner(struct device *dev,
+				      kuid_t kuid,
+				      kgid_t kgid);
+
+bool dev_liveupdate_preserved(struct device *dev);
+bool dev_liveupdate_reclaimed(struct device *dev);
+bool dev_liveupdate_requested(struct device *dev);
+void dev_liveupdate_set_requested(struct device *dev, bool val);
+
+#else /* CONFIG_LIVEUPDATE */
+
+static inline void dev_liveupdate_init(struct devie *dev);
+static inline void dev_liveupdate_add_device(struct device *dev) { }
+
+static inline int dev_liveupdate_sysfs_change_owner(struct device *dev,
+						    kuid_t kuid,
+						    kgid_t kgid)
+{
+	return 0;
+}
+
+static inline bool dev_liveupdate_preserved(struct device *dev)
+{
+	return false;
+}
+
+static inline bool dev_liveupdate_reclaimed(struct device *dev)
+{
+	return false;
+}
+
+static inline bool dev_liveupdate_requested(struct device *dev)
+{
+	return false;
+}
+
+static inline void dev_liveupdate_set_requested(struct device *dev, bool val)
+{
+}
+
+static inline void dev_liveupdate_set_reclaimed(struct device *dev);
+
+#endif /* CONFIG_LIVEUPDATE */
+#endif /* _LINUX_DEV_LIVEUPDATE_H */
diff --git a/include/linux/device.h b/include/linux/device.h
index 80a5b3268986..0b8cdc10e002 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -21,6 +21,7 @@
 #include <linux/lockdep.h>
 #include <linux/compiler.h>
 #include <linux/types.h>
+#include <linux/dev_liveupdate.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
 #include <linux/atomic.h>
@@ -663,6 +664,7 @@ struct device_physical_location {
  * @pm_domain:	Provide callbacks that are executed during system suspend,
  * 		hibernation, system resume and during runtime PM transitions
  * 		along with subsystem-level and driver-level callbacks.
+ * @lu:		Live update state.
  * @em_pd:	device's energy model performance domain
  * @pins:	For device pin management.
  *		See Documentation/driver-api/pin-control.rst for details.
@@ -758,6 +760,10 @@ struct device {
 	struct dev_pm_info	power;
 	struct dev_pm_domain	*pm_domain;
 
+#ifdef CONFIG_LIVEUPDATE
+	struct dev_liveupdate	lu;
+#endif
+
 #ifdef CONFIG_ENERGY_MODEL
 	struct em_perf_domain	*em_pd;
 #endif
diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index f5a56efd2bd6..d05f12187d34 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -17,6 +17,7 @@
 #include <linux/kobject.h>
 #include <linux/klist.h>
 #include <linux/pm.h>
+#include <linux/dev_liveupdate.h>
 
 struct device_driver;
 struct fwnode_handle;
@@ -63,6 +64,8 @@ struct fwnode_handle;
  *			this bus.
  * @pm:		Power management operations of this bus, callback the specific
  *		device driver's pm-ops.
+ * @liveupdate:	Live update callbacks, notify bus of the live update state, and
+ *		allow preseve device across reboot.
  * @need_parent_lock:	When probing or removing a device on this bus, the
  *			device core should lock the device's parent.
  *
@@ -103,6 +106,7 @@ struct bus_type {
 	void (*dma_cleanup)(struct device *dev);
 
 	const struct dev_pm_ops *pm;
+	const struct dev_liveupdate_cbs *liveupdate;
 
 	bool need_parent_lock;
 };
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index cd8e0f0a634b..01ade77061fc 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -19,6 +19,7 @@
 #include <linux/pm.h>
 #include <linux/device/bus.h>
 #include <linux/module.h>
+#include <linux/dev_liveupdate.h>
 
 /**
  * enum probe_type - device driver probe type to try
@@ -80,6 +81,8 @@ enum probe_type {
  *		it is bound to the driver.
  * @pm:		Power management operations of the device which matched
  *		this driver.
+ * @liveupdate:	Live update callbacks, notify device of the live
+ *		update state, and allow preseve device across reboot.
  * @coredump:	Called when sysfs entry is written to. The device driver
  *		is expected to call the dev_coredump API resulting in a
  *		uevent.
@@ -116,6 +119,7 @@ struct device_driver {
 	const struct attribute_group **dev_groups;
 
 	const struct dev_pm_ops *pm;
+	const struct dev_liveupdate_cbs *liveupdate;
 	void (*coredump) (struct device *dev);
 
 	struct driver_private *p;
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC v1 3/3] luo: x86: Enable live update support
  2025-03-20  2:40 [RFC v1 0/3] Live Update Orchestrator Pasha Tatashin
  2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
  2025-03-20  2:40 ` [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure Pasha Tatashin
@ 2025-03-20  2:40 ` Pasha Tatashin
  2025-03-20 13:35 ` [RFC v1 0/3] Live Update Orchestrator Greg KH
  3 siblings, 0 replies; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20  2:40 UTC (permalink / raw)
  To: changyuanl, graf, pasha.tatashin, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, jgowans, jgg

Enable the Live Update Orchestrator for the x86 architecture.

It does so by selecting ARCH_SUPPORTS_LIVEUPDATE when KEXEC_HANDOVER is
available, signaling to the LUO core that the architecture provides the
necessary Kexec Handover functionality required for live updates.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index acd180e3002f..a7497cc84fbb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_PER_VMA_LOCK
 	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
+	select ARCH_SUPPORTS_LIVEUPDATE if KEXEC_HANDOVER
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure
  2025-03-20  2:40 ` [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure Pasha Tatashin
@ 2025-03-20 13:34   ` Greg KH
  2025-03-20 18:03     ` Pasha Tatashin
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2025-03-20 13:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, jgg

On Thu, Mar 20, 2025 at 02:40:10AM +0000, Pasha Tatashin wrote:
> Introduce a new subsystem within the driver core to enable keeping
> devices alive during kernel live update. This infrastructure is
> designed to be registered with and driven by a separate Live Update
> Orchestrator, allowing the LUO's state machine to manage the save and
> restore process of device state during a kernel transition.
> 
> The goal is to allow drivers and buses to participate in a coordinated
> save and restore process orchestrated by a live update mechanism. By
> saving device state before the kernel switch and restoring it
> immediately after, the device can appear to remain continuously
> operational from the perspective of the system and userspace.
> 
> components introduced:
> 
> - `struct dev_liveupdate`: Embedded in `struct device` to track the
>   device's participation and state during a live update, including
>   request status, preservation status, and dependency depth.
> 
> - `liveupdate()` callback: Added to `struct bus_type` and
>   `struct device_driver`. This callback receives an enum
>   `liveupdate_event` to manage device state at different stages of the
>   live update process:
>     - LIVEUPDATE_PREPARE: Save device state before the kernel switch.
>     - LIVEUPDATE_REBOOT: Final actions just before the kernel jump.
>     - LIVEUPDATE_FINISH: Clean-up after live update.
>     - LIVEUPDATE_CANCEL: Clean up any saved state if the update is
>       aborted.
> 
> - Sysfs attribute "liveupdate/requested": Added under each device
>   directory, allowing user to request that a specific device to
>   participate in live update. I.e. its state is to be preserved
>   during the update.

As you can imagine, I have "thoughts" about all of this being added to
the driver core.  But, before I go off on that, I want to see some real,
actual, working, patches for at least 3 bus subsystems that correctly
implement this before I even consider reviewing this.

Show us real users please, otherwise any attempt at reviewing this is
going to just be a waste of our time as I have doubts that this actually
even works :)

Also, as you are adding a new user/kernel api, please also point at the
userspace tools that are written to handle all of this.  As you are
going to be handling potentially tens of thousands of devices from
userspace this way, in a single system, real code is needed to even
consider that this is an acceptable solution.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 0/3] Live Update Orchestrator
  2025-03-20  2:40 [RFC v1 0/3] Live Update Orchestrator Pasha Tatashin
                   ` (2 preceding siblings ...)
  2025-03-20  2:40 ` [RFC v1 3/3] luo: x86: Enable live update support Pasha Tatashin
@ 2025-03-20 13:35 ` Greg KH
  2025-03-20 15:34   ` Pasha Tatashin
  3 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2025-03-20 13:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, jgg

On Thu, Mar 20, 2025 at 02:40:08AM +0000, Pasha Tatashin wrote:
> From: Pasha Tatashin <tatashin@google.com>

Note, this does not match the author and signed-off-by on the actual
patches themselves.  Please use your google.com email address to
send/review/work on this.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
@ 2025-03-20 13:39   ` Andy Shevchenko
  2025-03-20 16:35     ` Pasha Tatashin
  2025-03-20 14:43   ` Jason Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Andy Shevchenko @ 2025-03-20 13:39 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes,
	jgowans, jgg

On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:
> Introduces the Live Update Orchestrator (LUO), a new kernel subsystem
> designed to facilitate live updates. Live update is a method to reboot
> the kernel while attempting to keep selected devices alive across the
> reboot boundary, minimizing downtime.
> 
> The primary use case is cloud environments, allowing hypervisor updates
> without fully disrupting running virtual machines. VMs can be suspended
> while the hypervisor kernel reboots, and devices attached to these VM
> are kept operational by the LUO.
> 
> Features introduced:
> 
> - Core orchestration logic for managing the live update process.
> - A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
>   the progress of live updates.
> - Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
>   etc.) to register callbacks for different live update events:
>     - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
>     - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
>     - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
>     - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.
> - A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
>     - `prepare`: Initiate preparation (write 1) or reset (write 0).
>     - `finish`: Finalize update in new kernel (write 1).
>     - `cancel`: Abort ongoing preparation or reboot (write 1).
>     - `reset`: Force state back to normal (write 1).
>     - `state`: Read-only view of the current LUO state.
>     - `enabled`: Read-only view of whether live update is enabled.
> - Integration with KHO to pass orchestrator state to the new kernel.
> - Version checking during startup of the new kernel to ensure
>   compatibility with the previous kernel's live update state.
> 
> This infrastructure allows various kernel subsystems to coordinate and
> participate in the live update process, serializing and restoring device
> state across a kernel reboot.

...

> +Date:		March 2025
> +KernelVersion:	6.14.0

This is way too optimistic, it even won't make v6.15.
And date can be chosen either v6.16-rc1 or v6.16 release
in accordance with prediction tool

...

> +#ifndef _LINUX_LIVEUPDATE_H
> +#define _LINUX_LIVEUPDATE_H

> +#include <linux/compiler.h>
> +#include <linux/notifier.h>

This is semi-random list of inclusions. Try to follow IWYU principle.
See below.

...

> +bool liveupdate_state_updated(void);

Where bool is defined?

...

> +/**
> + * LIVEUPDATE_DECLARE_NOTIFIER - Declare a live update notifier with default
> + * structure.
> + * @_name: A base name used to generate the names of the notifier block
> + * (e.g., ``_name##_liveupdate_notifier_block``) and the callback function
> + * (e.g., ``_name##_liveupdate``).
> + * @_priority: The priority of the notifier, specified using the
> + * ``enum liveupdate_cb_priority`` values
> + * (e.g., ``LIVEUPDATE_CB_PRIO_BEFORE_DEVICES``).
> + *
> + * This macro declares a static struct notifier_block and a corresponding
> + * notifier callback function for use with the live update orchestrator.
> + * It simplifies the process by automatically handling the dispatching of
> + * live update events to separate handler functions for prepare, reboot,
> + * finish, and cancel.
> + *
> + * This macro expects the following functions to be defined:
> + *
> + * ``_name##_liveupdate_prepare()``:  Called on LIVEUPDATE_PREPARE.
> + * ``_name##_liveupdate_reboot()``:   Called on LIVEUPDATE_REBOOT.
> + * ``_name##_liveupdate_finish()``:   Called on LIVEUPDATE_FINISH.
> + * ``_name##_liveupdate_cancel()``:   Called on LIVEUPDATE_CANCEL.
> + *
> + * The generated callback function handles the switch statement for the
> + * different live update events and calls the appropriate handler function.
> + * It also includes warnings if the finish or cancel handlers return an error.
> + *
> + * For example, declartion can look like this:
> + *
> + * ``static int foo_liveupdate_prepare(void) { ... }``
> + *
> + * ``static int foo_liveupdate_reboot(void) { ... }``
> + *
> + * ``static int foo_liveupdate_finish(void) { ... }``
> + *
> + * ``static int foo_liveupdate_cancel(void) { ... }``
> + *
> + * ``LIVEUPDATE_DECLARE_NOTIFIER(foo, LIVEUPDATE_CB_PRIO_WITH_DEVICES);``
> + *

Hmm... Have you run kernel-doc validator? There is missing Return section and
it will warn about that.

> + */

...

> +		WARN_ONCE(rv, "cancel failed[%d]\n", rv);		\

+ bug.h

...

> + #undef pr_fmt
> + #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt

Leftover from the development?

> +#undef pr_fmt

Not needed as long as pr_fmt9) is at the top of the file.

> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt

...

> +#include <linux/kernel.h>

What for? Can you follow IWYU, please? Here again semi-random list of
inclusions.

> +#include <linux/sysfs.h>
> +#include <linux/string.h>
> +#include <linux/rwsem.h>
> +#include <linux/err.h>
> +#include <linux/liveupdate.h>
> +#include <linux/cpufreq.h>
> +#include <linux/kexec_handover.h>

Can you keep them ordered which will be easier to read and maintain?

...

> +static int __init early_liveupdate_param(char *buf)
> +{
> +	return kstrtobool(buf, &luo_enabled);
> +}

> +

Redundant blank line.

> +early_param("liveupdate", early_liveupdate_param);

...

> +/* Show the current live update state */
> +static ssize_t state_show(struct kobject *kobj,
> +			  struct kobj_attribute *attr,

It is still enough room even for the strict 80 limit case.

> +			  char *buf)
> +{
> +	return sysfs_emit(buf, "%s\n", LUO_STATE_STR);
> +}

...

> +		ret = blocking_notifier_call_chain(&luo_notify_list,
> +						   event,
> +						   NULL);

There is room on the previous lines. Ditto for the similar cases all over
the code.

...

> +{
> +	int ret;
> +
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn(" %s, change state canceled by user\n", __func__);

Why __func__ is so important in _this_ message? And why leading space?
Ditto for the similar cases.

> +		return -EAGAIN;
> +	}
> +
> +	if (!IS_STATE(LIVEUPDATE_STATE_NORMAL)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_PREPARED],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	ret = luo_notify(LIVEUPDATE_PREPARE);
> +	if (!ret)
> +		luo_set_state(LIVEUPDATE_STATE_PREPARED);
> +
> +	up_write(&luo_state_rwsem);
> +
> +	return ret;
> +}

...

> +static ssize_t prepare_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf,
> +			     size_t count)
> +{
> +	ssize_t ret;
> +	long val;

> +	if (kstrtol(buf, 0, &val) < 0)
> +		return -EINVAL;

Shadower error code.


> +	if (val != 1 && val != 0)
> +		return -EINVAL;

What's wrong with using kstrtobool() from the beginning?

> +
> +	if (val)
> +		ret = luo_prepare();
> +	else
> +		ret = luo_cancel();

> +	if (!ret)
> +		ret = count;
> +
> +	return ret;

Can we go with the usual pattern "check for errors first"?

	if (ret)
		return ret;

	...

> +}

...

> +static ssize_t finish_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf,
> +			    size_t count)

Same comments as per above.

...

> +static struct attribute *luo_attrs[] = {
> +	&state_attribute.attr,
> +	&prepare_attribute.attr,
> +	&finish_attribute.attr,

> +	NULL,

No comma for the terminator entry.

> +};

...

> +static int __init luo_init(void)
> +{
> +	int ret;
> +
> +	if (!luo_enabled || !kho_is_enabled()) {
> +		pr_info("disabled by user\n");
> +		luo_enabled = false;
> +
> +		return 0;
> +	}

Can be written like

	if (!kho_is_enabled())
		luo_enabled = false;
	if (!luo_enabled) {
		pr_info("disabled by user\n");
		return 0;
	}

> +	ret = sysfs_create_group(kernel_kobj, &luo_attr_group);
> +	if (ret)
> +		pr_err("Failed to create group\n");
> +
> +	luo_sysfs_initialized = true;
> +	pr_info("Initialized\n");
> +
> +	return ret;

Something is odd here between (non-)checking for errors and printed messages.

> +}

...

> +EXPORT_SYMBOL_GPL(liveupdate_state_normal);

No namespace?

...

> --- a/kernel/reboot.c
> +++ b/kernel/reboot.c
> @@ -18,6 +18,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/syscore_ops.h>
>  #include <linux/uaccess.h>
> +#include <linux/liveupdate.h>

Can oyu preserve order (with given context at least)?

-- 
With Best Regards,
Andy Shevchenko




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
  2025-03-20 13:39   ` Andy Shevchenko
@ 2025-03-20 14:43   ` Jason Gunthorpe
  2025-03-20 19:00     ` Pasha Tatashin
  1 sibling, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 14:43 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans

On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:
> Introduces the Live Update Orchestrator (LUO), a new kernel subsystem
> designed to facilitate live updates. Live update is a method to reboot
> the kernel while attempting to keep selected devices alive across the
> reboot boundary, minimizing downtime.
> 
> The primary use case is cloud environments, allowing hypervisor updates
> without fully disrupting running virtual machines. VMs can be suspended
> while the hypervisor kernel reboots, and devices attached to these VM
> are kept operational by the LUO.
> 
> Features introduced:
> 
> - Core orchestration logic for managing the live update process.
> - A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
>   the progress of live updates.
> - Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
>   etc.) to register callbacks for different live update events:
>     - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
>     - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
>     - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
>     - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.

I still don't think notifier chains are the right way to go about alot
of this, most if it should be driven off of the file descriptors and
fdbox, not through notification.

At the very least we should not be adding notifier chains without a
clear user of them, and I'm not convinced that the iommu driver or
vfio are those users at the moment.

I feel more like the iommu can be brought into the serialization
indirectly by putting an iommufd into a fdbox.

> - A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
>     - `prepare`: Initiate preparation (write 1) or reset (write 0).
>     - `finish`: Finalize update in new kernel (write 1).
>     - `cancel`: Abort ongoing preparation or reboot (write 1).
>     - `reset`: Force state back to normal (write 1).
>     - `state`: Read-only view of the current LUO state.
>     - `enabled`: Read-only view of whether live update is enabled.

I also think we should give up on the sysfs. If fdbox is going forward
in a char dev direction then I think we should have two char devs
/dev/kho/serialize and /dev/kho/deserialize and run the whole thing
through that. The concepts shown in the fdbox patches should be merged
into the kho/serialize char dev as just a general architecture of open
the char dev, put stuff into it, then finalize and do the kexec.

It gives you more options to avoid things like notifiers and a very
clear "session" linked to a FD lifetime that encloses the
serialization effort. I think that will make error case cleanup easier
and the whole thing more maintainable. IMHO sysfs is not a great API
choice for something so complicated.

Also agree with Greg, I think this needs more thoughtful patch staging
with actual complete solutions. I think focusing on a progression of
demonstrable kexec preservation:
 - A simple KVM and the VM's backing memory in a memfd is perserved
 - A simple vfio-noiommu doing DMA to a preserved memfd, including not
   resetting the device (but with no iommu driver)
 - iommufd

This all builds on each other and introduces API along with concrete
and meaningful use cases.

I see alot of confusion in the various review comments in KHO work
that I think mis understands the scope of what would be brought into
this. It is not hundreds of FDs or hundreds of devices, but a very
very narrow and selective set that can work like this. Showing each
step along the way would help narrow the thinking.

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 0/3] Live Update Orchestrator
  2025-03-20 13:35 ` [RFC v1 0/3] Live Update Orchestrator Greg KH
@ 2025-03-20 15:34   ` Pasha Tatashin
  0 siblings, 0 replies; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20 15:34 UTC (permalink / raw)
  To: Greg KH
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, jgg

> > From: Pasha Tatashin <tatashin@google.com>
>
> Note, this does not match the author and signed-off-by on the actual
> patches themselves.  Please use your google.com email address to
> send/review/work on this.

This  was accidental, I meant to use pasha.tatashin@soleen.com here, I
use this e-mail for upstream work.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 13:39   ` Andy Shevchenko
@ 2025-03-20 16:35     ` Pasha Tatashin
  2025-03-20 17:50       ` Andy Shevchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20 16:35 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes,
	jgowans, jgg

On Thu, Mar 20, 2025 at 9:40 AM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:
> > Introduces the Live Update Orchestrator (LUO), a new kernel subsystem
> > designed to facilitate live updates. Live update is a method to reboot
> > the kernel while attempting to keep selected devices alive across the
> > reboot boundary, minimizing downtime.
> >
> > The primary use case is cloud environments, allowing hypervisor updates
> > without fully disrupting running virtual machines. VMs can be suspended
> > while the hypervisor kernel reboots, and devices attached to these VM
> > are kept operational by the LUO.
> >
> > Features introduced:
> >
> > - Core orchestration logic for managing the live update process.
> > - A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
> >   the progress of live updates.
> > - Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
> >   etc.) to register callbacks for different live update events:
> >     - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
> >     - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
> >     - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
> >     - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.
> > - A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
> >     - `prepare`: Initiate preparation (write 1) or reset (write 0).
> >     - `finish`: Finalize update in new kernel (write 1).
> >     - `cancel`: Abort ongoing preparation or reboot (write 1).
> >     - `reset`: Force state back to normal (write 1).
> >     - `state`: Read-only view of the current LUO state.
> >     - `enabled`: Read-only view of whether live update is enabled.
> > - Integration with KHO to pass orchestrator state to the new kernel.
> > - Version checking during startup of the new kernel to ensure
> >   compatibility with the previous kernel's live update state.
> >
> > This infrastructure allows various kernel subsystems to coordinate and
> > participate in the live update process, serializing and restoring device
> > state across a kernel reboot.
>
> ...
>
> > +Date:                March 2025
> > +KernelVersion:       6.14.0
>
> This is way too optimistic, it even won't make v6.15.
> And date can be chosen either v6.16-rc1 or v6.16 release
> in accordance with prediction tool

This is an early RFC and is not intended to be applied. I meant to
replace it with the appropriate version once it becomes a candidate to
land.

>
> ...
>
> > +#ifndef _LINUX_LIVEUPDATE_H
> > +#define _LINUX_LIVEUPDATE_H
>
> > +#include <linux/compiler.h>
> > +#include <linux/notifier.h>
>
> This is semi-random list of inclusions. Try to follow IWYU principle.
> See below.

I will remove <linux/compiler.h>

>
> ...
>
> > +bool liveupdate_state_updated(void);
>
> Where bool is defined?

in kernel/liveupdate.c

>
> ...
>
> > +/**
> > + * LIVEUPDATE_DECLARE_NOTIFIER - Declare a live update notifier with default
> > + * structure.
> > + * @_name: A base name used to generate the names of the notifier block
> > + * (e.g., ``_name##_liveupdate_notifier_block``) and the callback function
> > + * (e.g., ``_name##_liveupdate``).
> > + * @_priority: The priority of the notifier, specified using the
> > + * ``enum liveupdate_cb_priority`` values
> > + * (e.g., ``LIVEUPDATE_CB_PRIO_BEFORE_DEVICES``).
> > + *
> > + * This macro declares a static struct notifier_block and a corresponding
> > + * notifier callback function for use with the live update orchestrator.
> > + * It simplifies the process by automatically handling the dispatching of
> > + * live update events to separate handler functions for prepare, reboot,
> > + * finish, and cancel.
> > + *
> > + * This macro expects the following functions to be defined:
> > + *
> > + * ``_name##_liveupdate_prepare()``:  Called on LIVEUPDATE_PREPARE.
> > + * ``_name##_liveupdate_reboot()``:   Called on LIVEUPDATE_REBOOT.
> > + * ``_name##_liveupdate_finish()``:   Called on LIVEUPDATE_FINISH.
> > + * ``_name##_liveupdate_cancel()``:   Called on LIVEUPDATE_CANCEL.
> > + *
> > + * The generated callback function handles the switch statement for the
> > + * different live update events and calls the appropriate handler function.
> > + * It also includes warnings if the finish or cancel handlers return an error.
> > + *
> > + * For example, declartion can look like this:
> > + *
> > + * ``static int foo_liveupdate_prepare(void) { ... }``
> > + *
> > + * ``static int foo_liveupdate_reboot(void) { ... }``
> > + *
> > + * ``static int foo_liveupdate_finish(void) { ... }``
> > + *
> > + * ``static int foo_liveupdate_cancel(void) { ... }``
> > + *
> > + * ``LIVEUPDATE_DECLARE_NOTIFIER(foo, LIVEUPDATE_CB_PRIO_WITH_DEVICES);``
> > + *
>
> Hmm... Have you run kernel-doc validator? There is missing Return section and
> it will warn about that.

I have, and there are no warnings. There is no return in this macro.

>
> > + */
>
> ...
>
> > +             WARN_ONCE(rv, "cancel failed[%d]\n", rv);               \
>
> + bug.h

I will include bug.h

>
> ...
>
> > + #undef pr_fmt
> > + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> Leftover from the development?

Ah, yes, it is duplicated.

>
> > +#undef pr_fmt
>
> Not needed as long as pr_fmt9) is at the top of the file.

Thanks, I will remove it.

>
> > +#define pr_fmt(fmt)  KBUILD_MODNAME ": " fmt
>
> ...
>
> > +#include <linux/kernel.h>
>
> What for? Can you follow IWYU, please? Here again semi-random list of
> inclusions.

I will remove kernel.h and cpufreq.h and add
#include <linux/kobject.h>

> > +#include <linux/sysfs.h>
> > +#include <linux/string.h>
> > +#include <linux/rwsem.h>
> > +#include <linux/err.h>
> > +#include <linux/liveupdate.h>
> > +#include <linux/cpufreq.h>
> > +#include <linux/kexec_handover.h>
>
> Can you keep them ordered which will be easier to read and maintain?

Sure, I will order them alphabetically.

>
> ...
>
> > +static int __init early_liveupdate_param(char *buf)
> > +{
> > +     return kstrtobool(buf, &luo_enabled);
> > +}
>
> > +
>
> Redundant blank line.

OK

> > +early_param("liveupdate", early_liveupdate_param);
>
> ...
>
> > +/* Show the current live update state */
> > +static ssize_t state_show(struct kobject *kobj,
> > +                       struct kobj_attribute *attr,
>
> It is still enough room even for the strict 80 limit case.

OK

>
> > +                       char *buf)
> > +{
> > +     return sysfs_emit(buf, "%s\n", LUO_STATE_STR);
> > +}
>
> ...
>
> > +             ret = blocking_notifier_call_chain(&luo_notify_list,
> > +                                                event,
> > +                                                NULL);
>
> There is room on the previous lines. Ditto for the similar cases all over
> the code.

OK

>
> ...
>
> > +{
> > +     int ret;
> > +
> > +     if (down_write_killable(&luo_state_rwsem)) {
> > +             pr_warn(" %s, change state canceled by user\n", __func__);
>
> Why __func__ is so important in _this_ message? And why leading space?
> Ditto for the similar cases.

removed __func__, and rewarded messages to be more descriptive in each case.

>
> > +             return -EAGAIN;
> > +     }
> > +
> > +     if (!IS_STATE(LIVEUPDATE_STATE_NORMAL)) {
> > +             pr_warn("Can't switch to [%s] from [%s] state\n",
> > +                     luo_state_str[LIVEUPDATE_STATE_PREPARED],
> > +                     LUO_STATE_STR);
> > +             up_write(&luo_state_rwsem);
> > +
> > +             return -EINVAL;
> > +     }
> > +
> > +     ret = luo_notify(LIVEUPDATE_PREPARE);
> > +     if (!ret)
> > +             luo_set_state(LIVEUPDATE_STATE_PREPARED);
> > +
> > +     up_write(&luo_state_rwsem);
> > +
> > +     return ret;
> > +}
>
> ...
>
> > +static ssize_t prepare_store(struct kobject *kobj,
> > +                          struct kobj_attribute *attr,
> > +                          const char *buf,
> > +                          size_t count)
> > +{
> > +     ssize_t ret;
> > +     long val;
>
> > +     if (kstrtol(buf, 0, &val) < 0)
> > +             return -EINVAL;
>
> Shadower error code.

In this case it is appropriate, we do not case why kstrtol() could not
be converted into an appropriate integer, all we care is that the
input was invalid, and that what we return back to user.

>
>
> > +     if (val != 1 && val != 0)
> > +             return -EINVAL;
>
> What's wrong with using kstrtobool() from the beginning?

It makes the input less defined, here we only allow '1' or '0',
kstrtobool() allows almost anything.

>
> > +
> > +     if (val)
> > +             ret = luo_prepare();
> > +     else
> > +             ret = luo_cancel();
>
> > +     if (!ret)
> > +             ret = count;
> > +
> > +     return ret;
>
> Can we go with the usual pattern "check for errors first"?
>
>         if (ret)
>                 return ret;

Sure.

>
>         ...
>
> > +}
>
> ...
>
> > +static ssize_t finish_store(struct kobject *kobj,
> > +                         struct kobj_attribute *attr,
> > +                         const char *buf,
> > +                         size_t count)
>
> Same comments as per above.

OK

>
> ...
>
> > +static struct attribute *luo_attrs[] = {
> > +     &state_attribute.attr,
> > +     &prepare_attribute.attr,
> > +     &finish_attribute.attr,
>
> > +     NULL,
>
> No comma for the terminator entry.

Sure.

> > +};
>
> ...
>
> > +static int __init luo_init(void)
> > +{
> > +     int ret;
> > +
> > +     if (!luo_enabled || !kho_is_enabled()) {
> > +             pr_info("disabled by user\n");
> > +             luo_enabled = false;
> > +
> > +             return 0;
> > +     }
>
> Can be written like
>
>         if (!kho_is_enabled())
>                 luo_enabled = false;
>         if (!luo_enabled) {
>                 pr_info("disabled by user\n");
>                 return 0;
>         }

Sure

>
> > +     ret = sysfs_create_group(kernel_kobj, &luo_attr_group);
> > +     if (ret)
> > +             pr_err("Failed to create group\n");
> > +
> > +     luo_sysfs_initialized = true;
> > +     pr_info("Initialized\n");
> > +
> > +     return ret;
>
> Something is odd here between (non-)checking for errors and printed messages.

Thank you for pointing out, it is a bug, fixed.

>
> > +}
>
> ...
>
> > +EXPORT_SYMBOL_GPL(liveupdate_state_normal);
>
> No namespace?

Namespace is 'liveupdate_', all public interfaces have this prefix,
private functions are prefixed with luo_ where it makes sense.

>
> ...
>
> > --- a/kernel/reboot.c
> > +++ b/kernel/reboot.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/syscore_ops.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/liveupdate.h>
>
> Can oyu preserve order (with given context at least)?

Yes.

>
> --
> With Best Regards,
> Andy Shevchenko

Thank you for your review.
Pasha


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 16:35     ` Pasha Tatashin
@ 2025-03-20 17:50       ` Andy Shevchenko
  2025-03-20 18:30         ` Pasha Tatashin
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Shevchenko @ 2025-03-20 17:50 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes,
	jgowans, jgg

On Thu, Mar 20, 2025 at 12:35:20PM -0400, Pasha Tatashin wrote:
> On Thu, Mar 20, 2025 at 9:40 AM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:

...

> > > +#ifndef _LINUX_LIVEUPDATE_H
> > > +#define _LINUX_LIVEUPDATE_H
> >
> > > +#include <linux/compiler.h>
> > > +#include <linux/notifier.h>
> >
> > This is semi-random list of inclusions. Try to follow IWYU principle.
> > See below.
> 
> I will remove <linux/compiler.h>

But you need to add something more...

...

> > > +bool liveupdate_state_updated(void);
> >
> > Where bool is defined?
> 
> in kernel/liveupdate.c

Nope, I meant where the type is defined. It is IIRC in types.h which needs to
be included.

...

> > > +     if (kstrtol(buf, 0, &val) < 0)
> > > +             return -EINVAL;
> >
> > Shadower error code.
> 
> In this case it is appropriate, we do not case why kstrtol() could not
> be converted into an appropriate integer, all we care is that the
> input was invalid, and that what we return back to user.

The kstrtox() may give different error codes. User may want to know more about
what's wrong with the input. Shadowed error codes are discouraged and should be
explained.

> > > +     if (val != 1 && val != 0)
> > > +             return -EINVAL;
> >
> > What's wrong with using kstrtobool() from the beginning?
> 
> It makes the input less defined, here we only allow '1' or '0',
> kstrtobool() allows almost anything.

But kstrtobool() is the interface for boolean input. You may document only
0 and 1 and don't tell people to use anything else. ABI documentation should
be clear, that's it.

...

> > > +EXPORT_SYMBOL_GPL(liveupdate_state_normal);
> >
> > No namespace?
> 
> Namespace is 'liveupdate_', all public interfaces have this prefix,
> private functions are prefixed with luo_ where it makes sense.

No, I'm talking about export namespace. Why does the entire kernel need these APIs?

-- 
With Best Regards,
Andy Shevchenko




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure
  2025-03-20 13:34   ` Greg KH
@ 2025-03-20 18:03     ` Pasha Tatashin
  2025-03-20 20:51       ` Greg KH
  0 siblings, 1 reply; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20 18:03 UTC (permalink / raw)
  To: Greg KH
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, jgg

On Thu, Mar 20, 2025 at 9:36 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Mar 20, 2025 at 02:40:10AM +0000, Pasha Tatashin wrote:
> > Introduce a new subsystem within the driver core to enable keeping
> > devices alive during kernel live update. This infrastructure is
> > designed to be registered with and driven by a separate Live Update
> > Orchestrator, allowing the LUO's state machine to manage the save and
> > restore process of device state during a kernel transition.
> >
> > The goal is to allow drivers and buses to participate in a coordinated
> > save and restore process orchestrated by a live update mechanism. By
> > saving device state before the kernel switch and restoring it
> > immediately after, the device can appear to remain continuously
> > operational from the perspective of the system and userspace.
> >
> > components introduced:
> >
> > - `struct dev_liveupdate`: Embedded in `struct device` to track the
> >   device's participation and state during a live update, including
> >   request status, preservation status, and dependency depth.
> >
> > - `liveupdate()` callback: Added to `struct bus_type` and
> >   `struct device_driver`. This callback receives an enum
> >   `liveupdate_event` to manage device state at different stages of the
> >   live update process:
> >     - LIVEUPDATE_PREPARE: Save device state before the kernel switch.
> >     - LIVEUPDATE_REBOOT: Final actions just before the kernel jump.
> >     - LIVEUPDATE_FINISH: Clean-up after live update.
> >     - LIVEUPDATE_CANCEL: Clean up any saved state if the update is
> >       aborted.
> >
> > - Sysfs attribute "liveupdate/requested": Added under each device
> >   directory, allowing user to request that a specific device to
> >   participate in live update. I.e. its state is to be preserved
> >   during the update.
>
> As you can imagine, I have "thoughts" about all of this being added to
> the driver core.  But, before I go off on that, I want to see some real,
> actual, working, patches for at least 3 bus subsystems that correctly
> implement this before I even consider reviewing this.
>
> Show us real users please, otherwise any attempt at reviewing this is
> going to just be a waste of our time as I have doubts that this actually
> even works :)
>
> Also, as you are adding a new user/kernel api, please also point at the
> userspace tools that are written to handle all of this.  As you are
> going to be handling potentially tens of thousands of devices from
> userspace this way, in a single system, real code is needed to even
> consider that this is an acceptable solution.

Hi Greg,

Thanks for the feedback on this RFC. I understand your hesitation
about adding this to the driver core without seeing concrete
implementations. The primary goal of posting this RFC now is to get
early feedback on the overall state machine and rules concept. We have
a bi-weekly meeting [1] where the "Live Update Orchestrator" is
scheduled for presentation. I wanted to give people a chance to look
at the framework ahead of those discussions.

Regarding your request for real, working patches, we are actively
working on that. Our current efforts are focused on adding live update
support for LUO for these subsystems: KVM, Interrupts, IOMMU, Devices

Within the devices subsystem, we are targeting generic PCI, VFIO, and
a few other device types (real and emulated) to demonstrate the
implementation.

I absolutely agree that demonstrating a real use case is important.
However, this is a complicated project that involves changes in many
parts of the kernel, and we can't deliver everything in one large
patchset; it has to be divided and addressed incrementally.

So far, we have the following pieces of the Live Update puzzle: KHO
(for preserving kernel memory), LUO (for driving the live update
process), and Dev_Liveupdate (for managing device participation in
live update), IOMMU preservation [2], guest memory [3], and we are
planning to add support for interrupts, PCIe, VFIO, some drivers, and
other components.

On the user side, we are planning to propose the necessary changes to
VMMs such as CloudHypervisor and QEMU.

Thanks,
Pasha

[1] https://lore.kernel.org/all/a350f3e5-e764-4ba6-f871-da7252f314da@google.com
[2] https://lpc.events/event/18/contributions/1686
[3] https://lore.kernel.org/all/20240805093245.889357-1-jgowans@amazon.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 17:50       ` Andy Shevchenko
@ 2025-03-20 18:30         ` Pasha Tatashin
  2025-03-21 13:19           ` Andy Shevchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20 18:30 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes,
	jgowans, jgg

On Thu, Mar 20, 2025 at 1:50 PM Andy Shevchenko
<andriy.shevchenko@linux.intel.com> wrote:
>
> On Thu, Mar 20, 2025 at 12:35:20PM -0400, Pasha Tatashin wrote:
> > On Thu, Mar 20, 2025 at 9:40 AM Andy Shevchenko
> > <andriy.shevchenko@linux.intel.com> wrote:
> > > On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:
>
> ...
>
> > > > +#ifndef _LINUX_LIVEUPDATE_H
> > > > +#define _LINUX_LIVEUPDATE_H
> > >
> > > > +#include <linux/compiler.h>
> > > > +#include <linux/notifier.h>
> > >
> > > This is semi-random list of inclusions. Try to follow IWYU principle.
> > > See below.
> >
> > I will remove <linux/compiler.h>
>
> But you need to add something more...

...

>
> ...
>
> > > > +bool liveupdate_state_updated(void);
> > >
> > > Where bool is defined?
> >
> > in kernel/liveupdate.c
>
> Nope, I meant where the type is defined. It is IIRC in types.h which needs to
> be included.

Ah, I see what you mean, sure I will include types.h.

>
> ...
>
> > > > +     if (kstrtol(buf, 0, &val) < 0)
> > > > +             return -EINVAL;
> > >
> > > Shadower error code.
> >
> > In this case it is appropriate, we do not case why kstrtol() could not
> > be converted into an appropriate integer, all we care is that the
> > input was invalid, and that what we return back to user.
>
> The kstrtox() may give different error codes. User may want to know more about
> what's wrong with the input. Shadowed error codes are discouraged and should be
> explained.
>

...

> > > > +     if (val != 1 && val != 0)
> > > > +             return -EINVAL;
> > >
> > > What's wrong with using kstrtobool() from the beginning?
> >
> > It makes the input less defined, here we only allow '1' or '0',
> > kstrtobool() allows almost anything.
>
> But kstrtobool() is the interface for boolean input. You may document only
> 0 and 1 and don't tell people to use anything else. ABI documentation should
> be clear, that's it.

Sure, I will use kstrtobool().

>
> ...
>
> > > > +EXPORT_SYMBOL_GPL(liveupdate_state_normal);
> > >
> > > No namespace?
> >
> > Namespace is 'liveupdate_', all public interfaces have this prefix,
> > private functions are prefixed with luo_ where it makes sense.
>
> No, I'm talking about export namespace. Why does the entire kernel need these APIs?

These functions are intended for use by drivers and other subsystems
participating in the live update. They allow these components to
determine, during boot, whether to restore their state from the
serialized state, or, during runtime, whether a live update is in the
prepared state, causing different behavior compared to normal mode
(e.g., prohibiting DMA mappings modifications, binding/unbinding,
etc.).

Pasha

> --
> With Best Regards,
> Andy Shevchenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 14:43   ` Jason Gunthorpe
@ 2025-03-20 19:00     ` Pasha Tatashin
  2025-03-20 19:26       ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-20 19:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans

Hi Jason,

Thank you for your feedback.

> > Features introduced:
> >
> > - Core orchestration logic for managing the live update process.
> > - A state machine (NORMAL, PREPARED, UPDATED, *_FAILED) to track
> >   the progress of live updates.
> > - Notifier chains for subsystems (device layer, interrupts, KVM, IOMMU,
> >   etc.) to register callbacks for different live update events:
> >     - LIVEUPDATE_PREPARE: Prepare for reboot (before blackout).
> >     - LIVEUPDATE_REBOOT: Final serialization before kexec (blackout).
> >     - LIVEUPDATE_FINISH: Cleanup after update (after blackout).
> >     - LIVEUPDATE_CANCEL: Rollback actions on failure or user request.
>
> I still don't think notifier chains are the right way to go about alot
> of this, most if it should be driven off of the file descriptors and
> fdbox, not through notification.
>
> At the very least we should not be adding notifier chains without a
> clear user of them, and I'm not convinced that the iommu driver or
> vfio are those users at the moment.
>
> I feel more like the iommu can be brought into the serialization
> indirectly by putting an iommufd into a fdbox.

We have identified the subsystems that need to participate in Live
Update: KVM, IOMMU, Devices, and Interrupts. We are planning to
present how each of them will integrate with the LUO.

> > - A sysfs interface (/sys/kernel/liveupdate/) for user-space control:
> >     - `prepare`: Initiate preparation (write 1) or reset (write 0).
> >     - `finish`: Finalize update in new kernel (write 1).
> >     - `cancel`: Abort ongoing preparation or reboot (write 1).
> >     - `reset`: Force state back to normal (write 1).
> >     - `state`: Read-only view of the current LUO state.
> >     - `enabled`: Read-only view of whether live update is enabled.

I forgot to update the commit message, there are no: enabled, reset,
and cancel files. We only have three files in LUO: `prepare`,
`finish`, and `prepare`

>
> I also think we should give up on the sysfs. If fdbox is going forward
> in a char dev direction then I think we should have two char devs
> /dev/kho/serialize and /dev/kho/deserialize and run the whole thing

KHO is a mechanism to preserve kernel memory across reboots. It can be
used independently of live update, for example, to preserve kexec
reboot telemetry, traces, and for other purposes. The LUO utilizes KHO
for memory preservation but also orchestrates specifically a live
update process, provides a generic way for subsystems and devices to
participate, handles error recovery, unclaimed devices, and other live
update-specific steps.

That said, I can transition the LUO interface from sysfs to a character device.

> through that. The concepts shown in the fdbox patches should be merged
> into the kho/serialize char dev as just a general architecture of open
> the char dev, put stuff into it, then finalize and do the kexec.

Some participating subsystems, such as interrupts, do not have a way
to export a file descriptor. It is unclear why we would require this
for kernel-internal state that needs to be preserved for live update,
which should instead register with internally.

> It gives you more options to avoid things like notifiers and a very
> clear "session" linked to a FD lifetime that encloses the
> serialization effort. I think that will make error case cleanup easier
> and the whole thing more maintainable. IMHO sysfs is not a great API
> choice for something so complicated.

IMO, the current API and state machine are quite simple (I plan to
present and go through them at one of the Hypervisor Live Update
meetings). However, I am open to changing to a different API, and we
can expose it through a character device.

> Also agree with Greg, I think this needs more thoughtful patch staging
> with actual complete solutions. I think focusing on a progression of
> demonstrable kexec preservation:
>  - A simple KVM and the VM's backing memory in a memfd is perserved
>  - A simple vfio-noiommu doing DMA to a preserved memfd, including not
>    resetting the device (but with no iommu driver)
>  - iommufd

We are working on this. However, each component builds upon the
previous one, so it makes sense to discuss the lower layers early to
get early feedback.

Pasha


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 19:00     ` Pasha Tatashin
@ 2025-03-20 19:26       ` Jason Gunthorpe
  2025-03-27 19:29         ` Pasha Tatashin
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 19:26 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans

On Thu, Mar 20, 2025 at 03:00:31PM -0400, Pasha Tatashin wrote:

> > I also think we should give up on the sysfs. If fdbox is going forward
> > in a char dev direction then I think we should have two char devs
> > /dev/kho/serialize and /dev/kho/deserialize and run the whole thing
> 
> KHO is a mechanism to preserve kernel memory across reboots. It can be
> used independently of live update, for example, to preserve kexec
> reboot telemetry, traces, and for other purposes. The LUO utilizes KHO
> for memory preservation but also orchestrates specifically a live
> update process, provides a generic way for subsystems and devices to
> participate, handles error recovery, unclaimed devices, and other live
> update-specific steps.
> 
> That said, I can transition the LUO interface from sysfs to a character device.

Sure, I mean pick whatever name makes sense for this whole bundle..

> > through that. The concepts shown in the fdbox patches should be merged
> > into the kho/serialize char dev as just a general architecture of open
> > the char dev, put stuff into it, then finalize and do the kexec.
> 
> Some participating subsystems, such as interrupts, do not have a way
> to export a file descriptor. 

Interrupts that need to be preserved are owned by VFIO. Why do we need
to preserve interrupts? I thought the model was to halt all interrupts
and then re-inject a spurious one?

> It is unclear why we would require this
> for kernel-internal state that needs to be preserved for live update,
> which should instead register with internally.

Because there is almost no kernel state which is machine global and
unconditionally should be included. eg Interrupts for devices that are
not doing preservation should not be serialized. Only userspace knows
what should be preserved so you must always need a mechanism to tell
the kernel.

> IMO, the current API and state machine are quite simple (I plan to
> present and go through them at one of the Hypervisor Live Update
> meetings). However, I am open to changing to a different API, and we
> can expose it through a character device.

Everything seems simple before you actually try to use it :)

> > Also agree with Greg, I think this needs more thoughtful patch staging
> > with actual complete solutions. I think focusing on a progression of
> > demonstrable kexec preservation:
> >  - A simple KVM and the VM's backing memory in a memfd is perserved
> >  - A simple vfio-noiommu doing DMA to a preserved memfd, including not
> >    resetting the device (but with no iommu driver)
> >  - iommufd
> 
> We are working on this. However, each component builds upon the
> previous one, so it makes sense to discuss the lower layers early to
> get early feedback.

I think part of the problem is there are lots of people working on
pieces as though they are seperate components, and I'm not sure this
is entirely wise, or the components are actually seperate.  I see
fdbox and this luo patch series as effectively being the same
component, just different aspects of it.

I'm not entirely sure that Mike's work is actually really
separate. Yes you might use it with a crash kernel too, that mechanism
is going to trigger for a crash kernel scenario without something
triggering the serialization steps. It kind of makes sense to me that
the same uapi could both setup the crash scenario and choose what gets
pass to crash and also support kexec.

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure
  2025-03-20 18:03     ` Pasha Tatashin
@ 2025-03-20 20:51       ` Greg KH
  2025-03-21  9:41         ` Bartosz Golaszewski
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2025-03-20 20:51 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, jgg

On Thu, Mar 20, 2025 at 02:03:10PM -0400, Pasha Tatashin wrote:
> I absolutely agree that demonstrating a real use case is important.
> However, this is a complicated project that involves changes in many
> parts of the kernel, and we can't deliver everything in one large
> patchset; it has to be divided and addressed incrementally.

Ok, but then don't expect us to be able to actually review it in any
sane way, sorry.

Think about it from our side, what would you want to see if you had to
review this type of thing?  Remember, some of us get hundreds of things
to review a week, we don't have context for each and every new feature,
and your project does not have the "trust" in that we would even
consider taking anything without any real user both in the kernel and in
public userspace code.

Breaking up changes in a way that is acceptable upstream is a tough
task, usually harder than the original engineering effort to create the
feature in the first place.  But in the end, the result is a better
solution as it will evolve and change along the way based on reviews and
requirements from the different subsystems.

good luck!

greg k-h


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure
  2025-03-20 20:51       ` Greg KH
@ 2025-03-21  9:41         ` Bartosz Golaszewski
  0 siblings, 0 replies; 21+ messages in thread
From: Bartosz Golaszewski @ 2025-03-21  9:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Pasha Tatashin, changyuanl, graf, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr, cw00.choi,
	myungjoo.ham, yesanishhere, Jonathan.Cameron, quic_zijuhu,
	aleksander.lobakin, ira.weiny, andriy.shevchenko, leon, lukas,
	bhelgaas, wagi, djeffery, stuart.w.hayes, jgowans, jgg

On Thu, 20 Mar 2025 at 21:52, Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Mar 20, 2025 at 02:03:10PM -0400, Pasha Tatashin wrote:
> > I absolutely agree that demonstrating a real use case is important.
> > However, this is a complicated project that involves changes in many
> > parts of the kernel, and we can't deliver everything in one large
> > patchset; it has to be divided and addressed incrementally.
>
> Ok, but then don't expect us to be able to actually review it in any
> sane way, sorry.
>
> Think about it from our side, what would you want to see if you had to
> review this type of thing?  Remember, some of us get hundreds of things
> to review a week, we don't have context for each and every new feature,
> and your project does not have the "trust" in that we would even
> consider taking anything without any real user both in the kernel and in
> public userspace code.
>
> Breaking up changes in a way that is acceptable upstream is a tough
> task, usually harder than the original engineering effort to create the
> feature in the first place.  But in the end, the result is a better
> solution as it will evolve and change along the way based on reviews and
> requirements from the different subsystems.
>

If I may suggest something: typically you'd want to post the whole
working PoC on some public git tree for reference and to show the big
picture and then start sending out individual bits and pieces for
upstream review. This is how many big features have been done in the
past. As the reviewed code changes, you adjust the rest of it that
wasn't posted yet.

Bartosz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 18:30         ` Pasha Tatashin
@ 2025-03-21 13:19           ` Andy Shevchenko
  0 siblings, 0 replies; 21+ messages in thread
From: Andy Shevchenko @ 2025-03-21 13:19 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes,
	jgowans, jgg

On Thu, Mar 20, 2025 at 02:30:25PM -0400, Pasha Tatashin wrote:
> On Thu, Mar 20, 2025 at 1:50 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Thu, Mar 20, 2025 at 12:35:20PM -0400, Pasha Tatashin wrote:
> > > On Thu, Mar 20, 2025 at 9:40 AM Andy Shevchenko
> > > <andriy.shevchenko@linux.intel.com> wrote:
> > > > On Thu, Mar 20, 2025 at 02:40:09AM +0000, Pasha Tatashin wrote:

...

> > > > > +EXPORT_SYMBOL_GPL(liveupdate_state_normal);
> > > >
> > > > No namespace?
> > >
> > > Namespace is 'liveupdate_', all public interfaces have this prefix,
> > > private functions are prefixed with luo_ where it makes sense.
> >
> > No, I'm talking about export namespace. Why does the entire kernel need these APIs?
> 
> These functions are intended for use by drivers and other subsystems
> participating in the live update.

Sure. Why can't they import API namespace when needed?
Btw, is this feature switchable? Then why would the rest of the kernel
need to see these APIs or load them?

> They allow these components to
> determine, during boot, whether to restore their state from the
> serialized state, or, during runtime, whether a live update is in the
> prepared state, causing different behavior compared to normal mode
> (e.g., prohibiting DMA mappings modifications, binding/unbinding,
> etc.).

-- 
With Best Regards,
Andy Shevchenko




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-20 19:26       ` Jason Gunthorpe
@ 2025-03-27 19:29         ` Pasha Tatashin
  2025-03-31 16:37           ` Jason Gunthorpe
  2025-04-25 17:21           ` Lukas Wunner
  0 siblings, 2 replies; 21+ messages in thread
From: Pasha Tatashin @ 2025-03-27 19:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, Pratyush Yadav

On Thu, Mar 20, 2025 at 3:26 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Mar 20, 2025 at 03:00:31PM -0400, Pasha Tatashin wrote:
>
> > > I also think we should give up on the sysfs. If fdbox is going forward
> > > in a char dev direction then I think we should have two char devs
> > > /dev/kho/serialize and /dev/kho/deserialize and run the whole thing
> >
> > KHO is a mechanism to preserve kernel memory across reboots. It can be
> > used independently of live update, for example, to preserve kexec
> > reboot telemetry, traces, and for other purposes. The LUO utilizes KHO
> > for memory preservation but also orchestrates specifically a live
> > update process, provides a generic way for subsystems and devices to
> > participate, handles error recovery, unclaimed devices, and other live
> > update-specific steps.
> >
> > That said, I can transition the LUO interface from sysfs to a character device.
>
> Sure, I mean pick whatever name makes sense for this whole bundle..
>
> > > through that. The concepts shown in the fdbox patches should be merged
> > > into the kho/serialize char dev as just a general architecture of open
> > > the char dev, put stuff into it, then finalize and do the kexec.
> >
> > Some participating subsystems, such as interrupts, do not have a way
> > to export a file descriptor.
>
> Interrupts that need to be preserved are owned by VFIO. Why do we need
> to preserve interrupts? I thought the model was to halt all interrupts
> and then re-inject a spurious one?
>
> > It is unclear why we would require this
> > for kernel-internal state that needs to be preserved for live update,
> > which should instead register with internally.
>
> Because there is almost no kernel state which is machine global and
> unconditionally should be included. eg Interrupts for devices that are
> not doing preservation should not be serialized. Only userspace knows
> what should be preserved so you must always need a mechanism to tell
> the kernel.
>
> > IMO, the current API and state machine are quite simple (I plan to
> > present and go through them at one of the Hypervisor Live Update
> > meetings). However, I am open to changing to a different API, and we
> > can expose it through a character device.
>
> Everything seems simple before you actually try to use it :)
>
> > > Also agree with Greg, I think this needs more thoughtful patch staging
> > > with actual complete solutions. I think focusing on a progression of
> > > demonstrable kexec preservation:
> > >  - A simple KVM and the VM's backing memory in a memfd is perserved
> > >  - A simple vfio-noiommu doing DMA to a preserved memfd, including not
> > >    resetting the device (but with no iommu driver)
> > >  - iommufd
> >
> > We are working on this. However, each component builds upon the
> > previous one, so it makes sense to discuss the lower layers early to
> > get early feedback.
>

Hi Jason,

Thanks for your thoughts. I agree with your observation about
components being worked on separately when they might be intrinsically
linked. Especially, given that kvm/vfio/iommu all have FD counterparts
to the global states, or device state.

> I think part of the problem is there are lots of people working on
> pieces as though they are seperate components, and I'm not sure this
> is entirely wise, or the components are actually seperate.  I see
> fdbox and this luo patch series as effectively being the same
> component, just different aspects of it.

You've articulated precisely the point we discussed at LSF/MM. Based
on that conversation, the next proposal will focus on unifying FDBox
and the Live Update Orchestrator into a single, cohesive component.

Here’s a summary of the planned approach:

1. Unified Location: LUO will be moved under misc/liveupdate/ to house
the consolidated functionality.
2.  User Interfaces:  A primary character device (/dev/liveupdate)
utilizing an ioctl interface for control operations. (An initial draft
of this interface is available here:
https://raw.githubusercontent.com/soleen/linux/refs/heads/luo/rfc-v2.1/include/uapi/linux/liveupdate.h)
An optional sysfs interface will allow userspace applications to
monitor the LUO's state and react appropriately. e.g. allows SystemD
to load different services during different live update states.
3. Dependency Management: The viability of preserving a specific
resource (file, device) will be checked when it initially requests
participation.
However, the actual dependencies will only be pulled and the final
ordered list assembled during the prepare phase. This avoids the churn
of repeatedly adding/removing dependencies as individual components
register.

To manage the preservation logic, we'll use specific handles
categorized into three types: fd, device, and global. Each handle type
will define callbacks for the different phases of the live update
process. For instance, a file-system-related handle might look
something like this:

struct liveupdate_fs_handle {
    struct list_head liveupdate_entry;
    int (*prepare)(struct file *filp, void *preserve_page, ...); //
Callback during prepare phase
    int (*reboot)(struct file *filp, void *preserve_page,...);  //
Callback during reboot phase
    void (*finish)(struct file *filp, void *preserve_page,...); //
Callback after successful update to do state clean-up
    void (*cancel)(struct file *filp, void *preserve_page,...); //
Callback if prepare/reboot is cancelled
};

The overall preservation sequence involve processing these handles in
a specific order:

Preserved File Descriptors (e.g., memfd, kvmfd, iommufd, vfiofd)
Preserved Devices (ordered appropriately, leaves-to-root)
Global State Components

Let me know if this direction aligns with your expectations.

Pasha


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-27 19:29         ` Pasha Tatashin
@ 2025-03-31 16:37           ` Jason Gunthorpe
  2025-04-25 17:21           ` Lukas Wunner
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2025-03-31 16:37 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: changyuanl, graf, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, jgowans, Pratyush Yadav

On Thu, Mar 27, 2025 at 03:29:18PM -0400, Pasha Tatashin wrote:
> Here’s a summary of the planned approach:
> 
> 1. Unified Location: LUO will be moved under misc/liveupdate/ to house
> the consolidated functionality.

It make sense to me, and I prefer all this live update stuff be as
isolated and "side car" as possible to keep the normal kernel flow
simple..

> 2.  User Interfaces:  A primary character device (/dev/liveupdate)
> utilizing an ioctl interface for control operations. (An initial draft
> of this interface is available here:
> https://raw.githubusercontent.com/soleen/linux/refs/heads/luo/rfc-v2.1/include/uapi/linux/liveupdate.h)

That looks like a pretty comprehensive view

I'd probably nitpick some things but nothing fundamental..

You *may* want to look at drivers/fwctl/main.c around fwctl_fops_ioctl
for some thoughts on how to structure an ioctl implementation to be
safely extensible. You can even just copy that stuff, I copied it
already from iommufd..

Little confusing how you imagine to use UNPRESERVE_XX, EVENT_CANCEL
and close() as various error handling strategies? Especially depending
on how we are able to "freeze" a file descriptor.

> An optional sysfs interface will allow userspace applications to
> monitor the LUO's state and react appropriately. e.g. allows SystemD
> to load different services during different live update states.

Make sense, systemd works alot better with a sysfs file for knowing if
the boot is a kexec live update boot or not.

Though I don't know why you'd keep /sys/kernel/liveupdate/prepare and
others ? It seems really weird that something would be able to safely
sequence the update but not have access to the FD?

> 3. Dependency Management: The viability of preserving a specific
> resource (file, device) will be checked when it initially requests
> participation.
> However, the actual dependencies will only be pulled and the final
> ordered list assembled during the prepare phase. This avoids the churn
> of repeatedly adding/removing dependencies as individual components
> register.

Maybe, will have to see how the code works out in practice with real
implementations. I did not imagine having a full "unprepare" idea
since that significantly complicates everything. close() would just
nuke everything.

> struct liveupdate_fs_handle {
>     struct list_head liveupdate_entry;

Don't mix data and const function pointers..

>     int (*prepare)(struct file *filp, void *preserve_page, ...); // Callback during prepare phase
>     int (*reboot)(struct file *filp, void *preserve_page,...);  // Callback during reboot phase
>     void (*finish)(struct file *filp, void *preserve_page,...); // Callback after successful update to do state clean-up
>     void (*cancel)(struct file *filp, void *preserve_page,...); // Callback if prepare/reboot is cancelled
> };

But it makes sense over all

> Preserved File Descriptors (e.g., memfd, kvmfd, iommufd, vfiofd)
> Preserved Devices (ordered appropriately, leaves-to-root)

I think because of the cyclic ordering between kvm/iommu/vfio it may
become a bit complicated. You will want LIVEUPDATE_IOCTL_FD_PRESERVE
to not check dependencies but leave some kind of placeholder so the
cycles can be broken.

> Global State Components

You may need a LIVEUPDATE_IOCTL_GLOBAL_PRESERVE as well to select
these?

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC v1 1/3] luo: Live Update Orchestrator
  2025-03-27 19:29         ` Pasha Tatashin
  2025-03-31 16:37           ` Jason Gunthorpe
@ 2025-04-25 17:21           ` Lukas Wunner
  1 sibling, 0 replies; 21+ messages in thread
From: Lukas Wunner @ 2025-04-25 17:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, changyuanl, graf, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, bhelgaas, wagi, djeffery,
	stuart.w.hayes, jgowans, Pratyush Yadav

On Thu, Mar 27, 2025 at 03:29:18PM -0400, Pasha Tatashin wrote:
> 3. Dependency Management: The viability of preserving a specific
> resource (file, device) will be checked when it initially requests
> participation.
> However, the actual dependencies will only be pulled and the final
> ordered list assembled during the prepare phase. This avoids the churn
> of repeatedly adding/removing dependencies as individual components
> register.
[...]
> The overall preservation sequence involve processing these handles in
> a specific order:
> 
> Preserved File Descriptors (e.g., memfd, kvmfd, iommufd, vfiofd)
> Preserved Devices (ordered appropriately, leaves-to-root)
                                            ^^^^^^^^^^^^^^

Device dependencies are no longer strictly hierarchical since the
introduction of device links.  However devices_kset->list already
has the correct order, so if you follow that you should be fine.
Just be careful not to assume it's always in hierarchical order.

Thanks,

Lukas


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-04-25 17:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-20  2:40 [RFC v1 0/3] Live Update Orchestrator Pasha Tatashin
2025-03-20  2:40 ` [RFC v1 1/3] luo: " Pasha Tatashin
2025-03-20 13:39   ` Andy Shevchenko
2025-03-20 16:35     ` Pasha Tatashin
2025-03-20 17:50       ` Andy Shevchenko
2025-03-20 18:30         ` Pasha Tatashin
2025-03-21 13:19           ` Andy Shevchenko
2025-03-20 14:43   ` Jason Gunthorpe
2025-03-20 19:00     ` Pasha Tatashin
2025-03-20 19:26       ` Jason Gunthorpe
2025-03-27 19:29         ` Pasha Tatashin
2025-03-31 16:37           ` Jason Gunthorpe
2025-04-25 17:21           ` Lukas Wunner
2025-03-20  2:40 ` [RFC v1 2/3] luo: dev_liveupdate: Add device live update infrastructure Pasha Tatashin
2025-03-20 13:34   ` Greg KH
2025-03-20 18:03     ` Pasha Tatashin
2025-03-20 20:51       ` Greg KH
2025-03-21  9:41         ` Bartosz Golaszewski
2025-03-20  2:40 ` [RFC v1 3/3] luo: x86: Enable live update support Pasha Tatashin
2025-03-20 13:35 ` [RFC v1 0/3] Live Update Orchestrator Greg KH
2025-03-20 15:34   ` Pasha Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox