linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC Patch 0/7] kernel: Introduce multikernel architecture support
@ 2025-09-18 22:25 Cong Wang
  2025-09-18 22:26 ` [RFC Patch 1/7] kexec: Introduce multikernel support via kexec Cong Wang
                   ` (14 more replies)
  0 siblings, 15 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

This patch series introduces multikernel architecture support, enabling
multiple independent kernel instances to coexist and communicate on a
single physical machine. Each kernel instance can run on dedicated CPU
cores while sharing the underlying hardware resources.

The multikernel architecture provides several key benefits:
- Improved fault isolation between different workloads
- Enhanced security through kernel-level separation
- Better resource utilization than traditional VM (KVM, Xen etc.)
- Potential zero-down kernel update with KHO (Kernel Hand Over)

Architecture Overview:
The implementation leverages kexec infrastructure to load and manage
multiple kernel images, with each kernel instance assigned to specific
CPU cores. Inter-kernel communication is facilitated through a dedicated
IPI framework that allows kernels to coordinate and share information
when necessary.

Key Components:
1. Enhanced kexec subsystem with dynamic kimage tracking
2. Generic IPI communication framework for inter-kernel messaging
3. Architecture-specific CPU bootstrap mechanisms (only x86 so far)
4. Proc interface for monitoring loaded kernel instances

Patch Summary:

Patch 1/7: Introduces basic multikernel support via kexec, allowing
           multiple kernel images to be loaded simultaneously.

Patch 2/7: Adds x86-specific SMP INIT trampoline for bootstrapping
           CPUs with different kernel instances.

Patch 3/7: Introduces dedicated MULTIKERNEL_VECTOR for x86 inter-kernel
           communication.

Patch 4/7: Implements generic multikernel IPI communication framework
           for cross-kernel messaging and coordination.

Patch 5/7: Adds arch_cpu_physical_id() function to obtain physical CPU
           identifiers for proper CPU management.

Patch 6/7: Replaces static kimage globals with dynamic linked list
           infrastructure to support multiple kernel images.

Patch 7/7: Adds /proc/multikernel interface for monitoring and debugging
           loaded kernel instances.

The implementation maintains full backward compatibility with existing
kexec functionality while adding the new multikernel capabilities.

IMPORTANT NOTES:

1) This is a Request for Comments (RFC) submission. While the core
   architecture is functional, there are numerous implementation details
   that need improvement. The primary goal is to gather feedback on the
   high-level design and overall approach rather than focus on specific
   coding details at this stage.

2) This patch series represents only the foundational framework for
   multikernel support. It establishes the basic infrastructure and
   communication mechanisms. We welcome the community to build upon
   this foundation and develop their own solutions based on this
   framework.

3) Testing has been limited to the author's development machine using
   hard-coded boot parameters and specific hardware configurations.
   Community testing across different hardware platforms, configurations,
   and use cases would be greatly appreciated to identify potential
   issues and improve robustness. Obviously, don't use this code beyond
   testing.

This work enables new use cases such as running real-time kernels
alongside general-purpose kernels, isolating security-critical
applications, and providing dedicated kernel instances for specific
workloads etc..

Signed-off-by: Cong Wang <cwang@multikernel.io>

---

Cong Wang (7):
  kexec: Introduce multikernel support via kexec
  x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap
  x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication
  kernel: Introduce generic multikernel IPI communication framework
  x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID
  kexec: Implement dynamic kimage tracking
  kexec: Add /proc/multikernel interface for kimage tracking

 arch/powerpc/kexec/crash.c          |   8 +-
 arch/x86/include/asm/idtentry.h     |   1 +
 arch/x86/include/asm/irq_vectors.h  |   1 +
 arch/x86/include/asm/smp.h          |   7 +
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/crash.c             |   4 +-
 arch/x86/kernel/head64.c            |   5 +
 arch/x86/kernel/idt.c               |   1 +
 arch/x86/kernel/setup.c             |   3 +
 arch/x86/kernel/smp.c               |  15 ++
 arch/x86/kernel/smpboot.c           | 161 +++++++++++++
 arch/x86/kernel/trampoline_64_bsp.S | 288 ++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S       |   6 +
 include/linux/kexec.h               |  22 +-
 include/linux/multikernel.h         |  81 +++++++
 include/uapi/linux/kexec.h          |   1 +
 include/uapi/linux/reboot.h         |   2 +-
 init/main.c                         |   2 +
 kernel/Makefile                     |   2 +-
 kernel/kexec.c                      | 103 +++++++-
 kernel/kexec_core.c                 | 359 ++++++++++++++++++++++++++++
 kernel/kexec_file.c                 |  33 ++-
 kernel/multikernel.c                | 314 ++++++++++++++++++++++++
 kernel/reboot.c                     |  10 +
 24 files changed, 1411 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/kernel/trampoline_64_bsp.S
 create mode 100644 include/linux/multikernel.h
 create mode 100644 kernel/multikernel.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 1/7] kexec: Introduce multikernel support via kexec
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 2/7] x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap Cong Wang
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

This patch extends the kexec subsystem to support multikernel
functionality, allowing different kernel instances to be loaded and
executed on specific CPUs. The implementation introduces:

- New KEXEC_TYPE_MULTIKERNEL type and KEXEC_MULTIKERNEL flag

- multikernel_kick_ap() function for CPU-specific kernel booting

- LINUX_REBOOT_CMD_MULTIKERNEL reboot command with CPU parameter

- Specialized segment loading for multikernel images using memremap

- Integration with existing kexec infrastructure while bypassing
  standard machine_kexec_prepare() for avoiding resets

The multikernel_kexec() function validates CPU availability and uses
the existing kexec image start address to boot the target CPU with
a different kernel instance. This enables heterogeneous computing
scenarios where different CPUs can run specialized kernel variants.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/x86/include/asm/smp.h  |   1 +
 arch/x86/kernel/smpboot.c   | 104 +++++++++++++++++++++++++++
 include/linux/kexec.h       |   6 +-
 include/uapi/linux/kexec.h  |   1 +
 include/uapi/linux/reboot.h |   2 +-
 kernel/kexec.c              |  41 ++++++++++-
 kernel/kexec_core.c         | 135 ++++++++++++++++++++++++++++++++++++
 kernel/reboot.c             |  10 +++
 8 files changed, 294 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 22bfebe6776d..1a59fd0de759 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -107,6 +107,7 @@ void native_smp_prepare_cpus(unsigned int max_cpus);
 void native_smp_cpus_done(unsigned int max_cpus);
 int common_cpu_up(unsigned int cpunum, struct task_struct *tidle);
 int native_kick_ap(unsigned int cpu, struct task_struct *tidle);
+int multikernel_kick_ap(unsigned int cpu, unsigned long kernel_start_address);
 int native_cpu_disable(void);
 void __noreturn hlt_play_dead(void);
 void native_play_dead(void);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..c2844a493ebf 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -833,6 +833,72 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle)
 	return 0;
 }
 
+// must be locked by cpus_read_lock()
+static int do_multikernel_boot_cpu(u32 apicid, int cpu, unsigned long kernel_start_address)
+{
+	unsigned long start_ip = real_mode_header->trampoline_start;
+	int ret;
+
+	pr_info("do_multikernel_boot_cpu(apicid=%u, cpu=%u, kernel_start_address=%lx)\n", apicid, cpu, kernel_start_address);
+#ifdef CONFIG_X86_64
+	/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
+	if (apic->wakeup_secondary_cpu_64)
+		start_ip = real_mode_header->trampoline_start64;
+#endif
+	//initial_code = (unsigned long)start_secondary;
+	initial_code = (unsigned long)kernel_start_address;
+
+	if (IS_ENABLED(CONFIG_X86_32)) {
+		early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
+		//initial_stack  = idle->thread.sp;
+	} else if (!(smpboot_control & STARTUP_PARALLEL_MASK)) {
+		smpboot_control = cpu;
+	}
+
+	/* Skip init_espfix_ap(cpu); */
+
+	/* Skip announce_cpu(cpu, apicid); */
+
+	/*
+	 * This grunge runs the startup process for
+	 * the targeted processor.
+	 */
+	if (x86_platform.legacy.warm_reset) {
+
+		pr_debug("Setting warm reset code and vector.\n");
+
+		smpboot_setup_warm_reset_vector(start_ip);
+		/*
+		 * Be paranoid about clearing APIC errors.
+		*/
+		if (APIC_INTEGRATED(boot_cpu_apic_version)) {
+			apic_write(APIC_ESR, 0);
+			apic_read(APIC_ESR);
+		}
+	}
+
+	smp_mb();
+
+	/*
+	 * Wake up a CPU in difference cases:
+	 * - Use a method from the APIC driver if one defined, with wakeup
+	 *   straight to 64-bit mode preferred over wakeup to RM.
+	 * Otherwise,
+	 * - Use an INIT boot APIC message
+	 */
+	if (apic->wakeup_secondary_cpu_64)
+		ret = apic->wakeup_secondary_cpu_64(apicid, start_ip, cpu);
+	else if (apic->wakeup_secondary_cpu)
+		ret = apic->wakeup_secondary_cpu(apicid, start_ip, cpu);
+	else
+		ret = wakeup_secondary_cpu_via_init(apicid, start_ip, cpu);
+
+	pr_info("do_multikernel_boot_cpu end\n");
+	/* If the wakeup mechanism failed, cleanup the warm reset vector */
+	if (ret)
+		arch_cpuhp_cleanup_kick_cpu(cpu);
+	return ret;
+}
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
  * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
@@ -905,6 +971,44 @@ static int do_boot_cpu(u32 apicid, unsigned int cpu, struct task_struct *idle)
 	return ret;
 }
 
+// must be locked by cpus_read_lock()
+int multikernel_kick_ap(unsigned int cpu, unsigned long kernel_start_address)
+{
+	u32 apicid = apic->cpu_present_to_apicid(cpu);
+	int err;
+
+	lockdep_assert_irqs_enabled();
+
+	pr_info("++++++++++++++++++++=_---CPU UP  %u\n", cpu);
+
+	if (apicid == BAD_APICID || !apic_id_valid(apicid)) {
+		pr_err("CPU %u has invalid APIC ID %x. Aborting bringup\n", cpu, apicid);
+		return -EINVAL;
+	}
+
+	if (!test_bit(apicid, phys_cpu_present_map)) {
+		pr_err("CPU %u APIC ID %x is not present. Aborting bringup\n", cpu, apicid);
+		return -EINVAL;
+	}
+
+	/*
+	 * Save current MTRR state in case it was changed since early boot
+	 * (e.g. by the ACPI SMI) to initialize new CPUs with MTRRs in sync:
+	 */
+	mtrr_save_state();
+
+	/* the FPU context is blank, nobody can own it */
+	per_cpu(fpu_fpregs_owner_ctx, cpu) = NULL;
+	/* skip common_cpu_up(cpu, tidle); */
+
+	err = do_multikernel_boot_cpu(apicid, cpu, kernel_start_address);
+	if (err)
+		pr_err("do_multikernel_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu);
+
+	return err;
+}
+
+
 int native_kick_ap(unsigned int cpu, struct task_struct *tidle)
 {
 	u32 apicid = apic->cpu_present_to_apicid(cpu);
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 39fe3e6cd282..a3ae3e561109 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -358,9 +358,10 @@ struct kimage {
 	unsigned long control_page;
 
 	/* Flags to indicate special processing */
-	unsigned int type : 1;
+	unsigned int type : 2;
 #define KEXEC_TYPE_DEFAULT 0
 #define KEXEC_TYPE_CRASH   1
+#define KEXEC_TYPE_MULTIKERNEL 2
 	unsigned int preserve_context : 1;
 	/* If set, we are using file mode kexec syscall */
 	unsigned int file_mode:1;
@@ -434,6 +435,7 @@ extern void machine_kexec(struct kimage *image);
 extern int machine_kexec_prepare(struct kimage *image);
 extern void machine_kexec_cleanup(struct kimage *image);
 extern int kernel_kexec(void);
+extern int multikernel_kexec(int cpu);
 extern struct page *kimage_alloc_control_pages(struct kimage *image,
 						unsigned int order);
 
@@ -455,7 +457,7 @@ bool kexec_load_permitted(int kexec_image_type);
 #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT)
 #else
 #define KEXEC_FLAGS    (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT | KEXEC_UPDATE_ELFCOREHDR | \
-			KEXEC_CRASH_HOTPLUG_SUPPORT)
+			KEXEC_CRASH_HOTPLUG_SUPPORT | KEXEC_MULTIKERNEL)
 #endif
 
 /* List of defined/legal kexec file flags */
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index 8958ebfcff94..4ed8660ef95e 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -14,6 +14,7 @@
 #define KEXEC_PRESERVE_CONTEXT	0x00000002
 #define KEXEC_UPDATE_ELFCOREHDR	0x00000004
 #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008
+#define KEXEC_MULTIKERNEL	0x00000010
 #define KEXEC_ARCH_MASK		0xffff0000
 
 /*
diff --git a/include/uapi/linux/reboot.h b/include/uapi/linux/reboot.h
index 58e64398efc5..aac2f2f94a98 100644
--- a/include/uapi/linux/reboot.h
+++ b/include/uapi/linux/reboot.h
@@ -34,7 +34,7 @@
 #define	LINUX_REBOOT_CMD_RESTART2	0xA1B2C3D4
 #define	LINUX_REBOOT_CMD_SW_SUSPEND	0xD000FCE2
 #define	LINUX_REBOOT_CMD_KEXEC		0x45584543
-
+#define	LINUX_REBOOT_CMD_MULTIKERNEL	0x4D4B4C49
 
 
 #endif /* _UAPI_LINUX_REBOOT_H */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 28008e3d462e..49e62f804674 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/memblock.h>
 
 #include "kexec_internal.h"
 
@@ -27,6 +28,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
 	int ret;
 	struct kimage *image;
 	bool kexec_on_panic = flags & KEXEC_ON_CRASH;
+	bool multikernel_load = flags & KEXEC_MULTIKERNEL;
 
 #ifdef CONFIG_CRASH_DUMP
 	if (kexec_on_panic) {
@@ -37,6 +39,30 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
 	}
 #endif
 
+#if 0
+	if (multikernel_load) {
+		// Check if entry is in a reserved memory region
+		bool in_reserved_region = false;
+		phys_addr_t start, end;
+		u64 i;
+
+		for_each_reserved_mem_range(i, &start, &end) {
+			if (entry >= start && entry < end) {
+				in_reserved_region = true;
+				break;
+			}
+		}
+
+		if (!in_reserved_region) {
+			pr_err("Entry point 0x%lx is not in a reserved memory region\n", entry);
+			return -EADDRNOTAVAIL; // Return an error if not in a reserved region
+		}
+
+		pr_info("multikernel load: got to multikernel_load syscall, entry 0x%lx, nr_segments %lu, flags 0x%lx\n",
+			entry, nr_segments, flags);
+	}
+#endif
+
 	/* Allocate and initialize a controlling structure */
 	image = do_kimage_alloc_init();
 	if (!image)
@@ -54,10 +80,16 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
 	}
 #endif
 
+	if (multikernel_load) {
+		image->type = KEXEC_TYPE_MULTIKERNEL;
+	}
+
 	ret = sanity_check_segment_list(image);
 	if (ret)
 		goto out_free_image;
 
+	if (multikernel_load)
+		goto done;
 	/*
 	 * Find a location for the control code buffer, and add it
 	 * the vector of segments so that it's pages will also be
@@ -79,6 +111,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
 		}
 	}
 
+done:
 	*rimage = image;
 	return 0;
 out_free_control_pages:
@@ -139,9 +172,11 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 		image->hotplug_support = 1;
 #endif
 
-	ret = machine_kexec_prepare(image);
-	if (ret)
-		goto out;
+	if (!(flags & KEXEC_MULTIKERNEL)) {
+		ret = machine_kexec_prepare(image);
+		if (ret)
+			goto out;
+	}
 
 	/*
 	 * Some architecture(like S390) may touch the crash memory before
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 31203f0bacaf..35a66c8dd78b 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -41,6 +41,7 @@
 #include <linux/objtool.h>
 #include <linux/kmsg_dump.h>
 #include <linux/dma-map-ops.h>
+#include <linux/memblock.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -211,6 +212,32 @@ int sanity_check_segment_list(struct kimage *image)
 	}
 #endif
 
+#if 0
+	if (image->type == KEXEC_TYPE_MULTIKERNEL) {
+		for (i = 0; i < nr_segments; i++) {
+			unsigned long mstart, mend;
+			phys_addr_t start, end;
+			bool in_reserved_region = false;
+			u64 i;
+
+			mstart = image->segment[i].mem;
+			mend = mstart + image->segment[i].memsz - 1;
+			for_each_reserved_mem_range(i, &start, &end) {
+				if (mstart >= start && mend <= end) {
+					in_reserved_region = true;
+					break;
+				}
+			}
+
+			if (!in_reserved_region) {
+				pr_err("Segment 0x%lx-0x%lx is not in a reserved memory region\n",
+					mstart, mend);
+				return -EADDRNOTAVAIL;
+			}
+		}
+	}
+#endif
+
 	/*
 	 * The destination addresses are searched from system RAM rather than
 	 * being allocated from the buddy allocator, so they are not guaranteed
@@ -943,6 +970,84 @@ static int kimage_load_crash_segment(struct kimage *image, int idx)
 }
 #endif
 
+static int kimage_load_multikernel_segment(struct kimage *image, int idx)
+{
+	/* For multikernel we simply copy the data from
+	 * user space to it's destination.
+	 * We do things a page at a time for the sake of kmap.
+	 */
+	struct kexec_segment *segment = &image->segment[idx];
+	unsigned long maddr;
+	size_t ubytes, mbytes;
+	int result;
+	unsigned char __user *buf = NULL;
+	unsigned char *kbuf = NULL;
+
+	result = 0;
+	if (image->file_mode)
+		kbuf = segment->kbuf;
+	else
+		buf = segment->buf;
+	ubytes = segment->bufsz;
+	mbytes = segment->memsz;
+	maddr = segment->mem;
+	pr_info("Loading multikernel segment: mem=0x%lx, memsz=0x%zu, buf=0x%px, bufsz=0x%zu\n",
+		maddr, mbytes, buf, ubytes);
+	while (mbytes) {
+		char *ptr;
+		size_t uchunk, mchunk;
+		unsigned long page_addr = maddr & PAGE_MASK;
+		unsigned long page_offset = maddr & ~PAGE_MASK;
+
+		/* Use memremap to map the physical address */
+		ptr = memremap(page_addr, PAGE_SIZE, MEMREMAP_WB);
+		if (!ptr) {
+			pr_err("Failed to memremap memory at 0x%lx\n", page_addr);
+			result = -ENOMEM;
+			goto out;
+		}
+
+		/* Adjust pointer to the offset within the page */
+		ptr += page_offset;
+
+		/* Calculate chunk sizes */
+		mchunk = min_t(size_t, mbytes, PAGE_SIZE - page_offset);
+		uchunk = min(ubytes, mchunk);
+
+		/* Zero the trailing part of the page if needed */
+		if (mchunk > uchunk) {
+			/* Zero the trailing part of the page */
+			memset(ptr + uchunk, 0, mchunk - uchunk);
+		}
+
+		if (uchunk) {
+			/* For file based kexec, source pages are in kernel memory */
+			if (image->file_mode)
+				memcpy(ptr, kbuf, uchunk);
+			else
+				result = copy_from_user(ptr, buf, uchunk);
+			ubytes -= uchunk;
+			if (image->file_mode)
+				kbuf += uchunk;
+			else
+				buf += uchunk;
+		}
+
+		/* Clean up */
+		memunmap(ptr - page_offset);
+		if (result) {
+			result = -EFAULT;
+			goto out;
+		}
+		maddr  += mchunk;
+		mbytes -= mchunk;
+
+		cond_resched();
+	}
+out:
+	return result;
+}
+
 int kimage_load_segment(struct kimage *image, int idx)
 {
 	int result = -ENOMEM;
@@ -956,6 +1061,9 @@ int kimage_load_segment(struct kimage *image, int idx)
 		result = kimage_load_crash_segment(image, idx);
 		break;
 #endif
+	case KEXEC_TYPE_MULTIKERNEL:
+		result = kimage_load_multikernel_segment(image, idx);
+		break;
 	}
 
 	return result;
@@ -1230,3 +1338,30 @@ int kernel_kexec(void)
 	kexec_unlock();
 	return error;
 }
+
+int multikernel_kexec(int cpu)
+{
+	int rc;
+
+	pr_info("multikernel kexec: cpu %d\n", cpu);
+
+	if (cpu_online(cpu)) {
+		pr_err("The CPU is currently running with this kernel instance.");
+		return -EBUSY;
+	}
+
+	if (!kexec_trylock())
+		return -EBUSY;
+	if (!kexec_image) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	cpus_read_lock();
+	rc = multikernel_kick_ap(cpu, kexec_image->start);
+	cpus_read_unlock();
+
+unlock:
+	kexec_unlock();
+	return rc;
+}
diff --git a/kernel/reboot.c b/kernel/reboot.c
index ec087827c85c..f3ac703c4695 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -717,6 +717,10 @@ EXPORT_SYMBOL_GPL(kernel_power_off);
 
 DEFINE_MUTEX(system_transition_mutex);
 
+struct multikernel_boot_args {
+	int cpu;
+};
+
 /*
  * Reboot system call: for obvious reasons only root may call it,
  * and even root needs to set up some magic numbers in the registers
@@ -729,6 +733,7 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 		void __user *, arg)
 {
 	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+	struct multikernel_boot_args boot_args;
 	char buffer[256];
 	int ret = 0;
 
@@ -799,6 +804,11 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 	case LINUX_REBOOT_CMD_KEXEC:
 		ret = kernel_kexec();
 		break;
+	case LINUX_REBOOT_CMD_MULTIKERNEL:
+		if (copy_from_user(&boot_args, arg, sizeof(boot_args)))
+			return -EFAULT;
+		ret = multikernel_kexec(boot_args.cpu);
+		break;
 #endif
 
 #ifdef CONFIG_HIBERNATION
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 2/7] x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
  2025-09-18 22:26 ` [RFC Patch 1/7] kexec: Introduce multikernel support via kexec Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 3/7] x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication Cong Wang
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

This patch introduces a dedicated trampoline mechanism for booting
secondary CPUs with different kernel instances in multikernel mode.
The implementation provides:

- New trampoline_64_bsp.S assembly code for real-mode to long-mode
  transition when launching kernels on secondary CPUs
- Trampoline memory allocation and setup in low memory (<1MB) for
  real-mode execution compatibility
- Page table construction for identity mapping during CPU bootstrap
- Integration with existing multikernel kexec infrastructure

The trampoline handles the complete CPU initialization sequence from
16-bit real mode through 32-bit protected mode to 64-bit long mode,
setting up appropriate GDT, page tables, and control registers before
jumping to the target kernel entry point without resetting the whole
system or the running kernel.

Note: This implementation uses legacy assembly-based trampoline code
and should be migrated to C-based x86 trampoline in future updates.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/head64.c            |   5 +
 arch/x86/kernel/setup.c             |   3 +
 arch/x86/kernel/smpboot.c           |  87 +++++++--
 arch/x86/kernel/trampoline_64_bsp.S | 288 ++++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S       |   6 +
 6 files changed, 375 insertions(+), 15 deletions(-)
 create mode 100644 arch/x86/kernel/trampoline_64_bsp.S

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0d2a6d953be9..ac89d82bf25b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -50,6 +50,7 @@ CFLAGS_irq.o := -I $(src)/../include/asm/trace
 
 obj-y			+= head_$(BITS).o
 obj-y			+= head$(BITS).o
+obj-y			+= trampoline_64_bsp.o
 obj-y			+= ebda.o
 obj-y			+= platform-quirks.o
 obj-y			+= process_$(BITS).o signal.o signal_$(BITS).o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 533fcf5636fc..4097101011d2 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -216,6 +216,9 @@ static void __init copy_bootdata(char *real_mode_data)
 	sme_unmap_bootdata(real_mode_data);
 }
 
+unsigned long orig_boot_params;
+EXPORT_SYMBOL(orig_boot_params);
+
 asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode_data)
 {
 	/*
@@ -285,6 +288,8 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
 	/* set init_top_pgt kernel high mapping*/
 	init_top_pgt[511] = early_top_pgt[511];
 
+	orig_boot_params = (unsigned long) real_mode_data;
+
 	x86_64_start_reservations(real_mode_data);
 }
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1b2edd07a3e1..8342c4e46bad 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -877,6 +877,8 @@ static void __init x86_report_nx(void)
  * Note: On x86_64, fixmaps are ready for use even before this is called.
  */
 
+extern void __init setup_trampolines_bsp(void);
+
 void __init setup_arch(char **cmdline_p)
 {
 #ifdef CONFIG_X86_32
@@ -1103,6 +1105,7 @@ void __init setup_arch(char **cmdline_p)
 			(max_pfn_mapped<<PAGE_SHIFT) - 1);
 #endif
 
+	setup_trampolines_bsp();
 	/*
 	 * Find free memory for the real mode trampoline and place it there. If
 	 * there is not enough free memory under 1M, on EFI-enabled systems
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c2844a493ebf..df0b94612238 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -833,27 +833,48 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle)
 	return 0;
 }
 
+extern void __init setup_trampolines_bsp(void);
+extern unsigned long orig_boot_params;
+
+unsigned char *x86_trampoline_bsp_base;
+
+extern const unsigned char trampoline_data_bsp[];
+extern const unsigned char trampoline_status_bsp[];
+extern const unsigned char x86_trampoline_bsp_start [];
+extern const unsigned char x86_trampoline_bsp_end   [];
+extern unsigned long kernel_phys_addr;
+extern unsigned long boot_params_phys_addr;
+
+#define TRAMPOLINE_SYM_BSP(x)                                           \
+        ((void *)(x86_trampoline_bsp_base +                                     \
+                  ((const unsigned char *)(x) - trampoline_data_bsp)))
+
+/* Address of the SMP trampoline */
+static inline unsigned long trampoline_bsp_address(void)
+{
+        return virt_to_phys(TRAMPOLINE_SYM_BSP(trampoline_data_bsp));
+}
+
 // must be locked by cpus_read_lock()
 static int do_multikernel_boot_cpu(u32 apicid, int cpu, unsigned long kernel_start_address)
 {
-	unsigned long start_ip = real_mode_header->trampoline_start;
+	unsigned long start_ip;
 	int ret;
 
-	pr_info("do_multikernel_boot_cpu(apicid=%u, cpu=%u, kernel_start_address=%lx)\n", apicid, cpu, kernel_start_address);
-#ifdef CONFIG_X86_64
-	/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
-	if (apic->wakeup_secondary_cpu_64)
-		start_ip = real_mode_header->trampoline_start64;
-#endif
-	//initial_code = (unsigned long)start_secondary;
-	initial_code = (unsigned long)kernel_start_address;
+	/* Multikernel -- set physical address where kernel has been copied.
+           Note that this needs to be written to the location where the
+           trampoline was copied, not to the location within the original
+           kernel itself. */
+        unsigned long *kernel_virt_addr = TRAMPOLINE_SYM_BSP(&kernel_phys_addr);
+        unsigned long *boot_params_virt_addr = TRAMPOLINE_SYM_BSP(&boot_params_phys_addr);
 
-	if (IS_ENABLED(CONFIG_X86_32)) {
-		early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
-		//initial_stack  = idle->thread.sp;
-	} else if (!(smpboot_control & STARTUP_PARALLEL_MASK)) {
-		smpboot_control = cpu;
-	}
+        *kernel_virt_addr = kernel_start_address;
+        *boot_params_virt_addr = orig_boot_params;
+
+        /* start_ip had better be page-aligned! */
+        start_ip = trampoline_bsp_address();
+
+	pr_info("do_multikernel_boot_cpu(apicid=%u, cpu=%u, kernel_start_address=%lx)\n", apicid, cpu, kernel_start_address);
 
 	/* Skip init_espfix_ap(cpu); */
 
@@ -897,6 +918,9 @@ static int do_multikernel_boot_cpu(u32 apicid, int cpu, unsigned long kernel_sta
 	/* If the wakeup mechanism failed, cleanup the warm reset vector */
 	if (ret)
 		arch_cpuhp_cleanup_kick_cpu(cpu);
+
+        /* mark "stuck" area as not stuck */
+        *(volatile u32 *)TRAMPOLINE_SYM_BSP(trampoline_status_bsp) = 0;
 	return ret;
 }
 /*
@@ -1008,6 +1032,39 @@ int multikernel_kick_ap(unsigned int cpu, unsigned long kernel_start_address)
 	return err;
 }
 
+void __init setup_trampolines_bsp(void)
+{
+        phys_addr_t mem;
+        size_t size = PAGE_ALIGN(x86_trampoline_bsp_end - x86_trampoline_bsp_start);
+
+        /* Has to be in very low memory so we can execute real-mode AP code. */
+        mem = memblock_phys_alloc_range(size, PAGE_SIZE, 0, 1<<20);
+        if (!mem)
+                panic("Cannot allocate trampoline\n");
+
+        x86_trampoline_bsp_base = __va(mem);
+        memblock_reserve(mem, mem + size);
+
+        printk(KERN_DEBUG "Base memory trampoline BSP at [%p] %llx size %zu\n",
+               x86_trampoline_bsp_base, (unsigned long long)mem, size);
+
+        //if (!mklinux_boot) {
+                memcpy(x86_trampoline_bsp_base, trampoline_data_bsp, size);
+
+        //} else {
+        //        printk("Multikernel boot: BSP trampoline will NOT be copied\n");
+        //}
+}
+
+static int __init configure_trampolines_bsp(void)
+{
+        size_t size = PAGE_ALIGN(x86_trampoline_bsp_end - x86_trampoline_bsp_start);
+
+        set_memory_x((unsigned long)x86_trampoline_bsp_base, size >> PAGE_SHIFT);
+        return 0;
+}
+
+arch_initcall(configure_trampolines_bsp);
 
 int native_kick_ap(unsigned int cpu, struct task_struct *tidle)
 {
diff --git a/arch/x86/kernel/trampoline_64_bsp.S b/arch/x86/kernel/trampoline_64_bsp.S
new file mode 100644
index 000000000000..0bd2a971a973
--- /dev/null
+++ b/arch/x86/kernel/trampoline_64_bsp.S
@@ -0,0 +1,288 @@
+/*
+ *
+ *	Derived from Setup.S by Linus Torvalds, then derived from Popcorn Linux
+ *
+ *	4 Jan 1997 Michael Chastain: changed to gnu as.
+ *	15 Sept 2005 Eric Biederman: 64bit PIC support
+ *
+ *	Entry: CS:IP point to the start of our code, we are 
+ *	in real mode with no stack, but the rest of the 
+ *	trampoline page to make our stack and everything else
+ *	is a mystery.
+ *
+ *	On entry to trampoline_data, the processor is in real mode
+ *	with 16-bit addressing and 16-bit data.  CS has some value
+ *	and IP is zero.  Thus, data addresses need to be absolute
+ *	(no relocation) and are taken with regard to r_base.
+ *
+ *	With the addition of trampoline_level4_pgt this code can
+ *	now enter a 64bit kernel that lives at arbitrary 64bit
+ *	physical addresses.
+ *
+ *	If you work on this file, check the object module with objdump
+ *	--full-contents --reloc to make sure there are no relocation
+ *	entries.
+ */
+
+#include <linux/linkage.h>
+#include <linux/init.h>
+#include <asm/pgtable_types.h>
+#include <asm/page_types.h>
+#include <asm/msr.h>
+#include <asm/segment.h>
+#include <asm/processor-flags.h>
+
+	.section ".x86_trampoline_bsp","a"
+	.balign PAGE_SIZE
+	.code16
+
+SYM_CODE_START(trampoline_data_bsp)
+bsp_base = .
+	cli			# We should be safe anyway
+	wbinvd
+	mov	%cs, %ax	# Code and data in the same place
+	mov	%ax, %ds
+	mov	%ax, %es
+	mov	%ax, %ss
+
+
+	movl	$0xA5A5A5A5, trampoline_status_bsp - bsp_base
+				# write marker for master knows we're running
+
+					# Setup stack
+	movw	$(trampoline_stack_bsp_end - bsp_base), %sp
+
+	# call	verify_cpu		# Verify the cpu supports long mode
+	# testl   %eax, %eax		# Check for return code
+	# jnz	no_longmode_bsp
+
+	mov	%cs, %ax
+	movzx	%ax, %esi		# Find the 32bit trampoline location
+	shll	$4, %esi
+
+					# Fixup the absolute vectors
+	leal	(startup_32_bsp - bsp_base)(%esi), %eax
+	movl	%eax, startup_32_vector_bsp - bsp_base
+	leal	(startup_64_bsp - bsp_base)(%esi), %eax
+	movl	%eax, startup_64_vector_bsp - bsp_base
+	leal	(tgdt_bsp - bsp_base)(%esi), %eax
+	movl	%eax, (tgdt_bsp + 2 - bsp_base)
+
+	/*
+	 * GDT tables in non default location kernel can be beyond 16MB and
+	 * lgdt will not be able to load the address as in real mode default
+	 * operand size is 16bit. Use lgdtl instead to force operand size
+	 * to 32 bit.
+	 */
+
+	lidtl	tidt_bsp - bsp_base	# load idt with 0, 0
+	lgdtl	tgdt_bsp - bsp_base	# load gdt with whatever is appropriate
+
+	mov	$X86_CR0_PE, %ax	# protected mode (PE) bit
+	lmsw	%ax			# into protected mode
+
+	# flush prefetch and jump to startup_32
+	ljmpl	*(startup_32_vector_bsp - bsp_base)
+SYM_CODE_END(trampoline_data_bsp)
+
+	.code32
+	.balign 4
+startup_32_bsp:
+
+	cli
+        movl    $(__KERNEL_DS), %eax
+        movl    %eax, %ds
+        movl    %eax, %es
+        movl    %eax, %ss
+
+	/* Load new GDT with the 64bit segments using 32bit descriptor.
+	 * The new GDT labels the entire address space as 64-bit, so we
+	 * can switch into long mode later. */
+        leal    (gdt_bsp_64 - bsp_base)(%esi), %eax
+        movl    %eax, (gdt_bsp_64 - bsp_base + 2)(%esi)
+        lgdt    (gdt_bsp_64 - bsp_base)(%esi)
+
+	/* Enable PAE mode.  Note that this does not actually take effect
+	 * until paging is enabled */
+	movl	%cr4, %eax
+        orl     $(X86_CR4_PAE), %eax
+        movl    %eax, %cr4
+
+        /* Initialize Page tables to 0 */
+	leal    (pgtable_bsp - bsp_base)(%esi), %edi
+	xorl    %eax, %eax
+        movl    $((4096*6)/4), %ecx
+        rep     stosl
+
+        /* Build Level 4 */
+        leal    (pgtable_bsp - bsp_base)(%esi), %edi
+        leal    0x1007 (%edi), %eax
+        movl    %eax, 0(%edi)
+
+        /* Build Level 3 */
+        leal    (pgtable_bsp - bsp_base + 0x1000)(%esi), %edi
+        leal    0x1007(%edi), %eax
+        movl    $4, %ecx
+1:      movl    %eax, 0x00(%edi)
+        addl    $0x00001000, %eax
+        addl    $8, %edi
+        decl    %ecx
+        jnz     1b
+
+        /* Build Level 2 */
+        leal    (pgtable_bsp - bsp_base + 0x2000)(%esi), %edi
+        movl    $0x00000183, %eax
+        movl    $2048, %ecx
+1:      movl    %eax, 0(%edi)
+        addl    $0x00200000, %eax
+        addl    $8, %edi
+        decl    %ecx
+        jnz     1b
+
+        /* Enable the boot page tables */
+        leal    (pgtable_bsp - bsp_base)(%esi), %eax
+        movl    %eax, %cr3
+
+        /* Enable Long mode in EFER (Extended Feature Enable Register) */
+        movl    $MSR_EFER, %ecx
+        rdmsr
+        btsl    $_EFER_LME, %eax
+        wrmsr
+
+        /*
+         * Setup for the jump to 64bit mode
+         *
+         * When the jump is performend we will be in long mode but
+         * in 32bit compatibility mode with EFER.LME = 1, CS.L = 0, CS.D = 1
+         * (and in turn EFER.LMA = 1).  To jump into 64bit mode we use
+         * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+         * We place all of the values on our mini stack so lret can
+         * used to perform that far jump.
+         */
+        pushl   $__KERNEL_CS
+        leal    (startup_64_bsp - bsp_base)(%esi), %eax
+        pushl   %eax
+
+	/* Enter paged protected Mode, activating Long Mode */
+        movl    $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
+        movl    %eax, %cr0
+
+	/* Jump from 32bit compatibility mode into 64bit mode. */
+        lret
+
+	.code64
+	.balign 4
+startup_64_bsp:
+
+	/* Get physical address of boot_params structure */
+	movq    (boot_params_phys_addr - bsp_base)(%rsi), %r15
+
+	/* Load kernel address into register */
+	movq    (kernel_phys_addr - bsp_base)(%rsi), %r14
+
+	/* Check whether the kernel is in the 4 GB we mapped already,
+	 * and if not, add an additional mapping */
+	movq	$0xffffffff00000000, %r8
+	testq	%r8, %r14
+	je	2f
+
+	/* If we got here, we need to identity-map an additional 1 GB */
+	
+	/* Mask off to figure out what our directory pointer should be */
+	movq	%r14, %r13
+	movq	$0xffffffffc0000000, %r12
+	andq	%r12, %r13
+
+	/* Set our PDPTE */
+	movq	%r13, %r11
+	shrq	$(30-3), %r11
+	leaq    (pgtable_bsp - bsp_base + 0x1000)(%rsi), %rdi
+	addq	%r11, %rdi
+	leaq	(pgtable_extra_bsp - bsp_base + 0x7)(%rsi), %rax
+	movq	%rax, 0(%rdi)
+
+	/* Populate the page directory */
+	leaq    (pgtable_extra_bsp - bsp_base)(%rsi), %rdi
+	movq    $0x00000183, %rax
+	addq	%r13, %rax
+	movq    $512, %rcx
+1:      movq    %rax, 0(%rdi)
+	addq    $0x00200000, %rax
+	addq    $8, %rdi
+	decq    %rcx
+	jnz     1b
+
+	/* Set esi to point to the boot_params structure */
+2:	movq	%r15, %rsi
+	jmp	*%r14
+
+	.align 8
+SYM_DATA(boot_params_phys_addr, .quad  0)
+
+	.align 8
+SYM_DATA(kernel_phys_addr, .quad  0)
+
+	.code16
+	.balign 4
+	# Careful these need to be in the same 64K segment as the above;
+tidt_bsp:
+	.word	0			# idt limit = 0
+	.word	0, 0			# idt base = 0L
+
+	# Duplicate the global descriptor table
+	# so the kernel can live anywhere
+	.balign 4
+tgdt_bsp:
+	.short	tgdt_bsp_end - tgdt_bsp		# gdt limit
+	.long	tgdt_bsp - bsp_base
+	.short 0
+	.quad	0x00cf9b000000ffff	# __KERNEL32_CS
+	.quad	0x00af9b000000ffff	# __KERNEL_CS
+	.quad	0x00cf93000000ffff	# __KERNEL_DS
+tgdt_bsp_end:
+
+	.code64
+	.balign 4
+gdt_bsp_64:
+        .word   gdt_bsp_64_end - gdt_bsp_64
+        .long   gdt_bsp_64 - bsp_base
+        .word   0
+        .quad   0x0000000000000000      /* NULL descriptor */
+        .quad   0x00af9a000000ffff      /* __KERNEL_CS */
+        .quad   0x00cf92000000ffff      /* __KERNEL_DS */
+        .quad   0x0080890000000000      /* TS descriptor */
+        .quad   0x0000000000000000      /* TS continued */
+gdt_bsp_64_end:
+
+	.code16
+	.balign 4
+startup_32_vector_bsp:
+	.long	startup_32_bsp - bsp_base
+	.word	__KERNEL32_CS, 0
+
+	.balign 4
+startup_64_vector_bsp:
+	.long	startup_64_bsp - bsp_base
+	.word	__KERNEL_CS, 0
+
+	.balign 4
+SYM_DATA(trampoline_status_bsp, .long	0)
+
+	.balign 4
+SYM_DATA(trampoline_location, .quad   0)
+
+trampoline_stack_bsp:
+	.fill 512,8,0
+trampoline_stack_bsp_end:
+
+SYM_DATA(trampoline_bsp_end)
+
+/*
+ * Space for page tables (not in .bss so not zeroed)
+ */
+        .balign 4096
+pgtable_bsp:
+        .fill 6*4096, 1, 0
+pgtable_extra_bsp:
+	.fill 1*4096, 1, 0
+
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 4fa0be732af1..86f4fd37dc18 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -231,6 +231,12 @@ SECTIONS
 
 	INIT_DATA_SECTION(16)
 
+         .x86_trampoline_bsp : AT(ADDR(.x86_trampoline_bsp) - LOAD_OFFSET) {
+                x86_trampoline_bsp_start = .;
+                *(.x86_trampoline_bsp)
+                x86_trampoline_bsp_end = .;
+        }
+
 	.x86_cpu_dev.init : AT(ADDR(.x86_cpu_dev.init) - LOAD_OFFSET) {
 		__x86_cpu_dev_start = .;
 		*(.x86_cpu_dev.init)
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 3/7] x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
  2025-09-18 22:26 ` [RFC Patch 1/7] kexec: Introduce multikernel support via kexec Cong Wang
  2025-09-18 22:26 ` [RFC Patch 2/7] x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 4/7] kernel: Introduce generic multikernel IPI communication framework Cong Wang
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

This patch adds a dedicated IPI vector (0xea) for multikernel
communication, enabling different kernel instances running on
separate CPUs to send interrupts to each other.

The implementation includes:

- MULTIKERNEL_VECTOR definition at interrupt vector 0xea
- IDT entry declaration and registration for sysvec_multikernel
- Interrupt handler sysvec_multikernel() with proper APIC EOI
  and IRQ statistics tracking
- Placeholder generic_multikernel_interrupt() function for
  extensible multikernel interrupt handling

This vector provides the foundational interrupt mechanism required
for implementing inter-kernel communication protocols in multikernel
environments, where heterogeneous kernel instances coordinate while
maintaining CPU-level isolation.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/x86/include/asm/idtentry.h    |  1 +
 arch/x86/include/asm/irq_vectors.h |  1 +
 arch/x86/kernel/idt.c              |  1 +
 arch/x86/kernel/smp.c              | 12 ++++++++++++
 4 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a4ec27c67988..219ee36def33 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -708,6 +708,7 @@ DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
+DECLARE_IDTENTRY_SYSVEC(MULTIKERNEL_VECTOR,			sysvec_multikernel);
 #else
 # define fred_sysvec_reschedule_ipi			NULL
 # define fred_sysvec_reboot				NULL
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 47051871b436..478e2e2d188a 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -102,6 +102,7 @@
  * the host kernel.
  */
 #define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
+#define MULTIKERNEL_VECTOR		0xea
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index f445bec516a0..063b330d9fbf 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -135,6 +135,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
 	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
 	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
+	INTG(MULTIKERNEL_VECTOR,		asm_sysvec_multikernel),
 	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index b014e6d229f9..028cc423a772 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -272,6 +272,18 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_call_function_single)
 	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
 }
 
+static void generic_multikernel_interrupt(void)
+{
+	pr_info("Multikernel interrupt\n");
+}
+
+DEFINE_IDTENTRY_SYSVEC(sysvec_multikernel)
+{
+	apic_eoi();
+	inc_irq_stat(irq_call_count);
+	generic_multikernel_interrupt();
+}
+
 static int __init nonmi_ipi_setup(char *str)
 {
 	smp_no_nmi_ipi = true;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 4/7] kernel: Introduce generic multikernel IPI communication framework
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (2 preceding siblings ...)
  2025-09-18 22:26 ` [RFC Patch 3/7] x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 5/7] x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID Cong Wang
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

This patch implements a comprehensive IPI-based communication system
for multikernel environments, enabling data exchange between different
kernel instances running on separate CPUs.

Key features include:

- Generic IPI handler registration and callback mechanism allowing
  modules to register for multikernel communication events
- Shared memory infrastructure using either boot parameter-specified
  or dynamically allocated physical memory regions
- Per-CPU data buffers in shared memory for efficient IPI payload
  transfer up to 256 bytes per message
- IRQ work integration for safe callback execution in interrupt context
- PFN-based flexible shared memory APIs for page-level data sharing
- Resource tracking integration for /proc/iomem visibility

The implementation provides multikernel_send_ipi_data() for sending
typed data to target CPUs and multikernel_register_handler() for
receiving notifications. Shared memory is established during early
boot and mapped using memremap() for cache-coherent access.

This infrastructure enables heterogeneous computing scenarios where
multikernel instances can coordinate and share data while maintaining
isolation on their respective CPU cores.

Note, as a proof-of-concept, we have only implemented the x86 part.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/x86/kernel/smp.c       |   5 +-
 include/linux/multikernel.h |  81 ++++++++++
 init/main.c                 |   2 +
 kernel/Makefile             |   2 +-
 kernel/multikernel.c        | 313 ++++++++++++++++++++++++++++++++++++
 5 files changed, 398 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/multikernel.h
 create mode 100644 kernel/multikernel.c

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 028cc423a772..3ee515e32383 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -272,10 +272,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_call_function_single)
 	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
 }
 
-static void generic_multikernel_interrupt(void)
-{
-	pr_info("Multikernel interrupt\n");
-}
+void generic_multikernel_interrupt(void);
 
 DEFINE_IDTENTRY_SYSVEC(sysvec_multikernel)
 {
diff --git a/include/linux/multikernel.h b/include/linux/multikernel.h
new file mode 100644
index 000000000000..12ed5e03f92e
--- /dev/null
+++ b/include/linux/multikernel.h
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 Multikernel Technologies, Inc. All rights reserved
+ */
+#ifndef _LINUX_MULTIKERNEL_H
+#define _LINUX_MULTIKERNEL_H
+
+#include <linux/types.h>
+#include <linux/irq_work.h>
+
+/**
+ * Multikernel IPI interface
+ *
+ * This header provides declarations for the multikernel IPI interface,
+ * allowing modules to register callbacks for IPI events and pass data
+ * between CPUs.
+ */
+
+/* Maximum data size that can be transferred via IPI */
+#define MK_MAX_DATA_SIZE 256
+
+/* Data structure for passing parameters via IPI */
+struct mk_ipi_data {
+	int sender_cpu;          /* Which CPU sent this IPI */
+	unsigned int type;      /* User-defined type identifier */
+	size_t data_size;        /* Size of the data */
+	char buffer[MK_MAX_DATA_SIZE]; /* Actual data buffer */
+};
+
+/* Function pointer type for IPI callbacks */
+typedef void (*mk_ipi_callback_t)(struct mk_ipi_data *data, void *ctx);
+
+struct mk_ipi_handler {
+	mk_ipi_callback_t callback;
+	void *context;
+	struct mk_ipi_handler *next;
+	struct mk_ipi_data *saved_data;
+	struct irq_work work;
+};
+
+/**
+ * multikernel_register_handler - Register a callback for multikernel IPI
+ * @callback: Function to call when IPI is received
+ * @ctx: Context pointer passed to the callback
+ *
+ * Returns pointer to handler on success, NULL on failure
+ */
+struct mk_ipi_handler *multikernel_register_handler(mk_ipi_callback_t callback, void *ctx);
+
+/**
+ * multikernel_unregister_handler - Unregister a multikernel IPI callback
+ * @handler: Handler pointer returned from multikernel_register_handler
+ */
+void multikernel_unregister_handler(struct mk_ipi_handler *handler);
+
+/**
+ * multikernel_send_ipi_data - Send data to another CPU via IPI
+ * @cpu: Target CPU
+ * @data: Pointer to data to send
+ * @data_size: Size of data
+ * @type: User-defined type identifier
+ *
+ * This function copies the data to per-CPU storage and sends an IPI
+ * to the target CPU.
+ *
+ * Returns 0 on success, negative error code on failure
+ */
+int multikernel_send_ipi_data(int cpu, void *data, size_t data_size, unsigned long type);
+
+void generic_multikernel_interrupt(void);
+
+int __init multikernel_init(void);
+
+/* Flexible shared memory APIs (PFN-based) */
+int mk_send_pfn(int target_cpu, unsigned long pfn);
+int mk_receive_pfn(struct mk_ipi_data *data, unsigned long *out_pfn);
+void *mk_receive_map_page(struct mk_ipi_data *data);
+
+#define mk_receive_unmap_page(p) memunmap(p)
+
+#endif /* _LINUX_MULTIKERNEL_H */
diff --git a/init/main.c b/init/main.c
index 5753e9539ae6..46a199bcb389 100644
--- a/init/main.c
+++ b/init/main.c
@@ -103,6 +103,7 @@
 #include <linux/randomize_kstack.h>
 #include <linux/pidfs.h>
 #include <linux/ptdump.h>
+#include <linux/multikernel.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -955,6 +956,7 @@ void start_kernel(void)
 	vfs_caches_init_early();
 	sort_main_extable();
 	trap_init();
+	multikernel_init();
 	mm_core_init();
 	maple_tree_init();
 	poking_init();
diff --git a/kernel/Makefile b/kernel/Makefile
index c60623448235..e5216610a4e7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
+	    async.o range.o smpboot.o ucount.o regset.o ksyms_common.o multikernel.o
 
 obj-$(CONFIG_MULTIUSER) += groups.o
 obj-$(CONFIG_VHOST_TASK) += vhost_task.o
diff --git a/kernel/multikernel.c b/kernel/multikernel.c
new file mode 100644
index 000000000000..74e2f84b7914
--- /dev/null
+++ b/kernel/multikernel.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2025 Multikernel Technologies, Inc. All rights reserved
+ */
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/smp.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/multikernel.h>
+#include <linux/io.h>
+#include <linux/ioport.h>
+#include <asm/apic.h>
+#include <linux/memblock.h>
+
+/* Memory parameters for shared region */
+#define MK_IPI_DATA_SIZE  (sizeof(struct mk_ipi_data) * NR_CPUS)
+#define MK_MEM_BASE_SIZE  (sizeof(struct mk_shared_data))
+#define MK_MEM_SIZE       (MK_MEM_BASE_SIZE + PAGE_SIZE)
+
+/* Boot parameter for physical address */
+static unsigned long mk_phys_addr_param;
+
+/* Parse multikernel physical address from kernel command line */
+static int __init multikernel_phys_addr_setup(char *str)
+{
+	return kstrtoul(str, 0, &mk_phys_addr_param);
+}
+early_param("mk_shared_memory", multikernel_phys_addr_setup);
+
+/* Allocated/assigned physical address for shared memory */
+static phys_addr_t mk_phys_addr_base;
+
+/* Resource structure for tracking the memory in /proc/iomem */
+static struct resource mk_mem_res __ro_after_init = {
+	.name = "Multikernel Shared Memory",
+	.flags = IORESOURCE_MEM | IORESOURCE_BUSY,
+};
+
+/* Shared memory structures */
+struct mk_shared_data {
+	struct mk_ipi_data cpu_data[NR_CPUS];  /* Data area for each CPU */
+};
+
+/* Pointer to the shared memory area (remapped virtual address) */
+static struct mk_shared_data *mk_shared_mem;
+
+/* Callback management */
+static struct mk_ipi_handler *mk_handlers;
+static raw_spinlock_t mk_handlers_lock = __RAW_SPIN_LOCK_UNLOCKED(mk_handlers_lock);
+
+static void handler_work(struct irq_work *work)
+{
+    struct mk_ipi_handler *handler = container_of(work, struct mk_ipi_handler, work);
+    if (handler->callback)
+        handler->callback(handler->saved_data, handler->context);
+}
+
+/**
+ * multikernel_register_handler - Register a callback for multikernel IPI
+ * @callback: Function to call when IPI is received
+ * @ctx: Context pointer passed to the callback
+ *
+ * Returns pointer to handler on success, NULL on failure
+ */
+struct mk_ipi_handler *multikernel_register_handler(mk_ipi_callback_t callback, void *ctx)
+{
+	struct mk_ipi_handler *handler;
+	unsigned long flags;
+
+	if (!callback)
+		return NULL;
+
+	handler = kzalloc(sizeof(*handler), GFP_KERNEL);
+	if (!handler)
+		return NULL;
+
+	handler->callback = callback;
+	handler->context = ctx;
+
+	init_irq_work(&handler->work, handler_work);
+
+	raw_spin_lock_irqsave(&mk_handlers_lock, flags);
+	handler->next = mk_handlers;
+	mk_handlers = handler;
+	raw_spin_unlock_irqrestore(&mk_handlers_lock, flags);
+
+	return handler;
+}
+EXPORT_SYMBOL(multikernel_register_handler);
+
+/**
+ * multikernel_unregister_handler - Unregister a multikernel IPI callback
+ * @handler: Handler pointer returned from multikernel_register_handler
+ */
+void multikernel_unregister_handler(struct mk_ipi_handler *handler)
+{
+	struct mk_ipi_handler **pp, *p;
+	unsigned long flags;
+
+	if (!handler)
+		return;
+
+	raw_spin_lock_irqsave(&mk_handlers_lock, flags);
+	pp = &mk_handlers;
+	while ((p = *pp) != NULL) {
+		if (p == handler) {
+			*pp = p->next;
+			break;
+		}
+		pp = &p->next;
+	}
+	raw_spin_unlock_irqrestore(&mk_handlers_lock, flags);
+
+    /* Wait for pending work to complete */
+    irq_work_sync(&handler->work);
+    kfree(p);
+}
+EXPORT_SYMBOL(multikernel_unregister_handler);
+
+/**
+ * multikernel_send_ipi_data - Send data to another CPU via IPI
+ * @cpu: Target CPU
+ * @data: Pointer to data to send
+ * @data_size: Size of data
+ * @type: User-defined type identifier
+ *
+ * This function copies the data to per-CPU storage and sends an IPI
+ * to the target CPU.
+ *
+ * Returns 0 on success, negative error code on failure
+ */
+int multikernel_send_ipi_data(int cpu, void *data, size_t data_size, unsigned long type)
+{
+	struct mk_ipi_data *target;
+
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	if (data_size > MK_MAX_DATA_SIZE)
+		return -EINVAL;  /* Data too large for buffer */
+
+	/* Ensure shared memory is initialized */
+	if (!mk_shared_mem)
+		return -ENOMEM;
+
+	/* Get target CPU's data area from shared memory */
+	target = &mk_shared_mem->cpu_data[cpu];
+
+	/* Set header information */
+	target->data_size = data_size;
+	target->sender_cpu = smp_processor_id();
+	target->type = type;
+
+	/* Copy the actual data into the buffer */
+	if (data && data_size > 0)
+		memcpy(target->buffer, data, data_size);
+
+	/* Send IPI to target CPU */
+	__apic_send_IPI(cpu, MULTIKERNEL_VECTOR);
+
+	return 0;
+}
+EXPORT_SYMBOL(multikernel_send_ipi_data);
+
+/**
+ * multikernel_interrupt_handler - Handle the multikernel IPI
+ *
+ * This function is called when a multikernel IPI is received.
+ * It invokes all registered callbacks with the per-CPU data.
+ */
+static void multikernel_interrupt_handler(void)
+{
+	struct mk_ipi_data *data;
+	struct mk_ipi_handler *handler;
+	int current_cpu = smp_processor_id();
+
+	/* Ensure shared memory is initialized */
+	if (!mk_shared_mem) {
+		pr_err("Multikernel IPI received but shared memory not initialized\n");
+		return;
+	}
+
+	/* Get this CPU's data area from shared memory */
+	data = &mk_shared_mem->cpu_data[current_cpu];
+
+	pr_debug("Multikernel IPI received on CPU %d from CPU %d, type=%u\n",
+		 current_cpu, data->sender_cpu, data->type);
+
+    raw_spin_lock(&mk_handlers_lock);
+    for (handler = mk_handlers; handler; handler = handler->next) {
+        handler->saved_data = data;
+        irq_work_queue(&handler->work);
+    }
+    raw_spin_unlock(&mk_handlers_lock);
+}
+
+/**
+ * Generic multikernel interrupt handler - called by the IPI vector
+ *
+ * This is the function that gets called by the IPI vector handler.
+ */
+void generic_multikernel_interrupt(void)
+{
+	multikernel_interrupt_handler();
+}
+
+/**
+ * setup_shared_memory - Initialize shared memory for inter-kernel communication
+ *
+ * Maps a fixed physical memory region for sharing IPI data between kernels
+ * Returns 0 on success, negative error code on failure
+ */
+static int __init setup_shared_memory(void)
+{
+	/* Check if a fixed physical address was provided via parameter */
+	if (mk_phys_addr_param) {
+		/* Use the provided physical address */
+		mk_phys_addr_base = (phys_addr_t)mk_phys_addr_param;
+		pr_info("Using specified physical address 0x%llx for multikernel shared memory\n",
+		       (unsigned long long)mk_phys_addr_base);
+	} else {
+		/* Dynamically allocate contiguous physical memory using memblock */
+		mk_phys_addr_base = memblock_phys_alloc(MK_MEM_SIZE, PAGE_SIZE);
+		if (!mk_phys_addr_base) {
+			pr_err("Failed to allocate physical memory for multikernel IPI data\n");
+			return -ENOMEM;
+		}
+	}
+
+	/* Map the physical memory region to virtual address space */
+	mk_shared_mem = memremap(mk_phys_addr_base, MK_MEM_SIZE, MEMREMAP_WB);
+	if (!mk_shared_mem) {
+		pr_err("Failed to map shared memory at 0x%llx for multikernel IPI data\n",
+		       (unsigned long long)mk_phys_addr_base);
+
+		/* Only free the memory if we allocated it dynamically */
+		if (!mk_phys_addr_param)
+			memblock_phys_free(mk_phys_addr_base, MK_MEM_SIZE);
+		return -ENOMEM;
+	}
+
+	/* Initialize the memory to zero */
+	memset(mk_shared_mem, 0, sizeof(struct mk_shared_data));
+
+	pr_info("Allocated and mapped multikernel shared memory: phys=0x%llx, virt=%px, size=%lu bytes\n",
+		(unsigned long long)mk_phys_addr_base, mk_shared_mem, MK_MEM_SIZE);
+
+	return 0;
+}
+
+int __init multikernel_init(void)
+{
+	int ret;
+
+	ret = setup_shared_memory();
+	if (ret < 0)
+		return ret;
+
+	pr_info("Multikernel IPI support initialized\n");
+	return 0;
+}
+
+static int __init init_shared_memory(void)
+{
+	/* Set up resource structure for /proc/iomem visibility */
+	mk_mem_res.start = mk_phys_addr_base;
+	mk_mem_res.end = mk_phys_addr_base + MK_MEM_SIZE - 1;
+
+	/* Register the resource in the global resource tree */
+	if (insert_resource(&iomem_resource, &mk_mem_res)) {
+		pr_warn("Could not register multikernel shared memory region in resource tracking\n");
+		/* Continue anyway as this is not fatal */
+		return -1;
+	}
+
+	pr_info("Registered multikernel shared memory in resource tree: 0x%llx-0x%llx\n",
+		(unsigned long long)mk_mem_res.start, (unsigned long long)mk_mem_res.end);
+	return 0;
+}
+core_initcall(init_shared_memory);
+
+/* ---- Flexible shared memory APIs (PFN-based) ---- */
+#define MK_PFN_IPI_TYPE 0x80000001U
+
+/* Send a PFN to another kernel via mk_ipi_data */
+int mk_send_pfn(int target_cpu, unsigned long pfn)
+{
+	return multikernel_send_ipi_data(target_cpu, &pfn, sizeof(pfn), MK_PFN_IPI_TYPE);
+}
+
+/* Receive a PFN from mk_ipi_data. Caller must check type. */
+int mk_receive_pfn(struct mk_ipi_data *data, unsigned long *out_pfn)
+{
+	if (!data || !out_pfn)
+		return -EINVAL;
+	if (data->type != MK_PFN_IPI_TYPE || data->data_size != sizeof(unsigned long))
+		return -EINVAL;
+	*out_pfn = *(unsigned long *)data->buffer;
+	return 0;
+}
+
+void *mk_receive_map_page(struct mk_ipi_data *data)
+{
+	unsigned long pfn;
+	int ret;
+
+	ret = mk_receive_pfn(data, &pfn);
+	if (ret < 0)
+		return NULL;
+	return memremap(pfn << PAGE_SHIFT, PAGE_SIZE, MEMREMAP_WB);
+}
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 5/7] x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (3 preceding siblings ...)
  2025-09-18 22:26 ` [RFC Patch 4/7] kernel: Introduce generic multikernel IPI communication framework Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 6/7] kexec: Implement dynamic kimage tracking Cong Wang
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

The tranditional smp_processor_id() is a software-defined CPU ID which
is only unique within the same kernel. With Multikernel architecture, we
run multiple Linux kernels on different CPU's, hence the host kernel
needs a globally unique CPU ID to manage the CPU's. The physical CPU ID
is perfect for this case.

This API will be used to globally distinguish CPU's among different
multikernels.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/x86/include/asm/smp.h | 6 ++++++
 arch/x86/kernel/smp.c      | 6 ++++++
 kernel/multikernel.c       | 9 +++++----
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 1a59fd0de759..378be65ceafa 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -40,6 +40,7 @@ struct smp_ops {
 
 	void (*send_call_func_ipi)(const struct cpumask *mask);
 	void (*send_call_func_single_ipi)(int cpu);
+	int (*cpu_physical_id)(int cpu);
 };
 
 /* Globals due to paravirt */
@@ -100,6 +101,11 @@ static inline void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 	smp_ops.send_call_func_ipi(mask);
 }
 
+static inline int arch_cpu_physical_id(int cpu)
+{
+	return smp_ops.cpu_physical_id(cpu);
+}
+
 void cpu_disable_common(void);
 void native_smp_prepare_boot_cpu(void);
 void smp_prepare_cpus_common(void);
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 3ee515e32383..face9f80e05c 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -289,6 +289,11 @@ static int __init nonmi_ipi_setup(char *str)
 
 __setup("nonmi_ipi", nonmi_ipi_setup);
 
+static int native_cpu_physical_id(int cpu)
+{
+	return cpu_physical_id(cpu);
+}
+
 struct smp_ops smp_ops = {
 	.smp_prepare_boot_cpu	= native_smp_prepare_boot_cpu,
 	.smp_prepare_cpus	= native_smp_prepare_cpus,
@@ -306,6 +311,7 @@ struct smp_ops smp_ops = {
 
 	.send_call_func_ipi	= native_send_call_func_ipi,
 	.send_call_func_single_ipi = native_send_call_func_single_ipi,
+	.cpu_physical_id	= native_cpu_physical_id,
 };
 EXPORT_SYMBOL_GPL(smp_ops);
 
diff --git a/kernel/multikernel.c b/kernel/multikernel.c
index 74e2f84b7914..7f6f90485876 100644
--- a/kernel/multikernel.c
+++ b/kernel/multikernel.c
@@ -150,7 +150,7 @@ int multikernel_send_ipi_data(int cpu, void *data, size_t data_size, unsigned lo
 
 	/* Set header information */
 	target->data_size = data_size;
-	target->sender_cpu = smp_processor_id();
+	target->sender_cpu = arch_cpu_physical_id(smp_processor_id());
 	target->type = type;
 
 	/* Copy the actual data into the buffer */
@@ -175,6 +175,7 @@ static void multikernel_interrupt_handler(void)
 	struct mk_ipi_data *data;
 	struct mk_ipi_handler *handler;
 	int current_cpu = smp_processor_id();
+	int current_physical_id = arch_cpu_physical_id(current_cpu);
 
 	/* Ensure shared memory is initialized */
 	if (!mk_shared_mem) {
@@ -183,10 +184,10 @@ static void multikernel_interrupt_handler(void)
 	}
 
 	/* Get this CPU's data area from shared memory */
-	data = &mk_shared_mem->cpu_data[current_cpu];
+	data = &mk_shared_mem->cpu_data[current_physical_id];
 
-	pr_debug("Multikernel IPI received on CPU %d from CPU %d, type=%u\n",
-		 current_cpu, data->sender_cpu, data->type);
+	pr_info("Multikernel IPI received on CPU %d (physical id %d) from CPU %d type=%u\n",
+		 current_cpu, current_physical_id, data->sender_cpu, data->type);
 
     raw_spin_lock(&mk_handlers_lock);
     for (handler = mk_handlers; handler; handler = handler->next) {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 6/7] kexec: Implement dynamic kimage tracking
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (4 preceding siblings ...)
  2025-09-18 22:26 ` [RFC Patch 5/7] x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-18 22:26 ` [RFC Patch 7/7] kexec: Add /proc/multikernel interface for " Cong Wang
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

Replace static kexec_image and kexec_crash_image globals with a dynamic
linked list infrastructure to support multiple kernel images. This change
enables multikernel functionality while maintaining backward compatibility.

Key changes:
- Add list_head member to kimage structure for chaining
- Implement thread-safe linked list management with global mutex
- Update kexec load/unload logic to use list-based APIs for multikernel
- Add helper functions for finding and managing multiple kimages
- Preserve existing kexec_image/kexec_crash_image pointers for compatibility
- Update architecture-specific crash handling to use new APIs

The multikernel case now properly uses list-based management instead of
overwriting compatibility pointers, allowing multiple multikernel images
to coexist in the system.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 arch/powerpc/kexec/crash.c |   8 +-
 arch/x86/kernel/crash.c    |   4 +-
 include/linux/kexec.h      |  16 ++++
 kernel/kexec.c             |  62 +++++++++++++-
 kernel/kexec_core.c        | 165 ++++++++++++++++++++++++++++++++++++-
 kernel/kexec_file.c        |  33 +++++++-
 6 files changed, 274 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index a325c1c02f96..af190fad4f22 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -477,13 +477,13 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify *
 	ptr = __va(mem);
 	if (ptr) {
 		/* Temporarily invalidate the crash image while it is replaced */
-		xchg(&kexec_crash_image, NULL);
+		kimage_update_compat_pointers(NULL, KEXEC_TYPE_CRASH);
 
 		/* Replace the old elfcorehdr with newly prepared elfcorehdr */
 		memcpy((void *)ptr, elfbuf, elfsz);
 
 		/* The crash image is now valid once again */
-		xchg(&kexec_crash_image, image);
+		kimage_update_compat_pointers(image, KEXEC_TYPE_CRASH);
 	}
 out:
 	kvfree(cmem);
@@ -537,14 +537,14 @@ static void update_crash_fdt(struct kimage *image)
 	fdt = __va((void *)image->segment[fdt_index].mem);
 
 	/* Temporarily invalidate the crash image while it is replaced */
-	xchg(&kexec_crash_image, NULL);
+	kimage_update_compat_pointers(NULL, KEXEC_TYPE_CRASH);
 
 	/* update FDT to reflect changes in CPU resources */
 	if (update_cpus_node(fdt))
 		pr_err("Failed to update crash FDT");
 
 	/* The crash image is now valid once again */
-	xchg(&kexec_crash_image, image);
+	kimage_update_compat_pointers(image, KEXEC_TYPE_CRASH);
 }
 
 int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c6b12bed173d..fc561d5e058e 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -546,9 +546,9 @@ void arch_crash_handle_hotplug_event(struct kimage *image, void *arg)
 	 * Temporarily invalidate the crash image while the
 	 * elfcorehdr is updated.
 	 */
-	xchg(&kexec_crash_image, NULL);
+	kimage_update_compat_pointers(NULL, KEXEC_TYPE_CRASH);
 	memcpy_flushcache(old_elfcorehdr, elfbuf, elfsz);
-	xchg(&kexec_crash_image, image);
+	kimage_update_compat_pointers(image, KEXEC_TYPE_CRASH);
 	kunmap_local(old_elfcorehdr);
 	pr_debug("updated elfcorehdr\n");
 
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a3ae3e561109..3bcbbacc0108 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -428,6 +428,9 @@ struct kimage {
 	/* dm crypt keys buffer */
 	unsigned long dm_crypt_keys_addr;
 	unsigned long dm_crypt_keys_sz;
+
+	/* For multikernel support: linked list node */
+	struct list_head list;
 };
 
 /* kexec interface functions */
@@ -531,6 +534,19 @@ extern bool kexec_file_dbg_print;
 
 extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size);
 extern void kimage_unmap_segment(void *buffer);
+
+/* Multikernel support functions */
+extern struct kimage *kimage_find_by_type(int type);
+extern void kimage_add_to_list(struct kimage *image);
+extern void kimage_remove_from_list(struct kimage *image);
+extern void kimage_update_compat_pointers(struct kimage *new_image, int type);
+extern int kimage_get_all_by_type(int type, struct kimage **images, int max_count);
+extern void kimage_list_lock(void);
+extern void kimage_list_unlock(void);
+extern struct kimage *kimage_find_multikernel_by_entry(unsigned long entry);
+extern struct kimage *kimage_get_multikernel_by_index(int index);
+extern int multikernel_kexec_by_entry(int cpu, unsigned long entry);
+extern void kimage_list_multikernel_images(void);
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 49e62f804674..3d37925ee15a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -147,7 +147,31 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 
 	if (nr_segments == 0) {
 		/* Uninstall image */
-		kimage_free(xchg(dest_image, NULL));
+		if (flags & KEXEC_ON_CRASH) {
+			struct kimage *old_image = xchg(&kexec_crash_image, NULL);
+			if (old_image) {
+				kimage_remove_from_list(old_image);
+				kimage_free(old_image);
+			}
+		} else if (flags & KEXEC_MULTIKERNEL) {
+			/* For multikernel unload, we need to specify which image to remove */
+			/* For now, remove all multikernel images - this could be enhanced */
+			struct kimage *images[10];
+			int count, i;
+
+			count = kimage_get_all_by_type(KEXEC_TYPE_MULTIKERNEL, images, 10);
+			for (i = 0; i < count; i++) {
+				kimage_remove_from_list(images[i]);
+				kimage_free(images[i]);
+			}
+			pr_info("Unloaded %d multikernel images\n", count);
+		} else {
+			struct kimage *old_image = xchg(&kexec_image, NULL);
+			if (old_image) {
+				kimage_remove_from_list(old_image);
+				kimage_free(old_image);
+			}
+		}
 		ret = 0;
 		goto out_unlock;
 	}
@@ -157,7 +181,11 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 		 * crashes.  Free any current crash dump kernel before
 		 * we corrupt it.
 		 */
-		kimage_free(xchg(&kexec_crash_image, NULL));
+		struct kimage *old_crash_image = xchg(&kexec_crash_image, NULL);
+		if (old_crash_image) {
+			kimage_remove_from_list(old_crash_image);
+			kimage_free(old_crash_image);
+		}
 	}
 
 	ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags);
@@ -199,7 +227,35 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 		goto out;
 
 	/* Install the new kernel and uninstall the old */
-	image = xchg(dest_image, image);
+	if (flags & KEXEC_ON_CRASH) {
+		struct kimage *old_image = xchg(&kexec_crash_image, image);
+		if (old_image) {
+			kimage_remove_from_list(old_image);
+			kimage_free(old_image);
+		}
+		if (image) {
+			kimage_add_to_list(image);
+			kimage_update_compat_pointers(image, KEXEC_TYPE_CRASH);
+		}
+		image = NULL; /* Don't free the new image */
+	} else if (flags & KEXEC_MULTIKERNEL) {
+		if (image) {
+			kimage_add_to_list(image);
+			pr_info("Added multikernel image to list (entry: 0x%lx)\n", image->start);
+		}
+		image = NULL; /* Don't free the new image */
+	} else {
+		struct kimage *old_image = xchg(&kexec_image, image);
+		if (old_image) {
+			kimage_remove_from_list(old_image);
+			kimage_free(old_image);
+		}
+		if (image) {
+			kimage_add_to_list(image);
+			kimage_update_compat_pointers(image, KEXEC_TYPE_DEFAULT);
+		}
+		image = NULL; /* Don't free the new image */
+	}
 
 out:
 #ifdef CONFIG_CRASH_DUMP
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 35a66c8dd78b..4e489a7031e6 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -56,6 +56,10 @@ bool kexec_in_progress = false;
 
 bool kexec_file_dbg_print;
 
+/* Linked list of dynamically allocated kimages */
+static LIST_HEAD(kexec_image_list);
+static DEFINE_MUTEX(kexec_image_mutex);
+
 /*
  * When kexec transitions to the new kernel there is a one-to-one
  * mapping between physical and virtual addresses.  On processors
@@ -275,6 +279,9 @@ struct kimage *do_kimage_alloc_init(void)
 	/* Initialize the list of unusable pages */
 	INIT_LIST_HEAD(&image->unusable_pages);
 
+	/* Initialize the list node for multikernel support */
+	INIT_LIST_HEAD(&image->list);
+
 #ifdef CONFIG_CRASH_HOTPLUG
 	image->hp_action = KEXEC_CRASH_HP_NONE;
 	image->elfcorehdr_index = -1;
@@ -607,6 +614,13 @@ void kimage_free(struct kimage *image)
 	if (!image)
 		return;
 
+	/* Remove from linked list and update compatibility pointers */
+	kimage_remove_from_list(image);
+	if (image == kexec_image)
+		kimage_update_compat_pointers(NULL, KEXEC_TYPE_DEFAULT);
+	else if (image == kexec_crash_image)
+		kimage_update_compat_pointers(NULL, KEXEC_TYPE_CRASH);
+
 #ifdef CONFIG_CRASH_DUMP
 	if (image->vmcoreinfo_data_copy) {
 		crash_update_vmcoreinfo_safecopy(NULL);
@@ -1123,6 +1137,72 @@ void kimage_unmap_segment(void *segment_buffer)
 	vunmap(segment_buffer);
 }
 
+void kimage_add_to_list(struct kimage *image)
+{
+	mutex_lock(&kexec_image_mutex);
+	list_add_tail(&image->list, &kexec_image_list);
+	mutex_unlock(&kexec_image_mutex);
+}
+
+void kimage_remove_from_list(struct kimage *image)
+{
+	mutex_lock(&kexec_image_mutex);
+	if (!list_empty(&image->list))
+		list_del_init(&image->list);
+	mutex_unlock(&kexec_image_mutex);
+}
+
+struct kimage *kimage_find_by_type(int type)
+{
+	struct kimage *image;
+
+	mutex_lock(&kexec_image_mutex);
+	list_for_each_entry(image, &kexec_image_list, list) {
+		if (image->type == type) {
+			mutex_unlock(&kexec_image_mutex);
+			return image;
+		}
+	}
+	mutex_unlock(&kexec_image_mutex);
+	return NULL;
+}
+
+void kimage_update_compat_pointers(struct kimage *new_image, int type)
+{
+	mutex_lock(&kexec_image_mutex);
+	if (type == KEXEC_TYPE_CRASH) {
+		kexec_crash_image = new_image;
+	} else if (type == KEXEC_TYPE_DEFAULT) {
+		kexec_image = new_image;
+	}
+	mutex_unlock(&kexec_image_mutex);
+}
+
+int kimage_get_all_by_type(int type, struct kimage **images, int max_count)
+{
+	struct kimage *image;
+	int count = 0;
+
+	mutex_lock(&kexec_image_mutex);
+	list_for_each_entry(image, &kexec_image_list, list) {
+		if (image->type == type && count < max_count) {
+			images[count++] = image;
+		}
+	}
+	mutex_unlock(&kexec_image_mutex);
+	return count;
+}
+
+void kimage_list_lock(void)
+{
+	mutex_lock(&kexec_image_mutex);
+}
+
+void kimage_list_unlock(void)
+{
+	mutex_unlock(&kexec_image_mutex);
+}
+
 struct kexec_load_limit {
 	/* Mutex protects the limit count. */
 	struct mutex mutex;
@@ -1139,6 +1219,7 @@ static struct kexec_load_limit load_limit_panic = {
 	.limit = -1,
 };
 
+/* Compatibility: maintain pointers to current default and crash images */
 struct kimage *kexec_image;
 struct kimage *kexec_crash_image;
 static int kexec_load_disabled;
@@ -1339,8 +1420,49 @@ int kernel_kexec(void)
 	return error;
 }
 
+/*
+ * Find a multikernel image by entry point
+ */
+struct kimage *kimage_find_multikernel_by_entry(unsigned long entry)
+{
+	struct kimage *image;
+
+	kimage_list_lock();
+	list_for_each_entry(image, &kexec_image_list, list) {
+		if (image->type == KEXEC_TYPE_MULTIKERNEL && image->start == entry) {
+			kimage_list_unlock();
+			return image;
+		}
+	}
+	kimage_list_unlock();
+	return NULL;
+}
+
+/*
+ * Get multikernel image by index (0-based)
+ */
+struct kimage *kimage_get_multikernel_by_index(int index)
+{
+	struct kimage *image;
+	int count = 0;
+
+	kimage_list_lock();
+	list_for_each_entry(image, &kexec_image_list, list) {
+		if (image->type == KEXEC_TYPE_MULTIKERNEL) {
+			if (count == index) {
+				kimage_list_unlock();
+				return image;
+			}
+			count++;
+		}
+	}
+	kimage_list_unlock();
+	return NULL;
+}
+
 int multikernel_kexec(int cpu)
 {
+	struct kimage *mk_image;
 	int rc;
 
 	pr_info("multikernel kexec: cpu %d\n", cpu);
@@ -1352,13 +1474,52 @@ int multikernel_kexec(int cpu)
 
 	if (!kexec_trylock())
 		return -EBUSY;
-	if (!kexec_image) {
+
+	mk_image = kimage_find_by_type(KEXEC_TYPE_MULTIKERNEL);
+	if (!mk_image) {
+		pr_err("No multikernel image loaded\n");
 		rc = -EINVAL;
 		goto unlock;
 	}
 
+	pr_info("Found multikernel image with entry point: 0x%lx\n", mk_image->start);
+
+	cpus_read_lock();
+	rc = multikernel_kick_ap(cpu, mk_image->start);
+	cpus_read_unlock();
+
+unlock:
+	kexec_unlock();
+	return rc;
+}
+
+int multikernel_kexec_by_entry(int cpu, unsigned long entry)
+{
+	struct kimage *mk_image;
+	int rc;
+
+	pr_info("multikernel kexec: cpu %d, entry 0x%lx\n", cpu, entry);
+
+	if (cpu_online(cpu)) {
+		pr_err("The CPU is currently running with this kernel instance.");
+		return -EBUSY;
+	}
+
+	if (!kexec_trylock())
+		return -EBUSY;
+
+	/* Find the specific multikernel image by entry point */
+	mk_image = kimage_find_multikernel_by_entry(entry);
+	if (!mk_image) {
+		pr_err("No multikernel image found with entry point 0x%lx\n", entry);
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	pr_info("Using multikernel image with entry point: 0x%lx\n", mk_image->start);
+
 	cpus_read_lock();
-	rc = multikernel_kick_ap(cpu, kexec_image->start);
+	rc = multikernel_kick_ap(cpu, mk_image->start);
 	cpus_read_unlock();
 
 unlock:
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 91d46502a817..d4b8831eb59c 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -399,8 +399,13 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	 * same memory where old crash kernel might be loaded. Free any
 	 * current crash dump kernel before we corrupt it.
 	 */
-	if (flags & KEXEC_FILE_ON_CRASH)
-		kimage_free(xchg(&kexec_crash_image, NULL));
+	if (flags & KEXEC_FILE_ON_CRASH) {
+		struct kimage *old_crash_image = xchg(&kexec_crash_image, NULL);
+		if (old_crash_image) {
+			kimage_remove_from_list(old_crash_image);
+			kimage_free(old_crash_image);
+		}
+	}
 
 	ret = kimage_file_alloc_init(&image, kernel_fd, initrd_fd, cmdline_ptr,
 				     cmdline_len, flags);
@@ -456,7 +461,29 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
 	 */
 	kimage_file_post_load_cleanup(image);
 exchange:
-	image = xchg(dest_image, image);
+	if (image_type == KEXEC_TYPE_CRASH) {
+		struct kimage *old_image = xchg(&kexec_crash_image, image);
+		if (old_image) {
+			kimage_remove_from_list(old_image);
+			kimage_free(old_image);
+		}
+		if (image) {
+			kimage_add_to_list(image);
+			kimage_update_compat_pointers(image, KEXEC_TYPE_CRASH);
+		}
+		image = NULL; /* Don't free the new image */
+	} else {
+		struct kimage *old_image = xchg(&kexec_image, image);
+		if (old_image) {
+			kimage_remove_from_list(old_image);
+			kimage_free(old_image);
+		}
+		if (image) {
+			kimage_add_to_list(image);
+			kimage_update_compat_pointers(image, KEXEC_TYPE_DEFAULT);
+		}
+		image = NULL; /* Don't free the new image */
+	}
 out:
 #ifdef CONFIG_CRASH_DUMP
 	if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC Patch 7/7] kexec: Add /proc/multikernel interface for kimage tracking
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (5 preceding siblings ...)
  2025-09-18 22:26 ` [RFC Patch 6/7] kexec: Implement dynamic kimage tracking Cong Wang
@ 2025-09-18 22:26 ` Cong Wang
  2025-09-19 10:10 ` [syzbot ci] Re: kernel: Introduce multikernel architecture support syzbot ci
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-18 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: pasha.tatashin, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

From: Cong Wang <cwang@multikernel.io>

Add a dedicated /proc/multikernel file to provide read-only access to
all loaded kernel images in the system.

The interface displays kernel images in a tabular format showing:
- Type: kexec type (default, crash, multikernel)
- Start Address: entry point in hexadecimal format
- Segments: number of memory segments

A lot more information needs to be added here, for example a unique kernel
ID allocated for each kimage. For now, let's focus on the design first.

This interface is particularly useful for debugging multikernel setups,
system monitoring, and verifying that kernel images are loaded correctly.

Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 kernel/kexec_core.c | 63 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 4e489a7031e6..8306c10fc337 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -13,6 +13,8 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/kexec.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
 #include <linux/mutex.h>
 #include <linux/list.h>
 #include <linux/highmem.h>
@@ -1224,6 +1226,52 @@ struct kimage *kexec_image;
 struct kimage *kexec_crash_image;
 static int kexec_load_disabled;
 
+/*
+ * Proc interface for /proc/multikernel
+ */
+static int multikernel_proc_show(struct seq_file *m, void *v)
+{
+	struct kimage *image;
+	const char *type_names[] = {
+		[KEXEC_TYPE_DEFAULT] = "default",
+		[KEXEC_TYPE_CRASH] = "crash",
+		[KEXEC_TYPE_MULTIKERNEL] = "multikernel"
+	};
+
+	seq_printf(m, "Type        Start Address   Segments\n");
+	seq_printf(m, "----------  --------------  --------\n");
+
+	kimage_list_lock();
+	if (list_empty(&kexec_image_list)) {
+		seq_printf(m, "No kimages loaded\n");
+	} else {
+		list_for_each_entry(image, &kexec_image_list, list) {
+			const char *type_name = "unknown";
+
+			if (image->type < ARRAY_SIZE(type_names) && type_names[image->type])
+				type_name = type_names[image->type];
+
+			seq_printf(m, "%-10s  0x%012lx  %8lu\n",
+				   type_name, image->start, image->nr_segments);
+		}
+	}
+	kimage_list_unlock();
+
+	return 0;
+}
+
+static int multikernel_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, multikernel_proc_show, NULL);
+}
+
+static const struct proc_ops multikernel_proc_ops = {
+	.proc_open	= multikernel_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+};
+
 #ifdef CONFIG_SYSCTL
 static int kexec_limit_handler(const struct ctl_table *table, int write,
 			       void *buffer, size_t *lenp, loff_t *ppos)
@@ -1295,6 +1343,21 @@ static int __init kexec_core_sysctl_init(void)
 late_initcall(kexec_core_sysctl_init);
 #endif
 
+static int __init multikernel_proc_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	entry = proc_create("multikernel", 0444, NULL, &multikernel_proc_ops);
+	if (!entry) {
+		pr_err("Failed to create /proc/multikernel\n");
+		return -ENOMEM;
+	}
+
+	pr_debug("Created /proc/multikernel interface\n");
+	return 0;
+}
+late_initcall(multikernel_proc_init);
+
 bool kexec_load_permitted(int kexec_image_type)
 {
 	struct kexec_load_limit *limit;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [syzbot ci] Re: kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (6 preceding siblings ...)
  2025-09-18 22:26 ` [RFC Patch 7/7] kexec: Add /proc/multikernel interface for " Cong Wang
@ 2025-09-19 10:10 ` syzbot ci
  2025-09-19 13:14 ` [RFC Patch 0/7] " Pasha Tatashin
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 44+ messages in thread
From: syzbot ci @ 2025-09-19 10:10 UTC (permalink / raw)
  To: akpm, bhe, changyuanl, cwang, graf, kexec, linux-kernel,
	linux-mm, pasha.tatashin, rppt, xiyou.wangcong
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] kernel: Introduce multikernel architecture support
https://lore.kernel.org/all/20250918222607.186488-1-xiyou.wangcong@gmail.com
* [RFC Patch 1/7] kexec: Introduce multikernel support via kexec
* [RFC Patch 2/7] x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap
* [RFC Patch 3/7] x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication
* [RFC Patch 4/7] kernel: Introduce generic multikernel IPI communication framework
* [RFC Patch 5/7] x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID
* [RFC Patch 6/7] kexec: Implement dynamic kimage tracking
* [RFC Patch 7/7] kexec: Add /proc/multikernel interface for kimage tracking

and found the following issue:
WARNING in note_page

Full report is available here:
https://ci.syzbot.org/series/9ca759c7-776a-4d45-a2f9-5e6ca245e989

***

WARNING in note_page

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      f83ec76bf285bea5727f478a68b894f5543ca76e
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/645f543b-aa06-4977-81f2-45a0b44a2133/config

Key type fscrypt-provisioning registered
kAFS: Red Hat AFS client v0.1 registering.
Btrfs loaded, assert=on, ref-verify=on, zoned=yes, fsverity=yes
Key type big_key registered
Key type encrypted registered
AppArmor: AppArmor sha256 policy hashing enabled
ima: No TPM chip found, activating TPM-bypass!
Loading compiled-in module X.509 certificates
Loaded X.509 cert 'Build time autogenerated kernel key: 75e3f237904f24df4a2b6e4eae1a8f34effb6643'
ima: Allocated hash algorithm: sha256
ima: No architecture policies found
evm: Initialising EVM extended attributes:
evm: security.selinux (disabled)
evm: security.SMACK64 (disabled)
evm: security.SMACK64EXEC (disabled)
evm: security.SMACK64TRANSMUTE (disabled)
evm: security.SMACK64MMAP (disabled)
evm: security.apparmor
evm: security.ima
evm: security.capability
evm: HMAC attrs: 0x1
PM:   Magic number: 9:738:504
usb usb41-port1: hash matches
usb usb40-port2: hash matches
netconsole: network logging started
gtp: GTP module loaded (pdp ctx size 128 bytes)
rdma_rxe: loaded
cfg80211: Loading compiled-in X.509 certificates for regulatory database
Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Loaded X.509 cert 'wens: 61c038651aabdcf94bd0ac7ff06c7248db18c600'
clk: Disabling unused clocks
ALSA device list:
  #0: Dummy 1
  #1: Loopback 1
  #2: Virtual MIDI Card 1
check access for rdinit=/init failed: -2, ignoring
md: Waiting for all devices to be available before autodetect
md: If you don't use raid, use raid=noautodetect
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT4-fs (sda1): mounted filesystem b4773fba-1738-4da0-8a90-0fe043d0a496 ro with ordered data mode. Quota mode: none.
VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 26168K
Write protecting the kernel read-only data: 210944k
Freeing unused kernel image (text/rodata gap) memory: 104K
Freeing unused kernel image (rodata/data gap) memory: 300K
------------[ cut here ]------------
x86/mm: Found insecure W+X mapping at address 0xffff888000096000
WARNING: CPU: 1 PID: 1 at arch/x86/mm/dump_pagetables.c:248 note_page+0x12a5/0x14a0
Modules linked in:
CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:note_page+0x12a5/0x14a0
Code: d5 49 00 c6 05 bd 7d 17 0e 01 90 43 80 3c 2e 00 74 08 4c 89 ff e8 9b 47 ad 00 49 8b 37 48 c7 c7 40 98 88 8b e8 1c 53 0d 00 90 <0f> 0b 90 90 49 bd 00 00 00 00 00 fc ff df e9 5a f1 ff ff 44 89 f9
RSP: 0000:ffffc90000047678 EFLAGS: 00010246
RAX: 0829288c8ce4fa00 RBX: ffffc90000047cf0 RCX: ffff88801c2d0000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
RBP: 0000000000000001 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffffbfff1bfa274 R12: ffffffffffffffff
R13: dffffc0000000000 R14: 1ffff92000008fb2 R15: ffffc90000047d90
FS:  0000000000000000(0000) GS:ffff8881a3c09000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000df36000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ptdump_pte_entry+0xc6/0xe0
 walk_pte_range_inner+0x1ba/0x380
 walk_pgd_range+0x1467/0x1d40
 walk_page_range_debug+0x312/0x3d0
 ptdump_walk_pgd+0x126/0x320
 ptdump_walk_pgd_level_core+0x260/0x3e0
 kernel_init+0x53/0x1d0
 ret_from_fork+0x439/0x7d0
 ret_from_fork_asm+0x1a/0x30
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (7 preceding siblings ...)
  2025-09-19 10:10 ` [syzbot ci] Re: kernel: Introduce multikernel architecture support syzbot ci
@ 2025-09-19 13:14 ` Pasha Tatashin
  2025-09-20 21:13   ` Cong Wang
  2025-09-19 21:26 ` Stefan Hajnoczi
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Pasha Tatashin @ 2025-09-19 13:14 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

On Thu, Sep 18, 2025 at 6:26 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> This patch series introduces multikernel architecture support, enabling
> multiple independent kernel instances to coexist and communicate on a
> single physical machine. Each kernel instance can run on dedicated CPU
> cores while sharing the underlying hardware resources.
>
> The multikernel architecture provides several key benefits:
> - Improved fault isolation between different workloads
> - Enhanced security through kernel-level separation
> - Better resource utilization than traditional VM (KVM, Xen etc.)
> - Potential zero-down kernel update with KHO (Kernel Hand Over)

Hi Cong,

Thank you for submitting this; it is an exciting series.

I experimented with this approach about five years ago for a Live
Update scenario. It required surprisingly little work to get two OSes
to boot simultaneously on the same x86 hardware. The procedure I
followed looked like this:

1. Create an immutable kernel image bundle: kernel + initramfs.
2. The first kernel is booted with memmap parameters, setting aside
the first 1G for its own operation, the second 1G for the next kernel
(reserved), and the rest as PMEM for the VMs.
3. In the first kernel, we offline one CPU and kexec the second kernel
with parameters that specify to use only the offlined CPU as the boot
CPU and to keep the other CPUs offline (i.e., smp_init does not start
other CPUs). The memmap specify the first 1G reserved, and the 2nd 1G
for its own operations, and the rest  is PMEM.
4. Passing the VMs worked by suspending them in the old kernel.
5. The other CPUs are onlined in the new kernel (thus killing the old kernel).
6. The VMs are resumed in the new kernel.

While this approach was easy to get to the experimental PoC, it has
some fundamental problems that I am not sure can be solved in the long
run, such as handling global machine states like interrupts. I think
the Orphaned VM approach (i.e., keeping VCPUs running through the Live
Update procedure) is more reliable and likely to succeed for
zero-downtime kernel updates.

Pasha

>
> Architecture Overview:
> The implementation leverages kexec infrastructure to load and manage
> multiple kernel images, with each kernel instance assigned to specific
> CPU cores. Inter-kernel communication is facilitated through a dedicated
> IPI framework that allows kernels to coordinate and share information
> when necessary.
>
> Key Components:
> 1. Enhanced kexec subsystem with dynamic kimage tracking
> 2. Generic IPI communication framework for inter-kernel messaging
> 3. Architecture-specific CPU bootstrap mechanisms (only x86 so far)
> 4. Proc interface for monitoring loaded kernel instances
>
> Patch Summary:
>
> Patch 1/7: Introduces basic multikernel support via kexec, allowing
>            multiple kernel images to be loaded simultaneously.
>
> Patch 2/7: Adds x86-specific SMP INIT trampoline for bootstrapping
>            CPUs with different kernel instances.
>
> Patch 3/7: Introduces dedicated MULTIKERNEL_VECTOR for x86 inter-kernel
>            communication.
>
> Patch 4/7: Implements generic multikernel IPI communication framework
>            for cross-kernel messaging and coordination.
>
> Patch 5/7: Adds arch_cpu_physical_id() function to obtain physical CPU
>            identifiers for proper CPU management.
>
> Patch 6/7: Replaces static kimage globals with dynamic linked list
>            infrastructure to support multiple kernel images.
>
> Patch 7/7: Adds /proc/multikernel interface for monitoring and debugging
>            loaded kernel instances.
>
> The implementation maintains full backward compatibility with existing
> kexec functionality while adding the new multikernel capabilities.
>
> IMPORTANT NOTES:
>
> 1) This is a Request for Comments (RFC) submission. While the core
>    architecture is functional, there are numerous implementation details
>    that need improvement. The primary goal is to gather feedback on the
>    high-level design and overall approach rather than focus on specific
>    coding details at this stage.
>
> 2) This patch series represents only the foundational framework for
>    multikernel support. It establishes the basic infrastructure and
>    communication mechanisms. We welcome the community to build upon
>    this foundation and develop their own solutions based on this
>    framework.
>
> 3) Testing has been limited to the author's development machine using
>    hard-coded boot parameters and specific hardware configurations.
>    Community testing across different hardware platforms, configurations,
>    and use cases would be greatly appreciated to identify potential
>    issues and improve robustness. Obviously, don't use this code beyond
>    testing.
>
> This work enables new use cases such as running real-time kernels
> alongside general-purpose kernels, isolating security-critical
> applications, and providing dedicated kernel instances for specific
> workloads etc..
>
> Signed-off-by: Cong Wang <cwang@multikernel.io>
>
> ---
>
> Cong Wang (7):
>   kexec: Introduce multikernel support via kexec
>   x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap
>   x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication
>   kernel: Introduce generic multikernel IPI communication framework
>   x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID
>   kexec: Implement dynamic kimage tracking
>   kexec: Add /proc/multikernel interface for kimage tracking
>
>  arch/powerpc/kexec/crash.c          |   8 +-
>  arch/x86/include/asm/idtentry.h     |   1 +
>  arch/x86/include/asm/irq_vectors.h  |   1 +
>  arch/x86/include/asm/smp.h          |   7 +
>  arch/x86/kernel/Makefile            |   1 +
>  arch/x86/kernel/crash.c             |   4 +-
>  arch/x86/kernel/head64.c            |   5 +
>  arch/x86/kernel/idt.c               |   1 +
>  arch/x86/kernel/setup.c             |   3 +
>  arch/x86/kernel/smp.c               |  15 ++
>  arch/x86/kernel/smpboot.c           | 161 +++++++++++++
>  arch/x86/kernel/trampoline_64_bsp.S | 288 ++++++++++++++++++++++
>  arch/x86/kernel/vmlinux.lds.S       |   6 +
>  include/linux/kexec.h               |  22 +-
>  include/linux/multikernel.h         |  81 +++++++
>  include/uapi/linux/kexec.h          |   1 +
>  include/uapi/linux/reboot.h         |   2 +-
>  init/main.c                         |   2 +
>  kernel/Makefile                     |   2 +-
>  kernel/kexec.c                      | 103 +++++++-
>  kernel/kexec_core.c                 | 359 ++++++++++++++++++++++++++++
>  kernel/kexec_file.c                 |  33 ++-
>  kernel/multikernel.c                | 314 ++++++++++++++++++++++++
>  kernel/reboot.c                     |  10 +
>  24 files changed, 1411 insertions(+), 19 deletions(-)
>  create mode 100644 arch/x86/kernel/trampoline_64_bsp.S
>  create mode 100644 include/linux/multikernel.h
>  create mode 100644 kernel/multikernel.c
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (8 preceding siblings ...)
  2025-09-19 13:14 ` [RFC Patch 0/7] " Pasha Tatashin
@ 2025-09-19 21:26 ` Stefan Hajnoczi
  2025-09-20 21:40   ` Cong Wang
  2025-09-21  1:47 ` Hillf Danton
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-19 21:26 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 6230 bytes --]

On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> This patch series introduces multikernel architecture support, enabling
> multiple independent kernel instances to coexist and communicate on a
> single physical machine. Each kernel instance can run on dedicated CPU
> cores while sharing the underlying hardware resources.
> 
> The multikernel architecture provides several key benefits:
> - Improved fault isolation between different workloads
> - Enhanced security through kernel-level separation

What level of isolation does this patch series provide? What stops
kernel A from accessing kernel B's memory pages, sending interrupts to
its CPUs, etc?

> - Better resource utilization than traditional VM (KVM, Xen etc.)
> - Potential zero-down kernel update with KHO (Kernel Hand Over)
> 
> Architecture Overview:
> The implementation leverages kexec infrastructure to load and manage
> multiple kernel images, with each kernel instance assigned to specific
> CPU cores. Inter-kernel communication is facilitated through a dedicated
> IPI framework that allows kernels to coordinate and share information
> when necessary.
> 
> Key Components:
> 1. Enhanced kexec subsystem with dynamic kimage tracking
> 2. Generic IPI communication framework for inter-kernel messaging
> 3. Architecture-specific CPU bootstrap mechanisms (only x86 so far)
> 4. Proc interface for monitoring loaded kernel instances
> 
> Patch Summary:
> 
> Patch 1/7: Introduces basic multikernel support via kexec, allowing
>            multiple kernel images to be loaded simultaneously.
> 
> Patch 2/7: Adds x86-specific SMP INIT trampoline for bootstrapping
>            CPUs with different kernel instances.
> 
> Patch 3/7: Introduces dedicated MULTIKERNEL_VECTOR for x86 inter-kernel
>            communication.
> 
> Patch 4/7: Implements generic multikernel IPI communication framework
>            for cross-kernel messaging and coordination.
> 
> Patch 5/7: Adds arch_cpu_physical_id() function to obtain physical CPU
>            identifiers for proper CPU management.
> 
> Patch 6/7: Replaces static kimage globals with dynamic linked list
>            infrastructure to support multiple kernel images.
> 
> Patch 7/7: Adds /proc/multikernel interface for monitoring and debugging
>            loaded kernel instances.
> 
> The implementation maintains full backward compatibility with existing
> kexec functionality while adding the new multikernel capabilities.
> 
> IMPORTANT NOTES:
> 
> 1) This is a Request for Comments (RFC) submission. While the core
>    architecture is functional, there are numerous implementation details
>    that need improvement. The primary goal is to gather feedback on the
>    high-level design and overall approach rather than focus on specific
>    coding details at this stage.
> 
> 2) This patch series represents only the foundational framework for
>    multikernel support. It establishes the basic infrastructure and
>    communication mechanisms. We welcome the community to build upon
>    this foundation and develop their own solutions based on this
>    framework.
> 
> 3) Testing has been limited to the author's development machine using
>    hard-coded boot parameters and specific hardware configurations.
>    Community testing across different hardware platforms, configurations,
>    and use cases would be greatly appreciated to identify potential
>    issues and improve robustness. Obviously, don't use this code beyond
>    testing.
> 
> This work enables new use cases such as running real-time kernels
> alongside general-purpose kernels, isolating security-critical
> applications, and providing dedicated kernel instances for specific
> workloads etc..

This reminds me of Jailhouse, a partitioning hypervisor for Linux.
Jailhouse uses virtualization and other techniques to isolate CPUs,
allowing real-time workloads to run alongside Linux:
https://github.com/siemens/jailhouse

It would be interesting to hear your thoughts about where you want to go
with this series and how it compares with a partitioning hypervisor like
Jailhouse.

Thanks,
Stefan

> 
> Signed-off-by: Cong Wang <cwang@multikernel.io>
> 
> ---
> 
> Cong Wang (7):
>   kexec: Introduce multikernel support via kexec
>   x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap
>   x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication
>   kernel: Introduce generic multikernel IPI communication framework
>   x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID
>   kexec: Implement dynamic kimage tracking
>   kexec: Add /proc/multikernel interface for kimage tracking
> 
>  arch/powerpc/kexec/crash.c          |   8 +-
>  arch/x86/include/asm/idtentry.h     |   1 +
>  arch/x86/include/asm/irq_vectors.h  |   1 +
>  arch/x86/include/asm/smp.h          |   7 +
>  arch/x86/kernel/Makefile            |   1 +
>  arch/x86/kernel/crash.c             |   4 +-
>  arch/x86/kernel/head64.c            |   5 +
>  arch/x86/kernel/idt.c               |   1 +
>  arch/x86/kernel/setup.c             |   3 +
>  arch/x86/kernel/smp.c               |  15 ++
>  arch/x86/kernel/smpboot.c           | 161 +++++++++++++
>  arch/x86/kernel/trampoline_64_bsp.S | 288 ++++++++++++++++++++++
>  arch/x86/kernel/vmlinux.lds.S       |   6 +
>  include/linux/kexec.h               |  22 +-
>  include/linux/multikernel.h         |  81 +++++++
>  include/uapi/linux/kexec.h          |   1 +
>  include/uapi/linux/reboot.h         |   2 +-
>  init/main.c                         |   2 +
>  kernel/Makefile                     |   2 +-
>  kernel/kexec.c                      | 103 +++++++-
>  kernel/kexec_core.c                 | 359 ++++++++++++++++++++++++++++
>  kernel/kexec_file.c                 |  33 ++-
>  kernel/multikernel.c                | 314 ++++++++++++++++++++++++
>  kernel/reboot.c                     |  10 +
>  24 files changed, 1411 insertions(+), 19 deletions(-)
>  create mode 100644 arch/x86/kernel/trampoline_64_bsp.S
>  create mode 100644 include/linux/multikernel.h
>  create mode 100644 kernel/multikernel.c
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-19 13:14 ` [RFC Patch 0/7] " Pasha Tatashin
@ 2025-09-20 21:13   ` Cong Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-20 21:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, Cong Wang, Andrew Morton, Baoquan He,
	Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec, linux-mm

On Fri, Sep 19, 2025 at 6:14 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Thu, Sep 18, 2025 at 6:26 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > This patch series introduces multikernel architecture support, enabling
> > multiple independent kernel instances to coexist and communicate on a
> > single physical machine. Each kernel instance can run on dedicated CPU
> > cores while sharing the underlying hardware resources.
> >
> > The multikernel architecture provides several key benefits:
> > - Improved fault isolation between different workloads
> > - Enhanced security through kernel-level separation
> > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > - Potential zero-down kernel update with KHO (Kernel Hand Over)
>
> Hi Cong,
>
> Thank you for submitting this; it is an exciting series.

Thanks for your feedback, Pasha.

>
> I experimented with this approach about five years ago for a Live
> Update scenario. It required surprisingly little work to get two OSes
> to boot simultaneously on the same x86 hardware. The procedure I

Yes, I totally agree.

> followed looked like this:
>
> 1. Create an immutable kernel image bundle: kernel + initramfs.
> 2. The first kernel is booted with memmap parameters, setting aside
> the first 1G for its own operation, the second 1G for the next kernel
> (reserved), and the rest as PMEM for the VMs.
> 3. In the first kernel, we offline one CPU and kexec the second kernel
> with parameters that specify to use only the offlined CPU as the boot
> CPU and to keep the other CPUs offline (i.e., smp_init does not start
> other CPUs). The memmap specify the first 1G reserved, and the 2nd 1G
> for its own operations, and the rest  is PMEM.
> 4. Passing the VMs worked by suspending them in the old kernel.
> 5. The other CPUs are onlined in the new kernel (thus killing the old kernel).
> 6. The VMs are resumed in the new kernel.

Exactly.

>
> While this approach was easy to get to the experimental PoC, it has
> some fundamental problems that I am not sure can be solved in the long
> run, such as handling global machine states like interrupts. I think
> the Orphaned VM approach (i.e., keeping VCPUs running through the Live
> Update procedure) is more reliable and likely to succeed for
> zero-downtime kernel updates.

Indeed, migrating hardware resources gracefully is indeed challenging
for both VM or multikernel, especially when not interrupting the applications.
I am imagining that KHO could establish a kind of protocol between two
kernels to migrate resources. The device-tree-inspired abstraction looks
neat to me, it is pretty much like protobuf but in kernel-space.

Although I believe multikernel helps, there are still tons of details needed
to consider. Therefore, I hope my proposal inspires people to think deeper
and discuss together, and hopefully come up with better ideas.

Thanks for sharing your thoughts.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-19 21:26 ` Stefan Hajnoczi
@ 2025-09-20 21:40   ` Cong Wang
  2025-09-22 14:28     ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-20 21:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

On Fri, Sep 19, 2025 at 2:27 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > This patch series introduces multikernel architecture support, enabling
> > multiple independent kernel instances to coexist and communicate on a
> > single physical machine. Each kernel instance can run on dedicated CPU
> > cores while sharing the underlying hardware resources.
> >
> > The multikernel architecture provides several key benefits:
> > - Improved fault isolation between different workloads
> > - Enhanced security through kernel-level separation
>
> What level of isolation does this patch series provide? What stops
> kernel A from accessing kernel B's memory pages, sending interrupts to
> its CPUs, etc?

It is kernel-enforced isolation, therefore, the trust model here is still
based on kernel. Hence, a malicious kernel would be able to disrupt,
as you described. With memory encryption and IPI filtering, I think
that is solvable.

>
> > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> >
> > Architecture Overview:
> > The implementation leverages kexec infrastructure to load and manage
> > multiple kernel images, with each kernel instance assigned to specific
> > CPU cores. Inter-kernel communication is facilitated through a dedicated
> > IPI framework that allows kernels to coordinate and share information
> > when necessary.
> >
> > Key Components:
> > 1. Enhanced kexec subsystem with dynamic kimage tracking
> > 2. Generic IPI communication framework for inter-kernel messaging
> > 3. Architecture-specific CPU bootstrap mechanisms (only x86 so far)
> > 4. Proc interface for monitoring loaded kernel instances
> >
> > Patch Summary:
> >
> > Patch 1/7: Introduces basic multikernel support via kexec, allowing
> >            multiple kernel images to be loaded simultaneously.
> >
> > Patch 2/7: Adds x86-specific SMP INIT trampoline for bootstrapping
> >            CPUs with different kernel instances.
> >
> > Patch 3/7: Introduces dedicated MULTIKERNEL_VECTOR for x86 inter-kernel
> >            communication.
> >
> > Patch 4/7: Implements generic multikernel IPI communication framework
> >            for cross-kernel messaging and coordination.
> >
> > Patch 5/7: Adds arch_cpu_physical_id() function to obtain physical CPU
> >            identifiers for proper CPU management.
> >
> > Patch 6/7: Replaces static kimage globals with dynamic linked list
> >            infrastructure to support multiple kernel images.
> >
> > Patch 7/7: Adds /proc/multikernel interface for monitoring and debugging
> >            loaded kernel instances.
> >
> > The implementation maintains full backward compatibility with existing
> > kexec functionality while adding the new multikernel capabilities.
> >
> > IMPORTANT NOTES:
> >
> > 1) This is a Request for Comments (RFC) submission. While the core
> >    architecture is functional, there are numerous implementation details
> >    that need improvement. The primary goal is to gather feedback on the
> >    high-level design and overall approach rather than focus on specific
> >    coding details at this stage.
> >
> > 2) This patch series represents only the foundational framework for
> >    multikernel support. It establishes the basic infrastructure and
> >    communication mechanisms. We welcome the community to build upon
> >    this foundation and develop their own solutions based on this
> >    framework.
> >
> > 3) Testing has been limited to the author's development machine using
> >    hard-coded boot parameters and specific hardware configurations.
> >    Community testing across different hardware platforms, configurations,
> >    and use cases would be greatly appreciated to identify potential
> >    issues and improve robustness. Obviously, don't use this code beyond
> >    testing.
> >
> > This work enables new use cases such as running real-time kernels
> > alongside general-purpose kernels, isolating security-critical
> > applications, and providing dedicated kernel instances for specific
> > workloads etc..
>
> This reminds me of Jailhouse, a partitioning hypervisor for Linux.
> Jailhouse uses virtualization and other techniques to isolate CPUs,
> allowing real-time workloads to run alongside Linux:
> https://github.com/siemens/jailhouse
>
> It would be interesting to hear your thoughts about where you want to go
> with this series and how it compares with a partitioning hypervisor like
> Jailhouse.

Good question. A few people pointed me to Jailhouse before. If I understand
correctly, it is still based on hardware virtualization like IOMMU and VMX.
The goal of multikernel is to completely avoid hw virtualization
and without a hypervisor. Of course, this also depends on how we define
hypervisor here: If it is a user-space one like Qemu, this is exactly what
multikernel tries to avoid; or if it is just a broadly "supervisor",
it still exists
in the kernel (unlike Qemu).

This is why I tend to use "host kernel" and "spawned kernel" to distinguish
them, instead of using "hypervisor" and "guest", which easily confuse
people with  virtualization.

Speaking of virtualization, there are some other technologies like
DirectVisor or De-virt. In my humble opinion, they are going the wrong
way as apparently virt + de-virt = no virt. Why even bother virt? ;-p

I hope this answers your questions,

Regards,
Cong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (9 preceding siblings ...)
  2025-09-19 21:26 ` Stefan Hajnoczi
@ 2025-09-21  1:47 ` Hillf Danton
  2025-09-22 21:55   ` Cong Wang
  2025-09-21  5:54 ` Jan Engelhardt
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Hillf Danton @ 2025-09-21  1:47 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-kernel, linux-mm

On Thu, 18 Sep 2025 15:25:59 -0700 Cong Wang wrote:
> This patch series introduces multikernel architecture support, enabling
> multiple independent kernel instances to coexist and communicate on a
> single physical machine. Each kernel instance can run on dedicated CPU
> cores while sharing the underlying hardware resources.
> 
> The multikernel architecture provides several key benefits:
> - Improved fault isolation between different workloads
> - Enhanced security through kernel-level separation
> - Better resource utilization than traditional VM (KVM, Xen etc.)
> - Potential zero-down kernel update with KHO (Kernel Hand Over)
>
Could you illustrate a couple of use cases to help understand your idea?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (10 preceding siblings ...)
  2025-09-21  1:47 ` Hillf Danton
@ 2025-09-21  5:54 ` Jan Engelhardt
  2025-09-21  6:24   ` Mike Rapoport
  2025-09-24 17:51 ` Christoph Lameter (Ampere)
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 44+ messages in thread
From: Jan Engelhardt @ 2025-09-21  5:54 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm


On Friday 2025-09-19 00:25, Cong Wang wrote:

>This patch series introduces multikernel architecture support, enabling
>multiple independent kernel instances to coexist and communicate on a
>single physical machine.
>
>Each kernel instance can run on dedicated CPU
>cores while sharing the underlying hardware resources.

I initially read it in such a way that that kernels run without
supervisor, and thus necessarily cooperatively, on a system.

But then I looked at
<https://multikernel.io/assets/images/comparison-architecture-diagrams.svg>,
saw that there is a kernel on top of a kernel, to which my reactive
thought was: "well, that has been done before", e.g. User Mode Linux.
While UML does not technically talk to hardware directly, continuing
the thought "what's stopping a willing developer from giving /dev/mem
to the subordinate kernel".

On second thought, a hypervisor is just some kind of "miniature
kernel" too (if generalizing very hard).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-21  5:54 ` Jan Engelhardt
@ 2025-09-21  6:24   ` Mike Rapoport
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Rapoport @ 2025-09-21  6:24 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Cong Wang, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Changyuan Lyu, kexec,
	linux-mm

On Sun, Sep 21, 2025 at 07:54:31AM +0200, Jan Engelhardt wrote:
> 
> On Friday 2025-09-19 00:25, Cong Wang wrote:
> 
> >This patch series introduces multikernel architecture support, enabling
> >multiple independent kernel instances to coexist and communicate on a
> >single physical machine.
> >
> >Each kernel instance can run on dedicated CPU
> >cores while sharing the underlying hardware resources.
> 
> I initially read it in such a way that that kernels run without
> supervisor, and thus necessarily cooperatively, on a system.
> 
> But then I looked at
> <https://multikernel.io/assets/images/comparison-architecture-diagrams.svg>,

The diagram also oversimplifies containers, with system containers the "OS"
runs inside the container and only the kernel is shared.

It's interesting to hear how the multikernel approach compare to system
containers as well.

> saw that there is a kernel on top of a kernel, to which my reactive
> thought was: "well, that has been done before", e.g. User Mode Linux.
> While UML does not technically talk to hardware directly, continuing
> the thought "what's stopping a willing developer from giving /dev/mem
> to the subordinate kernel".
> 
> On second thought, a hypervisor is just some kind of "miniature
> kernel" too (if generalizing very hard).
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-20 21:40   ` Cong Wang
@ 2025-09-22 14:28     ` Stefan Hajnoczi
  2025-09-22 22:41       ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-22 14:28 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]

On Sat, Sep 20, 2025 at 02:40:18PM -0700, Cong Wang wrote:
> On Fri, Sep 19, 2025 at 2:27 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > This patch series introduces multikernel architecture support, enabling
> > > multiple independent kernel instances to coexist and communicate on a
> > > single physical machine. Each kernel instance can run on dedicated CPU
> > > cores while sharing the underlying hardware resources.
> > >
> > > The multikernel architecture provides several key benefits:
> > > - Improved fault isolation between different workloads
> > > - Enhanced security through kernel-level separation
> >
> > What level of isolation does this patch series provide? What stops
> > kernel A from accessing kernel B's memory pages, sending interrupts to
> > its CPUs, etc?
> 
> It is kernel-enforced isolation, therefore, the trust model here is still
> based on kernel. Hence, a malicious kernel would be able to disrupt,
> as you described. With memory encryption and IPI filtering, I think
> that is solvable.

I think solving this is key to the architecture, at least if fault
isolation and security are goals. A cooperative architecture where
nothing prevents kernels from interfering with each other simply doesn't
offer fault isolation or security.

On CPU architectures that offer additional privilege modes it may be
possible to run a supervisor on every CPU to restrict access to
resources in the spawned kernel. Kernels would need to be modified to
call into the supervisor instead of accessing certain resources
directly.

IOMMU and interrupt remapping control would need to be performed by the
supervisor to prevent spawned kernels from affecting each other.

This seems to be the price of fault isolation and security. It ends up
looking similar to a hypervisor, but maybe it wouldn't need to use
virtualization extensions, depending on the capabilities of the CPU
architecture.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-21  1:47 ` Hillf Danton
@ 2025-09-22 21:55   ` Cong Wang
  2025-09-24  1:12     ` Hillf Danton
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-22 21:55 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, multikernel

On Sat, Sep 20, 2025 at 6:47 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu, 18 Sep 2025 15:25:59 -0700 Cong Wang wrote:
> > This patch series introduces multikernel architecture support, enabling
> > multiple independent kernel instances to coexist and communicate on a
> > single physical machine. Each kernel instance can run on dedicated CPU
> > cores while sharing the underlying hardware resources.
> >
> > The multikernel architecture provides several key benefits:
> > - Improved fault isolation between different workloads
> > - Enhanced security through kernel-level separation
> > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> >
> Could you illustrate a couple of use cases to help understand your idea?

Sure, below are a few use cases on my mind:

1) With sufficient hardware resources: each kernel gets isolated resources
with real bare metal performance. This applies to all VM/container use cases
today, just with pure better performance: no virtualization, no noisy neighbor.

More importantly, they can co-exist. In theory, you can run a multiernel with
a VM inside and with a container inside the VM.

2) Active-backup kernel for mission-critical tasks: after the primary kernel
crashes, a backup kernel in parallel immediately takes over without interrupting
the user-space task.

Dual-kernel systems are very common for automotives today.

3) Getting rid of the OS to reduce the attack surface. We could pack everything
properly in an initramfs and run it directly without bothering a full
OS. This is
similar to what unikernels or macro VM's do today.

4) Machine learning in the kernel. Machine learning is too specific to
workloads,
for instance, mixing real-time scheduling and non-RT can be challenging for
ML to tune the CPU scheduler, which is an essential multi-goal learning.

5) Per-application specialized kernel: For example, running a RT kernel
and non-RT kernel in parallel. Memory footprint can also be reduced by
reducing the 5-level paging tables when necessary.

I hope this helps.

Regards,
Cong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-22 14:28     ` Stefan Hajnoczi
@ 2025-09-22 22:41       ` Cong Wang
  2025-09-23 17:05         ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-22 22:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Mon, Sep 22, 2025 at 7:28 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Sat, Sep 20, 2025 at 02:40:18PM -0700, Cong Wang wrote:
> > On Fri, Sep 19, 2025 at 2:27 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > > This patch series introduces multikernel architecture support, enabling
> > > > multiple independent kernel instances to coexist and communicate on a
> > > > single physical machine. Each kernel instance can run on dedicated CPU
> > > > cores while sharing the underlying hardware resources.
> > > >
> > > > The multikernel architecture provides several key benefits:
> > > > - Improved fault isolation between different workloads
> > > > - Enhanced security through kernel-level separation
> > >
> > > What level of isolation does this patch series provide? What stops
> > > kernel A from accessing kernel B's memory pages, sending interrupts to
> > > its CPUs, etc?
> >
> > It is kernel-enforced isolation, therefore, the trust model here is still
> > based on kernel. Hence, a malicious kernel would be able to disrupt,
> > as you described. With memory encryption and IPI filtering, I think
> > that is solvable.
>
> I think solving this is key to the architecture, at least if fault
> isolation and security are goals. A cooperative architecture where
> nothing prevents kernels from interfering with each other simply doesn't
> offer fault isolation or security.

Kernel and kernel modules can be signed today, kexec also supports
kernel signing via kexec_file_load(). It migrates at least untrusted
kernels, although kernels can be still exploited via 0-day.

>
> On CPU architectures that offer additional privilege modes it may be
> possible to run a supervisor on every CPU to restrict access to
> resources in the spawned kernel. Kernels would need to be modified to
> call into the supervisor instead of accessing certain resources
> directly.
>
> IOMMU and interrupt remapping control would need to be performed by the
> supervisor to prevent spawned kernels from affecting each other.

That's right, security vs performance. A lot of times we have to balance
between these two. This is why Kata Container today runs a container
inside a VM.

This largely depends on what users could compromise, there is no single
right answer here.

For example, in a fully-controlled private cloud, security exploits are
probably not even a concern. Sacrificing performance for a non-concern
is not reasonable.

>
> This seems to be the price of fault isolation and security. It ends up
> looking similar to a hypervisor, but maybe it wouldn't need to use
> virtualization extensions, depending on the capabilities of the CPU
> architecture.

Two more points:

1) Security lockdown. Security lockdown transforms multikernel from
"0-day means total compromise" to "0-day means single workload
compromise with rapid recovery." This is still a significant improvement
over containers where a single kernel 0-day compromises everything
simultaneously.

2) Rapid kernel updates: A more practical way to eliminate 0-day
exploits is to update kernel more frequently, today the major blocker
is the downtime required by kernel reboot, which is what multikernel
aims to resolve.

I hope this helps.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-22 22:41       ` Cong Wang
@ 2025-09-23 17:05         ` Stefan Hajnoczi
  2025-09-24 11:38           ` David Hildenbrand
  2025-09-24 17:18           ` Cong Wang
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-23 17:05 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On Mon, Sep 22, 2025 at 03:41:18PM -0700, Cong Wang wrote:
> On Mon, Sep 22, 2025 at 7:28 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Sat, Sep 20, 2025 at 02:40:18PM -0700, Cong Wang wrote:
> > > On Fri, Sep 19, 2025 at 2:27 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > > > This patch series introduces multikernel architecture support, enabling
> > > > > multiple independent kernel instances to coexist and communicate on a
> > > > > single physical machine. Each kernel instance can run on dedicated CPU
> > > > > cores while sharing the underlying hardware resources.
> > > > >
> > > > > The multikernel architecture provides several key benefits:
> > > > > - Improved fault isolation between different workloads
> > > > > - Enhanced security through kernel-level separation
> > > >
> > > > What level of isolation does this patch series provide? What stops
> > > > kernel A from accessing kernel B's memory pages, sending interrupts to
> > > > its CPUs, etc?
> > >
> > > It is kernel-enforced isolation, therefore, the trust model here is still
> > > based on kernel. Hence, a malicious kernel would be able to disrupt,
> > > as you described. With memory encryption and IPI filtering, I think
> > > that is solvable.
> >
> > I think solving this is key to the architecture, at least if fault
> > isolation and security are goals. A cooperative architecture where
> > nothing prevents kernels from interfering with each other simply doesn't
> > offer fault isolation or security.
> 
> Kernel and kernel modules can be signed today, kexec also supports
> kernel signing via kexec_file_load(). It migrates at least untrusted
> kernels, although kernels can be still exploited via 0-day.

Kernel signing also doesn't protect against bugs in one kernel
interfering with another kernel.

> >
> > On CPU architectures that offer additional privilege modes it may be
> > possible to run a supervisor on every CPU to restrict access to
> > resources in the spawned kernel. Kernels would need to be modified to
> > call into the supervisor instead of accessing certain resources
> > directly.
> >
> > IOMMU and interrupt remapping control would need to be performed by the
> > supervisor to prevent spawned kernels from affecting each other.
> 
> That's right, security vs performance. A lot of times we have to balance
> between these two. This is why Kata Container today runs a container
> inside a VM.
> 
> This largely depends on what users could compromise, there is no single
> right answer here.
> 
> For example, in a fully-controlled private cloud, security exploits are
> probably not even a concern. Sacrificing performance for a non-concern
> is not reasonable.
> 
> >
> > This seems to be the price of fault isolation and security. It ends up
> > looking similar to a hypervisor, but maybe it wouldn't need to use
> > virtualization extensions, depending on the capabilities of the CPU
> > architecture.
> 
> Two more points:
> 
> 1) Security lockdown. Security lockdown transforms multikernel from
> "0-day means total compromise" to "0-day means single workload
> compromise with rapid recovery." This is still a significant improvement
> over containers where a single kernel 0-day compromises everything
> simultaneously.

I don't follow. My understanding is that multikernel currently does not
prevent spawned kernels from affecting each other, so a kernel 0-day in
multikernel still compromises everything?

> 
> 2) Rapid kernel updates: A more practical way to eliminate 0-day
> exploits is to update kernel more frequently, today the major blocker
> is the downtime required by kernel reboot, which is what multikernel
> aims to resolve.

If kernel upgrades are the main use case for multikernel, then I guess
isolation is not necessary. Two kernels would only run side-by-side for
a limited period of time and they would have access to the same
workloads.

Stefan

> 
> I hope this helps.
> 
> Regards,
> Cong Wang
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-22 21:55   ` Cong Wang
@ 2025-09-24  1:12     ` Hillf Danton
  2025-09-24 17:30       ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Hillf Danton @ 2025-09-24  1:12 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-kernel, linux-mm, multikernel

On Mon, 22 Sep 2025 14:55:41 -0700 Cong Wang wrote:
> On Sat, Sep 20, 2025 at 6:47 PM Hillf Danton <hdanton@sina.com> wrote:
> > On Thu, 18 Sep 2025 15:25:59 -0700 Cong Wang wrote:
> > > This patch series introduces multikernel architecture support, enabling
> > > multiple independent kernel instances to coexist and communicate on a
> > > single physical machine. Each kernel instance can run on dedicated CPU
> > > cores while sharing the underlying hardware resources.
> > >
> > > The multikernel architecture provides several key benefits:
> > > - Improved fault isolation between different workloads
> > > - Enhanced security through kernel-level separation
> > > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> > >
> > Could you illustrate a couple of use cases to help understand your idea?
> 
> Sure, below are a few use cases on my mind:
> 
> 1) With sufficient hardware resources: each kernel gets isolated resources
> with real bare metal performance. This applies to all VM/container use cases
> today, just with pure better performance: no virtualization, no noisy neighbor.
> 
> More importantly, they can co-exist. In theory, you can run a multiernel with
> a VM inside and with a container inside the VM.
> 
If the 6.17 eevdf perfs better than the 6.15 one could, their co-exist wastes
bare metal cpu cycles.

> 2) Active-backup kernel for mission-critical tasks: after the primary kernel
> crashes, a backup kernel in parallel immediately takes over without interrupting
> the user-space task.
> 
> Dual-kernel systems are very common for automotives today.
> 
If 6.17 is more stable than 6.14, running the latter sounds like square skull
in the product environment.

> 3) Getting rid of the OS to reduce the attack surface. We could pack everything
> properly in an initramfs and run it directly without bothering a full
> OS. This is similar to what unikernels or macro VM's do today.
> 
Duno

> 4) Machine learning in the kernel. Machine learning is too specific to
> workloads, for instance, mixing real-time scheduling and non-RT can be challenging for
> ML to tune the CPU scheduler, which is an essential multi-goal learning.
> 
No room for CUDA in kernel I think in 2025.

> 5) Per-application specialized kernel: For example, running a RT kernel
> and non-RT kernel in parallel. Memory footprint can also be reduced by
> reducing the 5-level paging tables when necessary.

If RT makes your product earn more money in fewer weeks, why is eevdf
another option, given RT means no schedule at the first place?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-23 17:05         ` Stefan Hajnoczi
@ 2025-09-24 11:38           ` David Hildenbrand
  2025-09-24 12:51             ` Stefan Hajnoczi
  2025-09-24 17:18           ` Cong Wang
  1 sibling, 1 reply; 44+ messages in thread
From: David Hildenbrand @ 2025-09-24 11:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

>>
>> Two more points:
>>
>> 1) Security lockdown. Security lockdown transforms multikernel from
>> "0-day means total compromise" to "0-day means single workload
>> compromise with rapid recovery." This is still a significant improvement
>> over containers where a single kernel 0-day compromises everything
>> simultaneously.
> 
> I don't follow. My understanding is that multikernel currently does not
> prevent spawned kernels from affecting each other, so a kernel 0-day in
> multikernel still compromises everything?

I would assume that if there is no enforced isolation by the hardware 
(e.g., virtualization, including partitioning hypervisors like 
jailhouse, pkvm etc) nothing would stop a kernel A to access memory 
assigned to kernel B.

And of course, memory is just one of the resources that would not be 
properly isolated.

Not sure if encrypting memory per kernel would really allow to not let 
other kernels still damage such kernels.

Also, what stops a kernel to just reboot the whole machine? Happy to 
learn how that will be handled such that there is proper isolation.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 11:38           ` David Hildenbrand
@ 2025-09-24 12:51             ` Stefan Hajnoczi
  2025-09-24 18:28               ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-24 12:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Cong Wang, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel

[-- Attachment #1: Type: text/plain, Size: 1656 bytes --]

On Wed, Sep 24, 2025 at 01:38:31PM +0200, David Hildenbrand wrote:
> > > 
> > > Two more points:
> > > 
> > > 1) Security lockdown. Security lockdown transforms multikernel from
> > > "0-day means total compromise" to "0-day means single workload
> > > compromise with rapid recovery." This is still a significant improvement
> > > over containers where a single kernel 0-day compromises everything
> > > simultaneously.
> > 
> > I don't follow. My understanding is that multikernel currently does not
> > prevent spawned kernels from affecting each other, so a kernel 0-day in
> > multikernel still compromises everything?
> 
> I would assume that if there is no enforced isolation by the hardware (e.g.,
> virtualization, including partitioning hypervisors like jailhouse, pkvm etc)
> nothing would stop a kernel A to access memory assigned to kernel B.
> 
> And of course, memory is just one of the resources that would not be
> properly isolated.
> 
> Not sure if encrypting memory per kernel would really allow to not let other
> kernels still damage such kernels.
> 
> Also, what stops a kernel to just reboot the whole machine? Happy to learn
> how that will be handled such that there is proper isolation.

The reason I've been asking about the fault isolation and security
statements in the cover letter is because it's unclear:
1. What is implemented today in multikernel.
2. What is on the roadmap for multikernel.
3. What is out of scope for multikernel.

Cong: Can you clarify this? If the answer is that fault isolation and
security are out of scope, then this discussion can be skipped.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-23 17:05         ` Stefan Hajnoczi
  2025-09-24 11:38           ` David Hildenbrand
@ 2025-09-24 17:18           ` Cong Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-24 17:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Tue, Sep 23, 2025 at 10:05 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Sep 22, 2025 at 03:41:18PM -0700, Cong Wang wrote:
> > On Mon, Sep 22, 2025 at 7:28 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Sat, Sep 20, 2025 at 02:40:18PM -0700, Cong Wang wrote:
> > > > On Fri, Sep 19, 2025 at 2:27 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > > > > This patch series introduces multikernel architecture support, enabling
> > > > > > multiple independent kernel instances to coexist and communicate on a
> > > > > > single physical machine. Each kernel instance can run on dedicated CPU
> > > > > > cores while sharing the underlying hardware resources.
> > > > > >
> > > > > > The multikernel architecture provides several key benefits:
> > > > > > - Improved fault isolation between different workloads
> > > > > > - Enhanced security through kernel-level separation
> > > > >
> > > > > What level of isolation does this patch series provide? What stops
> > > > > kernel A from accessing kernel B's memory pages, sending interrupts to
> > > > > its CPUs, etc?
> > > >
> > > > It is kernel-enforced isolation, therefore, the trust model here is still
> > > > based on kernel. Hence, a malicious kernel would be able to disrupt,
> > > > as you described. With memory encryption and IPI filtering, I think
> > > > that is solvable.
> > >
> > > I think solving this is key to the architecture, at least if fault
> > > isolation and security are goals. A cooperative architecture where
> > > nothing prevents kernels from interfering with each other simply doesn't
> > > offer fault isolation or security.
> >
> > Kernel and kernel modules can be signed today, kexec also supports
> > kernel signing via kexec_file_load(). It migrates at least untrusted
> > kernels, although kernels can be still exploited via 0-day.
>
> Kernel signing also doesn't protect against bugs in one kernel
> interfering with another kernel.

This is also true, this is why memory encryption and authentication
could help. Hardware vendors can catch up with software, which
is how virtualization evolved (e.g. VPDA didn't exist when KVM was
invented).

>
> > >
> > > On CPU architectures that offer additional privilege modes it may be
> > > possible to run a supervisor on every CPU to restrict access to
> > > resources in the spawned kernel. Kernels would need to be modified to
> > > call into the supervisor instead of accessing certain resources
> > > directly.
> > >
> > > IOMMU and interrupt remapping control would need to be performed by the
> > > supervisor to prevent spawned kernels from affecting each other.
> >
> > That's right, security vs performance. A lot of times we have to balance
> > between these two. This is why Kata Container today runs a container
> > inside a VM.
> >
> > This largely depends on what users could compromise, there is no single
> > right answer here.
> >
> > For example, in a fully-controlled private cloud, security exploits are
> > probably not even a concern. Sacrificing performance for a non-concern
> > is not reasonable.
> >
> > >
> > > This seems to be the price of fault isolation and security. It ends up
> > > looking similar to a hypervisor, but maybe it wouldn't need to use
> > > virtualization extensions, depending on the capabilities of the CPU
> > > architecture.
> >
> > Two more points:
> >
> > 1) Security lockdown. Security lockdown transforms multikernel from
> > "0-day means total compromise" to "0-day means single workload
> > compromise with rapid recovery." This is still a significant improvement
> > over containers where a single kernel 0-day compromises everything
> > simultaneously.
>
> I don't follow. My understanding is that multikernel currently does not
> prevent spawned kernels from affecting each other, so a kernel 0-day in
> multikernel still compromises everything?

Linux kernel lockdown does reduce the blast radius of a 0-day exploit,
but it doesn’t eliminate it. I hope this is clearer.

>
> >
> > 2) Rapid kernel updates: A more practical way to eliminate 0-day
> > exploits is to update kernel more frequently, today the major blocker
> > is the downtime required by kernel reboot, which is what multikernel
> > aims to resolve.
>
> If kernel upgrades are the main use case for multikernel, then I guess
> isolation is not necessary. Two kernels would only run side-by-side for
> a limited period of time and they would have access to the same
> workloads.

Zero-downtime upgrade is probably the last we could achieve
with multikernel, as a true zero-downtime requires significant effort
on kernel-to-kernel coordination, so we would essentially need to
establish a protocol (via KHO, I hope) here.

On the other hand, isolation is relatively easy and more useful.
I understand you don't like kernel isolation, however, we need to
recognize the success of containers today, regardless we like it or
not.

By the way, although just a theory, I hope multikernel does not
prevent users using virtualization inside, as VM does not prevent
running containers inside. The choice should always be on users'
side, not ours.

I hope this helps.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24  1:12     ` Hillf Danton
@ 2025-09-24 17:30       ` Cong Wang
  2025-09-24 22:42         ` Hillf Danton
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-24 17:30 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, linux-mm, multikernel

On Tue, Sep 23, 2025 at 6:12 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Mon, 22 Sep 2025 14:55:41 -0700 Cong Wang wrote:
> > On Sat, Sep 20, 2025 at 6:47 PM Hillf Danton <hdanton@sina.com> wrote:
> > > On Thu, 18 Sep 2025 15:25:59 -0700 Cong Wang wrote:
> > > > This patch series introduces multikernel architecture support, enabling
> > > > multiple independent kernel instances to coexist and communicate on a
> > > > single physical machine. Each kernel instance can run on dedicated CPU
> > > > cores while sharing the underlying hardware resources.
> > > >
> > > > The multikernel architecture provides several key benefits:
> > > > - Improved fault isolation between different workloads
> > > > - Enhanced security through kernel-level separation
> > > > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > > > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> > > >
> > > Could you illustrate a couple of use cases to help understand your idea?
> >
> > Sure, below are a few use cases on my mind:
> >
> > 1) With sufficient hardware resources: each kernel gets isolated resources
> > with real bare metal performance. This applies to all VM/container use cases
> > today, just with pure better performance: no virtualization, no noisy neighbor.
> >
> > More importantly, they can co-exist. In theory, you can run a multiernel with
> > a VM inside and with a container inside the VM.
> >
> If the 6.17 eevdf perfs better than the 6.15 one could, their co-exist wastes
> bare metal cpu cycles.

I think we should never eliminate the ability of not using multikernel, users
should have a choice. Apologize if I didn't make this clear.

And even if you only want one kernel, you might still want to use
zero-downtime upgrade via multikernel. ;-)

>
> > 2) Active-backup kernel for mission-critical tasks: after the primary kernel
> > crashes, a backup kernel in parallel immediately takes over without interrupting
> > the user-space task.
> >
> > Dual-kernel systems are very common for automotives today.
> >
> If 6.17 is more stable than 6.14, running the latter sounds like square skull
> in the product environment.

I don't think anyone here wants to take your freedom of doing so.
You also have a choice of not using multikernel or kexec, or even
CONFIG_KEXEC=n. :)

On the other hand, let's also respect the fact that many automotives
today use dual-kernel systems (one for interaction, one for autonomous
driving).

>
> > 3) Getting rid of the OS to reduce the attack surface. We could pack everything
> > properly in an initramfs and run it directly without bothering a full
> > OS. This is similar to what unikernels or macro VM's do today.
> >
> Duno

Same, choice is always on the table, it must be.

>
> > 4) Machine learning in the kernel. Machine learning is too specific to
> > workloads, for instance, mixing real-time scheduling and non-RT can be challenging for
> > ML to tune the CPU scheduler, which is an essential multi-goal learning.
> >
> No room for CUDA in kernel I think in 2025.

Maybe yes. LAKE is framework for using GPU-accelerated ML
in the kernel:
https://utns.cs.utexas.edu/assets/papers/lake_camera_ready.pdf

If you are interested in this area, there are tons of papers existing.

>
> > 5) Per-application specialized kernel: For example, running a RT kernel
> > and non-RT kernel in parallel. Memory footprint can also be reduced by
> > reducing the 5-level paging tables when necessary.
>
> If RT makes your product earn more money in fewer weeks, why is eevdf
> another option, given RT means no schedule at the first place?

I wish there is a one-single perfect solution for everyone, unfortunately
the reality seems to be the opposite.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (11 preceding siblings ...)
  2025-09-21  5:54 ` Jan Engelhardt
@ 2025-09-24 17:51 ` Christoph Lameter (Ampere)
  2025-09-24 18:39   ` Cong Wang
  2025-09-25 15:47 ` Jiaxun Yang
  2025-09-26  9:01 ` Jarkko Sakkinen
  14 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-09-24 17:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

You dont really need any kernel support to run multiple kernels on an SMP
system. Multiple kernels can operate without a hypervisor or anything
complicated like that.

The firmware can prep the kernel boot so that the kernels operate
on distinct address spaces, processors and I/O resources. Whereupon you
have the challenge to communicate betweeen the kernels which led to
various forms of "ethernet" drives communicating via shard memory.

This has been done for decades. I have stuff run stuff like this on
SGI hardware and Sparc/Neon back in 2003. ARM should also do this without
too much fuss.

There is even a device driver to share memory between those kernels called
XPMEM (which was developed on SGI machines decades ago for exactly this
purpose). Its ancient.

There is nothing new under the sun as already expressed over 2000 years
ago in Ecclesiastes 1:9. ;-)

The improvement here is that the burden of implementation away from
firmware and we can then have a memory based "ethernet" shared memory
implementation indepedent of the architecture.

The reason this was done by SGI is to avoid scaling issues. Machines with
high numbers of cores can slow down because of serialization overhead in
the kernel.

The segmentation of the cores into various kernel images reduces the
scaling problem (but then lead to a communication bottleneck).


The rationale for Sparc/Neon were hardware limitations that resulted
in broken cache coherency between cores from multiple sockets. A kernel
expected coherent memory and thus one kernel for each coherency domain was
a solution. ;-)

AFAICT various contemporary Android deployments do the multiple kernel
approach in one way or another already for security purposes and for
specialized controllers. However, the multi kernel approaches are often
depending on specialized and dedicated hardware. It may be difficult to
support with a generic approach developed here.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 12:51             ` Stefan Hajnoczi
@ 2025-09-24 18:28               ` Cong Wang
  2025-09-24 19:03                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-24 18:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: David Hildenbrand, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel

On Wed, Sep 24, 2025 at 5:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Sep 24, 2025 at 01:38:31PM +0200, David Hildenbrand wrote:
> > > >
> > > > Two more points:
> > > >
> > > > 1) Security lockdown. Security lockdown transforms multikernel from
> > > > "0-day means total compromise" to "0-day means single workload
> > > > compromise with rapid recovery." This is still a significant improvement
> > > > over containers where a single kernel 0-day compromises everything
> > > > simultaneously.
> > >
> > > I don't follow. My understanding is that multikernel currently does not
> > > prevent spawned kernels from affecting each other, so a kernel 0-day in
> > > multikernel still compromises everything?
> >
> > I would assume that if there is no enforced isolation by the hardware (e.g.,
> > virtualization, including partitioning hypervisors like jailhouse, pkvm etc)
> > nothing would stop a kernel A to access memory assigned to kernel B.
> >
> > And of course, memory is just one of the resources that would not be
> > properly isolated.
> >
> > Not sure if encrypting memory per kernel would really allow to not let other
> > kernels still damage such kernels.
> >
> > Also, what stops a kernel to just reboot the whole machine? Happy to learn
> > how that will be handled such that there is proper isolation.
>
> The reason I've been asking about the fault isolation and security
> statements in the cover letter is because it's unclear:
> 1. What is implemented today in multikernel.
> 2. What is on the roadmap for multikernel.
> 3. What is out of scope for multikernel.
>
> Cong: Can you clarify this? If the answer is that fault isolation and
> security are out of scope, then this discussion can be skipped.

It is my pleasure. The email is too narrow, therefore I wrote a
complete document for you:
https://docs.google.com/document/d/1yneO6O6C_z0Lh3A2QyT8XsH7ZrQ7-naGQT-rpdjWa_g/edit?usp=sharing

I hope it answers all of the above questions and provides a clear
big picture. If not, please let me know.

(If you need edit permission for the above document, please just
request, I will approve.)

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 17:51 ` Christoph Lameter (Ampere)
@ 2025-09-24 18:39   ` Cong Wang
  2025-09-26  9:50     ` Jarkko Sakkinen
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-24 18:39 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
> AFAICT various contemporary Android deployments do the multiple kernel
> approach in one way or another already for security purposes and for
> specialized controllers. However, the multi kernel approaches are often
> depending on specialized and dedicated hardware. It may be difficult to
> support with a generic approach developed here.

You are right, the multikernel concept is indeed pretty old, the BarrelFish
OS was invented in around 2009. Jailhouse was released 12 years ago.
There are tons of papers in this area too.

Dual-kernel systems, whether using virtualization or firmware, are indeed
common at least for automotives today. This is a solid justification of its
usefulness and real-world practice.

As you stated, it should not depend on any firmware or specialized
hardware, hence I am making this effort here. Let's join the effort, instead
of inventing things in isolation. This is why I not only open the source code
but also open the roadmap and invite the whole communication for
collaboration.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 18:28               ` Cong Wang
@ 2025-09-24 19:03                 ` Stefan Hajnoczi
  2025-09-27 19:42                   ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-24 19:03 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Hildenbrand, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel

[-- Attachment #1: Type: text/plain, Size: 2931 bytes --]

On Wed, Sep 24, 2025 at 11:28:04AM -0700, Cong Wang wrote:
> On Wed, Sep 24, 2025 at 5:51 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Sep 24, 2025 at 01:38:31PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > Two more points:
> > > > >
> > > > > 1) Security lockdown. Security lockdown transforms multikernel from
> > > > > "0-day means total compromise" to "0-day means single workload
> > > > > compromise with rapid recovery." This is still a significant improvement
> > > > > over containers where a single kernel 0-day compromises everything
> > > > > simultaneously.
> > > >
> > > > I don't follow. My understanding is that multikernel currently does not
> > > > prevent spawned kernels from affecting each other, so a kernel 0-day in
> > > > multikernel still compromises everything?
> > >
> > > I would assume that if there is no enforced isolation by the hardware (e.g.,
> > > virtualization, including partitioning hypervisors like jailhouse, pkvm etc)
> > > nothing would stop a kernel A to access memory assigned to kernel B.
> > >
> > > And of course, memory is just one of the resources that would not be
> > > properly isolated.
> > >
> > > Not sure if encrypting memory per kernel would really allow to not let other
> > > kernels still damage such kernels.
> > >
> > > Also, what stops a kernel to just reboot the whole machine? Happy to learn
> > > how that will be handled such that there is proper isolation.
> >
> > The reason I've been asking about the fault isolation and security
> > statements in the cover letter is because it's unclear:
> > 1. What is implemented today in multikernel.
> > 2. What is on the roadmap for multikernel.
> > 3. What is out of scope for multikernel.
> >
> > Cong: Can you clarify this? If the answer is that fault isolation and
> > security are out of scope, then this discussion can be skipped.
> 
> It is my pleasure. The email is too narrow, therefore I wrote a
> complete document for you:
> https://docs.google.com/document/d/1yneO6O6C_z0Lh3A2QyT8XsH7ZrQ7-naGQT-rpdjWa_g/edit?usp=sharing
> 
> I hope it answers all of the above questions and provides a clear
> big picture. If not, please let me know.
> 
> (If you need edit permission for the above document, please just
> request, I will approve.)

Thanks, that gives a nice overview!

I/O Resource Allocation part will be interesting. Restructuring existing
device drivers to allow spawned kernels to use specific hardware queues
could be a lot of work and very device-specific. I guess a small set of
devices can be supported initially and then it can grow over time.

This also reminds me of VFIO/mdev devices, which would be another
solution to the same problem, but equally device-specific and also a lot
of work to implement the devices that spawned kernels see.

Anyway, I look forward to seeing how this develops.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 17:30       ` Cong Wang
@ 2025-09-24 22:42         ` Hillf Danton
  0 siblings, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2025-09-24 22:42 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-kernel, linux-mm, multikernel

On Wed, 24 Sep 2025 10:30:28 -0700 Cong Wang wrote:
>On Tue, Sep 23, 2025 at 6:12 PM Hillf Danton <hdanton@sina.com> wrote:
>> On Mon, 22 Sep 2025 14:55:41 -0700 Cong Wang wrote:
>> > On Sat, Sep 20, 2025 at 6:47 PM Hillf Danton <hdanton@sina.com> wrote:
>> > > On Thu, 18 Sep 2025 15:25:59 -0700 Cong Wang wrote:
>> > > > This patch series introduces multikernel architecture support, enabling
>> > > > multiple independent kernel instances to coexist and communicate on a
>> > > > single physical machine. Each kernel instance can run on dedicated CPU
>> > > > cores while sharing the underlying hardware resources.
>> > > >
>> > > > The multikernel architecture provides several key benefits:
>> > > > - Improved fault isolation between different workloads
>> > > > - Enhanced security through kernel-level separation
>> > > > - Better resource utilization than traditional VM (KVM, Xen etc.)
>> > > > - Potential zero-down kernel update with KHO (Kernel Hand Over)
>> > > >
>> > > Could you illustrate a couple of use cases to help understand your idea?
>> >
>> > Sure, below are a few use cases on my mind:
>> >
>> > 1) With sufficient hardware resources: each kernel gets isolated resources
>> > with real bare metal performance. This applies to all VM/container use cases
>> > today, just with pure better performance: no virtualization, no noisy neighbor.
>> >
>> > More importantly, they can co-exist. In theory, you can run a multiernel with
>> > a VM inside and with a container inside the VM.
>> >
>> If the 6.17 eevdf perfs better than the 6.15 one could, their co-exist wastes
>> bare metal cpu cycles.
>
> I think we should never eliminate the ability of not using multikernel, users
> should have a choice. Apologize if I didn't make this clear.
> 
If multikernel is one of features the Thompson and Ritchie Unix offered,
all is fine simply because the linux kernel is never the pill expected
to cure all pains particularly in the user space.

> And even if you only want one kernel, you might still want to use
> zero-downtime upgrade via multikernel. ;-)
> 
FYI what I see in Shenzhen 2025 in the car cockpit product environment WRT
multikernel is - hypervisor like QNX supports multi virtual machines
including Android, !Android, linux and !linux, RT and !RT.

Hillf


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (12 preceding siblings ...)
  2025-09-24 17:51 ` Christoph Lameter (Ampere)
@ 2025-09-25 15:47 ` Jiaxun Yang
  2025-09-27 20:06   ` Cong Wang
  2025-09-26  9:01 ` Jarkko Sakkinen
  14 siblings, 1 reply; 44+ messages in thread
From: Jiaxun Yang @ 2025-09-25 15:47 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm



> 2025年9月19日 06:25,Cong Wang <xiyou.wangcong@gmail.com> 写道:
> 
> This patch series introduces multikernel architecture support, enabling
> multiple independent kernel instances to coexist and communicate on a
> single physical machine. Each kernel instance can run on dedicated CPU
> cores while sharing the underlying hardware resources.

Hi Cong,

Sorry for chime in here, and thanks for brining replicated-kernel back to the life.

I have some experience on original Popcorn Linux [1] [2], which seems to be the
root of most code in this series, please see my comments below.

> 
> The multikernel architecture provides several key benefits:
> - Improved fault isolation between different workloads
> - Enhanced security through kernel-level separation

I’d agree with Stefen’s comments [3], an "isolation” solution is critical for adaptation
of multikernel OS, given that multi-tenant system is almost everywhere.

Also allowing other kernel to inject IPI without any restriction can impose DOS attack
risk.

> - Better resource utilization than traditional VM (KVM, Xen etc.)
> - Potential zero-down kernel update with KHO (Kernel Hand Over)
> 
> Architecture Overview:
> The implementation leverages kexec infrastructure to load and manage
> multiple kernel images, with each kernel instance assigned to specific
> CPU cores. Inter-kernel communication is facilitated through a dedicated
> IPI framework that allows kernels to coordinate and share information
> when necessary.
> 
> Key Components:
> 1. Enhanced kexec subsystem with dynamic kimage tracking
> 2. Generic IPI communication framework for inter-kernel messaging

I actually have concerns over inter-kernel communication. The origin Popcorn
IPI protocol, which seems to be inherited here, was designed as a prototype,
without much consideration on the ecosystem. It would be nice if we can reused
existing infra design for inter kernel communication.

I would suggest look into OpenAMP [4] and remoteproc subsystem in kernel. They
already have mature solutions on communication between different kernels over coherent
memory and mailboxes (rpmsg [5] co). They also defined ELF extensions to pass side band
information for other kernel images. 

Linaro folks are also working on a new VirtIO transport called virtio-msg [6], [7], which is designed
with Linux-Linux hardware partitioning scenario in mind.

> 3. Architecture-specific CPU bootstrap mechanisms (only x86 so far)
> 4. Proc interface for monitoring loaded kernel instances
> 
> 

[…]

Thanks
- Jiaxun

[1]: https://www.kernel.org/doc/ols/2014/ols2014-barbalace.pdf
[2]: https://sourceforge.net/projects/popcornlinux/
[3]: https://lore.kernel.org/all/CAM_iQpXnHr7WC6VN3WB-+=CZGF5pyfo9y9D4MCc_Wwgp29hBrw@mail.gmail.com/
[4]: https://www.openampproject.org/
[5]: https://docs.kernel.org/staging/rpmsg.html
[6]: https://linaro.atlassian.net/wiki/spaces/HVAC/overview
[7]: https://lwn.net/Articles/1031928/

> 
> 
> --
> 2.34.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
                   ` (13 preceding siblings ...)
  2025-09-25 15:47 ` Jiaxun Yang
@ 2025-09-26  9:01 ` Jarkko Sakkinen
  2025-09-27 20:27   ` Cong Wang
  14 siblings, 1 reply; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-26  9:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> This patch series introduces multikernel architecture support, enabling
> multiple independent kernel instances to coexist and communicate on a
> single physical machine. Each kernel instance can run on dedicated CPU
> cores while sharing the underlying hardware resources.
> 
> The multikernel architecture provides several key benefits:
> - Improved fault isolation between different workloads
> - Enhanced security through kernel-level separation
> - Better resource utilization than traditional VM (KVM, Xen etc.)
> - Potential zero-down kernel update with KHO (Kernel Hand Over)

This list is like asking AI to list benefits, or like the whole cover
letter has that type of feel.

I'd probably work on benchmarks and other types of tests that can
deliver comparative figures, and show data that addresses workloads
with KVM, namespaces/cgroups and this, reflecting these qualities.

E.g. consider "Enhanced security through kernel-level separation".
It's a pre-existing feature probably since dawn of time. Any new layer
makes obviously more complex version "kernel-level separation". You'd
had to prove that this even more complex version is more secure than
pre-existing science.

kexec and its various corner cases and how this patch set addresses
them is the part where I'm most lost.

If I look at one of multikernel distros (I don't know any other
tbh) that I know it's really VT-d and that type of hardware
enforcement that make Qubes shine:

https://www.qubes-os.org/

That said, I did not look how/if this is using CPU virtualization
features as part of the solution, so correct me if I'm wrong.

I'm not entirely sure whether this is aimed to be alternative to
namespaces/cgroups or vms but more in the direction of Solaris Zones
would be imho better alternative at least for containers because
it saves the overhead of an extra kernel. There's also a patch set
for this:

https://lwn.net/Articles/780364/?ref=alian.info

VM barrier combined with IOMMU is pretty strong and hardware
enforced, and with polished configuration it can be fairly
performant (e.g. via page cache bypass and stuff like that)
so really the overhead that this is fighting against is 
context switch overhead.

In security I don't believe this has any realistic chances to
win over VMs and IOMMU...

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 18:39   ` Cong Wang
@ 2025-09-26  9:50     ` Jarkko Sakkinen
  2025-09-27 20:43       ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-26  9:50 UTC (permalink / raw)
  To: Cong Wang
  Cc: Christoph Lameter (Ampere),
	linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm

On Wed, Sep 24, 2025 at 11:39:44AM -0700, Cong Wang wrote:
> On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
> <cl@gentwo.org> wrote:
> > AFAICT various contemporary Android deployments do the multiple kernel
> > approach in one way or another already for security purposes and for
> > specialized controllers. However, the multi kernel approaches are often
> > depending on specialized and dedicated hardware. It may be difficult to
> > support with a generic approach developed here.
> 
> You are right, the multikernel concept is indeed pretty old, the BarrelFish
> OS was invented in around 2009. Jailhouse was released 12 years ago.
> There are tons of papers in this area too.

Jailhouse is quite nice actually. Perhaps you should pick that up
instead, and start refining and improving it? I'd be interested to test
refined jailhouse patches. It's also easy build test images having the
feature both with BuildRoot and Yocto.

It would take me like half'ish day to create build target for it.

> Dual-kernel systems, whether using virtualization or firmware, are indeed
> common at least for automotives today. This is a solid justification of its
> usefulness and real-world practice.

OK so neither virtualization nor firmware are well defined here.
Firmware e.g. can mean anything fro pre-bootloader to full operating
system depending on context or who you ask.

It's also pretty hard to project why VMs are bad for cars, and
despite lacking experience with building operating systems for
cars, I'd like to believe that the hardware enforcement that VT-x
and VT-d type of technologies bring is actually great for cars.

It's like every other infosec con where someone is hacking a car,
and I seen even people who've participated to hackatons by car
manufacturers. That industry is improving gradually and the
challenge would be to create hard evidence that this brings
better isolation than VM based solutions..


> 
> As you stated, it should not depend on any firmware or specialized
> hardware, hence I am making this effort here. Let's join the effort, instead
> of inventing things in isolation. This is why I not only open the source code
> but also open the roadmap and invite the whole communication for
> collaboration.

I'm not sure if specialized hardware means but hardware features
used by e.g., kvm are not in the category of "specialized", unless
you referring specifically to SNP and TDX?

> 
> Regards,
> Cong Wang

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-24 19:03                 ` Stefan Hajnoczi
@ 2025-09-27 19:42                   ` Cong Wang
  2025-09-29 15:11                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-27 19:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: David Hildenbrand, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel

On Wed, Sep 24, 2025 at 12:03 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> Thanks, that gives a nice overview!
>
> I/O Resource Allocation part will be interesting. Restructuring existing
> device drivers to allow spawned kernels to use specific hardware queues
> could be a lot of work and very device-specific. I guess a small set of
> devices can be supported initially and then it can grow over time.

My idea is to leverage existing technologies like XDP, which
offers huge benefits here:

1) It is based on shared memory (although it is virtual)

2) Its API's are user-space API's, which is even stronger for
kernel-to-kernel sharing, this possibly avoids re-inventing
another protocol.

3) It provides eBPF.

4) The spawned kernel does not require any hardware knowledge,
just pure XDP-ringbuffer-based software logic.

But it also has limitations:

1) xdp_md is too specific for networking, extending it to storage
could be very challenging. But we could introduce a SDP for
storage to just mimic XDP.

2) Regardless, we need a doorbell anyway. IPI is handy, but
I hope we could have an even lighter one. Or more ideally,
redirecting the hardware queue IRQ into each target CPU.

>
> This also reminds me of VFIO/mdev devices, which would be another
> solution to the same problem, but equally device-specific and also a lot
> of work to implement the devices that spawned kernels see.

Right.

I prototyped VFIO on my side with AI, but failed with its complex PCI
interface. And the spawn kernel still requires hardware knowledge
to interpret PCI BAR etc..

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-25 15:47 ` Jiaxun Yang
@ 2025-09-27 20:06   ` Cong Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-27 20:06 UTC (permalink / raw)
  To: Jiaxun Yang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

Hi Jiaxun,

On Thu, Sep 25, 2025 at 8:48 AM Jiaxun Yang <jiaxun.yang@flygoat.com> wrote:
>
>
>
> > 2025年9月19日 06:25,Cong Wang <xiyou.wangcong@gmail.com> 写道:
> >
> > This patch series introduces multikernel architecture support, enabling
> > multiple independent kernel instances to coexist and communicate on a
> > single physical machine. Each kernel instance can run on dedicated CPU
> > cores while sharing the underlying hardware resources.
>
> Hi Cong,
>
> Sorry for chime in here, and thanks for brining replicated-kernel back to the life.

I have to clarify: in my design, kernel is not replicated. It is the opposite,
I intend to have diversified kernels for highly customization for each
application.

>
> I have some experience on original Popcorn Linux [1] [2], which seems to be the
> root of most code in this series, please see my comments below.
>
> >
> > The multikernel architecture provides several key benefits:
> > - Improved fault isolation between different workloads
> > - Enhanced security through kernel-level separation
>
> I’d agree with Stefen’s comments [3], an "isolation” solution is critical for adaptation
> of multikernel OS, given that multi-tenant system is almost everywhere.
>
> Also allowing other kernel to inject IPI without any restriction can impose DOS attack
> risk.

This is true. Like I mentioned, this is also a good opportunity to invite
hardware (CPU) vendors to catch up with software, for example, they
could provide hardware-filtering for IPI via MSR.

If we look at how virtualization evolves, it is the hardware follows software.
VMCS comes after Xen or KVM, VPDA comes after virtio.

>
> > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> >
> > Architecture Overview:
> > The implementation leverages kexec infrastructure to load and manage
> > multiple kernel images, with each kernel instance assigned to specific
> > CPU cores. Inter-kernel communication is facilitated through a dedicated
> > IPI framework that allows kernels to coordinate and share information
> > when necessary.
> >
> > Key Components:
> > 1. Enhanced kexec subsystem with dynamic kimage tracking
> > 2. Generic IPI communication framework for inter-kernel messaging
>
> I actually have concerns over inter-kernel communication. The origin Popcorn
> IPI protocol, which seems to be inherited here, was designed as a prototype,
> without much consideration on the ecosystem. It would be nice if we can reused
> existing infra design for inter kernel communication.

Popcorn does the opposite: it still stays with a single image which is
essentially against isolation. In fact, I also read its latest paper this year,
I don't see any essential change on this big direction:
https://www.ssrg.ece.vt.edu/papers/asplos25.pdf

This is why fundamentally Popcorn is not suitable for isolation. Please
don't get me wrong: I am not questioning its usefulness, it is just simply
two opposite directions. I wish people best luck on the heterogeneous
ISA design, and I hope major CPU vendors will catch up with you too. :)

>
> I would suggest look into OpenAMP [4] and remoteproc subsystem in kernel. They
> already have mature solutions on communication between different kernels over coherent
> memory and mailboxes (rpmsg [5] co). They also defined ELF extensions to pass side band
> information for other kernel images.

Thanks for the pointers. Jim Huang also shared his idea on remoteproc
at LinuxCon this year. After evaluations, I found remoteproc may not be
as good as IPI. Remoteproc is designed for heterogeneous systems with
different architectures, adding unnecessary abstraction layers.

>
> Linaro folks are also working on a new VirtIO transport called virtio-msg [6], [7], which is designed
> with Linux-Linux hardware partitioning scenario in mind.

I think there is still a fundamental difference between static partitioning.
and elastic resource allocation.

Static partitioning can be achieved as a default case of dynamic allocation
when resources remain unchanged, but the reverse is not possible.

Hope this makes sense to you.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-26  9:01 ` Jarkko Sakkinen
@ 2025-09-27 20:27   ` Cong Wang
  2025-09-27 20:39     ` Pasha Tatashin
  2025-09-28 14:08     ` Jarkko Sakkinen
  0 siblings, 2 replies; 44+ messages in thread
From: Cong Wang @ 2025-09-27 20:27 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Fri, Sep 26, 2025 at 2:01 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
>
> On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > This patch series introduces multikernel architecture support, enabling
> > multiple independent kernel instances to coexist and communicate on a
> > single physical machine. Each kernel instance can run on dedicated CPU
> > cores while sharing the underlying hardware resources.
> >
> > The multikernel architecture provides several key benefits:
> > - Improved fault isolation between different workloads
> > - Enhanced security through kernel-level separation
> > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > - Potential zero-down kernel update with KHO (Kernel Hand Over)
>
> This list is like asking AI to list benefits, or like the whole cover
> letter has that type of feel.

Sorry for giving you that feeling. Please let me know how I can
improve it for you.

>
> I'd probably work on benchmarks and other types of tests that can
> deliver comparative figures, and show data that addresses workloads
> with KVM, namespaces/cgroups and this, reflecting these qualities.

Sure, I think performance comes after usability, not vice versa.


>
> E.g. consider "Enhanced security through kernel-level separation".
> It's a pre-existing feature probably since dawn of time. Any new layer
> makes obviously more complex version "kernel-level separation". You'd
> had to prove that this even more complex version is more secure than
> pre-existing science.

Apologize for this. Do you mind explaining why this is more complex
than the KVM/Qemu/vhost/virtio/VDPA stack?

>
> kexec and its various corner cases and how this patch set addresses
> them is the part where I'm most lost.

Sorry for that. I will post Youtube videos to explain kexec in detail,
please follow our Youtube channel if you are interested. (I don't
want to post a link here in case people think I am promoting my
own interest, please email me privately.)

>
> If I look at one of multikernel distros (I don't know any other
> tbh) that I know it's really VT-d and that type of hardware
> enforcement that make Qubes shine:
>
> https://www.qubes-os.org/
>
> That said, I did not look how/if this is using CPU virtualization
> features as part of the solution, so correct me if I'm wrong.

Qubes OS is based on Xen:
https://en.wikipedia.org/wiki/Qubes_OS

>
> I'm not entirely sure whether this is aimed to be alternative to
> namespaces/cgroups or vms but more in the direction of Solaris Zones
> would be imho better alternative at least for containers because
> it saves the overhead of an extra kernel. There's also a patch set
> for this:
>
> https://lwn.net/Articles/780364/?ref=alian.info

Solaris Zones also share a single kernel. Or maybe I guess
you meant Kernel Zones? Isn't it a justification for our multikernel
approach for Linux? :-)

BTW, it is less flexible since it completely isolates kernels
without inter-kernel communication. With our design, you can
still choose not to use inter-kernel IPI's, which turns dynamic
into static.

>
> VM barrier combined with IOMMU is pretty strong and hardware
> enforced, and with polished configuration it can be fairly
> performant (e.g. via page cache bypass and stuff like that)
> so really the overhead that this is fighting against is
> context switch overhead.
>
> In security I don't believe this has any realistic chances to
> win over VMs and IOMMU...

I appreciate you sharing your opinions. I hope my information
helps.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-27 20:27   ` Cong Wang
@ 2025-09-27 20:39     ` Pasha Tatashin
  2025-09-28 14:08     ` Jarkko Sakkinen
  1 sibling, 0 replies; 44+ messages in thread
From: Pasha Tatashin @ 2025-09-27 20:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jarkko Sakkinen, linux-kernel, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Sat, Sep 27, 2025 at 4:27 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, Sep 26, 2025 at 2:01 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> >
> > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > This patch series introduces multikernel architecture support, enabling
> > > multiple independent kernel instances to coexist and communicate on a
> > > single physical machine. Each kernel instance can run on dedicated CPU
> > > cores while sharing the underlying hardware resources.
> > >
> > > The multikernel architecture provides several key benefits:
> > > - Improved fault isolation between different workloads
> > > - Enhanced security through kernel-level separation
> > > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> >
> > This list is like asking AI to list benefits, or like the whole cover
> > letter has that type of feel.
>
> Sorry for giving you that feeling. Please let me know how I can
> improve it for you.
>
> >
> > I'd probably work on benchmarks and other types of tests that can
> > deliver comparative figures, and show data that addresses workloads
> > with KVM, namespaces/cgroups and this, reflecting these qualities.
>
> Sure, I think performance comes after usability, not vice versa.
>
>
> >
> > E.g. consider "Enhanced security through kernel-level separation".
> > It's a pre-existing feature probably since dawn of time. Any new layer
> > makes obviously more complex version "kernel-level separation". You'd
> > had to prove that this even more complex version is more secure than
> > pre-existing science.
>
> Apologize for this. Do you mind explaining why this is more complex
> than the KVM/Qemu/vhost/virtio/VDPA stack?
>
> >
> > kexec and its various corner cases and how this patch set addresses
> > them is the part where I'm most lost.
>
> Sorry for that. I will post Youtube videos to explain kexec in detail,
> please follow our Youtube channel if you are interested. (I don't
> want to post a link here in case people think I am promoting my
> own interest, please email me privately.)
>
> >
> > If I look at one of multikernel distros (I don't know any other
> > tbh) that I know it's really VT-d and that type of hardware
> > enforcement that make Qubes shine:
> >
> > https://www.qubes-os.org/
> >
> > That said, I did not look how/if this is using CPU virtualization
> > features as part of the solution, so correct me if I'm wrong.
>
> Qubes OS is based on Xen:
> https://en.wikipedia.org/wiki/Qubes_OS
>
> >
> > I'm not entirely sure whether this is aimed to be alternative to
> > namespaces/cgroups or vms but more in the direction of Solaris Zones
> > would be imho better alternative at least for containers because
> > it saves the overhead of an extra kernel. There's also a patch set
> > for this:
> >
> > https://lwn.net/Articles/780364/?ref=alian.info
>
> Solaris Zones also share a single kernel. Or maybe I guess
> you meant Kernel Zones? Isn't it a justification for our multikernel
> approach for Linux? :-)

Solaris kernel zones use sun4v hypervisor to protect isolation. There
is no such thing on x86 and arm.

Pasha


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-26  9:50     ` Jarkko Sakkinen
@ 2025-09-27 20:43       ` Cong Wang
  2025-09-28 14:22         ` Jarkko Sakkinen
  0 siblings, 1 reply; 44+ messages in thread
From: Cong Wang @ 2025-09-27 20:43 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Christoph Lameter (Ampere),
	linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Fri, Sep 26, 2025 at 2:50 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
>
> On Wed, Sep 24, 2025 at 11:39:44AM -0700, Cong Wang wrote:
> > On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
> > <cl@gentwo.org> wrote:
> > > AFAICT various contemporary Android deployments do the multiple kernel
> > > approach in one way or another already for security purposes and for
> > > specialized controllers. However, the multi kernel approaches are often
> > > depending on specialized and dedicated hardware. It may be difficult to
> > > support with a generic approach developed here.
> >
> > You are right, the multikernel concept is indeed pretty old, the BarrelFish
> > OS was invented in around 2009. Jailhouse was released 12 years ago.
> > There are tons of papers in this area too.
>
> Jailhouse is quite nice actually. Perhaps you should pick that up
> instead, and start refining and improving it? I'd be interested to test
> refined jailhouse patches. It's also easy build test images having the
> feature both with BuildRoot and Yocto.

Static partitioning is not a bad choice, except it is less flexible. We can't
get dynamic resource allocation with just static partitioning, but we can
easily get static partitioning with dynamic allocation, in fact, it should be
the default case.

In my own opinion, the reason why containers today are more popular
than VM's is not just performance, it is elasticity too. Static partitioning
is essentially against elasticity.

More fundamentally, it is based on VMCS, which essentially requires
a hypervisor:
https://github.com/siemens/jailhouse/blob/master/hypervisor/control.c

>
> It would take me like half'ish day to create build target for it.
>
> > Dual-kernel systems, whether using virtualization or firmware, are indeed
> > common at least for automotives today. This is a solid justification of its
> > usefulness and real-world practice.
>
> OK so neither virtualization nor firmware are well defined here.
> Firmware e.g. can mean anything fro pre-bootloader to full operating
> system depending on context or who you ask.
>
> It's also pretty hard to project why VMs are bad for cars, and
> despite lacking experience with building operating systems for
> cars, I'd like to believe that the hardware enforcement that VT-x
> and VT-d type of technologies bring is actually great for cars.
>
> It's like every other infosec con where someone is hacking a car,
> and I seen even people who've participated to hackatons by car
> manufacturers. That industry is improving gradually and the
> challenge would be to create hard evidence that this brings
> better isolation than VM based solutions..

In case it is still not clear: No one wants to stop you from using a
VM. In fact, at least in theory, you could use a VM inside a multikernel.
Just like today we can still run a container in a VM (Kata Container).

Your choice is always on the table.

I hope this helps.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-27 20:27   ` Cong Wang
  2025-09-27 20:39     ` Pasha Tatashin
@ 2025-09-28 14:08     ` Jarkko Sakkinen
  1 sibling, 0 replies; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-28 14:08 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Sat, Sep 27, 2025 at 01:27:04PM -0700, Cong Wang wrote:
> On Fri, Sep 26, 2025 at 2:01 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> >
> > On Thu, Sep 18, 2025 at 03:25:59PM -0700, Cong Wang wrote:
> > > This patch series introduces multikernel architecture support, enabling
> > > multiple independent kernel instances to coexist and communicate on a
> > > single physical machine. Each kernel instance can run on dedicated CPU
> > > cores while sharing the underlying hardware resources.
> > >
> > > The multikernel architecture provides several key benefits:
> > > - Improved fault isolation between different workloads
> > > - Enhanced security through kernel-level separation
> > > - Better resource utilization than traditional VM (KVM, Xen etc.)
> > > - Potential zero-down kernel update with KHO (Kernel Hand Over)
> >
> > This list is like asking AI to list benefits, or like the whole cover
> > letter has that type of feel.
> 
> Sorry for giving you that feeling. Please let me know how I can
> improve it for you.

There is no evidence of any of these benefits. That's the central
issue. You pretty much must give quantatitve proof of any of these
claims or the benefit is imaginary.

> 
> >
> > I'd probably work on benchmarks and other types of tests that can
> > deliver comparative figures, and show data that addresses workloads
> > with KVM, namespaces/cgroups and this, reflecting these qualities.
> 
> Sure, I think performance comes after usability, not vice versa.
> 
> 
> >
> > E.g. consider "Enhanced security through kernel-level separation".
> > It's a pre-existing feature probably since dawn of time. Any new layer
> > makes obviously more complex version "kernel-level separation". You'd
> > had to prove that this even more complex version is more secure than
> > pre-existing science.
> 
> Apologize for this. Do you mind explaining why this is more complex
> than the KVM/Qemu/vhost/virtio/VDPA stack?

KVM does not complicate kernel level separation or access control per
kernel instance at all. A guest in the end of the day is just a fancy
executable.

This feature on the other hand intervenes various easily breaking code
paths.

> 
> >
> > kexec and its various corner cases and how this patch set addresses
> > them is the part where I'm most lost.
> 
> Sorry for that. I will post Youtube videos to explain kexec in detail,
> please follow our Youtube channel if you are interested. (I don't
> want to post a link here in case people think I am promoting my
> own interest, please email me privately.)

Here I have to say that posting a youtube link to LKML is of your
own interest is not unacceptable as far as I'm concerned :-)

That said, I don't promise that I will watch any of the Youtube
videos posted either here or privately. All the quantitative
proof should be embedded to patches.

> 
> >
> > If I look at one of multikernel distros (I don't know any other
> > tbh) that I know it's really VT-d and that type of hardware
> > enforcement that make Qubes shine:
> >
> > https://www.qubes-os.org/
> >
> > That said, I did not look how/if this is using CPU virtualization
> > features as part of the solution, so correct me if I'm wrong.
> 
> Qubes OS is based on Xen:
> https://en.wikipedia.org/wiki/Qubes_OS


Yes, and it works great, and has much stronger security metrics than
this could ever reach, and that is quantitative fact, thanks to great
technologies such as VT-d :-)

This is why I'm repeating the requirement for quantitative proof. We
have already great solutions to most what this can do so building
evidence of usefulness is the huge stretch this patch set should
make it.

Nothing personal, but with the current basically just claims, I don't
believe in this. That said, by saying this I don't I'd pick my soccer
no. If there is enough evidence, I'm always ready to turn my opinion
180 degrees.

> 
> >
> > I'm not entirely sure whether this is aimed to be alternative to
> > namespaces/cgroups or vms but more in the direction of Solaris Zones
> > would be imho better alternative at least for containers because
> > it saves the overhead of an extra kernel. There's also a patch set
> > for this:
> >
> > https://lwn.net/Articles/780364/?ref=alian.info
> 
> Solaris Zones also share a single kernel. Or maybe I guess
> you meant Kernel Zones? Isn't it a justification for our multikernel
> approach for Linux? :-)
> 
> BTW, it is less flexible since it completely isolates kernels
> without inter-kernel communication. With our design, you can
> still choose not to use inter-kernel IPI's, which turns dynamic
> into static.
> 
> >
> > VM barrier combined with IOMMU is pretty strong and hardware
> > enforced, and with polished configuration it can be fairly
> > performant (e.g. via page cache bypass and stuff like that)
> > so really the overhead that this is fighting against is
> > context switch overhead.
> >
> > In security I don't believe this has any realistic chances to
> > win over VMs and IOMMU...
> 
> I appreciate you sharing your opinions. I hope my information
> helps.

I'd put strong focus on getting the figures aside with the claims :-)

> 
> Regards,
> Cong Wang

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-27 20:43       ` Cong Wang
@ 2025-09-28 14:22         ` Jarkko Sakkinen
  2025-09-28 14:36           ` Jarkko Sakkinen
  0 siblings, 1 reply; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-28 14:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: Christoph Lameter (Ampere),
	linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Sat, Sep 27, 2025 at 01:43:23PM -0700, Cong Wang wrote:
> On Fri, Sep 26, 2025 at 2:50 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> >
> > On Wed, Sep 24, 2025 at 11:39:44AM -0700, Cong Wang wrote:
> > > On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
> > > <cl@gentwo.org> wrote:
> > > > AFAICT various contemporary Android deployments do the multiple kernel
> > > > approach in one way or another already for security purposes and for
> > > > specialized controllers. However, the multi kernel approaches are often
> > > > depending on specialized and dedicated hardware. It may be difficult to
> > > > support with a generic approach developed here.
> > >
> > > You are right, the multikernel concept is indeed pretty old, the BarrelFish
> > > OS was invented in around 2009. Jailhouse was released 12 years ago.
> > > There are tons of papers in this area too.
> >
> > Jailhouse is quite nice actually. Perhaps you should pick that up
> > instead, and start refining and improving it? I'd be interested to test
> > refined jailhouse patches. It's also easy build test images having the
> > feature both with BuildRoot and Yocto.
> 
> Static partitioning is not a bad choice, except it is less flexible. We can't
> get dynamic resource allocation with just static partitioning, but we can
> easily get static partitioning with dynamic allocation, in fact, it should be
> the default case.
> 
> In my own opinion, the reason why containers today are more popular
> than VM's is not just performance, it is elasticity too. Static partitioning
> is essentially against elasticity.

How do you make a popularity comparison between VMs and containers, and
what does the word "popularity" means in the context? The whole world
runs basically runs with guest VMs (just go to check AWS, Azure, Oracle
Cloud and what not).

The problem in that argument is that there is no problem.

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-28 14:22         ` Jarkko Sakkinen
@ 2025-09-28 14:36           ` Jarkko Sakkinen
  2025-09-28 14:41             ` Jarkko Sakkinen
  0 siblings, 1 reply; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-28 14:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: Christoph Lameter (Ampere),
	linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Sun, Sep 28, 2025 at 05:22:43PM +0300, Jarkko Sakkinen wrote:
> On Sat, Sep 27, 2025 at 01:43:23PM -0700, Cong Wang wrote:
> > On Fri, Sep 26, 2025 at 2:50 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > >
> > > On Wed, Sep 24, 2025 at 11:39:44AM -0700, Cong Wang wrote:
> > > > On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
> > > > <cl@gentwo.org> wrote:
> > > > > AFAICT various contemporary Android deployments do the multiple kernel
> > > > > approach in one way or another already for security purposes and for
> > > > > specialized controllers. However, the multi kernel approaches are often
> > > > > depending on specialized and dedicated hardware. It may be difficult to
> > > > > support with a generic approach developed here.
> > > >
> > > > You are right, the multikernel concept is indeed pretty old, the BarrelFish
> > > > OS was invented in around 2009. Jailhouse was released 12 years ago.
> > > > There are tons of papers in this area too.
> > >
> > > Jailhouse is quite nice actually. Perhaps you should pick that up
> > > instead, and start refining and improving it? I'd be interested to test
> > > refined jailhouse patches. It's also easy build test images having the
> > > feature both with BuildRoot and Yocto.
> > 
> > Static partitioning is not a bad choice, except it is less flexible. We can't
> > get dynamic resource allocation with just static partitioning, but we can
> > easily get static partitioning with dynamic allocation, in fact, it should be
> > the default case.
> > 
> > In my own opinion, the reason why containers today are more popular
> > than VM's is not just performance, it is elasticity too. Static partitioning
> > is essentially against elasticity.
> 
> How do you make a popularity comparison between VMs and containers, and
> what does the word "popularity" means in the context? The whole world
> runs basically runs with guest VMs (just go to check AWS, Azure, Oracle
> Cloud and what not).
> 
> The problem in that argument is that there is no problem.

If I was working on such a feature I would probably package it for e.g,
BuildRoot with BR2_EXTERNAL type of Git and create a user space that
can run some test and benchmarks that actually highlight the benefits.

Then, I would trash the existing cover letter with something with clear
problem statement and motivation instead of whitepaper alike claims.

We can argue to the eterenity with qualitative aspects of any feature
but it is the quantitative proof that actually drives things forward.

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-28 14:36           ` Jarkko Sakkinen
@ 2025-09-28 14:41             ` Jarkko Sakkinen
  0 siblings, 0 replies; 44+ messages in thread
From: Jarkko Sakkinen @ 2025-09-28 14:41 UTC (permalink / raw)
  To: Cong Wang
  Cc: Christoph Lameter (Ampere),
	linux-kernel, pasha.tatashin, Cong Wang, Andrew Morton,
	Baoquan He, Alexander Graf, Mike Rapoport, Changyuan Lyu, kexec,
	linux-mm, multikernel

On Sun, Sep 28, 2025 at 05:36:32PM +0300, Jarkko Sakkinen wrote:
> On Sun, Sep 28, 2025 at 05:22:43PM +0300, Jarkko Sakkinen wrote:
> > On Sat, Sep 27, 2025 at 01:43:23PM -0700, Cong Wang wrote:
> > > On Fri, Sep 26, 2025 at 2:50 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > > >
> > > > On Wed, Sep 24, 2025 at 11:39:44AM -0700, Cong Wang wrote:
> > > > > On Wed, Sep 24, 2025 at 10:51 AM Christoph Lameter (Ampere)
> > > > > <cl@gentwo.org> wrote:
> > > > > > AFAICT various contemporary Android deployments do the multiple kernel
> > > > > > approach in one way or another already for security purposes and for
> > > > > > specialized controllers. However, the multi kernel approaches are often
> > > > > > depending on specialized and dedicated hardware. It may be difficult to
> > > > > > support with a generic approach developed here.
> > > > >
> > > > > You are right, the multikernel concept is indeed pretty old, the BarrelFish
> > > > > OS was invented in around 2009. Jailhouse was released 12 years ago.
> > > > > There are tons of papers in this area too.
> > > >
> > > > Jailhouse is quite nice actually. Perhaps you should pick that up
> > > > instead, and start refining and improving it? I'd be interested to test
> > > > refined jailhouse patches. It's also easy build test images having the
> > > > feature both with BuildRoot and Yocto.
> > > 
> > > Static partitioning is not a bad choice, except it is less flexible. We can't
> > > get dynamic resource allocation with just static partitioning, but we can
> > > easily get static partitioning with dynamic allocation, in fact, it should be
> > > the default case.
> > > 
> > > In my own opinion, the reason why containers today are more popular
> > > than VM's is not just performance, it is elasticity too. Static partitioning
> > > is essentially against elasticity.
> > 
> > How do you make a popularity comparison between VMs and containers, and
> > what does the word "popularity" means in the context? The whole world
> > runs basically runs with guest VMs (just go to check AWS, Azure, Oracle
> > Cloud and what not).
> > 
> > The problem in that argument is that there is no problem.
> 
> If I was working on such a feature I would probably package it for e.g,
> BuildRoot with BR2_EXTERNAL type of Git and create a user space that
> can run some test and benchmarks that actually highlight the benefits.
> 
> Then, I would trash the existing cover letter with something with clear
> problem statement and motivation instead of whitepaper alike claims.
> 
> We can argue to the eterenity with qualitative aspects of any feature
> but it is the quantitative proof that actually drives things forward.

I'd also carefully check as per modifying kexec that more complex use
cases are compatible such as IMA. I don't know if there is an issue with
secure boot but I'd make sure that there is no friction with it either.

There's also shared security related hardware resources such as TPM,
and in this context two instances end up thus sharing it for e.g.
measurements, and that type of cross-communication could have 
unpredictable consequences (would need to be checked).

BR, Jarkko


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-27 19:42                   ` Cong Wang
@ 2025-09-29 15:11                     ` Stefan Hajnoczi
  2025-10-02  4:17                       ` Cong Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2025-09-29 15:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Hildenbrand, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel, jasowang

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

On Sat, Sep 27, 2025 at 12:42:23PM -0700, Cong Wang wrote:
> On Wed, Sep 24, 2025 at 12:03 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > Thanks, that gives a nice overview!
> >
> > I/O Resource Allocation part will be interesting. Restructuring existing
> > device drivers to allow spawned kernels to use specific hardware queues
> > could be a lot of work and very device-specific. I guess a small set of
> > devices can be supported initially and then it can grow over time.
> 
> My idea is to leverage existing technologies like XDP, which
> offers huge benefits here:
> 
> 1) It is based on shared memory (although it is virtual)
> 
> 2) Its API's are user-space API's, which is even stronger for
> kernel-to-kernel sharing, this possibly avoids re-inventing
> another protocol.
> 
> 3) It provides eBPF.
> 
> 4) The spawned kernel does not require any hardware knowledge,
> just pure XDP-ringbuffer-based software logic.
> 
> But it also has limitations:
> 
> 1) xdp_md is too specific for networking, extending it to storage
> could be very challenging. But we could introduce a SDP for
> storage to just mimic XDP.
> 
> 2) Regardless, we need a doorbell anyway. IPI is handy, but
> I hope we could have an even lighter one. Or more ideally,
> redirecting the hardware queue IRQ into each target CPU.

I see. I was thinking that spawned kernels would talk directly to the
hardware. Your idea of using a software interface is less invasive but
has an overhead similar to paravirtualized devices.

A software approach that supports a wider range of devices is
virtio_vdpa (drivers/vdpa/). The current virtio_vdpa implementation
assumes that the device is located in the same kernel. A
kernel-to-kernel bridge would be needed so that the spawned kernel
forwards the vDPA operations to the other kernel. The other kernel
provides the virtio-net, virtio-blk, etc device functionality by passing
requests to a netdev, blkdev, etc.

There are in-kernel simulator devices for virtio-net and virtio-blk in
drivers/vdpa/vdpa_sim/ which can be used as a starting point. These
devices are just for testing and would need to be fleshed out to become
useful for real workloads.

I have CCed Jason Wang, who maintains vDPA, in case you want to discuss
it more.

> 
> >
> > This also reminds me of VFIO/mdev devices, which would be another
> > solution to the same problem, but equally device-specific and also a lot
> > of work to implement the devices that spawned kernels see.
> 
> Right.
> 
> I prototyped VFIO on my side with AI, but failed with its complex PCI
> interface. And the spawn kernel still requires hardware knowledge
> to interpret PCI BAR etc..

Yeah, it's complex and invasive. :/

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC Patch 0/7] kernel: Introduce multikernel architecture support
  2025-09-29 15:11                     ` Stefan Hajnoczi
@ 2025-10-02  4:17                       ` Cong Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Cong Wang @ 2025-10-02  4:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: David Hildenbrand, linux-kernel, pasha.tatashin, Cong Wang,
	Andrew Morton, Baoquan He, Alexander Graf, Mike Rapoport,
	Changyuan Lyu, kexec, linux-mm, multikernel, jasowang

On Mon, Sep 29, 2025 at 8:12 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Sat, Sep 27, 2025 at 12:42:23PM -0700, Cong Wang wrote:
> > On Wed, Sep 24, 2025 at 12:03 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > Thanks, that gives a nice overview!
> > >
> > > I/O Resource Allocation part will be interesting. Restructuring existing
> > > device drivers to allow spawned kernels to use specific hardware queues
> > > could be a lot of work and very device-specific. I guess a small set of
> > > devices can be supported initially and then it can grow over time.
> >
> > My idea is to leverage existing technologies like XDP, which
> > offers huge benefits here:
> >
> > 1) It is based on shared memory (although it is virtual)
> >
> > 2) Its API's are user-space API's, which is even stronger for
> > kernel-to-kernel sharing, this possibly avoids re-inventing
> > another protocol.
> >
> > 3) It provides eBPF.
> >
> > 4) The spawned kernel does not require any hardware knowledge,
> > just pure XDP-ringbuffer-based software logic.
> >
> > But it also has limitations:
> >
> > 1) xdp_md is too specific for networking, extending it to storage
> > could be very challenging. But we could introduce a SDP for
> > storage to just mimic XDP.
> >
> > 2) Regardless, we need a doorbell anyway. IPI is handy, but
> > I hope we could have an even lighter one. Or more ideally,
> > redirecting the hardware queue IRQ into each target CPU.
>
> I see. I was thinking that spawned kernels would talk directly to the
> hardware. Your idea of using a software interface is less invasive but
> has an overhead similar to paravirtualized devices.

When we have sufficient hardware resources or prefer to use
SR IOV, the multikernel could indeed access hardware directly.
Queues are an alternative choice for elasticity.

>
> A software approach that supports a wider range of devices is
> virtio_vdpa (drivers/vdpa/). The current virtio_vdpa implementation
> assumes that the device is located in the same kernel. A
> kernel-to-kernel bridge would be needed so that the spawned kernel
> forwards the vDPA operations to the other kernel. The other kernel
> provides the virtio-net, virtio-blk, etc device functionality by passing
> requests to a netdev, blkdev, etc.

I think that is the major blocker. VDPA looks more complex than
queue-based solutions (including Soft Functions provided by mlx),
from my naive understanding, but I will take a deep look at VDPA.

>
> There are in-kernel simulator devices for virtio-net and virtio-blk in
> drivers/vdpa/vdpa_sim/ which can be used as a starting point. These
> devices are just for testing and would need to be fleshed out to become
> useful for real workloads.
>
> I have CCed Jason Wang, who maintains vDPA, in case you want to discuss
> it more.

Appreciate it.

Regards,
Cong Wang


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-10-02  4:17 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-18 22:25 [RFC Patch 0/7] kernel: Introduce multikernel architecture support Cong Wang
2025-09-18 22:26 ` [RFC Patch 1/7] kexec: Introduce multikernel support via kexec Cong Wang
2025-09-18 22:26 ` [RFC Patch 2/7] x86: Introduce SMP INIT trampoline for multikernel CPU bootstrap Cong Wang
2025-09-18 22:26 ` [RFC Patch 3/7] x86: Introduce MULTIKERNEL_VECTOR for inter-kernel communication Cong Wang
2025-09-18 22:26 ` [RFC Patch 4/7] kernel: Introduce generic multikernel IPI communication framework Cong Wang
2025-09-18 22:26 ` [RFC Patch 5/7] x86: Introduce arch_cpu_physical_id() to obtain physical CPU ID Cong Wang
2025-09-18 22:26 ` [RFC Patch 6/7] kexec: Implement dynamic kimage tracking Cong Wang
2025-09-18 22:26 ` [RFC Patch 7/7] kexec: Add /proc/multikernel interface for " Cong Wang
2025-09-19 10:10 ` [syzbot ci] Re: kernel: Introduce multikernel architecture support syzbot ci
2025-09-19 13:14 ` [RFC Patch 0/7] " Pasha Tatashin
2025-09-20 21:13   ` Cong Wang
2025-09-19 21:26 ` Stefan Hajnoczi
2025-09-20 21:40   ` Cong Wang
2025-09-22 14:28     ` Stefan Hajnoczi
2025-09-22 22:41       ` Cong Wang
2025-09-23 17:05         ` Stefan Hajnoczi
2025-09-24 11:38           ` David Hildenbrand
2025-09-24 12:51             ` Stefan Hajnoczi
2025-09-24 18:28               ` Cong Wang
2025-09-24 19:03                 ` Stefan Hajnoczi
2025-09-27 19:42                   ` Cong Wang
2025-09-29 15:11                     ` Stefan Hajnoczi
2025-10-02  4:17                       ` Cong Wang
2025-09-24 17:18           ` Cong Wang
2025-09-21  1:47 ` Hillf Danton
2025-09-22 21:55   ` Cong Wang
2025-09-24  1:12     ` Hillf Danton
2025-09-24 17:30       ` Cong Wang
2025-09-24 22:42         ` Hillf Danton
2025-09-21  5:54 ` Jan Engelhardt
2025-09-21  6:24   ` Mike Rapoport
2025-09-24 17:51 ` Christoph Lameter (Ampere)
2025-09-24 18:39   ` Cong Wang
2025-09-26  9:50     ` Jarkko Sakkinen
2025-09-27 20:43       ` Cong Wang
2025-09-28 14:22         ` Jarkko Sakkinen
2025-09-28 14:36           ` Jarkko Sakkinen
2025-09-28 14:41             ` Jarkko Sakkinen
2025-09-25 15:47 ` Jiaxun Yang
2025-09-27 20:06   ` Cong Wang
2025-09-26  9:01 ` Jarkko Sakkinen
2025-09-27 20:27   ` Cong Wang
2025-09-27 20:39     ` Pasha Tatashin
2025-09-28 14:08     ` Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox