* [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
@ 2025-02-06 13:27 Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
` (19 more replies)
0 siblings, 20 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Hi,
This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
just to make things simpler instead of ftrace we decided to preserve
"reserve_mem" regions.
The patches are also available in git:
https://git.kernel.org/rppt/h/kho/v4
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.
However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers
Conference 2023 presentation for details:
https://lpc.events/event/17/contributions/1485/
To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
memblock's reserve_mem.
With this patch set applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.
== Alternatives ==
There are alternative approaches to (parts of) the problems above:
* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line
All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.
KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.
== Overview ==
We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch regions" available
for kexec: A physically contiguous memory regions that is guaranteed to
not have any memory that KHO would preserve. The new kernel bootstraps
itself using the scratch regions and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.
== Limitations ==
Currently KHO is only implemented for file based kexec. The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.
== How to Use ==
To use the code, please boot the kernel with the "kho=on" command line
parameter.
KHO will automatically create scratch regions. If you want to set the
scratch size explicitly you can use "kho_scratch=" command line parameter.
For instance, "kho_scratch=512M,256M" will create a global scratch area of
512Mib and per-node scrath areas of 256Mib.
Make sure to to have a reserved memory range requested with reserv_mem
command line option. Then before you invoke file based "kexec -l", activate
KHO:
# echo 1 > /sys/kernel/kho/active
# kexec -l Image --initrd=initrd -s
# kexec -e
The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.
== Changelog ==
v3 -> v4:
- Major rework of scrach management. Rather than force scratch memory
allocations only very early in boot now we rely on scratch for all
memblock allocations.
- Use simple example usecase (reserv_mem instead of ftrace)
- merge all KHO functionality into a single kernel/kexec_handover.c file
- rename CONFIG_KEXEC_KHO to CONFIG_KEXEC_HANDOVER
v1 -> v2:
- Removed: tracing: Introduce names for ring buffers
- Removed: tracing: Introduce names for events
- New: kexec: Add config option for KHO
- New: kexec: Add documentation for KHO
- New: tracing: Initialize fields before registering
- New: devicetree: Add bindings for ftrace KHO
- test bot warning fixes
- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
- Remove / reduce ifdefs
- Select crc32
- Leave anything that requires a name in trace.c to keep buffers
unnamed entities
- Put events as array into a property, use fingerprint instead of
names to identify them
- Reduce footprint without CONFIG_FTRACE_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- make kho_get_fdt() const
- Add stubs for return_mem and claim_mem
- make kho_get_fdt() const
- Get events as array from a property, use fingerprint instead of
names to identify events
- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
- Leave the node generation code that needs to know the name in
trace.c so that ring buffers can stay anonymous
- s/kho_reserve/kho_reserve_scratch/g
- Move kho enums out of ifdef
- Move from names to fdt offsets. That way, trace.c can find the trace
array offset and then the ring buffer code only needs to read out
its per-CPU data. That way it can stay oblivient to its name.
- Make kho_get_fdt() const
v2 -> v3:
- Fix make dt_binding_check
- Add descriptions for each object
- s/trace_flags/trace-flags/
- s/global_trace/global-trace/
- Make all additionalProperties false
- Change subject to reflect subsysten (dt-bindings)
- Fix indentation
- Remove superfluous examples
- Convert to 64bit syntax
- Move to kho directory
- s/"global_trace"/"global-trace"/
- s/"global_trace"/"global-trace"/
- s/"trace_flags"/"trace-flags"/
- Fix wording
- Add Documentation to MAINTAINERS file
- Remove kho reference on read error
- Move handover_dt unmap up
- s/reserve_scratch_mem/mark_phys_as_cma/
- Remove ifdeffery
- Remove superfluous comment
Alexander Graf (9):
memblock: Add support for scratch memory
kexec: Add Kexec HandOver (KHO) generation helpers
kexec: Add KHO parsing support
kexec: Add KHO support to kexec file loads
kexec: Add config option for KHO
kexec: Add documentation for KHO
arm64: Add KHO support
x86: Add KHO support
memblock: Add KHO support for reserve_mem
Mike Rapoport (Microsoft) (5):
mm/mm_init: rename init_reserved_page to init_deferred_page
memblock: add MEMBLOCK_RSRV_KERN flag
memblock: introduce memmap_init_kho_scratch()
x86/setup: use memblock_reserve_kern for memory used by kernel
Documentation: KHO: Add memblock bindings
Documentation/ABI/testing/sysfs-firmware-kho | 9 +
Documentation/ABI/testing/sysfs-kernel-kho | 53 ++
.../admin-guide/kernel-parameters.txt | 24 +
.../kho/bindings/memblock/reserve_mem.yaml | 41 +
.../bindings/memblock/reserve_mem_map.yaml | 42 +
Documentation/kho/concepts.rst | 80 ++
Documentation/kho/index.rst | 19 +
Documentation/kho/usage.rst | 60 ++
Documentation/subsystem-apis.rst | 1 +
MAINTAINERS | 3 +
arch/arm64/Kconfig | 3 +
arch/x86/Kconfig | 3 +
arch/x86/boot/compressed/kaslr.c | 52 +-
arch/x86/include/asm/setup.h | 4 +
arch/x86/include/uapi/asm/setup_data.h | 13 +-
arch/x86/kernel/e820.c | 18 +
arch/x86/kernel/kexec-bzimage64.c | 36 +
arch/x86/kernel/setup.c | 39 +-
arch/x86/realmode/init.c | 2 +
drivers/of/fdt.c | 36 +
drivers/of/kexec.c | 42 +
include/linux/cma.h | 2 +
include/linux/kexec.h | 37 +
include/linux/kexec_handover.h | 10 +
include/linux/memblock.h | 38 +-
kernel/Kconfig.kexec | 13 +
kernel/Makefile | 1 +
kernel/kexec_file.c | 19 +
kernel/kexec_handover.c | 808 ++++++++++++++++++
kernel/kexec_internal.h | 16 +
mm/Kconfig | 4 +
mm/internal.h | 5 +-
mm/memblock.c | 247 +++++-
mm/mm_init.c | 19 +-
34 files changed, 1775 insertions(+), 24 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
create mode 100644 Documentation/kho/concepts.rst
create mode 100644 Documentation/kho/index.rst
create mode 100644 Documentation/kho/usage.rst
create mode 100644 include/linux/kexec_handover.h
create mode 100644 kernel/kexec_handover.c
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-18 14:59 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
` (18 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
function performs initialization of a struct page that would have been
deferred normally.
Rename it to init_deferred_page() to better reflect what the function does.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/mm_init.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2630cc30147e..c4b425125bad 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -705,7 +705,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
return false;
}
-static void __meminit init_reserved_page(unsigned long pfn, int nid)
+static void __meminit init_deferred_page(unsigned long pfn, int nid)
{
pg_data_t *pgdat;
int zid;
@@ -739,7 +739,7 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
return false;
}
-static inline void init_reserved_page(unsigned long pfn, int nid)
+static inline void init_deferred_page(unsigned long pfn, int nid)
{
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
@@ -760,7 +760,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
if (pfn_valid(start_pfn)) {
struct page *page = pfn_to_page(start_pfn);
- init_reserved_page(start_pfn, nid);
+ init_deferred_page(start_pfn, nid);
/*
* no need for atomic set_bit because the struct
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-18 15:50 ` Wei Yang
2025-02-26 1:53 ` Changyuan Lyu
2025-02-06 13:27 ` [PATCH v4 03/14] memblock: Add support for scratch memory Mike Rapoport
` (17 subsequent siblings)
19 siblings, 2 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
to denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/memblock.h | 16 +++++++++++++++-
mm/memblock.c | 32 ++++++++++++++++++++++++--------
2 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e79eb6ac516f..65e274550f5d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -50,6 +50,7 @@ enum memblock_flags {
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
};
/**
@@ -116,7 +117,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_phys_free(phys_addr_t base, phys_addr_t size);
-int memblock_reserve(phys_addr_t base, phys_addr_t size);
+int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid,
+ enum memblock_flags flags);
+
+static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return __memblock_reserve(base, size, NUMA_NO_NODE, 0);
+}
+
+static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size)
+{
+ return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN);
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
#endif
@@ -477,6 +490,7 @@ static inline __init_memblock bool memblock_bottom_up(void)
phys_addr_t memblock_phys_mem_size(void);
phys_addr_t memblock_reserved_size(void);
+phys_addr_t memblock_reserved_kern_size(int nid);
unsigned long memblock_estimated_nr_free_pages(void);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
diff --git a/mm/memblock.c b/mm/memblock.c
index 95af35fd1389..4c33baf4d97c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -491,7 +491,7 @@ static int __init_memblock memblock_double_array(struct memblock_type *type,
* needn't do it
*/
if (!use_slab)
- BUG_ON(memblock_reserve(addr, new_alloc_size));
+ BUG_ON(memblock_reserve_kern(addr, new_alloc_size));
/* Update slab flag */
*in_slab = use_slab;
@@ -641,7 +641,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
#ifdef CONFIG_NUMA
WARN_ON(nid != memblock_get_region_node(rgn));
#endif
- WARN_ON(flags != rgn->flags);
+ WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags);
nr_new++;
if (insert) {
if (start_rgn == -1)
@@ -901,14 +901,15 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
return memblock_remove_range(&memblock.reserved, base, size);
}
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size,
+ int nid, enum memblock_flags flags)
{
phys_addr_t end = base + size - 1;
- memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
- &base, &end, (void *)_RET_IP_);
+ memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
+ &base, &end, nid, flags, (void *)_RET_IP_);
- return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
+ return memblock_add_range(&memblock.reserved, base, size, nid, flags);
}
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
again:
found = memblock_find_in_range_node(size, align, start, end, nid,
flags);
- if (found && !memblock_reserve(found, size))
+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
goto done;
if (numa_valid_node(nid) && !exact_nid) {
found = memblock_find_in_range_node(size, align, start,
end, NUMA_NO_NODE,
flags);
- if (found && !memblock_reserve(found, size))
+ if (found && !memblock_reserve_kern(found, size))
goto done;
}
@@ -1751,6 +1752,20 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
return memblock.reserved.total_size;
}
+phys_addr_t __init_memblock memblock_reserved_kern_size(int nid)
+{
+ struct memblock_region *r;
+ phys_addr_t total = 0;
+
+ for_each_reserved_mem_region(r) {
+ if (nid == memblock_get_region_node(r) || !numa_valid_node(nid))
+ if (r->flags & MEMBLOCK_RSRV_KERN)
+ total += r->size;
+ }
+
+ return total;
+}
+
/**
* memblock_estimated_nr_free_pages - return estimated number of free pages
* from memblock point of view
@@ -2397,6 +2412,7 @@ static const char * const flagname[] = {
[ilog2(MEMBLOCK_NOMAP)] = "NOMAP",
[ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
[ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
+ [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
};
static int memblock_debug_show(struct seq_file *m, void *private)
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 03/14] memblock: Add support for scratch memory
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-24 2:50 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch() Mike Rapoport
` (16 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over. But to know where those are, we need to include
them in the memblock.reserved array which may not be big enough to hold
all ranges that need to be persisted across kexec. To resize the array,
we need to allocate memory. That brings us into a catch 22 situation.
The solution to that is limit memblock allocations to the scratch regions:
safe regions to operate in the case when there is memory that should remain
intact across kexec.
KHO provides several "scratch regions" as part of its metadata. These
scratch regions are contiguous memory blocks that known not to contain any
memory that should be persisted across kexec. These regions should be large
enough to accommodate all memblock allocations done by the kexeced kernel.
We introduce a new memblock_set_scratch_only() function that allows KHO to
indicate that any memblock allocation must happen from the scratch regions.
Later, we may want to perform another KHO kexec. For that, we reuse the
same scratch regions. To ensure that no eventually handed over data gets
allocated inside a scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions. We will lift that restriction
in the next patch.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/memblock.h | 20 +++++++++++++
mm/Kconfig | 4 +++
mm/memblock.c | 61 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 65e274550f5d..14e4c6b73e2c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,11 @@ extern unsigned long long max_possible_pfn;
* kernel resource tree.
* @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
* not initialized (only for reserved regions).
+ * @MEMBLOCK_KHO_SCRATCH: memory region that kexec can pass to the next
+ * kernel in handover mode. During early boot, we do not know about all
+ * memory reservations yet, so we get scratch memory from the previous
+ * kernel that we know is good to use. It is the only memory that
+ * allocations may happen from in this phase.
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
@@ -51,6 +56,7 @@ enum memblock_flags {
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
+ MEMBLOCK_KHO_SCRATCH = 0x40, /* scratch memory for kexec handover */
};
/**
@@ -145,6 +151,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
+int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size);
+int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size);
void memblock_free_all(void);
void memblock_free(void *ptr, size_t size);
@@ -289,6 +297,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
}
+static inline bool memblock_is_kho_scratch(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_KHO_SCRATCH;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
@@ -617,5 +630,12 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { }
static inline void memtest_report_meminfo(struct seq_file *m) { }
#endif
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+void memblock_set_kho_scratch_only(void);
+void memblock_clear_kho_scratch_only(void);
+#else
+static inline void memblock_set_kho_scratch_only(void) { }
+static inline void memblock_clear_kho_scratch_only(void) { }
+#endif
#endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..550bbafe5c0b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -506,6 +506,10 @@ config HAVE_GUP_FAST
depends on MMU
bool
+# Enable memblock support for scratch memory which is needed for kexec handover
+config MEMBLOCK_KHO_SCRATCH
+ bool
+
# Don't discard allocated memory used to track "memory" and "reserved" memblocks
# after early boot, so it can still be used to test for validity of memory.
# Also, memblocks are updated with memory hot(un)plug.
diff --git a/mm/memblock.c b/mm/memblock.c
index 4c33baf4d97c..3d68b1fc2bd2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -106,6 +106,13 @@ unsigned long min_low_pfn;
unsigned long max_pfn;
unsigned long long max_possible_pfn;
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+/* When set to true, only allocate from MEMBLOCK_KHO_SCRATCH ranges */
+static bool kho_scratch_only;
+#else
+#define kho_scratch_only false
+#endif
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -165,6 +172,10 @@ bool __init_memblock memblock_has_mirror(void)
static enum memblock_flags __init_memblock choose_memblock_flags(void)
{
+ /* skip non-scratch memory for kho early boot allocations */
+ if (kho_scratch_only)
+ return MEMBLOCK_KHO_SCRATCH;
+
return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
}
@@ -924,6 +935,18 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
}
#endif
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+__init_memblock void memblock_set_kho_scratch_only(void)
+{
+ kho_scratch_only = true;
+}
+
+__init_memblock void memblock_clear_kho_scratch_only(void)
+{
+ kho_scratch_only = false;
+}
+#endif
+
/**
* memblock_setclr_flag - set or clear flag for a memory region
* @type: memblock type to set/clear flag for
@@ -1049,6 +1072,36 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
MEMBLOCK_RSRV_NOINIT);
}
+/**
+ * memblock_mark_kho_scratch - Mark a memory region as MEMBLOCK_KHO_SCRATCH.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Only memory regions marked with %MEMBLOCK_KHO_SCRATCH will be considered
+ * for allocations during early boot with kexec handover.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.memory, base, size, 1,
+ MEMBLOCK_KHO_SCRATCH);
+}
+
+/**
+ * memblock_clear_kho_scratch - Clear MEMBLOCK_KHO_SCRATCH flag for a
+ * specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.memory, base, size, 0,
+ MEMBLOCK_KHO_SCRATCH);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
@@ -1080,6 +1133,13 @@ static bool should_skip_region(struct memblock_type *type,
if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
return true;
+ /*
+ * In early alloc during kexec handover, we can only consider
+ * MEMBLOCK_KHO_SCRATCH regions for the allocations
+ */
+ if ((flags & MEMBLOCK_KHO_SCRATCH) && !memblock_is_kho_scratch(m))
+ return true;
+
return false;
}
@@ -2413,6 +2473,7 @@ static const char * const flagname[] = {
[ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
[ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
[ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
+ [ilog2(MEMBLOCK_KHO_SCRATCH)] = "KHO_SCRATCH",
};
static int memblock_debug_show(struct seq_file *m, void *private)
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch()
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (2 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 03/14] memblock: Add support for scratch memory Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-24 3:02 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
` (15 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
With deferred initialization of struct page it will be necessary to
initialize memory map for KHO scratch regions early.
Add memmap_init_kho_scratch() method that will allow such initialization
in upcoming patches.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/memblock.h | 2 ++
mm/internal.h | 2 ++
mm/memblock.c | 22 ++++++++++++++++++++++
mm/mm_init.c | 11 ++++++++---
4 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 14e4c6b73e2c..20887e199cdb 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -633,9 +633,11 @@ static inline void memtest_report_meminfo(struct seq_file *m) { }
#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
void memblock_set_kho_scratch_only(void);
void memblock_clear_kho_scratch_only(void);
+void memmap_init_kho_scratch_pages(void);
#else
static inline void memblock_set_kho_scratch_only(void) { }
static inline void memblock_clear_kho_scratch_only(void) { }
+static inline void memmap_init_kho_scratch_pages(void) {}
#endif
#endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/internal.h b/mm/internal.h
index 109ef30fee11..986ad9c2a8b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1053,6 +1053,8 @@ DECLARE_STATIC_KEY_TRUE(deferred_pages);
bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+void init_deferred_page(unsigned long pfn, int nid);
+
enum mminit_level {
MMINIT_WARNING,
MMINIT_VERIFY,
diff --git a/mm/memblock.c b/mm/memblock.c
index 3d68b1fc2bd2..54bd95745381 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -945,6 +945,28 @@ __init_memblock void memblock_clear_kho_scratch_only(void)
{
kho_scratch_only = false;
}
+
+void __init_memblock memmap_init_kho_scratch_pages(void)
+{
+ phys_addr_t start, end;
+ unsigned long pfn;
+ int nid;
+ u64 i;
+
+ if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
+ return;
+
+ /*
+ * Initialize struct pages for free scratch memory.
+ * The struct pages for reserved scratch memory will be set up in
+ * reserve_bootmem_region()
+ */
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) {
+ for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++)
+ init_deferred_page(pfn, nid);
+ }
+}
#endif
/**
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c4b425125bad..04441c258b05 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -705,7 +705,7 @@ defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
return false;
}
-static void __meminit init_deferred_page(unsigned long pfn, int nid)
+static void __meminit __init_deferred_page(unsigned long pfn, int nid)
{
pg_data_t *pgdat;
int zid;
@@ -739,11 +739,16 @@ static inline bool defer_init(int nid, unsigned long pfn, unsigned long end_pfn)
return false;
}
-static inline void init_deferred_page(unsigned long pfn, int nid)
+static inline void __init_deferred_page(unsigned long pfn, int nid)
{
}
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+void __meminit init_deferred_page(unsigned long pfn, int nid)
+{
+ __init_deferred_page(pfn, nid);
+}
+
/*
* Initialised pages do not have PageReserved set. This function is
* called for each range allocated by the bootmem allocator and
@@ -760,7 +765,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start,
if (pfn_valid(start_pfn)) {
struct page *page = pfn_to_page(start_pfn);
- init_deferred_page(start_pfn, nid);
+ __init_deferred_page(start_pfn, nid);
/*
* no need for atomic set_bit because the struct
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (3 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch() Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-10 20:22 ` Jason Gunthorpe
2025-02-12 12:29 ` Thomas Weißschuh
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
` (14 subsequent siblings)
19 siblings, 2 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
This patch adds the core infrastructure to generate Kexec HandOver
metadata. Kexec HandOver is a mechanism that allows Linux to preserve
state - arbitrary properties as well as memory locations - across kexec.
It does so using 2 concepts:
1) Device Tree - Every KHO kexec carries a KHO specific flattened
device tree blob that describes the state of the system. Device
drivers can register to KHO to serialize their state before kexec.
2) Scratch Regions - CMA regions that we allocate in the first kernel.
CMA gives us the guarantee that no handover pages land in those
regions, because handover pages must be at a static physical memory
location. We use these regions as the place to load future kexec
images so that they won't collide with any handover data.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/ABI/testing/sysfs-kernel-kho | 53 +++
.../admin-guide/kernel-parameters.txt | 24 +
MAINTAINERS | 1 +
include/linux/cma.h | 2 +
include/linux/kexec.h | 18 +
include/linux/kexec_handover.h | 10 +
kernel/Makefile | 1 +
kernel/kexec_handover.c | 450 ++++++++++++++++++
mm/internal.h | 3 -
mm/mm_init.c | 8 +
10 files changed, 567 insertions(+), 3 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 include/linux/kexec_handover.h
create mode 100644 kernel/kexec_handover.c
diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
new file mode 100644
index 000000000000..f13b252bc303
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-kho
@@ -0,0 +1,53 @@
+What: /sys/kernel/kho/active
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ Kexec HandOver (KHO) allows Linux to transition the state of
+ compatible drivers into the next kexec'ed kernel. To do so,
+ device drivers will serialize their current state into a DT.
+ While the state is serialized, they are unable to perform
+ any modifications to state that was serialized, such as
+ handed over memory allocations.
+
+ When this file contains "1", the system is in the transition
+ state. When contains "0", it is not. To switch between the
+ two states, echo the respective number into this file.
+
+What: /sys/kernel/kho/dt_max
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ KHO needs to allocate a buffer for the DT that gets
+ generated before it knows the final size. By default, it
+ will allocate 10 MiB for it. You can write to this file
+ to modify the size of that allocation.
+
+What: /sys/kernel/kho/dt
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ When KHO is active, the kernel exposes the generated DT that
+ carries its current KHO state in this file. Kexec user space
+ tooling can use this as input file for the KHO payload image.
+
+What: /sys/kernel/kho/scratch_len
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ To support continuous KHO kexecs, we need to reserve
+ physically contiguous memory regions that will always stay
+ available for future kexec allocations. This file describes
+ the length of these memory regions. Kexec user space tooling
+ can use this to determine where it should place its payload
+ images.
+
+What: /sys/kernel/kho/scratch_phys
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ To support continuous KHO kexecs, we need to reserve
+ physically contiguous memory regions that will always stay
+ available for future kexec allocations. This file describes
+ the physical location of these memory regions. Kexec user space
+ tooling can use this to determine where it should place its
+ payload images.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..ed656e2fb05e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2698,6 +2698,30 @@
kgdbwait [KGDB,EARLY] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
+ kho= [KEXEC,EARLY]
+ Format: { "0" | "1" | "off" | "on" | "y" | "n" }
+ Enables or disables Kexec HandOver.
+ "0" | "off" | "n" - kexec handover is disabled
+ "1" | "on" | "y" - kexec handover is enabled
+
+ kho_scratch= [KEXEC,EARLY]
+ Format: nn[KMG],mm[KMG] | nn%
+ Defines the size of the KHO scratch region. The KHO
+ scratch regions are physically contiguous memory
+ ranges that can only be used for non-kernel
+ allocations. That way, even when memory is heavily
+ fragmented with handed over memory, the kexeced
+ kernel will always have enough contiguous ranges to
+ bootstrap itself.
+
+ It is possible to specify the exact amount of
+ memory in the form of "nn[KMG],mm[KMG]" where the
+ first parameter defines the size of a global
+ scratch area and the second parameter defines the
+ size of additional per-node scratch areas.
+ The form "nn%" defines scale factor (in percents)
+ of memory that was used during boot.
+
kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
diff --git a/MAINTAINERS b/MAINTAINERS
index 896a307fa065..8327795e8899 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12826,6 +12826,7 @@ M: Eric Biederman <ebiederm@xmission.com>
L: kexec@lists.infradead.org
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
F: kernel/kexec*
diff --git a/include/linux/cma.h b/include/linux/cma.h
index d15b64f51336..828a3c17504b 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -56,6 +56,8 @@ extern void cma_reserve_pages_on_error(struct cma *cma);
#ifdef CONFIG_CMA
struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp);
bool cma_free_folio(struct cma *cma, const struct folio *folio);
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void init_cma_reserved_pageblock(struct page *page);
#else
static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
{
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..ef5c90abafd1 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -483,6 +483,24 @@ void set_kexec_sig_enforced(void);
static inline void set_kexec_sig_enforced(void) {}
#endif
+/* KHO Notifier index */
+enum kho_event {
+ KEXEC_KHO_DUMP = 0,
+ KEXEC_KHO_ABORT = 1,
+};
+
+struct notifier_block;
+
+#ifdef CONFIG_KEXEC_HANDOVER
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+void kho_memory_init(void);
+#else
+static inline int register_kho_notifier(struct notifier_block *nb) { return 0; }
+static inline int unregister_kho_notifier(struct notifier_block *nb) { return 0; }
+static inline void kho_memory_init(void) {}
+#endif /* CONFIG_KEXEC_HANDOVER */
+
#endif /* !defined(__ASSEBMLY__) */
#endif /* LINUX_KEXEC_H */
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
new file mode 100644
index 000000000000..c4b0aab823dc
--- /dev/null
+++ b/include/linux/kexec_handover.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_H
+#define LINUX_KEXEC_HANDOVER_H
+
+struct kho_mem {
+ phys_addr_t addr;
+ phys_addr_t size;
+};
+
+#endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..cef5377c25cd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
new file mode 100644
index 000000000000..eccfe3a25798
--- /dev/null
+++ b/kernel/kexec_handover.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/sysfs.h>
+#include <linux/libfdt.h>
+#include <linux/memblock.h>
+#include <linux/notifier.h>
+#include <linux/kexec_handover.h>
+#include <linux/page-isolation.h>
+
+static bool kho_enable __ro_after_init;
+
+static int __init kho_parse_enable(char *p)
+{
+ return kstrtobool(p, &kho_enable);
+}
+early_param("kho", kho_parse_enable);
+
+/*
+ * With KHO enabled, memory can become fragmented because KHO regions may
+ * be anywhere in physical address space. The scratch regions give us a
+ * safe zones that we will never see KHO allocations from. This is where we
+ * can later safely load our new kexec images into and then use the scratch
+ * area for early allocations that happen before page allocator is
+ * initialized.
+ */
+static struct kho_mem *kho_scratch;
+static unsigned int kho_scratch_cnt;
+
+struct kho_out {
+ struct blocking_notifier_head chain_head;
+ struct kobject *kobj;
+ struct mutex lock;
+ void *dt;
+ u64 dt_len;
+ u64 dt_max;
+ bool active;
+};
+
+static struct kho_out kho_out = {
+ .chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+ .lock = __MUTEX_INITIALIZER(kho_out.lock),
+ .dt_max = 10 * SZ_1M,
+};
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+static ssize_t dt_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t pos, size_t count)
+{
+ mutex_lock(&kho_out.lock);
+ memcpy(buf, attr->private + pos, count);
+ mutex_unlock(&kho_out.lock);
+
+ return count;
+}
+
+struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
+
+static int kho_expose_dt(void *fdt)
+{
+ long fdt_len = fdt_totalsize(fdt);
+ int err;
+
+ kho_out.dt = fdt;
+ kho_out.dt_len = fdt_len;
+
+ bin_attr_dt_kern.size = fdt_totalsize(fdt);
+ bin_attr_dt_kern.private = fdt;
+ err = sysfs_create_bin_file(kho_out.kobj, &bin_attr_dt_kern);
+
+ return err;
+}
+
+static void kho_abort(void)
+{
+ if (!kho_out.active)
+ return;
+
+ sysfs_remove_bin_file(kho_out.kobj, &bin_attr_dt_kern);
+
+ kvfree(kho_out.dt);
+ kho_out.dt = NULL;
+ kho_out.dt_len = 0;
+
+ blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT, NULL);
+
+ kho_out.active = false;
+}
+
+static int kho_serialize(void)
+{
+ void *fdt = NULL;
+ int err = -ENOMEM;
+
+ fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
+ if (!fdt)
+ goto out;
+
+ if (fdt_create(fdt, kho_out.dt_max)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = fdt_finish_reservemap(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_begin_node(fdt, "");
+ if (err)
+ goto out;
+
+ err = fdt_property_string(fdt, "compatible", "kho-v1");
+ if (err)
+ goto out;
+
+ /* Loop through all kho dump functions */
+ err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
+ err = notifier_to_errno(err);
+ if (err)
+ goto out;
+
+ /* Close / */
+ err = fdt_end_node(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_finish(fdt);
+ if (err)
+ goto out;
+
+ if (WARN_ON(fdt_check_header(fdt))) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = kho_expose_dt(fdt);
+
+out:
+ if (err) {
+ pr_err("failed to serialize state: %d", err);
+ kho_abort();
+ }
+ return err;
+}
+
+/* Handling for /sys/kernel/kho */
+
+#define KHO_ATTR_RO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_RO_MODE(_name, 0400)
+#define KHO_ATTR_RW(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_RW_MODE(_name, 0600)
+
+static ssize_t active_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ ssize_t retsize = size;
+ bool val = false;
+ int ret;
+
+ if (kstrtobool(buf, &val) < 0)
+ return -EINVAL;
+
+ if (!kho_enable)
+ return -EOPNOTSUPP;
+ if (!kho_scratch_cnt)
+ return -ENOMEM;
+
+ mutex_lock(&kho_out.lock);
+ if (val != kho_out.active) {
+ if (val) {
+ ret = kho_serialize();
+ if (ret) {
+ retsize = -EINVAL;
+ goto out;
+ }
+ kho_out.active = true;
+ } else {
+ kho_abort();
+ }
+ }
+
+out:
+ mutex_unlock(&kho_out.lock);
+ return retsize;
+}
+
+static ssize_t active_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ ssize_t ret;
+
+ mutex_lock(&kho_out.lock);
+ ret = sysfs_emit(buf, "%d\n", kho_out.active);
+ mutex_unlock(&kho_out.lock);
+
+ return ret;
+}
+KHO_ATTR_RW(active);
+
+static ssize_t dt_max_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ u64 val;
+
+ if (kstrtoull(buf, 0, &val))
+ return -EINVAL;
+
+ /* FDT already exists, it's too late to change dt_max */
+ if (kho_out.dt_len)
+ return -EBUSY;
+
+ kho_out.dt_max = val;
+
+ return size;
+}
+
+static ssize_t dt_max_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho_out.dt_max);
+}
+KHO_ATTR_RW(dt_max);
+
+static ssize_t scratch_len_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ ssize_t count = 0;
+
+ for (int i = 0; i < kho_scratch_cnt; i++)
+ count += sysfs_emit_at(buf, count, "0x%llx\n", kho_scratch[i].size);
+
+ return count;
+}
+KHO_ATTR_RO(scratch_len);
+
+static ssize_t scratch_phys_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ ssize_t count = 0;
+
+ for (int i = 0; i < kho_scratch_cnt; i++)
+ count += sysfs_emit_at(buf, count, "0x%llx\n", kho_scratch[i].addr);
+
+ return count;
+}
+KHO_ATTR_RO(scratch_phys);
+
+static const struct attribute *kho_out_attrs[] = {
+ &active_attr.attr,
+ &dt_max_attr.attr,
+ &scratch_phys_attr.attr,
+ &scratch_len_attr.attr,
+ NULL,
+};
+
+static __init int kho_out_sysfs_init(void)
+{
+ int err;
+
+ kho_out.kobj = kobject_create_and_add("kho", kernel_kobj);
+ if (!kho_out.kobj)
+ return -ENOMEM;
+
+ err = sysfs_create_files(kho_out.kobj, kho_out_attrs);
+ if (err)
+ goto err_put_kobj;
+
+ return 0;
+
+err_put_kobj:
+ kobject_put(kho_out.kobj);
+ return err;
+}
+
+static __init int kho_init(void)
+{
+ int err;
+
+ if (!kho_enable)
+ return -EINVAL;
+
+ err = kho_out_sysfs_init();
+ if (err)
+ return err;
+
+ for (int i = 0; i < kho_scratch_cnt; i++) {
+ unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
+ unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
+ unsigned long pfn;
+
+ for (pfn = base_pfn; pfn < base_pfn + count;
+ pfn += pageblock_nr_pages)
+ init_cma_reserved_pageblock(pfn_to_page(pfn));
+ }
+
+ return 0;
+}
+late_initcall(kho_init);
+
+/*
+ * The scratch areas are scaled by default as percent of memory allocated from
+ * memblock. A user can override the scale with command line parameter:
+ *
+ * kho_scratch=N%
+ *
+ * It is also possible to explicitly define size for a global and per-node
+ * scratch areas:
+ *
+ * kho_scratch=n[KMG],m[KMG]
+ *
+ * The explicit size definition takes precedence over scale definition.
+ */
+static unsigned int scratch_scale __initdata = 200;
+static phys_addr_t scratch_size_global __initdata;
+static phys_addr_t scratch_size_pernode __initdata;
+
+static int __init kho_parse_scratch_size(char *p)
+{
+ unsigned long size, size_pernode;
+ char *endptr, *oldp = p;
+
+ if (!p)
+ return -EINVAL;
+
+ size = simple_strtoul(p, &endptr, 0);
+ if (*endptr == '%') {
+ scratch_scale = size;
+ pr_notice("scratch scale is %d percent\n", scratch_scale);
+ } else {
+ size = memparse(p, &p);
+ if (!size || p == oldp)
+ return -EINVAL;
+
+ if (*p != ',')
+ return -EINVAL;
+
+ size_pernode = memparse(p + 1, &p);
+ if (!size_pernode)
+ return -EINVAL;
+
+ scratch_size_global = size;
+ scratch_size_pernode = size_pernode;
+ scratch_scale = 0;
+
+ pr_notice("scratch areas: global: %lluMB pernode: %lldMB\n",
+ (u64)(scratch_size_global >> 20),
+ (u64)(scratch_size_pernode >> 20));
+ }
+
+ return 0;
+}
+early_param("kho_scratch", kho_parse_scratch_size);
+
+static phys_addr_t __init scratch_size(int nid)
+{
+ phys_addr_t size;
+
+ if (scratch_scale) {
+ size = memblock_reserved_kern_size(nid) * scratch_scale / 100;
+ } else {
+ if (numa_valid_node(nid))
+ size = scratch_size_pernode;
+ else
+ size = scratch_size_global;
+ }
+
+ return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+/**
+ * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+static void kho_reserve_scratch(void)
+{
+ phys_addr_t addr, size;
+ int nid, i = 1;
+
+ if (!kho_enable)
+ return;
+
+ /* FIXME: deal with node hot-plug/remove */
+ kho_scratch_cnt = num_online_nodes() + 1;
+ size = kho_scratch_cnt * sizeof(*kho_scratch);
+ kho_scratch = memblock_alloc(size, PAGE_SIZE);
+ if (!kho_scratch)
+ goto err_disable_kho;
+
+ /* reserve large contiguous area for allocations without nid */
+ size = scratch_size(NUMA_NO_NODE);
+ addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+ if (!addr)
+ goto err_free_scratch_desc;
+
+ kho_scratch[0].addr = addr;
+ kho_scratch[0].size = size;
+
+ for_each_online_node(nid) {
+ size = scratch_size(nid);
+ addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
+ 0, MEMBLOCK_ALLOC_ACCESSIBLE,
+ nid, true);
+ if (!addr)
+ goto err_free_scratch_areas;
+
+ kho_scratch[i].addr = addr;
+ kho_scratch[i].size = size;
+ i++;
+ }
+
+ return;
+
+err_free_scratch_areas:
+ for (i--; i >= 0; i--)
+ memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size);
+err_free_scratch_desc:
+ memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch));
+err_disable_kho:
+ kho_enable = false;
+}
+
+void __init kho_memory_init(void)
+{
+ kho_reserve_scratch();
+}
diff --git a/mm/internal.h b/mm/internal.h
index 986ad9c2a8b2..fdd379fddf6d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -841,9 +841,6 @@ int
isolate_migratepages_range(struct compact_control *cc,
unsigned long low_pfn, unsigned long end_pfn);
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void init_cma_reserved_pageblock(struct page *page);
-
#endif /* CONFIG_COMPACTION || CONFIG_CMA */
int find_suitable_fallback(struct free_area *area, unsigned int order,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 04441c258b05..60f08930e434 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -30,6 +30,7 @@
#include <linux/crash_dump.h>
#include <linux/execmem.h>
#include <linux/vmstat.h>
+#include <linux/kexec.h>
#include "internal.h"
#include "slab.h"
#include "shuffle.h"
@@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
report_meminit();
kmsan_init_shadow();
stack_depot_early_init();
+
+ /*
+ * KHO memory setup must happen while memblock is still active, but
+ * as close as possible to buddy initialization
+ */
+ kho_memory_init();
+
mem_init();
kmem_cache_init();
/*
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 06/14] kexec: Add KHO parsing support
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (4 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-10 20:50 ` Jason Gunthorpe
2025-03-10 16:20 ` Pratyush Yadav
2025-02-06 13:27 ` [PATCH v4 07/14] kexec: Add KHO support to kexec file loads Mike Rapoport
` (13 subsequent siblings)
19 siblings, 2 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
When we have a KHO kexec, we get a device tree and scratch region to
populate the state of the system. Provide helper functions that allow
architecture code to easily handle memory reservations based on them and
give device drivers visibility into the KHO DT and memory reservations
so they can recover their own state.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/ABI/testing/sysfs-firmware-kho | 9 +
MAINTAINERS | 1 +
include/linux/kexec.h | 12 +
kernel/kexec_handover.c | 268 ++++++++++++++++++-
mm/memblock.c | 1 +
5 files changed, 290 insertions(+), 1 deletion(-)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
diff --git a/Documentation/ABI/testing/sysfs-firmware-kho b/Documentation/ABI/testing/sysfs-firmware-kho
new file mode 100644
index 000000000000..e4ed2cb7c810
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-kho
@@ -0,0 +1,9 @@
+What: /sys/firmware/kho/dt
+Date: December 2023
+Contact: Alexander Graf <graf@amazon.com>
+Description:
+ When the kernel was booted with Kexec HandOver (KHO),
+ the device tree that carries metadata about the previous
+ kernel's state is in this file. This file may disappear
+ when all consumers of it finished to interpret their
+ metadata.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8327795e8899..e1e01b2a3727 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12826,6 +12826,7 @@ M: Eric Biederman <ebiederm@xmission.com>
L: kexec@lists.infradead.org
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-firmware-kho
F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ef5c90abafd1..4fdf5ee27144 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -490,12 +490,24 @@ enum kho_event {
};
struct notifier_block;
+struct kho_mem;
#ifdef CONFIG_KEXEC_HANDOVER
+void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len);
+const void *kho_get_fdt(void);
+void kho_return_mem(const struct kho_mem *mem);
+void *kho_claim_mem(const struct kho_mem *mem);
int register_kho_notifier(struct notifier_block *nb);
int unregister_kho_notifier(struct notifier_block *nb);
void kho_memory_init(void);
#else
+static inline void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len) {}
+static inline void *kho_get_fdt(void) { return NULL; }
+static inline void kho_return_mem(const struct kho_mem *mem) { }
+static inline void *kho_claim_mem(const struct kho_mem *mem) { return NULL; }
+
static inline int register_kho_notifier(struct notifier_block *nb) { return 0; }
static inline int unregister_kho_notifier(struct notifier_block *nb) { return 0; }
static inline void kho_memory_init(void) {}
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index eccfe3a25798..3b360e3a6057 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -51,6 +51,15 @@ static struct kho_out kho_out = {
.dt_max = 10 * SZ_1M,
};
+struct kho_in {
+ struct kobject *kobj;
+ phys_addr_t kho_scratch_phys;
+ phys_addr_t handover_phys;
+ u32 handover_len;
+};
+
+static struct kho_in kho_in;
+
int register_kho_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_register(&kho_out.chain_head, nb);
@@ -63,6 +72,89 @@ int unregister_kho_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+const void *kho_get_fdt(void)
+{
+ if (!kho_in.handover_phys)
+ return NULL;
+
+ return __va(kho_in.handover_phys);
+}
+EXPORT_SYMBOL_GPL(kho_get_fdt);
+
+static void kho_return_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_online_page(pfn);
+
+ if (WARN_ON(!page))
+ return;
+ __free_page(page);
+}
+
+/**
+ * kho_return_mem - Notify the kernel that initially reserved memory is no
+ * longer needed.
+ * @mem: memory range that was preserved during kexec handover
+ *
+ * When the last consumer of a page returns their memory, kho returns the page
+ * to the buddy allocator as free page.
+ */
+void kho_return_mem(const struct kho_mem *mem)
+{
+ unsigned long start_pfn, end_pfn, pfn;
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->size);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ kho_return_pfn(pfn);
+}
+EXPORT_SYMBOL_GPL(kho_return_mem);
+
+static int kho_claim_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_online_page(pfn);
+
+ if (!page)
+ return -ENOMEM;
+
+ /* almost as free_reserved_page(), just don't free the page */
+ ClearPageReserved(page);
+ init_page_count(page);
+ adjust_managed_page_count(page, 1);
+
+ return 0;
+}
+
+/**
+ * kho_claim_mem - Notify the kernel that a handed over memory range is now
+ * in use
+ * @mem: memory range that was preserved during kexec handover
+ *
+ * A kernel subsystem preserved that range during handover and it is going
+ * to reuse this range after kexec. The pages in the range are treated as
+ * allocated, but not %PG_reserved.
+ *
+ * Return: virtual address of the preserved memory range
+ */
+void *kho_claim_mem(const struct kho_mem *mem)
+{
+ unsigned long start_pfn, end_pfn, pfn;
+ void *va = __va(mem->addr);
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->size);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ int err = kho_claim_pfn(pfn);
+
+ if (err)
+ return NULL;
+ }
+
+ return va;
+}
+EXPORT_SYMBOL_GPL(kho_claim_mem);
+
static ssize_t dt_read(struct file *file, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
loff_t pos, size_t count)
@@ -273,6 +365,30 @@ static const struct attribute *kho_out_attrs[] = {
NULL,
};
+/* Handling for /sys/firmware/kho */
+static BIN_ATTR_SIMPLE_RO(dt_fw);
+
+static __init int kho_in_sysfs_init(const void *fdt)
+{
+ int err;
+
+ kho_in.kobj = kobject_create_and_add("kho", firmware_kobj);
+ if (!kho_in.kobj)
+ return -ENOMEM;
+
+ bin_attr_dt_fw.size = fdt_totalsize(fdt);
+ bin_attr_dt_fw.private = (void *)fdt;
+ err = sysfs_create_bin_file(kho_in.kobj, &bin_attr_dt_fw);
+ if (err)
+ goto err_put_kobj;
+
+ return 0;
+
+err_put_kobj:
+ kobject_put(kho_in.kobj);
+ return err;
+}
+
static __init int kho_out_sysfs_init(void)
{
int err;
@@ -294,6 +410,7 @@ static __init int kho_out_sysfs_init(void)
static __init int kho_init(void)
{
+ const void *fdt = kho_get_fdt();
int err;
if (!kho_enable)
@@ -303,6 +420,21 @@ static __init int kho_init(void)
if (err)
return err;
+ if (fdt) {
+ err = kho_in_sysfs_init(fdt);
+ /*
+ * Failure to create /sys/firmware/kho/dt does not prevent
+ * reviving state from KHO and setting up KHO for the next
+ * kexec.
+ */
+ if (err)
+ pr_err("failed exposing handover FDT in sysfs\n");
+
+ kho_scratch = __va(kho_in.kho_scratch_phys);
+
+ return 0;
+ }
+
for (int i = 0; i < kho_scratch_cnt; i++) {
unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
@@ -444,7 +576,141 @@ static void kho_reserve_scratch(void)
kho_enable = false;
}
+/*
+ * Scan the DT for any memory ranges and make sure they are reserved in
+ * memblock, otherwise they will end up in a weird state on free lists.
+ */
+static void kho_init_reserved_pages(void)
+{
+ const void *fdt = kho_get_fdt();
+ int offset = 0, depth = 0, initial_depth = 0, len;
+
+ if (!fdt)
+ return;
+
+ /* Go through the mem list and add 1 for each reference */
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+ u32 i;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+
+ for (i = 0; i < len; i += sizeof(*mems)) {
+ const struct kho_mem *mem = &mems[i];
+
+ memblock_reserve(mem->addr, mem->size);
+ }
+ }
+}
+
+static void __init kho_release_scratch(void)
+{
+ phys_addr_t start, end;
+ u64 i;
+
+ memmap_init_kho_scratch_pages();
+
+ /*
+ * Mark scratch mem as CMA before we return it. That way we
+ * ensure that no kernel allocations happen on it. That means
+ * we can reuse it as scratch memory again later.
+ */
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_KHO_SCRATCH, &start, &end, NULL) {
+ ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
+ ulong end_pfn = pageblock_align(PFN_UP(end));
+ ulong pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages)
+ set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);
+ }
+}
+
void __init kho_memory_init(void)
{
- kho_reserve_scratch();
+ if (!kho_get_fdt()) {
+ kho_reserve_scratch();
+ } else {
+ kho_init_reserved_pages();
+ kho_release_scratch();
+ }
+}
+
+void __init kho_populate(phys_addr_t handover_dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len)
+{
+ void *handover_dt;
+ struct kho_mem *scratch;
+
+ /* Determine the real size of the DT */
+ handover_dt = early_memremap(handover_dt_phys, sizeof(struct fdt_header));
+ if (!handover_dt) {
+ pr_warn("setup: failed to memremap kexec FDT (0x%llx)\n", handover_dt_phys);
+ return;
+ }
+
+ if (fdt_check_header(handover_dt)) {
+ pr_warn("setup: kexec handover FDT is invalid (0x%llx)\n", handover_dt_phys);
+ early_memunmap(handover_dt, PAGE_SIZE);
+ return;
+ }
+
+ kho_in.handover_len = fdt_totalsize(handover_dt);
+ kho_in.handover_phys = handover_dt_phys;
+
+ early_memunmap(handover_dt, sizeof(struct fdt_header));
+
+ /* Reserve the DT so we can still access it in late boot */
+ memblock_reserve(kho_in.handover_phys, kho_in.handover_len);
+
+ kho_in.kho_scratch_phys = scratch_phys;
+ kho_scratch_cnt = scratch_len / sizeof(*kho_scratch);
+ scratch = early_memremap(scratch_phys, scratch_len);
+ if (!scratch) {
+ pr_warn("setup: failed to memremap kexec scratch (0x%llx)\n", scratch_phys);
+ return;
+ }
+
+ /*
+ * We pass a safe contiguous blocks of memory to use for early boot
+ * purporses from the previous kernel so that we can resize the
+ * memblock array as needed.
+ */
+ for (int i = 0; i < kho_scratch_cnt; i++) {
+ struct kho_mem *area = &scratch[i];
+ u64 size = area->size;
+
+ memblock_add(area->addr, size);
+
+ if (WARN_ON(memblock_mark_kho_scratch(area->addr, size))) {
+ pr_err("Kexec failed to mark the scratch region. Disabling KHO revival.");
+ kho_in.handover_len = 0;
+ kho_in.handover_phys = 0;
+ scratch = NULL;
+ break;
+ }
+ pr_debug("Marked 0x%pa+0x%pa as scratch", &area->addr, &size);
+ }
+
+ early_memunmap(scratch, scratch_len);
+
+ if (!scratch)
+ return;
+
+ memblock_reserve(scratch_phys, scratch_len);
+
+ /*
+ * Now that we have a viable region of scratch memory, let's tell
+ * the memblocks allocator to only use that for any allocations.
+ * That way we ensure that nothing scribbles over in use data while
+ * we initialize the page tables which we will need to ingest all
+ * memory reservations from the previous kernel.
+ */
+ memblock_set_kho_scratch_only();
+
+ pr_info("setup: Found kexec handover data. Will skip init for some devices\n");
}
diff --git a/mm/memblock.c b/mm/memblock.c
index 54bd95745381..84df96efca62 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2366,6 +2366,7 @@ void __init memblock_free_all(void)
free_unused_memmap();
reset_all_zones_managed_pages();
+ memblock_clear_kho_scratch_only();
pages = free_low_memory_core_early();
totalram_pages_add(pages);
}
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 07/14] kexec: Add KHO support to kexec file loads
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (5 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 08/14] kexec: Add config option for KHO Mike Rapoport
` (12 subsequent siblings)
19 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
Kexec has 2 modes: A user space driven mode and a kernel driven mode.
For the kernel driven mode, kernel code determines the physical
addresses of all target buffers that the payload gets copied into.
With KHO, we can only safely copy payloads into the "scratch area".
Teach the kexec file loader about it, so it only allocates for that
area. In addition, enlighten it with support to ask the KHO subsystem
for its respective payloads to copy into target memory. Also teach the
KHO subsystem how to fill the images for file loads.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/kexec.h | 7 ++++
kernel/kexec_file.c | 19 +++++++++
kernel/kexec_handover.c | 92 +++++++++++++++++++++++++++++++++++++++++
kernel/kexec_internal.h | 16 +++++++
4 files changed, 134 insertions(+)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 4fdf5ee27144..c5e851717089 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -364,6 +364,13 @@ struct kimage {
size_t ima_buffer_size;
#endif
+#ifdef CONFIG_KEXEC_HANDOVER
+ struct {
+ struct kexec_buf dt;
+ struct kexec_buf scratch;
+ } kho;
+#endif
+
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_sz;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 3eedb8c226ad..d28d23bc1cf4 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -113,6 +113,12 @@ void kimage_file_post_load_cleanup(struct kimage *image)
image->ima_buffer = NULL;
#endif /* CONFIG_IMA_KEXEC */
+#ifdef CONFIG_KEXEC_HANDOVER
+ kvfree(image->kho.dt.buffer);
+ image->kho.dt = (struct kexec_buf) {};
+ image->kho.scratch = (struct kexec_buf) {};
+#endif
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
@@ -253,6 +259,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
/* IMA needs to pass the measurement list to the next kernel. */
ima_add_kexec_buffer(image);
+ /* If KHO is active, add its images to the list */
+ ret = kho_fill_kimage(image);
+ if (ret)
+ goto out;
+
/* Call image load handler */
ldata = kexec_image_load_default(image);
@@ -636,6 +647,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
return 0;
+ /*
+ * If KHO is active, only use KHO scratch memory. All other memory
+ * could potentially be handed over.
+ */
+ ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback);
+ if (ret <= 0)
+ return ret;
+
if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
else
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 3b360e3a6057..c26753d613cb 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -16,6 +16,8 @@
#include <linux/kexec_handover.h>
#include <linux/page-isolation.h>
+#include "kexec_internal.h"
+
static bool kho_enable __ro_after_init;
static int __init kho_parse_enable(char *p)
@@ -155,6 +157,96 @@ void *kho_claim_mem(const struct kho_mem *mem)
}
EXPORT_SYMBOL_GPL(kho_claim_mem);
+int kho_fill_kimage(struct kimage *image)
+{
+ ssize_t scratch_size;
+ int err = 0;
+ void *dt;
+
+ mutex_lock(&kho_out.lock);
+
+ if (!kho_out.active)
+ goto out;
+
+ /*
+ * Create a kexec copy of the DT here. We need this because lifetime may
+ * be different between kho.dt and the kimage
+ */
+ dt = kvmemdup(kho_out.dt, kho_out.dt_len, GFP_KERNEL);
+ if (!dt) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ /* Allocate target memory for kho dt */
+ image->kho.dt = (struct kexec_buf) {
+ .image = image,
+ .buffer = dt,
+ .bufsz = kho_out.dt_len,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = kho_out.dt_len,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+ err = kexec_add_buffer(&image->kho.dt);
+ if (err) {
+ pr_info("===> %s: kexec_add_buffer\n", __func__);
+ goto out;
+ }
+
+ scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt;
+ image->kho.scratch = (struct kexec_buf) {
+ .image = image,
+ .buffer = kho_scratch,
+ .bufsz = scratch_size,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = scratch_size,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+ err = kexec_add_buffer(&image->kho.scratch);
+
+out:
+ mutex_unlock(&kho_out.lock);
+ return err;
+}
+
+static int kho_walk_scratch(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+ int ret = 0;
+ int i;
+
+ for (i = 0; i < kho_scratch_cnt; i++) {
+ struct resource res = {
+ .start = kho_scratch[i].addr,
+ .end = kho_scratch[i].addr + kho_scratch[i].size - 1,
+ };
+
+ /* Try to fit the kimage into our KHO scratch region */
+ ret = func(&res, kbuf);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+int kho_locate_mem_hole(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+ int ret;
+
+ if (!kho_out.active || kbuf->image->type == KEXEC_TYPE_CRASH)
+ return 1;
+
+ ret = kho_walk_scratch(kbuf, func);
+
+ return ret == 1 ? 0 : -EADDRNOTAVAIL;
+}
+
static ssize_t dt_read(struct file *file, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
loff_t pos, size_t count)
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index d35d9792402d..c535dbd3b5bd 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -39,4 +39,20 @@ extern size_t kexec_purgatory_size;
#else /* CONFIG_KEXEC_FILE */
static inline void kimage_file_post_load_cleanup(struct kimage *image) { }
#endif /* CONFIG_KEXEC_FILE */
+
+struct kexec_buf;
+
+#ifdef CONFIG_KEXEC_HANDOVER
+int kho_locate_mem_hole(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *));
+int kho_fill_kimage(struct kimage *image);
+#else
+static inline int kho_locate_mem_hole(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+ return 0;
+}
+
+static inline int kho_fill_kimage(struct kimage *image) { return 0; }
+#endif
#endif /* LINUX_KEXEC_INTERNAL_H */
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 08/14] kexec: Add config option for KHO
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (6 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 07/14] kexec: Add KHO support to kexec file loads Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 09/14] kexec: Add documentation " Mike Rapoport
` (11 subsequent siblings)
19 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
We have all generic code in place now to support Kexec with KHO. This
patch adds a config option that depends on architecture support to
enable KHO support.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
kernel/Kconfig.kexec | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 4d111f871951..332824d8d6dc 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -95,6 +95,19 @@ config KEXEC_JUMP
Jump between original kernel and kexeced kernel and invoke
code in physical address mode via KEXEC
+config KEXEC_HANDOVER
+ bool "kexec handover"
+ depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
+ select MEMBLOCK_KHO_SCRATCH
+ select KEXEC_FILE
+ select LIBFDT
+ select CMA
+ help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
config CRASH_DUMP
bool "kernel crash dumps"
default ARCH_DEFAULT_CRASH_DUMP
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 09/14] kexec: Add documentation for KHO
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (7 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 08/14] kexec: Add config option for KHO Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-10 19:26 ` Jason Gunthorpe
2025-02-06 13:27 ` [PATCH v4 10/14] arm64: Add KHO support Mike Rapoport
` (10 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
With KHO in place, let's add documentation that describes what it is and
how to use it.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
Documentation/kho/concepts.rst | 80 ++++++++++++++++++++++++++++++++
Documentation/kho/index.rst | 19 ++++++++
Documentation/kho/usage.rst | 60 ++++++++++++++++++++++++
Documentation/subsystem-apis.rst | 1 +
MAINTAINERS | 1 +
5 files changed, 161 insertions(+)
create mode 100644 Documentation/kho/concepts.rst
create mode 100644 Documentation/kho/index.rst
create mode 100644 Documentation/kho/usage.rst
diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
new file mode 100644
index 000000000000..232bddacc0ef
--- /dev/null
+++ b/Documentation/kho/concepts.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+=======================
+Kexec Handover Concepts
+=======================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+It introduces multiple concepts:
+
+KHO Device Tree
+---------------
+
+Every KHO kexec carries a KHO specific flattened device tree blob that
+describes the state of the system. Device drivers can register to KHO to
+serialize their state before kexec. After KHO, device drivers can read
+the device tree and extract previous state.
+
+KHO only uses the fdt container format and libfdt library, but does not
+adhere to the same property semantics that normal device trees do: Properties
+are passed in native endianness and standardized properties like ``regs`` and
+``ranges`` do not exist, hence there are no ``#...-cells`` properties.
+
+KHO introduces a new concept to its device tree: ``mem`` properties. A
+``mem`` property can be inside any subnode in the device tree. When present,
+it contains an array of physical memory ranges that the new kernel must mark
+as reserved on boot. It is recommended, but not required, to make these ranges
+as physically contiguous as possible to reduce the number of array elements ::
+
+ struct kho_mem {
+ __u64 addr;
+ __u64 len;
+ };
+
+After boot, drivers can call the kho subsystem to transfer ownership of memory
+that was reserved via a ``mem`` property to themselves to continue using memory
+from the previous execution.
+
+The KHO device tree follows the in-Linux schema requirements. Any element in
+the device tree is documented via device tree schema yamls that explain what
+data gets transferred.
+
+Scratch Regions
+---------------
+
+To boot into kexec, we need to have a physically contiguous memory range that
+contains no handed over memory. Kexec then places the target kernel and initrd
+into that region. The new kernel exclusively uses this region for memory
+allocations before during boot up to the initialization of the page allocator.
+
+We guarantee that we always have such regions through the scratch regions: On
+first boot KHO allocates several physically contiguous memory regions. Since
+after kexec these regions will be used by early memory allocations, there is a
+scratch region per NUMA node plus a scratch region to satisfy allocations
+requests that do not require particilar NUMA node assignment.
+By default, size of the scratch region is calculated based on amount of memory
+allocated during boot. The ``kho_scratch`` kernel command line option may be used to explicitly define size of the scratch regions.
+The scratch regions are declared as CMA when page allocator is initialized so
+that their memory can be used during system lifetime. CMA gives us the
+guarantee that no handover pages land in that region, because handover pages
+must be at a static physical memory location and CMA enforces that only
+movable pages can be located inside.
+
+After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and
+instead reuse the exact same region that was originally allocated. This allows
+us to recursively execute any amount of KHO kexecs. Because we used this region
+for boot memory allocations and as target memory for kexec blobs, some parts
+of that memory region may be reserved. These reservations are irrenevant for
+the next KHO, because kexec can overwrite even the original kernel.
+
+KHO active phase
+----------------
+
+To enable user space based kexec file loader, the kernel needs to be able to
+provide the device tree that describes the previous kernel's state before
+performing the actual kexec. The process of generating that device tree is
+called serialization. When the device tree is generated, some properties
+of the system may become immutable because they are already written down
+in the device tree. That state is called the KHO active phase.
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
new file mode 100644
index 000000000000..5e7eeeca8520
--- /dev/null
+++ b/Documentation/kho/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+========================
+Kexec Handover Subsystem
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ concepts
+ usage
+
+.. only:: subproject and html
+
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
new file mode 100644
index 000000000000..e7300fbb309c
--- /dev/null
+++ b/Documentation/kho/usage.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`Documentation/kho/concepts.rst <concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+-------------
+
+KHO is available when the ``CONFIG_KEXEC_HANDOVER`` config option is set to y
+at compile time. Every KHO producer may have its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho=on`` command line
+parameter. You may use ``kho_scratch`` parameter to define size of the
+scratch regions. For example ``kho_scratch=512M,512M`` will reserve a 512
+MiB for a global scratch region and 512 MiB per NUMA node scratch regions
+on boot.
+
+Perform a KHO kexec
+-------------------
+
+Before you can perform a KHO kexec, you need to move the system into the
+:ref:`Documentation/kho/concepts.rst <KHO active phase>` ::
+
+ $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is available in ``/sys/kernel/kho/dt``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+ # kexec -l Image --initrd=initrd -s
+ # kexec -e
+
+The new kernel will boot up and contain some of the previous kernel's state.
+
+For example, if you used ``reserve_mem`` command line parameter to create
+an early memory reservation, the new kernel will have that memory at the
+same physical address as the old kernel.
+
+Abort a KHO exec
+----------------
+
+You can move the system out of KHO active phase again by calling ::
+
+ $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is no longer available in
+``/sys/kernel/kho/dt``.
diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst
index b52ad5b969d4..5fc69d6ff9f0 100644
--- a/Documentation/subsystem-apis.rst
+++ b/Documentation/subsystem-apis.rst
@@ -90,3 +90,4 @@ Other subsystems
peci/index
wmi/index
tee/index
+ kho/index
diff --git a/MAINTAINERS b/MAINTAINERS
index e1e01b2a3727..82c2ef421c00 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12828,6 +12828,7 @@ S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
F: Documentation/ABI/testing/sysfs-firmware-kho
F: Documentation/ABI/testing/sysfs-kernel-kho
+F: Documentation/kho/
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
F: kernel/kexec*
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 10/14] arm64: Add KHO support
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (8 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 09/14] kexec: Add documentation " Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-09 10:38 ` Krzysztof Kozlowski
2025-02-06 13:27 ` [PATCH v4 11/14] x86/setup: use memblock_reserve_kern for memory used by kernel Mike Rapoport
` (9 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for arm64 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/arm64/Kconfig | 3 +++
drivers/of/fdt.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/of/kexec.c | 42 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 81 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fcdd0ed3eca8..5d9f07cea258 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1590,6 +1590,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
def_bool y
+config ARCH_SUPPORTS_KEXEC_HANDOVER
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index aedd0e2dcd89..3178bf9c6bd2 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -875,6 +875,39 @@ void __init early_init_dt_check_for_usable_mem_range(void)
memblock_add(rgn[i].base, rgn[i].size);
}
+/**
+ * early_init_dt_check_kho - Decode info required for kexec handover from DT
+ */
+static void __init early_init_dt_check_kho(void)
+{
+ unsigned long node = chosen_node_offset;
+ u64 kho_start, scratch_start, scratch_size;
+ const __be32 *p;
+ int l;
+
+ if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER) || (long)node < 0)
+ return;
+
+ p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+ scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ kho_populate(kho_start, scratch_start, scratch_size);
+}
+
#ifdef CONFIG_SERIAL_EARLYCON
int __init early_init_dt_scan_chosen_stdout(void)
@@ -1169,6 +1202,9 @@ void __init early_init_dt_scan_nodes(void)
/* Handle linux,usable-memory-range property */
early_init_dt_check_for_usable_mem_range();
+
+ /* Handle kexec handover */
+ early_init_dt_check_kho();
}
bool __init early_init_dt_scan(void *dt_virt, phys_addr_t dt_phys)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 5b924597a4de..f6cf0bc13246 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -264,6 +264,43 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
}
#endif /* CONFIG_IMA_KEXEC */
+static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
+{
+ void *dt = NULL;
+ phys_addr_t dt_mem = 0;
+ phys_addr_t dt_len = 0;
+ phys_addr_t scratch_mem = 0;
+ phys_addr_t scratch_len = 0;
+ int ret = 0;
+
+#ifdef CONFIG_KEXEC_HANDOVER
+ dt = image->kho.dt.buffer;
+ dt_mem = image->kho.dt.mem;
+ dt_len = image->kho.dt.bufsz;
+
+ scratch_mem = image->kho.scratch.mem;
+ scratch_len = image->kho.scratch.bufsz;
+#endif
+
+ if (!dt)
+ goto out;
+
+ pr_debug("Adding kho metadata to DT");
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-dt",
+ dt_mem, dt_len);
+ if (ret)
+ goto out;
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
+ scratch_mem, scratch_len);
+ if (ret)
+ goto out;
+
+out:
+ return ret;
+}
+
/*
* of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree
*
@@ -414,6 +451,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
#endif
}
+ /* Add kho metadata if this is a KHO image */
+ ret = kho_add_chosen(image, fdt, chosen_node);
+ if (ret)
+ goto out;
+
/* add bootargs */
if (cmdline) {
ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 11/14] x86/setup: use memblock_reserve_kern for memory used by kernel
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (9 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 10/14] arm64: Add KHO support Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 12/14] x86: Add KHO support Mike Rapoport
` (8 subsequent siblings)
19 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
memblock_reserve() does not distinguish memory used by firmware from memory
used by kernel.
The distinction is nice to have for accounting of early memory allocations
and reservations, but it is essential for kexec handover (kho) to know how
much memory kernel consumes during boot.
Use memblock_reserve_kern() to reserve kernel memory, such as kernel image,
initrd and setup data.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/kernel/setup.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..c80c124af332 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -220,8 +220,8 @@ static void __init cleanup_highmap(void)
static void __init reserve_brk(void)
{
if (_brk_end > _brk_start)
- memblock_reserve(__pa_symbol(_brk_start),
- _brk_end - _brk_start);
+ memblock_reserve_kern(__pa_symbol(_brk_start),
+ _brk_end - _brk_start);
/* Mark brk area as locked down and no longer taking any
new allocations */
@@ -294,7 +294,7 @@ static void __init early_reserve_initrd(void)
!ramdisk_image || !ramdisk_size)
return; /* No initrd provided by bootloader */
- memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
+ memblock_reserve_kern(ramdisk_image, ramdisk_end - ramdisk_image);
}
static void __init reserve_initrd(void)
@@ -347,7 +347,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
}
if (data->size) {
- memblock_reserve(data->addr, data->size);
+ memblock_reserve_kern(data->addr, data->size);
ima_kexec_buffer_phys = data->addr;
ima_kexec_buffer_size = data->size;
}
@@ -447,7 +447,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
len = sizeof(*data);
pa_next = data->next;
- memblock_reserve(pa_data, sizeof(*data) + data->len);
+ memblock_reserve_kern(pa_data, sizeof(*data) + data->len);
if (data->type == SETUP_INDIRECT) {
len += data->len;
@@ -461,7 +461,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
indirect = (struct setup_indirect *)data->data;
if (indirect->type != SETUP_INDIRECT)
- memblock_reserve(indirect->addr, indirect->len);
+ memblock_reserve_kern(indirect->addr, indirect->len);
}
pa_data = pa_next;
@@ -649,7 +649,7 @@ static void __init early_reserve_memory(void)
* __end_of_kernel_reserve symbol must be explicitly reserved with a
* separate memblock_reserve() or they will be discarded.
*/
- memblock_reserve(__pa_symbol(_text),
+ memblock_reserve_kern(__pa_symbol(_text),
(unsigned long)__end_of_kernel_reserve - (unsigned long)_text);
/*
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 12/14] x86: Add KHO support
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (10 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 11/14] x86/setup: use memblock_reserve_kern for memory used by kernel Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-24 7:13 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
` (7 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for x86 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.
In addition, it enlightens it decompression code with KHO so that its
KASLR location finder only considers memory regions that are not already
occupied by KHO memory.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/Kconfig | 3 ++
arch/x86/boot/compressed/kaslr.c | 52 +++++++++++++++++++++++++-
arch/x86/include/asm/setup.h | 4 ++
arch/x86/include/uapi/asm/setup_data.h | 13 ++++++-
arch/x86/kernel/e820.c | 18 +++++++++
arch/x86/kernel/kexec-bzimage64.c | 36 ++++++++++++++++++
arch/x86/kernel/setup.c | 25 +++++++++++++
arch/x86/realmode/init.c | 2 +
8 files changed, 151 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 87198d957e2f..3a2d7b381704 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2090,6 +2090,9 @@ config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y
+config ARCH_SUPPORTS_KEXEC_HANDOVER
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index f03d59ea6e40..c932a30deb20 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -760,6 +760,55 @@ static void process_e820_entries(unsigned long minimum,
}
}
+/*
+ * If KHO is active, only process its scratch areas to ensure we are not
+ * stepping onto preserved memory.
+ */
+#ifdef CONFIG_KEXEC_HANDOVER
+static bool process_kho_entries(unsigned long minimum, unsigned long image_size)
+{
+ struct kho_mem *kho_scratch;
+ struct setup_data *ptr;
+ int i, nr_areas = 0;
+
+ ptr = (struct setup_data *)(unsigned long)boot_params_ptr->hdr.setup_data;
+ while (ptr) {
+ if (ptr->type == SETUP_KEXEC_KHO) {
+ struct kho_data *kho = (struct kho_data *)ptr->data;
+
+ kho_scratch = (void *)kho->scratch_addr;
+ nr_areas = kho->scratch_size / sizeof(*kho_scratch);
+
+ break;
+ }
+
+ ptr = (struct setup_data *)(unsigned long)ptr->next;
+ }
+
+ if (!nr_areas)
+ return false;
+
+ for (i = 0; i < nr_areas; i++) {
+ struct kho_mem *area = &kho_scratch[i];
+ struct mem_vector region = {
+ .start = area->addr,
+ .size = area->size,
+ };
+
+ if (process_mem_region(®ion, minimum, image_size))
+ break;
+ }
+
+ return true;
+}
+#else
+static inline bool process_kho_entries(unsigned long minimum,
+ unsigned long image_size)
+{
+ return false;
+}
+#endif
+
static unsigned long find_random_phys_addr(unsigned long minimum,
unsigned long image_size)
{
@@ -775,7 +824,8 @@ static unsigned long find_random_phys_addr(unsigned long minimum,
return 0;
}
- if (!process_efi_entries(minimum, image_size))
+ if (!process_kho_entries(minimum, image_size) &&
+ !process_efi_entries(minimum, image_size))
process_e820_entries(minimum, image_size);
phys_addr = slots_fetch_random();
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 85f4fde3515c..70e045321d4b 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -66,6 +66,10 @@ extern void x86_ce4100_early_setup(void);
static inline void x86_ce4100_early_setup(void) { }
#endif
+#ifdef CONFIG_KEXEC_HANDOVER
+#include <linux/kexec_handover.h>
+#endif
+
#ifndef _SETUP
#include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/setup_data.h b/arch/x86/include/uapi/asm/setup_data.h
index b111b0c18544..c258c37768ee 100644
--- a/arch/x86/include/uapi/asm/setup_data.h
+++ b/arch/x86/include/uapi/asm/setup_data.h
@@ -13,7 +13,8 @@
#define SETUP_CC_BLOB 7
#define SETUP_IMA 8
#define SETUP_RNG_SEED 9
-#define SETUP_ENUM_MAX SETUP_RNG_SEED
+#define SETUP_KEXEC_KHO 10
+#define SETUP_ENUM_MAX SETUP_KEXEC_KHO
#define SETUP_INDIRECT (1<<31)
#define SETUP_TYPE_MAX (SETUP_ENUM_MAX | SETUP_INDIRECT)
@@ -78,6 +79,16 @@ struct ima_setup_data {
__u64 size;
} __attribute__((packed));
+/*
+ * Locations of kexec handover metadata
+ */
+struct kho_data {
+ __u64 dt_addr;
+ __u64 dt_size;
+ __u64 scratch_addr;
+ __u64 scratch_size;
+} __attribute__((packed));
+
#endif /* __ASSEMBLY__ */
#endif /* _UAPI_ASM_X86_SETUP_DATA_H */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 82b96ed9890a..0b81cd70b02a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1329,6 +1329,24 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+ /*
+ * At this point with KHO we only allocate from scratch memory.
+ * At the same time, we configure memblock to only allow
+ * allocations from memory below ISA_END_ADDRESS which is not
+ * a natural scratch region, because Linux ignores memory below
+ * ISA_END_ADDRESS at runtime. Beside very few (if any) early
+ * allocations, we must allocate real-mode trapoline below
+ * ISA_END_ADDRESS.
+ *
+ * To make sure that we can actually perform allocations during
+ * this phase, let's mark memory below ISA_END_ADDRESS as scratch
+ * so we can allocate from there in a scratch-only world.
+ *
+ * After real mode trampoline is allocated, we clear scratch
+ * marking from the memory below ISA_END_ADDRESS
+ */
+ memblock_mark_kho_scratch(0, ISA_END_ADDRESS);
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 68530fad05f7..15fc3c1a92e8 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -233,6 +233,31 @@ setup_ima_state(const struct kimage *image, struct boot_params *params,
#endif /* CONFIG_IMA_KEXEC */
}
+static void setup_kho(const struct kimage *image, struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int setup_data_offset)
+{
+#ifdef CONFIG_KEXEC_HANDOVER
+ struct setup_data *sd = (void *)params + setup_data_offset;
+ struct kho_data *kho = (void *)sd + sizeof(*sd);
+
+ sd->type = SETUP_KEXEC_KHO;
+ sd->len = sizeof(struct kho_data);
+
+ /* Only add if we have all KHO images in place */
+ if (!image->kho.dt.buffer || !image->kho.scratch.buffer)
+ return;
+
+ /* Add setup data */
+ kho->dt_addr = image->kho.dt.mem;
+ kho->dt_size = image->kho.dt.bufsz;
+ kho->scratch_addr = image->kho.scratch.mem;
+ kho->scratch_size = image->kho.scratch.bufsz;
+ sd->next = params->hdr.setup_data;
+ params->hdr.setup_data = params_load_addr + setup_data_offset;
+#endif /* CONFIG_KEXEC_HANDOVER */
+}
+
static int
setup_boot_parameters(struct kimage *image, struct boot_params *params,
unsigned long params_load_addr,
@@ -312,6 +337,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params,
sizeof(struct ima_setup_data);
}
+ if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) {
+ /* Setup space to store preservation metadata */
+ setup_kho(image, params, params_load_addr, setup_data_offset);
+ setup_data_offset += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+ }
+
/* Setup RNG seed */
setup_rng_seed(params, params_load_addr, setup_data_offset);
@@ -479,6 +511,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
kbuf.bufsz += sizeof(struct setup_data) +
sizeof(struct ima_setup_data);
+ if (IS_ENABLED(CONFIG_KEXEC_HANDOVER))
+ kbuf.bufsz += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+
params = kzalloc(kbuf.bufsz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index c80c124af332..e0a89f4bb46f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -385,6 +385,28 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size)
}
#endif
+static void __init add_kho(u64 phys_addr, u32 data_len)
+{
+#ifdef CONFIG_KEXEC_HANDOVER
+ struct kho_data *kho;
+ u64 addr = phys_addr + sizeof(struct setup_data);
+ u64 size = data_len - sizeof(struct setup_data);
+
+ kho = early_memremap(addr, size);
+ if (!kho) {
+ pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n",
+ addr, size);
+ return;
+ }
+
+ kho_populate(kho->dt_addr, kho->scratch_addr, kho->scratch_size);
+
+ early_memunmap(kho, size);
+#else
+ pr_warn("Passed KHO data, but CONFIG_KEXEC_HANDOVER not set. Ignoring.\n");
+#endif
+}
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -413,6 +435,9 @@ static void __init parse_setup_data(void)
case SETUP_IMA:
add_early_ima_buffer(pa_data);
break;
+ case SETUP_KEXEC_KHO:
+ add_kho(pa_data, data_len);
+ break;
case SETUP_RNG_SEED:
data = early_memremap(pa_data, data_len);
add_bootloader_randomness(data->data, data->len);
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index f9bc444a3064..9b9f4534086d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
* setup_arch().
*/
memblock_reserve(0, SZ_1M);
+
+ memblock_clear_kho_scratch(0, SZ_1M);
}
static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 13/14] memblock: Add KHO support for reserve_mem
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (11 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 12/14] x86: Add KHO support Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-10 16:03 ` Rob Herring
2025-02-17 4:04 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 14/14] Documentation: KHO: Add memblock bindings Mike Rapoport
` (6 subsequent siblings)
19 siblings, 2 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: Alexander Graf <graf@amazon.com>
Linux has recently gained support for "reserve_mem": A mechanism to
allocate a region of memory early enough in boot that we can cross our
fingers and hope it stays at the same location during most boots, so we
can store for example ftrace buffers into it.
Thanks to KASLR, we can never be really sure that "reserve_mem"
allocations are static across kexec. Let's teach it KHO awareness so
that it serializes its reservations on kexec exit and deserializes them
again on boot, preserving the exact same mapping across kexec.
This is an example user for KHO in the KHO patch set to ensure we have
at least one (not very controversial) user in the tree before extending
KHO's use to more subsystems.
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
mm/memblock.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 131 insertions(+)
diff --git a/mm/memblock.c b/mm/memblock.c
index 84df96efca62..fdb08b60efc1 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,9 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/kexec_handover.h>
+#include <linux/kexec.h>
+#include <linux/libfdt.h>
#include <asm/sections.h>
#include <linux/io.h>
@@ -2423,6 +2426,70 @@ int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *
}
EXPORT_SYMBOL_GPL(reserve_mem_find_by_name);
+static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size,
+ phys_addr_t align)
+{
+ const void *fdt = kho_get_fdt();
+ const char *path = "/reserve_mem";
+ int node, child, err;
+
+ if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER))
+ return false;
+
+ if (!fdt)
+ return false;
+
+ node = fdt_path_offset(fdt, "/reserve_mem");
+ if (node < 0)
+ return false;
+
+ err = fdt_node_check_compatible(fdt, node, "reserve_mem-v1");
+ if (err) {
+ pr_warn("Node '%s' has unknown compatible", path);
+ return false;
+ }
+
+ fdt_for_each_subnode(child, fdt, node) {
+ const struct kho_mem *mem;
+ const char *child_name;
+ int len;
+
+ /* Search for old kernel's reserved_mem with the same name */
+ child_name = fdt_get_name(fdt, child, NULL);
+ if (strcmp(name, child_name))
+ continue;
+
+ err = fdt_node_check_compatible(fdt, child, "reserve_mem_map-v1");
+ if (err) {
+ pr_warn("Node '%s/%s' has unknown compatible", path, name);
+ continue;
+ }
+
+ mem = fdt_getprop(fdt, child, "mem", &len);
+ if (!mem || len != sizeof(*mem))
+ continue;
+
+ if (mem->addr & (align - 1)) {
+ pr_warn("KHO reserved_mem '%s' has wrong alignment (0x%lx, 0x%lx)",
+ name, (long)align, (long)mem->addr);
+ continue;
+ }
+
+ if (mem->size != size) {
+ pr_warn("KHO reserved_mem '%s' has wrong size (0x%lx != 0x%lx)",
+ name, (long)mem->size, (long)size);
+ continue;
+ }
+
+ reserved_mem_add(mem->addr, mem->size, name);
+ pr_info("Revived memory reservation '%s' from KHO", name);
+
+ return true;
+ }
+
+ return false;
+}
+
/*
* Parse reserve_mem=nn:align:name
*/
@@ -2478,6 +2545,11 @@ static int __init reserve_mem(char *p)
if (reserve_mem_find_by_name(name, &start, &tmp))
return -EBUSY;
+ /* Pick previous allocations up from KHO if available */
+ if (reserve_mem_kho_revive(name, size, align))
+ return 1;
+
+ /* TODO: Allocation must be outside of scratch region */
start = memblock_phys_alloc(size, align);
if (!start)
return -ENOMEM;
@@ -2488,6 +2560,65 @@ static int __init reserve_mem(char *p)
}
__setup("reserve_mem=", reserve_mem);
+static int reserve_mem_kho_write_map(void *fdt, struct reserve_mem_table *map)
+{
+ int err = 0;
+ const char compatible[] = "reserve_mem_map-v1";
+ struct kho_mem mem = {
+ .addr = map->start,
+ .size = map->size,
+ };
+
+ err |= fdt_begin_node(fdt, map->name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_property(fdt, "mem", &mem, sizeof(mem));
+ err |= fdt_end_node(fdt);
+
+ return err;
+}
+
+static int reserve_mem_kho_notifier(struct notifier_block *self,
+ unsigned long cmd, void *v)
+{
+ const char compatible[] = "reserve_mem-v1";
+ void *fdt = v;
+ int err = 0;
+ int i;
+
+ switch (cmd) {
+ case KEXEC_KHO_ABORT:
+ return NOTIFY_DONE;
+ case KEXEC_KHO_DUMP:
+ /* Handled below */
+ break;
+ default:
+ return NOTIFY_BAD;
+ }
+
+ if (!reserved_mem_count)
+ return NOTIFY_DONE;
+
+ err |= fdt_begin_node(fdt, "reserve_mem");
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ for (i = 0; i < reserved_mem_count; i++)
+ err |= reserve_mem_kho_write_map(fdt, &reserved_mem_table[i]);
+ err |= fdt_end_node(fdt);
+
+ return err ? NOTIFY_BAD : NOTIFY_DONE;
+}
+
+static struct notifier_block reserve_mem_kho_nb = {
+ .notifier_call = reserve_mem_kho_notifier,
+};
+
+static int __init reserve_mem_init(void)
+{
+ register_kho_notifier(&reserve_mem_kho_nb);
+
+ return 0;
+}
+core_initcall(reserve_mem_init);
+
#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
static const char * const flagname[] = {
[ilog2(MEMBLOCK_HOTPLUG)] = "HOTPLUG",
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (12 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
@ 2025-02-06 13:27 ` Mike Rapoport
2025-02-09 10:29 ` Krzysztof Kozlowski
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
` (5 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-06 13:27 UTC (permalink / raw)
To: linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Mike Rapoport, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
We introduced KHO into Linux: A framework that allows Linux to pass
metadata and memory across kexec from Linux to Linux. KHO reuses fdt
as file format and shares a lot of the same properties of firmware-to-
Linux boot formats: It needs a stable, documented ABI that allows for
forward and backward compatibility as well as versioning.
As first user of KHO, we introduced memblock which can now preserve
memory ranges reserved with reserve_mem command line options contents
across kexec, so you can use the post-kexec kernel to read traces from
the pre-kexec kernel.
This patch adds memblock schemas similar to "device" device tree ones to
a new kho bindings directory. This allows us to force contributors to
document the data that moves across KHO kexecs and catch breaking change
during review.
Co-developed-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
.../kho/bindings/memblock/reserve_mem.yaml | 41 ++++++++++++++++++
.../bindings/memblock/reserve_mem_map.yaml | 42 +++++++++++++++++++
2 files changed, 83 insertions(+)
create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
diff --git a/Documentation/kho/bindings/memblock/reserve_mem.yaml b/Documentation/kho/bindings/memblock/reserve_mem.yaml
new file mode 100644
index 000000000000..7b01791b10b3
--- /dev/null
+++ b/Documentation/kho/bindings/memblock/reserve_mem.yaml
@@ -0,0 +1,41 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Memblock reserved memory
+
+maintainers:
+ - Mike Rapoport <rppt@kernel.org>
+
+description: |
+ Memblock can serialize its current memory reservations created with
+ reserve_mem command line option across kexec through KHO.
+ The post-KHO kernel can then consume these reservations and they are
+ guaranteed to have the same physical address.
+
+properties:
+ compatible:
+ enum:
+ - reserve_mem-v1
+
+patternProperties:
+ "$[0-9a-f_]+^":
+ $ref: reserve_mem_map.yaml#
+ description: reserved memory regions
+
+required:
+ - compatible
+
+additionalProperties: false
+
+examples:
+ - |
+ reserve_mem {
+ compatible = "reserve_mem-v1";
+ r1 {
+ compatible = "reserve_mem_map-v1";
+ mem = <0xc07c 0x2000000 0x01 0x00>;
+ };
+ };
diff --git a/Documentation/kho/bindings/memblock/reserve_mem_map.yaml b/Documentation/kho/bindings/memblock/reserve_mem_map.yaml
new file mode 100644
index 000000000000..09001c5f2124
--- /dev/null
+++ b/Documentation/kho/bindings/memblock/reserve_mem_map.yaml
@@ -0,0 +1,42 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/memblock/reserve_mem_map.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Memblock reserved memory regions
+
+maintainers:
+ - Mike Rapoport <rppt@kernel.org>
+
+description: |
+ Memblock can serialize its current memory reservations created with
+ reserve_mem command line option across kexec through KHO.
+ This object describes each such region.
+
+properties:
+ compatible:
+ enum:
+ - reserve_mem_map-v1
+
+ mem:
+ $ref: /schemas/types.yaml#/definitions/uint32-array
+ description: |
+ Array of { u64 phys_addr, u64 len } elements that describe a list of
+ memory ranges.
+
+required:
+ - compatible
+ - mem
+
+additionalProperties: false
+
+examples:
+ - |
+ reserve_mem {
+ compatible = "reserve_mem-v1";
+ r1 {
+ compatible = "reserve_mem_map-v1";
+ mem = <0xc07c 0x2000000 0x01 0x00>;
+ };
+ };
--
2.47.2
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (13 preceding siblings ...)
2025-02-06 13:27 ` [PATCH v4 14/14] Documentation: KHO: Add memblock bindings Mike Rapoport
@ 2025-02-07 0:29 ` Andrew Morton
2025-02-07 1:28 ` Pasha Tatashin
` (2 more replies)
2025-02-07 4:50 ` Andrew Morton
` (4 subsequent siblings)
19 siblings, 3 replies; 97+ messages in thread
From: Andrew Morton @ 2025-02-07 0:29 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
>
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
>
>
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
I tossed this into mm.git for some testing and exposure.
What merge path are you anticipating?
Review activity seems pretty thin thus far?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
@ 2025-02-07 1:28 ` Pasha Tatashin
2025-02-08 1:38 ` Baoquan He
2025-02-07 8:06 ` Mike Rapoport
2025-02-09 10:33 ` Krzysztof Kozlowski
2 siblings, 1 reply; 97+ messages in thread
From: Pasha Tatashin @ 2025-02-07 1:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
>
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
>
> I tossed this into mm.git for some testing and exposure.
>
> What merge path are you anticipating?
>
> Review activity seems pretty thin thus far?
KHO is going to be discussed at the upcoming lsfmm, we are also
planning to send v5 of this patch series (discussed with Mike
Rapoport) in a couple of weeks. It will include enhancements needed
for the hypervisor live update scenario:
1. Allow nodes to be added to the KHO tree at any time
2. Remove "activate" (I will also send a live update framework that
provides the activate functionality).
3. Allow serialization during shutdown.
4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
not be used during live update blackout time.
5. Enable multithreaded serialization by using hash-table as an
intermediate step before conversion to FDT.
Pasha
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (14 preceding siblings ...)
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
@ 2025-02-07 4:50 ` Andrew Morton
2025-02-07 8:01 ` Mike Rapoport
2025-02-08 23:39 ` Cong Wang
` (3 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Andrew Morton @ 2025-02-07 4:50 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
My x86_64 allmodconfig sayeth:
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xca (section: .text) -> memblock_alloc_try_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xf5 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x100 (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x11d (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x129 (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x14e (section: .text) -> memblock_phys_alloc_range (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x261 (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x26d (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x29b (section: .text) -> memblock_alloc_range_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x334 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x33f (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x363 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x371 (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x3a1 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x3af (section: .text) -> scratch_size_global (section: .init.data)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-07 4:50 ` Andrew Morton
@ 2025-02-07 8:01 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-07 8:01 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Alexander Graf, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Thu, Feb 06, 2025 at 08:50:30PM -0800, Andrew Morton wrote:
> My x86_64 allmodconfig sayeth:
>
> WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xca (section: .text) -> memblock_alloc_try_nid (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xf5 (section: .text) -> scratch_scale (section: .init.data)
This should fix it:
From 176767698d4ac5b7cddffe16677b60cb18dce786 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Fri, 7 Feb 2025 09:57:09 +0200
Subject: [PATCH] kho: make kho_reserve_scratch and kho_init_reserved_pages
__init
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
kernel/kexec_handover.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c21ea2a09d47..e0b92011afe2 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -620,7 +620,7 @@ static phys_addr_t __init scratch_size(int nid)
* active. This CMA region will only be used for movable pages which are not a
* problem for us during KHO because we can just move them somewhere else.
*/
-static void kho_reserve_scratch(void)
+static void __init kho_reserve_scratch(void)
{
phys_addr_t addr, size;
int nid, i = 1;
@@ -672,7 +672,7 @@ static void kho_reserve_scratch(void)
* Scan the DT for any memory ranges and make sure they are reserved in
* memblock, otherwise they will end up in a weird state on free lists.
*/
-static void kho_init_reserved_pages(void)
+static void __init kho_init_reserved_pages(void)
{
const void *fdt = kho_get_fdt();
int offset = 0, depth = 0, initial_depth = 0, len;
--
2.47.2
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
2025-02-07 1:28 ` Pasha Tatashin
@ 2025-02-07 8:06 ` Mike Rapoport
2025-02-09 10:33 ` Krzysztof Kozlowski
2 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-07 8:06 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Alexander Graf, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Thu, Feb 06, 2025 at 04:29:39PM -0800, Andrew Morton wrote:
> On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
>
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
>
> I tossed this into mm.git for some testing and exposure.
>
> What merge path are you anticipating?
I think your tree is the most appropriate, but let's wait for Acks from x86
and arm64 people ;-)
> Review activity seems pretty thin thus far?
Yeah :(
Maybe with Pasha's version on top of that we'll have more people reviewing.
And here is another fixup for a sparse error kbuild reported:
From e1e34b96b96b89a01ee31be223c8dfc2ce1c4cbe Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Fri, 7 Feb 2025 09:55:03 +0200
Subject: [PATCH] kho: make bin_attr_dt_kern static
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
kernel/kexec_handover.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cb..c21ea2a09d47 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -258,7 +258,7 @@ static ssize_t dt_read(struct file *file, struct kobject *kobj,
return count;
}
-struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
+static struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
static int kho_expose_dt(void *fdt)
{
--
2.47.2
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-07 1:28 ` Pasha Tatashin
@ 2025-02-08 1:38 ` Baoquan He
2025-02-08 8:41 ` Mike Rapoport
2025-02-09 0:23 ` Pasha Tatashin
0 siblings, 2 replies; 97+ messages in thread
From: Baoquan He @ 2025-02-08 1:38 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Andrew Morton, Mike Rapoport, linux-kernel, Alexander Graf,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> >
> > I tossed this into mm.git for some testing and exposure.
> >
> > What merge path are you anticipating?
> >
> > Review activity seems pretty thin thus far?
>
> KHO is going to be discussed at the upcoming lsfmm, we are also
> planning to send v5 of this patch series (discussed with Mike
> Rapoport) in a couple of weeks. It will include enhancements needed
> for the hypervisor live update scenario:
So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
reviewing until v5? Or this series is a infrustructure building, v5 will
add more details as you listed as below. I am a little confused.
>
> 1. Allow nodes to be added to the KHO tree at any time
> 2. Remove "activate" (I will also send a live update framework that
> provides the activate functionality).
> 3. Allow serialization during shutdown.
> 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> not be used during live update blackout time.
> 5. Enable multithreaded serialization by using hash-table as an
> intermediate step before conversion to FDT.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-08 1:38 ` Baoquan He
@ 2025-02-08 8:41 ` Mike Rapoport
2025-02-08 11:13 ` Baoquan He
2025-02-09 0:23 ` Pasha Tatashin
1 sibling, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-08 8:41 UTC (permalink / raw)
To: Baoquan He
Cc: Pasha Tatashin, Andrew Morton, linux-kernel, Alexander Graf,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
Hi Baoquan,
On Sat, Feb 08, 2025 at 09:38:27AM +0800, Baoquan He wrote:
> On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > just to make things simpler instead of ftrace we decided to preserve
> > > > "reserve_mem" regions.
> > > >
> > > > The patches are also available in git:
> > > > https://git.kernel.org/rppt/h/kho/v4
> > > >
> > > >
> > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > new kernel reinitializes the system.
> > >
> > > I tossed this into mm.git for some testing and exposure.
> > >
> > > What merge path are you anticipating?
> > >
> > > Review activity seems pretty thin thus far?
> >
> > KHO is going to be discussed at the upcoming lsfmm, we are also
> > planning to send v5 of this patch series (discussed with Mike
> > Rapoport) in a couple of weeks. It will include enhancements needed
> > for the hypervisor live update scenario:
>
> So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> reviewing until v5? Or this series is a infrustructure building, v5 will
> add more details as you listed as below. I am a little confused.
v4 adds the very basic support for kexec handover in the simplest form we
could think of. There were discussions on Linux MM Alignment and Hypervisor
live update meetings and there people agreed about MVP for KHO that v4
essentially implements.
v5 will add more details on top of v4 and I'm not sure there's a consensus
about some of them among the people involved in KHO.
> > 1. Allow nodes to be added to the KHO tree at any time
> > 2. Remove "activate" (I will also send a live update framework that
> > provides the activate functionality).
> > 3. Allow serialization during shutdown.
> > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > not be used during live update blackout time.
> > 5. Enable multithreaded serialization by using hash-table as an
> > intermediate step before conversion to FDT.
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-08 8:41 ` Mike Rapoport
@ 2025-02-08 11:13 ` Baoquan He
0 siblings, 0 replies; 97+ messages in thread
From: Baoquan He @ 2025-02-08 11:13 UTC (permalink / raw)
To: Mike Rapoport
Cc: Pasha Tatashin, Andrew Morton, linux-kernel, Alexander Graf,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
On 02/08/25 at 10:41am, Mike Rapoport wrote:
> Hi Baoquan,
>
> On Sat, Feb 08, 2025 at 09:38:27AM +0800, Baoquan He wrote:
> > On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > > just to make things simpler instead of ftrace we decided to preserve
> > > > > "reserve_mem" regions.
> > > > >
> > > > > The patches are also available in git:
> > > > > https://git.kernel.org/rppt/h/kho/v4
> > > > >
> > > > >
> > > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > > new kernel reinitializes the system.
> > > >
> > > > I tossed this into mm.git for some testing and exposure.
> > > >
> > > > What merge path are you anticipating?
> > > >
> > > > Review activity seems pretty thin thus far?
> > >
> > > KHO is going to be discussed at the upcoming lsfmm, we are also
> > > planning to send v5 of this patch series (discussed with Mike
> > > Rapoport) in a couple of weeks. It will include enhancements needed
> > > for the hypervisor live update scenario:
> >
> > So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> > reviewing until v5? Or this series is a infrustructure building, v5 will
> > add more details as you listed as below. I am a little confused.
>
> v4 adds the very basic support for kexec handover in the simplest form we
> could think of. There were discussions on Linux MM Alignment and Hypervisor
> live update meetings and there people agreed about MVP for KHO that v4
> essentially implements.
>
> v5 will add more details on top of v4 and I'm not sure there's a consensus
> about some of them among the people involved in KHO.
Thanks for the information.
Then I will apply v4 and learn the infrastructure and mechanism firstly.
While what sounds more meaningful to me is v4 can be reviewed, then updated
and merged. Then another patchset can be posted to add details, if you have
reached the consensus on the infrastructure part. With that, posting and
reviewing will be much easier. Unless you guys are still discussing the
infrastructure part.
>
> > > 1. Allow nodes to be added to the KHO tree at any time
> > > 2. Remove "activate" (I will also send a live update framework that
> > > provides the activate functionality).
> > > 3. Allow serialization during shutdown.
> > > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > > not be used during live update blackout time.
> > > 5. Enable multithreaded serialization by using hash-table as an
> > > intermediate step before conversion to FDT.
> >
>
> --
> Sincerely yours,
> Mike.
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (15 preceding siblings ...)
2025-02-07 4:50 ` Andrew Morton
@ 2025-02-08 23:39 ` Cong Wang
2025-02-09 0:13 ` Pasha Tatashin
2025-02-09 0:51 ` Cong Wang
` (2 subsequent siblings)
19 siblings, 1 reply; 97+ messages in thread
From: Cong Wang @ 2025-02-08 23:39 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi Mike,
On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
>
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
>
>
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See "pkernfs: Persisting guest memory
> and kernel/device state safely across kexec" Linux Plumbers
> Conference 2023 presentation for details:
>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this patch
> implements basic infrastructure to allow hand over of kernel state across
> kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> memblock's reserve_mem.
> With this patch set applied, memory that was reserved using "reserve_mem"
> command line options remains intact after kexec and it is guaranteed to
> reside at the same physical address.
Nice work!
One concern there is that using memblock to reserve memory as crashkernel=
is not flexible. I worked on kdump years ago and one of the biggest pains
of kdump is how much memory should be reserved with crashkernel=. And
it is still a pain today.
If we reserve more, that would mean more waste for the 1st kernel. If we
reserve less, that would induce more OOM for the 2nd kernel.
I'd suggest considering using CMA, where the "reserved" memory can be
still reusable for other purposes, just that pages can be migrated out of this
reserved region on demand, that is, when loading a kexec kernel. Of course,
we need to make sure they are not reused by what you want to preserve here,
e.g., IOMMU. So you might need additional work to make it work, but still I
believe this is the right direction.
Just my two cents.
Thanks!
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-08 23:39 ` Cong Wang
@ 2025-02-09 0:13 ` Pasha Tatashin
2025-02-09 1:00 ` Cong Wang
0 siblings, 1 reply; 97+ messages in thread
From: Pasha Tatashin @ 2025-02-09 0:13 UTC (permalink / raw)
To: Cong Wang
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Mike,
>
> On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Hi,
> >
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
> >
> > However, there are use cases where this mode of operation is not what we
> > actually want. In virtualization hosts for example, we want to use kexec
> > to update the host kernel while virtual machine memory stays untouched.
> > When we add device assignment to the mix, we also need to ensure that
> > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > need to do the same for the PCI subsystem. If we want to kexec while an
> > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > context pages and physical memory. See "pkernfs: Persisting guest memory
> > and kernel/device state safely across kexec" Linux Plumbers
> > Conference 2023 presentation for details:
> >
> > https://lpc.events/event/17/contributions/1485/
> >
> > To start us on the journey to support all the use cases above, this patch
> > implements basic infrastructure to allow hand over of kernel state across
> > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > memblock's reserve_mem.
> > With this patch set applied, memory that was reserved using "reserve_mem"
> > command line options remains intact after kexec and it is guaranteed to
> > reside at the same physical address.
>
> Nice work!
>
> One concern there is that using memblock to reserve memory as crashkernel=
> is not flexible. I worked on kdump years ago and one of the biggest pains
> of kdump is how much memory should be reserved with crashkernel=. And
> it is still a pain today.
>
> If we reserve more, that would mean more waste for the 1st kernel. If we
> reserve less, that would induce more OOM for the 2nd kernel.
>
> I'd suggest considering using CMA, where the "reserved" memory can be
> still reusable for other purposes, just that pages can be migrated out of this
> reserved region on demand, that is, when loading a kexec kernel. Of course,
> we need to make sure they are not reused by what you want to preserve here,
> e.g., IOMMU. So you might need additional work to make it work, but still I
> believe this is the right direction.
This is exactly what scratch memory is used for. Unlike crashkernel=,
the entire scratch area is available to user applications as CMA, as
we know that no kernel-reserved memory will come from that area. This
doesn't work for crashkernel=, because in some cases, the user pages
might also need to be preserved in the crash dump. However, if user
pages are going to be discarded from the crash dump (as is done 99% of
the time), then it is better to also make it use CMA or ZONE_MOVABLE
and use only the memory occupied by the crash kernel and do not waste
any memory at all. We have an internal patch at Google that does this,
and I think it would be a good improvement for the upstream kernel to
carry as well.
Pasha
>
> Just my two cents.
>
> Thanks!
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-08 1:38 ` Baoquan He
2025-02-08 8:41 ` Mike Rapoport
@ 2025-02-09 0:23 ` Pasha Tatashin
2025-02-09 3:07 ` Baoquan He
1 sibling, 1 reply; 97+ messages in thread
From: Pasha Tatashin @ 2025-02-09 0:23 UTC (permalink / raw)
To: Baoquan He
Cc: Andrew Morton, Mike Rapoport, linux-kernel, Alexander Graf,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
On Fri, Feb 7, 2025 at 8:38 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > just to make things simpler instead of ftrace we decided to preserve
> > > > "reserve_mem" regions.
> > > >
> > > > The patches are also available in git:
> > > > https://git.kernel.org/rppt/h/kho/v4
> > > >
> > > >
> > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > new kernel reinitializes the system.
> > >
> > > I tossed this into mm.git for some testing and exposure.
> > >
> > > What merge path are you anticipating?
> > >
> > > Review activity seems pretty thin thus far?
> >
> > KHO is going to be discussed at the upcoming lsfmm, we are also
> > planning to send v5 of this patch series (discussed with Mike
> > Rapoport) in a couple of weeks. It will include enhancements needed
> > for the hypervisor live update scenario:
>
> So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> reviewing until v5? Or this series is a infrustructure building, v5 will
> add more details as you listed as below. I am a little confused.
We will modify the existing patches and send as v5 because some
interfaces are going to be changed*.
Otherwise, v5 will make KHO a lot more flexible as it will allow to
use the tree all the time while the system is running instead of only
once during the activation phase.
* Changing interfaces is optional, but decision whether to change
will be discussed at Hypervisor Live Update on Feb 10th:
https://lore.kernel.org/all/26a4b7ca-93a6-30e2-923b-f551ced03d62@google.com/
>
> >
> > 1. Allow nodes to be added to the KHO tree at any time
> > 2. Remove "activate" (I will also send a live update framework that
> > provides the activate functionality).
> > 3. Allow serialization during shutdown.
> > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > not be used during live update blackout time.
> > 5. Enable multithreaded serialization by using hash-table as an
> > intermediate step before conversion to FDT.
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (16 preceding siblings ...)
2025-02-08 23:39 ` Cong Wang
@ 2025-02-09 0:51 ` Cong Wang
2025-02-17 3:19 ` RuiRui Yang
2025-02-26 20:08 ` Pratyush Yadav
19 siblings, 0 replies; 97+ messages in thread
From: Cong Wang @ 2025-02-09 0:51 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi Mike,
On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch regions" available
> for kexec: A physically contiguous memory regions that is guaranteed to
> not have any memory that KHO would preserve. The new kernel bootstraps
> itself using the scratch regions and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
I have gone through your entire patchset, if you could provide an example
of a specific driver that supports KHO it would help a lot for people to
understand and more importantly help driver developers to adopt.
Even with a simulated driver, e.g. netdevsim, it would be greatly helpful.
Thanks.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-09 0:13 ` Pasha Tatashin
@ 2025-02-09 1:00 ` Cong Wang
0 siblings, 0 replies; 97+ messages in thread
From: Cong Wang @ 2025-02-09 1:00 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sat, Feb 8, 2025 at 4:14 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Hi Mike,
> >
> > On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > >
> > > Hi,
> > >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> > >
> > > However, there are use cases where this mode of operation is not what we
> > > actually want. In virtualization hosts for example, we want to use kexec
> > > to update the host kernel while virtual machine memory stays untouched.
> > > When we add device assignment to the mix, we also need to ensure that
> > > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > > need to do the same for the PCI subsystem. If we want to kexec while an
> > > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > > context pages and physical memory. See "pkernfs: Persisting guest memory
> > > and kernel/device state safely across kexec" Linux Plumbers
> > > Conference 2023 presentation for details:
> > >
> > > https://lpc.events/event/17/contributions/1485/
> > >
> > > To start us on the journey to support all the use cases above, this patch
> > > implements basic infrastructure to allow hand over of kernel state across
> > > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > > memblock's reserve_mem.
> > > With this patch set applied, memory that was reserved using "reserve_mem"
> > > command line options remains intact after kexec and it is guaranteed to
> > > reside at the same physical address.
> >
> > Nice work!
> >
> > One concern there is that using memblock to reserve memory as crashkernel=
> > is not flexible. I worked on kdump years ago and one of the biggest pains
> > of kdump is how much memory should be reserved with crashkernel=. And
> > it is still a pain today.
> >
> > If we reserve more, that would mean more waste for the 1st kernel. If we
> > reserve less, that would induce more OOM for the 2nd kernel.
> >
> > I'd suggest considering using CMA, where the "reserved" memory can be
> > still reusable for other purposes, just that pages can be migrated out of this
> > reserved region on demand, that is, when loading a kexec kernel. Of course,
> > we need to make sure they are not reused by what you want to preserve here,
> > e.g., IOMMU. So you might need additional work to make it work, but still I
> > believe this is the right direction.
>
> This is exactly what scratch memory is used for. Unlike crashkernel=,
> the entire scratch area is available to user applications as CMA, as
> we know that no kernel-reserved memory will come from that area. This
> doesn't work for crashkernel=, because in some cases, the user pages
> might also need to be preserved in the crash dump. However, if user
> pages are going to be discarded from the crash dump (as is done 99% of
> the time), then it is better to also make it use CMA or ZONE_MOVABLE
> and use only the memory occupied by the crash kernel and do not waste
> any memory at all. We have an internal patch at Google that does this,
> and I think it would be a good improvement for the upstream kernel to
> carry as well.
Good to know CMA is already used, I could not tell from the cover letter.
The case that user-space pages need to be preserved is for scenarios like
RDMA which pins user-space pages for DMA transfer. Since the goal here
is also to preserve hardware states like RDMA's I guess the same concern
remains.
Thanks!
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-09 0:23 ` Pasha Tatashin
@ 2025-02-09 3:07 ` Baoquan He
0 siblings, 0 replies; 97+ messages in thread
From: Baoquan He @ 2025-02-09 3:07 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Andrew Morton, Mike Rapoport, linux-kernel, Alexander Graf,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86, changyuanl
On 02/08/25 at 07:23pm, Pasha Tatashin wrote:
> On Fri, Feb 7, 2025 at 8:38 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > > just to make things simpler instead of ftrace we decided to preserve
> > > > > "reserve_mem" regions.
> > > > >
> > > > > The patches are also available in git:
> > > > > https://git.kernel.org/rppt/h/kho/v4
> > > > >
> > > > >
> > > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > > new kernel reinitializes the system.
> > > >
> > > > I tossed this into mm.git for some testing and exposure.
> > > >
> > > > What merge path are you anticipating?
> > > >
> > > > Review activity seems pretty thin thus far?
> > >
> > > KHO is going to be discussed at the upcoming lsfmm, we are also
> > > planning to send v5 of this patch series (discussed with Mike
> > > Rapoport) in a couple of weeks. It will include enhancements needed
> > > for the hypervisor live update scenario:
> >
> > So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> > reviewing until v5? Or this series is a infrustructure building, v5 will
> > add more details as you listed as below. I am a little confused.
>
> We will modify the existing patches and send as v5 because some
> interfaces are going to be changed*.
>
> Otherwise, v5 will make KHO a lot more flexible as it will allow to
> use the tree all the time while the system is running instead of only
> once during the activation phase.
>
> * Changing interfaces is optional, but decision whether to change
> will be discussed at Hypervisor Live Update on Feb 10th:
> https://lore.kernel.org/all/26a4b7ca-93a6-30e2-923b-f551ced03d62@google.com/
Ah, this is what I would like to know about the difference between v4
and v5. Thanks for the information, and looking forward to seeing the v5
update.
>
> >
> > >
> > > 1. Allow nodes to be added to the KHO tree at any time
> > > 2. Remove "activate" (I will also send a live update framework that
> > > provides the activate functionality).
> > > 3. Allow serialization during shutdown.
> > > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > > not be used during live update blackout time.
> > > 5. Enable multithreaded serialization by using hash-table as an
> > > intermediate step before conversion to FDT.
> >
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-06 13:27 ` [PATCH v4 14/14] Documentation: KHO: Add memblock bindings Mike Rapoport
@ 2025-02-09 10:29 ` Krzysztof Kozlowski
2025-02-09 15:10 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 10:29 UTC (permalink / raw)
To: Mike Rapoport, linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Mark Rutland, Paolo Bonzini, Pasha Tatashin, H. Peter Anvin,
Peter Zijlstra, Pratyush Yadav, Rob Herring, Rob Herring,
Saravana Kannan, Stanislav Kinsburskii, Steven Rostedt,
Thomas Gleixner, Tom Lendacky, Usama Arif, Will Deacon,
devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86
On 06/02/2025 14:27, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> We introduced KHO into Linux: A framework that allows Linux to pass
> metadata and memory across kexec from Linux to Linux. KHO reuses fdt
> as file format and shares a lot of the same properties of firmware-to-
> Linux boot formats: It needs a stable, documented ABI that allows for
> forward and backward compatibility as well as versioning.
Please use subject prefixes matching the subsystem. You can get them for
example with `git log --oneline -- DIRECTORY_OR_FILE` on the directory
your patch is touching. For bindings, the preferred subjects are
explained here:
https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
>
> As first user of KHO, we introduced memblock which can now preserve
> memory ranges reserved with reserve_mem command line options contents
> across kexec, so you can use the post-kexec kernel to read traces from
> the pre-kexec kernel.
>
> This patch adds memblock schemas similar to "device" device tree ones to
> a new kho bindings directory. This allows us to force contributors to
> document the data that moves across KHO kexecs and catch breaking change
> during review.
>
> Co-developed-by: Alexander Graf <graf@amazon.com>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> .../kho/bindings/memblock/reserve_mem.yaml | 41 ++++++++++++++++++
> .../bindings/memblock/reserve_mem_map.yaml | 42 +++++++++++++++++++
> 2 files changed, 83 insertions(+)
> create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
> create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
>
> diff --git a/Documentation/kho/bindings/memblock/reserve_mem.yaml b/Documentation/kho/bindings/memblock/reserve_mem.yaml
> new file mode 100644
> index 000000000000..7b01791b10b3
> --- /dev/null
> +++ b/Documentation/kho/bindings/memblock/reserve_mem.yaml
> @@ -0,0 +1,41 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Memblock reserved memory
> +
> +maintainers:
> + - Mike Rapoport <rppt@kernel.org>
> +
> +description: |
> + Memblock can serialize its current memory reservations created with
> + reserve_mem command line option across kexec through KHO.
> + The post-KHO kernel can then consume these reservations and they are
> + guaranteed to have the same physical address.
> +
> +properties:
> + compatible:
> + enum:
> + - reserve_mem-v1
NAK, underscores are not allowed. Please follow carefully DTS coding style.
> +
> +patternProperties:
> + "$[0-9a-f_]+^":
No underscores.
> + $ref: reserve_mem_map.yaml#
> + description: reserved memory regions
> +
> +required:
> + - compatible
> +
> +additionalProperties: false
> +
> +examples:
> + - |
> + reserve_mem {
Again, do not introduce own coding style.
I don't understand why do you need this in the first place. There is
already reserved-memory block.
> + compatible = "reserve_mem-v1";
> + r1 {
> + compatible = "reserve_mem_map-v1";
> + mem = <0xc07c 0x2000000 0x01 0x00>;
> + };
> + };
> diff --git a/Documentation/kho/bindings/memblock/reserve_mem_map.yaml b/Documentation/kho/bindings/memblock/reserve_mem_map.yaml
> new file mode 100644
> index 000000000000..09001c5f2124
> --- /dev/null
> +++ b/Documentation/kho/bindings/memblock/reserve_mem_map.yaml
> @@ -0,0 +1,42 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/memblock/reserve_mem_map.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Memblock reserved memory regions
> +
> +maintainers:
> + - Mike Rapoport <rppt@kernel.org>
> +
> +description: |
> + Memblock can serialize its current memory reservations created with
> + reserve_mem command line option across kexec through KHO.
> + This object describes each such region.
> +
> +properties:
> + compatible:
> + enum:
> + - reserve_mem_map-v1
Explain why you cannot use existing reserved memory bindings.
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
2025-02-07 1:28 ` Pasha Tatashin
2025-02-07 8:06 ` Mike Rapoport
@ 2025-02-09 10:33 ` Krzysztof Kozlowski
2 siblings, 0 replies; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 10:33 UTC (permalink / raw)
To: Andrew Morton, Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Mark Rutland, Paolo Bonzini, Pasha Tatashin, H. Peter Anvin,
Peter Zijlstra, Pratyush Yadav, Rob Herring, Rob Herring,
Saravana Kannan, Stanislav Kinsburskii, Steven Rostedt,
Thomas Gleixner, Tom Lendacky, Usama Arif, Will Deacon,
devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86
On 07/02/2025 01:29, Andrew Morton wrote:
> On Thu, 6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
>
>> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
>> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
>> just to make things simpler instead of ftrace we decided to preserve
>> "reserve_mem" regions.
>>
>> The patches are also available in git:
>> https://git.kernel.org/rppt/h/kho/v4
>>
>>
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
>
> I tossed this into mm.git for some testing and exposure.
>
> What merge path are you anticipating?
>
> Review activity seems pretty thin thus far?
At least for DT ABI because:
1. For some reason this escaped Patchwork. Maybe was blocked by spam
filters, maybe Cc list is too big. No clue.
2. In the same time fallback to Patchwork was avoided by:
Cc-ing wrong address and not using expected (see git log) subject prefixes.
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 10/14] arm64: Add KHO support
2025-02-06 13:27 ` [PATCH v4 10/14] arm64: Add KHO support Mike Rapoport
@ 2025-02-09 10:38 ` Krzysztof Kozlowski
0 siblings, 0 replies; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 10:38 UTC (permalink / raw)
To: Mike Rapoport, linux-kernel
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Mark Rutland, Paolo Bonzini, Pasha Tatashin, H. Peter Anvin,
Peter Zijlstra, Pratyush Yadav, Rob Herring, Rob Herring,
Saravana Kannan, Stanislav Kinsburskii, Steven Rostedt,
Thomas Gleixner, Tom Lendacky, Usama Arif, Will Deacon,
devicetree, kexec, linux-arm-kernel, linux-doc, linux-mm, x86
On 06/02/2025 14:27, Mike Rapoport wrote:
> From: Alexander Graf <graf@amazon.com>
>
> We now have all bits in place to support KHO kexecs. This patch adds
Please do not use "This commit/patch/change", but imperative mood. See
longer explanation here:
https://elixir.bootlin.com/linux/v5.17.1/source/Documentation/process/submitting-patches.rst#L95
> awareness of KHO in the kexec file as well as boot path for arm64 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
...
> +#ifdef CONFIG_KEXEC_HANDOVER
> + dt = image->kho.dt.buffer;
> + dt_mem = image->kho.dt.mem;
> + dt_len = image->kho.dt.bufsz;
> +
> + scratch_mem = image->kho.scratch.mem;
> + scratch_len = image->kho.scratch.bufsz;
> +#endif
> +
> + if (!dt)
> + goto out;
> +
> + pr_debug("Adding kho metadata to DT");
> +
> + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-dt",
Where is the ABI documentation for this?
> + dt_mem, dt_len);
> + if (ret)
> + goto out;
> +
> + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
Same question.
> + scratch_mem, scratch_len);
> + if (ret)
> + goto out;
> +
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 10:29 ` Krzysztof Kozlowski
@ 2025-02-09 15:10 ` Mike Rapoport
2025-02-09 15:23 ` Krzysztof Kozlowski
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-09 15:10 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sun, Feb 09, 2025 at 11:29:41AM +0100, Krzysztof Kozlowski wrote:
> On 06/02/2025 14:27, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > We introduced KHO into Linux: A framework that allows Linux to pass
> > metadata and memory across kexec from Linux to Linux. KHO reuses fdt
> > as file format and shares a lot of the same properties of firmware-to-
> > Linux boot formats: It needs a stable, documented ABI that allows for
> > forward and backward compatibility as well as versioning.
>
> Please use subject prefixes matching the subsystem. You can get them for
> example with `git log --oneline -- DIRECTORY_OR_FILE` on the directory
> your patch is touching. For bindings, the preferred subjects are
> explained here:
> https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
These are not devicetree binding for communicating data from firmware to
the kernel. These bindings are specific to KHO which is perfectly
reflected by the subject.
Just a brief reminder from v2 discussion:
(https://lore.kernel.org/linux-mm/20231222193607.15474-1-graf@amazon.com/)
"For quick reference: KHO is a new mechanism this patch set introduces
which allows Linux to pass arbitrary memory and metadata between kernels
on kexec. I'm reusing FDTs to implement the hand over protocol, as
Linux-to-Linux boot communication holds very similar properties to
firmware-to-Linux boot communication. So this binding is not about
hardware; it's about preserving Linux subsystem state across kexec.
For more details, please refer to the KHO documentation which is part of
patch 7 of this patch set:
https://lore.kernel.org/lkml/20231222195144.24532-2-graf@amazon.com/"
and
"This is our own data structure for KHO that just happens to again
contain a DT structure. The reason is simple: I want a unified,
versioned, introspectable data format that is cross platform so you
don't need to touch every architecture specific boot passing logic every
time you want to add a tiny piece of data."
> > As first user of KHO, we introduced memblock which can now preserve
> > memory ranges reserved with reserve_mem command line options contents
> > across kexec, so you can use the post-kexec kernel to read traces from
> > the pre-kexec kernel.
> >
> > This patch adds memblock schemas similar to "device" device tree ones to
> > a new kho bindings directory. This allows us to force contributors to
> > document the data that moves across KHO kexecs and catch breaking change
> > during review.
> >
> > Co-developed-by: Alexander Graf <graf@amazon.com>
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > .../kho/bindings/memblock/reserve_mem.yaml | 41 ++++++++++++++++++
> > .../bindings/memblock/reserve_mem_map.yaml | 42 +++++++++++++++++++
> > 2 files changed, 83 insertions(+)
> > create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
> > create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
> >
> > diff --git a/Documentation/kho/bindings/memblock/reserve_mem.yaml b/Documentation/kho/bindings/memblock/reserve_mem.yaml
> > new file mode 100644
> > index 000000000000..7b01791b10b3
> > --- /dev/null
> > +++ b/Documentation/kho/bindings/memblock/reserve_mem.yaml
> > @@ -0,0 +1,41 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Memblock reserved memory
> > +
> > +maintainers:
> > + - Mike Rapoport <rppt@kernel.org>
> > +
> > +description: |
> > + Memblock can serialize its current memory reservations created with
> > + reserve_mem command line option across kexec through KHO.
> > + The post-KHO kernel can then consume these reservations and they are
> > + guaranteed to have the same physical address.
> > +
> > +examples:
> > + - |
> > + reserve_mem {
>
> Again, do not introduce own coding style.
>
> I don't understand why do you need this in the first place. There is
> already reserved-memory block.
Because these regions are not "... designed for the special usage by
various device drivers" and should not be exclude by the operating system
from normal usage.
> Best regards,
> Krzysztof
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 15:10 ` Mike Rapoport
@ 2025-02-09 15:23 ` Krzysztof Kozlowski
2025-02-09 20:41 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 15:23 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On 09/02/2025 16:10, Mike Rapoport wrote:
> On Sun, Feb 09, 2025 at 11:29:41AM +0100, Krzysztof Kozlowski wrote:
>> On 06/02/2025 14:27, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> We introduced KHO into Linux: A framework that allows Linux to pass
>>> metadata and memory across kexec from Linux to Linux. KHO reuses fdt
>>> as file format and shares a lot of the same properties of firmware-to-
>>> Linux boot formats: It needs a stable, documented ABI that allows for
>>> forward and backward compatibility as well as versioning.
>>
>> Please use subject prefixes matching the subsystem. You can get them for
>> example with `git log --oneline -- DIRECTORY_OR_FILE` on the directory
>> your patch is touching. For bindings, the preferred subjects are
>> explained here:
>> https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
>
> These are not devicetree binding for communicating data from firmware to
> the kernel. These bindings are specific to KHO which is perfectly
> reflected by the subject.
No, it is not. None of the bindings use above subject prefix.
>
> Just a brief reminder from v2 discussion:
> (https://lore.kernel.org/linux-mm/20231222193607.15474-1-graf@amazon.com/)
>
> "For quick reference: KHO is a new mechanism this patch set introduces
> which allows Linux to pass arbitrary memory and metadata between kernels
> on kexec. I'm reusing FDTs to implement the hand over protocol, as
> Linux-to-Linux boot communication holds very similar properties to
> firmware-to-Linux boot communication. So this binding is not about
> hardware; it's about preserving Linux subsystem state across kexec.
does not matter. You added file to ABI documentation so you must follow
that ABI documentation rules. One rule is proper subject prefix.
>
> For more details, please refer to the KHO documentation which is part of
> patch 7 of this patch set:
> https://lore.kernel.org/lkml/20231222195144.24532-2-graf@amazon.com/"
I fail to see how this is related to the incorrect subject prefix as I
pointed.
>
> and
>
> "This is our own data structure for KHO that just happens to again
> contain a DT structure. The reason is simple: I want a unified,
> versioned, introspectable data format that is cross platform so you
> don't need to touch every architecture specific boot passing logic every
> time you want to add a tiny piece of data."
>
>>> As first user of KHO, we introduced memblock which can now preserve
>>> memory ranges reserved with reserve_mem command line options contents
>>> across kexec, so you can use the post-kexec kernel to read traces from
>>> the pre-kexec kernel.
>>>
>>> This patch adds memblock schemas similar to "device" device tree ones to
>>> a new kho bindings directory. This allows us to force contributors to
>>> document the data that moves across KHO kexecs and catch breaking change
>>> during review.
>>>
>>> Co-developed-by: Alexander Graf <graf@amazon.com>
>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> ---
>>> .../kho/bindings/memblock/reserve_mem.yaml | 41 ++++++++++++++++++
>>> .../bindings/memblock/reserve_mem_map.yaml | 42 +++++++++++++++++++
>>> 2 files changed, 83 insertions(+)
>>> create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
>>> create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
>>>
>>> diff --git a/Documentation/kho/bindings/memblock/reserve_mem.yaml b/Documentation/kho/bindings/memblock/reserve_mem.yaml
>>> new file mode 100644
>>> index 000000000000..7b01791b10b3
>>> --- /dev/null
>>> +++ b/Documentation/kho/bindings/memblock/reserve_mem.yaml
>>> @@ -0,0 +1,41 @@
>>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>>> +%YAML 1.2
>>> +---
>>> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>> +
>>> +title: Memblock reserved memory
>>> +
>>> +maintainers:
>>> + - Mike Rapoport <rppt@kernel.org>
>>> +
>>> +description: |
>>> + Memblock can serialize its current memory reservations created with
>>> + reserve_mem command line option across kexec through KHO.
>>> + The post-KHO kernel can then consume these reservations and they are
>>> + guaranteed to have the same physical address.
>>> +
>>> +examples:
>>> + - |
>>> + reserve_mem {
>>
>> Again, do not introduce own coding style.
>>
>> I don't understand why do you need this in the first place. There is
>> already reserved-memory block.
>
> Because these regions are not "... designed for the special usage by
> various device drivers"
So you use now very similar name, but different with few letters just to
note that you do not fit into existing formats. If this does not fit
existing usage, then use different name.
> and should not be exclude by the operating system
> from normal usage.
Then it does not look like a reserved memory
>
>> Best regards,
>> Krzysztof
>
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 15:23 ` Krzysztof Kozlowski
@ 2025-02-09 20:41 ` Mike Rapoport
2025-02-09 20:49 ` Krzysztof Kozlowski
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-09 20:41 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sun, Feb 09, 2025 at 04:23:09PM +0100, Krzysztof Kozlowski wrote:
> On 09/02/2025 16:10, Mike Rapoport wrote:
> > On Sun, Feb 09, 2025 at 11:29:41AM +0100, Krzysztof Kozlowski wrote:
> >> On 06/02/2025 14:27, Mike Rapoport wrote:
> >>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >>>
> >>> We introduced KHO into Linux: A framework that allows Linux to pass
> >>> metadata and memory across kexec from Linux to Linux. KHO reuses fdt
> >>> as file format and shares a lot of the same properties of firmware-to-
> >>> Linux boot formats: It needs a stable, documented ABI that allows for
> >>> forward and backward compatibility as well as versioning.
> >>
> >> Please use subject prefixes matching the subsystem. You can get them for
> >> example with `git log --oneline -- DIRECTORY_OR_FILE` on the directory
> >> your patch is touching. For bindings, the preferred subjects are
> >> explained here:
> >> https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
> >
> > These are not devicetree binding for communicating data from firmware to
> > the kernel. These bindings are specific to KHO which is perfectly
> > reflected by the subject.
>
> No, it is not. None of the bindings use above subject prefix.
>
> >
> > Just a brief reminder from v2 discussion:
> > (https://lore.kernel.org/linux-mm/20231222193607.15474-1-graf@amazon.com/)
> >
> > "For quick reference: KHO is a new mechanism this patch set introduces
> > which allows Linux to pass arbitrary memory and metadata between kernels
> > on kexec. I'm reusing FDTs to implement the hand over protocol, as
> > Linux-to-Linux boot communication holds very similar properties to
> > firmware-to-Linux boot communication. So this binding is not about
> > hardware; it's about preserving Linux subsystem state across kexec.
>
> does not matter. You added file to ABI documentation so you must follow
> that ABI documentation rules. One rule is proper subject prefix.
No, it does not. It's a different ABI.
FDT is a _data structure_ that provides cross platform unified, versioned,
introspectable data format.
Documentation/devicetree/bindings standardizes it's use for describing
hardware, but KHO uses FDT _data structure_ to describe state of the kernel
components that will be reused by the kexec'ed kernel.
KHO is a different namespace from Open Firmware Device Tree, with different
requirements and different stakeholders. Putting descriptions of KHO data
formats in Documentation/kho rather than in
Documentation/devicetree/bindings was not done to evade review of Open
Firmware Device Tree maintainers, but rather to emphasize that KHO FDT _is
not_ Open Firmware Device Tree.
> Best regards,
> Krzysztof
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 20:41 ` Mike Rapoport
@ 2025-02-09 20:49 ` Krzysztof Kozlowski
2025-02-09 20:50 ` Krzysztof Kozlowski
0 siblings, 1 reply; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 20:49 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On 09/02/2025 21:41, Mike Rapoport wrote:
> On Sun, Feb 09, 2025 at 04:23:09PM +0100, Krzysztof Kozlowski wrote:
>> On 09/02/2025 16:10, Mike Rapoport wrote:
>>> On Sun, Feb 09, 2025 at 11:29:41AM +0100, Krzysztof Kozlowski wrote:
>>>> On 06/02/2025 14:27, Mike Rapoport wrote:
>>>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>>>
>>>>> We introduced KHO into Linux: A framework that allows Linux to pass
>>>>> metadata and memory across kexec from Linux to Linux. KHO reuses fdt
>>>>> as file format and shares a lot of the same properties of firmware-to-
>>>>> Linux boot formats: It needs a stable, documented ABI that allows for
>>>>> forward and backward compatibility as well as versioning.
>>>>
>>>> Please use subject prefixes matching the subsystem. You can get them for
>>>> example with `git log --oneline -- DIRECTORY_OR_FILE` on the directory
>>>> your patch is touching. For bindings, the preferred subjects are
>>>> explained here:
>>>> https://www.kernel.org/doc/html/latest/devicetree/bindings/submitting-patches.html#i-for-patch-submitters
>>>
>>> These are not devicetree binding for communicating data from firmware to
>>> the kernel. These bindings are specific to KHO which is perfectly
>>> reflected by the subject.
>>
>> No, it is not. None of the bindings use above subject prefix.
>>
>>>
>>> Just a brief reminder from v2 discussion:
>>> (https://lore.kernel.org/linux-mm/20231222193607.15474-1-graf@amazon.com/)
>>>
>>> "For quick reference: KHO is a new mechanism this patch set introduces
>>> which allows Linux to pass arbitrary memory and metadata between kernels
>>> on kexec. I'm reusing FDTs to implement the hand over protocol, as
>>> Linux-to-Linux boot communication holds very similar properties to
>>> firmware-to-Linux boot communication. So this binding is not about
>>> hardware; it's about preserving Linux subsystem state across kexec.
>>
>> does not matter. You added file to ABI documentation so you must follow
>> that ABI documentation rules. One rule is proper subject prefix.
>
> No, it does not. It's a different ABI.
>
> FDT is a _data structure_ that provides cross platform unified, versioned,
> introspectable data format.
>
> Documentation/devicetree/bindings standardizes it's use for describing
> hardware, but KHO uses FDT _data structure_ to describe state of the kernel
> components that will be reused by the kexec'ed kernel.
>
> KHO is a different namespace from Open Firmware Device Tree, with different
> requirements and different stakeholders. Putting descriptions of KHO data
> formats in Documentation/kho rather than in
> Documentation/devicetree/bindings was not done to evade review of Open
> Firmware Device Tree maintainers, but rather to emphasize that KHO FDT _is
> not_ Open Firmware Device Tree.
Ah, neat, that would almost solve the problem but you wrote:
+$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
so no, this does not work like that. You use devicetree here namespace
and ignore its rules.
You cannot pretend this is not devicetree if you put it into devicetree
schemas.
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 20:49 ` Krzysztof Kozlowski
@ 2025-02-09 20:50 ` Krzysztof Kozlowski
2025-02-10 19:15 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-09 20:50 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On 09/02/2025 21:49, Krzysztof Kozlowski wrote:
>>>>
>>>> These are not devicetree binding for communicating data from firmware to
>>>> the kernel. These bindings are specific to KHO which is perfectly
>>>> reflected by the subject.
>>>
>>> No, it is not. None of the bindings use above subject prefix.
>>>
>>>>
>>>> Just a brief reminder from v2 discussion:
>>>> (https://lore.kernel.org/linux-mm/20231222193607.15474-1-graf@amazon.com/)
>>>>
>>>> "For quick reference: KHO is a new mechanism this patch set introduces
>>>> which allows Linux to pass arbitrary memory and metadata between kernels
>>>> on kexec. I'm reusing FDTs to implement the hand over protocol, as
>>>> Linux-to-Linux boot communication holds very similar properties to
>>>> firmware-to-Linux boot communication. So this binding is not about
>>>> hardware; it's about preserving Linux subsystem state across kexec.
>>>
>>> does not matter. You added file to ABI documentation so you must follow
>>> that ABI documentation rules. One rule is proper subject prefix.
>>
>> No, it does not. It's a different ABI.
>>
>> FDT is a _data structure_ that provides cross platform unified, versioned,
>> introspectable data format.
>>
>> Documentation/devicetree/bindings standardizes it's use for describing
>> hardware, but KHO uses FDT _data structure_ to describe state of the kernel
>> components that will be reused by the kexec'ed kernel.
>>
>> KHO is a different namespace from Open Firmware Device Tree, with different
>> requirements and different stakeholders. Putting descriptions of KHO data
>> formats in Documentation/kho rather than in
>> Documentation/devicetree/bindings was not done to evade review of Open
>> Firmware Device Tree maintainers, but rather to emphasize that KHO FDT _is
>> not_ Open Firmware Device Tree.
>
>
> Ah, neat, that would almost solve the problem but you wrote:
>
> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>
> so no, this does not work like that. You use devicetree here namespace
> and ignore its rules.
... and that obviously is barely parseable, so maybe one more try:
"You use here devicetree namespace but ignore its rules."
>
> You cannot pretend this is not devicetree if you put it into devicetree
> schemas.
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 13/14] memblock: Add KHO support for reserve_mem
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
@ 2025-02-10 16:03 ` Rob Herring
2025-02-12 16:30 ` Mike Rapoport
2025-02-17 4:04 ` Wei Yang
1 sibling, 1 reply; 97+ messages in thread
From: Rob Herring @ 2025-02-10 16:03 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Thu, Feb 6, 2025 at 7:30 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Alexander Graf <graf@amazon.com>
>
> Linux has recently gained support for "reserve_mem": A mechanism to
> allocate a region of memory early enough in boot that we can cross our
> fingers and hope it stays at the same location during most boots, so we
> can store for example ftrace buffers into it.
>
> Thanks to KASLR, we can never be really sure that "reserve_mem"
> allocations are static across kexec. Let's teach it KHO awareness so
> that it serializes its reservations on kexec exit and deserializes them
> again on boot, preserving the exact same mapping across kexec.
>
> This is an example user for KHO in the KHO patch set to ensure we have
> at least one (not very controversial) user in the tree before extending
> KHO's use to more subsystems.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/memblock.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 131 insertions(+)
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 84df96efca62..fdb08b60efc1 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -16,6 +16,9 @@
> #include <linux/kmemleak.h>
> #include <linux/seq_file.h>
> #include <linux/memblock.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/kexec.h>
> +#include <linux/libfdt.h>
>
> #include <asm/sections.h>
> #include <linux/io.h>
> @@ -2423,6 +2426,70 @@ int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *
> }
> EXPORT_SYMBOL_GPL(reserve_mem_find_by_name);
>
> +static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size,
> + phys_addr_t align)
> +{
> + const void *fdt = kho_get_fdt();
> + const char *path = "/reserve_mem";
> + int node, child, err;
> +
> + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER))
> + return false;
> +
> + if (!fdt)
> + return false;
> +
> + node = fdt_path_offset(fdt, "/reserve_mem");
> + if (node < 0)
> + return false;
> +
> + err = fdt_node_check_compatible(fdt, node, "reserve_mem-v1");
> + if (err) {
> + pr_warn("Node '%s' has unknown compatible", path);
> + return false;
> + }
> +
> + fdt_for_each_subnode(child, fdt, node) {
> + const struct kho_mem *mem;
> + const char *child_name;
> + int len;
> +
> + /* Search for old kernel's reserved_mem with the same name */
> + child_name = fdt_get_name(fdt, child, NULL);
> + if (strcmp(name, child_name))
> + continue;
> +
> + err = fdt_node_check_compatible(fdt, child, "reserve_mem_map-v1");
It really seems you all are trying to have things both ways. It's not
Devicetree, just the FDT file format, but then here you use
"compatible" which *is* Devicetree. At best, it's all just confusing
for folks. At worst, you're just picking and choosing what you want to
use.
I'm not saying don't use "compatible" just for the sake of looking
less like DT, but perhaps your versioning should be done differently.
You are reading the 'mem' property straight into a struct. Maybe the
struct should have a version. Or the size of the struct is the version
much like the userspace ABI is handled for structs.
> + if (err) {
> + pr_warn("Node '%s/%s' has unknown compatible", path, name);
> + continue;
> + }
> +
> + mem = fdt_getprop(fdt, child, "mem", &len);
> + if (!mem || len != sizeof(*mem))
> + continue;
> +
> + if (mem->addr & (align - 1)) {
It's stated somewhere in this that the FDT data is LE, but here you
are assuming the FDT is the same endianness as the CPU not that it's
LE. Arm64 can do BE. PowerPC does both. I'm not sure if kexec from one
endianness to another is possible. I would guess in theory it is and
in practice it's broken already (because kexec is always an
afterthought). Either you need to guarantee that native endianness
will never be an issue for any arch or you need to make the endianness
fixed.
Rob
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-09 20:50 ` Krzysztof Kozlowski
@ 2025-02-10 19:15 ` Jason Gunthorpe
2025-02-10 19:27 ` Krzysztof Kozlowski
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-10 19:15 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sun, Feb 09, 2025 at 09:50:37PM +0100, Krzysztof Kozlowski wrote:
> > Ah, neat, that would almost solve the problem but you wrote:
> >
> > +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> >
> > so no, this does not work like that. You use devicetree here namespace
> > and ignore its rules.
>
> ... and that obviously is barely parseable, so maybe one more try:
> "You use here devicetree namespace but ignore its rules."
It makes sense to me, there should be zero cross-over of the two
specs, KHO should be completely self defined and stand alone.
There is some documentation missing, I think. This yaml describes one
node type, but the entire overall structure of the fdt does not seem
to have documentation?
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 09/14] kexec: Add documentation for KHO
2025-02-06 13:27 ` [PATCH v4 09/14] kexec: Add documentation " Mike Rapoport
@ 2025-02-10 19:26 ` Jason Gunthorpe
0 siblings, 0 replies; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-10 19:26 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:49PM +0200, Mike Rapoport wrote:
> +KHO introduces a new concept to its device tree: ``mem`` properties. A
> +``mem`` property can be inside any subnode in the device tree.
I do not think this is a good idea.
It should be core infrastructure, totally unrelated to any per-device
fdt nodes, to carry the memory map.
IOW a full DT that looks something more like:
/dts-v1/;
/ {
compatible = "linux-kho,v1";
allocated-memory {
<>
};
ftracebuffer {
compatible = "linux-kho,ftracem,v1";
ftrace-buffer-phys = <..>;
ftrace-buffer-len = <..>;
..etc..
};
};
Where allocated_memory will remove all memory from the buddy allocator
very early on in an efficient way. that process should not be walking
the fdt to find mem nodes.
> +After boot, drivers can call the kho subsystem to transfer ownership of memory
> +that was reserved via a ``mem`` property to themselves to continue using memory
> +from the previous execution.
And this transfer should be done by phys that the node itself
describes.
Ie if ftrace has a single high order folio to store it's ftrace buffer
then I would expect code like:
allocate ftrace:
buffer = folio_alloc(..);
activate callback:
kho_preserve_folio(buffer)
fdt...("ftrace-buffer-phys", virt_to_phys(buffer))
restore callback:
buffer_phys = fdt..("ftrace-buffer-phys")
buffer = kho_restore_folio(buffer_phys)
[..]
destroy ftrace:
folio_put(buffer);
And kho will take care to restore the struct folio, put back the
order, etc, etc.
Similar for slab.
I think this sort of memory-based operation should be the very basic
core building primitive here.
So the allocated-memory node should preserve information about KHO'd
folios, their order and so on.
It doesn't matter what part of the FDT owns those folios, all the core
kernel should do is keep track of them and at some point check that
all preserved folios have been claimed.
> +We guarantee that we always have such regions through the scratch regions: On
> +first boot KHO allocates several physically contiguous memory regions. Since
> +after kexec these regions will be used by early memory allocations, there is a
> +scratch region per NUMA node plus a scratch region to satisfy allocations
> +requests that do not require particilar NUMA node assignment.
This plan sounds great, way better than the pmem approaches/etc.
> +To enable user space based kexec file loader, the kernel needs to be able to
> +provide the device tree that describes the previous kernel's state before
> +performing the actual kexec. The process of generating that device tree is
> +called serialization. When the device tree is generated, some properties
> +of the system may become immutable because they are already written down
> +in the device tree. That state is called the KHO active phase.
This should have a whole state diagram as we've talked a few
times. There is alot more to worry about here than just 'activate'.
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-10 19:15 ` Jason Gunthorpe
@ 2025-02-10 19:27 ` Krzysztof Kozlowski
2025-02-10 20:20 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Krzysztof Kozlowski @ 2025-02-10 19:27 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On 10/02/2025 20:15, Jason Gunthorpe wrote:
> On Sun, Feb 09, 2025 at 09:50:37PM +0100, Krzysztof Kozlowski wrote:
>>> Ah, neat, that would almost solve the problem but you wrote:
>>>
>>> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>>
>>> so no, this does not work like that. You use devicetree here namespace
>>> and ignore its rules.
>>
>> ... and that obviously is barely parseable, so maybe one more try:
>> "You use here devicetree namespace but ignore its rules."
>
> It makes sense to me, there should be zero cross-over of the two
> specs, KHO should be completely self defined and stand alone.
>
> There is some documentation missing, I think. This yaml describes one
> node type, but the entire overall structure of the fdt does not seem
> to have documentation?
A lot of ABI is missing there and undocumented like: node name (which
for every standard DT would be a NAK), few properties. This binding
describes only subset while skipping all the rest and effectively
introducing implied/undocumented ABI.
Best regards,
Krzysztof
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-10 19:27 ` Krzysztof Kozlowski
@ 2025-02-10 20:20 ` Jason Gunthorpe
2025-02-12 16:00 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-10 20:20 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Mon, Feb 10, 2025 at 08:27:34PM +0100, Krzysztof Kozlowski wrote:
> On 10/02/2025 20:15, Jason Gunthorpe wrote:
> > On Sun, Feb 09, 2025 at 09:50:37PM +0100, Krzysztof Kozlowski wrote:
> >>> Ah, neat, that would almost solve the problem but you wrote:
> >>>
> >>> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> >>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> >>>
> >>> so no, this does not work like that. You use devicetree here namespace
> >>> and ignore its rules.
> >>
> >> ... and that obviously is barely parseable, so maybe one more try:
> >> "You use here devicetree namespace but ignore its rules."
> >
> > It makes sense to me, there should be zero cross-over of the two
> > specs, KHO should be completely self defined and stand alone.
> >
> > There is some documentation missing, I think. This yaml describes one
> > node type, but the entire overall structure of the fdt does not seem
> > to have documentation?
>
> A lot of ABI is missing there and undocumented like: node name (which
> for every standard DT would be a NAK), few properties. This binding
> describes only subset while skipping all the rest and effectively
> introducing implied/undocumented ABI.
I agree, I think it can be easily adressed - the docs should have a
sample of the overal DT from the root node and yaml for each of the
blocks, laying out the purpose very much like the open dt spec..
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
@ 2025-02-10 20:22 ` Jason Gunthorpe
2025-02-10 20:58 ` Pasha Tatashin
2025-02-12 12:29 ` Thomas Weißschuh
1 sibling, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-10 20:22 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> new file mode 100644
> index 000000000000..f13b252bc303
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> @@ -0,0 +1,53 @@
> +What: /sys/kernel/kho/active
> +Date: December 2023
> +Contact: Alexander Graf <graf@amazon.com>
> +Description:
> + Kexec HandOver (KHO) allows Linux to transition the state of
> + compatible drivers into the next kexec'ed kernel. To do so,
> + device drivers will serialize their current state into a DT.
> + While the state is serialized, they are unable to perform
> + any modifications to state that was serialized, such as
> + handed over memory allocations.
> +
> + When this file contains "1", the system is in the transition
> + state. When contains "0", it is not. To switch between the
> + two states, echo the respective number into this file.
I don't think this is a great interface for the actual state machine..
> +What: /sys/kernel/kho/dt_max
> +Date: December 2023
> +Contact: Alexander Graf <graf@amazon.com>
> +Description:
> + KHO needs to allocate a buffer for the DT that gets
> + generated before it knows the final size. By default, it
> + will allocate 10 MiB for it. You can write to this file
> + to modify the size of that allocation.
Seems gross, why can't it use a non-contiguous page list to generate
the FDT? :\
See below for a suggestion..
> +static int kho_serialize(void)
> +{
> + void *fdt = NULL;
> + int err = -ENOMEM;
> +
> + fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> + if (!fdt)
> + goto out;
> +
> + if (fdt_create(fdt, kho_out.dt_max)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + err = fdt_finish_reservemap(fdt);
> + if (err)
> + goto out;
> +
> + err = fdt_begin_node(fdt, "");
> + if (err)
> + goto out;
> +
> + err = fdt_property_string(fdt, "compatible", "kho-v1");
> + if (err)
> + goto out;
> +
> + /* Loop through all kho dump functions */
> + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> + err = notifier_to_errno(err);
I don't see this really working long term. I think we'd like each
component to be able to serialize at its own pace under userspace
control.
This design requires that the whole thing be wrapped in a notifier
callback just so we can make use of the fdt APIs.
It seems like a poor fit me.
IMHO if you want to keep using FDT I suggest that each serializing
component (ie driver, ftrace whatever) allocate its own FDT fragment
from scratch and the main KHO one just link to the memories that holds
those fragements.
Ie the driver experience would be more like
kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);
fdt...(kho->fdt..)
kho_finish_storage(kho);
Where this ends up creating a stand alone FDT fragment:
/dts-v1/;
/ {
compatible = "linux-kho,my_compatible_string,v1";
instance = some_kind_of_instance_key;
key-value-1 = <..>;
key-value-1 = <..>;
};
And then kho_finish_storage() would remember the phys/length until the
kexec fdt is produced as the very last step.
This way we could do things like fdbox an iommufd and create the above
FDT fragment completely seperately from any notifier chain and,
crucially, disconnected from the fdt_create() for the kexec payload.
Further, if you split things like this (it will waste some small
amount of memory) you can probably get to a point where no single FDT
is more than 4k. That looks like it would simplify/robustify alot of
stuff?
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 06/14] kexec: Add KHO parsing support
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
@ 2025-02-10 20:50 ` Jason Gunthorpe
2025-03-10 16:20 ` Pratyush Yadav
1 sibling, 0 replies; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-10 20:50 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:46PM +0200, Mike Rapoport wrote:
> +/**
> + * kho_claim_mem - Notify the kernel that a handed over memory range is now
> + * in use
> + * @mem: memory range that was preserved during kexec handover
> + *
> + * A kernel subsystem preserved that range during handover and it is going
> + * to reuse this range after kexec. The pages in the range are treated as
> + * allocated, but not %PG_reserved.
> + *
> + * Return: virtual address of the preserved memory range
> + */
> +void *kho_claim_mem(const struct kho_mem *mem)
> +{
> + unsigned long start_pfn, end_pfn, pfn;
> + void *va = __va(mem->addr);
> +
> + start_pfn = PFN_DOWN(mem->addr);
> + end_pfn = PFN_UP(mem->addr + mem->size);
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + int err = kho_claim_pfn(pfn);
> +
> + if (err)
> + return NULL;
> + }
> +
> + return va;
> +}
> +EXPORT_SYMBOL_GPL(kho_claim_mem);
I think this is not the sort of interface toward drivers we should be
going for, I think we should be round tripping folios at their
allocated order and when restored the folio should freed with
folio_put(), just like in the normal way. Here you are breaking down
high order folios and undoing the GFP_COMP, it is not desirable for
drivers..
Eventually with some kind of support for conserving the memdesc
struct/page metadata if a driver is using it.
Following that basic primitive you'd want to have the same idea to
preserve kmalloc() memory.
And like I said elsewhere, the drivers should be working on naked
phys_addr_t's stored in their own structs in their own way, not
special kho_mem things.. It won't scale like this if the driver needs
to pass thousands of pages.
Also, how is the driver supposed to figure out what the structure is
inside the kho_mem anyhow? I would expect the FDT key/value store to
have a key/phys_addr_t structure outlining the various driver data
structures.
IMHO this links to my frist comment on how the FDT represents the
preserved memory, it seems thsat FDT format cannot effectively
preserve folios..
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-10 20:22 ` Jason Gunthorpe
@ 2025-02-10 20:58 ` Pasha Tatashin
2025-02-11 12:49 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Pasha Tatashin @ 2025-02-10 20:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Mon, Feb 10, 2025 at 3:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> > new file mode 100644
> > index 000000000000..f13b252bc303
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> > @@ -0,0 +1,53 @@
> > +What: /sys/kernel/kho/active
> > +Date: December 2023
> > +Contact: Alexander Graf <graf@amazon.com>
> > +Description:
> > + Kexec HandOver (KHO) allows Linux to transition the state of
> > + compatible drivers into the next kexec'ed kernel. To do so,
> > + device drivers will serialize their current state into a DT.
> > + While the state is serialized, they are unable to perform
> > + any modifications to state that was serialized, such as
> > + handed over memory allocations.
> > +
> > + When this file contains "1", the system is in the transition
> > + state. When contains "0", it is not. To switch between the
> > + two states, echo the respective number into this file.
>
> I don't think this is a great interface for the actual state machine..
In our next proposal we are going to remove this "activate" phase.
>
> > +What: /sys/kernel/kho/dt_max
> > +Date: December 2023
> > +Contact: Alexander Graf <graf@amazon.com>
> > +Description:
> > + KHO needs to allocate a buffer for the DT that gets
> > + generated before it knows the final size. By default, it
> > + will allocate 10 MiB for it. You can write to this file
> > + to modify the size of that allocation.
>
> Seems gross, why can't it use a non-contiguous page list to generate
> the FDT? :\
We will consider some of these ideas in the future version. I like the
idea of using preserved memory to carry sparse KHO tree: i.e FDT over
sparse memory, maybe use the anchor page to describe how it should be
vmapped into a virtually contiguous tree in the next kernel?
>
> See below for a suggestion..
>
> > +static int kho_serialize(void)
> > +{
> > + void *fdt = NULL;
> > + int err = -ENOMEM;
> > +
> > + fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> > + if (!fdt)
> > + goto out;
> > +
> > + if (fdt_create(fdt, kho_out.dt_max)) {
> > + err = -EINVAL;
> > + goto out;
> > + }
> > +
> > + err = fdt_finish_reservemap(fdt);
> > + if (err)
> > + goto out;
> > +
> > + err = fdt_begin_node(fdt, "");
> > + if (err)
> > + goto out;
> > +
> > + err = fdt_property_string(fdt, "compatible", "kho-v1");
> > + if (err)
> > + goto out;
> > +
> > + /* Loop through all kho dump functions */
> > + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> > + err = notifier_to_errno(err);
>
> I don't see this really working long term. I think we'd like each
> component to be able to serialize at its own pace under userspace
> control.
>
> This design requires that the whole thing be wrapped in a notifier
> callback just so we can make use of the fdt APIs.
>
> It seems like a poor fit me.
>
> IMHO if you want to keep using FDT I suggest that each serializing
> component (ie driver, ftrace whatever) allocate its own FDT fragment
> from scratch and the main KHO one just link to the memories that holds
> those fragements.
>
> Ie the driver experience would be more like
>
> kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);
>
> fdt...(kho->fdt..)
>
> kho_finish_storage(kho);
>
> Where this ends up creating a stand alone FDT fragment:
>
> /dts-v1/;
> / {
> compatible = "linux-kho,my_compatible_string,v1";
> instance = some_kind_of_instance_key;
> key-value-1 = <..>;
> key-value-1 = <..>;
> };
>
> And then kho_finish_storage() would remember the phys/length until the
> kexec fdt is produced as the very last step.
>
> This way we could do things like fdbox an iommufd and create the above
> FDT fragment completely seperately from any notifier chain and,
> crucially, disconnected from the fdt_create() for the kexec payload.
>
> Further, if you split things like this (it will waste some small
> amount of memory) you can probably get to a point where no single FDT
> is more than 4k. That looks like it would simplify/robustify alot of
> stuff?
>
> Jason
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-10 20:58 ` Pasha Tatashin
@ 2025-02-11 12:49 ` Jason Gunthorpe
2025-02-11 16:14 ` Pasha Tatashin
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-11 12:49 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Mon, Feb 10, 2025 at 03:58:00PM -0500, Pasha Tatashin wrote:
> >
> > > +What: /sys/kernel/kho/dt_max
> > > +Date: December 2023
> > > +Contact: Alexander Graf <graf@amazon.com>
> > > +Description:
> > > + KHO needs to allocate a buffer for the DT that gets
> > > + generated before it knows the final size. By default, it
> > > + will allocate 10 MiB for it. You can write to this file
> > > + to modify the size of that allocation.
> >
> > Seems gross, why can't it use a non-contiguous page list to generate
> > the FDT? :\
>
> We will consider some of these ideas in the future version. I like the
> idea of using preserved memory to carry sparse KHO tree: i.e FDT over
> sparse memory, maybe use the anchor page to describe how it should be
> vmapped into a virtually contiguous tree in the next kernel?
Yeah, but this is now permanent uAPI that has to be kept forever. I
think you should not add this when there are enough ideas on how to
completely avoid it.
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-11 12:49 ` Jason Gunthorpe
@ 2025-02-11 16:14 ` Pasha Tatashin
2025-02-11 16:37 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Pasha Tatashin @ 2025-02-11 16:14 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Tue, Feb 11, 2025 at 7:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Feb 10, 2025 at 03:58:00PM -0500, Pasha Tatashin wrote:
> > >
> > > > +What: /sys/kernel/kho/dt_max
> > > > +Date: December 2023
> > > > +Contact: Alexander Graf <graf@amazon.com>
> > > > +Description:
> > > > + KHO needs to allocate a buffer for the DT that gets
> > > > + generated before it knows the final size. By default, it
> > > > + will allocate 10 MiB for it. You can write to this file
> > > > + to modify the size of that allocation.
> > >
> > > Seems gross, why can't it use a non-contiguous page list to generate
> > > the FDT? :\
> >
> > We will consider some of these ideas in the future version. I like the
> > idea of using preserved memory to carry sparse KHO tree: i.e FDT over
> > sparse memory, maybe use the anchor page to describe how it should be
> > vmapped into a virtually contiguous tree in the next kernel?
>
> Yeah, but this is now permanent uAPI that has to be kept forever. I
Agree, what I meant in the future patch version is before it gets
merged. I should have been more clear.
> think you should not add this when there are enough ideas on how to
> completely avoid it.
Thinking about it some more, I'm actually leaning towards keeping
things as they are, instead of going with a sparse FDT. With a sparse
KHO-tree, we'd be kinda trying to fix something that should be handled
higher up. All userspace preservable memory (like emulated pmem with
devdax/fsdax and also pstore for logging) can already survive cold
reboots with modified firmware Google and Microsoft do this.
Similarly, the firmware could give the kernel the KHO-tree (generated
by firmware or from the previous kernel) to keep stuff like telemetry,
oops messages, time stamps etc. KHO should not be considered
explicitly as a mechanism to carry device serialization data, the KHO
should be a standard and simple way to pass kernel data between
reboots. The more complex state can be built on top of it, for example
guestmemfs, could preserve terabytes of data and have only one node in
the KHO tree.
>
> Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-11 16:14 ` Pasha Tatashin
@ 2025-02-11 16:37 ` Jason Gunthorpe
2025-02-12 15:23 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-11 16:37 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Tue, Feb 11, 2025 at 11:14:06AM -0500, Pasha Tatashin wrote:
> > think you should not add this when there are enough ideas on how to
> > completely avoid it.
>
> Thinking about it some more, I'm actually leaning towards keeping
> things as they are, instead of going with a sparse FDT.
What is a sparse FDT? My suggestion each driver make its own FDT?
The reason for this was sequencing because we need a more more
flexable way to manage all this serialization than just a notifier
chain. The existing FDT construction process is too restrictive to
accommodate this, IMHO.
That it also resolves the weird dt_max stuff above is a nice side
effect.
> With a sparse KHO-tree, we'd be kinda trying to fix something that
> should be handled higher up. All userspace preservable memory (like
> emulated pmem with devdax/fsdax and also pstore for logging) can
> already survive cold reboots with modified firmware Google and
> Microsoft do this.
I was hoping the VM memory wouldn't be in DAX. If you want some DAX
stuff to interact with FW, OK, but I think the design here should be
driving toward preserving a memfd/guestmemfd/hugetlbfs FDs directly
and eliminate the DAX backed VMs. We won't get to CC guestmemfd with
DAX.
fdbox of a guestmemfd, for instance.
To do that you need to preserve folios as the basic primitive.
> Similarly, the firmware could give the kernel the KHO-tree (generated
> by firmware or from the previous kernel) to keep stuff like telemetry,
> oops messages, time stamps etc.
This feels like a mistake to comingle things like this. KHO is complex
enough, it should stay focused on its thing..
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
2025-02-10 20:22 ` Jason Gunthorpe
@ 2025-02-12 12:29 ` Thomas Weißschuh
1 sibling, 0 replies; 97+ messages in thread
From: Thomas Weißschuh @ 2025-02-12 12:29 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On 2025-02-06 15:27:45+0200, Mike Rapoport wrote:
> From: Alexander Graf <graf@amazon.com>
>
> This patch adds the core infrastructure to generate Kexec HandOver
> metadata. Kexec HandOver is a mechanism that allows Linux to preserve
> state - arbitrary properties as well as memory locations - across kexec.
>
> It does so using 2 concepts:
>
> 1) Device Tree - Every KHO kexec carries a KHO specific flattened
> device tree blob that describes the state of the system. Device
> drivers can register to KHO to serialize their state before kexec.
>
> 2) Scratch Regions - CMA regions that we allocate in the first kernel.
> CMA gives us the guarantee that no handover pages land in those
> regions, because handover pages must be at a static physical memory
> location. We use these regions as the place to load future kexec
> images so that they won't collide with any handover data.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> Documentation/ABI/testing/sysfs-kernel-kho | 53 +++
> .../admin-guide/kernel-parameters.txt | 24 +
> MAINTAINERS | 1 +
> include/linux/cma.h | 2 +
> include/linux/kexec.h | 18 +
> include/linux/kexec_handover.h | 10 +
> kernel/Makefile | 1 +
> kernel/kexec_handover.c | 450 ++++++++++++++++++
> mm/internal.h | 3 -
> mm/mm_init.c | 8 +
> 10 files changed, 567 insertions(+), 3 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
> create mode 100644 include/linux/kexec_handover.h
> create mode 100644 kernel/kexec_handover.c
<snip>
> --- /dev/null
> +++ b/include/linux/kexec_handover.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_KEXEC_HANDOVER_H
> +#define LINUX_KEXEC_HANDOVER_H
#include <linux/types.h>
> +
> +struct kho_mem {
> + phys_addr_t addr;
> + phys_addr_t size;
> +};
> +
> +#endif /* LINUX_KEXEC_HANDOVER_H */
<snip>
> +static ssize_t dt_read(struct file *file, struct kobject *kobj,
> + struct bin_attribute *attr, char *buf,
Please make the bin_attribute argument const. Currently both work, but
the non-const variant will go away.
This way I can test my stuff on linux-next.
> + loff_t pos, size_t count)
> +{
> + mutex_lock(&kho_out.lock);
> + memcpy(buf, attr->private + pos, count);
> + mutex_unlock(&kho_out.lock);
> +
> + return count;
> +}
> +
> +struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
The new __BIN_ATTR_ADMIN_RO() could make this slightly shorter.
<snip>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-11 16:37 ` Jason Gunthorpe
@ 2025-02-12 15:23 ` Jason Gunthorpe
2025-02-12 16:39 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-12 15:23 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:
> To do that you need to preserve folios as the basic primitive.
I made a small sketch of what I suggest.
I imagine the FDT schema for this would look something like this:
/dts-v1/;
/ {
compatible = "linux-kho,v1";
phys-addr-size = 64;
void-p-size = 64;
preserved-folio-map = <phys_addr>;
// The per "driver" storage
instance@1 {..};
instance@2 {..};
};
I think this is alot better than what is in this series. It uses much
less memory when there are alot of allocation, it supports any order
folios, it is efficient for 1G guestmemfd folios, and it only needs a
few bytes in the FDT. It could preserve and restore the high order
folio struct page folding (HVO).
The use cases I'm imagining for drivers would be pushing gigabytes of
memory into this preservation mechanism. It needs to be scalable!
This also illustrates my point that I don't think FDT is a good
representation to use exclusively. This in-memory structure is much
better and faster than trying to represent the same information
embedded directly into the FDT. I imagine this to be the general
pattern that drivers will want to use. A few bytes in the FDT pointing
at a scalable in-memory structure for the bulk of the data.
/*
* Keep track of folio memory that is to be preserved across KHO.
*
* This is designed with the idea that the system will have alot of memory, eg
* 1TB, and the majority of it will be ~1G folios assigned to a hugetlb/etc
* being used to back guest memory. This would leave a smaller amount of memory,
* eg 16G, reserved for the hypervisor to use. The pages to preserve across KHO
* would be randomly distributed over the hypervisor memory. The hypervisor
* memory is not required to be contiguous.
*
* This approach is fully incremental, as the serialization progresses folios
* can continue be aggregated to the tracker. The final step, immediately prior
* to kexec would serialize the xarray information into a linked list for the
* successor kernel to parse.
*
* The serializing side uses two levels of xarrays to manage chunks of per-order
* 512 byte bitmaps. For instance the entire 1G order of a 1TB system would fit
* inside a single 512 byte bitmap. For order 0 allocations each bitmap will
* cover 16M of address space. Thus, for 16G of hypervisor memory at most 512K
* of bitmap memory will be needed for order 0.
*/
struct kho_mem_track
{
/* Points to kho_mem_phys, each order gets its own bitmap tree */
struct xarray orders;
};
struct kho_mem_phys
{
/*
* Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized
* to order.
*/
struct xarray phys_bits;
};
#define PRESERVE_BITS (512 * 8)
struct kho_mem_phys_bits
{
DECLARE_BITMAP(preserve, PRESERVE_BITS)
};
static void *
xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t elmsz)
{
void *elm;
void *res;
elm = xa_load(xa, index);
if (elm)
return elm;
elm = kzalloc(elmsz, GFP_KERNEL);
if (!elm)
return ERR_PTR(-ENOMEM);
res = xa_cmpxchg(xa, elmsz, NULL, elm, GFP_KERNEL);
if (xa_is_err(res)) {
kfree(elm);
return ERR_PTR(xa_err(res));
};
if (res != NULL) {
kfree(elm);
return res;
}
return elm;
}
/*
* Record that the entire folio under virt is preserved across KHO. virt must
* have come from alloc_pages/folio_alloc or similar and point to the first page
* of the folio. The order will be preserved as well.
*/
int kho_preserve_folio(struct kho_mem_track *tracker, void *virt)
{
struct folio *folio = virt_to_folio(virt);
unsigned int order = folio_order(folio);
phys_addr_t phys = virt_to_phys(virt);
struct kho_mem_phys_bits *bits;
struct kho_mem_phys *physxa;
might_sleep();
physxa = xa_load_or_alloc(&tracker->orders, order, sizeof(*physxa));
if (IS_ERR(physxa))
return PTR_ERR(physxa);
phys >>= PAGE_SHIFT + order;
static_assert(sizeof(phys_addr_t) <= sizeof(unsigned long));
bits = xa_load_or_alloc(&physxa->phys_bits, phys / PRESERVE_BITS,
sizeof(*bits));
set_bit(phys % PRESERVE_BITS, bits->preserve);
return 0;
}
#define KHOSER_PTR(type) union {phys_addr_t phys; type ptr;}
#define KHOSER_STORE_PTR(dest, val) \
({ \
(dest).phys = virt_to_phys(val); \
typecheck(typeof((dest).ptr), val); \
})
#define KHOSER_LOAD_PTR(src) ((typeof((src).ptr))(phys_to_virt((src).phys)))
struct khoser_mem_bitmap_ptr {
phys_addr_t phys_start;
KHOSER_PTR(struct kho_mem_phys_bits *) bitmap;
};
struct khoser_mem_chunk {
unsigned int order;
unsigned int num_elms;
KHOSER_PTR(struct khoser_mem_chunk *) next;
struct khoser_mem_bitmap_ptr
bitmaps[(PAGE_SIZE - 16) / sizeof(struct khoser_mem_bitmap_ptr)];
};
static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE);
static int new_chunk(struct khoser_mem_chunk **cur_chunk)
{
struct khoser_mem_chunk *chunk;
chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
if (!chunk)
return -ENOMEM;
if (*cur_chunk)
KHOSER_STORE_PTR((*cur_chunk)->next, chunk);
*cur_chunk = chunk;
return 0;
}
/*
* Record all the bitmaps in a linked list of pages for the next kernel to
* process. Each chunk holds bitmaps of the same order and each block of bitmaps
* starts at a given physical address. This allows the bitmaps to be sparse. The
* xarray is used to store them in a tree while building up the data structure,
* but the KHO successor kernel only needs to process them once in order.
*
* All of this memory is normal kmalloc() memory and is not marked for
* preservation. The successor kernel will remain isolated to the scratch space
* until it completes processing this list. Once processed all the memory
* storing these ranges will be marked as free.
*/
int kho_serialize(struct kho_mem_track *tracker, phys_addr_t *fdt_value)
{
struct khoser_mem_chunk *first_chunk = NULL;
struct khoser_mem_chunk *chunk = NULL;
struct kho_mem_phys *physxa;
unsigned long order;
int ret;
xa_for_each(&tracker->orders, order, physxa) {
struct kho_mem_phys_bits *bits;
unsigned long phys;
ret = new_chunk(&chunk);
if (ret)
goto err_free;
if (!first_chunk)
first_chunk = chunk;
chunk->order = order;
xa_for_each(&physxa->phys_bits, phys, bits) {
struct khoser_mem_bitmap_ptr *elm;
if (chunk->num_elms == ARRAY_SIZE(chunk->bitmaps)) {
ret = new_chunk(&chunk);
if (ret)
goto err_free;
}
elm = &chunk->bitmaps[chunk->num_elms];
chunk->num_elms++;
elm->phys_start = phys << (order + PAGE_SIZE);
KHOSER_STORE_PTR(elm->bitmap, bits);
}
}
*fdt_value = virt_to_phys(first_chunk);
return 0;
err_free:
chunk = first_chunk;
while (chunk) {
struct khoser_mem_chunk *tmp = chunk;
chunk = KHOSER_LOAD_PTR(chunk->next);
kfree(tmp);
}
return ret;
}
static void preserve_bitmap(unsigned int order,
struct khoser_mem_bitmap_ptr *elm)
{
struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
unsigned int bit;
for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
phys_addr_t phys =
elm->phys_start + (bit << (order +
PAGE_SHIFT));
// Do the struct page stuff..
}
}
void kho_deserialize(phys_addr_t fdt_value)
{
struct khoser_mem_chunk *chunk = phys_to_virt(fdt_value);
while (chunk) {
unsigned int i;
for (i = 0; i != chunk->num_elms; i++)
preserve_bitmap(chunk->order, chunk->bitmaps[i]);
chunk = KHOSER_LOAD_PTR(chunk->next);
}
}
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 14/14] Documentation: KHO: Add memblock bindings
2025-02-10 20:20 ` Jason Gunthorpe
@ 2025-02-12 16:00 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-12 16:00 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Krzysztof Kozlowski, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Mark Rutland, Paolo Bonzini,
Pasha Tatashin, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Mon, Feb 10, 2025 at 04:20:40PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2025 at 08:27:34PM +0100, Krzysztof Kozlowski wrote:
> > On 10/02/2025 20:15, Jason Gunthorpe wrote:
> > > On Sun, Feb 09, 2025 at 09:50:37PM +0100, Krzysztof Kozlowski wrote:
> > >>> Ah, neat, that would almost solve the problem but you wrote:
> > >>>
> > >>> +$id: http://devicetree.org/schemas/memblock/reserve_mem.yaml#
> > >>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > >>>
> > >>> so no, this does not work like that. You use devicetree here namespace
> > >>> and ignore its rules.
> > >>
> > >> ... and that obviously is barely parseable, so maybe one more try:
> > >> "You use here devicetree namespace but ignore its rules."
> > >
> > > It makes sense to me, there should be zero cross-over of the two
> > > specs, KHO should be completely self defined and stand alone.
> > >
> > > There is some documentation missing, I think. This yaml describes one
> > > node type, but the entire overall structure of the fdt does not seem
> > > to have documentation?
> >
> > A lot of ABI is missing there and undocumented like: node name (which
> > for every standard DT would be a NAK), few properties. This binding
> > describes only subset while skipping all the rest and effectively
> > introducing implied/undocumented ABI.
>
> I agree, I think it can be easily adressed - the docs should have a
> sample of the overal DT from the root node and yaml for each of the
> blocks, laying out the purpose very much like the open dt spec..
I'll update the docs with more details about overall structure and will
make it clear that it's a different namespace.
> Jason
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 13/14] memblock: Add KHO support for reserve_mem
2025-02-10 16:03 ` Rob Herring
@ 2025-02-12 16:30 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-12 16:30 UTC (permalink / raw)
To: Rob Herring
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Mon, Feb 10, 2025 at 10:03:58AM -0600, Rob Herring wrote:
> On Thu, Feb 6, 2025 at 7:30 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: Alexander Graf <graf@amazon.com>
> >
> > Linux has recently gained support for "reserve_mem": A mechanism to
> > allocate a region of memory early enough in boot that we can cross our
> > fingers and hope it stays at the same location during most boots, so we
> > can store for example ftrace buffers into it.
> >
> > Thanks to KASLR, we can never be really sure that "reserve_mem"
> > allocations are static across kexec. Let's teach it KHO awareness so
> > that it serializes its reservations on kexec exit and deserializes them
> > again on boot, preserving the exact same mapping across kexec.
> >
> > This is an example user for KHO in the KHO patch set to ensure we have
> > at least one (not very controversial) user in the tree before extending
> > KHO's use to more subsystems.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > mm/memblock.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 131 insertions(+)
> >
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 84df96efca62..fdb08b60efc1 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -16,6 +16,9 @@
> > #include <linux/kmemleak.h>
> > #include <linux/seq_file.h>
> > #include <linux/memblock.h>
> > +#include <linux/kexec_handover.h>
> > +#include <linux/kexec.h>
> > +#include <linux/libfdt.h>
> >
> > #include <asm/sections.h>
> > #include <linux/io.h>
> > @@ -2423,6 +2426,70 @@ int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *
> > }
> > EXPORT_SYMBOL_GPL(reserve_mem_find_by_name);
> >
> > +static bool __init reserve_mem_kho_revive(const char *name, phys_addr_t size,
> > + phys_addr_t align)
> > +{
> > + const void *fdt = kho_get_fdt();
> > + const char *path = "/reserve_mem";
> > + int node, child, err;
> > +
> > + if (!IS_ENABLED(CONFIG_KEXEC_HANDOVER))
> > + return false;
> > +
> > + if (!fdt)
> > + return false;
> > +
> > + node = fdt_path_offset(fdt, "/reserve_mem");
> > + if (node < 0)
> > + return false;
> > +
> > + err = fdt_node_check_compatible(fdt, node, "reserve_mem-v1");
> > + if (err) {
> > + pr_warn("Node '%s' has unknown compatible", path);
> > + return false;
> > + }
> > +
> > + fdt_for_each_subnode(child, fdt, node) {
> > + const struct kho_mem *mem;
> > + const char *child_name;
> > + int len;
> > +
> > + /* Search for old kernel's reserved_mem with the same name */
> > + child_name = fdt_get_name(fdt, child, NULL);
> > + if (strcmp(name, child_name))
> > + continue;
> > +
> > + err = fdt_node_check_compatible(fdt, child, "reserve_mem_map-v1");
>
> It really seems you all are trying to have things both ways. It's not
> Devicetree, just the FDT file format, but then here you use
> "compatible" which *is* Devicetree. At best, it's all just confusing
> for folks. At worst, you're just picking and choosing what you want to
> use.
>
> I'm not saying don't use "compatible" just for the sake of looking
> less like DT, but perhaps your versioning should be done differently.
> You are reading the 'mem' property straight into a struct. Maybe the
> struct should have a version. Or the size of the struct is the version
> much like the userspace ABI is handled for structs.
The idea is to have high level compatibility notion for node level and up
rather than verify that for each and every struct like uABI does.
For that "compatible" seems just a perfect fit.
> > + if (err) {
> > + pr_warn("Node '%s/%s' has unknown compatible", path, name);
> > + continue;
> > + }
> > +
> > + mem = fdt_getprop(fdt, child, "mem", &len);
> > + if (!mem || len != sizeof(*mem))
> > + continue;
> > +
> > + if (mem->addr & (align - 1)) {
>
> It's stated somewhere in this that the FDT data is LE, but here you
> are assuming the FDT is the same endianness as the CPU not that it's
> LE. Arm64 can do BE. PowerPC does both. I'm not sure if kexec from one
> endianness to another is possible. I would guess in theory it is and
> in practice it's broken already (because kexec is always an
> afterthought). Either you need to guarantee that native endianness
> will never be an issue for any arch or you need to make the endianness
> fixed.
I believe Alex mentioned little endian in the sense of native endianness
for practical purposes :)
Since arm64 does seem to support kexec from one endianness to another in
certain circumstances, but I believe that we can limit KHO only to work
when both kernels have the same endianness.
> Rob
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-12 15:23 ` Jason Gunthorpe
@ 2025-02-12 16:39 ` Mike Rapoport
2025-02-12 17:43 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-12 16:39 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pasha Tatashin, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Jason,
On Wed, Feb 12, 2025 at 11:23:36AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:
>
> > To do that you need to preserve folios as the basic primitive.
>
> I made a small sketch of what I suggest.
>
> I imagine the FDT schema for this would look something like this:
>
> /dts-v1/;
> / {
> compatible = "linux-kho,v1";
> phys-addr-size = 64;
> void-p-size = 64;
> preserved-folio-map = <phys_addr>;
>
> // The per "driver" storage
> instance@1 {..};
> instance@2 {..};
> };
>
> I think this is alot better than what is in this series. It uses much
> less memory when there are alot of allocation, it supports any order
> folios, it is efficient for 1G guestmemfd folios, and it only needs a
> few bytes in the FDT. It could preserve and restore the high order
> folio struct page folding (HVO).
>
> The use cases I'm imagining for drivers would be pushing gigabytes of
> memory into this preservation mechanism. It needs to be scalable!
>
> This also illustrates my point that I don't think FDT is a good
> representation to use exclusively. This in-memory structure is much
> better and faster than trying to represent the same information
> embedded directly into the FDT. I imagine this to be the general
> pattern that drivers will want to use. A few bytes in the FDT pointing
> at a scalable in-memory structure for the bulk of the data.
As I've mentioned off-list earlier, KHO in its current form is the lowest
level of abstraction for state preservation and it is by no means is
intended to provide complex drivers with all the tools necessary.
It's sole purpose is to allow preserving simple properties and ensure that
memory ranges KHO clients need to preserve won't be overwritten.
What you propose is a great optimization for memory preservation mechanism,
and additional and very useful abstraction layer on top of "basic KHO"!
But I think it will be easier to start with something *very simple* and
probably suboptimal and then extend it rather than to try to build complex
comprehensive solution from day one.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-12 16:39 ` Mike Rapoport
@ 2025-02-12 17:43 ` Jason Gunthorpe
2025-02-23 18:51 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-12 17:43 UTC (permalink / raw)
To: Mike Rapoport
Cc: Pasha Tatashin, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
> As I've mentioned off-list earlier, KHO in its current form is the lowest
> level of abstraction for state preservation and it is by no means is
> intended to provide complex drivers with all the tools necessary.
My point, is I think it is the wrong level of abstraction and the
wrong FDT schema. It does not and cannot solve the problems we know we
will have, so why invest anything into that schema?
I think the scratch system is great, and an amazing improvement over
past version. Upgrade the memory preservation to match and it will be
really good.
> What you propose is a great optimization for memory preservation mechanism,
> and additional and very useful abstraction layer on top of "basic KHO"!
I do not see this as a layer on top, I see it as fundamentally
replacing the memory preservation mechanism with something more
scalable.
> But I think it will be easier to start with something *very simple* and
> probably suboptimal and then extend it rather than to try to build complex
> comprehensive solution from day one.
But why? Just do it right from the start? I spent like a hour
sketching that, the existing preservation code is also very simple,
why not just fix it right now?
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (17 preceding siblings ...)
2025-02-09 0:51 ` Cong Wang
@ 2025-02-17 3:19 ` RuiRui Yang
2025-02-19 7:32 ` Mike Rapoport
2025-02-26 20:08 ` Pratyush Yadav
19 siblings, 1 reply; 97+ messages in thread
From: RuiRui Yang @ 2025-02-17 3:19 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
>
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
>
>
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See "pkernfs: Persisting guest memory
> and kernel/device state safely across kexec" Linux Plumbers
> Conference 2023 presentation for details:
>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this patch
> implements basic infrastructure to allow hand over of kernel state across
> kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> memblock's reserve_mem.
> With this patch set applied, memory that was reserved using "reserve_mem"
> command line options remains intact after kexec and it is guaranteed to
> reside at the same physical address.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
> * Memory Pools [1] - preallocated persistent memory region + allocator
> * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
> address location on the kernel command line
> * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch regions" available
> for kexec: A physically contiguous memory regions that is guaranteed to
> not have any memory that KHO would preserve. The new kernel bootstraps
> itself using the scratch regions and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> Currently KHO is only implemented for file based kexec. The kernel
> interfaces in the patch set are already in place to support user space
> kexec as well, but it is still not implemented it yet inside kexec tools.
>
What architecture exactly does this KHO work fine? Device Tree
should be ok on arm*, x86 and power*, but how about s390?
Thanks
Dae
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 13/14] memblock: Add KHO support for reserve_mem
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
2025-02-10 16:03 ` Rob Herring
@ 2025-02-17 4:04 ` Wei Yang
2025-02-19 7:25 ` Mike Rapoport
1 sibling, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-17 4:04 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:53PM +0200, Mike Rapoport wrote:
>From: Alexander Graf <graf@amazon.com>
>
>Linux has recently gained support for "reserve_mem": A mechanism to
>allocate a region of memory early enough in boot that we can cross our
>fingers and hope it stays at the same location during most boots, so we
>can store for example ftrace buffers into it.
>
>Thanks to KASLR, we can never be really sure that "reserve_mem"
>allocations are static across kexec. Let's teach it KHO awareness so
>that it serializes its reservations on kexec exit and deserializes them
>again on boot, preserving the exact same mapping across kexec.
>
>This is an example user for KHO in the KHO patch set to ensure we have
>at least one (not very controversial) user in the tree before extending
>KHO's use to more subsystems.
>
>Signed-off-by: Alexander Graf <graf@amazon.com>
>Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>---
> mm/memblock.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 131 insertions(+)
>
>diff --git a/mm/memblock.c b/mm/memblock.c
>index 84df96efca62..fdb08b60efc1 100644
>--- a/mm/memblock.c
>+++ b/mm/memblock.c
>@@ -16,6 +16,9 @@
> #include <linux/kmemleak.h>
> #include <linux/seq_file.h>
> #include <linux/memblock.h>
>+#include <linux/kexec_handover.h>
Looks this one breaks the memblock test in tools/testing/memblock.
memblock.c:19:10: fatal error: linux/kexec_handover.h: No such file or directory
19 | #include <linux/kexec_handover.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~
>+#include <linux/kexec.h>
>+#include <linux/libfdt.h>
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
@ 2025-02-18 14:59 ` Wei Yang
2025-02-19 7:13 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-18 14:59 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:41PM +0200, Mike Rapoport wrote:
>From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
>When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
>function performs initialization of a struct page that would have been
>deferred normally.
>
>Rename it to init_deferred_page() to better reflect what the function does.
Would it be confused with deferred_init_pages()?
And it still calls __init_reserved_page_zone(), even we __SetPageReserved()
after it. Current logic looks not clear.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
@ 2025-02-18 15:50 ` Wei Yang
2025-02-19 7:24 ` Mike Rapoport
2025-02-26 1:53 ` Changyuan Lyu
1 sibling, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-18 15:50 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
>to denote areas that were reserved for kernel use either directly with
>memblock_reserve_kern() or via memblock allocations.
>
>Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>---
> include/linux/memblock.h | 16 +++++++++++++++-
> mm/memblock.c | 32 ++++++++++++++++++++++++--------
> 2 files changed, 39 insertions(+), 9 deletions(-)
>
>diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>index e79eb6ac516f..65e274550f5d 100644
>--- a/include/linux/memblock.h
>+++ b/include/linux/memblock.h
>@@ -50,6 +50,7 @@ enum memblock_flags {
> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
> MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
Above memblock_flags, there are comments on explaining those flags.
Seems we miss it for MEMBLOCK_RSRV_KERN.
> };
>
> /**
>@@ -116,7 +117,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
> int memblock_add(phys_addr_t base, phys_addr_t size);
> int memblock_remove(phys_addr_t base, phys_addr_t size);
> int memblock_phys_free(phys_addr_t base, phys_addr_t size);
>-int memblock_reserve(phys_addr_t base, phys_addr_t size);
>+int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid,
>+ enum memblock_flags flags);
>+
>+static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size)
>+{
>+ return __memblock_reserve(base, size, NUMA_NO_NODE, 0);
^ MEMBLOCK_NONE ?
>+}
>+
>+static __always_inline int memblock_reserve_kern(phys_addr_t base, phys_addr_t size)
>+{
>+ return __memblock_reserve(base, size, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN);
>+}
>+
> #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
> int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
> #endif
>@@ -477,6 +490,7 @@ static inline __init_memblock bool memblock_bottom_up(void)
>
> phys_addr_t memblock_phys_mem_size(void);
> phys_addr_t memblock_reserved_size(void);
>+phys_addr_t memblock_reserved_kern_size(int nid);
> unsigned long memblock_estimated_nr_free_pages(void);
> phys_addr_t memblock_start_of_DRAM(void);
> phys_addr_t memblock_end_of_DRAM(void);
>diff --git a/mm/memblock.c b/mm/memblock.c
>index 95af35fd1389..4c33baf4d97c 100644
>--- a/mm/memblock.c
>+++ b/mm/memblock.c
>@@ -491,7 +491,7 @@ static int __init_memblock memblock_double_array(struct memblock_type *type,
> * needn't do it
> */
> if (!use_slab)
>- BUG_ON(memblock_reserve(addr, new_alloc_size));
>+ BUG_ON(memblock_reserve_kern(addr, new_alloc_size));
>
> /* Update slab flag */
> *in_slab = use_slab;
>@@ -641,7 +641,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
> #ifdef CONFIG_NUMA
> WARN_ON(nid != memblock_get_region_node(rgn));
> #endif
>- WARN_ON(flags != rgn->flags);
>+ WARN_ON(flags != MEMBLOCK_NONE && flags != rgn->flags);
> nr_new++;
> if (insert) {
> if (start_rgn == -1)
>@@ -901,14 +901,15 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
> return memblock_remove_range(&memblock.reserved, base, size);
> }
>
>-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
>+int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size,
>+ int nid, enum memblock_flags flags)
> {
> phys_addr_t end = base + size - 1;
>
>- memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
>- &base, &end, (void *)_RET_IP_);
>+ memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
>+ &base, &end, nid, flags, (void *)_RET_IP_);
>
>- return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
>+ return memblock_add_range(&memblock.reserved, base, size, nid, flags);
> }
>
> #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> again:
> found = memblock_find_in_range_node(size, align, start, end, nid,
> flags);
>- if (found && !memblock_reserve(found, size))
>+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
Maybe we could use memblock_reserve_kern() directly. If my understanding is
correct, the reserved region's nid is not used.
BTW, one question here. How we handle concurrent memblock allocation? If two
threads find the same available range and do the reservation, it seems to be a
problem to me. Or I missed something?
> goto done;
>
> if (numa_valid_node(nid) && !exact_nid) {
> found = memblock_find_in_range_node(size, align, start,
> end, NUMA_NO_NODE,
> flags);
>- if (found && !memblock_reserve(found, size))
>+ if (found && !memblock_reserve_kern(found, size))
> goto done;
> }
>
>@@ -1751,6 +1752,20 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
> return memblock.reserved.total_size;
> }
>
>+phys_addr_t __init_memblock memblock_reserved_kern_size(int nid)
>+{
>+ struct memblock_region *r;
>+ phys_addr_t total = 0;
>+
>+ for_each_reserved_mem_region(r) {
>+ if (nid == memblock_get_region_node(r) || !numa_valid_node(nid))
>+ if (r->flags & MEMBLOCK_RSRV_KERN)
>+ total += r->size;
>+ }
>+
>+ return total;
>+}
>+
> /**
> * memblock_estimated_nr_free_pages - return estimated number of free pages
> * from memblock point of view
>@@ -2397,6 +2412,7 @@ static const char * const flagname[] = {
> [ilog2(MEMBLOCK_NOMAP)] = "NOMAP",
> [ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
> [ilog2(MEMBLOCK_RSRV_NOINIT)] = "RSV_NIT",
>+ [ilog2(MEMBLOCK_RSRV_KERN)] = "RSV_KERN",
> };
>
> static int memblock_debug_show(struct seq_file *m, void *private)
>--
>2.47.2
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-18 14:59 ` Wei Yang
@ 2025-02-19 7:13 ` Mike Rapoport
2025-02-20 8:36 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-19 7:13 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi,
On Tue, Feb 18, 2025 at 02:59:04PM +0000, Wei Yang wrote:
> On Thu, Feb 06, 2025 at 03:27:41PM +0200, Mike Rapoport wrote:
> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> >When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
> >function performs initialization of a struct page that would have been
> >deferred normally.
> >
> >Rename it to init_deferred_page() to better reflect what the function does.
>
> Would it be confused with deferred_init_pages()?
Why? It initializes a single page, deferred_init_pages() initializes many.
> And it still calls __init_reserved_page_zone(), even we __SetPageReserved()
> after it. Current logic looks not clear.
There's no __init_reserved_page_zone(). Currently init_reserved_page()
detects the zone of the page and calls __init_single_page(), so essentially
it initializes one struct page.
And we __SetPageReserved() in reserve_bootmem_region() after call to
init_reseved_page() because pages there are indeed reserved.
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-18 15:50 ` Wei Yang
@ 2025-02-19 7:24 ` Mike Rapoport
2025-02-23 0:22 ` Wei Yang
2025-02-24 1:31 ` Wei Yang
0 siblings, 2 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-19 7:24 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi,
On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> >to denote areas that were reserved for kernel use either directly with
> >memblock_reserve_kern() or via memblock allocations.
> >
> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >---
> > include/linux/memblock.h | 16 +++++++++++++++-
> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
> > 2 files changed, 39 insertions(+), 9 deletions(-)
> >
> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >index e79eb6ac516f..65e274550f5d 100644
> >--- a/include/linux/memblock.h
> >+++ b/include/linux/memblock.h
> >@@ -50,6 +50,7 @@ enum memblock_flags {
> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>
> Above memblock_flags, there are comments on explaining those flags.
>
> Seems we miss it for MEMBLOCK_RSRV_KERN.
Right, thanks!
> >
> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> > again:
> > found = memblock_find_in_range_node(size, align, start, end, nid,
> > flags);
> >- if (found && !memblock_reserve(found, size))
> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>
> Maybe we could use memblock_reserve_kern() directly. If my understanding is
> correct, the reserved region's nid is not used.
We use nid of reserved regions in reserve_bootmem_region() (commit
61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
know the distribution of reserved memory among the nodes before
memmap_init_reserved_pages().
> BTW, one question here. How we handle concurrent memblock allocation? If two
> threads find the same available range and do the reservation, it seems to be a
> problem to me. Or I missed something?
memblock allocations end before smp_init(), there is no possible concurrency.
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 13/14] memblock: Add KHO support for reserve_mem
2025-02-17 4:04 ` Wei Yang
@ 2025-02-19 7:25 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-19 7:25 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 17, 2025 at 04:04:48AM +0000, Wei Yang wrote:
> On Thu, Feb 06, 2025 at 03:27:53PM +0200, Mike Rapoport wrote:
> >From: Alexander Graf <graf@amazon.com>
> >
> >Linux has recently gained support for "reserve_mem": A mechanism to
> >allocate a region of memory early enough in boot that we can cross our
> >fingers and hope it stays at the same location during most boots, so we
> >can store for example ftrace buffers into it.
> >
> >Thanks to KASLR, we can never be really sure that "reserve_mem"
> >allocations are static across kexec. Let's teach it KHO awareness so
> >that it serializes its reservations on kexec exit and deserializes them
> >again on boot, preserving the exact same mapping across kexec.
> >
> >This is an example user for KHO in the KHO patch set to ensure we have
> >at least one (not very controversial) user in the tree before extending
> >KHO's use to more subsystems.
> >
> >Signed-off-by: Alexander Graf <graf@amazon.com>
> >Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >---
> > mm/memblock.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 131 insertions(+)
> >
> >diff --git a/mm/memblock.c b/mm/memblock.c
> >index 84df96efca62..fdb08b60efc1 100644
> >--- a/mm/memblock.c
> >+++ b/mm/memblock.c
> >@@ -16,6 +16,9 @@
> > #include <linux/kmemleak.h>
> > #include <linux/seq_file.h>
> > #include <linux/memblock.h>
> >+#include <linux/kexec_handover.h>
>
> Looks this one breaks the memblock test in tools/testing/memblock.
>
> memblock.c:19:10: fatal error: linux/kexec_handover.h: No such file or directory
> 19 | #include <linux/kexec_handover.h>
> | ^~~~~~~~~~~~~~~~~~~~~~~~
Thanks, will fix.
> >+#include <linux/kexec.h>
> >+#include <linux/libfdt.h>
> >
>
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-17 3:19 ` RuiRui Yang
@ 2025-02-19 7:32 ` Mike Rapoport
2025-02-19 12:49 ` Dave Young
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-19 7:32 UTC (permalink / raw)
To: RuiRui Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> > == Limitations ==
> >
> > Currently KHO is only implemented for file based kexec. The kernel
> > interfaces in the patch set are already in place to support user space
> > kexec as well, but it is still not implemented it yet inside kexec tools.
> >
>
> What architecture exactly does this KHO work fine? Device Tree
> should be ok on arm*, x86 and power*, but how about s390?
KHO does not use device tree as the boot protocol, it uses FDT as a data
structure and adds architecture specific bits to the boot structures to
point to that data, very similar to how IMA_KEXEC works.
Currently KHO is implemented on arm64 and x86, but there is no fundamental
reason why it wouldn't work on any architecture that supports kexec.
> Thanks
> Dae
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-19 7:32 ` Mike Rapoport
@ 2025-02-19 12:49 ` Dave Young
2025-02-19 13:54 ` Alexander Graf
0 siblings, 1 reply; 97+ messages in thread
From: Dave Young @ 2025-02-19 12:49 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> > On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> > > == Limitations ==
> > >
> > > Currently KHO is only implemented for file based kexec. The kernel
> > > interfaces in the patch set are already in place to support user space
> > > kexec as well, but it is still not implemented it yet inside kexec tools.
> > >
> >
> > What architecture exactly does this KHO work fine? Device Tree
> > should be ok on arm*, x86 and power*, but how about s390?
>
> KHO does not use device tree as the boot protocol, it uses FDT as a data
> structure and adds architecture specific bits to the boot structures to
> point to that data, very similar to how IMA_KEXEC works.
>
> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> reason why it wouldn't work on any architecture that supports kexec.
Well, the problem is whether there is a way to add dtb in the early
boot path, for X86 it is added via setup_data, if there is no such
way I'm not sure if it is doable especially for passing some info for
early boot use. Then the KHO will be only for limited use cases.
>
> > Thanks
> > Dae
> >
>
> --
> Sincerely yours,
> Mike.
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-19 12:49 ` Dave Young
@ 2025-02-19 13:54 ` Alexander Graf
2025-02-20 1:49 ` Dave Young
0 siblings, 1 reply; 97+ messages in thread
From: Alexander Graf @ 2025-02-19 13:54 UTC (permalink / raw)
To: Dave Young, Mike Rapoport
Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On 19.02.25 13:49, Dave Young wrote:
> On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
>> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
>>> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
>>>> == Limitations ==
>>>>
>>>> Currently KHO is only implemented for file based kexec. The kernel
>>>> interfaces in the patch set are already in place to support user space
>>>> kexec as well, but it is still not implemented it yet inside kexec tools.
>>>>
>>> What architecture exactly does this KHO work fine? Device Tree
>>> should be ok on arm*, x86 and power*, but how about s390?
>> KHO does not use device tree as the boot protocol, it uses FDT as a data
>> structure and adds architecture specific bits to the boot structures to
>> point to that data, very similar to how IMA_KEXEC works.
>>
>> Currently KHO is implemented on arm64 and x86, but there is no fundamental
>> reason why it wouldn't work on any architecture that supports kexec.
> Well, the problem is whether there is a way to add dtb in the early
> boot path, for X86 it is added via setup_data, if there is no such
> way I'm not sure if it is doable especially for passing some info for
> early boot use. Then the KHO will be only for limited use cases.
Every architecture has a platform specific way of passing data into the
kernel so it can find its command line and initrd. S390x for example has
struct parmarea. To enable s390x, you would remove some of its padding
and replace it with a KHO base addr + size, so that the new kernel can
find the KHO state tree.
Alex
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-19 13:54 ` Alexander Graf
@ 2025-02-20 1:49 ` Dave Young
2025-02-20 16:43 ` Alexander Gordeev
0 siblings, 1 reply; 97+ messages in thread
From: Dave Young @ 2025-02-20 1:49 UTC (permalink / raw)
To: Alexander Graf
Cc: Mike Rapoport, linux-kernel, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86, Philipp Rudo,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Christian Borntraeger, Sven Schnelle, linux-s390
On Wed, 19 Feb 2025 at 21:55, Alexander Graf <graf@amazon.com> wrote:
>
>
> On 19.02.25 13:49, Dave Young wrote:
> > On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
> >> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> >>> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> >>>> == Limitations ==
> >>>>
> >>>> Currently KHO is only implemented for file based kexec. The kernel
> >>>> interfaces in the patch set are already in place to support user space
> >>>> kexec as well, but it is still not implemented it yet inside kexec tools.
> >>>>
> >>> What architecture exactly does this KHO work fine? Device Tree
> >>> should be ok on arm*, x86 and power*, but how about s390?
> >> KHO does not use device tree as the boot protocol, it uses FDT as a data
> >> structure and adds architecture specific bits to the boot structures to
> >> point to that data, very similar to how IMA_KEXEC works.
> >>
> >> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> >> reason why it wouldn't work on any architecture that supports kexec.
> > Well, the problem is whether there is a way to add dtb in the early
> > boot path, for X86 it is added via setup_data, if there is no such
> > way I'm not sure if it is doable especially for passing some info for
> > early boot use. Then the KHO will be only for limited use cases.
>
>
> Every architecture has a platform specific way of passing data into the
> kernel so it can find its command line and initrd. S390x for example has
> struct parmarea. To enable s390x, you would remove some of its padding
> and replace it with a KHO base addr + size, so that the new kernel can
> find the KHO state tree.
Ok, thanks for the info, I cced s390 people maybe they can provide inputs.
Other than the arch concern, I'm not so excited about the KHO
because for kexec reboot there is a fundamental problem which makes us
(Red Hat kexec/kdump team) can not full support it in RHEL
distribution, that is the stability due to drivers usually do not
implement the device shutdown method or not well tested. From time
to time we see weird bugs, could be malfunctioned devices or memory
corruption caused by ongoing DMA etc. Also no way for the time being
to make some graphic/drm drivers work ok after a kexec reboot, it
might happen to work by luck but also not stable.
So I personally think that improving the above concern is more
important than introducing more features to utilize kexec reboot.
>
>
> Alex
>
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-19 7:13 ` Mike Rapoport
@ 2025-02-20 8:36 ` Wei Yang
2025-02-20 14:54 ` Mike Rapoport
2025-02-25 7:40 ` Mike Rapoport
0 siblings, 2 replies; 97+ messages in thread
From: Wei Yang @ 2025-02-20 8:36 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Wed, Feb 19, 2025 at 09:13:22AM +0200, Mike Rapoport wrote:
>Hi,
>
>On Tue, Feb 18, 2025 at 02:59:04PM +0000, Wei Yang wrote:
>> On Thu, Feb 06, 2025 at 03:27:41PM +0200, Mike Rapoport wrote:
>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> >When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
>> >function performs initialization of a struct page that would have been
>> >deferred normally.
>> >
>> >Rename it to init_deferred_page() to better reflect what the function does.
>>
>> Would it be confused with deferred_init_pages()?
>
>Why? It initializes a single page, deferred_init_pages() initializes many.
>
See below.
>> And it still calls __init_reserved_page_zone(), even we __SetPageReserved()
>> after it. Current logic looks not clear.
>
>There's no __init_reserved_page_zone(). Currently init_reserved_page()
>detects the zone of the page and calls __init_single_page(), so essentially
>it initializes one struct page.
>
>And we __SetPageReserved() in reserve_bootmem_region() after call to
>init_reseved_page() because pages there are indeed reserved.
>
Hmm... I am not sure we are looking at the same code. I take a look at current
mm-unstable, this patch set is not included. So I am looking at previous
version with this last commit:
8bf30f9d23eb 2025-02-06 Documentation: KHO: add memblock bindings
Here is what I see for init_deferred_page()'s definition:
init_deferred_page()
__init_deferred_page()
__init_reserved_page_zone() <--- I do see this function, it is removed?
__init_single_page()
What I want to say is __init_deferred_page() calls
__init_reserved_page_zone(). This sounds imply a deferred page is always
reserved page. But we know it is not. deferred_init_pages() initialize the
pages are not reserved one. Or we want to have this context in
__init_deferred_page()?
>--
>Sincerely yours,
>Mike.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-20 8:36 ` Wei Yang
@ 2025-02-20 14:54 ` Mike Rapoport
2025-02-25 7:40 ` Mike Rapoport
1 sibling, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-20 14:54 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 20, 2025 at 08:36:01AM +0000, Wei Yang wrote:
> On Wed, Feb 19, 2025 at 09:13:22AM +0200, Mike Rapoport wrote:
> >Hi,
> >
> >On Tue, Feb 18, 2025 at 02:59:04PM +0000, Wei Yang wrote:
> >> On Thu, Feb 06, 2025 at 03:27:41PM +0200, Mike Rapoport wrote:
> >> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> >When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
> >> >function performs initialization of a struct page that would have been
> >> >deferred normally.
> >> >
> >> >Rename it to init_deferred_page() to better reflect what the function does.
> >>
> >> Would it be confused with deferred_init_pages()?
> >
> >Why? It initializes a single page, deferred_init_pages() initializes many.
> >
>
> See below.
>
> >> And it still calls __init_reserved_page_zone(), even we __SetPageReserved()
> >> after it. Current logic looks not clear.
> >
> >There's no __init_reserved_page_zone(). Currently init_reserved_page()
> >detects the zone of the page and calls __init_single_page(), so essentially
> >it initializes one struct page.
> >
> >And we __SetPageReserved() in reserve_bootmem_region() after call to
> >init_reseved_page() because pages there are indeed reserved.
> >
>
> Hmm... I am not sure we are looking at the same code. I take a look at current
> mm-unstable, this patch set is not included.
I was looking at Linus tree, it was not there yet :)
> So I am looking at previous version with this last commit:
>
> 8bf30f9d23eb 2025-02-06 Documentation: KHO: add memblock bindings
>
> Here is what I see for init_deferred_page()'s definition:
>
> init_deferred_page()
> __init_deferred_page()
> __init_reserved_page_zone() <--- I do see this function, it is removed?
> __init_single_page()
>
> What I want to say is __init_deferred_page() calls
> __init_reserved_page_zone(). This sounds imply a deferred page is always
> reserved page. But we know it is not. deferred_init_pages() initialize the
> pages are not reserved one. Or we want to have this context in
> __init_deferred_page()?
If the commit that introduced __init_reserved_page_zone goes in before KHO,
I'll just rename both functions, there is nothing about reserved pages
there.
> >--
> >Sincerely yours,
> >Mike.
>
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-20 1:49 ` Dave Young
@ 2025-02-20 16:43 ` Alexander Gordeev
2025-02-23 17:54 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Alexander Gordeev @ 2025-02-20 16:43 UTC (permalink / raw)
To: Dave Young
Cc: Alexander Graf, Mike Rapoport, linux-kernel, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86, Philipp Rudo,
Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
Sven Schnelle, linux-s390
On Thu, Feb 20, 2025 at 09:49:52AM +0800, Dave Young wrote:
> On Wed, 19 Feb 2025 at 21:55, Alexander Graf <graf@amazon.com> wrote:
> > >>> What architecture exactly does this KHO work fine? Device Tree
> > >>> should be ok on arm*, x86 and power*, but how about s390?
> > >> KHO does not use device tree as the boot protocol, it uses FDT as a data
> > >> structure and adds architecture specific bits to the boot structures to
> > >> point to that data, very similar to how IMA_KEXEC works.
> > >>
> > >> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> > >> reason why it wouldn't work on any architecture that supports kexec.
> > > Well, the problem is whether there is a way to add dtb in the early
> > > boot path, for X86 it is added via setup_data, if there is no such
> > > way I'm not sure if it is doable especially for passing some info for
> > > early boot use. Then the KHO will be only for limited use cases.
> >
> >
> > Every architecture has a platform specific way of passing data into the
> > kernel so it can find its command line and initrd. S390x for example has
> > struct parmarea. To enable s390x, you would remove some of its padding
> > and replace it with a KHO base addr + size, so that the new kernel can
> > find the KHO state tree.
>
> Ok, thanks for the info, I cced s390 people maybe they can provide inputs.
If I understand correctly, the parmarea would be used for passing the
FDT address - which appears to be fine. However, s390 does not implement
early_memremap()/early_memunmap(), which KHO needs.
Thanks, Dave!
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-19 7:24 ` Mike Rapoport
@ 2025-02-23 0:22 ` Wei Yang
2025-03-10 9:51 ` Wei Yang
2025-02-24 1:31 ` Wei Yang
1 sibling, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-23 0:22 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>Hi,
>
>On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> >to denote areas that were reserved for kernel use either directly with
>> >memblock_reserve_kern() or via memblock allocations.
>> >
>> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >---
>> > include/linux/memblock.h | 16 +++++++++++++++-
>> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>> > 2 files changed, 39 insertions(+), 9 deletions(-)
>> >
>> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> >index e79eb6ac516f..65e274550f5d 100644
>> >--- a/include/linux/memblock.h
>> >+++ b/include/linux/memblock.h
>> >@@ -50,6 +50,7 @@ enum memblock_flags {
>> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>>
>> Above memblock_flags, there are comments on explaining those flags.
>>
>> Seems we miss it for MEMBLOCK_RSRV_KERN.
>
>Right, thanks!
>
>> >
>> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>> > again:
>> > found = memblock_find_in_range_node(size, align, start, end, nid,
>> > flags);
>> >- if (found && !memblock_reserve(found, size))
>> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>>
>> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>> correct, the reserved region's nid is not used.
>
>We use nid of reserved regions in reserve_bootmem_region() (commit
>61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>know the distribution of reserved memory among the nodes before
>memmap_init_reserved_pages().
>
I took another look into this commit. There maybe a very corner case in which
will leave a reserved region with no nid set.
memmap_init_reserved_pages()
for_each_mem_region() {
...
memblock_set_node(start, end, &memblock.reserved, nid);
}
We leverage the iteration here to set nid to all regions in memblock.reserved.
But memblock_set_node() may call memblock_double_array() to expand the array,
which may get a range before current start. So we would miss to set the
correct nid to the new reserved region.
I have tried to create a case in memblock test. This would happen when there
are 126 memblock.reserved regions. And the last region is across the last two
node.
One way to fix this is compare type->max in memblock_set_node(). Then check
this return value in memmap_init_reserved_pages(). If we found the size
changes, repeat the iteration.
But this is a very trivial one, not sure it worth fix.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-20 16:43 ` Alexander Gordeev
@ 2025-02-23 17:54 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-23 17:54 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Dave Young, Alexander Graf, linux-kernel, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86, Philipp Rudo,
Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
Sven Schnelle, linux-s390
On Thu, Feb 20, 2025 at 05:43:48PM +0100, Alexander Gordeev wrote:
> On Thu, Feb 20, 2025 at 09:49:52AM +0800, Dave Young wrote:
> > On Wed, 19 Feb 2025 at 21:55, Alexander Graf <graf@amazon.com> wrote:
> > > >>> What architecture exactly does this KHO work fine? Device Tree
> > > >>> should be ok on arm*, x86 and power*, but how about s390?
> > > >> KHO does not use device tree as the boot protocol, it uses FDT as a data
> > > >> structure and adds architecture specific bits to the boot structures to
> > > >> point to that data, very similar to how IMA_KEXEC works.
> > > >>
> > > >> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> > > >> reason why it wouldn't work on any architecture that supports kexec.
> > > > Well, the problem is whether there is a way to add dtb in the early
> > > > boot path, for X86 it is added via setup_data, if there is no such
> > > > way I'm not sure if it is doable especially for passing some info for
> > > > early boot use. Then the KHO will be only for limited use cases.
> > >
> > >
> > > Every architecture has a platform specific way of passing data into the
> > > kernel so it can find its command line and initrd. S390x for example has
> > > struct parmarea. To enable s390x, you would remove some of its padding
> > > and replace it with a KHO base addr + size, so that the new kernel can
> > > find the KHO state tree.
> >
> > Ok, thanks for the info, I cced s390 people maybe they can provide inputs.
>
> If I understand correctly, the parmarea would be used for passing the
> FDT address - which appears to be fine. However, s390 does not implement
> early_memremap()/early_memunmap(), which KHO needs.
KHO uses early_memremap()/early_memunmap() because it parses FDT before
phys_to_virt() is available on arm64 and x86. AFAIU on s390 phys_to_virt()
can be used at setup_arch() time, so it shouldn't be a problem to add
appropriate wrappers.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-12 17:43 ` Jason Gunthorpe
@ 2025-02-23 18:51 ` Mike Rapoport
2025-02-24 14:28 ` Jason Gunthorpe
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-23 18:51 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pasha Tatashin, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Wed, Feb 12, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
>
> > As I've mentioned off-list earlier, KHO in its current form is the lowest
> > level of abstraction for state preservation and it is by no means is
> > intended to provide complex drivers with all the tools necessary.
>
> My point, is I think it is the wrong level of abstraction and the
> wrong FDT schema. It does not and cannot solve the problems we know we
> will have, so why invest anything into that schema?
Preserving a lot of random pages spread all over the place will be a
problem no matter what. With kho_preserve_folio() the users will still need
to save physical address of that folio somewhere, be it FDT or some binary
structure that FDT will point to. So either instead of "mem" properties we'll
have "addresses" property or a pointer to yet another page that should be
preserved and, by the way, "mem" may come handy in this case :)
I don't see how the "mem" property contradicts future extensions and for
simple use cases it is already enough. The simple reserve_mem use case in
this patchset indeed does not represent the complexity of a driver, but
it's still useful, at least for the ftrace folks. And reserve_mem is just
fine with "mem" property.
> I think the scratch system is great, and an amazing improvement over
> past version. Upgrade the memory preservation to match and it will be
> really good.
>
> > What you propose is a great optimization for memory preservation mechanism,
> > and additional and very useful abstraction layer on top of "basic KHO"!
>
> I do not see this as a layer on top, I see it as fundamentally
> replacing the memory preservation mechanism with something more
> scalable.
There are two parts to the memory preservation: making sure the preserved
pages don't make it to the free lists and than restoring struct
page/folio/memdesc so that the pages will look the same way as when they
were allocated.
For the first part we must memblock_reserve(addr, size) for every preserved
range before memblock releases memory to the buddy.
I did an experiment and preserved 1GiB of random order-0 pages and measured
time required to reserve everything in memblock.
kho_deserialize() you suggested slightly outperformed
kho_init_reserved_pages() that parsed a single "mem" property containing
an array of <addr, size> pairs. For more random distribution of orders and
more deep FDT the difference or course would be higher, but still both
options sucked relatively to a maple tree serialized similarly to your
tracker xarray.
For the restoration of the struct folio for multiorder folios the tracker
xarray is a really great fit, but again, it does not contradict having
"mem" properties. And the restoration of struct folio does not have to
happen very early, so we'd probably want to run it in parallel, somewhat
like deferred initialization of struct page.
> > But I think it will be easier to start with something *very simple* and
> > probably suboptimal and then extend it rather than to try to build complex
> > comprehensive solution from day one.
>
> But why? Just do it right from the start? I spent like a hour
> sketching that, the existing preservation code is also very simple,
> why not just fix it right now?
As I see it, we can have both. "mem" property for simple use cases, or as a
partial solution for complex use cases and tracker you proposed for
preserving the order of the folios.
And as another optimization we may want a maple tree for coalescing as much
as possible to reduce amount of memblock_reserve() calls.
> Jason
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-19 7:24 ` Mike Rapoport
2025-02-23 0:22 ` Wei Yang
@ 2025-02-24 1:31 ` Wei Yang
2025-02-25 7:46 ` Mike Rapoport
1 sibling, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-24 1:31 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>Hi,
>
>On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> >to denote areas that were reserved for kernel use either directly with
>> >memblock_reserve_kern() or via memblock allocations.
>> >
>> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >---
>> > include/linux/memblock.h | 16 +++++++++++++++-
>> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>> > 2 files changed, 39 insertions(+), 9 deletions(-)
>> >
>> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> >index e79eb6ac516f..65e274550f5d 100644
>> >--- a/include/linux/memblock.h
>> >+++ b/include/linux/memblock.h
>> >@@ -50,6 +50,7 @@ enum memblock_flags {
>> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>>
>> Above memblock_flags, there are comments on explaining those flags.
>>
>> Seems we miss it for MEMBLOCK_RSRV_KERN.
>
>Right, thanks!
>
>> >
>> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>> > again:
>> > found = memblock_find_in_range_node(size, align, start, end, nid,
>> > flags);
>> >- if (found && !memblock_reserve(found, size))
>> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>>
>> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>> correct, the reserved region's nid is not used.
>
>We use nid of reserved regions in reserve_bootmem_region() (commit
>61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>know the distribution of reserved memory among the nodes before
>memmap_init_reserved_pages().
>
>> BTW, one question here. How we handle concurrent memblock allocation? If two
>> threads find the same available range and do the reservation, it seems to be a
>> problem to me. Or I missed something?
>
>memblock allocations end before smp_init(), there is no possible concurrency.
>
Thanks, I still have one question here.
Below is a simplified call flow.
mm_core_init()
mem_init()
memblock_free_all()
free_low_memory_core_early()
memmap_init_reserved_pages()
memblock_set_node(..., memblock.reserved, ) --- (1)
__free_memory_core()
kmem_cache_init()
slab_state = UP; --- (2)
And memblock_allloc_range_nid() is not supposed to be called after
slab_is_available(). Even someone do dose it, it will get memory from slab
instead of reserve region in memblock.
From the above call flow and background, there are three cases when
memblock_alloc_range_nid() would be called:
* If it is called before (1), memblock.reserved's nid would be adjusted correctly.
* If it is called after (2), we don't touch memblock.reserved.
* If it happens between (1) and (2), it looks would break the consistency of
nid information in memblock.reserved. Because when we use
memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
So my question is if the third case happens, would it introduce a bug? If it
won't happen, seems we don't need to specify the nid here?
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 03/14] memblock: Add support for scratch memory
2025-02-06 13:27 ` [PATCH v4 03/14] memblock: Add support for scratch memory Mike Rapoport
@ 2025-02-24 2:50 ` Wei Yang
2025-02-25 7:47 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-24 2:50 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:43PM +0200, Mike Rapoport wrote:
>From: Alexander Graf <graf@amazon.com>
>
>With KHO (Kexec HandOver), we need a way to ensure that the new kernel
>does not allocate memory on top of any memory regions that the previous
>kernel was handing over. But to know where those are, we need to include
>them in the memblock.reserved array which may not be big enough to hold
>all ranges that need to be persisted across kexec. To resize the array,
>we need to allocate memory. That brings us into a catch 22 situation.
>
>The solution to that is limit memblock allocations to the scratch regions:
>safe regions to operate in the case when there is memory that should remain
>intact across kexec.
>
>KHO provides several "scratch regions" as part of its metadata. These
>scratch regions are contiguous memory blocks that known not to contain any
>memory that should be persisted across kexec. These regions should be large
>enough to accommodate all memblock allocations done by the kexeced kernel.
>
>We introduce a new memblock_set_scratch_only() function that allows KHO to
memblock_set_kho_scratch_only?
>indicate that any memblock allocation must happen from the scratch regions.
>
>Later, we may want to perform another KHO kexec. For that, we reuse the
>same scratch regions. To ensure that no eventually handed over data gets
>allocated inside a scratch region, we flip the semantics of the scratch
>region with memblock_clear_scratch_only(): After that call, no allocations
memblock_clear_kho_scratch_only?
>may happen from scratch memblock regions. We will lift that restriction
>in the next patch.
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch()
2025-02-06 13:27 ` [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch() Mike Rapoport
@ 2025-02-24 3:02 ` Wei Yang
0 siblings, 0 replies; 97+ messages in thread
From: Wei Yang @ 2025-02-24 3:02 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:44PM +0200, Mike Rapoport wrote:
>From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
>With deferred initialization of struct page it will be necessary to
>initialize memory map for KHO scratch regions early.
>
>Add memmap_init_kho_scratch() method that will allow such initialization
memmap_init_kho_scratch_pages ?
>in upcoming patches.
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 12/14] x86: Add KHO support
2025-02-06 13:27 ` [PATCH v4 12/14] x86: Add KHO support Mike Rapoport
@ 2025-02-24 7:13 ` Wei Yang
2025-02-24 14:36 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-24 7:13 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 06, 2025 at 03:27:52PM +0200, Mike Rapoport wrote:
>From: Alexander Graf <graf@amazon.com>
[...]
>diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>index 82b96ed9890a..0b81cd70b02a 100644
>--- a/arch/x86/kernel/e820.c
>+++ b/arch/x86/kernel/e820.c
>@@ -1329,6 +1329,24 @@ void __init e820__memblock_setup(void)
> memblock_add(entry->addr, entry->size);
> }
>
>+ /*
>+ * At this point with KHO we only allocate from scratch memory.
>+ * At the same time, we configure memblock to only allow
>+ * allocations from memory below ISA_END_ADDRESS which is not
>+ * a natural scratch region, because Linux ignores memory below
>+ * ISA_END_ADDRESS at runtime. Beside very few (if any) early
>+ * allocations, we must allocate real-mode trapoline below
>+ * ISA_END_ADDRESS.
>+ *
>+ * To make sure that we can actually perform allocations during
>+ * this phase, let's mark memory below ISA_END_ADDRESS as scratch
>+ * so we can allocate from there in a scratch-only world.
>+ *
>+ * After real mode trampoline is allocated, we clear scratch
>+ * marking from the memory below ISA_END_ADDRESS
>+ */
>+ memblock_mark_kho_scratch(0, ISA_END_ADDRESS);
>+
At the beginning of e820__memblock_setup() we call memblock_allow_resize(),
which means during adding memory region it could double the array. And the
memory used here is from some region just added.
But with KHO, I am afraid it would fail?
> /* Throw away partial pages: */
> memblock_trim_memory(PAGE_SIZE);
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
2025-02-23 18:51 ` Mike Rapoport
@ 2025-02-24 14:28 ` Jason Gunthorpe
0 siblings, 0 replies; 97+ messages in thread
From: Jason Gunthorpe @ 2025-02-24 14:28 UTC (permalink / raw)
To: Mike Rapoport
Cc: Pasha Tatashin, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, H. Peter Anvin, Peter Zijlstra, Pratyush Yadav,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Sun, Feb 23, 2025 at 08:51:27PM +0200, Mike Rapoport wrote:
> On Wed, Feb 12, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
> >
> > > As I've mentioned off-list earlier, KHO in its current form is the lowest
> > > level of abstraction for state preservation and it is by no means is
> > > intended to provide complex drivers with all the tools necessary.
> >
> > My point, is I think it is the wrong level of abstraction and the
> > wrong FDT schema. It does not and cannot solve the problems we know we
> > will have, so why invest anything into that schema?
>
> Preserving a lot of random pages spread all over the place will be a
> problem no matter what. With kho_preserve_folio() the users will still need
> to save physical address of that folio somewhere,
Yes of course. However the schema of each node now gets a choice for
how it does that. ie the iommu is probably going to just store the top
pointer of a page table and rely on the internal table pointers to
store the physical addresses.
My point is that fdt mem should not be *mandatory* in the schema
because it is inherently unscalable and not what we want.
> structure that FDT will point to. So either instead of "mem" properties we'll
> have "addresses" property or a pointer to yet another page that should be
> preserved and, by the way, "mem" may come handy in this case :)
I think the preservation of the memory should be completely
independent of the FDT scheme of the nodes. If ftrace wants a "mem"
then sure, but the core preservation code should not parse it.
Nodes should be free to select whatever serialization scheme they
want.
Memory preservation should be a seperate self-contained node with its
own schema version. They should not be mixed together.
There should be single API toward the drivers, there should not get
"automatic" preservation because they put magic stuff in the FDT.
> I don't see how the "mem" property contradicts future extensions and for
> simple use cases it is already enough.
You'd just have to throw out all this code parsing mem to build the
memblock.
It also makes the weird preallocation of the FDT and it's related
sysfs probably unnecessary as it seems largely driven by this
unbounded mem attribute problem.
> I did an experiment and preserved 1GiB of random order-0 pages and measured
> time required to reserve everything in memblock.
> kho_deserialize() you suggested slightly outperformed
> kho_init_reserved_pages() that parsed a single "mem" property containing
> an array of <addr, size> pairs.
It has to be considered end-to-end, there is more cost to build up the
FDT array, and copying it around as well. Your 16Gib of random order 0
pages is 64MB of FDT space to represent as 16 byte addr/len
pairs. That's alot of memory to be allocating, zeroing and copying
around three (or four/five?) times.
So if the bitmap parsing is already slightly faster I expect the whole
end-to-end solution will be notably faster.
> For more random distribution of orders and more deep FDT the
> difference or course would be higher, but still both options sucked
> relatively to a maple tree serialized similarly to your tracker
> xarray.
I didn't like a maple tree like thing because the worst case memory
requirements become much higher - and it is more expensive to build it
on the serializing side (you have to run maple tree algorithms per-4k,
and then copy it out of the maple tree to a representation). Maybe it
is better, but I'd defer to real data on real systems before deciding.
With the numbers I was working with there are 512k of bitmaps worst
case for 16G of memory. If you imagine encoding ranges in, say, 8
bytes per range (52 bits of phys_addr_t, 12 bits of length) then you
get about 65k of ranges in the same 512k. That is only enough to store
a random distribution of 256MB of 4k pages.
Still, I'd like the see the memory preservation have its own
independent scheme, so if there is a better approach it can be
upgraded as self-contained project. It should have no effect on the
schema of the other nodes, or API toward the drivers.
> > But why? Just do it right from the start? I spent like a hour
> > sketching that, the existing preservation code is also very simple,
> > why not just fix it right now?
>
> As I see it, we can have both. "mem" property for simple use cases, or as a
> partial solution for complex use cases and tracker you proposed for
> preserving the order of the folios.
I don't think that is a good idea. Two ways it is unnecessarily
complicated. memory preservation should be integral to the system and
be done in one way that works well for all cases. We definately don't
want two APIs toward drivers for this.
If we have the bitmap then all drivers should be updated to use
it. The core code parsing of the mem schema should be removed.
> And as another optimization we may want a maple tree for coalescing as much
> as possible to reduce amount of memblock_reserve() calls.
Is the bitmap scanning really such a high cost? It can be coalescing
the set bitmap ranges with ffs/ffz if you want to run a
memblock_reserve() sort of thing.
However, I was not imagining using something as inefficient as
memblock_reserve() in the long run. It doesn't make sense to take a
bitmap and then convert it into ranges, parse the ranges to build up
the free list, then throw away the ranges.
Instead the bitmaps should be consulted as the free list is being
built up immediately after allocating the struct pages. No ranges
ever.
I didn't try to show this because it is definately complicated, but
the serialize side has eveything indexed in xarrays so it can generate
a linear sorted list of 'de-serializing' instructions that are slices
of bitmaps of different orders. The code that builds the free list
would simply walk that linear list of instructions and not add memory
with set bits to the free list. Simple O(1) de-serialzing approach
with some cost on the serializing side
I think going through memblock_reserve() is a good starting point, but
there is certainly alot of room for improving away from using ranges.
Jason
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 12/14] x86: Add KHO support
2025-02-24 7:13 ` Wei Yang
@ 2025-02-24 14:36 ` Mike Rapoport
2025-02-25 0:00 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-24 14:36 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 24, 2025 at 07:13:55AM +0000, Wei Yang wrote:
> On Thu, Feb 06, 2025 at 03:27:52PM +0200, Mike Rapoport wrote:
> >From: Alexander Graf <graf@amazon.com>
> [...]
> >diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >index 82b96ed9890a..0b81cd70b02a 100644
> >--- a/arch/x86/kernel/e820.c
> >+++ b/arch/x86/kernel/e820.c
> >@@ -1329,6 +1329,24 @@ void __init e820__memblock_setup(void)
> > memblock_add(entry->addr, entry->size);
> > }
> >
> >+ /*
> >+ * At this point with KHO we only allocate from scratch memory.
> >+ * At the same time, we configure memblock to only allow
> >+ * allocations from memory below ISA_END_ADDRESS which is not
> >+ * a natural scratch region, because Linux ignores memory below
> >+ * ISA_END_ADDRESS at runtime. Beside very few (if any) early
> >+ * allocations, we must allocate real-mode trapoline below
> >+ * ISA_END_ADDRESS.
> >+ *
> >+ * To make sure that we can actually perform allocations during
> >+ * this phase, let's mark memory below ISA_END_ADDRESS as scratch
> >+ * so we can allocate from there in a scratch-only world.
> >+ *
> >+ * After real mode trampoline is allocated, we clear scratch
> >+ * marking from the memory below ISA_END_ADDRESS
> >+ */
> >+ memblock_mark_kho_scratch(0, ISA_END_ADDRESS);
> >+
>
> At the beginning of e820__memblock_setup() we call memblock_allow_resize(),
> which means during adding memory region it could double the array. And the
> memory used here is from some region just added.
There are large KHO scratch areas that will be used for most allocations.
Marking the memory below ISA_END_ADDRESS as KHO scratch is required to
satisfy allocations that explicitly limit the allocation to ISA_END_ADDRESS,
e.g the real time trampoline.
> But with KHO, I am afraid it would fail?
>
> > /* Throw away partial pages: */
> > memblock_trim_memory(PAGE_SIZE);
> >
>
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 12/14] x86: Add KHO support
2025-02-24 14:36 ` Mike Rapoport
@ 2025-02-25 0:00 ` Wei Yang
0 siblings, 0 replies; 97+ messages in thread
From: Wei Yang @ 2025-02-25 0:00 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 24, 2025 at 04:36:38PM +0200, Mike Rapoport wrote:
>On Mon, Feb 24, 2025 at 07:13:55AM +0000, Wei Yang wrote:
>> On Thu, Feb 06, 2025 at 03:27:52PM +0200, Mike Rapoport wrote:
>> >From: Alexander Graf <graf@amazon.com>
>> [...]
>> >diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>> >index 82b96ed9890a..0b81cd70b02a 100644
>> >--- a/arch/x86/kernel/e820.c
>> >+++ b/arch/x86/kernel/e820.c
>> >@@ -1329,6 +1329,24 @@ void __init e820__memblock_setup(void)
>> > memblock_add(entry->addr, entry->size);
>> > }
>> >
>> >+ /*
>> >+ * At this point with KHO we only allocate from scratch memory.
>> >+ * At the same time, we configure memblock to only allow
>> >+ * allocations from memory below ISA_END_ADDRESS which is not
>> >+ * a natural scratch region, because Linux ignores memory below
>> >+ * ISA_END_ADDRESS at runtime. Beside very few (if any) early
>> >+ * allocations, we must allocate real-mode trapoline below
>> >+ * ISA_END_ADDRESS.
>> >+ *
>> >+ * To make sure that we can actually perform allocations during
>> >+ * this phase, let's mark memory below ISA_END_ADDRESS as scratch
>> >+ * so we can allocate from there in a scratch-only world.
>> >+ *
>> >+ * After real mode trampoline is allocated, we clear scratch
>> >+ * marking from the memory below ISA_END_ADDRESS
>> >+ */
>> >+ memblock_mark_kho_scratch(0, ISA_END_ADDRESS);
>> >+
>>
>> At the beginning of e820__memblock_setup() we call memblock_allow_resize(),
>> which means during adding memory region it could double the array. And the
>> memory used here is from some region just added.
>
>There are large KHO scratch areas that will be used for most allocations.
>Marking the memory below ISA_END_ADDRESS as KHO scratch is required to
>satisfy allocations that explicitly limit the allocation to ISA_END_ADDRESS,
>e.g the real time trampoline.
>
Thanks, I see you point. We would add memory region during kho_populate() and
mark it scratch.
>> But with KHO, I am afraid it would fail?
>>
>> > /* Throw away partial pages: */
>> > memblock_trim_memory(PAGE_SIZE);
>> >
>>
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Sincerely yours,
>Mike.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page
2025-02-20 8:36 ` Wei Yang
2025-02-20 14:54 ` Mike Rapoport
@ 2025-02-25 7:40 ` Mike Rapoport
1 sibling, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-25 7:40 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Thu, Feb 20, 2025 at 08:36:01AM +0000, Wei Yang wrote:
> On Wed, Feb 19, 2025 at 09:13:22AM +0200, Mike Rapoport wrote:
> >Hi,
> >
> >On Tue, Feb 18, 2025 at 02:59:04PM +0000, Wei Yang wrote:
> >> On Thu, Feb 06, 2025 at 03:27:41PM +0200, Mike Rapoport wrote:
> >> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> >When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
> >> >function performs initialization of a struct page that would have been
> >> >deferred normally.
> >> >
> >> >Rename it to init_deferred_page() to better reflect what the function does.
> >>
> >> Would it be confused with deferred_init_pages()?
> >
> >Why? It initializes a single page, deferred_init_pages() initializes many.
> >
>
> See below.
>
> >> And it still calls __init_reserved_page_zone(), even we __SetPageReserved()
> >> after it. Current logic looks not clear.
> >
> >There's no __init_reserved_page_zone(). Currently init_reserved_page()
> >detects the zone of the page and calls __init_single_page(), so essentially
> >it initializes one struct page.
> >
> >And we __SetPageReserved() in reserve_bootmem_region() after call to
> >init_reseved_page() because pages there are indeed reserved.
> >
>
> Hmm... I am not sure we are looking at the same code.
By "currently" I meant the Linus tree.
> I take a look at current
> mm-unstable, this patch set is not included. So I am looking at previous
> version with this last commit:
>
> 8bf30f9d23eb 2025-02-06 Documentation: KHO: add memblock bindings
>
> Here is what I see for init_deferred_page()'s definition:
>
> init_deferred_page()
> __init_deferred_page()
> __init_reserved_page_zone() <--- I do see this function, it is removed?
> __init_single_page()
>
> What I want to say is __init_deferred_page() calls
> __init_reserved_page_zone(). This sounds imply a deferred page is always
> reserved page. But we know it is not. deferred_init_pages() initialize the
> pages are not reserved one. Or we want to have this context in
> __init_deferred_page()?
In mm-unstable the code is slightly reorganized, but in the end it still
initializes a deferred page. It's just only called for reserved page, but
the initialization itself does not presume that the page is reserved.
So both functions are misnamed here, __init_reserved_page_zone() and
init_reserved_page().
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-24 1:31 ` Wei Yang
@ 2025-02-25 7:46 ` Mike Rapoport
2025-02-26 2:09 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-25 7:46 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 24, 2025 at 01:31:31AM +0000, Wei Yang wrote:
> On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
> >Hi,
> >
> >On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
> >> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
> >> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> >to denote areas that were reserved for kernel use either directly with
> >> >memblock_reserve_kern() or via memblock allocations.
> >> >
> >> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >> >---
> >> > include/linux/memblock.h | 16 +++++++++++++++-
> >> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
> >> > 2 files changed, 39 insertions(+), 9 deletions(-)
> >> >
> >> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> >index e79eb6ac516f..65e274550f5d 100644
> >> >--- a/include/linux/memblock.h
> >> >+++ b/include/linux/memblock.h
> >> >@@ -50,6 +50,7 @@ enum memblock_flags {
> >> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> >> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
> >> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
> >> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
> >>
> >> Above memblock_flags, there are comments on explaining those flags.
> >>
> >> Seems we miss it for MEMBLOCK_RSRV_KERN.
> >
> >Right, thanks!
> >
> >> >
> >> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
> >> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >> > again:
> >> > found = memblock_find_in_range_node(size, align, start, end, nid,
> >> > flags);
> >> >- if (found && !memblock_reserve(found, size))
> >> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
> >>
> >> Maybe we could use memblock_reserve_kern() directly. If my understanding is
> >> correct, the reserved region's nid is not used.
> >
> >We use nid of reserved regions in reserve_bootmem_region() (commit
> >61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
> >know the distribution of reserved memory among the nodes before
> >memmap_init_reserved_pages().
> >
> >> BTW, one question here. How we handle concurrent memblock allocation? If two
> >> threads find the same available range and do the reservation, it seems to be a
> >> problem to me. Or I missed something?
> >
> >memblock allocations end before smp_init(), there is no possible concurrency.
> >
>
> Thanks, I still have one question here.
>
> Below is a simplified call flow.
>
> mm_core_init()
> mem_init()
> memblock_free_all()
> free_low_memory_core_early()
> memmap_init_reserved_pages()
> memblock_set_node(..., memblock.reserved, ) --- (1)
> __free_memory_core()
> kmem_cache_init()
> slab_state = UP; --- (2)
>
> And memblock_allloc_range_nid() is not supposed to be called after
> slab_is_available(). Even someone do dose it, it will get memory from slab
> instead of reserve region in memblock.
>
> From the above call flow and background, there are three cases when
> memblock_alloc_range_nid() would be called:
>
> * If it is called before (1), memblock.reserved's nid would be adjusted correctly.
> * If it is called after (2), we don't touch memblock.reserved.
> * If it happens between (1) and (2), it looks would break the consistency of
> nid information in memblock.reserved. Because when we use
> memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
>
> So my question is if the third case happens, would it introduce a bug? If it
> won't happen, seems we don't need to specify the nid here?
We don't really care about proper assignment of nodes between (1) and (2)
from one side and the third case does not happen on the other side. Nothing
should call membloc_alloc() after memblock_free_all().
But it's easy to make the window between (1) and (2) disappear by replacing
checks for slab_is_available() in memblock with a variable local to
memblock.
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 03/14] memblock: Add support for scratch memory
2025-02-24 2:50 ` Wei Yang
@ 2025-02-25 7:47 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-02-25 7:47 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Feb 24, 2025 at 02:50:34AM +0000, Wei Yang wrote:
> On Thu, Feb 06, 2025 at 03:27:43PM +0200, Mike Rapoport wrote:
> >From: Alexander Graf <graf@amazon.com>
> >
> >With KHO (Kexec HandOver), we need a way to ensure that the new kernel
> >does not allocate memory on top of any memory regions that the previous
> >kernel was handing over. But to know where those are, we need to include
> >them in the memblock.reserved array which may not be big enough to hold
> >all ranges that need to be persisted across kexec. To resize the array,
> >we need to allocate memory. That brings us into a catch 22 situation.
> >
> >The solution to that is limit memblock allocations to the scratch regions:
> >safe regions to operate in the case when there is memory that should remain
> >intact across kexec.
> >
> >KHO provides several "scratch regions" as part of its metadata. These
> >scratch regions are contiguous memory blocks that known not to contain any
> >memory that should be persisted across kexec. These regions should be large
> >enough to accommodate all memblock allocations done by the kexeced kernel.
> >
> >We introduce a new memblock_set_scratch_only() function that allows KHO to
>
> memblock_set_kho_scratch_only?
>
> >indicate that any memblock allocation must happen from the scratch regions.
> >
> >Later, we may want to perform another KHO kexec. For that, we reuse the
> >same scratch regions. To ensure that no eventually handed over data gets
> >allocated inside a scratch region, we flip the semantics of the scratch
> >region with memblock_clear_scratch_only(): After that call, no allocations
>
> memblock_clear_kho_scratch_only?
Right, I missed those in the commit message.
> >may happen from scratch memblock regions. We will lift that restriction
> >in the next patch.
> >
>
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
2025-02-18 15:50 ` Wei Yang
@ 2025-02-26 1:53 ` Changyuan Lyu
2025-03-13 15:41 ` Mike Rapoport
1 sibling, 1 reply; 97+ messages in thread
From: Changyuan Lyu @ 2025-02-26 1:53 UTC (permalink / raw)
To: Mike Rapoport
Cc: Changyuan Lyu, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi Mike,
On Thu, 6 Feb 2025 15:27:42 +0200, Mike Rapoport <rppt@kernel.org> wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> to denote areas that were reserved for kernel use either directly with
> memblock_reserve_kern() or via memblock allocations.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/memblock.h | 16 +++++++++++++++-
> mm/memblock.c | 32 ++++++++++++++++++++++++--------
> 2 files changed, 39 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index e79eb6ac516f..65e274550f5d 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> ......
> @@ -116,7 +117,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
> int memblock_add(phys_addr_t base, phys_addr_t size);
> int memblock_remove(phys_addr_t base, phys_addr_t size);
> int memblock_phys_free(phys_addr_t base, phys_addr_t size);
> -int memblock_reserve(phys_addr_t base, phys_addr_t size);
> +int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid,
> + enum memblock_flags flags);
> +
> +static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size)
> +{
> + return __memblock_reserve(base, size, NUMA_NO_NODE, 0);
Without this patch `memblock_reserve` eventually calls `memblock_add_range`
with `MAX_NUMNODES`, but with this patch, `memblock_reserve` calls
`memblock_add_range` with `NUMA_NO_NODE`. Is it intended or an
accidental typo? Thanks!
> ......
>
> -int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
> +int __init_memblock __memblock_reserve(phys_addr_t base, phys_addr_t size,
> + int nid, enum memblock_flags flags)
> {
> phys_addr_t end = base + size - 1;
>
> - memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
> - &base, &end, (void *)_RET_IP_);
> + memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
> + &base, &end, nid, flags, (void *)_RET_IP_);
>
> - return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
Originally `memblock_reserve` calls `memblock_add_range` with `MAX_NUMNODES`,
See my comments above.
Best,
Changyuan
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-25 7:46 ` Mike Rapoport
@ 2025-02-26 2:09 ` Wei Yang
2025-03-10 7:56 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-02-26 2:09 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Tue, Feb 25, 2025 at 09:46:28AM +0200, Mike Rapoport wrote:
>On Mon, Feb 24, 2025 at 01:31:31AM +0000, Wei Yang wrote:
>> On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>> >Hi,
>> >
>> >On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>> >> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>> >> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >> >
>> >> >to denote areas that were reserved for kernel use either directly with
>> >> >memblock_reserve_kern() or via memblock allocations.
>> >> >
>> >> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >> >---
>> >> > include/linux/memblock.h | 16 +++++++++++++++-
>> >> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>> >> > 2 files changed, 39 insertions(+), 9 deletions(-)
>> >> >
>> >> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> >> >index e79eb6ac516f..65e274550f5d 100644
>> >> >--- a/include/linux/memblock.h
>> >> >+++ b/include/linux/memblock.h
>> >> >@@ -50,6 +50,7 @@ enum memblock_flags {
>> >> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>> >> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>> >> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>> >> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>> >>
>> >> Above memblock_flags, there are comments on explaining those flags.
>> >>
>> >> Seems we miss it for MEMBLOCK_RSRV_KERN.
>> >
>> >Right, thanks!
>> >
>> >> >
>> >> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>> >> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>> >> > again:
>> >> > found = memblock_find_in_range_node(size, align, start, end, nid,
>> >> > flags);
>> >> >- if (found && !memblock_reserve(found, size))
>> >> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>> >>
>> >> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>> >> correct, the reserved region's nid is not used.
>> >
>> >We use nid of reserved regions in reserve_bootmem_region() (commit
>> >61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>> >know the distribution of reserved memory among the nodes before
>> >memmap_init_reserved_pages().
>> >
>> >> BTW, one question here. How we handle concurrent memblock allocation? If two
>> >> threads find the same available range and do the reservation, it seems to be a
>> >> problem to me. Or I missed something?
>> >
>> >memblock allocations end before smp_init(), there is no possible concurrency.
>> >
>>
>> Thanks, I still have one question here.
>>
>> Below is a simplified call flow.
>>
>> mm_core_init()
>> mem_init()
>> memblock_free_all()
>> free_low_memory_core_early()
>> memmap_init_reserved_pages()
>> memblock_set_node(..., memblock.reserved, ) --- (1)
>> __free_memory_core()
>> kmem_cache_init()
>> slab_state = UP; --- (2)
>>
>> And memblock_allloc_range_nid() is not supposed to be called after
>> slab_is_available(). Even someone do dose it, it will get memory from slab
>> instead of reserve region in memblock.
>>
>> From the above call flow and background, there are three cases when
>> memblock_alloc_range_nid() would be called:
>>
>> * If it is called before (1), memblock.reserved's nid would be adjusted correctly.
>> * If it is called after (2), we don't touch memblock.reserved.
>> * If it happens between (1) and (2), it looks would break the consistency of
>> nid information in memblock.reserved. Because when we use
>> memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
>>
>> So my question is if the third case happens, would it introduce a bug? If it
>> won't happen, seems we don't need to specify the nid here?
>
>We don't really care about proper assignment of nodes between (1) and (2)
>from one side and the third case does not happen on the other side. Nothing
>should call membloc_alloc() after memblock_free_all().
>
My point is if no one would call memblock_alloc() after memblock_free_all(),
which set nid in memblock.reserved properly, it seems not necessary to do
__memblock_reserve() with exact nid during memblock_alloc()?
As you did __memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN) in this
patch.
>But it's easy to make the window between (1) and (2) disappear by replacing
>checks for slab_is_available() in memblock with a variable local to
>memblock.
>
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Sincerely yours,
>Mike.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
` (18 preceding siblings ...)
2025-02-17 3:19 ` RuiRui Yang
@ 2025-02-26 20:08 ` Pratyush Yadav
2025-02-28 20:20 ` Mike Rapoport
19 siblings, 1 reply; 97+ messages in thread
From: Pratyush Yadav @ 2025-02-26 20:08 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Mike,
On Thu, Feb 06 2025, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
[...]
I applied the patches on top of v6.14-rc1 and tried them out on an x86
qemu machine . When I do a plain KHO activate and kexec, I get the below
errors on boot. This causes networking to fail on the VM. The errors are
consistent and happen every kexec-reboot, though fairly late in boot
after systemd tries to bring up network. The same setup has worked fine
with Alex's v3 of KHO patches.
Do you see anything obvious that might cause this? I can try to debug
this tomorrow, but if it rings any loud bells it would be nice to know.
[ 1.665225] ------------[ cut here ]------------
[ 1.665606] e1000 0000:00:03.0: DMA addr 0x0000000107978040+1522 overflow (mask ffffffff, bus limit 0).
[ 1.666364] WARNING: CPU: 6 PID: 2033 at kernel/dma/direct.h:103 dma_direct_map_page+0x271/0x280
[ 1.667074] Modules linked in:
[ 1.667335] CPU: 6 UID: 980 PID: 2033 Comm: systemd-network Not tainted 6.14.0-rc1+ #70
[ 1.668004] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 1.668760] RIP: 0010:dma_direct_map_page+0x271/0x280
[ 1.669166] Code: 04 48 8b 5d 00 48 89 ef e8 7c 5f 7a 00 4d 89 e9 4d 89 e0 48 89 da 48 89 e1 41 56 48 89 c6 48 c7 c7 58 a9 4f 82 e8 3f 09 f4 ff <0f> 0b 58 eb 88 e8 05 b7 b9 00 0f 1f 44 00 00 90 90 90 90 90 90 90
[ 1.670624] RSP: 0018:ffffc90002baf628 EFLAGS: 00010282
[ 1.671035] RAX: 0000000000000000 RBX: ffff888100dd2e00 RCX: 0000000000000027
[ 1.671607] RDX: ffff88842d920a88 RSI: 0000000000000001 RDI: ffff88842d920a80
[ 1.672177] RBP: ffff8881015530c0 R08: 0000000000000000 R09: ffffc90002baf4b0
[ 1.672757] R10: ffffffff82deeec8 R11: 0000000000000003 R12: 00000000000005f2
[ 1.673320] R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000040
[ 1.673882] FS: 00007f15c3edb880(0000) GS:ffff88842d900000(0000) knlGS:0000000000000000
[ 1.674516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.674977] CR2: 00007f15c430cb50 CR3: 0000000102d28000 CR4: 00000000000006f0
[ 1.675554] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.676117] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.676698] Call Trace:
[ 1.676914] <TASK>
[ 1.677097] ? show_trace_log_lvl+0x1a7/0x2f0
[ 1.677458] ? show_trace_log_lvl+0x1a7/0x2f0
[ 1.677811] ? dma_map_page_attrs+0x6e/0x1f0
[ 1.678158] ? dma_direct_map_page+0x271/0x280
[ 1.678510] ? __warn.cold+0x93/0xf1
[ 1.678802] ? dma_direct_map_page+0x271/0x280
[ 1.679158] ? report_bug+0xff/0x140
[ 1.679461] ? handle_bug+0x53/0x90
[ 1.679768] ? exc_invalid_op+0x17/0x70
[ 1.680077] ? asm_exc_invalid_op+0x1a/0x20
[ 1.680428] ? dma_direct_map_page+0x271/0x280
[ 1.680789] ? dma_direct_map_page+0x271/0x280
[ 1.681149] dma_map_page_attrs+0x6e/0x1f0
[ 1.681483] e1000_alloc_rx_buffers+0x140/0x340
[ 1.681852] e1000_configure+0xf9/0x110
[ 1.682163] e1000_open+0xc5/0x200
[ 1.682441] __dev_open+0xff/0x1d0
[ 1.682728] __dev_change_flags+0x1f8/0x240
[ 1.683066] ? __nla_put+0x10/0x30
[ 1.683345] dev_change_flags+0x26/0x70
[ 1.683668] do_setlink.isra.0+0x2ca/0xbe0
[ 1.684002] ? cred_has_capability.isra.0+0x6a/0x110
[ 1.684411] ? blk_mq_get_tags+0x33/0x70
[ 1.684731] ? virtqueue_add_split+0xa4/0x6b0
[ 1.685086] ? security_capable+0x70/0xc0
[ 1.685409] rtnl_setlink+0x184/0x220
[ 1.685712] ? cred_has_capability.isra.0+0x6a/0x110
[ 1.686110] ? xa_load+0x7a/0xb0
[ 1.686374] ? __pfx_rtnl_setlink+0x10/0x10
[ 1.686710] rtnetlink_rcv_msg+0x354/0x3f0
[ 1.687039] ? folio_wait_bit_common+0x28b/0x300
[ 1.687407] ? ___pte_offset_map+0x1b/0x140
[ 1.687757] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 1.688123] netlink_rcv_skb+0x53/0x100
[ 1.688451] netlink_unicast+0x245/0x390
[ 1.688772] netlink_sendmsg+0x21b/0x470
[ 1.689085] ? __check_object_size.part.0+0x39/0xd0
[ 1.689481] __sys_sendto+0x1d4/0x1e0
[ 1.689788] __x64_sys_sendto+0x24/0x30
[ 1.690108] do_syscall_64+0x4b/0x110
[ 1.690414] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1.690820] RIP: 0033:0x7f15c40fc897
[ 1.691114] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 37 0d 00 00 41 89 ca 74 10 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 69 c3 55 48 89 e5 53 48 83 ec 38 44 89 4d d0
[ 1.692589] RSP: 002b:00007ffc3c3080b8 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[ 1.693191] RAX: ffffffffffffffda RBX: 000055eda6a76a70 RCX: 00007f15c40fc897
[ 1.693752] RDX: 0000000000000020 RSI: 000055eda6a93260 RDI: 0000000000000003
[ 1.694308] RBP: 00007ffc3c308150 R08: 00007ffc3c3080c0 R09: 0000000000000080
[ 1.694867] R10: 0000000000000000 R11: 0000000000000202 R12: 000055eda6a93bc0
[ 1.695423] R13: 000055eda6a8c748 R14: 0000000000000000 R15: 000055eda6a8c700
[ 1.695998] </TASK>
[ 1.696183] ---[ end trace 0000000000000000 ]---
[ 1.707952] e1000: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[...]
[ 3.071312] e1000 0000:00:03.0: TX DMA map failed
[ 3.263302] e1000 0000:00:03.0: TX DMA map failed
[ 3.388180] e1000 0000:00:03.0: TX DMA map failed
[many more times]
Qemu version:
QEMU emulator version 9.2.0
Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
Command to run qemu:
qemu-system-x86_64 \
-m 16G \
-smp 10 \
-kernel arch/x86/boot/bzImage \
-append "console=ttyS0 root=/dev/vda rw earlyprintk=serial net.ifnames=0 kho=1 kho_scratch=200M,200M nokaslr" \
-drive file=/local/home/ptyadav/qemu/drive.ext4,format=raw,if=virtio \
-nic user,hostfwd=tcp::10022-:22 \
-gdb tcp::1234 \
-nographic
Steps used to kexec:
$ echo 1 > /sys/kernel/kho/active
$ kexec -l bzImage -s --initrd /boot/initramfs-linux.img --reuse-cmdline
DT after KHO activate:
/dts-v1/;
/ {
compatible = "kho-v1";
};
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-26 20:08 ` Pratyush Yadav
@ 2025-02-28 20:20 ` Mike Rapoport
2025-02-28 23:04 ` Pratyush Yadav
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-02-28 20:20 UTC (permalink / raw)
To: Pratyush Yadav
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Pratyush,
On Wed, Feb 26, 2025 at 08:08:27PM +0000, Pratyush Yadav wrote:
> Hi Mike,
>
> On Thu, Feb 06 2025, Mike Rapoport wrote:
>
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Hi,
> >
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> [...]
>
> I applied the patches on top of v6.14-rc1 and tried them out on an x86
> qemu machine . When I do a plain KHO activate and kexec, I get the below
> errors on boot. This causes networking to fail on the VM. The errors are
> consistent and happen every kexec-reboot, though fairly late in boot
> after systemd tries to bring up network. The same setup has worked fine
> with Alex's v3 of KHO patches.
>
> Do you see anything obvious that might cause this? I can try to debug
> this tomorrow, but if it rings any loud bells it would be nice to know.
Thanks for the report!
It didn't ring any bells, but after I've found the issue and a
fast-and-dirty fix.
The scratch areas are allocated from high addresses and there is no scratch
memory to satisfy memblock_alloc_low() in swiotb, so second kernel produces
a couple of
software IO TLB: swiotlb_memblock_alloc: Failed to allocate 67108864 bytes for tlb structure
and without those buffers e1000 can't dma :(
A quick fix would be to add another scratch area in the lower memory
(below). I'll work on a better fix.
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cb..37bb54cdb130 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -623,13 +623,13 @@ static phys_addr_t __init scratch_size(int nid)
static void kho_reserve_scratch(void)
{
phys_addr_t addr, size;
- int nid, i = 1;
+ int nid, i = 2;
if (!kho_enable)
return;
/* FIXME: deal with node hot-plug/remove */
- kho_scratch_cnt = num_online_nodes() + 1;
+ kho_scratch_cnt = num_online_nodes() + 2;
size = kho_scratch_cnt * sizeof(*kho_scratch);
kho_scratch = memblock_alloc(size, PAGE_SIZE);
if (!kho_scratch)
@@ -644,6 +644,15 @@ static void kho_reserve_scratch(void)
kho_scratch[0].addr = addr;
kho_scratch[0].size = size;
+ addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES,
+ MEMBLOCK_LOW_LIMIT,
+ ARCH_LOW_ADDRESS_LIMIT);
+ if (!addr)
+ goto err_free_scratch_areas;
+
+ kho_scratch[1].addr = addr;
+ kho_scratch[1].size = size;
+
for_each_online_node(nid) {
size = scratch_size(nid);
addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
> --
> Regards,
> Pratyush Yadav
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-28 20:20 ` Mike Rapoport
@ 2025-02-28 23:04 ` Pratyush Yadav
2025-03-02 9:52 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Pratyush Yadav @ 2025-02-28 23:04 UTC (permalink / raw)
To: Mike Rapoport
Cc: Pratyush Yadav, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Fri, Feb 28 2025, Mike Rapoport wrote:
> Hi Pratyush,
>
> On Wed, Feb 26, 2025 at 08:08:27PM +0000, Pratyush Yadav wrote:
>> Hi Mike,
>>
>> On Thu, Feb 06 2025, Mike Rapoport wrote:
>>
>> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >
>> > Hi,
>> >
>> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
>> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
>> > just to make things simpler instead of ftrace we decided to preserve
>> > "reserve_mem" regions.
>> [...]
>>
>> I applied the patches on top of v6.14-rc1 and tried them out on an x86
>> qemu machine . When I do a plain KHO activate and kexec, I get the below
>> errors on boot. This causes networking to fail on the VM. The errors are
>> consistent and happen every kexec-reboot, though fairly late in boot
>> after systemd tries to bring up network. The same setup has worked fine
>> with Alex's v3 of KHO patches.
>>
>> Do you see anything obvious that might cause this? I can try to debug
>> this tomorrow, but if it rings any loud bells it would be nice to know.
>
> Thanks for the report!
> It didn't ring any bells, but after I've found the issue and a
> fast-and-dirty fix.
>
> The scratch areas are allocated from high addresses and there is no scratch
> memory to satisfy memblock_alloc_low() in swiotb, so second kernel produces
> a couple of
>
> software IO TLB: swiotlb_memblock_alloc: Failed to allocate 67108864 bytes for tlb structure
I also did some digging today and ended up finding the same thing out
but it seems you got there before me :-)
>
> and without those buffers e1000 can't dma :(
>
> A quick fix would be to add another scratch area in the lower memory
> (below). I'll work on a better fix.
I have already written a less-quick fix (patch pasted below) so I
suppose we can use that to review the idea instead. It adds a dedicated
scratch area for lowmem, similar to your patch, and adds some tracking
to calculate the size.
I am not sure if the size estimation is completely right though, since
it is possible that allocations that don't _need_ to be in lowmem end up
being there, causing the scratch area to be too big (or perhaps even
causing allocation failures if the scale is big enough). Maybe we would
be better off tracking lowmem allocation requests separately?
----- 8< -----
From d60aeb2c4a1c0eea05e1a13b48b268d6192a615e Mon Sep 17 00:00:00 2001
From: Pratyush Yadav <ptyadav@amazon.de>
Date: Fri, 28 Feb 2025 22:36:06 +0000
Subject: [PATCH] KHO: always have a lowmem scratch region
During initialization, some callers need to allocate low memory from
memblock. One such caller is swiotlb_memblock_alloc() on x86. The global
and per-node scratch regions are allocated without any constraints on
the range. This can lead to having no scratch region in lowmem. If that
happens, the lowmem allocations can fail, leading to failures during
boot or later down the line.
Always ensure there is some scratch memory available in low memory by
having a separate scratch area for it, along with the global and
per-node ones, and allow specifying its size via the command line.
To more accurately guess suitable scratch sizes, add
memblock_reserved_kern_lowmem_size() and
memblock_reserved_kern_highmem_size() which calculate how much memory
was allocated in low and high memory, along with some helper functions
to calculate scratch sizes.
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
.../admin-guide/kernel-parameters.txt | 9 +-
Documentation/kho/usage.rst | 10 +--
include/linux/memblock.h | 2 +
kernel/kexec_handover.c | 83 ++++++++++++++-----
mm/memblock.c | 28 +++++++
5 files changed, 103 insertions(+), 29 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ed656e2fb05ef..7c5afd45ad9dc 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2705,7 +2705,7 @@
"1" | "on" | "y" - kexec handover is enabled
kho_scratch= [KEXEC,EARLY]
- Format: nn[KMG],mm[KMG] | nn%
+ Format: ll[KMG],nn[KMG],mm[KMG] | nn%
Defines the size of the KHO scratch region. The KHO
scratch regions are physically contiguous memory
ranges that can only be used for non-kernel
@@ -2715,9 +2715,10 @@
bootstrap itself.
It is possible to specify the exact amount of
- memory in the form of "nn[KMG],mm[KMG]" where the
- first parameter defines the size of a global
- scratch area and the second parameter defines the
+ memory in the form of "ll[KMG],nn[KMG],mm[KMG]" where the
+ first parameter defines the size of a low memory scratch
+ area, the second parameter defines the size of a global
+ scratch area and the third parameter defines the
size of additional per-node scratch areas.
The form "nn%" defines scale factor (in percents)
of memory that was used during boot.
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
index e7300fbb309c1..6a6011809795d 100644
--- a/Documentation/kho/usage.rst
+++ b/Documentation/kho/usage.rst
@@ -19,11 +19,11 @@ at compile time. Every KHO producer may have its own config option that you
need to enable if you would like to preserve their respective state across
kexec.
-To use KHO, please boot the kernel with the ``kho=on`` command line
-parameter. You may use ``kho_scratch`` parameter to define size of the
-scratch regions. For example ``kho_scratch=512M,512M`` will reserve a 512
-MiB for a global scratch region and 512 MiB per NUMA node scratch regions
-on boot.
+To use KHO, please boot the kernel with the ``kho=on`` command line parameter.
+You may use ``kho_scratch`` parameter to define size of the scratch regions. For
+example ``kho_scratch=128M,512M,512M`` will reserve a 128 MiB low memory scratch
+region, a 512 MiB global scratch region and 512 MiB per NUMA node scratch
+regions on boot.
Perform a KHO kexec
-------------------
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 20887e199cdbd..9f5c5aec4b1d4 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -504,6 +504,8 @@ static inline __init_memblock bool memblock_bottom_up(void)
phys_addr_t memblock_phys_mem_size(void);
phys_addr_t memblock_reserved_size(void);
phys_addr_t memblock_reserved_kern_size(int nid);
+phys_addr_t memblock_reserved_kern_lowmem_size(void);
+phys_addr_t memblock_reserved_kern_highmem_size(void);
unsigned long memblock_estimated_nr_free_pages(void);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cbc..29eeed09ceb31 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -547,20 +547,21 @@ late_initcall(kho_init);
*
* kho_scratch=N%
*
- * It is also possible to explicitly define size for a global and per-node
- * scratch areas:
+ * It is also possible to explicitly define size for a lowmem, a global and
+ * per-node scratch areas:
*
- * kho_scratch=n[KMG],m[KMG]
+ * kho_scratch=l[KMG],n[KMG],m[KMG]
*
* The explicit size definition takes precedence over scale definition.
*/
static unsigned int scratch_scale __initdata = 200;
static phys_addr_t scratch_size_global __initdata;
static phys_addr_t scratch_size_pernode __initdata;
+static phys_addr_t scratch_size_lowmem __initdata;
static int __init kho_parse_scratch_size(char *p)
{
- unsigned long size, size_pernode;
+ unsigned long size, size_pernode, size_global;
char *endptr, *oldp = p;
if (!p)
@@ -578,15 +579,25 @@ static int __init kho_parse_scratch_size(char *p)
if (*p != ',')
return -EINVAL;
+ oldp = p;
+ size_global = memparse(p + 1, &p);
+ if (!size_global || p == oldp)
+ return -EINVAL;
+
+ if (*p != ',')
+ return -EINVAL;
+
size_pernode = memparse(p + 1, &p);
if (!size_pernode)
return -EINVAL;
- scratch_size_global = size;
+ scratch_size_lowmem = size;
+ scratch_size_global = size_global;
scratch_size_pernode = size_pernode;
scratch_scale = 0;
- pr_notice("scratch areas: global: %lluMB pernode: %lldMB\n",
+ pr_notice("scratch areas: lowmem: %lluMB global: %lluMB pernode: %lldMB\n",
+ (u64)(scratch_size_lowmem >> 20),
(u64)(scratch_size_global >> 20),
(u64)(scratch_size_pernode >> 20));
}
@@ -595,18 +606,38 @@ static int __init kho_parse_scratch_size(char *p)
}
early_param("kho_scratch", kho_parse_scratch_size);
-static phys_addr_t __init scratch_size(int nid)
+static phys_addr_t __init scratch_size_low(void)
+{
+ phys_addr_t size;
+
+ if (scratch_scale)
+ size = memblock_reserved_kern_lowmem_size() * scratch_scale / 100;
+ else
+ size = scratch_size_lowmem;
+
+ return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_high(void)
+{
+ phys_addr_t size;
+
+ if (scratch_scale)
+ size = memblock_reserved_kern_highmem_size() * scratch_scale / 100;
+ else
+ size = scratch_size_global;
+
+ return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_node(int nid)
{
phys_addr_t size;
- if (scratch_scale) {
+ if (scratch_scale)
size = memblock_reserved_kern_size(nid) * scratch_scale / 100;
- } else {
- if (numa_valid_node(nid))
- size = scratch_size_pernode;
- else
- size = scratch_size_global;
- }
+ else
+ size = scratch_size_pernode;
return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
}
@@ -623,29 +654,41 @@ static phys_addr_t __init scratch_size(int nid)
static void kho_reserve_scratch(void)
{
phys_addr_t addr, size;
- int nid, i = 1;
+ int nid, i = 0;
if (!kho_enable)
return;
/* FIXME: deal with node hot-plug/remove */
- kho_scratch_cnt = num_online_nodes() + 1;
+ kho_scratch_cnt = num_online_nodes() + 2;
size = kho_scratch_cnt * sizeof(*kho_scratch);
kho_scratch = memblock_alloc(size, PAGE_SIZE);
if (!kho_scratch)
goto err_disable_kho;
- /* reserve large contiguous area for allocations without nid */
- size = scratch_size(NUMA_NO_NODE);
- addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+ /* reserve area for lowmem allocations. */
+ size = scratch_size_low();
+ addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0,
+ ARCH_LOW_ADDRESS_LIMIT);
if (!addr)
goto err_free_scratch_desc;
kho_scratch[0].addr = addr;
kho_scratch[0].size = size;
+ i++;
+
+ /* reserve large contiguous area for allocations without nid */
+ size = scratch_size_high();
+ addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+ if (!addr)
+ goto err_free_scratch_areas;
+
+ kho_scratch[1].addr = addr;
+ kho_scratch[1].size = size;
+ i++;
for_each_online_node(nid) {
- size = scratch_size(nid);
+ size = scratch_size_node(nid);
addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
0, MEMBLOCK_ALLOC_ACCESSIBLE,
nid, true);
diff --git a/mm/memblock.c b/mm/memblock.c
index fdb08b60efc17..da7abf5e5e504 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1837,6 +1837,34 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
return memblock.reserved.total_size;
}
+phys_addr_t __init_memblock memblock_reserved_kern_lowmem_size(void)
+{
+ struct memblock_region *r;
+ phys_addr_t total = 0;
+
+ for_each_reserved_mem_region(r) {
+ if ((r->flags & MEMBLOCK_RSRV_KERN) &&
+ (r->base + r->size <= ARCH_LOW_ADDRESS_LIMIT))
+ total += r->size;
+ }
+
+ return total;
+}
+
+phys_addr_t __init_memblock memblock_reserved_kern_highmem_size(void)
+{
+ struct memblock_region *r;
+ phys_addr_t total = 0;
+
+ for_each_reserved_mem_region(r) {
+ if ((r->flags & MEMBLOCK_RSRV_KERN) &&
+ (r->base + r->size > ARCH_LOW_ADDRESS_LIMIT))
+ total += r->size;
+ }
+
+ return total;
+}
+
phys_addr_t __init_memblock memblock_reserved_kern_size(int nid)
{
struct memblock_region *r;
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO)
2025-02-28 23:04 ` Pratyush Yadav
@ 2025-03-02 9:52 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-03-02 9:52 UTC (permalink / raw)
To: Pratyush Yadav
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
On Fri, Feb 28, 2025 at 11:04:48PM +0000, Pratyush Yadav wrote:
> On Fri, Feb 28 2025, Mike Rapoport wrote:
> > On Wed, Feb 26, 2025 at 08:08:27PM +0000, Pratyush Yadav wrote:
> >> On Thu, Feb 06 2025, Mike Rapoport wrote:
> >>
> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >> >
> >> > Hi,
> >> >
> >> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> >> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> >> > just to make things simpler instead of ftrace we decided to preserve
> >> > "reserve_mem" regions.
> >> [...]
> >>
> >> I applied the patches on top of v6.14-rc1 and tried them out on an x86
> >> qemu machine . When I do a plain KHO activate and kexec, I get the below
> >> errors on boot. This causes networking to fail on the VM. The errors are
> >> consistent and happen every kexec-reboot, though fairly late in boot
> >> after systemd tries to bring up network. The same setup has worked fine
> >> with Alex's v3 of KHO patches.
> >>
> >> Do you see anything obvious that might cause this? I can try to debug
> >> this tomorrow, but if it rings any loud bells it would be nice to know.
> >
> > Thanks for the report!
> > It didn't ring any bells, but after I've found the issue and a
> > fast-and-dirty fix.
> >
> > The scratch areas are allocated from high addresses and there is no scratch
> > memory to satisfy memblock_alloc_low() in swiotb, so second kernel produces
> > a couple of
> >
> > software IO TLB: swiotlb_memblock_alloc: Failed to allocate 67108864 bytes for tlb structure
>
> I also did some digging today and ended up finding the same thing out
> but it seems you got there before me :-)
>
> >
> > and without those buffers e1000 can't dma :(
> >
> > A quick fix would be to add another scratch area in the lower memory
> > (below). I'll work on a better fix.
>
> I have already written a less-quick fix (patch pasted below) so I
> suppose we can use that to review the idea instead. It adds a dedicated
> scratch area for lowmem, similar to your patch, and adds some tracking
> to calculate the size.
>
> I am not sure if the size estimation is completely right though, since
> it is possible that allocations that don't _need_ to be in lowmem end up
> being there, causing the scratch area to be too big (or perhaps even
> causing allocation failures if the scale is big enough). Maybe we would
> be better off tracking lowmem allocation requests separately?
Normally memblock allocates top-down, so if there is enough memory
allocations that do not limit memory ranges won't end up in lowmem.
And on systems with smaller memory (e.g 4G and less) we'd anyway have
scratch areas in lowmem.
Yet, there is an issue with swiotlb_init() and estimation of the size for
lowmem scratch area we need to address. swiotlb_init() is called from
mem_init() and the memory it allocates won't be accounted by the time of
kho_memory_init().
I've been meaning to pull out common parts of mem_init() from arch-specific
code to mm_init for a while now, maybe the time has come :)
> ----- 8< -----
> From d60aeb2c4a1c0eea05e1a13b48b268d6192a615e Mon Sep 17 00:00:00 2001
> From: Pratyush Yadav <ptyadav@amazon.de>
> Date: Fri, 28 Feb 2025 22:36:06 +0000
> Subject: [PATCH] KHO: always have a lowmem scratch region
>
> During initialization, some callers need to allocate low memory from
> memblock. One such caller is swiotlb_memblock_alloc() on x86. The global
> and per-node scratch regions are allocated without any constraints on
> the range. This can lead to having no scratch region in lowmem. If that
> happens, the lowmem allocations can fail, leading to failures during
> boot or later down the line.
>
> Always ensure there is some scratch memory available in low memory by
> having a separate scratch area for it, along with the global and
> per-node ones, and allow specifying its size via the command line.
>
> To more accurately guess suitable scratch sizes, add
> memblock_reserved_kern_lowmem_size() and
> memblock_reserved_kern_highmem_size() which calculate how much memory
> was allocated in low and high memory, along with some helper functions
> to calculate scratch sizes.
>
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> ---
> .../admin-guide/kernel-parameters.txt | 9 +-
> Documentation/kho/usage.rst | 10 +--
> include/linux/memblock.h | 2 +
> kernel/kexec_handover.c | 83 ++++++++++++++-----
> mm/memblock.c | 28 +++++++
> 5 files changed, 103 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index ed656e2fb05ef..7c5afd45ad9dc 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2705,7 +2705,7 @@
> "1" | "on" | "y" - kexec handover is enabled
>
> kho_scratch= [KEXEC,EARLY]
> - Format: nn[KMG],mm[KMG] | nn%
> + Format: ll[KMG],nn[KMG],mm[KMG] | nn%
> Defines the size of the KHO scratch region. The KHO
> scratch regions are physically contiguous memory
> ranges that can only be used for non-kernel
> @@ -2715,9 +2715,10 @@
> bootstrap itself.
>
> It is possible to specify the exact amount of
> - memory in the form of "nn[KMG],mm[KMG]" where the
> - first parameter defines the size of a global
> - scratch area and the second parameter defines the
> + memory in the form of "ll[KMG],nn[KMG],mm[KMG]" where the
> + first parameter defines the size of a low memory scratch
> + area, the second parameter defines the size of a global
> + scratch area and the third parameter defines the
> size of additional per-node scratch areas.
> The form "nn%" defines scale factor (in percents)
> of memory that was used during boot.
> diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
> index e7300fbb309c1..6a6011809795d 100644
> --- a/Documentation/kho/usage.rst
> +++ b/Documentation/kho/usage.rst
> @@ -19,11 +19,11 @@ at compile time. Every KHO producer may have its own config option that you
> need to enable if you would like to preserve their respective state across
> kexec.
>
> -To use KHO, please boot the kernel with the ``kho=on`` command line
> -parameter. You may use ``kho_scratch`` parameter to define size of the
> -scratch regions. For example ``kho_scratch=512M,512M`` will reserve a 512
> -MiB for a global scratch region and 512 MiB per NUMA node scratch regions
> -on boot.
> +To use KHO, please boot the kernel with the ``kho=on`` command line parameter.
> +You may use ``kho_scratch`` parameter to define size of the scratch regions. For
> +example ``kho_scratch=128M,512M,512M`` will reserve a 128 MiB low memory scratch
> +region, a 512 MiB global scratch region and 512 MiB per NUMA node scratch
> +regions on boot.
>
> Perform a KHO kexec
> -------------------
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 20887e199cdbd..9f5c5aec4b1d4 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -504,6 +504,8 @@ static inline __init_memblock bool memblock_bottom_up(void)
> phys_addr_t memblock_phys_mem_size(void);
> phys_addr_t memblock_reserved_size(void);
> phys_addr_t memblock_reserved_kern_size(int nid);
> +phys_addr_t memblock_reserved_kern_lowmem_size(void);
> +phys_addr_t memblock_reserved_kern_highmem_size(void);
> unsigned long memblock_estimated_nr_free_pages(void);
> phys_addr_t memblock_start_of_DRAM(void);
> phys_addr_t memblock_end_of_DRAM(void);
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index c26753d613cbc..29eeed09ceb31 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -547,20 +547,21 @@ late_initcall(kho_init);
> *
> * kho_scratch=N%
> *
> - * It is also possible to explicitly define size for a global and per-node
> - * scratch areas:
> + * It is also possible to explicitly define size for a lowmem, a global and
> + * per-node scratch areas:
> *
> - * kho_scratch=n[KMG],m[KMG]
> + * kho_scratch=l[KMG],n[KMG],m[KMG]
> *
> * The explicit size definition takes precedence over scale definition.
> */
> static unsigned int scratch_scale __initdata = 200;
> static phys_addr_t scratch_size_global __initdata;
> static phys_addr_t scratch_size_pernode __initdata;
> +static phys_addr_t scratch_size_lowmem __initdata;
>
> static int __init kho_parse_scratch_size(char *p)
> {
> - unsigned long size, size_pernode;
> + unsigned long size, size_pernode, size_global;
> char *endptr, *oldp = p;
>
> if (!p)
> @@ -578,15 +579,25 @@ static int __init kho_parse_scratch_size(char *p)
> if (*p != ',')
> return -EINVAL;
>
> + oldp = p;
> + size_global = memparse(p + 1, &p);
> + if (!size_global || p == oldp)
> + return -EINVAL;
> +
> + if (*p != ',')
> + return -EINVAL;
> +
> size_pernode = memparse(p + 1, &p);
> if (!size_pernode)
> return -EINVAL;
>
> - scratch_size_global = size;
> + scratch_size_lowmem = size;
> + scratch_size_global = size_global;
> scratch_size_pernode = size_pernode;
> scratch_scale = 0;
>
> - pr_notice("scratch areas: global: %lluMB pernode: %lldMB\n",
> + pr_notice("scratch areas: lowmem: %lluMB global: %lluMB pernode: %lldMB\n",
> + (u64)(scratch_size_lowmem >> 20),
> (u64)(scratch_size_global >> 20),
> (u64)(scratch_size_pernode >> 20));
> }
> @@ -595,18 +606,38 @@ static int __init kho_parse_scratch_size(char *p)
> }
> early_param("kho_scratch", kho_parse_scratch_size);
>
> -static phys_addr_t __init scratch_size(int nid)
> +static phys_addr_t __init scratch_size_low(void)
> +{
> + phys_addr_t size;
> +
> + if (scratch_scale)
> + size = memblock_reserved_kern_lowmem_size() * scratch_scale / 100;
> + else
> + size = scratch_size_lowmem;
> +
> + return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
> +}
> +
> +static phys_addr_t __init scratch_size_high(void)
> +{
> + phys_addr_t size;
> +
> + if (scratch_scale)
> + size = memblock_reserved_kern_highmem_size() * scratch_scale / 100;
> + else
> + size = scratch_size_global;
> +
> + return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
> +}
> +
> +static phys_addr_t __init scratch_size_node(int nid)
> {
> phys_addr_t size;
>
> - if (scratch_scale) {
> + if (scratch_scale)
> size = memblock_reserved_kern_size(nid) * scratch_scale / 100;
> - } else {
> - if (numa_valid_node(nid))
> - size = scratch_size_pernode;
> - else
> - size = scratch_size_global;
> - }
> + else
> + size = scratch_size_pernode;
>
> return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
> }
> @@ -623,29 +654,41 @@ static phys_addr_t __init scratch_size(int nid)
> static void kho_reserve_scratch(void)
> {
> phys_addr_t addr, size;
> - int nid, i = 1;
> + int nid, i = 0;
>
> if (!kho_enable)
> return;
>
> /* FIXME: deal with node hot-plug/remove */
> - kho_scratch_cnt = num_online_nodes() + 1;
> + kho_scratch_cnt = num_online_nodes() + 2;
> size = kho_scratch_cnt * sizeof(*kho_scratch);
> kho_scratch = memblock_alloc(size, PAGE_SIZE);
> if (!kho_scratch)
> goto err_disable_kho;
>
> - /* reserve large contiguous area for allocations without nid */
> - size = scratch_size(NUMA_NO_NODE);
> - addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
> + /* reserve area for lowmem allocations. */
> + size = scratch_size_low();
> + addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0,
> + ARCH_LOW_ADDRESS_LIMIT);
> if (!addr)
> goto err_free_scratch_desc;
>
> kho_scratch[0].addr = addr;
> kho_scratch[0].size = size;
> + i++;
> +
> + /* reserve large contiguous area for allocations without nid */
> + size = scratch_size_high();
> + addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
> + if (!addr)
> + goto err_free_scratch_areas;
> +
> + kho_scratch[1].addr = addr;
> + kho_scratch[1].size = size;
> + i++;
>
> for_each_online_node(nid) {
> - size = scratch_size(nid);
> + size = scratch_size_node(nid);
> addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
> 0, MEMBLOCK_ALLOC_ACCESSIBLE,
> nid, true);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index fdb08b60efc17..da7abf5e5e504 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1837,6 +1837,34 @@ phys_addr_t __init_memblock memblock_reserved_size(void)
> return memblock.reserved.total_size;
> }
>
> +phys_addr_t __init_memblock memblock_reserved_kern_lowmem_size(void)
> +{
> + struct memblock_region *r;
> + phys_addr_t total = 0;
> +
> + for_each_reserved_mem_region(r) {
> + if ((r->flags & MEMBLOCK_RSRV_KERN) &&
> + (r->base + r->size <= ARCH_LOW_ADDRESS_LIMIT))
> + total += r->size;
> + }
> +
> + return total;
> +}
> +
> +phys_addr_t __init_memblock memblock_reserved_kern_highmem_size(void)
> +{
> + struct memblock_region *r;
> + phys_addr_t total = 0;
> +
> + for_each_reserved_mem_region(r) {
> + if ((r->flags & MEMBLOCK_RSRV_KERN) &&
> + (r->base + r->size > ARCH_LOW_ADDRESS_LIMIT))
> + total += r->size;
> + }
> +
> + return total;
> +}
> +
> phys_addr_t __init_memblock memblock_reserved_kern_size(int nid)
> {
> struct memblock_region *r;
> --
> Regards,
> Pratyush Yadav
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-26 2:09 ` Wei Yang
@ 2025-03-10 7:56 ` Wei Yang
2025-03-10 8:28 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-03-10 7:56 UTC (permalink / raw)
To: Wei Yang
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Wed, Feb 26, 2025 at 02:09:15AM +0000, Wei Yang wrote:
>On Tue, Feb 25, 2025 at 09:46:28AM +0200, Mike Rapoport wrote:
>>On Mon, Feb 24, 2025 at 01:31:31AM +0000, Wei Yang wrote:
>>> On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>>> >Hi,
>>> >
>>> >On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>>> >> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>>> >> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>> >> >
>>> >> >to denote areas that were reserved for kernel use either directly with
>>> >> >memblock_reserve_kern() or via memblock allocations.
>>> >> >
>>> >> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> >> >---
>>> >> > include/linux/memblock.h | 16 +++++++++++++++-
>>> >> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>>> >> > 2 files changed, 39 insertions(+), 9 deletions(-)
>>> >> >
>>> >> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>> >> >index e79eb6ac516f..65e274550f5d 100644
>>> >> >--- a/include/linux/memblock.h
>>> >> >+++ b/include/linux/memblock.h
>>> >> >@@ -50,6 +50,7 @@ enum memblock_flags {
>>> >> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>>> >> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>>> >> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>>> >> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>>> >>
>>> >> Above memblock_flags, there are comments on explaining those flags.
>>> >>
>>> >> Seems we miss it for MEMBLOCK_RSRV_KERN.
>>> >
>>> >Right, thanks!
>>> >
>>> >> >
>>> >> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>>> >> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>>> >> > again:
>>> >> > found = memblock_find_in_range_node(size, align, start, end, nid,
>>> >> > flags);
>>> >> >- if (found && !memblock_reserve(found, size))
>>> >> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>>> >>
>>> >> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>>> >> correct, the reserved region's nid is not used.
>>> >
>>> >We use nid of reserved regions in reserve_bootmem_region() (commit
>>> >61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>>> >know the distribution of reserved memory among the nodes before
>>> >memmap_init_reserved_pages().
>>> >
>>> >> BTW, one question here. How we handle concurrent memblock allocation? If two
>>> >> threads find the same available range and do the reservation, it seems to be a
>>> >> problem to me. Or I missed something?
>>> >
>>> >memblock allocations end before smp_init(), there is no possible concurrency.
>>> >
>>>
>>> Thanks, I still have one question here.
>>>
>>> Below is a simplified call flow.
>>>
>>> mm_core_init()
>>> mem_init()
>>> memblock_free_all()
>>> free_low_memory_core_early()
>>> memmap_init_reserved_pages()
>>> memblock_set_node(..., memblock.reserved, ) --- (1)
>>> __free_memory_core()
>>> kmem_cache_init()
>>> slab_state = UP; --- (2)
>>>
>>> And memblock_allloc_range_nid() is not supposed to be called after
>>> slab_is_available(). Even someone do dose it, it will get memory from slab
>>> instead of reserve region in memblock.
>>>
>>> From the above call flow and background, there are three cases when
>>> memblock_alloc_range_nid() would be called:
>>>
>>> * If it is called before (1), memblock.reserved's nid would be adjusted correctly.
>>> * If it is called after (2), we don't touch memblock.reserved.
>>> * If it happens between (1) and (2), it looks would break the consistency of
>>> nid information in memblock.reserved. Because when we use
>>> memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
>>>
>>> So my question is if the third case happens, would it introduce a bug? If it
>>> won't happen, seems we don't need to specify the nid here?
>>
>>We don't really care about proper assignment of nodes between (1) and (2)
>>from one side and the third case does not happen on the other side. Nothing
>>should call membloc_alloc() after memblock_free_all().
>>
>
>My point is if no one would call memblock_alloc() after memblock_free_all(),
>which set nid in memblock.reserved properly, it seems not necessary to do
>__memblock_reserve() with exact nid during memblock_alloc()?
>
>As you did __memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN) in this
>patch.
>
Hi, Mike
Do you think my understanding is reasonable?
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-03-10 7:56 ` Wei Yang
@ 2025-03-10 8:28 ` Mike Rapoport
2025-03-10 9:42 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-03-10 8:28 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi Wei,
On Mon, Mar 10, 2025 at 07:56:27AM +0000, Wei Yang wrote:
> On Wed, Feb 26, 2025 at 02:09:15AM +0000, Wei Yang wrote:
> >>>
> >>> From the above call flow and background, there are three cases when
> >>> memblock_alloc_range_nid() would be called:
> >>>
> >>> * If it is called before (1), memblock.reserved's nid would be adjusted correctly.
> >>> * If it is called after (2), we don't touch memblock.reserved.
> >>> * If it happens between (1) and (2), it looks would break the consistency of
> >>> nid information in memblock.reserved. Because when we use
> >>> memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
> >>>
> >>> So my question is if the third case happens, would it introduce a bug? If it
> >>> won't happen, seems we don't need to specify the nid here?
> >>
> >>We don't really care about proper assignment of nodes between (1) and (2)
> >>from one side and the third case does not happen on the other side. Nothing
> >>should call membloc_alloc() after memblock_free_all().
> >>
> >
> >My point is if no one would call memblock_alloc() after memblock_free_all(),
> >which set nid in memblock.reserved properly, it seems not necessary to do
> >__memblock_reserve() with exact nid during memblock_alloc()?
> >
> >As you did __memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN) in this
> >patch.
> >
>
> Hi, Mike
>
> Do you think my understanding is reasonable?
Without KHO it is indeed not strictly necessary to set nid during memblock_alloc().
But since we anyway have nid parameter in memblock_alloc_range_nid() and it
anyway propagates to memblock_add_range(), I think it's easier and cleaner
to pass nid to __memblock_reserve() there.
And for KHO estimation of scratch size it is important to have nid assigned to
the reserved areas before memblock_free_all(), at least for the allocations
that request particular nid explicitly.
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-03-10 8:28 ` Mike Rapoport
@ 2025-03-10 9:42 ` Wei Yang
0 siblings, 0 replies; 97+ messages in thread
From: Wei Yang @ 2025-03-10 9:42 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Mon, Mar 10, 2025 at 10:28:02AM +0200, Mike Rapoport wrote:
>Hi Wei,
>
>On Mon, Mar 10, 2025 at 07:56:27AM +0000, Wei Yang wrote:
>> On Wed, Feb 26, 2025 at 02:09:15AM +0000, Wei Yang wrote:
>> >>>
>> >>> From the above call flow and background, there are three cases when
>> >>> memblock_alloc_range_nid() would be called:
>> >>>
>> >>> * If it is called before (1), memblock.reserved's nid would be adjusted correctly.
>> >>> * If it is called after (2), we don't touch memblock.reserved.
>> >>> * If it happens between (1) and (2), it looks would break the consistency of
>> >>> nid information in memblock.reserved. Because when we use
>> >>> memblock_reserve_kern(), NUMA_NO_NODE would be stored in region.
>> >>>
>> >>> So my question is if the third case happens, would it introduce a bug? If it
>> >>> won't happen, seems we don't need to specify the nid here?
>> >>
>> >>We don't really care about proper assignment of nodes between (1) and (2)
>> >>from one side and the third case does not happen on the other side. Nothing
>> >>should call membloc_alloc() after memblock_free_all().
>> >>
>> >
>> >My point is if no one would call memblock_alloc() after memblock_free_all(),
>> >which set nid in memblock.reserved properly, it seems not necessary to do
>> >__memblock_reserve() with exact nid during memblock_alloc()?
>> >
>> >As you did __memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN) in this
>> >patch.
>> >
>>
>> Hi, Mike
>>
>> Do you think my understanding is reasonable?
>
>Without KHO it is indeed not strictly necessary to set nid during memblock_alloc().
>But since we anyway have nid parameter in memblock_alloc_range_nid() and it
>anyway propagates to memblock_add_range(), I think it's easier and cleaner
>to pass nid to __memblock_reserve() there.
>
>And for KHO estimation of scratch size it is important to have nid assigned to
>the reserved areas before memblock_free_all(), at least for the allocations
>that request particular nid explicitly.
Thanks, I see your point.
>
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Sincerely yours,
>Mike.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-23 0:22 ` Wei Yang
@ 2025-03-10 9:51 ` Wei Yang
2025-03-11 5:27 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-03-10 9:51 UTC (permalink / raw)
To: Wei Yang
Cc: Mike Rapoport, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Sun, Feb 23, 2025 at 12:22:29AM +0000, Wei Yang wrote:
>On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>>Hi,
>>
>>On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>>> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>> >
>>> >to denote areas that were reserved for kernel use either directly with
>>> >memblock_reserve_kern() or via memblock allocations.
>>> >
>>> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> >---
>>> > include/linux/memblock.h | 16 +++++++++++++++-
>>> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>>> > 2 files changed, 39 insertions(+), 9 deletions(-)
>>> >
>>> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>> >index e79eb6ac516f..65e274550f5d 100644
>>> >--- a/include/linux/memblock.h
>>> >+++ b/include/linux/memblock.h
>>> >@@ -50,6 +50,7 @@ enum memblock_flags {
>>> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>>> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>>> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>>> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>>>
>>> Above memblock_flags, there are comments on explaining those flags.
>>>
>>> Seems we miss it for MEMBLOCK_RSRV_KERN.
>>
>>Right, thanks!
>>
>>> >
>>> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>>> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>>> > again:
>>> > found = memblock_find_in_range_node(size, align, start, end, nid,
>>> > flags);
>>> >- if (found && !memblock_reserve(found, size))
>>> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>>>
>>> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>>> correct, the reserved region's nid is not used.
>>
>>We use nid of reserved regions in reserve_bootmem_region() (commit
>>61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>>know the distribution of reserved memory among the nodes before
>>memmap_init_reserved_pages().
>>
>
>I took another look into this commit. There maybe a very corner case in which
>will leave a reserved region with no nid set.
>
>memmap_init_reserved_pages()
> for_each_mem_region() {
> ...
> memblock_set_node(start, end, &memblock.reserved, nid);
> }
>
>We leverage the iteration here to set nid to all regions in memblock.reserved.
>But memblock_set_node() may call memblock_double_array() to expand the array,
>which may get a range before current start. So we would miss to set the
>correct nid to the new reserved region.
>
>I have tried to create a case in memblock test. This would happen when there
>are 126 memblock.reserved regions. And the last region is across the last two
>node.
>
>One way to fix this is compare type->max in memblock_set_node(). Then check
>this return value in memmap_init_reserved_pages(). If we found the size
>changes, repeat the iteration.
>
>But this is a very trivial one, not sure it worth fix.
>
Hi, Mike
I have done a user space test which shows we may have a chance to leave a
region with non-nid set.
Not sure you are ok with my approach of fixing.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 06/14] kexec: Add KHO parsing support
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
2025-02-10 20:50 ` Jason Gunthorpe
@ 2025-03-10 16:20 ` Pratyush Yadav
2025-03-10 17:08 ` Mike Rapoport
1 sibling, 1 reply; 97+ messages in thread
From: Pratyush Yadav @ 2025-03-10 16:20 UTC (permalink / raw)
To: Mike Rapoport
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Mike,
On Thu, Feb 06 2025, Mike Rapoport wrote:
[...]
> @@ -444,7 +576,141 @@ static void kho_reserve_scratch(void)
> kho_enable = false;
> }
>
> +/*
> + * Scan the DT for any memory ranges and make sure they are reserved in
> + * memblock, otherwise they will end up in a weird state on free lists.
> + */
> +static void kho_init_reserved_pages(void)
> +{
> + const void *fdt = kho_get_fdt();
> + int offset = 0, depth = 0, initial_depth = 0, len;
> +
> + if (!fdt)
> + return;
> +
> + /* Go through the mem list and add 1 for each reference */
> + for (offset = 0;
> + offset >= 0 && depth >= initial_depth;
> + offset = fdt_next_node(fdt, offset, &depth)) {
> + const struct kho_mem *mems;
> + u32 i;
> +
> + mems = fdt_getprop(fdt, offset, "mem", &len);
> + if (!mems || len & (sizeof(*mems) - 1))
> + continue;
> +
> + for (i = 0; i < len; i += sizeof(*mems)) {
> + const struct kho_mem *mem = &mems[i];
i goes from 0 to len in steps of 16, but you use it to dereference an
array of type struct kho_mem. So you end up only looking at only one of
every 16 mems and do an out of bounds access. I found this when testing
the memfd patches and any time the file was more than 1 page, it started
to crash randomly.
Below patch should fix that:
---- 8< ----
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cbc..40d1d8ac68d44 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -685,13 +685,15 @@ static void kho_init_reserved_pages(void)
offset >= 0 && depth >= initial_depth;
offset = fdt_next_node(fdt, offset, &depth)) {
const struct kho_mem *mems;
- u32 i;
+ u32 i, nr_mems;
mems = fdt_getprop(fdt, offset, "mem", &len);
if (!mems || len & (sizeof(*mems) - 1))
continue;
- for (i = 0; i < len; i += sizeof(*mems)) {
+ nr_mems = len / sizeof(*mems);
+
+ for (i = 0; i < nr_mems; i++) {
const struct kho_mem *mem = &mems[i];
memblock_reserve(mem->addr, mem->size);
---- >8 ----
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 06/14] kexec: Add KHO parsing support
2025-03-10 16:20 ` Pratyush Yadav
@ 2025-03-10 17:08 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-03-10 17:08 UTC (permalink / raw)
To: Pratyush Yadav
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Rob Herring, Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Pratyush,
On Mon, Mar 10, 2025 at 04:20:01PM +0000, Pratyush Yadav wrote:
> Hi Mike,
>
> On Thu, Feb 06 2025, Mike Rapoport wrote:
> [...]
> > @@ -444,7 +576,141 @@ static void kho_reserve_scratch(void)
> > kho_enable = false;
> > }
> >
> > +/*
> > + * Scan the DT for any memory ranges and make sure they are reserved in
> > + * memblock, otherwise they will end up in a weird state on free lists.
> > + */
> > +static void kho_init_reserved_pages(void)
> > +{
> > + const void *fdt = kho_get_fdt();
> > + int offset = 0, depth = 0, initial_depth = 0, len;
> > +
> > + if (!fdt)
> > + return;
> > +
> > + /* Go through the mem list and add 1 for each reference */
> > + for (offset = 0;
> > + offset >= 0 && depth >= initial_depth;
> > + offset = fdt_next_node(fdt, offset, &depth)) {
> > + const struct kho_mem *mems;
> > + u32 i;
> > +
> > + mems = fdt_getprop(fdt, offset, "mem", &len);
> > + if (!mems || len & (sizeof(*mems) - 1))
> > + continue;
> > +
> > + for (i = 0; i < len; i += sizeof(*mems)) {
> > + const struct kho_mem *mem = &mems[i];
>
> i goes from 0 to len in steps of 16, but you use it to dereference an
> array of type struct kho_mem. So you end up only looking at only one of
> every 16 mems and do an out of bounds access. I found this when testing
> the memfd patches and any time the file was more than 1 page, it started
> to crash randomly.
Thanks! Changyuan already pointed that out privately.
But I'm going to adopt the memory reservation scheme Jason proposed so
this code is going to go away anyway :)
> Below patch should fix that:
>
> ---- 8< ----
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index c26753d613cbc..40d1d8ac68d44 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -685,13 +685,15 @@ static void kho_init_reserved_pages(void)
> offset >= 0 && depth >= initial_depth;
> offset = fdt_next_node(fdt, offset, &depth)) {
> const struct kho_mem *mems;
> - u32 i;
> + u32 i, nr_mems;
>
> mems = fdt_getprop(fdt, offset, "mem", &len);
> if (!mems || len & (sizeof(*mems) - 1))
> continue;
>
> - for (i = 0; i < len; i += sizeof(*mems)) {
> + nr_mems = len / sizeof(*mems);
> +
> + for (i = 0; i < nr_mems; i++) {
> const struct kho_mem *mem = &mems[i];
>
> memblock_reserve(mem->addr, mem->size);
> ---- >8 ----
> [...]
>
> --
> Regards,
> Pratyush Yadav
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-03-10 9:51 ` Wei Yang
@ 2025-03-11 5:27 ` Mike Rapoport
2025-03-11 13:41 ` Wei Yang
0 siblings, 1 reply; 97+ messages in thread
From: Mike Rapoport @ 2025-03-11 5:27 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
Hi Wei,
On Mon, Mar 10, 2025 at 09:51:24AM +0000, Wei Yang wrote:
> On Sun, Feb 23, 2025 at 12:22:29AM +0000, Wei Yang wrote:
> >On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
> >>Hi,
> >>
> >>On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
> >>> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
> >>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >>> >
> >>> >to denote areas that were reserved for kernel use either directly with
> >>> >memblock_reserve_kern() or via memblock allocations.
> >>> >
> >>> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >>> >---
> >>> > include/linux/memblock.h | 16 +++++++++++++++-
> >>> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
> >>> > 2 files changed, 39 insertions(+), 9 deletions(-)
> >>> >
> >>> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >>> >index e79eb6ac516f..65e274550f5d 100644
> >>> >--- a/include/linux/memblock.h
> >>> >+++ b/include/linux/memblock.h
> >>> >@@ -50,6 +50,7 @@ enum memblock_flags {
> >>> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> >>> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
> >>> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
> >>> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
> >>>
> >>> Above memblock_flags, there are comments on explaining those flags.
> >>>
> >>> Seems we miss it for MEMBLOCK_RSRV_KERN.
> >>
> >>Right, thanks!
> >>
> >>> >
> >>> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
> >>> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >>> > again:
> >>> > found = memblock_find_in_range_node(size, align, start, end, nid,
> >>> > flags);
> >>> >- if (found && !memblock_reserve(found, size))
> >>> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
> >>>
> >>> Maybe we could use memblock_reserve_kern() directly. If my understanding is
> >>> correct, the reserved region's nid is not used.
> >>
> >>We use nid of reserved regions in reserve_bootmem_region() (commit
> >>61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
> >>know the distribution of reserved memory among the nodes before
> >>memmap_init_reserved_pages().
> >>
> >
> >I took another look into this commit. There maybe a very corner case in which
> >will leave a reserved region with no nid set.
> >
> >memmap_init_reserved_pages()
> > for_each_mem_region() {
> > ...
> > memblock_set_node(start, end, &memblock.reserved, nid);
> > }
> >
> >We leverage the iteration here to set nid to all regions in memblock.reserved.
> >But memblock_set_node() may call memblock_double_array() to expand the array,
> >which may get a range before current start. So we would miss to set the
> >correct nid to the new reserved region.
> >
> >I have tried to create a case in memblock test. This would happen when there
> >are 126 memblock.reserved regions. And the last region is across the last two
> >node.
> >
> >One way to fix this is compare type->max in memblock_set_node(). Then check
> >this return value in memmap_init_reserved_pages(). If we found the size
> >changes, repeat the iteration.
> >
> >But this is a very trivial one, not sure it worth fix.
> >
>
> Hi, Mike
>
> I have done a user space test which shows we may have a chance to leave a
> region with non-nid set.
>
> Not sure you are ok with my approach of fixing.
Wouldn't it be better to check for a change in reserved.max in
memmap_init_reserved_pages()?
> --
> Wei Yang
> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-03-11 5:27 ` Mike Rapoport
@ 2025-03-11 13:41 ` Wei Yang
2025-03-12 5:22 ` Mike Rapoport
0 siblings, 1 reply; 97+ messages in thread
From: Wei Yang @ 2025-03-11 13:41 UTC (permalink / raw)
To: Mike Rapoport
Cc: Wei Yang, linux-kernel, Alexander Graf, Andrew Morton,
Andy Lutomirski, Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Tue, Mar 11, 2025 at 07:27:23AM +0200, Mike Rapoport wrote:
>Hi Wei,
>
>On Mon, Mar 10, 2025 at 09:51:24AM +0000, Wei Yang wrote:
>> On Sun, Feb 23, 2025 at 12:22:29AM +0000, Wei Yang wrote:
>> >On Wed, Feb 19, 2025 at 09:24:31AM +0200, Mike Rapoport wrote:
>> >>Hi,
>> >>
>> >>On Tue, Feb 18, 2025 at 03:50:04PM +0000, Wei Yang wrote:
>> >>> On Thu, Feb 06, 2025 at 03:27:42PM +0200, Mike Rapoport wrote:
>> >>> >From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>> >>> >
>> >>> >to denote areas that were reserved for kernel use either directly with
>> >>> >memblock_reserve_kern() or via memblock allocations.
>> >>> >
>> >>> >Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>> >>> >---
>> >>> > include/linux/memblock.h | 16 +++++++++++++++-
>> >>> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
>> >>> > 2 files changed, 39 insertions(+), 9 deletions(-)
>> >>> >
>> >>> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> >>> >index e79eb6ac516f..65e274550f5d 100644
>> >>> >--- a/include/linux/memblock.h
>> >>> >+++ b/include/linux/memblock.h
>> >>> >@@ -50,6 +50,7 @@ enum memblock_flags {
>> >>> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
>> >>> > MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
>> >>> > MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
>> >>> >+ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */
>> >>>
>> >>> Above memblock_flags, there are comments on explaining those flags.
>> >>>
>> >>> Seems we miss it for MEMBLOCK_RSRV_KERN.
>> >>
>> >>Right, thanks!
>> >>
>> >>> >
>> >>> > #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
>> >>> >@@ -1459,14 +1460,14 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>> >>> > again:
>> >>> > found = memblock_find_in_range_node(size, align, start, end, nid,
>> >>> > flags);
>> >>> >- if (found && !memblock_reserve(found, size))
>> >>> >+ if (found && !__memblock_reserve(found, size, nid, MEMBLOCK_RSRV_KERN))
>> >>>
>> >>> Maybe we could use memblock_reserve_kern() directly. If my understanding is
>> >>> correct, the reserved region's nid is not used.
>> >>
>> >>We use nid of reserved regions in reserve_bootmem_region() (commit
>> >>61167ad5fecd ("mm: pass nid to reserve_bootmem_region()")) but KHO needs to
>> >>know the distribution of reserved memory among the nodes before
>> >>memmap_init_reserved_pages().
>> >>
>> >
>> >I took another look into this commit. There maybe a very corner case in which
>> >will leave a reserved region with no nid set.
>> >
>> >memmap_init_reserved_pages()
>> > for_each_mem_region() {
>> > ...
>> > memblock_set_node(start, end, &memblock.reserved, nid);
>> > }
>> >
>> >We leverage the iteration here to set nid to all regions in memblock.reserved.
>> >But memblock_set_node() may call memblock_double_array() to expand the array,
>> >which may get a range before current start. So we would miss to set the
>> >correct nid to the new reserved region.
>> >
>> >I have tried to create a case in memblock test. This would happen when there
>> >are 126 memblock.reserved regions. And the last region is across the last two
>> >node.
>> >
>> >One way to fix this is compare type->max in memblock_set_node(). Then check
>> >this return value in memmap_init_reserved_pages(). If we found the size
>> >changes, repeat the iteration.
>> >
>> >But this is a very trivial one, not sure it worth fix.
>> >
>>
>> Hi, Mike
>>
>> I have done a user space test which shows we may have a chance to leave a
>> region with non-nid set.
>>
>> Not sure you are ok with my approach of fixing.
>
>Wouldn't it be better to check for a change in reserved.max in
>memmap_init_reserved_pages()?
>
Sounds better.
Previously I thought we need to hide detail from user, but actually it is
already in memblock.c :-)
If you agree, I would like to prepare a fix.
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Sincerely yours,
>Mike.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-03-11 13:41 ` Wei Yang
@ 2025-03-12 5:22 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-03-12 5:22 UTC (permalink / raw)
To: Wei Yang
Cc: linux-kernel, Alexander Graf, Andrew Morton, Andy Lutomirski,
Anthony Yznaga, Arnd Bergmann, Ashish Kalra,
Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
Dave Hansen, David Woodhouse, Eric Biederman, Ingo Molnar,
James Gowans, Jonathan Corbet, Krzysztof Kozlowski, Mark Rutland,
Paolo Bonzini, Pasha Tatashin, H. Peter Anvin, Peter Zijlstra,
Pratyush Yadav, Rob Herring, Rob Herring, Saravana Kannan,
Stanislav Kinsburskii, Steven Rostedt, Thomas Gleixner,
Tom Lendacky, Usama Arif, Will Deacon, devicetree, kexec,
linux-arm-kernel, linux-doc, linux-mm, x86
On Tue, Mar 11, 2025 at 01:41:26PM +0000, Wei Yang wrote:
> On Tue, Mar 11, 2025 at 07:27:23AM +0200, Mike Rapoport wrote:
> >> >>
> >> >
> >> >I took another look into this commit. There maybe a very corner case in which
> >> >will leave a reserved region with no nid set.
> >> >
> >> >memmap_init_reserved_pages()
> >> > for_each_mem_region() {
> >> > ...
> >> > memblock_set_node(start, end, &memblock.reserved, nid);
> >> > }
> >> >
> >> >We leverage the iteration here to set nid to all regions in memblock.reserved.
> >> >But memblock_set_node() may call memblock_double_array() to expand the array,
> >> >which may get a range before current start. So we would miss to set the
> >> >correct nid to the new reserved region.
> >> >
> >> >I have tried to create a case in memblock test. This would happen when there
> >> >are 126 memblock.reserved regions. And the last region is across the last two
> >> >node.
> >> >
> >> >One way to fix this is compare type->max in memblock_set_node(). Then check
> >> >this return value in memmap_init_reserved_pages(). If we found the size
> >> >changes, repeat the iteration.
> >> >
> >> >But this is a very trivial one, not sure it worth fix.
> >> >
> >>
> >> Hi, Mike
> >>
> >> I have done a user space test which shows we may have a chance to leave a
> >> region with non-nid set.
> >>
> >> Not sure you are ok with my approach of fixing.
> >
> >Wouldn't it be better to check for a change in reserved.max in
> >memmap_init_reserved_pages()?
> >
>
> Sounds better.
>
> Previously I thought we need to hide detail from user, but actually it is
> already in memblock.c :-)
>
> If you agree, I would like to prepare a fix.
Sure :)
> >> --
> >> Wei Yang
> >> Help you, Help me
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag
2025-02-26 1:53 ` Changyuan Lyu
@ 2025-03-13 15:41 ` Mike Rapoport
0 siblings, 0 replies; 97+ messages in thread
From: Mike Rapoport @ 2025-03-13 15:41 UTC (permalink / raw)
To: Changyuan Lyu
Cc: Alexander Graf, Andrew Morton, Andy Lutomirski, Anthony Yznaga,
Arnd Bergmann, Ashish Kalra, Benjamin Herrenschmidt,
Borislav Petkov, Catalin Marinas, Dave Hansen, David Woodhouse,
Eric Biederman, Ingo Molnar, James Gowans, Jonathan Corbet,
Krzysztof Kozlowski, Mark Rutland, Paolo Bonzini, Pasha Tatashin,
H. Peter Anvin, Peter Zijlstra, Pratyush Yadav, Rob Herring,
Rob Herring, Saravana Kannan, Stanislav Kinsburskii,
Steven Rostedt, Thomas Gleixner, Tom Lendacky, Usama Arif,
Will Deacon, devicetree, kexec, linux-arm-kernel, linux-doc,
linux-mm, x86
Hi Changyuan,
On Tue, Feb 25, 2025 at 05:53:39PM -0800, Changyuan Lyu wrote:
> Hi Mike,
>
> On Thu, 6 Feb 2025 15:27:42 +0200, Mike Rapoport <rppt@kernel.org> wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > to denote areas that were reserved for kernel use either directly with
> > memblock_reserve_kern() or via memblock allocations.
> >
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > include/linux/memblock.h | 16 +++++++++++++++-
> > mm/memblock.c | 32 ++++++++++++++++++++++++--------
> > 2 files changed, 39 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index e79eb6ac516f..65e274550f5d 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > ......
> > @@ -116,7 +117,19 @@ int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
> > int memblock_add(phys_addr_t base, phys_addr_t size);
> > int memblock_remove(phys_addr_t base, phys_addr_t size);
> > int memblock_phys_free(phys_addr_t base, phys_addr_t size);
> > -int memblock_reserve(phys_addr_t base, phys_addr_t size);
> > +int __memblock_reserve(phys_addr_t base, phys_addr_t size, int nid,
> > + enum memblock_flags flags);
> > +
> > +static __always_inline int memblock_reserve(phys_addr_t base, phys_addr_t size)
> > +{
> > + return __memblock_reserve(base, size, NUMA_NO_NODE, 0);
>
> Without this patch `memblock_reserve` eventually calls `memblock_add_range`
> with `MAX_NUMNODES`, but with this patch, `memblock_reserve` calls
> `memblock_add_range` with `NUMA_NO_NODE`. Is it intended or an
> accidental typo? Thanks!
We were mixing NUMA_NO_NODE and MAX_NUMNODES for memory with undefined node
id for a while, with MAX_NUMNODES being older and NUMA_NO_NODE newer define
for the same thing.
To make sure both are treated correctly in memblock we use
numa_valid_node() to check if a range has node id set.
> Best,
> Changyuan
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 97+ messages in thread
end of thread, other threads:[~2025-03-13 15:42 UTC | newest]
Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-06 13:27 [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 01/14] mm/mm_init: rename init_reserved_page to init_deferred_page Mike Rapoport
2025-02-18 14:59 ` Wei Yang
2025-02-19 7:13 ` Mike Rapoport
2025-02-20 8:36 ` Wei Yang
2025-02-20 14:54 ` Mike Rapoport
2025-02-25 7:40 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 02/14] memblock: add MEMBLOCK_RSRV_KERN flag Mike Rapoport
2025-02-18 15:50 ` Wei Yang
2025-02-19 7:24 ` Mike Rapoport
2025-02-23 0:22 ` Wei Yang
2025-03-10 9:51 ` Wei Yang
2025-03-11 5:27 ` Mike Rapoport
2025-03-11 13:41 ` Wei Yang
2025-03-12 5:22 ` Mike Rapoport
2025-02-24 1:31 ` Wei Yang
2025-02-25 7:46 ` Mike Rapoport
2025-02-26 2:09 ` Wei Yang
2025-03-10 7:56 ` Wei Yang
2025-03-10 8:28 ` Mike Rapoport
2025-03-10 9:42 ` Wei Yang
2025-02-26 1:53 ` Changyuan Lyu
2025-03-13 15:41 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 03/14] memblock: Add support for scratch memory Mike Rapoport
2025-02-24 2:50 ` Wei Yang
2025-02-25 7:47 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 04/14] memblock: introduce memmap_init_kho_scratch() Mike Rapoport
2025-02-24 3:02 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers Mike Rapoport
2025-02-10 20:22 ` Jason Gunthorpe
2025-02-10 20:58 ` Pasha Tatashin
2025-02-11 12:49 ` Jason Gunthorpe
2025-02-11 16:14 ` Pasha Tatashin
2025-02-11 16:37 ` Jason Gunthorpe
2025-02-12 15:23 ` Jason Gunthorpe
2025-02-12 16:39 ` Mike Rapoport
2025-02-12 17:43 ` Jason Gunthorpe
2025-02-23 18:51 ` Mike Rapoport
2025-02-24 14:28 ` Jason Gunthorpe
2025-02-12 12:29 ` Thomas Weißschuh
2025-02-06 13:27 ` [PATCH v4 06/14] kexec: Add KHO parsing support Mike Rapoport
2025-02-10 20:50 ` Jason Gunthorpe
2025-03-10 16:20 ` Pratyush Yadav
2025-03-10 17:08 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 07/14] kexec: Add KHO support to kexec file loads Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 08/14] kexec: Add config option for KHO Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 09/14] kexec: Add documentation " Mike Rapoport
2025-02-10 19:26 ` Jason Gunthorpe
2025-02-06 13:27 ` [PATCH v4 10/14] arm64: Add KHO support Mike Rapoport
2025-02-09 10:38 ` Krzysztof Kozlowski
2025-02-06 13:27 ` [PATCH v4 11/14] x86/setup: use memblock_reserve_kern for memory used by kernel Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 12/14] x86: Add KHO support Mike Rapoport
2025-02-24 7:13 ` Wei Yang
2025-02-24 14:36 ` Mike Rapoport
2025-02-25 0:00 ` Wei Yang
2025-02-06 13:27 ` [PATCH v4 13/14] memblock: Add KHO support for reserve_mem Mike Rapoport
2025-02-10 16:03 ` Rob Herring
2025-02-12 16:30 ` Mike Rapoport
2025-02-17 4:04 ` Wei Yang
2025-02-19 7:25 ` Mike Rapoport
2025-02-06 13:27 ` [PATCH v4 14/14] Documentation: KHO: Add memblock bindings Mike Rapoport
2025-02-09 10:29 ` Krzysztof Kozlowski
2025-02-09 15:10 ` Mike Rapoport
2025-02-09 15:23 ` Krzysztof Kozlowski
2025-02-09 20:41 ` Mike Rapoport
2025-02-09 20:49 ` Krzysztof Kozlowski
2025-02-09 20:50 ` Krzysztof Kozlowski
2025-02-10 19:15 ` Jason Gunthorpe
2025-02-10 19:27 ` Krzysztof Kozlowski
2025-02-10 20:20 ` Jason Gunthorpe
2025-02-12 16:00 ` Mike Rapoport
2025-02-07 0:29 ` [PATCH v4 00/14] kexec: introduce Kexec HandOver (KHO) Andrew Morton
2025-02-07 1:28 ` Pasha Tatashin
2025-02-08 1:38 ` Baoquan He
2025-02-08 8:41 ` Mike Rapoport
2025-02-08 11:13 ` Baoquan He
2025-02-09 0:23 ` Pasha Tatashin
2025-02-09 3:07 ` Baoquan He
2025-02-07 8:06 ` Mike Rapoport
2025-02-09 10:33 ` Krzysztof Kozlowski
2025-02-07 4:50 ` Andrew Morton
2025-02-07 8:01 ` Mike Rapoport
2025-02-08 23:39 ` Cong Wang
2025-02-09 0:13 ` Pasha Tatashin
2025-02-09 1:00 ` Cong Wang
2025-02-09 0:51 ` Cong Wang
2025-02-17 3:19 ` RuiRui Yang
2025-02-19 7:32 ` Mike Rapoport
2025-02-19 12:49 ` Dave Young
2025-02-19 13:54 ` Alexander Graf
2025-02-20 1:49 ` Dave Young
2025-02-20 16:43 ` Alexander Gordeev
2025-02-23 17:54 ` Mike Rapoport
2025-02-26 20:08 ` Pratyush Yadav
2025-02-28 20:20 ` Mike Rapoport
2025-02-28 23:04 ` Pratyush Yadav
2025-03-02 9:52 ` Mike Rapoport
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox