linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KHO: Fix metadata allocation in scratch area
@ 2025-10-15  5:31 Pasha Tatashin
  2025-10-15  5:31 ` [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory " Pasha Tatashin
  2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
  0 siblings, 2 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15  5:31 UTC (permalink / raw)
  To: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pasha.tatashin, pratyush, rdunlap,
	rppt, tj, jasonmiu, dmatlack, skhawaja

This series fixes a memory corruption bug in KHO that occurs when KFENCE
is enabled.

The root cause is that KHO metadata, allocated via kzalloc(), can be
randomly serviced by kfence_alloc(). When a kernel boots via KHO, the
early memblock allocator is restricted to a "scratch area". This forces
the KFENCE pool to be allocated within this scratch area, creating a
conflict. If KHO metadata is subsequently placed in this pool, it gets
corrupted during the next kexec operation.

The series is structured in two parts:
Patch 1/2 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG)
that adds checks to detect and fail any operation that attempts to place
KHO metadata or preserved memory within the scratch area. This serves as
a validation and diagnostic tool to confirm the problem without
affecting production builds.

Patch 2/2 provides the fix by modifying KHO to allocate its metadata
directly from the buddy allocator instead of SLUB. This bypasses the
KFENCE interception entirely.

Pasha Tatashin (2):
  liveupdate: kho: warn and fail on metadata or preserved memory in
    scratch area
  liveupdate: kho: allocate metadata directly from the buddy allocator

 kernel/liveupdate/Kconfig                   | 15 ++++++
 kernel/liveupdate/kexec_handover.c          | 51 ++++++++++++++++-----
 kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++
 kernel/liveupdate/kexec_handover_internal.h |  9 ++++
 4 files changed, 81 insertions(+), 12 deletions(-)


base-commit: 0b2f041c47acb45db82b4e847af6e17eb66cd32d
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15  5:31 [PATCH 0/2] KHO: Fix metadata allocation in scratch area Pasha Tatashin
@ 2025-10-15  5:31 ` Pasha Tatashin
  2025-10-15  8:21   ` Mike Rapoport
  2025-10-15 12:10   ` Pratyush Yadav
  2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
  1 sibling, 2 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15  5:31 UTC (permalink / raw)
  To: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pasha.tatashin, pratyush, rdunlap,
	rppt, tj, jasonmiu, dmatlack, skhawaja

It is invalid for KHO metadata or preserved memory regions to be located
within the KHO scratch area, as this area is overwritten when the next
kernel is loaded, and used early in boot by the next kernel. This can
lead to memory corruption.

Adds checks to kho_preserve_* and KHO's internal metadata allocators
(xa_load_or_alloc, new_chunk) to verify that the physical address of the
memory does not overlap with any defined scratch region. If an overlap
is detected, the operation will fail and a WARN_ON is triggered. To
avoid performance overhead in production kernels, these checks are
enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/liveupdate/Kconfig                   | 15 ++++++++++
 kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
 kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
 kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
 4 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index 522b9f74d605..d119f4f3f4b1 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
 	  Also, enables inspecting the KHO fdt trees with the debugfs binary
 	  blobs.
 
+config KEXEC_HANDOVER_DEBUG
+	bool "Enable Kexec Handover debug checks"
+	depends on KEXEC_HANDOVER_DEBUGFS
+	help
+	  This option enables extra sanity checks for the Kexec Handover
+	  subsystem.
+
+	  These checks verify that neither preserved memory regions nor KHO's
+	  internal metadata are allocated from within a KHO scratch area.
+	  An overlap can lead to memory corruption during a subsequent kexec
+	  operation.
+
+	  If an overlap is detected, the kernel will print a warning and the
+	  offending operation will fail. This should only be enabled for
+	  debugging purposes due to runtime overhead.
 endmenu
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 5da21f1510cc..ef1e6f7a234b 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 	if (!elm)
 		return ERR_PTR(-ENOMEM);
 
+	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
+		kfree(elm);
+		return ERR_PTR(-EINVAL);
+	}
+
 	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
 	if (xa_is_err(res))
 		res = ERR_PTR(xa_err(res));
@@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
 
 	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
 	if (!chunk)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
+
+	if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
+		kfree(chunk);
+		return ERR_PTR(-EINVAL);
+	}
+
 	chunk->hdr.order = order;
 	if (cur_chunk)
 		KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
@@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out)
 	struct khoser_mem_chunk *chunk = NULL;
 	struct kho_mem_phys *physxa;
 	unsigned long order;
+	int ret = -ENOMEM;
 
 	xa_for_each(&kho_out->track.orders, order, physxa) {
 		struct kho_mem_phys_bits *bits;
 		unsigned long phys;
 
 		chunk = new_chunk(chunk, order);
-		if (!chunk)
+		if (IS_ERR(chunk)) {
+			ret = PTR_ERR(chunk);
 			goto err_free;
+		}
 
 		if (!first_chunk)
 			first_chunk = chunk;
@@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
 
 			if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
 				chunk = new_chunk(chunk, order);
-				if (!chunk)
+				if (IS_ERR(chunk)) {
+					ret = PTR_ERR(chunk);
 					goto err_free;
+				}
 			}
 
 			elm = &chunk->bitmaps[chunk->hdr.num_elms];
@@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
 
 err_free:
 	kho_mem_ser_free(first_chunk);
-	return -ENOMEM;
+	return ret;
 }
 
 static void __init deserialize_bitmap(unsigned int order,
@@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio)
 	const unsigned int order = folio_order(folio);
 	struct kho_mem_track *track = &kho_out.track;
 
+	if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
+		return -EINVAL;
+
 	return __kho_preserve_order(track, pfn, order);
 }
 EXPORT_SYMBOL_GPL(kho_preserve_folio);
@@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
 	unsigned long failed_pfn = 0;
 	int err = 0;
 
+	if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
+					nr_pages << PAGE_SHIFT))) {
+		return -EINVAL;
+	}
+
 	while (pfn < end_pfn) {
 		const unsigned int order =
 			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
index eb47f000887d..294d1d290142 100644
--- a/kernel/liveupdate/kexec_handover_debug.c
+++ b/kernel/liveupdate/kexec_handover_debug.c
@@ -214,3 +214,21 @@ __init int kho_debugfs_init(void)
 		return -ENOENT;
 	return 0;
 }
+
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+bool kho_scratch_overlap(phys_addr_t phys, size_t size)
+{
+	phys_addr_t scratch_start, scratch_end;
+	unsigned int i;
+
+	for (i = 0; i < kho_scratch_cnt; i++) {
+		scratch_start = kho_scratch[i].addr;
+		scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;
+
+		if (phys <= scratch_end && (phys + size) > scratch_start)
+			return true;
+	}
+
+	return false;
+}
+#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
index b3fc1957affa..92798346fa5a 100644
--- a/kernel/liveupdate/kexec_handover_internal.h
+++ b/kernel/liveupdate/kexec_handover_internal.h
@@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg,
 					  void *fdt) { }
 #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
 
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+bool kho_scratch_overlap(phys_addr_t phys, size_t size);
+#else
+static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size)
+{
+	return false;
+}
+#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
+
 #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15  5:31 [PATCH 0/2] KHO: Fix metadata allocation in scratch area Pasha Tatashin
  2025-10-15  5:31 ` [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory " Pasha Tatashin
@ 2025-10-15  5:31 ` Pasha Tatashin
  2025-10-15  8:37   ` Mike Rapoport
                     ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15  5:31 UTC (permalink / raw)
  To: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pasha.tatashin, pratyush, rdunlap,
	rppt, tj, jasonmiu, dmatlack, skhawaja

KHO allocates metadata for its preserved memory map using the SLUB
allocator via kzalloc(). This metadata is temporary and is used by the
next kernel during early boot to find preserved memory.

A problem arises when KFENCE is enabled. kzalloc() calls can be
randomly intercepted by kfence_alloc(), which services the allocation
from a dedicated KFENCE memory pool. This pool is allocated early in
boot via memblock.

When booting via KHO, the memblock allocator is restricted to a "scratch
area", forcing the KFENCE pool to be allocated within it. This creates a
conflict, as the scratch area is expected to be ephemeral and
overwriteable by a subsequent kexec. If KHO metadata is placed in this
KFENCE pool, it leads to memory corruption when the next kernel is
loaded.

To fix this, modify KHO to allocate its metadata directly from the buddy
allocator instead of SLUB.

As part of this change, the metadata bitmap size is increased from 512
bytes to PAGE_SIZE to align with the page-based allocations from the
buddy system.

Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/liveupdate/kexec_handover.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index ef1e6f7a234b..519de6d68b27 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
  * Keep track of memory that is to be preserved across KHO.
  *
  * The serializing side uses two levels of xarrays to manage chunks of per-order
- * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
- * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
- * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
- * 512K of bitmap memory will be needed for order 0.
+ * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
+ * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
+ * allocations each bitmap will cover 128M of address space. Thus, for 16G of
+ * memory at most 512K of bitmap memory will be needed for order 0.
  *
  * This approach is fully incremental, as the serialization progresses folios
  * can continue be aggregated to the tracker. The final step, immediately prior
@@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
  * successor kernel to parse.
  */
 
-#define PRESERVE_BITS (512 * 8)
+#define PRESERVE_BITS (PAGE_SIZE * 8)
 
 struct kho_mem_phys_bits {
 	DECLARE_BITMAP(preserve, PRESERVE_BITS);
@@ -131,18 +131,21 @@ static struct kho_out kho_out = {
 
 static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 {
+	unsigned int order;
 	void *elm, *res;
 
 	elm = xa_load(xa, index);
 	if (elm)
 		return elm;
 
-	elm = kzalloc(sz, GFP_KERNEL);
+	order = get_order(sz);
+	elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!elm)
 		return ERR_PTR(-ENOMEM);
 
-	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
-		kfree(elm);
+	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
+					PAGE_SIZE << order))) {
+		free_pages((unsigned long)elm, order);
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 		res = ERR_PTR(xa_err(res));
 
 	if (res) {
-		kfree(elm);
+		free_pages((unsigned long)elm, order);
 		return res;
 	}
 
@@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
 {
 	struct khoser_mem_chunk *chunk;
 
-	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	chunk = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!chunk)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15  5:31 ` [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory " Pasha Tatashin
@ 2025-10-15  8:21   ` Mike Rapoport
  2025-10-15 12:36     ` Pasha Tatashin
  2025-10-15 12:10   ` Pratyush Yadav
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2025-10-15  8:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

Hi Pasha,

On Wed, Oct 15, 2025 at 01:31:20AM -0400, Pasha Tatashin wrote:
>
> Subject: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area

Hmm, can't say I like liveupdate: kho: prefix. KHO: on it's own is shorter
and reflects that this is a KHO change rather than liveupdate.

> It is invalid for KHO metadata or preserved memory regions to be located
> within the KHO scratch area, as this area is overwritten when the next
> kernel is loaded, and used early in boot by the next kernel. This can
> lead to memory corruption.
> 
> Adds checks to kho_preserve_* and KHO's internal metadata allocators
> (xa_load_or_alloc, new_chunk) to verify that the physical address of the
> memory does not overlap with any defined scratch region. If an overlap
> is detected, the operation will fail and a WARN_ON is triggered. To
> avoid performance overhead in production kernels, these checks are
> enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/Kconfig                   | 15 ++++++++++

Feels like kernel/liveupdate/Makefile change is missing

>  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
>  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
>  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
>  4 files changed, 70 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> index 522b9f74d605..d119f4f3f4b1 100644
> --- a/kernel/liveupdate/Kconfig
> +++ b/kernel/liveupdate/Kconfig
> @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
>  	  Also, enables inspecting the KHO fdt trees with the debugfs binary
>  	  blobs.
>  
> +config KEXEC_HANDOVER_DEBUG
> +	bool "Enable Kexec Handover debug checks"
> +	depends on KEXEC_HANDOVER_DEBUGFS
> +	help
> +	  This option enables extra sanity checks for the Kexec Handover
> +	  subsystem.
> +
> +	  These checks verify that neither preserved memory regions nor KHO's
> +	  internal metadata are allocated from within a KHO scratch area.
> +	  An overlap can lead to memory corruption during a subsequent kexec
> +	  operation.
> +
> +	  If an overlap is detected, the kernel will print a warning and the
> +	  offending operation will fail. This should only be enabled for
> +	  debugging purposes due to runtime overhead.
>  endmenu
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 5da21f1510cc..ef1e6f7a234b 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>  	if (!elm)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> +		kfree(elm);

I think __free() cleanup would be better than this.

> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
>  	if (xa_is_err(res))
>  		res = ERR_PTR(xa_err(res));
> @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
>  
>  	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!chunk)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);

I don't think it's important to return -errno here, it's not that it's
called from a syscall and we need to set errno for the userspace.
BTW, the same applies to xa_load_or_alloc() IMO.

> +
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> +		kfree(chunk);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	chunk->hdr.order = order;
>  	if (cur_chunk)
>  		KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
> @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  	struct khoser_mem_chunk *chunk = NULL;
>  	struct kho_mem_phys *physxa;
>  	unsigned long order;
> +	int ret = -ENOMEM;
>  
>  	xa_for_each(&kho_out->track.orders, order, physxa) {
>  		struct kho_mem_phys_bits *bits;
>  		unsigned long phys;
>  
>  		chunk = new_chunk(chunk, order);
> -		if (!chunk)
> +		if (IS_ERR(chunk)) {
> +			ret = PTR_ERR(chunk);

... and indeed, -errno from new_chunk() juts makes things more complex :(

>  			goto err_free;
> +		}
>  
>  		if (!first_chunk)
>  			first_chunk = chunk;
> @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  
>  			if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
>  				chunk = new_chunk(chunk, order);
> -				if (!chunk)
> +				if (IS_ERR(chunk)) {
> +					ret = PTR_ERR(chunk);
>  					goto err_free;
> +				}
>  			}
>  
>  			elm = &chunk->bitmaps[chunk->hdr.num_elms];
> @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  
>  err_free:
>  	kho_mem_ser_free(first_chunk);
> -	return -ENOMEM;
> +	return ret;
>  }
>  
>  static void __init deserialize_bitmap(unsigned int order,
> @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio)
>  	const unsigned int order = folio_order(folio);
>  	struct kho_mem_track *track = &kho_out.track;
>  
> +	if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
> +		return -EINVAL;
> +
>  	return __kho_preserve_order(track, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(kho_preserve_folio);
> @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
>  	unsigned long failed_pfn = 0;
>  	int err = 0;
>  
> +	if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
> +					nr_pages << PAGE_SHIFT))) {
> +		return -EINVAL;
> +	}

Can't we check this in __kho_preseve_order() and not duplicate the code?

> +
>  	while (pfn < end_pfn) {
>  		const unsigned int order =
>  			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
@ 2025-10-15  8:37   ` Mike Rapoport
  2025-10-15 12:46     ` Pasha Tatashin
  2025-10-15 13:05   ` Pratyush Yadav
  2025-10-24 13:21   ` Jason Gunthorpe
  2 siblings, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2025-10-15  8:37 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
> KHO allocates metadata for its preserved memory map using the SLUB
> allocator via kzalloc(). This metadata is temporary and is used by the
> next kernel during early boot to find preserved memory.
> 
> A problem arises when KFENCE is enabled. kzalloc() calls can be
> randomly intercepted by kfence_alloc(), which services the allocation
> from a dedicated KFENCE memory pool. This pool is allocated early in
> boot via memblock.
> 
> When booting via KHO, the memblock allocator is restricted to a "scratch
> area", forcing the KFENCE pool to be allocated within it. This creates a
> conflict, as the scratch area is expected to be ephemeral and
> overwriteable by a subsequent kexec. If KHO metadata is placed in this
> KFENCE pool, it leads to memory corruption when the next kernel is
> loaded.
> 
> To fix this, modify KHO to allocate its metadata directly from the buddy
> allocator instead of SLUB.
> 
> As part of this change, the metadata bitmap size is increased from 512
> bytes to PAGE_SIZE to align with the page-based allocations from the
> buddy system.
> 
> Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/kexec_handover.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index ef1e6f7a234b..519de6d68b27 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
>   * Keep track of memory that is to be preserved across KHO.
>   *
>   * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.
> + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
> + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
> + * allocations each bitmap will cover 128M of address space. Thus, for 16G of
> + * memory at most 512K of bitmap memory will be needed for order 0.
>   *
>   * This approach is fully incremental, as the serialization progresses folios
>   * can continue be aggregated to the tracker. The final step, immediately prior
> @@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
>   * successor kernel to parse.
>   */
>  
> -#define PRESERVE_BITS (512 * 8)
> +#define PRESERVE_BITS (PAGE_SIZE * 8)
>  
>  struct kho_mem_phys_bits {
>  	DECLARE_BITMAP(preserve, PRESERVE_BITS);
> @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
>  
>  static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)

The name 'xa_load_or_alloc' is confusing now that we only use this function
to allocate bitmaps. I think it should be renamed to reflect that and it's
return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz
parameter and the size calculations below become redundant.

>  {
> +	unsigned int order;
>  	void *elm, *res;
>  
>  	elm = xa_load(xa, index);
>  	if (elm)
>  		return elm;
>  
> -	elm = kzalloc(sz, GFP_KERNEL);
> +	order = get_order(sz);
> +	elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!elm)
>  		return ERR_PTR(-ENOMEM);
>  
> -	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> -		kfree(elm);
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
> +					PAGE_SIZE << order))) {
> +		free_pages((unsigned long)elm, order);
>  		return ERR_PTR(-EINVAL);
>  	}
>  

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15  5:31 ` [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory " Pasha Tatashin
  2025-10-15  8:21   ` Mike Rapoport
@ 2025-10-15 12:10   ` Pratyush Yadav
  2025-10-15 12:40     ` Pasha Tatashin
  1 sibling, 1 reply; 24+ messages in thread
From: Pratyush Yadav @ 2025-10-15 12:10 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja

On Wed, Oct 15 2025, Pasha Tatashin wrote:

> It is invalid for KHO metadata or preserved memory regions to be located
> within the KHO scratch area, as this area is overwritten when the next
> kernel is loaded, and used early in boot by the next kernel. This can
> lead to memory corruption.
>
> Adds checks to kho_preserve_* and KHO's internal metadata allocators
> (xa_load_or_alloc, new_chunk) to verify that the physical address of the
> memory does not overlap with any defined scratch region. If an overlap
> is detected, the operation will fail and a WARN_ON is triggered. To
> avoid performance overhead in production kernels, these checks are
> enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/Kconfig                   | 15 ++++++++++
>  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
>  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
>  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
>  4 files changed, 70 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> index 522b9f74d605..d119f4f3f4b1 100644
> --- a/kernel/liveupdate/Kconfig
> +++ b/kernel/liveupdate/Kconfig
> @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
>  	  Also, enables inspecting the KHO fdt trees with the debugfs binary
>  	  blobs.
>  
> +config KEXEC_HANDOVER_DEBUG
> +	bool "Enable Kexec Handover debug checks"
> +	depends on KEXEC_HANDOVER_DEBUGFS

Why the dependency on debugfs? Why can't the debug checks be enabled
independently?

> +	help
> +	  This option enables extra sanity checks for the Kexec Handover
> +	  subsystem.
> +
> +	  These checks verify that neither preserved memory regions nor KHO's
> +	  internal metadata are allocated from within a KHO scratch area.
> +	  An overlap can lead to memory corruption during a subsequent kexec
> +	  operation.

I don't think the checks that are done should be listed here since as
soon as another check is added this list will become out of date.

> +
> +	  If an overlap is detected, the kernel will print a warning and the
> +	  offending operation will fail. This should only be enabled for

This also describes the behaviour of the checks, which might change
later. Maybe for some checks the operation won't fail? I suppose just
leave it at "the kernel will print a warning"?

> +	  debugging purposes due to runtime overhead.
>  endmenu
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 5da21f1510cc..ef1e6f7a234b 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>  	if (!elm)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> +		kfree(elm);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
>  	if (xa_is_err(res))
>  		res = ERR_PTR(xa_err(res));
> @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
>  
>  	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
>  	if (!chunk)
> -		return NULL;
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> +		kfree(chunk);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
>  	chunk->hdr.order = order;
>  	if (cur_chunk)
>  		KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
> @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  	struct khoser_mem_chunk *chunk = NULL;
>  	struct kho_mem_phys *physxa;
>  	unsigned long order;
> +	int ret = -ENOMEM;
>  
>  	xa_for_each(&kho_out->track.orders, order, physxa) {
>  		struct kho_mem_phys_bits *bits;
>  		unsigned long phys;
>  
>  		chunk = new_chunk(chunk, order);
> -		if (!chunk)
> +		if (IS_ERR(chunk)) {
> +			ret = PTR_ERR(chunk);
>  			goto err_free;
> +		}
>  
>  		if (!first_chunk)
>  			first_chunk = chunk;
> @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  
>  			if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
>  				chunk = new_chunk(chunk, order);
> -				if (!chunk)
> +				if (IS_ERR(chunk)) {
> +					ret = PTR_ERR(chunk);
>  					goto err_free;
> +				}
>  			}
>  
>  			elm = &chunk->bitmaps[chunk->hdr.num_elms];
> @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>  
>  err_free:
>  	kho_mem_ser_free(first_chunk);
> -	return -ENOMEM;
> +	return ret;
>  }
>  
>  static void __init deserialize_bitmap(unsigned int order,
> @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio)
>  	const unsigned int order = folio_order(folio);
>  	struct kho_mem_track *track = &kho_out.track;
>  
> +	if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
> +		return -EINVAL;
> +
>  	return __kho_preserve_order(track, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(kho_preserve_folio);
> @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
>  	unsigned long failed_pfn = 0;
>  	int err = 0;
>  
> +	if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
> +					nr_pages << PAGE_SHIFT))) {
> +		return -EINVAL;
> +	}
> +
>  	while (pfn < end_pfn) {
>  		const unsigned int order =
>  			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
> index eb47f000887d..294d1d290142 100644
> --- a/kernel/liveupdate/kexec_handover_debug.c
> +++ b/kernel/liveupdate/kexec_handover_debug.c
> @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void)
>  		return -ENOENT;
>  	return 0;
>  }
> +
> +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
> +bool kho_scratch_overlap(phys_addr_t phys, size_t size)
> +{
> +	phys_addr_t scratch_start, scratch_end;
> +	unsigned int i;
> +
> +	for (i = 0; i < kho_scratch_cnt; i++) {
> +		scratch_start = kho_scratch[i].addr;
> +		scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;

Nit: wouldn't it be a tad bit simpler to do

		scratch_end = kho_scratch[i].addr + kho_scratch[i].size;

> +
> +		if (phys <= scratch_end && (phys + size) > scratch_start)

and here

		if (phys < scratch_end && (phys + size) > scratch_start)

At least I find it slightly easier to understand, though I don't think
it makes too much of a difference so either way is fine.

> +			return true;
> +	}
> +
> +	return false;
> +}
> +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
> diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
> index b3fc1957affa..92798346fa5a 100644
> --- a/kernel/liveupdate/kexec_handover_internal.h
> +++ b/kernel/liveupdate/kexec_handover_internal.h
> @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg,
>  					  void *fdt) { }
>  #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
>  
> +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
> +bool kho_scratch_overlap(phys_addr_t phys, size_t size);
> +#else
> +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
> +
>  #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15  8:21   ` Mike Rapoport
@ 2025-10-15 12:36     ` Pasha Tatashin
  2025-10-16 17:23       ` Mike Rapoport
  2025-10-18 15:28       ` Pasha Tatashin
  0 siblings, 2 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15 12:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  kernel/liveupdate/Kconfig                   | 15 ++++++++++
>
> Feels like kernel/liveupdate/Makefile change is missing

It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in
kexec_handover_debug.c

That debug file contains KHO debugfs and debug code. The debug code
adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is
appropriate for a debug build.

However, I do not like ugly ifdefs in .c, so perhaps, we should have two files:
kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What
do you think?

> >  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
> >  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
> >  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
> >  4 files changed, 70 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> > index 522b9f74d605..d119f4f3f4b1 100644
> > --- a/kernel/liveupdate/Kconfig
> > +++ b/kernel/liveupdate/Kconfig
> > @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
> >         Also, enables inspecting the KHO fdt trees with the debugfs binary
> >         blobs.
> >
> > +config KEXEC_HANDOVER_DEBUG
> > +     bool "Enable Kexec Handover debug checks"
> > +     depends on KEXEC_HANDOVER_DEBUGFS
> > +     help
> > +       This option enables extra sanity checks for the Kexec Handover
> > +       subsystem.
> > +
> > +       These checks verify that neither preserved memory regions nor KHO's
> > +       internal metadata are allocated from within a KHO scratch area.
> > +       An overlap can lead to memory corruption during a subsequent kexec
> > +       operation.
> > +
> > +       If an overlap is detected, the kernel will print a warning and the
> > +       offending operation will fail. This should only be enabled for
> > +       debugging purposes due to runtime overhead.
> >  endmenu
> > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > index 5da21f1510cc..ef1e6f7a234b 100644
> > --- a/kernel/liveupdate/kexec_handover.c
> > +++ b/kernel/liveupdate/kexec_handover.c
> > @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
> >       if (!elm)
> >               return ERR_PTR(-ENOMEM);
> >
> > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> > +             kfree(elm);
>
> I think __free() cleanup would be better than this.

Sorry, not sure what do you mean. kfree() is already is in this
function in case of failure.

>
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> >       res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
> >       if (xa_is_err(res))
> >               res = ERR_PTR(xa_err(res));
> > @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
> >
> >       chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> >       if (!chunk)
> > -             return NULL;
> > +             return ERR_PTR(-ENOMEM);
>
> I don't think it's important to return -errno here, it's not that it's
> called from a syscall and we need to set errno for the userspace.
> BTW, the same applies to xa_load_or_alloc() IMO.

HM, but they are very different errors: ENOMEM, the KHO user can try
again after more memory is available, but the new -EINVAL return from
this function tells the caller that there is something broken in the
system, and using KHO is futile until this bug is fixed.

> > +
> > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> > +             kfree(chunk);
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> >       chunk->hdr.order = order;
> >       if (cur_chunk)
> >               KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
> > @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >       struct khoser_mem_chunk *chunk = NULL;
> >       struct kho_mem_phys *physxa;
> >       unsigned long order;
> > +     int ret = -ENOMEM;
> >
> >       xa_for_each(&kho_out->track.orders, order, physxa) {
> >               struct kho_mem_phys_bits *bits;
> >               unsigned long phys;
> >
> >               chunk = new_chunk(chunk, order);
> > -             if (!chunk)
> > +             if (IS_ERR(chunk)) {
> > +                     ret = PTR_ERR(chunk);
>
> ... and indeed, -errno from new_chunk() juts makes things more complex :(
>
> >                       goto err_free;
> > +             }
> >
> >               if (!first_chunk)
> >                       first_chunk = chunk;
> > @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >
> >                       if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
> >                               chunk = new_chunk(chunk, order);
> > -                             if (!chunk)
> > +                             if (IS_ERR(chunk)) {
> > +                                     ret = PTR_ERR(chunk);
> >                                       goto err_free;
> > +                             }
> >                       }
> >
> >                       elm = &chunk->bitmaps[chunk->hdr.num_elms];
> > @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >
> >  err_free:
> >       kho_mem_ser_free(first_chunk);
> > -     return -ENOMEM;
> > +     return ret;
> >  }
> >
> >  static void __init deserialize_bitmap(unsigned int order,
> > @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio)
> >       const unsigned int order = folio_order(folio);
> >       struct kho_mem_track *track = &kho_out.track;
> >
> > +     if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
> > +             return -EINVAL;
> > +
> >       return __kho_preserve_order(track, pfn, order);
> >  }
> >  EXPORT_SYMBOL_GPL(kho_preserve_folio);
> > @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
> >       unsigned long failed_pfn = 0;
> >       int err = 0;
> >
> > +     if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
> > +                                     nr_pages << PAGE_SHIFT))) {
> > +             return -EINVAL;
> > +     }
>
> Can't we check this in __kho_preseve_order() and not duplicate the code?

Yes, that is possible, I will move it in the next version.

> > +
> >       while (pfn < end_pfn) {
> >               const unsigned int order =
> >                       min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15 12:10   ` Pratyush Yadav
@ 2025-10-15 12:40     ` Pasha Tatashin
  2025-10-15 13:11       ` Pratyush Yadav
  0 siblings, 1 reply; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15 12:40 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, rdunlap, rppt, tj, jasonmiu,
	dmatlack, skhawaja

On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Wed, Oct 15 2025, Pasha Tatashin wrote:
>
> > It is invalid for KHO metadata or preserved memory regions to be located
> > within the KHO scratch area, as this area is overwritten when the next
> > kernel is loaded, and used early in boot by the next kernel. This can
> > lead to memory corruption.
> >
> > Adds checks to kho_preserve_* and KHO's internal metadata allocators
> > (xa_load_or_alloc, new_chunk) to verify that the physical address of the
> > memory does not overlap with any defined scratch region. If an overlap
> > is detected, the operation will fail and a WARN_ON is triggered. To
> > avoid performance overhead in production kernels, these checks are
> > enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  kernel/liveupdate/Kconfig                   | 15 ++++++++++
> >  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
> >  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
> >  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
> >  4 files changed, 70 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> > index 522b9f74d605..d119f4f3f4b1 100644
> > --- a/kernel/liveupdate/Kconfig
> > +++ b/kernel/liveupdate/Kconfig
> > @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
> >         Also, enables inspecting the KHO fdt trees with the debugfs binary
> >         blobs.
> >
> > +config KEXEC_HANDOVER_DEBUG
> > +     bool "Enable Kexec Handover debug checks"
> > +     depends on KEXEC_HANDOVER_DEBUGFS
>
> Why the dependency on debugfs? Why can't the debug checks be enabled
> independently?

Because there is one kexec_handover_debug.c file, that I thought would
make sense to use for both, but now thinking about this, perhaps we
should split the code: KEXEC_HANDOVER_DEBUGFS and
KEXEC_HANDOVER_DEBUG, and add two files:
kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid
ifdefs in .c.

>
> > +     help
> > +       This option enables extra sanity checks for the Kexec Handover
> > +       subsystem.
> > +
> > +       These checks verify that neither preserved memory regions nor KHO's
> > +       internal metadata are allocated from within a KHO scratch area.
> > +       An overlap can lead to memory corruption during a subsequent kexec
> > +       operation.
>
> I don't think the checks that are done should be listed here since as
> soon as another check is added this list will become out of date.

I thought it could be expanded when new features are added, but I can
remove this description.

>
> > +
> > +       If an overlap is detected, the kernel will print a warning and the
> > +       offending operation will fail. This should only be enabled for
>
> This also describes the behaviour of the checks, which might change
> later. Maybe for some checks the operation won't fail? I suppose just
> leave it at "the kernel will print a warning"?

If it changes, and Kconfig should be updated as well.

>
> > +       debugging purposes due to runtime overhead.
> >  endmenu
> > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > index 5da21f1510cc..ef1e6f7a234b 100644
> > --- a/kernel/liveupdate/kexec_handover.c
> > +++ b/kernel/liveupdate/kexec_handover.c
> > @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
> >       if (!elm)
> >               return ERR_PTR(-ENOMEM);
> >
> > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> > +             kfree(elm);
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> >       res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
> >       if (xa_is_err(res))
> >               res = ERR_PTR(xa_err(res));
> > @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
> >
> >       chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> >       if (!chunk)
> > -             return NULL;
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> > +             kfree(chunk);
> > +             return ERR_PTR(-EINVAL);
> > +     }
> > +
> >       chunk->hdr.order = order;
> >       if (cur_chunk)
> >               KHOSER_STORE_PTR(cur_chunk->hdr.next, chunk);
> > @@ -379,14 +390,17 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >       struct khoser_mem_chunk *chunk = NULL;
> >       struct kho_mem_phys *physxa;
> >       unsigned long order;
> > +     int ret = -ENOMEM;
> >
> >       xa_for_each(&kho_out->track.orders, order, physxa) {
> >               struct kho_mem_phys_bits *bits;
> >               unsigned long phys;
> >
> >               chunk = new_chunk(chunk, order);
> > -             if (!chunk)
> > +             if (IS_ERR(chunk)) {
> > +                     ret = PTR_ERR(chunk);
> >                       goto err_free;
> > +             }
> >
> >               if (!first_chunk)
> >                       first_chunk = chunk;
> > @@ -396,8 +410,10 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >
> >                       if (chunk->hdr.num_elms == ARRAY_SIZE(chunk->bitmaps)) {
> >                               chunk = new_chunk(chunk, order);
> > -                             if (!chunk)
> > +                             if (IS_ERR(chunk)) {
> > +                                     ret = PTR_ERR(chunk);
> >                                       goto err_free;
> > +                             }
> >                       }
> >
> >                       elm = &chunk->bitmaps[chunk->hdr.num_elms];
> > @@ -414,7 +430,7 @@ static int kho_mem_serialize(struct kho_out *kho_out)
> >
> >  err_free:
> >       kho_mem_ser_free(first_chunk);
> > -     return -ENOMEM;
> > +     return ret;
> >  }
> >
> >  static void __init deserialize_bitmap(unsigned int order,
> > @@ -737,6 +753,9 @@ int kho_preserve_folio(struct folio *folio)
> >       const unsigned int order = folio_order(folio);
> >       struct kho_mem_track *track = &kho_out.track;
> >
> > +     if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order)))
> > +             return -EINVAL;
> > +
> >       return __kho_preserve_order(track, pfn, order);
> >  }
> >  EXPORT_SYMBOL_GPL(kho_preserve_folio);
> > @@ -784,6 +803,11 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
> >       unsigned long failed_pfn = 0;
> >       int err = 0;
> >
> > +     if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT,
> > +                                     nr_pages << PAGE_SHIFT))) {
> > +             return -EINVAL;
> > +     }
> > +
> >       while (pfn < end_pfn) {
> >               const unsigned int order =
> >                       min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> > diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
> > index eb47f000887d..294d1d290142 100644
> > --- a/kernel/liveupdate/kexec_handover_debug.c
> > +++ b/kernel/liveupdate/kexec_handover_debug.c
> > @@ -214,3 +214,21 @@ __init int kho_debugfs_init(void)
> >               return -ENOENT;
> >       return 0;
> >  }
> > +
> > +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
> > +bool kho_scratch_overlap(phys_addr_t phys, size_t size)
> > +{
> > +     phys_addr_t scratch_start, scratch_end;
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < kho_scratch_cnt; i++) {
> > +             scratch_start = kho_scratch[i].addr;
> > +             scratch_end = kho_scratch[i].addr + kho_scratch[i].size - 1;
>
> Nit: wouldn't it be a tad bit simpler to do
>
>                 scratch_end = kho_scratch[i].addr + kho_scratch[i].size;
>
> > +
> > +             if (phys <= scratch_end && (phys + size) > scratch_start)
>
> and here
>
>                 if (phys < scratch_end && (phys + size) > scratch_start)
>
> At least I find it slightly easier to understand, though I don't think
> it makes too much of a difference so either way is fine.
>
> > +                     return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
> > diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
> > index b3fc1957affa..92798346fa5a 100644
> > --- a/kernel/liveupdate/kexec_handover_internal.h
> > +++ b/kernel/liveupdate/kexec_handover_internal.h
> > @@ -44,4 +44,13 @@ static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg,
> >                                         void *fdt) { }
> >  #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */
> >
> > +#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
> > +bool kho_scratch_overlap(phys_addr_t phys, size_t size);
> > +#else
> > +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size)
> > +{
> > +     return false;
> > +}
> > +#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
> > +
> >  #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
>
> --
> Regards,
> Pratyush Yadav


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15  8:37   ` Mike Rapoport
@ 2025-10-15 12:46     ` Pasha Tatashin
  0 siblings, 0 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15 12:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

On Wed, Oct 15, 2025 at 4:37 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
> > KHO allocates metadata for its preserved memory map using the SLUB
> > allocator via kzalloc(). This metadata is temporary and is used by the
> > next kernel during early boot to find preserved memory.
> >
> > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > randomly intercepted by kfence_alloc(), which services the allocation
> > from a dedicated KFENCE memory pool. This pool is allocated early in
> > boot via memblock.
> >
> > When booting via KHO, the memblock allocator is restricted to a "scratch
> > area", forcing the KFENCE pool to be allocated within it. This creates a
> > conflict, as the scratch area is expected to be ephemeral and
> > overwriteable by a subsequent kexec. If KHO metadata is placed in this
> > KFENCE pool, it leads to memory corruption when the next kernel is
> > loaded.
> >
> > To fix this, modify KHO to allocate its metadata directly from the buddy
> > allocator instead of SLUB.
> >
> > As part of this change, the metadata bitmap size is increased from 512
> > bytes to PAGE_SIZE to align with the page-based allocations from the
> > buddy system.
> >
> > Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  kernel/liveupdate/kexec_handover.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > index ef1e6f7a234b..519de6d68b27 100644
> > --- a/kernel/liveupdate/kexec_handover.c
> > +++ b/kernel/liveupdate/kexec_handover.c
> > @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
> >   * Keep track of memory that is to be preserved across KHO.
> >   *
> >   * The serializing side uses two levels of xarrays to manage chunks of per-order
> > - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> > - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> > - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> > - * 512K of bitmap memory will be needed for order 0.
> > + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
> > + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
> > + * allocations each bitmap will cover 128M of address space. Thus, for 16G of
> > + * memory at most 512K of bitmap memory will be needed for order 0.
> >   *
> >   * This approach is fully incremental, as the serialization progresses folios
> >   * can continue be aggregated to the tracker. The final step, immediately prior
> > @@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
> >   * successor kernel to parse.
> >   */
> >
> > -#define PRESERVE_BITS (512 * 8)
> > +#define PRESERVE_BITS (PAGE_SIZE * 8)
> >
> >  struct kho_mem_phys_bits {
> >       DECLARE_BITMAP(preserve, PRESERVE_BITS);
> > @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
> >
> >  static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>
> The name 'xa_load_or_alloc' is confusing now that we only use this function

Indeed, but this is not something that is done by this patch

> to allocate bitmaps. I think it should be renamed to reflect that and it's
> return type should be 'struct kho_mem_phys_bits'. Then it wouldn't need sz
> parameter and the size calculations below become redundant.

I am thinking of splitting from this patch increase of bitmap size to
PAGE_SIZE, and after that re-name this function, and remove size_t
argument in another patch, and finally the fix patch that replaces
slub with buddy.

>
> >  {
> > +     unsigned int order;
> >       void *elm, *res;
> >
> >       elm = xa_load(xa, index);
> >       if (elm)
> >               return elm;
> >
> > -     elm = kzalloc(sz, GFP_KERNEL);
> > +     order = get_order(sz);
> > +     elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> >       if (!elm)
> >               return ERR_PTR(-ENOMEM);
> >
> > -     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> > -             kfree(elm);
> > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
> > +                                     PAGE_SIZE << order))) {
> > +             free_pages((unsigned long)elm, order);
> >               return ERR_PTR(-EINVAL);
> >       }
> >
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
  2025-10-15  8:37   ` Mike Rapoport
@ 2025-10-15 13:05   ` Pratyush Yadav
  2025-10-15 14:19     ` Pasha Tatashin
  2025-10-15 14:22     ` Pasha Tatashin
  2025-10-24 13:21   ` Jason Gunthorpe
  2 siblings, 2 replies; 24+ messages in thread
From: Pratyush Yadav @ 2025-10-15 13:05 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

+Cc Marco, Alexander

On Wed, Oct 15 2025, Pasha Tatashin wrote:

> KHO allocates metadata for its preserved memory map using the SLUB
> allocator via kzalloc(). This metadata is temporary and is used by the
> next kernel during early boot to find preserved memory.
>
> A problem arises when KFENCE is enabled. kzalloc() calls can be
> randomly intercepted by kfence_alloc(), which services the allocation
> from a dedicated KFENCE memory pool. This pool is allocated early in
> boot via memblock.

At some point, we'd probably want to add support for preserving slab
objects using KHO. That wouldn't work if the objects can land in scratch
memory. Right now, the kfence pools are allocated right before KHO goes
out of scratch-only and memblock frees pages to buddy.

	kfence_alloc_pool_and_metadata();
	report_meminit();
	kmsan_init_shadow();
	stack_depot_early_init();
        [...]
	kho_memory_init();

	memblock_free_all();

Can't kfence allocate its pool right after memblock_free_all()? IIUC at
that point, there shouldn't be much fragmentation and allocations should
still be possible.

Another idea could be to disable scratch-only a bit earlier and add an
option in memblock_alloc() to avoid scratch areas?

Anyway, not something we need to solve right now with this series.
Something to figure out eventually.

>
> When booting via KHO, the memblock allocator is restricted to a "scratch
> area", forcing the KFENCE pool to be allocated within it. This creates a
> conflict, as the scratch area is expected to be ephemeral and
> overwriteable by a subsequent kexec. If KHO metadata is placed in this
> KFENCE pool, it leads to memory corruption when the next kernel is
> loaded.
>
> To fix this, modify KHO to allocate its metadata directly from the buddy
> allocator instead of SLUB.
>
> As part of this change, the metadata bitmap size is increased from 512
> bytes to PAGE_SIZE to align with the page-based allocations from the
> buddy system.

The implication of this change is that preservation metadata becomes
less memory-efficient when preserved pages are sparse. Mainly because if
only one bit is set in the bitmap, now 4k bytes of memory is used
instead of 512 bytes.

It is hard to say what difference this makes in practice without
sampling real workloads, but perhaps still worth mentioning in the
commit message?

Other than this,

Reviewed-by: Pratyush Yadav <pratyush@kernel.org>

>
> Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/kexec_handover.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index ef1e6f7a234b..519de6d68b27 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -66,10 +66,10 @@ early_param("kho", kho_parse_enable);
>   * Keep track of memory that is to be preserved across KHO.
>   *
>   * The serializing side uses two levels of xarrays to manage chunks of per-order
> - * 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a
> - * 1TB system would fit inside a single 512 byte bitmap. For order 0 allocations
> - * each bitmap will cover 16M of address space. Thus, for 16G of memory at most
> - * 512K of bitmap memory will be needed for order 0.
> + * PAGE_SIZE byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order
> + * of a 8TB system would fit inside a single 4096 byte bitmap. For order 0
> + * allocations each bitmap will cover 128M of address space. Thus, for 16G of
> + * memory at most 512K of bitmap memory will be needed for order 0.
>   *
>   * This approach is fully incremental, as the serialization progresses folios
>   * can continue be aggregated to the tracker. The final step, immediately prior
> @@ -77,7 +77,7 @@ early_param("kho", kho_parse_enable);
>   * successor kernel to parse.
>   */
>  
> -#define PRESERVE_BITS (512 * 8)
> +#define PRESERVE_BITS (PAGE_SIZE * 8)
>  
>  struct kho_mem_phys_bits {
>  	DECLARE_BITMAP(preserve, PRESERVE_BITS);
> @@ -131,18 +131,21 @@ static struct kho_out kho_out = {
>  
>  static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>  {
> +	unsigned int order;
>  	void *elm, *res;
>  
>  	elm = xa_load(xa, index);
>  	if (elm)
>  		return elm;
>  
> -	elm = kzalloc(sz, GFP_KERNEL);
> +	order = get_order(sz);
> +	elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!elm)
>  		return ERR_PTR(-ENOMEM);
>  
> -	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> -		kfree(elm);
> +	if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm),
> +					PAGE_SIZE << order))) {
> +		free_pages((unsigned long)elm, order);
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> @@ -151,7 +154,7 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>  		res = ERR_PTR(xa_err(res));
>  
>  	if (res) {
> -		kfree(elm);
> +		free_pages((unsigned long)elm, order);
>  		return res;
>  	}
>  
> @@ -357,7 +360,7 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
>  {
>  	struct khoser_mem_chunk *chunk;
>  
> -	chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	chunk = (void *)get_zeroed_page(GFP_KERNEL);
>  	if (!chunk)
>  		return ERR_PTR(-ENOMEM);

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15 12:40     ` Pasha Tatashin
@ 2025-10-15 13:11       ` Pratyush Yadav
  0 siblings, 0 replies; 24+ messages in thread
From: Pratyush Yadav @ 2025-10-15 13:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, jgg, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja

On Wed, Oct 15 2025, Pasha Tatashin wrote:

> On Wed, Oct 15, 2025 at 8:10 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Wed, Oct 15 2025, Pasha Tatashin wrote:
>>
>> > It is invalid for KHO metadata or preserved memory regions to be located
>> > within the KHO scratch area, as this area is overwritten when the next
>> > kernel is loaded, and used early in boot by the next kernel. This can
>> > lead to memory corruption.
>> >
>> > Adds checks to kho_preserve_* and KHO's internal metadata allocators
>> > (xa_load_or_alloc, new_chunk) to verify that the physical address of the
>> > memory does not overlap with any defined scratch region. If an overlap
>> > is detected, the operation will fail and a WARN_ON is triggered. To
>> > avoid performance overhead in production kernels, these checks are
>> > enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected.
>> >
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > ---
>> >  kernel/liveupdate/Kconfig                   | 15 ++++++++++
>> >  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
>> >  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
>> >  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
>> >  4 files changed, 70 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
>> > index 522b9f74d605..d119f4f3f4b1 100644
>> > --- a/kernel/liveupdate/Kconfig
>> > +++ b/kernel/liveupdate/Kconfig
>> > @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
>> >         Also, enables inspecting the KHO fdt trees with the debugfs binary
>> >         blobs.
>> >
>> > +config KEXEC_HANDOVER_DEBUG
>> > +     bool "Enable Kexec Handover debug checks"
>> > +     depends on KEXEC_HANDOVER_DEBUGFS
>>
>> Why the dependency on debugfs? Why can't the debug checks be enabled
>> independently?
>
> Because there is one kexec_handover_debug.c file, that I thought would
> make sense to use for both, but now thinking about this, perhaps we
> should split the code: KEXEC_HANDOVER_DEBUGFS and
> KEXEC_HANDOVER_DEBUG, and add two files:
> kexec_handover_debugfs.c and kexec_handover_debug.c, this would avoid
> ifdefs in .c.

Sounds good.

>
>>
>> > +     help
>> > +       This option enables extra sanity checks for the Kexec Handover
>> > +       subsystem.
>> > +
>> > +       These checks verify that neither preserved memory regions nor KHO's
>> > +       internal metadata are allocated from within a KHO scratch area.
>> > +       An overlap can lead to memory corruption during a subsequent kexec
>> > +       operation.
>>
>> I don't think the checks that are done should be listed here since as
>> soon as another check is added this list will become out of date.
>
> I thought it could be expanded when new features are added, but I can
> remove this description.

Yes, but it is easy to forget to do so.

>
>>
>> > +
>> > +       If an overlap is detected, the kernel will print a warning and the
>> > +       offending operation will fail. This should only be enabled for
>>
>> This also describes the behaviour of the checks, which might change
>> later. Maybe for some checks the operation won't fail? I suppose just
>> leave it at "the kernel will print a warning"?
>
> If it changes, and Kconfig should be updated as well.
>
>>
>> > +       debugging purposes due to runtime overhead.
>> >  endmenu
[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15 13:05   ` Pratyush Yadav
@ 2025-10-15 14:19     ` Pasha Tatashin
  2025-10-15 14:36       ` Alexander Potapenko
  2025-10-24 13:25       ` Jason Gunthorpe
  2025-10-15 14:22     ` Pasha Tatashin
  1 sibling, 2 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15 14:19 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, rdunlap, rppt, tj, jasonmiu,
	dmatlack, skhawaja, glider, elver

On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> +Cc Marco, Alexander
>
> On Wed, Oct 15 2025, Pasha Tatashin wrote:
>
> > KHO allocates metadata for its preserved memory map using the SLUB
> > allocator via kzalloc(). This metadata is temporary and is used by the
> > next kernel during early boot to find preserved memory.
> >
> > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > randomly intercepted by kfence_alloc(), which services the allocation
> > from a dedicated KFENCE memory pool. This pool is allocated early in
> > boot via memblock.
>
> At some point, we'd probably want to add support for preserving slab
> objects using KHO. That wouldn't work if the objects can land in scratch
> memory. Right now, the kfence pools are allocated right before KHO goes
> out of scratch-only and memblock frees pages to buddy.

If we do that, most likely we will add a GFP flag that goes with it,
so the slab can use a special pool of pages that are preservable.
Otherwise, we are going to be leaking memory from the old kernel in
the unpreserved parts of the pages. If we do that, kfence can ignore
allocations with that new preservable GFP flag.

Pasha


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15 13:05   ` Pratyush Yadav
  2025-10-15 14:19     ` Pasha Tatashin
@ 2025-10-15 14:22     ` Pasha Tatashin
  1 sibling, 0 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-15 14:22 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, rdunlap, rppt, tj, jasonmiu,
	dmatlack, skhawaja, glider, elver

> > As part of this change, the metadata bitmap size is increased from 512
> > bytes to PAGE_SIZE to align with the page-based allocations from the
> > buddy system.
>
> The implication of this change is that preservation metadata becomes
> less memory-efficient when preserved pages are sparse. Mainly because if
> only one bit is set in the bitmap, now 4k bytes of memory is used
> instead of 512 bytes.
>
> It is hard to say what difference this makes in practice without
> sampling real workloads, but perhaps still worth mentioning in the
> commit message?

Forgot to reply to the other part:

I agree, however, I suspect the implication is going to be minimal, it
is strange to preserve fragmented state and expect a fast reboot. Most
likely, we are going to be optimizing the preservation pools, such as
using 1G order pages for guest memory.

Also, we are moving toward preserving 4K bitmaps as part of the
Stateless KHO patch series, so I think we will make this change
anyway, as part of this fix or as part of transitioning to radix-tree
stateless KHO.


> Reviewed-by: Pratyush Yadav <pratyush@kernel.org>

Thank you.
Pasha


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15 14:19     ` Pasha Tatashin
@ 2025-10-15 14:36       ` Alexander Potapenko
  2025-10-24 13:25       ` Jason Gunthorpe
  1 sibling, 0 replies; 24+ messages in thread
From: Alexander Potapenko @ 2025-10-15 14:36 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, jgg, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, elver

On Wed, Oct 15, 2025 at 4:19 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > +Cc Marco, Alexander
> >
> > On Wed, Oct 15 2025, Pasha Tatashin wrote:
> >
> > > KHO allocates metadata for its preserved memory map using the SLUB
> > > allocator via kzalloc(). This metadata is temporary and is used by the
> > > next kernel during early boot to find preserved memory.
> > >
> > > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > > randomly intercepted by kfence_alloc(), which services the allocation
> > > from a dedicated KFENCE memory pool. This pool is allocated early in
> > > boot via memblock.
> >
> > At some point, we'd probably want to add support for preserving slab
> > objects using KHO. That wouldn't work if the objects can land in scratch
> > memory. Right now, the kfence pools are allocated right before KHO goes
> > out of scratch-only and memblock frees pages to buddy.
>
> If we do that, most likely we will add a GFP flag that goes with it,
> so the slab can use a special pool of pages that are preservable.
> Otherwise, we are going to be leaking memory from the old kernel in
> the unpreserved parts of the pages. If we do that, kfence can ignore
> allocations with that new preservable GFP flag.

I think this is the best way forward.
Changing KFENCE to allocate the pool from buddy will make the
allocation size less flexible (goodbye, CONFIG_KFENCE_NUM_OBJECTS)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15 12:36     ` Pasha Tatashin
@ 2025-10-16 17:23       ` Mike Rapoport
  2025-10-18 15:31         ` Pasha Tatashin
  2025-10-18 15:28       ` Pasha Tatashin
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2025-10-16 17:23 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
> > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > ---
> > >  kernel/liveupdate/Kconfig                   | 15 ++++++++++
> >
> > Feels like kernel/liveupdate/Makefile change is missing
> 
> It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in
> kexec_handover_debug.c
> 
> That debug file contains KHO debugfs and debug code. The debug code
> adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is
> appropriate for a debug build.
> 
> However, I do not like ugly ifdefs in .c, so perhaps, we should have two files:
> kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What
> do you think?
> 
> > >  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
> > >  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
> > >  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
> > >  4 files changed, 70 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> > > index 522b9f74d605..d119f4f3f4b1 100644
> > > --- a/kernel/liveupdate/Kconfig
> > > +++ b/kernel/liveupdate/Kconfig
> > > @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
> > >         Also, enables inspecting the KHO fdt trees with the debugfs binary
> > >         blobs.
> > >
> > > +config KEXEC_HANDOVER_DEBUG
> > > +     bool "Enable Kexec Handover debug checks"
> > > +     depends on KEXEC_HANDOVER_DEBUGFS
> > > +     help
> > > +       This option enables extra sanity checks for the Kexec Handover
> > > +       subsystem.
> > > +
> > > +       These checks verify that neither preserved memory regions nor KHO's
> > > +       internal metadata are allocated from within a KHO scratch area.
> > > +       An overlap can lead to memory corruption during a subsequent kexec
> > > +       operation.
> > > +
> > > +       If an overlap is detected, the kernel will print a warning and the
> > > +       offending operation will fail. This should only be enabled for
> > > +       debugging purposes due to runtime overhead.
> > >  endmenu
> > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > > index 5da21f1510cc..ef1e6f7a234b 100644
> > > --- a/kernel/liveupdate/kexec_handover.c
> > > +++ b/kernel/liveupdate/kexec_handover.c
> > > @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
> > >       if (!elm)
> > >               return ERR_PTR(-ENOMEM);
> > >
> > > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> > > +             kfree(elm);
> >
> > I think __free() cleanup would be better than this.
> 
> Sorry, not sure what do you mean. kfree() is already is in this
> function in case of failure.

There's __free(kfree) cleanup function defined in include/linux/cleanup.h
that ensures that on return from a function resources are not leaked.
With kfree we could do something like

	void *elm __free(kfree) = NULL;

	if (error)
		return ERR_PTR(errno);

	return no_free_ptr(elm);

There's no __free() definition for free_page() though :(

The second best IMHO is to use goto for error handling rather than free()
inside if (error).

> > > +             return ERR_PTR(-EINVAL);
> > > +     }
> > > +
> > >       res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
> > >       if (xa_is_err(res))
> > >               res = ERR_PTR(xa_err(res));
> > > @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
> > >
> > >       chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > >       if (!chunk)
> > > -             return NULL;
> > > +             return ERR_PTR(-ENOMEM);
> >
> > I don't think it's important to return -errno here, it's not that it's
> > called from a syscall and we need to set errno for the userspace.
> > BTW, the same applies to xa_load_or_alloc() IMO.
> 
> HM, but they are very different errors: ENOMEM, the KHO user can try
> again after more memory is available, but the new -EINVAL return from
> this function tells the caller that there is something broken in the
> system, and using KHO is futile until this bug is fixed.

Do you really see the callers handling this differently? 
And we already have WARN_ON() because something is broken in the system.
 
> > > +
> > > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> > > +             kfree(chunk);
> > > +             return ERR_PTR(-EINVAL);
> > > +     }
> > > +

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-15 12:36     ` Pasha Tatashin
  2025-10-16 17:23       ` Mike Rapoport
@ 2025-10-18 15:28       ` Pasha Tatashin
  1 sibling, 0 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-18 15:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

> > Can't we check this in __kho_preseve_order() and not duplicate the code?
>
> Yes, that is possible, I will move it in the next version.

Actually, I decided against this. The check might be expensive
depending on how sparse scratch area. Why make it even more expensive
for kho_preserve_pages() case, which might be calling
__kho_preserve_order() multiple times.

Pasha


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory in scratch area
  2025-10-16 17:23       ` Mike Rapoport
@ 2025-10-18 15:31         ` Pasha Tatashin
  0 siblings, 0 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-18 15:31 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, brauner, corbet, graf, jgg, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, tj, jasonmiu,
	dmatlack, skhawaja

On Thu, Oct 16, 2025 at 1:23 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Oct 15, 2025 at 08:36:25AM -0400, Pasha Tatashin wrote:
> > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > > ---
> > > >  kernel/liveupdate/Kconfig                   | 15 ++++++++++
> > >
> > > Feels like kernel/liveupdate/Makefile change is missing
> >
> > It's not, we already have KEXEC_HANDOVER_DEBUGFS that pulls in
> > kexec_handover_debug.c
> >
> > That debug file contains KHO debugfs and debug code. The debug code
> > adds KEXEC_HANDOVER_DEBUGFS as a dependency, which I think is
> > appropriate for a debug build.
> >
> > However, I do not like ugly ifdefs in .c, so perhaps, we should have two files:
> > kexec_handover_debugfs.c for debugfs and kexec_handover_debug.c ? What
> > do you think?
> >
> > > >  kernel/liveupdate/kexec_handover.c          | 32 ++++++++++++++++++---
> > > >  kernel/liveupdate/kexec_handover_debug.c    | 18 ++++++++++++
> > > >  kernel/liveupdate/kexec_handover_internal.h |  9 ++++++
> > > >  4 files changed, 70 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> > > > index 522b9f74d605..d119f4f3f4b1 100644
> > > > --- a/kernel/liveupdate/Kconfig
> > > > +++ b/kernel/liveupdate/Kconfig
> > > > @@ -27,4 +27,19 @@ config KEXEC_HANDOVER_DEBUGFS
> > > >         Also, enables inspecting the KHO fdt trees with the debugfs binary
> > > >         blobs.
> > > >
> > > > +config KEXEC_HANDOVER_DEBUG
> > > > +     bool "Enable Kexec Handover debug checks"
> > > > +     depends on KEXEC_HANDOVER_DEBUGFS
> > > > +     help
> > > > +       This option enables extra sanity checks for the Kexec Handover
> > > > +       subsystem.
> > > > +
> > > > +       These checks verify that neither preserved memory regions nor KHO's
> > > > +       internal metadata are allocated from within a KHO scratch area.
> > > > +       An overlap can lead to memory corruption during a subsequent kexec
> > > > +       operation.
> > > > +
> > > > +       If an overlap is detected, the kernel will print a warning and the
> > > > +       offending operation will fail. This should only be enabled for
> > > > +       debugging purposes due to runtime overhead.
> > > >  endmenu
> > > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> > > > index 5da21f1510cc..ef1e6f7a234b 100644
> > > > --- a/kernel/liveupdate/kexec_handover.c
> > > > +++ b/kernel/liveupdate/kexec_handover.c
> > > > @@ -141,6 +141,11 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
> > > >       if (!elm)
> > > >               return ERR_PTR(-ENOMEM);
> > > >
> > > > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(elm), sz))) {
> > > > +             kfree(elm);
> > >
> > > I think __free() cleanup would be better than this.
> >
> > Sorry, not sure what do you mean. kfree() is already is in this
> > function in case of failure.
>
> There's __free(kfree) cleanup function defined in include/linux/cleanup.h
> that ensures that on return from a function resources are not leaked.
> With kfree we could do something like
>
>         void *elm __free(kfree) = NULL;
>
>         if (error)
>                 return ERR_PTR(errno);
>
>         return no_free_ptr(elm);
>
> There's no __free() definition for free_page() though :(
>
> The second best IMHO is to use goto for error handling rather than free()
> inside if (error).

Makes sense, let me try to get it done. I actually like what cleanup.h provides.

>
> > > > +             return ERR_PTR(-EINVAL);
> > > > +     }
> > > > +
> > > >       res = xa_cmpxchg(xa, index, NULL, elm, GFP_KERNEL);
> > > >       if (xa_is_err(res))
> > > >               res = ERR_PTR(xa_err(res));
> > > > @@ -354,7 +359,13 @@ static struct khoser_mem_chunk *new_chunk(struct khoser_mem_chunk *cur_chunk,
> > > >
> > > >       chunk = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > > >       if (!chunk)
> > > > -             return NULL;
> > > > +             return ERR_PTR(-ENOMEM);
> > >
> > > I don't think it's important to return -errno here, it's not that it's
> > > called from a syscall and we need to set errno for the userspace.
> > > BTW, the same applies to xa_load_or_alloc() IMO.
> >
> > HM, but they are very different errors: ENOMEM, the KHO user can try
> > again after more memory is available, but the new -EINVAL return from
> > this function tells the caller that there is something broken in the
> > system, and using KHO is futile until this bug is fixed.
>
> Do you really see the callers handling this differently?
> And we already have WARN_ON() because something is broken in the system.

Maybe, also maybe for some self-tests. I think it is more consistent
to return PTR_ERR() based on other code in this file (i.e.
xa_load_or_alloc() already uses it). Let's keep it.

>
> > > > +
> > > > +     if (WARN_ON(kho_scratch_overlap(virt_to_phys(chunk), PAGE_SIZE))) {
> > > > +             kfree(chunk);
> > > > +             return ERR_PTR(-EINVAL);
> > > > +     }
> > > > +
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
  2025-10-15  8:37   ` Mike Rapoport
  2025-10-15 13:05   ` Pratyush Yadav
@ 2025-10-24 13:21   ` Jason Gunthorpe
  2 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2025-10-24 13:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: akpm, brauner, corbet, graf, linux-kernel, linux-kselftest,
	linux-mm, masahiroy, ojeda, pratyush, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja

On Wed, Oct 15, 2025 at 01:31:21AM -0400, Pasha Tatashin wrote:
> -	elm = kzalloc(sz, GFP_KERNEL);
> +	order = get_order(sz);
> +	elm = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!elm)
>  		return ERR_PTR(-ENOMEM);

Please add a function to allocate preservable non-scratch pages then
:\ I don't think we should be sprinkling __get_free_pages() all over
the place.

You can then add a cleanup.h for the new function and so on.

It should allocate frozen pages, accept a size_t and return a void *.

If this issue is only related to kfence then the new function could
have an if IS_ENABLED(KFENCE) and use slab when it is safe.

Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-15 14:19     ` Pasha Tatashin
  2025-10-15 14:36       ` Alexander Potapenko
@ 2025-10-24 13:25       ` Jason Gunthorpe
  2025-10-24 13:57         ` Pasha Tatashin
  1 sibling, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2025-10-24 13:25 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
> On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > +Cc Marco, Alexander
> >
> > On Wed, Oct 15 2025, Pasha Tatashin wrote:
> >
> > > KHO allocates metadata for its preserved memory map using the SLUB
> > > allocator via kzalloc(). This metadata is temporary and is used by the
> > > next kernel during early boot to find preserved memory.
> > >
> > > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > > randomly intercepted by kfence_alloc(), which services the allocation
> > > from a dedicated KFENCE memory pool. This pool is allocated early in
> > > boot via memblock.
> >
> > At some point, we'd probably want to add support for preserving slab
> > objects using KHO. That wouldn't work if the objects can land in scratch
> > memory. Right now, the kfence pools are allocated right before KHO goes
> > out of scratch-only and memblock frees pages to buddy.
> 
> If we do that, most likely we will add a GFP flag that goes with it,
> so the slab can use a special pool of pages that are preservable.
> Otherwise, we are going to be leaking memory from the old kernel in
> the unpreserved parts of the pages. 

That isn't an issue. If we make slab preservable then we'd have to
preserve the page and then somehow record what order is stored in that
page and a bit map of which parts are allocated to restore the slab
state on recovery.

So long as the non-preserved memory comes back as freed on the
sucessor kernel it doesn't matter what was in it in the preceeding
kernel. The new kernel will eventually zero it. So it isn't a 'leak'.

Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-24 13:25       ` Jason Gunthorpe
@ 2025-10-24 13:57         ` Pasha Tatashin
  2025-10-24 14:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-24 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Fri, Oct 24, 2025 at 9:25 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 15, 2025 at 10:19:08AM -0400, Pasha Tatashin wrote:
> > On Wed, Oct 15, 2025 at 9:05 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> > >
> > > +Cc Marco, Alexander
> > >
> > > On Wed, Oct 15 2025, Pasha Tatashin wrote:
> > >
> > > > KHO allocates metadata for its preserved memory map using the SLUB
> > > > allocator via kzalloc(). This metadata is temporary and is used by the
> > > > next kernel during early boot to find preserved memory.
> > > >
> > > > A problem arises when KFENCE is enabled. kzalloc() calls can be
> > > > randomly intercepted by kfence_alloc(), which services the allocation
> > > > from a dedicated KFENCE memory pool. This pool is allocated early in
> > > > boot via memblock.
> > >
> > > At some point, we'd probably want to add support for preserving slab
> > > objects using KHO. That wouldn't work if the objects can land in scratch
> > > memory. Right now, the kfence pools are allocated right before KHO goes
> > > out of scratch-only and memblock frees pages to buddy.
> >
> > If we do that, most likely we will add a GFP flag that goes with it,
> > so the slab can use a special pool of pages that are preservable.
> > Otherwise, we are going to be leaking memory from the old kernel in
> > the unpreserved parts of the pages.
>
> That isn't an issue. If we make slab preservable then we'd have to
> preserve the page and then somehow record what order is stored in that
> page and a bit map of which parts are allocated to restore the slab
> state on recovery.
>
> So long as the non-preserved memory comes back as freed on the
> sucessor kernel it doesn't matter what was in it in the preceeding
> kernel. The new kernel will eventually zero it. So it isn't a 'leak'.

Hi Jason,

I agree, it's not a "leak" in the traditional sense, as we trust the
successor kernel to manage its own memory.

However, my concern is that without a dedicated GFP flag, this
partial-page preservation model becomes too fragile, inefficient, and
creates a data exposure risk.

You're right the new kernel will eventually zero memory, but KHO
preserves at page granularity. If we preserve a single slab object,
the entire page is handed off. When the new kernel maps that page
(e.g., to userspace) to access the preserved object, it also exposes
the unpreserved portions of that same page. Those portions contain
stale data from the old kernel and won't have been zeroed yet,
creating an easy-to-miss data leak vector. It makes the API very
error-prone.

There's also the inefficiency. The unpreserved parts of that page are
unusable by the new kernel until the preserved object is freed.
Depending on the use case, that object might live for the entire
kernel lifetime, effectively wasting that memory. This waste could
then accumulate with each subsequent live update.

Trying to create a special KHO slab cache isn't a solution either,
since slab caches are often merged.

As I see it, the only robust solution is to use a special GFP flag.
This would force these allocations to come from a dedicated pool of
pages that are fully preserved, with no partial/mixed-use pages and
also retrieved as slabs.

That said, I'm not sure preserving individual slab objects is a high
priority right now. It might be simpler to avoid it altogether.

Pasha


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-24 13:57         ` Pasha Tatashin
@ 2025-10-24 14:20           ` Jason Gunthorpe
  2025-10-24 14:36             ` Pasha Tatashin
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2025-10-24 14:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Fri, Oct 24, 2025 at 09:57:24AM -0400, Pasha Tatashin wrote:

> You're right the new kernel will eventually zero memory, but KHO
> preserves at page granularity. If we preserve a single slab object,
> the entire page is handed off. When the new kernel maps that page
> (e.g., to userspace) to access the preserved object, it also exposes
> the unpreserved portions of that same page. Those portions contain
> stale data from the old kernel and won't have been zeroed yet,
> creating an easy-to-miss data leak vector. 

Do we zero any of the memory on KHO? Honestly, I wouldn't worry about
the point it zeros, slab guarentees it will be zero when it should be
zero.

> There's also the inefficiency. The unpreserved parts of that page are
> unusable by the new kernel until the preserved object is freed.

Thats not how I see slab preservation working. When the slab page
is unpreserved all the free space in that page should be immediately
available to the sucessor kernel.

> As I see it, the only robust solution is to use a special GFP flag.
> This would force these allocations to come from a dedicated pool of
> pages that are fully preserved, with no partial/mixed-use pages and
> also retrieved as slabs.

It is certainly more efficient to preserve fewer slab pages in total
and pooling would help get there.

> That said, I'm not sure preserving individual slab objects is a high
> priority right now. It might be simpler to avoid it altogether.

I think we will need something, a lot of the structs I'm seeing in
other patches are small and allocating a whole page is pretty wasteful
too.

Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-24 14:20           ` Jason Gunthorpe
@ 2025-10-24 14:36             ` Pasha Tatashin
  2025-10-24 14:55               ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-24 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Fri, Oct 24, 2025 at 10:20 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 24, 2025 at 09:57:24AM -0400, Pasha Tatashin wrote:
>
> > You're right the new kernel will eventually zero memory, but KHO
> > preserves at page granularity. If we preserve a single slab object,
> > the entire page is handed off. When the new kernel maps that page
> > (e.g., to userspace) to access the preserved object, it also exposes
> > the unpreserved portions of that same page. Those portions contain
> > stale data from the old kernel and won't have been zeroed yet,
> > creating an easy-to-miss data leak vector.
>
> Do we zero any of the memory on KHO? Honestly, I wouldn't worry about
> the point it zeros, slab guarentees it will be zero when it should be
> zero.

We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes
memory on demand during allocation. My point is that the KHO interface
retrieves a full page in the next kernel, not an individual slab.
Consequently, a caller might retrieve data that was preserved as a
slab in the previous kernel, expose that data to the user, and
unintentionally leak the remaining part of the page as well.

> > There's also the inefficiency. The unpreserved parts of that page are
> > unusable by the new kernel until the preserved object is freed.
>
> Thats not how I see slab preservation working. When the slab page
> is unpreserved all the free space in that page should be immediately
> available to the sucessor kernel.

This ties into the same problem. The scenario I'm worried about is:
1. A caller preserves one small slab object.
2. In the new kernel, the caller retrieves the entire page that
contains this object.
3. The caller uses the data from that slab object without freeing it.

In this case, the rest of the page, all the other slab slots, is
effectively wasted. The page can't be fully returned to the system or
used by the slab allocator until that one preserved object is freed,
which might be never. The free space isn't "immediately available"
because the page is being held by the caller, even though the caller
is using only a slab in that page.

> > As I see it, the only robust solution is to use a special GFP flag.
> > This would force these allocations to come from a dedicated pool of
> > pages that are fully preserved, with no partial/mixed-use pages and
> > also retrieved as slabs.
>
> It is certainly more efficient to preserve fewer slab pages in total
> and pooling would help get there.
>
> > That said, I'm not sure preserving individual slab objects is a high
> > priority right now. It might be simpler to avoid it altogether.
>
> I think we will need something, a lot of the structs I'm seeing in
> other patches are small and allocating a whole page is pretty wasteful
> too.

If we're going to support this, it would have to be specifically
engineered as full slab support for KHO preservation, where the
interface retrieves slab objects directly, not the pages they're on,
and I think would require using a special GFP_PRESERVED flag.

> Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-24 14:36             ` Pasha Tatashin
@ 2025-10-24 14:55               ` Jason Gunthorpe
  2025-10-24 15:06                 ` Pasha Tatashin
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2025-10-24 14:55 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Fri, Oct 24, 2025 at 10:36:45AM -0400, Pasha Tatashin wrote:

> We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes
> memory on demand during allocation. My point is that the KHO interface
> retrieves a full page in the next kernel, not an individual slab.
> Consequently, a caller might retrieve data that was preserved as a
> slab in the previous kernel, expose that data to the user, and
> unintentionally leak the remaining part of the page as well.

I don't think preventing that is part of the kho threat model..

> 
> > > There's also the inefficiency. The unpreserved parts of that page are
> > > unusable by the new kernel until the preserved object is freed.
> >
> > Thats not how I see slab preservation working. When the slab page
> > is unpreserved all the free space in that page should be immediately
> > available to the sucessor kernel.
> 
> This ties into the same problem. The scenario I'm worried about is:
> 1. A caller preserves one small slab object.
> 2. In the new kernel, the caller retrieves the entire page that
> contains this object.
> 3. The caller uses the data from that slab object without freeing it.

4. When slab restores the page it immediately makes all the free slots
  available on its free list.

> > other patches are small and allocating a whole page is pretty wasteful
> > too.
> 
> If we're going to support this, it would have to be specifically
> engineered as full slab support for KHO preservation, where the
> interface retrieves slab objects directly, not the pages they're on,

Yes

> and I think would require using a special GFP_PRESERVED flag.

Maybe so, I was hoping not..

Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator
  2025-10-24 14:55               ` Jason Gunthorpe
@ 2025-10-24 15:06                 ` Pasha Tatashin
  0 siblings, 0 replies; 24+ messages in thread
From: Pasha Tatashin @ 2025-10-24 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, akpm, brauner, corbet, graf, linux-kernel,
	linux-kselftest, linux-mm, masahiroy, ojeda, rdunlap, rppt, tj,
	jasonmiu, dmatlack, skhawaja, glider, elver

On Fri, Oct 24, 2025 at 10:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 24, 2025 at 10:36:45AM -0400, Pasha Tatashin wrote:
>
> > We do not zero memory on kexec/KHO/LU; instead, the next kernel zeroes
> > memory on demand during allocation. My point is that the KHO interface
> > retrieves a full page in the next kernel, not an individual slab.
> > Consequently, a caller might retrieve data that was preserved as a
> > slab in the previous kernel, expose that data to the user, and
> > unintentionally leak the remaining part of the page as well.
>
> I don't think preventing that is part of the kho threat model..
>
> >
> > > > There's also the inefficiency. The unpreserved parts of that page are
> > > > unusable by the new kernel until the preserved object is freed.
> > >
> > > Thats not how I see slab preservation working. When the slab page
> > > is unpreserved all the free space in that page should be immediately
> > > available to the sucessor kernel.
> >
> > This ties into the same problem. The scenario I'm worried about is:
> > 1. A caller preserves one small slab object.
> > 2. In the new kernel, the caller retrieves the entire page that
> > contains this object.
> > 3. The caller uses the data from that slab object without freeing it.
>
> 4. When slab restores the page it immediately makes all the free slots
>   available on its free list.

Right, we do not have this functionality.

>
> > > other patches are small and allocating a whole page is pretty wasteful
> > > too.
> >
> > If we're going to support this, it would have to be specifically
> > engineered as full slab support for KHO preservation, where the
> > interface retrieves slab objects directly, not the pages they're on,
>
> Yes
>
> > and I think would require using a special GFP_PRESERVED flag.
>
> Maybe so, I was hoping not..
>
> Jason


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-10-24 15:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-15  5:31 [PATCH 0/2] KHO: Fix metadata allocation in scratch area Pasha Tatashin
2025-10-15  5:31 ` [PATCH 1/2] liveupdate: kho: warn and fail on metadata or preserved memory " Pasha Tatashin
2025-10-15  8:21   ` Mike Rapoport
2025-10-15 12:36     ` Pasha Tatashin
2025-10-16 17:23       ` Mike Rapoport
2025-10-18 15:31         ` Pasha Tatashin
2025-10-18 15:28       ` Pasha Tatashin
2025-10-15 12:10   ` Pratyush Yadav
2025-10-15 12:40     ` Pasha Tatashin
2025-10-15 13:11       ` Pratyush Yadav
2025-10-15  5:31 ` [PATCH 2/2] liveupdate: kho: allocate metadata directly from the buddy allocator Pasha Tatashin
2025-10-15  8:37   ` Mike Rapoport
2025-10-15 12:46     ` Pasha Tatashin
2025-10-15 13:05   ` Pratyush Yadav
2025-10-15 14:19     ` Pasha Tatashin
2025-10-15 14:36       ` Alexander Potapenko
2025-10-24 13:25       ` Jason Gunthorpe
2025-10-24 13:57         ` Pasha Tatashin
2025-10-24 14:20           ` Jason Gunthorpe
2025-10-24 14:36             ` Pasha Tatashin
2025-10-24 14:55               ` Jason Gunthorpe
2025-10-24 15:06                 ` Pasha Tatashin
2025-10-15 14:22     ` Pasha Tatashin
2025-10-24 13:21   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox