linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] liveupdate: hugetlb support
@ 2025-12-06 23:02 Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 01/10] kho: drop restriction on maximum page order Pratyush Yadav
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

This series adds support for live updating hugetlb-backed memfd,
including support for 1G huge pages. This allows live updating VMs which
use hugepages to back VM memory.

Please take a look at this patch series [0] to know more about the Live
Update Orchestrator (LUO). It also includes patches for live updating a
shmem-backed memfd. This series is a follow up to that, adding huge page
support as well.

You can also read this LWN article [1] to learn more about KHO and Live
Update Orchestrator, though do note that this article is a bit
out-of-date. LUO has since evolved. For example, subsystems have been
replaced with FLB, and the state machine has been simplified.

This series is based on top of mm-non-unstable, which includes the LUO
FLB patches [2].

This series uses LUO FLB to track how many pages are preserved for each
hstate, to ensure the live updated kernel does not over-allocate
hugepages.

Areas for Discussion
====================

Why is this an RFC?
-------------------

While I believe the code is in decent shape, I have only done some basic
testing and have not put it through more intensive testing, including
testing on ARM64. I am also not completely confident on the handling of
reservations and cgroup charging, even though it appears to work on the
surface.

The goal of this is to start discussion at high level points so we can
at least agree on the general direction. This also gives people some
time to see the code, before the session discussing this at LPC 2025
[3].

Disabling scratch-only earlier in boot
--------------------------------------

Patch 2 moves KHO memory initialization to earlier in boot. Detailed
discussion on the topic is in patch 2's message.

Allocating gigantic hugepages after paging_init() on x86
--------------------------------------------------------

To allow KHO to work with gigantic hugepages on x86, patch 2 moves
gigantic huge page allocation after paging_init(). This can have some
impact on ability to allocate gigantic pages, but I believe the impact
should not be severe. See patch 2 for more detailed discussion and test
results.

Early-boot access to LUO FLB data
---------------------------------

To work with gigantic page allocation, LUO FLB data is needed in early
boot, before LUO is fully initialized. Patch 3 adds support for fetching
LUO FLB data in early boot.

Preserving the entire huge page pool vs only used
-------------------------------------------------

This series makes a design decision on preserving only the number of
preserved huge pages for each hstate, instead of preserving the entire
huge page pool. Both approaches were brought up in the Live Update
meetings. Patch 6 discusses the reasoning in more detail.

[0] https://lore.kernel.org/linux-mm/20251125165850.3389713-1-pasha.tatashin@soleen.com/T/#u
[1] https://lwn.net/Articles/1033364/
[2] https://lore.kernel.org/linux-mm/20251125225006.3722394-1-pasha.tatashin@soleen.com/T/#u
[3] https://lpc.events/event/19/contributions/2044/

Pratyush Yadav (10):
  kho: drop restriction on maximum page order
  kho: disable scratch-only earlier in boot
  liveupdate: do early initialization before hugepages are allocated
  liveupdate: flb: allow getting FLB data in early boot
  mm: hugetlb: export some functions to hugetlb-internal header
  liveupdate: hugetlb subsystem FLB state preservation
  mm: hugetlb: don't allocate pages already in live update
  mm: hugetlb: disable CMA if liveupdate is enabled
  mm: hugetlb: allow freezing the inode
  liveupdate: allow preserving hugetlb-backed memfd

 Documentation/mm/memfd_preservation.rst |   9 +
 MAINTAINERS                             |   2 +
 arch/x86/kernel/setup.c                 |  19 +-
 fs/hugetlbfs/inode.c                    |  14 +-
 include/linux/hugetlb.h                 |   8 +
 include/linux/kho/abi/hugetlb.h         |  98 ++++
 include/linux/liveupdate.h              |  12 +
 kernel/liveupdate/Kconfig               |  15 +
 kernel/liveupdate/kexec_handover.c      |  13 +-
 kernel/liveupdate/luo_core.c            |  30 +-
 kernel/liveupdate/luo_flb.c             |  69 ++-
 kernel/liveupdate/luo_internal.h        |   2 +
 mm/Makefile                             |   1 +
 mm/hugetlb.c                            | 113 ++--
 mm/hugetlb_cma.c                        |   7 +
 mm/hugetlb_internal.h                   |  50 ++
 mm/hugetlb_luo.c                        | 699 ++++++++++++++++++++++++
 mm/memblock.c                           |   1 -
 mm/memfd_luo.c                          |   4 -
 mm/mm_init.c                            |  15 +-
 20 files changed, 1099 insertions(+), 82 deletions(-)
 create mode 100644 include/linux/kho/abi/hugetlb.h
 create mode 100644 mm/hugetlb_internal.h
 create mode 100644 mm/hugetlb_luo.c


base-commit: 55b7d75112c25b3e2a5eadc11244c330a5c00a41
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 01/10] kho: drop restriction on maximum page order
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 02/10] kho: disable scratch-only earlier in boot Pratyush Yadav
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

KHO currently restricts the maximum order of a restored page to the
maximum order supported by the buddy allocator. While this works fine
for much of the data passed across kexec, it is possible to have pages
larger than MAX_PAGE_ORDER.

For one, it is possible to get a larger order when using
kho_preserve_pages() if the number of pages is large enough, since it
tries to combine multiple aligned 0-order preservations into one higher
order preservation.

For another, upcoming support for hugepages can have gigantic hugepages
being preserved over KHO.

There is no real reason for this limit. The KHO preservation machinery
can handle any page order. Remove this artificial restriction on max
page order.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---

Notes:
    This patch can be taken independent of hugetlb live update support.

 kernel/liveupdate/kexec_handover.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 9dc51fab604f..9aa128909ecf 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -234,7 +234,7 @@ static struct page *kho_restore_page(phys_addr_t phys, bool is_folio)
 	 * check also implicitly makes sure phys is order-aligned since for
 	 * non-order-aligned phys addresses, magic will never be set.
 	 */
-	if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
+	if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC))
 		return NULL;
 	nr_pages = (1 << info.order);
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 02/10] kho: disable scratch-only earlier in boot
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 01/10] kho: drop restriction on maximum page order Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 03/10] liveupdate: do early initialization before hugepages are allocated Pratyush Yadav
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

Background
==========

Scratch areas
-------------

When KHO is used, allocations are only allowed from scratch areas. The
scratch areas are pre-reserved chunks of memory that are known to not
have any preserved memory. They can safely be used until KHO is able to
parse its serialized data to find out which pages are preserved.

The scratch areas are generally sized to ensure enough memory is available for
early boot allocations. They should not be excessively large to ensure
less memory is wasted.

Gigantic hugepage allocation
----------------------------

Gigantic hugepages are allocated early in boot before memblock releases
pages to the buddy allocator. This is to ensure enough contiguous chunks
of memory are available to satisfy huge page allocations. On x86 this is
done in setup_arch(). On other architectures, including ARM64 (the only
other arch that supports KHO), this is done in mm_core_init().

Problem
=======

Currently during KHO boot, scratch-only mode is active when hugepage
allocations are attempted on both x86 and ARM64. Since scratch areas are
generally not large enough to accommodate the allocation, this
allocation fails and results in gigantic hugepages being unavailable.

Solution
========

Moving KHO memory init
----------------------

Move KHO memory initialization before gigantic hugepage allocation.
Disable scratch-only as soon as the bitmaps are deserialized, since
there is no longer a reason to stay in scratch-only mode. Since on x86
this can get called twice, once from setup_arch() and once from the
generic path in mm_core_init(), add a variable to catch this and skip
double-initialization.

Re-ordering hugepage allocation
-------------------------------

KHO memory initialization uses the struct page to store the order. On
x86, This is not available until paging_init(). If kho_memory_init() is
called before paging_init() it will cause a page fault when trying to
access the struct pages.

But Hugepage allocations are done before paging_init(). Move them to
just after paging_init(), and call kho_memory_init() right before that.
While in theory it might result in more chances in failing hugepage
allocations, in practice it will likely not have a huge impact, since
usually systems leave a fair bit of margin for non-hugepage workloads.

Testing results
===============

Normal boot
-----------

On my test system with 7GiB of memory, I tried allocating 6 1G
hugepages. I can get a maximum of 4 1G hugepages both with and without
this patch.

    [    0.039182] HugeTLB: allocating 6 of page size 1.00 GiB failed.  Only allocated 4 hugepages.

KHO boot
--------

Without this patch, I cannot get any hugepages:

    [    0.098201] HugeTLB: allocating 6 of page size 1.00 GiB failed.  Only allocated 0 hugepages.

With this patch, I am again able to get 4:

    [    0.194657] HugeTLB: allocating 6 of page size 1.00 GiB failed.  Only allocated 4 hugepages.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---

Notes:
    Only tested on x86 so far, not yet on ARM64. This patch can also be
    taken independent of the rest of the series. Even with plain KHO with
    live update not even enabled, gigantic hugepages fail to allocate
    because of scratch-only.

 arch/x86/kernel/setup.c            | 12 +++++++-----
 kernel/liveupdate/kexec_handover.c | 11 ++++++++++-
 mm/memblock.c                      |  1 -
 mm/mm_init.c                       |  8 ++------
 4 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 74aa904be6dc..9bf00287c408 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1203,11 +1203,6 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-	if (boot_cpu_has(X86_FEATURE_GBPAGES)) {
-		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-		hugetlb_bootmem_alloc();
-	}
-
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
 	 * won't consume hotpluggable memory.
@@ -1219,6 +1214,13 @@ void __init setup_arch(char **cmdline_p)
 
 	x86_init.paging.pagetable_init();
 
+	kho_memory_init();
+
+	if (boot_cpu_has(X86_FEATURE_GBPAGES)) {
+		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+		hugetlb_bootmem_alloc();
+	}
+
 	kasan_init();
 
 	/*
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 9aa128909ecf..4cfd5690f356 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -1432,14 +1432,23 @@ static void __init kho_release_scratch(void)
 	}
 }
 
+static bool kho_memory_initialized;
+
 void __init kho_memory_init(void)
 {
+	if (kho_memory_initialized)
+		return;
+
+	kho_memory_initialized = true;
+
 	if (kho_in.scratch_phys) {
 		kho_scratch = phys_to_virt(kho_in.scratch_phys);
-		kho_release_scratch();
 
 		if (!kho_mem_deserialize(kho_get_fdt()))
 			kho_in.fdt_phys = 0;
+
+		memblock_clear_kho_scratch_only();
+		kho_release_scratch();
 	} else {
 		kho_reserve_scratch();
 	}
diff --git a/mm/memblock.c b/mm/memblock.c
index c7869860e659..a5682dff526d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2342,7 +2342,6 @@ void __init memblock_free_all(void)
 	free_unused_memmap();
 	reset_all_zones_managed_pages();
 
-	memblock_clear_kho_scratch_only();
 	pages = free_low_memory_core_early();
 	totalram_pages_add(pages);
 }
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7712d887b696..93cec06c1c8a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2679,6 +2679,8 @@ void __init __weak mem_init(void)
 void __init mm_core_init(void)
 {
 	arch_mm_preinit();
+
+	kho_memory_init();
 	hugetlb_bootmem_alloc();
 
 	/* Initializations relying on SMP setup */
@@ -2697,12 +2699,6 @@ void __init mm_core_init(void)
 	kmsan_init_shadow();
 	stack_depot_early_init();
 
-	/*
-	 * KHO memory setup must happen while memblock is still active, but
-	 * as close as possible to buddy initialization
-	 */
-	kho_memory_init();
-
 	memblock_free_all();
 	mem_init();
 	kmem_cache_init();
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 03/10] liveupdate: do early initialization before hugepages are allocated
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 01/10] kho: drop restriction on maximum page order Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 02/10] kho: disable scratch-only earlier in boot Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 04/10] liveupdate: flb: allow getting FLB data in early boot Pratyush Yadav
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

To support hugepage preservation using LUO, the hugetlb subsystem needs
to get liveupdate data when it allocates the hugepages to find out how
many pages are coming from live update.

Move early LUO init from early_initcall to mm_core_init(). This is where
gigantic hugepages are allocated on ARM64. On x86, they are allocated in
setup_arch(), so have a call there as well. Keep track of whether the
function was already called to avoid double-init.

liveupdate_early_init() only gets the KHO subtree and validates the data
to ensure it is valid and understood. These are read-only operations and
do not need much from the system, so it is safe to call early in boot.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 arch/x86/kernel/setup.c          |  7 +++++++
 include/linux/liveupdate.h       |  6 ++++++
 kernel/liveupdate/luo_core.c     | 30 ++++++++++++++++++++++++++----
 kernel/liveupdate/luo_internal.h |  2 ++
 mm/mm_init.c                     |  7 +++++++
 5 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9bf00287c408..e2ec779afc2c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -26,6 +26,7 @@
 #include <linux/tboot.h>
 #include <linux/usb/xhci-dbgp.h>
 #include <linux/vmalloc.h>
+#include <linux/liveupdate.h>
 
 #include <uapi/linux/mount.h>
 
@@ -1216,6 +1217,12 @@ void __init setup_arch(char **cmdline_p)
 
 	kho_memory_init();
 
+	/*
+	 * Hugepages might be preserved from a liveupdate. Make sure it is
+	 * initialized so hugetlb can query its state.
+	 */
+	liveupdate_early_init();
+
 	if (boot_cpu_has(X86_FEATURE_GBPAGES)) {
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 		hugetlb_bootmem_alloc();
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index ed81e7b31a9f..78e8c529e4e7 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -214,6 +214,8 @@ struct liveupdate_flb {
 
 #ifdef CONFIG_LIVEUPDATE
 
+void __init liveupdate_early_init(void);
+
 /* Return true if live update orchestrator is enabled */
 bool liveupdate_enabled(void);
 
@@ -233,6 +235,10 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
 
 #else /* CONFIG_LIVEUPDATE */
 
+static inline void liveupdate_early_init(void)
+{
+}
+
 static inline bool liveupdate_enabled(void)
 {
 	return false;
diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c
index 7a9ef16b37d8..2c740ecad8e6 100644
--- a/kernel/liveupdate/luo_core.c
+++ b/kernel/liveupdate/luo_core.c
@@ -69,6 +69,13 @@ static struct {
 	u64 liveupdate_num;
 } luo_global;
 
+static bool __luo_early_initialized __initdata;
+
+bool __init luo_early_initialized(void)
+{
+	return __luo_early_initialized;
+}
+
 static int __init early_liveupdate_param(char *buf)
 {
 	return kstrtobool(buf, &luo_global.enabled);
@@ -133,20 +140,35 @@ static int __init luo_early_startup(void)
 	return err;
 }
 
-static int __init liveupdate_early_init(void)
+/*
+ * This should only be called after KHO FDT is known. It gets the LUO subtree
+ * and does initial validation, making early boot read-only access possible.
+ */
+void __init liveupdate_early_init(void)
 {
 	int err;
 
+	/*
+	 * HugeTLB needs LUO to be initialized early in boot, before gigantic
+	 * hugepages are allocated. On x86, that happens in setup_arch(), but on
+	 * ARM64 (and other architectures) that happens in mm_core_init().
+	 *
+	 * Since the code in mm_core_init() is shared between all architectures,
+	 * this can lead to the init being called twice. Skip if initialization
+	 * was already done.
+	 */
+	if (__luo_early_initialized)
+		return;
+
+	__luo_early_initialized = true;
+
 	err = luo_early_startup();
 	if (err) {
 		luo_global.enabled = false;
 		luo_restore_fail("The incoming tree failed to initialize properly [%pe], disabling live update\n",
 				 ERR_PTR(err));
 	}
-
-	return err;
 }
-early_initcall(liveupdate_early_init);
 
 /* Called during boot to create outgoing LUO fdt tree */
 static int __init luo_fdt_setup(void)
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
index 6115d6a4054d..171c54af7b38 100644
--- a/kernel/liveupdate/luo_internal.h
+++ b/kernel/liveupdate/luo_internal.h
@@ -114,6 +114,8 @@ int __init luo_flb_setup_outgoing(void *fdt);
 int __init luo_flb_setup_incoming(void *fdt);
 void luo_flb_serialize(void);
 
+bool __init luo_early_initialized(void);
+
 #ifdef CONFIG_LIVEUPDATE_TEST
 void liveupdate_test_register(struct liveupdate_file_handler *fh);
 void liveupdate_test_unregister(struct liveupdate_file_handler *fh);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 93cec06c1c8a..9a5b06a93622 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -31,6 +31,7 @@
 #include <linux/execmem.h>
 #include <linux/vmstat.h>
 #include <linux/kexec_handover.h>
+#include <linux/liveupdate.h>
 #include <linux/hugetlb.h>
 #include "internal.h"
 #include "slab.h"
@@ -2681,6 +2682,12 @@ void __init mm_core_init(void)
 	arch_mm_preinit();
 
 	kho_memory_init();
+	/*
+	 * Hugepages might be preserved from a liveupdate. Make sure it is
+	 * initialized so hugetlb can query its state.
+	 */
+	liveupdate_early_init();
+
 	hugetlb_bootmem_alloc();
 
 	/* Initializations relying on SMP setup */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 04/10] liveupdate: flb: allow getting FLB data in early boot
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (2 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 03/10] liveupdate: do early initialization before hugepages are allocated Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 05/10] mm: hugetlb: export some functions to hugetlb-internal header Pratyush Yadav
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

To support hugepage preservation using LUO, the hugetlb subsystem needs
to get liveupdate data when it allocates the hugepages to find out how
many pages are coming from live update. This data is preserved via LUO
FLB.

Since gigantic hugepage allocations happen before LUO (and much of the
rest of the system) is initialized, the usual
liveupdate_flb_get_incoming() can not work.

Add a read-only variant that fetches the FLB data but does not trigger
its retrieve or do any locking or reference counting. It is the caller's
responsibility to make sure there are no side effects of using this data
to the proper retrieve call that would happen later.

Refactor the logic to find the right FLB in the serialized data in a
helper that can be used from both luo_flb_retrieve_one() (called from
luo_flb_get_incoming()), and from luo_flb_get_incoming_early().

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 include/linux/liveupdate.h  |  6 ++++
 kernel/liveupdate/luo_flb.c | 69 +++++++++++++++++++++++++++++--------
 2 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index 78e8c529e4e7..39b429d2c62c 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -232,6 +232,7 @@ int liveupdate_unregister_flb(struct liveupdate_file_handler *fh,
 
 int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp);
 int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp);
+int liveupdate_flb_incoming_early(struct liveupdate_flb *flb, u64 *datap);
 
 #else /* CONFIG_LIVEUPDATE */
 
@@ -283,5 +284,10 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb,
 	return -EOPNOTSUPP;
 }
 
+int liveupdate_flb_incoming_early(struct liveupdate_flb *flb, u64 *datap)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_LIVEUPDATE */
 #endif /* _LINUX_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c
index e80ac5b575ec..fb287734a88e 100644
--- a/kernel/liveupdate/luo_flb.c
+++ b/kernel/liveupdate/luo_flb.c
@@ -145,12 +145,25 @@ static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb)
 	}
 }
 
+static struct luo_flb_ser *luo_flb_find_ser(struct luo_flb_header *fh,
+					    const char *name)
+{
+	if (!fh->active)
+		return ERR_PTR(-ENODATA);
+
+	for (int i = 0; i < fh->header_ser->count; i++) {
+		if (!strcmp(fh->ser[i].name, name))
+			return &fh->ser[i];
+	}
+
+	return ERR_PTR(-ENOENT);
+}
+
 static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
 {
 	struct luo_flb_private *private = luo_flb_get_private(flb);
-	struct luo_flb_header *fh = &luo_flb_global.incoming;
 	struct liveupdate_flb_op_args args = {0};
-	bool found = false;
+	struct luo_flb_ser *ser;
 	int err;
 
 	guard(mutex)(&private->incoming.lock);
@@ -158,20 +171,12 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
 	if (private->incoming.obj)
 		return 0;
 
-	if (!fh->active)
-		return -ENODATA;
+	ser = luo_flb_find_ser(&luo_flb_global.incoming, flb->compatible);
+	if (IS_ERR(ser))
+		return PTR_ERR(ser);
 
-	for (int i = 0; i < fh->header_ser->count; i++) {
-		if (!strcmp(fh->ser[i].name, flb->compatible)) {
-			private->incoming.data = fh->ser[i].data;
-			private->incoming.count = fh->ser[i].count;
-			found = true;
-			break;
-		}
-	}
-
-	if (!found)
-		return -ENOENT;
+	private->incoming.data = ser->data;
+	private->incoming.count = ser->count;
 
 	args.flb = flb;
 	args.data = private->incoming.data;
@@ -188,6 +193,40 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb)
 	return 0;
 }
 
+/**
+ * liveupdate_flb_incoming_early - Fetch FLB data in early boot.
+ * @flb:   The FLB definition
+ * @datap: Pointer to serialized state handle of the FLB
+ *
+ * This function is intended to be called during early boot, before the
+ * liveupdate subsystem is fully initialized. It must only be called after
+ * liveupdate_early_init().
+ *
+ * Directly returns the u64 handle to the serialized state of the FLB, and does
+ * not trigger its retrieve. A later fetch of the FLB will trigger the retrieve.
+ * Callers must make sure there are no side effects because of this.
+ *
+ * Return: 0 on success, -errno on failure. -ENODATA means no incoming FLB data,
+ * -ENOENT means specific FLB not found in incoming data, and -EOPNOTSUPP when
+ * live update is disabled or not early initialization not finished.
+ */
+int __init liveupdate_flb_incoming_early(struct liveupdate_flb *flb, u64 *datap)
+{
+	struct luo_flb_ser *ser;
+
+	if (!luo_early_initialized()) {
+		pr_warn("LUO FLB retrieved before LUO early init!\n");
+		return -EOPNOTSUPP;
+	}
+
+	ser = luo_flb_find_ser(&luo_flb_global.incoming, flb->compatible);
+	if (IS_ERR(ser))
+		return PTR_ERR(ser);
+
+	*datap = ser->data;
+	return 0;
+}
+
 static void luo_flb_file_finish_one(struct liveupdate_flb *flb)
 {
 	struct luo_flb_private *private = luo_flb_get_private(flb);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 05/10] mm: hugetlb: export some functions to hugetlb-internal header
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (3 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 04/10] liveupdate: flb: allow getting FLB data in early boot Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 06/10] liveupdate: hugetlb subsystem FLB state preservation Pratyush Yadav
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

A later commit will add support for live updating a memfd backed by
HugeTLB. It needs access to these internal functions to prepare the
folios and properly queue them to the hstate and the file. Move them out
to a separate hugetlb-internal header.

There does exist include/linux/hugetlb.h, but that contains higher level
routines. It also prefixes the function names to make it clear they
belong to hugetlb. These are low-level routines that do not need to be
exposed to the public API, and renaming them to prefix with hugetlb is
going to cause a lot of code churn. So create mm/hugetlb_internal.h that
contains these definitions.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 MAINTAINERS           |  1 +
 mm/hugetlb.c          | 33 +++++++++------------------------
 mm/hugetlb_internal.h | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 24 deletions(-)
 create mode 100644 mm/hugetlb_internal.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2722f98d0ed7..fc23a0381e19 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11540,6 +11540,7 @@ F:	mm/hugetlb.c
 F:	mm/hugetlb_cgroup.c
 F:	mm/hugetlb_cma.c
 F:	mm/hugetlb_cma.h
+F:	mm/hugetlb_internal.h
 F:	mm/hugetlb_vmemmap.c
 F:	mm/hugetlb_vmemmap.h
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec..0f818086bf4f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -55,6 +55,8 @@
 #include "hugetlb_cma.h"
 #include <linux/page-isolation.h>
 
+#include "hugetlb_internal.h"
+
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
@@ -733,9 +735,8 @@ static int allocate_file_region_entries(struct resv_map *resv,
  * fail; region_chg will always allocate at least 1 entry and a region_add for
  * 1 page will only require at most 1 entry.
  */
-static long region_add(struct resv_map *resv, long f, long t,
-		       long in_regions_needed, struct hstate *h,
-		       struct hugetlb_cgroup *h_cg)
+long region_add(struct resv_map *resv, long f, long t, long in_regions_needed,
+		struct hstate *h, struct hugetlb_cgroup *h_cg)
 {
 	long add = 0, actual_regions_needed = 0;
 
@@ -800,8 +801,7 @@ static long region_add(struct resv_map *resv, long f, long t,
  * zero.  -ENOMEM is returned if a new file_region structure or cache entry
  * is needed and can not be allocated.
  */
-static long region_chg(struct resv_map *resv, long f, long t,
-		       long *out_regions_needed)
+long region_chg(struct resv_map *resv, long f, long t, long *out_regions_needed)
 {
 	long chg = 0;
 
@@ -836,8 +836,7 @@ static long region_chg(struct resv_map *resv, long f, long t,
  * routine.  They are kept to make reading the calling code easier as
  * arguments will match the associated region_chg call.
  */
-static void region_abort(struct resv_map *resv, long f, long t,
-			 long regions_needed)
+void region_abort(struct resv_map *resv, long f, long t, long regions_needed)
 {
 	spin_lock(&resv->lock);
 	VM_BUG_ON(!resv->region_cache_count);
@@ -1162,19 +1161,6 @@ void resv_map_release(struct kref *ref)
 	kfree(resv_map);
 }
 
-static inline struct resv_map *inode_resv_map(struct inode *inode)
-{
-	/*
-	 * At inode evict time, i_mapping may not point to the original
-	 * address space within the inode.  This original address space
-	 * contains the pointer to the resv_map.  So, always use the
-	 * address space embedded within the inode.
-	 * The VERY common case is inode->mapping == &inode->i_data but,
-	 * this may not be true for device special inodes.
-	 */
-	return (struct resv_map *)(&inode->i_data)->i_private_data;
-}
-
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
@@ -1887,14 +1873,14 @@ void free_huge_folio(struct folio *folio)
 /*
  * Must be called with the hugetlb lock held
  */
-static void account_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+void account_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	lockdep_assert_held(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[folio_nid(folio)]++;
 }
 
-static void init_new_hugetlb_folio(struct folio *folio)
+void init_new_hugetlb_folio(struct folio *folio)
 {
 	__folio_set_hugetlb(folio);
 	INIT_LIST_HEAD(&folio->lru);
@@ -2006,8 +1992,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 	return folio;
 }
 
-static void prep_and_add_allocated_folios(struct hstate *h,
-					struct list_head *folio_list)
+void prep_and_add_allocated_folios(struct hstate *h, struct list_head *folio_list)
 {
 	unsigned long flags;
 	struct folio *folio, *tmp_f;
diff --git a/mm/hugetlb_internal.h b/mm/hugetlb_internal.h
new file mode 100644
index 000000000000..edfb4eb75828
--- /dev/null
+++ b/mm/hugetlb_internal.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2025 Pratyush Yadav <pratyush@kernel.org>
+ */
+#ifndef __HUGETLB_INTERNAL_H
+#define __HUGETLB_INTERNAL_H
+
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
+#include <linux/list.h>
+
+void init_new_hugetlb_folio(struct folio *folio);
+void account_new_hugetlb_folio(struct hstate *h, struct folio *folio);
+
+long region_chg(struct resv_map *resv, long f, long t, long *out_regions_needed);
+long region_add(struct resv_map *resv, long f, long t, long in_regions_needed,
+		struct hstate *h, struct hugetlb_cgroup *h_cg);
+void region_abort(struct resv_map *resv, long f, long t, long regions_needed);
+void prep_and_add_allocated_folios(struct hstate *h, struct list_head *folio_list);
+
+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+	/*
+	 * At inode evict time, i_mapping may not point to the original
+	 * address space within the inode.  This original address space
+	 * contains the pointer to the resv_map.  So, always use the
+	 * address space embedded within the inode.
+	 * The VERY common case is inode->mapping == &inode->i_data but,
+	 * this may not be true for device special inodes.
+	 */
+	return (struct resv_map *)(&inode->i_data)->i_private_data;
+}
+
+#endif /* __HUGETLB_INTERNAL_H */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 06/10] liveupdate: hugetlb subsystem FLB state preservation
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (4 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 05/10] mm: hugetlb: export some functions to hugetlb-internal header Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 07/10] mm: hugetlb: don't allocate pages already in live update Pratyush Yadav
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

HugeTLB manages its own pages. It allocates them on boot and uses those
to fulfill hugepage requests.

To support live update for a hugetlb-backed memfd, it is necessary to
track how many pages of each hstate are coming from live update. This is
needed to ensure the boot time allocations don't over-allocate huge
pages, causing the rest of the system unexpected memory pressure.

For example, say the system has 100G memory and it uses 90 1G huge
pages, with 10G put aside for other processes. Now say 5 of those pages
are preserved via KHO for live updating a huge memfd.

But during boot, the system will still see that it needs 90 huge pages,
so it will attempt to allocate those. When the file is later retrieved,
those 5 pages also get added to the huge page pool, resulting in 95
total huge pages. This exceeds the original expectation of 90 pages, and
ends up wasting memory.

LUO has file-lifecycle-bound (FLB) data to keep track of global state of
a subsystem. Use it to track how many huge pages are used up for each
hstate. When a file is preserved, it will increment to the counter, and
when it is unpreserved, it will decrement it. During boot time
allocations, this data can be used to calculate how many hugepages
actually need to be allocated.

Design note: another way of doing this would be to preserve the entire
set of hugepages using the FLB, skip boot time allocation, and restore
them all on FLB retrieve. The pain problem with that approach is that it
would need to freeze all hstates after serializing them. This will need
a lot more invasive changes in hugetlb since there are many ways folios
can be added to or removed from a hstate. Doing it this way is simpler
and less invasive.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 Documentation/mm/memfd_preservation.rst |   9 ++
 MAINTAINERS                             |   1 +
 include/linux/kho/abi/hugetlb.h         |  66 +++++++++
 kernel/liveupdate/Kconfig               |  12 ++
 mm/Makefile                             |   1 +
 mm/hugetlb.c                            |   1 +
 mm/hugetlb_internal.h                   |  15 ++
 mm/hugetlb_luo.c                        | 179 ++++++++++++++++++++++++
 8 files changed, 284 insertions(+)
 create mode 100644 include/linux/kho/abi/hugetlb.h
 create mode 100644 mm/hugetlb_luo.c

diff --git a/Documentation/mm/memfd_preservation.rst b/Documentation/mm/memfd_preservation.rst
index 66e0fb6d5ef0..6068dd55f4fb 100644
--- a/Documentation/mm/memfd_preservation.rst
+++ b/Documentation/mm/memfd_preservation.rst
@@ -16,6 +16,15 @@ Memfd Preservation ABI
 .. kernel-doc:: include/linux/kho/abi/memfd.h
    :internal:
 
+HugeTLB-backed memfd Preservation ABI
+=====================================
+
+.. kernel-doc:: include/linux/kho/abi/hugetlb.h
+   :doc: hugetlb-backed memfd live update ABI
+
+.. kernel-doc:: include/linux/kho/abi/hugetlb.h
+   :internal:
+
 See Also
 ========
 
diff --git a/MAINTAINERS b/MAINTAINERS
index fc23a0381e19..55ef24e80ae5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14481,6 +14481,7 @@ F:	include/linux/liveupdate/
 F:	include/uapi/linux/liveupdate.h
 F:	kernel/liveupdate/
 F:	lib/tests/liveupdate.c
+F:	mm/hugetlb_luo.c
 F:	mm/memfd_luo.c
 F:	tools/testing/selftests/liveupdate/
 
diff --git a/include/linux/kho/abi/hugetlb.h b/include/linux/kho/abi/hugetlb.h
new file mode 100644
index 000000000000..55e833569c48
--- /dev/null
+++ b/include/linux/kho/abi/hugetlb.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (C) 2025 Amazon.com Inc. or its affiliates.
+ * Pratyush Yadav <pratyush@kernel.org>
+ */
+
+#ifndef _LINUX_KHO_ABI_HUGETLB_H
+#define _LINUX_KHO_ABI_HUGETLB_H
+
+#include <linux/hugetlb.h>
+
+/**
+ * DOC: hugetlb-backed memfd live update ABI
+ *
+ * This header defines the ABI for preserving the state of the hugetlb subsystem
+ * and a hugetlb-backed memfd across a kexec reboot using LUO.
+ *
+ * This interface is a contract. Any modification to the structure layout
+ * constitutes a breaking change. Such changes require incrementing the version
+ * number in the HUGETLB_FLB_COMPATIBLE or HUGE_MEMFD_COMPATIBLE strings for
+ * hugetlb FLB or hugetlb-backed memfd, respectively.
+ */
+
+/*
+ * Keep the serialized max hstates separate from the kernel's HUGE_MAX_HSTATE to
+ * keep the value stable.
+ *
+ * Currently x86 and arm64 are supported. x86 has HUGE_MAX_HSTATE as 2 and arm64
+ * has 4. Pick 4 as the number to start with.
+ */
+#define HUGETLB_SER_MAX_HSTATES		4
+
+static_assert(HUGETLB_SER_MAX_HSTATES >= HUGE_MAX_HSTATE);
+
+/**
+ * struct hugetlb_hstate_ser: Serialized state of a hstate.
+ * @nr_pages:     Number of preserved pages in the hstate.
+ * @order:        Order of the hstate this struct describes.
+ *
+ * The only state needed for hstates is the number of pages that are preserved
+ * from this hstate. The preserved pages are added to the hstate when the file
+ * is retrieved. This information gets used in early boot to calculate the
+ * remaining pages that must be allocated by the normal path.
+ */
+struct hugetlb_hstate_ser {
+	/* Number of _preserved_ pages in the hstate. */
+	u64 nr_pages;
+	u8 order;
+} __packed;
+
+/**
+ * struct hugetlb_ser - The main serialization structure for HugeTLB FLB.
+ * @hstates:      Array of serialized hstates.
+ * @nr_hstates:   Number of serialized hstates in the array.
+ */
+struct hugetlb_ser {
+	struct hugetlb_hstate_ser hstates[HUGETLB_SER_MAX_HSTATES];
+	u8 nr_hstates;
+} __packed;
+
+static_assert(sizeof(struct hugetlb_ser) <= PAGE_SIZE);
+
+#define HUGETLB_FLB_COMPATIBLE "hugetlb-v1"
+
+#endif /* _LINUX_KHO_ABI_HUGETLB_H */
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index 9b2515f31afb..86e76aed8a93 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -72,4 +72,16 @@ config LIVEUPDATE
 
 	  If unsure, say N.
 
+config LIVEUPDATE_HUGETLB
+	bool "Live update support for HugeTLB"
+	depends on LIVEUPDATE && HUGETLBFS
+	help
+
+	  Enable live update support for the HugeTLB subsystem. This allows live
+	  updating memfd backed by huge pages. This can be used by hypervisors that
+	  use hugetlb memfd to back VM memory, or for other user workloads needing
+	  to live update huge pages.
+
+	  If unsure, say N.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 7738ec416f00..753bc1e3f3fd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_LIVEUPDATE) += memfd_luo.o
+obj-$(CONFIG_LIVEUPDATE_HUGETLB) += hugetlb_luo.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f818086bf4f..ff90ceacf62c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4702,6 +4702,7 @@ static int __init hugetlb_init(void)
 	hugetlb_sysfs_init();
 	hugetlb_cgroup_file_init();
 	hugetlb_sysctl_init();
+	hugetlb_luo_init();
 
 #ifdef CONFIG_SMP
 	num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
diff --git a/mm/hugetlb_internal.h b/mm/hugetlb_internal.h
index edfb4eb75828..b7b149c56567 100644
--- a/mm/hugetlb_internal.h
+++ b/mm/hugetlb_internal.h
@@ -9,6 +9,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/list.h>
+#include <linux/liveupdate.h>
 
 void init_new_hugetlb_folio(struct folio *folio);
 void account_new_hugetlb_folio(struct hstate *h, struct folio *folio);
@@ -32,4 +33,18 @@ static inline struct resv_map *inode_resv_map(struct inode *inode)
 	return (struct resv_map *)(&inode->i_data)->i_private_data;
 }
 
+#ifdef CONFIG_LIVEUPDATE_HUGETLB
+void hugetlb_luo_init(void);
+unsigned long hstate_liveupdate_pages(struct hstate *h);
+#else
+static inline void hugetlb_luo_init(void)
+{
+}
+
+static inline unsigned long hstate_liveupdate_pages(struct hstate *h)
+{
+	return 0;
+}
+#endif /* CONFIG_LIVEUPDATE_HUGETLB */
+
 #endif /* __HUGETLB_INTERNAL_H */
diff --git a/mm/hugetlb_luo.c b/mm/hugetlb_luo.c
new file mode 100644
index 000000000000..80e3e015eca5
--- /dev/null
+++ b/mm/hugetlb_luo.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Amazon.com Inc. or its affiliates.
+ * Copyright (C) 2025 Pratyush Yadav <pratyush@kernel.org>
+ */
+
+/* The documentation for this is in mm/memfd_luo.c */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/liveupdate.h>
+#include <linux/kexec_handover.h>
+#include <linux/hugetlb.h>
+#include <linux/kho/abi/hugetlb.h>
+#include <linux/spinlock.h>
+
+#include "hugetlb_internal.h"
+
+struct hugetlb_flb_obj {
+	/* Serializes access to ser and its hstates. */
+	spinlock_t lock;
+	struct hugetlb_ser *ser;
+};
+
+static int hugetlb_flb_preserve(struct liveupdate_flb_op_args *args)
+{
+	struct hugetlb_ser *hugetlb_ser;
+	struct hugetlb_flb_obj *obj;
+	u8 nr_hstates = 0;
+	struct hstate *h;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return -ENOMEM;
+
+	hugetlb_ser = kho_alloc_preserve(sizeof(*hugetlb_ser));
+	if (!hugetlb_ser) {
+		kfree(obj);
+		return -ENOMEM;
+	}
+
+	spin_lock_init(&obj->lock);
+	obj->ser = hugetlb_ser;
+
+	for_each_hstate(h) {
+		struct hugetlb_hstate_ser *hser = &hugetlb_ser->hstates[nr_hstates];
+
+		hser->nr_pages = 0;
+		hser->order = h->order;
+		nr_hstates++;
+	}
+
+	hugetlb_ser->nr_hstates = nr_hstates;
+
+	args->obj = obj;
+	args->data = virt_to_phys(hugetlb_ser);
+
+	return 0;
+}
+
+static void hugetlb_flb_unpreserve(struct liveupdate_flb_op_args *args)
+{
+	kho_unpreserve_free(phys_to_virt(args->data));
+	kfree(args->obj);
+}
+
+static void hugetlb_flb_finish(struct liveupdate_flb_op_args *args)
+{
+	/* No live state on the retrieve side. */
+}
+
+static int hugetlb_flb_retrieve(struct liveupdate_flb_op_args *args)
+{
+	/*
+	 * The FLB is only needed for boot-time calculation of how many
+	 * hugepages are needed. This is done by early boot handlers already.
+	 * Free the serialized state now.
+	 */
+	kho_restore_free(phys_to_virt(args->data));
+
+	/*
+	 * HACK: But since LUO FLB still needs an obj, use ZERO_SIZE_PTR to
+	 * satisfy it.
+	 */
+	args->obj = ZERO_SIZE_PTR;
+	return 0;
+}
+
+static struct liveupdate_flb_ops hugetlb_luo_flb_ops = {
+	.preserve = hugetlb_flb_preserve,
+	.unpreserve = hugetlb_flb_unpreserve,
+	.finish = hugetlb_flb_finish,
+	.retrieve = hugetlb_flb_retrieve,
+};
+
+static struct liveupdate_flb hugetlb_luo_flb = {
+	.ops = &hugetlb_luo_flb_ops,
+	.compatible = HUGETLB_FLB_COMPATIBLE,
+};
+
+static struct hugetlb_hstate_ser
+*hugetlb_flb_get_hser(struct hugetlb_ser *hugetlb_ser, unsigned int order)
+{
+	for (u8 i = 0; i < hugetlb_ser->nr_hstates; i++) {
+		if (hugetlb_ser->hstates[i].order == order)
+			return &hugetlb_ser->hstates[i];
+	}
+
+	return NULL;
+}
+
+static int hugetlb_flb_add_folio(struct hstate *h)
+{
+	struct hugetlb_ser *hugetlb_ser;
+	struct hugetlb_hstate_ser *hser;
+	struct hugetlb_flb_obj *obj;
+	int err;
+
+	err = liveupdate_flb_get_outgoing(&hugetlb_luo_flb, (void **)&obj);
+	if (err)
+		return err;
+
+	hugetlb_ser = obj->ser;
+
+	guard(spinlock)(&obj->lock);
+	hser = hugetlb_flb_get_hser(hugetlb_ser, h->order);
+	if (!hser)
+		return -ENOENT;
+
+	hser->nr_pages++;
+	return 0;
+}
+
+static int hugetlb_flb_del_folio(struct hstate *h)
+{
+	struct hugetlb_ser *hugetlb_ser;
+	struct hugetlb_hstate_ser *hser;
+	struct hugetlb_flb_obj *obj;
+	int err;
+
+	err = liveupdate_flb_get_outgoing(&hugetlb_luo_flb, (void **)&obj);
+	if (err)
+		return err;
+
+	hugetlb_ser = obj->ser;
+
+	guard(spinlock)(&obj->lock);
+	hser = hugetlb_flb_get_hser(hugetlb_ser, h->order);
+	if (!hser)
+		return -ENOENT;
+
+	hser->nr_pages--;
+	return 0;
+}
+
+unsigned long __init hstate_liveupdate_pages(struct hstate *h)
+{
+	struct hugetlb_hstate_ser *hser;
+	struct hugetlb_ser *hugetlb_ser;
+	u64 data;
+	int err;
+
+	err = liveupdate_flb_incoming_early(&hugetlb_luo_flb, &data);
+	if (err)
+		/* If FLB can't be fetched, assume no pages from liveupdate. */
+		return 0;
+
+	hugetlb_ser = phys_to_virt(data);
+
+	/* NOTE: No need for locking since this is read-only on incoming side. */
+	hser = hugetlb_flb_get_hser(hugetlb_ser, h->order);
+	return hser ? hser->nr_pages : 0;
+}
+
+void __init hugetlb_luo_init(void)
+{
+	if (!liveupdate_enabled())
+		return;
+}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 07/10] mm: hugetlb: don't allocate pages already in live update
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (5 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 06/10] liveupdate: hugetlb subsystem FLB state preservation Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 08/10] mm: hugetlb: disable CMA if liveupdate is enabled Pratyush Yadav
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

To support live update for a hugetlb-backed memfd, it is necessary to
track how many pages of each hstate are coming from live update. This is
needed to ensure the boot time allocations don't over-allocate huge
pages, causing the rest of the system unexpected memory pressure.

For example, say the system has 100G memory and it uses 90 1G huge
pages, with 10G put aside for other processes. Now say 5 of those pages
are preserved via KHO for live updating a huge memfd.

But during boot, hugetlb will still see that it needs 90 huge pages, so
it will attempt to allocate those. When the file is later retrieved,
those 5 pages also get added to the huge page pool, resulting in 95
total huge pages. This exceeds the original expectation of 90 pages, and
ends up wasting memory.

Check the number of hugepages for the hstate already coming from live
update using hstate_liveupdate_pages(). Subtract that number from
h->max_huge_pages and pass that to the allocation functions. The
allocation functions currently directly use h->max_huge_pages, so update
them to take a parameter for number of pages to allocate instead.

Also update the error and status reporting function to deal with
liveupdated pages, report the right number, and handle errors.

Node-specific allocation is not supported with liveupdate currently.
This is because the liveupdate FLB data does not contain per-node
allocation numbers, so it is not possible to know how many live updated
pages each node has.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 mm/hugetlb.c | 79 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 59 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ff90ceacf62c..22af2e56772e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -39,6 +39,7 @@
 #include <linux/memory.h>
 #include <linux/mm_inline.h>
 #include <linux/padata.h>
+#include <linux/liveupdate.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -64,6 +65,7 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 __initdata nodemask_t hugetlb_bootmem_nodes;
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
 static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
+static unsigned long hstate_boot_nrliveupdated[HUGE_MAX_HSTATE] __initdata;
 
 /*
  * Due to ordering constraints across the init code for various
@@ -3484,13 +3486,19 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
-static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h)
+static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h,
+							     unsigned long liveupdated)
 {
 	int i;
 	bool node_specific_alloc = false;
 
 	for_each_online_node(i) {
 		if (h->max_huge_pages_node[i] > 0) {
+			if (liveupdated) {
+				pr_warn("HugeTLB: node-specific allocation not supported with liveupdate. Defaulting to normal\n");
+				return false;
+			}
+
 			hugetlb_hstate_alloc_pages_onenode(h, i);
 			node_specific_alloc = true;
 		}
@@ -3499,15 +3507,25 @@ static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h)
 	return node_specific_alloc;
 }
 
-static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated, struct hstate *h)
+static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
+						       unsigned long liveupdated,
+						       struct hstate *h)
 {
-	if (allocated < h->max_huge_pages) {
-		char buf[32];
+	char buf[32];
 
-		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+
+	if (liveupdated > h->max_huge_pages) {
+		pr_warn("HugeTLB: got %lu of page size %s from liveupdate, requested pages are %lu\n",
+			liveupdated, buf, h->max_huge_pages);
+		h->max_huge_pages = liveupdated;
+	} else  if (liveupdated + allocated < h->max_huge_pages) {
 		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
-			h->max_huge_pages, buf, allocated);
-		h->max_huge_pages = allocated;
+			h->max_huge_pages - liveupdated, buf, allocated);
+		if (liveupdated)
+			pr_warn("HugeTLB: %lu of page size %s are from liveupdate\n",
+				liveupdated, buf);
+		h->max_huge_pages = allocated + liveupdated;
 	}
 }
 
@@ -3542,11 +3560,12 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
 	prep_and_add_allocated_folios(h, &folio_list);
 }
 
-static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
+static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h,
+							      unsigned long nr)
 {
 	unsigned long i;
 
-	for (i = 0; i < h->max_huge_pages; ++i) {
+	for (i = 0; i < nr; ++i) {
 		if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 			break;
 		cond_resched();
@@ -3555,7 +3574,8 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 	return i;
 }
 
-static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h,
+						     unsigned long nr)
 {
 	struct padata_mt_job job = {
 		.fn_arg		= h,
@@ -3594,14 +3614,14 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 
 	jiffies_start = jiffies;
 	do {
-		remaining = h->max_huge_pages - h->nr_huge_pages;
+		remaining = nr - h->nr_huge_pages;
 
 		job.start     = h->nr_huge_pages;
 		job.size      = remaining;
 		job.min_chunk = remaining / hugepage_allocation_threads;
 		padata_do_multithreaded(&job);
 
-		if (h->nr_huge_pages == h->max_huge_pages)
+		if (h->nr_huge_pages == nr)
 			break;
 
 		/*
@@ -3612,7 +3632,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 			break;
 
 		/* Continue if progress was made in last iteration */
-	} while (remaining != (h->max_huge_pages - h->nr_huge_pages));
+	} while (remaining != (nr - h->nr_huge_pages));
 
 	jiffies_end = jiffies;
 
@@ -3636,7 +3656,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
  */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
-	unsigned long allocated;
+	unsigned long allocated, liveupdated, nr_alloc;
 
 	/*
 	 * Skip gigantic hugepages allocation if early CMA
@@ -3648,20 +3668,31 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 		return;
 	}
 
-	if (!h->max_huge_pages)
+	/*
+	 * Some huge pages might come from live update. They will get added to
+	 * the hstate when liveupdate retrieves its files. To avoid
+	 * over-allocating, subtract the liveupdated pages from the total number
+	 * of pages to allocate.
+	 */
+	liveupdated = hstate_liveupdate_pages(h);
+	hstate_boot_nrliveupdated[hstate_index(h)] = liveupdated;
+	if (liveupdated >= h->max_huge_pages) {
+		hugetlb_hstate_alloc_pages_errcheck(0, liveupdated, h);
 		return;
+	}
+	nr_alloc = h->max_huge_pages - liveupdated;
 
 	/* do node specific alloc */
-	if (hugetlb_hstate_alloc_pages_specific_nodes(h))
+	if (hugetlb_hstate_alloc_pages_specific_nodes(h, liveupdated))
 		return;
 
 	/* below will do all node balanced alloc */
 	if (hstate_is_gigantic(h))
-		allocated = hugetlb_gigantic_pages_alloc_boot(h);
+		allocated = hugetlb_gigantic_pages_alloc_boot(h, nr_alloc);
 	else
-		allocated = hugetlb_pages_alloc_boot(h);
+		allocated = hugetlb_pages_alloc_boot(h, nr_alloc);
 
-	hugetlb_hstate_alloc_pages_errcheck(allocated, h);
+	hugetlb_hstate_alloc_pages_errcheck(allocated, liveupdated, h);
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -3710,14 +3741,22 @@ static void __init report_hugepages(void)
 	unsigned long nrinvalid;
 
 	for_each_hstate(h) {
+		unsigned long liveupdated;
 		char buf[32];
 
 		nrinvalid = hstate_boot_nrinvalid[hstate_index(h)];
 		h->max_huge_pages -= nrinvalid;
 
 		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
-		pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
+		pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages",
 			buf, h->nr_huge_pages);
+
+		liveupdated = hstate_boot_nrliveupdated[hstate_index(h)];
+		if (liveupdated)
+			pr_info(KERN_CONT ", %ld pages from liveupdate\n", liveupdated);
+		else
+			pr_info(KERN_CONT "\n");
+
 		if (nrinvalid)
 			pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n",
 					buf, nrinvalid, str_plural(nrinvalid));
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 08/10] mm: hugetlb: disable CMA if liveupdate is enabled
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (6 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 07/10] mm: hugetlb: don't allocate pages already in live update Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 09/10] mm: hugetlb: allow freezing the inode Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 10/10] liveupdate: allow preserving hugetlb-backed memfd Pratyush Yadav
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

Hugetlb live update support does not yet work with CMA. Print a warning
and disable CMA if the config for live updating hugetlb is enabled, and
liveupdate is enabled at runtime.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 kernel/liveupdate/Kconfig | 3 +++
 mm/hugetlb_cma.c          | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index 86e76aed8a93..4676fea6d8a6 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -82,6 +82,9 @@ config LIVEUPDATE_HUGETLB
 	  use hugetlb memfd to back VM memory, or for other user workloads needing
 	  to live update huge pages.
 
+	  Enabling this config disables CMA for hugetlb pages. It is not yet
+	  supported with live update.
+
 	  If unsure, say N.
 
 endmenu
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index e8e4dc7182d5..fa3bb776c0d2 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -4,6 +4,7 @@
 #include <linux/cma.h>
 #include <linux/compiler.h>
 #include <linux/mm_inline.h>
+#include <linux/liveupdate.h>
 
 #include <asm/page.h>
 #include <asm/setup.h>
@@ -152,6 +153,12 @@ void __init hugetlb_cma_reserve(int order)
 	if (!hugetlb_cma_size)
 		return;
 
+	if (IS_ENABLED(CONFIG_LIVEUPDATE_HUGETLB) && liveupdate_enabled()) {
+		pr_warn("HugeTLB: CMA not supported with live update. Falling back to pre-allocating pages.\n");
+		hugetlb_cma_size = 0;
+		return;
+	}
+
 	hugetlb_bootmem_set_nodes();
 
 	for (nid = 0; nid < MAX_NUMNODES; nid++) {
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 09/10] mm: hugetlb: allow freezing the inode
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (7 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 08/10] mm: hugetlb: disable CMA if liveupdate is enabled Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  2025-12-06 23:02 ` [RFC PATCH 10/10] liveupdate: allow preserving hugetlb-backed memfd Pratyush Yadav
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

To prepare a hugetlb inode for live update, its index -> folio mappings
must be serialized. Once the mappings are serialized, they cannot change
since it would cause the serialized data to become inconsistent. This
can be done by pinning the folios to avoid migration, and by making sure
no folios can be added to or removed from the inode.

While mechanisms to pin folios already exist, the only way to stop folios
being added or removed are the grow and shrink file seals.  But file seals
come with their own semantics, one of which is that they can't be removed.
This doesn't work with liveupdate since it can be cancelled or error out,
which would need the seals to be removed and the file's normal
functionality to be restored.

Introduce a frozen flag that indicates this status. It is internal to
hugetlbfs and is not directly exposed to userspace. It functions similar
to F_SEAL_GROW | F_SEAL_SHRINK, but additionally disallows hole
punching, and can be removed.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 fs/hugetlbfs/inode.c    | 14 +++++++++++++-
 include/linux/hugetlb.h |  8 ++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index f42548ee9083..9af0372c7aea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -673,6 +673,11 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	inode_lock(inode);
 
+	if (info->frozen) {
+		inode_unlock(inode);
+		return -EPERM;
+	}
+
 	/* protected by i_rwsem */
 	if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
 		inode_unlock(inode);
@@ -743,6 +748,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 
 	inode_lock(inode);
 
+	if (info->frozen) {
+		error = -EPERM;
+		goto out;
+	}
+
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
 	error = inode_newsize_ok(inode, offset + len);
 	if (error)
@@ -864,7 +874,8 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap,
 			return -EINVAL;
 		/* protected by i_rwsem */
 		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
-		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)) ||
+		    ((newsize != oldsize) && info->frozen))
 			return -EPERM;
 		hugetlb_vmtruncate(inode, newsize);
 	}
@@ -933,6 +944,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		simple_inode_init_ts(inode);
 		inode->i_mapping->i_private_data = resv_map;
 		info->seals = F_SEAL_SEAL;
+		info->frozen = false;
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e63e46b8e1f..d70a3015c759 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 struct hugetlbfs_inode_info {
 	struct inode vfs_inode;
 	unsigned int seals;
+	bool frozen;
 };
 
 static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
@@ -531,6 +532,13 @@ static inline struct hstate *hstate_inode(struct inode *i)
 {
 	return HUGETLBFS_SB(i->i_sb)->hstate;
 }
+
+/* Must be called with inode lock taken exclusive. */
+static inline void hugetlb_i_freeze(struct inode *inode, bool freeze)
+{
+	HUGETLBFS_I(inode)->frozen = freeze;
+}
+
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)			false
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 10/10] liveupdate: allow preserving hugetlb-backed memfd
  2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
                   ` (8 preceding siblings ...)
  2025-12-06 23:02 ` [RFC PATCH 09/10] mm: hugetlb: allow freezing the inode Pratyush Yadav
@ 2025-12-06 23:02 ` Pratyush Yadav
  9 siblings, 0 replies; 11+ messages in thread
From: Pratyush Yadav @ 2025-12-06 23:02 UTC (permalink / raw)
  To: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Muchun Song, Oscar Salvador,
	Alexander Graf, David Matlack, David Rientjes, Jason Gunthorpe,
	Samiullah Khawaja, Vipin Sharma, Zhu Yanjun
  Cc: linux-kernel, linux-mm, linux-doc, kexec

Hugetlb-backed memfds can be used to improve performance by reducing TLB
pressure, page faults, and save memory when using HVO. They are also
commonly used to back VM memory, which is one of the primary users of
live update.

Add support for preserving a hugetlb-backed memfd across a live update.
The serialized data takes similar form to shmem-backed memfds. See
include/linux/kho/abi/memfd.h for more details. There is an additional
field for the file order to identify its hstate.

The behaviour of the file is also similar to shmem-backed memfds. The
file cannot grow or shrink once preserved, and all its pages are pinned
to avoid migration.

In addition, the preservation logic reports preserved hugepages to the
FLB so the right number of huge pages can be allocated on the next boot.

On file retrieve, the reservations are set up, the folios are first
prepped and added to the hstate, and then the folios are added to the
page cache.

Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
---
 include/linux/kho/abi/hugetlb.h |  32 ++
 mm/hugetlb_luo.c                | 520 ++++++++++++++++++++++++++++++++
 mm/memfd_luo.c                  |   4 -
 3 files changed, 552 insertions(+), 4 deletions(-)

diff --git a/include/linux/kho/abi/hugetlb.h b/include/linux/kho/abi/hugetlb.h
index 55e833569c48..dad4358da062 100644
--- a/include/linux/kho/abi/hugetlb.h
+++ b/include/linux/kho/abi/hugetlb.h
@@ -9,6 +9,7 @@
 #define _LINUX_KHO_ABI_HUGETLB_H
 
 #include <linux/hugetlb.h>
+#include <linux/kexec_handover.h>
 
 /**
  * DOC: hugetlb-backed memfd live update ABI
@@ -63,4 +64,35 @@ static_assert(sizeof(struct hugetlb_ser) <= PAGE_SIZE);
 
 #define HUGETLB_FLB_COMPATIBLE "hugetlb-v1"
 
+/**
+ * struct hugemfd_folio_ser - Serialized state of a single folio.
+ * @pfn:          The page frame number of the folio.
+ * @reserved:     Reserved bits. Might be used for flags later.
+ * @index:        The page offset of the folio in the original file.
+ */
+struct hugemfd_folio_ser {
+	u64 pfn:52;
+	u64 reserved:12;
+	u64 index;
+} __packed;
+
+/**
+ * struct hugemfd_ser - Main serialization structure of a HugeTLB-backed memfd.
+ * @pos:          The file's current position (f_pos).
+ * @size:         The total size of the file in bytes (i_size).
+ * @nr_folios:    Number of folios in the folios array.
+ * @folios:       KHO vmalloc descriptor pointing to the array of
+ *                struct hugemfd_folio_ser.
+ * @order:        Order of the hugepages that back this file.
+ */
+struct hugemfd_ser {
+	u64 size;
+	u64 pos;
+	u64 nr_folios;
+	struct kho_vmalloc folios;
+	u8 order;
+} __packed;
+
+#define HUGE_MEMFD_COMPATIBLE "huge-memfd-v1"
+
 #endif /* _LINUX_KHO_ABI_HUGETLB_H */
diff --git a/mm/hugetlb_luo.c b/mm/hugetlb_luo.c
index 80e3e015eca5..6454f8955d18 100644
--- a/mm/hugetlb_luo.c
+++ b/mm/hugetlb_luo.c
@@ -8,13 +8,17 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/file.h>
 #include <linux/liveupdate.h>
 #include <linux/kexec_handover.h>
 #include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
+#include <linux/vmalloc.h>
 #include <linux/kho/abi/hugetlb.h>
 #include <linux/spinlock.h>
 
 #include "hugetlb_internal.h"
+#include "hugetlb_vmemmap.h"
 
 struct hugetlb_flb_obj {
 	/* Serializes access to ser and its hstates. */
@@ -22,6 +26,11 @@ struct hugetlb_flb_obj {
 	struct hugetlb_ser *ser;
 };
 
+struct hugemfd_private {
+	struct hugemfd_folio_ser *folios_ser;
+	unsigned long nr_folios;
+};
+
 static int hugetlb_flb_preserve(struct liveupdate_flb_op_args *args)
 {
 	struct hugetlb_ser *hugetlb_ser;
@@ -172,8 +181,519 @@ unsigned long __init hstate_liveupdate_pages(struct hstate *h)
 	return hser ? hser->nr_pages : 0;
 }
 
+static bool hugemfd_can_preserve(struct liveupdate_file_handler *handler,
+				 struct file *file)
+{
+	struct inode *inode = file_inode(file);
+
+	return is_file_hugepages(file) && !inode->i_nlink;
+}
+
+static void hugemfd_unpreserve_folio(struct hstate *h, struct folio *folio)
+{
+	hugetlb_flb_del_folio(h);
+	kho_unpreserve_folio(folio);
+}
+
+static int hugemfd_preserve_folio(struct hstate *h, struct folio *folio,
+				  struct hugemfd_folio_ser *folio_ser)
+{
+	int err;
+
+	err = kho_preserve_folio(folio);
+	if (err)
+		return err;
+
+	err = hugetlb_flb_add_folio(h);
+	if (err)
+		goto err_unpreserve;
+
+	folio_ser->pfn = folio_pfn(folio);
+	folio_ser->index = folio->index;
+	return 0;
+
+err_unpreserve:
+	kho_unpreserve_folio(folio);
+	return err;
+}
+
+static int
+hugemfd_preserve_folios(struct hugemfd_ser *memfd_ser, struct file *file,
+			unsigned long *nr_foliosp,
+			struct hugemfd_folio_ser **out_folios_ser)
+{
+	struct hugemfd_folio_ser *folios_ser;
+	struct inode *inode = file_inode(file);
+	struct hstate *h = hstate_inode(inode);
+	unsigned int max_folios;
+	long i, nr_folios, size;
+	struct folio **folios;
+	pgoff_t offset;
+	int err;
+
+	size = i_size_read(inode);
+
+	if (!size) {
+		*nr_foliosp = 0;
+		*out_folios_ser = NULL;
+		memset(&memfd_ser->folios, 0, sizeof(memfd_ser->folios));
+		return 0;
+	}
+
+	/* Calculate number of folios in the file based on its size. */
+	max_folios = size / huge_page_size(h);
+	folios = kvmalloc_array(max_folios, sizeof(*folios), GFP_KERNEL);
+	if (!folios)
+		return -ENOMEM;
+
+	/*
+	 * Pin the folios so they don't move around behind our back. This also
+	 * ensures none of the folios are in CMA -- which ensures they don't
+	 * fall in KHO scratch memory. It also moves swapped out folios back to
+	 * memory.
+	 *
+	 * A side effect of doing this is that it allocates a folio for all
+	 * indices in the file. This might waste memory on sparse memfds. If
+	 * that is really a problem in the future, we can have a
+	 * memfd_pin_folios() variant that does not allocate a page on empty
+	 * slots.
+	 */
+	nr_folios = memfd_pin_folios(file, 0, size - 1, folios, max_folios, &offset);
+	if (nr_folios < 0) {
+		err = nr_folios;
+		goto err_free_folios;
+	}
+
+	folios_ser = vcalloc(nr_folios, sizeof(*folios_ser));
+	if (!folios_ser) {
+		err = -ENOMEM;
+		goto err_unpin;
+	}
+
+	for (i = 0; i < nr_folios; i++) {
+		err = hugemfd_preserve_folio(h, folios[i], &folios_ser[i]);
+		if (err)
+			goto err_unpreserve;
+	}
+
+	err = kho_preserve_vmalloc(folios_ser, &memfd_ser->folios);
+	if (err)
+		goto err_unpreserve;
+
+	kvfree(folios);
+
+	memfd_ser->nr_folios = nr_folios;
+	*nr_foliosp = nr_folios;
+	*out_folios_ser = folios_ser;
+	return 0;
+
+err_unpreserve:
+	for (i = i - 1; i >= 0; i--)
+		hugemfd_unpreserve_folio(h, folios[i]);
+	vfree(folios_ser);
+err_unpin:
+	unpin_folios(folios, nr_folios);
+err_free_folios:
+	kvfree(folios);
+	return err;
+}
+
+static int hugemfd_preserve(struct liveupdate_file_op_args *args)
+{
+	struct file *file = args->file;
+	struct inode *inode = file_inode(file);
+	struct hstate *h = hstate_inode(inode);
+	struct hugemfd_folio_ser *folios_ser;
+	struct hugemfd_private *private;
+	struct hugemfd_ser *memfd_ser;
+	unsigned long nr_folios;
+	int err;
+
+	private = kmalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		return -ENOMEM;
+
+	memfd_ser = kho_alloc_preserve(sizeof(*memfd_ser));
+	if (!memfd_ser) {
+		err = -ENOMEM;
+		goto err_free_private;
+	}
+
+	inode_lock(inode);
+
+	hugetlb_i_freeze(inode, true);
+
+	memfd_ser->size = i_size_read(inode);
+	memfd_ser->pos = file->f_pos;
+	memfd_ser->order = h->order;
+
+	err = hugemfd_preserve_folios(memfd_ser, file, &nr_folios, &folios_ser);
+	if (err)
+		goto err_unlock;
+
+	inode_unlock(inode);
+
+	private->folios_ser = folios_ser;
+	private->nr_folios = nr_folios;
+	args->private_data = private;
+	args->serialized_data = virt_to_phys(memfd_ser);
+
+	return 0;
+
+err_unlock:
+	hugetlb_i_freeze(inode, false);
+	inode_unlock(inode);
+	kho_unpreserve_free(memfd_ser);
+err_free_private:
+	kfree(private);
+	return err;
+}
+
+static void hugemfd_unpreserve_folios(struct hugemfd_ser *memfd_ser,
+				      struct hugemfd_folio_ser *folios_ser,
+				      unsigned long nr_folios,
+				      struct hstate *h)
+{
+	if (!nr_folios)
+		return;
+
+	kho_unpreserve_vmalloc(&memfd_ser->folios);
+
+	for (long i = 0; i < nr_folios; i++) {
+		struct folio *folio = pfn_folio(folios_ser[i].pfn);
+
+		hugemfd_unpreserve_folio(h, folio);
+		unpin_folio(folio);
+	}
+
+	vfree(folios_ser);
+}
+
+static void hugemfd_unpreserve(struct liveupdate_file_op_args *args)
+{
+	struct hugemfd_ser *memfd_ser = phys_to_virt(args->serialized_data);
+	struct hugemfd_private *private = args->private_data;
+	struct inode *inode = file_inode(args->file);
+	struct hstate *h = hstate_inode(inode);
+
+	inode_lock(inode);
+	hugemfd_unpreserve_folios(memfd_ser, private->folios_ser,
+				  private->nr_folios, h);
+	hugetlb_i_freeze(inode, false);
+	kho_unpreserve_free(memfd_ser);
+	kfree(private);
+	inode_unlock(inode);
+}
+
+static int hugemfd_freeze(struct liveupdate_file_op_args *args)
+{
+	struct hugemfd_ser *memfd_ser = phys_to_virt(args->serialized_data);
+
+	/*
+	 * The pos might have changed since prepare. Everything else stays the
+	 * same.
+	 */
+	memfd_ser->pos = args->file->f_pos;
+	return 0;
+}
+
+static void hugemfd_finish(struct liveupdate_file_op_args *args)
+{
+	struct hugemfd_ser *memfd_ser = phys_to_virt(args->serialized_data);
+	struct hugemfd_folio_ser *folios_ser;
+	LIST_HEAD(folio_list);
+	struct hstate *h;
+
+	if (args->retrieved)
+		return;
+
+	folios_ser = kho_restore_vmalloc(&memfd_ser->folios);
+	if (WARN_ON_ONCE(!folios_ser))
+		return;
+
+	h = size_to_hstate(PAGE_SIZE << memfd_ser->order);
+	if (!h) {
+		pr_warn("no hstate found for order %u\n", memfd_ser->order);
+		goto err_free_all;
+	}
+
+	/* Return the folios back to the hstate. */
+	for (u64 i = 0; i < memfd_ser->nr_folios; i++) {
+		struct folio *folio;
+
+		folio = kho_restore_folio(PFN_PHYS(folios_ser[i].pfn));
+		if (!folio)
+			continue;
+
+		if (!folio_ref_freeze(folio, 1)) {
+			pr_warn("unexpected refcount on PFN 0x%lx\n",
+				folio_pfn(folio));
+			continue;
+		}
+
+		init_new_hugetlb_folio(folio);
+		list_add(&folio->lru, &folio_list);
+	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
+	vfree(folios_ser);
+	return;
+
+err_free_all:
+	for (u64 i = 0; i < memfd_ser->nr_folios; i++) {
+		struct folio *folio;
+
+		folio = kho_restore_folio(PFN_PHYS(folios_ser[i].pfn));
+		if (folio)
+			folio_put(folio);
+	}
+	vfree(folios_ser);
+}
+
+static int hugemfd_setup_rsrv(struct inode *inode)
+{
+	struct hstate *h = hstate_inode(inode);
+	long chg, regions_needed, add = -1;
+	/*
+	 * NOTE: Setting up the reservations for the whole file works right now
+	 * because during preserve all the folios are filled in when pinning.
+	 * Whenever that changes, this needs to be updated as well.
+	 */
+	long from = 0, to = inode->i_size >> huge_page_shift(h);
+	struct resv_map *resv_map;
+	struct hugetlb_cgroup *h_cg = NULL;
+	int err;
+
+	resv_map = inode_resv_map(inode);
+	chg = region_chg(resv_map, from, to, &regions_needed);
+	if (chg < 0)
+		return chg;
+
+	if (hugetlb_cgroup_charge_cgroup_rsvd(hstate_index(h),
+					      chg * pages_per_huge_page(h),
+					      &h_cg) < 0) {
+		err = -ENOMEM;
+		goto err_region_abort;
+	}
+
+	/*
+	 * No need for hugetlb_acct_memory() to update h->resv_huge_pages since
+	 * the reserved pages we added here will get used immediately after in
+	 * hugemfd_retrieve_folios().
+	 *
+	 * No need for subpool reservations as well since the memfds come from
+	 * the internal mounts of hugetlbfs and that doesn't have subpools.
+	 */
+	add = region_add(resv_map, from, to, regions_needed, h, h_cg);
+	if (add < 0) {
+		err = add;
+		goto err_uncharge_cgroup;
+	}
+
+	hugetlb_cgroup_put_rsvd_cgroup(h_cg);
+
+	return 0;
+
+err_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
+					    chg * pages_per_huge_page(h), h_cg);
+err_region_abort:
+	region_abort(resv_map, from, to, regions_needed);
+	return err;
+}
+
+static struct folio *hugemfd_retrieve_folio(struct hugemfd_folio_ser *folio_ser)
+{
+	struct folio *folio;
+
+	folio = kho_restore_folio(PFN_PHYS(folio_ser->pfn));
+	if (!folio)
+		return NULL;
+
+	init_new_hugetlb_folio(folio);
+	__folio_mark_uptodate(folio);
+	folio_ref_freeze(folio, 1);
+
+	return folio;
+}
+
+static void hugemfd_add_folios(struct hstate *h, struct list_head *folio_list)
+{
+	unsigned long flags;
+	struct folio *folio, *tmp_f;
+
+	/* Send list for bulk vmemmap optimization processing */
+	hugetlb_vmemmap_optimize_folios(h, folio_list);
+
+	spin_lock_irqsave(&hugetlb_lock, flags);
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		account_new_hugetlb_folio(h, folio);
+		folio_clear_hugetlb_freed(folio);
+		list_move(&folio->lru, &h->hugepage_activelist);
+	}
+	spin_unlock_irqrestore(&hugetlb_lock, flags);
+}
+
+static int hugemfd_retrieve_folios(struct file *file,
+				   struct hugemfd_ser *memfd_ser)
+{
+	struct hugemfd_folio_ser *folios_ser;
+	struct inode *inode = file_inode(file);
+	struct hstate *h = hstate_inode(inode);
+	int err, hidx = hstate_index(h);
+	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	struct address_space *mapping;
+	struct hugetlb_cgroup *h_cg;
+	struct folio *folio;
+	LIST_HEAD(list);
+	u64 nr_folios;
+
+	if (!memfd_ser->size)
+		return 0;
+
+	folios_ser = kho_restore_vmalloc(&memfd_ser->folios);
+	if (!folios_ser)
+		return -ENOMEM;
+
+	nr_folios = memfd_ser->nr_folios;
+	mapping = inode->i_mapping;
+
+	/* First prepare the folios and add them to the hstate. */
+	for (u64 i = 0; i < nr_folios; i++) {
+		struct hugemfd_folio_ser *folio_ser = &folios_ser[i];
+
+		folio = hugemfd_retrieve_folio(folio_ser);
+		if (!folio) {
+			err = -EINVAL;
+			goto err_free_folios_ser;
+		}
+
+		list_add(&folio->lru, &list);
+	}
+
+	hugemfd_add_folios(h, &list);
+
+	/* Now that all the folios are prepared, add them to the file. */
+	for (u64 i = 0; i < nr_folios; i++) {
+		folio = pfn_folio(folios_ser[i].pfn);
+		folio_ref_unfreeze(folio, 1);
+
+		err = hugetlb_add_to_page_cache(folio, mapping,
+						folios_ser[i].index >> memfd_ser->order);
+		if (err) {
+			pr_err("failed to add to page cache: %pe\n", ERR_PTR(err));
+			goto err_free_folios_ser;
+		}
+
+		spin_lock_irq(&hugetlb_lock);
+		err = hugetlb_cgroup_charge_cgroup(hidx, pages_per_huge_page(h),
+						   &h_cg);
+		if (err) {
+			spin_unlock_irq(&hugetlb_lock);
+			folio_unlock(folio);
+			goto err_free_folios_ser;
+		}
+		hugetlb_cgroup_commit_charge(hidx, pages_per_huge_page(h), h_cg, folio);
+		spin_unlock_irq(&hugetlb_lock);
+
+		err = mem_cgroup_charge_hugetlb(folio, gfp);
+		if (err) {
+			folio_unlock(folio);
+			goto err_free_folios_ser;
+		}
+
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	vfree(folios_ser);
+	return 0;
+
+err_free_folios_ser:
+	/*
+	 * NOTE: The folios of the file might be in use for DMA or other
+	 * things. It is unsafe to free them. Leak them, and let userspace get
+	 * the error code and decide what to do.
+	 */
+	vfree(folios_ser);
+	return err;
+}
+
+/*
+ * NOTE: Leaking the file in the error paths is intentional here. The memory
+ * might be in use by devices, and it is unsafe to release it. Return the error
+ * to userspace and let it decide how to recover, usually by rebooting the
+ * system.
+ */
+static int hugemfd_retrieve(struct liveupdate_file_op_args *args)
+{
+	struct hugemfd_ser *memfd_ser;
+	struct file *file;
+	int err;
+
+	memfd_ser = phys_to_virt(args->serialized_data);
+
+	file = hugetlb_file_setup("", 0, VM_NORESERVE, HUGETLB_ANONHUGE_INODE,
+				  memfd_ser->order + PAGE_SHIFT);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_free_memfd_ser;
+	}
+
+	vfs_setpos(file, memfd_ser->pos, MAX_LFS_FILESIZE);
+	file->f_inode->i_size = memfd_ser->size;
+
+	err = hugemfd_setup_rsrv(file_inode(file));
+	if (err)
+		goto err_free_memfd_ser;
+
+	if (memfd_ser->nr_folios) {
+		err = hugemfd_retrieve_folios(file, memfd_ser);
+		if (err)
+			goto err_free_memfd_ser;
+	}
+
+	args->file = file;
+	kho_restore_free(memfd_ser);
+	return 0;
+
+err_free_memfd_ser:
+	kho_restore_free(memfd_ser);
+	return err;
+}
+
+static const struct liveupdate_file_ops hugemfd_luo_ops = {
+	.can_preserve = hugemfd_can_preserve,
+	.preserve = hugemfd_preserve,
+	.unpreserve = hugemfd_unpreserve,
+	.freeze = hugemfd_freeze,
+	.finish = hugemfd_finish,
+	.retrieve = hugemfd_retrieve,
+	.owner = THIS_MODULE,
+};
+
+static struct liveupdate_file_handler hugemfd_handler = {
+	.ops = &hugemfd_luo_ops,
+	.compatible = HUGE_MEMFD_COMPATIBLE,
+};
+
 void __init hugetlb_luo_init(void)
 {
+	int err;
+
 	if (!liveupdate_enabled())
 		return;
+
+	err = liveupdate_register_file_handler(&hugemfd_handler);
+	if (err) {
+		pr_err("could not register file handler: %pe\n", ERR_PTR(err));
+		return;
+	}
+
+	err = liveupdate_register_flb(&hugemfd_handler, &hugetlb_luo_flb);
+	if (err) {
+		pr_err("could not register hugetlb FLB handler: %pe\n", ERR_PTR(err));
+		liveupdate_unregister_file_handler(&hugemfd_handler);
+		return;
+	}
 }
diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c
index 4f6ba63b4310..de715d67d543 100644
--- a/mm/memfd_luo.c
+++ b/mm/memfd_luo.c
@@ -26,10 +26,6 @@
  *    The LUO API is not stabilized yet, so the preserved properties of a memfd
  *    are also not stable and are subject to backwards incompatible changes.
  *
- * .. note::
- *    Currently a memfd backed by Hugetlb is not supported. Memfds created
- *    with ``MFD_HUGETLB`` will be rejected.
- *
  * Preserved Properties
  * ====================
  *
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-06 23:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-06 23:02 [RFC PATCH 00/10] liveupdate: hugetlb support Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 01/10] kho: drop restriction on maximum page order Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 02/10] kho: disable scratch-only earlier in boot Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 03/10] liveupdate: do early initialization before hugepages are allocated Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 04/10] liveupdate: flb: allow getting FLB data in early boot Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 05/10] mm: hugetlb: export some functions to hugetlb-internal header Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 06/10] liveupdate: hugetlb subsystem FLB state preservation Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 07/10] mm: hugetlb: don't allocate pages already in live update Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 08/10] mm: hugetlb: disable CMA if liveupdate is enabled Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 09/10] mm: hugetlb: allow freezing the inode Pratyush Yadav
2025-12-06 23:02 ` [RFC PATCH 10/10] liveupdate: allow preserving hugetlb-backed memfd Pratyush Yadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox