* [PATCH] x86/efi: defer freeing of boot services memory
@ 2026-02-23 7:52 Mike Rapoport
2026-02-23 8:08 ` Ard Biesheuvel
0 siblings, 1 reply; 6+ messages in thread
From: Mike Rapoport @ 2026-02-23 7:52 UTC (permalink / raw)
To: x86, linux-kernel
Cc: Ard Biesheuvel, Benjamin Herrenschmidt, Borislav Petkov,
Dave Hansen, Ilias Apalodimas, Ingo Molnar, Mike Rapoport,
H. Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE
and EFI_BOOT_SERVICES_DATA using memblock_free_late().
There are two issue with that: memblock_free_late() should be used for
memory allocated with memblock_alloc() while the memory reserved with
memblock_reserve() should be freed with free_reserved_area().
More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
efi_free_boot_services() is called before deferred initialization of the
memory map is complete.
Benjamin Herrenschmidt reports that this causes a leak of ~140MB of
RAM on EC2 t3a.nano instances which only have 512MB or RAM.
If the freed memory resides in the areas that memory map for them is
still uninitialized, they won't be actually freed because
memblock_free_late() calls memblock_free_pages() and the latter skips
uninitialized pages.
Using free_reserved_area() at this point is also problematic because
__free_page() accesses the buddy of the freed page and that again might
end up in uninitialized part of the memory map.
Delaying the entire efi_free_boot_services() could be problematic
because in addition to freeing boot services memory it updates
efi.memmap without any synchronization and that's undesirable late in
boot when there is concurrency.
More robust approach is to only defer freeing of the EFI boot services
memory.
Make efi_free_boot_services() collect ranges that should be freed into
an array and add an initcall efi_free_boot_services_memory() that walks
that array and actually frees the memory using free_reserved_area().
Link: https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org
Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after switching to virtual mode")
Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/include/asm/efi.h | 2 +-
arch/x86/platform/efi/efi.c | 2 +-
arch/x86/platform/efi/quirks.c | 55 +++++++++++++++++++++++++++--
drivers/firmware/efi/mokvar-table.c | 2 +-
4 files changed, 55 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index f227a70ac91f..51b4cdbea061 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -138,7 +138,7 @@ extern void __init efi_apply_memmap_quirks(void);
extern int __init efi_reuse_config(u64 tables, int nr_tables);
extern void efi_delete_dummy_variable(void);
extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr);
-extern void efi_free_boot_services(void);
+extern void efi_unmap_boot_services(void);
void arch_efi_call_virt_setup(void);
void arch_efi_call_virt_teardown(void);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index d00c6de7f3b7..d84c6020dda1 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -836,7 +836,7 @@ static void __init __efi_enter_virtual_mode(void)
}
efi_check_for_embedded_firmwares();
- efi_free_boot_services();
+ efi_unmap_boot_services();
if (!efi_is_mixed())
efi_native_runtime_setup();
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 553f330198f2..35caa5746115 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -341,7 +341,7 @@ void __init efi_reserve_boot_services(void)
/*
* Because the following memblock_reserve() is paired
- * with memblock_free_late() for this region in
+ * with free_reserved_area() for this region in
* efi_free_boot_services(), we must be extremely
* careful not to reserve, and subsequently free,
* critical regions of memory (like the kernel image) or
@@ -404,17 +404,33 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md)
pr_err("Failed to unmap VA mapping for 0x%llx\n", va);
}
-void __init efi_free_boot_services(void)
+struct efi_freeable_range {
+ u64 start;
+ u64 end;
+};
+
+static struct efi_freeable_range *ranges_to_free;
+
+void __init efi_unmap_boot_services(void)
{
struct efi_memory_map_data data = { 0 };
efi_memory_desc_t *md;
int num_entries = 0;
+ int idx = 0;
+ size_t sz;
void *new, *new_md;
/* Keep all regions for /sys/kernel/debug/efi */
if (efi_enabled(EFI_DBG))
return;
+ sz = sizeof(*ranges_to_free) * efi.memmap.nr_map + 1;
+ ranges_to_free = kzalloc(sz, GFP_KERNEL);
+ if (!ranges_to_free) {
+ pr_err("Failed to allocate storage for freeable EFI regions\n");
+ return;
+ }
+
for_each_efi_memory_desc(md) {
unsigned long long start = md->phys_addr;
unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
@@ -471,7 +487,15 @@ void __init efi_free_boot_services(void)
start = SZ_1M;
}
- memblock_free_late(start, size);
+ /*
+ * With CONFIG_DEFERRED_STRUCT_PAGE_INIT parts of the memory
+ * map are still not initialized and we can't reliably free
+ * memory here.
+ * Queue the ranges to free at a later point.
+ */
+ ranges_to_free[idx].start = start;
+ ranges_to_free[idx].end = start + size;
+ idx++;
}
if (!num_entries)
@@ -512,6 +536,31 @@ void __init efi_free_boot_services(void)
}
}
+static int __init efi_free_boot_services(void)
+{
+ struct efi_freeable_range *range = ranges_to_free;
+ unsigned long freed = 0;
+
+ if (!ranges_to_free)
+ return 0;
+
+ while (range->start) {
+ void *start = phys_to_virt(range->start);
+ void *end = phys_to_virt(range->end);
+
+ free_reserved_area(start, end, -1, NULL);
+ freed += (end - start);
+ range++;
+ }
+ kfree(ranges_to_free);
+
+ if (freed)
+ pr_info("Freeing EFI boot services memory: %ldK\n", freed / SZ_1K);
+
+ return 0;
+}
+arch_initcall(efi_free_boot_services);
+
/*
* A number of config table entries get remapped to virtual addresses
* after entering EFI virtual mode. However, the kexec kernel requires
diff --git a/drivers/firmware/efi/mokvar-table.c b/drivers/firmware/efi/mokvar-table.c
index 4ff0c2926097..6842aa96d704 100644
--- a/drivers/firmware/efi/mokvar-table.c
+++ b/drivers/firmware/efi/mokvar-table.c
@@ -85,7 +85,7 @@ static struct kobject *mokvar_kobj;
* as an alternative to ordinary EFI variables, due to platform-dependent
* limitations. The memory occupied by this table is marked as reserved.
*
- * This routine must be called before efi_free_boot_services() in order
+ * This routine must be called before efi_unmap_boot_services() in order
* to guarantee that it can mark the table as reserved.
*
* Implicit inputs:
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
--
2.51.0
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] x86/efi: defer freeing of boot services memory 2026-02-23 7:52 [PATCH] x86/efi: defer freeing of boot services memory Mike Rapoport @ 2026-02-23 8:08 ` Ard Biesheuvel 2026-02-23 10:55 ` Mike Rapoport 0 siblings, 1 reply; 6+ messages in thread From: Ard Biesheuvel @ 2026-02-23 8:08 UTC (permalink / raw) To: Mike Rapoport, x86, linux-kernel Cc: Benjamin Herrenschmidt, Borislav Petkov, Dave Hansen, Ilias Apalodimas, Ingo Molnar, H . Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable Hi Mike, On Mon, 23 Feb 2026, at 08:52, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE > and EFI_BOOT_SERVICES_DATA using memblock_free_late(). > > There are two issue with that: memblock_free_late() should be used for > memory allocated with memblock_alloc() while the memory reserved with > memblock_reserve() should be freed with free_reserved_area(). > > More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y > efi_free_boot_services() is called before deferred initialization of the > memory map is complete. > > Benjamin Herrenschmidt reports that this causes a leak of ~140MB of > RAM on EC2 t3a.nano instances which only have 512MB or RAM. > > If the freed memory resides in the areas that memory map for them is > still uninitialized, they won't be actually freed because > memblock_free_late() calls memblock_free_pages() and the latter skips > uninitialized pages. > > Using free_reserved_area() at this point is also problematic because > __free_page() accesses the buddy of the freed page and that again might > end up in uninitialized part of the memory map. > > Delaying the entire efi_free_boot_services() could be problematic > because in addition to freeing boot services memory it updates > efi.memmap without any synchronization and that's undesirable late in > boot when there is concurrency. > > More robust approach is to only defer freeing of the EFI boot services > memory. > > Make efi_free_boot_services() collect ranges that should be freed into > an array and add an initcall efi_free_boot_services_memory() that walks > that array and actually frees the memory using free_reserved_area(). > Instead of creating another table, could we just traverse the EFI memory map again in the arch_initcall(), and free all boot services code/data above 1M with EFI_MEMORY_RUNTIME cleared ? > Link: > https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org > Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after > switching to virtual mode") > Cc: <stable@vger.kernel.org> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > --- > arch/x86/include/asm/efi.h | 2 +- > arch/x86/platform/efi/efi.c | 2 +- > arch/x86/platform/efi/quirks.c | 55 +++++++++++++++++++++++++++-- > drivers/firmware/efi/mokvar-table.c | 2 +- > 4 files changed, 55 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h > index f227a70ac91f..51b4cdbea061 100644 > --- a/arch/x86/include/asm/efi.h > +++ b/arch/x86/include/asm/efi.h > @@ -138,7 +138,7 @@ extern void __init efi_apply_memmap_quirks(void); > extern int __init efi_reuse_config(u64 tables, int nr_tables); > extern void efi_delete_dummy_variable(void); > extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr); > -extern void efi_free_boot_services(void); > +extern void efi_unmap_boot_services(void); > > void arch_efi_call_virt_setup(void); > void arch_efi_call_virt_teardown(void); > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c > index d00c6de7f3b7..d84c6020dda1 100644 > --- a/arch/x86/platform/efi/efi.c > +++ b/arch/x86/platform/efi/efi.c > @@ -836,7 +836,7 @@ static void __init __efi_enter_virtual_mode(void) > } > > efi_check_for_embedded_firmwares(); > - efi_free_boot_services(); > + efi_unmap_boot_services(); > > if (!efi_is_mixed()) > efi_native_runtime_setup(); > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c > index 553f330198f2..35caa5746115 100644 > --- a/arch/x86/platform/efi/quirks.c > +++ b/arch/x86/platform/efi/quirks.c > @@ -341,7 +341,7 @@ void __init efi_reserve_boot_services(void) > > /* > * Because the following memblock_reserve() is paired > - * with memblock_free_late() for this region in > + * with free_reserved_area() for this region in > * efi_free_boot_services(), we must be extremely > * careful not to reserve, and subsequently free, > * critical regions of memory (like the kernel image) or > @@ -404,17 +404,33 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md) > pr_err("Failed to unmap VA mapping for 0x%llx\n", va); > } > > -void __init efi_free_boot_services(void) > +struct efi_freeable_range { > + u64 start; > + u64 end; > +}; > + > +static struct efi_freeable_range *ranges_to_free; > + > +void __init efi_unmap_boot_services(void) > { > struct efi_memory_map_data data = { 0 }; > efi_memory_desc_t *md; > int num_entries = 0; > + int idx = 0; > + size_t sz; > void *new, *new_md; > > /* Keep all regions for /sys/kernel/debug/efi */ > if (efi_enabled(EFI_DBG)) > return; > > + sz = sizeof(*ranges_to_free) * efi.memmap.nr_map + 1; > + ranges_to_free = kzalloc(sz, GFP_KERNEL); > + if (!ranges_to_free) { > + pr_err("Failed to allocate storage for freeable EFI regions\n"); > + return; > + } > + > for_each_efi_memory_desc(md) { > unsigned long long start = md->phys_addr; > unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; > @@ -471,7 +487,15 @@ void __init efi_free_boot_services(void) > start = SZ_1M; > } > > - memblock_free_late(start, size); > + /* > + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT parts of the memory > + * map are still not initialized and we can't reliably free > + * memory here. > + * Queue the ranges to free at a later point. > + */ > + ranges_to_free[idx].start = start; > + ranges_to_free[idx].end = start + size; > + idx++; > } > > if (!num_entries) > @@ -512,6 +536,31 @@ void __init efi_free_boot_services(void) > } > } > > +static int __init efi_free_boot_services(void) > +{ > + struct efi_freeable_range *range = ranges_to_free; > + unsigned long freed = 0; > + > + if (!ranges_to_free) > + return 0; > + > + while (range->start) { > + void *start = phys_to_virt(range->start); > + void *end = phys_to_virt(range->end); > + > + free_reserved_area(start, end, -1, NULL); > + freed += (end - start); > + range++; > + } > + kfree(ranges_to_free); > + > + if (freed) > + pr_info("Freeing EFI boot services memory: %ldK\n", freed / SZ_1K); > + > + return 0; > +} > +arch_initcall(efi_free_boot_services); > + > /* > * A number of config table entries get remapped to virtual addresses > * after entering EFI virtual mode. However, the kexec kernel requires > diff --git a/drivers/firmware/efi/mokvar-table.c > b/drivers/firmware/efi/mokvar-table.c > index 4ff0c2926097..6842aa96d704 100644 > --- a/drivers/firmware/efi/mokvar-table.c > +++ b/drivers/firmware/efi/mokvar-table.c > @@ -85,7 +85,7 @@ static struct kobject *mokvar_kobj; > * as an alternative to ordinary EFI variables, due to > platform-dependent > * limitations. The memory occupied by this table is marked as > reserved. > * > - * This routine must be called before efi_free_boot_services() in order > + * This routine must be called before efi_unmap_boot_services() in > order > * to guarantee that it can mark the table as reserved. > * > * Implicit inputs: > > base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f > -- > 2.51.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/efi: defer freeing of boot services memory 2026-02-23 8:08 ` Ard Biesheuvel @ 2026-02-23 10:55 ` Mike Rapoport 2026-02-23 11:17 ` Ard Biesheuvel 0 siblings, 1 reply; 6+ messages in thread From: Mike Rapoport @ 2026-02-23 10:55 UTC (permalink / raw) To: Ard Biesheuvel Cc: x86, linux-kernel, Benjamin Herrenschmidt, Borislav Petkov, Dave Hansen, Ilias Apalodimas, Ingo Molnar, H . Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable Hi Ard, On Mon, Feb 23, 2026 at 09:08:29AM +0100, Ard Biesheuvel wrote: > Hi Mike, > > On Mon, 23 Feb 2026, at 08:52, Mike Rapoport wrote: > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > > > efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE > > and EFI_BOOT_SERVICES_DATA using memblock_free_late(). > > > > There are two issue with that: memblock_free_late() should be used for > > memory allocated with memblock_alloc() while the memory reserved with > > memblock_reserve() should be freed with free_reserved_area(). > > > > More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y > > efi_free_boot_services() is called before deferred initialization of the > > memory map is complete. > > > > Benjamin Herrenschmidt reports that this causes a leak of ~140MB of > > RAM on EC2 t3a.nano instances which only have 512MB or RAM. > > > > If the freed memory resides in the areas that memory map for them is > > still uninitialized, they won't be actually freed because > > memblock_free_late() calls memblock_free_pages() and the latter skips > > uninitialized pages. > > > > Using free_reserved_area() at this point is also problematic because > > __free_page() accesses the buddy of the freed page and that again might > > end up in uninitialized part of the memory map. > > > > Delaying the entire efi_free_boot_services() could be problematic > > because in addition to freeing boot services memory it updates > > efi.memmap without any synchronization and that's undesirable late in > > boot when there is concurrency. > > > > More robust approach is to only defer freeing of the EFI boot services > > memory. > > > > Make efi_free_boot_services() collect ranges that should be freed into > > an array and add an initcall efi_free_boot_services_memory() that walks > > that array and actually frees the memory using free_reserved_area(). > > > > Instead of creating another table, could we just traverse the EFI memory > map again in the arch_initcall(), and free all boot services code/data > above 1M with EFI_MEMORY_RUNTIME cleared ? Currently efi_free_boot_services() unmaps all boot services code/data with EFI_MEMORY_RUNTIME cleared and removes them from the efi.memmap. I wasn't sure it's Ok to only unmap them, but leave in efi.memmap, that's why I didn't use the existing EFI memory map. Now thinking about it, if the unmapping can happen later, maybe we'll just move the entire efi_free_boot_services() to an initcall? > > Link: > > https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org > > Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after > > switching to virtual mode") > > Cc: <stable@vger.kernel.org> > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > > --- > > arch/x86/include/asm/efi.h | 2 +- > > arch/x86/platform/efi/efi.c | 2 +- > > arch/x86/platform/efi/quirks.c | 55 +++++++++++++++++++++++++++-- > > drivers/firmware/efi/mokvar-table.c | 2 +- > > 4 files changed, 55 insertions(+), 6 deletions(-) > > > > diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h > > index f227a70ac91f..51b4cdbea061 100644 > > --- a/arch/x86/include/asm/efi.h > > +++ b/arch/x86/include/asm/efi.h > > @@ -138,7 +138,7 @@ extern void __init efi_apply_memmap_quirks(void); > > extern int __init efi_reuse_config(u64 tables, int nr_tables); > > extern void efi_delete_dummy_variable(void); > > extern void efi_crash_gracefully_on_page_fault(unsigned long phys_addr); > > -extern void efi_free_boot_services(void); > > +extern void efi_unmap_boot_services(void); > > > > void arch_efi_call_virt_setup(void); > > void arch_efi_call_virt_teardown(void); > > diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c > > index d00c6de7f3b7..d84c6020dda1 100644 > > --- a/arch/x86/platform/efi/efi.c > > +++ b/arch/x86/platform/efi/efi.c > > @@ -836,7 +836,7 @@ static void __init __efi_enter_virtual_mode(void) > > } > > > > efi_check_for_embedded_firmwares(); > > - efi_free_boot_services(); > > + efi_unmap_boot_services(); > > > > if (!efi_is_mixed()) > > efi_native_runtime_setup(); > > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c > > index 553f330198f2..35caa5746115 100644 > > --- a/arch/x86/platform/efi/quirks.c > > +++ b/arch/x86/platform/efi/quirks.c > > @@ -341,7 +341,7 @@ void __init efi_reserve_boot_services(void) > > > > /* > > * Because the following memblock_reserve() is paired > > - * with memblock_free_late() for this region in > > + * with free_reserved_area() for this region in > > * efi_free_boot_services(), we must be extremely > > * careful not to reserve, and subsequently free, > > * critical regions of memory (like the kernel image) or > > @@ -404,17 +404,33 @@ static void __init efi_unmap_pages(efi_memory_desc_t *md) > > pr_err("Failed to unmap VA mapping for 0x%llx\n", va); > > } > > > > -void __init efi_free_boot_services(void) > > +struct efi_freeable_range { > > + u64 start; > > + u64 end; > > +}; > > + > > +static struct efi_freeable_range *ranges_to_free; > > + > > +void __init efi_unmap_boot_services(void) > > { > > struct efi_memory_map_data data = { 0 }; > > efi_memory_desc_t *md; > > int num_entries = 0; > > + int idx = 0; > > + size_t sz; > > void *new, *new_md; > > > > /* Keep all regions for /sys/kernel/debug/efi */ > > if (efi_enabled(EFI_DBG)) > > return; > > > > + sz = sizeof(*ranges_to_free) * efi.memmap.nr_map + 1; > > + ranges_to_free = kzalloc(sz, GFP_KERNEL); > > + if (!ranges_to_free) { > > + pr_err("Failed to allocate storage for freeable EFI regions\n"); > > + return; > > + } > > + > > for_each_efi_memory_desc(md) { > > unsigned long long start = md->phys_addr; > > unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; > > @@ -471,7 +487,15 @@ void __init efi_free_boot_services(void) > > start = SZ_1M; > > } > > > > - memblock_free_late(start, size); > > + /* > > + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT parts of the memory > > + * map are still not initialized and we can't reliably free > > + * memory here. > > + * Queue the ranges to free at a later point. > > + */ > > + ranges_to_free[idx].start = start; > > + ranges_to_free[idx].end = start + size; > > + idx++; > > } > > > > if (!num_entries) > > @@ -512,6 +536,31 @@ void __init efi_free_boot_services(void) > > } > > } > > > > +static int __init efi_free_boot_services(void) > > +{ > > + struct efi_freeable_range *range = ranges_to_free; > > + unsigned long freed = 0; > > + > > + if (!ranges_to_free) > > + return 0; > > + > > + while (range->start) { > > + void *start = phys_to_virt(range->start); > > + void *end = phys_to_virt(range->end); > > + > > + free_reserved_area(start, end, -1, NULL); > > + freed += (end - start); > > + range++; > > + } > > + kfree(ranges_to_free); > > + > > + if (freed) > > + pr_info("Freeing EFI boot services memory: %ldK\n", freed / SZ_1K); > > + > > + return 0; > > +} > > +arch_initcall(efi_free_boot_services); > > + > > /* > > * A number of config table entries get remapped to virtual addresses > > * after entering EFI virtual mode. However, the kexec kernel requires > > diff --git a/drivers/firmware/efi/mokvar-table.c > > b/drivers/firmware/efi/mokvar-table.c > > index 4ff0c2926097..6842aa96d704 100644 > > --- a/drivers/firmware/efi/mokvar-table.c > > +++ b/drivers/firmware/efi/mokvar-table.c > > @@ -85,7 +85,7 @@ static struct kobject *mokvar_kobj; > > * as an alternative to ordinary EFI variables, due to > > platform-dependent > > * limitations. The memory occupied by this table is marked as > > reserved. > > * > > - * This routine must be called before efi_free_boot_services() in order > > + * This routine must be called before efi_unmap_boot_services() in > > order > > * to guarantee that it can mark the table as reserved. > > * > > * Implicit inputs: > > > > base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f > > -- > > 2.51.0 -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/efi: defer freeing of boot services memory 2026-02-23 10:55 ` Mike Rapoport @ 2026-02-23 11:17 ` Ard Biesheuvel 2026-02-23 11:40 ` Mike Rapoport 0 siblings, 1 reply; 6+ messages in thread From: Ard Biesheuvel @ 2026-02-23 11:17 UTC (permalink / raw) To: Mike Rapoport Cc: x86, linux-kernel, Benjamin Herrenschmidt, Borislav Petkov, Dave Hansen, Ilias Apalodimas, Ingo Molnar, H . Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable On Mon, 23 Feb 2026, at 11:55, Mike Rapoport wrote: > Hi Ard, > > On Mon, Feb 23, 2026 at 09:08:29AM +0100, Ard Biesheuvel wrote: >> Hi Mike, >> >> On Mon, 23 Feb 2026, at 08:52, Mike Rapoport wrote: >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> >> > >> > efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE >> > and EFI_BOOT_SERVICES_DATA using memblock_free_late(). >> > >> > There are two issue with that: memblock_free_late() should be used for >> > memory allocated with memblock_alloc() while the memory reserved with >> > memblock_reserve() should be freed with free_reserved_area(). >> > >> > More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y >> > efi_free_boot_services() is called before deferred initialization of the >> > memory map is complete. >> > >> > Benjamin Herrenschmidt reports that this causes a leak of ~140MB of >> > RAM on EC2 t3a.nano instances which only have 512MB or RAM. >> > >> > If the freed memory resides in the areas that memory map for them is >> > still uninitialized, they won't be actually freed because >> > memblock_free_late() calls memblock_free_pages() and the latter skips >> > uninitialized pages. >> > >> > Using free_reserved_area() at this point is also problematic because >> > __free_page() accesses the buddy of the freed page and that again might >> > end up in uninitialized part of the memory map. >> > >> > Delaying the entire efi_free_boot_services() could be problematic >> > because in addition to freeing boot services memory it updates >> > efi.memmap without any synchronization and that's undesirable late in >> > boot when there is concurrency. >> > >> > More robust approach is to only defer freeing of the EFI boot services >> > memory. >> > >> > Make efi_free_boot_services() collect ranges that should be freed into >> > an array and add an initcall efi_free_boot_services_memory() that walks >> > that array and actually frees the memory using free_reserved_area(). >> > >> >> Instead of creating another table, could we just traverse the EFI memory >> map again in the arch_initcall(), and free all boot services code/data >> above 1M with EFI_MEMORY_RUNTIME cleared ? > > Currently efi_free_boot_services() unmaps all boot services code/data with > EFI_MEMORY_RUNTIME cleared and removes them from the efi.memmap. > Ah yes, I failed to spot that those entries are long gone by initcall time. Other architectures don't touch the EFI memory map at all, but x86 mangles it beyond recognition :-) > I wasn't sure it's Ok to only unmap them, but leave in efi.memmap, that's > why I didn't use the existing EFI memory map. > > Now thinking about it, if the unmapping can happen later, maybe we'll just > move the entire efi_free_boot_services() to an initcall? > As long as it is pre-SMP, as that code also contains a quirk to allocate the real mode trampoline if all memory below 1 MB is used for boot services. But actually, that should be a separate quirk to begin with, rather than being integrated into an unrelated function that happens to iterate over the boot services regions. The only problem, I guess, is that memblock_reserve()'ing that sub-1MB region in the old location in the ordinary way would cause it to be freed again in the initcall? But yes, in general I think it is fine to unmap those regions from the EFI page tables during an initcall. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/efi: defer freeing of boot services memory 2026-02-23 11:17 ` Ard Biesheuvel @ 2026-02-23 11:40 ` Mike Rapoport 2026-02-23 12:18 ` Ard Biesheuvel 0 siblings, 1 reply; 6+ messages in thread From: Mike Rapoport @ 2026-02-23 11:40 UTC (permalink / raw) To: Ard Biesheuvel Cc: x86, linux-kernel, Benjamin Herrenschmidt, Borislav Petkov, Dave Hansen, Ilias Apalodimas, Ingo Molnar, H . Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable On Mon, Feb 23, 2026 at 12:17:22PM +0100, Ard Biesheuvel wrote: > > On Mon, 23 Feb 2026, at 11:55, Mike Rapoport wrote: > > Hi Ard, > > > > On Mon, Feb 23, 2026 at 09:08:29AM +0100, Ard Biesheuvel wrote: > >> Hi Mike, > >> > >> On Mon, 23 Feb 2026, at 08:52, Mike Rapoport wrote: > >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > >> > > >> > efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE > >> > and EFI_BOOT_SERVICES_DATA using memblock_free_late(). > >> > > >> > There are two issue with that: memblock_free_late() should be used for > >> > memory allocated with memblock_alloc() while the memory reserved with > >> > memblock_reserve() should be freed with free_reserved_area(). > >> > > >> > More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y > >> > efi_free_boot_services() is called before deferred initialization of the > >> > memory map is complete. > >> > > >> > Benjamin Herrenschmidt reports that this causes a leak of ~140MB of > >> > RAM on EC2 t3a.nano instances which only have 512MB or RAM. > >> > > >> > If the freed memory resides in the areas that memory map for them is > >> > still uninitialized, they won't be actually freed because > >> > memblock_free_late() calls memblock_free_pages() and the latter skips > >> > uninitialized pages. > >> > > >> > Using free_reserved_area() at this point is also problematic because > >> > __free_page() accesses the buddy of the freed page and that again might > >> > end up in uninitialized part of the memory map. > >> > > >> > Delaying the entire efi_free_boot_services() could be problematic > >> > because in addition to freeing boot services memory it updates > >> > efi.memmap without any synchronization and that's undesirable late in > >> > boot when there is concurrency. > >> > > >> > More robust approach is to only defer freeing of the EFI boot services > >> > memory. > >> > > >> > Make efi_free_boot_services() collect ranges that should be freed into > >> > an array and add an initcall efi_free_boot_services_memory() that walks > >> > that array and actually frees the memory using free_reserved_area(). > >> > > >> > >> Instead of creating another table, could we just traverse the EFI memory > >> map again in the arch_initcall(), and free all boot services code/data > >> above 1M with EFI_MEMORY_RUNTIME cleared ? > > > > Currently efi_free_boot_services() unmaps all boot services code/data with > > EFI_MEMORY_RUNTIME cleared and removes them from the efi.memmap. > > Ah yes, I failed to spot that those entries are long gone by initcall > time. Other architectures don't touch the EFI memory map at all, but x86 > mangles it beyond recognition :-) Heh, EFI on x86 does a lot of, hmm, interesting things with memory, like memremaping kmalloced memory and I it really begs for cleanups :) > > I wasn't sure it's Ok to only unmap them, but leave in efi.memmap, that's > > why I didn't use the existing EFI memory map. > > > > Now thinking about it, if the unmapping can happen later, maybe we'll just > > move the entire efi_free_boot_services() to an initcall? > > > > As long as it is pre-SMP, as that code also contains a quirk to allocate > the real mode trampoline if all memory below 1 MB is used for boot > services. initcall is long after SMP. It the real mode trampoline allocation is the only thing that should happen pre-SMP? > But actually, that should be a separate quirk to begin with, rather than > being integrated into an unrelated function that happens to iterate over > the boot services regions. The only problem, I guess, is that > memblock_reserve()'ing that sub-1MB region in the old location in the > ordinary way would cause it to be freed again in the initcall? Right now we anyway don't free anything below 1M, I don't see why it should change. > But yes, in general I think it is fine to unmap those regions from the > EFI page tables during an initcall. Thanks for confirming. I'll look into extracting the allocation of the real mode trampoline to a separate quirk and then making the entire efi_free_boot_services() an initcall. -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] x86/efi: defer freeing of boot services memory 2026-02-23 11:40 ` Mike Rapoport @ 2026-02-23 12:18 ` Ard Biesheuvel 0 siblings, 0 replies; 6+ messages in thread From: Ard Biesheuvel @ 2026-02-23 12:18 UTC (permalink / raw) To: Mike Rapoport Cc: x86, linux-kernel, Benjamin Herrenschmidt, Borislav Petkov, Dave Hansen, Ilias Apalodimas, Ingo Molnar, H . Peter Anvin, Thomas Gleixner, linux-efi, linux-mm, stable On Mon, 23 Feb 2026, at 12:40, Mike Rapoport wrote: > On Mon, Feb 23, 2026 at 12:17:22PM +0100, Ard Biesheuvel wrote: >> >> On Mon, 23 Feb 2026, at 11:55, Mike Rapoport wrote: >> > Hi Ard, >> > >> > On Mon, Feb 23, 2026 at 09:08:29AM +0100, Ard Biesheuvel wrote: >> >> Hi Mike, >> >> >> >> On Mon, 23 Feb 2026, at 08:52, Mike Rapoport wrote: >> >> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> >> >> > >> >> > efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE >> >> > and EFI_BOOT_SERVICES_DATA using memblock_free_late(). >> >> > >> >> > There are two issue with that: memblock_free_late() should be used for >> >> > memory allocated with memblock_alloc() while the memory reserved with >> >> > memblock_reserve() should be freed with free_reserved_area(). >> >> > >> >> > More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y >> >> > efi_free_boot_services() is called before deferred initialization of the >> >> > memory map is complete. >> >> > >> >> > Benjamin Herrenschmidt reports that this causes a leak of ~140MB of >> >> > RAM on EC2 t3a.nano instances which only have 512MB or RAM. >> >> > >> >> > If the freed memory resides in the areas that memory map for them is >> >> > still uninitialized, they won't be actually freed because >> >> > memblock_free_late() calls memblock_free_pages() and the latter skips >> >> > uninitialized pages. >> >> > >> >> > Using free_reserved_area() at this point is also problematic because >> >> > __free_page() accesses the buddy of the freed page and that again might >> >> > end up in uninitialized part of the memory map. >> >> > >> >> > Delaying the entire efi_free_boot_services() could be problematic >> >> > because in addition to freeing boot services memory it updates >> >> > efi.memmap without any synchronization and that's undesirable late in >> >> > boot when there is concurrency. >> >> > >> >> > More robust approach is to only defer freeing of the EFI boot services >> >> > memory. >> >> > >> >> > Make efi_free_boot_services() collect ranges that should be freed into >> >> > an array and add an initcall efi_free_boot_services_memory() that walks >> >> > that array and actually frees the memory using free_reserved_area(). >> >> > >> >> >> >> Instead of creating another table, could we just traverse the EFI memory >> >> map again in the arch_initcall(), and free all boot services code/data >> >> above 1M with EFI_MEMORY_RUNTIME cleared ? >> > >> > Currently efi_free_boot_services() unmaps all boot services code/data with >> > EFI_MEMORY_RUNTIME cleared and removes them from the efi.memmap. >> >> Ah yes, I failed to spot that those entries are long gone by initcall >> time. Other architectures don't touch the EFI memory map at all, but x86 >> mangles it beyond recognition :-) > > Heh, EFI on x86 does a lot of, hmm, interesting things with memory, like > memremaping kmalloced memory and I it really begs for cleanups :) > Yeah. Sadly, all this has become ABI for kexec, so the EFI memory map abuse is hard to fix. >> > I wasn't sure it's Ok to only unmap them, but leave in efi.memmap, that's >> > why I didn't use the existing EFI memory map. >> > >> > Now thinking about it, if the unmapping can happen later, maybe we'll just >> > move the entire efi_free_boot_services() to an initcall? >> > >> >> As long as it is pre-SMP, as that code also contains a quirk to allocate >> the real mode trampoline if all memory below 1 MB is used for boot >> services. > > initcall is long after SMP. It the real mode trampoline allocation is the > only thing that should happen pre-SMP? > early_initcall() should be early enough, those run before SMP init. >> But actually, that should be a separate quirk to begin with, rather than >> being integrated into an unrelated function that happens to iterate over >> the boot services regions. The only problem, I guess, is that >> memblock_reserve()'ing that sub-1MB region in the old location in the >> ordinary way would cause it to be freed again in the initcall? > > Right now we anyway don't free anything below 1M, I don't see why it should > change. > >> But yes, in general I think it is fine to unmap those regions from the >> EFI page tables during an initcall. > > Thanks for confirming. I'll look into extracting the allocation of the real > mode trampoline to a separate quirk and then making the entire > efi_free_boot_services() an initcall. > Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-23 12:19 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-23 7:52 [PATCH] x86/efi: defer freeing of boot services memory Mike Rapoport 2026-02-23 8:08 ` Ard Biesheuvel 2026-02-23 10:55 ` Mike Rapoport 2026-02-23 11:17 ` Ard Biesheuvel 2026-02-23 11:40 ` Mike Rapoport 2026-02-23 12:18 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox