Re: [PATCH] kho: add support for deferred struct page init

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] kho: add support for deferred struct page init
@ 2025-12-24  7:34 Fadouse
  2025-12-29 21:09 ` Pratyush Yadav
  0 siblings, 1 reply; 28+ messages in thread
From: Fadouse @ 2025-12-24  7:34 UTC (permalink / raw)
  To: Evangelos Petrongonas, Mike Rapoport
  Cc: Pasha Tatashin, Pratyush Yadav, Alexander Graf, Andrew Morton,
	Jason Miu, linux-kernel, kexec, linux-mm, nh-open-source


[-- Attachment #1.1: Type: text/plain, Size: 5467 bytes --]

Hi Evangelos, Mike, Pasha, Pratyush,

I independently hit a crash in the LUO/memfd restore path with
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, on a local build based on dd9b004b7ff3
(x86_64 QEMU, 6.19.0-rc1 timeframe).

In my reproducer, stage1 preserves a memfd via LUO and kexecs into stage2;
stage2 calls LIVEUPDATE_SESSION_FINISH without retrieving files. I observed
a reliable crash in adjust_managed_page_count() from kho_restore_page().

Minimal excerpt:

stage2: start
stage2: retrieved session fd=4
BUG: unable to handle page fault for address: 0000000000001410
RIP: adjust_managed_page_count+0x29/0x40
Call Trace:
   kho_restore_page+0x18a/0x1c0
   kho_restore_folio+0xe/0x60
   memfd_luo_finish+0xe6/0x160
   luo_file_finish+0x188/0x240
   luo_session_finish+0x2c/0x80
   luo_session_ioctl+0xf5/0x170
   __x64_sys_ioctl+0x91/0xe0

Applying the patch in <20251216084913.86342-1-epetron@amazon.de> makes the
issue no longer reproduce for me.

I can share full logs and the small two-stage initramfs reproducer if 
needed.

Thanks,
YanXin Li

Tested-by: YanXin Li <fadouse@proton.me>

On 12/16/2025 4:49 PM, Evangelos Petrongonas wrote:
> When `CONFIG_DEFERRED_STRUCT_PAGE_INIT` is enabled, struct page
> initialization is deferred to parallel kthreads that run later
> in the boot process.
>
> During KHO restoration, `deserialize_bitmap()` writes metadata for
> each preserved memory region. However, if the struct page has not been
> initialized, this write targets uninitialized memory, potentially
> leading to errors like:
> ```
> BUG: unable to handle page fault for address: ...
> ```
>
> Fix this by introducing `kho_get_preserved_page()`,  which ensures
> all struct pages in a preserved region are initialized by calling
> `init_deferred_page()` which is a no-op when deferred init is disabled
> or when the struct page is already initialized.
>
> Fixes: 8b66ed2c3f42 ("kho: mm: don't allow deferred struct page with KHO")
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---
> ### Notes
> @Jason, this patch should act as a temporary fix to make KHO play nice
> with deferred struct page init until you post your ideas about splitting
> "Physical Reservation" from "Metadata Restoration".
>
> ### Testing
> In order to test the fix, I modified the KHO selftest, to allocate more
> memory and do so from higher memory to trigger the incompatibility. The
> branch with those changes can be found in:
> https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/kho-deferred-struct-page-init
>
> In future patches, we might want to enhance the selftest to cover
> this case as well. However, properly adopting the test for this
> is much more work than the actual fix, therefore it can be deferred to a
> follow-up series.
>
> In addition attempting to run the selftest for arm (without my changes)
> fails with:
> ```
> ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
> Bail out! ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
> ./tools/testing/selftests/kho/vmtest.sh: line 113: 61609 Aborted
> ```
> I have not looked it up further, but can also do so as part of a
> selftest follow-up.
>
>   kernel/liveupdate/Kconfig          |  2 --
>   kernel/liveupdate/kexec_handover.c | 19 ++++++++++++++++++-
>   2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> index d2aeaf13c3ac..9394a608f939 100644
> --- a/kernel/liveupdate/Kconfig
> +++ b/kernel/liveupdate/Kconfig
> @@ -1,12 +1,10 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   
>   menu "Live Update and Kexec HandOver"
> -	depends on !DEFERRED_STRUCT_PAGE_INIT
>   
>   config KEXEC_HANDOVER
>   	bool "kexec handover"
>   	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> -	depends on !DEFERRED_STRUCT_PAGE_INIT
>   	select MEMBLOCK_KHO_SCRATCH
>   	select KEXEC_FILE
>   	select LIBFDT
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 9dc51fab604f..78cfe71e6107 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -439,6 +439,23 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>   	return err;
>   }
>   
> +/*
> + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT, struct pages in higher memory
> + * regions may not be initialized yet at the time KHO deserializes preserved
> + * memory. This function ensures all struct pages in the region are initialized.
> + */
> +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> +						  unsigned int order)
> +{
> +	unsigned long pfn = PHYS_PFN(phys);
> +	int nid = early_pfn_to_nid(pfn);
> +
> +	for (int i = 0; i < (1 << order); i++)
> +		init_deferred_page(pfn + i, nid);
> +
> +	return pfn_to_page(pfn);
> +}
> +
>   static void __init deserialize_bitmap(unsigned int order,
>   				      struct khoser_mem_bitmap_ptr *elm)
>   {
> @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
>   		int sz = 1 << (order + PAGE_SHIFT);
>   		phys_addr_t phys =
>   			elm->phys_start + (bit << (order + PAGE_SHIFT));
> -		struct page *page = phys_to_page(phys);
> +		struct page *page = kho_get_preserved_page(phys, order);
>   		union kho_page_info info;
>   
>   		memblock_reserve(phys, sz);

[-- Attachment #1.2: publickey - fadouse@proton.me - 0xFD2A1679.asc --]
[-- Type: application/pgp-keys, Size: 693 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 322 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-24  7:34 [PATCH] kho: add support for deferred struct page init Fadouse
@ 2025-12-29 21:09 ` Pratyush Yadav
  2025-12-30 15:05   ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-29 21:09 UTC (permalink / raw)
  To: Fadouse
  Cc: Evangelos Petrongonas, Mike Rapoport, Pasha Tatashin,
	Pratyush Yadav, Alexander Graf, Andrew Morton, Jason Miu,
	linux-kernel, kexec, linux-mm, nh-open-source

On Wed, Dec 24 2025, Fadouse wrote:

> Hi Evangelos, Mike, Pasha, Pratyush,
>
> I independently hit a crash in the LUO/memfd restore path with
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, on a local build based on dd9b004b7ff3
> (x86_64 QEMU, 6.19.0-rc1 timeframe).

How? config KEXEC_HANDOVER depends on !DEFERRED_STRUCT_PAGE_INIT. So you
shouldn't even be able to enable KHO or LUO with
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y. Are you sure it is enabled?

>
> In my reproducer, stage1 preserves a memfd via LUO and kexecs into stage2;
> stage2 calls LIVEUPDATE_SESSION_FINISH without retrieving files. I observed
> a reliable crash in adjust_managed_page_count() from kho_restore_page().
>
> Minimal excerpt:
>
> stage2: start
> stage2: retrieved session fd=4
> BUG: unable to handle page fault for address: 0000000000001410
> RIP: adjust_managed_page_count+0x29/0x40
> Call Trace:
>   kho_restore_page+0x18a/0x1c0
>   kho_restore_folio+0xe/0x60
>   memfd_luo_finish+0xe6/0x160
>   luo_file_finish+0x188/0x240
>   luo_session_finish+0x2c/0x80
>   luo_session_ioctl+0xf5/0x170
>   __x64_sys_ioctl+0x91/0xe0
>
> Applying the patch in <20251216084913.86342-1-epetron@amazon.de> makes the
> issue no longer reproduce for me.
>
> I can share full logs and the small two-stage initramfs reproducer if needed.
>
> Thanks,
> YanXin Li
>
> Tested-by: YanXin Li <fadouse@proton.me>
>
[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-29 21:09 ` Pratyush Yadav
@ 2025-12-30 15:05   ` Pasha Tatashin
  0 siblings, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-30 15:05 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Fadouse, Evangelos Petrongonas, Mike Rapoport, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Mon, Dec 29, 2025 at 4:09 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Wed, Dec 24 2025, Fadouse wrote:
>
> > Hi Evangelos, Mike, Pasha, Pratyush,
> >
> > I independently hit a crash in the LUO/memfd restore path with
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, on a local build based on dd9b004b7ff3
> > (x86_64 QEMU, 6.19.0-rc1 timeframe).
>
> How? config KEXEC_HANDOVER depends on !DEFERRED_STRUCT_PAGE_INIT. So you
> shouldn't even be able to enable KHO or LUO with
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y. Are you sure it is enabled?

I think, Fadouse reported a bug with this patch applied, not an upstream bug.

Pasha

>
> >
> > In my reproducer, stage1 preserves a memfd via LUO and kexecs into stage2;
> > stage2 calls LIVEUPDATE_SESSION_FINISH without retrieving files. I observed
> > a reliable crash in adjust_managed_page_count() from kho_restore_page().
> >
> > Minimal excerpt:
> >
> > stage2: start
> > stage2: retrieved session fd=4
> > BUG: unable to handle page fault for address: 0000000000001410
> > RIP: adjust_managed_page_count+0x29/0x40
> > Call Trace:
> >   kho_restore_page+0x18a/0x1c0
> >   kho_restore_folio+0xe/0x60
> >   memfd_luo_finish+0xe6/0x160
> >   luo_file_finish+0x188/0x240
> >   luo_session_finish+0x2c/0x80
> >   luo_session_ioctl+0xf5/0x170
> >   __x64_sys_ioctl+0x91/0xe0
> >
> > Applying the patch in <20251216084913.86342-1-epetron@amazon.de> makes the
> > issue no longer reproduce for me.
> >
> > I can share full logs and the small two-stage initramfs reproducer if needed.
> >
> > Thanks,
> > YanXin Li
> >
> > Tested-by: YanXin Li <fadouse@proton.me>
> >
> [...]
>
> --
> Regards,
> Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-30 18:21                                   ` Pasha Tatashin
@ 2025-12-31  9:46                                     ` Mike Rapoport
  0 siblings, 0 replies; 28+ messages in thread
From: Mike Rapoport @ 2025-12-31  9:46 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 30, 2025 at 01:21:31PM -0500, Pasha Tatashin wrote:
> On Tue, Dec 30, 2025 at 12:18 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Dec 30, 2025 at 11:18:12AM -0500, Pasha Tatashin wrote:
> > > On Tue, Dec 30, 2025 at 11:16 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > On Tue, Dec 30, 2025 at 11:05:05AM -0500, Pasha Tatashin wrote:
> > > > > On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > > > > >
> > > > > > The magic is purely sanity checking. It is not used to decide anything
> > > > > > other than to make sure this is actually a KHO page. I don't intend to
> > > > > > change that. My point is, if we make sure the KHO pages are properly
> > > > > > initialized during MM init, then restoring can actually be a very cheap
> > > > > > operation, where you only do the sanity checking. You can even put the
> > > > > > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> > > > > > it is useful enough to keep in production systems too.
> > > > >
> > > > > It is part of a critical hotpath during blackout, should really be
> > > > > behind CONFIG_KEXEC_HANDOVER_DEBUG
> > > >
> > > > Do you have the numbers? ;-)
> > >
> > > The fastest reboot we can achieve is ~0.4s on ARM
> >
> > I meant the difference between assigning info.magic and skipping it.
> 
> It is proportional to the amount of preserved memory. Extra assignment
> for each page. In our fleet we have observed IOMMU page tables to be
> 20G in size. So, let's just assume it is 20G. That is: 20 * 1024^3 /

Do you see 400ms reboot times on machines that have 20G of IOMMU page
tables? That's impressive presuming the overall size of those machines. 

> 4096 = 5.24 million pages. If we access "struct page" only for the
> magic purpose, we fetch full 64-byte cacheline, which is 5.24 million
> * 64 bytes = 335 M, that is ~13ms with ~25G/s DRAM; and also each TLB
> miss will add some latency, 5.2M * 10ns = ~50ms. In total we can get
> 15ms ~ 50ms regression compared to 400ms, that is 4-12%. It will be
> less if we also access "struct page" for another reason at the same
> time, but still it adds up.

Your overhead calculations are based on the assumption that we don't
access struct page, but we do. We assign page->private during
deserialization and then initialize struct page during restore.
We get the hit of cache fetches and TLB misses anyway.

It would be interesting to see the difference *measured* on those large
systems.

> > > (shutdown+purgatory+boot), let's not add anything to regress, as every
> > > microsecond counts during blackout.
> >
> > Any added functionality adds cycles, this is inevitable. And neither KHO
> > nor LUO are near the completion, so we'll have to add functionality to both
> > of them. And the added functionality should be correct first and foremost.
> > And magic sanity check seems pretty useful and presumably cheap enough to
> > always keep it unless you see a real slowdown because of it.
> 
> Magic check is proportional to the amount of preserved memory. It is
> not a required functionality, only a sanity checking. I really do not
> see a reason to enable it in production. All other sanity struct page,
> and pg_flags related sanity checking are usually enabled with
> CONFIG_DEBUG_VM, so enabling it only with CONFIG_KEXEC_HANDOVER_DEBUG
> is better.

Having sanity checks in production could be useful because some errors
could be hard to reproduce in controlled environment with debug kernel.

Just last cycle there was commit 83c8f7b5e194 ("mm/mm_init: Introduce a
boot parameter for check_pages") that allows enabling page sanity checks in
production.

So my take is to keep the magic check until KHO/LUO mature at least.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-30 17:18                                 ` Mike Rapoport
@ 2025-12-30 18:21                                   ` Pasha Tatashin
  2025-12-31  9:46                                     ` Mike Rapoport
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-30 18:21 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 30, 2025 at 12:18 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Dec 30, 2025 at 11:18:12AM -0500, Pasha Tatashin wrote:
> > On Tue, Dec 30, 2025 at 11:16 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Dec 30, 2025 at 11:05:05AM -0500, Pasha Tatashin wrote:
> > > > On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > > > >
> > > > > The magic is purely sanity checking. It is not used to decide anything
> > > > > other than to make sure this is actually a KHO page. I don't intend to
> > > > > change that. My point is, if we make sure the KHO pages are properly
> > > > > initialized during MM init, then restoring can actually be a very cheap
> > > > > operation, where you only do the sanity checking. You can even put the
> > > > > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> > > > > it is useful enough to keep in production systems too.
> > > >
> > > > It is part of a critical hotpath during blackout, should really be
> > > > behind CONFIG_KEXEC_HANDOVER_DEBUG
> > >
> > > Do you have the numbers? ;-)
> >
> > The fastest reboot we can achieve is ~0.4s on ARM
>
> I meant the difference between assigning info.magic and skipping it.

It is proportional to the amount of preserved memory. Extra assignment
for each page. In our fleet we have observed IOMMU page tables to be
20G in size. So, let's just assume it is 20G. That is: 20 * 1024^3 /
4096 = 5.24 million pages. If we access "struct page" only for the
magic purpose, we fetch full 64-byte cacheline, which is 5.24 million
* 64 bytes = 335 M, that is ~13ms with ~25G/s DRAM; and also each TLB
miss will add some latency, 5.2M * 10ns = ~50ms. In total we can get
15ms ~ 50ms regression compared to 400ms, that is 4-12%. It will be
less if we also access "struct page" for another reason at the same
time, but still it adds up.

>
> > (shutdown+purgatory+boot), let's not add anything to regress, as every
> > microsecond counts during blackout.
>
> Any added functionality adds cycles, this is inevitable. And neither KHO
> nor LUO are near the completion, so we'll have to add functionality to both
> of them. And the added functionality should be correct first and foremost.
> And magic sanity check seems pretty useful and presumably cheap enough to
> always keep it unless you see a real slowdown because of it.

Magic check is proportional to the amount of preserved memory. It is
not a required functionality, only a sanity checking. I really do not
see a reason to enable it in production. All other sanity struct page,
and pg_flags related sanity checking are usually enabled with
CONFIG_DEBUG_VM, so enabling it only with CONFIG_KEXEC_HANDOVER_DEBUG
is better.

Pasha

>
> > Pasha
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-30 16:18                               ` Pasha Tatashin
@ 2025-12-30 17:18                                 ` Mike Rapoport
  2025-12-30 18:21                                   ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Rapoport @ 2025-12-30 17:18 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 30, 2025 at 11:18:12AM -0500, Pasha Tatashin wrote:
> On Tue, Dec 30, 2025 at 11:16 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Dec 30, 2025 at 11:05:05AM -0500, Pasha Tatashin wrote:
> > > On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > > >
> > > > The magic is purely sanity checking. It is not used to decide anything
> > > > other than to make sure this is actually a KHO page. I don't intend to
> > > > change that. My point is, if we make sure the KHO pages are properly
> > > > initialized during MM init, then restoring can actually be a very cheap
> > > > operation, where you only do the sanity checking. You can even put the
> > > > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> > > > it is useful enough to keep in production systems too.
> > >
> > > It is part of a critical hotpath during blackout, should really be
> > > behind CONFIG_KEXEC_HANDOVER_DEBUG
> >
> > Do you have the numbers? ;-)
> 
> The fastest reboot we can achieve is ~0.4s on ARM

I meant the difference between assigning info.magic and skipping it.

> (shutdown+purgatory+boot), let's not add anything to regress, as every
> microsecond counts during blackout.

Any added functionality adds cycles, this is inevitable. And neither KHO
nor LUO are near the completion, so we'll have to add functionality to both
of them. And the added functionality should be correct first and foremost.
And magic sanity check seems pretty useful and presumably cheap enough to
always keep it unless you see a real slowdown because of it.
 
> Pasha

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-30 16:16                             ` Mike Rapoport
@ 2025-12-30 16:18                               ` Pasha Tatashin
  2025-12-30 17:18                                 ` Mike Rapoport
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-30 16:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 30, 2025 at 11:16 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Dec 30, 2025 at 11:05:05AM -0500, Pasha Tatashin wrote:
> > On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > >
> > > The magic is purely sanity checking. It is not used to decide anything
> > > other than to make sure this is actually a KHO page. I don't intend to
> > > change that. My point is, if we make sure the KHO pages are properly
> > > initialized during MM init, then restoring can actually be a very cheap
> > > operation, where you only do the sanity checking. You can even put the
> > > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> > > it is useful enough to keep in production systems too.
> >
> > It is part of a critical hotpath during blackout, should really be
> > behind CONFIG_KEXEC_HANDOVER_DEBUG
>
> Do you have the numbers? ;-)

The fastest reboot we can achieve is ~0.4s on ARM
(shutdown+purgatory+boot), let's not add anything to regress, as every
microsecond counts during blackout.

Pasha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-30 16:05                           ` Pasha Tatashin
@ 2025-12-30 16:16                             ` Mike Rapoport
  2025-12-30 16:18                               ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Rapoport @ 2025-12-30 16:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 30, 2025 at 11:05:05AM -0500, Pasha Tatashin wrote:
> On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > The magic is purely sanity checking. It is not used to decide anything
> > other than to make sure this is actually a KHO page. I don't intend to
> > change that. My point is, if we make sure the KHO pages are properly
> > initialized during MM init, then restoring can actually be a very cheap
> > operation, where you only do the sanity checking. You can even put the
> > magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> > it is useful enough to keep in production systems too.
> 
> It is part of a critical hotpath during blackout, should really be
> behind CONFIG_KEXEC_HANDOVER_DEBUG

Do you have the numbers? ;-)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-29 21:03                         ` Pratyush Yadav
  2025-12-30 16:05                           ` Pasha Tatashin
@ 2025-12-30 16:14                           ` Mike Rapoport
  1 sibling, 0 replies; 28+ messages in thread
From: Mike Rapoport @ 2025-12-30 16:14 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Mon, Dec 29, 2025 at 10:03:29PM +0100, Pratyush Yadav wrote:
> On Tue, Dec 23 2025, Pasha Tatashin wrote:
> 
> >> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> >> > return NULL;
> >>
> >> See my patch that drops this restriction:
> >> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
> >>
> >> I think it was wrong to add it in the first place.
> >
> > Agree, the restriction can be removed. Indeed, it is wrong as it is
> > not enforced during preservation.
> >
> > However, I think we are going to be in a world of pain if we allow
> > preserving memory from different topologies within the same order. In
> > kho_preserve_pages(), we have to check if the first and last page are
> > from the same nid; if not, reduce the order by 1 and repeat until they
> > are. It is just wrong to intermix different memory into the same
> > order, so in addition to removing that restriction, I think we should
> > implement this enforcement.
> 
> Sure, makes sense.
> >
> > Also, perhaps we should pass the NID in the Jason's radix tree
> > together with the order. We could have a single tree that encodes both
> > order and NID information in the top level, or we can have one tree
> > per NID. It does not really matter to me, but that should help us with
> > faster struct page initialization.

To setup page links we need nid and zone. AFAIR we have 7 or 8 upper bits
free in the radix tree, so to support the general case of up to 3 bits per
zone and up to 10 bits per node we'll need to implement two versions of
detection of zone and node for a page.
I'd wait with this optimization for a while.
 
> Can we use NIDs in ABI? Do they stay stable across reboots? I never
> looked at how NIDs actually get assigned.

Node ids are assigned by the firmware, so unless firmware changes or there
are hotplugged/hotremoved memory they are stable.
And we can't really do hotplug/hotremove with KHO/LUO anyway :)
 
> Not sure if we should target it for the initial merge of the radix tree,
> but I think this is something we can try to figure out later down the
> line.
> 
> >
> >> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> >> >> spinlock and searches through all memblock memory regions. I don't think
> >> >> it is too expensive, but it isn't free either. And all this would be
> >> >> done serially. With the zone search, you at least have some room for
> >> >> concurrency.
> >> >>
> >> >> I think either approach only makes a difference when we have a large
> >> >> number of low-order preservations. If we have a handful of high-order
> >> >> preservations, I suppose the overhead of nid search would be negligible.
> >> >
> >> > We should be targeting a situation where the vast majority of the
> >> > preserved memory is HugeTLB, but I am still worried about lower order
> >> > preservation efficiency for IOMMU page tables, etc.
> >>
> >> Yep. Plus we might get VMMs stashing some of their state in a memfd too.
> >
> > Yes, that is true, but hopefully those are tiny compared to everything else.
> >
> >> >> Long term, I think we should hook this into page_alloc_init_late() so
> >> >> that all the KHO pages also get initalized along with all the other
> >> >> pages. This will result in better integration of KHO with rest of MM
> >> >> init, and also have more consistent page restore performance.
> >> >
> >> > But we keep KHO as reserved memory, and hooking it up into
> >> > page_alloc_init_late() would make it very different, since that memory
> >> > is part of the buddy allocator memory...
> >>
> >> The idea I have is to have a separate call in page_alloc_init_late()
> >> that initalizes KHO pages. It would traverse the radix tree (probably in
> >> parallel by distributing the address space across multiple threads?) and
> >> initialize all the pages. Then kho_restore_page() would only have to
> >> double-check the magic and it can directly return the page.

page_alloc_init_late() is probably too late and some subsystems might need
to call kho_restore_*() before it.

> > I kind of do not like relying on magic to decide whether to initialize
> > the struct page. I would prefer to avoid this magic marker altogether:
> > i.e. struct page is either initialized or not, not halfway
> > initialized, etc.
> 
> The magic is purely sanity checking. It is not used to decide anything
> other than to make sure this is actually a KHO page. I don't intend to
> change that. My point is, if we make sure the KHO pages are properly
> initialized during MM init, then restoring can actually be a very cheap
> operation, where you only do the sanity checking. You can even put the
> magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> it is useful enough to keep in production systems too.
> 
> >
> > Magic is not reliable. During machine reset in many firmware
> > implementations, and in every kexec reboot, memory is not zeroed. The
> > kernel usually allocates vmemmap using exactly the same pages, so
> > there is just too high a chance of getting magic values accidentally
> > inherited from the previous boot.
> 
> I don't think that can happen. All the pages are zeroed when
> initialized, which will clear the magic. We should only be setting the
> magic on an initialized struct page.
 
Currently we set the magic on an initialized struct page because we don't
support deferred struct page initialization. If we want to enable it, lots
of struct pages are uninitialized by the time kho_mem_deserialize() runs.

To ensure there are no concerns with the stale data in the memory map we
either need to initialize struct pages in kho_mem_deserialize() before
setting page->private or let memmap_init_reserved_pages() initialize them
(e.g by splitting memblock_reserve() out of kho_mem_deserialize() and
calling it before memmap_init_reserved_pages())

It seems that hugetlb support anyway requires moving of memblock_reserve()
earlier, so maybe we can do it as a part of deferred initialization work.

> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-29 21:03                         ` Pratyush Yadav
@ 2025-12-30 16:05                           ` Pasha Tatashin
  2025-12-30 16:16                             ` Mike Rapoport
  2025-12-30 16:14                           ` Mike Rapoport
  1 sibling, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-30 16:05 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Dec 23 2025, Pasha Tatashin wrote:
>
> >> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> >> > return NULL;
> >>
> >> See my patch that drops this restriction:
> >> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
> >>
> >> I think it was wrong to add it in the first place.
> >
> > Agree, the restriction can be removed. Indeed, it is wrong as it is
> > not enforced during preservation.
> >
> > However, I think we are going to be in a world of pain if we allow
> > preserving memory from different topologies within the same order. In
> > kho_preserve_pages(), we have to check if the first and last page are
> > from the same nid; if not, reduce the order by 1 and repeat until they
> > are. It is just wrong to intermix different memory into the same
> > order, so in addition to removing that restriction, I think we should
> > implement this enforcement.
>
> Sure, makes sense.
>
> >
> > Also, perhaps we should pass the NID in the Jason's radix tree
> > together with the order. We could have a single tree that encodes both
> > order and NID information in the top level, or we can have one tree
> > per NID. It does not really matter to me, but that should help us with
> > faster struct page initialization.
>
> Can we use NIDs in ABI? Do they stay stable across reboots? I never
> looked at how NIDs actually get assigned.
>
> Not sure if we should target it for the initial merge of the radix tree,
> but I think this is something we can try to figure out later down the
> line.
>
> >
> >> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> >> >> spinlock and searches through all memblock memory regions. I don't think
> >> >> it is too expensive, but it isn't free either. And all this would be
> >> >> done serially. With the zone search, you at least have some room for
> >> >> concurrency.
> >> >>
> >> >> I think either approach only makes a difference when we have a large
> >> >> number of low-order preservations. If we have a handful of high-order
> >> >> preservations, I suppose the overhead of nid search would be negligible.
> >> >
> >> > We should be targeting a situation where the vast majority of the
> >> > preserved memory is HugeTLB, but I am still worried about lower order
> >> > preservation efficiency for IOMMU page tables, etc.
> >>
> >> Yep. Plus we might get VMMs stashing some of their state in a memfd too.
> >
> > Yes, that is true, but hopefully those are tiny compared to everything else.
> >
> >> >> Long term, I think we should hook this into page_alloc_init_late() so
> >> >> that all the KHO pages also get initalized along with all the other
> >> >> pages. This will result in better integration of KHO with rest of MM
> >> >> init, and also have more consistent page restore performance.
> >> >
> >> > But we keep KHO as reserved memory, and hooking it up into
> >> > page_alloc_init_late() would make it very different, since that memory
> >> > is part of the buddy allocator memory...
> >>
> >> The idea I have is to have a separate call in page_alloc_init_late()
> >> that initalizes KHO pages. It would traverse the radix tree (probably in
> >> parallel by distributing the address space across multiple threads?) and
> >> initialize all the pages. Then kho_restore_page() would only have to
> >> double-check the magic and it can directly return the page.
> >
> > I kind of do not like relying on magic to decide whether to initialize
> > the struct page. I would prefer to avoid this magic marker altogether:
> > i.e. struct page is either initialized or not, not halfway
> > initialized, etc.
>
> The magic is purely sanity checking. It is not used to decide anything
> other than to make sure this is actually a KHO page. I don't intend to
> change that. My point is, if we make sure the KHO pages are properly
> initialized during MM init, then restoring can actually be a very cheap
> operation, where you only do the sanity checking. You can even put the
> magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> it is useful enough to keep in production systems too.

It is part of a critical hotpath during blackout, should really be
behind CONFIG_KEXEC_HANDOVER_DEBUG

> > Magic is not reliable. During machine reset in many firmware
> > implementations, and in every kexec reboot, memory is not zeroed. The
> > kernel usually allocates vmemmap using exactly the same pages, so
> > there is just too high a chance of getting magic values accidentally
> > inherited from the previous boot.
>
> I don't think that can happen. All the pages are zeroed when
> initialized, which will clear the magic. We should only be setting the
> magic on an initialized struct page.

This can happen due to bugs when we use a partially initialized
"struct page", something that Mike have been looking to do. So, pass
some information in a struct page before it is fully initialized.

> >> Radix tree makes parallelism easier than the linked lists we have now.
> >
> > Agree, radix tree can absolutely help with parallelism.
> >
> >> >> Jason's radix tree patches will make that a bit easier to do I think.
> >> >> The zone search will scale better I reckon.
> >> >
> >> > It could, perhaps early in boot we should reserve the radix tree, and
> >> > use it as a source of truth look-ups later in boot?
> >>
> >> Yep. I think the radix tree should mark its own pages as preserved too
> >> so they stick around later in boot.
> >
> > Unfortunately, this can only be done in the new kernel, not in the old
> > kernel; otherwise we can end up with a recursive dependency that may
> > never be satisfied.
>
> Right. It shouldn't be too hard to do in the new kernel though. We will
> walk the whole tree anyway.
>
> --
> Regards,
> Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-23 17:37                       ` Pasha Tatashin
@ 2025-12-29 21:03                         ` Pratyush Yadav
  2025-12-30 16:05                           ` Pasha Tatashin
  2025-12-30 16:14                           ` Mike Rapoport
  0 siblings, 2 replies; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-29 21:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, Evangelos Petrongonas,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

On Tue, Dec 23 2025, Pasha Tatashin wrote:

>> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
>> > return NULL;
>>
>> See my patch that drops this restriction:
>> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
>>
>> I think it was wrong to add it in the first place.
>
> Agree, the restriction can be removed. Indeed, it is wrong as it is
> not enforced during preservation.
>
> However, I think we are going to be in a world of pain if we allow
> preserving memory from different topologies within the same order. In
> kho_preserve_pages(), we have to check if the first and last page are
> from the same nid; if not, reduce the order by 1 and repeat until they
> are. It is just wrong to intermix different memory into the same
> order, so in addition to removing that restriction, I think we should
> implement this enforcement.

Sure, makes sense.

>
> Also, perhaps we should pass the NID in the Jason's radix tree
> together with the order. We could have a single tree that encodes both
> order and NID information in the top level, or we can have one tree
> per NID. It does not really matter to me, but that should help us with
> faster struct page initialization.

Can we use NIDs in ABI? Do they stay stable across reboots? I never
looked at how NIDs actually get assigned.

Not sure if we should target it for the initial merge of the radix tree,
but I think this is something we can try to figure out later down the
line.

>
>> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
>> >> spinlock and searches through all memblock memory regions. I don't think
>> >> it is too expensive, but it isn't free either. And all this would be
>> >> done serially. With the zone search, you at least have some room for
>> >> concurrency.
>> >>
>> >> I think either approach only makes a difference when we have a large
>> >> number of low-order preservations. If we have a handful of high-order
>> >> preservations, I suppose the overhead of nid search would be negligible.
>> >
>> > We should be targeting a situation where the vast majority of the
>> > preserved memory is HugeTLB, but I am still worried about lower order
>> > preservation efficiency for IOMMU page tables, etc.
>>
>> Yep. Plus we might get VMMs stashing some of their state in a memfd too.
>
> Yes, that is true, but hopefully those are tiny compared to everything else.
>
>> >> Long term, I think we should hook this into page_alloc_init_late() so
>> >> that all the KHO pages also get initalized along with all the other
>> >> pages. This will result in better integration of KHO with rest of MM
>> >> init, and also have more consistent page restore performance.
>> >
>> > But we keep KHO as reserved memory, and hooking it up into
>> > page_alloc_init_late() would make it very different, since that memory
>> > is part of the buddy allocator memory...
>>
>> The idea I have is to have a separate call in page_alloc_init_late()
>> that initalizes KHO pages. It would traverse the radix tree (probably in
>> parallel by distributing the address space across multiple threads?) and
>> initialize all the pages. Then kho_restore_page() would only have to
>> double-check the magic and it can directly return the page.
>
> I kind of do not like relying on magic to decide whether to initialize
> the struct page. I would prefer to avoid this magic marker altogether:
> i.e. struct page is either initialized or not, not halfway
> initialized, etc.

The magic is purely sanity checking. It is not used to decide anything
other than to make sure this is actually a KHO page. I don't intend to
change that. My point is, if we make sure the KHO pages are properly
initialized during MM init, then restoring can actually be a very cheap
operation, where you only do the sanity checking. You can even put the
magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
it is useful enough to keep in production systems too.

>
> Magic is not reliable. During machine reset in many firmware
> implementations, and in every kexec reboot, memory is not zeroed. The
> kernel usually allocates vmemmap using exactly the same pages, so
> there is just too high a chance of getting magic values accidentally
> inherited from the previous boot.

I don't think that can happen. All the pages are zeroed when
initialized, which will clear the magic. We should only be setting the
magic on an initialized struct page.

>
>> Radix tree makes parallelism easier than the linked lists we have now.
>
> Agree, radix tree can absolutely help with parallelism.
>
>> >> Jason's radix tree patches will make that a bit easier to do I think.
>> >> The zone search will scale better I reckon.
>> >
>> > It could, perhaps early in boot we should reserve the radix tree, and
>> > use it as a source of truth look-ups later in boot?
>>
>> Yep. I think the radix tree should mark its own pages as preserved too
>> so they stick around later in boot.
>
> Unfortunately, this can only be done in the new kernel, not in the old
> kernel; otherwise we can end up with a recursive dependency that may
> never be satisfied.

Right. It shouldn't be too hard to do in the new kernel though. We will
walk the whole tree anyway.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-22 16:24                     ` Pratyush Yadav
@ 2025-12-23 17:37                       ` Pasha Tatashin
  2025-12-29 21:03                         ` Pratyush Yadav
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-23 17:37 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> > return NULL;
>
> See my patch that drops this restriction:
> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
>
> I think it was wrong to add it in the first place.

Agree, the restriction can be removed. Indeed, it is wrong as it is
not enforced during preservation.

However, I think we are going to be in a world of pain if we allow
preserving memory from different topologies within the same order. In
kho_preserve_pages(), we have to check if the first and last page are
from the same nid; if not, reduce the order by 1 and repeat until they
are. It is just wrong to intermix different memory into the same
order, so in addition to removing that restriction, I think we should
implement this enforcement.

Also, perhaps we should pass the NID in the Jason's radix tree
together with the order. We could have a single tree that encodes both
order and NID information in the top level, or we can have one tree
per NID. It does not really matter to me, but that should help us with
faster struct page initialization.

> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> >> spinlock and searches through all memblock memory regions. I don't think
> >> it is too expensive, but it isn't free either. And all this would be
> >> done serially. With the zone search, you at least have some room for
> >> concurrency.
> >>
> >> I think either approach only makes a difference when we have a large
> >> number of low-order preservations. If we have a handful of high-order
> >> preservations, I suppose the overhead of nid search would be negligible.
> >
> > We should be targeting a situation where the vast majority of the
> > preserved memory is HugeTLB, but I am still worried about lower order
> > preservation efficiency for IOMMU page tables, etc.
>
> Yep. Plus we might get VMMs stashing some of their state in a memfd too.

Yes, that is true, but hopefully those are tiny compared to everything else.

> >> Long term, I think we should hook this into page_alloc_init_late() so
> >> that all the KHO pages also get initalized along with all the other
> >> pages. This will result in better integration of KHO with rest of MM
> >> init, and also have more consistent page restore performance.
> >
> > But we keep KHO as reserved memory, and hooking it up into
> > page_alloc_init_late() would make it very different, since that memory
> > is part of the buddy allocator memory...
>
> The idea I have is to have a separate call in page_alloc_init_late()
> that initalizes KHO pages. It would traverse the radix tree (probably in
> parallel by distributing the address space across multiple threads?) and
> initialize all the pages. Then kho_restore_page() would only have to
> double-check the magic and it can directly return the page.

I kind of do not like relying on magic to decide whether to initialize
the struct page. I would prefer to avoid this magic marker altogether:
i.e. struct page is either initialized or not, not halfway
initialized, etc.

Magic is not reliable. During machine reset in many firmware
implementations, and in every kexec reboot, memory is not zeroed. The
kernel usually allocates vmemmap using exactly the same pages, so
there is just too high a chance of getting magic values accidentally
inherited from the previous boot.

> Radix tree makes parallelism easier than the linked lists we have now.

Agree, radix tree can absolutely help with parallelism.

> >> Jason's radix tree patches will make that a bit easier to do I think.
> >> The zone search will scale better I reckon.
> >
> > It could, perhaps early in boot we should reserve the radix tree, and
> > use it as a source of truth look-ups later in boot?
>
> Yep. I think the radix tree should mark its own pages as preserved too
> so they stick around later in boot.

Unfortunately, this can only be done in the new kernel, not in the old
kernel; otherwise we can end up with a recursive dependency that may
never be satisfied.

Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-22 15:55                   ` Pasha Tatashin
@ 2025-12-22 16:24                     ` Pratyush Yadav
  2025-12-23 17:37                       ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-22 16:24 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, Evangelos Petrongonas,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

On Mon, Dec 22 2025, Pasha Tatashin wrote:

>> > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
>> > larger than MAX_PAGE_ORDER it is mathematically impossible for a
>> > single chunk to span multiple nodes.
>>
>> For folios, yes. The whole folio should only be in a single node. But we
>> also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
>> be used to preserve an arbitrary size of memory and _that_ doesn't have
>> to be in the same section. And if the memory is properly aligned, then
>> it will end up being just one higher-order preservation in KHO.
>
> Both restore pages and folios we use: kho_restore_page() which has the
> following:
>
> /*
> * deserialize_bitmap() only sets the magic on the head page. This magic
> * check also implicitly makes sure phys is order-aligned since for
> * non-order-aligned phys addresses, magic will never be set.
> */
> if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> return NULL;

See my patch that drops this restriction:
https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/

I think it was wrong to add it in the first place.

>
> My understanding the head page can never be more than MAX_PAGE_ORDER
> hence why I am saying it will be less than SECTION_SIZE. With HugeTLB
> the order can be more than MAX_PAGE_ORDER, but in that case it still
> has to be within a single NID, since a huge page cannot be split
> across multiple nodes.

For a "proper" page/folio, that either comes from the page allocator or
from HugeTLB, you are right. But see again how kho_preserve_pages()
works:

	while (pfn < end_pfn) {
		const unsigned int order =
			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
	
		err = __kho_preserve_order(track, pfn, order);
		[...]

It combines contiguous order-aligned pages into one KHO preservation.

So say I have two nodes, each 64G. If I call kho_preserve_pages() for
62G to 66G, I will get _one_ 4G preservation at 62G. kho_restore_page()
will split it into 0-order pages on restore.

>
>> >> > This approach seems to give us the best of both worlds: It avoids the
>> >> > memblock dependency during restoration. It keeps the serial work in
>> >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
>> >> > heavy lifting of tail page initialization to be done later in the boot
>> >> > process, potentially in parallel, as you suggested.
>> >>
>> >> Here's another idea I have been thinking about, but never dug deep
>> >> enough to figure out if it actually works.
>> >>
>> >> __init_page_from_nid() loops through all the zones for the node to find
>> >> the zone id for the page. We can flip it the other way round and loop
>> >> through all zones (on all nodes) to find out if the PFN spans that zone.
>> >> Once we find the zone, we can directly call __init_single_page() on it.
>> >> If a contiguous chunk of preserved memory lands in one zone, we can
>> >> batch the init to save some time.
>> >>
>> >> Something like the below (completely untested):
>> >>
>> >>
>> >>         static void kho_init_page(struct page *page)
>> >>         {
>> >>                 unsigned long pfn = page_to_pfn(page);
>> >>                 struct zone *zone;
>> >>
>> >>                 for_each_zone(zone) {
>> >>                         if (zone_spans_pfn(zone, pfn))
>> >>                                 break;
>> >>                 }
>> >>
>> >>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
>> >>         }
>> >>
>> >> It doesn't do the batching I mentioned, but I think it at least gets the
>> >> point across. And I think even this simple version would be a good first
>> >> step.
>> >>
>> >> This lets us initialize the page from kho_restore_folio() without having
>> >> to rely of memblock being alive, and saves us from doing work during
>> >> early boot. We should only have a handful of zones and nodes in
>> >> practice, so I think it should perform fairly well too.
>> >>
>> >> We would of course need to see how it performs in practice. If it works,
>> >> I think it would be cleaner and simpler than splitting the
>> >> initialization into two separate parts.
>> >
>> > I think your idea is clever and would work. However, consider the
>> > cache efficiency: in deserialize_bitmap(), we must write to the head
>> > struct page anyway to preserve the order. Since we are already
>> > bringing that 64-byte cacheline in and dirtying it, and since memblock
>> > is available and fast at this stage, it makes sense to fully
>> > initialize the head page right then.
>>
>> You will also bring in the cache line and dirty it during
>> kho_restore_folio() since you need to write the page refcounts. So I
>> don't think the cache efficiency makes any difference between either
>> approach.
>>
>> > If we do that, we get the nid for "free" (cache-wise) and we avoid the
>> > overhead of iterating zones during the restore phase. We can then
>> > simply inherit the nid from the head page when initializing the tail
>> > pages later.
>>
>> To get the nid, you would need to call early_pfn_to_nid(). This takes a
>> spinlock and searches through all memblock memory regions. I don't think
>> it is too expensive, but it isn't free either. And all this would be
>> done serially. With the zone search, you at least have some room for
>> concurrency.
>>
>> I think either approach only makes a difference when we have a large
>> number of low-order preservations. If we have a handful of high-order
>> preservations, I suppose the overhead of nid search would be negligible.
>
> We should be targeting a situation where the vast majority of the
> preserved memory is HugeTLB, but I am still worried about lower order
> preservation efficiency for IOMMU page tables, etc.

Yep. Plus we might get VMMs stashing some of their state in a memfd too.

>
>> Long term, I think we should hook this into page_alloc_init_late() so
>> that all the KHO pages also get initalized along with all the other
>> pages. This will result in better integration of KHO with rest of MM
>> init, and also have more consistent page restore performance.
>
> But we keep KHO as reserved memory, and hooking it up into
> page_alloc_init_late() would make it very different, since that memory
> is part of the buddy allocator memory...

The idea I have is to have a separate call in page_alloc_init_late()
that initalizes KHO pages. It would traverse the radix tree (probably in
parallel by distributing the address space across multiple threads?) and
initialize all the pages. Then kho_restore_page() would only have to
double-check the magic and it can directly return the page.

Radix tree makes parallelism easier than the linked lists we have now.

>
>> Jason's radix tree patches will make that a bit easier to do I think.
>> The zone search will scale better I reckon.
>
> It could, perhaps early in boot we should reserve the radix tree, and
> use it as a source of truth look-ups later in boot?

Yep. I think the radix tree should mark its own pages as preserved too
so they stick around later in boot.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-22 15:33                 ` Pratyush Yadav
@ 2025-12-22 15:55                   ` Pasha Tatashin
  2025-12-22 16:24                     ` Pratyush Yadav
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-22 15:55 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

> > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
> > larger than MAX_PAGE_ORDER it is mathematically impossible for a
> > single chunk to span multiple nodes.
>
> For folios, yes. The whole folio should only be in a single node. But we
> also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
> be used to preserve an arbitrary size of memory and _that_ doesn't have
> to be in the same section. And if the memory is properly aligned, then
> it will end up being just one higher-order preservation in KHO.

Both restore pages and folios we use: kho_restore_page() which has the
following:

/*
* deserialize_bitmap() only sets the magic on the head page. This magic
* check also implicitly makes sure phys is order-aligned since for
* non-order-aligned phys addresses, magic will never be set.
*/
if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
return NULL;

My understanding the head page can never be more than MAX_PAGE_ORDER
hence why I am saying it will be less than SECTION_SIZE. With HugeTLB
the order can be more than MAX_PAGE_ORDER, but in that case it still
has to be within a single NID, since a huge page cannot be split
across multiple nodes.

> >> > This approach seems to give us the best of both worlds: It avoids the
> >> > memblock dependency during restoration. It keeps the serial work in
> >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
> >> > heavy lifting of tail page initialization to be done later in the boot
> >> > process, potentially in parallel, as you suggested.
> >>
> >> Here's another idea I have been thinking about, but never dug deep
> >> enough to figure out if it actually works.
> >>
> >> __init_page_from_nid() loops through all the zones for the node to find
> >> the zone id for the page. We can flip it the other way round and loop
> >> through all zones (on all nodes) to find out if the PFN spans that zone.
> >> Once we find the zone, we can directly call __init_single_page() on it.
> >> If a contiguous chunk of preserved memory lands in one zone, we can
> >> batch the init to save some time.
> >>
> >> Something like the below (completely untested):
> >>
> >>
> >>         static void kho_init_page(struct page *page)
> >>         {
> >>                 unsigned long pfn = page_to_pfn(page);
> >>                 struct zone *zone;
> >>
> >>                 for_each_zone(zone) {
> >>                         if (zone_spans_pfn(zone, pfn))
> >>                                 break;
> >>                 }
> >>
> >>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
> >>         }
> >>
> >> It doesn't do the batching I mentioned, but I think it at least gets the
> >> point across. And I think even this simple version would be a good first
> >> step.
> >>
> >> This lets us initialize the page from kho_restore_folio() without having
> >> to rely of memblock being alive, and saves us from doing work during
> >> early boot. We should only have a handful of zones and nodes in
> >> practice, so I think it should perform fairly well too.
> >>
> >> We would of course need to see how it performs in practice. If it works,
> >> I think it would be cleaner and simpler than splitting the
> >> initialization into two separate parts.
> >
> > I think your idea is clever and would work. However, consider the
> > cache efficiency: in deserialize_bitmap(), we must write to the head
> > struct page anyway to preserve the order. Since we are already
> > bringing that 64-byte cacheline in and dirtying it, and since memblock
> > is available and fast at this stage, it makes sense to fully
> > initialize the head page right then.
>
> You will also bring in the cache line and dirty it during
> kho_restore_folio() since you need to write the page refcounts. So I
> don't think the cache efficiency makes any difference between either
> approach.
>
> > If we do that, we get the nid for "free" (cache-wise) and we avoid the
> > overhead of iterating zones during the restore phase. We can then
> > simply inherit the nid from the head page when initializing the tail
> > pages later.
>
> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> spinlock and searches through all memblock memory regions. I don't think
> it is too expensive, but it isn't free either. And all this would be
> done serially. With the zone search, you at least have some room for
> concurrency.
>
> I think either approach only makes a difference when we have a large
> number of low-order preservations. If we have a handful of high-order
> preservations, I suppose the overhead of nid search would be negligible.

We should be targeting a situation where the vast majority of the
preserved memory is HugeTLB, but I am still worried about lower order
preservation efficiency for IOMMU page tables, etc.

> Long term, I think we should hook this into page_alloc_init_late() so
> that all the KHO pages also get initalized along with all the other
> pages. This will result in better integration of KHO with rest of MM
> init, and also have more consistent page restore performance.

But we keep KHO as reserved memory, and hooking it up into
page_alloc_init_late() would make it very different, since that memory
is part of the buddy allocator memory...

> Jason's radix tree patches will make that a bit easier to do I think.
> The zone search will scale better I reckon.

It could, perhaps early in boot we should reserve the radix tree, and
use it as a source of truth look-ups later in boot?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-20 14:49               ` Pasha Tatashin
@ 2025-12-22 15:33                 ` Pratyush Yadav
  2025-12-22 15:55                   ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-22 15:33 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, Evangelos Petrongonas,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

On Sat, Dec 20 2025, Pasha Tatashin wrote:

> On Fri, Dec 19, 2025 at 10:20 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Fri, Dec 19 2025, Pasha Tatashin wrote:
[...]
>> > Let's do the lazy tail initialization that I proposed to you in a
>> > chat: we initialize only the head struct page during
>> > deserialize_bitmap(). Since this happens while memblock is still
>> > active, we can safely use early_pfn_to_nid() to set the nid in the
>> > head page's flags, and also preserve order as we do today.
>> >
>> > Then, we can defer the initialization of all tail pages to
>> > kho_restore_folio(). At that stage, we no longer need memblock or
>> > early_pfn_to_nid(); we can simply inherit the nid from the head page
>> > using page_to_nid(head).
>>
>> Does that assumption always hold? Does every contiguous chunk of memory
>> always have to be in the same node? For folios it would hold, but what
>> about kho_preserve_pages()?
>
> NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
> larger than MAX_PAGE_ORDER it is mathematically impossible for a
> single chunk to span multiple nodes.

For folios, yes. The whole folio should only be in a single node. But we
also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
be used to preserve an arbitrary size of memory and _that_ doesn't have
to be in the same section. And if the memory is properly aligned, then
it will end up being just one higher-order preservation in KHO.

>
>> > This approach seems to give us the best of both worlds: It avoids the
>> > memblock dependency during restoration. It keeps the serial work in
>> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
>> > heavy lifting of tail page initialization to be done later in the boot
>> > process, potentially in parallel, as you suggested.
>>
>> Here's another idea I have been thinking about, but never dug deep
>> enough to figure out if it actually works.
>>
>> __init_page_from_nid() loops through all the zones for the node to find
>> the zone id for the page. We can flip it the other way round and loop
>> through all zones (on all nodes) to find out if the PFN spans that zone.
>> Once we find the zone, we can directly call __init_single_page() on it.
>> If a contiguous chunk of preserved memory lands in one zone, we can
>> batch the init to save some time.
>>
>> Something like the below (completely untested):
>>
>>
>>         static void kho_init_page(struct page *page)
>>         {
>>                 unsigned long pfn = page_to_pfn(page);
>>                 struct zone *zone;
>>
>>                 for_each_zone(zone) {
>>                         if (zone_spans_pfn(zone, pfn))
>>                                 break;
>>                 }
>>
>>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
>>         }
>>
>> It doesn't do the batching I mentioned, but I think it at least gets the
>> point across. And I think even this simple version would be a good first
>> step.
>>
>> This lets us initialize the page from kho_restore_folio() without having
>> to rely of memblock being alive, and saves us from doing work during
>> early boot. We should only have a handful of zones and nodes in
>> practice, so I think it should perform fairly well too.
>>
>> We would of course need to see how it performs in practice. If it works,
>> I think it would be cleaner and simpler than splitting the
>> initialization into two separate parts.
>
> I think your idea is clever and would work. However, consider the
> cache efficiency: in deserialize_bitmap(), we must write to the head
> struct page anyway to preserve the order. Since we are already
> bringing that 64-byte cacheline in and dirtying it, and since memblock
> is available and fast at this stage, it makes sense to fully
> initialize the head page right then.

You will also bring in the cache line and dirty it during
kho_restore_folio() since you need to write the page refcounts. So I
don't think the cache efficiency makes any difference between either
approach.

> If we do that, we get the nid for "free" (cache-wise) and we avoid the
> overhead of iterating zones during the restore phase. We can then
> simply inherit the nid from the head page when initializing the tail
> pages later.

To get the nid, you would need to call early_pfn_to_nid(). This takes a
spinlock and searches through all memblock memory regions. I don't think
it is too expensive, but it isn't free either. And all this would be
done serially. With the zone search, you at least have some room for
concurrency.

I think either approach only makes a difference when we have a large
number of low-order preservations. If we have a handful of high-order
preservations, I suppose the overhead of nid search would be negligible.

Long term, I think we should hook this into page_alloc_init_late() so
that all the KHO pages also get initalized along with all the other
pages. This will result in better integration of KHO with rest of MM
init, and also have more consistent page restore performance.

Jason's radix tree patches will make that a bit easier to do I think.
The zone search will scale better I reckon.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-20  3:20             ` Pratyush Yadav
@ 2025-12-20 14:49               ` Pasha Tatashin
  2025-12-22 15:33                 ` Pratyush Yadav
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-20 14:49 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Evangelos Petrongonas, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Fri, Dec 19, 2025 at 10:20 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Fri, Dec 19 2025, Pasha Tatashin wrote:
>
> > On Fri, Dec 19, 2025 at 4:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> >> On Tue, Dec 16, 2025 at 10:36:01AM -0500, Pasha Tatashin wrote:
> >> > On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> >> > >
> >> > > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> >> > > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> >> > > > > > +                                               unsigned int order)
> >> > > > > > +{
> >> > > > > > +     unsigned long pfn = PHYS_PFN(phys);
> >> > > > > > +     int nid = early_pfn_to_nid(pfn);
> >> > > > > > +
> >> > > > > > +     for (int i = 0; i < (1 << order); i++)
> >> > > > > > +             init_deferred_page(pfn + i, nid);
> >> > > > >
> >> > > > > This will skip pages below node->first_deferred_pfn, we need to use
> >> > > > > __init_page_from_nid() here.
> >> > > >
> >> > > > Mike, but those struct pages should be initialized early anyway. If
> >> > > > they are not yet initialized we have a problem, as they are going to
> >> > > > be re-initialized later.
> >> > >
> >> > > Can say I understand your point. Which pages should be initialized earlt?
> >> >
> >> > All pages below node->first_deferred_pfn.
> >> >
> >> > > And which pages will be reinitialized?
> >> >
> >> > kho_memory_init() is called after free_area_init() (which calls
> >> > memmap_init_range to initialize low memory struct pages). So, if we
> >> > use __init_page_from_nid() as suggested, we would be blindly running
> >> > __init_single_page() again on those low-memory pages that
> >> > memmap_init_range() already set up. This would cause double
> >> > initialization and corruptions due to losing the order information.
> >> >
> >> > > > > > +
> >> > > > > > +     return pfn_to_page(pfn);
> >> > > > > > +}
> >> > > > > > +
> >> > > > > >  static void __init deserialize_bitmap(unsigned int order,
> >> > > > > >                                     struct khoser_mem_bitmap_ptr *elm)
> >> > > > > >  {
> >> > > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> >> > > > > >               int sz = 1 << (order + PAGE_SHIFT);
> >> > > > > >               phys_addr_t phys =
> >> > > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> >> > > > > > -             struct page *page = phys_to_page(phys);
> >> > > > > > +             struct page *page = kho_get_preserved_page(phys, order);
> >> > > > >
> >> > > > > I think it's better to initialize deferred struct pages later in
> >> > > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> >> > > >
> >> > > > The KHO memory should still be accessible early in boot, right?
> >> > >
> >> > > The memory is accessible. And we anyway should not use struct page for
> >> > > preserved memory before kho_restore_{folio,pages}.
> >> >
> >> > This makes sense, what happens if someone calls kho_restore_folio()
> >> > before deferred pages are initialized?
> >>
> >> That's fine, because this memory is still memblock_reserve()ed and deferred
> >> init skips reserved ranges.
> >> There is a problem however with the calls to kho_restore_{pages,folio}()
> >> after memblock is gone because we can't use early_pfn_to_nid() then.
>
> I suppose we can select CONFIG_ARCH_KEEP_MEMBLOCK with
> CONFIG_KEXEC_HANDOVER. But that comes with its own set of problems like
> wasting memory, especially when there are a lot of scattered preserved
> pages
>
> I don't think this is a very good idea, just throwing it out there as an
> option.
>
> >
> > I agree with the regarding memblock and early_pfn_to_nid(), but I
> > don't think we need to rely on early_pfn_to_nid() during the restore
> > phase.
> >
> >> I think we can start with Evangelos' approach that initializes struct pages
> >> at deserialize time and then we'll see how to optimize it.
> >
> > Let's do the lazy tail initialization that I proposed to you in a
> > chat: we initialize only the head struct page during
> > deserialize_bitmap(). Since this happens while memblock is still
> > active, we can safely use early_pfn_to_nid() to set the nid in the
> > head page's flags, and also preserve order as we do today.
> >
> > Then, we can defer the initialization of all tail pages to
> > kho_restore_folio(). At that stage, we no longer need memblock or
> > early_pfn_to_nid(); we can simply inherit the nid from the head page
> > using page_to_nid(head).
>
> Does that assumption always hold? Does every contiguous chunk of memory
> always have to be in the same node? For folios it would hold, but what
> about kho_preserve_pages()?

NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
larger than MAX_PAGE_ORDER it is mathematically impossible for a
single chunk to span multiple nodes.

> > This approach seems to give us the best of both worlds: It avoids the
> > memblock dependency during restoration. It keeps the serial work in
> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
> > heavy lifting of tail page initialization to be done later in the boot
> > process, potentially in parallel, as you suggested.
>
> Here's another idea I have been thinking about, but never dug deep
> enough to figure out if it actually works.
>
> __init_page_from_nid() loops through all the zones for the node to find
> the zone id for the page. We can flip it the other way round and loop
> through all zones (on all nodes) to find out if the PFN spans that zone.
> Once we find the zone, we can directly call __init_single_page() on it.
> If a contiguous chunk of preserved memory lands in one zone, we can
> batch the init to save some time.
>
> Something like the below (completely untested):
>
>
>         static void kho_init_page(struct page *page)
>         {
>                 unsigned long pfn = page_to_pfn(page);
>                 struct zone *zone;
>
>                 for_each_zone(zone) {
>                         if (zone_spans_pfn(zone, pfn))
>                                 break;
>                 }
>
>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
>         }
>
> It doesn't do the batching I mentioned, but I think it at least gets the
> point across. And I think even this simple version would be a good first
> step.
>
> This lets us initialize the page from kho_restore_folio() without having
> to rely of memblock being alive, and saves us from doing work during
> early boot. We should only have a handful of zones and nodes in
> practice, so I think it should perform fairly well too.
>
> We would of course need to see how it performs in practice. If it works,
> I think it would be cleaner and simpler than splitting the
> initialization into two separate parts.

I think your idea is clever and would work. However, consider the
cache efficiency: in deserialize_bitmap(), we must write to the head
struct page anyway to preserve the order. Since we are already
bringing that 64-byte cacheline in and dirtying it, and since memblock
is available and fast at this stage, it makes sense to fully
initialize the head page right then.
If we do that, we get the nid for "free" (cache-wise) and we avoid the
overhead of iterating zones during the restore phase. We can then
simply inherit the nid from the head page when initializing the tail
pages later.

Pasha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-19 16:28           ` Pasha Tatashin
@ 2025-12-20  3:20             ` Pratyush Yadav
  2025-12-20 14:49               ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-20  3:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, Evangelos Petrongonas, Pratyush Yadav,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

On Fri, Dec 19 2025, Pasha Tatashin wrote:

> On Fri, Dec 19, 2025 at 4:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>>
>> On Tue, Dec 16, 2025 at 10:36:01AM -0500, Pasha Tatashin wrote:
>> > On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>> > >
>> > > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
>> > > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
>> > > > > > +                                               unsigned int order)
>> > > > > > +{
>> > > > > > +     unsigned long pfn = PHYS_PFN(phys);
>> > > > > > +     int nid = early_pfn_to_nid(pfn);
>> > > > > > +
>> > > > > > +     for (int i = 0; i < (1 << order); i++)
>> > > > > > +             init_deferred_page(pfn + i, nid);
>> > > > >
>> > > > > This will skip pages below node->first_deferred_pfn, we need to use
>> > > > > __init_page_from_nid() here.
>> > > >
>> > > > Mike, but those struct pages should be initialized early anyway. If
>> > > > they are not yet initialized we have a problem, as they are going to
>> > > > be re-initialized later.
>> > >
>> > > Can say I understand your point. Which pages should be initialized earlt?
>> >
>> > All pages below node->first_deferred_pfn.
>> >
>> > > And which pages will be reinitialized?
>> >
>> > kho_memory_init() is called after free_area_init() (which calls
>> > memmap_init_range to initialize low memory struct pages). So, if we
>> > use __init_page_from_nid() as suggested, we would be blindly running
>> > __init_single_page() again on those low-memory pages that
>> > memmap_init_range() already set up. This would cause double
>> > initialization and corruptions due to losing the order information.
>> >
>> > > > > > +
>> > > > > > +     return pfn_to_page(pfn);
>> > > > > > +}
>> > > > > > +
>> > > > > >  static void __init deserialize_bitmap(unsigned int order,
>> > > > > >                                     struct khoser_mem_bitmap_ptr *elm)
>> > > > > >  {
>> > > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
>> > > > > >               int sz = 1 << (order + PAGE_SHIFT);
>> > > > > >               phys_addr_t phys =
>> > > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
>> > > > > > -             struct page *page = phys_to_page(phys);
>> > > > > > +             struct page *page = kho_get_preserved_page(phys, order);
>> > > > >
>> > > > > I think it's better to initialize deferred struct pages later in
>> > > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
>> > > >
>> > > > The KHO memory should still be accessible early in boot, right?
>> > >
>> > > The memory is accessible. And we anyway should not use struct page for
>> > > preserved memory before kho_restore_{folio,pages}.
>> >
>> > This makes sense, what happens if someone calls kho_restore_folio()
>> > before deferred pages are initialized?
>>
>> That's fine, because this memory is still memblock_reserve()ed and deferred
>> init skips reserved ranges.
>> There is a problem however with the calls to kho_restore_{pages,folio}()
>> after memblock is gone because we can't use early_pfn_to_nid() then.

I suppose we can select CONFIG_ARCH_KEEP_MEMBLOCK with
CONFIG_KEXEC_HANDOVER. But that comes with its own set of problems like
wasting memory, especially when there are a lot of scattered preserved
pages

I don't think this is a very good idea, just throwing it out there as an
option.

>
> I agree with the regarding memblock and early_pfn_to_nid(), but I
> don't think we need to rely on early_pfn_to_nid() during the restore
> phase.
>
>> I think we can start with Evangelos' approach that initializes struct pages
>> at deserialize time and then we'll see how to optimize it.
>
> Let's do the lazy tail initialization that I proposed to you in a
> chat: we initialize only the head struct page during
> deserialize_bitmap(). Since this happens while memblock is still
> active, we can safely use early_pfn_to_nid() to set the nid in the
> head page's flags, and also preserve order as we do today.
>
> Then, we can defer the initialization of all tail pages to
> kho_restore_folio(). At that stage, we no longer need memblock or
> early_pfn_to_nid(); we can simply inherit the nid from the head page
> using page_to_nid(head).

Does that assumption always hold? Does every contiguous chunk of memory
always have to be in the same node? For folios it would hold, but what
about kho_preserve_pages()?

>
> This approach seems to give us the best of both worlds: It avoids the
> memblock dependency during restoration. It keeps the serial work in
> deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
> heavy lifting of tail page initialization to be done later in the boot
> process, potentially in parallel, as you suggested.

Here's another idea I have been thinking about, but never dug deep
enough to figure out if it actually works.

__init_page_from_nid() loops through all the zones for the node to find
the zone id for the page. We can flip it the other way round and loop
through all zones (on all nodes) to find out if the PFN spans that zone.
Once we find the zone, we can directly call __init_single_page() on it.
If a contiguous chunk of preserved memory lands in one zone, we can
batch the init to save some time.

Something like the below (completely untested):


	static void kho_init_page(struct page *page)
	{
		unsigned long pfn = page_to_pfn(page);
		struct zone *zone;
	
		for_each_zone(zone) {
			if (zone_spans_pfn(zone, pfn))
				break;
		}
	
		__init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
	}

It doesn't do the batching I mentioned, but I think it at least gets the
point across. And I think even this simple version would be a good first
step.

This lets us initialize the page from kho_restore_folio() without having
to rely of memblock being alive, and saves us from doing work during
early boot. We should only have a handful of zones and nodes in
practice, so I think it should perform fairly well too.

We would of course need to see how it performs in practice. If it works,
I think it would be cleaner and simpler than splitting the
initialization into two separate parts.

Evangelos, would you mind giving it a try to see if the idea works and
how well it performs?

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 15:51         ` Pasha Tatashin
@ 2025-12-20  2:27           ` Pratyush Yadav
  0 siblings, 0 replies; 28+ messages in thread
From: Pratyush Yadav @ 2025-12-20  2:27 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, Evangelos Petrongonas, Pratyush Yadav,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

On Tue, Dec 16 2025, Pasha Tatashin wrote:

> On Tue, Dec 16, 2025 at 10:36 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>> On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>> >
>> > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
>> > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
>> > > > > +                                               unsigned int order)
>> > > > > +{
>> > > > > +     unsigned long pfn = PHYS_PFN(phys);
>> > > > > +     int nid = early_pfn_to_nid(pfn);
>> > > > > +
>> > > > > +     for (int i = 0; i < (1 << order); i++)
>> > > > > +             init_deferred_page(pfn + i, nid);
>> > > >
>> > > > This will skip pages below node->first_deferred_pfn, we need to use
>> > > > __init_page_from_nid() here.
>> > >
>> > > Mike, but those struct pages should be initialized early anyway. If
>> > > they are not yet initialized we have a problem, as they are going to
>> > > be re-initialized later.
>> >
>> > Can say I understand your point. Which pages should be initialized earlt?
>>
>> All pages below node->first_deferred_pfn.
>>
>> > And which pages will be reinitialized?
>>
>> kho_memory_init() is called after free_area_init() (which calls
>> memmap_init_range to initialize low memory struct pages). So, if we
>> use __init_page_from_nid() as suggested, we would be blindly running
>> __init_single_page() again on those low-memory pages that
>> memmap_init_range() already set up. This would cause double
>> initialization and corruptions due to losing the order information.
>>
>> > > > > +
>> > > > > +     return pfn_to_page(pfn);
>> > > > > +}
>> > > > > +
>> > > > >  static void __init deserialize_bitmap(unsigned int order,
>> > > > >                                     struct khoser_mem_bitmap_ptr *elm)
>> > > > >  {
>> > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
>> > > > >               int sz = 1 << (order + PAGE_SHIFT);
>> > > > >               phys_addr_t phys =
>> > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
>> > > > > -             struct page *page = phys_to_page(phys);
>> > > > > +             struct page *page = kho_get_preserved_page(phys, order);
>> > > >
>> > > > I think it's better to initialize deferred struct pages later in
>> > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
>> > >
>> > > The KHO memory should still be accessible early in boot, right?
>> >
>> > The memory is accessible. And we anyway should not use struct page for
>> > preserved memory before kho_restore_{folio,pages}.
>>
>> This makes sense, what happens if someone calls kho_restore_folio()
>> before deferred pages are initialized?
>
> I looked at your repo. I think what you're proposing makes sense, and
> indeed it will provide a performance boost if some of the folios are
> restored in parallel. Just kho_init_deferred_pages() should be using
> init_deferred_page() to avoid re-initializing the lower memory pages.
> Also, I am still wondering how it will work with HVO, but I need to
> take a look at Pratyuh's series for that.

The HVO optimization happens when the file is retrieved, after all the
folios are restored. So that is long after deferred page init. For my
series both approaches should work.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-19  9:19         ` Mike Rapoport
@ 2025-12-19 16:28           ` Pasha Tatashin
  2025-12-20  3:20             ` Pratyush Yadav
  0 siblings, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-19 16:28 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Fri, Dec 19, 2025 at 4:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Dec 16, 2025 at 10:36:01AM -0500, Pasha Tatashin wrote:
> > On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > > > > +                                               unsigned int order)
> > > > > > +{
> > > > > > +     unsigned long pfn = PHYS_PFN(phys);
> > > > > > +     int nid = early_pfn_to_nid(pfn);
> > > > > > +
> > > > > > +     for (int i = 0; i < (1 << order); i++)
> > > > > > +             init_deferred_page(pfn + i, nid);
> > > > >
> > > > > This will skip pages below node->first_deferred_pfn, we need to use
> > > > > __init_page_from_nid() here.
> > > >
> > > > Mike, but those struct pages should be initialized early anyway. If
> > > > they are not yet initialized we have a problem, as they are going to
> > > > be re-initialized later.
> > >
> > > Can say I understand your point. Which pages should be initialized earlt?
> >
> > All pages below node->first_deferred_pfn.
> >
> > > And which pages will be reinitialized?
> >
> > kho_memory_init() is called after free_area_init() (which calls
> > memmap_init_range to initialize low memory struct pages). So, if we
> > use __init_page_from_nid() as suggested, we would be blindly running
> > __init_single_page() again on those low-memory pages that
> > memmap_init_range() already set up. This would cause double
> > initialization and corruptions due to losing the order information.
> >
> > > > > > +
> > > > > > +     return pfn_to_page(pfn);
> > > > > > +}
> > > > > > +
> > > > > >  static void __init deserialize_bitmap(unsigned int order,
> > > > > >                                     struct khoser_mem_bitmap_ptr *elm)
> > > > > >  {
> > > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > > > > >               int sz = 1 << (order + PAGE_SHIFT);
> > > > > >               phys_addr_t phys =
> > > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > > > > -             struct page *page = phys_to_page(phys);
> > > > > > +             struct page *page = kho_get_preserved_page(phys, order);
> > > > >
> > > > > I think it's better to initialize deferred struct pages later in
> > > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> > > >
> > > > The KHO memory should still be accessible early in boot, right?
> > >
> > > The memory is accessible. And we anyway should not use struct page for
> > > preserved memory before kho_restore_{folio,pages}.
> >
> > This makes sense, what happens if someone calls kho_restore_folio()
> > before deferred pages are initialized?
>
> That's fine, because this memory is still memblock_reserve()ed and deferred
> init skips reserved ranges.
> There is a problem however with the calls to kho_restore_{pages,folio}()
> after memblock is gone because we can't use early_pfn_to_nid() then.

I agree with the regarding memblock and early_pfn_to_nid(), but I
don't think we need to rely on early_pfn_to_nid() during the restore
phase.

> I think we can start with Evangelos' approach that initializes struct pages
> at deserialize time and then we'll see how to optimize it.

Let's do the lazy tail initialization that I proposed to you in a
chat: we initialize only the head struct page during
deserialize_bitmap(). Since this happens while memblock is still
active, we can safely use early_pfn_to_nid() to set the nid in the
head page's flags, and also preserve order as we do today.

Then, we can defer the initialization of all tail pages to
kho_restore_folio(). At that stage, we no longer need memblock or
early_pfn_to_nid(); we can simply inherit the nid from the head page
using page_to_nid(head).

This approach seems to give us the best of both worlds: It avoids the
memblock dependency during restoration. It keeps the serial work in
deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
heavy lifting of tail page initialization to be done later in the boot
process, potentially in parallel, as you suggested.

Thanks,
Pasha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 15:36       ` Pasha Tatashin
  2025-12-16 15:51         ` Pasha Tatashin
@ 2025-12-19  9:19         ` Mike Rapoport
  2025-12-19 16:28           ` Pasha Tatashin
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Rapoport @ 2025-12-19  9:19 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 16, 2025 at 10:36:01AM -0500, Pasha Tatashin wrote:
> On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > > > +                                               unsigned int order)
> > > > > +{
> > > > > +     unsigned long pfn = PHYS_PFN(phys);
> > > > > +     int nid = early_pfn_to_nid(pfn);
> > > > > +
> > > > > +     for (int i = 0; i < (1 << order); i++)
> > > > > +             init_deferred_page(pfn + i, nid);
> > > >
> > > > This will skip pages below node->first_deferred_pfn, we need to use
> > > > __init_page_from_nid() here.
> > >
> > > Mike, but those struct pages should be initialized early anyway. If
> > > they are not yet initialized we have a problem, as they are going to
> > > be re-initialized later.
> >
> > Can say I understand your point. Which pages should be initialized earlt?
> 
> All pages below node->first_deferred_pfn.
> 
> > And which pages will be reinitialized?
> 
> kho_memory_init() is called after free_area_init() (which calls
> memmap_init_range to initialize low memory struct pages). So, if we
> use __init_page_from_nid() as suggested, we would be blindly running
> __init_single_page() again on those low-memory pages that
> memmap_init_range() already set up. This would cause double
> initialization and corruptions due to losing the order information.
> 
> > > > > +
> > > > > +     return pfn_to_page(pfn);
> > > > > +}
> > > > > +
> > > > >  static void __init deserialize_bitmap(unsigned int order,
> > > > >                                     struct khoser_mem_bitmap_ptr *elm)
> > > > >  {
> > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > > > >               int sz = 1 << (order + PAGE_SHIFT);
> > > > >               phys_addr_t phys =
> > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > > > -             struct page *page = phys_to_page(phys);
> > > > > +             struct page *page = kho_get_preserved_page(phys, order);
> > > >
> > > > I think it's better to initialize deferred struct pages later in
> > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> > >
> > > The KHO memory should still be accessible early in boot, right?
> >
> > The memory is accessible. And we anyway should not use struct page for
> > preserved memory before kho_restore_{folio,pages}.
> 
> This makes sense, what happens if someone calls kho_restore_folio()
> before deferred pages are initialized?

That's fine, because this memory is still memblock_reserve()ed and deferred
init skips reserved ranges.
There is a problem however with the calls to kho_restore_{pages,folio}()
after memblock is gone because we can't use early_pfn_to_nid() then.

I think we can start with Evangelos' approach that initializes struct pages
at deserialize time and then we'll see how to optimize it.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 15:36       ` Pasha Tatashin
@ 2025-12-16 15:51         ` Pasha Tatashin
  2025-12-20  2:27           ` Pratyush Yadav
  2025-12-19  9:19         ` Mike Rapoport
  1 sibling, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-16 15:51 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 16, 2025 at 10:36 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > > > +                                               unsigned int order)
> > > > > +{
> > > > > +     unsigned long pfn = PHYS_PFN(phys);
> > > > > +     int nid = early_pfn_to_nid(pfn);
> > > > > +
> > > > > +     for (int i = 0; i < (1 << order); i++)
> > > > > +             init_deferred_page(pfn + i, nid);
> > > >
> > > > This will skip pages below node->first_deferred_pfn, we need to use
> > > > __init_page_from_nid() here.
> > >
> > > Mike, but those struct pages should be initialized early anyway. If
> > > they are not yet initialized we have a problem, as they are going to
> > > be re-initialized later.
> >
> > Can say I understand your point. Which pages should be initialized earlt?
>
> All pages below node->first_deferred_pfn.
>
> > And which pages will be reinitialized?
>
> kho_memory_init() is called after free_area_init() (which calls
> memmap_init_range to initialize low memory struct pages). So, if we
> use __init_page_from_nid() as suggested, we would be blindly running
> __init_single_page() again on those low-memory pages that
> memmap_init_range() already set up. This would cause double
> initialization and corruptions due to losing the order information.
>
> > > > > +
> > > > > +     return pfn_to_page(pfn);
> > > > > +}
> > > > > +
> > > > >  static void __init deserialize_bitmap(unsigned int order,
> > > > >                                     struct khoser_mem_bitmap_ptr *elm)
> > > > >  {
> > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > > > >               int sz = 1 << (order + PAGE_SHIFT);
> > > > >               phys_addr_t phys =
> > > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > > > -             struct page *page = phys_to_page(phys);
> > > > > +             struct page *page = kho_get_preserved_page(phys, order);
> > > >
> > > > I think it's better to initialize deferred struct pages later in
> > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> > >
> > > The KHO memory should still be accessible early in boot, right?
> >
> > The memory is accessible. And we anyway should not use struct page for
> > preserved memory before kho_restore_{folio,pages}.
>
> This makes sense, what happens if someone calls kho_restore_folio()
> before deferred pages are initialized?

I looked at your repo. I think what you're proposing makes sense, and
indeed it will provide a performance boost if some of the folios are
restored in parallel. Just kho_init_deferred_pages() should be using
init_deferred_page() to avoid re-initializing the lower memory pages.
Also, I am still wondering how it will work with HVO, but I need to
take a look at Pratyuh's series for that.

Thanks,
Pasha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 15:19     ` Mike Rapoport
@ 2025-12-16 15:36       ` Pasha Tatashin
  2025-12-16 15:51         ` Pasha Tatashin
  2025-12-19  9:19         ` Mike Rapoport
  0 siblings, 2 replies; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-16 15:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > > +                                               unsigned int order)
> > > > +{
> > > > +     unsigned long pfn = PHYS_PFN(phys);
> > > > +     int nid = early_pfn_to_nid(pfn);
> > > > +
> > > > +     for (int i = 0; i < (1 << order); i++)
> > > > +             init_deferred_page(pfn + i, nid);
> > >
> > > This will skip pages below node->first_deferred_pfn, we need to use
> > > __init_page_from_nid() here.
> >
> > Mike, but those struct pages should be initialized early anyway. If
> > they are not yet initialized we have a problem, as they are going to
> > be re-initialized later.
>
> Can say I understand your point. Which pages should be initialized earlt?

All pages below node->first_deferred_pfn.

> And which pages will be reinitialized?

kho_memory_init() is called after free_area_init() (which calls
memmap_init_range to initialize low memory struct pages). So, if we
use __init_page_from_nid() as suggested, we would be blindly running
__init_single_page() again on those low-memory pages that
memmap_init_range() already set up. This would cause double
initialization and corruptions due to losing the order information.

> > > > +
> > > > +     return pfn_to_page(pfn);
> > > > +}
> > > > +
> > > >  static void __init deserialize_bitmap(unsigned int order,
> > > >                                     struct khoser_mem_bitmap_ptr *elm)
> > > >  {
> > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > > >               int sz = 1 << (order + PAGE_SHIFT);
> > > >               phys_addr_t phys =
> > > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > > -             struct page *page = phys_to_page(phys);
> > > > +             struct page *page = kho_get_preserved_page(phys, order);
> > >
> > > I think it's better to initialize deferred struct pages later in
> > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> >
> > The KHO memory should still be accessible early in boot, right?
>
> The memory is accessible. And we anyway should not use struct page for
> preserved memory before kho_restore_{folio,pages}.

This makes sense, what happens if someone calls kho_restore_folio()
before deferred pages are initialized?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 15:05   ` Pasha Tatashin
@ 2025-12-16 15:19     ` Mike Rapoport
  2025-12-16 15:36       ` Pasha Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Rapoport @ 2025-12-16 15:19 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > +                                               unsigned int order)
> > > +{
> > > +     unsigned long pfn = PHYS_PFN(phys);
> > > +     int nid = early_pfn_to_nid(pfn);
> > > +
> > > +     for (int i = 0; i < (1 << order); i++)
> > > +             init_deferred_page(pfn + i, nid);
> >
> > This will skip pages below node->first_deferred_pfn, we need to use
> > __init_page_from_nid() here.
> 
> Mike, but those struct pages should be initialized early anyway. If
> they are not yet initialized we have a problem, as they are going to
> be re-initialized later.

Can say I understand your point. Which pages should be initialized earlt?
And which pages will be reinitialized? 

> > > +
> > > +     return pfn_to_page(pfn);
> > > +}
> > > +
> > >  static void __init deserialize_bitmap(unsigned int order,
> > >                                     struct khoser_mem_bitmap_ptr *elm)
> > >  {
> > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > >               int sz = 1 << (order + PAGE_SHIFT);
> > >               phys_addr_t phys =
> > >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > -             struct page *page = phys_to_page(phys);
> > > +             struct page *page = kho_get_preserved_page(phys, order);
> >
> > I think it's better to initialize deferred struct pages later in
> > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> 
> The KHO memory should still be accessible early in boot, right?

The memory is accessible. And we anyway should not use struct page for
preserved memory before kho_restore_{folio,pages}.
 
> > heavy lifting of memblock_reserve()s. Delaying struct page initialization
> > until restore makes it at least run in parallel with other initialization
> > tasks.
> >
> > I started to work on this just before plumbers and I have something
> > untested here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/deferred-page/v0.1
> >
> > >               union kho_page_info info;
> > >
> > >               memblock_reserve(phys, sz);
> > > --

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 11:57 ` Mike Rapoport
  2025-12-16 14:26   ` Evangelos Petrongonas
@ 2025-12-16 15:05   ` Pasha Tatashin
  2025-12-16 15:19     ` Mike Rapoport
  1 sibling, 1 reply; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-16 15:05 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Evangelos Petrongonas, Pratyush Yadav, Alexander Graf,
	Andrew Morton, Jason Miu, linux-kernel, kexec, linux-mm,
	nh-open-source

> > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > +                                               unsigned int order)
> > +{
> > +     unsigned long pfn = PHYS_PFN(phys);
> > +     int nid = early_pfn_to_nid(pfn);
> > +
> > +     for (int i = 0; i < (1 << order); i++)
> > +             init_deferred_page(pfn + i, nid);
>
> This will skip pages below node->first_deferred_pfn, we need to use
> __init_page_from_nid() here.

Mike, but those struct pages should be initialized early anyway. If
they are not yet initialized we have a problem, as they are going to
be re-initialized later.

>
> > +
> > +     return pfn_to_page(pfn);
> > +}
> > +
> >  static void __init deserialize_bitmap(unsigned int order,
> >                                     struct khoser_mem_bitmap_ptr *elm)
> >  {
> > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> >               int sz = 1 << (order + PAGE_SHIFT);
> >               phys_addr_t phys =
> >                       elm->phys_start + (bit << (order + PAGE_SHIFT));
> > -             struct page *page = phys_to_page(phys);
> > +             struct page *page = kho_get_preserved_page(phys, order);
>
> I think it's better to initialize deferred struct pages later in
> kho_restore_page. deserialize_bitmap() runs before SMP and it already does

The KHO memory should still be accessible early in boot, right?

> heavy lifting of memblock_reserve()s. Delaying struct page initialization
> until restore makes it at least run in parallel with other initialization
> tasks.
>
> I started to work on this just before plumbers and I have something
> untested here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/deferred-page/v0.1
>
> >               union kho_page_info info;
> >
> >               memblock_reserve(phys, sz);
> > --
> > 2.43.0
> >
> >
> >
> >
> > Amazon Web Services Development Center Germany GmbH
> > Tamara-Danz-Str. 13
> > 10243 Berlin
> > Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> > Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> > Sitz: Berlin
> > Ust-ID: DE 365 538 597
> >
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16 11:57 ` Mike Rapoport
@ 2025-12-16 14:26   ` Evangelos Petrongonas
  2025-12-16 15:05   ` Pasha Tatashin
  1 sibling, 0 replies; 28+ messages in thread
From: Evangelos Petrongonas @ 2025-12-16 14:26 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pasha Tatashin, Pratyush Yadav, Alexander Graf, Andrew Morton,
	Jason Miu, linux-kernel, kexec, linux-mm, nh-open-source

On Tue, Dec 16, 2025 at 01:57:19PM +0200 Mike Rapoport wrote:
> Hi Evangelos,
> 
> On Tue, Dec 16, 2025 at 08:49:12AM +0000, Evangelos Petrongonas wrote:
> > When `CONFIG_DEFERRED_STRUCT_PAGE_INIT` is enabled, struct page
> 
> No need for markup formatting in the changelog.
> 
ack

> > initialization is deferred to parallel kthreads that run later
> > in the boot process.
> > 
> > During KHO restoration, `deserialize_bitmap()` writes metadata for
> > each preserved memory region. However, if the struct page has not been
> > initialized, this write targets uninitialized memory, potentially
> > leading to errors like:
> > ```
> > BUG: unable to handle page fault for address: ...
> > ```
> > 
> > Fix this by introducing `kho_get_preserved_page()`,  which ensures
> > all struct pages in a preserved region are initialized by calling
> > `init_deferred_page()` which is a no-op when deferred init is disabled
> > or when the struct page is already initialized.
> > 
> > Fixes: 8b66ed2c3f42 ("kho: mm: don't allow deferred struct page with KHO")
> > Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> > ---
> 
> ...
> 
> > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > +						  unsigned int order)
> > +{
> > +	unsigned long pfn = PHYS_PFN(phys);
> > +	int nid = early_pfn_to_nid(pfn);
> > +
> > +	for (int i = 0; i < (1 << order); i++)
> > +		init_deferred_page(pfn + i, nid);
> 
> This will skip pages below node->first_deferred_pfn, we need to use
> __init_page_from_nid() here.
> 
Right, __init_page_from_nid() unconditionally initializes the page.

> > +
> > +	return pfn_to_page(pfn);
> > +}
> > +
> >  static void __init deserialize_bitmap(unsigned int order,
> >  				      struct khoser_mem_bitmap_ptr *elm)
> >  {
> > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> >  		int sz = 1 << (order + PAGE_SHIFT);
> >  		phys_addr_t phys =
> >  			elm->phys_start + (bit << (order + PAGE_SHIFT));
> > -		struct page *page = phys_to_page(phys);
> > +		struct page *page = kho_get_preserved_page(phys, order);
> 
> I think it's better to initialize deferred struct pages later in
> kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> heavy lifting of memblock_reserve()s. Delaying struct page initialization
> until restore makes it at least run in parallel with other initialization
> tasks.
> 
> I started to work on this just before plumbers and I have something
> untested here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/deferred-page/v0.1
> 

Nice suggestion! I looked at your branch and I agree, your
approach seems better.

I also noticed your debug check:
```
	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER_DEBUG))
		WARN_ON(nid != early_pfn_to_nid(pfn + i));
```

This catches, or at least allows for easier debugging of,
another potential (although unlinkely (?)) issue that my patch missed:
preserved pages spanning multiple NUMA nodes within a single higher-order
allocation. Nice to have this :)

I am happy to drop my patch in favor of yours. FWIW I have quickly
tested it both using the modified selftest and a custom payload and it
seems to be working fine. Please let me know once you post the patches.

> >  		union kho_page_info info;
> >  
> >  		memblock_reserve(phys, sz);
> > -- 
> > 2.43.0
> > 
> > 
> > 
> > 
> > Amazon Web Services Development Center Germany GmbH
> > Tamara-Danz-Str. 13
> > 10243 Berlin
> > Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> > Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> > Sitz: Berlin
> > Ust-ID: DE 365 538 597
> > 
> 
> -- 
> Sincerely yours,
> Mike.

Kind Regards,
Evangelos



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16  8:49 Evangelos Petrongonas
  2025-12-16 10:53 ` Pasha Tatashin
@ 2025-12-16 11:57 ` Mike Rapoport
  2025-12-16 14:26   ` Evangelos Petrongonas
  2025-12-16 15:05   ` Pasha Tatashin
  1 sibling, 2 replies; 28+ messages in thread
From: Mike Rapoport @ 2025-12-16 11:57 UTC (permalink / raw)
  To: Evangelos Petrongonas
  Cc: Pasha Tatashin, Pratyush Yadav, Alexander Graf, Andrew Morton,
	Jason Miu, linux-kernel, kexec, linux-mm, nh-open-source

Hi Evangelos,

On Tue, Dec 16, 2025 at 08:49:12AM +0000, Evangelos Petrongonas wrote:
> When `CONFIG_DEFERRED_STRUCT_PAGE_INIT` is enabled, struct page

No need for markup formatting in the changelog.

> initialization is deferred to parallel kthreads that run later
> in the boot process.
> 
> During KHO restoration, `deserialize_bitmap()` writes metadata for
> each preserved memory region. However, if the struct page has not been
> initialized, this write targets uninitialized memory, potentially
> leading to errors like:
> ```
> BUG: unable to handle page fault for address: ...
> ```
> 
> Fix this by introducing `kho_get_preserved_page()`,  which ensures
> all struct pages in a preserved region are initialized by calling
> `init_deferred_page()` which is a no-op when deferred init is disabled
> or when the struct page is already initialized.
> 
> Fixes: 8b66ed2c3f42 ("kho: mm: don't allow deferred struct page with KHO")
> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---

...

> +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> +						  unsigned int order)
> +{
> +	unsigned long pfn = PHYS_PFN(phys);
> +	int nid = early_pfn_to_nid(pfn);
> +
> +	for (int i = 0; i < (1 << order); i++)
> +		init_deferred_page(pfn + i, nid);

This will skip pages below node->first_deferred_pfn, we need to use
__init_page_from_nid() here.

> +
> +	return pfn_to_page(pfn);
> +}
> +
>  static void __init deserialize_bitmap(unsigned int order,
>  				      struct khoser_mem_bitmap_ptr *elm)
>  {
> @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
>  		int sz = 1 << (order + PAGE_SHIFT);
>  		phys_addr_t phys =
>  			elm->phys_start + (bit << (order + PAGE_SHIFT));
> -		struct page *page = phys_to_page(phys);
> +		struct page *page = kho_get_preserved_page(phys, order);

I think it's better to initialize deferred struct pages later in
kho_restore_page. deserialize_bitmap() runs before SMP and it already does
heavy lifting of memblock_reserve()s. Delaying struct page initialization
until restore makes it at least run in parallel with other initialization
tasks.

I started to work on this just before plumbers and I have something
untested here:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/deferred-page/v0.1

>  		union kho_page_info info;
>  
>  		memblock_reserve(phys, sz);
> -- 
> 2.43.0
> 
> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] kho: add support for deferred struct page init
  2025-12-16  8:49 Evangelos Petrongonas
@ 2025-12-16 10:53 ` Pasha Tatashin
  2025-12-16 11:57 ` Mike Rapoport
  1 sibling, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2025-12-16 10:53 UTC (permalink / raw)
  To: Evangelos Petrongonas
  Cc: Mike Rapoport, Pratyush Yadav, Alexander Graf, Andrew Morton,
	Jason Miu, linux-kernel, kexec, linux-mm, nh-open-source

On Tue, Dec 16, 2025 at 3:49 AM Evangelos Petrongonas <epetron@amazon.de> wrote:
>
> When `CONFIG_DEFERRED_STRUCT_PAGE_INIT` is enabled, struct page
> initialization is deferred to parallel kthreads that run later
> in the boot process.
>
> During KHO restoration, `deserialize_bitmap()` writes metadata for
> each preserved memory region. However, if the struct page has not been
> initialized, this write targets uninitialized memory, potentially
> leading to errors like:
> ```
> BUG: unable to handle page fault for address: ...
> ```
>
> Fix this by introducing `kho_get_preserved_page()`,  which ensures
> all struct pages in a preserved region are initialized by calling
> `init_deferred_page()` which is a no-op when deferred init is disabled
> or when the struct page is already initialized.
>
> Fixes: 8b66ed2c3f42 ("kho: mm: don't allow deferred struct page with KHO")

You are adding a new feature. Backporting this to stable is not needed.

> Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
> ---
> ### Notes
> @Jason, this patch should act as a temporary fix to make KHO play nice
> with deferred struct page init until you post your ideas about splitting
> "Physical Reservation" from "Metadata Restoration".
>
> ### Testing
> In order to test the fix, I modified the KHO selftest, to allocate more
> memory and do so from higher memory to trigger the incompatibility. The
> branch with those changes can be found in:
> https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/kho-deferred-struct-page-init
>
> In future patches, we might want to enhance the selftest to cover
> this case as well. However, properly adopting the test for this
> is much more work than the actual fix, therefore it can be deferred to a
> follow-up series.
>
> In addition attempting to run the selftest for arm (without my changes)
> fails with:
> ```
> ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
> Bail out! ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
> ./tools/testing/selftests/kho/vmtest.sh: line 113: 61609 Aborted
> ```
> I have not looked it up further, but can also do so as part of a
> selftest follow-up.
>
>  kernel/liveupdate/Kconfig          |  2 --
>  kernel/liveupdate/kexec_handover.c | 19 ++++++++++++++++++-
>  2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
> index d2aeaf13c3ac..9394a608f939 100644
> --- a/kernel/liveupdate/Kconfig
> +++ b/kernel/liveupdate/Kconfig
> @@ -1,12 +1,10 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>
>  menu "Live Update and Kexec HandOver"
> -       depends on !DEFERRED_STRUCT_PAGE_INIT
>
>  config KEXEC_HANDOVER
>         bool "kexec handover"
>         depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> -       depends on !DEFERRED_STRUCT_PAGE_INIT
>         select MEMBLOCK_KHO_SCRATCH
>         select KEXEC_FILE
>         select LIBFDT
> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> index 9dc51fab604f..78cfe71e6107 100644
> --- a/kernel/liveupdate/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -439,6 +439,23 @@ static int kho_mem_serialize(struct kho_out *kho_out)
>         return err;
>  }
>
> +/*
> + * With CONFIG_DEFERRED_STRUCT_PAGE_INIT, struct pages in higher memory
> + * regions may not be initialized yet at the time KHO deserializes preserved
> + * memory. This function ensures all struct pages in the region are initialized.
> + */
> +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> +                                                 unsigned int order)
> +{
> +       unsigned long pfn = PHYS_PFN(phys);
> +       int nid = early_pfn_to_nid(pfn);
> +
> +       for (int i = 0; i < (1 << order); i++)
> +               init_deferred_page(pfn + i, nid);
> +
> +       return pfn_to_page(pfn);
> +}
> +
>  static void __init deserialize_bitmap(unsigned int order,
>                                       struct khoser_mem_bitmap_ptr *elm)
>  {
> @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
>                 int sz = 1 << (order + PAGE_SHIFT);
>                 phys_addr_t phys =
>                         elm->phys_start + (bit << (order + PAGE_SHIFT));
> -               struct page *page = phys_to_page(phys);
> +               struct page *page = kho_get_preserved_page(phys, order);
>                 union kho_page_info info;
>
>                 memblock_reserve(phys, sz);

In deferred_init_memmap_chunk() we initialize deferred struct pages in
this iterator:

for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
 init_deferred_page()
}

However, since, memblock_reserve() is called, the memory is not going
to be part of the for_each_free_mem_range() iterator. So, I think the
proposed patch should work.

Pratyush, what happens if we deserialize a HugeTLB with HVO? Since HVO
optimizes out the unique backing struct pages for tail pages, blindly
iterating and initializing them via init_deferred_page() might corrupt
the shared struct page mapping.

Pasha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] kho: add support for deferred struct page init
@ 2025-12-16  8:49 Evangelos Petrongonas
  2025-12-16 10:53 ` Pasha Tatashin
  2025-12-16 11:57 ` Mike Rapoport
  0 siblings, 2 replies; 28+ messages in thread
From: Evangelos Petrongonas @ 2025-12-16  8:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Evangelos Petrongonas, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Andrew Morton, Jason Miu, linux-kernel, kexec,
	linux-mm, nh-open-source

When `CONFIG_DEFERRED_STRUCT_PAGE_INIT` is enabled, struct page
initialization is deferred to parallel kthreads that run later
in the boot process.

During KHO restoration, `deserialize_bitmap()` writes metadata for
each preserved memory region. However, if the struct page has not been
initialized, this write targets uninitialized memory, potentially
leading to errors like:
```
BUG: unable to handle page fault for address: ...
```

Fix this by introducing `kho_get_preserved_page()`,  which ensures
all struct pages in a preserved region are initialized by calling
`init_deferred_page()` which is a no-op when deferred init is disabled
or when the struct page is already initialized.

Fixes: 8b66ed2c3f42 ("kho: mm: don't allow deferred struct page with KHO")
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
---
### Notes
@Jason, this patch should act as a temporary fix to make KHO play nice
with deferred struct page init until you post your ideas about splitting
"Physical Reservation" from "Metadata Restoration".

### Testing
In order to test the fix, I modified the KHO selftest, to allocate more
memory and do so from higher memory to trigger the incompatibility. The
branch with those changes can be found in:
https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/kho-deferred-struct-page-init

In future patches, we might want to enhance the selftest to cover
this case as well. However, properly adopting the test for this
is much more work than the actual fix, therefore it can be deferred to a
follow-up series.

In addition attempting to run the selftest for arm (without my changes)
fails with:
```
ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
Bail out! ERROR:target/arm/internals.h:767:regime_is_user: code should not be reached
./tools/testing/selftests/kho/vmtest.sh: line 113: 61609 Aborted
```
I have not looked it up further, but can also do so as part of a
selftest follow-up.

 kernel/liveupdate/Kconfig          |  2 --
 kernel/liveupdate/kexec_handover.c | 19 ++++++++++++++++++-
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index d2aeaf13c3ac..9394a608f939 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -1,12 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only

 menu "Live Update and Kexec HandOver"
-	depends on !DEFERRED_STRUCT_PAGE_INIT

 config KEXEC_HANDOVER
 	bool "kexec handover"
 	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
-	depends on !DEFERRED_STRUCT_PAGE_INIT
 	select MEMBLOCK_KHO_SCRATCH
 	select KEXEC_FILE
 	select LIBFDT
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 9dc51fab604f..78cfe71e6107 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -439,6 +439,23 @@ static int kho_mem_serialize(struct kho_out *kho_out)
 	return err;
 }

+/*
+ * With CONFIG_DEFERRED_STRUCT_PAGE_INIT, struct pages in higher memory
+ * regions may not be initialized yet at the time KHO deserializes preserved
+ * memory. This function ensures all struct pages in the region are initialized.
+ */
+static struct page *__init kho_get_preserved_page(phys_addr_t phys,
+						  unsigned int order)
+{
+	unsigned long pfn = PHYS_PFN(phys);
+	int nid = early_pfn_to_nid(pfn);
+
+	for (int i = 0; i < (1 << order); i++)
+		init_deferred_page(pfn + i, nid);
+
+	return pfn_to_page(pfn);
+}
+
 static void __init deserialize_bitmap(unsigned int order,
 				      struct khoser_mem_bitmap_ptr *elm)
 {
@@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
 		int sz = 1 << (order + PAGE_SHIFT);
 		phys_addr_t phys =
 			elm->phys_start + (bit << (order + PAGE_SHIFT));
-		struct page *page = phys_to_page(phys);
+		struct page *page = kho_get_preserved_page(phys, order);
 		union kho_page_info info;

 		memblock_reserve(phys, sz);
-- 
2.43.0

Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-12-31  9:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-24  7:34 [PATCH] kho: add support for deferred struct page init Fadouse
2025-12-29 21:09 ` Pratyush Yadav
2025-12-30 15:05   ` Pasha Tatashin
  -- strict thread matches above, loose matches on Subject: below --
2025-12-16  8:49 Evangelos Petrongonas
2025-12-16 10:53 ` Pasha Tatashin
2025-12-16 11:57 ` Mike Rapoport
2025-12-16 14:26   ` Evangelos Petrongonas
2025-12-16 15:05   ` Pasha Tatashin
2025-12-16 15:19     ` Mike Rapoport
2025-12-16 15:36       ` Pasha Tatashin
2025-12-16 15:51         ` Pasha Tatashin
2025-12-20  2:27           ` Pratyush Yadav
2025-12-19  9:19         ` Mike Rapoport
2025-12-19 16:28           ` Pasha Tatashin
2025-12-20  3:20             ` Pratyush Yadav
2025-12-20 14:49               ` Pasha Tatashin
2025-12-22 15:33                 ` Pratyush Yadav
2025-12-22 15:55                   ` Pasha Tatashin
2025-12-22 16:24                     ` Pratyush Yadav
2025-12-23 17:37                       ` Pasha Tatashin
2025-12-29 21:03                         ` Pratyush Yadav
2025-12-30 16:05                           ` Pasha Tatashin
2025-12-30 16:16                             ` Mike Rapoport
2025-12-30 16:18                               ` Pasha Tatashin
2025-12-30 17:18                                 ` Mike Rapoport
2025-12-30 18:21                                   ` Pasha Tatashin
2025-12-31  9:46                                     ` Mike Rapoport
2025-12-30 16:14                           ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox