From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Pratyush Yadav <pratyush@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>,
Evangelos Petrongonas <epetron@amazon.de>,
Alexander Graf <graf@amazon.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jason Miu <jasonmiu@google.com>,
linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
linux-mm@kvack.org, nh-open-source@amazon.com
Subject: Re: [PATCH] kho: add support for deferred struct page init
Date: Tue, 30 Dec 2025 11:05:05 -0500 [thread overview]
Message-ID: <CA+CK2bDDEZ5a+LJNDVMtKFtQ1D9rP6rJqU064eoru=a9eHpaAQ@mail.gmail.com> (raw)
In-Reply-To: <864ip99f1a.fsf@kernel.org>
On Mon, Dec 29, 2025 at 4:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Dec 23 2025, Pasha Tatashin wrote:
>
> >> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> >> > return NULL;
> >>
> >> See my patch that drops this restriction:
> >> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
> >>
> >> I think it was wrong to add it in the first place.
> >
> > Agree, the restriction can be removed. Indeed, it is wrong as it is
> > not enforced during preservation.
> >
> > However, I think we are going to be in a world of pain if we allow
> > preserving memory from different topologies within the same order. In
> > kho_preserve_pages(), we have to check if the first and last page are
> > from the same nid; if not, reduce the order by 1 and repeat until they
> > are. It is just wrong to intermix different memory into the same
> > order, so in addition to removing that restriction, I think we should
> > implement this enforcement.
>
> Sure, makes sense.
>
> >
> > Also, perhaps we should pass the NID in the Jason's radix tree
> > together with the order. We could have a single tree that encodes both
> > order and NID information in the top level, or we can have one tree
> > per NID. It does not really matter to me, but that should help us with
> > faster struct page initialization.
>
> Can we use NIDs in ABI? Do they stay stable across reboots? I never
> looked at how NIDs actually get assigned.
>
> Not sure if we should target it for the initial merge of the radix tree,
> but I think this is something we can try to figure out later down the
> line.
>
> >
> >> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> >> >> spinlock and searches through all memblock memory regions. I don't think
> >> >> it is too expensive, but it isn't free either. And all this would be
> >> >> done serially. With the zone search, you at least have some room for
> >> >> concurrency.
> >> >>
> >> >> I think either approach only makes a difference when we have a large
> >> >> number of low-order preservations. If we have a handful of high-order
> >> >> preservations, I suppose the overhead of nid search would be negligible.
> >> >
> >> > We should be targeting a situation where the vast majority of the
> >> > preserved memory is HugeTLB, but I am still worried about lower order
> >> > preservation efficiency for IOMMU page tables, etc.
> >>
> >> Yep. Plus we might get VMMs stashing some of their state in a memfd too.
> >
> > Yes, that is true, but hopefully those are tiny compared to everything else.
> >
> >> >> Long term, I think we should hook this into page_alloc_init_late() so
> >> >> that all the KHO pages also get initalized along with all the other
> >> >> pages. This will result in better integration of KHO with rest of MM
> >> >> init, and also have more consistent page restore performance.
> >> >
> >> > But we keep KHO as reserved memory, and hooking it up into
> >> > page_alloc_init_late() would make it very different, since that memory
> >> > is part of the buddy allocator memory...
> >>
> >> The idea I have is to have a separate call in page_alloc_init_late()
> >> that initalizes KHO pages. It would traverse the radix tree (probably in
> >> parallel by distributing the address space across multiple threads?) and
> >> initialize all the pages. Then kho_restore_page() would only have to
> >> double-check the magic and it can directly return the page.
> >
> > I kind of do not like relying on magic to decide whether to initialize
> > the struct page. I would prefer to avoid this magic marker altogether:
> > i.e. struct page is either initialized or not, not halfway
> > initialized, etc.
>
> The magic is purely sanity checking. It is not used to decide anything
> other than to make sure this is actually a KHO page. I don't intend to
> change that. My point is, if we make sure the KHO pages are properly
> initialized during MM init, then restoring can actually be a very cheap
> operation, where you only do the sanity checking. You can even put the
> magic check behind CONFIG_KEXEC_HANDOVER_DEBUG if you want, but I think
> it is useful enough to keep in production systems too.
It is part of a critical hotpath during blackout, should really be
behind CONFIG_KEXEC_HANDOVER_DEBUG
> > Magic is not reliable. During machine reset in many firmware
> > implementations, and in every kexec reboot, memory is not zeroed. The
> > kernel usually allocates vmemmap using exactly the same pages, so
> > there is just too high a chance of getting magic values accidentally
> > inherited from the previous boot.
>
> I don't think that can happen. All the pages are zeroed when
> initialized, which will clear the magic. We should only be setting the
> magic on an initialized struct page.
This can happen due to bugs when we use a partially initialized
"struct page", something that Mike have been looking to do. So, pass
some information in a struct page before it is fully initialized.
> >> Radix tree makes parallelism easier than the linked lists we have now.
> >
> > Agree, radix tree can absolutely help with parallelism.
> >
> >> >> Jason's radix tree patches will make that a bit easier to do I think.
> >> >> The zone search will scale better I reckon.
> >> >
> >> > It could, perhaps early in boot we should reserve the radix tree, and
> >> > use it as a source of truth look-ups later in boot?
> >>
> >> Yep. I think the radix tree should mark its own pages as preserved too
> >> so they stick around later in boot.
> >
> > Unfortunately, this can only be done in the new kernel, not in the old
> > kernel; otherwise we can end up with a recursive dependency that may
> > never be satisfied.
>
> Right. It shouldn't be too hard to do in the new kernel though. We will
> walk the whole tree anyway.
>
> --
> Regards,
> Pratyush Yadav
next prev parent reply other threads:[~2025-12-30 16:05 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-16 8:49 Evangelos Petrongonas
2025-12-16 10:53 ` Pasha Tatashin
2025-12-16 11:57 ` Mike Rapoport
2025-12-16 14:26 ` Evangelos Petrongonas
2025-12-16 15:05 ` Pasha Tatashin
2025-12-16 15:19 ` Mike Rapoport
2025-12-16 15:36 ` Pasha Tatashin
2025-12-16 15:51 ` Pasha Tatashin
2025-12-20 2:27 ` Pratyush Yadav
2025-12-19 9:19 ` Mike Rapoport
2025-12-19 16:28 ` Pasha Tatashin
2025-12-20 3:20 ` Pratyush Yadav
2025-12-20 14:49 ` Pasha Tatashin
2025-12-22 15:33 ` Pratyush Yadav
2025-12-22 15:55 ` Pasha Tatashin
2025-12-22 16:24 ` Pratyush Yadav
2025-12-23 17:37 ` Pasha Tatashin
2025-12-29 21:03 ` Pratyush Yadav
2025-12-30 16:05 ` Pasha Tatashin [this message]
2025-12-30 16:16 ` Mike Rapoport
2025-12-30 16:18 ` Pasha Tatashin
2025-12-30 17:18 ` Mike Rapoport
2025-12-30 18:21 ` Pasha Tatashin
2025-12-31 9:46 ` Mike Rapoport
2025-12-30 16:14 ` Mike Rapoport
2025-12-24 7:34 Fadouse
2025-12-29 21:09 ` Pratyush Yadav
2025-12-30 15:05 ` Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CA+CK2bDDEZ5a+LJNDVMtKFtQ1D9rP6rJqU064eoru=a9eHpaAQ@mail.gmail.com' \
--to=pasha.tatashin@soleen.com \
--cc=akpm@linux-foundation.org \
--cc=epetron@amazon.de \
--cc=graf@amazon.com \
--cc=jasonmiu@google.com \
--cc=kexec@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nh-open-source@amazon.com \
--cc=pratyush@kernel.org \
--cc=rppt@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox