Re: [PATCH] kho: add support for deferred struct page init

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Pratyush Yadav <pratyush@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>,
	Evangelos Petrongonas <epetron@amazon.de>,
	Alexander Graf <graf@amazon.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Jason Miu <jasonmiu@google.com>,
	 linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
	linux-mm@kvack.org,  nh-open-source@amazon.com
Subject: Re: [PATCH] kho: add support for deferred struct page init
Date: Tue, 23 Dec 2025 12:37:34 -0500	[thread overview]
Message-ID: <CA+CK2bCjJWZG_rPoPsHWSxirmUCTOuFQzTCss2AKf9UqpThrdw@mail.gmail.com> (raw)
In-Reply-To: <863452cwns.fsf@kernel.org>

> > if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> > return NULL;
>
> See my patch that drops this restriction:
> https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/
>
> I think it was wrong to add it in the first place.

Agree, the restriction can be removed. Indeed, it is wrong as it is
not enforced during preservation.

However, I think we are going to be in a world of pain if we allow
preserving memory from different topologies within the same order. In
kho_preserve_pages(), we have to check if the first and last page are
from the same nid; if not, reduce the order by 1 and repeat until they
are. It is just wrong to intermix different memory into the same
order, so in addition to removing that restriction, I think we should
implement this enforcement.

Also, perhaps we should pass the NID in the Jason's radix tree
together with the order. We could have a single tree that encodes both
order and NID information in the top level, or we can have one tree
per NID. It does not really matter to me, but that should help us with
faster struct page initialization.

> >> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> >> spinlock and searches through all memblock memory regions. I don't think
> >> it is too expensive, but it isn't free either. And all this would be
> >> done serially. With the zone search, you at least have some room for
> >> concurrency.
> >>
> >> I think either approach only makes a difference when we have a large
> >> number of low-order preservations. If we have a handful of high-order
> >> preservations, I suppose the overhead of nid search would be negligible.
> >
> > We should be targeting a situation where the vast majority of the
> > preserved memory is HugeTLB, but I am still worried about lower order
> > preservation efficiency for IOMMU page tables, etc.
>
> Yep. Plus we might get VMMs stashing some of their state in a memfd too.

Yes, that is true, but hopefully those are tiny compared to everything else.

> >> Long term, I think we should hook this into page_alloc_init_late() so
> >> that all the KHO pages also get initalized along with all the other
> >> pages. This will result in better integration of KHO with rest of MM
> >> init, and also have more consistent page restore performance.
> >
> > But we keep KHO as reserved memory, and hooking it up into
> > page_alloc_init_late() would make it very different, since that memory
> > is part of the buddy allocator memory...
>
> The idea I have is to have a separate call in page_alloc_init_late()
> that initalizes KHO pages. It would traverse the radix tree (probably in
> parallel by distributing the address space across multiple threads?) and
> initialize all the pages. Then kho_restore_page() would only have to
> double-check the magic and it can directly return the page.

I kind of do not like relying on magic to decide whether to initialize
the struct page. I would prefer to avoid this magic marker altogether:
i.e. struct page is either initialized or not, not halfway
initialized, etc.

Magic is not reliable. During machine reset in many firmware
implementations, and in every kexec reboot, memory is not zeroed. The
kernel usually allocates vmemmap using exactly the same pages, so
there is just too high a chance of getting magic values accidentally
inherited from the previous boot.

> Radix tree makes parallelism easier than the linked lists we have now.

Agree, radix tree can absolutely help with parallelism.

> >> Jason's radix tree patches will make that a bit easier to do I think.
> >> The zone search will scale better I reckon.
> >
> > It could, perhaps early in boot we should reserve the radix tree, and
> > use it as a source of truth look-ups later in boot?
>
> Yep. I think the radix tree should mark its own pages as preserved too
> so they stick around later in boot.

Unfortunately, this can only be done in the new kernel, not in the old
kernel; otherwise we can end up with a recursive dependency that may
never be satisfied.

Pasha

next prev parent reply	other threads:[~2025-12-23 17:38 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-16  8:49 Evangelos Petrongonas
2025-12-16 10:53 ` Pasha Tatashin
2025-12-16 11:57 ` Mike Rapoport
2025-12-16 14:26   ` Evangelos Petrongonas
2025-12-16 15:05   ` Pasha Tatashin
2025-12-16 15:19     ` Mike Rapoport
2025-12-16 15:36       ` Pasha Tatashin
2025-12-16 15:51         ` Pasha Tatashin
2025-12-20  2:27           ` Pratyush Yadav
2025-12-19  9:19         ` Mike Rapoport
2025-12-19 16:28           ` Pasha Tatashin
2025-12-20  3:20             ` Pratyush Yadav
2025-12-20 14:49               ` Pasha Tatashin
2025-12-22 15:33                 ` Pratyush Yadav
2025-12-22 15:55                   ` Pasha Tatashin
2025-12-22 16:24                     ` Pratyush Yadav
2025-12-23 17:37                       ` Pasha Tatashin [this message]
2025-12-29 21:03                         ` Pratyush Yadav
2025-12-30 16:05                           ` Pasha Tatashin
2025-12-30 16:16                             ` Mike Rapoport
2025-12-30 16:18                               ` Pasha Tatashin
2025-12-30 17:18                                 ` Mike Rapoport
2025-12-30 18:21                                   ` Pasha Tatashin
2025-12-31  9:46                                     ` Mike Rapoport
2025-12-30 16:14                           ` Mike Rapoport
2025-12-24  7:34 Fadouse
2025-12-29 21:09 ` Pratyush Yadav
2025-12-30 15:05   ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+CK2bCjJWZG_rPoPsHWSxirmUCTOuFQzTCss2AKf9UqpThrdw@mail.gmail.com \
    --to=pasha.tatashin@soleen.com \
    --cc=akpm@linux-foundation.org \
    --cc=epetron@amazon.de \
    --cc=graf@amazon.com \
    --cc=jasonmiu@google.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nh-open-source@amazon.com \
    --cc=pratyush@kernel.org \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox