linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Michał Cłapiński" <mclapinski@google.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Evangelos Petrongonas <epetron@amazon.de>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	 Mike Rapoport <rppt@kernel.org>,
	Pratyush Yadav <pratyush@kernel.org>,
	Alexander Graf <graf@amazon.com>,
	 Samiullah Khawaja <skhawaja@google.com>,
	kexec@lists.infradead.org, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v7 2/3] kho: fix deferred init of kho scratch
Date: Wed, 18 Mar 2026 18:19:35 +0100	[thread overview]
Message-ID: <CAAi7L5fvUPUqd3A4m6wubeYh90NH+S2KBRE1R8WKerBYhkU8kg@mail.gmail.com> (raw)
In-Reply-To: <0D1F59C7-CA35-49C8-B341-32D8C7F4A345@nvidia.com>

On Wed, Mar 18, 2026 at 6:08 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 18 Mar 2026, at 11:45, Michał Cłapiński wrote:
>
> > On Wed, Mar 18, 2026 at 4:26 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 18 Mar 2026, at 11:18, Michał Cłapiński wrote:
> >>
> >>> On Wed, Mar 18, 2026 at 4:10 PM Zi Yan <ziy@nvidia.com> wrote:
> >>>>
> >>>> On 17 Mar 2026, at 10:15, Michal Clapinski wrote:
> >>>>
> >>>>> Currently, if DEFERRED is enabled, kho_release_scratch will initialize
> >>>>> the struct pages and set migratetype of kho scratch. Unless the whole
> >>>>> scratch fit below first_deferred_pfn, some of that will be overwritten
> >>>>> either by deferred_init_pages or memmap_init_reserved_pages.
> >>>>>
> >>>>> To fix it, I modified kho_release_scratch to only set the migratetype
> >>>>> on already initialized pages. Then, modified init_pageblock_migratetype
> >>>>> to set the migratetype to CMA if the page is located inside scratch.
> >>>>>
> >>>>> Signed-off-by: Michal Clapinski <mclapinski@google.com>
> >>>>> ---
> >>>>>  include/linux/memblock.h           |  2 --
> >>>>>  kernel/liveupdate/kexec_handover.c | 10 ++++++----
> >>>>>  mm/memblock.c                      | 22 ----------------------
> >>>>>  mm/page_alloc.c                    |  7 +++++++
> >>>>>  4 files changed, 13 insertions(+), 28 deletions(-)
> >>>>>
> >>>>
> >>>> <snip>
> >>>>
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index ee81f5c67c18..5ca078dde61d 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -55,6 +55,7 @@
> >>>>>  #include <linux/cacheinfo.h>
> >>>>>  #include <linux/pgalloc_tag.h>
> >>>>>  #include <linux/mmzone_lock.h>
> >>>>> +#include <linux/kexec_handover.h>
> >>>>>  #include <asm/div64.h>
> >>>>>  #include "internal.h"
> >>>>>  #include "shuffle.h"
> >>>>> @@ -549,6 +550,12 @@ void __meminit init_pageblock_migratetype(struct page *page,
> >>>>>                    migratetype < MIGRATE_PCPTYPES))
> >>>>>               migratetype = MIGRATE_UNMOVABLE;
> >>>>>
> >>>>> +     /*
> >>>>> +      * Mark KHO scratch as CMA so no unmovable allocations are made there.
> >>>>> +      */
> >>>>> +     if (unlikely(kho_scratch_overlap(page_to_phys(page), PAGE_SIZE)))
> >>>>> +             migratetype = MIGRATE_CMA;
> >>>>> +
> >>>>
> >>>> If this is only for deferred init code, why not put it in deferred_free_pages()?
> >>>> Otherwise, all init_pageblock_migratetype() callers need to pay the penalty
> >>>> of traversing kho_scratch array.
> >>>
> >>> Because reserve_bootmem_region() doesn't call deferred_free_pages().
> >>> So I would also have to modify it.
> >>>
> >>> And the early initialization won't pay the penalty of traversing the
> >>> kho_scratch array, since then kho_scratch is NULL.
> >>
> >> How about hugetlb_bootmem_init_migratetype(), init_cma_pageblock(),
> >> init_cma_reserved_pageblock(), __init_page_from_nid(), memmap_init_range(),
> >> __init_zone_device_page()?
> >>
> >> 1. are they having any PFN range overlapping with kho?
> >> 2. is kho_scratch NULL for them?
> >>
> >> 1 tells us whether putting code in init_pageblock_migratetype() could save
> >> the hassle of changing all above locations.
> >> 2 tells us how many callers are affected by traversing kho_scratch.
> >
> > I could try answering those questions but
> >
> > 1. I'm new to this and I'm not sure how correct the answers will be.
> >
> > 2. If you're not using CONFIG_KEXEC_HANDOVER, the performance penalty
> > will be zero.
> > If you are using it, currently you have to disable
> > CONFIG_DEFERRED_STRUCT_PAGE_INIT and the performance hit from this is
> > far, far greater. This solution saves 0.5s on my setup (100GB of
> > memory). We can always improve the performance further in the future.
> >
>
> OK, I asked Claude for help and the answer is that not all callers of
> init_pageblock_migratetype() touch kho scratch memory regions. Basically,
> you only need to perform the kho_scratch_overlap() check in
> __init_page_from_nid() to achieve the same end result.
>
>
> The below is the analysis from Claude.
> Based on my understanding,
> 1. memmap_init_range() is done before kho_memory_init(), so it does not need
> the check.
>
> 2. __init_zone_device_page() is not relevant.
>
> 3. init_cma_reserved_pageblock() / init_cma_pageblock() are already set
> to MIGRATE_CMA.
>
> 4. hugetlb is not used by kho scratch, so also does not need the check.
>
> 5. kho_release_scratch() already takes care of it.
>
> The remaining memblock_free_pages() needs a check, but I am not 100%.
>
>
> # kho_scratch_overlap() in init_pageblock_migratetype() — scope analysis
>
> ## Context
>
> Commit a7700b3c6779 ("kho: fix deferred init of kho scratch") added a
> kho_scratch_overlap() call inside init_pageblock_migratetype() in
> mm/page_alloc.c:
>
> ```c
> if (unlikely(kho_scratch_overlap(page_to_phys(page), PAGE_SIZE)))
>     migratetype = MIGRATE_CMA;
> ```
>
> kho_scratch_overlap() does a NULL check followed by a loop over the
> kho_scratch array. For non-KHO boots (kho_scratch == NULL) the cost is
> a single NULL load and branch. For KHO boots the loop runs on every call
> to init_pageblock_migratetype().
>
> ## Question
>
> Does this add overhead for callers whose memory range cannot overlap
> with scratch? Can the check be moved to the caller side?
>
> ## Call site analysis
>
> init_pageblock_migratetype() has nine call sites. The init call ordering
> relevant to scratch is:
>
> ```
> setup_arch()
>   zone_sizes_init() -> free_area_init() -> memmap_init_range()   [1]
>
> mm_init_free_all() / start_kernel():
>   kho_memory_init() -> kho_release_scratch()                     [2]
>   memblock_free_all()
>     free_low_memory_core_early()
>       memmap_init_reserved_pages()
>         reserve_bootmem_region() -> __init_deferred_page()
>           -> __init_page_from_nid()                              [3]
>   deferred init kthreads -> __init_page_from_nid()               [4]
> ```

I don't understand this. deferred_free_pages() doesn't call
__init_page_from_nid(). So I would clearly need to modify both
deferred_free_pages and __init_page_from_nid.

>
> ### Per call site
>
> **mm/mm_init.c — __init_page_from_nid() (deferred init)**
>
> Called for every deferred pfn (>= first_deferred_pfn). Scratch pages
> in the deferred range are not touched by kho_release_scratch() (new
> code clips end_pfn to first_deferred_pfn) and not touched by
> memmap_init_range() (stops at first_deferred_pfn). This path sets
> MIGRATE_MOVABLE on deferred scratch pageblocks after
> kho_release_scratch() has already run.
>
> **Needs the fix: yes.**
>
> Both sub-paths that reach this function for deferred scratch pages:
> - deferred init kthreads [4]
> - reserve_bootmem_region() -> __init_deferred_page() [3]
>   (early_page_initialised() returns early for non-deferred pfns, so
>   __init_page_from_nid() is only reached for deferred pfns here too)
>
> **mm/mm_init.c — memmap_init_range()**
>
> Runs during setup_arch() [1], before kho_memory_init() [2]. Sets
> MIGRATE_MOVABLE on scratch pageblocks, but kho_release_scratch() runs
> afterward and correctly overrides to MIGRATE_CMA for non-deferred
> scratch. For deferred scratch, memmap_init_range() stops at
> first_deferred_pfn and never processes them.
>
> **Needs the fix: no.**
>
> **mm/mm_init.c — __init_zone_device_page()**
>
> ZONE_DEVICE path only. Scratch is normal RAM, not ZONE_DEVICE.
>
> **Needs the fix: no.**
>
> **mm/mm_init.c — memblock_free_pages() (lines ~2012 and ~2023)**
>
> Called by memblock_free_all() for free (non-reserved) memblock regions.
> Scratch is memblock-reserved and released through the CMA path, not
> through memblock_free_all().
>
> **Needs the fix: no.**
>
> **mm/mm_init.c — init_cma_reserved_pageblock() / init_cma_pageblock()**
>
> Both already pass MIGRATE_CMA. The kho_scratch_overlap() check would
> be redundant even if scratch reaches these paths.
>
> **Needs the fix: no (redundant).**
>
> **mm/hugetlb.c — __prep_compound_gigantic_folio()**
>
> Gigantic hugepage setup. Scratch regions are not used for gigantic
> hugepages.
>
> **Needs the fix: no.**
>
> **kernel/liveupdate/kexec_handover.c — kho_release_scratch()**
>
> Already passes MIGRATE_CMA. Additionally, kho_scratch is NULL at the
> point kho_release_scratch() runs (kho_memory_init() sets kho_scratch
> only after kho_release_scratch() returns), so kho_scratch_overlap()
> would return false regardless.
>
> **Needs the fix: no.**
>
> ## Conclusion
>
> The only path that actually requires the MIGRATE_CMA override is
> __init_page_from_nid(). All problematic sub-paths (deferred init
> kthreads and reserve_bootmem_region()) converge there.
>
> The check could be moved to __init_page_from_nid() to keep the
> KHO-specific concern out of the generic init_pageblock_migratetype():
>
> ```c
> /* mm/mm_init.c: __init_page_from_nid() */
> if (pageblock_aligned(pfn)) {
>     enum migratetype mt = MIGRATE_MOVABLE;
>     if (kho_scratch_overlap(PFN_PHYS(pfn), PAGE_SIZE))
>         mt = MIGRATE_CMA;
>     init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
> }
> ```
>
> __init_page_from_nid() is only compiled under CONFIG_DEFERRED_STRUCT_PAGE_INIT,
> which is the only configuration where the bug can occur, so the
> kho_scratch_overlap() call would be naturally gated by that config.
>
>
>
> Best Regards,
> Yan, Zi


  reply	other threads:[~2026-03-18 17:19 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17 14:15 [PATCH v7 0/3] kho: add support for deferred struct page init Michal Clapinski
2026-03-17 14:15 ` [PATCH v7 1/3] kho: make kho_scratch_overlap usable outside debugging Michal Clapinski
2026-03-18  9:16   ` Mike Rapoport
2026-04-07 10:55     ` Pratyush Yadav
2026-04-07 14:18       ` Pasha Tatashin
2026-04-07 16:09         ` Pratyush Yadav
2026-04-07 16:32           ` Pasha Tatashin
2026-03-17 14:15 ` [PATCH v7 2/3] kho: fix deferred init of kho scratch Michal Clapinski
2026-03-17 23:23   ` Vishal Moola (Oracle)
2026-03-18  0:08     ` SeongJae Park
2026-03-18  0:23       ` Andrew Morton
2026-03-18  9:33   ` Mike Rapoport
2026-03-18 10:28     ` Michał Cłapiński
2026-03-18 10:33     ` Michał Cłapiński
2026-03-18 11:02       ` Mike Rapoport
2026-03-18 15:10   ` Zi Yan
2026-03-18 15:18     ` Michał Cłapiński
2026-03-18 15:26       ` Zi Yan
2026-03-18 15:45         ` Michał Cłapiński
2026-03-18 17:08           ` Zi Yan
2026-03-18 17:19             ` Michał Cłapiński [this message]
2026-03-18 17:36               ` Zi Yan
2026-03-19  7:54                 ` Mike Rapoport
2026-03-19 18:17                   ` Michał Cłapiński
2026-03-22 14:45                     ` Mike Rapoport
2026-04-07 12:21                       ` Pratyush Yadav
2026-04-07 13:21                         ` Zi Yan
2026-03-17 14:15 ` [PATCH v7 3/3] kho: make preserved pages compatible with deferred struct page init Michal Clapinski
2026-03-17 17:46 ` [PATCH v7 0/3] kho: add support for " Andrew Morton
2026-03-18  9:34   ` Mike Rapoport
2026-03-18  9:18 ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAi7L5fvUPUqd3A4m6wubeYh90NH+S2KBRE1R8WKerBYhkU8kg@mail.gmail.com \
    --to=mclapinski@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=epetron@amazon.de \
    --cc=graf@amazon.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=pratyush@kernel.org \
    --cc=rppt@kernel.org \
    --cc=skhawaja@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox