On Tue, 2021-05-11 at 13:05 +0300, Mike Rapoport wrote: > From: Mike Rapoport > > The struct pages representing a reserved memory region are initialized > using reserve_bootmem_range() function. This function is called for each > reserved region just before the memory is freed from memblock to the buddy > page allocator. > > The struct pages for MEMBLOCK_NOMAP regions are kept with the default > values set by the memory map initialization which makes it necessary to > have a special treatment for such pages in pfn_valid() and > pfn_valid_within(). > > Split out initialization of the reserved pages to a function with a > meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the > reserved regions and mark struct pages for the NOMAP regions as > PageReserved. > > Signed-off-by: Mike Rapoport > Reviewed-by: David Hildenbrand > Reviewed-by: Anshuman Khandual > --- >  include/linux/memblock.h |  4 +++- >  mm/memblock.c            | 28 ++++++++++++++++++++++++++-- >  2 files changed, 29 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index 5984fff3f175..1b4c97c151ae 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn; >   * @MEMBLOCK_NONE: no special request >   * @MEMBLOCK_HOTPLUG: hotpluggable region >   * @MEMBLOCK_MIRROR: mirrored region > - * @MEMBLOCK_NOMAP: don't add to kernel direct mapping > + * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as > + * reserved in the memory map; refer to memblock_mark_nomap() description > + * for further details >   */ >  enum memblock_flags { >   MEMBLOCK_NONE = 0x0, /* No special request */ > diff --git a/mm/memblock.c b/mm/memblock.c > index afaefa8fc6ab..3abf2c3fea7f 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) >   * @base: the base phys addr of the region >   * @size: the size of the region >   * > + * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the > + * direct mapping of the physical memory. These regions will still be > + * covered by the memory map. The struct page representing NOMAP memory > + * frames in the memory map will be PageReserved() > + * >   * Return: 0 on success, -errno on failure. >   */ >  int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) > @@ -2002,6 +2007,26 @@ static unsigned long __init __free_memory_core(phys_addr_t start, >   return end_pfn - start_pfn; >  } >   > +static void __init memmap_init_reserved_pages(void) > +{ > + struct memblock_region *region; > + phys_addr_t start, end; > + u64 i; > + > + /* initialize struct pages for the reserved regions */ > + for_each_reserved_mem_range(i, &start, &end) > + reserve_bootmem_region(start, end); > + > + /* and also treat struct pages for the NOMAP regions as PageReserved */ > + for_each_mem_region(region) { > + if (memblock_is_nomap(region)) { > + start = region->base; > + end = start + region->size; > + reserve_bootmem_region(start, end); > + } > + } > +} > + In some cases, that whole call to reserve_bootmem_region() may be a no- op because pfn_valid() is not true for *any* address in that range. But reserve_bootmem_region() spends a long time iterating of them all, and eventually doing nothing: void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) { if (pfn_valid(start_pfn)) { struct page *page = pfn_to_page(start_pfn); init_reserved_page(start_pfn, nid); /* * no need for atomic set_bit because the struct * page is not visible yet so nobody should * access it yet. */ __SetPageReserved(page); } } } On platforms with large NOMAP regions (e.g. which are actually reserved for guest memory to keep it out of the Linux address map and allow for kexec-based live update of the hypervisor), this pointless loop ends up taking a significant amount of time which is visible as guest steal time during the live update. Can reserve_bootmem_region() skip the loop *completely* if no PFN in the range from start to end is valid? Or tweak the loop itself to have an 'else' case which skips to the next valid PFN? Something like for(...) { if (pfn_valid(start_pfn)) { ... } else { start_pfn = next_valid_pfn(start_pfn); } }