From: David Hildenbrand <david@redhat.com>
To: Donet Tom <donettom@linux.ibm.com>, Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>, Zi Yan <ziy@nvidia.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
rafael@kernel.org, Danilo Krummrich <dakr@kernel.org>,
Ritesh Harjani <ritesh.list@gmail.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alison Schofield <alison.schofield@intel.com>,
Yury Norov <yury.norov@gmail.com>,
Dave Jiang <dave.jiang@intel.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
Date: Mon, 5 May 2025 09:16:48 +0200 [thread overview]
Message-ID: <188fbfba-afb4-4db7-bbba-7689a96be931@redhat.com> (raw)
In-Reply-To: <a1e0cddc-ed38-4f48-b028-f3ab5025c157@linux.ibm.com>
On 04.05.25 18:34, Donet Tom wrote:
>
> On 5/4/25 4:39 PM, Mike Rapoport wrote:
>> On Sat, May 03, 2025 at 11:10:12AM +0530, Donet Tom wrote:
>>> During node device initialization, `memory blocks` are registered under
>>> each NUMA node. The `memory blocks` to be registered are identified using
>>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>>
>>> However, not all PFNs within this range necessarily belong to the same
>>> node—some may belong to other nodes. Additionally, due to the
>>> discontiguous nature of physical memory, certain sections within a
>>> `memory block` may be absent.
>>>
>>> As a result, `memory blocks` that fall between a node’s start and end
>>> PFNs may span across multiple nodes, and some sections within those blocks
>>> may be missing. `Memory blocks` have a fixed size, which is architecture
>>> dependent.
>>>
>>> Due to these considerations, the memory block registration is currently
>>> performed as follows:
>>>
>>> for_each_online_node(nid):
>>> start_pfn = pgdat->node_start_pfn;
>>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>>> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>>> if (get_nid_for_pfn(pfn) != nid):
>>> continue;
>>> else
>>> do_register_memory_block_under_node(nid, mem_blk,
>>> MEMINIT_EARLY);
>>>
>>> Here, we derive the start and end PFNs from the node's pg_data, then
>>> determine the memory blocks that may belong to the node. For each
>>> `memory block` in this range, we inspect all PFNs it contains and check
>>> their associated NUMA node ID. If a PFN within the block matches the
>>> current node, the memory block is registered under that node.
>>>
>>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>>> a binary search in the `memblock regions` to determine the NUMA node ID
>>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>>> from the struct page.
>>>
>>> On large systems, this process can become time-consuming, especially since
>>> we iterate over each `memory block` and all PFNs within it until a match is
>>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>>> overhead of the binary search increases the execution time significantly,
>>> potentially leading to soft lockups during boot.
>>>
>>> In this patch, we iterate over `memblock region` to identify the
>>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>>> are contiguous memory ranges, each associated with a single NUMA node, and
>>> they do not span across multiple nodes.
>>>
>>> for_each_online_node(nid):
>>> for_each_memory_region(r): // r => region
>>> if (r->nid != nid):
>>> continue;
>>> else
>>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>>> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>>>
>>> We iterate over all `memblock regions` and identify those that belong to
>>> the current NUMA node. For each `memblock region` associated with the
>>> current node, we calculate the start and end `memory blocks` based on the
>>> region's start and end PFNs. We then register all `memory blocks` within
>>> that range under the current node.
>>>
>>> Test Results on My system with 32TB RAM
>>> =======================================
>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 1min 16.528s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>>
>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 28.320s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 15.621s (kernel) - 46% Improvement
>>>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>
>>> ---
>>> v2 -> v3
>>>
>>> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
>>> and moved variable declarations to the places where they are needed.
>>>
>>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>>> ---
>>> drivers/base/memory.c | 4 ++--
>>> drivers/base/node.c | 38 ++++++++++++++++++++++++++++++++++++++
>>> include/linux/memory.h | 2 ++
>>> include/linux/node.h | 11 +++++------
>>> 4 files changed, 47 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 19469e7f88c2..7f1d266ae593 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
>>> return memory_block_id(pfn_to_section_nr(pfn));
>>> }
>>>
>>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>>> +unsigned long phys_to_block_id(unsigned long phys)
>>> {
>>> return pfn_to_block_id(PFN_DOWN(phys));
>>> }
>>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>>> *
>>> * Called under device_hotplug_lock.
>>> */
>>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>> {
>>> struct memory_block *mem;
>>>
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index cd13ef287011..0f8a4645b26c 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -20,6 +20,7 @@
>>> #include <linux/pm_runtime.h>
>>> #include <linux/swap.h>
>>> #include <linux/slab.h>
>>> +#include <linux/memblock.h>
>>>
>>> static const struct bus_type node_subsys = {
>>> .name = "node",
>>> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>>> }
>>>
>>> +/*
>>> + * register_memory_blocks_under_node_early : Register the memory
>>> + * blocks under the current node.
>>> + * @nid : Current node under registration
>>> + *
>>> + * This function iterates over all memblock regions and identifies the regions
>>> + * that belong to the current node. For each region which belongs to current
>>> + * node, it calculates the start and end memory blocks based on the region's
>>> + * start and end PFNs. It then registers all memory blocks within that range
>>> + * under the current node.
>>> + */
>>> +void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> + struct memblock_region *r;
>>> +
>>> + for_each_mem_region(r) {
>>> + if (r->nid != nid)
>>> + continue;
>>> +
>>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>>> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>>> + unsigned long block_id;
>>> +
>>> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>>> + struct memory_block *mem;
>>> +
>>> + mem = find_memory_block_by_id(block_id);
>>> + if (!mem)
>>> + continue;
>>> +
>>> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
>>> + put_device(&mem->dev);
>>> + }
>>> +
>>> + }
>>> +}
>>> +
>>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context)
>>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>>> index 12daa6ec7d09..cb8579226536 100644
>>> --- a/include/linux/memory.h
>>> +++ b/include/linux/memory.h
>>> @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
>>> typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>>> int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>>> struct memory_group *excluded, void *arg);
>>> +unsigned long phys_to_block_id(unsigned long phys);
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>>> #define hotplug_memory_notifier(fn, pri) ({ \
>>> static __meminitdata struct notifier_block fn##_mem_nb =\
>>> { .notifier_call = fn, .priority = pri };\
>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 2b7517892230..93beefe8f179 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -114,12 +114,16 @@ extern struct node *node_devices[];
>>> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context);
>>> +void register_memory_blocks_under_node_early(int nid);
>>> #else
>>> static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>> unsigned long end_pfn,
>>> enum meminit_context context)
>>> {
>>> }
>>> +static inline void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> +}
>>> #endif
>>>
>>> extern void unregister_node(struct node *node);
>>> @@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
>>> int error = 0;
>>>
>>> if (node_online(nid)) {
>>> - struct pglist_data *pgdat = NODE_DATA(nid);
>>> - unsigned long start_pfn = pgdat->node_start_pfn;
>>> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> -
>>> error = __register_one_node(nid);
>>> if (error)
>>> return error;
>>> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
>>> - MEMINIT_EARLY);
>>> + register_memory_blocks_under_node_early(nid);
>> Does not that change mean that when register_one_node() is called from
>> memory hotplug it will always try to iterate memblock regions?
>> This would be a problem on architectures that don't keep memblock around
>> after boot.
>
>
> Hi Mike
>
> Apologies — I wasn’t aware about CONFIG_ARCH_KEEP_MEMBLOCK.
> Thank you for pointing it out.
>
> If this configuration is disabled, the current patchset would not function
> correctly, and node device initialization could fail during the hotplug path.
memory hotplug code never calls register_one_node(), unless I am missing
something.
During add_memory_resource(), we call __try_online_node(nid, false),
meaning we skip register_one_node().
The only caller of __try_online_node(nid, true) is try_online_node(),
called from CPU hotplug code, and I *guess* that is not required.
I think the reason is simple: memory hotplug onlines the node before it
adds any memory. So, there is no memory yet, consequently nothing to
walk / register.
This whole code could deserve some cleanups. In particular, I am not
sure if __try_online_node() must ever call register_one_node().
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-05-05 7:16 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-03 5:40 Donet Tom
2025-05-03 5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-03 13:10 ` Zi Yan
2025-05-03 13:10 ` [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Zi Yan
2025-05-04 11:09 ` Mike Rapoport
2025-05-04 16:34 ` Donet Tom
2025-05-04 20:03 ` Andrew Morton
2025-05-05 14:05 ` Mike Rapoport
2025-05-05 7:16 ` David Hildenbrand [this message]
2025-05-05 7:28 ` Oscar Salvador
2025-05-05 7:38 ` David Hildenbrand
2025-05-05 7:53 ` Mike Rapoport
2025-05-05 8:18 ` David Hildenbrand
2025-05-05 13:24 ` Mike Rapoport
2025-05-08 9:18 ` David Hildenbrand
2025-05-09 15:40 ` Donet Tom
2025-05-09 21:10 ` Andrew Morton
2025-05-11 6:40 ` Donet Tom
2025-05-11 5:39 ` Mike Rapoport
2025-05-11 12:33 ` Donet Tom
2025-05-05 7:57 ` Oscar Salvador
2025-05-05 8:12 ` David Hildenbrand
2025-05-05 9:36 ` Oscar Salvador
2025-05-05 10:36 ` David Hildenbrand
2025-05-05 12:51 ` Donet Tom
2025-05-05 13:02 ` David Hildenbrand
2025-05-05 16:40 ` Donet Tom
2025-05-05 13:07 ` Oscar Salvador
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=188fbfba-afb4-4db7-bbba-7689a96be931@redhat.com \
--to=david@redhat.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=dakr@kernel.org \
--cc=dave.jiang@intel.com \
--cc=donettom@linux.ibm.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=ritesh.list@gmail.com \
--cc=rppt@kernel.org \
--cc=yury.norov@gmail.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox