From: Donet Tom <donettom@linux.ibm.com>
To: David Hildenbrand <david@redhat.com>,
Mike Rapoport <rppt@kernel.org>,
Oscar Salvador <osalvador@suse.de>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
rafael@kernel.org, Danilo Krummrich <dakr@kernel.org>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alison Schofield <alison.schofield@intel.com>,
Yury Norov <yury.norov@gmail.com>,
Dave Jiang <dave.jiang@intel.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/2] driver/base: Optimize memory block registration to reduce boot time
Date: Tue, 29 Apr 2025 19:38:59 +0530 [thread overview]
Message-ID: <1f750ad6-b6b1-41de-9cdd-9abe64c14eae@linux.ibm.com> (raw)
In-Reply-To: <da9c6b2f-5b4b-444c-a453-cf72272c2fb7@redhat.com>
On 4/29/25 2:49 AM, David Hildenbrand wrote:
> On 28.04.25 19:03, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified
>> using
>> the node’s start and end PFNs, which are obtained from the node's
>> pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.> As a result, `memory blocks` that fall
>> between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those
>> blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn),
>> PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn()
>> performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially
>> since
>> we iterate over each `memory block` and all PFNs within it until a
>> match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time
>> significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA
>> node, and
>> they do not span across multiple nodes.
>>
>> for_each_online_node(nid):
>> for_each_memory_region(r): // r => region
>> if (r->nid != nid):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> We iterate over all `memblock regions` and identify those that belong to
>> the current NUMA node. For each `memblock region` associated with the
>> current node, we calculate the start and end `memory blocks` based on
>> the
>> region's start and end PFNs. We then register all `memory blocks` within
>> that range under the current node.
>
> Yes, makes sense.
>
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>
> Wow!
>
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>
> Also very nice!
>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>
>> v1->v2
>>
>> Reworked the implementation according to suggestions from
>> Mike Rapoport[1]
>>
>> [1] - https://lore.kernel.org/all/Z_j2Gv9n4DOj6LSs@kernel.org/
>>
>> v1 -
>> https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> drivers/base/memory.c | 4 ++--
>> drivers/base/node.c | 39 +++++++++++++++++++++++++++++++++++++++
>> include/linux/memory.h | 2 ++
>> include/linux/node.h | 11 +++++------
>> 4 files changed, 48 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..7f1d266ae593 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -60,7 +60,7 @@ static inline unsigned long
>> pfn_to_block_id(unsigned long pfn)
>> return memory_block_id(pfn_to_section_nr(pfn));
>> }
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> +unsigned long phys_to_block_id(unsigned long phys)
>> {
>> return pfn_to_block_id(PFN_DOWN(phys));
>> }
>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned
>> long start_pfn)
>> *
>> * Called under device_hotplug_lock.
>> */
>> -static struct memory_block *find_memory_block_by_id(unsigned long
>> block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..4869333d366d 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>> #include <linux/pm_runtime.h>
>> #include <linux/swap.h>
>> #include <linux/slab.h>
>> +#include <linux/memblock.h>
>> static const struct bus_type node_subsys = {
>> .name = "node",
>> @@ -850,6 +851,44 @@ void unregister_memory_block_under_nodes(struct
>> memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>> }
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory
>> + * blocks under the current node.
>> + * @nid : Current node under registration
>> + *
>> + * This function iterates over all memblock regions and identifies
>> the regions
>> + * that belong to the current node. For each region which belongs to
>> current
>> + * node, it calculates the start and end memory blocks based on the
>> region's
>> + * start and end PFNs. It then registers all memory blocks within
>> that range
>> + * under the current node.
>> + *
>> + */
>> +void register_memory_blocks_under_node_early(int nid)
>> +{
>> + struct memblock_region *r;
>
> You almost achieved a reverse x-mas tree :)
>
>> + unsigned long start_block_id;
>> + unsigned long end_block_id;
>> + struct memory_block *mem;
>> + unsigned long block_id;
>> +
>> + for_each_mem_region(r) {
>> + if (r->nid == nid) {
>
> To reduce indentation
>
> if (r->nid != nid)
> continue;
ok.
>
>> + start_block_id = phys_to_block_id(r->base);
>> + end_block_id = phys_to_block_id(r->base + r->size - 1);
>
> Probably you could make them const in the for loop
>
> const unsigned long start_block_id = phys_to_block_id(r->base);
> const unsigned long end_block_id = phys_to_block_id(r->base +
> r->size - 1);
ok. I will add this change.
>
> Okay, so end is inclusive as well.
yes
>
>> +
>> + for (block_id = start_block_id; block_id <=
>> end_block_id; block_id++) {
>> + mem = find_memory_block_by_id(block_id);
>> + if (!mem)
>> + continue;
>> +
>> + do_register_memory_block_under_node(nid, mem,
>> MEMINIT_EARLY);
>
next prev parent reply other threads:[~2025-04-29 14:09 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-28 17:03 Donet Tom
2025-04-28 17:03 ` [PATCH v2 2/2] driver/base: Remove unused functions Donet Tom
2025-04-28 21:21 ` David Hildenbrand
2025-04-29 14:07 ` Donet Tom
2025-04-30 7:48 ` Oscar Salvador
2025-05-01 13:49 ` Donet Tom
2025-05-01 15:08 ` Zi Yan
2025-05-01 15:15 ` Donet Tom
2025-04-28 21:19 ` [PATCH v2 1/2] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
2025-04-29 14:08 ` Donet Tom [this message]
2025-04-29 16:37 ` kernel test robot
2025-04-29 17:01 ` kernel test robot
2025-05-01 14:10 ` Donet Tom
2025-04-30 7:38 ` Oscar Salvador
2025-05-01 13:55 ` Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1f750ad6-b6b1-41de-9cdd-9abe64c14eae@linux.ibm.com \
--to=donettom@linux.ibm.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=dakr@kernel.org \
--cc=dave.jiang@intel.com \
--cc=david@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=ritesh.list@gmail.com \
--cc=rppt@kernel.org \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox