From: David Hildenbrand <david@redhat.com>
To: Donet Tom <donettom@linux.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Oscar Salvador <osalvador@suse.de>, Zi Yan <ziy@nvidia.com>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
rafael@kernel.org, Danilo Krummrich <dakr@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Alison Schofield <alison.schofield@intel.com>,
Yury Norov <yury.norov@gmail.com>,
Dave Jiang <dave.jiang@intel.com>
Subject: Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
Date: Fri, 16 May 2025 11:15:29 +0200 [thread overview]
Message-ID: <56cb2494-56ba-4895-9dd1-23243c2eecdb@redhat.com> (raw)
In-Reply-To: <f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com>
On 16.05.25 10:19, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_online_node(nid):
> for_each_memory_region(r): // r => region
> if (r->nid != nid):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all `memblock regions` and identify those that belong to
> the current NUMA node. For each `memblock region` associated with the
> current node, we calculate the start and end `memory blocks` based on the
> region's start and end PFNs. We then register all `memory blocks` within
> that range under the current node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v3 -> v4
>
> Addressed Mike's comment by making node_dev_init() call __register_one_node().
>
> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 4 ++--
> drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> include/linux/memory.h | 2 ++
> include/linux/node.h | 3 +++
> 4 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 19469e7f88c2..7f1d266ae593 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
> return memory_block_id(pfn_to_section_nr(pfn));
> }
>
> -static inline unsigned long phys_to_block_id(unsigned long phys)
> +unsigned long phys_to_block_id(unsigned long phys)
> {
> return pfn_to_block_id(PFN_DOWN(phys));
> }
I was wondering whether we should move all these helpers into a header,
and export sections_per_block instead. Probably doesn't really matter
for your use case.
> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> *
> * Called under device_hotplug_lock.
> */
> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..f8cafd8c8fb1 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
> #include <linux/pm_runtime.h>
> #include <linux/swap.h>
> #include <linux/slab.h>
> +#include <linux/memblock.h>
>
> static const struct bus_type node_subsys = {
> .name = "node",
> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> }
>
> +/*
> + * register_memory_blocks_under_node_early : Register the memory
> + * blocks under the current node.
> + * @nid : Current node under registration
> + *
> + * This function iterates over all memblock regions and identifies the regions
> + * that belong to the current node. For each region which belongs to current
> + * node, it calculates the start and end memory blocks based on the region's
> + * start and end PFNs. It then registers all memory blocks within that range
> + * under the current node.
> + */
> +static void register_memory_blocks_under_node_early(int nid)
> +{
> + struct memblock_region *r;
> +
> + for_each_mem_region(r) {
> + if (r->nid != nid)
> + continue;
> +
> + const unsigned long start_block_id = phys_to_block_id(r->base);
> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> + unsigned long block_id;
This should definitely be above the if().
> +
> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> + struct memory_block *mem;
> +
> + mem = find_memory_block_by_id(block_id);
> + if (!mem)
> + continue;
> +
> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
> + put_device(&mem->dev);
> + }
> +
> + }
> +}
> +
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> @@ -974,8 +1012,9 @@ void __init node_dev_init(void)
> * to applicable memory block devices and already created cpu devices.
> */
> for_each_online_node(i) {
> - ret = register_one_node(i);
> + ret = __register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> + register_memory_blocks_under_node_early(i);
> }
In general, LGTM.
BUT :)
I was wondering whether having a register_memory_blocks_early() call
*after* the for_each_online_node(), and walking all memory regions only
once would make a difference.
We'd have to be smart about memory blocks that fall into multiple
regions, but it should be a corner case and doable.
OTOH, we usually don't expect having a lot of regions, so iterating over
them is probably not a big bottleneck? Anyhow, just wanted to raise it.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-05-16 9:15 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-16 8:19 Donet Tom
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:05 ` Oscar Salvador
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:58 ` Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:06 ` Oscar Salvador
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:11 ` Mike Rapoport
2025-05-20 10:07 ` Oscar Salvador
2025-05-16 9:15 ` David Hildenbrand [this message]
2025-05-16 10:09 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Mike Rapoport
2025-05-16 10:12 ` David Hildenbrand
2025-05-16 11:00 ` Donet Tom
2025-05-16 11:00 ` Donet Tom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56cb2494-56ba-4895-9dd1-23243c2eecdb@redhat.com \
--to=david@redhat.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=dakr@kernel.org \
--cc=dave.jiang@intel.com \
--cc=donettom@linux.ibm.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=ritesh.list@gmail.com \
--cc=rppt@kernel.org \
--cc=yury.norov@gmail.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox