Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Donet Tom <donettom@linux.ibm.com>, Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>, Zi Yan <ziy@nvidia.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	rafael@kernel.org, Danilo Krummrich <dakr@kernel.org>,
	Ritesh Harjani <ritesh.list@gmail.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Yury Norov <yury.norov@gmail.com>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time
Date: Mon, 5 May 2025 09:16:48 +0200	[thread overview]
Message-ID: <188fbfba-afb4-4db7-bbba-7689a96be931@redhat.com> (raw)
In-Reply-To: <a1e0cddc-ed38-4f48-b028-f3ab5025c157@linux.ibm.com>

On 04.05.25 18:34, Donet Tom wrote:
> 
> On 5/4/25 4:39 PM, Mike Rapoport wrote:
>> On Sat, May 03, 2025 at 11:10:12AM +0530, Donet Tom wrote:
>>> During node device initialization, `memory blocks` are registered under
>>> each NUMA node. The `memory blocks` to be registered are identified using
>>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>>
>>> However, not all PFNs within this range necessarily belong to the same
>>> node—some may belong to other nodes. Additionally, due to the
>>> discontiguous nature of physical memory, certain sections within a
>>> `memory block` may be absent.
>>>
>>> As a result, `memory blocks` that fall between a node’s start and end
>>> PFNs may span across multiple nodes, and some sections within those blocks
>>> may be missing. `Memory blocks` have a fixed size, which is architecture
>>> dependent.
>>>
>>> Due to these considerations, the memory block registration is currently
>>> performed as follows:
>>>
>>> for_each_online_node(nid):
>>>       start_pfn = pgdat->node_start_pfn;
>>>       end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>>>       for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>>>           mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>>>           pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>>>           pfn_mb_end = pfn_start + memory_block_pfns - 1
>>>           for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>>>               if (get_nid_for_pfn(pfn) != nid):
>>>                   continue;
>>>               else
>>>                   do_register_memory_block_under_node(nid, mem_blk,
>>>                                                           MEMINIT_EARLY);
>>>
>>> Here, we derive the start and end PFNs from the node's pg_data, then
>>> determine the memory blocks that may belong to the node. For each
>>> `memory block` in this range, we inspect all PFNs it contains and check
>>> their associated NUMA node ID. If a PFN within the block matches the
>>> current node, the memory block is registered under that node.
>>>
>>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>>> a binary search in the `memblock regions` to determine the NUMA node ID
>>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>>> from the struct page.
>>>
>>> On large systems, this process can become time-consuming, especially since
>>> we iterate over each `memory block` and all PFNs within it until a match is
>>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>>> overhead of the binary search increases the execution time significantly,
>>> potentially leading to soft lockups during boot.
>>>
>>> In this patch, we iterate over `memblock region` to identify the
>>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>>> are contiguous memory ranges, each associated with a single NUMA node, and
>>> they do not span across multiple nodes.
>>>
>>> for_each_online_node(nid):
>>>     for_each_memory_region(r): // r => region
>>>       if (r->nid != nid):
>>>         continue;
>>>       else
>>>         for_each_memory_block_between(r->base, r->base + r->size - 1):
>>>           do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>>>
>>> We iterate over all `memblock regions` and identify those that belong to
>>> the current NUMA node. For each `memblock region` associated with the
>>> current node, we calculate the start and end `memory blocks` based on the
>>> region's start and end PFNs. We then register all `memory blocks` within
>>> that range under the current node.
>>>
>>> Test Results on My system with 32TB RAM
>>> =======================================
>>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 1min 16.528s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 17.236s (kernel) - 78% Improvement
>>>
>>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>>
>>> Without this patch
>>> ------------------
>>> Startup finished in 28.320s (kernel)
>>>
>>> With this patch
>>> ---------------
>>> Startup finished in 15.621s (kernel) - 46% Improvement
>>>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>
>>> ---
>>> v2 -> v3
>>>
>>> Fixed indentation issues, made `start_block_id` and `end_block_id` constants,
>>> and moved variable declarations to the places where they are needed.
>>>
>>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>>> ---
>>>    drivers/base/memory.c  |  4 ++--
>>>    drivers/base/node.c    | 38 ++++++++++++++++++++++++++++++++++++++
>>>    include/linux/memory.h |  2 ++
>>>    include/linux/node.h   | 11 +++++------
>>>    4 files changed, 47 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 19469e7f88c2..7f1d266ae593 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
>>>    	return memory_block_id(pfn_to_section_nr(pfn));
>>>    }
>>>    
>>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>>> +unsigned long phys_to_block_id(unsigned long phys)
>>>    {
>>>    	return pfn_to_block_id(PFN_DOWN(phys));
>>>    }
>>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>>>     *
>>>     * Called under device_hotplug_lock.
>>>     */
>>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>>    {
>>>    	struct memory_block *mem;
>>>    
>>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>>> index cd13ef287011..0f8a4645b26c 100644
>>> --- a/drivers/base/node.c
>>> +++ b/drivers/base/node.c
>>> @@ -20,6 +20,7 @@
>>>    #include <linux/pm_runtime.h>
>>>    #include <linux/swap.h>
>>>    #include <linux/slab.h>
>>> +#include <linux/memblock.h>
>>>    
>>>    static const struct bus_type node_subsys = {
>>>    	.name = "node",
>>> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>>>    			  kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>>>    }
>>>    
>>> +/*
>>> + * register_memory_blocks_under_node_early : Register the memory
>>> + *		  blocks under the current node.
>>> + * @nid : Current node under registration
>>> + *
>>> + * This function iterates over all memblock regions and identifies the regions
>>> + * that belong to the current node. For each region which belongs to current
>>> + * node, it calculates the start and end memory blocks based on the region's
>>> + * start and end PFNs. It then registers all memory blocks within that range
>>> + * under the current node.
>>> + */
>>> +void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> +	struct memblock_region *r;
>>> +
>>> +	for_each_mem_region(r) {
>>> +		if (r->nid != nid)
>>> +			continue;
>>> +
>>> +		const unsigned long start_block_id = phys_to_block_id(r->base);
>>> +		const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>>> +		unsigned long block_id;
>>> +
>>> +		for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>>> +			struct memory_block *mem;
>>> +
>>> +			mem = find_memory_block_by_id(block_id);
>>> +			if (!mem)
>>> +				continue;
>>> +
>>> +			do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
>>> +			put_device(&mem->dev);
>>> +		}
>>> +
>>> +	}
>>> +}
>>> +
>>>    void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>>    				       unsigned long end_pfn,
>>>    				       enum meminit_context context)
>>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>>> index 12daa6ec7d09..cb8579226536 100644
>>> --- a/include/linux/memory.h
>>> +++ b/include/linux/memory.h
>>> @@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
>>>    typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>>>    int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>>>    			       struct memory_group *excluded, void *arg);
>>> +unsigned long phys_to_block_id(unsigned long phys);
>>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>>>    #define hotplug_memory_notifier(fn, pri) ({		\
>>>    	static __meminitdata struct notifier_block fn##_mem_nb =\
>>>    		{ .notifier_call = fn, .priority = pri };\
>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 2b7517892230..93beefe8f179 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -114,12 +114,16 @@ extern struct node *node_devices[];
>>>    void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>>    				       unsigned long end_pfn,
>>>    				       enum meminit_context context);
>>> +void register_memory_blocks_under_node_early(int nid);
>>>    #else
>>>    static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>>    						     unsigned long end_pfn,
>>>    						     enum meminit_context context)
>>>    {
>>>    }
>>> +static inline void register_memory_blocks_under_node_early(int nid)
>>> +{
>>> +}
>>>    #endif
>>>    
>>>    extern void unregister_node(struct node *node);
>>> @@ -134,15 +138,10 @@ static inline int register_one_node(int nid)
>>>    	int error = 0;
>>>    
>>>    	if (node_online(nid)) {
>>> -		struct pglist_data *pgdat = NODE_DATA(nid);
>>> -		unsigned long start_pfn = pgdat->node_start_pfn;
>>> -		unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> -
>>>    		error = __register_one_node(nid);
>>>    		if (error)
>>>    			return error;
>>> -		register_memory_blocks_under_node(nid, start_pfn, end_pfn,
>>> -						  MEMINIT_EARLY);
>>> +		register_memory_blocks_under_node_early(nid);
>> Does not that change mean that when register_one_node() is called from
>> memory hotplug it will always try to iterate memblock regions?
>> This would be a problem on architectures that don't keep memblock around
>> after boot.
> 
> 
> Hi Mike
> 
> Apologies — I wasn’t aware about CONFIG_ARCH_KEEP_MEMBLOCK.
> Thank you for pointing it out.
> 
> If this configuration is disabled, the current patchset would not function
> correctly, and node device initialization could fail during the hotplug path.

memory hotplug code never calls register_one_node(), unless I am missing 
something.

During add_memory_resource(), we call __try_online_node(nid, false), 
meaning we skip register_one_node().

The only caller of __try_online_node(nid, true) is try_online_node(), 
called from CPU hotplug code, and I *guess* that is not required.

I think the reason is simple: memory hotplug onlines the node before it 
adds any memory. So, there is no memory yet, consequently nothing to 
walk / register.

This whole code could deserve some cleanups. In particular, I am not 
sure if __try_online_node() must ever call register_one_node().

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-05-05  7:16 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-03  5:40 Donet Tom
2025-05-03  5:40 ` [PATCH v3 2/3] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-03 13:10   ` Zi Yan
2025-05-03  5:40 ` [PATCH v3 3/3] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-03 13:10   ` Zi Yan
2025-05-03 13:10 ` [PATCH v3 1/3] driver/base: Optimize memory block registration to reduce boot time Zi Yan
2025-05-04 11:09 ` Mike Rapoport
2025-05-04 16:34   ` Donet Tom
2025-05-04 20:03     ` Andrew Morton
2025-05-05 14:05       ` Mike Rapoport
2025-05-05  7:16     ` David Hildenbrand [this message]
2025-05-05  7:28       ` Oscar Salvador
2025-05-05  7:38         ` David Hildenbrand
2025-05-05  7:53           ` Mike Rapoport
2025-05-05  8:18             ` David Hildenbrand
2025-05-05 13:24               ` Mike Rapoport
2025-05-08  9:18                 ` David Hildenbrand
2025-05-09 15:40                   ` Donet Tom
2025-05-09 21:10                     ` Andrew Morton
2025-05-11  6:40                       ` Donet Tom
2025-05-11  5:39                     ` Mike Rapoport
2025-05-11 12:33                       ` Donet Tom
2025-05-05  7:57           ` Oscar Salvador
2025-05-05  8:12             ` David Hildenbrand
2025-05-05  9:36               ` Oscar Salvador
2025-05-05 10:36                 ` David Hildenbrand
2025-05-05 12:51                   ` Donet Tom
2025-05-05 13:02                     ` David Hildenbrand
2025-05-05 16:40                       ` Donet Tom
2025-05-05 13:07                   ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=188fbfba-afb4-4db7-bbba-7689a96be931@redhat.com \
    --to=david@redhat.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=dakr@kernel.org \
    --cc=dave.jiang@intel.com \
    --cc=donettom@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=yury.norov@gmail.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox