Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Donet Tom <donettom@linux.ibm.com>
To: Mike Rapoport <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>, Zi Yan <ziy@nvidia.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Ritesh Harjani <ritesh.list@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Danilo Krummrich <dakr@kernel.org>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Yury Norov <yury.norov@gmail.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: Re: [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time
Date: Thu, 22 May 2025 17:29:20 +0530	[thread overview]
Message-ID: <0e0f2b4c-01d3-4ae9-b1f2-3490bccc4cfb@linux.ibm.com> (raw)
In-Reply-To: <aC7-S0EXnbGP3UNU@kernel.org>


On 5/22/25 4:06 PM, Mike Rapoport wrote:
> On Thu, May 22, 2025 at 04:17:28AM -0500, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified using
>> the node’s start and end PFNs, which are obtained from the node's pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>>      start_pfn = pgdat->node_start_pfn;
>>      end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>>      for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
>>          mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>>          pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>>          pfn_mb_end = pfn_start + memory_block_pfns - 1
>>          for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>>              if (get_nid_for_pfn(pfn) != nid):
>>                  continue;
>>              else
>>                  do_register_memory_block_under_node(nid, mem_blk,
>>                                                          MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially since
>> we iterate over each `memory block` and all PFNs within it until a match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA node, and
>> they do not span across multiple nodes.
>>
>> for_each_memory_region(r): // r => region
>>    if (!node_online(r->nid)):
>>      continue;
>>    else
>>      for_each_memory_block_between(r->base, r->base + r->size - 1):
>>        do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
>>
>> We iterate over all memblock regions, and if the node associated with the
>> region is online, we calculate the start and end memory blocks based on the
>> region's start and end PFNs. We then register all the memory blocks within
>> that range under the region node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v4 -> v5
>>
>> 1. Moved all helpers(memory_block_id(), pfn_to_block_id(), and phys_to_block_id())
>>     into memory.h and exported sections_per_block.
>> 2. register_memory_blocks_early() moved out of for_each_online_node().
>>     Now we iterate over all memory regions at once and register the
>>     memory blocks.
>>
>>     Tested corner cases where memory blocks span across multiple memblock regions; it
>>     is working fine.
>>
>>     #cd /sys/devices/system/node/
>>     # find node1/  |grep memory0
>>     node1/memory0
>>     # find node0/  |grep memory0
>>     node0/memory0
>>     # find node0/  |grep memory0
>>     node2/memory0
>>     # cat node0/memory0/valid_zones
>>     none
>>
>> V4 - https://lore.kernel.org/all/f94685be9cdc931a026999d236d7e92de29725c7.1747376551.git.donettom@linux.ibm.com/
>> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
>> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>>   drivers/base/memory.c  | 21 ++++----------------
>>   drivers/base/node.c    | 45 ++++++++++++++++++++++++++++++++++++++++--
>>   include/linux/memory.h | 19 +++++++++++++++++-
>>   include/linux/node.h   |  3 +++
>>   4 files changed, 68 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..39fcc075a36f 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/stat.h>
>>   #include <linux/slab.h>
>>   #include <linux/xarray.h>
>> +#include <linux/export.h>
>>   
>>   #include <linux/atomic.h>
>>   #include <linux/uaccess.h>
>> @@ -48,22 +49,8 @@ int mhp_online_type_from_str(const char *str)
>>   
>>   #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>>   
>> -static int sections_per_block;
>> -
>> -static inline unsigned long memory_block_id(unsigned long section_nr)
>> -{
>> -	return section_nr / sections_per_block;
>> -}
>> -
>> -static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> -{
>> -	return memory_block_id(pfn_to_section_nr(pfn));
>> -}
>> -
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> -{
>> -	return pfn_to_block_id(PFN_DOWN(phys));
>> -}
>> +int sections_per_block;
>> +EXPORT_SYMBOL(sections_per_block);
>>   
>>   static int memory_subsys_online(struct device *dev);
>>   static int memory_subsys_offline(struct device *dev);
>> @@ -632,7 +619,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
>>    *
>>    * Called under device_hotplug_lock.
>>    */
>> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>>   {
>>   	struct memory_block *mem;
>>   
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..e8b6f6b9ce51 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/pm_runtime.h>
>>   #include <linux/swap.h>
>>   #include <linux/slab.h>
>> +#include <linux/memblock.h>
>>   
>>   static const struct bus_type node_subsys = {
>>   	.name = "node",
>> @@ -850,6 +851,41 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
>>   			  kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>>   }
>>   
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory blocks
>> + *                 under the nodes.
>> + *
>> + * This function iterates over all memblock regions, and if the node associated with
>> + * the region is online, calculates the start and end memory blocks based on the
>> + * region's start and end PFNs. Then, registers all the memory blocks within that
>> + * range under the region node.
>> + */
>> +static void register_memory_blocks_under_node_early(void)
>> +{
>> +	struct memblock_region *r;
>> +
>> +	for_each_mem_region(r) {
>> +		const unsigned long start_block_id = phys_to_block_id(r->base);
>> +		const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
>> +		unsigned long block_id;
>> +
>> +		if (!node_online(r->nid))
> memblock_get_region_node() please, otherwise it won't build for !NUMA.

Thank you Mike

I tested with !CONFIG_NUMA, and the build was successful. This is 
because node.c is not compiled when CONFIG_NUMA is disabled:

obj-$(CONFIG_NUMA) += node.o

But it is better to use memblock_get_region_node(). I'll make the change 
and include it in the next revision.Thanks Donet

> Otherwise
>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>
>> +			continue;
>> +
>> +		for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
>> +			struct memory_block *mem;
>> +
>> +			mem = find_memory_block_by_id(block_id);
>> +			if (!mem)
>> +				continue;
>> +
>> +			do_register_memory_block_under_node(r->nid, mem, MEMINIT_EARLY);
>> +			put_device(&mem->dev);
>> +		}
>> +
>> +	}
>> +}
>> +
>>   void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
>>   				       unsigned long end_pfn,
>>   				       enum meminit_context context)
>> @@ -971,11 +1007,16 @@ void __init node_dev_init(void)
>>   
>>   	/*
>>   	 * Create all node devices, which will properly link the node
>> -	 * to applicable memory block devices and already created cpu devices.
>> +	 * to already created cpu devices.
>>   	 */
>>   	for_each_online_node(i) {
>> -		ret = register_one_node(i);
>> +		ret =  __register_one_node(i);
>>   		if (ret)
>>   			panic("%s() failed to add node: %d\n", __func__, ret);
>>   	}
>> +
>> +	/*
>> +	 * Link the node to memory block devices
>> +	 */
>> +	register_memory_blocks_under_node_early();
>>   }
>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 12daa6ec7d09..2a61088e17ad 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -171,12 +171,30 @@ struct memory_group *memory_group_find_by_id(int mgid);
>>   typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
>>   int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
>>   			       struct memory_group *excluded, void *arg);
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id);
>>   #define hotplug_memory_notifier(fn, pri) ({		\
>>   	static __meminitdata struct notifier_block fn##_mem_nb =\
>>   		{ .notifier_call = fn, .priority = pri };\
>>   	register_memory_notifier(&fn##_mem_nb);			\
>>   })
>>   
>> +extern int sections_per_block;
>> +
>> +static inline unsigned long memory_block_id(unsigned long section_nr)
>> +{
>> +	return section_nr / sections_per_block;
>> +}
>> +
>> +static inline unsigned long pfn_to_block_id(unsigned long pfn)
>> +{
>> +	return memory_block_id(pfn_to_section_nr(pfn));
>> +}
>> +
>> +static inline unsigned long phys_to_block_id(unsigned long phys)
>> +{
>> +	return pfn_to_block_id(PFN_DOWN(phys));
>> +}
>> +
>>   #ifdef CONFIG_NUMA
>>   void memory_block_add_nid(struct memory_block *mem, int nid,
>>   			  enum meminit_context context);
>> @@ -188,5 +206,4 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
>>    * can sleep.
>>    */
>>   extern struct mutex text_mutex;
>> -
>>   #endif /* _LINUX_MEMORY_H_ */
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 2b7517892230..5c763253c42c 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
>>   						     enum meminit_context context)
>>   {
>>   }
>> +static inline void register_memory_blocks_under_node_early(void)
>> +{
>> +}
>>   #endif
>>   
>>   extern void unregister_node(struct node *node);
>> -- 
>> 2.43.5
>>

next prev parent reply	other threads:[~2025-05-22 11:59 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-22  9:17 Donet Tom
2025-05-22  9:17 ` [PATCH v5 2/4] drivers/base/node: remove register_mem_block_under_node_early() Donet Tom
2025-05-22  9:17 ` [PATCH v5 3/4] drivers/base/node: Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-22 10:06   ` Oscar Salvador
2025-05-22 10:31     ` Mike Rapoport
2025-05-22 11:46       ` Donet Tom
2025-05-22 12:09   ` David Hildenbrand
2025-05-22  9:17 ` [PATCH v5 4/4] drivers/base/node : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-22 10:36 ` [PATCH v5 1/4] drivers/base/node: Optimize memory block registration to reduce boot time Mike Rapoport
2025-05-22 11:59   ` Donet Tom [this message]
2025-05-22 12:09 ` David Hildenbrand
2025-05-22 12:29   ` Donet Tom
2025-05-23  8:46 ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0e0f2b4c-01d3-4ae9-b1f2-3490bccc4cfb@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alison.schofield@intel.com \
    --cc=dakr@kernel.org \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=yury.norov@gmail.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox