* [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
@ 2025-05-16 8:19 Donet Tom
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 8:19 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang, Donet Tom
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node’s start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a
`memory block` may be absent.
As a result, `memory blocks` that fall between a node’s start and end
PFNs may span across multiple nodes, and some sections within those blocks
may be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each
`memory block` in this range, we inspect all PFNs it contains and check
their associated NUMA node ID. If a PFN within the block matches the
current node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match is
found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
overhead of the binary search increases the execution time significantly,
potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the
`memory blocks` that belong to the current NUMA node. `memblock regions`
are contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_online_node(nid):
for_each_memory_region(r): // r => region
if (r->nid != nid):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
We iterate over all `memblock regions` and identify those that belong to
the current NUMA node. For each `memblock region` associated with the
current node, we calculate the start and end `memory blocks` based on the
region's start and end PFNs. We then register all `memory blocks` within
that range under the current node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v3 -> v4
Addressed Mike's comment by making node_dev_init() call __register_one_node().
V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/memory.c | 4 ++--
drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++-
include/linux/memory.h | 2 ++
include/linux/node.h | 3 +++
4 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 19469e7f88c2..7f1d266ae593 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
return memory_block_id(pfn_to_section_nr(pfn));
}
-static inline unsigned long phys_to_block_id(unsigned long phys)
+unsigned long phys_to_block_id(unsigned long phys)
{
return pfn_to_block_id(PFN_DOWN(phys));
}
@@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
*
* Called under device_hotplug_lock.
*/
-static struct memory_block *find_memory_block_by_id(unsigned long block_id)
+struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
struct memory_block *mem;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index cd13ef287011..f8cafd8c8fb1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>
+#include <linux/memblock.h>
static const struct bus_type node_subsys = {
.name = "node",
@@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
}
+/*
+ * register_memory_blocks_under_node_early : Register the memory
+ * blocks under the current node.
+ * @nid : Current node under registration
+ *
+ * This function iterates over all memblock regions and identifies the regions
+ * that belong to the current node. For each region which belongs to current
+ * node, it calculates the start and end memory blocks based on the region's
+ * start and end PFNs. It then registers all memory blocks within that range
+ * under the current node.
+ */
+static void register_memory_blocks_under_node_early(int nid)
+{
+ struct memblock_region *r;
+
+ for_each_mem_region(r) {
+ if (r->nid != nid)
+ continue;
+
+ const unsigned long start_block_id = phys_to_block_id(r->base);
+ const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
+ unsigned long block_id;
+
+ for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
+ struct memory_block *mem;
+
+ mem = find_memory_block_by_id(block_id);
+ if (!mem)
+ continue;
+
+ do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
+ put_device(&mem->dev);
+ }
+
+ }
+}
+
void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
@@ -974,8 +1012,9 @@ void __init node_dev_init(void)
* to applicable memory block devices and already created cpu devices.
*/
for_each_online_node(i) {
- ret = register_one_node(i);
+ ret = __register_one_node(i);
if (ret)
panic("%s() failed to add node: %d\n", __func__, ret);
+ register_memory_blocks_under_node_early(i);
}
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 12daa6ec7d09..cb8579226536 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -171,6 +171,8 @@ struct memory_group *memory_group_find_by_id(int mgid);
typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
struct memory_group *excluded, void *arg);
+unsigned long phys_to_block_id(unsigned long phys);
+struct memory_block *find_memory_block_by_id(unsigned long block_id);
#define hotplug_memory_notifier(fn, pri) ({ \
static __meminitdata struct notifier_block fn##_mem_nb =\
{ .notifier_call = fn, .priority = pri };\
diff --git a/include/linux/node.h b/include/linux/node.h
index 2b7517892230..806e62638cbe 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star
enum meminit_context context)
{
}
+static inline void register_memory_blocks_under_node_early(int nid)
+{
+}
#endif
extern void unregister_node(struct node *node);
--
2.43.5
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early()
2025-05-16 8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
@ 2025-05-16 8:19 ` Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:05 ` Oscar Salvador
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
` (2 subsequent siblings)
3 siblings, 2 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 8:19 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang, Donet Tom
The function register_mem_block_under_node_early() is no longer used,
as register_memory_blocks_under_node_early() now handles memory block
registration during early boot.
Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
the latter was only used by the former.
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v3->v4
Added Acked-by tag
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 58 +--------------------------------------------
1 file changed, 1 insertion(+), 57 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index f8cafd8c8fb1..8a14ebcae3b9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -748,15 +748,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-static int __ref get_nid_for_pfn(unsigned long pfn)
-{
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
- if (system_state < SYSTEM_RUNNING)
- return early_pfn_to_nid(pfn);
-#endif
- return pfn_to_nid(pfn);
-}
-
static void do_register_memory_block_under_node(int nid,
struct memory_block *mem_blk,
enum meminit_context context)
@@ -783,46 +774,6 @@ static void do_register_memory_block_under_node(int nid,
ret);
}
-/* register memory section under specified node if it spans that node */
-static int register_mem_block_under_node_early(struct memory_block *mem_blk,
- void *arg)
-{
- unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
- unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
- unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
- int nid = *(int *)arg;
- unsigned long pfn;
-
- for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
- int page_nid;
-
- /*
- * memory block could have several absent sections from start.
- * skip pfn range from absent section
- */
- if (!pfn_in_present_section(pfn)) {
- pfn = round_down(pfn + PAGES_PER_SECTION,
- PAGES_PER_SECTION) - 1;
- continue;
- }
-
- /*
- * We need to check if page belongs to nid only at the boot
- * case because node's ranges can be interleaved.
- */
- page_nid = get_nid_for_pfn(pfn);
- if (page_nid < 0)
- continue;
- if (page_nid != nid)
- continue;
-
- do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
- return 0;
- }
- /* mem section does not span the specified node */
- return 0;
-}
-
/*
* During hotplug we know that all pages in the memory block belong to the same
* node.
@@ -892,15 +843,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
unsigned long end_pfn,
enum meminit_context context)
{
- walk_memory_blocks_func_t func;
-
- if (context == MEMINIT_HOTPLUG)
- func = register_mem_block_under_node_hotplug;
- else
- func = register_mem_block_under_node_early;
-
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
- (void *)&nid, func);
+ (void *)&nid, register_mem_block_under_node_hotplug);
return;
}
#endif /* CONFIG_MEMORY_HOTPLUG */
--
2.43.5
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-16 8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-16 8:19 ` Donet Tom
2025-05-16 9:18 ` David Hildenbrand
` (2 more replies)
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16 9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
3 siblings, 3 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 8:19 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang, Donet Tom
register_one_node() is now only called via cpu_up() → __try_online_node()
during CPU hotplug operations to online a node. At this stage, the node has
not yet had any memory added. As a result, there are no memory blocks to
walk or register, so calling register_memory_blocks_under_node() is
unnecessary. Therefore, the call to register_memory_blocks_under_node()
has been removed from register_one_node().
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v3->v4
Addressed Mike's comment by dropping the call to
register_memory_blocks_under_node() from register_one_node()
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
include/linux/node.h | 6 ------
1 file changed, 6 deletions(-)
diff --git a/include/linux/node.h b/include/linux/node.h
index 806e62638cbe..8b8f96ca5b06 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -137,15 +137,9 @@ static inline int register_one_node(int nid)
int error = 0;
if (node_online(nid)) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- unsigned long start_pfn = pgdat->node_start_pfn;
- unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
-
error = __register_one_node(nid);
if (error)
return error;
- register_memory_blocks_under_node(nid, start_pfn, end_pfn,
- MEMINIT_EARLY);
}
return error;
--
2.43.5
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-16 8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
@ 2025-05-16 8:19 ` Donet Tom
2025-05-16 9:18 ` David Hildenbrand
` (2 more replies)
2025-05-16 9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
3 siblings, 3 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 8:19 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang, Donet Tom
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
v3->v4
Added Acked-by tag
v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
---
drivers/base/node.c | 5 ++---
include/linux/node.h | 11 +++++------
mm/memory_hotplug.c | 5 ++---
3 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8a14ebcae3b9..52b0bf91ccd8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -839,9 +839,8 @@ static void register_memory_blocks_under_node_early(int nid)
}
}
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
{
walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
(void *)&nid, register_mem_block_under_node_hotplug);
diff --git a/include/linux/node.h b/include/linux/node.h
index 8b8f96ca5b06..d3a31d18625e 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -111,13 +111,12 @@ struct memory_block;
extern struct node *node_devices[];
#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
-void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context);
+void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
#else
-static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
- unsigned long end_pfn,
- enum meminit_context context)
+static inline void register_memory_blocks_under_node_hotplug(int nid,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
{
}
static inline void register_memory_blocks_under_node_early(int nid)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8305483de38b..f734cc924b51 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
BUG_ON(ret);
}
- register_memory_blocks_under_node(nid, PFN_DOWN(start),
- PFN_UP(start + size - 1),
- MEMINIT_HOTPLUG);
+ register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start),
+ PFN_UP(start + size - 1));
/* create new memmap entry */
if (!strcmp(res->name, "System RAM"))
--
2.43.5
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
2025-05-16 8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
` (2 preceding siblings ...)
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-16 9:15 ` David Hildenbrand
2025-05-16 10:09 ` Mike Rapoport
2025-05-16 11:00 ` Donet Tom
3 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-05-16 9:15 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On 16.05.25 10:19, Donet Tom wrote:
> During node device initialization, `memory blocks` are registered under
> each NUMA node. The `memory blocks` to be registered are identified using
> the node’s start and end PFNs, which are obtained from the node's pg_data
>
> However, not all PFNs within this range necessarily belong to the same
> node—some may belong to other nodes. Additionally, due to the
> discontiguous nature of physical memory, certain sections within a
> `memory block` may be absent.
>
> As a result, `memory blocks` that fall between a node’s start and end
> PFNs may span across multiple nodes, and some sections within those blocks
> may be missing. `Memory blocks` have a fixed size, which is architecture
> dependent.
>
> Due to these considerations, the memory block registration is currently
> performed as follows:
>
> for_each_online_node(nid):
> start_pfn = pgdat->node_start_pfn;
> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> pfn_mb_end = pfn_start + memory_block_pfns - 1
> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> if (get_nid_for_pfn(pfn) != nid):
> continue;
> else
> do_register_memory_block_under_node(nid, mem_blk,
> MEMINIT_EARLY);
>
> Here, we derive the start and end PFNs from the node's pg_data, then
> determine the memory blocks that may belong to the node. For each
> `memory block` in this range, we inspect all PFNs it contains and check
> their associated NUMA node ID. If a PFN within the block matches the
> current node, the memory block is registered under that node.
>
> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> a binary search in the `memblock regions` to determine the NUMA node ID
> for a given PFN. If it is not enabled, the node ID is retrieved directly
> from the struct page.
>
> On large systems, this process can become time-consuming, especially since
> we iterate over each `memory block` and all PFNs within it until a match is
> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> overhead of the binary search increases the execution time significantly,
> potentially leading to soft lockups during boot.
>
> In this patch, we iterate over `memblock region` to identify the
> `memory blocks` that belong to the current NUMA node. `memblock regions`
> are contiguous memory ranges, each associated with a single NUMA node, and
> they do not span across multiple nodes.
>
> for_each_online_node(nid):
> for_each_memory_region(r): // r => region
> if (r->nid != nid):
> continue;
> else
> for_each_memory_block_between(r->base, r->base + r->size - 1):
> do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
>
> We iterate over all `memblock regions` and identify those that belong to
> the current NUMA node. For each `memblock region` associated with the
> current node, we calculate the start and end `memory blocks` based on the
> region's start and end PFNs. We then register all `memory blocks` within
> that range under the current node.
>
> Test Results on My system with 32TB RAM
> =======================================
> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>
> Without this patch
> ------------------
> Startup finished in 1min 16.528s (kernel)
>
> With this patch
> ---------------
> Startup finished in 17.236s (kernel) - 78% Improvement
>
> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>
> Without this patch
> ------------------
> Startup finished in 28.320s (kernel)
>
> With this patch
> ---------------
> Startup finished in 15.621s (kernel) - 46% Improvement
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v3 -> v4
>
> Addressed Mike's comment by making node_dev_init() call __register_one_node().
>
> V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/memory.c | 4 ++--
> drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> include/linux/memory.h | 2 ++
> include/linux/node.h | 3 +++
> 4 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 19469e7f88c2..7f1d266ae593 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
> return memory_block_id(pfn_to_section_nr(pfn));
> }
>
> -static inline unsigned long phys_to_block_id(unsigned long phys)
> +unsigned long phys_to_block_id(unsigned long phys)
> {
> return pfn_to_block_id(PFN_DOWN(phys));
> }
I was wondering whether we should move all these helpers into a header,
and export sections_per_block instead. Probably doesn't really matter
for your use case.
> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> *
> * Called under device_hotplug_lock.
> */
> -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cd13ef287011..f8cafd8c8fb1 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
> #include <linux/pm_runtime.h>
> #include <linux/swap.h>
> #include <linux/slab.h>
> +#include <linux/memblock.h>
>
> static const struct bus_type node_subsys = {
> .name = "node",
> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> }
>
> +/*
> + * register_memory_blocks_under_node_early : Register the memory
> + * blocks under the current node.
> + * @nid : Current node under registration
> + *
> + * This function iterates over all memblock regions and identifies the regions
> + * that belong to the current node. For each region which belongs to current
> + * node, it calculates the start and end memory blocks based on the region's
> + * start and end PFNs. It then registers all memory blocks within that range
> + * under the current node.
> + */
> +static void register_memory_blocks_under_node_early(int nid)
> +{
> + struct memblock_region *r;
> +
> + for_each_mem_region(r) {
> + if (r->nid != nid)
> + continue;
> +
> + const unsigned long start_block_id = phys_to_block_id(r->base);
> + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> + unsigned long block_id;
This should definitely be above the if().
> +
> + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> + struct memory_block *mem;
> +
> + mem = find_memory_block_by_id(block_id);
> + if (!mem)
> + continue;
> +
> + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
> + put_device(&mem->dev);
> + }
> +
> + }
> +}
> +
> void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> @@ -974,8 +1012,9 @@ void __init node_dev_init(void)
> * to applicable memory block devices and already created cpu devices.
> */
> for_each_online_node(i) {
> - ret = register_one_node(i);
> + ret = __register_one_node(i);
> if (ret)
> panic("%s() failed to add node: %d\n", __func__, ret);
> + register_memory_blocks_under_node_early(i);
> }
In general, LGTM.
BUT :)
I was wondering whether having a register_memory_blocks_early() call
*after* the for_each_online_node(), and walking all memory regions only
once would make a difference.
We'd have to be smart about memory blocks that fall into multiple
regions, but it should be a corner case and doable.
OTOH, we usually don't expect having a lot of regions, so iterating over
them is probably not a big bottleneck? Anyhow, just wanted to raise it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
@ 2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:58 ` Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:06 ` Oscar Salvador
2 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-05-16 9:18 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On 16.05.25 10:19, Donet Tom wrote:
> register_one_node() is now only called via cpu_up() → __try_online_node()
> during CPU hotplug operations to online a node. At this stage, the node has
> not yet had any memory added. As a result, there are no memory blocks to
> walk or register, so calling register_memory_blocks_under_node() is
> unnecessary. Therefore, the call to register_memory_blocks_under_node()
> has been removed from register_one_node().
It might help to throw in some empty lines to make this easier to read.
Patch subject should start with something like
"node:" or "mm:"
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
> v3->v4
>
> Addressed Mike's comment by dropping the call to
> register_memory_blocks_under_node() from register_one_node()
>
> v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> include/linux/node.h | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 806e62638cbe..8b8f96ca5b06 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -137,15 +137,9 @@ static inline int register_one_node(int nid)
> int error = 0;
>
> if (node_online(nid)) {
> - struct pglist_data *pgdat = NODE_DATA(nid);
> - unsigned long start_pfn = pgdat->node_start_pfn;
> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> -
> error = __register_one_node(nid);
> if (error)
> return error;
> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
> - MEMINIT_EARLY);
> }
Can be further simplified
if (node_online(nid))
error = __register_one_node(nid);
return error;
>
> return error;
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
@ 2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:11 ` Mike Rapoport
2025-05-20 10:07 ` Oscar Salvador
2 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-05-16 9:18 UTC (permalink / raw)
To: Donet Tom, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On 16.05.25 10:19, Donet Tom wrote:
> The function register_memory_blocks_under_node() is now only called from
> the memory hotplug path, as register_memory_blocks_under_node_early()
> handles registration during early boot. Therefore, the context argument
> used to differentiate between early boot and hotplug is no longer needed
> and was removed.
>
> Since the function is only called from the hotplug path, we renamed
> register_memory_blocks_under_node() to
> register_memory_blocks_under_node_hotplug()
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>
> ---
>
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
2025-05-16 9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
@ 2025-05-16 10:09 ` Mike Rapoport
2025-05-16 10:12 ` David Hildenbrand
2025-05-16 11:00 ` Donet Tom
1 sibling, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2025-05-16 10:09 UTC (permalink / raw)
To: David Hildenbrand
Cc: Donet Tom, Andrew Morton, Oscar Salvador, Zi Yan, Ritesh Harjani,
rafael, Danilo Krummrich, Greg Kroah-Hartman, linux-kernel,
linux-mm, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
On Fri, May 16, 2025 at 11:15:29AM +0200, David Hildenbrand wrote:
> On 16.05.25 10:19, Donet Tom wrote:
> > During node device initialization, `memory blocks` are registered under
> > each NUMA node. The `memory blocks` to be registered are identified using
> > the node’s start and end PFNs, which are obtained from the node's pg_data
> >
> > However, not all PFNs within this range necessarily belong to the same
> > node—some may belong to other nodes. Additionally, due to the
> > discontiguous nature of physical memory, certain sections within a
> > `memory block` may be absent.
> >
> > As a result, `memory blocks` that fall between a node’s start and end
> > PFNs may span across multiple nodes, and some sections within those blocks
> > may be missing. `Memory blocks` have a fixed size, which is architecture
> > dependent.
> >
> > Due to these considerations, the memory block registration is currently
> > performed as follows:
> >
> > for_each_online_node(nid):
> > start_pfn = pgdat->node_start_pfn;
> > end_pfn = pgdat->node_start_pfn + node_spanned_pages;
> > for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
> > mem_blk = memory_block_id(pfn_to_section_nr(pfn));
> > pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
> > pfn_mb_end = pfn_start + memory_block_pfns - 1
> > for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
> > if (get_nid_for_pfn(pfn) != nid):
> > continue;
> > else
> > do_register_memory_block_under_node(nid, mem_blk,
> > MEMINIT_EARLY);
> >
> > Here, we derive the start and end PFNs from the node's pg_data, then
> > determine the memory blocks that may belong to the node. For each
> > `memory block` in this range, we inspect all PFNs it contains and check
> > their associated NUMA node ID. If a PFN within the block matches the
> > current node, the memory block is registered under that node.
> >
> > If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
> > a binary search in the `memblock regions` to determine the NUMA node ID
> > for a given PFN. If it is not enabled, the node ID is retrieved directly
> > from the struct page.
> >
> > On large systems, this process can become time-consuming, especially since
> > we iterate over each `memory block` and all PFNs within it until a match is
> > found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
> > overhead of the binary search increases the execution time significantly,
> > potentially leading to soft lockups during boot.
> >
> > In this patch, we iterate over `memblock region` to identify the
> > `memory blocks` that belong to the current NUMA node. `memblock regions`
> > are contiguous memory ranges, each associated with a single NUMA node, and
> > they do not span across multiple nodes.
> >
> > for_each_online_node(nid):
> > for_each_memory_region(r): // r => region
> > if (r->nid != nid):
> > continue;
> > else
> > for_each_memory_block_between(r->base, r->base + r->size - 1):
> > do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
> >
> > We iterate over all `memblock regions` and identify those that belong to
> > the current NUMA node. For each `memblock region` associated with the
> > current node, we calculate the start and end `memory blocks` based on the
> > region's start and end PFNs. We then register all `memory blocks` within
> > that range under the current node.
> >
> > Test Results on My system with 32TB RAM
> > =======================================
> > 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
> >
> > Without this patch
> > ------------------
> > Startup finished in 1min 16.528s (kernel)
> >
> > With this patch
> > ---------------
> > Startup finished in 17.236s (kernel) - 78% Improvement
> >
> > 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
> >
> > Without this patch
> > ------------------
> > Startup finished in 28.320s (kernel)
> >
> > With this patch
> > ---------------
> > Startup finished in 15.621s (kernel) - 46% Improvement
> >
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> >
> > ---
> > v3 -> v4
> >
> > Addressed Mike's comment by making node_dev_init() call __register_one_node().
> >
> > V3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> > v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> > v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> > ---
> > drivers/base/memory.c | 4 ++--
> > drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> > include/linux/memory.h | 2 ++
> > include/linux/node.h | 3 +++
> > 4 files changed, 47 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 19469e7f88c2..7f1d266ae593 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -60,7 +60,7 @@ static inline unsigned long pfn_to_block_id(unsigned long pfn)
> > return memory_block_id(pfn_to_section_nr(pfn));
> > }
> > -static inline unsigned long phys_to_block_id(unsigned long phys)
> > +unsigned long phys_to_block_id(unsigned long phys)
> > {
> > return pfn_to_block_id(PFN_DOWN(phys));
> > }
>
>
> I was wondering whether we should move all these helpers into a header, and
> export sections_per_block instead. Probably doesn't really matter for your
> use case.
>
> > @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> > *
> > * Called under device_hotplug_lock.
> > */
> > -static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> > +struct memory_block *find_memory_block_by_id(unsigned long block_id)
> > {
> > struct memory_block *mem;
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index cd13ef287011..f8cafd8c8fb1 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -20,6 +20,7 @@
> > #include <linux/pm_runtime.h>
> > #include <linux/swap.h>
> > #include <linux/slab.h>
> > +#include <linux/memblock.h>
> > static const struct bus_type node_subsys = {
> > .name = "node",
> > @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct memory_block *mem_blk)
> > kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
> > }
> > +/*
> > + * register_memory_blocks_under_node_early : Register the memory
> > + * blocks under the current node.
> > + * @nid : Current node under registration
> > + *
> > + * This function iterates over all memblock regions and identifies the regions
> > + * that belong to the current node. For each region which belongs to current
> > + * node, it calculates the start and end memory blocks based on the region's
> > + * start and end PFNs. It then registers all memory blocks within that range
> > + * under the current node.
> > + */
> > +static void register_memory_blocks_under_node_early(int nid)
> > +{
> > + struct memblock_region *r;
> > +
> > + for_each_mem_region(r) {
> > + if (r->nid != nid)
> > + continue;
> > +
> > + const unsigned long start_block_id = phys_to_block_id(r->base);
> > + const unsigned long end_block_id = phys_to_block_id(r->base + r->size - 1);
> > + unsigned long block_id;
>
> This should definitely be above the if().
>
> > +
> > + for (block_id = start_block_id; block_id <= end_block_id; block_id++) {
> > + struct memory_block *mem;
> > +
> > + mem = find_memory_block_by_id(block_id);
> > + if (!mem)
> > + continue;
> > +
> > + do_register_memory_block_under_node(nid, mem, MEMINIT_EARLY);
> > + put_device(&mem->dev);
> > + }
> > +
> > + }
> > +}
> > +
> > void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> > unsigned long end_pfn,
> > enum meminit_context context)
> > @@ -974,8 +1012,9 @@ void __init node_dev_init(void)
> > * to applicable memory block devices and already created cpu devices.
> > */
> > for_each_online_node(i) {
> > - ret = register_one_node(i);
> > + ret = __register_one_node(i);
> > if (ret)
> > panic("%s() failed to add node: %d\n", __func__, ret);
> > + register_memory_blocks_under_node_early(i);
> > }
>
> In general, LGTM.
>
>
> BUT :)
>
> I was wondering whether having a register_memory_blocks_early() call *after*
> the for_each_online_node(), and walking all memory regions only once would
> make a difference.
I don't know how many nodes there should be to see measurable performance
difference, but having register_memory_blocks_under_node_early() after
for_each_online_node() is definitely nicer.
There's no real need to run for_each_mem_region() for every online node.
> We'd have to be smart about memory blocks that fall into multiple regions,
> but it should be a corner case and doable.
This is a corner case that should be handled regardless of the loop order.
And I don't think it's handled today at all.
If we have a block that crosses node boundaries, current implementation of
register_mem_block_under_node_early() will register it under the first
node.
> OTOH, we usually don't expect having a lot of regions, so iterating over
> them is probably not a big bottleneck? Anyhow, just wanted to raise it.
There would be at least a region per node and having
for_each_online_node()
for_each_mem_region()
makes the loop O(n²) for no good reason.
> --
> Cheers,
>
> David / dhildenb
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early()
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
@ 2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:05 ` Oscar Salvador
1 sibling, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2025-05-16 10:10 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Oscar Salvador, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:52AM -0500, Donet Tom wrote:
> The function register_mem_block_under_node_early() is no longer used,
> as register_memory_blocks_under_node_early() now handles memory block
> registration during early boot.
>
> Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
> the latter was only used by the former.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> v3->v4
>
> Added Acked-by tag
>
> v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/node.c | 58 +--------------------------------------------
> 1 file changed, 1 insertion(+), 57 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index f8cafd8c8fb1..8a14ebcae3b9 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -748,15 +748,6 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
> }
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> -static int __ref get_nid_for_pfn(unsigned long pfn)
> -{
> -#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> - if (system_state < SYSTEM_RUNNING)
> - return early_pfn_to_nid(pfn);
> -#endif
> - return pfn_to_nid(pfn);
> -}
> -
> static void do_register_memory_block_under_node(int nid,
> struct memory_block *mem_blk,
> enum meminit_context context)
> @@ -783,46 +774,6 @@ static void do_register_memory_block_under_node(int nid,
> ret);
> }
>
> -/* register memory section under specified node if it spans that node */
> -static int register_mem_block_under_node_early(struct memory_block *mem_blk,
> - void *arg)
> -{
> - unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
> - unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
> - unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
> - int nid = *(int *)arg;
> - unsigned long pfn;
> -
> - for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
> - int page_nid;
> -
> - /*
> - * memory block could have several absent sections from start.
> - * skip pfn range from absent section
> - */
> - if (!pfn_in_present_section(pfn)) {
> - pfn = round_down(pfn + PAGES_PER_SECTION,
> - PAGES_PER_SECTION) - 1;
> - continue;
> - }
> -
> - /*
> - * We need to check if page belongs to nid only at the boot
> - * case because node's ranges can be interleaved.
> - */
> - page_nid = get_nid_for_pfn(pfn);
> - if (page_nid < 0)
> - continue;
> - if (page_nid != nid)
> - continue;
> -
> - do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY);
> - return 0;
> - }
> - /* mem section does not span the specified node */
> - return 0;
> -}
> -
> /*
> * During hotplug we know that all pages in the memory block belong to the same
> * node.
> @@ -892,15 +843,8 @@ void register_memory_blocks_under_node(int nid, unsigned long start_pfn,
> unsigned long end_pfn,
> enum meminit_context context)
> {
> - walk_memory_blocks_func_t func;
> -
> - if (context == MEMINIT_HOTPLUG)
> - func = register_mem_block_under_node_hotplug;
> - else
> - func = register_mem_block_under_node_early;
> -
> walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
> - (void *)&nid, func);
> + (void *)&nid, register_mem_block_under_node_hotplug);
> return;
> }
> #endif /* CONFIG_MEMORY_HOTPLUG */
> --
> 2.43.5
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-16 9:18 ` David Hildenbrand
@ 2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:06 ` Oscar Salvador
2 siblings, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2025-05-16 10:10 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Oscar Salvador, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:53AM -0500, Donet Tom wrote:
> register_one_node() is now only called via cpu_up() → __try_online_node()
> during CPU hotplug operations to online a node. At this stage, the node has
> not yet had any memory added. As a result, there are no memory blocks to
> walk or register, so calling register_memory_blocks_under_node() is
> unnecessary. Therefore, the call to register_memory_blocks_under_node()
> has been removed from register_one_node().
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> v3->v4
>
> Addressed Mike's comment by dropping the call to
> register_memory_blocks_under_node() from register_one_node()
>
> v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> include/linux/node.h | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 806e62638cbe..8b8f96ca5b06 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -137,15 +137,9 @@ static inline int register_one_node(int nid)
> int error = 0;
>
> if (node_online(nid)) {
> - struct pglist_data *pgdat = NODE_DATA(nid);
> - unsigned long start_pfn = pgdat->node_start_pfn;
> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> -
> error = __register_one_node(nid);
> if (error)
> return error;
> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
> - MEMINIT_EARLY);
> }
>
> return error;
> --
> 2.43.5
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16 9:18 ` David Hildenbrand
@ 2025-05-16 10:11 ` Mike Rapoport
2025-05-20 10:07 ` Oscar Salvador
2 siblings, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2025-05-16 10:11 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Oscar Salvador, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:54AM -0500, Donet Tom wrote:
> The function register_memory_blocks_under_node() is now only called from
> the memory hotplug path, as register_memory_blocks_under_node_early()
> handles registration during early boot. Therefore, the context argument
> used to differentiate between early boot and hotplug is no longer needed
> and was removed.
>
> Since the function is only called from the hotplug path, we renamed
> register_memory_blocks_under_node() to
> register_memory_blocks_under_node_hotplug()
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>
> v3->v4
>
> Added Acked-by tag
>
> v3 - https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
> v2 - https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
> v1 - https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
> ---
> drivers/base/node.c | 5 ++---
> include/linux/node.h | 11 +++++------
> mm/memory_hotplug.c | 5 ++---
> 3 files changed, 9 insertions(+), 12 deletions(-)
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
2025-05-16 10:09 ` Mike Rapoport
@ 2025-05-16 10:12 ` David Hildenbrand
2025-05-16 11:00 ` Donet Tom
0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-05-16 10:12 UTC (permalink / raw)
To: Mike Rapoport
Cc: Donet Tom, Andrew Morton, Oscar Salvador, Zi Yan, Ritesh Harjani,
rafael, Danilo Krummrich, Greg Kroah-Hartman, linux-kernel,
linux-mm, Jonathan Cameron, Alison Schofield, Yury Norov,
Dave Jiang
>> We'd have to be smart about memory blocks that fall into multiple regions,
>> but it should be a corner case and doable.
>
> This is a corner case that should be handled regardless of the loop order.
> And I don't think it's handled today at all.
>
> If we have a block that crosses node boundaries, current implementation of
> register_mem_block_under_node_early() will register it under the first
> node.
At least upstream behavior should be that it would be linked under all
nodes. At least that's what I remember :)
>
>> OTOH, we usually don't expect having a lot of regions, so iterating over
>> them is probably not a big bottleneck? Anyhow, just wanted to raise it.
>
> There would be at least a region per node and having
>
> for_each_online_node()
> for_each_mem_region()
>
> makes the loop O(n²) for no good reason.
Yes, that's why I mentioned it. If we have many nodes it might
definitely be relevant.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-16 9:18 ` David Hildenbrand
@ 2025-05-16 10:58 ` Donet Tom
0 siblings, 0 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 10:58 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On 5/16/25 2:48 PM, David Hildenbrand wrote:
> On 16.05.25 10:19, Donet Tom wrote:
>> register_one_node() is now only called via cpu_up() →
>> __try_online_node()
>> during CPU hotplug operations to online a node. At this stage, the
>> node has
>> not yet had any memory added. As a result, there are no memory blocks to
>> walk or register, so calling register_memory_blocks_under_node() is
>> unnecessary. Therefore, the call to register_memory_blocks_under_node()
>> has been removed from register_one_node().
>
> It might help to throw in some empty lines to make this easier to read.
>
> Patch subject should start with something like
>
> "node:" or "mm:"
Sorry. I missed it. I will add it in the next version.
>
>>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v3->v4
>>
>> Addressed Mike's comment by dropping the call to
>> register_memory_blocks_under_node() from register_one_node()
>>
>> v3 -
>> https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
>> v2 -
>> https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 -
>> https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> include/linux/node.h | 6 ------
>> 1 file changed, 6 deletions(-)
>>
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 806e62638cbe..8b8f96ca5b06 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -137,15 +137,9 @@ static inline int register_one_node(int nid)
>> int error = 0;
>> if (node_online(nid)) {
>> - struct pglist_data *pgdat = NODE_DATA(nid);
>> - unsigned long start_pfn = pgdat->node_start_pfn;
>> - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>> -
>> error = __register_one_node(nid);
>> if (error)
>> return error;
>> - register_memory_blocks_under_node(nid, start_pfn, end_pfn,
>> - MEMINIT_EARLY);
>> }
>
> Can be further simplified
>
> if (node_online(nid))
> error = __register_one_node(nid);
> return error;
Thanks. I will add it and it in the next version.
>
>> return error;
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
2025-05-16 9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
2025-05-16 10:09 ` Mike Rapoport
@ 2025-05-16 11:00 ` Donet Tom
1 sibling, 0 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 11:00 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Mike Rapoport, Oscar Salvador, Zi Yan
Cc: Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On 5/16/25 2:45 PM, David Hildenbrand wrote:
> On 16.05.25 10:19, Donet Tom wrote:
>> During node device initialization, `memory blocks` are registered under
>> each NUMA node. The `memory blocks` to be registered are identified
>> using
>> the node’s start and end PFNs, which are obtained from the node's
>> pg_data
>>
>> However, not all PFNs within this range necessarily belong to the same
>> node—some may belong to other nodes. Additionally, due to the
>> discontiguous nature of physical memory, certain sections within a
>> `memory block` may be absent.
>>
>> As a result, `memory blocks` that fall between a node’s start and end
>> PFNs may span across multiple nodes, and some sections within those
>> blocks
>> may be missing. `Memory blocks` have a fixed size, which is architecture
>> dependent.
>>
>> Due to these considerations, the memory block registration is currently
>> performed as follows:
>>
>> for_each_online_node(nid):
>> start_pfn = pgdat->node_start_pfn;
>> end_pfn = pgdat->node_start_pfn + node_spanned_pages;
>> for_each_memory_block_between(PFN_PHYS(start_pfn),
>> PFN_PHYS(end_pfn))
>> mem_blk = memory_block_id(pfn_to_section_nr(pfn));
>> pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
>> pfn_mb_end = pfn_start + memory_block_pfns - 1
>> for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
>> if (get_nid_for_pfn(pfn) != nid):
>> continue;
>> else
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> Here, we derive the start and end PFNs from the node's pg_data, then
>> determine the memory blocks that may belong to the node. For each
>> `memory block` in this range, we inspect all PFNs it contains and check
>> their associated NUMA node ID. If a PFN within the block matches the
>> current node, the memory block is registered under that node.
>>
>> If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn()
>> performs
>> a binary search in the `memblock regions` to determine the NUMA node ID
>> for a given PFN. If it is not enabled, the node ID is retrieved directly
>> from the struct page.
>>
>> On large systems, this process can become time-consuming, especially
>> since
>> we iterate over each `memory block` and all PFNs within it until a
>> match is
>> found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional
>> overhead of the binary search increases the execution time
>> significantly,
>> potentially leading to soft lockups during boot.
>>
>> In this patch, we iterate over `memblock region` to identify the
>> `memory blocks` that belong to the current NUMA node. `memblock regions`
>> are contiguous memory ranges, each associated with a single NUMA
>> node, and
>> they do not span across multiple nodes.
>>
>> for_each_online_node(nid):
>> for_each_memory_region(r): // r => region
>> if (r->nid != nid):
>> continue;
>> else
>> for_each_memory_block_between(r->base, r->base + r->size - 1):
>> do_register_memory_block_under_node(nid, mem_blk,
>> MEMINIT_EARLY);
>>
>> We iterate over all `memblock regions` and identify those that belong to
>> the current NUMA node. For each `memblock region` associated with the
>> current node, we calculate the start and end `memory blocks` based on
>> the
>> region's start and end PFNs. We then register all `memory blocks` within
>> that range under the current node.
>>
>> Test Results on My system with 32TB RAM
>> =======================================
>> 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 1min 16.528s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 17.236s (kernel) - 78% Improvement
>>
>> 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
>>
>> Without this patch
>> ------------------
>> Startup finished in 28.320s (kernel)
>>
>> With this patch
>> ---------------
>> Startup finished in 15.621s (kernel) - 46% Improvement
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>
>> ---
>> v3 -> v4
>>
>> Addressed Mike's comment by making node_dev_init() call
>> __register_one_node().
>>
>> V3 -
>> https://lore.kernel.org/all/b49ed289096643ff5b5fbedcf1d1c1be42845a74.1746250339.git.donettom@linux.ibm.com/
>> v2 -
>> https://lore.kernel.org/all/fbe1e0c7d91bf3fa9a64ff5d84b53ded1d0d5ac7.1745852397.git.donettom@linux.ibm.com/
>> v1 -
>> https://lore.kernel.org/all/50142a29010463f436dc5c4feb540e5de3bb09df.1744175097.git.donettom@linux.ibm.com/
>> ---
>> drivers/base/memory.c | 4 ++--
>> drivers/base/node.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>> include/linux/memory.h | 2 ++
>> include/linux/node.h | 3 +++
>> 4 files changed, 47 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 19469e7f88c2..7f1d266ae593 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -60,7 +60,7 @@ static inline unsigned long
>> pfn_to_block_id(unsigned long pfn)
>> return memory_block_id(pfn_to_section_nr(pfn));
>> }
>> -static inline unsigned long phys_to_block_id(unsigned long phys)
>> +unsigned long phys_to_block_id(unsigned long phys)
>> {
>> return pfn_to_block_id(PFN_DOWN(phys));
>> }
>
>
> I was wondering whether we should move all these helpers into a
> header, and export sections_per_block instead. Probably doesn't really
> matter for your use case.
So, memory_block_id(), pfn_to_block_id(), and phys_to_block_id() should
be moved to memory.h, right?
I will do it and send the next version.
>
>> @@ -632,7 +632,7 @@ int __weak arch_get_memory_phys_device(unsigned
>> long start_pfn)
>> *
>> * Called under device_hotplug_lock.
>> */
>> -static struct memory_block *find_memory_block_by_id(unsigned long
>> block_id)
>> +struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cd13ef287011..f8cafd8c8fb1 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>> #include <linux/pm_runtime.h>
>> #include <linux/swap.h>
>> #include <linux/slab.h>
>> +#include <linux/memblock.h>
>> static const struct bus_type node_subsys = {
>> .name = "node",
>> @@ -850,6 +851,43 @@ void unregister_memory_block_under_nodes(struct
>> memory_block *mem_blk)
>> kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
>> }
>> +/*
>> + * register_memory_blocks_under_node_early : Register the memory
>> + * blocks under the current node.
>> + * @nid : Current node under registration
>> + *
>> + * This function iterates over all memblock regions and identifies
>> the regions
>> + * that belong to the current node. For each region which belongs to
>> current
>> + * node, it calculates the start and end memory blocks based on the
>> region's
>> + * start and end PFNs. It then registers all memory blocks within
>> that range
>> + * under the current node.
>> + */
>> +static void register_memory_blocks_under_node_early(int nid)
>> +{
>> + struct memblock_region *r;
>> +
>> + for_each_mem_region(r) {
>> + if (r->nid != nid)
>> + continue;
>> +
>> + const unsigned long start_block_id = phys_to_block_id(r->base);
>> + const unsigned long end_block_id = phys_to_block_id(r->base
>> + r->size - 1);
>> + unsigned long block_id;
>
> This should definitely be above the if().
>
Sure, I will change it.
>
>> +
>> + for (block_id = start_block_id; block_id <= end_block_id;
>> block_id++) {
>> + struct memory_block *mem;
>> +
>> + mem = find_memory_block_by_id(block_id);
>> + if (!mem)
>> + continue;
>> +
>> + do_register_memory_block_under_node(nid, mem,
>> MEMINIT_EARLY);
>> + put_device(&mem->dev);
>> + }
>> +
>> + }
>> +}
>> +
>> void register_memory_blocks_under_node(int nid, unsigned long
>> start_pfn,
>> unsigned long end_pfn,
>> enum meminit_context context)
>> @@ -974,8 +1012,9 @@ void __init node_dev_init(void)
>> * to applicable memory block devices and already created cpu
>> devices.
>> */
>> for_each_online_node(i) {
>> - ret = register_one_node(i);
>> + ret = __register_one_node(i);
>> if (ret)
>> panic("%s() failed to add node: %d\n", __func__, ret);
>> + register_memory_blocks_under_node_early(i);
>> }
>
> In general, LGTM.
>
>
> BUT :)
>
> I was wondering whether having a register_memory_blocks_early() call
> *after* the for_each_online_node(), and walking all memory regions
> only once would make a difference.
>
> We'd have to be smart about memory blocks that fall into multiple
> regions, but it should be a corner case and doable.
>
> OTOH, we usually don't expect having a lot of regions, so iterating
> over them is probably not a big bottleneck? Anyhow, just wanted to
> raise it.
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time
2025-05-16 10:12 ` David Hildenbrand
@ 2025-05-16 11:00 ` Donet Tom
0 siblings, 0 replies; 18+ messages in thread
From: Donet Tom @ 2025-05-16 11:00 UTC (permalink / raw)
To: David Hildenbrand, Mike Rapoport
Cc: Andrew Morton, Oscar Salvador, Zi Yan, Ritesh Harjani, rafael,
Danilo Krummrich, Greg Kroah-Hartman, linux-kernel, linux-mm,
Jonathan Cameron, Alison Schofield, Yury Norov, Dave Jiang
On 5/16/25 3:42 PM, David Hildenbrand wrote:
>>> We'd have to be smart about memory blocks that fall into multiple
>>> regions,
>>> but it should be a corner case and doable.
>>
>> This is a corner case that should be handled regardless of the loop
>> order.
>> And I don't think it's handled today at all.
>>
>> If we have a block that crosses node boundaries, current
>> implementation of
>> register_mem_block_under_node_early() will register it under the first
>> node.
>
> At least upstream behavior should be that it would be linked under all
> nodes. At least that's what I remember :)
>
>>> OTOH, we usually don't expect having a lot of regions, so iterating
>>> over
>>> them is probably not a big bottleneck? Anyhow, just wanted to raise it.
>>
>> There would be at least a region per node and having
>>
>> for_each_online_node()
>> for_each_mem_region()
>>
>> makes the loop O(n²) for no good reason.
>
> Yes, that's why I mentioned it. If we have many nodes it might
> definitely be relevant.
Thanks David and Mike
I will implement this and send the next version.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early()
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-16 10:10 ` Mike Rapoport
@ 2025-05-20 10:05 ` Oscar Salvador
1 sibling, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-05-20 10:05 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Mike Rapoport, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:52AM -0500, Donet Tom wrote:
> The function register_mem_block_under_node_early() is no longer used,
> as register_memory_blocks_under_node_early() now handles memory block
> registration during early boot.
>
> Removed register_mem_block_under_node_early() and get_nid_for_pfn(),
> the latter was only used by the former.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:10 ` Mike Rapoport
@ 2025-05-20 10:06 ` Oscar Salvador
2 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-05-20 10:06 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Mike Rapoport, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:53AM -0500, Donet Tom wrote:
> register_one_node() is now only called via cpu_up() → __try_online_node()
> during CPU hotplug operations to online a node. At this stage, the node has
> not yet had any memory added. As a result, there are no memory blocks to
> walk or register, so calling register_memory_blocks_under_node() is
> unnecessary. Therefore, the call to register_memory_blocks_under_node()
> has been removed from register_one_node().
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
With the comments raised by David addressed:
Acked-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:11 ` Mike Rapoport
@ 2025-05-20 10:07 ` Oscar Salvador
2 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-05-20 10:07 UTC (permalink / raw)
To: Donet Tom
Cc: David Hildenbrand, Andrew Morton, Mike Rapoport, Zi Yan,
Ritesh Harjani, rafael, Danilo Krummrich, Greg Kroah-Hartman,
linux-kernel, linux-mm, Jonathan Cameron, Alison Schofield,
Yury Norov, Dave Jiang
On Fri, May 16, 2025 at 03:19:54AM -0500, Donet Tom wrote:
> The function register_memory_blocks_under_node() is now only called from
> the memory hotplug path, as register_memory_blocks_under_node_early()
> handles registration during early boot. Therefore, the context argument
> used to differentiate between early boot and hotplug is no longer needed
> and was removed.
>
> Since the function is only called from the hotplug path, we renamed
> register_memory_blocks_under_node() to
> register_memory_blocks_under_node_hotplug()
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-05-20 10:07 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-16 8:19 [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time Donet Tom
2025-05-16 8:19 ` [PATCH v4 2/4] driver/base: remove register_mem_block_under_node_early() Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:05 ` Oscar Salvador
2025-05-16 8:19 ` [PATCH v4 3/4] Remove register_memory_blocks_under_node() function call from register_one_node Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:58 ` Donet Tom
2025-05-16 10:10 ` Mike Rapoport
2025-05-20 10:06 ` Oscar Salvador
2025-05-16 8:19 ` [PATCH v4 4/4] drivers/base : Rename register_memory_blocks_under_node() and remove context argument Donet Tom
2025-05-16 9:18 ` David Hildenbrand
2025-05-16 10:11 ` Mike Rapoport
2025-05-20 10:07 ` Oscar Salvador
2025-05-16 9:15 ` [PATCH v4 1/4] driver/base: Optimize memory block registration to reduce boot time David Hildenbrand
2025-05-16 10:09 ` Mike Rapoport
2025-05-16 10:12 ` David Hildenbrand
2025-05-16 11:00 ` Donet Tom
2025-05-16 11:00 ` Donet Tom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox