[PATCH] mm/hugetlb: use separate nodemask for bootmem allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
@ 2025-04-02 20:56 Frank van der Linden
  2025-04-08 13:54 ` Oscar Salvador
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Frank van der Linden @ 2025-04-02 20:56 UTC (permalink / raw)
  To: akpm, muchun.song, linux-mm, linux-kernel
  Cc: david, osalvador, luizcap, Frank van der Linden

Hugetlb boot allocation has used online nodes for allocation since
commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
allocation"). This was needed to be able to do the allocations
earlier in boot, before N_MEMORY was set.

This might lead to a different distribution of gigantic hugepages
across NUMA nodes if there are memoryless nodes in the system.

What happens is that the memoryless nodes are tried, but then
the memblock allocation fails and falls back, which usually means
that the node that has the highest physical address available
will be used (top-down allocation). While this will end up
getting the same number of hugetlb pages, they might not be
be distributed the same way. The fallback for each memoryless
node might not end up coming from the same node as the
successful round-robin allocation from N_MEMORY nodes.

While administrators that rely on having a specific number of
hugepages per node should use the hugepages=N:X syntax, it's
better not to change the old behavior for the plain hugepages=N
case.

To do this, construct a nodemask for hugetlb bootmem purposes
only, containing nodes that have memory. Then use that
for round-robin bootmem allocations.

This saves some cycles, and the added advantage here is that
hugetlb_cma can use it too, avoiding the older issue of
pointless attempts to create a CMA area for memoryless nodes
(which will also cause the per-node CMA area size to be too
small).

Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation")
Signed-off-by: Frank van der Linden <fvdl@google.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 30 ++++++++++++++++++++++++++++--
 mm/hugetlb_cma.c        | 11 +++++++----
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8f3ac832ee7f..fc9166f7f679 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,6 +14,7 @@
 #include <linux/pgtable.h>
 #include <linux/gfp.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/nodemask.h>
 
 struct ctl_table;
 struct user_struct;
@@ -176,6 +177,8 @@ extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
 void hugetlb_bootmem_alloc(void);
 bool hugetlb_bootmem_allocated(void);
+extern nodemask_t hugetlb_bootmem_nodes;
+void hugetlb_bootmem_set_nodes(void);
 
 /* arch callbacks */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6fccfe6d046c..e69f6f31e082 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -58,6 +58,7 @@ int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
+__initdata nodemask_t hugetlb_bootmem_nodes;
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
 static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
 
@@ -3237,7 +3238,8 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	}
 
 	/* allocate from next node when distributing huge pages */
-	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) {
+	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
+				    &hugetlb_bootmem_nodes) {
 		m = alloc_bootmem(h, node, false);
 		if (!m)
 			return 0;
@@ -3701,6 +3703,15 @@ static void __init hugetlb_init_hstates(void)
 	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
+		/*
+		 * Always reset to first_memory_node here, even if
+		 * next_nid_to_alloc was set before - we can't
+		 * reference hugetlb_bootmem_nodes after init, and
+		 * first_memory_node is right for all further allocations.
+		 */
+		h->next_nid_to_alloc = first_memory_node;
+		h->next_nid_to_free = first_memory_node;
+
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
@@ -4990,6 +5001,20 @@ static int __init default_hugepagesz_setup(char *s)
 }
 hugetlb_early_param("default_hugepagesz", default_hugepagesz_setup);
 
+void __init hugetlb_bootmem_set_nodes(void)
+{
+	int i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!nodes_empty(hugetlb_bootmem_nodes))
+		return;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		if (end_pfn > start_pfn)
+			node_set(nid, hugetlb_bootmem_nodes);
+	}
+}
+
 static bool __hugetlb_bootmem_allocated __initdata;
 
 bool __init hugetlb_bootmem_allocated(void)
@@ -5005,6 +5030,8 @@ void __init hugetlb_bootmem_alloc(void)
 	if (__hugetlb_bootmem_allocated)
 		return;
 
+	hugetlb_bootmem_set_nodes();
+
 	for (i = 0; i < MAX_NUMNODES; i++)
 		INIT_LIST_HEAD(&huge_boot_pages[i]);
 
@@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
 
 	for_each_hstate(h) {
 		h->next_nid_to_alloc = first_online_node;
-		h->next_nid_to_free = first_online_node;
 
 		if (hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index e0f2d5c3a84c..f58ef4969e7a 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -66,7 +66,7 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
 		if (node_exact)
 			return NULL;
 
-		for_each_online_node(node) {
+		for_each_node_mask(node, hugetlb_bootmem_nodes) {
 			cma = hugetlb_cma[node];
 			if (!cma || node == *nid)
 				continue;
@@ -153,11 +153,13 @@ void __init hugetlb_cma_reserve(int order)
 	if (!hugetlb_cma_size)
 		return;
 
+	hugetlb_bootmem_set_nodes();
+
 	for (nid = 0; nid < MAX_NUMNODES; nid++) {
 		if (hugetlb_cma_size_in_node[nid] == 0)
 			continue;
 
-		if (!node_online(nid)) {
+		if (!node_isset(nid, hugetlb_bootmem_nodes)) {
 			pr_warn("hugetlb_cma: invalid node %d specified\n", nid);
 			hugetlb_cma_size -= hugetlb_cma_size_in_node[nid];
 			hugetlb_cma_size_in_node[nid] = 0;
@@ -190,13 +192,14 @@ void __init hugetlb_cma_reserve(int order)
 		 * If 3 GB area is requested on a machine with 4 numa nodes,
 		 * let's allocate 1 GB on first three nodes and ignore the last one.
 		 */
-		per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+		per_node = DIV_ROUND_UP(hugetlb_cma_size,
+					nodes_weight(hugetlb_bootmem_nodes));
 		pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
 			hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
 	}
 
 	reserved = 0;
-	for_each_online_node(nid) {
+	for_each_node_mask(nid, hugetlb_bootmem_nodes) {
 		int res;
 		char name[CMA_MAX_NAME];
 
-- 
2.49.0.504.g3bcea36a83-goog



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-02 20:56 [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations Frank van der Linden
@ 2025-04-08 13:54 ` Oscar Salvador
  2025-04-08 15:48   ` Frank van der Linden
  2025-04-09  7:47 ` Oscar Salvador
  2025-04-16  1:07 ` Luiz Capitulino
  2 siblings, 1 reply; 8+ messages in thread
From: Oscar Salvador @ 2025-04-08 13:54 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, muchun.song, linux-mm, linux-kernel, david, luizcap

On Wed, Apr 02, 2025 at 08:56:13PM +0000, Frank van der Linden wrote:
> Hugetlb boot allocation has used online nodes for allocation since
> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> allocation"). This was needed to be able to do the allocations
> earlier in boot, before N_MEMORY was set.
> 
> This might lead to a different distribution of gigantic hugepages
> across NUMA nodes if there are memoryless nodes in the system.
> 
> What happens is that the memoryless nodes are tried, but then
> the memblock allocation fails and falls back, which usually means
> that the node that has the highest physical address available
> will be used (top-down allocation). While this will end up
> getting the same number of hugetlb pages, they might not be
> be distributed the same way. The fallback for each memoryless
> node might not end up coming from the same node as the
> successful round-robin allocation from N_MEMORY nodes.
> 
> While administrators that rely on having a specific number of
> hugepages per node should use the hugepages=N:X syntax, it's
> better not to change the old behavior for the plain hugepages=N
> case.
> 
> To do this, construct a nodemask for hugetlb bootmem purposes
> only, containing nodes that have memory. Then use that
> for round-robin bootmem allocations.
> 
> This saves some cycles, and the added advantage here is that
> hugetlb_cma can use it too, avoiding the older issue of
> pointless attempts to create a CMA area for memoryless nodes
> (which will also cause the per-node CMA area size to be too
> small).

Hi Frank,

Makes sense.

There something I do not quite understand though

> @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
>  
>  	for_each_hstate(h) {
>  		h->next_nid_to_alloc = first_online_node;
> -		h->next_nid_to_free = first_online_node;

Why are you unsetting next_nid_to_free? I guess it is because
we do not use it during boot time and you already set it to
first_memory_node further down the road in hugetlb_init_hstates.

And the reason you are leaving next_nid_to_alloc set is to see if
there is any chance that first_online_node is part of hugetlb_bootmem_nodes?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-08 13:54 ` Oscar Salvador
@ 2025-04-08 15:48   ` Frank van der Linden
  2025-04-09  7:41     ` Oscar Salvador
  0 siblings, 1 reply; 8+ messages in thread
From: Frank van der Linden @ 2025-04-08 15:48 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, muchun.song, linux-mm, linux-kernel, david, luizcap

On Tue, Apr 8, 2025 at 6:54 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, Apr 02, 2025 at 08:56:13PM +0000, Frank van der Linden wrote:
> > Hugetlb boot allocation has used online nodes for allocation since
> > commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> > allocation"). This was needed to be able to do the allocations
> > earlier in boot, before N_MEMORY was set.
> >
> > This might lead to a different distribution of gigantic hugepages
> > across NUMA nodes if there are memoryless nodes in the system.
> >
> > What happens is that the memoryless nodes are tried, but then
> > the memblock allocation fails and falls back, which usually means
> > that the node that has the highest physical address available
> > will be used (top-down allocation). While this will end up
> > getting the same number of hugetlb pages, they might not be
> > be distributed the same way. The fallback for each memoryless
> > node might not end up coming from the same node as the
> > successful round-robin allocation from N_MEMORY nodes.
> >
> > While administrators that rely on having a specific number of
> > hugepages per node should use the hugepages=N:X syntax, it's
> > better not to change the old behavior for the plain hugepages=N
> > case.
> >
> > To do this, construct a nodemask for hugetlb bootmem purposes
> > only, containing nodes that have memory. Then use that
> > for round-robin bootmem allocations.
> >
> > This saves some cycles, and the added advantage here is that
> > hugetlb_cma can use it too, avoiding the older issue of
> > pointless attempts to create a CMA area for memoryless nodes
> > (which will also cause the per-node CMA area size to be too
> > small).
>
> Hi Frank,
>
> Makes sense.

Hi Oskar, thanks for looking at the patch.

>
> There something I do not quite understand though
>
> > @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
> >
> >       for_each_hstate(h) {
> >               h->next_nid_to_alloc = first_online_node;
> > -             h->next_nid_to_free = first_online_node;
>
> Why are you unsetting next_nid_to_free? I guess it is because
> we do not use it during boot time and you already set it to
> first_memory_node further down the road in hugetlb_init_hstates.

Yes, that's exactly it - it's not used, so there was no need to set
it, and I made sure it's set later.
>
> And the reason you are leaving next_nid_to_alloc set is to see if
> there is any chance that first_online_node is part of hugetlb_bootmem_nodes?

next_nid_to_alloc is used to remember the last node that was allocated
from in __alloc_bootmem_huge_page(), so that the next call will
continue at the node after the one that was successfully allocated
from. The code there looks a bit confusing, since the macro
for_each_node_mask_to_alloc is used there not really as a for loop,
but simply as a way of saying "try this node and remember the next
one".

I've been meaning to clean that code up for several reasons, but
didn't get around to it, it's a separate issue.

- Frank


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-08 15:48   ` Frank van der Linden
@ 2025-04-09  7:41     ` Oscar Salvador
  0 siblings, 0 replies; 8+ messages in thread
From: Oscar Salvador @ 2025-04-09  7:41 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, muchun.song, linux-mm, linux-kernel, david, luizcap

On Tue, Apr 08, 2025 at 08:48:33AM -0700, Frank van der Linden wrote:
> next_nid_to_alloc is used to remember the last node that was allocated
> from in __alloc_bootmem_huge_page(), so that the next call will
> continue at the node after the one that was successfully allocated
> from. The code there looks a bit confusing, since the macro
> for_each_node_mask_to_alloc is used there not really as a for loop,
> but simply as a way of saying "try this node and remember the next
> one".

Yes, this makes sense, thanks for explaining Frank :-)


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-02 20:56 [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations Frank van der Linden
  2025-04-08 13:54 ` Oscar Salvador
@ 2025-04-09  7:47 ` Oscar Salvador
  2025-04-16  1:07 ` Luiz Capitulino
  2 siblings, 0 replies; 8+ messages in thread
From: Oscar Salvador @ 2025-04-09  7:47 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, muchun.song, linux-mm, linux-kernel, david, luizcap

On Wed, Apr 02, 2025 at 08:56:13PM +0000, Frank van der Linden wrote:
> Hugetlb boot allocation has used online nodes for allocation since
> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> allocation"). This was needed to be able to do the allocations
> earlier in boot, before N_MEMORY was set.
> 
> This might lead to a different distribution of gigantic hugepages
> across NUMA nodes if there are memoryless nodes in the system.
> 
> What happens is that the memoryless nodes are tried, but then
> the memblock allocation fails and falls back, which usually means
> that the node that has the highest physical address available
> will be used (top-down allocation). While this will end up
> getting the same number of hugetlb pages, they might not be
> be distributed the same way. The fallback for each memoryless
> node might not end up coming from the same node as the
> successful round-robin allocation from N_MEMORY nodes.
> 
> While administrators that rely on having a specific number of
> hugepages per node should use the hugepages=N:X syntax, it's
> better not to change the old behavior for the plain hugepages=N
> case.
> 
> To do this, construct a nodemask for hugetlb bootmem purposes
> only, containing nodes that have memory. Then use that
> for round-robin bootmem allocations.
> 
> This saves some cycles, and the added advantage here is that
> hugetlb_cma can use it too, avoiding the older issue of
> pointless attempts to create a CMA area for memoryless nodes
> (which will also cause the per-node CMA area size to be too
> small).
> 
> Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation")
> Signed-off-by: Frank van der Linden <fvdl@google.com>

This looks good to me

Reviewed-by: Oscar Salvador <osalvador@suse.de>

The only think I was pondering whether it would be a way
to keep hugetlb_bootmem_set_nodes() confined in hugetlb code
and not having to export that to hugetlb_cma.

But then again, you would have to create a function that calls
hugetlb_bootmem_set_nodes() earlier and would be churn for churn.


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-02 20:56 [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations Frank van der Linden
  2025-04-08 13:54 ` Oscar Salvador
  2025-04-09  7:47 ` Oscar Salvador
@ 2025-04-16  1:07 ` Luiz Capitulino
  2025-04-16 16:32   ` Frank van der Linden
  2 siblings, 1 reply; 8+ messages in thread
From: Luiz Capitulino @ 2025-04-16  1:07 UTC (permalink / raw)
  To: Frank van der Linden, akpm, muchun.song, linux-mm, linux-kernel
  Cc: david, osalvador

On 2025-04-02 16:56, Frank van der Linden wrote:
> Hugetlb boot allocation has used online nodes for allocation since
> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> allocation"). This was needed to be able to do the allocations
> earlier in boot, before N_MEMORY was set.

Honest question: I imagine there's a reason why we can't move
x86's hugetlb_cma_reserve() and hugetlb_bootmem_alloc() calls
in setup_arch() to after x86_init.paging.pagetable_init() (which
seems to be where we call zone_sizes_init())? This way we could
go back to using N_MEMORY and avoid this dance.

I'm not familiar with vmemmap if that's the reason...

- Luiz

> 
> This might lead to a different distribution of gigantic hugepages
> across NUMA nodes if there are memoryless nodes in the system.
> 
> What happens is that the memoryless nodes are tried, but then
> the memblock allocation fails and falls back, which usually means
> that the node that has the highest physical address available
> will be used (top-down allocation). While this will end up
> getting the same number of hugetlb pages, they might not be
> be distributed the same way. The fallback for each memoryless
> node might not end up coming from the same node as the
> successful round-robin allocation from N_MEMORY nodes.
> 
> While administrators that rely on having a specific number of
> hugepages per node should use the hugepages=N:X syntax, it's
> better not to change the old behavior for the plain hugepages=N
> case.
> 
> To do this, construct a nodemask for hugetlb bootmem purposes
> only, containing nodes that have memory. Then use that
> for round-robin bootmem allocations.
> 
> This saves some cycles, and the added advantage here is that
> hugetlb_cma can use it too, avoiding the older issue of
> pointless attempts to create a CMA area for memoryless nodes
> (which will also cause the per-node CMA area size to be too
> small).
> 
> Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation")
> Signed-off-by: Frank van der Linden <fvdl@google.com>
> ---
>   include/linux/hugetlb.h |  3 +++
>   mm/hugetlb.c            | 30 ++++++++++++++++++++++++++++--
>   mm/hugetlb_cma.c        | 11 +++++++----
>   3 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8f3ac832ee7f..fc9166f7f679 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -14,6 +14,7 @@
>   #include <linux/pgtable.h>
>   #include <linux/gfp.h>
>   #include <linux/userfaultfd_k.h>
> +#include <linux/nodemask.h>
>   
>   struct ctl_table;
>   struct user_struct;
> @@ -176,6 +177,8 @@ extern struct list_head huge_boot_pages[MAX_NUMNODES];
>   
>   void hugetlb_bootmem_alloc(void);
>   bool hugetlb_bootmem_allocated(void);
> +extern nodemask_t hugetlb_bootmem_nodes;
> +void hugetlb_bootmem_set_nodes(void);
>   
>   /* arch callbacks */
>   
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6fccfe6d046c..e69f6f31e082 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -58,6 +58,7 @@ int hugetlb_max_hstate __read_mostly;
>   unsigned int default_hstate_idx;
>   struct hstate hstates[HUGE_MAX_HSTATE];
>   
> +__initdata nodemask_t hugetlb_bootmem_nodes;
>   __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>   static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
>   
> @@ -3237,7 +3238,8 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>   	}
>   
>   	/* allocate from next node when distributing huge pages */
> -	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) {
> +	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
> +				    &hugetlb_bootmem_nodes) {
>   		m = alloc_bootmem(h, node, false);
>   		if (!m)
>   			return 0;
> @@ -3701,6 +3703,15 @@ static void __init hugetlb_init_hstates(void)
>   	struct hstate *h, *h2;
>   
>   	for_each_hstate(h) {
> +		/*
> +		 * Always reset to first_memory_node here, even if
> +		 * next_nid_to_alloc was set before - we can't
> +		 * reference hugetlb_bootmem_nodes after init, and
> +		 * first_memory_node is right for all further allocations.
> +		 */
> +		h->next_nid_to_alloc = first_memory_node;
> +		h->next_nid_to_free = first_memory_node;
> +
>   		/* oversize hugepages were init'ed in early boot */
>   		if (!hstate_is_gigantic(h))
>   			hugetlb_hstate_alloc_pages(h);
> @@ -4990,6 +5001,20 @@ static int __init default_hugepagesz_setup(char *s)
>   }
>   hugetlb_early_param("default_hugepagesz", default_hugepagesz_setup);
>   
> +void __init hugetlb_bootmem_set_nodes(void)
> +{
> +	int i, nid;
> +	unsigned long start_pfn, end_pfn;
> +
> +	if (!nodes_empty(hugetlb_bootmem_nodes))
> +		return;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		if (end_pfn > start_pfn)
> +			node_set(nid, hugetlb_bootmem_nodes);
> +	}
> +}
> +
>   static bool __hugetlb_bootmem_allocated __initdata;
>   
>   bool __init hugetlb_bootmem_allocated(void)
> @@ -5005,6 +5030,8 @@ void __init hugetlb_bootmem_alloc(void)
>   	if (__hugetlb_bootmem_allocated)
>   		return;
>   
> +	hugetlb_bootmem_set_nodes();
> +
>   	for (i = 0; i < MAX_NUMNODES; i++)
>   		INIT_LIST_HEAD(&huge_boot_pages[i]);
>   
> @@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
>   
>   	for_each_hstate(h) {
>   		h->next_nid_to_alloc = first_online_node;
> -		h->next_nid_to_free = first_online_node;
>   
>   		if (hstate_is_gigantic(h))
>   			hugetlb_hstate_alloc_pages(h);
> diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
> index e0f2d5c3a84c..f58ef4969e7a 100644
> --- a/mm/hugetlb_cma.c
> +++ b/mm/hugetlb_cma.c
> @@ -66,7 +66,7 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
>   		if (node_exact)
>   			return NULL;
>   
> -		for_each_online_node(node) {
> +		for_each_node_mask(node, hugetlb_bootmem_nodes) {
>   			cma = hugetlb_cma[node];
>   			if (!cma || node == *nid)
>   				continue;
> @@ -153,11 +153,13 @@ void __init hugetlb_cma_reserve(int order)
>   	if (!hugetlb_cma_size)
>   		return;
>   
> +	hugetlb_bootmem_set_nodes();
> +
>   	for (nid = 0; nid < MAX_NUMNODES; nid++) {
>   		if (hugetlb_cma_size_in_node[nid] == 0)
>   			continue;
>   
> -		if (!node_online(nid)) {
> +		if (!node_isset(nid, hugetlb_bootmem_nodes)) {
>   			pr_warn("hugetlb_cma: invalid node %d specified\n", nid);
>   			hugetlb_cma_size -= hugetlb_cma_size_in_node[nid];
>   			hugetlb_cma_size_in_node[nid] = 0;
> @@ -190,13 +192,14 @@ void __init hugetlb_cma_reserve(int order)
>   		 * If 3 GB area is requested on a machine with 4 numa nodes,
>   		 * let's allocate 1 GB on first three nodes and ignore the last one.
>   		 */
> -		per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
> +		per_node = DIV_ROUND_UP(hugetlb_cma_size,
> +					nodes_weight(hugetlb_bootmem_nodes));
>   		pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
>   			hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
>   	}
>   
>   	reserved = 0;
> -	for_each_online_node(nid) {
> +	for_each_node_mask(nid, hugetlb_bootmem_nodes) {
>   		int res;
>   		char name[CMA_MAX_NAME];
>   



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-16  1:07 ` Luiz Capitulino
@ 2025-04-16 16:32   ` Frank van der Linden
  2025-04-16 17:07     ` Luiz Capitulino
  0 siblings, 1 reply; 8+ messages in thread
From: Frank van der Linden @ 2025-04-16 16:32 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: akpm, muchun.song, linux-mm, linux-kernel, david, osalvador

On Tue, Apr 15, 2025 at 6:08 PM Luiz Capitulino <luizcap@redhat.com> wrote:
>
> On 2025-04-02 16:56, Frank van der Linden wrote:
> > Hugetlb boot allocation has used online nodes for allocation since
> > commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
> > allocation"). This was needed to be able to do the allocations
> > earlier in boot, before N_MEMORY was set.
>
> Honest question: I imagine there's a reason why we can't move
> x86's hugetlb_cma_reserve() and hugetlb_bootmem_alloc() calls
> in setup_arch() to after x86_init.paging.pagetable_init() (which
> seems to be where we call zone_sizes_init())? This way we could
> go back to using N_MEMORY and avoid this dance.
>
> I'm not familiar with vmemmap if that's the reason...
>

Yeah, vmemmap is the reason. pre-HVO (setting up vmemmap HVO-style)
requires the hugetlb bootmem allocations to be done before
sparse_init(), so the ordering you propose wouldn't work.

I originally looked at explicitly initializing N_MEMORY earlier,
figuring that all that was needed was having memblock node information
available. But there seems to be a history there - N_MEMORY indicates
that buddy allocator memory is available on the node, and several
comments referenced the fact that zone init and rounding may end up
not setting N_MEMORY on NUMA nodes with a tiny amount of memory. There
is also code that sets N_MEMORY temporarily in
find_zone_movable_pfns_for_nodes().

Some of the commits went back a long time ago, and I can't quite judge
if the comments still apply without looking at the code more. So, I
chickened out, and did a hugetlb only change to fix the hugetlb
issues.

But it does seem like setting N_MEMORY can be cleaned up a bit, it's
definitely something to follow up on.

- Frank

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
  2025-04-16 16:32   ` Frank van der Linden
@ 2025-04-16 17:07     ` Luiz Capitulino
  0 siblings, 0 replies; 8+ messages in thread
From: Luiz Capitulino @ 2025-04-16 17:07 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: akpm, muchun.song, linux-mm, linux-kernel, david, osalvador

On 2025-04-16 12:32, Frank van der Linden wrote:
> On Tue, Apr 15, 2025 at 6:08 PM Luiz Capitulino <luizcap@redhat.com> wrote:
>>
>> On 2025-04-02 16:56, Frank van der Linden wrote:
>>> Hugetlb boot allocation has used online nodes for allocation since
>>> commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
>>> allocation"). This was needed to be able to do the allocations
>>> earlier in boot, before N_MEMORY was set.
>>
>> Honest question: I imagine there's a reason why we can't move
>> x86's hugetlb_cma_reserve() and hugetlb_bootmem_alloc() calls
>> in setup_arch() to after x86_init.paging.pagetable_init() (which
>> seems to be where we call zone_sizes_init())? This way we could
>> go back to using N_MEMORY and avoid this dance.
>>
>> I'm not familiar with vmemmap if that's the reason...
>>
> 
> Yeah, vmemmap is the reason. pre-HVO (setting up vmemmap HVO-style)
> requires the hugetlb bootmem allocations to be done before
> sparse_init(), so the ordering you propose wouldn't work.
> 
> I originally looked at explicitly initializing N_MEMORY earlier,
> figuring that all that was needed was having memblock node information
> available. But there seems to be a history there - N_MEMORY indicates
> that buddy allocator memory is available on the node, and several
> comments referenced the fact that zone init and rounding may end up
> not setting N_MEMORY on NUMA nodes with a tiny amount of memory. There
> is also code that sets N_MEMORY temporarily in
> find_zone_movable_pfns_for_nodes().
> 
> Some of the commits went back a long time ago, and I can't quite judge
> if the comments still apply without looking at the code more. So, I
> chickened out, and did a hugetlb only change to fix the hugetlb
> issues.

Oh, thanks for the full explanation.

Since the new hugetlb init has to happen before sparse_init() then
this patch is fine by me and I appreciate your concern in not
changing/regressing the user visible behavior.

Reviewed-by: Luiz Capitulino <luizcap@redhat.com>

> 
> But it does seem like setting N_MEMORY can be cleaned up a bit, it's
> definitely something to follow up on.
> 
> - Frank
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-16 17:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-02 20:56 [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations Frank van der Linden
2025-04-08 13:54 ` Oscar Salvador
2025-04-08 15:48   ` Frank van der Linden
2025-04-09  7:41     ` Oscar Salvador
2025-04-09  7:47 ` Oscar Salvador
2025-04-16  1:07 ` Luiz Capitulino
2025-04-16 16:32   ` Frank van der Linden
2025-04-16 17:07     ` Luiz Capitulino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox