[PATCH 0/2] Add a command line option that enables control of how many threads per NUMA node should be used to allocate huge pages.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Add a command line option that enables control of how many threads per NUMA node should be used to allocate huge pages.
@ 2025-02-21 13:49 Thomas Prescher via B4 Relay
  2025-02-21 13:49 ` [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
  2025-02-21 13:49 ` [PATCH 2/2] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay
  0 siblings, 2 replies; 9+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-21 13:49 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

Allocating huge pages can take a very long time on servers
with terabytes of memory even when they are allocated at
boot time where the allocation happens in parallel.

The kernel currently uses a hard coded value of 2 threads per
NUMA node for these allocations. This value might have been good
enough in the past but it is not sufficient to fully utilize
newer systems.

This patch allows to override this value.

We tested this on 2 generations of Xeon CPUs and the results
show a big improvement of the overall allocation time.

+--------------------+-------+-------+-------+-------+-------+
| threads per node   |   2   |   4   |   8   |   16  |    32 |
+--------------------+-------+-------+-------+-------+-------+
| skylake 4node      |   44s |   22s |   16s |   19s |   20s |
| cascade lake 4node |   39s |   20s |   11s |   10s |    9s |
+--------------------+-------+-------+-------+-------+-------+

On skylake, we see an improvment of 2.75x when using 8 threads,
on cascade lake we can get even better at 4.3x when we use
32 threads per node.

This speedup is quite significant and users of large machines
like these should have the option to make the machines boot
as fast as possible.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
Thomas Prescher (2):
      mm: hugetlb: add hugetlb_alloc_threads cmdline option
      mm: hugetlb: log time needed to allocate hugepages

 Documentation/admin-guide/kernel-parameters.txt |  7 +++
 Documentation/admin-guide/mm/hugetlbpage.rst    |  9 +++-
 mm/hugetlb.c                                    | 59 ++++++++++++++++++-------
 3 files changed, 58 insertions(+), 17 deletions(-)
---
base-commit: 334426094588f8179fe175a09ecc887ff0c75758
change-id: 20250221-hugepage-parameter-e8542fdfc0ae

Best regards,
-- 
Thomas Prescher <thomas.prescher@cyberus-technology.de>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-21 13:49 [PATCH 0/2] Add a command line option that enables control of how many threads per NUMA node should be used to allocate huge pages Thomas Prescher via B4 Relay
@ 2025-02-21 13:49 ` Thomas Prescher via B4 Relay
  2025-02-21 13:52   ` Matthew Wilcox
  2025-02-24 17:37   ` Frank van der Linden
  2025-02-21 13:49 ` [PATCH 2/2] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay
  1 sibling, 2 replies; 9+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-21 13:49 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Add a command line option that enables control of how many
threads per NUMA node should be used to allocate huge pages.

Allocating huge pages can take a very long time on servers
with terabytes of memory even when they are allocated at
boot time where the allocation happens in parallel.

The kernel currently uses a hard coded value of 2 threads per
NUMA node for these allocations.

This patch allows to override this value.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 ++++
 Documentation/admin-guide/mm/hugetlbpage.rst    |  9 ++++-
 mm/hugetlb.c                                    | 50 +++++++++++++++++--------
 3 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8582b8750d7e014c4d76166fa2fc1..812064542fdb0a5c0ff7587aaaba8da81dc234a9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1882,6 +1882,13 @@
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugepage_alloc_threads=
+			[HW] The number of threads per NUMA node that should
+			be used to allocate hugepages during boot.
+			This option can be used to improve system bootup time
+			when allocating a large amount of huge pages.
+			The default value is 2 threads per NUMA node.
+
 	hugetlb_cma=	[HW,CMA,EARLY] The size of a CMA area used for allocation
 			of gigantic hugepages. Or using node format, the size
 			of a CMA area per node can be specified.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b533f30add99a34f66ba4e1c496a3..c88461be0f66887d532ac4ef20e3a61dfd396be7 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,14 @@ hugepages
 
 	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
 	If the node number is invalid,  the parameter will be ignored.
-
+hugepage_alloc_threads
+	Specify the number of threads per NUMA node that should be used to
+	allocate hugepages during boot. This parameter can be used to improve
+	system bootup time when allocating a large amount of huge pages.
+	The default value is 2 threads per NUMA node. Example to use 8 threads
+	per NUMA node::
+
+		hugepage_alloc_threads=8
 default_hugepagesz
 	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 163190e89ea16450026496c020b544877db147d1..b7d24c41e0f9d22f5b86c253e29a2eca28460026 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -68,6 +68,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
 static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
+static unsigned long allocation_threads_per_node __initdata = 2;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -3432,26 +3433,23 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 	job.size	= h->max_huge_pages;
 
 	/*
-	 * job.max_threads is twice the num_node_state(N_MEMORY),
+	 * job.max_threads is twice the num_node_state(N_MEMORY) by default.
 	 *
-	 * Tests below indicate that a multiplier of 2 significantly improves
-	 * performance, and although larger values also provide improvements,
-	 * the gains are marginal.
+	 * On large servers with terabytes of memory, huge page allocation
+	 * can consume a considerably amount of time.
 	 *
-	 * Therefore, choosing 2 as the multiplier strikes a good balance between
-	 * enhancing parallel processing capabilities and maintaining efficient
-	 * resource management.
+	 * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
+	 * 2MiB huge pages. Using more threads can significantly improve allocation time.
 	 *
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | multiplier |   1   |   2   |   3   |   4   |   5   |
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
-	 * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
-	 * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
-	 * +------------+-------+-------+-------+-------+-------+
+	 * +--------------------+-------+-------+-------+-------+-------+
+	 * | threads per node   |   2   |   4   |   8   |   16  |    32 |
+	 * +--------------------+-------+-------+-------+-------+-------+
+	 * | skylake 4node      |   44s |   22s |   16s |   19s |   20s |
+	 * | cascade lake 4node |   39s |   20s |   11s |   10s |    9s |
+	 * +--------------------+-------+-------+-------+-------+-------+
 	 */
-	job.max_threads	= num_node_state(N_MEMORY) * 2;
-	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+	job.max_threads	= num_node_state(N_MEMORY) * allocation_threads_per_node;
+	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / allocation_threads_per_node;
 	padata_do_multithreaded(&job);
 
 	return h->nr_huge_pages;
@@ -4764,6 +4762,26 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+/* hugepage_alloc_threads command line parsing
+ * When set, use this specific number of threads per NUMA node for the boot
+ * allocation of hugepages.
+ */
+static int __init hugepage_alloc_threads_setup(char *s)
+{
+	unsigned long threads_per_node;
+
+	if (kstrtoul(s, 0, &threads_per_node) != 0)
+		return 1;
+
+	if (threads_per_node == 0)
+		return 1;
+
+	allocation_threads_per_node = threads_per_node;
+
+	return 1;
+}
+__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;

-- 
2.48.1




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] mm: hugetlb: log time needed to allocate hugepages
  2025-02-21 13:49 [PATCH 0/2] Add a command line option that enables control of how many threads per NUMA node should be used to allocate huge pages Thomas Prescher via B4 Relay
  2025-02-21 13:49 ` [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
@ 2025-02-21 13:49 ` Thomas Prescher via B4 Relay
  1 sibling, 0 replies; 9+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-21 13:49 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Having this information allows users to easily tune
the hugepages_node_threads parameter.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 mm/hugetlb.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b7d24c41e0f9d22f5b86c253e29a2eca28460026..2aa5724a385494f9d6f1d644a2bfe547591fc96c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3428,6 +3428,9 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 		.numa_aware	= true
 	};
 
+	unsigned long jiffies_start;
+	unsigned long jiffies_end;
+
 	job.thread_fn	= hugetlb_pages_alloc_boot_node;
 	job.start	= 0;
 	job.size	= h->max_huge_pages;
@@ -3450,7 +3453,13 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 	 */
 	job.max_threads	= num_node_state(N_MEMORY) * allocation_threads_per_node;
 	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / allocation_threads_per_node;
+
+	jiffies_start = jiffies;
 	padata_do_multithreaded(&job);
+	jiffies_end = jiffies;
+
+	printk(KERN_DEBUG "HugeTLB: allocation took %dms\n",
+	    jiffies_to_msecs(jiffies_end - jiffies_start));
 
 	return h->nr_huge_pages;
 }

-- 
2.48.1




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-21 13:49 ` [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
@ 2025-02-21 13:52   ` Matthew Wilcox
  2025-02-21 14:16     ` Thomas Prescher
  2025-02-24 17:37   ` Frank van der Linden
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2025-02-21 13:52 UTC (permalink / raw)
  To: thomas.prescher
  Cc: Jonathan Corbet, Muchun Song, Andrew Morton, linux-doc,
	linux-kernel, linux-mm

On Fri, Feb 21, 2025 at 02:49:03PM +0100, Thomas Prescher via B4 Relay wrote:
> Add a command line option that enables control of how many
> threads per NUMA node should be used to allocate huge pages.

I don't think we should add a command line option (ie blame the sysadmin
for getting it wrong).  Instead, we should figure out the right number.
Is it half the number of threads per socket?  A quarter?  90%?  It's
bootup, the threads aren't really doing anything else.  But we
should figure it out, not the sysadmin.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-21 13:52   ` Matthew Wilcox
@ 2025-02-21 14:16     ` Thomas Prescher
  2025-02-23  2:46       ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Prescher @ 2025-02-21 14:16 UTC (permalink / raw)
  To: willy; +Cc: linux-mm, corbet, muchun.song, linux-doc, linux-kernel, akpm

On Fri, 2025-02-21 at 13:52 +0000, Matthew Wilcox wrote:
> I don't think we should add a command line option (ie blame the
> sysadmin
> for getting it wrong).  Instead, we should figure out the right
> number.
> Is it half the number of threads per socket?  A quarter?  90%?  It's
> bootup, the threads aren't really doing anything else.  But we
> should figure it out, not the sysadmin.

I don't think we will find a number that delivers the best performance
on every system out there. With the two systems we tested, we already
see some differences.

The Skylake servers have 36 threads per socket and deliver the best
performance when we use 8 threads which is 22%. Using more threads
decreases the performance.

On Cascade Lake with 48 threads per socket, we see the best performance
when using 32 threads which is 66%. Using more threads also decreases
the performance here (not included in the table obove). The performance
benefits of using more than 8 threads are very marginal though.

I'm completely open to change the default so something that makes more
sense. From the experiments we did so far, 25% of the threads per node
deliver a reasonable good performance. We could still keep the
parameter for sysadmins that want to micro-optimize the bootup time
though.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-21 14:16     ` Thomas Prescher
@ 2025-02-23  2:46       ` Andrew Morton
  2025-02-24 10:42         ` Thomas Prescher
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-02-23  2:46 UTC (permalink / raw)
  To: Thomas Prescher
  Cc: willy, linux-mm, corbet, muchun.song, linux-doc, linux-kernel

On Fri, 21 Feb 2025 14:16:31 +0000 Thomas Prescher <thomas.prescher@cyberus-technology.de> wrote:

> On Fri, 2025-02-21 at 13:52 +0000, Matthew Wilcox wrote:
> > I don't think we should add a command line option (ie blame the
> > sysadmin
> > for getting it wrong).  Instead, we should figure out the right
> > number.
> > Is it half the number of threads per socket?  A quarter?  90%?  It's
> > bootup, the threads aren't really doing anything else.  But we
> > should figure it out, not the sysadmin.
> 
> I don't think we will find a number that delivers the best performance
> on every system out there. With the two systems we tested, we already
> see some differences.
> 
> The Skylake servers have 36 threads per socket and deliver the best
> performance when we use 8 threads which is 22%. Using more threads
> decreases the performance.
> 
> On Cascade Lake with 48 threads per socket, we see the best performance
> when using 32 threads which is 66%. Using more threads also decreases
> the performance here (not included in the table obove). The performance
> benefits of using more than 8 threads are very marginal though.
> 
> I'm completely open to change the default so something that makes more
> sense. From the experiments we did so far, 25% of the threads per node
> deliver a reasonable good performance. We could still keep the
> parameter for sysadmins that want to micro-optimize the bootup time
> though.

I'm all for auto-tuning but yeah, for a boot-time thing like this we
require a boot-time knob.

As is often (always) the case, the sad thing is that about five people
in the world know that this exists.  How can we tell our users that
this new thing is available and possibly useful to them?  We have no
channel.

Perhaps in your [2/2] we could be noisier?  

	HugeTLB: allocation took 4242ms with hugepage_alloc_threads=42

and with a facility level higher than KERN_DEBUG (can/should we use
pr_foo() here, btw?).  That should get people curious and poking around
in the documentation and experimenting.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-23  2:46       ` Andrew Morton
@ 2025-02-24 10:42         ` Thomas Prescher
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Prescher @ 2025-02-24 10:42 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, corbet, willy, muchun.song, linux-kernel, linux-doc

On Sat, 2025-02-22 at 18:46 -0800, Andrew Morton wrote:
> 
> I'm all for auto-tuning but yeah, for a boot-time thing like this we
> require a boot-time knob.
> 
> As is often (always) the case, the sad thing is that about five
> people
> in the world know that this exists.  How can we tell our users that
> this new thing is available and possibly useful to them?  We have no
> channel.
> 
> Perhaps in your [2/2] we could be noisier?  
> 
> 	HugeTLB: allocation took 4242ms with
> hugepage_alloc_threads=42
> 
> and with a facility level higher than KERN_DEBUG (can/should we use
> pr_foo() here, btw?).  That should get people curious and poking
> around
> in the documentation and experimenting.

I completely agree with what you are saying. So for v2, I would:

a) change the default from 2 threads per node to 25% of the total
threads per node to make it faster for everyone

b) make it noisier by using pr_info instead of printk(KERN_DEBUG ...)
and also print the number of threads that are used

Anything that I am missing here?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-21 13:49 ` [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
  2025-02-21 13:52   ` Matthew Wilcox
@ 2025-02-24 17:37   ` Frank van der Linden
  2025-02-25 13:01     ` Thomas Prescher
  1 sibling, 1 reply; 9+ messages in thread
From: Frank van der Linden @ 2025-02-24 17:37 UTC (permalink / raw)
  To: thomas.prescher
  Cc: Jonathan Corbet, Muchun Song, Andrew Morton, linux-doc,
	linux-kernel, linux-mm

On Fri, Feb 21, 2025 at 5:49 AM Thomas Prescher via B4 Relay
<devnull+thomas.prescher.cyberus-technology.de@kernel.org> wrote:
>
> From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
>
> Add a command line option that enables control of how many
> threads per NUMA node should be used to allocate huge pages.
>
> Allocating huge pages can take a very long time on servers
> with terabytes of memory even when they are allocated at
> boot time where the allocation happens in parallel.
>
> The kernel currently uses a hard coded value of 2 threads per
> NUMA node for these allocations.
>
> This patch allows to override this value.
>
> Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  7 ++++
>  Documentation/admin-guide/mm/hugetlbpage.rst    |  9 ++++-
>  mm/hugetlb.c                                    | 50 +++++++++++++++++--------
>  3 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8582b8750d7e014c4d76166fa2fc1..812064542fdb0a5c0ff7587aaaba8da81dc234a9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1882,6 +1882,13 @@
>                         Documentation/admin-guide/mm/hugetlbpage.rst.
>                         Format: size[KMG]
>
> +       hugepage_alloc_threads=
> +                       [HW] The number of threads per NUMA node that should
> +                       be used to allocate hugepages during boot.
> +                       This option can be used to improve system bootup time
> +                       when allocating a large amount of huge pages.
> +                       The default value is 2 threads per NUMA node.
> +
>         hugetlb_cma=    [HW,CMA,EARLY] The size of a CMA area used for allocation
>                         of gigantic hugepages. Or using node format, the size
>                         of a CMA area per node can be specified.
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index f34a0d798d5b533f30add99a34f66ba4e1c496a3..c88461be0f66887d532ac4ef20e3a61dfd396be7 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -145,7 +145,14 @@ hugepages
>
>         It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
>         If the node number is invalid,  the parameter will be ignored.
> -
> +hugepage_alloc_threads
> +       Specify the number of threads per NUMA node that should be used to
> +       allocate hugepages during boot. This parameter can be used to improve
> +       system bootup time when allocating a large amount of huge pages.
> +       The default value is 2 threads per NUMA node. Example to use 8 threads
> +       per NUMA node::
> +
> +               hugepage_alloc_threads=8
>  default_hugepagesz
>         Specify the default huge page size.  This parameter can
>         only be specified once on the command line.  default_hugepagesz can
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 163190e89ea16450026496c020b544877db147d1..b7d24c41e0f9d22f5b86c253e29a2eca28460026 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -68,6 +68,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
>  static bool __initdata parsed_valid_hugepagesz = true;
>  static bool __initdata parsed_default_hugepagesz;
>  static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
> +static unsigned long allocation_threads_per_node __initdata = 2;
>
>  /*
>   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -3432,26 +3433,23 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>         job.size        = h->max_huge_pages;
>
>         /*
> -        * job.max_threads is twice the num_node_state(N_MEMORY),
> +        * job.max_threads is twice the num_node_state(N_MEMORY) by default.
>          *
> -        * Tests below indicate that a multiplier of 2 significantly improves
> -        * performance, and although larger values also provide improvements,
> -        * the gains are marginal.
> +        * On large servers with terabytes of memory, huge page allocation
> +        * can consume a considerably amount of time.
>          *
> -        * Therefore, choosing 2 as the multiplier strikes a good balance between
> -        * enhancing parallel processing capabilities and maintaining efficient
> -        * resource management.
> +        * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
> +        * 2MiB huge pages. Using more threads can significantly improve allocation time.
>          *
> -        * +------------+-------+-------+-------+-------+-------+
> -        * | multiplier |   1   |   2   |   3   |   4   |   5   |
> -        * +------------+-------+-------+-------+-------+-------+
> -        * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
> -        * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
> -        * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
> -        * +------------+-------+-------+-------+-------+-------+
> +        * +--------------------+-------+-------+-------+-------+-------+
> +        * | threads per node   |   2   |   4   |   8   |   16  |    32 |
> +        * +--------------------+-------+-------+-------+-------+-------+
> +        * | skylake 4node      |   44s |   22s |   16s |   19s |   20s |
> +        * | cascade lake 4node |   39s |   20s |   11s |   10s |    9s |
> +        * +--------------------+-------+-------+-------+-------+-------+
>          */
> -       job.max_threads = num_node_state(N_MEMORY) * 2;
> -       job.min_chunk   = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
> +       job.max_threads = num_node_state(N_MEMORY) * allocation_threads_per_node;
> +       job.min_chunk   = h->max_huge_pages / num_node_state(N_MEMORY) / allocation_threads_per_node;
>         padata_do_multithreaded(&job);
>
>         return h->nr_huge_pages;
> @@ -4764,6 +4762,26 @@ static int __init default_hugepagesz_setup(char *s)
>  }
>  __setup("default_hugepagesz=", default_hugepagesz_setup);
>
> +/* hugepage_alloc_threads command line parsing
> + * When set, use this specific number of threads per NUMA node for the boot
> + * allocation of hugepages.
> + */
> +static int __init hugepage_alloc_threads_setup(char *s)
> +{
> +       unsigned long threads_per_node;
> +
> +       if (kstrtoul(s, 0, &threads_per_node) != 0)
> +               return 1;
> +
> +       if (threads_per_node == 0)
> +               return 1;
> +
> +       allocation_threads_per_node = threads_per_node;
> +
> +       return 1;
> +}
> +__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
> +
>  static unsigned int allowed_mems_nr(struct hstate *h)
>  {
>         int node;
>
> --
> 2.48.1
>
>
>

Maybe mention that this does not apply to 'gigantic' hugepages (e.g.
hugetlb pages of an order > MAX_PAGE_ORDER). Those are allocated
earlier in boot by memblock, in a single-threaded environment.

Not your fault that this distinction between these types of hugetlb
pages isn't clear in the Docs, of course. Only hugetlb_cma mentions
that it is for gigantic pages. But it's probably best to mention that
the threads parameter is for non-gigantic hugetlb pages only.

- Frank


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-24 17:37   ` Frank van der Linden
@ 2025-02-25 13:01     ` Thomas Prescher
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Prescher @ 2025-02-25 13:01 UTC (permalink / raw)
  To: fvdl; +Cc: linux-mm, corbet, muchun.song, linux-doc, linux-kernel, akpm

On Mon, 2025-02-24 at 09:37 -0800, Frank van der Linden wrote:
> 
> Maybe mention that this does not apply to 'gigantic' hugepages (e.g.
> hugetlb pages of an order > MAX_PAGE_ORDER). Those are allocated
> earlier in boot by memblock, in a single-threaded environment.
> 
> Not your fault that this distinction between these types of hugetlb
> pages isn't clear in the Docs, of course. Only hugetlb_cma mentions
> that it is for gigantic pages. But it's probably best to mention that
> the threads parameter is for non-gigantic hugetlb pages only.
> 
> - Frank

Very good point. I will update the documentation accordingly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-25 13:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-21 13:49 [PATCH 0/2] Add a command line option that enables control of how many threads per NUMA node should be used to allocate huge pages Thomas Prescher via B4 Relay
2025-02-21 13:49 ` [PATCH 1/2] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
2025-02-21 13:52   ` Matthew Wilcox
2025-02-21 14:16     ` Thomas Prescher
2025-02-23  2:46       ` Andrew Morton
2025-02-24 10:42         ` Thomas Prescher
2025-02-24 17:37   ` Frank van der Linden
2025-02-25 13:01     ` Thomas Prescher
2025-02-21 13:49 ` [PATCH 2/2] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox