[PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages.
@ 2025-02-27 23:02 Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 1/3] mm: hugetlb: improve parallel huge page allocation time Thomas Prescher via B4 Relay
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-27 23:02 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

Allocating huge pages can take a very long time on servers
with terabytes of memory even when they are allocated at
boot time where the allocation happens in parallel.

Before this series, the kernel used a hard coded value of 2 threads per
NUMA node for these allocations. This value might have been good
enough in the past but it is not sufficient to fully utilize
newer systems.

This series changes the default so the kernel uses 25% of the available
hardware threads for these allocations. In addition, we allow the user
that wish to micro-optimize the allocation time to override this value
via a new kernel parameter.

We tested this on 2 generations of Xeon CPUs and the results
show a big improvement of the overall allocation time.

+-----------------------+-------+-------+-------+-------+-------+
| threads               |   8   |   16  |   32  |   64  |   128 |
+-----------------------+-------+-------+-------+-------+-------+
| skylake      144 cpus |   44s |   22s |   16s |   19s |   20s |
| cascade lake 192 cpus |   39s |   20s |   11s |   10s |    9s |
+-----------------------+-------+-------+-------+-------+-------+

On skylake, we see an improvment of 2.75x when using 32 threads,
on cascade lake we can get even better at 4.3x when we use
128 threads.

This speedup is quite significant and users of large machines
like these should have the option to make the machines boot
as fast as possible.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
Changes in v3:
- Fix log message (the name of the parameter was incorrect)
- Link to v2: https://lore.kernel.org/r/20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de

Changes in v2:
- the default thread value has been changed from 2 threads per node to
  25% of the available hardware threads
- the hugepage_alloc_threads parameter now specifies the total number of
  threads being used instead of threads per node
- the kernel now logs the time needed to allocate the huge pages and the
  number of threads being used
- update the documentation so that users are aware that this does not
  apply to gigantic huge pages
- Link to v1: https://lore.kernel.org/r/20250221-hugepage-parameter-v1-0-fa49a77c87c8@cyberus-technology.de

---
Thomas Prescher (3):
      mm: hugetlb: improve parallel huge page allocation time
      mm: hugetlb: add hugetlb_alloc_threads cmdline option
      mm: hugetlb: log time needed to allocate hugepages

 Documentation/admin-guide/kernel-parameters.txt |  9 ++++
 Documentation/admin-guide/mm/hugetlbpage.rst    | 10 ++++
 mm/hugetlb.c                                    | 67 +++++++++++++++++++------
 3 files changed, 70 insertions(+), 16 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20250221-hugepage-parameter-e8542fdfc0ae

Best regards,
-- 
Thomas Prescher <thomas.prescher@cyberus-technology.de>




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 1/3] mm: hugetlb: improve parallel huge page allocation time
  2025-02-27 23:02 [PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages Thomas Prescher via B4 Relay
@ 2025-02-27 23:02 ` Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 2/3] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 3/3] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-27 23:02 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Before this patch, the kernel currently used a hard coded
value of 2 threads per NUMA node for these allocations.

This patch changes this policy and the kernel now uses 25%
of the available hardware threads for the allocations.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 mm/hugetlb.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 163190e89ea16450026496c020b544877db147d1..e9b1b3e2b9d467f067d54359e1401a03f9926108 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,9 +14,11 @@
 #include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/compiler.h>
+#include <linux/cpumask.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
 #include <linux/memblock.h>
+#include <linux/minmax.h>
 #include <linux/sysfs.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
@@ -3427,31 +3429,31 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 		.numa_aware	= true
 	};
 
+	unsigned int num_allocation_threads = max(num_online_cpus() / 4, 1);
+
 	job.thread_fn	= hugetlb_pages_alloc_boot_node;
 	job.start	= 0;
 	job.size	= h->max_huge_pages;
 
 	/*
-	 * job.max_threads is twice the num_node_state(N_MEMORY),
+	 * job.max_threads is 25% of the available cpu threads by default.
 	 *
-	 * Tests below indicate that a multiplier of 2 significantly improves
-	 * performance, and although larger values also provide improvements,
-	 * the gains are marginal.
+	 * On large servers with terabytes of memory, huge page allocation
+	 * can consume a considerably amount of time.
 	 *
-	 * Therefore, choosing 2 as the multiplier strikes a good balance between
-	 * enhancing parallel processing capabilities and maintaining efficient
-	 * resource management.
+	 * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
+	 * 2MiB huge pages. Using more threads can significantly improve allocation time.
 	 *
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | multiplier |   1   |   2   |   3   |   4   |   5   |
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
-	 * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
-	 * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
-	 * +------------+-------+-------+-------+-------+-------+
+	 * +-----------------------+-------+-------+-------+-------+-------+
+	 * | threads               |   8   |   16  |   32  |   64  |   128 |
+	 * +-----------------------+-------+-------+-------+-------+-------+
+	 * | skylake      144 cpus |   44s |   22s |   16s |   19s |   20s |
+	 * | cascade lake 192 cpus |   39s |   20s |   11s |   10s |    9s |
+	 * +-----------------------+-------+-------+-------+-------+-------+
 	 */
-	job.max_threads	= num_node_state(N_MEMORY) * 2;
-	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+
+	job.max_threads	= num_allocation_threads;
+	job.min_chunk	= h->max_huge_pages / num_allocation_threads;
 	padata_do_multithreaded(&job);
 
 	return h->nr_huge_pages;

-- 
2.48.1




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 2/3] mm: hugetlb: add hugetlb_alloc_threads cmdline option
  2025-02-27 23:02 [PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 1/3] mm: hugetlb: improve parallel huge page allocation time Thomas Prescher via B4 Relay
@ 2025-02-27 23:02 ` Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 3/3] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-27 23:02 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Add a command line option that enables control of how many
threads should be used to allocate huge pages.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +++++++
 Documentation/admin-guide/mm/hugetlbpage.rst    | 10 ++++++++
 mm/hugetlb.c                                    | 31 +++++++++++++++++++++----
 3 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8582b8750d7e014c4d76166fa2fc1..1937ee02c1f883ecd910bab33cdb9194bddbd9b1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1882,6 +1882,15 @@
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugepage_alloc_threads=
+			[HW] The number of threads that should be used to
+			allocate hugepages during boot. This option can be
+			used to improve system bootup time when allocating
+			a large amount of huge pages.
+			The default value is 25% of the available hardware threads.
+
+			Note that this parameter only applies to non-gigantic huge pages.
+
 	hugetlb_cma=	[HW,CMA,EARLY] The size of a CMA area used for allocation
 			of gigantic hugepages. Or using node format, the size
 			of a CMA area per node can be specified.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b533f30add99a34f66ba4e1c496a3..67a941903fd2231e6c082cffb4c9179ee094b208 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,17 @@ hugepages
 
 	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
 	If the node number is invalid,  the parameter will be ignored.
+hugepage_alloc_threads
+	Specify the number of threads that should be used to allocate hugepages
+	during boot. This parameter can be used to improve system bootup time
+	when allocating a large amount of huge pages.
 
+	The default value is 25% of the available hardware threads.
+	Example to use 8 allocation threads::
+
+		hugepage_alloc_threads=8
+
+	Note that this parameter only applies to non-gigantic huge pages.
 default_hugepagesz
 	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e9b1b3e2b9d467f067d54359e1401a03f9926108..98dbfa18bee01d01b40cc7c650cd3eca5eae2457 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -70,6 +70,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
 static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
+static unsigned long hugepage_allocation_threads __initdata;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -3429,8 +3430,6 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 		.numa_aware	= true
 	};
 
-	unsigned int num_allocation_threads = max(num_online_cpus() / 4, 1);
-
 	job.thread_fn	= hugetlb_pages_alloc_boot_node;
 	job.start	= 0;
 	job.size	= h->max_huge_pages;
@@ -3451,9 +3450,13 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 	 * | cascade lake 192 cpus |   39s |   20s |   11s |   10s |    9s |
 	 * +-----------------------+-------+-------+-------+-------+-------+
 	 */
+	if (hugepage_allocation_threads == 0) {
+		hugepage_allocation_threads = num_online_cpus() / 4;
+		hugepage_allocation_threads = max(hugepage_allocation_threads, 1);
+	}
 
-	job.max_threads	= num_allocation_threads;
-	job.min_chunk	= h->max_huge_pages / num_allocation_threads;
+	job.max_threads	= hugepage_allocation_threads;
+	job.min_chunk	= h->max_huge_pages / hugepage_allocation_threads;
 	padata_do_multithreaded(&job);
 
 	return h->nr_huge_pages;
@@ -4766,6 +4769,26 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+/* hugepage_alloc_threads command line parsing
+ * When set, use this specific number of threads for the boot
+ * allocation of hugepages.
+ */
+static int __init hugepage_alloc_threads_setup(char *s)
+{
+	unsigned long allocation_threads;
+
+	if (kstrtoul(s, 0, &allocation_threads) != 0)
+		return 1;
+
+	if (allocation_threads == 0)
+		return 1;
+
+	hugepage_allocation_threads = allocation_threads;
+
+	return 1;
+}
+__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;

-- 
2.48.1




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 3/3] mm: hugetlb: log time needed to allocate hugepages
  2025-02-27 23:02 [PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 1/3] mm: hugetlb: improve parallel huge page allocation time Thomas Prescher via B4 Relay
  2025-02-27 23:02 ` [PATCH v3 2/3] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
@ 2025-02-27 23:02 ` Thomas Prescher via B4 Relay
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Prescher via B4 Relay @ 2025-02-27 23:02 UTC (permalink / raw)
  To: Jonathan Corbet, Muchun Song, Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, Thomas Prescher

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Having this information allows users to easily tune
the hugepages_node_threads parameter.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 mm/hugetlb.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 98dbfa18bee01d01b40cc7c650cd3eca5eae2457..816e5846222a54255b99515a94e0c1ba9b2b7b27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3430,6 +3430,9 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 		.numa_aware	= true
 	};
 
+	unsigned long jiffies_start;
+	unsigned long jiffies_end;
+
 	job.thread_fn	= hugetlb_pages_alloc_boot_node;
 	job.start	= 0;
 	job.size	= h->max_huge_pages;
@@ -3457,7 +3460,14 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 
 	job.max_threads	= hugepage_allocation_threads;
 	job.min_chunk	= h->max_huge_pages / hugepage_allocation_threads;
+
+	jiffies_start = jiffies;
 	padata_do_multithreaded(&job);
+	jiffies_end = jiffies;
+
+	pr_info("HugeTLB: allocation took %dms with hugepage_alloc_threads=%ld\n",
+		jiffies_to_msecs(jiffies_end - jiffies_start),
+		hugepage_allocation_threads);
 
 	return h->nr_huge_pages;
 }

-- 
2.48.1




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-27 23:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-27 23:02 [PATCH v3 0/3] Add a command line option that enables control of how many threads should be used to allocate huge pages Thomas Prescher via B4 Relay
2025-02-27 23:02 ` [PATCH v3 1/3] mm: hugetlb: improve parallel huge page allocation time Thomas Prescher via B4 Relay
2025-02-27 23:02 ` [PATCH v3 2/3] mm: hugetlb: add hugetlb_alloc_threads cmdline option Thomas Prescher via B4 Relay
2025-02-27 23:02 ` [PATCH v3 3/3] mm: hugetlb: log time needed to allocate hugepages Thomas Prescher via B4 Relay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox