From: Frank van der Linden <fvdl@google.com>
To: akpm@linux-foundation.org, muchun.song@linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: yuzhao@google.com, usamaarif642@gmail.com,
joao.m.martins@oracle.com, roman.gushchin@linux.dev,
Frank van der Linden <fvdl@google.com>
Subject: [PATCH v3 20/28] mm/hugetlb: do pre-HVO for bootmem allocated pages
Date: Thu, 6 Feb 2025 18:51:00 +0000 [thread overview]
Message-ID: <20250206185109.1210657-21-fvdl@google.com> (raw)
In-Reply-To: <20250206185109.1210657-1-fvdl@google.com>
For large systems, the overhead of vmemmap pages for hugetlb
is substantial. It's about 1.5% of memory, which is about
45G for a 3T system. If you want to configure most of that
system for hugetlb (e.g. to use as backing memory for VMs),
there is a chance of running out of memory on boot, even
though you know that the 45G will become available later.
To avoid this scenario, and since it's a waste to first
allocate and then free that 45G during boot, do pre-HVO
for hugetlb bootmem allocated pages ('gigantic' pages).
pre-HVO is done by adding functions that are called from
sparse_init_nid_early and sparse_init_nid_late. The first
is called before memmap allocation, so it takes care of
allocating memmap HVO-style. The second verifies that all
bootmem pages look good, specifically it checks that they
do not intersect with multiple zones. This can only be done
from sparse_init_nid_late path, when zones have been
initialized.
The hugetlb page size must be aligned to the section size,
and aligned to the size of memory described by the number
of page structures contained in one PMD (since pre-HVO
is not prepared to split PMDs). This should be true for
most 'gigantic' pages, it is for 1G pages on x86, where
both of these alignment requirements are 128M.
This will only have an effect if hugetlb_bootmem_alloc was
called early in boot. If not, it won't do anything, and
HVO for bootmem hugetlb pages works as before.
Signed-off-by: Frank van der Linden <fvdl@google.com>
---
include/linux/hugetlb.h | 2 +
mm/hugetlb.c | 4 +-
mm/hugetlb_vmemmap.c | 143 ++++++++++++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 6 ++
mm/sparse-vmemmap.c | 4 ++
5 files changed, 157 insertions(+), 2 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 10a7ce2b95e1..2512463bca49 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -687,6 +687,8 @@ struct huge_bootmem_page {
#define HUGE_BOOTMEM_HVO 0x0001
#define HUGE_BOOTMEM_ZONES_VALID 0x0002
+bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
+
int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e0810c217ce..29b3a6e70a53 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3310,8 +3310,8 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
}
}
-static bool __init hugetlb_bootmem_page_zones_valid(int nid,
- struct huge_bootmem_page *m)
+bool __init hugetlb_bootmem_page_zones_valid(int nid,
+ struct huge_bootmem_page *m)
{
unsigned long start_pfn;
bool valid;
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index be6b33ecbc8e..9a99dfa3c495 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -743,6 +743,149 @@ void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head
__hugetlb_vmemmap_optimize_folios(h, folio_list, true);
}
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
+
+/* Return true of a bootmem allocated HugeTLB page should be pre-HVO-ed */
+static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m)
+{
+ unsigned long section_size, psize, pmd_vmemmap_size;
+ phys_addr_t paddr;
+
+ if (!READ_ONCE(vmemmap_optimize_enabled))
+ return false;
+
+ if (!hugetlb_vmemmap_optimizable(m->hstate))
+ return false;
+
+ psize = huge_page_size(m->hstate);
+ paddr = virt_to_phys(m);
+
+ /*
+ * Pre-HVO only works if the bootmem huge page
+ * is aligned to the section size.
+ */
+ section_size = (1UL << PA_SECTION_SHIFT);
+ if (!IS_ALIGNED(paddr, section_size) ||
+ !IS_ALIGNED(psize, section_size))
+ return false;
+
+ /*
+ * The pre-HVO code does not deal with splitting PMDS,
+ * so the bootmem page must be aligned to the number
+ * of base pages that can be mapped with one vmemmap PMD.
+ */
+ pmd_vmemmap_size = (PMD_SIZE / (sizeof(struct page))) << PAGE_SHIFT;
+ if (!IS_ALIGNED(paddr, pmd_vmemmap_size) ||
+ !IS_ALIGNED(psize, pmd_vmemmap_size))
+ return false;
+
+ return true;
+}
+
+/*
+ * Initialize memmap section for a gigantic page, HVO-style.
+ */
+void __init hugetlb_vmemmap_init_early(int nid)
+{
+ unsigned long psize, paddr, section_size;
+ unsigned long ns, i, pnum, pfn, nr_pages;
+ unsigned long start, end;
+ struct huge_bootmem_page *m = NULL;
+ void *map;
+
+ /*
+ * Noting to do if bootmem pages were not allocated
+ * early in boot, or if HVO wasn't enabled in the
+ * first place.
+ */
+ if (!hugetlb_bootmem_allocated())
+ return;
+
+ if (!READ_ONCE(vmemmap_optimize_enabled))
+ return;
+
+ section_size = (1UL << PA_SECTION_SHIFT);
+
+ list_for_each_entry(m, &huge_boot_pages[nid], list) {
+ if (!vmemmap_should_optimize_bootmem_page(m))
+ continue;
+
+ nr_pages = pages_per_huge_page(m->hstate);
+ psize = nr_pages << PAGE_SHIFT;
+ paddr = virt_to_phys(m);
+ pfn = PHYS_PFN(paddr);
+ map = pfn_to_page(pfn);
+ start = (unsigned long)map;
+ end = start + nr_pages * sizeof(struct page);
+
+ if (vmemmap_populate_hvo(start, end, nid,
+ HUGETLB_VMEMMAP_RESERVE_SIZE) < 0)
+ continue;
+
+ memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE);
+
+ pnum = pfn_to_section_nr(pfn);
+ ns = psize / section_size;
+
+ for (i = 0; i < ns; i++) {
+ sparse_init_early_section(nid, map, pnum,
+ SECTION_IS_VMEMMAP_PREINIT);
+ map += section_map_size();
+ pnum++;
+ }
+
+ m->flags |= HUGE_BOOTMEM_HVO;
+ }
+}
+
+void __init hugetlb_vmemmap_init_late(int nid)
+{
+ struct huge_bootmem_page *m, *tm;
+ unsigned long phys, nr_pages, start, end;
+ unsigned long pfn, nr_mmap;
+ struct hstate *h;
+ void *map;
+
+ if (!hugetlb_bootmem_allocated())
+ return;
+
+ if (!READ_ONCE(vmemmap_optimize_enabled))
+ return;
+
+ list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
+ if (!(m->flags & HUGE_BOOTMEM_HVO))
+ continue;
+
+ phys = virt_to_phys(m);
+ h = m->hstate;
+ pfn = PHYS_PFN(phys);
+ nr_pages = pages_per_huge_page(h);
+
+ if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
+ /*
+ * Oops, the hugetlb page spans multiple zones.
+ * Remove it from the list, and undo HVO.
+ */
+ list_del(&m->list);
+
+ map = pfn_to_page(pfn);
+
+ start = (unsigned long)map;
+ end = start + nr_pages * sizeof(struct page);
+
+ vmemmap_undo_hvo(start, end, nid,
+ HUGETLB_VMEMMAP_RESERVE_SIZE);
+ nr_mmap = end - start - HUGETLB_VMEMMAP_RESERVE_SIZE;
+ memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE));
+
+ memblock_phys_free(phys, huge_page_size(h));
+ continue;
+ } else
+ m->flags |= HUGE_BOOTMEM_ZONES_VALID;
+ }
+}
+#endif
+
static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
{
.procname = "hugetlb_optimize_vmemmap",
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 926b8b27b5cb..0031e49b12f7 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -9,6 +9,8 @@
#ifndef _LINUX_HUGETLB_VMEMMAP_H
#define _LINUX_HUGETLB_VMEMMAP_H
#include <linux/hugetlb.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
/*
* Reserve one vmemmap page, all vmemmap addresses are mapped to it. See
@@ -25,6 +27,10 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
+#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
+void hugetlb_vmemmap_init_early(int nid);
+void hugetlb_vmemmap_init_late(int nid);
+#endif
static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8cc848c4b17c..fd2ab5118e13 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -32,6 +32,8 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
+#include "hugetlb_vmemmap.h"
+
/*
* Flags for vmemmap_populate_range and friends.
*/
@@ -594,6 +596,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
*/
void __init sparse_vmemmap_init_nid_early(int nid)
{
+ hugetlb_vmemmap_init_early(nid);
}
/*
@@ -604,5 +607,6 @@ void __init sparse_vmemmap_init_nid_early(int nid)
*/
void __init sparse_vmemmap_init_nid_late(int nid)
{
+ hugetlb_vmemmap_init_late(nid);
}
#endif
--
2.48.1.502.g6dc24dfdaf-goog
next prev parent reply other threads:[~2025-02-06 18:52 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-06 18:50 [PATCH v3 00/28] hugetlb/CMA improvements for large systems Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 01/28] mm/cma: export total and free number of pages for CMA areas Frank van der Linden
2025-02-10 10:22 ` Oscar Salvador
2025-02-10 18:18 ` Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 02/28] mm, cma: support multiple contiguous ranges, if requested Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 03/28] mm/cma: introduce cma_intersects function Frank van der Linden
2025-02-14 10:02 ` Alexander Gordeev
2025-02-06 18:50 ` [PATCH v3 04/28] mm, hugetlb: use cma_declare_contiguous_multi Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 05/28] mm/hugetlb: fix round-robin bootmem allocation Frank van der Linden
2025-02-10 12:57 ` Oscar Salvador
2025-02-10 18:30 ` Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 06/28] mm/hugetlb: remove redundant __ClearPageReserved Frank van der Linden
2025-02-10 13:14 ` Oscar Salvador
2025-02-06 18:50 ` [PATCH v3 07/28] mm/hugetlb: use online nodes for bootmem allocation Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 08/28] mm/hugetlb: convert cmdline parameters from setup to early Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 09/28] x86/mm: make register_page_bootmem_memmap handle PTE mappings Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 10/28] mm/bootmem_info: export register_page_bootmem_memmap Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 11/28] mm/sparse: allow for alternate vmemmap section init at boot Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 12/28] mm/hugetlb: set migratetype for bootmem folios Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 13/28] mm: define __init_reserved_page_zone function Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 14/28] mm/hugetlb: check bootmem pages for zone intersections Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 15/28] mm/sparse: add vmemmap_*_hvo functions Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 16/28] mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 17/28] mm/hugetlb: move huge_boot_pages list init " Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 18/28] mm/hugetlb: add pre-HVO framework Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 19/28] mm/hugetlb_vmemmap: fix hugetlb_vmemmap_restore_folios definition Frank van der Linden
2025-02-06 18:51 ` Frank van der Linden [this message]
2025-02-06 18:51 ` [PATCH v3 21/28] x86/setup: call hugetlb_bootmem_alloc early Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 22/28] x86/mm: set ARCH_WANT_SPARSEMEM_VMEMMAP_PREINIT Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 23/28] mm/cma: simplify zone intersection check Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 24/28] mm/cma: introduce a cma validate function Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 25/28] mm/cma: introduce interface for early reservations Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 26/28] mm/hugetlb: add hugetlb_cma_only cmdline option Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 27/28] mm/hugetlb: enable bootmem allocation from CMA areas Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 28/28] mm/hugetlb: move hugetlb CMA code in to its own file Frank van der Linden
2025-02-10 18:39 ` [PATCH v3 00/28] hugetlb/CMA improvements for large systems Oscar Salvador
2025-02-10 18:56 ` Frank van der Linden
2025-02-10 23:28 ` Andrew Morton
2025-02-11 17:21 ` Frank van der Linden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250206185109.1210657-21-fvdl@google.com \
--to=fvdl@google.com \
--cc=akpm@linux-foundation.org \
--cc=joao.m.martins@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=usamaarif642@gmail.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox