linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Frank van der Linden <fvdl@google.com>
To: akpm@linux-foundation.org, muchun.song@linux.dev,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Cc: yuzhao@google.com, usamaarif642@gmail.com,
	joao.m.martins@oracle.com,  roman.gushchin@linux.dev,
	Frank van der Linden <fvdl@google.com>
Subject: [PATCH v3 14/28] mm/hugetlb: check bootmem pages for zone intersections
Date: Thu,  6 Feb 2025 18:50:54 +0000	[thread overview]
Message-ID: <20250206185109.1210657-15-fvdl@google.com> (raw)
In-Reply-To: <20250206185109.1210657-1-fvdl@google.com>

Bootmem hugetlb pages are allocated using memblock, which isn't
(and mostly can't be) aware of zones.

So, they may end up crossing zone boundaries. This would create
confusion, a hugetlb page that is part of multiple zones is bad.
Worse, HVO might then end up stealthily re-assigning pages to a
different zone when a hugetlb page is freed, since the tail page
structures beyond the first vmemmap page would inherit the zone
of the first page structures.

While the chance of this happening is low, you can definitely
create a configuration where this happens (especially using
ZONE_MOVABLE).

To avoid this issue, check if bootmem hugetlb pages intersect
with multiple zones during the gather phase, and discard
them, handing them to the page allocator, if they do. Record
the number of invalid bootmem pages per node and subtract them
from the number of available pages at the end, making it easier
to do these checks in multiple places later on.

Signed-off-by: Frank van der Linden <fvdl@google.com>
---
 mm/hugetlb.c  | 61 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h |  2 ++
 mm/mm_init.c  | 25 +++++++++++++++++++++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de8adfb487f4..789341158ef6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -62,6 +62,7 @@ static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
 static unsigned long hugetlb_cma_size __initdata;
 
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
+static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
 
 /*
  * Due to ordering constraints across the init code for various
@@ -3308,6 +3309,44 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	}
 }
 
+static bool __init hugetlb_bootmem_page_zones_valid(int nid,
+						    struct huge_bootmem_page *m)
+{
+	unsigned long start_pfn;
+	bool valid;
+
+	start_pfn = virt_to_phys(m) >> PAGE_SHIFT;
+
+	valid = !pfn_range_intersects_zones(nid, start_pfn,
+			pages_per_huge_page(m->hstate));
+	if (!valid)
+		hstate_boot_nrinvalid[hstate_index(m->hstate)]++;
+
+	return valid;
+}
+
+/*
+ * Free a bootmem page that was found to be invalid (intersecting with
+ * multiple zones).
+ *
+ * Since it intersects with multiple zones, we can't just do a free
+ * operation on all pages at once, but instead have to walk all
+ * pages, freeing them one by one.
+ */
+static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
+					     struct hstate *h)
+{
+	unsigned long npages = pages_per_huge_page(h);
+	unsigned long pfn;
+
+	while (npages--) {
+		pfn = page_to_pfn(page);
+		__init_reserved_page_zone(pfn, nid);
+		free_reserved_page(page);
+		page++;
+	}
+}
+
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
@@ -3315,14 +3354,25 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 static void __init gather_bootmem_prealloc_node(unsigned long nid)
 {
 	LIST_HEAD(folio_list);
-	struct huge_bootmem_page *m;
+	struct huge_bootmem_page *m, *tm;
 	struct hstate *h = NULL, *prev_h = NULL;
 
-	list_for_each_entry(m, &huge_boot_pages[nid], list) {
+	list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
 
 		h = m->hstate;
+		if (!hugetlb_bootmem_page_zones_valid(nid, m)) {
+			/*
+			 * Can't use this page. Initialize the
+			 * page structures if that hasn't already
+			 * been done, and give them to the page
+			 * allocator.
+			 */
+			hugetlb_bootmem_free_invalid_page(nid, page, h);
+			continue;
+		}
+
 		/*
 		 * It is possible to have multiple huge page sizes (hstates)
 		 * in this list.  If so, process each size separately.
@@ -3594,13 +3644,20 @@ static void __init hugetlb_init_hstates(void)
 static void __init report_hugepages(void)
 {
 	struct hstate *h;
+	unsigned long nrinvalid;
 
 	for_each_hstate(h) {
 		char buf[32];
 
+		nrinvalid = hstate_boot_nrinvalid[hstate_index(h)];
+		h->max_huge_pages -= nrinvalid;
+
 		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
 		pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n",
 			buf, h->free_huge_pages);
+		if (nrinvalid)
+			pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n",
+					buf, nrinvalid, nrinvalid > 1 ? "s" : "");
 		pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n",
 			hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf);
 	}
diff --git a/mm/internal.h b/mm/internal.h
index 57662141930e..63fda9bb9426 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -658,6 +658,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
 }
 
 void set_zone_contiguous(struct zone *zone);
+bool pfn_range_intersects_zones(int nid, unsigned long start_pfn,
+			   unsigned long nr_pages);
 
 static inline void clear_zone_contiguous(struct zone *zone)
 {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 925ed6564572..f7d5b4fe1ae9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2287,6 +2287,31 @@ void set_zone_contiguous(struct zone *zone)
 	zone->contiguous = true;
 }
 
+/*
+ * Check if a PFN range intersects multiple zones on one or more
+ * NUMA nodes. Specify the @nid argument if it is known that this
+ * PFN range is on one node, NUMA_NO_NODE otherwise.
+ */
+bool pfn_range_intersects_zones(int nid, unsigned long start_pfn,
+			   unsigned long nr_pages)
+{
+	struct zone *zone, *izone = NULL;
+
+	for_each_zone(zone) {
+		if (nid != NUMA_NO_NODE && zone_to_nid(zone) != nid)
+			continue;
+
+		if (zone_intersects(zone, start_pfn, nr_pages)) {
+			if (izone != NULL)
+				return true;
+			izone = zone;
+		}
+
+	}
+
+	return false;
+}
+
 static void __init mem_init_print_info(void);
 void __init page_alloc_init_late(void)
 {
-- 
2.48.1.502.g6dc24dfdaf-goog



  parent reply	other threads:[~2025-02-06 18:51 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-06 18:50 [PATCH v3 00/28] hugetlb/CMA improvements for large systems Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 01/28] mm/cma: export total and free number of pages for CMA areas Frank van der Linden
2025-02-10 10:22   ` Oscar Salvador
2025-02-10 18:18     ` Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 02/28] mm, cma: support multiple contiguous ranges, if requested Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 03/28] mm/cma: introduce cma_intersects function Frank van der Linden
2025-02-14 10:02   ` Alexander Gordeev
2025-02-06 18:50 ` [PATCH v3 04/28] mm, hugetlb: use cma_declare_contiguous_multi Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 05/28] mm/hugetlb: fix round-robin bootmem allocation Frank van der Linden
2025-02-10 12:57   ` Oscar Salvador
2025-02-10 18:30     ` Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 06/28] mm/hugetlb: remove redundant __ClearPageReserved Frank van der Linden
2025-02-10 13:14   ` Oscar Salvador
2025-02-06 18:50 ` [PATCH v3 07/28] mm/hugetlb: use online nodes for bootmem allocation Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 08/28] mm/hugetlb: convert cmdline parameters from setup to early Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 09/28] x86/mm: make register_page_bootmem_memmap handle PTE mappings Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 10/28] mm/bootmem_info: export register_page_bootmem_memmap Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 11/28] mm/sparse: allow for alternate vmemmap section init at boot Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 12/28] mm/hugetlb: set migratetype for bootmem folios Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 13/28] mm: define __init_reserved_page_zone function Frank van der Linden
2025-02-06 18:50 ` Frank van der Linden [this message]
2025-02-06 18:50 ` [PATCH v3 15/28] mm/sparse: add vmemmap_*_hvo functions Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 16/28] mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 17/28] mm/hugetlb: move huge_boot_pages list init " Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 18/28] mm/hugetlb: add pre-HVO framework Frank van der Linden
2025-02-06 18:50 ` [PATCH v3 19/28] mm/hugetlb_vmemmap: fix hugetlb_vmemmap_restore_folios definition Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 20/28] mm/hugetlb: do pre-HVO for bootmem allocated pages Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 21/28] x86/setup: call hugetlb_bootmem_alloc early Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 22/28] x86/mm: set ARCH_WANT_SPARSEMEM_VMEMMAP_PREINIT Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 23/28] mm/cma: simplify zone intersection check Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 24/28] mm/cma: introduce a cma validate function Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 25/28] mm/cma: introduce interface for early reservations Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 26/28] mm/hugetlb: add hugetlb_cma_only cmdline option Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 27/28] mm/hugetlb: enable bootmem allocation from CMA areas Frank van der Linden
2025-02-06 18:51 ` [PATCH v3 28/28] mm/hugetlb: move hugetlb CMA code in to its own file Frank van der Linden
2025-02-10 18:39 ` [PATCH v3 00/28] hugetlb/CMA improvements for large systems Oscar Salvador
2025-02-10 18:56   ` Frank van der Linden
2025-02-10 23:28     ` Andrew Morton
2025-02-11 17:21       ` Frank van der Linden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250206185109.1210657-15-fvdl@google.com \
    --to=fvdl@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox