linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Wei Yang <richard.weiyang@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, akpm@linux-foundation.org, vbabka@suse.cz,
	surenb@google.com, mhocko@suse.com, jackmanb@google.com,
	hannes@cmpxchg.org, ziy@nvidia.com, richard.weiyang@gmail.com,
	osalvador@suse.de, rientjes@google.com, david@redhat.com,
	joshua.hahnjy@gmail.com, fvdl@google.com
Subject: Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc
Date: Fri, 19 Dec 2025 00:08:00 +0000	[thread overview]
Message-ID: <20251219000800.tnpqzvcdyeqcwryt@master> (raw)
In-Reply-To: <20251218233804.1395835-1-gourry@gourry.net>

On Thu, Dec 18, 2025 at 06:38:04PM -0500, Gregory Price wrote:
>We presently skip regions with hugepages entirely when trying to do
>contiguous page allocation.  This will cause otherwise-movable
>2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
>page allocation less reliable on systems utilizing both.
>
>Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
>when allocating 1G pages") skipped all HugePage containing regions
>because it can cause significant delays in 1G allocation (as HugeTLB
>migrations may fail for a number of reasons).
>
>Instead, if hugepage migration is enabled, consider regions with
>hugepages smaller than the target contiguous allocation request
>as valid targets for allocation.
>
>We optimize for the existing behavior by searching for non-hugetlb
>regions in a first pass, then retrying the search to include hugetlb
>only on failure.  This allows the existing fast-path to remain the
>default case with a slow-path fallback to increase reliability.
>
>We only fallback to the slow path if a hugetlb region was detected,
>and we do a full re-scan because the zones/blocks may have changed
>during the first pass (and it's not worth further complexity).
>
>isolate_migrate_pages_block() has similar hugetlb filter logic, and
>the hugetlb code does a migratable check in folio_isolate_hugetlb()
>during isolation.  The code servicing the allocation and migration
>already supports this exact use case.
>
>To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
>and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
>(Scale to your machine's memory capacity).
>
>echo 24576 > .../hugepages-2048kB/nr_hugepages
>echo 4 > .../hugepages-1048576kB/nr_hugepages
>
>Prior to this patch, the 1GB page reservation can fail if no contiguous
>1GB pages remain.  After this patch, the kernel will try to move 2MB
>pages and successfully allocate the 1GB pages (assuming overall
>sufficient memory is available). Also tested this while a program had
>the 2MB reservations mapped, and the 1GB reservation still succeeds.
>
>folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
>other users are debug or init-time allocations and largely unaffected.
>- ppc/memtrace is a debugfs interface
>- x86/tdx memory allocation occurs once on module-init
>- kfence/core happens once on module (late) init
>- THP uses it in debug_vm_pgtable_alloc_huge_page at __init time
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/
>Signed-off-by: Gregory Price <gourry@gourry.net>
>---
> mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 48 insertions(+), 4 deletions(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 822e05f1a964..adf579a0df3e 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
> }
> 
> static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>-				   unsigned long nr_pages)
>+				   unsigned long nr_pages, bool skip_hugetlb,
>+				   bool *skipped_hugetlb)
> {
> 	unsigned long i, end_pfn = start_pfn + nr_pages;
> 	struct page *page;
>@@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> 		if (PageReserved(page))
> 			return false;
> 
>-		if (PageHuge(page))
>-			return false;
>+		/*
>+		 * Only consider ranges containing hugepages if those pages are
>+		 * smaller than the requested contiguous region.  e.g.:
>+		 *     Move 2MB pages to free up a 1GB range.
>+		 *     Don't move 1GB pages to free up a 2MB range.
>+		 *
>+		 * This makes contiguous allocation more reliable if multiple
>+		 * hugepage sizes are used without causing needless movement.
>+		 */
>+		if (PageHuge(page)) {
>+			unsigned int order;
>+
>+			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
>+				return false;
>+
>+			if (skip_hugetlb) {
>+				*skipped_hugetlb = true;
>+				return false;
>+			}
>+
>+			page = compound_head(page);
>+			order = compound_order(page);

The order is get from head page.

>+			if ((order >= MAX_FOLIO_ORDER) ||
>+			    (nr_pages <= (1 << order)))
>+				return false;
>+
>+			/* No need to check the pfns for this page */
>+			i += (1 << order) - 1;

So this advance should based on "head page" instead of original page, right?

>+		}
> 	}
> 	return true;
> }
>@@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 	struct zonelist *zonelist;
> 	struct zone *zone;
> 	struct zoneref *z;
>+	bool skip_hugetlb = true;
>+	bool skipped_hugetlb = false;
> 
>+retry:
> 	zonelist = node_zonelist(nid, gfp_mask);
> 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> 					gfp_zone(gfp_mask), nodemask) {
>@@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 
> 		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
>-			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
>+			if (pfn_range_valid_contig(zone, pfn, nr_pages,
>+						   skip_hugetlb,
>+						   &skipped_hugetlb)) {
> 				/*
> 				 * We release the zone lock here because
> 				 * alloc_contig_range() will also lock the zone
>@@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> 		}
> 		spin_unlock_irqrestore(&zone->lock, flags);
> 	}
>+	/*
>+	 * If we failed, retry the search, but treat regions with HugeTLB pages
>+	 * as valid targets.  This retains fast-allocations on first pass
>+	 * without trying to migrate HugeTLB pages (which may fail). On the
>+	 * second pass, we will try moving HugeTLB pages when those pages are
>+	 * smaller than the requested contiguous region size.
>+	 */
>+	if (skip_hugetlb && skipped_hugetlb) {
>+		skip_hugetlb = false;
>+		goto retry;
>+	}
> 	return NULL;
> }
> #endif /* CONFIG_CONTIG_ALLOC */
>-- 
>2.52.0

-- 
Wei Yang
Help you, Help me


  reply	other threads:[~2025-12-19  0:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 23:38 Gregory Price
2025-12-19  0:08 ` Wei Yang [this message]
2025-12-19 14:26   ` Gregory Price
2025-12-19 20:46     ` Zi Yan
2025-12-19 20:56       ` Gregory Price
2025-12-20  6:37       ` Wei Yang
2025-12-21 11:32         ` Gregory Price
2025-12-21 12:42       ` Gregory Price
2025-12-19 20:48 ` Zi Yan
2025-12-19 20:59   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251219000800.tnpqzvcdyeqcwryt@master \
    --to=richard.weiyang@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fvdl@google.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox