From: Wei Yang <richard.weiyang@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
kernel-team@meta.com, akpm@linux-foundation.org, vbabka@suse.cz,
surenb@google.com, mhocko@suse.com, jackmanb@google.com,
hannes@cmpxchg.org, ziy@nvidia.com, richard.weiyang@gmail.com,
osalvador@suse.de, rientjes@google.com, david@redhat.com,
joshua.hahnjy@gmail.com, fvdl@google.com
Subject: Re: [PATCH v6] page_alloc: allow migration of smaller hugepages during contig_alloc
Date: Fri, 19 Dec 2025 00:08:00 +0000 [thread overview]
Message-ID: <20251219000800.tnpqzvcdyeqcwryt@master> (raw)
In-Reply-To: <20251218233804.1395835-1-gourry@gourry.net>
On Thu, Dec 18, 2025 at 06:38:04PM -0500, Gregory Price wrote:
>We presently skip regions with hugepages entirely when trying to do
>contiguous page allocation. This will cause otherwise-movable
>2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
>page allocation less reliable on systems utilizing both.
>
>Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
>when allocating 1G pages") skipped all HugePage containing regions
>because it can cause significant delays in 1G allocation (as HugeTLB
>migrations may fail for a number of reasons).
>
>Instead, if hugepage migration is enabled, consider regions with
>hugepages smaller than the target contiguous allocation request
>as valid targets for allocation.
>
>We optimize for the existing behavior by searching for non-hugetlb
>regions in a first pass, then retrying the search to include hugetlb
>only on failure. This allows the existing fast-path to remain the
>default case with a slow-path fallback to increase reliability.
>
>We only fallback to the slow path if a hugetlb region was detected,
>and we do a full re-scan because the zones/blocks may have changed
>during the first pass (and it's not worth further complexity).
>
>isolate_migrate_pages_block() has similar hugetlb filter logic, and
>the hugetlb code does a migratable check in folio_isolate_hugetlb()
>during isolation. The code servicing the allocation and migration
>already supports this exact use case.
>
>To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
>and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
>(Scale to your machine's memory capacity).
>
>echo 24576 > .../hugepages-2048kB/nr_hugepages
>echo 4 > .../hugepages-1048576kB/nr_hugepages
>
>Prior to this patch, the 1GB page reservation can fail if no contiguous
>1GB pages remain. After this patch, the kernel will try to move 2MB
>pages and successfully allocate the 1GB pages (assuming overall
>sufficient memory is available). Also tested this while a program had
>the 2MB reservations mapped, and the 1GB reservation still succeeds.
>
>folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
>other users are debug or init-time allocations and largely unaffected.
>- ppc/memtrace is a debugfs interface
>- x86/tdx memory allocation occurs once on module-init
>- kfence/core happens once on module (late) init
>- THP uses it in debug_vm_pgtable_alloc_huge_page at __init time
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/
>Signed-off-by: Gregory Price <gourry@gourry.net>
>---
> mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 48 insertions(+), 4 deletions(-)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 822e05f1a964..adf579a0df3e 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_pfn,
> }
>
> static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>- unsigned long nr_pages)
>+ unsigned long nr_pages, bool skip_hugetlb,
>+ bool *skipped_hugetlb)
> {
> unsigned long i, end_pfn = start_pfn + nr_pages;
> struct page *page;
>@@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
> if (PageReserved(page))
> return false;
>
>- if (PageHuge(page))
>- return false;
>+ /*
>+ * Only consider ranges containing hugepages if those pages are
>+ * smaller than the requested contiguous region. e.g.:
>+ * Move 2MB pages to free up a 1GB range.
>+ * Don't move 1GB pages to free up a 2MB range.
>+ *
>+ * This makes contiguous allocation more reliable if multiple
>+ * hugepage sizes are used without causing needless movement.
>+ */
>+ if (PageHuge(page)) {
>+ unsigned int order;
>+
>+ if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
>+ return false;
>+
>+ if (skip_hugetlb) {
>+ *skipped_hugetlb = true;
>+ return false;
>+ }
>+
>+ page = compound_head(page);
>+ order = compound_order(page);
The order is get from head page.
>+ if ((order >= MAX_FOLIO_ORDER) ||
>+ (nr_pages <= (1 << order)))
>+ return false;
>+
>+ /* No need to check the pfns for this page */
>+ i += (1 << order) - 1;
So this advance should based on "head page" instead of original page, right?
>+ }
> }
> return true;
> }
>@@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> struct zonelist *zonelist;
> struct zone *zone;
> struct zoneref *z;
>+ bool skip_hugetlb = true;
>+ bool skipped_hugetlb = false;
>
>+retry:
> zonelist = node_zonelist(nid, gfp_mask);
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> gfp_zone(gfp_mask), nodemask) {
>@@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
>
> pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
>- if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
>+ if (pfn_range_valid_contig(zone, pfn, nr_pages,
>+ skip_hugetlb,
>+ &skipped_hugetlb)) {
> /*
> * We release the zone lock here because
> * alloc_contig_range() will also lock the zone
>@@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
> }
> spin_unlock_irqrestore(&zone->lock, flags);
> }
>+ /*
>+ * If we failed, retry the search, but treat regions with HugeTLB pages
>+ * as valid targets. This retains fast-allocations on first pass
>+ * without trying to migrate HugeTLB pages (which may fail). On the
>+ * second pass, we will try moving HugeTLB pages when those pages are
>+ * smaller than the requested contiguous region size.
>+ */
>+ if (skip_hugetlb && skipped_hugetlb) {
>+ skip_hugetlb = false;
>+ goto retry;
>+ }
> return NULL;
> }
> #endif /* CONFIG_CONTIG_ALLOC */
>--
>2.52.0
--
Wei Yang
Help you, Help me
next prev parent reply other threads:[~2025-12-19 0:08 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-18 23:38 Gregory Price
2025-12-19 0:08 ` Wei Yang [this message]
2025-12-19 14:26 ` Gregory Price
2025-12-19 20:46 ` Zi Yan
2025-12-19 20:56 ` Gregory Price
2025-12-20 6:37 ` Wei Yang
2025-12-21 11:32 ` Gregory Price
2025-12-21 12:42 ` Gregory Price
2025-12-19 20:48 ` Zi Yan
2025-12-19 20:59 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251219000800.tnpqzvcdyeqcwryt@master \
--to=richard.weiyang@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=fvdl@google.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=osalvador@suse.de \
--cc=rientjes@google.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox