From: Kiryl Shutsemau <kas@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>,
Muchun Song <muchun.song@linux.dev>,
David Hildenbrand <david@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Usama Arif <usamaarif642@gmail.com>,
Frank van der Linden <fvdl@google.com>
Cc: Oscar Salvador <osalvador@suse.de>,
Mike Rapoport <rppt@kernel.org>, Vlastimil Babka <vbabka@suse.cz>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Zi Yan <ziy@nvidia.com>, Baoquan He <bhe@redhat.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Jonathan Corbet <corbet@lwn.net>,
Huacai Chen <chenhuacai@kernel.org>,
WANG Xuerui <kernel@xen0n.name>,
Palmer Dabbelt <palmer@dabbelt.com>,
Paul Walmsley <paul.walmsley@sifive.com>,
Albert Ou <aou@eecs.berkeley.edu>,
Alexandre Ghiti <alex@ghiti.fr>,
kernel-team@meta.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
loongarch@lists.linux.dev, linux-riscv@lists.infradead.org,
Kiryl Shutsemau <kas@kernel.org>
Subject: [PATCHv5 11/17] mm/hugetlb: Remove fake head pages
Date: Wed, 28 Jan 2026 13:54:52 +0000 [thread overview]
Message-ID: <20260128135500.22121-12-kas@kernel.org> (raw)
In-Reply-To: <20260128135500.22121-1-kas@kernel.org>
HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most
vmemmap pages for huge pages and remapping the freed range to a single
page containing the struct page metadata.
With the new mask-based compound_info encoding (for power-of-2 struct
page sizes), all tail pages of the same order are now identical
regardless of which compound page they belong to. This means the tail
pages can be truly shared without fake heads.
Allocate a single page of initialized tail struct pages per NUMA node
per order in the vmemmap_tails[] array in pglist_data. All huge pages of
that order on the node share this tail page, mapped read-only into their
vmemmap. The head page remains unique per huge page.
Redefine MAX_FOLIO_ORDER using ilog2(). The define has to produce a
compile-constant as it is used to specify vmemmap_tail array size.
For some reason, compiler is not able to solve get_order() at
compile-time, but ilog2() works.
Avoid PUD_ORDER to define MAX_FOLIO_ORDER as it adds dependency to
<linux/pgtable.h> which generates hard-to-break include loop.
This eliminates fake heads while maintaining the same memory savings,
and simplifies compound_head() by removing fake head detection.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/mmzone.h | 18 +++++++++++++++--
mm/hugetlb_vmemmap.c | 36 ++++++++++++++++++++++++++++++++--
mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++++++++++--------
3 files changed, 86 insertions(+), 12 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 192143b5cdc0..698091c74dbb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -81,13 +81,17 @@
* currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
* no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
*/
-#define MAX_FOLIO_ORDER get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
+#ifdef CONFIG_64BIT
+#define MAX_FOLIO_ORDER (ilog2(SZ_16G) - PAGE_SHIFT)
+#else
+#define MAX_FOLIO_ORDER (ilog2(SZ_1G) - PAGE_SHIFT)
+#endif
#else
/*
* Without hugetlb, gigantic folios that are bigger than a single PUD are
* currently impossible.
*/
-#define MAX_FOLIO_ORDER PUD_ORDER
+#define MAX_FOLIO_ORDER (PUD_SHIFT - PAGE_SHIFT)
#endif
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
@@ -1402,6 +1406,13 @@ struct memory_failure_stats {
};
#endif
+/*
+ * vmemmap optimization (like HVO) is only possible for page orders that fill
+ * two or more pages with struct pages.
+ */
+#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
+#define NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
+
/*
* On NUMA machines, each NUMA node would have a pg_data_t to describe
* it's memory layout. On UMA machines there is a single pglist_data which
@@ -1550,6 +1561,9 @@ typedef struct pglist_data {
#ifdef CONFIG_MEMORY_FAILURE
struct memory_failure_stats mf_stats;
#endif
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+ unsigned long vmemmap_tails[NR_VMEMMAP_TAILS];
+#endif
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a39a301e08b9..f5f42b92dd7d 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -19,6 +19,7 @@
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
+#include "internal.h"
/**
* struct vmemmap_remap_walk - walk vmemmap page table
@@ -505,6 +506,34 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
return true;
}
+static struct page *vmemmap_get_tail(unsigned int order, int node)
+{
+ unsigned long pfn;
+ unsigned int idx;
+ struct page *tail, *p;
+
+ idx = order - VMEMMAP_TAIL_MIN_ORDER;
+ pfn = READ_ONCE(NODE_DATA(node)->vmemmap_tails[idx]);
+ if (pfn)
+ return pfn_to_page(pfn);
+
+ tail = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
+ if (!tail)
+ return NULL;
+
+ p = page_to_virt(tail);
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+ prep_compound_tail(p + i, NULL, order);
+
+ pfn = PHYS_PFN(virt_to_phys(p));
+ if (cmpxchg(&NODE_DATA(node)->vmemmap_tails[idx], 0, pfn)) {
+ __free_page(tail);
+ pfn = READ_ONCE(NODE_DATA(node)->vmemmap_tails[idx]);
+ }
+
+ return pfn_to_page(pfn);
+}
+
static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
struct folio *folio,
struct list_head *vmemmap_pages,
@@ -520,6 +549,11 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
if (!vmemmap_should_optimize_folio(h, folio))
return ret;
+ nid = folio_nid(folio);
+ vmemmap_tail = vmemmap_get_tail(h->order, nid);
+ if (!vmemmap_tail)
+ return -ENOMEM;
+
static_branch_inc(&hugetlb_optimize_vmemmap_key);
if (flags & VMEMMAP_SYNCHRONIZE_RCU)
@@ -537,7 +571,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
*/
folio_set_hugetlb_vmemmap_optimized(folio);
- nid = folio_nid(folio);
vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
if (!vmemmap_head) {
ret = -ENOMEM;
@@ -548,7 +581,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
list_add(&vmemmap_head->lru, vmemmap_pages);
memmap_pages_add(1);
- vmemmap_tail = vmemmap_head;
vmemmap_start = (unsigned long)&folio->page;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 37522d6cb398..23abd06f1a4e 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -378,16 +378,45 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
}
}
-/*
- * Populate vmemmap pages HVO-style. The first page contains the head
- * page and needed tail pages, the other ones are mirrors of the first
- * page.
- */
+static __meminit unsigned long vmemmap_get_tail(unsigned int order, int node)
+{
+ unsigned long pfn;
+ unsigned int idx;
+ struct page *p;
+
+ BUG_ON(order < VMEMMAP_TAIL_MIN_ORDER);
+ BUG_ON(order > MAX_FOLIO_ORDER);
+
+ idx = order - VMEMMAP_TAIL_MIN_ORDER;
+ pfn = NODE_DATA(node)->vmemmap_tails[idx];
+ if (pfn)
+ return pfn;
+
+ p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+ if (!p)
+ return 0;
+
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+ prep_compound_tail(p + i, NULL, order);
+
+ pfn = PHYS_PFN(virt_to_phys(p));
+ NODE_DATA(node)->vmemmap_tails[idx] = pfn;
+
+ return pfn;
+}
+
int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
int node, unsigned long headsize)
{
+ unsigned long maddr, len, tail_pfn;
+ unsigned int order;
pte_t *pte;
- unsigned long maddr;
+
+ len = end - addr;
+ order = ilog2(len * sizeof(struct page) / PAGE_SIZE);
+ tail_pfn = vmemmap_get_tail(order, node);
+ if (!tail_pfn)
+ return -ENOMEM;
for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
pte = vmemmap_populate_address(maddr, node, NULL, -1, 0);
@@ -398,8 +427,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
/*
* Reuse the last page struct page mapped above for the rest.
*/
- return vmemmap_populate_range(maddr, end, node, NULL,
- pte_pfn(ptep_get(pte)), 0);
+ return vmemmap_populate_range(maddr, end, node, NULL, tail_pfn, 0);
}
void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
--
2.51.2
next prev parent reply other threads:[~2026-01-28 13:56 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-28 13:54 [PATCHv5 00/17] mm: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 01/17] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 02/17] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 03/17] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 04/17] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 05/17] riscv/mm: Align vmemmap to maximal folio size Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 06/17] LoongArch/mm: " Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 07/17] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 08/17] mm: Make page_zonenum() use head page Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 09/17] mm/sparse: Check memmap alignment for compound_info_has_mask() Kiryl Shutsemau
2026-01-29 3:00 ` Muchun Song
2026-01-29 3:10 ` Zi Yan
2026-01-29 3:23 ` Muchun Song
2026-01-29 3:29 ` Zi Yan
2026-01-29 7:03 ` Muchun Song
2026-01-29 17:33 ` Zi Yan
2026-01-28 13:54 ` [PATCHv5 10/17] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
2026-01-29 2:51 ` Muchun Song
2026-01-28 13:54 ` Kiryl Shutsemau [this message]
2026-01-29 6:54 ` [PATCHv5 11/17] mm/hugetlb: Remove fake head pages Muchun Song
2026-01-28 13:54 ` [PATCHv5 12/17] mm: Drop fake head checks Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 13/17] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 14/17] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 15/17] mm: Remove the branch from compound_head() Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 16/17] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
2026-01-28 13:54 ` [PATCHv5 17/17] mm/slab: Use compound_head() in page_slab() Kiryl Shutsemau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260128135500.22121-12-kas@kernel.org \
--to=kas@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=aou@eecs.berkeley.edu \
--cc=bhe@redhat.com \
--cc=chenhuacai@kernel.org \
--cc=corbet@lwn.net \
--cc=david@redhat.com \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=kernel@xen0n.name \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=loongarch@lists.linux.dev \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=rppt@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox