From: Muchun Song <muchun.song@linux.dev>
To: Kiryl Shutsemau <kas@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>,
Mike Rapoport <rppt@kernel.org>, Vlastimil Babka <vbabka@suse.cz>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Zi Yan <ziy@nvidia.com>, Baoquan He <bhe@redhat.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Jonathan Corbet <corbet@lwn.net>,
kernel-team@meta.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Matthew Wilcox <willy@infradead.org>,
Usama Arif <usamaarif642@gmail.com>,
Frank van der Linden <fvdl@google.com>
Subject: Re: [PATCHv4 09/14] mm/hugetlb: Remove fake head pages
Date: Thu, 22 Jan 2026 15:00:03 +0800 [thread overview]
Message-ID: <ffe9811b-d7f8-4924-9ad6-96057a16b693@linux.dev> (raw)
In-Reply-To: <20260121162253.2216580-10-kas@kernel.org>
On 2026/1/22 00:22, Kiryl Shutsemau wrote:
> HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most
> vmemmap pages for huge pages and remapping the freed range to a single
> page containing the struct page metadata.
>
> With the new mask-based compound_info encoding (for power-of-2 struct
> page sizes), all tail pages of the same order are now identical
> regardless of which compound page they belong to. This means the tail
> pages can be truly shared without fake heads.
>
> Allocate a single page of initialized tail struct pages per NUMA node
> per order in the vmemmap_tails[] array in pglist_data. All huge pages of
> that order on the node share this tail page, mapped read-only into their
> vmemmap. The head page remains unique per huge page.
>
> Redefine MAX_FOLIO_ORDER using ilog2(). The define has to produce a
> compile-constant as it is used to specify vmemmap_tail array size.
> For some reason, compiler is not able to solve get_order() at
> compile-time, but ilog2() works.
>
> Avoid PUD_ORDER to define MAX_FOLIO_ORDER as it adds dependency to
> <linux/pgtable.h> which generates hard-to-break include loop.
>
> This eliminates fake heads while maintaining the same memory savings,
> and simplifies compound_head() by removing fake head detection.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> include/linux/mmzone.h | 18 ++++++++--
> mm/hugetlb_vmemmap.c | 80 ++++++++++++++++++++++++++++--------------
> mm/sparse-vmemmap.c | 44 ++++++++++++++++++-----
> 3 files changed, 106 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7e4f69b9d760..7e6beeca4d40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -81,13 +81,17 @@
> * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
> * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
> */
> -#define MAX_FOLIO_ORDER get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
> +#ifdef CONFIG_64BIT
> +#define MAX_FOLIO_ORDER (ilog2(SZ_16G) - PAGE_SHIFT)
> +#else
> +#define MAX_FOLIO_ORDER (ilog2(SZ_1G) - PAGE_SHIFT)
> +#endif
> #else
> /*
> * Without hugetlb, gigantic folios that are bigger than a single PUD are
> * currently impossible.
> */
> -#define MAX_FOLIO_ORDER PUD_ORDER
> +#define MAX_FOLIO_ORDER (PUD_SHIFT - PAGE_SHIFT)
> #endif
>
> #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
> @@ -1407,6 +1411,13 @@ struct memory_failure_stats {
> };
> #endif
>
> +/*
> + * vmemmap optimization (like HVO) is only possible for page orders that fill
> + * two or more pages with struct pages.
> + */
> +#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
> +#define NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
> +
> /*
> * On NUMA machines, each NUMA node would have a pg_data_t to describe
> * it's memory layout. On UMA machines there is a single pglist_data which
> @@ -1555,6 +1566,9 @@ typedef struct pglist_data {
> #ifdef CONFIG_MEMORY_FAILURE
> struct memory_failure_stats mf_stats;
> #endif
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> + unsigned long vmemmap_tails[NR_VMEMMAP_TAILS];
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index a51c0e293175..51bb6c73db92 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -18,6 +18,7 @@
> #include <asm/pgalloc.h>
> #include <asm/tlbflush.h>
> #include "hugetlb_vmemmap.h"
> +#include "internal.h"
>
> /**
> * struct vmemmap_remap_walk - walk vmemmap page table
> @@ -231,36 +232,25 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> set_pte_at(&init_mm, addr, pte, entry);
> }
>
> -/*
> - * How many struct page structs need to be reset. When we reuse the head
> - * struct page, the special metadata (e.g. page->flags or page->mapping)
> - * cannot copy to the tail struct page structs. The invalid value will be
> - * checked in the free_tail_page_prepare(). In order to avoid the message
> - * of "corrupted mapping in tail page". We need to reset at least 4 (one
> - * head struct page struct and three tail struct page structs) struct page
> - * structs.
> - */
> -#define NR_RESET_STRUCT_PAGE 4
> -
> -static inline void reset_struct_pages(struct page *start)
> -{
> - struct page *from = start + NR_RESET_STRUCT_PAGE;
> -
> - BUILD_BUG_ON(NR_RESET_STRUCT_PAGE * 2 > PAGE_SIZE / sizeof(struct page));
> - memcpy(start, from, sizeof(*from) * NR_RESET_STRUCT_PAGE);
> -}
> -
> static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> struct vmemmap_remap_walk *walk)
> {
> struct page *page;
> - void *to;
> + struct page *from, *to;
>
> page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> list_del(&page->lru);
> +
> + /*
> + * Initialize all tail pages with the value of the first non-special
> + * tail pages. The first 4 tail pages of the hugetlb folio contain
> + * special metadata.
> + */
> + from = compound_head((struct page *)addr) + 4;
If we can eliminate the hard-coded number 4 as much as possible,
we should do so. This is to avoid issues like the commit 274fe92de2c4.
Therefore, I suggest copying data from the last struct page. Something like:
from = compound_head((struct page *)addr) + PAGE_SIZE / sizeof(struct
page) - 1;
> to = page_to_virt(page);
> - copy_page(to, (void *)walk->vmemmap_start);
> - reset_struct_pages(to);
> + for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++, to++) {
> + *to = *from;
> + }
From the code style, "{}" is not necessary for one-line-code block.
>
> /*
> * Makes sure that preceding stores to the page contents become visible
> @@ -425,8 +415,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>
> vmemmap_start = (unsigned long)&folio->page;
> vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
> -
> - vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
> + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
Those two-line changes should go into patch 8.
>
> /*
> * The pages which the vmemmap virtual address range [@vmemmap_start,
> @@ -517,6 +506,41 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
> return true;
> }
>
> +static struct page *vmemmap_get_tail(unsigned int order, int node)
> +{
> + unsigned long pfn;
> + unsigned int idx;
> + struct page *tail, *p;
> +
> + idx = order - VMEMMAP_TAIL_MIN_ORDER;
> + pfn = NODE_DATA(node)->vmemmap_tails[idx];
READ_ONCE() for access of NODE_DATA(node)->vmemmap_tails[idx].
> + if (pfn)
> + return pfn_to_page(pfn);
> +
> + tail = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
> + if (!tail)
> + return NULL;
> +
> + p = page_to_virt(tail);
> + for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
> + prep_compound_tail(p + i, NULL, order);
> +
> + spin_lock(&hugetlb_lock);
hugetlb_lock is considered a contended lock, better not to abuse it.
cmpxchg() is enought in this case.
> + if (!NODE_DATA(node)->vmemmap_tails[idx]) {
> + pfn = PHYS_PFN(virt_to_phys(p));
> + NODE_DATA(node)->vmemmap_tails[idx] = pfn;
> + tail = NULL;
> + } else {
> + pfn = NODE_DATA(node)->vmemmap_tails[idx];
> + }
> + spin_unlock(&hugetlb_lock);
> +
> + if (tail)
> + __free_page(tail);
> +
> + return pfn_to_page(pfn);
> +}
> +
> static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> struct folio *folio,
> struct list_head *vmemmap_pages,
> @@ -532,6 +556,12 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> if (!vmemmap_should_optimize_folio(h, folio))
> return ret;
>
> + nid = folio_nid(folio);
> +
Do not add a new line here.
> + vmemmap_tail = vmemmap_get_tail(h->order, nid);
> + if (!vmemmap_tail)
> + return -ENOMEM;
> +
> static_branch_inc(&hugetlb_optimize_vmemmap_key);
>
> if (flags & VMEMMAP_SYNCHRONIZE_RCU)
> @@ -549,7 +579,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> */
> folio_set_hugetlb_vmemmap_optimized(folio);
>
> - nid = folio_nid(folio);
> vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
>
> if (!vmemmap_head) {
> @@ -561,7 +590,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> list_add(&vmemmap_head->lru, vmemmap_pages);
> memmap_pages_add(1);
>
> - vmemmap_tail = vmemmap_head;
> vmemmap_start = (unsigned long)&folio->page;
> vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index dbd8daccade2..94b4e90fa00f 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -378,16 +378,45 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
> }
> }
>
> -/*
> - * Populate vmemmap pages HVO-style. The first page contains the head
> - * page and needed tail pages, the other ones are mirrors of the first
> - * page.
> - */
> +static __meminit unsigned long vmemmap_get_tail(unsigned int order, int node)
> +{
> + unsigned long pfn;
> + unsigned int idx;
> + struct page *p;
> +
> + BUG_ON(order < VMEMMAP_TAIL_MIN_ORDER);
> + BUG_ON(order > MAX_FOLIO_ORDER);
> +
> + idx = order - VMEMMAP_TAIL_MIN_ORDER;
> + pfn = NODE_DATA(node)->vmemmap_tails[idx];
> + if (pfn)
> + return pfn;
> +
> + p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> + if (!p)
> + return 0;
> +
> + for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
> + prep_compound_tail(p + i, NULL, order);
> +
> + pfn = PHYS_PFN(virt_to_phys(p));
> + NODE_DATA(node)->vmemmap_tails[idx] = pfn;
> +
> + return pfn;
> +}
> +
> int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
> int node, unsigned long headsize)
> {
> + unsigned long maddr, len, tail_pfn;
> + unsigned int order;
> pte_t *pte;
> - unsigned long maddr;
> +
> + len = end - addr;
> + order = ilog2(len * sizeof(struct page) / PAGE_SIZE);
> + tail_pfn = vmemmap_get_tail(order, node);
> + if (!tail_pfn)
> + return -ENOMEM;
>
> for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
> pte = vmemmap_populate_address(maddr, node, NULL, -1, 0);
> @@ -398,8 +427,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
> /*
> * Reuse the last page struct page mapped above for the rest.
> */
> - return vmemmap_populate_range(maddr, end, node, NULL,
> - pte_pfn(ptep_get(pte)), 0);
> + return vmemmap_populate_range(maddr, end, node, NULL, tail_pfn, 0);
> }
>
> void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
next prev parent reply other threads:[~2026-01-22 7:00 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-21 16:22 [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
2026-01-21 16:29 ` Zi Yan
2026-01-22 2:24 ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 02/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
2026-01-21 16:32 ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 03/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
2026-01-21 16:34 ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 04/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
2026-01-21 16:35 ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 05/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
2026-01-21 17:12 ` Zi Yan
2026-01-22 11:29 ` Kiryl Shutsemau
2026-01-22 11:52 ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 06/14] mm: Make page_zonenum() use head page Kiryl Shutsemau
2026-01-21 16:28 ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 07/14] mm/sparse: Check memmap alignment for compound_info_has_mask() Kiryl Shutsemau
2026-01-21 17:58 ` Zi Yan
2026-01-22 11:22 ` Kiryl Shutsemau
2026-01-22 3:10 ` Muchun Song
2026-01-22 11:28 ` Kiryl Shutsemau
2026-01-22 11:33 ` Muchun Song
2026-01-22 11:42 ` Muchun Song
2026-01-22 12:42 ` Kiryl Shutsemau
2026-01-22 14:02 ` Muchun Song
2026-01-22 17:59 ` Kiryl Shutsemau
2026-01-23 2:32 ` Muchun Song
2026-01-23 12:07 ` Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
2026-01-22 8:08 ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 09/14] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
2026-01-22 7:00 ` Muchun Song [this message]
2026-01-27 14:51 ` Kiryl Shutsemau
2026-01-28 2:43 ` Muchun Song
2026-01-28 12:59 ` Kiryl Shutsemau
2026-01-29 3:04 ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 10/14] mm: Drop fake head checks Kiryl Shutsemau
2026-01-21 18:16 ` Zi Yan
2026-01-22 12:48 ` Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
2026-01-21 18:21 ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
2026-01-22 2:22 ` Muchun Song
2026-01-21 18:44 ` [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization Vlastimil Babka
2026-01-21 20:31 ` Zi Yan
2026-01-22 11:21 ` Kiryl Shutsemau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ffe9811b-d7f8-4924-9ad6-96057a16b693@linux.dev \
--to=muchun.song@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=osalvador@suse.de \
--cc=rppt@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox