[PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Muchun Song <songmuchun@bytedance.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <chleroy@kernel.org>,
	aneesh.kumar@linux.ibm.com, joao.m.martins@oracle.com,
	linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization
Date: Sun,  5 Apr 2026 20:52:28 +0800	[thread overview]
Message-ID: <20260405125240.2558577-38-songmuchun@bytedance.com> (raw)
In-Reply-To: <20260405125240.2558577-1-songmuchun@bytedance.com>

The ultimate goal of the recent refactoring series is to unify the vmemmap
optimization logic for both DAX and HugeTLB under a common framework
(CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION).

A key breakthrough in this unification is that DAX now only requires 1
vmemmap page to be preserved (the head page), aligning its requirements
exactly with HugeTLB. Previously, DAX optimization relied on a dedicated
upper-level function, vmemmap_populate_compound_pages, which handled the
manual allocation of the head page AND the first tail page before reusing
the shared tail page for the rest.

Because DAX and HugeTLB are now perfectly aligned in their optimization
requirements (1 reserved page + reused shared tail pages), this patch
eliminates the dedicated compound page mapping loop entirely. Instead, it
pushes the optimization decision down to the lowest level in
vmemmap_pte_populate. Now, all mapping requests flow through the standard
vmemmap_populate_basepages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  13 +-
 include/linux/mm.h                       |   2 +-
 mm/mm_init.c                             |   2 +-
 mm/sparse-vmemmap.c                      | 185 +++++------------------
 4 files changed, 40 insertions(+), 162 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5ce3deb464d5..714d5cdc10ec 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1326,17 +1326,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 					return -ENOMEM;
 				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
-				/*
-				 * Populate the tail pages vmemmap page
-				 * It can fall in different pmd, hence
-				 * vmemmap_populate_address()
-				 */
-				pte = radix__vmemmap_populate_address(addr + PAGE_SIZE, node, NULL, NULL);
-				if (!pte)
-					return -ENOMEM;
-
-				addr_pfn += 2;
-				next = addr + 2 * PAGE_SIZE;
+				addr_pfn += 1;
+				next = addr + PAGE_SIZE;
 				continue;
 			}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15841829b7eb..bceef0dc578b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4912,7 +4912,7 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif
 
-#define VMEMMAP_RESERVE_NR	2
+#define VMEMMAP_RESERVE_NR	OPTIMIZED_FOLIO_VMEMMAP_PAGES
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
 					  struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 636a0f9644f6..6b23b5f02544 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1066,7 +1066,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
  * initialize is a lot smaller that the total amount of struct pages being
  * mapped. This is a paired / mild layering violation with explicit knowledge
  * of how the sparse_vmemmap internals handle compound pages in the lack
- * of an altmap. See vmemmap_populate_compound_pages().
+ * of an altmap.
  */
 static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
 					      struct dev_pagemap *pgmap,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 1867b5dcc73c..fd7b0e1e5aba 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -152,46 +152,40 @@ static pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, in
 					      struct vmem_altmap *altmap,
 					      unsigned long ptpfn)
 {
-	pte_t *pte = pte_offset_kernel(pmd, addr);
-
-	if (pte_none(ptep_get(pte))) {
-		pte_t entry;
-
-		if (vmemmap_page_optimizable((struct page *)addr) &&
-		    ptpfn == (unsigned long)-1) {
-			struct page *page;
-			unsigned long pfn = page_to_pfn((struct page *)addr);
-			const struct mem_section *ms = __pfn_to_section(pfn);
-
-			page = vmemmap_shared_tail_page(section_order(ms),
-							section_to_zone(ms, node));
-			if (!page)
-				return NULL;
-			ptpfn = page_to_pfn(page);
-		}
+	pte_t entry, *pte = pte_offset_kernel(pmd, addr);
 
-		if (ptpfn == (unsigned long)-1) {
-			void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-
-			if (!p)
-				return NULL;
-			ptpfn = PHYS_PFN(__pa(p));
-		} else {
-			/*
-			 * When a PTE/PMD entry is freed from the init_mm
-			 * there's a free_pages() call to this page allocated
-			 * above. Thus this get_page() is paired with the
-			 * put_page_testzero() on the freeing path.
-			 * This can only called by certain ZONE_DEVICE path,
-			 * and through vmemmap_populate_compound_pages() when
-			 * slab is available.
-			 */
-			if (slab_is_available())
-				get_page(pfn_to_page(ptpfn));
-		}
-		entry = pfn_pte(ptpfn, PAGE_KERNEL);
-		set_pte_at(&init_mm, addr, pte, entry);
+	if (!pte_none(ptep_get(pte)))
+		return pte;
+
+	/* See layout diagram in Documentation/mm/vmemmap_dedup.rst. */
+	if (vmemmap_page_optimizable((struct page *)addr)) {
+		struct page *page;
+		unsigned long pfn = page_to_pfn((struct page *)addr);
+		const struct mem_section *ms = __pfn_to_section(pfn);
+
+		page = vmemmap_shared_tail_page(section_order(ms),
+						section_to_zone(ms, node));
+		if (!page)
+			return NULL;
+
+		/*
+		 * When a PTE entry is freed, a free_pages() call occurs. This
+		 * get_page() pairs with put_page_testzero() on the freeing
+		 * path. This can only occur when slab is available.
+		 */
+		if (slab_is_available())
+			get_page(page);
+		ptpfn = page_to_pfn(page);
+	} else {
+		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+
+		if (!p)
+			return NULL;
+		ptpfn = PHYS_PFN(__pa(p));
 	}
+	entry = pfn_pte(ptpfn, PAGE_KERNEL);
+	set_pte_at(&init_mm, addr, pte, entry);
+
 	return pte;
 }
 
@@ -287,17 +281,15 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
 	return pte;
 }
 
-static int __meminit vmemmap_populate_range(unsigned long start,
-					    unsigned long end, int node,
-					    struct vmem_altmap *altmap,
-					    unsigned long ptpfn)
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap,
+					 struct dev_pagemap *pgmap)
 {
 	unsigned long addr = start;
 	pte_t *pte;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		pte = vmemmap_populate_address(addr, node, altmap,
-					       ptpfn);
+		pte = vmemmap_populate_address(addr, node, altmap, -1);
 		if (!pte)
 			return -ENOMEM;
 	}
@@ -305,19 +297,6 @@ static int __meminit vmemmap_populate_range(unsigned long start,
 	return 0;
 }
 
-static int __meminit vmemmap_populate_compound_pages(unsigned long start,
-						     unsigned long end, int node,
-						     struct dev_pagemap *pgmap);
-
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap,
-					 struct dev_pagemap *pgmap)
-{
-	if (vmemmap_can_optimize(altmap, pgmap))
-		return vmemmap_populate_compound_pages(start, end, node, pgmap);
-	return vmemmap_populate_range(start, end, node, altmap, -1);
-}
-
 /*
  * Write protect the mirrored tail page structs for HVO. This will be
  * called from the hugetlb code when gathering and initializing the
@@ -397,9 +376,6 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	pud_t *pud;
 	pmd_t *pmd;
 
-	if (vmemmap_can_optimize(altmap, pgmap))
-		return vmemmap_populate_compound_pages(start, end, node, pgmap);
-
 	for (addr = start; addr < end; addr = next) {
 		unsigned long pfn = page_to_pfn((struct page *)addr);
 		const struct mem_section *ms = __pfn_to_section(pfn);
@@ -447,95 +423,6 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
 	return 0;
 }
 
-/*
- * For compound pages bigger than section size (e.g. x86 1G compound
- * pages with 2M subsection size) fill the rest of sections as tail
- * pages.
- *
- * Note that memremap_pages() resets @nr_range value and will increment
- * it after each range successful onlining. Thus the value or @nr_range
- * at section memmap populate corresponds to the in-progress range
- * being onlined here.
- */
-static bool __meminit reuse_compound_section(unsigned long start_pfn,
-					     struct dev_pagemap *pgmap)
-{
-	unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
-	unsigned long offset = start_pfn -
-		PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
-
-	return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
-}
-
-static int __meminit vmemmap_populate_compound_pages(unsigned long start,
-						     unsigned long end, int node,
-						     struct dev_pagemap *pgmap)
-{
-	unsigned long size, addr;
-	pte_t *pte;
-	int rc;
-	unsigned long start_pfn = page_to_pfn((struct page *)start);
-	const struct mem_section *ms = __pfn_to_section(start_pfn);
-	struct page *tail;
-
-	/* This may occur in sub-section scenarios. */
-	if (!section_vmemmap_optimizable(ms))
-		return vmemmap_populate_range(start, end, node, NULL, -1);
-
-	tail = vmemmap_shared_tail_page(section_order(ms),
-					section_to_zone(ms, node));
-	if (!tail)
-		return -ENOMEM;
-
-	if (reuse_compound_section(start_pfn, pgmap))
-		return vmemmap_populate_range(start, end, node, NULL,
-					      page_to_pfn(tail));
-
-	size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
-	for (addr = start; addr < end; addr += size) {
-		unsigned long next, last = addr + size;
-		void *p;
-
-		/* Populate the head page vmemmap page */
-		pte = vmemmap_populate_address(addr, node, NULL, -1);
-		if (!pte)
-			return -ENOMEM;
-
-		/*
-		 * Allocate manually since vmemmap_populate_address() will assume DAX
-		 * only needs 1 vmemmap page to be reserved, however DAX now needs 2
-		 * vmemmap pages. This is a temporary solution and will be unified
-		 * with HugeTLB in the future.
-		 */
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL);
-		if (!p)
-			return -ENOMEM;
-
-		/* Populate the tail pages vmemmap page */
-		next = addr + PAGE_SIZE;
-		pte = vmemmap_populate_address(next, node, NULL, PHYS_PFN(__pa(p)));
-		/*
-		 * get_page() is called above. Since we are not actually
-		 * reusing it, to avoid a memory leak, we call put_page() here.
-		 */
-		put_page(virt_to_page(p));
-		if (!pte)
-			return -ENOMEM;
-
-		/*
-		 * Reuse the shared vmemmap page for the rest of tail pages
-		 * See layout diagram in Documentation/mm/vmemmap_dedup.rst
-		 */
-		next += PAGE_SIZE;
-		rc = vmemmap_populate_range(next, last, node, NULL,
-					    page_to_pfn(tail));
-		if (rc)
-			return -ENOMEM;
-	}
-
-	return 0;
-}
-
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
-- 
2.20.1

next prev parent reply	other threads:[~2026-04-05 12:57 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
2026-04-05 12:51 ` [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths Muchun Song
2026-04-05 12:51 ` [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX Muchun Song
2026-04-05 12:51 ` [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate() Muchun Song
2026-04-05 12:51 ` [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX Muchun Song
2026-04-05 12:51 ` [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
2026-04-05 12:51 ` [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages() Muchun Song
2026-04-05 12:51 ` [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions Muchun Song
2026-04-05 12:52 ` [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 10/49] mm: move subsection_map_init() into sparse_init() Muchun Song
2026-04-05 12:52 ` [PATCH 11/49] mm: defer sparse_init() until after zone initialization Muchun Song
2026-04-05 12:52 ` [PATCH 12/49] mm: make set_pageblock_order() static Muchun Song
2026-04-05 12:52 ` [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time Muchun Song
2026-04-05 12:52 ` [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation Muchun Song
2026-04-05 12:52 ` [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage Muchun Song
2026-04-05 12:52 ` [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late() Muchun Song
2026-04-05 12:52 ` [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static Muchun Song
2026-04-05 12:52 ` [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag Muchun Song
2026-04-05 12:52 ` [PATCH 20/49] mm: rename vmemmap optimization macros to generic names Muchun Song
2026-04-05 12:52 ` [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 22/49] mm/sparse: introduce compound page order to mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages Muchun Song
2026-04-05 12:52 ` [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation Muchun Song
2026-04-05 12:52 ` [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population Muchun Song
2026-04-05 12:52 ` [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros Muchun Song
2026-04-05 12:52 ` [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization Muchun Song
2026-04-05 12:52 ` [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 29/49] mm: extract pfn_to_zone() helper Muchun Song
2026-04-05 12:52 ` [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature Muchun Song
2026-04-05 12:52 ` [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic Muchun Song
2026-04-05 12:52 ` [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation Muchun Song
2026-04-05 12:52 ` [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
2026-04-05 12:52 ` [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap Muchun Song
2026-04-05 12:52 ` Muchun Song [this message]
2026-04-05 12:52 ` [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only Muchun Song
2026-04-05 12:52 ` [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument Muchun Song
2026-04-05 12:52 ` [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code Muchun Song
2026-04-05 12:52 ` [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages() Muchun Song
2026-04-05 12:52 ` [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs() Muchun Song
2026-04-05 12:52 ` [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code Muchun Song
2026-04-05 12:52 ` [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks Muchun Song
2026-04-05 12:52 ` [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs Muchun Song
2026-04-05 12:52 ` [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section() Muchun Song
2026-04-05 12:52 ` [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization Muchun Song
2026-04-05 12:52 ` [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO Muchun Song
2026-04-05 12:52 ` [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO Muchun Song
2026-04-05 13:34 ` [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Mike Rapoport
2026-04-06 19:59 ` David Hildenbrand (arm)
2026-04-08 15:29 ` Frank van der Linden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260405125240.2558577-38-songmuchun@bytedance.com \
    --to=songmuchun@bytedance.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=chleroy@kernel.org \
    --cc=david@kernel.org \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=ljs@kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mhocko@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=muchun.song@linux.dev \
    --cc=npiggin@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox