[PATCHv2 00/14]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv2 00/14]
@ 2025-12-18 15:09 Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
                   ` (14 more replies)
  0 siblings, 15 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

This series removes "fake head pages" from the HugeTLB vmemmap
optimization (HVO) by changing how tail pages encode their relationship
to the head page.

It simplifies compound_head() and page_ref_add_unless(). Both are in the
hot path.

Background
==========

HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
and remapping the freed virtual addresses to a single physical page.
Previously, all tail page vmemmap entries were remapped to the first
vmemmap page (containing the head struct page), creating "fake heads" -
tail pages that appear to have PG_head set when accessed through the
deduplicated vmemmap.

This required special handling in compound_head() to detect and work
around fake heads, adding complexity and overhead to a very hot path.

New Approach
============

For architectures/configs where sizeof(struct page) is a power of 2 (the
common case), this series changes how position of the head page is encoded
in the tail pages.

Instead of storing a pointer to the head page, the ->compound_info
(renamed from ->compound_head) now stores a mask.

The mask can be applied to any tail page's virtual address to compute
the head page address. Critically, all tail pages of the same order now
have identical compound_info values, regardless of which compound page
they belong to.

The key insight is that all tail pages of the same order now have
identical compound_info values, regardless of which compound page they
belong to. This allows a single page of tail struct pages to be shared
across all huge pages of the same order on a NUMA node.

Benefits
========

1. Simplified compound_head(): No fake head detection needed, can be
   implemented in a branchless manner.

2. Simplified page_ref_add_unless(): RCU protection removed since there's
   no race with fake head remapping.

3. Cleaner architecture: The shared tail pages are truly read-only and
   contain valid tail page metadata.

If sizeof(struct page) is not power-of-2, there are no functional changes.
HVO is not supported in this configuration.

I had hoped to see performance improvement, but my testing thus far has
shown either no change or only a slight improvement within the noise.

Series Organization
===================

Patches 1-2: Preparation - move MAX_FOLIO_ORDER, add alignment check
Patches 3-5: Refactoring - interface changes, field rename, code movement
Patch 6: Core change - new mask-based compound_head() encoding
Patch 7: Correctness fix - page_zonenum() must use head page
Patch 8: Refactor vmemmap_walk for new design
Patch 9: Eliminate fake heads with shared tail pages
Patches 10-13: Cleanup - remove fake head infrastructure
Patch 14: Documentation update

Changes in v2:
==============

- Handle boot-allocated huge pages correctly. (Frank)

- Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
  in pglist_data. (Muchun)

- Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
  a race condition where two threads could both allocate tail pages.
  The losing thread now properly frees its allocated page. (Usama)

- Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
  required for the mask approach. (Muchun)

- Make page_zonenum() use head page - correctness fix since shared
  tail pages cannot have valid zone information. (Muchun)

- Added 'const' qualifier to head parameter in set_compound_head() and
  prep_compound_tail(). (Usama)

- Updated commit messages.

Kiryl Shutsemau (14):
  mm: Move MAX_FOLIO_ORDER definition to mmzone.h
  mm/sparse: Check memmap alignment
  mm: Change the interface of prep_compound_tail()
  mm: Rename the 'compound_head' field in the 'struct page' to
    'compound_info'
  mm: Move set/clear_compound_head() next to compound_head()
  mm: Rework compound_head() for power-of-2 sizeof(struct page)
  mm: Make page_zonenum() use head page
  mm/hugetlb: Refactor code around vmemmap_walk
  mm/hugetlb: Remove fake head pages
  mm: Drop fake head checks
  hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
  mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
  mm: Remove the branch from compound_head()
  hugetlb: Update vmemmap_dedup.rst

 .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
 Documentation/mm/vmemmap_dedup.rst            |  62 ++--
 include/linux/mm.h                            |  31 --
 include/linux/mm_types.h                      |  20 +-
 include/linux/mmzone.h                        |  47 +++
 include/linux/page-flags.h                    | 163 ++++-------
 include/linux/page_ref.h                      |   8 +-
 include/linux/types.h                         |   2 +-
 kernel/vmcore_info.c                          |   2 +-
 mm/hugetlb.c                                  |   8 +-
 mm/hugetlb_vmemmap.c                          | 270 +++++++++---------
 mm/internal.h                                 |  12 +-
 mm/mm_init.c                                  |   2 +-
 mm/page_alloc.c                               |   4 +-
 mm/slab.h                                     |   2 +-
 mm/sparse-vmemmap.c                           |  44 ++-
 mm/sparse.c                                   |   3 +
 mm/util.c                                     |  16 +-
 18 files changed, 345 insertions(+), 353 deletions(-)

-- 
2.51.2

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 02/14] mm/sparse: Check memmap alignment Kiryl Shutsemau
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

Move MAX_FOLIO_ORDER definition from mm.h to mmzone.h.

This is preparation for adding the vmemmap_tails array to struct
pglist_data, which requires MAX_FOLIO_ORDER to be available in mmzone.h.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/mm.h     | 31 -------------------------------
 include/linux/mmzone.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c79b3369b82..2c409f583569 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,7 +26,6 @@
 #include <linux/page-flags.h>
 #include <linux/page_ref.h>
 #include <linux/overflow.h>
-#include <linux/sizes.h>
 #include <linux/sched.h>
 #include <linux/pgtable.h>
 #include <linux/kasan.h>
@@ -2074,36 +2073,6 @@ static inline unsigned long folio_nr_pages(const struct folio *folio)
 	return folio_large_nr_pages(folio);
 }
 
-#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
-/*
- * We don't expect any folios that exceed buddy sizes (and consequently
- * memory sections).
- */
-#define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
-#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-/*
- * Only pages within a single memory section are guaranteed to be
- * contiguous. By limiting folios to a single memory section, all folio
- * pages are guaranteed to be contiguous.
- */
-#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
-#elif defined(CONFIG_HUGETLB_PAGE)
-/*
- * There is no real limit on the folio size. We limit them to the maximum we
- * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
- * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
- */
-#define MAX_FOLIO_ORDER		get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
-#else
-/*
- * Without hugetlb, gigantic folios that are bigger than a single PUD are
- * currently impossible.
- */
-#define MAX_FOLIO_ORDER		PUD_ORDER
-#endif
-
-#define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
-
 /*
  * compound_nr() returns the number of pages in this potentially compound
  * page.  compound_nr() can be called on a tail page, and is defined to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..6cfede39570a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -23,6 +23,7 @@
 #include <linux/page-flags.h>
 #include <linux/local_lock.h>
 #include <linux/zswap.h>
+#include <linux/sizes.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -61,6 +62,36 @@
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
+#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
+/*
+ * We don't expect any folios that exceed buddy sizes (and consequently
+ * memory sections).
+ */
+#define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/*
+ * Only pages within a single memory section are guaranteed to be
+ * contiguous. By limiting folios to a single memory section, all folio
+ * pages are guaranteed to be contiguous.
+ */
+#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
+#elif defined(CONFIG_HUGETLB_PAGE)
+/*
+ * There is no real limit on the folio size. We limit them to the maximum we
+ * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
+ * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
+ */
+#define MAX_FOLIO_ORDER		get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
+#else
+/*
+ * Without hugetlb, gigantic folios that are bigger than a single PUD are
+ * currently impossible.
+ */
+#define MAX_FOLIO_ORDER		PUD_ORDER
+#endif
+
+#define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
+
 enum migratetype {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_MOVABLE,
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  8:34   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 03/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

The upcoming changes in compound_head() require memmap to be naturally
aligned to the maximum folio size.

Add a warning if it is not.

A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
kernel is still likely to be functional if this strict check fails.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/mmzone.h | 1 +
 mm/sparse.c            | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cfede39570a..9f44dc760cdc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,6 +91,7 @@
 #endif
 
 #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
+#define MAX_FOLIO_SIZE		(PAGE_SIZE << MAX_FOLIO_ORDER)
 
 enum migratetype {
 	MIGRATE_UNMOVABLE,
diff --git a/mm/sparse.c b/mm/sparse.c
index 17c50a6415c2..c5810ff7c6f7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -600,6 +600,9 @@ void __init sparse_init(void)
 	BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
 	memblocks_present();
 
+	WARN_ON(!IS_ALIGNED((unsigned long)pfn_to_page(0),
+			    MAX_FOLIO_SIZE / sizeof(struct page)));
+
 	pnum_begin = first_present_section_nr();
 	nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
 
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 03/14] mm: Change the interface of prep_compound_tail()
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 02/14] mm/sparse: Check memmap alignment Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  2:55   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

Instead of passing down the head page and tail page index, pass the tail
and head pages directly, as well as the order of the compound page.

This is a preparation for changing how the head position is encoded in
the tail page.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h |  4 +++-
 mm/hugetlb.c               |  8 +++++---
 mm/internal.h              | 12 ++++++------
 mm/mm_init.c               |  2 +-
 mm/page_alloc.c            |  2 +-
 5 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0091ad1986bf..d4952573a4af 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -865,7 +865,9 @@ static inline bool folio_test_large(const struct folio *folio)
 	return folio_test_head(folio);
 }
 
-static __always_inline void set_compound_head(struct page *page, struct page *head)
+static __always_inline void set_compound_head(struct page *page,
+					      const struct page *head,
+					      unsigned int order)
 {
 	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec..a55d638975bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3212,6 +3212,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 
 /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
 static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
+					struct hstate *h,
 					unsigned long start_page_number,
 					unsigned long end_page_number)
 {
@@ -3220,6 +3221,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	struct page *page = folio_page(folio, start_page_number);
 	unsigned long head_pfn = folio_pfn(folio);
 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
+	unsigned int order = huge_page_order(h);
 
 	/*
 	 * As we marked all tail pages with memblock_reserved_mark_noinit(),
@@ -3227,7 +3229,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	 */
 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
 		__init_single_page(page, pfn, zone, nid);
-		prep_compound_tail((struct page *)folio, pfn - head_pfn);
+		prep_compound_tail(page, &folio->page, order);
 		set_page_count(page, 0);
 	}
 }
@@ -3247,7 +3249,7 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
 	__folio_set_head(folio);
 	ret = folio_ref_freeze(folio, 1);
 	VM_BUG_ON(!ret);
-	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
+	hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages);
 	prep_compound_head((struct page *)folio, huge_page_order(h));
 }
 
@@ -3304,7 +3306,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 			 * time as this is early in boot and there should
 			 * be no contention.
 			 */
-			hugetlb_folio_init_tail_vmemmap(folio,
+			hugetlb_folio_init_tail_vmemmap(folio, h,
 					HUGETLB_VMEMMAP_RESERVE_PAGES,
 					pages_per_huge_page(h));
 		}
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..f385370256b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -810,13 +810,13 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 		INIT_LIST_HEAD(&folio->_deferred_list);
 }
 
-static inline void prep_compound_tail(struct page *head, int tail_idx)
+static inline void prep_compound_tail(struct page *tail,
+				      const struct page *head,
+				      unsigned int order)
 {
-	struct page *p = head + tail_idx;
-
-	p->mapping = TAIL_MAPPING;
-	set_compound_head(p, head);
-	set_page_private(p, 0);
+	tail->mapping = TAIL_MAPPING;
+	set_compound_head(tail, head, order);
+	set_page_private(tail, 0);
 }
 
 void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7712d887b696..87d1e0277318 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1102,7 +1102,7 @@ static void __ref memmap_init_compound(struct page *head,
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
-		prep_compound_tail(head, pfn - head_pfn);
+		prep_compound_tail(page, head, order);
 		set_page_count(page, 0);
 	}
 	prep_compound_head(head, order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..fe77c00c99df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -717,7 +717,7 @@ void prep_compound_page(struct page *page, unsigned int order)
 
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++)
-		prep_compound_tail(page, i);
+		prep_compound_tail(page + i, page, order);
 
 	prep_compound_head(page, order);
 }
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info'
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (2 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 03/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  3:00   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

The 'compound_head' field in the 'struct page' encodes whether the page
is a tail and where to locate the head page. Bit 0 is set if the page is
a tail, and the remaining bits in the field point to the head page.

As preparation for changing how the field encodes information about the
head page, rename the field to 'compound_info'.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 .../admin-guide/kdump/vmcoreinfo.rst          |  2 +-
 Documentation/mm/vmemmap_dedup.rst            |  6 +++---
 include/linux/mm_types.h                      | 20 +++++++++----------
 include/linux/page-flags.h                    | 18 ++++++++---------
 include/linux/types.h                         |  2 +-
 kernel/vmcore_info.c                          |  2 +-
 mm/page_alloc.c                               |  2 +-
 mm/slab.h                                     |  2 +-
 mm/util.c                                     |  2 +-
 9 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 404a15f6782c..7663c610fe90 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -141,7 +141,7 @@ nodemask_t
 The size of a nodemask_t type. Used to compute the number of online
 nodes.
 
-(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_info)
 ----------------------------------------------------------------------------------
 
 User-space tools compute their values based on the offset of these
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index b4a55b6569fa..1863d88d2dcb 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -24,7 +24,7 @@ For each base page, there is a corresponding ``struct page``.
 Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
 contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
 this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_head field, and this field is the same for all tail pages.
+is the compound_info field, and this field is the same for all tail pages.
 
 By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
 to the buddy allocator for other uses.
@@ -124,10 +124,10 @@ Here is how things look before optimization::
  |           |
  +-----------+
 
-The value of page->compound_head is the same for all tail pages. The first
+The value of page->compound_info is the same for all tail pages. The first
 page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
 ``struct page`` necessary to describe the HugeTLB. The only use of the remaining
-pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head.
+pages of ``struct page`` (page 1 to page 7) is to point to page->compound_info.
 Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
 will be used for each HugeTLB page. This will allow us to free the remaining
 7 pages to the buddy allocator.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..a94683272869 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -125,14 +125,14 @@ struct page {
 			atomic_long_t pp_ref_count;
 		};
 		struct {	/* Tail pages of compound page */
-			unsigned long compound_head;	/* Bit zero is set */
+			unsigned long compound_info;	/* Bit zero is set */
 		};
 		struct {	/* ZONE_DEVICE pages */
 			/*
-			 * The first word is used for compound_head or folio
+			 * The first word is used for compound_info or folio
 			 * pgmap
 			 */
-			void *_unused_pgmap_compound_head;
+			void *_unused_pgmap_compound_info;
 			void *zone_device_data;
 			/*
 			 * ZONE_DEVICE private pages are counted as being
@@ -383,7 +383,7 @@ struct folio {
 	/* private: avoid cluttering the output */
 				/* For the Unevictable "LRU list" slot */
 				struct {
-					/* Avoid compound_head */
+					/* Avoid compound_info */
 					void *__filler;
 	/* public: */
 					unsigned int mlock_count;
@@ -484,7 +484,7 @@ struct folio {
 FOLIO_MATCH(flags, flags);
 FOLIO_MATCH(lru, lru);
 FOLIO_MATCH(mapping, mapping);
-FOLIO_MATCH(compound_head, lru);
+FOLIO_MATCH(compound_info, lru);
 FOLIO_MATCH(__folio_index, index);
 FOLIO_MATCH(private, private);
 FOLIO_MATCH(_mapcount, _mapcount);
@@ -503,7 +503,7 @@ FOLIO_MATCH(_last_cpupid, _last_cpupid);
 	static_assert(offsetof(struct folio, fl) ==			\
 			offsetof(struct page, pg) + sizeof(struct page))
 FOLIO_MATCH(flags, _flags_1);
-FOLIO_MATCH(compound_head, _head_1);
+FOLIO_MATCH(compound_info, _head_1);
 FOLIO_MATCH(_mapcount, _mapcount_1);
 FOLIO_MATCH(_refcount, _refcount_1);
 #undef FOLIO_MATCH
@@ -511,13 +511,13 @@ FOLIO_MATCH(_refcount, _refcount_1);
 	static_assert(offsetof(struct folio, fl) ==			\
 			offsetof(struct page, pg) + 2 * sizeof(struct page))
 FOLIO_MATCH(flags, _flags_2);
-FOLIO_MATCH(compound_head, _head_2);
+FOLIO_MATCH(compound_info, _head_2);
 #undef FOLIO_MATCH
 #define FOLIO_MATCH(pg, fl)						\
 	static_assert(offsetof(struct folio, fl) ==			\
 			offsetof(struct page, pg) + 3 * sizeof(struct page))
 FOLIO_MATCH(flags, _flags_3);
-FOLIO_MATCH(compound_head, _head_3);
+FOLIO_MATCH(compound_info, _head_3);
 #undef FOLIO_MATCH
 
 /**
@@ -583,8 +583,8 @@ struct ptdesc {
 #define TABLE_MATCH(pg, pt)						\
 	static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
 TABLE_MATCH(flags, pt_flags);
-TABLE_MATCH(compound_head, pt_list);
-TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(compound_info, pt_list);
+TABLE_MATCH(compound_info, _pt_pad_1);
 TABLE_MATCH(mapping, __page_mapping);
 TABLE_MATCH(__folio_index, pt_index);
 TABLE_MATCH(rcu_head, pt_rcu_head);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d4952573a4af..72c933a43b6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -213,7 +213,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
 	/*
 	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
 	 * struct page. The alignment check aims to avoid access the fields (
-	 * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
+	 * e.g. compound_info) of the @page[1]. It can avoid touch a (possibly)
 	 * cold cacheline in some cases.
 	 */
 	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
@@ -223,7 +223,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
 		 * because the @page is a compound page composed with at least
 		 * two contiguous pages.
 		 */
-		unsigned long head = READ_ONCE(page[1].compound_head);
+		unsigned long head = READ_ONCE(page[1].compound_info);
 
 		if (likely(head & 1))
 			return (const struct page *)(head - 1);
@@ -281,7 +281,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
 
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
-	unsigned long head = READ_ONCE(page->compound_head);
+	unsigned long head = READ_ONCE(page->compound_info);
 
 	if (unlikely(head & 1))
 		return head - 1;
@@ -320,13 +320,13 @@ static __always_inline unsigned long _compound_head(const struct page *page)
 
 static __always_inline int PageTail(const struct page *page)
 {
-	return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page);
+	return READ_ONCE(page->compound_info) & 1 || page_is_fake_head(page);
 }
 
 static __always_inline int PageCompound(const struct page *page)
 {
 	return test_bit(PG_head, &page->flags.f) ||
-	       READ_ONCE(page->compound_head) & 1;
+	       READ_ONCE(page->compound_info) & 1;
 }
 
 #define	PAGE_POISON_PATTERN	-1l
@@ -348,7 +348,7 @@ static const unsigned long *const_folio_flags(const struct folio *folio,
 {
 	const struct page *page = &folio->page;
 
-	VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+	VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
 	VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
 	return &page[n].flags.f;
 }
@@ -357,7 +357,7 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
 {
 	struct page *page = &folio->page;
 
-	VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+	VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
 	VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
 	return &page[n].flags.f;
 }
@@ -869,12 +869,12 @@ static __always_inline void set_compound_head(struct page *page,
 					      const struct page *head,
 					      unsigned int order)
 {
-	WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
+	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
 }
 
 static __always_inline void clear_compound_head(struct page *page)
 {
-	WRITE_ONCE(page->compound_head, 0);
+	WRITE_ONCE(page->compound_info, 0);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c3..3a65f0ef4a73 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -234,7 +234,7 @@ struct ustat {
  *
  * This guarantee is important for few reasons:
  *  - future call_rcu_lazy() will make use of lower bits in the pointer;
- *  - the structure shares storage space in struct page with @compound_head,
+ *  - the structure shares storage space in struct page with @compound_info,
  *    which encode PageTail() in bit 0. The guarantee is needed to avoid
  *    false-positive PageTail().
  */
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f8..782bc2050a40 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -175,7 +175,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(page, lru);
 	VMCOREINFO_OFFSET(page, _mapcount);
 	VMCOREINFO_OFFSET(page, private);
-	VMCOREINFO_OFFSET(page, compound_head);
+	VMCOREINFO_OFFSET(page, compound_info);
 	VMCOREINFO_OFFSET(pglist_data, node_zones);
 	VMCOREINFO_OFFSET(pglist_data, nr_zones);
 #ifdef CONFIG_FLATMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe77c00c99df..cecd6d89ff60 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -704,7 +704,7 @@ static inline bool pcp_allowed_order(unsigned int order)
  * The first PAGE_SIZE page is called the "head page" and have PG_head set.
  *
  * The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
- * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
+ * in bit 0 of page->compound_info. The rest of bits is pointer to head page.
  *
  * The first tail page's ->compound_order holds the order of allocation.
  * This usage means that zero-order pages may not be compound.
diff --git a/mm/slab.h b/mm/slab.h
index 078daecc7cf5..b471877af296 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -104,7 +104,7 @@ struct slab {
 #define SLAB_MATCH(pg, sl)						\
 	static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
 SLAB_MATCH(flags, flags);
-SLAB_MATCH(compound_head, slab_cache);	/* Ensure bit 0 is clear */
+SLAB_MATCH(compound_info, slab_cache);	/* Ensure bit 0 is clear */
 SLAB_MATCH(_refcount, __page_refcount);
 #ifdef CONFIG_MEMCG
 SLAB_MATCH(memcg_data, obj_exts);
diff --git a/mm/util.c b/mm/util.c
index 8989d5767528..cbf93cf3223a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1244,7 +1244,7 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 again:
 	memset(&ps->folio_snapshot, 0, sizeof(struct folio));
 	memcpy(&ps->page_snapshot, page, sizeof(*page));
-	head = ps->page_snapshot.compound_head;
+	head = ps->page_snapshot.compound_info;
 	if ((head & 1) == 0) {
 		ps->idx = 0;
 		foliop = (struct folio *)&ps->page_snapshot;
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head()
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (3 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  3:06   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

Move set_compound_head() and clear_compound_head() to be adjacent to the
compound_head() function in page-flags.h.

These functions encode and decode the same compound_info field, so
keeping them together makes it easier to verify their logic is
consistent, especially when the encoding changes.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 72c933a43b6a..0de7db7efb00 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -290,6 +290,18 @@ static __always_inline unsigned long _compound_head(const struct page *page)
 
 #define compound_head(page)	((typeof(page))_compound_head(page))
 
+static __always_inline void set_compound_head(struct page *page,
+					      const struct page *head,
+					      unsigned int order)
+{
+	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
+}
+
+static __always_inline void clear_compound_head(struct page *page)
+{
+	WRITE_ONCE(page->compound_info, 0);
+}
+
 /**
  * page_folio - Converts from page to folio.
  * @p: The page.
@@ -865,18 +877,6 @@ static inline bool folio_test_large(const struct folio *folio)
 	return folio_test_head(folio);
 }
 
-static __always_inline void set_compound_head(struct page *page,
-					      const struct page *head,
-					      unsigned int order)
-{
-	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
-}
-
-static __always_inline void clear_compound_head(struct page *page)
-{
-	WRITE_ONCE(page->compound_info, 0);
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void ClearPageCompound(struct page *page)
 {
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (4 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  3:20   ` Muchun Song
  2025-12-22  7:57   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 07/14] mm: Make page_zonenum() use head page Kiryl Shutsemau
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

For tail pages, the kernel uses the 'compound_info' field to get to the
head page. The bit 0 of the field indicates whether the page is a
tail page, and if set, the remaining bits represent a pointer to the
head page.

For cases when size of struct page is power-of-2, change the encoding of
compound_info to store a mask that can be applied to the virtual address
of the tail page in order to access the head page. It is possible
because struct page of the head page is naturally aligned with regards
to order of the page.

The significant impact of this modification is that all tail pages of
the same order will now have identical 'compound_info', regardless of
the compound page they are associated with. This paves the way for
eliminating fake heads.

The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
applied when the sizeof(struct page) is power-of-2. Having identical
tail pages allows the same page to be mapped into the vmemmap of all
pages, maintaining memory savings without fake heads.

If sizeof(struct page) is not power-of-2, there is no functional
changes.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h | 62 +++++++++++++++++++++++++++++++++-----
 mm/util.c                  | 16 +++++++---
 2 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0de7db7efb00..fac5f41b3b27 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
 	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
 		return page;
 
+	/*
+	 * Fake heads only exists if size of struct page is power-of-2.
+	 * See hugetlb_vmemmap_optimizable_size().
+	 */
+	if (!is_power_of_2(sizeof(struct page)))
+		return page;
+
 	/*
 	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
 	 * struct page. The alignment check aims to avoid access the fields (
@@ -223,10 +230,14 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
 		 * because the @page is a compound page composed with at least
 		 * two contiguous pages.
 		 */
-		unsigned long head = READ_ONCE(page[1].compound_info);
+		unsigned long info = READ_ONCE(page[1].compound_info);
 
-		if (likely(head & 1))
-			return (const struct page *)(head - 1);
+		/* See set_compound_head() */
+		if (likely(info & 1)) {
+			unsigned long p = (unsigned long)page;
+
+			return (const struct page *)(p & info);
+		}
 	}
 	return page;
 }
@@ -281,11 +292,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
 
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
-	unsigned long head = READ_ONCE(page->compound_info);
+	unsigned long info = READ_ONCE(page->compound_info);
 
-	if (unlikely(head & 1))
-		return head - 1;
-	return (unsigned long)page_fixed_fake_head(page);
+	/* Bit 0 encodes PageTail() */
+	if (!(info & 1))
+		return (unsigned long)page_fixed_fake_head(page);
+
+	/*
+	 * If the size of struct page is not power-of-2, the rest of
+	 * compound_info is the pointer to the head page.
+	 */
+	if (!is_power_of_2(sizeof(struct page)))
+		return info - 1;
+
+	/*
+	 * If the size of struct page is power-of-2 the rest of the info
+	 * encodes the mask that converts the address of the tail page to
+	 * the head page.
+	 *
+	 * No need to clear bit 0 in the mask as 'page' always has it clear.
+	 */
+	return (unsigned long)page & info;
 }
 
 #define compound_head(page)	((typeof(page))_compound_head(page))
@@ -294,7 +321,26 @@ static __always_inline void set_compound_head(struct page *page,
 					      const struct page *head,
 					      unsigned int order)
 {
-	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
+	unsigned int shift;
+	unsigned long mask;
+
+	if (!is_power_of_2(sizeof(struct page))) {
+		WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
+		return;
+	}
+
+	/*
+	 * If the size of struct page is power-of-2, bits [shift:0] of the
+	 * virtual address of compound head are zero.
+	 *
+	 * Calculate mask that can be applied to the virtual address of
+	 * the tail page to get address of the head page.
+	 */
+	shift = order + order_base_2(sizeof(struct page));
+	mask = GENMASK(BITS_PER_LONG - 1, shift);
+
+	/* Bit 0 encodes PageTail() */
+	WRITE_ONCE(page->compound_info, mask | 1);
 }
 
 static __always_inline void clear_compound_head(struct page *page)
diff --git a/mm/util.c b/mm/util.c
index cbf93cf3223a..3c00f6cec3f0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1234,7 +1234,7 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
  */
 void snapshot_page(struct page_snapshot *ps, const struct page *page)
 {
-	unsigned long head, nr_pages = 1;
+	unsigned long info, nr_pages = 1;
 	struct folio *foliop;
 	int loops = 5;
 
@@ -1244,8 +1244,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 again:
 	memset(&ps->folio_snapshot, 0, sizeof(struct folio));
 	memcpy(&ps->page_snapshot, page, sizeof(*page));
-	head = ps->page_snapshot.compound_info;
-	if ((head & 1) == 0) {
+	info = ps->page_snapshot.compound_info;
+	if ((info & 1) == 0) {
 		ps->idx = 0;
 		foliop = (struct folio *)&ps->page_snapshot;
 		if (!folio_test_large(foliop)) {
@@ -1256,7 +1256,15 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 		}
 		foliop = (struct folio *)page;
 	} else {
-		foliop = (struct folio *)(head - 1);
+		/* See compound_head() */
+		if (is_power_of_2(sizeof(struct page))) {
+			unsigned long p = (unsigned long)page;
+
+			foliop = (struct folio *)(p & info);
+		} else {
+			foliop = (struct folio *)(info - 1);
+		}
+
 		ps->idx = folio_page_idx(foliop, page);
 	}
 
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 07/14] mm: Make page_zonenum() use head page
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (5 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

With the upcoming changes to HVO, a single page of tail struct pages
will be shared across all huge pages of the same order on a node. Since
huge pages on the same node may belong to different zones, the zone
information stored in shared tail page flags would be incorrect.

Always fetch zone information from the head page, which has unique and
correct zone flags for each compound page.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/mmzone.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f44dc760cdc..7e4f69b9d760 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1224,6 +1224,7 @@ static inline enum zone_type memdesc_zonenum(memdesc_flags_t flags)
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
+	page = compound_head(page);
 	return memdesc_zonenum(page->flags);
 }
 
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (6 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 07/14] mm: Make page_zonenum() use head page Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  5:54   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 09/14] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

To prepare for removing fake head pages, the vmemmap_walk code is being reworked.

The reuse_page and reuse_addr variables are being eliminated. There will
no longer be an expectation regarding the reuse address in relation to
the operated range. Instead, the caller will provide head and tail
vmemmap pages, along with the vmemmap_start address where the head page
is located.

Currently, vmemmap_head and vmemmap_tail are set to the same page, but
this will change in the future.

The only functional change is that __hugetlb_vmemmap_optimize_folio()
will abandon optimization if memory allocation fails.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/hugetlb_vmemmap.c | 198 ++++++++++++++++++-------------------------
 1 file changed, 83 insertions(+), 115 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba0fb1b6a5a8..d18e7475cf95 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -24,8 +24,9 @@
  *
  * @remap_pte:		called for each lowest-level entry (PTE).
  * @nr_walked:		the number of walked pte.
- * @reuse_page:		the page which is reused for the tail vmemmap pages.
- * @reuse_addr:		the virtual address of the @reuse_page page.
+ * @vmemmap_start:	the start of vmemmap range, where head page is located
+ * @vmemmap_head:	the page to be installed as first in the vmemmap range
+ * @vmemmap_tail:	the page to be installed as non-first in the vmemmap range
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
  *			or is mapped from.
  * @flags:		used to modify behavior in vmemmap page table walking
@@ -34,11 +35,14 @@
 struct vmemmap_remap_walk {
 	void			(*remap_pte)(pte_t *pte, unsigned long addr,
 					     struct vmemmap_remap_walk *walk);
+
 	unsigned long		nr_walked;
-	struct page		*reuse_page;
-	unsigned long		reuse_addr;
+	unsigned long		vmemmap_start;
+	struct page		*vmemmap_head;
+	struct page		*vmemmap_tail;
 	struct list_head	*vmemmap_pages;
 
+
 /* Skip the TLB flush when we split the PMD */
 #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
 /* Skip the TLB flush when we remap the PTE */
@@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
 {
 	struct vmemmap_remap_walk *vmemmap_walk = walk->private;
 
-	/*
-	 * The reuse_page is found 'first' in page table walking before
-	 * starting remapping.
-	 */
-	if (!vmemmap_walk->reuse_page)
-		vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
-	else
-		vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
+	vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
 	vmemmap_walk->nr_walked++;
 
 	return 0;
@@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *list)
 static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 			      struct vmemmap_remap_walk *walk)
 {
-	/*
-	 * Remap the tail pages as read-only to catch illegal write operation
-	 * to the tail pages.
-	 */
-	pgprot_t pgprot = PAGE_KERNEL_RO;
 	struct page *page = pte_page(ptep_get(pte));
 	pte_t entry;
 
 	/* Remapping the head page requires r/w */
-	if (unlikely(addr == walk->reuse_addr)) {
-		pgprot = PAGE_KERNEL;
-		list_del(&walk->reuse_page->lru);
+	if (unlikely(addr == walk->vmemmap_start)) {
+		list_del(&walk->vmemmap_head->lru);
 
 		/*
 		 * Makes sure that preceding stores to the page contents from
@@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 		 * write.
 		 */
 		smp_wmb();
+
+		entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL);
+	} else {
+		/*
+		 * Remap the tail pages as read-only to catch illegal write
+		 * operation to the tail pages.
+		 */
+		entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO);
 	}
 
-	entry = mk_pte(walk->reuse_page, pgprot);
 	list_add(&page->lru, walk->vmemmap_pages);
 	set_pte_at(&init_mm, addr, pte, entry);
 }
@@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *start)
 static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 				struct vmemmap_remap_walk *walk)
 {
-	pgprot_t pgprot = PAGE_KERNEL;
 	struct page *page;
 	void *to;
 
-	BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
-
 	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
 	list_del(&page->lru);
 	to = page_to_virt(page);
-	copy_page(to, (void *)walk->reuse_addr);
+	copy_page(to, (void *)walk->vmemmap_start);
 	reset_struct_pages(to);
 
 	/*
@@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	 * before the set_pte_at() write.
 	 */
 	smp_wmb();
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
 }
 
 /**
@@ -282,33 +277,29 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
  *             to remap.
  * @end:       end address of the vmemmap virtual address range that we want to
  *             remap.
- * @reuse:     reuse address.
- *
  * Return: %0 on success, negative error code otherwise.
  */
-static int vmemmap_remap_split(unsigned long start, unsigned long end,
-			       unsigned long reuse)
+static int vmemmap_remap_split(unsigned long start, unsigned long end)
 {
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= NULL,
+		.vmemmap_start	= start,
 		.flags		= VMEMMAP_SPLIT_NO_TLB_FLUSH,
 	};
 
-	/* See the comment in the vmemmap_remap_free(). */
-	BUG_ON(start - reuse != PAGE_SIZE);
-
-	return vmemmap_remap_range(reuse, end, &walk);
+	return vmemmap_remap_range(start, end, &walk);
 }
 
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
- *			to the page which @reuse is mapped to, then free vmemmap
- *			which the range are mapped to.
+ *			to use @vmemmap_head/tail, then free vmemmap which
+ *			the range are mapped to.
  * @start:	start address of the vmemmap virtual address range that we want
  *		to remap.
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
- * @reuse:	reuse address.
+ * @vmemmap_head: the page to be installed as first in the vmemmap range
+ * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
  * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
  *		responsibility to free pages.
  * @flags:	modifications to vmemmap_remap_walk flags
@@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_free(unsigned long start, unsigned long end,
-			      unsigned long reuse,
+			      struct page *vmemmap_head,
+			      struct page *vmemmap_tail,
 			      struct list_head *vmemmap_pages,
 			      unsigned long flags)
 {
 	int ret;
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_remap_pte,
-		.reuse_addr	= reuse,
+		.vmemmap_start	= start,
+		.vmemmap_head	= vmemmap_head,
+		.vmemmap_tail	= vmemmap_tail,
 		.vmemmap_pages	= vmemmap_pages,
 		.flags		= flags,
 	};
-	int nid = page_to_nid((struct page *)reuse);
-	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+
+	ret = vmemmap_remap_range(start, end, &walk);
+	if (!ret || !walk.nr_walked)
+		return ret;
+
+	end = start + walk.nr_walked * PAGE_SIZE;
 
 	/*
-	 * Allocate a new head vmemmap page to avoid breaking a contiguous
-	 * block of struct page memory when freeing it back to page allocator
-	 * in free_vmemmap_page_list(). This will allow the likely contiguous
-	 * struct page backing memory to be kept contiguous and allowing for
-	 * more allocations of hugepages. Fallback to the currently
-	 * mapped head page in case should it fail to allocate.
+	 * vmemmap_pages contains pages from the previous vmemmap_remap_range()
+	 * call which failed.  These are pages which were removed from
+	 * the vmemmap. They will be restored in the following call.
 	 */
-	walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
-	if (walk.reuse_page) {
-		copy_page(page_to_virt(walk.reuse_page),
-			  (void *)walk.reuse_addr);
-		list_add(&walk.reuse_page->lru, vmemmap_pages);
-		memmap_pages_add(1);
-	}
+	walk = (struct vmemmap_remap_walk) {
+		.remap_pte	= vmemmap_restore_pte,
+		.vmemmap_start	= start,
+		.vmemmap_pages	= vmemmap_pages,
+		.flags		= 0,
+	};
 
-	/*
-	 * In order to make remapping routine most efficient for the huge pages,
-	 * the routine of vmemmap page table walking has the following rules
-	 * (see more details from the vmemmap_pte_range()):
-	 *
-	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
-	 *   should be continuous.
-	 * - The @reuse address is part of the range [@reuse, @end) that we are
-	 *   walking which is passed to vmemmap_remap_range().
-	 * - The @reuse address is the first in the complete range.
-	 *
-	 * So we need to make sure that @start and @reuse meet the above rules.
-	 */
-	BUG_ON(start - reuse != PAGE_SIZE);
-
-	ret = vmemmap_remap_range(reuse, end, &walk);
-	if (ret && walk.nr_walked) {
-		end = reuse + walk.nr_walked * PAGE_SIZE;
-		/*
-		 * vmemmap_pages contains pages from the previous
-		 * vmemmap_remap_range call which failed.  These
-		 * are pages which were removed from the vmemmap.
-		 * They will be restored in the following call.
-		 */
-		walk = (struct vmemmap_remap_walk) {
-			.remap_pte	= vmemmap_restore_pte,
-			.reuse_addr	= reuse,
-			.vmemmap_pages	= vmemmap_pages,
-			.flags		= 0,
-		};
-
-		vmemmap_remap_range(reuse, end, &walk);
-	}
+	vmemmap_remap_range(start + PAGE_SIZE, end, &walk);
 
 	return ret;
 }
@@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
  *		to remap.
  * @end:	end address of the vmemmap virtual address range that we want to
  *		remap.
- * @reuse:	reuse address.
  * @flags:	modifications to vmemmap_remap_walk flags
  *
  * Return: %0 on success, negative error code otherwise.
  */
 static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
-			       unsigned long reuse, unsigned long flags)
+			       unsigned long flags)
 {
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_restore_pte,
-		.reuse_addr	= reuse,
+		.vmemmap_start	= start,
 		.vmemmap_pages	= &vmemmap_pages,
 		.flags		= flags,
 	};
 
-	/* See the comment in the vmemmap_remap_free(). */
-	BUG_ON(start - reuse != PAGE_SIZE);
+	start += HUGETLB_VMEMMAP_RESERVE_SIZE;
 
 	if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
 		return -ENOMEM;
 
-	return vmemmap_remap_range(reuse, end, &walk);
+	return vmemmap_remap_range(start, end, &walk);
 }
 
 DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
@@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 					   struct folio *folio, unsigned long flags)
 {
 	int ret;
-	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
-	unsigned long vmemmap_reuse;
+	unsigned long vmemmap_start, vmemmap_end;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
@@ -466,18 +425,16 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
 		synchronize_rcu();
 
+	vmemmap_start	= (unsigned long)folio;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
-	vmemmap_reuse	= vmemmap_start;
-	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
 
 	/*
 	 * The pages which the vmemmap virtual address range [@vmemmap_start,
-	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
-	 * the range is mapped to the page which @vmemmap_reuse is mapped to.
+	 * @vmemmap_end) are mapped to are freed to the buddy allocator.
 	 * When a HugeTLB page is freed to the buddy allocator, previously
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
-	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
+	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
 	if (!ret) {
 		folio_clear_hugetlb_vmemmap_optimized(folio);
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -565,9 +522,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 					    struct list_head *vmemmap_pages,
 					    unsigned long flags)
 {
-	int ret = 0;
-	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
-	unsigned long vmemmap_reuse;
+	unsigned long vmemmap_start, vmemmap_end;
+	struct page *vmemmap_head, *vmemmap_tail;
+	int nid, ret = 0;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
@@ -592,18 +549,31 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 	 */
 	folio_set_hugetlb_vmemmap_optimized(folio);
 
+	nid = folio_nid(folio);
+	vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
+
+	if (!vmemmap_head) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	copy_page(page_to_virt(vmemmap_head), folio);
+	list_add(&vmemmap_head->lru, vmemmap_pages);
+	memmap_pages_add(1);
+
+	vmemmap_tail	= vmemmap_head;
+	vmemmap_start	= (unsigned long)folio;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
-	vmemmap_reuse	= vmemmap_start;
-	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
 
 	/*
-	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
-	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
-	 * mapping the range to vmemmap_pages list so that they can be freed by
-	 * the caller.
+	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end).
+	 * Add pages previously mapping the range to vmemmap_pages list so that
+	 * they can be freed by the caller.
 	 */
-	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
+	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end,
+				 vmemmap_head, vmemmap_tail,
 				 vmemmap_pages, flags);
+out:
 	if (ret) {
 		static_branch_dec(&hugetlb_optimize_vmemmap_key);
 		folio_clear_hugetlb_vmemmap_optimized(folio);
@@ -632,21 +602,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
 
 static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio)
 {
-	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
-	unsigned long vmemmap_reuse;
+	unsigned long vmemmap_start, vmemmap_end;
 
 	if (!vmemmap_should_optimize_folio(h, folio))
 		return 0;
 
+	vmemmap_start	= (unsigned long)folio;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
-	vmemmap_reuse	= vmemmap_start;
-	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
 
 	/*
 	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
 	 * @vmemmap_end]
 	 */
-	return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+	return vmemmap_remap_split(vmemmap_start, vmemmap_end);
 }
 
 static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 09/14] mm/hugetlb: Remove fake head pages
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (7 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-18 15:09 ` [PATCHv2 10/14] mm: Drop fake head checks Kiryl Shutsemau
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most
vmemmap pages for huge pages and remapping the freed range to a single
page containing the struct page metadata.

With the new mask-based compound_info encoding (for power-of-2 struct
page sizes), all tail pages of the same order are now identical
regardless of which compound page they belong to. This means the tail
pages can be truly shared without fake heads.

Allocate a single page of initialized tail struct pages per NUMA node
per order in the vmemmap_tails[] array in pglist_data. All huge pages
of that order on the node share this tail page, mapped read-only into
their vmemmap. The head page remains unique per huge page.

This eliminates fake heads while maintaining the same memory savings,
and simplifies compound_head() by removing fake head detection.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/mmzone.h | 16 ++++++++++++++-
 mm/hugetlb_vmemmap.c   | 44 ++++++++++++++++++++++++++++++++++++++++--
 mm/sparse-vmemmap.c    | 44 ++++++++++++++++++++++++++++++++++--------
 3 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7e4f69b9d760..f33117618f50 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -81,7 +81,11 @@
  * currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
  * no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
  */
-#define MAX_FOLIO_ORDER		get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G : SZ_1G)
+#ifdef CONFIG_64BIT
+#define MAX_FOLIO_ORDER		(34 - PAGE_SHIFT)
+#else
+#define MAX_FOLIO_ORDER		(30 - PAGE_SHIFT)
+#endif
 #else
 /*
  * Without hugetlb, gigantic folios that are bigger than a single PUD are
@@ -1407,6 +1411,13 @@ struct memory_failure_stats {
 };
 #endif
 
+/*
+ * vmemmap optimization (like HVO) is only possible for page orders that fill
+ * two or more pages with struct pages.
+ */
+#define VMEMMAP_TAIL_MIN_ORDER (ilog2(2 * PAGE_SIZE / sizeof(struct page)))
+#define NR_VMEMMAP_TAILS (MAX_FOLIO_ORDER - VMEMMAP_TAIL_MIN_ORDER + 1)
+
 /*
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout. On UMA machines there is a single pglist_data which
@@ -1555,6 +1566,9 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+	unsigned long vmemmap_tails[NR_VMEMMAP_TAILS];
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d18e7475cf95..63d79ac80594 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -18,6 +18,7 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "hugetlb_vmemmap.h"
+#include "internal.h"
 
 /**
  * struct vmemmap_remap_walk - walk vmemmap page table
@@ -517,6 +518,41 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
 	return true;
 }
 
+static struct page *vmemmap_get_tail(unsigned int order, int node)
+{
+	unsigned long pfn;
+	unsigned int idx;
+	struct page *tail, *p;
+
+	idx = order - VMEMMAP_TAIL_MIN_ORDER;
+	pfn =  NODE_DATA(node)->vmemmap_tails[idx];
+	if (pfn)
+		return pfn_to_page(pfn);
+
+	tail = alloc_pages_node(node, GFP_KERNEL, 0);
+	if (!tail)
+		return NULL;
+
+	p = page_to_virt(tail);
+	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+		prep_compound_tail(p + i, NULL, order);
+
+	spin_lock(&hugetlb_lock);
+	if (!NODE_DATA(node)->vmemmap_tails[idx]) {
+		pfn = PHYS_PFN(virt_to_phys(p));
+		NODE_DATA(node)->vmemmap_tails[idx] = pfn;
+		tail = NULL;
+	} else {
+		pfn = NODE_DATA(node)->vmemmap_tails[idx];
+	}
+	spin_unlock(&hugetlb_lock);
+
+	if (tail)
+		__free_page(tail);
+
+	return pfn_to_page(pfn);
+}
+
 static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 					    struct folio *folio,
 					    struct list_head *vmemmap_pages,
@@ -532,6 +568,12 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 	if (!vmemmap_should_optimize_folio(h, folio))
 		return ret;
 
+	nid = folio_nid(folio);
+
+	vmemmap_tail = vmemmap_get_tail(h->order, nid);
+	if (!vmemmap_tail)
+		return -ENOMEM;
+
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
 
 	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
@@ -549,7 +591,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 	 */
 	folio_set_hugetlb_vmemmap_optimized(folio);
 
-	nid = folio_nid(folio);
 	vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
 
 	if (!vmemmap_head) {
@@ -561,7 +602,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 	list_add(&vmemmap_head->lru, vmemmap_pages);
 	memmap_pages_add(1);
 
-	vmemmap_tail	= vmemmap_head;
 	vmemmap_start	= (unsigned long)folio;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index dbd8daccade2..94b4e90fa00f 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -378,16 +378,45 @@ void vmemmap_wrprotect_hvo(unsigned long addr, unsigned long end,
 	}
 }
 
-/*
- * Populate vmemmap pages HVO-style. The first page contains the head
- * page and needed tail pages, the other ones are mirrors of the first
- * page.
- */
+static __meminit unsigned long vmemmap_get_tail(unsigned int order, int node)
+{
+	unsigned long pfn;
+	unsigned int idx;
+	struct page *p;
+
+	BUG_ON(order < VMEMMAP_TAIL_MIN_ORDER);
+	BUG_ON(order > MAX_FOLIO_ORDER);
+
+	idx = order - VMEMMAP_TAIL_MIN_ORDER;
+	pfn =  NODE_DATA(node)->vmemmap_tails[idx];
+	if (pfn)
+		return pfn;
+
+	p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+	if (!p)
+		return 0;
+
+	for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+		prep_compound_tail(p + i, NULL, order);
+
+	pfn = PHYS_PFN(virt_to_phys(p));
+	NODE_DATA(node)->vmemmap_tails[idx] = pfn;
+
+	return pfn;
+}
+
 int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
 				       int node, unsigned long headsize)
 {
+	unsigned long maddr, len, tail_pfn;
+	unsigned int order;
 	pte_t *pte;
-	unsigned long maddr;
+
+	len = end - addr;
+	order = ilog2(len * sizeof(struct page) / PAGE_SIZE);
+	tail_pfn = vmemmap_get_tail(order, node);
+	if (!tail_pfn)
+		return -ENOMEM;
 
 	for (maddr = addr; maddr < addr + headsize; maddr += PAGE_SIZE) {
 		pte = vmemmap_populate_address(maddr, node, NULL, -1, 0);
@@ -398,8 +427,7 @@ int __meminit vmemmap_populate_hvo(unsigned long addr, unsigned long end,
 	/*
 	 * Reuse the last page struct page mapped above for the rest.
 	 */
-	return vmemmap_populate_range(maddr, end, node, NULL,
-					pte_pfn(ptep_get(pte)), 0);
+	return vmemmap_populate_range(maddr, end, node, NULL, tail_pfn, 0);
 }
 
 void __weak __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node,
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 10/14] mm: Drop fake head checks
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (8 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 09/14] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  5:56   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

With fake head pages eliminated in the previous commit, remove the
supporting infrastructure:

  - page_fixed_fake_head(): no longer needed to detect fake heads;
  - page_is_fake_head(): no longer needed;
  - page_count_writable(): no longer needed for RCU protection;
  - RCU read_lock in page_ref_add_unless(): no longer needed;

This substantially simplifies compound_head() and page_ref_add_unless(),
removing both branches and RCU overhead from these hot paths.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h | 96 ++------------------------------------
 include/linux/page_ref.h   |  8 +---
 2 files changed, 4 insertions(+), 100 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fac5f41b3b27..9d89beed9df6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,105 +198,15 @@ enum pageflags {
 
 #ifndef __GENERATING_BOUNDS_H
 
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 
-/*
- * Return the real head page struct iff the @page is a fake head page, otherwise
- * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
- */
-static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
-{
-	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
-		return page;
-
-	/*
-	 * Fake heads only exists if size of struct page is power-of-2.
-	 * See hugetlb_vmemmap_optimizable_size().
-	 */
-	if (!is_power_of_2(sizeof(struct page)))
-		return page;
-
-	/*
-	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
-	 * struct page. The alignment check aims to avoid access the fields (
-	 * e.g. compound_info) of the @page[1]. It can avoid touch a (possibly)
-	 * cold cacheline in some cases.
-	 */
-	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
-	    test_bit(PG_head, &page->flags.f)) {
-		/*
-		 * We can safely access the field of the @page[1] with PG_head
-		 * because the @page is a compound page composed with at least
-		 * two contiguous pages.
-		 */
-		unsigned long info = READ_ONCE(page[1].compound_info);
-
-		/* See set_compound_head() */
-		if (likely(info & 1)) {
-			unsigned long p = (unsigned long)page;
-
-			return (const struct page *)(p & info);
-		}
-	}
-	return page;
-}
-
-static __always_inline bool page_count_writable(const struct page *page, int u)
-{
-	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
-		return true;
-
-	/*
-	 * The refcount check is ordered before the fake-head check to prevent
-	 * the following race:
-	 *   CPU 1 (HVO)                     CPU 2 (speculative PFN walker)
-	 *
-	 *   page_ref_freeze()
-	 *   synchronize_rcu()
-	 *                                   rcu_read_lock()
-	 *                                   page_is_fake_head() is false
-	 *   vmemmap_remap_pte()
-	 *   XXX: struct page[] becomes r/o
-	 *
-	 *   page_ref_unfreeze()
-	 *                                   page_ref_count() is not zero
-	 *
-	 *                                   atomic_add_unless(&page->_refcount)
-	 *                                   XXX: try to modify r/o struct page[]
-	 *
-	 * The refcount check also prevents modification attempts to other (r/o)
-	 * tail pages that are not fake heads.
-	 */
-	if (atomic_read_acquire(&page->_refcount) == u)
-		return false;
-
-	return page_fixed_fake_head(page) == page;
-}
-#else
-static inline const struct page *page_fixed_fake_head(const struct page *page)
-{
-	return page;
-}
-
-static inline bool page_count_writable(const struct page *page, int u)
-{
-	return true;
-}
-#endif
-
-static __always_inline int page_is_fake_head(const struct page *page)
-{
-	return page_fixed_fake_head(page) != page;
-}
-
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
 	unsigned long info = READ_ONCE(page->compound_info);
 
 	/* Bit 0 encodes PageTail() */
 	if (!(info & 1))
-		return (unsigned long)page_fixed_fake_head(page);
+		return (unsigned long)page;
 
 	/*
 	 * If the size of struct page is not power-of-2, the rest of
@@ -378,7 +288,7 @@ static __always_inline void clear_compound_head(struct page *page)
 
 static __always_inline int PageTail(const struct page *page)
 {
-	return READ_ONCE(page->compound_info) & 1 || page_is_fake_head(page);
+	return READ_ONCE(page->compound_info) & 1;
 }
 
 static __always_inline int PageCompound(const struct page *page)
@@ -905,7 +815,7 @@ static __always_inline bool folio_test_head(const struct folio *folio)
 static __always_inline int PageHead(const struct page *page)
 {
 	PF_POISONED_CHECK(page);
-	return test_bit(PG_head, &page->flags.f) && !page_is_fake_head(page);
+	return test_bit(PG_head, &page->flags.f);
 }
 
 __SETPAGEFLAG(Head, head, PF_ANY)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..490d0ad6e56d 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -230,13 +230,7 @@ static inline int folio_ref_dec_return(struct folio *folio)
 
 static inline bool page_ref_add_unless(struct page *page, int nr, int u)
 {
-	bool ret = false;
-
-	rcu_read_lock();
-	/* avoid writing to the vmemmap area being remapped */
-	if (page_count_writable(page, u))
-		ret = atomic_add_unless(&page->_refcount, nr, u);
-	rcu_read_unlock();
+	bool ret = atomic_add_unless(&page->_refcount, nr, u);
 
 	if (page_ref_tracepoint_active(page_ref_mod_unless))
 		__page_ref_mod_unless(page, nr, ret);
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (9 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 10/14] mm: Drop fake head checks Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  6:00   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

The VMEMMAP_SYNCHRONIZE_RCU flag triggered synchronize_rcu() calls to
prevent a race between HVO remapping and page_ref_add_unless(). The
race could occur when a speculative PFN walker tried to modify the
refcount on a struct page that was in the process of being remapped
to a fake head.

With fake heads eliminated, page_ref_add_unless() no longer needs RCU
protection.

Remove the flag and synchronize_rcu() calls.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/hugetlb_vmemmap.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 63d79ac80594..cc0fcf847810 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -48,8 +48,6 @@ struct vmemmap_remap_walk {
 #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
 /* Skip the TLB flush when we remap the PTE */
 #define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
-/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
-#define VMEMMAP_SYNCHRONIZE_RCU		BIT(2)
 	unsigned long		flags;
 };
 
@@ -423,9 +421,6 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	if (!folio_test_hugetlb_vmemmap_optimized(folio))
 		return 0;
 
-	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
-		synchronize_rcu();
-
 	vmemmap_start	= (unsigned long)folio;
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 
@@ -456,7 +451,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
  */
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
 {
-	return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
+	return __hugetlb_vmemmap_restore_folio(h, folio, 0);
 }
 
 /**
@@ -479,14 +474,11 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	struct folio *folio, *t_folio;
 	long restored = 0;
 	long ret = 0;
-	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
 			ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
-			/* only need to synchronize_rcu() once for each batch */
-			flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
-
 			if (ret)
 				break;
 			restored++;
@@ -576,8 +568,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
 
-	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
-		synchronize_rcu();
 	/*
 	 * Very Subtle
 	 * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -636,7 +626,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
+	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -664,7 +654,7 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 	struct folio *folio;
 	int nr_to_optimize;
 	LIST_HEAD(vmemmap_pages);
-	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
 
 	nr_to_optimize = 0;
 	list_for_each_entry(folio, folio_list, lru) {
@@ -717,8 +707,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 		int ret;
 
 		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
-		/* only need to synchronize_rcu() once for each batch */
-		flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
 
 		/*
 		 * Pages to be freed may have been accumulated.  If we
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (10 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  6:03   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

The hugetlb_optimize_vmemmap_key static key was used to guard fake head
detection in compound_head() and related functions. It allowed skipping
the fake head checks entirely when HVO was not in use.

With fake heads eliminated and the detection code removed, the static
key serves no purpose. Remove its definition and all increment/decrement
calls.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h |  2 --
 mm/hugetlb_vmemmap.c       | 14 ++------------
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9d89beed9df6..2255e7e6759c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,8 +198,6 @@ enum pageflags {
 
 #ifndef __GENERATING_BOUNDS_H
 
-DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
 	unsigned long info = READ_ONCE(page->compound_info);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index cc0fcf847810..f68ed7ebf873 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -399,9 +399,6 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 	return vmemmap_remap_range(start, end, &walk);
 }
 
-DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
-
 static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
 static int __init hugetlb_vmemmap_optimize_param(char *buf)
 {
@@ -431,10 +428,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	 * discarded vmemmap pages must be allocated and remapping.
 	 */
 	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
-	if (!ret) {
+	if (!ret)
 		folio_clear_hugetlb_vmemmap_optimized(folio);
-		static_branch_dec(&hugetlb_optimize_vmemmap_key);
-	}
 
 	return ret;
 }
@@ -566,8 +561,6 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 	if (!vmemmap_tail)
 		return -ENOMEM;
 
-	static_branch_inc(&hugetlb_optimize_vmemmap_key);
-
 	/*
 	 * Very Subtle
 	 * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -604,10 +597,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 				 vmemmap_head, vmemmap_tail,
 				 vmemmap_pages, flags);
 out:
-	if (ret) {
-		static_branch_dec(&hugetlb_optimize_vmemmap_key);
+	if (ret)
 		folio_clear_hugetlb_vmemmap_optimized(folio);
-	}
 
 	return ret;
 }
@@ -673,7 +664,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
 			register_page_bootmem_memmap(pfn_to_section_nr(spfn),
 					&folio->page,
 					HUGETLB_VMEMMAP_RESERVE_SIZE);
-			static_branch_inc(&hugetlb_optimize_vmemmap_key);
 			continue;
 		}
 
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 13/14] mm: Remove the branch from compound_head()
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (11 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  6:30   ` Muchun Song
  2025-12-18 15:09 ` [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
  2025-12-18 22:18 ` [PATCHv2 00/14] Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

The compound_head() function is a hot path. For example, the zap path
calls it for every leaf page table entry.

Rewrite the helper function in a branchless manner to eliminate the risk
of CPU branch misprediction.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 include/linux/page-flags.h | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2255e7e6759c..6d5ebd66eda6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -201,17 +201,15 @@ enum pageflags {
 static __always_inline unsigned long _compound_head(const struct page *page)
 {
 	unsigned long info = READ_ONCE(page->compound_info);
+	unsigned long mask;
+
+	if (!is_power_of_2(sizeof(struct page))) {
+		/* Bit 0 encodes PageTail() */
+		if (info & 1)
+			return info - 1;
 
-	/* Bit 0 encodes PageTail() */
-	if (!(info & 1))
 		return (unsigned long)page;
-
-	/*
-	 * If the size of struct page is not power-of-2, the rest of
-	 * compound_info is the pointer to the head page.
-	 */
-	if (!is_power_of_2(sizeof(struct page)))
-		return info - 1;
+	}
 
 	/*
 	 * If the size of struct page is power-of-2 the rest of the info
@@ -219,8 +217,17 @@ static __always_inline unsigned long _compound_head(const struct page *page)
 	 * the head page.
 	 *
 	 * No need to clear bit 0 in the mask as 'page' always has it clear.
+	 *
+	 * Let's do it in a branchless manner.
 	 */
-	return (unsigned long)page & info;
+
+	/* Non-tail: -1UL, Tail: 0 */
+	mask = (info & 1) - 1;
+
+	/* Non-tail: -1UL, Tail: info */
+	mask |= info;
+
+	return (unsigned long)page & mask;
 }
 
 #define compound_head(page)	((typeof(page))_compound_head(page))
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (12 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
@ 2025-12-18 15:09 ` Kiryl Shutsemau
  2025-12-22  6:20   ` Muchun Song
  2025-12-18 22:18 ` [PATCHv2 00/14] Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
  14 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 15:09 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Kiryl Shutsemau

Update the documentation regarding vmemmap optimization for hugetlb to
reflect the changes in how the kernel maps the tail pages.

Fake heads no longer exist. Remove their description.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 Documentation/mm/vmemmap_dedup.rst | 60 +++++++++++++-----------------
 1 file changed, 26 insertions(+), 34 deletions(-)

diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 1863d88d2dcb..a0c4c79d6922 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -124,33 +124,35 @@ Here is how things look before optimization::
  |           |
  +-----------+
 
-The value of page->compound_info is the same for all tail pages. The first
-page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
-``struct page`` necessary to describe the HugeTLB. The only use of the remaining
-pages of ``struct page`` (page 1 to page 7) is to point to page->compound_info.
-Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
-will be used for each HugeTLB page. This will allow us to free the remaining
-7 pages to the buddy allocator.
+The first page of ``struct page`` (page 0) associated with the HugeTLB page
+contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
+pages of ``struct page`` (page 1 to page 7) are tail pages.
+
+The optimization is only applied when the size of the struct page is a power-of-2
+In this case, all tail pages of the same order are identical. See
+compound_head(). This allows us to remap the tail pages of the vmemmap to a
+shared, read-only page. The head page is also remapped to a new page. This
+allows the original vmemmap pages to be freed.
 
 Here is how things look after remapping::
 
-    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- |           |                     |     0     | -------------> |     0     |
- |           |                     +-----------+                +-----------+
- |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
- |           |                     +-----------+                  | | | | | |
- |           |                     |     2     | -----------------+ | | | | |
- |           |                     +-----------+                    | | | | |
- |           |                     |     3     | -------------------+ | | | |
- |           |                     +-----------+                      | | | |
- |           |                     |     4     | ---------------------+ | | |
- |    PMD    |                     +-----------+                        | | |
- |   level   |                     |     5     | -----------------------+ | |
- |  mapping  |                     +-----------+                          | |
- |           |                     |     6     | -------------------------+ |
- |           |                     +-----------+                            |
- |           |                     |     7     | ---------------------------+
+    HugeTLB                  struct pages(8 pages)                 page frame
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +----------------+
+ |           |                     |     0     | -------------> |       0        |
+ |           |                     +-----------+                +----------------+
+ |           |                     |     1     | ------┐
+ |           |                     +-----------+       |
+ |           |                     |     2     | ------┼        +----------------+
+ |           |                     +-----------+       |        | vmemmap_tail   |
+ |           |                     |     3     | ------┼------> | shared for the |
+ |           |                     +-----------+       |        | struct hstate  |
+ |           |                     |     4     | ------┼        +----------------+
+ |           |                     +-----------+       |
+ |           |                     |     5     | ------┼
+ |    PMD    |                     +-----------+       |
+ |   level   |                     |     6     | ------┼
+ |  mapping  |                     +-----------+       |
+ |           |                     |     7     | ------┘
  |           |                     +-----------+
  |           |
  |           |
@@ -172,16 +174,6 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
 (last) level. So this type of HugeTLB page can be optimized only when its
 size of the ``struct page`` structs is greater than **1** page.
 
-Notice: The head vmemmap page is not freed to the buddy allocator and all
-tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
-more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB
-page) associated with each HugeTLB page. The ``compound_head()`` can handle
-this correctly. There is only **one** head ``struct page``, the tail
-``struct page`` with ``PG_head`` are fake head ``struct page``.  We need an
-approach to distinguish between those two different types of ``struct page`` so
-that ``compound_head()`` can return the real head ``struct page`` when the
-parameter is the tail ``struct page`` but with ``PG_head``.
-
 Device DAX
 ==========
 
-- 
2.51.2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 00/14] Eliminate fake head pages from vmemmap optimization
  2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
                   ` (13 preceding siblings ...)
  2025-12-18 15:09 ` [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
@ 2025-12-18 22:18 ` Kiryl Shutsemau
  14 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-18 22:18 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Usama Arif, Frank van der Linden
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc


Oopsie. Add the Subject.

On Thu, Dec 18, 2025 at 03:09:31PM +0000, Kiryl Shutsemau wrote:
> This series removes "fake head pages" from the HugeTLB vmemmap
> optimization (HVO) by changing how tail pages encode their relationship
> to the head page.
> 
> It simplifies compound_head() and page_ref_add_unless(). Both are in the
> hot path.
> 
> Background
> ==========
> 
> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> and remapping the freed virtual addresses to a single physical page.
> Previously, all tail page vmemmap entries were remapped to the first
> vmemmap page (containing the head struct page), creating "fake heads" -
> tail pages that appear to have PG_head set when accessed through the
> deduplicated vmemmap.
> 
> This required special handling in compound_head() to detect and work
> around fake heads, adding complexity and overhead to a very hot path.
> 
> New Approach
> ============
> 
> For architectures/configs where sizeof(struct page) is a power of 2 (the
> common case), this series changes how position of the head page is encoded
> in the tail pages.
> 
> Instead of storing a pointer to the head page, the ->compound_info
> (renamed from ->compound_head) now stores a mask.
> 
> The mask can be applied to any tail page's virtual address to compute
> the head page address. Critically, all tail pages of the same order now
> have identical compound_info values, regardless of which compound page
> they belong to.
> 
> The key insight is that all tail pages of the same order now have
> identical compound_info values, regardless of which compound page they
> belong to. This allows a single page of tail struct pages to be shared
> across all huge pages of the same order on a NUMA node.
> 
> Benefits
> ========
> 
> 1. Simplified compound_head(): No fake head detection needed, can be
>    implemented in a branchless manner.
> 
> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>    no race with fake head remapping.
> 
> 3. Cleaner architecture: The shared tail pages are truly read-only and
>    contain valid tail page metadata.
> 
> If sizeof(struct page) is not power-of-2, there are no functional changes.
> HVO is not supported in this configuration.
> 
> I had hoped to see performance improvement, but my testing thus far has
> shown either no change or only a slight improvement within the noise.
> 
> Series Organization
> ===================
> 
> Patches 1-2: Preparation - move MAX_FOLIO_ORDER, add alignment check
> Patches 3-5: Refactoring - interface changes, field rename, code movement
> Patch 6: Core change - new mask-based compound_head() encoding
> Patch 7: Correctness fix - page_zonenum() must use head page
> Patch 8: Refactor vmemmap_walk for new design
> Patch 9: Eliminate fake heads with shared tail pages
> Patches 10-13: Cleanup - remove fake head infrastructure
> Patch 14: Documentation update
> 
> Changes in v2:
> ==============
> 
> - Handle boot-allocated huge pages correctly. (Frank)
> 
> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>   in pglist_data. (Muchun)
> 
> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>   a race condition where two threads could both allocate tail pages.
>   The losing thread now properly frees its allocated page. (Usama)
> 
> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>   required for the mask approach. (Muchun)
> 
> - Make page_zonenum() use head page - correctness fix since shared
>   tail pages cannot have valid zone information. (Muchun)
> 
> - Added 'const' qualifier to head parameter in set_compound_head() and
>   prep_compound_tail(). (Usama)
> 
> - Updated commit messages.
> 
> Kiryl Shutsemau (14):
>   mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>   mm/sparse: Check memmap alignment
>   mm: Change the interface of prep_compound_tail()
>   mm: Rename the 'compound_head' field in the 'struct page' to
>     'compound_info'
>   mm: Move set/clear_compound_head() next to compound_head()
>   mm: Rework compound_head() for power-of-2 sizeof(struct page)
>   mm: Make page_zonenum() use head page
>   mm/hugetlb: Refactor code around vmemmap_walk
>   mm/hugetlb: Remove fake head pages
>   mm: Drop fake head checks
>   hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>   mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>   mm: Remove the branch from compound_head()
>   hugetlb: Update vmemmap_dedup.rst
> 
>  .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
>  Documentation/mm/vmemmap_dedup.rst            |  62 ++--
>  include/linux/mm.h                            |  31 --
>  include/linux/mm_types.h                      |  20 +-
>  include/linux/mmzone.h                        |  47 +++
>  include/linux/page-flags.h                    | 163 ++++-------
>  include/linux/page_ref.h                      |   8 +-
>  include/linux/types.h                         |   2 +-
>  kernel/vmcore_info.c                          |   2 +-
>  mm/hugetlb.c                                  |   8 +-
>  mm/hugetlb_vmemmap.c                          | 270 +++++++++---------
>  mm/internal.h                                 |  12 +-
>  mm/mm_init.c                                  |   2 +-
>  mm/page_alloc.c                               |   4 +-
>  mm/slab.h                                     |   2 +-
>  mm/sparse-vmemmap.c                           |  44 ++-
>  mm/sparse.c                                   |   3 +
>  mm/util.c                                     |  16 +-
>  18 files changed, 345 insertions(+), 353 deletions(-)
> 
> -- 
> 2.51.2
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 03/14] mm: Change the interface of prep_compound_tail()
  2025-12-18 15:09 ` [PATCHv2 03/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
@ 2025-12-22  2:55   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  2:55 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> Instead of passing down the head page and tail page index, pass the tail
> and head pages directly, as well as the order of the compound page.
> 
> This is a preparation for changing how the head position is encoded in
> the tail page.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info'
  2025-12-18 15:09 ` [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
@ 2025-12-22  3:00   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  3:00 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> The 'compound_head' field in the 'struct page' encodes whether the page
> is a tail and where to locate the head page. Bit 0 is set if the page is
> a tail, and the remaining bits in the field point to the head page.
> 
> As preparation for changing how the field encodes information about the
> head page, rename the field to 'compound_info'.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head()
  2025-12-18 15:09 ` [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
@ 2025-12-22  3:06   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  3:06 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> Move set_compound_head() and clear_compound_head() to be adjacent to the
> compound_head() function in page-flags.h.
> 
> These functions encode and decode the same compound_info field, so
> keeping them together makes it easier to verify their logic is
> consistent, especially when the encoding changes.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-18 15:09 ` [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
@ 2025-12-22  3:20   ` Muchun Song
  2025-12-22 14:03     ` Kiryl Shutsemau
  2025-12-22  7:57   ` Muchun Song
  1 sibling, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22  3:20 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> For tail pages, the kernel uses the 'compound_info' field to get to the
> head page. The bit 0 of the field indicates whether the page is a
> tail page, and if set, the remaining bits represent a pointer to the
> head page.
>
> For cases when size of struct page is power-of-2, change the encoding of
> compound_info to store a mask that can be applied to the virtual address
> of the tail page in order to access the head page. It is possible
> because struct page of the head page is naturally aligned with regards
> to order of the page.
>
> The significant impact of this modification is that all tail pages of
> the same order will now have identical 'compound_info', regardless of
> the compound page they are associated with. This paves the way for
> eliminating fake heads.
>
> The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
> applied when the sizeof(struct page) is power-of-2. Having identical
> tail pages allows the same page to be mapped into the vmemmap of all
> pages, maintaining memory savings without fake heads.
>
> If sizeof(struct page) is not power-of-2, there is no functional
> changes.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

One nit bellow.

> ---
>   include/linux/page-flags.h | 62 +++++++++++++++++++++++++++++++++-----
>   mm/util.c                  | 16 +++++++---
>   2 files changed, 66 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0de7db7efb00..fac5f41b3b27 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>   	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
>   		return page;
>   
> +	/*
> +	 * Fake heads only exists if size of struct page is power-of-2.
> +	 * See hugetlb_vmemmap_optimizable_size().
> +	 */
> +	if (!is_power_of_2(sizeof(struct page)))
> +		return page;
> +
>   	/*
>   	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
>   	 * struct page. The alignment check aims to avoid access the fields (
> @@ -223,10 +230,14 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>   		 * because the @page is a compound page composed with at least
>   		 * two contiguous pages.
>   		 */
> -		unsigned long head = READ_ONCE(page[1].compound_info);
> +		unsigned long info = READ_ONCE(page[1].compound_info);
>   
> -		if (likely(head & 1))
> -			return (const struct page *)(head - 1);
> +		/* See set_compound_head() */
> +		if (likely(info & 1)) {
> +			unsigned long p = (unsigned long)page;
> +
> +			return (const struct page *)(p & info);
> +		}
>   	}
>   	return page;
>   }
> @@ -281,11 +292,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
>   
>   static __always_inline unsigned long _compound_head(const struct page *page)
>   {
> -	unsigned long head = READ_ONCE(page->compound_info);
> +	unsigned long info = READ_ONCE(page->compound_info);
>   
> -	if (unlikely(head & 1))
> -		return head - 1;
> -	return (unsigned long)page_fixed_fake_head(page);
> +	/* Bit 0 encodes PageTail() */
> +	if (!(info & 1))
> +		return (unsigned long)page_fixed_fake_head(page);
> +
> +	/*
> +	 * If the size of struct page is not power-of-2, the rest of
> +	 * compound_info is the pointer to the head page.
> +	 */
> +	if (!is_power_of_2(sizeof(struct page)))
> +		return info - 1;
> +
> +	/*
> +	 * If the size of struct page is power-of-2 the rest of the info
> +	 * encodes the mask that converts the address of the tail page to
> +	 * the head page.
> +	 *
> +	 * No need to clear bit 0 in the mask as 'page' always has it clear.
> +	 */
> +	return (unsigned long)page & info;
>   }
>   
>   #define compound_head(page)	((typeof(page))_compound_head(page))
> @@ -294,7 +321,26 @@ static __always_inline void set_compound_head(struct page *page,
>   					      const struct page *head,
>   					      unsigned int order)
>   {
> -	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
> +	unsigned int shift;
> +	unsigned long mask;
> +
> +	if (!is_power_of_2(sizeof(struct page))) {
> +		WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
> +		return;
> +	}
> +
> +	/*
> +	 * If the size of struct page is power-of-2, bits [shift:0] of the
> +	 * virtual address of compound head are zero.
> +	 *
> +	 * Calculate mask that can be applied to the virtual address of
> +	 * the tail page to get address of the head page.
> +	 */
> +	shift = order + order_base_2(sizeof(struct page));

We already have a macro for order_base_2(sizeof(struct page)),
that is STRUCT_PAGE_MAX_SHIFT.

Thanks.

> +	mask = GENMASK(BITS_PER_LONG - 1, shift);
> +
> +	/* Bit 0 encodes PageTail() */
> +	WRITE_ONCE(page->compound_info, mask | 1);
>   }
>   
>   static __always_inline void clear_compound_head(struct page *page)
> diff --git a/mm/util.c b/mm/util.c
> index cbf93cf3223a..3c00f6cec3f0 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1234,7 +1234,7 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
>    */
>   void snapshot_page(struct page_snapshot *ps, const struct page *page)
>   {
> -	unsigned long head, nr_pages = 1;
> +	unsigned long info, nr_pages = 1;
>   	struct folio *foliop;
>   	int loops = 5;
>   
> @@ -1244,8 +1244,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
>   again:
>   	memset(&ps->folio_snapshot, 0, sizeof(struct folio));
>   	memcpy(&ps->page_snapshot, page, sizeof(*page));
> -	head = ps->page_snapshot.compound_info;
> -	if ((head & 1) == 0) {
> +	info = ps->page_snapshot.compound_info;
> +	if ((info & 1) == 0) {
>   		ps->idx = 0;
>   		foliop = (struct folio *)&ps->page_snapshot;
>   		if (!folio_test_large(foliop)) {
> @@ -1256,7 +1256,15 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
>   		}
>   		foliop = (struct folio *)page;
>   	} else {
> -		foliop = (struct folio *)(head - 1);
> +		/* See compound_head() */
> +		if (is_power_of_2(sizeof(struct page))) {
> +			unsigned long p = (unsigned long)page;
> +
> +			foliop = (struct folio *)(p & info);
> +		} else {
> +			foliop = (struct folio *)(info - 1);
> +		}
> +
>   		ps->idx = folio_page_idx(foliop, page);
>   	}
>   



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk
  2025-12-18 15:09 ` [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
@ 2025-12-22  5:54   ` Muchun Song
  2025-12-22 15:00     ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22  5:54 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> To prepare for removing fake head pages, the vmemmap_walk code is being reworked.
>
> The reuse_page and reuse_addr variables are being eliminated. There will
> no longer be an expectation regarding the reuse address in relation to
> the operated range. Instead, the caller will provide head and tail
> vmemmap pages, along with the vmemmap_start address where the head page
> is located.
>
> Currently, vmemmap_head and vmemmap_tail are set to the same page, but
> this will change in the future.
>
> The only functional change is that __hugetlb_vmemmap_optimize_folio()
> will abandon optimization if memory allocation fails.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>   mm/hugetlb_vmemmap.c | 198 ++++++++++++++++++-------------------------
>   1 file changed, 83 insertions(+), 115 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index ba0fb1b6a5a8..d18e7475cf95 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -24,8 +24,9 @@
>    *
>    * @remap_pte:		called for each lowest-level entry (PTE).
>    * @nr_walked:		the number of walked pte.
> - * @reuse_page:		the page which is reused for the tail vmemmap pages.
> - * @reuse_addr:		the virtual address of the @reuse_page page.
> + * @vmemmap_start:	the start of vmemmap range, where head page is located
> + * @vmemmap_head:	the page to be installed as first in the vmemmap range
> + * @vmemmap_tail:	the page to be installed as non-first in the vmemmap range
>    * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
>    *			or is mapped from.
>    * @flags:		used to modify behavior in vmemmap page table walking
> @@ -34,11 +35,14 @@
>   struct vmemmap_remap_walk {
>   	void			(*remap_pte)(pte_t *pte, unsigned long addr,
>   					     struct vmemmap_remap_walk *walk);
> +
>   	unsigned long		nr_walked;
> -	struct page		*reuse_page;
> -	unsigned long		reuse_addr;
> +	unsigned long		vmemmap_start;
> +	struct page		*vmemmap_head;
> +	struct page		*vmemmap_tail;
>   	struct list_head	*vmemmap_pages;
>   
> +
>   /* Skip the TLB flush when we split the PMD */
>   #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
>   /* Skip the TLB flush when we remap the PTE */
> @@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
>   {
>   	struct vmemmap_remap_walk *vmemmap_walk = walk->private;
>   
> -	/*
> -	 * The reuse_page is found 'first' in page table walking before
> -	 * starting remapping.
> -	 */
> -	if (!vmemmap_walk->reuse_page)
> -		vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
> -	else
> -		vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
> +	vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
>   	vmemmap_walk->nr_walked++;
>   
>   	return 0;
> @@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *list)
>   static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>   			      struct vmemmap_remap_walk *walk)
>   {
> -	/*
> -	 * Remap the tail pages as read-only to catch illegal write operation
> -	 * to the tail pages.
> -	 */
> -	pgprot_t pgprot = PAGE_KERNEL_RO;
>   	struct page *page = pte_page(ptep_get(pte));
>   	pte_t entry;
>   
>   	/* Remapping the head page requires r/w */
> -	if (unlikely(addr == walk->reuse_addr)) {
> -		pgprot = PAGE_KERNEL;
> -		list_del(&walk->reuse_page->lru);
> +	if (unlikely(addr == walk->vmemmap_start)) {
> +		list_del(&walk->vmemmap_head->lru);
>   
>   		/*
>   		 * Makes sure that preceding stores to the page contents from
> @@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>   		 * write.
>   		 */
>   		smp_wmb();
> +
> +		entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL);
> +	} else {
> +		/*
> +		 * Remap the tail pages as read-only to catch illegal write
> +		 * operation to the tail pages.
> +		 */
> +		entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO);
>   	}
>   
> -	entry = mk_pte(walk->reuse_page, pgprot);
>   	list_add(&page->lru, walk->vmemmap_pages);
>   	set_pte_at(&init_mm, addr, pte, entry);
>   }
> @@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *start)
>   static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>   				struct vmemmap_remap_walk *walk)
>   {
> -	pgprot_t pgprot = PAGE_KERNEL;
>   	struct page *page;
>   	void *to;
>   
> -	BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
> -
>   	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
>   	list_del(&page->lru);
>   	to = page_to_virt(page);
> -	copy_page(to, (void *)walk->reuse_addr);
> +	copy_page(to, (void *)walk->vmemmap_start);
>   	reset_struct_pages(to);
>   
>   	/*
> @@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>   	 * before the set_pte_at() write.
>   	 */
>   	smp_wmb();
> -	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> +	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
>   }
>   
>   /**
> @@ -282,33 +277,29 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>    *             to remap.
>    * @end:       end address of the vmemmap virtual address range that we want to
>    *             remap.
> - * @reuse:     reuse address.
> - *
>    * Return: %0 on success, negative error code otherwise.
>    */
> -static int vmemmap_remap_split(unsigned long start, unsigned long end,
> -			       unsigned long reuse)
> +static int vmemmap_remap_split(unsigned long start, unsigned long end)
>   {
>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= NULL,
> +		.vmemmap_start	= start,
>   		.flags		= VMEMMAP_SPLIT_NO_TLB_FLUSH,
>   	};
>   
> -	/* See the comment in the vmemmap_remap_free(). */
> -	BUG_ON(start - reuse != PAGE_SIZE);
> -
> -	return vmemmap_remap_range(reuse, end, &walk);
> +	return vmemmap_remap_range(start, end, &walk);
>   }
>   
>   /**
>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> - *			to the page which @reuse is mapped to, then free vmemmap
> - *			which the range are mapped to.
> + *			to use @vmemmap_head/tail, then free vmemmap which
> + *			the range are mapped to.
>    * @start:	start address of the vmemmap virtual address range that we want
>    *		to remap.
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
> - * @reuse:	reuse address.
> + * @vmemmap_head: the page to be installed as first in the vmemmap range
> + * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
>    * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
>    *		responsibility to free pages.
>    * @flags:	modifications to vmemmap_remap_walk flags
> @@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> -			      unsigned long reuse,
> +			      struct page *vmemmap_head,
> +			      struct page *vmemmap_tail,
>   			      struct list_head *vmemmap_pages,
>   			      unsigned long flags)
>   {
>   	int ret;
>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= vmemmap_remap_pte,
> -		.reuse_addr	= reuse,
> +		.vmemmap_start	= start,
> +		.vmemmap_head	= vmemmap_head,
> +		.vmemmap_tail	= vmemmap_tail,
>   		.vmemmap_pages	= vmemmap_pages,
>   		.flags		= flags,
>   	};
> -	int nid = page_to_nid((struct page *)reuse);
> -	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> +
> +	ret = vmemmap_remap_range(start, end, &walk);
> +	if (!ret || !walk.nr_walked)
> +		return ret;
> +
> +	end = start + walk.nr_walked * PAGE_SIZE;
>   
>   	/*
> -	 * Allocate a new head vmemmap page to avoid breaking a contiguous
> -	 * block of struct page memory when freeing it back to page allocator
> -	 * in free_vmemmap_page_list(). This will allow the likely contiguous
> -	 * struct page backing memory to be kept contiguous and allowing for
> -	 * more allocations of hugepages. Fallback to the currently
> -	 * mapped head page in case should it fail to allocate.
> +	 * vmemmap_pages contains pages from the previous vmemmap_remap_range()
> +	 * call which failed.  These are pages which were removed from
> +	 * the vmemmap. They will be restored in the following call.
>   	 */
> -	walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
> -	if (walk.reuse_page) {
> -		copy_page(page_to_virt(walk.reuse_page),
> -			  (void *)walk.reuse_addr);
> -		list_add(&walk.reuse_page->lru, vmemmap_pages);
> -		memmap_pages_add(1);
> -	}
> +	walk = (struct vmemmap_remap_walk) {
> +		.remap_pte	= vmemmap_restore_pte,
> +		.vmemmap_start	= start,
> +		.vmemmap_pages	= vmemmap_pages,
> +		.flags		= 0,
> +	};
>   
> -	/*
> -	 * In order to make remapping routine most efficient for the huge pages,
> -	 * the routine of vmemmap page table walking has the following rules
> -	 * (see more details from the vmemmap_pte_range()):
> -	 *
> -	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> -	 *   should be continuous.
> -	 * - The @reuse address is part of the range [@reuse, @end) that we are
> -	 *   walking which is passed to vmemmap_remap_range().
> -	 * - The @reuse address is the first in the complete range.
> -	 *
> -	 * So we need to make sure that @start and @reuse meet the above rules.
> -	 */
> -	BUG_ON(start - reuse != PAGE_SIZE);
> -
> -	ret = vmemmap_remap_range(reuse, end, &walk);
> -	if (ret && walk.nr_walked) {
> -		end = reuse + walk.nr_walked * PAGE_SIZE;
> -		/*
> -		 * vmemmap_pages contains pages from the previous
> -		 * vmemmap_remap_range call which failed.  These
> -		 * are pages which were removed from the vmemmap.
> -		 * They will be restored in the following call.
> -		 */
> -		walk = (struct vmemmap_remap_walk) {
> -			.remap_pte	= vmemmap_restore_pte,
> -			.reuse_addr	= reuse,
> -			.vmemmap_pages	= vmemmap_pages,
> -			.flags		= 0,
> -		};
> -
> -		vmemmap_remap_range(reuse, end, &walk);
> -	}
> +	vmemmap_remap_range(start + PAGE_SIZE, end, &walk);

The reason we previously passed the "start" address
was to perform a TLB flush within that address range.
So he startaddress is still necessary.

>   
>   	return ret;
>   }
> @@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
>    *		to remap.
>    * @end:	end address of the vmemmap virtual address range that we want to
>    *		remap.
> - * @reuse:	reuse address.
>    * @flags:	modifications to vmemmap_remap_walk flags
>    *
>    * Return: %0 on success, negative error code otherwise.
>    */
>   static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> -			       unsigned long reuse, unsigned long flags)
> +			       unsigned long flags)
>   {
>   	LIST_HEAD(vmemmap_pages);
>   	struct vmemmap_remap_walk walk = {
>   		.remap_pte	= vmemmap_restore_pte,
> -		.reuse_addr	= reuse,
> +		.vmemmap_start	= start,
>   		.vmemmap_pages	= &vmemmap_pages,
>   		.flags		= flags,
>   	};
>   
> -	/* See the comment in the vmemmap_remap_free(). */
> -	BUG_ON(start - reuse != PAGE_SIZE);
> +	start += HUGETLB_VMEMMAP_RESERVE_SIZE;
>   
>   	if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
>   		return -ENOMEM;
>   
> -	return vmemmap_remap_range(reuse, end, &walk);
> +	return vmemmap_remap_range(start, end, &walk);
>   }
>   
>   DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
> @@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>   					   struct folio *folio, unsigned long flags)
>   {
>   	int ret;
> -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> -	unsigned long vmemmap_reuse;
> +	unsigned long vmemmap_start, vmemmap_end;
>   
>   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
>   	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> @@ -466,18 +425,16 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>   	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
>   		synchronize_rcu();
>   
> +	vmemmap_start	= (unsigned long)folio;
>   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> -	vmemmap_reuse	= vmemmap_start;
> -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
>   
>   	/*
>   	 * The pages which the vmemmap virtual address range [@vmemmap_start,
> -	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
> -	 * the range is mapped to the page which @vmemmap_reuse is mapped to.
> +	 * @vmemmap_end) are mapped to are freed to the buddy allocator.
>   	 * When a HugeTLB page is freed to the buddy allocator, previously
>   	 * discarded vmemmap pages must be allocated and remapping.
>   	 */
> -	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
> +	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
>   	if (!ret) {
>   		folio_clear_hugetlb_vmemmap_optimized(folio);
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> @@ -565,9 +522,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
>   					    struct list_head *vmemmap_pages,
>   					    unsigned long flags)
>   {
> -	int ret = 0;
> -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> -	unsigned long vmemmap_reuse;
> +	unsigned long vmemmap_start, vmemmap_end;
> +	struct page *vmemmap_head, *vmemmap_tail;
> +	int nid, ret = 0;
>   
>   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
>   	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> @@ -592,18 +549,31 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
>   	 */
>   	folio_set_hugetlb_vmemmap_optimized(folio);
>   
> +	nid = folio_nid(folio);
> +	vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);

Why did you choose to change the gfpmask (previous is
GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN)?

> +
> +	if (!vmemmap_head) {
> +		ret = -ENOMEM;

Why did you choose to change the allocation-failure
behavior? Replacing the head page isn’t mandatory;
it’s only nice-to-have.

> +		goto out;
> +	}
> +
> +	copy_page(page_to_virt(vmemmap_head), folio);
> +	list_add(&vmemmap_head->lru, vmemmap_pages);
> +	memmap_pages_add(1);
> +
> +	vmemmap_tail	= vmemmap_head;
> +	vmemmap_start	= (unsigned long)folio;
>   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> -	vmemmap_reuse	= vmemmap_start;
> -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
>   
>   	/*
> -	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> -	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
> -	 * mapping the range to vmemmap_pages list so that they can be freed by
> -	 * the caller.
> +	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end).
> +	 * Add pages previously mapping the range to vmemmap_pages list so that
> +	 * they can be freed by the caller.
>   	 */
> -	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
> +	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end,
> +				 vmemmap_head, vmemmap_tail,
>   				 vmemmap_pages, flags);
> +out:
>   	if (ret) {
>   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
>   		folio_clear_hugetlb_vmemmap_optimized(folio);
> @@ -632,21 +602,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
>   
>   static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio)
>   {
> -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> -	unsigned long vmemmap_reuse;
> +	unsigned long vmemmap_start, vmemmap_end;
>   
>   	if (!vmemmap_should_optimize_folio(h, folio))
>   		return 0;
>   
> +	vmemmap_start	= (unsigned long)folio;
>   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> -	vmemmap_reuse	= vmemmap_start;
> -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
>   
>   	/*
>   	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>   	 * @vmemmap_end]
>   	 */
> -	return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> +	return vmemmap_remap_split(vmemmap_start, vmemmap_end);
>   }
>   
>   static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 10/14] mm: Drop fake head checks
  2025-12-18 15:09 ` [PATCHv2 10/14] mm: Drop fake head checks Kiryl Shutsemau
@ 2025-12-22  5:56   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  5:56 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> With fake head pages eliminated in the previous commit, remove the
> supporting infrastructure:
> 
>  - page_fixed_fake_head(): no longer needed to detect fake heads;
>  - page_is_fake_head(): no longer needed;
>  - page_count_writable(): no longer needed for RCU protection;
>  - RCU read_lock in page_ref_add_unless(): no longer needed;
> 
> This substantially simplifies compound_head() and page_ref_add_unless(),
> removing both branches and RCU overhead from these hot paths.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
  2025-12-18 15:09 ` [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
@ 2025-12-22  6:00   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  6:00 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> The VMEMMAP_SYNCHRONIZE_RCU flag triggered synchronize_rcu() calls to
> prevent a race between HVO remapping and page_ref_add_unless(). The
> race could occur when a speculative PFN walker tried to modify the
> refcount on a struct page that was in the process of being remapped
> to a fake head.
> 
> With fake heads eliminated, page_ref_add_unless() no longer needs RCU
> protection.
> 
> Remove the flag and synchronize_rcu() calls.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
  2025-12-18 15:09 ` [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
@ 2025-12-22  6:03   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  6:03 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> The hugetlb_optimize_vmemmap_key static key was used to guard fake head
> detection in compound_head() and related functions. It allowed skipping
> the fake head checks entirely when HVO was not in use.
> 
> With fake heads eliminated and the detection code removed, the static
> key serves no purpose. Remove its definition and all increment/decrement
> calls.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst
  2025-12-18 15:09 ` [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
@ 2025-12-22  6:20   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  6:20 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> Update the documentation regarding vmemmap optimization for hugetlb to
> reflect the changes in how the kernel maps the tail pages.
>
> Fake heads no longer exist. Remove their description.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>   Documentation/mm/vmemmap_dedup.rst | 60 +++++++++++++-----------------
>   1 file changed, 26 insertions(+), 34 deletions(-)
>
> diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
> index 1863d88d2dcb..a0c4c79d6922 100644
> --- a/Documentation/mm/vmemmap_dedup.rst
> +++ b/Documentation/mm/vmemmap_dedup.rst
> @@ -124,33 +124,35 @@ Here is how things look before optimization::
>    |           |
>    +-----------+
>   
> -The value of page->compound_info is the same for all tail pages. The first
> -page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
> -``struct page`` necessary to describe the HugeTLB. The only use of the remaining
> -pages of ``struct page`` (page 1 to page 7) is to point to page->compound_info.
> -Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
> -will be used for each HugeTLB page. This will allow us to free the remaining
> -7 pages to the buddy allocator.
> +The first page of ``struct page`` (page 0) associated with the HugeTLB page
> +contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
> +pages of ``struct page`` (page 1 to page 7) are tail pages.
> +
> +The optimization is only applied when the size of the struct page is a power-of-2
> +In this case, all tail pages of the same order are identical. See
> +compound_head(). This allows us to remap the tail pages of the vmemmap to a
> +shared, read-only page. The head page is also remapped to a new page. This
> +allows the original vmemmap pages to be freed.

Replacing the head page is nice-to-have, so I think the details of
it should not mentioned here.

>   
>   Here is how things look after remapping::
>   
> -    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> - +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> - |           |                     |     0     | -------------> |     0     |
> - |           |                     +-----------+                +-----------+
> - |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
> - |           |                     +-----------+                  | | | | | |
> - |           |                     |     2     | -----------------+ | | | | |
> - |           |                     +-----------+                    | | | | |
> - |           |                     |     3     | -------------------+ | | | |
> - |           |                     +-----------+                      | | | |
> - |           |                     |     4     | ---------------------+ | | |
> - |    PMD    |                     +-----------+                        | | |
> - |   level   |                     |     5     | -----------------------+ | |
> - |  mapping  |                     +-----------+                          | |
> - |           |                     |     6     | -------------------------+ |
> - |           |                     +-----------+                            |
> - |           |                     |     7     | ---------------------------+
> +    HugeTLB                  struct pages(8 pages)                 page frame
> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +----------------+
> + |           |                     |     0     | -------------> |       0        |
> + |           |                     +-----------+                +----------------+
> + |           |                     |     1     | ------┐
> + |           |                     +-----------+       |
> + |           |                     |     2     | ------┼        +----------------+
> + |           |                     +-----------+       |        | vmemmap_tail   |
> + |           |                     |     3     | ------┼------> | shared for the |
> + |           |                     +-----------+       |        | struct hstate  |

I suggest using the following wording (since struct hstate and vmemmap_tail
are somewhat code-level implementation details).

     A single, per-node page frame shared among all hugepages of the 
same size

Thanks.
> + |           |                     |     4     | ------┼        +----------------+
> + |           |                     +-----------+       |
> + |           |                     |     5     | ------┼
> + |    PMD    |                     +-----------+       |
> + |   level   |                     |     6     | ------┼
> + |  mapping  |                     +-----------+       |
> + |           |                     |     7     | ------┘
>    |           |                     +-----------+
>    |           |
>    |           |
> @@ -172,16 +174,6 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
>   (last) level. So this type of HugeTLB page can be optimized only when its
>   size of the ``struct page`` structs is greater than **1** page.
>   
> -Notice: The head vmemmap page is not freed to the buddy allocator and all
> -tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
> -more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB
> -page) associated with each HugeTLB page. The ``compound_head()`` can handle
> -this correctly. There is only **one** head ``struct page``, the tail
> -``struct page`` with ``PG_head`` are fake head ``struct page``.  We need an
> -approach to distinguish between those two different types of ``struct page`` so
> -that ``compound_head()`` can return the real head ``struct page`` when the
> -parameter is the tail ``struct page`` but with ``PG_head``.
> -
>   Device DAX
>   ==========
>   



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 13/14] mm: Remove the branch from compound_head()
  2025-12-18 15:09 ` [PATCHv2 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
@ 2025-12-22  6:30   ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22  6:30 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> The compound_head() function is a hot path. For example, the zap path
> calls it for every leaf page table entry.
> 
> Rewrite the helper function in a branchless manner to eliminate the risk
> of CPU branch misprediction.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-18 15:09 ` [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
  2025-12-22  3:20   ` Muchun Song
@ 2025-12-22  7:57   ` Muchun Song
  2025-12-22  9:45     ` Muchun Song
  1 sibling, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22  7:57 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> For tail pages, the kernel uses the 'compound_info' field to get to the
> head page. The bit 0 of the field indicates whether the page is a
> tail page, and if set, the remaining bits represent a pointer to the
> head page.
> 
> For cases when size of struct page is power-of-2, change the encoding of
> compound_info to store a mask that can be applied to the virtual address
> of the tail page in order to access the head page. It is possible
> because struct page of the head page is naturally aligned with regards
> to order of the page.
> 
> The significant impact of this modification is that all tail pages of
> the same order will now have identical 'compound_info', regardless of
> the compound page they are associated with. This paves the way for
> eliminating fake heads.
> 
> The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
> applied when the sizeof(struct page) is power-of-2. Having identical
> tail pages allows the same page to be mapped into the vmemmap of all
> pages, maintaining memory savings without fake heads.
> 
> If sizeof(struct page) is not power-of-2, there is no functional
> changes.
> 

Forgot to mention, I believe I stated in the previous version that this
mechanism only applies when CONFIG_SPARSEMEM_VMEMMAP is configured.
Therefore, you need to wrap the entire mechanism within CONFIG_SPARSEMEM_VMEMMAP.
For other configurations, it's difficult to guarantee alignment to a very
large size (for example, in the case of CONFIG_SPARSEMEM && !CONFIG_SPARSEMEM_VMEMMAP,
vmemmap allocation uses kvmalloc, which only guarantees PAGE_SIZE alignment
for the returned address).




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-18 15:09 ` [PATCHv2 02/14] mm/sparse: Check memmap alignment Kiryl Shutsemau
@ 2025-12-22  8:34   ` Muchun Song
  2025-12-22 14:02     ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22  8:34 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> The upcoming changes in compound_head() require memmap to be naturally
> aligned to the maximum folio size.
>
> Add a warning if it is not.
>
> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
> kernel is still likely to be functional if this strict check fails.

Different architectures default to 2 MB alignment (mainly to
enable huge mappings), which only accommodates folios up to
128 MB. Yet 1 GB huge pages are still fairly common, so
validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
miss the most frequent case.

I’m concerned that this might plant a hidden time bomb: it
could detonate at any moment in later code, silently triggering
memory corruption or similar failures. Therefore, I don’t
think a WARNING is a good choice.

>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>   include/linux/mmzone.h | 1 +
>   mm/sparse.c            | 3 +++
>   2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6cfede39570a..9f44dc760cdc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -91,6 +91,7 @@
>   #endif
>   
>   #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
> +#define MAX_FOLIO_SIZE		(PAGE_SIZE << MAX_FOLIO_ORDER)
>   
>   enum migratetype {
>   	MIGRATE_UNMOVABLE,
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 17c50a6415c2..c5810ff7c6f7 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -600,6 +600,9 @@ void __init sparse_init(void)
>   	BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
>   	memblocks_present();
>   
> +	WARN_ON(!IS_ALIGNED((unsigned long)pfn_to_page(0),
> +			    MAX_FOLIO_SIZE / sizeof(struct page)));
> +
>   	pnum_begin = first_present_section_nr();
>   	nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
>   



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-22  7:57   ` Muchun Song
@ 2025-12-22  9:45     ` Muchun Song
  2025-12-22 14:49       ` Kiryl Shutsemau
  0 siblings, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22  9:45 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc



> On Dec 22, 2025, at 15:57, Muchun Song <muchun.song@linux.dev> wrote:
> 
> 
> 
>> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
>> 
>> For tail pages, the kernel uses the 'compound_info' field to get to the
>> head page. The bit 0 of the field indicates whether the page is a
>> tail page, and if set, the remaining bits represent a pointer to the
>> head page.
>> 
>> For cases when size of struct page is power-of-2, change the encoding of
>> compound_info to store a mask that can be applied to the virtual address
>> of the tail page in order to access the head page. It is possible
>> because struct page of the head page is naturally aligned with regards
>> to order of the page.
>> 
>> The significant impact of this modification is that all tail pages of
>> the same order will now have identical 'compound_info', regardless of
>> the compound page they are associated with. This paves the way for
>> eliminating fake heads.
>> 
>> The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
>> applied when the sizeof(struct page) is power-of-2. Having identical
>> tail pages allows the same page to be mapped into the vmemmap of all
>> pages, maintaining memory savings without fake heads.
>> 
>> If sizeof(struct page) is not power-of-2, there is no functional
>> changes.
>> 
> 
> Forgot to mention, I believe I stated in the previous version that this
> mechanism only applies when CONFIG_SPARSEMEM_VMEMMAP is configured.
> Therefore, you need to wrap the entire mechanism within CONFIG_SPARSEMEM_VMEMMAP.
> For other configurations, it's difficult to guarantee alignment to a very
> large size (for example, in the case of CONFIG_SPARSEMEM && !CONFIG_SPARSEMEM_VMEMMAP,
> vmemmap allocation uses kvmalloc, which only guarantees PAGE_SIZE alignment
> for the returned address).

I found that we can call kvmalloc_node_align inside populate_section_memmap (for
memory hotplug case), so that we can specify the alignment parameter as the
input size. Then, this mechanism can applied for CONFIG_SPARSEMEM &&
!CONFIG_SPARSEMEM_VMEMMAP.

For CONFIG_FLATMEM, we also need similar approach to specify the correct alignment
in alloc_node_mem_map().

Thanks.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22  8:34   ` Muchun Song
@ 2025-12-22 14:02     ` Kiryl Shutsemau
  2025-12-22 14:18       ` David Hildenbrand (Red Hat)
  2025-12-22 14:49       ` Muchun Song
  0 siblings, 2 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 14:02 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden

On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
> 
> 
> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> > The upcoming changes in compound_head() require memmap to be naturally
> > aligned to the maximum folio size.
> > 
> > Add a warning if it is not.
> > 
> > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
> > kernel is still likely to be functional if this strict check fails.
> 
> Different architectures default to 2 MB alignment (mainly to
> enable huge mappings), which only accommodates folios up to
> 128 MB. Yet 1 GB huge pages are still fairly common, so
> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
> miss the most frequent case.

I don't follow. 16 GB check is more strict that anything smaller.
How can it miss the most frequent case?

> I’m concerned that this might plant a hidden time bomb: it
> could detonate at any moment in later code, silently triggering
> memory corruption or similar failures. Therefore, I don’t
> think a WARNING is a good choice.

We can upgrade it BUG_ON(), but I want to understand your logic here
first.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-22  3:20   ` Muchun Song
@ 2025-12-22 14:03     ` Kiryl Shutsemau
  2025-12-23  8:37       ` Muchun Song
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 14:03 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden

On Mon, Dec 22, 2025 at 11:20:48AM +0800, Muchun Song wrote:
> 
> 
> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> > For tail pages, the kernel uses the 'compound_info' field to get to the
> > head page. The bit 0 of the field indicates whether the page is a
> > tail page, and if set, the remaining bits represent a pointer to the
> > head page.
> > 
> > For cases when size of struct page is power-of-2, change the encoding of
> > compound_info to store a mask that can be applied to the virtual address
> > of the tail page in order to access the head page. It is possible
> > because struct page of the head page is naturally aligned with regards
> > to order of the page.
> > 
> > The significant impact of this modification is that all tail pages of
> > the same order will now have identical 'compound_info', regardless of
> > the compound page they are associated with. This paves the way for
> > eliminating fake heads.
> > 
> > The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
> > applied when the sizeof(struct page) is power-of-2. Having identical
> > tail pages allows the same page to be mapped into the vmemmap of all
> > pages, maintaining memory savings without fake heads.
> > 
> > If sizeof(struct page) is not power-of-2, there is no functional
> > changes.
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> 
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> 
> One nit bellow.
> 
> > ---
> >   include/linux/page-flags.h | 62 +++++++++++++++++++++++++++++++++-----
> >   mm/util.c                  | 16 +++++++---
> >   2 files changed, 66 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 0de7db7efb00..fac5f41b3b27 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> >   	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> >   		return page;
> > +	/*
> > +	 * Fake heads only exists if size of struct page is power-of-2.
> > +	 * See hugetlb_vmemmap_optimizable_size().
> > +	 */
> > +	if (!is_power_of_2(sizeof(struct page)))
> > +		return page;
> > +
> >   	/*
> >   	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> >   	 * struct page. The alignment check aims to avoid access the fields (
> > @@ -223,10 +230,14 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> >   		 * because the @page is a compound page composed with at least
> >   		 * two contiguous pages.
> >   		 */
> > -		unsigned long head = READ_ONCE(page[1].compound_info);
> > +		unsigned long info = READ_ONCE(page[1].compound_info);
> > -		if (likely(head & 1))
> > -			return (const struct page *)(head - 1);
> > +		/* See set_compound_head() */
> > +		if (likely(info & 1)) {
> > +			unsigned long p = (unsigned long)page;
> > +
> > +			return (const struct page *)(p & info);
> > +		}
> >   	}
> >   	return page;
> >   }
> > @@ -281,11 +292,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
> >   static __always_inline unsigned long _compound_head(const struct page *page)
> >   {
> > -	unsigned long head = READ_ONCE(page->compound_info);
> > +	unsigned long info = READ_ONCE(page->compound_info);
> > -	if (unlikely(head & 1))
> > -		return head - 1;
> > -	return (unsigned long)page_fixed_fake_head(page);
> > +	/* Bit 0 encodes PageTail() */
> > +	if (!(info & 1))
> > +		return (unsigned long)page_fixed_fake_head(page);
> > +
> > +	/*
> > +	 * If the size of struct page is not power-of-2, the rest of
> > +	 * compound_info is the pointer to the head page.
> > +	 */
> > +	if (!is_power_of_2(sizeof(struct page)))
> > +		return info - 1;
> > +
> > +	/*
> > +	 * If the size of struct page is power-of-2 the rest of the info
> > +	 * encodes the mask that converts the address of the tail page to
> > +	 * the head page.
> > +	 *
> > +	 * No need to clear bit 0 in the mask as 'page' always has it clear.
> > +	 */
> > +	return (unsigned long)page & info;
> >   }
> >   #define compound_head(page)	((typeof(page))_compound_head(page))
> > @@ -294,7 +321,26 @@ static __always_inline void set_compound_head(struct page *page,
> >   					      const struct page *head,
> >   					      unsigned int order)
> >   {
> > -	WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
> > +	unsigned int shift;
> > +	unsigned long mask;
> > +
> > +	if (!is_power_of_2(sizeof(struct page))) {
> > +		WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If the size of struct page is power-of-2, bits [shift:0] of the
> > +	 * virtual address of compound head are zero.
> > +	 *
> > +	 * Calculate mask that can be applied to the virtual address of
> > +	 * the tail page to get address of the head page.
> > +	 */
> > +	shift = order + order_base_2(sizeof(struct page));
> 
> We already have a macro for order_base_2(sizeof(struct page)),
> that is STRUCT_PAGE_MAX_SHIFT.

I used it before, but the name is obscure and opencoded version is
easier to follow in my view.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:02     ` Kiryl Shutsemau
@ 2025-12-22 14:18       ` David Hildenbrand (Red Hat)
  2025-12-22 14:52         ` Kiryl Shutsemau
  2025-12-22 14:55         ` Muchun Song
  2025-12-22 14:49       ` Muchun Song
  1 sibling, 2 replies; 43+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-22 14:18 UTC (permalink / raw)
  To: Kiryl Shutsemau, Muchun Song, Matthew Wilcox
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, Usama Arif, Frank van der Linden

On 12/22/25 15:02, Kiryl Shutsemau wrote:
> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>>
>>
>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>> The upcoming changes in compound_head() require memmap to be naturally
>>> aligned to the maximum folio size.
>>>
>>> Add a warning if it is not.
>>>
>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>> kernel is still likely to be functional if this strict check fails.
>>
>> Different architectures default to 2 MB alignment (mainly to
>> enable huge mappings), which only accommodates folios up to
>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>> miss the most frequent case.
> 
> I don't follow. 16 GB check is more strict that anything smaller.
> How can it miss the most frequent case?
> 
>> I’m concerned that this might plant a hidden time bomb: it
>> could detonate at any moment in later code, silently triggering
>> memory corruption or similar failures. Therefore, I don’t
>> think a WARNING is a good choice.
> 
> We can upgrade it BUG_ON(), but I want to understand your logic here
> first.

Definitely no BUG_ON(). I would assume this is something we would find 
early during testing, so even a VM_WARN_ON_ONCE() should be good enough?

This smells like a possible problem, though, as soon as some 
architecture wants to increase the folio size. What would be the 
expected step to ensure the alignment is done properly?

But OTOH, as I raised Willy's work will make all of that here obsolete 
either way, so maybe not worth worrying about that case too much,

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:02     ` Kiryl Shutsemau
  2025-12-22 14:18       ` David Hildenbrand (Red Hat)
@ 2025-12-22 14:49       ` Muchun Song
  1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22 14:49 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



> On Dec 22, 2025, at 22:03, Kiryl Shutsemau <kas@kernel.org> wrote:
> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>> 
>> 
>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>> The upcoming changes in compound_head() require memmap to be naturally
>>> aligned to the maximum folio size.
>>> Add a warning if it is not.
>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>> kernel is still likely to be functional if this strict check fails.
>> 
>> Different architectures default to 2 MB alignment (mainly to
>> enable huge mappings), which only accommodates folios up to
>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>> miss the most frequent case.
> 
> I don't follow. 16 GB check is more strict that anything smaller.
> How can it miss the most frequent case?

Sorry, I didn’t make myself clear. What I meant
is that if this warning triggers, it implies the
largest-sized folio isn’t properly aligned, and
the 1 GB folios are probably mis-aligned too.
Your commit message says
“MAX_FOLIO_ORDER is very rarely used,” but
I want to stress that 1 GB folios are actually
 common. If they’re also mis-aligned, we’re
quietly planting a land-mine. That’s why I’m
worried a mere warning isn’t enough—it
leaves a latent bug in the system.

If there’s a problem, we should stop right
here—this is the earliest place where it will surface.

As David assumed, if we expect to catch the
problem during testing, then I think VM_BUG_ON
would be more appropriate.

Thanks.

> 
>> I’m concerned that this might plant a hidden time bomb: it
>> could detonate at any moment in later code, silently triggering
>> memory corruption or similar failures. Therefore, I don’t
>> think a WARNING is a good choice.
> 
> We can upgrade it BUG_ON(), but I want to understand your logic here
> first.
> 
> --
>  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-22  9:45     ` Muchun Song
@ 2025-12-22 14:49       ` Kiryl Shutsemau
  0 siblings, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 14:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc

On Mon, Dec 22, 2025 at 05:45:16PM +0800, Muchun Song wrote:
> 
> 
> > On Dec 22, 2025, at 15:57, Muchun Song <muchun.song@linux.dev> wrote:
> > 
> > 
> > 
> >> On Dec 18, 2025, at 23:09, Kiryl Shutsemau <kas@kernel.org> wrote:
> >> 
> >> For tail pages, the kernel uses the 'compound_info' field to get to the
> >> head page. The bit 0 of the field indicates whether the page is a
> >> tail page, and if set, the remaining bits represent a pointer to the
> >> head page.
> >> 
> >> For cases when size of struct page is power-of-2, change the encoding of
> >> compound_info to store a mask that can be applied to the virtual address
> >> of the tail page in order to access the head page. It is possible
> >> because struct page of the head page is naturally aligned with regards
> >> to order of the page.
> >> 
> >> The significant impact of this modification is that all tail pages of
> >> the same order will now have identical 'compound_info', regardless of
> >> the compound page they are associated with. This paves the way for
> >> eliminating fake heads.
> >> 
> >> The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
> >> applied when the sizeof(struct page) is power-of-2. Having identical
> >> tail pages allows the same page to be mapped into the vmemmap of all
> >> pages, maintaining memory savings without fake heads.
> >> 
> >> If sizeof(struct page) is not power-of-2, there is no functional
> >> changes.
> >> 
> > 
> > Forgot to mention, I believe I stated in the previous version that this
> > mechanism only applies when CONFIG_SPARSEMEM_VMEMMAP is configured.
> > Therefore, you need to wrap the entire mechanism within CONFIG_SPARSEMEM_VMEMMAP.
> > For other configurations, it's difficult to guarantee alignment to a very
> > large size (for example, in the case of CONFIG_SPARSEMEM && !CONFIG_SPARSEMEM_VMEMMAP,
> > vmemmap allocation uses kvmalloc, which only guarantees PAGE_SIZE alignment
> > for the returned address).
> 
> I found that we can call kvmalloc_node_align inside populate_section_memmap (for
> memory hotplug case), so that we can specify the alignment parameter as the
> input size. Then, this mechanism can applied for CONFIG_SPARSEMEM &&
> !CONFIG_SPARSEMEM_VMEMMAP.
> 
> For CONFIG_FLATMEM, we also need similar approach to specify the correct alignment
> in alloc_node_mem_map().

I guess I will need to invest some time to make a test setup with
!VMEMMAP and FLATMEM.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:18       ` David Hildenbrand (Red Hat)
@ 2025-12-22 14:52         ` Kiryl Shutsemau
  2025-12-22 14:59           ` Muchun Song
  2025-12-22 14:55         ` Muchun Song
  1 sibling, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 14:52 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Wilcox
  Cc: Muchun Song, Oscar Salvador, Mike Rapoport, Vlastimil Babka,
	Lorenzo Stoakes, Zi Yan, Baoquan He, Michal Hocko,
	Johannes Weiner, Jonathan Corbet, kernel-team, linux-mm,
	linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden

On Mon, Dec 22, 2025 at 03:18:29PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/22/25 15:02, Kiryl Shutsemau wrote:
> > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
> > > 
> > > 
> > > On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> > > > The upcoming changes in compound_head() require memmap to be naturally
> > > > aligned to the maximum folio size.
> > > > 
> > > > Add a warning if it is not.
> > > > 
> > > > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
> > > > kernel is still likely to be functional if this strict check fails.
> > > 
> > > Different architectures default to 2 MB alignment (mainly to
> > > enable huge mappings), which only accommodates folios up to
> > > 128 MB. Yet 1 GB huge pages are still fairly common, so
> > > validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
> > > miss the most frequent case.
> > 
> > I don't follow. 16 GB check is more strict that anything smaller.
> > How can it miss the most frequent case?
> > 
> > > I’m concerned that this might plant a hidden time bomb: it
> > > could detonate at any moment in later code, silently triggering
> > > memory corruption or similar failures. Therefore, I don’t
> > > think a WARNING is a good choice.
> > 
> > We can upgrade it BUG_ON(), but I want to understand your logic here
> > first.
> 
> Definitely no BUG_ON(). I would assume this is something we would find early
> during testing, so even a VM_WARN_ON_ONCE() should be good enough?
> 
> This smells like a possible problem, though, as soon as some architecture
> wants to increase the folio size. What would be the expected step to ensure
> the alignment is done properly?

It depends on memory model and whether the arch has KASLR for memmap.

> But OTOH, as I raised Willy's work will make all of that here obsolete
> either way, so maybe not worth worrying about that case too much,

Willy, what is timeline here?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:18       ` David Hildenbrand (Red Hat)
  2025-12-22 14:52         ` Kiryl Shutsemau
@ 2025-12-22 14:55         ` Muchun Song
  2025-12-23  9:38           ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 43+ messages in thread
From: Muchun Song @ 2025-12-22 14:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Matthew Wilcox, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden



> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
> 
> On 12/22/25 15:02, Kiryl Shutsemau wrote:
>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>>> 
>>> 
>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>>> The upcoming changes in compound_head() require memmap to be naturally
>>>> aligned to the maximum folio size.
>>>> 
>>>> Add a warning if it is not.
>>>> 
>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>>> kernel is still likely to be functional if this strict check fails.
>>> 
>>> Different architectures default to 2 MB alignment (mainly to
>>> enable huge mappings), which only accommodates folios up to
>>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>>> miss the most frequent case.
>> I don't follow. 16 GB check is more strict that anything smaller.
>> How can it miss the most frequent case?
>>> I’m concerned that this might plant a hidden time bomb: it
>>> could detonate at any moment in later code, silently triggering
>>> memory corruption or similar failures. Therefore, I don’t
>>> think a WARNING is a good choice.
>> We can upgrade it BUG_ON(), but I want to understand your logic here
>> first.
> 
> Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough?
> 
> This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly?
> 
> But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much,

Hi David,

I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance.
Thank you very much in advance.

Best regards,

> 
> --
> Cheers
> 
> David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:52         ` Kiryl Shutsemau
@ 2025-12-22 14:59           ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22 14:59 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand, Wilcox, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden



> On Dec 22, 2025, at 22:52, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> On Mon, Dec 22, 2025 at 03:18:29PM +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/22/25 15:02, Kiryl Shutsemau wrote:
>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>>>> 
>>>> 
>>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>>>> The upcoming changes in compound_head() require memmap to be naturally
>>>>> aligned to the maximum folio size.
>>>>> 
>>>>> Add a warning if it is not.
>>>>> 
>>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>>>> kernel is still likely to be functional if this strict check fails.
>>>> 
>>>> Different architectures default to 2 MB alignment (mainly to
>>>> enable huge mappings), which only accommodates folios up to
>>>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>>>> miss the most frequent case.
>>> 
>>> I don't follow. 16 GB check is more strict that anything smaller.
>>> How can it miss the most frequent case?
>>> 
>>>> I’m concerned that this might plant a hidden time bomb: it
>>>> could detonate at any moment in later code, silently triggering
>>>> memory corruption or similar failures. Therefore, I don’t
>>>> think a WARNING is a good choice.
>>> 
>>> We can upgrade it BUG_ON(), but I want to understand your logic here
>>> first.
>> 
>> Definitely no BUG_ON(). I would assume this is something we would find early
>> during testing, so even a VM_WARN_ON_ONCE() should be good enough?
>> 
>> This smells like a possible problem, though, as soon as some architecture
>> wants to increase the folio size. What would be the expected step to ensure
>> the alignment is done properly?
> 
> It depends on memory model and whether the arch has KASLR for memmap.

Yes. Theoretically, the most correct approach is
to ensure that the randomly chosen offset at the
KASLR relocation site meets alignment
requirements, and it likely needs to be adapted
for each architecture—sounds rather tedious.

> 
>> But OTOH, as I raised Willy's work will make all of that here obsolete
>> either way, so maybe not worth worrying about that case too much,
> 
> Willy, what is timeline here?
> 
> --
>  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk
  2025-12-22  5:54   ` Muchun Song
@ 2025-12-22 15:00     ` Kiryl Shutsemau
  2025-12-22 15:11       ` Muchun Song
  0 siblings, 1 reply; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-22 15:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden

On Mon, Dec 22, 2025 at 01:54:47PM +0800, Muchun Song wrote:
> 
> 
> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> > To prepare for removing fake head pages, the vmemmap_walk code is being reworked.
> > 
> > The reuse_page and reuse_addr variables are being eliminated. There will
> > no longer be an expectation regarding the reuse address in relation to
> > the operated range. Instead, the caller will provide head and tail
> > vmemmap pages, along with the vmemmap_start address where the head page
> > is located.
> > 
> > Currently, vmemmap_head and vmemmap_tail are set to the same page, but
> > this will change in the future.
> > 
> > The only functional change is that __hugetlb_vmemmap_optimize_folio()
> > will abandon optimization if memory allocation fails.
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> >   mm/hugetlb_vmemmap.c | 198 ++++++++++++++++++-------------------------
> >   1 file changed, 83 insertions(+), 115 deletions(-)
> > 
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index ba0fb1b6a5a8..d18e7475cf95 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -24,8 +24,9 @@
> >    *
> >    * @remap_pte:		called for each lowest-level entry (PTE).
> >    * @nr_walked:		the number of walked pte.
> > - * @reuse_page:		the page which is reused for the tail vmemmap pages.
> > - * @reuse_addr:		the virtual address of the @reuse_page page.
> > + * @vmemmap_start:	the start of vmemmap range, where head page is located
> > + * @vmemmap_head:	the page to be installed as first in the vmemmap range
> > + * @vmemmap_tail:	the page to be installed as non-first in the vmemmap range
> >    * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
> >    *			or is mapped from.
> >    * @flags:		used to modify behavior in vmemmap page table walking
> > @@ -34,11 +35,14 @@
> >   struct vmemmap_remap_walk {
> >   	void			(*remap_pte)(pte_t *pte, unsigned long addr,
> >   					     struct vmemmap_remap_walk *walk);
> > +
> >   	unsigned long		nr_walked;
> > -	struct page		*reuse_page;
> > -	unsigned long		reuse_addr;
> > +	unsigned long		vmemmap_start;
> > +	struct page		*vmemmap_head;
> > +	struct page		*vmemmap_tail;
> >   	struct list_head	*vmemmap_pages;
> > +
> >   /* Skip the TLB flush when we split the PMD */
> >   #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
> >   /* Skip the TLB flush when we remap the PTE */
> > @@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
> >   {
> >   	struct vmemmap_remap_walk *vmemmap_walk = walk->private;
> > -	/*
> > -	 * The reuse_page is found 'first' in page table walking before
> > -	 * starting remapping.
> > -	 */
> > -	if (!vmemmap_walk->reuse_page)
> > -		vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
> > -	else
> > -		vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
> > +	vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
> >   	vmemmap_walk->nr_walked++;
> >   	return 0;
> > @@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *list)
> >   static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> >   			      struct vmemmap_remap_walk *walk)
> >   {
> > -	/*
> > -	 * Remap the tail pages as read-only to catch illegal write operation
> > -	 * to the tail pages.
> > -	 */
> > -	pgprot_t pgprot = PAGE_KERNEL_RO;
> >   	struct page *page = pte_page(ptep_get(pte));
> >   	pte_t entry;
> >   	/* Remapping the head page requires r/w */
> > -	if (unlikely(addr == walk->reuse_addr)) {
> > -		pgprot = PAGE_KERNEL;
> > -		list_del(&walk->reuse_page->lru);
> > +	if (unlikely(addr == walk->vmemmap_start)) {
> > +		list_del(&walk->vmemmap_head->lru);
> >   		/*
> >   		 * Makes sure that preceding stores to the page contents from
> > @@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> >   		 * write.
> >   		 */
> >   		smp_wmb();
> > +
> > +		entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL);
> > +	} else {
> > +		/*
> > +		 * Remap the tail pages as read-only to catch illegal write
> > +		 * operation to the tail pages.
> > +		 */
> > +		entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO);
> >   	}
> > -	entry = mk_pte(walk->reuse_page, pgprot);
> >   	list_add(&page->lru, walk->vmemmap_pages);
> >   	set_pte_at(&init_mm, addr, pte, entry);
> >   }
> > @@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *start)
> >   static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >   				struct vmemmap_remap_walk *walk)
> >   {
> > -	pgprot_t pgprot = PAGE_KERNEL;
> >   	struct page *page;
> >   	void *to;
> > -	BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
> > -
> >   	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
> >   	list_del(&page->lru);
> >   	to = page_to_virt(page);
> > -	copy_page(to, (void *)walk->reuse_addr);
> > +	copy_page(to, (void *)walk->vmemmap_start);
> >   	reset_struct_pages(to);
> >   	/*
> > @@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >   	 * before the set_pte_at() write.
> >   	 */
> >   	smp_wmb();
> > -	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> > +	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> >   }
> >   /**
> > @@ -282,33 +277,29 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> >    *             to remap.
> >    * @end:       end address of the vmemmap virtual address range that we want to
> >    *             remap.
> > - * @reuse:     reuse address.
> > - *
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> > -static int vmemmap_remap_split(unsigned long start, unsigned long end,
> > -			       unsigned long reuse)
> > +static int vmemmap_remap_split(unsigned long start, unsigned long end)
> >   {
> >   	struct vmemmap_remap_walk walk = {
> >   		.remap_pte	= NULL,
> > +		.vmemmap_start	= start,
> >   		.flags		= VMEMMAP_SPLIT_NO_TLB_FLUSH,
> >   	};
> > -	/* See the comment in the vmemmap_remap_free(). */
> > -	BUG_ON(start - reuse != PAGE_SIZE);
> > -
> > -	return vmemmap_remap_range(reuse, end, &walk);
> > +	return vmemmap_remap_range(start, end, &walk);
> >   }
> >   /**
> >    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> > - *			to the page which @reuse is mapped to, then free vmemmap
> > - *			which the range are mapped to.
> > + *			to use @vmemmap_head/tail, then free vmemmap which
> > + *			the range are mapped to.
> >    * @start:	start address of the vmemmap virtual address range that we want
> >    *		to remap.
> >    * @end:	end address of the vmemmap virtual address range that we want to
> >    *		remap.
> > - * @reuse:	reuse address.
> > + * @vmemmap_head: the page to be installed as first in the vmemmap range
> > + * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
> >    * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
> >    *		responsibility to free pages.
> >    * @flags:	modifications to vmemmap_remap_walk flags
> > @@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> >   static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > -			      unsigned long reuse,
> > +			      struct page *vmemmap_head,
> > +			      struct page *vmemmap_tail,
> >   			      struct list_head *vmemmap_pages,
> >   			      unsigned long flags)
> >   {
> >   	int ret;
> >   	struct vmemmap_remap_walk walk = {
> >   		.remap_pte	= vmemmap_remap_pte,
> > -		.reuse_addr	= reuse,
> > +		.vmemmap_start	= start,
> > +		.vmemmap_head	= vmemmap_head,
> > +		.vmemmap_tail	= vmemmap_tail,
> >   		.vmemmap_pages	= vmemmap_pages,
> >   		.flags		= flags,
> >   	};
> > -	int nid = page_to_nid((struct page *)reuse);
> > -	gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> > +
> > +	ret = vmemmap_remap_range(start, end, &walk);
> > +	if (!ret || !walk.nr_walked)
> > +		return ret;
> > +
> > +	end = start + walk.nr_walked * PAGE_SIZE;
> >   	/*
> > -	 * Allocate a new head vmemmap page to avoid breaking a contiguous
> > -	 * block of struct page memory when freeing it back to page allocator
> > -	 * in free_vmemmap_page_list(). This will allow the likely contiguous
> > -	 * struct page backing memory to be kept contiguous and allowing for
> > -	 * more allocations of hugepages. Fallback to the currently
> > -	 * mapped head page in case should it fail to allocate.
> > +	 * vmemmap_pages contains pages from the previous vmemmap_remap_range()
> > +	 * call which failed.  These are pages which were removed from
> > +	 * the vmemmap. They will be restored in the following call.
> >   	 */
> > -	walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
> > -	if (walk.reuse_page) {
> > -		copy_page(page_to_virt(walk.reuse_page),
> > -			  (void *)walk.reuse_addr);
> > -		list_add(&walk.reuse_page->lru, vmemmap_pages);
> > -		memmap_pages_add(1);
> > -	}
> > +	walk = (struct vmemmap_remap_walk) {
> > +		.remap_pte	= vmemmap_restore_pte,
> > +		.vmemmap_start	= start,
> > +		.vmemmap_pages	= vmemmap_pages,
> > +		.flags		= 0,
> > +	};
> > -	/*
> > -	 * In order to make remapping routine most efficient for the huge pages,
> > -	 * the routine of vmemmap page table walking has the following rules
> > -	 * (see more details from the vmemmap_pte_range()):
> > -	 *
> > -	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> > -	 *   should be continuous.
> > -	 * - The @reuse address is part of the range [@reuse, @end) that we are
> > -	 *   walking which is passed to vmemmap_remap_range().
> > -	 * - The @reuse address is the first in the complete range.
> > -	 *
> > -	 * So we need to make sure that @start and @reuse meet the above rules.
> > -	 */
> > -	BUG_ON(start - reuse != PAGE_SIZE);
> > -
> > -	ret = vmemmap_remap_range(reuse, end, &walk);
> > -	if (ret && walk.nr_walked) {
> > -		end = reuse + walk.nr_walked * PAGE_SIZE;
> > -		/*
> > -		 * vmemmap_pages contains pages from the previous
> > -		 * vmemmap_remap_range call which failed.  These
> > -		 * are pages which were removed from the vmemmap.
> > -		 * They will be restored in the following call.
> > -		 */
> > -		walk = (struct vmemmap_remap_walk) {
> > -			.remap_pte	= vmemmap_restore_pte,
> > -			.reuse_addr	= reuse,
> > -			.vmemmap_pages	= vmemmap_pages,
> > -			.flags		= 0,
> > -		};
> > -
> > -		vmemmap_remap_range(reuse, end, &walk);
> > -	}
> > +	vmemmap_remap_range(start + PAGE_SIZE, end, &walk);
> 
> The reason we previously passed the "start" address
> was to perform a TLB flush within that address range.
> So he startaddress is still necessary.

Good catch.

> >   	return ret;
> >   }
> > @@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> >    *		to remap.
> >    * @end:	end address of the vmemmap virtual address range that we want to
> >    *		remap.
> > - * @reuse:	reuse address.
> >    * @flags:	modifications to vmemmap_remap_walk flags
> >    *
> >    * Return: %0 on success, negative error code otherwise.
> >    */
> >   static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> > -			       unsigned long reuse, unsigned long flags)
> > +			       unsigned long flags)
> >   {
> >   	LIST_HEAD(vmemmap_pages);
> >   	struct vmemmap_remap_walk walk = {
> >   		.remap_pte	= vmemmap_restore_pte,
> > -		.reuse_addr	= reuse,
> > +		.vmemmap_start	= start,
> >   		.vmemmap_pages	= &vmemmap_pages,
> >   		.flags		= flags,
> >   	};
> > -	/* See the comment in the vmemmap_remap_free(). */
> > -	BUG_ON(start - reuse != PAGE_SIZE);
> > +	start += HUGETLB_VMEMMAP_RESERVE_SIZE;
> >   	if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
> >   		return -ENOMEM;
> > -	return vmemmap_remap_range(reuse, end, &walk);
> > +	return vmemmap_remap_range(start, end, &walk);
> >   }
> >   DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
> > @@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
> >   					   struct folio *folio, unsigned long flags)
> >   {
> >   	int ret;
> > -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> > -	unsigned long vmemmap_reuse;
> > +	unsigned long vmemmap_start, vmemmap_end;
> >   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
> >   	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> > @@ -466,18 +425,16 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
> >   	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
> >   		synchronize_rcu();
> > +	vmemmap_start	= (unsigned long)folio;
> >   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> > -	vmemmap_reuse	= vmemmap_start;
> > -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
> >   	/*
> >   	 * The pages which the vmemmap virtual address range [@vmemmap_start,
> > -	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
> > -	 * the range is mapped to the page which @vmemmap_reuse is mapped to.
> > +	 * @vmemmap_end) are mapped to are freed to the buddy allocator.
> >   	 * When a HugeTLB page is freed to the buddy allocator, previously
> >   	 * discarded vmemmap pages must be allocated and remapping.
> >   	 */
> > -	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
> > +	ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
> >   	if (!ret) {
> >   		folio_clear_hugetlb_vmemmap_optimized(folio);
> >   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> > @@ -565,9 +522,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> >   					    struct list_head *vmemmap_pages,
> >   					    unsigned long flags)
> >   {
> > -	int ret = 0;
> > -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> > -	unsigned long vmemmap_reuse;
> > +	unsigned long vmemmap_start, vmemmap_end;
> > +	struct page *vmemmap_head, *vmemmap_tail;
> > +	int nid, ret = 0;
> >   	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
> >   	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
> > @@ -592,18 +549,31 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
> >   	 */
> >   	folio_set_hugetlb_vmemmap_optimized(folio);
> > +	nid = folio_nid(folio);
> > +	vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
> 
> Why did you choose to change the gfpmask (previous is
> GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN)?

Because I removed the fallback for allocation failure. Trying harder and
warn if the allocation failed is justified without the fallback path.

> > +
> > +	if (!vmemmap_head) {
> > +		ret = -ENOMEM;
> 
> Why did you choose to change the allocation-failure
> behavior? Replacing the head page isn’t mandatory;
> it’s only nice-to-have.

I would require to extract the vmemmap_head page from page tables which
I found to be useless complication that never will get executed and
therefore tested.

If we failed to allocate a single page here, we are in OOM territory. It
is not time to play with huge page allocation.

> > +		goto out;
> > +	}
> > +
> > +	copy_page(page_to_virt(vmemmap_head), folio);
> > +	list_add(&vmemmap_head->lru, vmemmap_pages);
> > +	memmap_pages_add(1);
> > +
> > +	vmemmap_tail	= vmemmap_head;
> > +	vmemmap_start	= (unsigned long)folio;
> >   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> > -	vmemmap_reuse	= vmemmap_start;
> > -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
> >   	/*
> > -	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> > -	 * to the page which @vmemmap_reuse is mapped to.  Add pages previously
> > -	 * mapping the range to vmemmap_pages list so that they can be freed by
> > -	 * the caller.
> > +	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end).
> > +	 * Add pages previously mapping the range to vmemmap_pages list so that
> > +	 * they can be freed by the caller.
> >   	 */
> > -	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
> > +	ret = vmemmap_remap_free(vmemmap_start, vmemmap_end,
> > +				 vmemmap_head, vmemmap_tail,
> >   				 vmemmap_pages, flags);
> > +out:
> >   	if (ret) {
> >   		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> >   		folio_clear_hugetlb_vmemmap_optimized(folio);
> > @@ -632,21 +602,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
> >   static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio)
> >   {
> > -	unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
> > -	unsigned long vmemmap_reuse;
> > +	unsigned long vmemmap_start, vmemmap_end;
> >   	if (!vmemmap_should_optimize_folio(h, folio))
> >   		return 0;
> > +	vmemmap_start	= (unsigned long)folio;
> >   	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
> > -	vmemmap_reuse	= vmemmap_start;
> > -	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
> >   	/*
> >   	 * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
> >   	 * @vmemmap_end]
> >   	 */
> > -	return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> > +	return vmemmap_remap_split(vmemmap_start, vmemmap_end);
> >   }
> >   static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk
  2025-12-22 15:00     ` Kiryl Shutsemau
@ 2025-12-22 15:11       ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-22 15:11 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



> On Dec 22, 2025, at 23:03, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> On Mon, Dec 22, 2025 at 01:54:47PM +0800, Muchun Song wrote:
>> 
>> 
>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>> To prepare for removing fake head pages, the vmemmap_walk code is being reworked.
>>> 
>>> The reuse_page and reuse_addr variables are being eliminated. There will
>>> no longer be an expectation regarding the reuse address in relation to
>>> the operated range. Instead, the caller will provide head and tail
>>> vmemmap pages, along with the vmemmap_start address where the head page
>>> is located.
>>> 
>>> Currently, vmemmap_head and vmemmap_tail are set to the same page, but
>>> this will change in the future.
>>> 
>>> The only functional change is that __hugetlb_vmemmap_optimize_folio()
>>> will abandon optimization if memory allocation fails.
>>> 
>>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>>> ---
>>>  mm/hugetlb_vmemmap.c | 198 ++++++++++++++++++-------------------------
>>>  1 file changed, 83 insertions(+), 115 deletions(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index ba0fb1b6a5a8..d18e7475cf95 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -24,8 +24,9 @@
>>>   *
>>>   * @remap_pte:        called for each lowest-level entry (PTE).
>>>   * @nr_walked:        the number of walked pte.
>>> - * @reuse_page:        the page which is reused for the tail vmemmap pages.
>>> - * @reuse_addr:        the virtual address of the @reuse_page page.
>>> + * @vmemmap_start:    the start of vmemmap range, where head page is located
>>> + * @vmemmap_head:    the page to be installed as first in the vmemmap range
>>> + * @vmemmap_tail:    the page to be installed as non-first in the vmemmap range
>>>   * @vmemmap_pages:    the list head of the vmemmap pages that can be freed
>>>   *            or is mapped from.
>>>   * @flags:        used to modify behavior in vmemmap page table walking
>>> @@ -34,11 +35,14 @@
>>>  struct vmemmap_remap_walk {
>>>      void            (*remap_pte)(pte_t *pte, unsigned long addr,
>>>                           struct vmemmap_remap_walk *walk);
>>> +
>>>      unsigned long        nr_walked;
>>> -    struct page        *reuse_page;
>>> -    unsigned long        reuse_addr;
>>> +    unsigned long        vmemmap_start;
>>> +    struct page        *vmemmap_head;
>>> +    struct page        *vmemmap_tail;
>>>      struct list_head    *vmemmap_pages;
>>> +
>>>  /* Skip the TLB flush when we split the PMD */
>>>  #define VMEMMAP_SPLIT_NO_TLB_FLUSH    BIT(0)
>>>  /* Skip the TLB flush when we remap the PTE */
>>> @@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
>>>  {
>>>      struct vmemmap_remap_walk *vmemmap_walk = walk->private;
>>> -    /*
>>> -     * The reuse_page is found 'first' in page table walking before
>>> -     * starting remapping.
>>> -     */
>>> -    if (!vmemmap_walk->reuse_page)
>>> -        vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
>>> -    else
>>> -        vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
>>> +    vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
>>>      vmemmap_walk->nr_walked++;
>>>      return 0;
>>> @@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *list)
>>>  static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>>                    struct vmemmap_remap_walk *walk)
>>>  {
>>> -    /*
>>> -     * Remap the tail pages as read-only to catch illegal write operation
>>> -     * to the tail pages.
>>> -     */
>>> -    pgprot_t pgprot = PAGE_KERNEL_RO;
>>>      struct page *page = pte_page(ptep_get(pte));
>>>      pte_t entry;
>>>      /* Remapping the head page requires r/w */
>>> -    if (unlikely(addr == walk->reuse_addr)) {
>>> -        pgprot = PAGE_KERNEL;
>>> -        list_del(&walk->reuse_page->lru);
>>> +    if (unlikely(addr == walk->vmemmap_start)) {
>>> +        list_del(&walk->vmemmap_head->lru);
>>>          /*
>>>           * Makes sure that preceding stores to the page contents from
>>> @@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>>           * write.
>>>           */
>>>          smp_wmb();
>>> +
>>> +        entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL);
>>> +    } else {
>>> +        /*
>>> +         * Remap the tail pages as read-only to catch illegal write
>>> +         * operation to the tail pages.
>>> +         */
>>> +        entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO);
>>>      }
>>> -    entry = mk_pte(walk->reuse_page, pgprot);
>>>      list_add(&page->lru, walk->vmemmap_pages);
>>>      set_pte_at(&init_mm, addr, pte, entry);
>>>  }
>>> @@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *start)
>>>  static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>                  struct vmemmap_remap_walk *walk)
>>>  {
>>> -    pgprot_t pgprot = PAGE_KERNEL;
>>>      struct page *page;
>>>      void *to;
>>> -    BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
>>> -
>>>      page = list_first_entry(walk->vmemmap_pages, struct page, lru);
>>>      list_del(&page->lru);
>>>      to = page_to_virt(page);
>>> -    copy_page(to, (void *)walk->reuse_addr);
>>> +    copy_page(to, (void *)walk->vmemmap_start);
>>>      reset_struct_pages(to);
>>>      /*
>>> @@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>       * before the set_pte_at() write.
>>>       */
>>>      smp_wmb();
>>> -    set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>>> +    set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
>>>  }
>>>  /**
>>> @@ -282,33 +277,29 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>>>   *             to remap.
>>>   * @end:       end address of the vmemmap virtual address range that we want to
>>>   *             remap.
>>> - * @reuse:     reuse address.
>>> - *
>>>   * Return: %0 on success, negative error code otherwise.
>>>   */
>>> -static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>> -                   unsigned long reuse)
>>> +static int vmemmap_remap_split(unsigned long start, unsigned long end)
>>>  {
>>>      struct vmemmap_remap_walk walk = {
>>>          .remap_pte    = NULL,
>>> +        .vmemmap_start    = start,
>>>          .flags        = VMEMMAP_SPLIT_NO_TLB_FLUSH,
>>>      };
>>> -    /* See the comment in the vmemmap_remap_free(). */
>>> -    BUG_ON(start - reuse != PAGE_SIZE);
>>> -
>>> -    return vmemmap_remap_range(reuse, end, &walk);
>>> +    return vmemmap_remap_range(start, end, &walk);
>>>  }
>>>  /**
>>>   * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>>> - *            to the page which @reuse is mapped to, then free vmemmap
>>> - *            which the range are mapped to.
>>> + *            to use @vmemmap_head/tail, then free vmemmap which
>>> + *            the range are mapped to.
>>>   * @start:    start address of the vmemmap virtual address range that we want
>>>   *        to remap.
>>>   * @end:    end address of the vmemmap virtual address range that we want to
>>>   *        remap.
>>> - * @reuse:    reuse address.
>>> + * @vmemmap_head: the page to be installed as first in the vmemmap range
>>> + * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
>>>   * @vmemmap_pages: list to deposit vmemmap pages to be freed.  It is callers
>>>   *        responsibility to free pages.
>>>   * @flags:    modifications to vmemmap_remap_walk flags
>>> @@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
>>>   * Return: %0 on success, negative error code otherwise.
>>>   */
>>>  static int vmemmap_remap_free(unsigned long start, unsigned long end,
>>> -                  unsigned long reuse,
>>> +                  struct page *vmemmap_head,
>>> +                  struct page *vmemmap_tail,
>>>                    struct list_head *vmemmap_pages,
>>>                    unsigned long flags)
>>>  {
>>>      int ret;
>>>      struct vmemmap_remap_walk walk = {
>>>          .remap_pte    = vmemmap_remap_pte,
>>> -        .reuse_addr    = reuse,
>>> +        .vmemmap_start    = start,
>>> +        .vmemmap_head    = vmemmap_head,
>>> +        .vmemmap_tail    = vmemmap_tail,
>>>          .vmemmap_pages    = vmemmap_pages,
>>>          .flags        = flags,
>>>      };
>>> -    int nid = page_to_nid((struct page *)reuse);
>>> -    gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>>> +
>>> +    ret = vmemmap_remap_range(start, end, &walk);
>>> +    if (!ret || !walk.nr_walked)
>>> +        return ret;
>>> +
>>> +    end = start + walk.nr_walked * PAGE_SIZE;
>>>      /*
>>> -     * Allocate a new head vmemmap page to avoid breaking a contiguous
>>> -     * block of struct page memory when freeing it back to page allocator
>>> -     * in free_vmemmap_page_list(). This will allow the likely contiguous
>>> -     * struct page backing memory to be kept contiguous and allowing for
>>> -     * more allocations of hugepages. Fallback to the currently
>>> -     * mapped head page in case should it fail to allocate.
>>> +     * vmemmap_pages contains pages from the previous vmemmap_remap_range()
>>> +     * call which failed.  These are pages which were removed from
>>> +     * the vmemmap. They will be restored in the following call.
>>>       */
>>> -    walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
>>> -    if (walk.reuse_page) {
>>> -        copy_page(page_to_virt(walk.reuse_page),
>>> -              (void *)walk.reuse_addr);
>>> -        list_add(&walk.reuse_page->lru, vmemmap_pages);
>>> -        memmap_pages_add(1);
>>> -    }
>>> +    walk = (struct vmemmap_remap_walk) {
>>> +        .remap_pte    = vmemmap_restore_pte,
>>> +        .vmemmap_start    = start,
>>> +        .vmemmap_pages    = vmemmap_pages,
>>> +        .flags        = 0,
>>> +    };
>>> -    /*
>>> -     * In order to make remapping routine most efficient for the huge pages,
>>> -     * the routine of vmemmap page table walking has the following rules
>>> -     * (see more details from the vmemmap_pte_range()):
>>> -     *
>>> -     * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
>>> -     *   should be continuous.
>>> -     * - The @reuse address is part of the range [@reuse, @end) that we are
>>> -     *   walking which is passed to vmemmap_remap_range().
>>> -     * - The @reuse address is the first in the complete range.
>>> -     *
>>> -     * So we need to make sure that @start and @reuse meet the above rules.
>>> -     */
>>> -    BUG_ON(start - reuse != PAGE_SIZE);
>>> -
>>> -    ret = vmemmap_remap_range(reuse, end, &walk);
>>> -    if (ret && walk.nr_walked) {
>>> -        end = reuse + walk.nr_walked * PAGE_SIZE;
>>> -        /*
>>> -         * vmemmap_pages contains pages from the previous
>>> -         * vmemmap_remap_range call which failed.  These
>>> -         * are pages which were removed from the vmemmap.
>>> -         * They will be restored in the following call.
>>> -         */
>>> -        walk = (struct vmemmap_remap_walk) {
>>> -            .remap_pte    = vmemmap_restore_pte,
>>> -            .reuse_addr    = reuse,
>>> -            .vmemmap_pages    = vmemmap_pages,
>>> -            .flags        = 0,
>>> -        };
>>> -
>>> -        vmemmap_remap_range(reuse, end, &walk);
>>> -    }
>>> +    vmemmap_remap_range(start + PAGE_SIZE, end, &walk);
>> 
>> The reason we previously passed the "start" address
>> was to perform a TLB flush within that address range.
>> So he startaddress is still necessary.
> 
> Good catch.
> 
>>>      return ret;
>>>  }
>>> @@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
>>>   *        to remap.
>>>   * @end:    end address of the vmemmap virtual address range that we want to
>>>   *        remap.
>>> - * @reuse:    reuse address.
>>>   * @flags:    modifications to vmemmap_remap_walk flags
>>>   *
>>>   * Return: %0 on success, negative error code otherwise.
>>>   */
>>>  static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
>>> -                   unsigned long reuse, unsigned long flags)
>>> +                   unsigned long flags)
>>>  {
>>>      LIST_HEAD(vmemmap_pages);
>>>      struct vmemmap_remap_walk walk = {
>>>          .remap_pte    = vmemmap_restore_pte,
>>> -        .reuse_addr    = reuse,
>>> +        .vmemmap_start    = start,
>>>          .vmemmap_pages    = &vmemmap_pages,
>>>          .flags        = flags,
>>>      };
>>> -    /* See the comment in the vmemmap_remap_free(). */
>>> -    BUG_ON(start - reuse != PAGE_SIZE);
>>> +    start += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>      if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
>>>          return -ENOMEM;
>>> -    return vmemmap_remap_range(reuse, end, &walk);
>>> +    return vmemmap_remap_range(start, end, &walk);
>>>  }
>>>  DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
>>> @@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>>>                         struct folio *folio, unsigned long flags)
>>>  {
>>>      int ret;
>>> -    unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
>>> -    unsigned long vmemmap_reuse;
>>> +    unsigned long vmemmap_start, vmemmap_end;
>>>      VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
>>>      VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
>>> @@ -466,18 +425,16 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
>>>      if (flags & VMEMMAP_SYNCHRONIZE_RCU)
>>>          synchronize_rcu();
>>> +    vmemmap_start    = (unsigned long)folio;
>>>      vmemmap_end    = vmemmap_start + hugetlb_vmemmap_size(h);
>>> -    vmemmap_reuse    = vmemmap_start;
>>> -    vmemmap_start    += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>      /*
>>>       * The pages which the vmemmap virtual address range [@vmemmap_start,
>>> -     * @vmemmap_end) are mapped to are freed to the buddy allocator, and
>>> -     * the range is mapped to the page which @vmemmap_reuse is mapped to.
>>> +     * @vmemmap_end) are mapped to are freed to the buddy allocator.
>>>       * When a HugeTLB page is freed to the buddy allocator, previously
>>>       * discarded vmemmap pages must be allocated and remapping.
>>>       */
>>> -    ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
>>> +    ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
>>>      if (!ret) {
>>>          folio_clear_hugetlb_vmemmap_optimized(folio);
>>>          static_branch_dec(&hugetlb_optimize_vmemmap_key);
>>> @@ -565,9 +522,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
>>>                          struct list_head *vmemmap_pages,
>>>                          unsigned long flags)
>>>  {
>>> -    int ret = 0;
>>> -    unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
>>> -    unsigned long vmemmap_reuse;
>>> +    unsigned long vmemmap_start, vmemmap_end;
>>> +    struct page *vmemmap_head, *vmemmap_tail;
>>> +    int nid, ret = 0;
>>>      VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
>>>      VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
>>> @@ -592,18 +549,31 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
>>>       */
>>>      folio_set_hugetlb_vmemmap_optimized(folio);
>>> +    nid = folio_nid(folio);
>>> +    vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
>> 
>> Why did you choose to change the gfpmask (previous is
>> GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN)?
> 
> Because I removed the fallback for allocation failure. Trying harder and
> warn if the allocation failed is justified without the fallback path.
> 
>>> +
>>> +    if (!vmemmap_head) {
>>> +        ret = -ENOMEM;
>> 
>> Why did you choose to change the allocation-failure
>> behavior? Replacing the head page isn’t mandatory;
>> it’s only nice-to-have.
> 
> I would require to extract the vmemmap_head page from page tables which
> I found to be useless complication that never will get executed and
> therefore tested.
> 
> If we failed to allocate a single page here, we are in OOM territory. It
> is not time to play with huge page allocation.

Alright, you’ve convinced me.

> 
>>> +        goto out;
>>> +    }
>>> +
>>> +    copy_page(page_to_virt(vmemmap_head), folio);
>>> +    list_add(&vmemmap_head->lru, vmemmap_pages);
>>> +    memmap_pages_add(1);
>>> +
>>> +    vmemmap_tail    = vmemmap_head;
>>> +    vmemmap_start    = (unsigned long)folio;
>>>      vmemmap_end    = vmemmap_start + hugetlb_vmemmap_size(h);
>>> -    vmemmap_reuse    = vmemmap_start;
>>> -    vmemmap_start    += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>      /*
>>> -     * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
>>> -     * to the page which @vmemmap_reuse is mapped to.  Add pages previously
>>> -     * mapping the range to vmemmap_pages list so that they can be freed by
>>> -     * the caller.
>>> +     * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end).
>>> +     * Add pages previously mapping the range to vmemmap_pages list so that
>>> +     * they can be freed by the caller.
>>>       */
>>> -    ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
>>> +    ret = vmemmap_remap_free(vmemmap_start, vmemmap_end,
>>> +                 vmemmap_head, vmemmap_tail,
>>>                   vmemmap_pages, flags);
>>> +out:
>>>      if (ret) {
>>>          static_branch_dec(&hugetlb_optimize_vmemmap_key);
>>>          folio_clear_hugetlb_vmemmap_optimized(folio);
>>> @@ -632,21 +602,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
>>>  static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio)
>>>  {
>>> -    unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
>>> -    unsigned long vmemmap_reuse;
>>> +    unsigned long vmemmap_start, vmemmap_end;
>>>      if (!vmemmap_should_optimize_folio(h, folio))
>>>          return 0;
>>> +    vmemmap_start    = (unsigned long)folio;
>>>      vmemmap_end    = vmemmap_start + hugetlb_vmemmap_size(h);
>>> -    vmemmap_reuse    = vmemmap_start;
>>> -    vmemmap_start    += HUGETLB_VMEMMAP_RESERVE_SIZE;
>>>      /*
>>>       * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>>       * @vmemmap_end]
>>>       */
>>> -    return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>> +    return vmemmap_remap_split(vmemmap_start, vmemmap_end);
>>>  }
>>>  static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
>> 
> 
> --
>  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page)
  2025-12-22 14:03     ` Kiryl Shutsemau
@ 2025-12-23  8:37       ` Muchun Song
  0 siblings, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-23  8:37 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
	Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
	Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Usama Arif,
	Frank van der Linden



> On Dec 22, 2025, at 22:03, Kiryl Shutsemau <kas@kernel.org> wrote:
> 
> On Mon, Dec 22, 2025 at 11:20:48AM +0800, Muchun Song wrote:
>> 
>> 
>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>> For tail pages, the kernel uses the 'compound_info' field to get to the
>>> head page. The bit 0 of the field indicates whether the page is a
>>> tail page, and if set, the remaining bits represent a pointer to the
>>> head page.
>>> 
>>> For cases when size of struct page is power-of-2, change the encoding of
>>> compound_info to store a mask that can be applied to the virtual address
>>> of the tail page in order to access the head page. It is possible
>>> because struct page of the head page is naturally aligned with regards
>>> to order of the page.
>>> 
>>> The significant impact of this modification is that all tail pages of
>>> the same order will now have identical 'compound_info', regardless of
>>> the compound page they are associated with. This paves the way for
>>> eliminating fake heads.
>>> 
>>> The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
>>> applied when the sizeof(struct page) is power-of-2. Having identical
>>> tail pages allows the same page to be mapped into the vmemmap of all
>>> pages, maintaining memory savings without fake heads.
>>> 
>>> If sizeof(struct page) is not power-of-2, there is no functional
>>> changes.
>>> 
>>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>> 
>> Reviewed-by: Muchun Song <muchun.song@linux.dev>
>> 
>> One nit bellow.
>> 
>>> ---
>>>  include/linux/page-flags.h | 62 +++++++++++++++++++++++++++++++++-----
>>>  mm/util.c                  | 16 +++++++---
>>>  2 files changed, 66 insertions(+), 12 deletions(-)
>>> 
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 0de7db7efb00..fac5f41b3b27 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>>>   if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
>>>   return page;
>>> + /*
>>> +  * Fake heads only exists if size of struct page is power-of-2.
>>> +  * See hugetlb_vmemmap_optimizable_size().
>>> +  */
>>> + if (!is_power_of_2(sizeof(struct page)))
>>> + return page;
>>> +
>>>   /*
>>>    * Only addresses aligned with PAGE_SIZE of struct page may be fake head
>>>    * struct page. The alignment check aims to avoid access the fields (
>>> @@ -223,10 +230,14 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>>>    * because the @page is a compound page composed with at least
>>>    * two contiguous pages.
>>>    */
>>> - unsigned long head = READ_ONCE(page[1].compound_info);
>>> + unsigned long info = READ_ONCE(page[1].compound_info);
>>> - if (likely(head & 1))
>>> - return (const struct page *)(head - 1);
>>> + /* See set_compound_head() */
>>> + if (likely(info & 1)) {
>>> + unsigned long p = (unsigned long)page;
>>> +
>>> + return (const struct page *)(p & info);
>>> + }
>>>   }
>>>   return page;
>>>  }
>>> @@ -281,11 +292,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
>>>  static __always_inline unsigned long _compound_head(const struct page *page)
>>>  {
>>> - unsigned long head = READ_ONCE(page->compound_info);
>>> + unsigned long info = READ_ONCE(page->compound_info);
>>> - if (unlikely(head & 1))
>>> - return head - 1;
>>> - return (unsigned long)page_fixed_fake_head(page);
>>> + /* Bit 0 encodes PageTail() */
>>> + if (!(info & 1))
>>> + return (unsigned long)page_fixed_fake_head(page);
>>> +
>>> + /*
>>> +  * If the size of struct page is not power-of-2, the rest of
>>> +  * compound_info is the pointer to the head page.
>>> +  */
>>> + if (!is_power_of_2(sizeof(struct page)))
>>> + return info - 1;
>>> +
>>> + /*
>>> +  * If the size of struct page is power-of-2 the rest of the info
>>> +  * encodes the mask that converts the address of the tail page to
>>> +  * the head page.
>>> +  *
>>> +  * No need to clear bit 0 in the mask as 'page' always has it clear.
>>> +  */
>>> + return (unsigned long)page & info;
>>>  }
>>>  #define compound_head(page) ((typeof(page))_compound_head(page))
>>> @@ -294,7 +321,26 @@ static __always_inline void set_compound_head(struct page *page,
>>>         const struct page *head,
>>>         unsigned int order)
>>>  {
>>> - WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
>>> + unsigned int shift;
>>> + unsigned long mask;
>>> +
>>> + if (!is_power_of_2(sizeof(struct page))) {
>>> + WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
>>> + return;
>>> + }
>>> +
>>> + /*
>>> +  * If the size of struct page is power-of-2, bits [shift:0] of the
>>> +  * virtual address of compound head are zero.
>>> +  *
>>> +  * Calculate mask that can be applied to the virtual address of
>>> +  * the tail page to get address of the head page.
>>> +  */
>>> + shift = order + order_base_2(sizeof(struct page));
>> 
>> We already have a macro for order_base_2(sizeof(struct page)),
>> that is STRUCT_PAGE_MAX_SHIFT.
> 
> I used it before, but the name is obscure and opencoded version is
> easier to follow in my view.

OK. I'm fine with opencoded version as well.

> 
> -- 
>  Kiryl Shutsemau / Kirill A. Shutemov




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-22 14:55         ` Muchun Song
@ 2025-12-23  9:38           ` David Hildenbrand (Red Hat)
  2025-12-23 11:26             ` Muchun Song
  2025-12-24 14:13             ` Kiryl Shutsemau
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23  9:38 UTC (permalink / raw)
  To: Muchun Song, Matthew Wilcox
  Cc: Kiryl Shutsemau, Oscar Salvador, Mike Rapoport, Vlastimil Babka,
	Lorenzo Stoakes, Zi Yan, Baoquan He, Michal Hocko,
	Johannes Weiner, Jonathan Corbet, kernel-team, linux-mm,
	linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden

On 12/22/25 15:55, Muchun Song wrote:
> 
> 
>> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
>>
>> On 12/22/25 15:02, Kiryl Shutsemau wrote:
>>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>>>>
>>>>
>>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>>>> The upcoming changes in compound_head() require memmap to be naturally
>>>>> aligned to the maximum folio size.
>>>>>
>>>>> Add a warning if it is not.
>>>>>
>>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>>>> kernel is still likely to be functional if this strict check fails.
>>>>
>>>> Different architectures default to 2 MB alignment (mainly to
>>>> enable huge mappings), which only accommodates folios up to
>>>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>>>> miss the most frequent case.
>>> I don't follow. 16 GB check is more strict that anything smaller.
>>> How can it miss the most frequent case?
>>>> I’m concerned that this might plant a hidden time bomb: it
>>>> could detonate at any moment in later code, silently triggering
>>>> memory corruption or similar failures. Therefore, I don’t
>>>> think a WARNING is a good choice.
>>> We can upgrade it BUG_ON(), but I want to understand your logic here
>>> first.
>>
>> Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough?
>>
>> This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly?
>>
>> But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much,
> 
> Hi David,
> 

Hi! :)

> I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance.
> Thank you very much in advance.

There is some information to be had at [1], but more at [2]. Take a look 
at [2] in "After those projects are complete - Then we can shrink struct 
page to 32 bytes:"

In essence, all pages (belonging to a memdesc) will have a "memdesc" 
pointer (that replaces the compound_head pointer).

"Then we make page->compound_head point to the dynamically allocated 
memdesc rather than the first page. Then we can transition to the above 
layout. "

The "memdesc" could be a pointer to a "struct folio" that is allocated 
from the slab.

So in the new memdesc world, all pages part of a folio will point at the 
allocated "struct folio", not the head page where "struct folio" 
currently overlays "struct page".

That would mean that the proposal in this patch set will have to be 
reverted again.


At LPC, Willy said that he wants to have something out there in the 
first half of 2026.

[1] https://kernelnewbies.org/MatthewWilcox/Memdescs
[2] https://kernelnewbies.org/MatthewWilcox/Memdescs/Path

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-23  9:38           ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:26             ` Muchun Song
  2025-12-24 14:13             ` Kiryl Shutsemau
  1 sibling, 0 replies; 43+ messages in thread
From: Muchun Song @ 2025-12-23 11:26 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Matthew Wilcox, Kiryl Shutsemau, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden



> On Dec 23, 2025, at 17:38, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
> 
> On 12/22/25 15:55, Muchun Song wrote:
>>> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
>>> 
>>> On 12/22/25 15:02, Kiryl Shutsemau wrote:
>>>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
>>>>> 
>>>>> 
>>>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote:
>>>>>> The upcoming changes in compound_head() require memmap to be naturally
>>>>>> aligned to the maximum folio size.
>>>>>> 
>>>>>> Add a warning if it is not.
>>>>>> 
>>>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
>>>>>> kernel is still likely to be functional if this strict check fails.
>>>>> 
>>>>> Different architectures default to 2 MB alignment (mainly to
>>>>> enable huge mappings), which only accommodates folios up to
>>>>> 128 MB. Yet 1 GB huge pages are still fairly common, so
>>>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
>>>>> miss the most frequent case.
>>>> I don't follow. 16 GB check is more strict that anything smaller.
>>>> How can it miss the most frequent case?
>>>>> I’m concerned that this might plant a hidden time bomb: it
>>>>> could detonate at any moment in later code, silently triggering
>>>>> memory corruption or similar failures. Therefore, I don’t
>>>>> think a WARNING is a good choice.
>>>> We can upgrade it BUG_ON(), but I want to understand your logic here
>>>> first.
>>> 
>>> Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough?
>>> 
>>> This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly?
>>> 
>>> But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much,
>> Hi David,
> 
> Hi! :)
> 
>> I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance.
>> Thank you very much in advance.
> 
> There is some information to be had at [1], but more at [2]. Take a look at [2] in "After those projects are complete - Then we can shrink struct page to 32 bytes:"
> 
> In essence, all pages (belonging to a memdesc) will have a "memdesc" pointer (that replaces the compound_head pointer).
> 
> "Then we make page->compound_head point to the dynamically allocated memdesc rather than the first page. Then we can transition to the above layout. "
> 
> The "memdesc" could be a pointer to a "struct folio" that is allocated from the slab.
> 
> So in the new memdesc world, all pages part of a folio will point at the allocated "struct folio", not the head page where "struct folio" currently overlays "struct page".
> 
> That would mean that the proposal in this patch set will have to be reverted again.
> 
> 
> At LPC, Willy said that he wants to have something out there in the first half of 2026.
> 
> [1] https://kernelnewbies.org/MatthewWilcox/Memdescs
> [2] https://kernelnewbies.org/MatthewWilcox/Memdescs/Path

Many thanks for taking the time to explain everything in detail and for providing
such valuable information. I plan to invest additional time to fully understand
the details you’ve shared.

Muchun,
Thanks.

> 
> -- 
> Cheers
> 
> David




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHv2 02/14] mm/sparse: Check memmap alignment
  2025-12-23  9:38           ` David Hildenbrand (Red Hat)
  2025-12-23 11:26             ` Muchun Song
@ 2025-12-24 14:13             ` Kiryl Shutsemau
  1 sibling, 0 replies; 43+ messages in thread
From: Kiryl Shutsemau @ 2025-12-24 14:13 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Muchun Song, Matthew Wilcox, Oscar Salvador, Mike Rapoport,
	Vlastimil Babka, Lorenzo Stoakes, Zi Yan, Baoquan He,
	Michal Hocko, Johannes Weiner, Jonathan Corbet, kernel-team,
	linux-mm, linux-kernel, linux-doc, Andrew Morton, Usama Arif,
	Frank van der Linden

On Tue, Dec 23, 2025 at 10:38:26AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/22/25 15:55, Muchun Song wrote:
> > 
> > 
> > > On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
> > > 
> > > On 12/22/25 15:02, Kiryl Shutsemau wrote:
> > > > > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote:
> > > > > 
> > > > > 
> > > > > On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> > > > > > The upcoming changes in compound_head() require memmap to be naturally
> > > > > > aligned to the maximum folio size.
> > > > > > 
> > > > > > Add a warning if it is not.
> > > > > > 
> > > > > > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
> > > > > > kernel is still likely to be functional if this strict check fails.
> > > > > 
> > > > > Different architectures default to 2 MB alignment (mainly to
> > > > > enable huge mappings), which only accommodates folios up to
> > > > > 128 MB. Yet 1 GB huge pages are still fairly common, so
> > > > > validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
> > > > > miss the most frequent case.
> > > > I don't follow. 16 GB check is more strict that anything smaller.
> > > > How can it miss the most frequent case?
> > > > > I’m concerned that this might plant a hidden time bomb: it
> > > > > could detonate at any moment in later code, silently triggering
> > > > > memory corruption or similar failures. Therefore, I don’t
> > > > > think a WARNING is a good choice.
> > > > We can upgrade it BUG_ON(), but I want to understand your logic here
> > > > first.
> > > 
> > > Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough?
> > > 
> > > This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly?
> > > 
> > > But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much,
> > 
> > Hi David,
> > 
> 
> Hi! :)
> 
> > I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance.
> > Thank you very much in advance.
> 
> There is some information to be had at [1], but more at [2]. Take a look at
> [2] in "After those projects are complete - Then we can shrink struct page
> to 32 bytes:"
> 
> In essence, all pages (belonging to a memdesc) will have a "memdesc" pointer
> (that replaces the compound_head pointer).
> 
> "Then we make page->compound_head point to the dynamically allocated memdesc
> rather than the first page. Then we can transition to the above layout. "

I am not sure I understand how it is going to work.

32-byte layout indicates that flags will stay in the statically
allocated part, but most (all?) flags are in the head page and we would
need a way to redirect from tail to head in the statically allocated
pages.

> The "memdesc" could be a pointer to a "struct folio" that is allocated from
> the slab.
> 
> So in the new memdesc world, all pages part of a folio will point at the
> allocated "struct folio", not the head page where "struct folio" currently
> overlays "struct page".
> 
> That would mean that the proposal in this patch set will have to be reverted
> again.
> 
> 
> At LPC, Willy said that he wants to have something out there in the first
> half of 2026.

Okay, seems ambitious to me.

Last time I asked, we had no idea how much performance would additional
indirection cost us. Do we have a clue?

I like memdesc idea, but indirection cost always bothered me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-12-24 14:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-18 15:09 [PATCHv2 00/14] Kiryl Shutsemau
2025-12-18 15:09 ` [PATCHv2 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
2025-12-18 15:09 ` [PATCHv2 02/14] mm/sparse: Check memmap alignment Kiryl Shutsemau
2025-12-22  8:34   ` Muchun Song
2025-12-22 14:02     ` Kiryl Shutsemau
2025-12-22 14:18       ` David Hildenbrand (Red Hat)
2025-12-22 14:52         ` Kiryl Shutsemau
2025-12-22 14:59           ` Muchun Song
2025-12-22 14:55         ` Muchun Song
2025-12-23  9:38           ` David Hildenbrand (Red Hat)
2025-12-23 11:26             ` Muchun Song
2025-12-24 14:13             ` Kiryl Shutsemau
2025-12-22 14:49       ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 03/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
2025-12-22  2:55   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 04/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
2025-12-22  3:00   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 05/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
2025-12-22  3:06   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 06/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
2025-12-22  3:20   ` Muchun Song
2025-12-22 14:03     ` Kiryl Shutsemau
2025-12-23  8:37       ` Muchun Song
2025-12-22  7:57   ` Muchun Song
2025-12-22  9:45     ` Muchun Song
2025-12-22 14:49       ` Kiryl Shutsemau
2025-12-18 15:09 ` [PATCHv2 07/14] mm: Make page_zonenum() use head page Kiryl Shutsemau
2025-12-18 15:09 ` [PATCHv2 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
2025-12-22  5:54   ` Muchun Song
2025-12-22 15:00     ` Kiryl Shutsemau
2025-12-22 15:11       ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 09/14] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
2025-12-18 15:09 ` [PATCHv2 10/14] mm: Drop fake head checks Kiryl Shutsemau
2025-12-22  5:56   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
2025-12-22  6:00   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
2025-12-22  6:03   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
2025-12-22  6:30   ` Muchun Song
2025-12-18 15:09 ` [PATCHv2 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
2025-12-22  6:20   ` Muchun Song
2025-12-18 22:18 ` [PATCHv2 00/14] Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox