* [PATCH 01/11] mm: Change the interface of prep_compound_tail()
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 21:49 ` Usama Arif
2025-12-05 19:43 ` [PATCH 02/11] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
` (10 subsequent siblings)
11 siblings, 1 reply; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
Instead of passing down the head page and tail page index, pass the tail
and head pages directly, as well as the order of the compound page.
This is a preparation for changing how the head position is encoded in
the tail page.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 4 +++-
mm/hugetlb.c | 8 +++++---
mm/internal.h | 11 +++++------
mm/mm_init.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0091ad1986bf..2c1153dd7e0e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -865,7 +865,9 @@ static inline bool folio_test_large(const struct folio *folio)
return folio_test_head(folio);
}
-static __always_inline void set_compound_head(struct page *page, struct page *head)
+static __always_inline void set_compound_head(struct page *page,
+ struct page *head,
+ unsigned int order)
{
WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec..a55d638975bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3212,6 +3212,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
+ struct hstate *h,
unsigned long start_page_number,
unsigned long end_page_number)
{
@@ -3220,6 +3221,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
struct page *page = folio_page(folio, start_page_number);
unsigned long head_pfn = folio_pfn(folio);
unsigned long pfn, end_pfn = head_pfn + end_page_number;
+ unsigned int order = huge_page_order(h);
/*
* As we marked all tail pages with memblock_reserved_mark_noinit(),
@@ -3227,7 +3229,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
*/
for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
__init_single_page(page, pfn, zone, nid);
- prep_compound_tail((struct page *)folio, pfn - head_pfn);
+ prep_compound_tail(page, &folio->page, order);
set_page_count(page, 0);
}
}
@@ -3247,7 +3249,7 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
__folio_set_head(folio);
ret = folio_ref_freeze(folio, 1);
VM_BUG_ON(!ret);
- hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
+ hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages);
prep_compound_head((struct page *)folio, huge_page_order(h));
}
@@ -3304,7 +3306,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
* time as this is early in boot and there should
* be no contention.
*/
- hugetlb_folio_init_tail_vmemmap(folio,
+ hugetlb_folio_init_tail_vmemmap(folio, h,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
}
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..0355da7cb6df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -810,13 +810,12 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
INIT_LIST_HEAD(&folio->_deferred_list);
}
-static inline void prep_compound_tail(struct page *head, int tail_idx)
+static inline void prep_compound_tail(struct page *tail,
+ struct page *head, unsigned int order)
{
- struct page *p = head + tail_idx;
-
- p->mapping = TAIL_MAPPING;
- set_compound_head(p, head);
- set_page_private(p, 0);
+ tail->mapping = TAIL_MAPPING;
+ set_compound_head(tail, head, order);
+ set_page_private(tail, 0);
}
void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7712d887b696..87d1e0277318 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1102,7 +1102,7 @@ static void __ref memmap_init_compound(struct page *head,
struct page *page = pfn_to_page(pfn);
__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
- prep_compound_tail(head, pfn - head_pfn);
+ prep_compound_tail(page, head, order);
set_page_count(page, 0);
}
prep_compound_head(head, order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..fe77c00c99df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -717,7 +717,7 @@ void prep_compound_page(struct page *page, unsigned int order)
__SetPageHead(page);
for (i = 1; i < nr_pages; i++)
- prep_compound_tail(page, i);
+ prep_compound_tail(page + i, page, order);
prep_compound_head(page, order);
}
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 01/11] mm: Change the interface of prep_compound_tail()
2025-12-05 19:43 ` [PATCH 01/11] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
@ 2025-12-05 21:49 ` Usama Arif
2025-12-05 22:10 ` Kiryl Shutsemau
0 siblings, 1 reply; 22+ messages in thread
From: Usama Arif @ 2025-12-05 21:49 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
kernel-team, linux-mm, linux-kernel, linux-doc
On 05/12/2025 19:43, Kiryl Shutsemau wrote:
> Instead of passing down the head page and tail page index, pass the tail
> and head pages directly, as well as the order of the compound page.
>
> This is a preparation for changing how the head position is encoded in
> the tail page.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> include/linux/page-flags.h | 4 +++-
> mm/hugetlb.c | 8 +++++---
> mm/internal.h | 11 +++++------
> mm/mm_init.c | 2 +-
> mm/page_alloc.c | 2 +-
> 5 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0091ad1986bf..2c1153dd7e0e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -865,7 +865,9 @@ static inline bool folio_test_large(const struct folio *folio)
> return folio_test_head(folio);
> }
>
> -static __always_inline void set_compound_head(struct page *page, struct page *head)
> +static __always_inline void set_compound_head(struct page *page,
> + struct page *head,
> + unsigned int order)
I can see that order is used later, I think patch 4, but probably this patch might cause a
build warning as order is unused? Might be good to integrate that into the later patch?
Other nit is, do we want const for head here? (Its not there before, but might be good to add).
> {
> WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0455119716ec..a55d638975bd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3212,6 +3212,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>
> /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
> static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> + struct hstate *h,
> unsigned long start_page_number,
> unsigned long end_page_number)
> {
> @@ -3220,6 +3221,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> struct page *page = folio_page(folio, start_page_number);
> unsigned long head_pfn = folio_pfn(folio);
> unsigned long pfn, end_pfn = head_pfn + end_page_number;
> + unsigned int order = huge_page_order(h);
>
> /*
> * As we marked all tail pages with memblock_reserved_mark_noinit(),
> @@ -3227,7 +3229,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> */
> for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
> __init_single_page(page, pfn, zone, nid);
> - prep_compound_tail((struct page *)folio, pfn - head_pfn);
> + prep_compound_tail(page, &folio->page, order);
> set_page_count(page, 0);
> }
> }
> @@ -3247,7 +3249,7 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
> __folio_set_head(folio);
> ret = folio_ref_freeze(folio, 1);
> VM_BUG_ON(!ret);
> - hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
> + hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages);
> prep_compound_head((struct page *)folio, huge_page_order(h));
> }
>
> @@ -3304,7 +3306,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> * time as this is early in boot and there should
> * be no contention.
> */
> - hugetlb_folio_init_tail_vmemmap(folio,
> + hugetlb_folio_init_tail_vmemmap(folio, h,
> HUGETLB_VMEMMAP_RESERVE_PAGES,
> pages_per_huge_page(h));
> }
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..0355da7cb6df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -810,13 +810,12 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
> INIT_LIST_HEAD(&folio->_deferred_list);
> }
>
> -static inline void prep_compound_tail(struct page *head, int tail_idx)
> +static inline void prep_compound_tail(struct page *tail,
> + struct page *head, unsigned int order)
> {
> - struct page *p = head + tail_idx;
> -
> - p->mapping = TAIL_MAPPING;
> - set_compound_head(p, head);
> - set_page_private(p, 0);
> + tail->mapping = TAIL_MAPPING;
> + set_compound_head(tail, head, order);
> + set_page_private(tail, 0);
> }
>
> void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 7712d887b696..87d1e0277318 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1102,7 +1102,7 @@ static void __ref memmap_init_compound(struct page *head,
> struct page *page = pfn_to_page(pfn);
>
> __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> - prep_compound_tail(head, pfn - head_pfn);
> + prep_compound_tail(page, head, order);
> set_page_count(page, 0);
> }
> prep_compound_head(head, order);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed82ee55e66a..fe77c00c99df 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -717,7 +717,7 @@ void prep_compound_page(struct page *page, unsigned int order)
>
> __SetPageHead(page);
> for (i = 1; i < nr_pages; i++)
> - prep_compound_tail(page, i);
> + prep_compound_tail(page + i, page, order);
>
> prep_compound_head(page, order);
> }
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 01/11] mm: Change the interface of prep_compound_tail()
2025-12-05 21:49 ` Usama Arif
@ 2025-12-05 22:10 ` Kiryl Shutsemau
2025-12-05 22:15 ` Usama Arif
0 siblings, 1 reply; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 22:10 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, Muchun Song, David Hildenbrand, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc
On Fri, Dec 05, 2025 at 09:49:36PM +0000, Usama Arif wrote:
>
>
> On 05/12/2025 19:43, Kiryl Shutsemau wrote:
> > Instead of passing down the head page and tail page index, pass the tail
> > and head pages directly, as well as the order of the compound page.
> >
> > This is a preparation for changing how the head position is encoded in
> > the tail page.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> > include/linux/page-flags.h | 4 +++-
> > mm/hugetlb.c | 8 +++++---
> > mm/internal.h | 11 +++++------
> > mm/mm_init.c | 2 +-
> > mm/page_alloc.c | 2 +-
> > 5 files changed, 15 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 0091ad1986bf..2c1153dd7e0e 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -865,7 +865,9 @@ static inline bool folio_test_large(const struct folio *folio)
> > return folio_test_head(folio);
> > }
> >
> > -static __always_inline void set_compound_head(struct page *page, struct page *head)
> > +static __always_inline void set_compound_head(struct page *page,
> > + struct page *head,
> > + unsigned int order)
>
> I can see that order is used later, I think patch 4, but probably this patch might cause a
> build warning as order is unused? Might be good to integrate that into the later patch?
Is there warning for unused function parameters?
I think it will blow up whole kernel, no?
> Other nit is, do we want const for head here? (Its not there before, but might be good to add).
Sure, can do.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 01/11] mm: Change the interface of prep_compound_tail()
2025-12-05 22:10 ` Kiryl Shutsemau
@ 2025-12-05 22:15 ` Usama Arif
0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2025-12-05 22:15 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, Muchun Song, David Hildenbrand, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
Jonathan Corbet, kernel-team, linux-mm, linux-kernel, linux-doc
On 05/12/2025 22:10, Kiryl Shutsemau wrote:
> On Fri, Dec 05, 2025 at 09:49:36PM +0000, Usama Arif wrote:
>>
>>
>> On 05/12/2025 19:43, Kiryl Shutsemau wrote:
>>> Instead of passing down the head page and tail page index, pass the tail
>>> and head pages directly, as well as the order of the compound page.
>>>
>>> This is a preparation for changing how the head position is encoded in
>>> the tail page.
>>>
>>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>>> ---
>>> include/linux/page-flags.h | 4 +++-
>>> mm/hugetlb.c | 8 +++++---
>>> mm/internal.h | 11 +++++------
>>> mm/mm_init.c | 2 +-
>>> mm/page_alloc.c | 2 +-
>>> 5 files changed, 15 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 0091ad1986bf..2c1153dd7e0e 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -865,7 +865,9 @@ static inline bool folio_test_large(const struct folio *folio)
>>> return folio_test_head(folio);
>>> }
>>>
>>> -static __always_inline void set_compound_head(struct page *page, struct page *head)
>>> +static __always_inline void set_compound_head(struct page *page,
>>> + struct page *head,
>>> + unsigned int order)
>>
>> I can see that order is used later, I think patch 4, but probably this patch might cause a
>> build warning as order is unused? Might be good to integrate that into the later patch?
>
> Is there warning for unused function parameters?
ah I havent tried actually building, but I thought unused args would complain. If it doesnt,
should be ok.
>
> I think it will blow up whole kernel, no?
>
>> Other nit is, do we want const for head here? (Its not there before, but might be good to add).
>
> Sure, can do.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 02/11] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info'
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 01/11] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 03/11] mm: Move set/clear_compound_head() to compound_head() Kiryl Shutsemau
` (9 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
The 'compound_head' field in the 'struct page' encodes whether the page
is a tail and where to locate the head page. Bit 0 is set if the page is
a tail, and the remaining bits in the field point to the head page.
As preparation for changing how the field encodes information about the
head page, rename the field to 'compound_info'.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
.../admin-guide/kdump/vmcoreinfo.rst | 2 +-
Documentation/mm/vmemmap_dedup.rst | 6 +++---
include/linux/mm_types.h | 20 +++++++++----------
include/linux/page-flags.h | 18 ++++++++---------
include/linux/types.h | 2 +-
kernel/vmcore_info.c | 2 +-
mm/page_alloc.c | 2 +-
mm/slab.h | 2 +-
mm/util.c | 2 +-
9 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 404a15f6782c..7663c610fe90 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -141,7 +141,7 @@ nodemask_t
The size of a nodemask_t type. Used to compute the number of online
nodes.
-(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_info)
----------------------------------------------------------------------------------
User-space tools compute their values based on the offset of these
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index b4a55b6569fa..1863d88d2dcb 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -24,7 +24,7 @@ For each base page, there is a corresponding ``struct page``.
Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_head field, and this field is the same for all tail pages.
+is the compound_info field, and this field is the same for all tail pages.
By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
to the buddy allocator for other uses.
@@ -124,10 +124,10 @@ Here is how things look before optimization::
| |
+-----------+
-The value of page->compound_head is the same for all tail pages. The first
+The value of page->compound_info is the same for all tail pages. The first
page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
``struct page`` necessary to describe the HugeTLB. The only use of the remaining
-pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head.
+pages of ``struct page`` (page 1 to page 7) is to point to page->compound_info.
Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
will be used for each HugeTLB page. This will allow us to free the remaining
7 pages to the buddy allocator.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..a94683272869 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -125,14 +125,14 @@ struct page {
atomic_long_t pp_ref_count;
};
struct { /* Tail pages of compound page */
- unsigned long compound_head; /* Bit zero is set */
+ unsigned long compound_info; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
/*
- * The first word is used for compound_head or folio
+ * The first word is used for compound_info or folio
* pgmap
*/
- void *_unused_pgmap_compound_head;
+ void *_unused_pgmap_compound_info;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
@@ -383,7 +383,7 @@ struct folio {
/* private: avoid cluttering the output */
/* For the Unevictable "LRU list" slot */
struct {
- /* Avoid compound_head */
+ /* Avoid compound_info */
void *__filler;
/* public: */
unsigned int mlock_count;
@@ -484,7 +484,7 @@ struct folio {
FOLIO_MATCH(flags, flags);
FOLIO_MATCH(lru, lru);
FOLIO_MATCH(mapping, mapping);
-FOLIO_MATCH(compound_head, lru);
+FOLIO_MATCH(compound_info, lru);
FOLIO_MATCH(__folio_index, index);
FOLIO_MATCH(private, private);
FOLIO_MATCH(_mapcount, _mapcount);
@@ -503,7 +503,7 @@ FOLIO_MATCH(_last_cpupid, _last_cpupid);
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + sizeof(struct page))
FOLIO_MATCH(flags, _flags_1);
-FOLIO_MATCH(compound_head, _head_1);
+FOLIO_MATCH(compound_info, _head_1);
FOLIO_MATCH(_mapcount, _mapcount_1);
FOLIO_MATCH(_refcount, _refcount_1);
#undef FOLIO_MATCH
@@ -511,13 +511,13 @@ FOLIO_MATCH(_refcount, _refcount_1);
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + 2 * sizeof(struct page))
FOLIO_MATCH(flags, _flags_2);
-FOLIO_MATCH(compound_head, _head_2);
+FOLIO_MATCH(compound_info, _head_2);
#undef FOLIO_MATCH
#define FOLIO_MATCH(pg, fl) \
static_assert(offsetof(struct folio, fl) == \
offsetof(struct page, pg) + 3 * sizeof(struct page))
FOLIO_MATCH(flags, _flags_3);
-FOLIO_MATCH(compound_head, _head_3);
+FOLIO_MATCH(compound_info, _head_3);
#undef FOLIO_MATCH
/**
@@ -583,8 +583,8 @@ struct ptdesc {
#define TABLE_MATCH(pg, pt) \
static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
TABLE_MATCH(flags, pt_flags);
-TABLE_MATCH(compound_head, pt_list);
-TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(compound_info, pt_list);
+TABLE_MATCH(compound_info, _pt_pad_1);
TABLE_MATCH(mapping, __page_mapping);
TABLE_MATCH(__folio_index, pt_index);
TABLE_MATCH(rcu_head, pt_rcu_head);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2c1153dd7e0e..446f89c01a4c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -213,7 +213,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
/*
* Only addresses aligned with PAGE_SIZE of struct page may be fake head
* struct page. The alignment check aims to avoid access the fields (
- * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
+ * e.g. compound_info) of the @page[1]. It can avoid touch a (possibly)
* cold cacheline in some cases.
*/
if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
@@ -223,7 +223,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
* because the @page is a compound page composed with at least
* two contiguous pages.
*/
- unsigned long head = READ_ONCE(page[1].compound_head);
+ unsigned long head = READ_ONCE(page[1].compound_info);
if (likely(head & 1))
return (const struct page *)(head - 1);
@@ -281,7 +281,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
static __always_inline unsigned long _compound_head(const struct page *page)
{
- unsigned long head = READ_ONCE(page->compound_head);
+ unsigned long head = READ_ONCE(page->compound_info);
if (unlikely(head & 1))
return head - 1;
@@ -320,13 +320,13 @@ static __always_inline unsigned long _compound_head(const struct page *page)
static __always_inline int PageTail(const struct page *page)
{
- return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page);
+ return READ_ONCE(page->compound_info) & 1 || page_is_fake_head(page);
}
static __always_inline int PageCompound(const struct page *page)
{
return test_bit(PG_head, &page->flags.f) ||
- READ_ONCE(page->compound_head) & 1;
+ READ_ONCE(page->compound_info) & 1;
}
#define PAGE_POISON_PATTERN -1l
@@ -348,7 +348,7 @@ static const unsigned long *const_folio_flags(const struct folio *folio,
{
const struct page *page = &folio->page;
- VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+ VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
return &page[n].flags.f;
}
@@ -357,7 +357,7 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
{
struct page *page = &folio->page;
- VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
+ VM_BUG_ON_PGFLAGS(page->compound_info & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags.f), page);
return &page[n].flags.f;
}
@@ -869,12 +869,12 @@ static __always_inline void set_compound_head(struct page *page,
struct page *head,
unsigned int order)
{
- WRITE_ONCE(page->compound_head, (unsigned long)head + 1);
+ WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
}
static __always_inline void clear_compound_head(struct page *page)
{
- WRITE_ONCE(page->compound_head, 0);
+ WRITE_ONCE(page->compound_info, 0);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c3..3a65f0ef4a73 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -234,7 +234,7 @@ struct ustat {
*
* This guarantee is important for few reasons:
* - future call_rcu_lazy() will make use of lower bits in the pointer;
- * - the structure shares storage space in struct page with @compound_head,
+ * - the structure shares storage space in struct page with @compound_info,
* which encode PageTail() in bit 0. The guarantee is needed to avoid
* false-positive PageTail().
*/
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f8..782bc2050a40 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -175,7 +175,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, lru);
VMCOREINFO_OFFSET(page, _mapcount);
VMCOREINFO_OFFSET(page, private);
- VMCOREINFO_OFFSET(page, compound_head);
+ VMCOREINFO_OFFSET(page, compound_info);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
#ifdef CONFIG_FLATMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe77c00c99df..cecd6d89ff60 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -704,7 +704,7 @@ static inline bool pcp_allowed_order(unsigned int order)
* The first PAGE_SIZE page is called the "head page" and have PG_head set.
*
* The remaining PAGE_SIZE pages are called "tail pages". PageTail() is encoded
- * in bit 0 of page->compound_head. The rest of bits is pointer to head page.
+ * in bit 0 of page->compound_info. The rest of bits is pointer to head page.
*
* The first tail page's ->compound_order holds the order of allocation.
* This usage means that zero-order pages may not be compound.
diff --git a/mm/slab.h b/mm/slab.h
index 078daecc7cf5..b471877af296 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -104,7 +104,7 @@ struct slab {
#define SLAB_MATCH(pg, sl) \
static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
SLAB_MATCH(flags, flags);
-SLAB_MATCH(compound_head, slab_cache); /* Ensure bit 0 is clear */
+SLAB_MATCH(compound_info, slab_cache); /* Ensure bit 0 is clear */
SLAB_MATCH(_refcount, __page_refcount);
#ifdef CONFIG_MEMCG
SLAB_MATCH(memcg_data, obj_exts);
diff --git a/mm/util.c b/mm/util.c
index 8989d5767528..cbf93cf3223a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1244,7 +1244,7 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
again:
memset(&ps->folio_snapshot, 0, sizeof(struct folio));
memcpy(&ps->page_snapshot, page, sizeof(*page));
- head = ps->page_snapshot.compound_head;
+ head = ps->page_snapshot.compound_info;
if ((head & 1) == 0) {
ps->idx = 0;
foliop = (struct folio *)&ps->page_snapshot;
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 03/11] mm: Move set/clear_compound_head() to compound_head()
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 01/11] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 02/11] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 04/11] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
` (8 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
Move set_compound_head() and clear_compound_head() next to
compound_head().
Their logic should match, and keeping them together makes it easier.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 446f89c01a4c..11d9499e5ced 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -290,6 +290,18 @@ static __always_inline unsigned long _compound_head(const struct page *page)
#define compound_head(page) ((typeof(page))_compound_head(page))
+static __always_inline void set_compound_head(struct page *page,
+ struct page *head,
+ unsigned int order)
+{
+ WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
+}
+
+static __always_inline void clear_compound_head(struct page *page)
+{
+ WRITE_ONCE(page->compound_info, 0);
+}
+
/**
* page_folio - Converts from page to folio.
* @p: The page.
@@ -865,18 +877,6 @@ static inline bool folio_test_large(const struct folio *folio)
return folio_test_head(folio);
}
-static __always_inline void set_compound_head(struct page *page,
- struct page *head,
- unsigned int order)
-{
- WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
-}
-
-static __always_inline void clear_compound_head(struct page *page)
-{
- WRITE_ONCE(page->compound_info, 0);
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline void ClearPageCompound(struct page *page)
{
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 04/11] mm: Rework compound_head() for power-of-2 sizeof(struct page)
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (2 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 03/11] mm: Move set/clear_compound_head() to compound_head() Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-06 0:25 ` Usama Arif
2025-12-05 19:43 ` [PATCH 05/11] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
` (7 subsequent siblings)
11 siblings, 1 reply; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
For tail pages, the kernel uses the 'compound_info' field to get to the
head page. The bit 0 of the field indicates whether the page is a
tail page, and if set, the remaining bits represent a pointer to the
head page.
For cases when size of struct page is power-of-2, change the encoding of
compound_info to store a mask that can be applied to the virtual address
of the tail page in order to access the head page. It is possible
because sturct page of the head page is naturally aligned with regards
to order of the page.
The significant impact of this modification is that all tail pages of
the same order will now have identical 'compound_info', regardless of
the compound page they are associated with. This paves the way for
eliminating fake heads.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 61 +++++++++++++++++++++++++++++++++-----
mm/util.c | 15 +++++++---
2 files changed, 64 insertions(+), 12 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 11d9499e5ced..eef02fbbb40f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
return page;
+ /*
+ * Fake heads only exists if size of struct page is power-of-2.
+ * See hugetlb_vmemmap_optimizable_size().
+ */
+ if (!is_power_of_2(sizeof(struct page)))
+ return page;
+
/*
* Only addresses aligned with PAGE_SIZE of struct page may be fake head
* struct page. The alignment check aims to avoid access the fields (
@@ -223,10 +230,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
* because the @page is a compound page composed with at least
* two contiguous pages.
*/
- unsigned long head = READ_ONCE(page[1].compound_info);
+ unsigned long info = READ_ONCE(page[1].compound_info);
- if (likely(head & 1))
- return (const struct page *)(head - 1);
+ if (likely(info & 1)) {
+ unsigned long p = (unsigned long)page;
+
+ return (const struct page *)(p & info);
+ }
}
return page;
}
@@ -281,11 +291,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
static __always_inline unsigned long _compound_head(const struct page *page)
{
- unsigned long head = READ_ONCE(page->compound_info);
+ unsigned long info = READ_ONCE(page->compound_info);
- if (unlikely(head & 1))
- return head - 1;
- return (unsigned long)page_fixed_fake_head(page);
+ /* Bit 0 encodes PageTail() */
+ if (!(info & 1))
+ return (unsigned long)page_fixed_fake_head(page);
+
+ /*
+ * If the size of struct page is not power-of-2, the rest if
+ * compound_info is the pointer to the head page.
+ */
+ if (!is_power_of_2(sizeof(struct page)))
+ return info - 1;
+
+ /*
+ * If the size of struct page is power-of-2 it is set the rest of
+ * the info encodes the mask that converts the address of the tail
+ * page to the head page.
+ *
+ * No need to clear bit 0 in the mask as 'page' always has it clear.
+ */
+ return (unsigned long)page & info;
}
#define compound_head(page) ((typeof(page))_compound_head(page))
@@ -294,7 +320,26 @@ static __always_inline void set_compound_head(struct page *page,
struct page *head,
unsigned int order)
{
- WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
+ unsigned int shift;
+ unsigned long mask;
+
+ if (!is_power_of_2(sizeof(struct page))) {
+ WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
+ return;
+ }
+
+ /*
+ * If the size of struct page is power-of-2, bits [shift:0] of the
+ * virtual address of compound head are zero.
+ *
+ * Calculate mask that can be applied the virtual address of the
+ * tail page to get address of the head page.
+ */
+ shift = order + order_base_2(sizeof(struct page));
+ mask = GENMASK(BITS_PER_LONG - 1, shift);
+
+ /* Bit 0 encodes PageTail() */
+ WRITE_ONCE(page->compound_info, mask | 1);
}
static __always_inline void clear_compound_head(struct page *page)
diff --git a/mm/util.c b/mm/util.c
index cbf93cf3223a..6723d2bb7f1e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1234,7 +1234,7 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
*/
void snapshot_page(struct page_snapshot *ps, const struct page *page)
{
- unsigned long head, nr_pages = 1;
+ unsigned long info, nr_pages = 1;
struct folio *foliop;
int loops = 5;
@@ -1244,8 +1244,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
again:
memset(&ps->folio_snapshot, 0, sizeof(struct folio));
memcpy(&ps->page_snapshot, page, sizeof(*page));
- head = ps->page_snapshot.compound_info;
- if ((head & 1) == 0) {
+ info = ps->page_snapshot.compound_info;
+ if ((info & 1) == 0) {
ps->idx = 0;
foliop = (struct folio *)&ps->page_snapshot;
if (!folio_test_large(foliop)) {
@@ -1256,7 +1256,14 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
}
foliop = (struct folio *)page;
} else {
- foliop = (struct folio *)(head - 1);
+ unsigned long p = (unsigned long)page;
+
+ /* See compound_head() */
+ if (is_power_of_2(sizeof(struct page)))
+ foliop = (struct folio *)(p & info);
+ else
+ foliop = (struct folio *)(info - 1);
+
ps->idx = folio_page_idx(foliop, page);
}
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 04/11] mm: Rework compound_head() for power-of-2 sizeof(struct page)
2025-12-05 19:43 ` [PATCH 04/11] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
@ 2025-12-06 0:25 ` Usama Arif
0 siblings, 0 replies; 22+ messages in thread
From: Usama Arif @ 2025-12-06 0:25 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
kernel-team, linux-mm, linux-kernel, linux-doc
On 05/12/2025 19:43, Kiryl Shutsemau wrote:
> For tail pages, the kernel uses the 'compound_info' field to get to the
> head page. The bit 0 of the field indicates whether the page is a
> tail page, and if set, the remaining bits represent a pointer to the
> head page.
>
> For cases when size of struct page is power-of-2, change the encoding of
> compound_info to store a mask that can be applied to the virtual address
> of the tail page in order to access the head page. It is possible
> because sturct page of the head page is naturally aligned with regards
nit: s/sturct/struct/
> to order of the page.
Might be good to add to state here that no change expected if the struct page
is not a power of 2.
>
> The significant impact of this modification is that all tail pages of
> the same order will now have identical 'compound_info', regardless of
> the compound page they are associated with. This paves the way for
> eliminating fake heads.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> include/linux/page-flags.h | 61 +++++++++++++++++++++++++++++++++-----
> mm/util.c | 15 +++++++---
> 2 files changed, 64 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 11d9499e5ced..eef02fbbb40f 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -210,6 +210,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> return page;
>
> + /*
> + * Fake heads only exists if size of struct page is power-of-2.
> + * See hugetlb_vmemmap_optimizable_size().
> + */
> + if (!is_power_of_2(sizeof(struct page)))
> + return page;
> +
hmm my understanding reviewing up until this patch of the series is that everything works
the same as old code when struct page is not a power of 2. Returning page here means you dont
fix page head when sizeof(struct page) is not a power of 2?
> /*
> * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> * struct page. The alignment check aims to avoid access the fields (
> @@ -223,10 +230,13 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> * because the @page is a compound page composed with at least
> * two contiguous pages.
> */
> - unsigned long head = READ_ONCE(page[1].compound_info);
> + unsigned long info = READ_ONCE(page[1].compound_info);
>
> - if (likely(head & 1))
> - return (const struct page *)(head - 1);
> + if (likely(info & 1)) {
> + unsigned long p = (unsigned long)page;
> +
> + return (const struct page *)(p & info);
Would it be worth writing a comment over here similar to what you have in set_compound_head
to explain why this works? i.e. compound_info contains the mask derived from folio order that
can be applied to the virtual address to get the head page.
Also, it takes a few minutes to wrap your head around the fact that this works because the struct
page of the head page is aligned wrt to the order. Maybe it might be good to add that somewhere as
a comment somewhere? I dont see it documented in this patch, if its in a future patch, please ignore
this comment.
> + }
> }
> return page;
> }
> @@ -281,11 +291,27 @@ static __always_inline int page_is_fake_head(const struct page *page)
>
> static __always_inline unsigned long _compound_head(const struct page *page)
> {
> - unsigned long head = READ_ONCE(page->compound_info);
> + unsigned long info = READ_ONCE(page->compound_info);
>
> - if (unlikely(head & 1))
> - return head - 1;
> - return (unsigned long)page_fixed_fake_head(page);
> + /* Bit 0 encodes PageTail() */
> + if (!(info & 1))
> + return (unsigned long)page_fixed_fake_head(page);
> +
> + /*
> + * If the size of struct page is not power-of-2, the rest if
nit: s/if/of
> + * compound_info is the pointer to the head page.
> + */
> + if (!is_power_of_2(sizeof(struct page)))
> + return info - 1;
> +
> + /*
> + * If the size of struct page is power-of-2 it is set the rest of
nit: remove "it is set"
> + * the info encodes the mask that converts the address of the tail
> + * page to the head page.
> + *
> + * No need to clear bit 0 in the mask as 'page' always has it clear.
> + */
> + return (unsigned long)page & info;
> }
>
> #define compound_head(page) ((typeof(page))_compound_head(page))
> @@ -294,7 +320,26 @@ static __always_inline void set_compound_head(struct page *page,
> struct page *head,
> unsigned int order)
> {
> - WRITE_ONCE(page->compound_info, (unsigned long)head + 1);
> + unsigned int shift;
> + unsigned long mask;
> +
> + if (!is_power_of_2(sizeof(struct page))) {
> + WRITE_ONCE(page->compound_info, (unsigned long)head | 1);
> + return;
> + }
> +
> + /*
> + * If the size of struct page is power-of-2, bits [shift:0] of the
> + * virtual address of compound head are zero.
> + *
> + * Calculate mask that can be applied the virtual address of the
nit: applied to the ..
> + * tail page to get address of the head page.
> + */
> + shift = order + order_base_2(sizeof(struct page));
> + mask = GENMASK(BITS_PER_LONG - 1, shift);
> +
> + /* Bit 0 encodes PageTail() */
> + WRITE_ONCE(page->compound_info, mask | 1);
> }
>
> static __always_inline void clear_compound_head(struct page *page)
> diff --git a/mm/util.c b/mm/util.c
> index cbf93cf3223a..6723d2bb7f1e 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1234,7 +1234,7 @@ static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
> */
> void snapshot_page(struct page_snapshot *ps, const struct page *page)
> {
> - unsigned long head, nr_pages = 1;
> + unsigned long info, nr_pages = 1;
> struct folio *foliop;
> int loops = 5;
>
> @@ -1244,8 +1244,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
> again:
> memset(&ps->folio_snapshot, 0, sizeof(struct folio));
> memcpy(&ps->page_snapshot, page, sizeof(*page));
> - head = ps->page_snapshot.compound_info;
> - if ((head & 1) == 0) {
> + info = ps->page_snapshot.compound_info;
> + if ((info & 1) == 0) {
> ps->idx = 0;
> foliop = (struct folio *)&ps->page_snapshot;
> if (!folio_test_large(foliop)) {
> @@ -1256,7 +1256,14 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
> }
> foliop = (struct folio *)page;
> } else {
> - foliop = (struct folio *)(head - 1);
> + unsigned long p = (unsigned long)page;
> +
> + /* See compound_head() */
> + if (is_power_of_2(sizeof(struct page)))
> + foliop = (struct folio *)(p & info);
> + else
> + foliop = (struct folio *)(info - 1);
> +
Would it be better to do below, as you dont need to than declare p if sizeof(struct page) is not
a power of 2?
if (!is_power_of_2(sizeof(struct page)))
foliop = (struct folio *)(info - 1);
else {
unsigned long p = (unsigned long)page;
foliop = (struct folio *)(p & info);
}
> ps->idx = folio_page_idx(foliop, page);
> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 05/11] mm/hugetlb: Refactor code around vmemmap_walk
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (3 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 04/11] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 06/11] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
` (6 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
To prepare for removing fake head pages, the vmemmap_walk code is being reworked.
The reuse_page and reuse_addr variables are being eliminated. There will
no longer be an expectation regarding the reuse address in relation to
the operated range. Instead, the caller will provide head and tail
vmemmap pages, along with the vmemmap_start address where the head page
is located.
Currently, vmemmap_head and vmemmap_tail are set to the same page, but
this will change in the future.
The only functional change is that __hugetlb_vmemmap_optimize_folio()
will abandon optimization if memory allocation fails.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/hugetlb_vmemmap.c | 184 ++++++++++++++++++-------------------------
1 file changed, 77 insertions(+), 107 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index ba0fb1b6a5a8..f5ee499b8563 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -24,8 +24,9 @@
*
* @remap_pte: called for each lowest-level entry (PTE).
* @nr_walked: the number of walked pte.
- * @reuse_page: the page which is reused for the tail vmemmap pages.
- * @reuse_addr: the virtual address of the @reuse_page page.
+ * @vmemmap_start: the start of vmemmap range, where head page is located
+ * @vmemmap_head: the page to be installed as first in the vmemmap range
+ * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
* @vmemmap_pages: the list head of the vmemmap pages that can be freed
* or is mapped from.
* @flags: used to modify behavior in vmemmap page table walking
@@ -34,11 +35,14 @@
struct vmemmap_remap_walk {
void (*remap_pte)(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk);
+
unsigned long nr_walked;
- struct page *reuse_page;
- unsigned long reuse_addr;
+ unsigned long vmemmap_start;
+ struct page *vmemmap_head;
+ struct page *vmemmap_tail;
struct list_head *vmemmap_pages;
+
/* Skip the TLB flush when we split the PMD */
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
/* Skip the TLB flush when we remap the PTE */
@@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
{
struct vmemmap_remap_walk *vmemmap_walk = walk->private;
- /*
- * The reuse_page is found 'first' in page table walking before
- * starting remapping.
- */
- if (!vmemmap_walk->reuse_page)
- vmemmap_walk->reuse_page = pte_page(ptep_get(pte));
- else
- vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
+ vmemmap_walk->remap_pte(pte, addr, vmemmap_walk);
vmemmap_walk->nr_walked++;
return 0;
@@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *list)
static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk)
{
- /*
- * Remap the tail pages as read-only to catch illegal write operation
- * to the tail pages.
- */
- pgprot_t pgprot = PAGE_KERNEL_RO;
struct page *page = pte_page(ptep_get(pte));
pte_t entry;
/* Remapping the head page requires r/w */
- if (unlikely(addr == walk->reuse_addr)) {
- pgprot = PAGE_KERNEL;
- list_del(&walk->reuse_page->lru);
+ if (unlikely(addr == walk->vmemmap_start)) {
+ list_del(&walk->vmemmap_head->lru);
/*
* Makes sure that preceding stores to the page contents from
@@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
* write.
*/
smp_wmb();
+
+ entry = mk_pte(walk->vmemmap_head, PAGE_KERNEL);
+ } else {
+ /*
+ * Remap the tail pages as read-only to catch illegal write
+ * operation to the tail pages.
+ */
+ entry = mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO);
}
- entry = mk_pte(walk->reuse_page, pgprot);
list_add(&page->lru, walk->vmemmap_pages);
set_pte_at(&init_mm, addr, pte, entry);
}
@@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *start)
static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
struct vmemmap_remap_walk *walk)
{
- pgprot_t pgprot = PAGE_KERNEL;
struct page *page;
void *to;
- BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
-
page = list_first_entry(walk->vmemmap_pages, struct page, lru);
list_del(&page->lru);
to = page_to_virt(page);
- copy_page(to, (void *)walk->reuse_addr);
+ copy_page(to, (void *)walk->vmemmap_start);
reset_struct_pages(to);
/*
@@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
* before the set_pte_at() write.
*/
smp_wmb();
- set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+ set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
}
/**
@@ -282,22 +277,17 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
* to remap.
* @end: end address of the vmemmap virtual address range that we want to
* remap.
- * @reuse: reuse address.
- *
* Return: %0 on success, negative error code otherwise.
*/
-static int vmemmap_remap_split(unsigned long start, unsigned long end,
- unsigned long reuse)
+static int vmemmap_remap_split(unsigned long start, unsigned long end)
{
struct vmemmap_remap_walk walk = {
.remap_pte = NULL,
+ .vmemmap_start = start,
.flags = VMEMMAP_SPLIT_NO_TLB_FLUSH,
};
- /* See the comment in the vmemmap_remap_free(). */
- BUG_ON(start - reuse != PAGE_SIZE);
-
- return vmemmap_remap_range(reuse, end, &walk);
+ return vmemmap_remap_range(start, end, &walk);
}
/**
@@ -308,7 +298,8 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
* to remap.
* @end: end address of the vmemmap virtual address range that we want to
* remap.
- * @reuse: reuse address.
+ * @vmemmap_head: the page to be installed as first in the vmemmap range
+ * @vmemmap_tail: the page to be installed as non-first in the vmemmap range
* @vmemmap_pages: list to deposit vmemmap pages to be freed. It is callers
* responsibility to free pages.
* @flags: modifications to vmemmap_remap_walk flags
@@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_free(unsigned long start, unsigned long end,
- unsigned long reuse,
+ struct page *vmemmap_head,
+ struct page *vmemmap_tail,
struct list_head *vmemmap_pages,
unsigned long flags)
{
int ret;
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_remap_pte,
- .reuse_addr = reuse,
+ .vmemmap_start = start,
+ .vmemmap_head = vmemmap_head,
+ .vmemmap_tail = vmemmap_tail,
.vmemmap_pages = vmemmap_pages,
.flags = flags,
};
- int nid = page_to_nid((struct page *)reuse);
- gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+
+ ret = vmemmap_remap_range(start, end, &walk);
+ if (!ret || !walk.nr_walked)
+ return ret;
+
+ end = start + walk.nr_walked * PAGE_SIZE;
/*
- * Allocate a new head vmemmap page to avoid breaking a contiguous
- * block of struct page memory when freeing it back to page allocator
- * in free_vmemmap_page_list(). This will allow the likely contiguous
- * struct page backing memory to be kept contiguous and allowing for
- * more allocations of hugepages. Fallback to the currently
- * mapped head page in case should it fail to allocate.
+ * vmemmap_pages contains pages from the previous vmemmap_remap_range()
+ * call which failed. These are pages which were removed from
+ * the vmemmap. They will be restored in the following call.
*/
- walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
- if (walk.reuse_page) {
- copy_page(page_to_virt(walk.reuse_page),
- (void *)walk.reuse_addr);
- list_add(&walk.reuse_page->lru, vmemmap_pages);
- memmap_pages_add(1);
- }
+ walk = (struct vmemmap_remap_walk) {
+ .remap_pte = vmemmap_restore_pte,
+ .vmemmap_start = start,
+ .vmemmap_pages = vmemmap_pages,
+ .flags = 0,
+ };
- /*
- * In order to make remapping routine most efficient for the huge pages,
- * the routine of vmemmap page table walking has the following rules
- * (see more details from the vmemmap_pte_range()):
- *
- * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
- * should be continuous.
- * - The @reuse address is part of the range [@reuse, @end) that we are
- * walking which is passed to vmemmap_remap_range().
- * - The @reuse address is the first in the complete range.
- *
- * So we need to make sure that @start and @reuse meet the above rules.
- */
- BUG_ON(start - reuse != PAGE_SIZE);
-
- ret = vmemmap_remap_range(reuse, end, &walk);
- if (ret && walk.nr_walked) {
- end = reuse + walk.nr_walked * PAGE_SIZE;
- /*
- * vmemmap_pages contains pages from the previous
- * vmemmap_remap_range call which failed. These
- * are pages which were removed from the vmemmap.
- * They will be restored in the following call.
- */
- walk = (struct vmemmap_remap_walk) {
- .remap_pte = vmemmap_restore_pte,
- .reuse_addr = reuse,
- .vmemmap_pages = vmemmap_pages,
- .flags = 0,
- };
-
- vmemmap_remap_range(reuse, end, &walk);
- }
+ vmemmap_remap_range(start + PAGE_SIZE, end, &walk);
return ret;
}
@@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
* to remap.
* @end: end address of the vmemmap virtual address range that we want to
* remap.
- * @reuse: reuse address.
* @flags: modifications to vmemmap_remap_walk flags
*
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
- unsigned long reuse, unsigned long flags)
+ unsigned long flags)
{
LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_restore_pte,
- .reuse_addr = reuse,
+ .vmemmap_start = start,
.vmemmap_pages = &vmemmap_pages,
.flags = flags,
};
- /* See the comment in the vmemmap_remap_free(). */
- BUG_ON(start - reuse != PAGE_SIZE);
+ start += HUGETLB_VMEMMAP_RESERVE_SIZE;
if (alloc_vmemmap_page_list(start, end, &vmemmap_pages))
return -ENOMEM;
- return vmemmap_remap_range(reuse, end, &walk);
+ return vmemmap_remap_range(start, end, &walk);
}
DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
@@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
struct folio *folio, unsigned long flags)
{
int ret;
- unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
- unsigned long vmemmap_reuse;
+ unsigned long vmemmap_start, vmemmap_end;
VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
@@ -466,9 +425,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
if (flags & VMEMMAP_SYNCHRONIZE_RCU)
synchronize_rcu();
+ vmemmap_start = (unsigned long)folio;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
- vmemmap_reuse = vmemmap_start;
- vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
/*
* The pages which the vmemmap virtual address range [@vmemmap_start,
@@ -477,7 +435,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
* When a HugeTLB page is freed to the buddy allocator, previously
* discarded vmemmap pages must be allocated and remapping.
*/
- ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
+ ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
if (!ret) {
folio_clear_hugetlb_vmemmap_optimized(folio);
static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -565,9 +523,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
struct list_head *vmemmap_pages,
unsigned long flags)
{
- int ret = 0;
- unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
- unsigned long vmemmap_reuse;
+ unsigned long vmemmap_start, vmemmap_end;
+ struct page *vmemmap_head, *vmemmap_tail;
+ int nid, ret = 0;
VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio);
@@ -592,9 +550,21 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
*/
folio_set_hugetlb_vmemmap_optimized(folio);
+ nid = folio_nid(folio);
+ vmemmap_head = alloc_pages_node(nid, GFP_KERNEL, 0);
+
+ if (!vmemmap_head) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ copy_page(page_to_virt(vmemmap_head), folio);
+ list_add(&vmemmap_head->lru, vmemmap_pages);
+ memmap_pages_add(1);
+
+ vmemmap_tail = vmemmap_head;
+ vmemmap_start = (unsigned long)folio;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
- vmemmap_reuse = vmemmap_start;
- vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
/*
* Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
@@ -602,8 +572,10 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
* mapping the range to vmemmap_pages list so that they can be freed by
* the caller.
*/
- ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
+ ret = vmemmap_remap_free(vmemmap_start, vmemmap_end,
+ vmemmap_head, vmemmap_tail,
vmemmap_pages, flags);
+out:
if (ret) {
static_branch_dec(&hugetlb_optimize_vmemmap_key);
folio_clear_hugetlb_vmemmap_optimized(folio);
@@ -632,21 +604,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *folio)
{
- unsigned long vmemmap_start = (unsigned long)&folio->page, vmemmap_end;
- unsigned long vmemmap_reuse;
+ unsigned long vmemmap_start, vmemmap_end;
if (!vmemmap_should_optimize_folio(h, folio))
return 0;
+ vmemmap_start = (unsigned long)folio;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
- vmemmap_reuse = vmemmap_start;
- vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
/*
* Split PMDs on the vmemmap virtual address range [@vmemmap_start,
* @vmemmap_end]
*/
- return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+ return vmemmap_remap_split(vmemmap_start, vmemmap_end);
}
static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 06/11] mm/hugetlb: Remove fake head pages
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (4 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 05/11] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 07/11] mm: Drop fake head checks and fix a race condition Kiryl Shutsemau
` (5 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
HugeTLB optimizes vmemmap memory usage by freeing all but the first page
of vmemmap memory for the huge page and remapping the rest of the pages
to the first one.
This only occurs if the size of the struct page is a power of 2. In
these instances, the compound head position encoding in the tail pages
ensures that all tail pages of the same order are identical, regardless
of the page to which they belong.
This allows for the elimination of fake head pages without significant
memory overhead: a page full of tail struct pages is allocated per
hstate and mapped instead of the page with the head page for all pages
of the given hstate.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/hugetlb.h | 3 +++
mm/hugetlb_vmemmap.c | 31 +++++++++++++++++++++++++++----
mm/hugetlb_vmemmap.h | 4 ++--
3 files changed, 32 insertions(+), 6 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e63e46b8e1f..75dd940fda22 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -676,6 +676,9 @@ struct hstate {
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
char name[HSTATE_NAME_LEN];
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ struct page *vmemmap_tail;
+#endif
};
struct cma;
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index f5ee499b8563..2543bdbcae20 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -18,6 +18,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
+#include "internal.h"
/**
* struct vmemmap_remap_walk - walk vmemmap page table
@@ -518,7 +519,24 @@ static bool vmemmap_should_optimize_folio(const struct hstate *h, struct folio *
return true;
}
-static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
+static void hugetlb_vmemmap_tail_alloc(struct hstate *h)
+{
+ struct page *p;
+
+ if (h->vmemmap_tail)
+ return;
+
+ h->vmemmap_tail = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!h->vmemmap_tail)
+ return;
+
+ p = page_to_virt(h->vmemmap_tail);
+
+ for (int i = 0; i < PAGE_SIZE / sizeof(struct page); i++)
+ prep_compound_tail(p + i, p, huge_page_order(h));
+}
+
+static int __hugetlb_vmemmap_optimize_folio(struct hstate *h,
struct folio *folio,
struct list_head *vmemmap_pages,
unsigned long flags)
@@ -533,6 +551,11 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
if (!vmemmap_should_optimize_folio(h, folio))
return ret;
+ if (!h->vmemmap_tail)
+ hugetlb_vmemmap_tail_alloc(h);
+ if (!h->vmemmap_tail)
+ return -ENOMEM;
+
static_branch_inc(&hugetlb_optimize_vmemmap_key);
if (flags & VMEMMAP_SYNCHRONIZE_RCU)
@@ -562,7 +585,7 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
list_add(&vmemmap_head->lru, vmemmap_pages);
memmap_pages_add(1);
- vmemmap_tail = vmemmap_head;
+ vmemmap_tail = h->vmemmap_tail;
vmemmap_start = (unsigned long)folio;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
@@ -594,7 +617,7 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
* can use folio_test_hugetlb_vmemmap_optimized(@folio) to detect if @folio's
* vmemmap pages have been optimized.
*/
-void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
+void hugetlb_vmemmap_optimize_folio(struct hstate *h, struct folio *folio)
{
LIST_HEAD(vmemmap_pages);
@@ -868,7 +891,7 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
static int __init hugetlb_vmemmap_init(void)
{
- const struct hstate *h;
+ struct hstate *h;
/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 18b490825215..f44e40c44a21 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -24,7 +24,7 @@ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct list_head *folio_list,
struct list_head *non_hvo_folios);
-void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
+void hugetlb_vmemmap_optimize_folio(struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list);
#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT
@@ -64,7 +64,7 @@ static inline long hugetlb_vmemmap_restore_folios(const struct hstate *h,
return 0;
}
-static inline void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
+static inline void hugetlb_vmemmap_optimize_folio(struct hstate *h, struct folio *folio)
{
}
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 07/11] mm: Drop fake head checks and fix a race condition
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (5 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 06/11] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 08/11] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
` (4 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
Fake heads are no longer in use, so checks for them should be removed.
It simplifies compound_head() and page_ref_add_unless() substantially.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 95 ++------------------------------------
include/linux/page_ref.h | 8 +---
2 files changed, 4 insertions(+), 99 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index eef02fbbb40f..8acb141a127b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,104 +198,15 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-/*
- * Return the real head page struct iff the @page is a fake head page, otherwise
- * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
- */
-static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
-{
- if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
- return page;
-
- /*
- * Fake heads only exists if size of struct page is power-of-2.
- * See hugetlb_vmemmap_optimizable_size().
- */
- if (!is_power_of_2(sizeof(struct page)))
- return page;
-
- /*
- * Only addresses aligned with PAGE_SIZE of struct page may be fake head
- * struct page. The alignment check aims to avoid access the fields (
- * e.g. compound_info) of the @page[1]. It can avoid touch a (possibly)
- * cold cacheline in some cases.
- */
- if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
- test_bit(PG_head, &page->flags.f)) {
- /*
- * We can safely access the field of the @page[1] with PG_head
- * because the @page is a compound page composed with at least
- * two contiguous pages.
- */
- unsigned long info = READ_ONCE(page[1].compound_info);
-
- if (likely(info & 1)) {
- unsigned long p = (unsigned long)page;
-
- return (const struct page *)(p & info);
- }
- }
- return page;
-}
-
-static __always_inline bool page_count_writable(const struct page *page, int u)
-{
- if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
- return true;
-
- /*
- * The refcount check is ordered before the fake-head check to prevent
- * the following race:
- * CPU 1 (HVO) CPU 2 (speculative PFN walker)
- *
- * page_ref_freeze()
- * synchronize_rcu()
- * rcu_read_lock()
- * page_is_fake_head() is false
- * vmemmap_remap_pte()
- * XXX: struct page[] becomes r/o
- *
- * page_ref_unfreeze()
- * page_ref_count() is not zero
- *
- * atomic_add_unless(&page->_refcount)
- * XXX: try to modify r/o struct page[]
- *
- * The refcount check also prevents modification attempts to other (r/o)
- * tail pages that are not fake heads.
- */
- if (atomic_read_acquire(&page->_refcount) == u)
- return false;
-
- return page_fixed_fake_head(page) == page;
-}
-#else
-static inline const struct page *page_fixed_fake_head(const struct page *page)
-{
- return page;
-}
-
-static inline bool page_count_writable(const struct page *page, int u)
-{
- return true;
-}
-#endif
-
-static __always_inline int page_is_fake_head(const struct page *page)
-{
- return page_fixed_fake_head(page) != page;
-}
-
static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long info = READ_ONCE(page->compound_info);
/* Bit 0 encodes PageTail() */
if (!(info & 1))
- return (unsigned long)page_fixed_fake_head(page);
+ return (unsigned long)page;
/*
* If the size of struct page is not power-of-2, the rest if
@@ -377,7 +288,7 @@ static __always_inline void clear_compound_head(struct page *page)
static __always_inline int PageTail(const struct page *page)
{
- return READ_ONCE(page->compound_info) & 1 || page_is_fake_head(page);
+ return READ_ONCE(page->compound_info) & 1;
}
static __always_inline int PageCompound(const struct page *page)
@@ -904,7 +815,7 @@ static __always_inline bool folio_test_head(const struct folio *folio)
static __always_inline int PageHead(const struct page *page)
{
PF_POISONED_CHECK(page);
- return test_bit(PG_head, &page->flags.f) && !page_is_fake_head(page);
+ return test_bit(PG_head, &page->flags.f);
}
__SETPAGEFLAG(Head, head, PF_ANY)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 544150d1d5fd..490d0ad6e56d 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -230,13 +230,7 @@ static inline int folio_ref_dec_return(struct folio *folio)
static inline bool page_ref_add_unless(struct page *page, int nr, int u)
{
- bool ret = false;
-
- rcu_read_lock();
- /* avoid writing to the vmemmap area being remapped */
- if (page_count_writable(page, u))
- ret = atomic_add_unless(&page->_refcount, nr, u);
- rcu_read_unlock();
+ bool ret = atomic_add_unless(&page->_refcount, nr, u);
if (page_ref_tracepoint_active(page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 08/11] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (6 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 07/11] mm: Drop fake head checks and fix a race condition Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 09/11] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
` (3 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
Fake heads no longer exist, so there is no reason to synchronize them
with page_ref_add_unless().
Remove the flag and the synchronization with synchronize_rcu() that is
gated by it.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
mm/hugetlb_vmemmap.c | 20 ++++----------------
1 file changed, 4 insertions(+), 16 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 2543bdbcae20..0f142e4eafb9 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -48,8 +48,6 @@ struct vmemmap_remap_walk {
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
/* Skip the TLB flush when we remap the PTE */
#define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
-/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
-#define VMEMMAP_SYNCHRONIZE_RCU BIT(2)
unsigned long flags;
};
@@ -423,9 +421,6 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
if (!folio_test_hugetlb_vmemmap_optimized(folio))
return 0;
- if (flags & VMEMMAP_SYNCHRONIZE_RCU)
- synchronize_rcu();
-
vmemmap_start = (unsigned long)folio;
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
@@ -457,7 +452,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
*/
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
{
- return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
+ return __hugetlb_vmemmap_restore_folio(h, folio, 0);
}
/**
@@ -480,14 +475,11 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct folio *folio, *t_folio;
long restored = 0;
long ret = 0;
- unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
if (folio_test_hugetlb_vmemmap_optimized(folio)) {
ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
- /* only need to synchronize_rcu() once for each batch */
- flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
-
if (ret)
break;
restored++;
@@ -558,8 +550,6 @@ static int __hugetlb_vmemmap_optimize_folio(struct hstate *h,
static_branch_inc(&hugetlb_optimize_vmemmap_key);
- if (flags & VMEMMAP_SYNCHRONIZE_RCU)
- synchronize_rcu();
/*
* Very Subtle
* If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -621,7 +611,7 @@ void hugetlb_vmemmap_optimize_folio(struct hstate *h, struct folio *folio)
{
LIST_HEAD(vmemmap_pages);
- __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
+ __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
free_vmemmap_page_list(&vmemmap_pages);
}
@@ -649,7 +639,7 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
struct folio *folio;
int nr_to_optimize;
LIST_HEAD(vmemmap_pages);
- unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
nr_to_optimize = 0;
list_for_each_entry(folio, folio_list, lru) {
@@ -702,8 +692,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
int ret;
ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
- /* only need to synchronize_rcu() once for each batch */
- flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
/*
* Pages to be freed may have been accumulated. If we
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 09/11] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (7 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 08/11] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 10/11] mm: Remove the branch from compound_head() Kiryl Shutsemau
` (2 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
The static key is no longer used.
Removed it.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 2 --
mm/hugetlb_vmemmap.c | 14 ++------------
2 files changed, 2 insertions(+), 14 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 8acb141a127b..02a851ab7f5e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,8 +198,6 @@ enum pageflags {
#ifndef __GENERATING_BOUNDS_H
-DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-
static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long info = READ_ONCE(page->compound_info);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 0f142e4eafb9..81f5160ff216 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -399,9 +399,6 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
return vmemmap_remap_range(start, end, &walk);
}
-DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
-
static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
static int __init hugetlb_vmemmap_optimize_param(char *buf)
{
@@ -432,10 +429,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
* discarded vmemmap pages must be allocated and remapping.
*/
ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags);
- if (!ret) {
+ if (!ret)
folio_clear_hugetlb_vmemmap_optimized(folio);
- static_branch_dec(&hugetlb_optimize_vmemmap_key);
- }
return ret;
}
@@ -548,8 +543,6 @@ static int __hugetlb_vmemmap_optimize_folio(struct hstate *h,
if (!h->vmemmap_tail)
return -ENOMEM;
- static_branch_inc(&hugetlb_optimize_vmemmap_key);
-
/*
* Very Subtle
* If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -589,10 +582,8 @@ static int __hugetlb_vmemmap_optimize_folio(struct hstate *h,
vmemmap_head, vmemmap_tail,
vmemmap_pages, flags);
out:
- if (ret) {
- static_branch_dec(&hugetlb_optimize_vmemmap_key);
+ if (ret)
folio_clear_hugetlb_vmemmap_optimized(folio);
- }
return ret;
}
@@ -658,7 +649,6 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
register_page_bootmem_memmap(pfn_to_section_nr(spfn),
&folio->page,
HUGETLB_VMEMMAP_RESERVE_SIZE);
- static_branch_inc(&hugetlb_optimize_vmemmap_key);
continue;
}
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 10/11] mm: Remove the branch from compound_head()
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (8 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 09/11] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 19:43 ` [PATCH 11/11] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
2025-12-05 20:16 ` [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization David Hildenbrand (Red Hat)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
The compound_head() function is a hot path. For example, the zap path
calls it for every leaf page table entry.
Rewrite the helper function in a branchless manner to eliminate the risk
of CPU branch misprediction.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/page-flags.h | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 02a851ab7f5e..01d9893c4bd8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -201,17 +201,15 @@ enum pageflags {
static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long info = READ_ONCE(page->compound_info);
+ unsigned long mask;
+
+ if (!is_power_of_2(sizeof(struct page))) {
+ /* Bit 0 encodes PageTail() */
+ if (info & 1)
+ return info - 1;
- /* Bit 0 encodes PageTail() */
- if (!(info & 1))
return (unsigned long)page;
-
- /*
- * If the size of struct page is not power-of-2, the rest if
- * compound_info is the pointer to the head page.
- */
- if (!is_power_of_2(sizeof(struct page)))
- return info - 1;
+ }
/*
* If the size of struct page is power-of-2 it is set the rest of
@@ -219,8 +217,17 @@ static __always_inline unsigned long _compound_head(const struct page *page)
* page to the head page.
*
* No need to clear bit 0 in the mask as 'page' always has it clear.
+ *
+ * Let's do it in a branchless manner.
*/
- return (unsigned long)page & info;
+
+ /* Non-tail: -1UL, Tail: 0 */
+ mask = (info & 1) - 1;
+
+ /* Non-tail: -1UL, Tail: info */
+ mask |= info;
+
+ return (unsigned long)page & mask;
}
#define compound_head(page) ((typeof(page))_compound_head(page))
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH 11/11] hugetlb: Update vmemmap_dedup.rst
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (9 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 10/11] mm: Remove the branch from compound_head() Kiryl Shutsemau
@ 2025-12-05 19:43 ` Kiryl Shutsemau
2025-12-05 20:16 ` [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization David Hildenbrand (Red Hat)
11 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 19:43 UTC (permalink / raw)
To: Andrew Morton, Muchun Song
Cc: David Hildenbrand, Oscar Salvador, Mike Rapoport,
Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc,
Kiryl Shutsemau
Update the documentation regarding vmemmap optimization for hugetlb to
reflect the changes in how the kernel maps the tail pages.
Fake heads no longer exist. Remove their description.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
Documentation/mm/vmemmap_dedup.rst | 60 +++++++++++++-----------------
1 file changed, 26 insertions(+), 34 deletions(-)
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 1863d88d2dcb..a0c4c79d6922 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -124,33 +124,35 @@ Here is how things look before optimization::
| |
+-----------+
-The value of page->compound_info is the same for all tail pages. The first
-page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
-``struct page`` necessary to describe the HugeTLB. The only use of the remaining
-pages of ``struct page`` (page 1 to page 7) is to point to page->compound_info.
-Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
-will be used for each HugeTLB page. This will allow us to free the remaining
-7 pages to the buddy allocator.
+The first page of ``struct page`` (page 0) associated with the HugeTLB page
+contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
+pages of ``struct page`` (page 1 to page 7) are tail pages.
+
+The optimization is only applied when the size of the struct page is a power-of-2
+In this case, all tail pages of the same order are identical. See
+compound_head(). This allows us to remap the tail pages of the vmemmap to a
+shared, read-only page. The head page is also remapped to a new page. This
+allows the original vmemmap pages to be freed.
Here is how things look after remapping::
- HugeTLB struct pages(8 pages) page frame(8 pages)
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | | |
- | | | 2 | -----------------+ | | | | |
- | | +-----------+ | | | | |
- | | | 3 | -------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | ---------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | -----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | -------------------------+ |
- | | +-----------+ |
- | | | 7 | ---------------------------+
+ HugeTLB struct pages(8 pages) page frame
+ +-----------+ ---virt_to_page---> +-----------+ mapping to +----------------+
+ | | | 0 | -------------> | 0 |
+ | | +-----------+ +----------------+
+ | | | 1 | ------┐
+ | | +-----------+ |
+ | | | 2 | ------┼ +----------------+
+ | | +-----------+ | | vmemmap_tail |
+ | | | 3 | ------┼------> | shared for the |
+ | | +-----------+ | | struct hstate |
+ | | | 4 | ------┼ +----------------+
+ | | +-----------+ |
+ | | | 5 | ------┼
+ | PMD | +-----------+ |
+ | level | | 6 | ------┼
+ | mapping | +-----------+ |
+ | | | 7 | ------┘
| | +-----------+
| |
| |
@@ -172,16 +174,6 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
(last) level. So this type of HugeTLB page can be optimized only when its
size of the ``struct page`` structs is greater than **1** page.
-Notice: The head vmemmap page is not freed to the buddy allocator and all
-tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
-more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB
-page) associated with each HugeTLB page. The ``compound_head()`` can handle
-this correctly. There is only **one** head ``struct page``, the tail
-``struct page`` with ``PG_head`` are fake head ``struct page``. We need an
-approach to distinguish between those two different types of ``struct page`` so
-that ``compound_head()`` can return the real head ``struct page`` when the
-parameter is the tail ``struct page`` but with ``PG_head``.
-
Device DAX
==========
--
2.51.2
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 19:43 [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization Kiryl Shutsemau
` (10 preceding siblings ...)
2025-12-05 19:43 ` [PATCH 11/11] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
@ 2025-12-05 20:16 ` David Hildenbrand (Red Hat)
2025-12-05 20:33 ` Kiryl Shutsemau
11 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 20:16 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, Muchun Song, Matthew Wilcox
Cc: Oscar Salvador, Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes,
Zi Yan, Baoquan He, Michal Hocko, Johannes Weiner,
Jonathan Corbet, Usama Arif, kernel-team, linux-mm, linux-kernel,
linux-doc
On 12/5/25 20:43, Kiryl Shutsemau wrote:
> This series removes "fake head pages" from the HugeTLB vmemmap
> optimization (HVO) by changing how tail pages encode their relationship
> to the head page.
>
> It simplifies compound_head() and page_ref_add_unless(). Both are in the
> hot path.
>
> Background
> ==========
>
> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> and remapping the freed virtual addresses to a single physical page.
> Previously, all tail page vmemmap entries were remapped to the first
> vmemmap page (containing the head struct page), creating "fake heads" -
> tail pages that appear to have PG_head set when accessed through the
> deduplicated vmemmap.
>
> This required special handling in compound_head() to detect and work
> around fake heads, adding complexity and overhead to a very hot path.
>
> New Approach
> ============
>
> For architectures/configs where sizeof(struct page) is a power of 2 (the
> common case), this series changes how position of the head page is encoded
> in the tail pages.
>
> Instead of storing a pointer to the head page, the ->compound_info
> (renamed from ->compound_head) now stores a mask.
(we're in the merge window)
That doesn't seem to be suitable for the memdesc plans, where we want
all tail pages do directly point at the allocated memdesc (e.g., struct
folio), no?
@Willy what's your take?
--
Cheers
David
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 20:16 ` [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization David Hildenbrand (Red Hat)
@ 2025-12-05 20:33 ` Kiryl Shutsemau
2025-12-05 20:44 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 20:33 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Andrew Morton, Muchun Song, Matthew Wilcox, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc
On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/5/25 20:43, Kiryl Shutsemau wrote:
> > This series removes "fake head pages" from the HugeTLB vmemmap
> > optimization (HVO) by changing how tail pages encode their relationship
> > to the head page.
> >
> > It simplifies compound_head() and page_ref_add_unless(). Both are in the
> > hot path.
> >
> > Background
> > ==========
> >
> > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> > and remapping the freed virtual addresses to a single physical page.
> > Previously, all tail page vmemmap entries were remapped to the first
> > vmemmap page (containing the head struct page), creating "fake heads" -
> > tail pages that appear to have PG_head set when accessed through the
> > deduplicated vmemmap.
> >
> > This required special handling in compound_head() to detect and work
> > around fake heads, adding complexity and overhead to a very hot path.
> >
> > New Approach
> > ============
> >
> > For architectures/configs where sizeof(struct page) is a power of 2 (the
> > common case), this series changes how position of the head page is encoded
> > in the tail pages.
> >
> > Instead of storing a pointer to the head page, the ->compound_info
> > (renamed from ->compound_head) now stores a mask.
>
> (we're in the merge window)
>
> That doesn't seem to be suitable for the memdesc plans, where we want all
> tail pages do directly point at the allocated memdesc (e.g., struct folio),
> no?
Sure. My understanding is that it is going to eliminate a need in
compound_head() completely. I don't see the conflict so far.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 20:33 ` Kiryl Shutsemau
@ 2025-12-05 20:44 ` David Hildenbrand (Red Hat)
2025-12-05 20:54 ` Kiryl Shutsemau
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 20:44 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, Muchun Song, Matthew Wilcox, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc
On 12/5/25 21:33, Kiryl Shutsemau wrote:
> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/5/25 20:43, Kiryl Shutsemau wrote:
>>> This series removes "fake head pages" from the HugeTLB vmemmap
>>> optimization (HVO) by changing how tail pages encode their relationship
>>> to the head page.
>>>
>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>>> hot path.
>>>
>>> Background
>>> ==========
>>>
>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>>> and remapping the freed virtual addresses to a single physical page.
>>> Previously, all tail page vmemmap entries were remapped to the first
>>> vmemmap page (containing the head struct page), creating "fake heads" -
>>> tail pages that appear to have PG_head set when accessed through the
>>> deduplicated vmemmap.
>>>
>>> This required special handling in compound_head() to detect and work
>>> around fake heads, adding complexity and overhead to a very hot path.
>>>
>>> New Approach
>>> ============
>>>
>>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>>> common case), this series changes how position of the head page is encoded
>>> in the tail pages.
>>>
>>> Instead of storing a pointer to the head page, the ->compound_info
>>> (renamed from ->compound_head) now stores a mask.
>>
>> (we're in the merge window)
>>
>> That doesn't seem to be suitable for the memdesc plans, where we want all
>> tail pages do directly point at the allocated memdesc (e.g., struct folio),
>> no?
>
> Sure. My understanding is that it is going to eliminate a need in
> compound_head() completely. I don't see the conflict so far.
Right. All compound_head pointers will point at the allocated memdesc.
Would we still have to detect fake head pages though (at least for some
transition period)?
I don't recall whether we'll really convert all memdesc users at once,
or if some memdescs will co-exist with ordinary compound pages for a while.
--
Cheers
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 20:44 ` David Hildenbrand (Red Hat)
@ 2025-12-05 20:54 ` Kiryl Shutsemau
2025-12-05 21:34 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 20:54 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Andrew Morton, Muchun Song, Matthew Wilcox, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc
On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/5/25 21:33, Kiryl Shutsemau wrote:
> > On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/5/25 20:43, Kiryl Shutsemau wrote:
> > > > This series removes "fake head pages" from the HugeTLB vmemmap
> > > > optimization (HVO) by changing how tail pages encode their relationship
> > > > to the head page.
> > > >
> > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the
> > > > hot path.
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> > > > and remapping the freed virtual addresses to a single physical page.
> > > > Previously, all tail page vmemmap entries were remapped to the first
> > > > vmemmap page (containing the head struct page), creating "fake heads" -
> > > > tail pages that appear to have PG_head set when accessed through the
> > > > deduplicated vmemmap.
> > > >
> > > > This required special handling in compound_head() to detect and work
> > > > around fake heads, adding complexity and overhead to a very hot path.
> > > >
> > > > New Approach
> > > > ============
> > > >
> > > > For architectures/configs where sizeof(struct page) is a power of 2 (the
> > > > common case), this series changes how position of the head page is encoded
> > > > in the tail pages.
> > > >
> > > > Instead of storing a pointer to the head page, the ->compound_info
> > > > (renamed from ->compound_head) now stores a mask.
> > >
> > > (we're in the merge window)
> > >
> > > That doesn't seem to be suitable for the memdesc plans, where we want all
> > > tail pages do directly point at the allocated memdesc (e.g., struct folio),
> > > no?
> >
> > Sure. My understanding is that it is going to eliminate a need in
> > compound_head() completely. I don't see the conflict so far.
>
> Right. All compound_head pointers will point at the allocated memdesc.
>
> Would we still have to detect fake head pages though (at least for some
> transition period)?
If we need to detect if the memdesc is tail it should be as trivial as
comparing the given memdesc to the memdesc - 1. If they match, you are
looking at the tail.
But I don't think we wound need it.
The memdesc itself doesn't hold anything you want to touch if don't hold
reference to the folio. You wound need dereference memdesc and after it
you don't care if the memdesc it tail.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 20:54 ` Kiryl Shutsemau
@ 2025-12-05 21:34 ` David Hildenbrand (Red Hat)
2025-12-05 21:41 ` Kiryl Shutsemau
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:34 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, Muchun Song, Matthew Wilcox, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc
On 12/5/25 21:54, Kiryl Shutsemau wrote:
> On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/5/25 21:33, Kiryl Shutsemau wrote:
>>> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote:
>>>> On 12/5/25 20:43, Kiryl Shutsemau wrote:
>>>>> This series removes "fake head pages" from the HugeTLB vmemmap
>>>>> optimization (HVO) by changing how tail pages encode their relationship
>>>>> to the head page.
>>>>>
>>>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>>>>> hot path.
>>>>>
>>>>> Background
>>>>> ==========
>>>>>
>>>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>>>>> and remapping the freed virtual addresses to a single physical page.
>>>>> Previously, all tail page vmemmap entries were remapped to the first
>>>>> vmemmap page (containing the head struct page), creating "fake heads" -
>>>>> tail pages that appear to have PG_head set when accessed through the
>>>>> deduplicated vmemmap.
>>>>>
>>>>> This required special handling in compound_head() to detect and work
>>>>> around fake heads, adding complexity and overhead to a very hot path.
>>>>>
>>>>> New Approach
>>>>> ============
>>>>>
>>>>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>>>>> common case), this series changes how position of the head page is encoded
>>>>> in the tail pages.
>>>>>
>>>>> Instead of storing a pointer to the head page, the ->compound_info
>>>>> (renamed from ->compound_head) now stores a mask.
>>>>
>>>> (we're in the merge window)
>>>>
>>>> That doesn't seem to be suitable for the memdesc plans, where we want all
>>>> tail pages do directly point at the allocated memdesc (e.g., struct folio),
>>>> no?
>>>
>>> Sure. My understanding is that it is going to eliminate a need in
>>> compound_head() completely. I don't see the conflict so far.
>>
>> Right. All compound_head pointers will point at the allocated memdesc.
>>
>> Would we still have to detect fake head pages though (at least for some
>> transition period)?
>
> If we need to detect if the memdesc is tail it should be as trivial as
> comparing the given memdesc to the memdesc - 1. If they match, you are
> looking at the tail.
How could you assume memdesc - 1 exists without performing other checks?
>
> But I don't think we wound need it.
I would guess so.
>
> The memdesc itself doesn't hold anything you want to touch if don't hold
> reference to the folio. You wound need dereference memdesc and after it
> you don't care if the memdesc it tail.
Hopefully.
So the real question is how this would affect the transition period
(some memdescs allocated, others not allocated separately) that Willy
might soon want to start. And the dual mode where, whether "struct
folio" is allocated separately will be a config option.
Let's wait for Willy's reply.
--
Cheers
David
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 00/11] mm/hugetlb: Eliminate fake head pages from vmemmap optimization
2025-12-05 21:34 ` David Hildenbrand (Red Hat)
@ 2025-12-05 21:41 ` Kiryl Shutsemau
0 siblings, 0 replies; 22+ messages in thread
From: Kiryl Shutsemau @ 2025-12-05 21:41 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Andrew Morton, Muchun Song, Matthew Wilcox, Oscar Salvador,
Mike Rapoport, Vlastimil Babka, Lorenzo Stoakes, Zi Yan,
Baoquan He, Michal Hocko, Johannes Weiner, Jonathan Corbet,
Usama Arif, kernel-team, linux-mm, linux-kernel, linux-doc
On Fri, Dec 05, 2025 at 10:34:48PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/5/25 21:54, Kiryl Shutsemau wrote:
> > On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/5/25 21:33, Kiryl Shutsemau wrote:
> > > > On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > On 12/5/25 20:43, Kiryl Shutsemau wrote:
> > > > > > This series removes "fake head pages" from the HugeTLB vmemmap
> > > > > > optimization (HVO) by changing how tail pages encode their relationship
> > > > > > to the head page.
> > > > > >
> > > > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the
> > > > > > hot path.
> > > > > >
> > > > > > Background
> > > > > > ==========
> > > > > >
> > > > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> > > > > > and remapping the freed virtual addresses to a single physical page.
> > > > > > Previously, all tail page vmemmap entries were remapped to the first
> > > > > > vmemmap page (containing the head struct page), creating "fake heads" -
> > > > > > tail pages that appear to have PG_head set when accessed through the
> > > > > > deduplicated vmemmap.
> > > > > >
> > > > > > This required special handling in compound_head() to detect and work
> > > > > > around fake heads, adding complexity and overhead to a very hot path.
> > > > > >
> > > > > > New Approach
> > > > > > ============
> > > > > >
> > > > > > For architectures/configs where sizeof(struct page) is a power of 2 (the
> > > > > > common case), this series changes how position of the head page is encoded
> > > > > > in the tail pages.
> > > > > >
> > > > > > Instead of storing a pointer to the head page, the ->compound_info
> > > > > > (renamed from ->compound_head) now stores a mask.
> > > > >
> > > > > (we're in the merge window)
> > > > >
> > > > > That doesn't seem to be suitable for the memdesc plans, where we want all
> > > > > tail pages do directly point at the allocated memdesc (e.g., struct folio),
> > > > > no?
> > > >
> > > > Sure. My understanding is that it is going to eliminate a need in
> > > > compound_head() completely. I don't see the conflict so far.
> > >
> > > Right. All compound_head pointers will point at the allocated memdesc.
> > >
> > > Would we still have to detect fake head pages though (at least for some
> > > transition period)?
> >
> > If we need to detect if the memdesc is tail it should be as trivial as
> > comparing the given memdesc to the memdesc - 1. If they match, you are
> > looking at the tail.
>
> How could you assume memdesc - 1 exists without performing other checks?
Map zero page in front of every discontinuous vmemmap region :P
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread