linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Zi Yan <ziy@nvidia.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Kiryl Shutsemau <kas@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>,
	David Hildenbrand <david@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Usama Arif <usamaarif642@gmail.com>,
	Frank van der Linden <fvdl@google.com>,
	Oscar Salvador <osalvador@suse.de>,
	Mike Rapoport <rppt@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Baoquan He <bhe@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	kernel-team@meta.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization
Date: Wed, 21 Jan 2026 15:31:59 -0500	[thread overview]
Message-ID: <E99A40AF-1535-4FC0-BEE5-6F0F5B3FF840@nvidia.com> (raw)
In-Reply-To: <bc7b8c62-a8b3-4407-a69f-30b3fd269566@suse.cz>

On 21 Jan 2026, at 13:44, Vlastimil Babka wrote:

> On 1/21/26 17:22, Kiryl Shutsemau wrote:
>> This series removes "fake head pages" from the HugeTLB vmemmap
>> optimization (HVO) by changing how tail pages encode their relationship
>> to the head page.
>>
>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>> hot path.
>
> We never got the definitive answer in the previous version discussions
> whether it's worth to do this now with the upcoming memdesc stuff, right?
>
>> Background
>> ==========
>>
>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>> and remapping the freed virtual addresses to a single physical page.
>> Previously, all tail page vmemmap entries were remapped to the first
>> vmemmap page (containing the head struct page), creating "fake heads" -
>> tail pages that appear to have PG_head set when accessed through the
>> deduplicated vmemmap.
>>
>> This required special handling in compound_head() to detect and work
>> around fake heads, adding complexity and overhead to a very hot path.
>
> So a very stupid question, why did we remap everything to the first page,
> and not instead create two pages, where the first one would contain the head
> and the first batch of tails, and the second one would be used for the rest
> of the tails? I'd expect it wouldn't make the memory savings that much
> worse, and eliminate most of the issues?

I think it was using 2 pages before[1]. The benefit of using one page is:
“
It further reduces the overhead of struct
page by 12.5% for a 2MB HugeTLB compared to the previous approach,
which means 2GB per 1TB HugeTLB (2MB type).
“

[1] https://lore.kernel.org/all/20211101031651.75851-1-songmuchun@bytedance.com/T/#u

>
>> New Approach
>> ============
>>
>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>> common case), this series changes how position of the head page is encoded
>> in the tail pages.
>>
>> Instead of storing a pointer to the head page, the ->compound_info
>> (renamed from ->compound_head) now stores a mask.
>>
>> The mask can be applied to any tail page's virtual address to compute
>> the head page address. Critically, all tail pages of the same order now
>> have identical compound_info values, regardless of which compound page
>> they belong to.
>>
>> The key insight is that all tail pages of the same order now have
>> identical compound_info values, regardless of which compound page they
>> belong to. This allows a single page of tail struct pages to be shared
>> across all huge pages of the same order on a NUMA node.
>>
>> Benefits
>> ========
>>
>> 1. Simplified compound_head(): No fake head detection needed, can be
>>    implemented in a branchless manner.
>>
>> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>>    no race with fake head remapping.
>>
>> 3. Cleaner architecture: The shared tail pages are truly read-only and
>>    contain valid tail page metadata.
>>
>> If sizeof(struct page) is not power-of-2, there are no functional changes.
>> HVO is not supported in this configuration.
>>
>> I had hoped to see performance improvement, but my testing thus far has
>> shown either no change or only a slight improvement within the noise.
>>
>> Series Organization
>> ===================
>>
>> Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
>> Patches 2-4: Refactoring - interface changes, field rename, code movement
>> Patch 5: Core change - new mask-based compound_head() encoding
>> Patch 6: Correctness fix - page_zonenum() must use head page
>> Patch 7: Add memmap alignment check for compound_info_has_mask()
>> Patch 8: Refactor vmemmap_walk for new design
>> Patch 9: Eliminate fake heads with shared tail pages
>> Patches 10-13: Cleanup - remove fake head infrastructure
>> Patch 14: Documentation update
>>
>> Changes in v4:
>> ==============
>>   - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
>>     dependency loop by avoiding including linux/pgtable.h into
>>     linux/mmzone.h
>>
>>   - Rework vmemmap_remap_alloc() interface. (Muchun)
>>
>>   - Use &folio->page instead of folio address for optimization
>>     target. (Muchun)
>>
>> Changes in v3:
>> ==============
>>   - Fixed error recovery path in vmemmap_remap_free() to pass correct start
>>     address for TLB flush. (Muchun)
>>
>>   - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
>>     check via compound_info_has_mask(). For other memory models, alignment
>>     guarantees are harder to verify. (Muchun)
>>
>>   - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
>>     shared for the struct hstate" to "A single, per-node page frame shared
>>     among all hugepages of the same size". (Muchun)
>>
>>   - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
>>     in certain configurations. (kernel test robot)
>>
>> Changes in v2:
>> ==============
>>
>> - Handle boot-allocated huge pages correctly. (Frank)
>>
>> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>>   in pglist_data. (Muchun)
>>
>> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>>   a race condition where two threads could both allocate tail pages.
>>   The losing thread now properly frees its allocated page. (Usama)
>>
>> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>>   required for the mask approach. (Muchun)
>>
>> - Make page_zonenum() use head page - correctness fix since shared
>>   tail pages cannot have valid zone information. (Muchun)
>>
>> - Added 'const' qualifier to head parameter in set_compound_head() and
>>   prep_compound_tail(). (Usama)
>>
>> - Updated commit messages.
>>
>> Kiryl Shutsemau (14):
>>   mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>>   mm: Change the interface of prep_compound_tail()
>>   mm: Rename the 'compound_head' field in the 'struct page' to
>>     'compound_info'
>>   mm: Move set/clear_compound_head() next to compound_head()
>>   mm: Rework compound_head() for power-of-2 sizeof(struct page)
>>   mm: Make page_zonenum() use head page
>>   mm/sparse: Check memmap alignment for compound_info_has_mask()
>>   mm/hugetlb: Refactor code around vmemmap_walk
>>   mm/hugetlb: Remove fake head pages
>>   mm: Drop fake head checks
>>   hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>>   mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>>   mm: Remove the branch from compound_head()
>>   hugetlb: Update vmemmap_dedup.rst
>>
>>  .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
>>  Documentation/mm/vmemmap_dedup.rst            |  62 ++--
>>  include/linux/mm.h                            |  31 --
>>  include/linux/mm_types.h                      |  20 +-
>>  include/linux/mmzone.h                        |  47 +++
>>  include/linux/page-flags.h                    | 167 +++++-----
>>  include/linux/page_ref.h                      |   8 +-
>>  include/linux/types.h                         |   2 +-
>>  kernel/vmcore_info.c                          |   2 +-
>>  mm/hugetlb.c                                  |   8 +-
>>  mm/hugetlb_vmemmap.c                          | 300 ++++++++----------
>>  mm/internal.h                                 |  12 +-
>>  mm/mm_init.c                                  |   2 +-
>>  mm/page_alloc.c                               |   4 +-
>>  mm/slab.h                                     |   2 +-
>>  mm/sparse-vmemmap.c                           |  44 ++-
>>  mm/sparse.c                                   |   5 +
>>  mm/util.c                                     |  16 +-
>>  18 files changed, 369 insertions(+), 365 deletions(-)
>>


Best Regards,
Yan, Zi


  reply	other threads:[~2026-01-21 20:32 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-21 16:22 Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 01/14] mm: Move MAX_FOLIO_ORDER definition to mmzone.h Kiryl Shutsemau
2026-01-21 16:29   ` Zi Yan
2026-01-22  2:24   ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 02/14] mm: Change the interface of prep_compound_tail() Kiryl Shutsemau
2026-01-21 16:32   ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 03/14] mm: Rename the 'compound_head' field in the 'struct page' to 'compound_info' Kiryl Shutsemau
2026-01-21 16:34   ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 04/14] mm: Move set/clear_compound_head() next to compound_head() Kiryl Shutsemau
2026-01-21 16:35   ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 05/14] mm: Rework compound_head() for power-of-2 sizeof(struct page) Kiryl Shutsemau
2026-01-21 17:12   ` Zi Yan
2026-01-22 11:29     ` Kiryl Shutsemau
2026-01-22 11:52       ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 06/14] mm: Make page_zonenum() use head page Kiryl Shutsemau
2026-01-21 16:28   ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 07/14] mm/sparse: Check memmap alignment for compound_info_has_mask() Kiryl Shutsemau
2026-01-21 17:58   ` Zi Yan
2026-01-22 11:22     ` Kiryl Shutsemau
2026-01-22  3:10   ` Muchun Song
2026-01-22 11:28     ` Kiryl Shutsemau
2026-01-22 11:33       ` Muchun Song
2026-01-22 11:42         ` Muchun Song
2026-01-22 12:42           ` Kiryl Shutsemau
2026-01-22 14:02             ` Muchun Song
2026-01-22 17:59               ` Kiryl Shutsemau
2026-01-23  2:32                 ` Muchun Song
2026-01-23 12:07                   ` Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 08/14] mm/hugetlb: Refactor code around vmemmap_walk Kiryl Shutsemau
2026-01-22  8:08   ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 09/14] mm/hugetlb: Remove fake head pages Kiryl Shutsemau
2026-01-22  7:00   ` Muchun Song
2026-01-27 14:51     ` Kiryl Shutsemau
2026-01-28  2:43       ` Muchun Song
2026-01-28 12:59         ` Kiryl Shutsemau
2026-01-29  3:04           ` Muchun Song
2026-01-21 16:22 ` [PATCHv4 10/14] mm: Drop fake head checks Kiryl Shutsemau
2026-01-21 18:16   ` Zi Yan
2026-01-22 12:48     ` Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 11/14] hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 12/14] mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key Kiryl Shutsemau
2026-01-21 16:22 ` [PATCHv4 13/14] mm: Remove the branch from compound_head() Kiryl Shutsemau
2026-01-21 18:21   ` Zi Yan
2026-01-21 16:22 ` [PATCHv4 14/14] hugetlb: Update vmemmap_dedup.rst Kiryl Shutsemau
2026-01-22  2:22   ` Muchun Song
2026-01-21 18:44 ` [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization Vlastimil Babka
2026-01-21 20:31   ` Zi Yan [this message]
2026-01-22 11:21     ` Kiryl Shutsemau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E99A40AF-1535-4FC0-BEE5-6F0F5B3FF840@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=rppt@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox