From: Muchun Song <songmuchun@bytedance.com>
To: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
Michael Ellerman <mpe@ellerman.id.au>,
Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Nicholas Piggin <npiggin@gmail.com>,
Christophe Leroy <chleroy@kernel.org>,
aneesh.kumar@linux.ibm.com, joao.m.martins@oracle.com,
linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org,
linux-kernel@vger.kernel.org,
Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH 48/49] Documentation/mm: restructure vmemmap_dedup.rst to reflect generalized HVO
Date: Sun, 5 Apr 2026 20:52:39 +0800 [thread overview]
Message-ID: <20260405125240.2558577-49-songmuchun@bytedance.com> (raw)
In-Reply-To: <20260405125240.2558577-1-songmuchun@bytedance.com>
The documentation Documentation/mm/vmemmap_dedup.rst is slightly
outdated and poorly structured. The current document primarily focuses
on HugeTLB pages as the main context, which fails to clearly communicate
the general principles of Hugepage Vmemmap Optimization (HVO).
To make it more logical and readable, refine the document. Specifically,
introduce the general HVO concepts, principles, and calculations that
are agnostic to specific subsystems. Remove the outdated and
subsystem-specific contexts (like explicit HugeTLB and Device DAX sections)
to better reflect the universal applicability of HVO to any large compound
page.
This reorganization makes the documentation much easier to read and
understand, and aligns with the recent renaming and generalization of
the HVO mechanism.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
Documentation/mm/vmemmap_dedup.rst | 218 ++++++-----------------------
1 file changed, 42 insertions(+), 176 deletions(-)
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 44e80bd2e398..a21d84fcbe24 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -1,107 +1,33 @@
-
.. SPDX-License-Identifier: GPL-2.0
-=========================================
-A vmemmap diet for HugeTLB and Device DAX
-=========================================
-
-HugeTLB
-=======
-
-This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
+===================================================
+Fundamentals of Hugepage Vmemmap Optimization (HVO)
+===================================================
-The ``struct page`` structures are used to describe a physical page frame. By
-default, there is a one-to-one mapping from a page frame to its corresponding
+The ``struct page`` structures are used to describe a physical base page frame.
+By default, there is a one-to-one mapping from a page frame to its corresponding
``struct page``.
-HugeTLB pages consist of multiple base page size pages and is supported by many
-architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
-details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
-currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
-consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages.
-For each base page, there is a corresponding ``struct page``.
-
-Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
-contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
-this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_info field, and this field is the same for all tail pages.
-
-By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
-to the buddy allocator for other uses.
-
-Different architectures support different HugeTLB pages. For example, the
-following table is the HugeTLB page size supported by x86 and arm64
-architectures. Because arm64 supports 4k, 16k, and 64k base pages and
-supports contiguous entries, so it supports many kinds of sizes of HugeTLB
-page.
-
-+--------------+-----------+-----------------------------------------------+
-| Architecture | Page Size | HugeTLB Page Size |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| x86-64 | 4KB | 2MB | 1GB | | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| | 4KB | 64KB | 2MB | 32MB | 1GB |
-| +-----------+-----------+-----------+-----------+-----------+
-| arm64 | 16KB | 2MB | 32MB | 1GB | |
-| +-----------+-----------+-----------+-----------+-----------+
-| | 64KB | 2MB | 512MB | 16GB | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-
-When the system boot up, every HugeTLB page has more than one ``struct page``
-structs which size is (unit: pages)::
-
- struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
-
-Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
-of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
-relationship::
-
- HugeTLB_Size = n * PAGE_SIZE
-
-Then::
+When huge pages (large compound page) are used, they consist of multiple base
+page size pages. For each base page, there is a corresponding ``struct page``.
+However, only a few ``struct page``
+structures are actually used to contain unique information about the huge page.
+The only 'useful' information in the remaining tail ``struct page`` structures
+is the ``->compound_info`` field to get the head page structure, and this field
+is the same for all tail pages.
- struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- = n * sizeof(struct page) / PAGE_SIZE
+We can remove redundant ``struct page`` structures for huge pages to save memory.
+This optimization is referred to as Hugepage Vmemmap Optimization (HVO).
-We can use huge mapping at the pud/pmd level for the HugeTLB page.
+The optimization is only applied when the size of the ``struct page`` is a
+power-of-2. In this case, all tail pages of the same order are identical. See
+``compound_head()``. This allows us to remap the tail pages of the vmemmap to a
+shared page.
-For the HugeTLB page of the pmd level mapping, then::
+Let’s take a system with a 2 MB huge page and a base page size of 4 KB as an
+example for illustration. Here is how things look before optimization::
- struct_size = n * sizeof(struct page) / PAGE_SIZE
- = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- = sizeof(struct page) / sizeof(pte_t)
- = 64 / 8
- = 8 (pages)
-
-Where n is how many pte entries which one page can contains. So the value of
-n is (PAGE_SIZE / sizeof(pte_t)).
-
-This optimization only supports 64-bit system, so the value of sizeof(pte_t)
-is 8. And this optimization also applicable only when the size of ``struct page``
-is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
-x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
-size of ``struct page`` structs of it is 8 page frames which size depends on the
-size of the base page.
-
-For the HugeTLB page of the pud level mapping, then::
-
- struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- = PAGE_SIZE / 8 * 8 (pages)
- = PAGE_SIZE (pages)
-
-Where the struct_size(pmd) is the size of the ``struct page`` structs of a
-HugeTLB page of the pmd level mapping.
-
-E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
-HugeTLB page consists in 4096.
-
-Next, we take the pmd level mapping of the HugeTLB page as an example to
-show the internal implementation of this optimization. There are 8 pages
-``struct page`` structs associated with a HugeTLB page which is pmd mapped.
-
-Here is how things look before optimization::
-
- HugeTLB struct pages(8 pages) page frame(8 pages)
+ 2MB Hugepage struct pages (8 pages) page frame (8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
@@ -112,9 +38,9 @@ Here is how things look before optimization::
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
- | PMD | +-----------+ +-----------+
- | level | | 5 | -------------> | 5 |
- | mapping | +-----------+ +-----------+
+ | | +-----------+ +-----------+
+ | | | 5 | -------------> | 5 |
+ | | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
@@ -124,34 +50,27 @@ Here is how things look before optimization::
| |
+-----------+
-The first page of ``struct page`` (page 0) associated with the HugeTLB page
-contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
-pages of ``struct page`` (page 1 to page 7) are tail pages.
-
-The optimization is only applied when the size of the struct page is a power
-of 2. In this case, all tail pages of the same order are identical. See
-compound_head(). This allows us to remap the tail pages of the vmemmap to a
-shared, read-only page. The head page is also remapped to a new page. This
-allows the original vmemmap pages to be freed.
+We remap the tail pages (page 1 to page 7) of the vmemmap to a shared, read-only
+page (per-zone).
Here is how things look after remapping::
- HugeTLB struct pages(8 pages) page frame (new)
+ 2MB Hugepage struct pages(8 pages) page frame (1 page)
+-----------+ ---virt_to_page---> +-----------+ mapping to +----------------+
| | | 0 | -------------> | 0 |
| | +-----------+ +----------------+
| | | 1 | ------┐
| | +-----------+ |
- | | | 2 | ------┼ +----------------------------+
+ | | | 2 | ------┼
+ | | +-----------+ |
+ | | | 3 | ------┼ +----------------------------+
| | +-----------+ | | A single, per-zone page |
- | | | 3 | ------┼------> | frame shared among all |
+ | | | 4 | ------┼------> | frame shared among all |
| | +-----------+ | | hugepages of the same size |
- | | | 4 | ------┼ +----------------------------+
+ | | | 5 | ------┼ +----------------------------+
+ | | +-----------+ |
+ | | | 6 | ------┼
| | +-----------+ |
- | | | 5 | ------┼
- | PMD | +-----------+ |
- | level | | 6 | ------┼
- | mapping | +-----------+ |
| | | 7 | ------┘
| | +-----------+
| |
@@ -159,65 +78,12 @@ Here is how things look after remapping::
| |
+-----------+
-When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
-vmemmap pages and restore the previous mapping relationship.
-
-For the HugeTLB page of the pud level mapping. It is similar to the former.
-We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
-
-Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
-(e.g. aarch64) provides a contiguous bit in the translation table entries
-that hints to the MMU to indicate that it is one of a contiguous set of
-entries that can be cached in a single TLB entry.
-
-The contiguous bit is used to increase the mapping size at the pmd and pte
-(last) level. So this type of HugeTLB page can be optimized only when its
-size of the ``struct page`` structs is greater than **1** page.
-
-Device DAX
-==========
-
-The device-dax interface uses the same tail deduplication technique explained
-in the previous chapter, except when used with the vmemmap in
-the device (altmap).
-
-The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
-PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
-For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
-
-The differences with HugeTLB are relatively minor.
-
-It only use 3 ``struct page`` for storing all information as opposed
-to 4 on HugeTLB pages.
-
-There's no remapping of vmemmap given that device-dax memory is not part of
-System RAM ranges initialized at boot. Thus the tail page deduplication
-happens at a later stage when we populate the sections. HugeTLB reuses the
-the head vmemmap page representing, whereas device-dax reuses the tail
-vmemmap page. This results in only half of the savings compared to HugeTLB.
-
-Deduplicated tail pages are not mapped read-only.
+Therefore, for any hugepage, if the total size of its corresponding ``struct pages``
+is greater than or equal to the size of two base pages, then HVO technology can
+be applied to this hugepage to save memory. For example, in this case, the
+smallest hugepage that can apply HVO is 512 KB (its order corresponds to
+``OPTIMIZABLE_FOLIO_MIN_ORDER``). Therefore, any hugepage with an order greater
+than or equal to ``OPTIMIZABLE_FOLIO_MIN_ORDER`` can apply HVO technology.
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | ------------------------+ |
- | | +-----------+ |
- | | | 7 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
+Meanwhile, each HVOed hugepage still has ``OPTIMIZED_FOLIO_VMEMMAP_PAGE_STRUCTS``
+available ``struct page`` structures.
--
2.20.1
next prev parent reply other threads:[~2026-04-05 12:58 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-05 12:51 [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Muchun Song
2026-04-05 12:51 ` [PATCH 01/49] mm/sparse: fix vmemmap accounting imbalance on memory hotplug error Muchun Song
2026-04-05 12:51 ` [PATCH 02/49] mm/sparse: add a @pgmap argument to memory deactivation paths Muchun Song
2026-04-05 12:51 ` [PATCH 03/49] mm/sparse: fix vmemmap page accounting for HVOed DAX Muchun Song
2026-04-05 12:51 ` [PATCH 04/49] mm/sparse: add a @pgmap parameter to arch vmemmap_populate() Muchun Song
2026-04-05 12:51 ` [PATCH 05/49] mm/sparse: fix missing architecture-specific page table sync for HVO DAX Muchun Song
2026-04-05 12:51 ` [PATCH 06/49] mm/mm_init: fix uninitialized pageblock migratetype for ZONE_DEVICE compound pages Muchun Song
2026-04-05 12:51 ` [PATCH 07/49] mm/mm_init: use pageblock_migratetype_init_range() in deferred_free_pages() Muchun Song
2026-04-05 12:51 ` [PATCH 08/49] mm: Convert vmemmap_p?d_populate() to static functions Muchun Song
2026-04-05 12:52 ` [PATCH 09/49] mm: panic on memory allocation failure in sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 10/49] mm: move subsection_map_init() into sparse_init() Muchun Song
2026-04-05 12:52 ` [PATCH 11/49] mm: defer sparse_init() until after zone initialization Muchun Song
2026-04-05 12:52 ` [PATCH 12/49] mm: make set_pageblock_order() static Muchun Song
2026-04-05 12:52 ` [PATCH 13/49] mm: integrate sparse_vmemmap_init_nid_late() into sparse_init_nid() Muchun Song
2026-04-05 12:52 ` [PATCH 14/49] mm/cma: validate hugetlb CMA range by zone at reserve time Muchun Song
2026-04-05 12:52 ` [PATCH 15/49] mm/hugetlb: free cross-zone bootmem gigantic pages after allocation Muchun Song
2026-04-05 12:52 ` [PATCH 16/49] mm/hugetlb: initialize vmemmap optimization in early stage Muchun Song
2026-04-05 12:52 ` [PATCH 17/49] mm: remove sparse_vmemmap_init_nid_late() Muchun Song
2026-04-05 12:52 ` [PATCH 18/49] mm/mm_init: make __init_page_from_nid() static Muchun Song
2026-04-05 12:52 ` [PATCH 19/49] mm/sparse-vmemmap: remove the VMEMMAP_POPULATE_PAGEREF flag Muchun Song
2026-04-05 12:52 ` [PATCH 20/49] mm: rename vmemmap optimization macros to generic names Muchun Song
2026-04-05 12:52 ` [PATCH 21/49] mm/sparse: drop power-of-2 size requirement for struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 22/49] mm/sparse: introduce compound page order to mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 23/49] mm/mm_init: skip initializing shared tail pages for compound pages Muchun Song
2026-04-05 12:52 ` [PATCH 24/49] mm/sparse-vmemmap: initialize shared tail vmemmap page upon allocation Muchun Song
2026-04-05 12:52 ` [PATCH 25/49] mm/sparse-vmemmap: support vmemmap-optimizable compound page population Muchun Song
2026-04-05 12:52 ` [PATCH 26/49] mm/hugetlb: use generic vmemmap optimization macros Muchun Song
2026-04-05 12:52 ` [PATCH 27/49] mm: call memblocks_present() before HugeTLB initialization Muchun Song
2026-04-05 12:52 ` [PATCH 28/49] mm/hugetlb: switch HugeTLB to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 29/49] mm: extract pfn_to_zone() helper Muchun Song
2026-04-05 12:52 ` [PATCH 30/49] mm/sparse-vmemmap: remove unused SPARSEMEM_VMEMMAP_PREINIT feature Muchun Song
2026-04-05 12:52 ` [PATCH 31/49] mm/hugetlb: remove HUGE_BOOTMEM_HVO flag and simplify pre-HVO logic Muchun Song
2026-04-05 12:52 ` [PATCH 32/49] mm/sparse-vmemmap: consolidate shared tail page allocation Muchun Song
2026-04-05 12:52 ` [PATCH 33/49] mm: introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION Muchun Song
2026-04-05 12:52 ` [PATCH 34/49] mm/sparse-vmemmap: switch DAX to use generic vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 35/49] mm/sparse-vmemmap: introduce section zone to struct mem_section Muchun Song
2026-04-05 12:52 ` [PATCH 36/49] powerpc/mm: use generic vmemmap_shared_tail_page() in compound vmemmap Muchun Song
2026-04-05 12:52 ` [PATCH 37/49] mm/sparse-vmemmap: unify DAX and HugeTLB vmemmap optimization Muchun Song
2026-04-05 12:52 ` [PATCH 38/49] mm/sparse-vmemmap: remap the shared tail pages as read-only Muchun Song
2026-04-05 12:52 ` [PATCH 39/49] mm/sparse-vmemmap: remove unused ptpfn argument Muchun Song
2026-04-05 12:52 ` [PATCH 40/49] mm/hugetlb_vmemmap: remove vmemmap_wrprotect_hvo() and related code Muchun Song
2026-04-05 12:52 ` [PATCH 41/49] mm/sparse: simplify section_vmemmap_pages() Muchun Song
2026-04-05 12:52 ` [PATCH 42/49] mm/sparse-vmemmap: introduce section_vmemmap_page_structs() Muchun Song
2026-04-05 12:52 ` [PATCH 43/49] powerpc/mm: rely on generic vmemmap_can_optimize() to simplify code Muchun Song
2026-04-05 12:52 ` [PATCH 44/49] mm/sparse-vmemmap: drop ARCH_WANT_OPTIMIZE_DAX_VMEMMAP and simplify checks Muchun Song
2026-04-05 12:52 ` [PATCH 45/49] mm/sparse-vmemmap: drop @pgmap parameter from vmemmap populate APIs Muchun Song
2026-04-05 12:52 ` [PATCH 46/49] mm/sparse: replace pgmap with order and zone in sparse_add_section() Muchun Song
2026-04-05 12:52 ` [PATCH 47/49] mm: redefine HVO as Hugepage Vmemmap Optimization Muchun Song
2026-04-05 12:52 ` Muchun Song [this message]
2026-04-05 12:52 ` [PATCH 49/49] mm: consolidate struct page power-of-2 size checks for HVO Muchun Song
2026-04-05 13:34 ` [PATCH 00/49] mm: Generalize vmemmap optimization for DAX and HugeTLB Mike Rapoport
2026-04-06 19:59 ` David Hildenbrand (arm)
2026-04-08 15:29 ` Frank van der Linden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260405125240.2558577-49-songmuchun@bytedance.com \
--to=songmuchun@bytedance.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=chleroy@kernel.org \
--cc=david@kernel.org \
--cc=joao.m.martins@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=ljs@kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mhocko@suse.com \
--cc=mpe@ellerman.id.au \
--cc=muchun.song@linux.dev \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox