From: Usama Arif <usamaarif642@gmail.com>
To: ziy@nvidia.com, Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
lorenzo.stoakes@oracle.com, linux-mm@kvack.org
Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, npache@redhat.com,
Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
kernel-team@meta.com, Usama Arif <usamaarif642@gmail.com>
Subject: [RFC 00/12] mm: PUD (1GB) THP implementation
Date: Sun, 1 Feb 2026 16:50:17 -0800 [thread overview]
Message-ID: <20260202005451.774496-1-usamaarif642@gmail.com> (raw)
This is an RFC series to implement 1GB PUD-level THPs, allowing
applications to benefit from reduced TLB pressure without requiring
hugetlbfs. The patches are based on top of
f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).
Motivation: Why 1GB THP over hugetlbfs?
=======================================
While hugetlbfs provides 1GB huge pages today, it has significant limitations
that make it unsuitable for many workloads:
1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
or runtime, taking memory away. This requires capacity planning,
administrative overhead, and makes workload orchastration much much more
complex, especially colocating with workloads that don't use hugetlbfs.
4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
rather than falling back to smaller pages. This makes it fragile under
memory pressure.
4. No Splitting: hugetlbfs pages cannot be split when only partial access
is needed, leading to memory waste and preventing partial reclaim.
5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
be easily shared with regular memory pools.
PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.
Performance Results
===================
Benchmark results of these patches on Intel Xeon Platinum 8321HC:
Test: True Random Memory Access [1] test of 4GB memory region with pointer
chasing workload (4M random pointer dereferences through memory):
| Metric | PUD THP (1GB) | PMD THP (2MB) | Change |
|-------------------|---------------|---------------|--------------|
| Memory access | 88 ms | 134 ms | 34% faster |
| Page fault time | 898 ms | 331 ms | 2.7x slower |
Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
For long-running workloads this will be a one-off cost, and the 34%
improvement in access latency provides significant benefit.
ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
bound workload running on a large number of ARM servers (256G). I enabled
the 512M THP settings to always for a 100 servers in production (didn't
really have high expectations :)). The average memory used for the workload
increased from 217G to 233G. The amount of memory backed by 512M pages was
68G! The dTLB misses went down by 26% and the PID multiplier increased input
by 5.9% (This is a very significant improvment in workload performance).
A significant number of these THPs were faulted in at application start when
were present across different VMAs. Ofcourse getting these 512M pages is
easier on ARM due to bigger PAGE_SIZE and pageblock order.
I am hoping that these patches for 1G THP can be used to provide similar
benefits for x86. I expect workloads to fault them in at start time when there
is plenty of free memory available.
Previous attempt by Zi Yan
==========================
Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
significant changes in kernel since then, including folio conversion, mTHP
framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
code as reference for making 1G PUD THP work. I am hoping Zi can provide
guidance on these patches!
Major Design Decisions
======================
1. No shared 1G zero page: The memory cost would be quite significant!
2. Page Table Pre-deposit Strategy
PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
page tables (one for each potential PMD entry after split).
We allocate a PMD page table and use its pmd_huge_pte list to store
the deposited PTE tables. This ensures split operations don't fail due
to page table allocation failures (at the cost of 2M per PUD THP)
3. Split to Base Pages
When a PUD THP must be split (COW, partial unmap, mprotect), we split
directly to base pages (262,144 PTEs). The ideal thing would be to split
to 2M pages and then to 4K pages if needed. However, this would require
significant rmap and mapcount tracking changes.
4. COW and fork handling via split
Copy-on-write and fork for PUD THP triggers a split to base pages, then
uses existing PTE-level COW infrastructure. Getting another 1G region is
hard and could fail. If only a 4K is written, copying 1G is a waste.
Probably this should only be done on CoW and not fork?
5. Migration via split
Split PUD to PTEs and migrate individual pages. It is going to be difficult
to find a 1G continguous memory to migrate to. Maybe its better to not
allow migration of PUDs at all? I am more tempted to not allow migration,
but have kept splitting in this RFC.
Reviewers guide
===============
Most of the code is written by adapting from PMD code. For e.g. the PUD page
fault path is very similar to PMD. The difference is no shared zero page and
the page table deposit strategy. I think the easiest way to review this series
is to compare with PMD code.
Test results
============
1..7
# Starting 7 tests from 1 test cases.
# RUN pud_thp.basic_allocation ...
# pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
# OK pud_thp.basic_allocation
ok 1 pud_thp.basic_allocation
# RUN pud_thp.read_write_access ...
# OK pud_thp.read_write_access
ok 2 pud_thp.read_write_access
# RUN pud_thp.fork_cow ...
# pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
# OK pud_thp.fork_cow
ok 3 pud_thp.fork_cow
# RUN pud_thp.partial_munmap ...
# pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
# OK pud_thp.partial_munmap
ok 4 pud_thp.partial_munmap
# RUN pud_thp.mprotect_split ...
# pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
# OK pud_thp.mprotect_split
ok 5 pud_thp.mprotect_split
# RUN pud_thp.reclaim_pageout ...
# pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
# OK pud_thp.reclaim_pageout
ok 6 pud_thp.reclaim_pageout
# RUN pud_thp.migration_mbind ...
# pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
# OK pud_thp.migration_mbind
ok 7 pud_thp.migration_mbind
# PASSED: 7 / 7 tests passed.
# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
[1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
[2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Usama Arif (12):
mm: add PUD THP ptdesc and rmap support
mm/thp: add mTHP stats infrastructure for PUD THP
mm: thp: add PUD THP allocation and fault handling
mm: thp: implement PUD THP split to PTE level
mm: thp: add reclaim and migration support for PUD THP
selftests/mm: add PUD THP basic allocation test
selftests/mm: add PUD THP read/write access test
selftests/mm: add PUD THP fork COW test
selftests/mm: add PUD THP partial munmap test
selftests/mm: add PUD THP mprotect split test
selftests/mm: add PUD THP reclaim test
selftests/mm: add PUD THP migration test
include/linux/huge_mm.h | 60 ++-
include/linux/mm.h | 19 +
include/linux/mm_types.h | 5 +-
include/linux/pgtable.h | 8 +
include/linux/rmap.h | 7 +-
mm/huge_memory.c | 535 +++++++++++++++++++++-
mm/internal.h | 3 +
mm/memory.c | 8 +-
mm/migrate.c | 17 +
mm/page_vma_mapped.c | 35 ++
mm/pgtable-generic.c | 83 ++++
mm/rmap.c | 96 +++-
mm/vmscan.c | 2 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
15 files changed, 1197 insertions(+), 42 deletions(-)
create mode 100644 tools/testing/selftests/mm/pud_thp_test.c
--
2.47.3
next reply other threads:[~2026-02-02 0:55 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-02 0:50 Usama Arif [this message]
2026-02-02 0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
2026-02-02 10:44 ` Kiryl Shutsemau
2026-02-02 16:01 ` Zi Yan
2026-02-03 22:07 ` Usama Arif
2026-02-05 4:17 ` Matthew Wilcox
2026-02-05 4:21 ` Matthew Wilcox
2026-02-05 5:13 ` Usama Arif
2026-02-05 17:40 ` David Hildenbrand (Arm)
2026-02-05 18:05 ` Usama Arif
2026-02-05 18:11 ` Usama Arif
2026-02-02 12:15 ` Lorenzo Stoakes
2026-02-04 7:38 ` Usama Arif
2026-02-04 12:55 ` Lorenzo Stoakes
2026-02-05 6:40 ` Usama Arif
2026-02-02 0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
2026-02-02 11:56 ` Lorenzo Stoakes
2026-02-05 5:53 ` Usama Arif
2026-02-02 0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
2026-02-02 0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
2026-02-02 0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
2026-02-02 0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
2026-02-02 0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
2026-02-02 0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
2026-02-02 0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
2026-02-02 0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
2026-02-02 0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
2026-02-02 0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
2026-02-02 2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
2026-02-02 11:30 ` Lorenzo Stoakes
2026-02-02 15:50 ` Zi Yan
2026-02-04 10:56 ` Lorenzo Stoakes
2026-02-05 11:29 ` David Hildenbrand (arm)
2026-02-05 11:22 ` David Hildenbrand (arm)
2026-02-02 4:00 ` Matthew Wilcox
2026-02-02 9:06 ` David Hildenbrand (arm)
2026-02-03 21:11 ` Usama Arif
2026-02-02 11:20 ` Lorenzo Stoakes
2026-02-04 1:00 ` Usama Arif
2026-02-04 11:08 ` Lorenzo Stoakes
2026-02-04 11:50 ` Dev Jain
2026-02-04 12:01 ` Dev Jain
2026-02-05 6:08 ` Usama Arif
2026-02-02 16:24 ` Zi Yan
2026-02-03 23:29 ` Usama Arif
2026-02-04 0:08 ` Frank van der Linden
2026-02-05 5:46 ` Usama Arif
2026-02-05 18:07 ` Zi Yan
2026-02-07 23:22 ` Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260202005451.774496-1-usamaarif642@gmail.com \
--to=usamaarif642@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox