[RFC 00/12] mm: PUD (1GB) THP implementation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: ziy@nvidia.com, Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	lorenzo.stoakes@oracle.com, linux-mm@kvack.org
Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, Usama Arif <usamaarif642@gmail.com>
Subject: [RFC 00/12] mm: PUD (1GB) THP implementation
Date: Sun,  1 Feb 2026 16:50:17 -0800	[thread overview]
Message-ID: <20260202005451.774496-1-usamaarif642@gmail.com> (raw)

This is an RFC series to implement 1GB PUD-level THPs, allowing
applications to benefit from reduced TLB pressure without requiring
hugetlbfs. The patches are based on top of
f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

Motivation: Why 1GB THP over hugetlbfs?
=======================================

While hugetlbfs provides 1GB huge pages today, it has significant limitations
that make it unsuitable for many workloads:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
   or runtime, taking memory away. This requires capacity planning,
   administrative overhead, and makes workload orchastration much much more
   complex, especially colocating with workloads that don't use hugetlbfs.

4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
   rather than falling back to smaller pages. This makes it fragile under
   memory pressure.

4. No Splitting: hugetlbfs pages cannot be split when only partial access
   is needed, leading to memory waste and preventing partial reclaim.

5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
   be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.

Performance Results
===================

Benchmark results of these patches on Intel Xeon Platinum 8321HC:

Test: True Random Memory Access [1] test of 4GB memory region with pointer
chasing workload (4M random pointer dereferences through memory):

| Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
|-------------------|---------------|---------------|--------------|
| Memory access     | 88 ms         | 134 ms        | 34% faster   |
| Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |

Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
For long-running workloads this will be a one-off cost, and the 34%
improvement in access latency provides significant benefit.

ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
bound workload running on a large number of ARM servers (256G). I enabled
the 512M THP settings to always for a 100 servers in production (didn't
really have high expectations :)). The average memory used for the workload
increased from 217G to 233G. The amount of memory backed by 512M pages was
68G! The dTLB misses went down by 26% and the PID multiplier increased input
by 5.9% (This is a very significant improvment in workload performance).
A significant number of these THPs were faulted in at application start when
were present across different VMAs. Ofcourse getting these 512M pages is
easier on ARM due to bigger PAGE_SIZE and pageblock order.

I am hoping that these patches for 1G THP can be used to provide similar
benefits for x86. I expect workloads to fault them in at start time when there
is plenty of free memory available.

Previous attempt by Zi Yan
==========================

Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
significant changes in kernel since then, including folio conversion, mTHP
framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
code as reference for making 1G PUD THP work. I am hoping Zi can provide
guidance on these patches!

Major Design Decisions
======================

1. No shared 1G zero page: The memory cost would be quite significant!

2. Page Table Pre-deposit Strategy
   PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
   page tables (one for each potential PMD entry after split).
   We allocate a PMD page table and use its pmd_huge_pte list to store
   the deposited PTE tables. This ensures split operations don't fail due
   to page table allocation failures (at the cost of 2M per PUD THP)

3. Split to Base Pages
   When a PUD THP must be split (COW, partial unmap, mprotect), we split
   directly to base pages (262,144 PTEs). The ideal thing would be to split
   to 2M pages and then to 4K pages if needed. However, this would require
   significant rmap and mapcount tracking changes.

4. COW and fork handling via split
   Copy-on-write and fork for PUD THP triggers a split to base pages, then
   uses existing PTE-level COW infrastructure. Getting another 1G region is
   hard and could fail. If only a 4K is written, copying 1G is a waste.
   Probably this should only be done on CoW and not fork?

5. Migration via split
   Split PUD to PTEs and migrate individual pages. It is going to be difficult
   to find a 1G continguous memory to migrate to. Maybe its better to not
   allow migration of PUDs at all? I am more tempted to not allow migration,
   but have kept splitting in this RFC.

Reviewers guide
===============

Most of the code is written by adapting from PMD code. For e.g. the PUD page
fault path is very similar to PMD. The difference is no shared zero page and
the page table deposit strategy. I think the easiest way to review this series
is to compare with PMD code.

Test results
============

  1..7
  # Starting 7 tests from 1 test cases.
  #  RUN           pud_thp.basic_allocation ...
  # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
  #            OK  pud_thp.basic_allocation
  ok 1 pud_thp.basic_allocation
  #  RUN           pud_thp.read_write_access ...
  #            OK  pud_thp.read_write_access
  ok 2 pud_thp.read_write_access
  #  RUN           pud_thp.fork_cow ...
  # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
  #            OK  pud_thp.fork_cow
  ok 3 pud_thp.fork_cow
  #  RUN           pud_thp.partial_munmap ...
  # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
  #            OK  pud_thp.partial_munmap
  ok 4 pud_thp.partial_munmap
  #  RUN           pud_thp.mprotect_split ...
  # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
  #            OK  pud_thp.mprotect_split
  ok 5 pud_thp.mprotect_split
  #  RUN           pud_thp.reclaim_pageout ...
  # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
  #            OK  pud_thp.reclaim_pageout
  ok 6 pud_thp.reclaim_pageout
  #  RUN           pud_thp.migration_mbind ...
  # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
  #            OK  pud_thp.migration_mbind
  ok 7 pud_thp.migration_mbind
  # PASSED: 7 / 7 tests passed.
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0

[1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
[2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/

Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Usama Arif (12):
  mm: add PUD THP ptdesc and rmap support
  mm/thp: add mTHP stats infrastructure for PUD THP
  mm: thp: add PUD THP allocation and fault handling
  mm: thp: implement PUD THP split to PTE level
  mm: thp: add reclaim and migration support for PUD THP
  selftests/mm: add PUD THP basic allocation test
  selftests/mm: add PUD THP read/write access test
  selftests/mm: add PUD THP fork COW test
  selftests/mm: add PUD THP partial munmap test
  selftests/mm: add PUD THP mprotect split test
  selftests/mm: add PUD THP reclaim test
  selftests/mm: add PUD THP migration test

 include/linux/huge_mm.h                   |  60 ++-
 include/linux/mm.h                        |  19 +
 include/linux/mm_types.h                  |   5 +-
 include/linux/pgtable.h                   |   8 +
 include/linux/rmap.h                      |   7 +-
 mm/huge_memory.c                          | 535 +++++++++++++++++++++-
 mm/internal.h                             |   3 +
 mm/memory.c                               |   8 +-
 mm/migrate.c                              |  17 +
 mm/page_vma_mapped.c                      |  35 ++
 mm/pgtable-generic.c                      |  83 ++++
 mm/rmap.c                                 |  96 +++-
 mm/vmscan.c                               |   2 +
 tools/testing/selftests/mm/Makefile       |   1 +
 tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
 15 files changed, 1197 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pud_thp_test.c

-- 
2.47.3

next             reply	other threads:[~2026-02-02  0:55 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-02  0:50 Usama Arif [this message]
2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
2026-02-02 10:44   ` Kiryl Shutsemau
2026-02-02 16:01     ` Zi Yan
2026-02-03 22:07       ` Usama Arif
2026-02-05  4:17         ` Matthew Wilcox
2026-02-05  4:21           ` Matthew Wilcox
2026-02-05  5:13             ` Usama Arif
2026-02-05 17:40               ` David Hildenbrand (Arm)
2026-02-05 18:05                 ` Usama Arif
2026-02-05 18:11                   ` Usama Arif
2026-02-02 12:15   ` Lorenzo Stoakes
2026-02-04  7:38     ` Usama Arif
2026-02-04 12:55       ` Lorenzo Stoakes
2026-02-05  6:40         ` Usama Arif
2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
2026-02-02 11:56   ` Lorenzo Stoakes
2026-02-05  5:53     ` Usama Arif
2026-02-02  0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
2026-02-02  0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
2026-02-02  0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
2026-02-02  0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
2026-02-02  0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
2026-02-02  0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
2026-02-02  0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
2026-02-02  0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
2026-02-02  0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
2026-02-02 11:30   ` Lorenzo Stoakes
2026-02-02 15:50     ` Zi Yan
2026-02-04 10:56       ` Lorenzo Stoakes
2026-02-05 11:29         ` David Hildenbrand (arm)
2026-02-05 11:22       ` David Hildenbrand (arm)
2026-02-02  4:00 ` Matthew Wilcox
2026-02-02  9:06   ` David Hildenbrand (arm)
2026-02-03 21:11     ` Usama Arif
2026-02-02 11:20 ` Lorenzo Stoakes
2026-02-04  1:00   ` Usama Arif
2026-02-04 11:08     ` Lorenzo Stoakes
2026-02-04 11:50       ` Dev Jain
2026-02-04 12:01         ` Dev Jain
2026-02-05  6:08       ` Usama Arif
2026-02-02 16:24 ` Zi Yan
2026-02-03 23:29   ` Usama Arif
2026-02-04  0:08     ` Frank van der Linden
2026-02-05  5:46       ` Usama Arif
2026-02-05 18:07     ` Zi Yan
2026-02-07 23:22       ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260202005451.774496-1-usamaarif642@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox