[LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Usama Arif <usama.arif@linux.dev>
To: David Hildenbrand <david@kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	lsf-pc@lists.linux-foundation.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	"riel@surriel.com" <riel@surriel.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>, Barry Song <baohua@kernel.org>,
	Dev Jain <dev.jain@arm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Lance Yang <lance.yang@linux.dev>,
	Frank van der Linden <fvdl@google.com>
Subject: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
Date: Thu, 19 Feb 2026 15:53:35 +0000	[thread overview]
Message-ID: <540c5c13-9cfb-44ea-b18f-8e4abff30a01@linux.dev> (raw)

When 2M THPs were introduced, the new server hardware coming out had memory in
the scale of low hunderds of gigabytes. Today, modern server hardware ship with
several terabytes of memory. This is widely available at all hyperscalers (AWS,
Azure, GCP, Meta, Oracle, etc).
While 2MB THP have mitigated some scalability bottlenecks, they are no longer
"huge" in the context of terabyte-scale memory. There are concrete scalability
walls that large-memory machines hit today. This includes LRU lock contention,
zone lock contention when missing PCP cache at allocation, extremely low TLB
coverage, amount of page tables used...

1G THPs come with their own set of challenges, more difficult to allocate, higher
compaction times…

Why 1G THP over hugetlbfs?
==========================

As mentioned in the RFC for 1G THPs[1] while hugetlbfs provides 1GB huge pages
today, it has significant limitations that make it unsuitable for many workloads.

The classic hugetlb user is a dedicated machine running a dedicated HPC workload.
This approach just doesnt work when you run a multitude of general-purpose workloads
co-located on the same host. Enlightening every one of these workloads to use
hugetlbfs is impractical -- it requires application-level changes, explicit mmap
flags, filesystem mounts, and per-workload capacity planning. Sharing a host
between hugetlbfs consumers and regular workloads is equally difficult because
hugetlb's static reservation model locks memory away from the rest of the
system. In a multi-tenant environment where workloads are constantly being
scheduled, resized, and migrated, this rigidity is a serious operational burden.

Concretely, hugetlbfs has the following limitations:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
   or runtime, taking memory away. This requires capacity planning,
   administrative overhead, and makes workload orchestration much much more
   complex, especially colocating with workloads that don't use hugetlbfs.

2. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
   rather than falling back to smaller pages. This makes it fragile under
   memory pressure.

3. No Splitting: hugetlbfs pages cannot be split when only partial access
   is needed, leading to memory waste and preventing partial reclaim. It would
   also make recovery from HWPOISON much easier, by splitting the 1G THP
   which is not possible with hugetlb.

4. Memory Accounting: hugetlbfs memory is accounted separately and cannot
   be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.

The RFC [1] coverletter contains performance numbers for 1G THPs on x86 and
512M PMD THPs on arm which I wont repeat here.
The RFC raised many good questions for how we can approach this and what the
way forward would be. Some of these include:

Page table deposit strategy:
============================

The RFC deposited pagetables for the PMD page table and 512 PTE page tables,
which means ~2MB of memory was going to be reserved (and unused) during the
lifetime of the 1G THP. David raised a valid point if this is even needed for
2M THP, and I believe the answer is no. As part of cleaning up the current 2M
implementation, I am currently working on seeing how kernel would look like
without page table deposit for 2M THPs [2] (for everything apart from PowerPC
hash MMU).

For 1G THPs a similar approach to [2] can be taken. And probably no initial
support for 1G THPs on PowerPC hash MMU which requires page table deposit?

There will also be a lot of code reuse between PUD and PMD, and similar to
page table deposit cleanup, it would be good to know what else needs to be
targeted!

Is CMA needed to make this work?
================================

The short answer is no. 1G THPs can be gotten without it. CMA can help a lot
ofcourse, but we dont *need* it. For e.g. I can run the very simple case of
trying to get 1G pages in the upstream kernel without CMA on my server via
hugetlb and it works. The server has been up for more than 2 weeks (so pretty
fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
and I tried to get 100x1G pages on it and it worked.
It uses folio_alloc_gigantic, which is exactly what this RFC uses:

$ uptime -p
up 2 weeks, 18 hours, 35 minutes
$ cat /proc/meminfo | grep -i cma
CmaTotal:              0 kB
CmaFree:               0 kB
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti        97Gi       297Gi       586Mi       623Gi       913Gi
Swap:          129Gi       659Mi       129Gi
$ echo 100 |   sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
100
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
100
$ ./map_1g_hugepages
Mapping 100 x 1GB huge pages (100 GB total)
Mapped at 0x7f2d80000000
Touched page 0 at 0x7f2d80000000
Touched page 1 at 0x7f2dc0000000
Touched page 2 at 0x7f2e00000000
Touched page 3 at 0x7f2e40000000
..
..
Touched page 98 at 0x7f4600000000
Touched page 99 at 0x7f4640000000
Unmapped successfully

I see 1G THPs being opportunistically used ideally at the start of the application
or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
available and a greater chance of getting 1G THPs.

Splitting strategy
==================

When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
be possible with Davids proposal [3].

khugepaged support
==================

I believe the best strategy for 1G THPs would be to follow the same path as mTHPs,
i.e. not having khugepaged support at the start. I have seen khugepaged working in
ARM with 512M pages and 64K PAGE_SIZE, so maybe there is a case for it? But I
I believe the initial implementation shouldn't have it.
Maybe MADV_COLLPASE only support makes more sense?
I would love to hear more thoughts on this.

Migration support
=================

It is going to be difficult to find a 1GB contiguous memory to migrate to.
Maybe it's better to not allow migration of PUDs at all?
As Zi rightly mentioned [4], without migration, PUD THP loses its flexibility
and transparency. But with its 1GB size, what exactly would the purpose of
PUD THP migration be? It does not create memory fragmentation, since it is
the largest folio size we have and contiguous. NUMA balancing 1GB THP seems
too much work.

There are a lot more topics that would need to be discussed. But these are
some of the big ones that came out of the RFC.

[1] https://lore.kernel.org/all/20260202005451.774496-1-usamaarif642@gmail.com/
[2] https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
[3] http://lore.kernel.org/all/fe6afcc3-7539-4650-863b-04d971e89cfb@kernel.org/
[4] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

next             reply	other threads:[~2026-02-19 15:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 15:53 Usama Arif [this message]
2026-02-19 16:00 ` David Hildenbrand (Arm)
2026-02-19 16:48   ` Johannes Weiner
2026-02-19 16:52     ` Zi Yan
2026-02-19 17:08       ` Johannes Weiner
2026-02-19 17:09         ` David Hildenbrand (Arm)
2026-02-19 17:09       ` David Hildenbrand (Arm)
2026-02-19 16:49   ` Zi Yan
2026-02-19 17:13     ` Matthew Wilcox
2026-02-19 17:28       ` Zi Yan
2026-02-19 19:02 ` Rik van Riel
2026-02-20 10:00   ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=540c5c13-9cfb-44ea-b18f-8e4abff30a01@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox