[LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
@ 2026-02-19 15:53 Usama Arif
  2026-02-19 16:00 ` David Hildenbrand (Arm)
  2026-02-19 19:02 ` Rik van Riel
  0 siblings, 2 replies; 12+ messages in thread
From: Usama Arif @ 2026-02-19 15:53 UTC (permalink / raw)
  To: David Hildenbrand, willy, Lorenzo Stoakes, Zi Yan, Andrew Morton,
	lsf-pc, linux-mm
  Cc: Johannes Weiner, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

When 2M THPs were introduced, the new server hardware coming out had memory in
the scale of low hunderds of gigabytes. Today, modern server hardware ship with
several terabytes of memory. This is widely available at all hyperscalers (AWS,
Azure, GCP, Meta, Oracle, etc).
While 2MB THP have mitigated some scalability bottlenecks, they are no longer
"huge" in the context of terabyte-scale memory. There are concrete scalability
walls that large-memory machines hit today. This includes LRU lock contention,
zone lock contention when missing PCP cache at allocation, extremely low TLB
coverage, amount of page tables used...

1G THPs come with their own set of challenges, more difficult to allocate, higher
compaction times…

Why 1G THP over hugetlbfs?
==========================

As mentioned in the RFC for 1G THPs[1] while hugetlbfs provides 1GB huge pages
today, it has significant limitations that make it unsuitable for many workloads.

The classic hugetlb user is a dedicated machine running a dedicated HPC workload.
This approach just doesnt work when you run a multitude of general-purpose workloads
co-located on the same host. Enlightening every one of these workloads to use
hugetlbfs is impractical -- it requires application-level changes, explicit mmap
flags, filesystem mounts, and per-workload capacity planning. Sharing a host
between hugetlbfs consumers and regular workloads is equally difficult because
hugetlb's static reservation model locks memory away from the rest of the
system. In a multi-tenant environment where workloads are constantly being
scheduled, resized, and migrated, this rigidity is a serious operational burden.

Concretely, hugetlbfs has the following limitations:

1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
   or runtime, taking memory away. This requires capacity planning,
   administrative overhead, and makes workload orchestration much much more
   complex, especially colocating with workloads that don't use hugetlbfs.

2. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
   rather than falling back to smaller pages. This makes it fragile under
   memory pressure.

3. No Splitting: hugetlbfs pages cannot be split when only partial access
   is needed, leading to memory waste and preventing partial reclaim. It would
   also make recovery from HWPOISON much easier, by splitting the 1G THP
   which is not possible with hugetlb.

4. Memory Accounting: hugetlbfs memory is accounted separately and cannot
   be easily shared with regular memory pools.

PUD THP solves these limitations by integrating 1GB pages into the existing
THP infrastructure.

The RFC [1] coverletter contains performance numbers for 1G THPs on x86 and
512M PMD THPs on arm which I wont repeat here.
The RFC raised many good questions for how we can approach this and what the
way forward would be. Some of these include:

Page table deposit strategy:
============================

The RFC deposited pagetables for the PMD page table and 512 PTE page tables,
which means ~2MB of memory was going to be reserved (and unused) during the
lifetime of the 1G THP. David raised a valid point if this is even needed for
2M THP, and I believe the answer is no. As part of cleaning up the current 2M
implementation, I am currently working on seeing how kernel would look like
without page table deposit for 2M THPs [2] (for everything apart from PowerPC
hash MMU).

For 1G THPs a similar approach to [2] can be taken. And probably no initial
support for 1G THPs on PowerPC hash MMU which requires page table deposit?

There will also be a lot of code reuse between PUD and PMD, and similar to
page table deposit cleanup, it would be good to know what else needs to be
targeted!

Is CMA needed to make this work?
================================

The short answer is no. 1G THPs can be gotten without it. CMA can help a lot
ofcourse, but we dont *need* it. For e.g. I can run the very simple case of
trying to get 1G pages in the upstream kernel without CMA on my server via
hugetlb and it works. The server has been up for more than 2 weeks (so pretty
fragmented), is running a bunch of stuff in the background, uses 0 CMA memory,
and I tried to get 100x1G pages on it and it worked.
It uses folio_alloc_gigantic, which is exactly what this RFC uses:

$ uptime -p
up 2 weeks, 18 hours, 35 minutes
$ cat /proc/meminfo | grep -i cma
CmaTotal:              0 kB
CmaFree:               0 kB
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti        97Gi       297Gi       586Mi       623Gi       913Gi
Swap:          129Gi       659Mi       129Gi
$ echo 100 |   sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
100
$ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
100
$ ./map_1g_hugepages
Mapping 100 x 1GB huge pages (100 GB total)
Mapped at 0x7f2d80000000
Touched page 0 at 0x7f2d80000000
Touched page 1 at 0x7f2dc0000000
Touched page 2 at 0x7f2e00000000
Touched page 3 at 0x7f2e40000000
..
..
Touched page 98 at 0x7f4600000000
Touched page 99 at 0x7f4640000000
Unmapped successfully

I see 1G THPs being opportunistically used ideally at the start of the application
or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
available and a greater chance of getting 1G THPs.

Splitting strategy
==================

When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
be possible with Davids proposal [3].

khugepaged support
==================

I believe the best strategy for 1G THPs would be to follow the same path as mTHPs,
i.e. not having khugepaged support at the start. I have seen khugepaged working in
ARM with 512M pages and 64K PAGE_SIZE, so maybe there is a case for it? But I
I believe the initial implementation shouldn't have it.
Maybe MADV_COLLPASE only support makes more sense?
I would love to hear more thoughts on this.

Migration support
=================

It is going to be difficult to find a 1GB contiguous memory to migrate to.
Maybe it's better to not allow migration of PUDs at all?
As Zi rightly mentioned [4], without migration, PUD THP loses its flexibility
and transparency. But with its 1GB size, what exactly would the purpose of
PUD THP migration be? It does not create memory fragmentation, since it is
the largest folio size we have and contiguous. NUMA balancing 1GB THP seems
too much work.

There are a lot more topics that would need to be discussed. But these are
some of the big ones that came out of the RFC.

[1] https://lore.kernel.org/all/20260202005451.774496-1-usamaarif642@gmail.com/
[2] https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
[3] http://lore.kernel.org/all/fe6afcc3-7539-4650-863b-04d971e89cfb@kernel.org/
[4] https://lore.kernel.org/all/3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 15:53 [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages Usama Arif
@ 2026-02-19 16:00 ` David Hildenbrand (Arm)
  2026-02-19 16:48   ` Johannes Weiner
  2026-02-19 16:49   ` Zi Yan
  2026-02-19 19:02 ` Rik van Riel
  1 sibling, 2 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 16:00 UTC (permalink / raw)
  To: Usama Arif, willy, Lorenzo Stoakes, Zi Yan, Andrew Morton,
	lsf-pc, linux-mm
  Cc: Johannes Weiner, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden


> 
> I see 1G THPs being opportunistically used ideally at the start of the application
> or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
> available and a greater chance of getting 1G THPs.
> 
> Splitting strategy
> ==================
> 
> When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
> a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
> 1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
> PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
> be possible with Davids proposal [3].

There once was this proposal where we would, instead of splitting a THP, 
migrate all memory away instead. That means, instead of splitting the 1 
GiB THP, you would instead return it to the page allocator where 
somebody else could use it.

However, we cannot easily do the same when remapping a 1 GiB THP to be 
mapped by PMDs etc. I think there are examples where that just doesn't 
work or is not desired.

But I considered that in general (avoid folio_split()) an interesting 
approach. The remapping part is a bit different though.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:00 ` David Hildenbrand (Arm)
@ 2026-02-19 16:48   ` Johannes Weiner
  2026-02-19 16:52     ` Zi Yan
  2026-02-19 16:49   ` Zi Yan
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2026-02-19 16:48 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Usama Arif, willy, Lorenzo Stoakes, Zi Yan, Andrew Morton,
	lsf-pc, linux-mm, riel, Shakeel Butt, Kiryl Shutsemau,
	Barry Song, Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On Thu, Feb 19, 2026 at 05:00:19PM +0100, David Hildenbrand (Arm) wrote:
> 
> > 
> > I see 1G THPs being opportunistically used ideally at the start of the application
> > or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
> > available and a greater chance of getting 1G THPs.
> > 
> > Splitting strategy
> > ==================
> > 
> > When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
> > a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
> > 1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
> > PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
> > be possible with Davids proposal [3].
> 
> There once was this proposal where we would, instead of splitting a THP, 
> migrate all memory away instead. That means, instead of splitting the 1 
> GiB THP, you would instead return it to the page allocator where 
> somebody else could use it.

With TLB coalescing, there is benefit in preserving contiguity. If you
lop off the last 4k of a 2M-backed range, a split still gives you 511
contiguously mapped pfns that can be coalesced.

It would be unfortunate to lose that for pure virtual memory splits,
while there is no demand or no shortage of huge pages. But it might be
possible to do this lazily, e.g. when somebody has trouble getting a
larger page, scan the deferred split lists for candidates to migrate.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:00 ` David Hildenbrand (Arm)
  2026-02-19 16:48   ` Johannes Weiner
@ 2026-02-19 16:49   ` Zi Yan
  2026-02-19 17:13     ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Zi Yan @ 2026-02-19 16:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Usama Arif, willy, Lorenzo Stoakes, Andrew Morton, lsf-pc,
	linux-mm, Johannes Weiner, riel, Shakeel Butt, Kiryl Shutsemau,
	Barry Song, Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 19 Feb 2026, at 11:00, David Hildenbrand (Arm) wrote:

>>
>> I see 1G THPs being opportunistically used ideally at the start of the application
>> or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
>> available and a greater chance of getting 1G THPs.
>>
>> Splitting strategy
>> ==================
>>
>> When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
>> a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
>> 1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
>> PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
>> be possible with Davids proposal [3].

With mapping of folios > PMD with PMDs, you can use non uniform split to keep
after-split folios as large as possible.

>
> There once was this proposal where we would, instead of splitting a THP, migrate all memory away instead. That means, instead of splitting the 1 GiB THP, you would instead return it to the page allocator where somebody else could use it.

This sounds more reasonable than splitting 1GB itself.

>
> However, we cannot easily do the same when remapping a 1 GiB THP to be mapped by PMDs etc. I think there are examples where that just doesn't work or is not desired.
>
> But I considered that in general (avoid folio_split()) an interesting approach. The remapping part is a bit different though.

If HW can support multiple TLB entries translating to the same physical frame
and allow translation priority of TLB entries, this remapping would be easy
and we can still keep the 1GB PUD mapping. Basically, we can have 1GB TLB entry
pointing to the 1GB folio and another 4KB TLB entry pointing to the remapped
region and overriding the part in the original 1GB vaddr region.

Without that, SW will need to split the PUD into PMDs and PTEs.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:48   ` Johannes Weiner
@ 2026-02-19 16:52     ` Zi Yan
  2026-02-19 17:08       ` Johannes Weiner
  2026-02-19 17:09       ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 12+ messages in thread
From: Zi Yan @ 2026-02-19 16:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand (Arm),
	Usama Arif, willy, Lorenzo Stoakes, Andrew Morton, lsf-pc,
	linux-mm, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 19 Feb 2026, at 11:48, Johannes Weiner wrote:

> On Thu, Feb 19, 2026 at 05:00:19PM +0100, David Hildenbrand (Arm) wrote:
>>
>>>
>>> I see 1G THPs being opportunistically used ideally at the start of the application
>>> or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
>>> available and a greater chance of getting 1G THPs.
>>>
>>> Splitting strategy
>>> ==================
>>>
>>> When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
>>> a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
>>> 1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
>>> PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
>>> be possible with Davids proposal [3].
>>
>> There once was this proposal where we would, instead of splitting a THP,
>> migrate all memory away instead. That means, instead of splitting the 1
>> GiB THP, you would instead return it to the page allocator where
>> somebody else could use it.
>
> With TLB coalescing, there is benefit in preserving contiguity. If you
> lop off the last 4k of a 2M-backed range, a split still gives you 511
> contiguously mapped pfns that can be coalesced.

Which CPU are you referring to? AMD’s PTE coalescing works up to 32KB
and ARM’s contig PTE supports larger sizes. BTW, do we have PMD level
ARM contiguous bit support?

>
> It would be unfortunate to lose that for pure virtual memory splits,
> while there is no demand or no shortage of huge pages. But it might be
> possible to do this lazily, e.g. when somebody has trouble getting a
> larger page, scan the deferred split lists for candidates to migrate.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:52     ` Zi Yan
@ 2026-02-19 17:08       ` Johannes Weiner
  2026-02-19 17:09         ` David Hildenbrand (Arm)
  2026-02-19 17:09       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2026-02-19 17:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm),
	Usama Arif, willy, Lorenzo Stoakes, Andrew Morton, lsf-pc,
	linux-mm, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On Thu, Feb 19, 2026 at 11:52:57AM -0500, Zi Yan wrote:
> On 19 Feb 2026, at 11:48, Johannes Weiner wrote:
> 
> > On Thu, Feb 19, 2026 at 05:00:19PM +0100, David Hildenbrand (Arm) wrote:
> >>
> >>>
> >>> I see 1G THPs being opportunistically used ideally at the start of the application
> >>> or by the allocator (jemalloc/tcmalloc) when there is plenty of free memory
> >>> available and a greater chance of getting 1G THPs.
> >>>
> >>> Splitting strategy
> >>> ==================
> >>>
> >>> When PUD THP must be break -- for COW after fork, partial munmap, mprotect on
> >>> a subregion, or reclaim -- it splits directly from PUD to PTE level, converting
> >>> 1 PUD entry into 262,144 PTE entries. The ideal solution would be to split to
> >>> PMDs and only the necessary PMDs to PTEs. This is something that would hopefully
> >>> be possible with Davids proposal [3].
> >>
> >> There once was this proposal where we would, instead of splitting a THP,
> >> migrate all memory away instead. That means, instead of splitting the 1
> >> GiB THP, you would instead return it to the page allocator where
> >> somebody else could use it.
> >
> > With TLB coalescing, there is benefit in preserving contiguity. If you
> > lop off the last 4k of a 2M-backed range, a split still gives you 511
> > contiguously mapped pfns that can be coalesced.
> 
> Which CPU are you referring to? AMD’s PTE coalescing works up to 32KB
> and ARM’s contig PTE supports larger sizes. BTW, do we have PMD level
> ARM contiguous bit support?

I'm not aware of a CPU that will coalesce the 511 entries into a
single one. But *any* coalescing effects will be lost when the range
is scattered into discontiguous 4k pagelets.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:52     ` Zi Yan
  2026-02-19 17:08       ` Johannes Weiner
@ 2026-02-19 17:09       ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 17:09 UTC (permalink / raw)
  To: Zi Yan, Johannes Weiner
  Cc: Usama Arif, willy, Lorenzo Stoakes, Andrew Morton, lsf-pc,
	linux-mm, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 2/19/26 17:52, Zi Yan wrote:
> On 19 Feb 2026, at 11:48, Johannes Weiner wrote:
> 
>> On Thu, Feb 19, 2026 at 05:00:19PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>>
>>> There once was this proposal where we would, instead of splitting a THP,
>>> migrate all memory away instead. That means, instead of splitting the 1
>>> GiB THP, you would instead return it to the page allocator where
>>> somebody else could use it.
>>
>> With TLB coalescing, there is benefit in preserving contiguity. If you
>> lop off the last 4k of a 2M-backed range, a split still gives you 511
>> contiguously mapped pfns that can be coalesced.
> 
> Which CPU are you referring to? AMD’s PTE coalescing works up to 32KB
> and ARM’s contig PTE supports larger sizes. BTW, do we have PMD level
> ARM contiguous bit support?

Yes. It's used for hugetlb only so far, obviously (np THP > PMD).

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 17:08       ` Johannes Weiner
@ 2026-02-19 17:09         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-19 17:09 UTC (permalink / raw)
  To: Johannes Weiner, Zi Yan
  Cc: Usama Arif, willy, Lorenzo Stoakes, Andrew Morton, lsf-pc,
	linux-mm, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 2/19/26 18:08, Johannes Weiner wrote:
> On Thu, Feb 19, 2026 at 11:52:57AM -0500, Zi Yan wrote:
>> On 19 Feb 2026, at 11:48, Johannes Weiner wrote:
>>
>>>
>>> With TLB coalescing, there is benefit in preserving contiguity. If you
>>> lop off the last 4k of a 2M-backed range, a split still gives you 511
>>> contiguously mapped pfns that can be coalesced.
>>
>> Which CPU are you referring to? AMD’s PTE coalescing works up to 32KB
>> and ARM’s contig PTE supports larger sizes. BTW, do we have PMD level
>> ARM contiguous bit support?
> 
> I'm not aware of a CPU that will coalesce the 511 entries into a
> single one. But *any* coalescing effects will be lost when the range
> is scattered into discontiguous 4k pagelets.

You could of course migrate to larger folios, not necessarily 4k.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 16:49   ` Zi Yan
@ 2026-02-19 17:13     ` Matthew Wilcox
  2026-02-19 17:28       ` Zi Yan
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2026-02-19 17:13 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm),
	Usama Arif, Lorenzo Stoakes, Andrew Morton, lsf-pc, linux-mm,
	Johannes Weiner, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On Thu, Feb 19, 2026 at 11:49:27AM -0500, Zi Yan wrote:
> If HW can support multiple TLB entries translating to the same physical frame
> and allow translation priority of TLB entries, this remapping would be easy
> and we can still keep the 1GB PUD mapping. Basically, we can have 1GB TLB entry
> pointing to the 1GB folio and another 4KB TLB entry pointing to the remapped
> region and overriding the part in the original 1GB vaddr region.

Uh, do you know any hardware that supports that?  Every CPU I'm familiar
with has notes suggesting that trying to do this will cause you to Have
A Very Bad Day.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 17:13     ` Matthew Wilcox
@ 2026-02-19 17:28       ` Zi Yan
  0 siblings, 0 replies; 12+ messages in thread
From: Zi Yan @ 2026-02-19 17:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand (Arm),
	Usama Arif, Lorenzo Stoakes, Andrew Morton, lsf-pc, linux-mm,
	Johannes Weiner, riel, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 19 Feb 2026, at 12:13, Matthew Wilcox wrote:

> On Thu, Feb 19, 2026 at 11:49:27AM -0500, Zi Yan wrote:
>> If HW can support multiple TLB entries translating to the same physical frame
>> and allow translation priority of TLB entries, this remapping would be easy
>> and we can still keep the 1GB PUD mapping. Basically, we can have 1GB TLB entry
>> pointing to the 1GB folio and another 4KB TLB entry pointing to the remapped
>> region and overriding the part in the original 1GB vaddr region.
>
> Uh, do you know any hardware that supports that?  Every CPU I'm familiar
> with has notes suggesting that trying to do this will cause you to Have
> A Very Bad Day.

No. I was imagining it. :)

But thinking about it more, that means for every >PTE TLB hit, HW needs to know
whether any sub-range has an additional translation. It is easy if all sub-range
translations are present in the TLB. Otherwise, a per sub range bitmap or rewalks
of each sub range is needed. Never mind, thank you for waking me up in my
daydream.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 15:53 [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages Usama Arif
  2026-02-19 16:00 ` David Hildenbrand (Arm)
@ 2026-02-19 19:02 ` Rik van Riel
  2026-02-20 10:00   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2026-02-19 19:02 UTC (permalink / raw)
  To: Usama Arif, David Hildenbrand, willy, Lorenzo Stoakes, Zi Yan,
	Andrew Morton, lsf-pc, linux-mm
  Cc: Johannes Weiner, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On Thu, 2026-02-19 at 15:53 +0000, Usama Arif wrote:
> 
> Is CMA needed to make this work?
> ================================
> 
> The short answer is no. 1G THPs can be gotten without it. CMA can
> help a lot
> ofcourse, but we dont *need* it. For e.g. I can run the very simple
> case of
> trying to get 1G pages in the upstream kernel without CMA on my
> server via
> hugetlb and it works. The server has been up for more than 2 weeks
> (so pretty
> fragmented), is running a bunch of stuff in the background, uses 0
> CMA memory,
> and I tried to get 100x1G pages on it and it worked.
> It uses folio_alloc_gigantic, which is exactly what this RFC uses:

While I agree with the idea of starting simple, I think
we should ask the question of what we want physical memory
handling to look like if 1TB pages become more common,
and applications start to rely on them to meet their
performance goals.

We have CMA balancing code today. It seems to work, but
it likely is not the long term direction we want to go,
mostly due to the way CMA does allocations.

It seems clear that in order to prevent memory fragmentation,
we need to split up system memory in some way between an area
that is used only for movable allocations, and an area where
any kind of allocation can go.

This would need something similar to CMA balancing to prevent
false OOMs for non-movable allocations.

However, beyond that I really do not have any idea of what
things should look like.

What do we want the kernel to do here?

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
  2026-02-19 19:02 ` Rik van Riel
@ 2026-02-20 10:00   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-20 10:00 UTC (permalink / raw)
  To: Rik van Riel, Usama Arif, willy, Lorenzo Stoakes, Zi Yan,
	Andrew Morton, lsf-pc, linux-mm
  Cc: Johannes Weiner, Shakeel Butt, Kiryl Shutsemau, Barry Song,
	Dev Jain, Baolin Wang, Nico Pache, Liam R . Howlett,
	Ryan Roberts, Vlastimil Babka, Lance Yang, Frank van der Linden

On 2/19/26 20:02, Rik van Riel wrote:
> On Thu, 2026-02-19 at 15:53 +0000, Usama Arif wrote:
>>
>> Is CMA needed to make this work?
>> ================================
>>
>> The short answer is no. 1G THPs can be gotten without it. CMA can
>> help a lot
>> ofcourse, but we dont *need* it. For e.g. I can run the very simple
>> case of
>> trying to get 1G pages in the upstream kernel without CMA on my
>> server via
>> hugetlb and it works. The server has been up for more than 2 weeks
>> (so pretty
>> fragmented), is running a bunch of stuff in the background, uses 0
>> CMA memory,
>> and I tried to get 100x1G pages on it and it worked.
>> It uses folio_alloc_gigantic, which is exactly what this RFC uses:
> 
> While I agree with the idea of starting simple, I think
> we should ask the question of what we want physical memory
> handling to look like if 1TB pages become more common,
> and applications start to rely on them to meet their
> performance goals.
> 
> We have CMA balancing code today. It seems to work, but
> it likely is not the long term direction we want to go,
> mostly due to the way CMA does allocations.
> 
> It seems clear that in order to prevent memory fragmentation,
> we need to split up system memory in some way between an area
> that is used only for movable allocations, and an area where
> any kind of allocation can go.
> 
> This would need something similar to CMA balancing to prevent
> false OOMs for non-movable allocations.
> 
> However, beyond that I really do not have any idea of what
> things should look like.
> 
> What do we want the kernel to do here?

This subtopic is certainly worth a separate session as it's quite 
involved, but I assume the right (tm) thing to do will be

(a) Teaching the buddy to manage pages larger than the current maximum
     buddy order. There will certainly be some work required to get to
     that point (and Zi Yan already did some work). It might also be
     fair to say that order > current  buddy order might behave different
     at least to some degree (thinking about relation to zone alignment,
     section sizes etc).

     If we require vmemmap for these larger orders, maybe the buddy order
     could more easily exceed the section size; I don't remember all of
     the details why that limitation was in place (but one of them was
     memmap continuity within a high-order buddy page, which is only
     guaranteed within a memory section with CONFIG_SPARSEMEM).

(b) Teaching compaction etc. to *also* compact/group on a larger
     granularity (in addition to current sized pageblocks). When we
     discussed that in the past we used the term superblock, that
     Zi Yan just brought up again in another thread [1].



There was a proposal a while ago to internally separate zones into 
chunks of memory (I think the proposal used DRAM banks, such that you 
could more easily power down unused DRAM banks). I'm not saying we 
should do that, but maybe something like sub-zones could be something to 
explore. Maybe not.

Big, more complex topic :)


[1] 
https://lore.kernel.org/r/34730030-48F6-4D0C-91EA-998A5AF93F5F@nvidia.com

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-20 10:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 15:53 [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages Usama Arif
2026-02-19 16:00 ` David Hildenbrand (Arm)
2026-02-19 16:48   ` Johannes Weiner
2026-02-19 16:52     ` Zi Yan
2026-02-19 17:08       ` Johannes Weiner
2026-02-19 17:09         ` David Hildenbrand (Arm)
2026-02-19 17:09       ` David Hildenbrand (Arm)
2026-02-19 16:49   ` Zi Yan
2026-02-19 17:13     ` Matthew Wilcox
2026-02-19 17:28       ` Zi Yan
2026-02-19 19:02 ` Rik van Riel
2026-02-20 10:00   ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox