Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Matthew Wilcox <willy@infradead.org>, Zi Yan <ziy@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Usama Arif <usama.arif@linux.dev>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	riel@surriel.com, Shakeel Butt <shakeel.butt@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>, Barry Song <baohua@kernel.org>,
	Dev Jain <dev.jain@arm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Lance Yang <lance.yang@linux.dev>,
	Frank van der Linden <fvdl@google.com>,
	Oscar Salvador <osalvador@suse.de>
Subject: Re: [LSF/MM/BPF TOPIC] Beyond 2MB: Why Terabyte-Scale Machines Need 1GB Transparent Huge Pages
Date: Thu, 26 Feb 2026 11:05:06 +0100	[thread overview]
Message-ID: <99ea2e25-a2fb-469f-9a4e-1ca2cceab77c@kernel.org> (raw)
In-Reply-To: <aZ4Lg51BVmGE5MLn@casper.infradead.org>

On 2/24/26 21:35, Matthew Wilcox wrote:
> On Tue, Feb 24, 2026 at 02:08:26PM -0500, Zi Yan wrote:
>> On 24 Feb 2026, at 14:03, Johannes Weiner wrote:
>>>
>>> I know this isn't your intention, but one interesting aspect of
>>> supporting PUD mapped folios natively is that it could open the door
>>> to simplifying hugetlb as well.
>>>
>>> We currently have all kinds of huge_vma checks scattered over the page
>>> table code, and entirely parallel paths for unmapping etc. With native
>>> PUD mappings, this could allow pushing the special casing out of the
>>> virtual memory layer and into where we deal with the page objects.
>>>
>>> You might be able to take it as far as the only thing left of hugetlb
>>> is the reservation pool. Such that a naive application does mmap() as
>>> per usual, and it comes down to a separate allocation policy how the
>>> backing pages are served (buddy, CMA, boot-time reservations, ...)
>>>
>>> Approaching it this way could help separate out the discussion on code
>>> impact and tech debt of PUD mappings, from the allocation technique
>>> question, which in itself is a fairly large topic.
>>
>> I agree with this 100%. Adding 1GB folio support first, we then can think
>> about what other THP features, e.g., split, migration, PMD/PTE mapping, are
>> really needed and add them one by one. It is also going to be a good way
>> of retiring hugetlb special code.
> 
> But this hasn't happened yet for PMD-sized hugetlb, and there's no need
> to wait for PUD-sized THP to start this process.  I don't think that
> introducing PUD-sized THP will actually motivate anyone to do this work.
> 
> I think we have four main things that hugetlb still offers:
> 
>  - Reserved pool (mentioned above) which we don't yet have a THP
>    replacement for
>  - shared page tables.  mshare() is the replacement here, and that
>    project is moving along nicely.
>  - Being able to allocate gigantic folios.  This is also progressing.
>  - Guaranteeing that you don't get a fallback; you either get memory in
>    the size you asked for, or you fail.
> 
> Every time this comes up, I offer the pagewalk code as an egregious
> example of where we force every user to know "oh, hugetlb is special".
> Getting rid of mm_walk_ops->hugetlb_entry() would be a great improvement.

I'm all for having a better way to walk folios, and Oscar did some work
on that in the past that got stalled (also partially my fault for being
overloaded to provide feedback).

But that would be rather replacing mm_walk_ops, not changing it.

The real challenge are not PMD-mapped or PUD-mapped hugetlb folios.
These are the easy case.

It's all about the hugetlb corner cases. PTE-batching to handle
cont-pte, we already can support. Cont-PMD-mapped hugetlb folios would
require similar support (rather easy).

But the real struggle is "fun" hugetlb things like
* PMD table sharing. I don't want ordinary page table walking code to
  mess with that monstrosity. It can mostly stay in mm/hugetlb.c where
  it can die a slow death.
* Powerpc stuff like a hugetlb folio that spwans two full page tables.
  There is no way we are going to add batching support (reading+writing)
  to generic page table walking code just for that purpose.

Instead, I think we should
* Leave all the mm_walk_ops->hugetlb_entry() stuff alone.
* Enlighten shmem to provide 2M + 1G folios from an alternative
  allocator (pool) (like guest_memfd is planning)
* Allow hugetlb to server as such a pool. Preparations happening in [1].
* Maybe enlighten shmem to be able to reserve folios in hugetlb.

Essentially: rebuild something like hugetlb for shared memory using
shmem, and reuse the existing hugetlb pool. Avoid all the nastiness
(weird architecture tweaks, odd folio sizes we don't want to support,
huge pmd sharing, etc).


[1] https://lore.kernel.org/r/cover.1770854662.git.ackerleytng@google.com

-- 
Cheers,

David

     prev parent reply	other threads:[~2026-02-26 10:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 15:53 Usama Arif
2026-02-19 16:00 ` David Hildenbrand (Arm)
2026-02-19 16:48   ` Johannes Weiner
2026-02-19 16:52     ` Zi Yan
2026-02-19 17:08       ` Johannes Weiner
2026-02-19 17:09         ` David Hildenbrand (Arm)
2026-02-19 17:09       ` David Hildenbrand (Arm)
2026-02-19 16:49   ` Zi Yan
2026-02-19 17:13     ` Matthew Wilcox
2026-02-19 17:28       ` Zi Yan
2026-02-19 19:02 ` Rik van Riel
2026-02-20 10:00   ` David Hildenbrand (Arm)
2026-02-24 19:03 ` Johannes Weiner
2026-02-24 19:08   ` Zi Yan
2026-02-24 20:35     ` Matthew Wilcox
2026-02-26 10:05       ` David Hildenbrand (Arm) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99ea2e25-a2fb-469f-9a4e-1ca2cceab77c@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=npache@redhat.com \
    --cc=osalvador@suse.de \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox