Re: [RFC 00/12] mm: PUD (1GB) THP implementation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	lorenzo.stoakes@oracle.com, linux-mm@kvack.org,
	hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
	kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz,
	lance.yang@linux.dev, linux-kernel@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [RFC 00/12] mm: PUD (1GB) THP implementation
Date: Mon, 02 Feb 2026 11:24:19 -0500	[thread overview]
Message-ID: <3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com> (raw)
In-Reply-To: <20260202005451.774496-1-usamaarif642@gmail.com>

On 1 Feb 2026, at 19:50, Usama Arif wrote:

> This is an RFC series to implement 1GB PUD-level THPs, allowing
> applications to benefit from reduced TLB pressure without requiring
> hugetlbfs. The patches are based on top of
> f9b74c13b773b7c7e4920d7bc214ea3d5f37b422 from mm-stable (6.19-rc6).

It is nice to see you are working on 1GB THP.

>
> Motivation: Why 1GB THP over hugetlbfs?
> =======================================
>
> While hugetlbfs provides 1GB huge pages today, it has significant limitations
> that make it unsuitable for many workloads:
>
> 1. Static Reservation: hugetlbfs requires pre-allocating huge pages at boot
>    or runtime, taking memory away. This requires capacity planning,
>    administrative overhead, and makes workload orchastration much much more
>    complex, especially colocating with workloads that don't use hugetlbfs.

But you are using CMA, the same allocation mechanism as hugetlb_cma. What
is the difference?

>
> 4. No Fallback: If a 1GB huge page cannot be allocated, hugetlbfs fails
>    rather than falling back to smaller pages. This makes it fragile under
>    memory pressure.

True.

>
> 4. No Splitting: hugetlbfs pages cannot be split when only partial access
>    is needed, leading to memory waste and preventing partial reclaim.

Since you have PUD THP implementation, have you run any workload on it?
How often you see a PUD THP split?

Oh, you actually ran 512MB THP on ARM64 (I saw it below), do you have
any split stats to show the necessity of THP split?

>
> 5. Memory Accounting: hugetlbfs memory is accounted separately and cannot
>    be easily shared with regular memory pools.

True.

>
> PUD THP solves these limitations by integrating 1GB pages into the existing
> THP infrastructure.

The main advantage of PUD THP over hugetlb is that it can be split and mapped
at sub-folio level. Do you have any data to support the necessity of them?
I wonder if it would be easier to just support 1GB folio in core-mm first
and we can add 1GB THP split and sub-folio mapping later. With that, we
can move hugetlb users to 1GB folio.

BTW, without split support, you can apply HVO to 1GB folio to save memory.
That is a disadvantage of PUD THP. Have you taken that into consideration?
Basically, switching from hugetlb to PUD THP, you will lose memory due
to vmemmap usage.

>
> Performance Results
> ===================
>
> Benchmark results of these patches on Intel Xeon Platinum 8321HC:
>
> Test: True Random Memory Access [1] test of 4GB memory region with pointer
> chasing workload (4M random pointer dereferences through memory):
>
> | Metric            | PUD THP (1GB) | PMD THP (2MB) | Change       |
> |-------------------|---------------|---------------|--------------|
> | Memory access     | 88 ms         | 134 ms        | 34% faster   |
> | Page fault time   | 898 ms        | 331 ms        | 2.7x slower  |
>
> Page faulting 1G pages is 2.7x slower (Allocating 1G pages is hard :)).
> For long-running workloads this will be a one-off cost, and the 34%
> improvement in access latency provides significant benefit.
>
> ARM with 64K PAGE_SZIE supports 512M PMD THPs. In meta, we have a CPU
> bound workload running on a large number of ARM servers (256G). I enabled
> the 512M THP settings to always for a 100 servers in production (didn't
> really have high expectations :)). The average memory used for the workload
> increased from 217G to 233G. The amount of memory backed by 512M pages was
> 68G! The dTLB misses went down by 26% and the PID multiplier increased input
> by 5.9% (This is a very significant improvment in workload performance).
> A significant number of these THPs were faulted in at application start when
> were present across different VMAs. Ofcourse getting these 512M pages is
> easier on ARM due to bigger PAGE_SIZE and pageblock order.
>
> I am hoping that these patches for 1G THP can be used to provide similar
> benefits for x86. I expect workloads to fault them in at start time when there
> is plenty of free memory available.
>
>
> Previous attempt by Zi Yan
> ==========================
>
> Zi Yan attempted 1G THPs [2] in kernel version 5.11. There have been
> significant changes in kernel since then, including folio conversion, mTHP
> framework, ptdesc, rmap changes, etc. I found it easier to use the current PMD
> code as reference for making 1G PUD THP work. I am hoping Zi can provide
> guidance on these patches!

I am more than happy to help you. :)

>
> Major Design Decisions
> ======================
>
> 1. No shared 1G zero page: The memory cost would be quite significant!
>
> 2. Page Table Pre-deposit Strategy
>    PMD THP deposits a single PTE page table. PUD THP deposits 512 PTE
>    page tables (one for each potential PMD entry after split).
>    We allocate a PMD page table and use its pmd_huge_pte list to store
>    the deposited PTE tables. This ensures split operations don't fail due
>    to page table allocation failures (at the cost of 2M per PUD THP)
>
> 3. Split to Base Pages
>    When a PUD THP must be split (COW, partial unmap, mprotect), we split
>    directly to base pages (262,144 PTEs). The ideal thing would be to split
>    to 2M pages and then to 4K pages if needed. However, this would require
>    significant rmap and mapcount tracking changes.
>
> 4. COW and fork handling via split
>    Copy-on-write and fork for PUD THP triggers a split to base pages, then
>    uses existing PTE-level COW infrastructure. Getting another 1G region is
>    hard and could fail. If only a 4K is written, copying 1G is a waste.
>    Probably this should only be done on CoW and not fork?
>
> 5. Migration via split
>    Split PUD to PTEs and migrate individual pages. It is going to be difficult
>    to find a 1G continguous memory to migrate to. Maybe its better to not
>    allow migration of PUDs at all? I am more tempted to not allow migration,
>    but have kept splitting in this RFC.

Without migration, PUD THP loses its flexibility and transparency. But with
its 1GB size, I also wonder what the purpose of PUD THP migration can be.
It does not create memory fragmentation, since it is the largest folio size
we have and contiguous. NUMA balancing 1GB THP seems too much work.

BTW, I posted many questions, but that does not mean I object the patchset.
I just want to understand your use case better, reduce unnecessary
code changes, and hopefully get it upstreamed this time. :)

Thank you for the work.

>
>
> Reviewers guide
> ===============
>
> Most of the code is written by adapting from PMD code. For e.g. the PUD page
> fault path is very similar to PMD. The difference is no shared zero page and
> the page table deposit strategy. I think the easiest way to review this series
> is to compare with PMD code.
>
> Test results
> ============
>
>   1..7
>   # Starting 7 tests from 1 test cases.
>   #  RUN           pud_thp.basic_allocation ...
>   # pud_thp_test.c:169:basic_allocation:PUD THP allocated (anon_fault_alloc: 0 -> 1)
>   #            OK  pud_thp.basic_allocation
>   ok 1 pud_thp.basic_allocation
>   #  RUN           pud_thp.read_write_access ...
>   #            OK  pud_thp.read_write_access
>   ok 2 pud_thp.read_write_access
>   #  RUN           pud_thp.fork_cow ...
>   # pud_thp_test.c:236:fork_cow:Fork COW completed (thp_split_pud: 0 -> 1)
>   #            OK  pud_thp.fork_cow
>   ok 3 pud_thp.fork_cow
>   #  RUN           pud_thp.partial_munmap ...
>   # pud_thp_test.c:267:partial_munmap:Partial munmap completed (thp_split_pud: 1 -> 2)
>   #            OK  pud_thp.partial_munmap
>   ok 4 pud_thp.partial_munmap
>   #  RUN           pud_thp.mprotect_split ...
>   # pud_thp_test.c:293:mprotect_split:mprotect split completed (thp_split_pud: 2 -> 3)
>   #            OK  pud_thp.mprotect_split
>   ok 5 pud_thp.mprotect_split
>   #  RUN           pud_thp.reclaim_pageout ...
>   # pud_thp_test.c:322:reclaim_pageout:Reclaim completed (thp_split_pud: 3 -> 4)
>   #            OK  pud_thp.reclaim_pageout
>   ok 6 pud_thp.reclaim_pageout
>   #  RUN           pud_thp.migration_mbind ...
>   # pud_thp_test.c:356:migration_mbind:Migration completed (thp_split_pud: 4 -> 5)
>   #            OK  pud_thp.migration_mbind
>   ok 7 pud_thp.migration_mbind
>   # PASSED: 7 / 7 tests passed.
>   # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> [1] https://gist.github.com/uarif1/bf279b2a01a536cda945ff9f40196a26
> [2] https://lore.kernel.org/linux-mm/20210224223536.803765-1-zi.yan@sent.com/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Usama Arif (12):
>   mm: add PUD THP ptdesc and rmap support
>   mm/thp: add mTHP stats infrastructure for PUD THP
>   mm: thp: add PUD THP allocation and fault handling
>   mm: thp: implement PUD THP split to PTE level
>   mm: thp: add reclaim and migration support for PUD THP
>   selftests/mm: add PUD THP basic allocation test
>   selftests/mm: add PUD THP read/write access test
>   selftests/mm: add PUD THP fork COW test
>   selftests/mm: add PUD THP partial munmap test
>   selftests/mm: add PUD THP mprotect split test
>   selftests/mm: add PUD THP reclaim test
>   selftests/mm: add PUD THP migration test
>
>  include/linux/huge_mm.h                   |  60 ++-
>  include/linux/mm.h                        |  19 +
>  include/linux/mm_types.h                  |   5 +-
>  include/linux/pgtable.h                   |   8 +
>  include/linux/rmap.h                      |   7 +-
>  mm/huge_memory.c                          | 535 +++++++++++++++++++++-
>  mm/internal.h                             |   3 +
>  mm/memory.c                               |   8 +-
>  mm/migrate.c                              |  17 +
>  mm/page_vma_mapped.c                      |  35 ++
>  mm/pgtable-generic.c                      |  83 ++++
>  mm/rmap.c                                 |  96 +++-
>  mm/vmscan.c                               |   2 +
>  tools/testing/selftests/mm/Makefile       |   1 +
>  tools/testing/selftests/mm/pud_thp_test.c | 360 +++++++++++++++
>  15 files changed, 1197 insertions(+), 42 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/pud_thp_test.c
>
> -- 
> 2.47.3


Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2026-02-02 16:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-02  0:50 Usama Arif
2026-02-02  0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
2026-02-02 10:44   ` Kiryl Shutsemau
2026-02-02 16:01     ` Zi Yan
2026-02-03 22:07       ` Usama Arif
2026-02-05  4:17         ` Matthew Wilcox
2026-02-05  4:21           ` Matthew Wilcox
2026-02-05  5:13             ` Usama Arif
2026-02-05 17:40               ` David Hildenbrand (Arm)
2026-02-05 18:05                 ` Usama Arif
2026-02-05 18:11                   ` Usama Arif
2026-02-02 12:15   ` Lorenzo Stoakes
2026-02-04  7:38     ` Usama Arif
2026-02-04 12:55       ` Lorenzo Stoakes
2026-02-05  6:40         ` Usama Arif
2026-02-02  0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
2026-02-02 11:56   ` Lorenzo Stoakes
2026-02-05  5:53     ` Usama Arif
2026-02-02  0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
2026-02-02  0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
2026-02-02  0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
2026-02-02  0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
2026-02-02  0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
2026-02-02  0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
2026-02-02  0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
2026-02-02  0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
2026-02-02  0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
2026-02-02  0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
2026-02-02  2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
2026-02-02 11:30   ` Lorenzo Stoakes
2026-02-02 15:50     ` Zi Yan
2026-02-04 10:56       ` Lorenzo Stoakes
2026-02-05 11:29         ` David Hildenbrand (arm)
2026-02-05 11:22       ` David Hildenbrand (arm)
2026-02-02  4:00 ` Matthew Wilcox
2026-02-02  9:06   ` David Hildenbrand (arm)
2026-02-03 21:11     ` Usama Arif
2026-02-02 11:20 ` Lorenzo Stoakes
2026-02-04  1:00   ` Usama Arif
2026-02-04 11:08     ` Lorenzo Stoakes
2026-02-04 11:50       ` Dev Jain
2026-02-04 12:01         ` Dev Jain
2026-02-05  6:08       ` Usama Arif
2026-02-02 16:24 ` Zi Yan [this message]
2026-02-03 23:29   ` Usama Arif
2026-02-04  0:08     ` Frank van der Linden
2026-02-05  5:46       ` Usama Arif
2026-02-05 18:07     ` Zi Yan
2026-02-07 23:22       ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3561FD10-664D-42AA-8351-DE7D8D49D42E@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox