From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, lorenzo.stoakes@oracle.com,
willy@infradead.org, linux-mm@kvack.org
Cc: fvdl@google.com, hannes@cmpxchg.org, riel@surriel.com,
shakeel.butt@linux.dev, kas@kernel.org, baohua@kernel.org,
dev.jain@arm.com, baolin.wang@linux.alibaba.com,
npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com,
Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
kernel-team@meta.com, maddy@linux.ibm.com, mpe@ellerman.id.au,
linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
gor@linux.ibm.com, agordeev@linux.ibm.com,
borntraeger@linux.ibm.com, svens@linux.ibm.com,
linux-s390@vger.kernel.org, Usama Arif <usama.arif@linux.dev>
Subject: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
Date: Thu, 26 Feb 2026 03:23:29 -0800 [thread overview]
Message-ID: <20260226113233.3987674-1-usama.arif@linux.dev> (raw)
When the kernel creates a PMD-level THP mapping for anonymous pages, it
pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
page table sits unused in a deposit list for the lifetime of the THP
mapping, only to be withdrawn when the PMD is split or zapped. Every
anonymous THP therefore wastes 4KB of memory unconditionally. On large
servers where hundreds of gigabytes of memory are mapped as THPs, this
adds up: roughly 200MB wasted per 100GB of THP memory. This memory
could otherwise satisfy other allocations, including the very PTE page
table allocations needed when splits eventually occur.
This series removes the pre-deposit and allocates the PTE page table
lazily — only when a PMD split actually happens. Since a large number
of THPs are never split (they are zapped wholesale when processes exit or
munmap the full range), the allocation is avoided entirely in the common
case.
The pre-deposit pattern exists because split_huge_pmd was designed as an
operation that must never fail: if the kernel decides to split, it needs
a PTE page table, so one is deposited in advance. But "must never fail"
is an unnecessarily strong requirement. A PMD split is typically triggered
by a partial operation on a sub-PMD range — partial munmap, partial
mprotect, partial mremap and so on.
Most of these operations already have well-defined error handling for
allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
fail and propagating the error through these existing paths is the natural
thing to do. Furthermore, split failing requires an order-0 allocation for
a page table to fail, which is extremely unlikely.
Designing functions like split_huge_pmd as operations that cannot fail
has a subtle but real cost to code quality. It forces a pre-allocation
pattern - every THP creation path must deposit a page table, and every
split or zap path must withdraw one, creating a hidden coupling between
widely separated code paths.
This also serves as a code cleanup. On every architecture except powerpc
with hash MMU, the deposit/withdraw machinery becomes dead code. The
series removes the generic implementations in pgtable-generic.c and the
s390/sparc overrides, replacing them with no-op stubs guarded by
arch_needs_pgtable_deposit(), which evaluates to false at compile time
on all non-powerpc architectures.
The series is structured as follows:
Patches 1-2: Error infrastructure — make split functions return int
and propagate errors from vma_adjust_trans_huge()
through __split_vma, vma_shrink, and commit_merge.
Patches 3-12: Handle split failure at every call site — copy_huge_pmd,
do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
change_pmd_range (mprotect), follow_pmd_mask (GUP),
walk_pmd_range (pagewalk), move_page_tables (mremap),
move_pages (userfaultfd), and device migration.
The code will become affective in Patch 14 when split
functions start returning -ENOMEM.
Patch 13: Add __must_check to __split_huge_pmd(), split_huge_pmd()
and split_huge_pmd_address() so the compiler warns on
unchecked return values.
Patch 14: The actual change — allocate PTE page tables lazily at
split time instead of pre-depositing at THP creation.
This is when split functions will actually start returning
-ENOMEM.
Patch 15: Remove the now-dead deposit/withdraw code on
non-powerpc architectures.
Patch 16: Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
split failures.
Patches 17-21: Selftests covering partial munmap, mprotect, mlock,
mremap, and MADV_DONTNEED on THPs to exercise the
split paths.
The error handling patches are placed before the lazy allocation patch so
that every call site is already prepared to handle split failures before
the failure mode is introduced. This makes each patch independently safe
to apply and bisect through.
The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
enabled. The test results are below:
TAP version 13
1..5
# Starting 5 tests from 1 test cases.
# RUN thp_pmd_split.partial_munmap ...
# thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
# thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
# OK thp_pmd_split.partial_munmap
ok 1 thp_pmd_split.partial_munmap
# RUN thp_pmd_split.partial_mprotect ...
# thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
# thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
# OK thp_pmd_split.partial_mprotect
ok 2 thp_pmd_split.partial_mprotect
# RUN thp_pmd_split.partial_mlock ...
# thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
# thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
# OK thp_pmd_split.partial_mlock
ok 3 thp_pmd_split.partial_mlock
# RUN thp_pmd_split.partial_mremap ...
# thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
# thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
# OK thp_pmd_split.partial_mremap
ok 4 thp_pmd_split.partial_mremap
# RUN thp_pmd_split.partial_madv_dontneed ...
# thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
# thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
# OK thp_pmd_split.partial_madv_dontneed
ok 5 thp_pmd_split.partial_madv_dontneed
# PASSED: 5 / 5 tests passed.
# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
mm-unstable as of 25 Feb (7.0.0-rc1).
RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
- Change counter name to THP_SPLIT_PMD_FAILED (David)
- remove pgtable_trans_huge_{deposit/withdraw} when not needed and
make them arch specific (David)
- make split functions return error code and have callers handle them
(David and Kiryl)
- Add test cases for splitting
Usama Arif (21):
mm: thp: make split_huge_pmd functions return int for error
propagation
mm: thp: propagate split failure from vma_adjust_trans_huge()
mm: thp: handle split failure in copy_huge_pmd()
mm: thp: handle split failure in do_huge_pmd_wp_page()
mm: thp: handle split failure in zap_pmd_range()
mm: thp: handle split failure in wp_huge_pmd()
mm: thp: retry on split failure in change_pmd_range()
mm: thp: handle split failure in follow_pmd_mask()
mm: handle walk_page_range() failure from THP split
mm: thp: handle split failure in mremap move_page_tables()
mm: thp: handle split failure in userfaultfd move_pages()
mm: thp: handle split failure in device migration
mm: huge_mm: Make sure all split_huge_pmd calls are checked
mm: thp: allocate PTE page tables lazily at split time
mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
mm: thp: add THP_SPLIT_PMD_FAILED counter
selftests/mm: add THP PMD split test infrastructure
selftests/mm: add partial_mprotect test for change_pmd_range
selftests/mm: add partial_mlock test
selftests/mm: add partial_mremap test for move_page_tables
selftests/mm: add madv_dontneed_partial test
arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +-
arch/s390/include/asm/pgtable.h | 6 -
arch/s390/mm/pgtable.c | 41 ---
arch/sparc/include/asm/pgtable_64.h | 6 -
arch/sparc/mm/tlb.c | 36 ---
include/linux/huge_mm.h | 51 +--
include/linux/pgtable.h | 16 +-
include/linux/vm_event_item.h | 1 +
mm/debug_vm_pgtable.c | 4 +-
mm/gup.c | 10 +-
mm/huge_memory.c | 208 +++++++++----
mm/khugepaged.c | 7 +-
mm/memory.c | 26 +-
mm/migrate_device.c | 33 +-
mm/mprotect.c | 11 +-
mm/mremap.c | 8 +-
mm/pagewalk.c | 8 +-
mm/pgtable-generic.c | 32 --
mm/rmap.c | 42 ++-
mm/userfaultfd.c | 8 +-
mm/vma.c | 37 ++-
mm/vmstat.c | 1 +
tools/testing/selftests/mm/Makefile | 1 +
.../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
tools/testing/vma/include/stubs.h | 9 +-
25 files changed, 645 insertions(+), 259 deletions(-)
create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c
--
2.47.3
next reply other threads:[~2026-02-26 11:33 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-26 11:23 Usama Arif [this message]
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 14:22 ` Usama Arif
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260226113233.3987674-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=borntraeger@linux.ibm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=fvdl@google.com \
--cc=gor@linux.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=hca@linux.ibm.com \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=npache@redhat.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=svens@linux.ibm.com \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox