linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
@ 2026-02-26 11:23 Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
                   ` (21 more replies)
  0 siblings, 22 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages, it
pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
page table sits unused in a deposit list for the lifetime of the THP
mapping, only to be withdrawn when the PMD is split or zapped. Every
anonymous THP therefore wastes 4KB of memory unconditionally. On large
servers where hundreds of gigabytes of memory are mapped as THPs, this
adds up: roughly 200MB wasted per 100GB of THP memory. This memory
could otherwise satisfy other allocations, including the very PTE page
table allocations needed when splits eventually occur.

This series removes the pre-deposit and allocates the PTE page table
lazily — only when a PMD split actually happens. Since a large number
of THPs are never split (they are zapped wholesale when processes exit or
munmap the full range), the allocation is avoided entirely in the common
case.

The pre-deposit pattern exists because split_huge_pmd was designed as an
operation that must never fail: if the kernel decides to split, it needs
a PTE page table, so one is deposited in advance. But "must never fail"
is an unnecessarily strong requirement. A PMD split is typically triggered
by a partial operation on a sub-PMD range — partial munmap, partial
mprotect, partial mremap and so on.
Most of these operations already have well-defined error handling for
allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
fail and propagating the error through these existing paths is the natural
thing to do. Furthermore, split failing requires an order-0 allocation for
a page table to fail, which is extremely unlikely.

Designing functions like split_huge_pmd as operations that cannot fail
has a subtle but real cost to code quality. It forces a pre-allocation
pattern - every THP creation path must deposit a page table, and every
split or zap path must withdraw one, creating a hidden coupling between
widely separated code paths.

This also serves as a code cleanup. On every architecture except powerpc
with hash MMU, the deposit/withdraw machinery becomes dead code. The
series removes the generic implementations in pgtable-generic.c and the
s390/sparc overrides, replacing them with no-op stubs guarded by
arch_needs_pgtable_deposit(), which evaluates to false at compile time
on all non-powerpc architectures.

The series is structured as follows:

Patches 1-2:    Error infrastructure — make split functions return int
                and propagate errors from vma_adjust_trans_huge()
                through __split_vma, vma_shrink, and commit_merge.

Patches 3-12:   Handle split failure at every call site — copy_huge_pmd,
                do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
                change_pmd_range (mprotect), follow_pmd_mask (GUP),
                walk_pmd_range (pagewalk), move_page_tables (mremap),
                move_pages (userfaultfd), and device migration.
                The code will become affective in Patch 14 when split
                functions start returning -ENOMEM.

Patch 13:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
                and split_huge_pmd_address() so the compiler warns on
                unchecked return values.

Patch 14:       The actual change — allocate PTE page tables lazily at
                split time instead of pre-depositing at THP creation.
                This is when split functions will actually start returning
                -ENOMEM.

Patch 15:       Remove the now-dead deposit/withdraw code on
                non-powerpc architectures.

Patch 16:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
                split failures.

Patches 17-21:  Selftests covering partial munmap, mprotect, mlock,
                mremap, and MADV_DONTNEED on THPs to exercise the
                split paths.

The error handling patches are placed before the lazy allocation patch so
that every call site is already prepared to handle split failures before
the failure mode is introduced. This makes each patch independently safe
to apply and bisect through.

The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
enabled. The test results are below:

TAP version 13
1..5
# Starting 5 tests from 1 test cases.
#  RUN           thp_pmd_split.partial_munmap ...
# thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
# thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_munmap
ok 1 thp_pmd_split.partial_munmap
#  RUN           thp_pmd_split.partial_mprotect ...
# thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
# thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mprotect
ok 2 thp_pmd_split.partial_mprotect
#  RUN           thp_pmd_split.partial_mlock ...
# thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
# thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mlock
ok 3 thp_pmd_split.partial_mlock
#  RUN           thp_pmd_split.partial_mremap ...
# thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
# thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_mremap
ok 4 thp_pmd_split.partial_mremap
#  RUN           thp_pmd_split.partial_madv_dontneed ...
# thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
# thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
#            OK  thp_pmd_split.partial_madv_dontneed
ok 5 thp_pmd_split.partial_madv_dontneed
# PASSED: 5 / 5 tests passed.
# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0

The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
mm-unstable as of 25 Feb (7.0.0-rc1).


RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
- Change counter name to THP_SPLIT_PMD_FAILED (David)
- remove pgtable_trans_huge_{deposit/withdraw} when not needed and
  make them arch specific (David)
- make split functions return error code and have callers handle them
  (David and Kiryl)
- Add test cases for splitting

Usama Arif (21):
  mm: thp: make split_huge_pmd functions return int for error
    propagation
  mm: thp: propagate split failure from vma_adjust_trans_huge()
  mm: thp: handle split failure in copy_huge_pmd()
  mm: thp: handle split failure in do_huge_pmd_wp_page()
  mm: thp: handle split failure in zap_pmd_range()
  mm: thp: handle split failure in wp_huge_pmd()
  mm: thp: retry on split failure in change_pmd_range()
  mm: thp: handle split failure in follow_pmd_mask()
  mm: handle walk_page_range() failure from THP split
  mm: thp: handle split failure in mremap move_page_tables()
  mm: thp: handle split failure in userfaultfd move_pages()
  mm: thp: handle split failure in device migration
  mm: huge_mm: Make sure all split_huge_pmd calls are checked
  mm: thp: allocate PTE page tables lazily at split time
  mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  mm: thp: add THP_SPLIT_PMD_FAILED counter
  selftests/mm: add THP PMD split test infrastructure
  selftests/mm: add partial_mprotect test for change_pmd_range
  selftests/mm: add partial_mlock test
  selftests/mm: add partial_mremap test for move_page_tables
  selftests/mm: add madv_dontneed_partial test

 arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
 arch/s390/include/asm/pgtable.h               |   6 -
 arch/s390/mm/pgtable.c                        |  41 ---
 arch/sparc/include/asm/pgtable_64.h           |   6 -
 arch/sparc/mm/tlb.c                           |  36 ---
 include/linux/huge_mm.h                       |  51 +--
 include/linux/pgtable.h                       |  16 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/debug_vm_pgtable.c                         |   4 +-
 mm/gup.c                                      |  10 +-
 mm/huge_memory.c                              | 208 +++++++++----
 mm/khugepaged.c                               |   7 +-
 mm/memory.c                                   |  26 +-
 mm/migrate_device.c                           |  33 +-
 mm/mprotect.c                                 |  11 +-
 mm/mremap.c                                   |   8 +-
 mm/pagewalk.c                                 |   8 +-
 mm/pgtable-generic.c                          |  32 --
 mm/rmap.c                                     |  42 ++-
 mm/userfaultfd.c                              |   8 +-
 mm/vma.c                                      |  37 ++-
 mm/vmstat.c                                   |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
 tools/testing/vma/include/stubs.h             |   9 +-
 25 files changed, 645 insertions(+), 259 deletions(-)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Currently split cannot fail, but future patches will add lazy PTE page
table allocation. With lazy PTE page table allocation at THP split time
__split_huge_pmd() calls pte_alloc_one() which can fail if order-0
allocation cannot be satisfied.
Split functions currently return void, so callers have no way to detect
this failure.  The PMD would remain huge, but callers assumed the split
succeeded and proceeded to operate on that basis — interpreting a huge PMD
entry as a page table pointer could result in a kernel bug.

Change __split_huge_pmd(), split_huge_pmd(), split_huge_pmd_if_needed()
and split_huge_pmd_address() to return 0 on success (-ENOMEM on
allocation failure in later patch).  Convert the split_huge_pmd macro
to a static inline function that propagates the return value. The return
values will be handled by the callers in future commits.

The CONFIG_TRANSPARENT_HUGEPAGE=n stubs are changed to return 0.

No behaviour change is expected with this patch.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 34 ++++++++++++++++++----------------
 mm/huge_memory.c        | 16 ++++++++++------
 2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..e4cbf5afdbe7e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -419,7 +419,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped);
 void reparent_deferred_split_queue(struct mem_cgroup *memcg);
 #endif
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -448,15 +448,15 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (pmd_is_huge(*____pmd))				\
-			__split_huge_pmd(__vma, __pmd, __address,	\
-					 false);			\
-	}  while (0)
+static inline int split_huge_pmd(struct vm_area_struct *vma,
+					     pmd_t *pmd, unsigned long address)
+{
+	if (pmd_is_huge(*pmd))
+		return __split_huge_pmd(vma, pmd, address, false);
+	return 0;
+}
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
@@ -651,13 +651,15 @@ static inline int try_folio_split_to_order(struct folio *folio,
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
-#define split_huge_pmd(__vma, __pmd, __address)	\
-	do { } while (0)
-
-static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze) {}
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-		unsigned long address, bool freeze) {}
+static inline int split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+				unsigned long address)
+{
+	return 0;
+}
+static inline int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze) { return 0; }
+static inline int split_huge_pmd_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
 					 bool freeze) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8003d3a498220..125ff36f475de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3273,7 +3273,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze)
 {
 	spinlock_t *ptl;
@@ -3287,20 +3287,22 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	split_huge_pmd_locked(vma, range.start, pmd, freeze);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
+
+	return 0;
 }
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze)
 {
 	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
 
 	if (!pmd)
-		return;
+		return 0;
 
-	__split_huge_pmd(vma, pmd, address, freeze);
+	return __split_huge_pmd(vma, pmd, address, freeze);
 }
 
-static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
+static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
 {
 	/*
 	 * If the new address isn't hpage aligned and it could previously
@@ -3309,7 +3311,9 @@ static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	if (!IS_ALIGNED(address, HPAGE_PMD_SIZE) &&
 	    range_in_vma(vma, ALIGN_DOWN(address, HPAGE_PMD_SIZE),
 			 ALIGN(address, HPAGE_PMD_SIZE)))
-		split_huge_pmd_address(vma, address, false);
+		return split_huge_pmd_address(vma, address, false);
+
+	return 0;
 }
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

With lazy PTE page table allocation, split_huge_pmd_if_needed() and
thus vma_adjust_trans_huge() can now fail if order-0 allocation
for pagetable fails when trying to split. It is important to check
if this failure occurred to prevent a huge PMD straddling at VMA
boundary.

The vma_adjust_trans_huge() call is moved before vma_prepare() in all
three callers (__split_vma, vma_shrink, commit_merge). Previously it sat
between vma_prepare() and vma_complete(), where there is no mechanism to
abort - once vma_prepare() has been called, we must reach vma_complete().
By moving the call earlier, a split failure can return -ENOMEM cleanly
without needing to undo VMA preparation.

This move is safe because vma_adjust_trans_huge() acquires its own
pmd_lock() internally and does not depend on any locks or state changes
from vma_prepare(). The VMA boundaries are also unchanged at the new
call site, satisfying __split_huge_pmd_locked()'s requirement that the
VMA covers the full PMD range.

All 3 callers (__split_vma, vma_shrink, commit_merge) already return
-ENOMEM if there are allocation failures for other reasons (failure in
vma_iter_prealloc for example), this follows the same pattern.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h           | 13 ++++++-----
 mm/huge_memory.c                  | 21 +++++++++++++-----
 mm/vma.c                          | 37 +++++++++++++++++++++----------
 tools/testing/vma/include/stubs.h |  9 ++++----
 4 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e4cbf5afdbe7e..207bf7cd95c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -484,8 +484,8 @@ int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end, bool *lock_dropped);
-void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
-			   unsigned long end, struct vm_area_struct *next);
+int vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end, struct vm_area_struct *next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
 spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 
@@ -687,11 +687,12 @@ static inline int madvise_collapse(struct vm_area_struct *vma,
 	return -EINVAL;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 125ff36f475de..a979aa5bd2995 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3316,20 +3316,31 @@ static inline int split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned
 	return 0;
 }
 
-void vma_adjust_trans_huge(struct vm_area_struct *vma,
+int vma_adjust_trans_huge(struct vm_area_struct *vma,
 			   unsigned long start,
 			   unsigned long end,
 			   struct vm_area_struct *next)
 {
+	int err;
+
 	/* Check if we need to split start first. */
-	split_huge_pmd_if_needed(vma, start);
+	err = split_huge_pmd_if_needed(vma, start);
+	if (err)
+		return err;
 
 	/* Check if we need to split end next. */
-	split_huge_pmd_if_needed(vma, end);
+	err = split_huge_pmd_if_needed(vma, end);
+	if (err)
+		return err;
 
 	/* If we're incrementing next->vm_start, we might need to split it. */
-	if (next)
-		split_huge_pmd_if_needed(next, end);
+	if (next) {
+		err = split_huge_pmd_if_needed(next, end);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 static void unmap_folio(struct folio *folio)
diff --git a/mm/vma.c b/mm/vma.c
index be64f781a3aa7..f50b1f291ab7c 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -510,6 +510,15 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			return err;
 	}
 
+	/*
+	 * Split any THP straddling the split boundary before splitting
+	 * the VMA itself. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
+	if (err)
+		return err;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
@@ -547,11 +556,6 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vp.insert = new;
 	vma_prepare(&vp);
 
-	/*
-	 * Get rid of huge pages and shared page tables straddling the split
-	 * boundary.
-	 */
-	vma_adjust_trans_huge(vma, vma->vm_start, addr, NULL);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_split(vma, addr);
 
@@ -729,6 +733,7 @@ static int commit_merge(struct vma_merge_struct *vmg)
 {
 	struct vm_area_struct *vma;
 	struct vma_prepare vp;
+	int err;
 
 	if (vmg->__adjust_next_start) {
 		/* We manipulate middle and adjust next, which is the target. */
@@ -740,6 +745,16 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		vma_iter_config(vmg->vmi, vmg->start, vmg->end);
 	}
 
+	/*
+	 * THP pages may need to do additional splits if we increase
+	 * middle->vm_start. Do this before vma_prepare() so we can
+	 * cleanly fail without undoing VMA preparation.
+	 */
+	err = vma_adjust_trans_huge(vma, vmg->start, vmg->end,
+				  vmg->__adjust_middle_start ? vmg->middle : NULL);
+	if (err)
+		return err;
+
 	init_multi_vma_prep(&vp, vma, vmg);
 
 	/*
@@ -752,12 +767,6 @@ static int commit_merge(struct vma_merge_struct *vmg)
 		return -ENOMEM;
 
 	vma_prepare(&vp);
-	/*
-	 * THP pages may need to do additional splits if we increase
-	 * middle->vm_start.
-	 */
-	vma_adjust_trans_huge(vma, vmg->start, vmg->end,
-			      vmg->__adjust_middle_start ? vmg->middle : NULL);
 	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
 	vmg_adjust_set_range(vmg);
 	vma_iter_store_overwrite(vmg->vmi, vmg->target);
@@ -1229,9 +1238,14 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	       unsigned long start, unsigned long end, pgoff_t pgoff)
 {
 	struct vma_prepare vp;
+	int err;
 
 	WARN_ON((vma->vm_start != start) && (vma->vm_end != end));
 
+	err = vma_adjust_trans_huge(vma, start, end, NULL);
+	if (err)
+		return err;
+
 	if (vma->vm_start < start)
 		vma_iter_config(vmi, vma->vm_start, start);
 	else
@@ -1244,7 +1258,6 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	init_vma_prep(&vp, vma);
 	vma_prepare(&vp);
-	vma_adjust_trans_huge(vma, start, end, NULL);
 
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
index 947a3a0c25665..171986f9c9fcd 100644
--- a/tools/testing/vma/include/stubs.h
+++ b/tools/testing/vma/include/stubs.h
@@ -418,11 +418,12 @@ static inline int vma_dup_policy(struct vm_area_struct *src, struct vm_area_stru
 	return 0;
 }
 
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 struct vm_area_struct *next)
+static inline int vma_adjust_trans_huge(struct vm_area_struct *vma,
+					unsigned long start,
+					unsigned long end,
+					struct vm_area_struct *next)
 {
+	return 0;
 }
 
 static inline void hugetlb_split(struct vm_area_struct *, unsigned long) {}
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
  2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
  2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

copy_huge_pmd() splits the source PMD when a folio is pinned and can't
be COW-shared at PMD granularity.  It then returns -EAGAIN so
copy_pmd_range() falls through to copy_pte_range().

If the split fails, the PMD is still huge.  Returning -EAGAIN would cause
copy_pmd_range() to call copy_pte_range(), which would dereference the
huge PMD entry as if it were a pointer to a PTE page table.
Return -ENOMEM on split failure instead (which is already done in
copy_huge_pmd() if pte_alloc_one() fails), which causes copy_page_range()
to abort the fork with -ENOMEM, similar to how copy_pmd_range() would
be aborted if pmd_alloc() and copy_pte_range() fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a979aa5bd2995..d9fb5875fa59e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-		__split_huge_pmd(src_vma, src_pmd, addr, false);
+		/*
+		 * If split fails, the PMD is still huge so copy_pte_range
+		 * (via -EAGAIN) would misinterpret it as a page table
+		 * pointer.  Return -ENOMEM directly to copy_pmd_range.
+		 */
+		if (__split_huge_pmd(src_vma, src_pmd, addr, false))
+			return -ENOMEM;
 		return -EAGAIN;
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (2 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

do_huge_pmd_wp_page() splits the PMD when a COW of the entire huge page
fails (e.g., can't allocate a new folio or the folio is pinned).  It then
returns VM_FAULT_FALLBACK so the fault can be retried at PTE granularity.

If the split fails, the PMD is still huge.  Returning VM_FAULT_FALLBACK
would re-enter the PTE fault path, which expects a PTE page table at the
PMD entry — not a huge PMD.

Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9fb5875fa59e..e82b8435a0b7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2153,7 +2153,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	folio_unlock(folio);
 	spin_unlock(vmf->ptl);
 fallback:
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * Split failure means the PMD is still huge; returning
+	 * VM_FAULT_FALLBACK would re-enter the PTE path with a
+	 * huge PMD, causing incorrect behavior.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 	return VM_FAULT_FALLBACK;
 }
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (3 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

zap_pmd_range() splits a huge PMD when the zap range doesn't cover the
full PMD (partial unmap).  If the split fails, the PMD stays huge.
Falling through to zap_pte_range() would dereference the huge PMD entry
as a PTE page table pointer.

Skip the range covered by the PMD on split failure instead.

The skip is safe across all call paths into zap_pmd_range():

- exit_mmap() and OOM reaper: the zap range covers entire VMAs, so
  every PMD is fully covered (next - addr == HPAGE_PMD_SIZE).  The
  zap_huge_pmd() branch handles these without splitting.  The split
  failure path is unreachable.

- munmap / mmap overlay: vma_adjust_trans_huge() (called from
  __split_vma) splits any PMD straddling the VMA boundary before the
  VMA is split.  If that PMD split fails, __split_vma() returns
  -ENOMEM and the munmap is aborted before reaching zap_pmd_range().
  The split failure path is unreachable.

- MADV_DONTNEED: advisory hint, the kernel is allowed to ignore it.
  The pages remain valid and accessible.  A subsequent access returns
  existing data without faulting.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9385842c35034..7ba1221c63792 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1983,9 +1983,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_is_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
-				__split_huge_pmd(vma, pmd, addr, false);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				/*
+				 * If split fails, the PMD stays huge.
+				 * Skip the range to avoid falling through
+				 * to zap_pte_range, which would treat the
+				 * huge PMD entry as a page table pointer.
+				 */
+				if (__split_huge_pmd(vma, pmd, addr, false)) {
+					addr = next;
+					continue;
+				}
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
 				addr = next;
 				continue;
 			}
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (4 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

wp_huge_pmd() splits the PMD when COW or write-notify must be handled at
PTE level (e.g., shared/file VMAs, userfaultfd).  It then returns
VM_FAULT_FALLBACK so the fault handler retries at PTE granularity.
If the split fails, the PMD is still huge.  The PTE fault path cannot
handle a huge PMD entry.
Return VM_FAULT_OOM on split failure, which signals the fault handler to
invoke the OOM killer or return -ENOMEM to userspace. This is similar to
what __handle_mm_fault would do if p4d_alloc or pud_alloc fails.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7ba1221c63792..51d2717e3f1b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6161,8 +6161,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 	}
 
 split:
-	/* COW or write-notify handled on pte level: split pmd. */
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false);
+	/*
+	 * COW or write-notify handled on pte level: split pmd.
+	 * If split fails, the PMD is still huge so falling back
+	 * to PTE handling would be incorrect.
+	 */
+	if (__split_huge_pmd(vma, vmf->pmd, vmf->address, false))
+		return VM_FAULT_OOM;
 
 	return VM_FAULT_FALLBACK;
 }
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (5 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

change_pmd_range() splits a huge PMD when mprotect() targets a sub-PMD
range or when VMA flags require per-PTE protection bits that can't be
represented at PMD granularity.

If pte_alloc_one() fails inside __split_huge_pmd(), the huge PMD remains
intact. Without this change, change_pte_range() would return -EAGAIN
because pte_offset_map_lock() returns NULL for a huge PMD, sending the
code back to the 'again' label to retry the split—without ever calling
cond_resched().

Now that __split_huge_pmd() returns an error code, handle it explicitly:
yield the CPU with cond_resched() and retry via goto again, giving other
tasks a chance to free memory.

Trying to return an error all the way to change_protection_range would
not work as it would leave a memory range with new protections, and
others unchanged, with no easy way to roll back the already modified
entries (and previous splits). __split_huge_pmd only requires an
order-0 allocation and is extremely unlikely to fail.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mprotect.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 9681f055b9fca..599d80a7d6969 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -477,7 +477,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_is_huge(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
-				__split_huge_pmd(vma, pmd, addr, false);
+				ret = __split_huge_pmd(vma, pmd, addr, false);
+				if (ret) {
+					/*
+					 * Yield and retry. Other tasks
+					 * may free memory while we
+					 * reschedule.
+					 */
+					cond_resched();
+					goto again;
+				}
 				/*
 				 * For file-backed, the pmd could have been
 				 * cleared; make sure pmd populated if
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (6 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

follow_pmd_mask() splits a huge PMD when FOLL_SPLIT_PMD is set, so GUP
can pin individual pages at PTE granularity.

If the split fails, the PMD is still huge and follow_page_pte() cannot
process it. Return ERR_PTR(-ENOMEM) on split failure, which causes the
GUP caller to get -ENOMEM. -ENOMEM is already returned in follow_pmd_mask
if pte_alloc_one fails (which is the reason why split_huge_pmd could
fail), hence this is a safe change.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/gup.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee738..792b2e7319dd0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -928,8 +928,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return follow_page_pte(vma, address, pmd, flags);
 	}
 	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
+		int ret;
+
 		spin_unlock(ptl);
-		split_huge_pmd(vma, pmd, address);
+		/*
+		 * If split fails, the PMD is still huge and
+		 * we cannot proceed to follow_page_pte.
+		 */
+		ret = split_huge_pmd(vma, pmd, address);
+		if (ret)
+			return ERR_PTR(ret);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 09/21] mm: handle walk_page_range() failure from THP split
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (7 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

walk_pmd_range() splits a huge PMD when a page table walker with
pte_entry or install_pte callbacks needs PTE-level granularity. If
the split fails due to memory allocation failure in pte_alloc_one(),
walk_pte_range() would encounter a huge PMD instead of a PTE page
table.

Break out of the loop on split failure and return -ENOMEM to the
walker's caller. Callers that reach this path (those with pte_entry
or install_pte set) such as mincore, hmm_range_fault and
queue_pages_range already handle negative return values from
walk_page_range(). Similar approach is taken when __pte_alloc()
fails in walk_pmd_range().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/pagewalk.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a94c401ab2cfe..1ee9df7a4461d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -147,9 +147,11 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				continue;
 		}
 
-		if (walk->vma)
-			split_huge_pmd(walk->vma, pmd, addr);
-		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+		if (walk->vma) {
+			err = split_huge_pmd(walk->vma, pmd, addr);
+			if (err)
+				break;
+		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
 			continue; /* Nothing to do. */
 
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (8 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

move_page_tables() splits a huge PMD when the extent is smaller than
HPAGE_PMD_SIZE and the PMD can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_ptes() can't operate on
individual PTEs.

Break out of the loop on split failure, which causes mremap() to return
however much was moved so far (partial move).  This is consistent with
other allocation failures in the same loop (e.g., alloc_new_pmd(),
pte_alloc()).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/mremap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 2be876a70cc0d..d067c9fbf140b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -855,7 +855,13 @@ unsigned long move_page_tables(struct pagetable_move_control *pmc)
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(pmc, HPAGE_PMD, old_pmd, new_pmd))
 				continue;
-			split_huge_pmd(pmc->old, old_pmd, pmc->old_addr);
+			/*
+			 * If split fails, the PMD stays huge and move_ptes
+			 * can't operate on it.  Break out so the caller
+			 * can handle the partial move.
+			 */
+			if (split_huge_pmd(pmc->old, old_pmd, pmc->old_addr))
+				break;
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
 			   extent == PMD_SIZE) {
 			/*
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages()
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (9 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

The UFFDIO_MOVE ioctl's move_pages() loop splits a huge PMD when the
folio is pinned and can't be moved at PMD granularity.

If the split fails, the PMD stays huge and move_pages_pte() can't
process individual pages. Break out of the loop on split failure
and return -ENOMEM to the caller. This is similar to how other
allocation failures (__pte_alloc, mm_alloc_pmd) are handled in
move_pages().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/userfaultfd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e19872e518785..2728102e00c72 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1870,7 +1870,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 				}
 
 				spin_unlock(ptl);
-				split_huge_pmd(src_vma, src_pmd, src_addr);
+				/*
+				 * If split fails, the PMD stays huge and
+				 * move_pages_pte can't process it.
+				 */
+				err = split_huge_pmd(src_vma, src_pmd, src_addr);
+				if (err)
+					break;
 				/* The folio will be split by move_pages_pte() */
 				continue;
 			}
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 12/21] mm: thp: handle split failure in device migration
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (10 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Device memory migration has two call sites that split huge PMDs:

migrate_vma_split_unmapped_folio():
  Called from migrate_vma_pages() when migrating a PMD-mapped THP to a
  destination that doesn't support compound pages.  It splits the PMD
  then splits the folio via folio_split_unmapped().

  If the PMD split fails, folio_split_unmapped() would operate on an
  unsplit folio with inconsistent page table state.  Propagate -ENOMEM
  to skip this page's migration. This is safe as folio_split_unmapped
  failure would be propagated in a similar way.

migrate_vma_insert_page():
  Called from migrate_vma_pages() when inserting a page into a VMA
  during migration back from device memory.  If a huge zero PMD exists
  at the target address, it must be split before PTE insertion.

  If the split fails, the subsequent pte_alloc() and set_pte_at() would
  operate on a PMD slot still occupied by the huge zero entry.  Use
  goto abort, consistent with other allocation failures in this function.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/migrate_device.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 78c7acf024615..bc53e06fd9735 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -909,7 +909,13 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 	int ret = 0;
 
 	folio_get(folio);
-	split_huge_pmd_address(migrate->vma, addr, true);
+	/*
+	 * If PMD split fails, folio_split_unmapped would operate on an
+	 * unsplit folio with inconsistent page table state.
+	 */
+	ret = split_huge_pmd_address(migrate->vma, addr, true);
+	if (ret)
+		return ret;
 	ret = folio_split_unmapped(folio, 0);
 	if (ret)
 		return ret;
@@ -1005,7 +1011,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (pmd_trans_huge(*pmdp)) {
 			if (!is_huge_zero_pmd(*pmdp))
 				goto abort;
-			split_huge_pmd(vma, pmdp, addr);
+			/*
+			 * If split fails, the huge zero PMD remains and
+			 * pte_alloc/PTE insertion that follows would be
+			 * incorrect.
+			 */
+			if (split_huge_pmd(vma, pmdp, addr))
+				goto abort;
 		} else if (pmd_leaf(*pmdp))
 			goto abort;
 	}
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (11 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Mark __split_huge_pmd(), split_huge_pmd() and split_huge_pmd_address()
with __must_check so the compiler warns if any caller ignores the return
value. Not checking return value and operating on the basis that the pmd
is split could result in a kernel bug. The possibility of an order-0
allocation failing for page table allocation is very low, but it should
be handled correctly.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 207bf7cd95c78..b4c2fd4252097 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -419,7 +419,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped);
 void reparent_deferred_split_queue(struct mem_cgroup *memcg);
 #endif
 
-int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+int __must_check __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
 
 /**
@@ -448,7 +448,7 @@ static inline bool pmd_is_huge(pmd_t pmd)
 	return false;
 }
 
-static inline int split_huge_pmd(struct vm_area_struct *vma,
+static inline int __must_check split_huge_pmd(struct vm_area_struct *vma,
 					     pmd_t *pmd, unsigned long address)
 {
 	if (pmd_is_huge(*pmd))
@@ -456,7 +456,7 @@ static inline int split_huge_pmd(struct vm_area_struct *vma,
 	return 0;
 }
 
-int split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+int __must_check split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze);
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (12 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
 mm/khugepaged.c         |   7 +-
 mm/migrate_device.c     |  15 +++--
 mm/rmap.c               |  39 ++++++++++-
 5 files changed, 156 insertions(+), 53 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b4c2fd4252097..ed4c97734b335 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze);
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -662,7 +662,7 @@ static inline int split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) { return 0; }
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze, pgtable_t pgtable) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e82b8435a0b7f..a10cb136000d1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1325,17 +1325,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	pgtable_t pgtable;
+	pgtable_t pgtable = NULL;
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
 	if (unlikely(!folio))
 		return VM_FAULT_FALLBACK;
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable)) {
-		ret = VM_FAULT_OOM;
-		goto release;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto release;
+		}
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1350,14 +1352,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		if (userfaultfd_missing(vma)) {
 			spin_unlock(vmf->ptl);
 			folio_put(folio);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			ret = handle_userfault(vmf, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		if (pgtable) {
+			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+						   pgtable);
+			mm_inc_nr_ptes(vma->vm_mm);
+		}
 		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
-		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 	}
 
@@ -1453,9 +1459,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
 	set_pmd_at(mm, haddr, pmd, entry);
-	mm_inc_nr_ptes(mm);
 }
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1474,16 +1482,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		pgtable_t pgtable;
+		pgtable_t pgtable = NULL;
 		struct folio *zero_folio;
 		vm_fault_t ret;
 
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (unlikely(!pgtable))
-			return VM_FAULT_OOM;
+		if (arch_needs_pgtable_deposit()) {
+			pgtable = pte_alloc_one(vma->vm_mm);
+			if (unlikely(!pgtable))
+				return VM_FAULT_OOM;
+		}
 		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
 		if (unlikely(!zero_folio)) {
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
@@ -1493,10 +1504,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			ret = check_stable_address_space(vma->vm_mm);
 			if (ret) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -1507,7 +1520,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			}
 		} else {
 			spin_unlock(vmf->ptl);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 		}
 		return ret;
 	}
@@ -1839,8 +1853,10 @@ static void copy_huge_non_present_pmd(
 	}
 
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1880,9 +1896,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(dst_mm);
+		if (unlikely(!pgtable))
+			goto out;
+	}
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1900,7 +1918,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
 	/*
@@ -1926,7 +1945,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		/*
@@ -1940,8 +1960,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
@@ -2379,7 +2401,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
@@ -2404,7 +2426,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		}
 
 		if (folio_test_anon(folio)) {
-			zap_deposited_table(tlb->mm, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
@@ -2505,7 +2528,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2813,8 +2837,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
-	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
 	/* unblock rmap walks */
@@ -2956,10 +2982,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
 	pmd_t _pmd, old_pmd;
 	unsigned long addr;
 	pte_t *pte;
@@ -2975,7 +3000,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	 */
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -2997,12 +3031,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr, bool freeze, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *folio;
 	struct page *page;
-	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
@@ -3026,6 +3059,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
+		if (pgtable)
+			pte_free(mm, pgtable);
 		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3058,7 +3093,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3182,7 +3217,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Withdraw the table only after we mark the pmd entry invalid.
 	 * This's critical for some architectures (Power).
 	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3278,11 +3322,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze)
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+	else if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 }
 
 int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3290,13 +3336,24 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	pgtable_t pgtable = NULL;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
 				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
+
+	/* allocate pagetable before acquiring pmd lock */
+	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable) {
+			mmu_notifier_invalidate_range_end(&range);
+			return -ENOMEM;
+		}
+	}
+
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 
@@ -3432,7 +3489,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	}
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
-	zap_deposited_table(mm, pmdp);
+	if (arch_needs_pgtable_deposit())
+		zap_deposited_table(mm, pmdp);
 	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c85d7381adb5f..735d7ee5bbab2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1224,7 +1224,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	} else {
+		mm_dec_nr_ptes(mm);
+		pte_free(mm, pgtable);
+	}
 	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
 	spin_unlock(pmd_ptl);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index bc53e06fd9735..1adb5abccfb70 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -823,9 +823,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 
 	__folio_mark_uptodate(folio);
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable))
-		goto abort;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable))
+			goto abort;
+	} else {
+		pgtable = NULL;
+	}
 
 	if (folio_is_device_private(folio)) {
 		swp_entry_t swp_entry;
@@ -873,10 +877,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 	folio_get(folio);
 
 	if (flush) {
-		pte_free(vma->vm_mm, pgtable);
+		if (pgtable)
+			pte_free(vma->vm_mm, pgtable);
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
-	} else {
+	} else if (pgtable) {
 		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index bff8f222004e4..2519d579bc1d8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
@@ -1975,6 +1976,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2009,6 +2011,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
@@ -2058,12 +2064,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * We temporarily have to drop the PTL and
 				 * restart so we can process the PTE-mapped THP.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, false);
+						      pvmw.pmd, false, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2343,6 +2358,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		break;
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -2402,6 +2420,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2436,6 +2455,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
@@ -2443,8 +2466,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+						      pvmw.pmd, true, pgtable);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -2695,6 +2727,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (13 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Since the previous commit made deposit/withdraw only needed for
architectures where arch_needs_pgtable_deposit() returns true (currently
only powerpc hash MMU), the generic implementation in pgtable-generic.c
and the s390/sparc overrides are now dead code — all call sites are
guarded by arch_needs_pgtable_deposit() which is compile-time false on
those architectures. Remove them entirely and replace the extern
declarations with static inline no-op stubs for the default case.

pgtable_trans_huge_{deposit,withdraw}() are renamed to
arch_pgtable_trans_huge_{deposit,withdraw}().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++---
 arch/s390/include/asm/pgtable.h              |  6 ---
 arch/s390/mm/pgtable.c                       | 41 --------------------
 arch/sparc/include/asm/pgtable_64.h          |  6 ---
 arch/sparc/mm/tlb.c                          | 36 -----------------
 include/linux/pgtable.h                      | 16 +++++---
 mm/debug_vm_pgtable.c                        |  4 +-
 mm/huge_memory.c                             | 26 ++++++-------
 mm/khugepaged.c                              |  2 +-
 mm/memory.c                                  |  2 +-
 mm/migrate_device.c                          |  2 +-
 mm/pgtable-generic.c                         | 32 ---------------
 12 files changed, 35 insertions(+), 150 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 1a91762b455d9..e0dd2a83b9e05 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1360,18 +1360,18 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr,
 				   pud_t *pudp, int full);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
-					      pmd_t *pmdp, pgtable_t pgtable)
+#define arch_pgtable_trans_huge_deposit arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 	return hash__pgtable_trans_huge_deposit(mm, pmdp, pgtable);
 }
 
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
-						    pmd_t *pmdp)
+#define arch_pgtable_trans_huge_withdraw arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
 {
 	if (radix_enabled())
 		return radix__pgtable_trans_huge_withdraw(mm, pmdp);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1c3c3be93be9c..6bffe88b297b8 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1659,12 +1659,6 @@ pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 
 #define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4acd8b140c4bd..c9a9ab2c7d937 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -312,44 +312,3 @@ pud_t pudp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 	return old;
 }
 EXPORT_SYMBOL(pudp_xchg_direct);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-	pte_t *ptep;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	ptep = (pte_t *) pgtable;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	ptep++;
-	set_pte(ptep, __pte(_PAGE_INVALID));
-	return pgtable;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 74ede706fb325..60861560f8c40 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -987,12 +987,6 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
-#define __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable);
-
-#define __HAVE_ARCH_PGTABLE_WITHDRAW
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 /*
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 6d9dd5eb13287..9049d54e6e2cb 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -275,40 +275,4 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return old;
 }
 
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	struct list_head *lh = (struct list_head *) pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(lh);
-	else
-		list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	struct list_head *lh;
-	pgtable_t pgtable;
-
-	assert_spin_locked(&mm->page_table_lock);
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	lh = (struct list_head *) pgtable;
-	if (list_empty(lh))
-		pmd_huge_pte(mm, pmdp) = NULL;
-	else {
-		pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
-		list_del(lh);
-	}
-	pte_val(pgtable[0]) = 0;
-	pte_val(pgtable[1]) = 0;
-
-	return pgtable;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 776993d4567b4..6e3b66d17ccf0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1171,13 +1171,19 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				       pgtable_t pgtable);
+#ifndef arch_pgtable_trans_huge_deposit
+static inline void arch_pgtable_trans_huge_deposit(struct mm_struct *mm,
+						   pmd_t *pmdp, pgtable_t pgtable)
+{
+}
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+#ifndef arch_pgtable_trans_huge_withdraw
+static inline pgtable_t arch_pgtable_trans_huge_withdraw(struct mm_struct *mm,
+							 pmd_t *pmdp)
+{
+	return NULL;
+}
 #endif
 
 #ifndef arch_needs_pgtable_deposit
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 83cf07269f134..2f811c5a083ce 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -240,7 +240,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 	/* Align the address wrt HPAGE_PMD_SIZE */
 	vaddr &= HPAGE_PMD_MASK;
 
-	pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
+	arch_pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep);
 
 	pmd = pfn_pmd(args->pmd_pfn, args->page_prot);
 	set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
@@ -276,7 +276,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
 
 	/*  Clear the pte entries  */
 	pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp);
-	pgtable_trans_huge_withdraw(args->mm, args->pmdp);
+	arch_pgtable_trans_huge_withdraw(args->mm, args->pmdp);
 }
 
 static void __init pmd_leaf_tests(struct pgtable_debug_args *args)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a10cb136000d1..55b14ba244b1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1359,7 +1359,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			return ret;
 		}
 		if (pgtable) {
-			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+			arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
 						   pgtable);
 			mm_inc_nr_ptes(vma->vm_mm);
 		}
@@ -1460,7 +1460,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 	}
 	set_pmd_at(mm, haddr, pmd, entry);
@@ -1593,7 +1593,7 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
 		pgtable = NULL;
 	}
@@ -1855,7 +1855,7 @@ static void copy_huge_non_present_pmd(
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
@@ -1962,7 +1962,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_zero_page:
 	if (pgtable) {
 		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
@@ -2370,7 +2370,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t pgtable;
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	pte_free(mm, pgtable);
 	mm_dec_nr_ptes(mm);
 }
@@ -2389,7 +2389,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	/*
 	 * For architectures like ppc64 we look at deposited pgtable
 	 * when calling pmdp_huge_get_and_clear. So do the
-	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * arch_pgtable_trans_huge_withdraw after finishing pmdp related
 	 * operations.
 	 */
 	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
@@ -2531,8 +2531,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
 		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
-			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
+			pgtable = arch_pgtable_trans_huge_withdraw(mm, old_pmd);
+			arch_pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
 		}
 		pmd = move_soft_dirty_pmd(pmd);
 		if (vma_has_uffd_without_event_remap(vma))
@@ -2838,8 +2838,8 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+		src_pgtable = arch_pgtable_trans_huge_withdraw(mm, src_pmd);
+		arch_pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
 	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
@@ -3001,7 +3001,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
@@ -3218,7 +3218,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * This's critical for some architectures (Power).
 	 */
 	if (arch_needs_pgtable_deposit()) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pgtable = arch_pgtable_trans_huge_withdraw(mm, pmd);
 	} else {
 		VM_BUG_ON(!pgtable);
 		/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 735d7ee5bbab2..2b426bcd16977 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1225,7 +1225,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	if (arch_needs_pgtable_deposit()) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		arch_pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	} else {
 		mm_dec_nr_ptes(mm);
 		pte_free(mm, pgtable);
diff --git a/mm/memory.c b/mm/memory.c
index 51d2717e3f1b4..4ec1ae909baf4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5384,7 +5384,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+	arch_pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
 	/*
 	 * We are going to consume the prealloc table,
 	 * count that as nr_ptes.
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 1adb5abccfb70..be84ace37b88f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -882,7 +882,7 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
 	} else if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
+		arch_pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
 	set_pmd_at(vma->vm_mm, addr, pmdp, entry);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index af7966169d695..d8d5875d66fed 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -162,38 +162,6 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif
 
-#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
-void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-				pgtable_t pgtable)
-{
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	if (!pmd_huge_pte(mm, pmdp))
-		INIT_LIST_HEAD(&pgtable->lru);
-	else
-		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
-	pmd_huge_pte(mm, pmdp) = pgtable;
-}
-#endif
-
-#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
-/* no "address" argument so destroys page coloring of some arch */
-pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
-{
-	pgtable_t pgtable;
-
-	assert_spin_locked(pmd_lockptr(mm, pmdp));
-
-	/* FIFO */
-	pgtable = pmd_huge_pte(mm, pmdp);
-	pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
-							  struct page, lru);
-	if (pmd_huge_pte(mm, pmdp))
-		list_del(&pgtable->lru);
-	return pgtable;
-}
-#endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (14 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 14:22   ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.

The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 1 +
 mm/rmap.c                     | 3 +++
 mm/vmstat.c                   | 1 +
 4 files changed, 6 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a0201..ce696cf7d6321 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -98,6 +98,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SPLIT_PMD_FAILED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55b14ba244b1b..fc0a5e91b4d40 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3347,6 +3347,7 @@ int __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable) {
+			count_vm_event(THP_SPLIT_PMD_FAILED);
 			mmu_notifier_invalidate_range_end(&range);
 			return -ENOMEM;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index 2519d579bc1d8..2dae46fff08ae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2067,8 +2067,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				pgtable_t pgtable = prealloc_pte;
 
 				prealloc_pte = NULL;
+
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
@@ -2471,6 +2473,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				prealloc_pte = NULL;
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 667474773dbc7..da276ef0072ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
 	[I(THP_DEFERRED_SPLIT_PAGE)]		= "thp_deferred_split_page",
 	[I(THP_UNDERUSED_SPLIT_PAGE)]		= "thp_underused_split_page",
 	[I(THP_SPLIT_PMD)]			= "thp_split_pmd",
+	[I(THP_SPLIT_PMD_FAILED)]		= "thp_split_pmd_failed",
 	[I(THP_SCAN_EXCEED_NONE_PTE)]		= "thp_scan_exceed_none_pte",
 	[I(THP_SCAN_EXCEED_SWAP_PTE)]		= "thp_scan_exceed_swap_pte",
 	[I(THP_SCAN_EXCEED_SHARED_PTE)]		= "thp_scan_exceed_share_pte",
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (15 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test infrastructure for verifying THP PMD split behavior with lazy
PTE allocation. This includes:

- Test fixture with PMD-aligned memory allocation
- Helper functions for reading vmstat counters
- log_and_check_pmd_split() macro for logging counters and checking
  if thp_split_pmd has incremented and thp_split_pmd_failed hasn't.
- THP allocation helper with verification

Also add a test to check if partial unmap of a THP splits the PMD.
This exercises zap_pmd_range part of split.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/thp_pmd_split_test.c | 149 ++++++++++++++++++
 2 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 7a5de4e9bf520..e80551e76013a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -95,6 +95,7 @@ TEST_GEN_FILES += uffd-stress
 TEST_GEN_FILES += uffd-unit-tests
 TEST_GEN_FILES += uffd-wp-mremap
 TEST_GEN_FILES += split_huge_page_test
+TEST_GEN_FILES += thp_pmd_split_test
 TEST_GEN_FILES += ksm_tests
 TEST_GEN_FILES += ksm_functional_tests
 TEST_GEN_FILES += mdwe_test
diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
new file mode 100644
index 0000000000000..0f54ac04760d5
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests various kernel code paths that handle THP PMD splitting.
+ *
+ * Prerequisites:
+ * - THP enabled (always or madvise mode):
+ *   echo always > /sys/kernel/mm/transparent_hugepage/enabled
+ *   or
+ *   echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <stdint.h>
+
+#include "kselftest_harness.h"
+#include "thp_settings.h"
+#include "vm_util.h"
+
+/* Read vmstat counter */
+static unsigned long read_vmstat(const char *name)
+{
+	FILE *fp;
+	char line[256];
+	unsigned long value = 0;
+
+	fp = fopen("/proc/vmstat", "r");
+	if (!fp)
+		return 0;
+
+	while (fgets(line, sizeof(line), fp)) {
+		if (strncmp(line, name, strlen(name)) == 0 &&
+		    line[strlen(name)] == ' ') {
+			sscanf(line + strlen(name), " %lu", &value);
+			break;
+		}
+	}
+	fclose(fp);
+	return value;
+}
+
+/*
+ * Log vmstat counters for split_pmd_after/split_pmd_failed_after,
+ * check if split_pmd_after is greater than before and split_pmd_failed_after
+ * hasn't incremented.
+ */
+static void log_and_check_pmd_split(struct __test_metadata *const _metadata,
+	unsigned long split_pmd_before, unsigned long split_pmd_failed_before)
+{
+	unsigned long split_pmd_after = read_vmstat("thp_split_pmd");
+	unsigned long split_pmd_failed_after = read_vmstat("thp_split_pmd_failed");
+
+	TH_LOG("thp_split_pmd: %lu -> %lu", \
+	       split_pmd_before, split_pmd_after);
+	TH_LOG("thp_split_pmd_failed: %lu -> %lu", \
+	       split_pmd_failed_before, split_pmd_failed_after);
+	ASSERT_GT(split_pmd_after, split_pmd_before);
+	ASSERT_EQ(split_pmd_failed_after, split_pmd_failed_before);
+}
+
+/* Allocate a THP at the given aligned address */
+static int allocate_thp(void *aligned, size_t pmdsize)
+{
+	int ret;
+
+	ret = madvise(aligned, pmdsize, MADV_HUGEPAGE);
+	if (ret)
+		return -1;
+
+	/* Touch all pages to allocate the THP */
+	memset(aligned, 0xAA, pmdsize);
+
+	/* Verify we got a THP */
+	if (!check_huge_anon(aligned, 1, pmdsize))
+		return -1;
+
+	return 0;
+}
+
+FIXTURE(thp_pmd_split)
+{
+	void *mem;		/* Base mmap allocation */
+	void *aligned;		/* PMD-aligned pointer within mem */
+	size_t pmdsize;		/* PMD size from sysfs */
+	size_t pagesize;	/* Base page size */
+	size_t mmap_size;	/* Total mmap size for alignment */
+	unsigned long split_pmd_before;
+	unsigned long split_pmd_failed_before;
+};
+
+FIXTURE_SETUP(thp_pmd_split)
+{
+	if (!thp_available())
+		SKIP(return, "THP not available");
+
+	self->pmdsize = read_pmd_pagesize();
+	if (!self->pmdsize)
+		SKIP(return, "Unable to read PMD size");
+
+	self->pagesize = getpagesize();
+	self->mmap_size = 4 * self->pmdsize;
+
+	self->split_pmd_before = read_vmstat("thp_split_pmd");
+	self->split_pmd_failed_before = read_vmstat("thp_split_pmd_failed");
+
+	self->mem = mmap(NULL, self->mmap_size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(self->mem, MAP_FAILED);
+
+	/* Align to PMD boundary */
+	self->aligned = (void *)(((unsigned long)self->mem + self->pmdsize - 1) &
+				 ~(self->pmdsize - 1));
+}
+
+FIXTURE_TEARDOWN(thp_pmd_split)
+{
+	if (self->mem && self->mem != MAP_FAILED)
+		munmap(self->mem, self->mmap_size);
+}
+
+/*
+ * Partial munmap on THP (zap_pmd_range)
+ *
+ * Tests that partial munmap of a THP correctly splits the PMD.
+ * This exercises zap_pmd_range part of split.
+ */
+TEST_F(thp_pmd_split, partial_munmap)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	ret = munmap((char *)self->aligned + self->pagesize, self->pagesize);
+	ASSERT_EQ(ret, 0);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
+TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (16 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mprotect on THP which exercises change_pmd_range().
This verifies that partial mprotect correctly splits the PMD, applies
protection only to the requested portion, and leaves the rest of the
mapping writable.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 31 +++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 0f54ac04760d5..4944a5a516da9 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -146,4 +146,35 @@ TEST_F(thp_pmd_split, partial_munmap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mprotect on THP (change_pmd_range)
+ *
+ * Tests that partial mprotect of a THP correctly splits the PMD and
+ * applies protection only to the requested portion. This exercises
+ * the mprotect path which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mprotect)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mprotect - make middle page read-only */
+	ret = mprotect((char *)self->aligned + self->pagesize, self->pagesize, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify we can still write to non-protected pages */
+	ptr[0] = 0xDD;
+	ptr[self->pmdsize - 1] = 0xEE;
+
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xEE);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 19/21] selftests/mm: add partial_mlock test
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (17 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mlock on THP which exercises walk_page_range()
with a subset of the THP. This should trigger a PMD split since
mlock operates at page granularity.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 26 +++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 4944a5a516da9..3c9f05457efec 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -177,4 +177,30 @@ TEST_F(thp_pmd_split, partial_mprotect)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mlock triggering split (walk_page_range)
+ *
+ * Tests mlock on a partial THP region which should trigger a PMD split.
+ */
+TEST_F(thp_pmd_split, partial_mlock)
+{
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Partial mlock - should trigger PMD split */
+	ret = mlock((char *)self->aligned + self->pagesize, self->pagesize);
+	if (ret && errno == ENOMEM)
+		SKIP(return, "mlock failed with ENOMEM (resource limit)");
+	ASSERT_EQ(ret, 0);
+
+	/* Cleanup */
+	munlock((char *)self->aligned + self->pagesize, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (18 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
  2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial mremap on THP which exercises move_page_tables().
This verifies that partial mremap correctly splits the PMD, moves
only the requested page, and preserves data integrity in both the
moved region and the original mapping.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 3c9f05457efec..1f29296759a5b 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -203,4 +203,54 @@ TEST_F(thp_pmd_split, partial_mlock)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * Partial mremap (move_page_tables)
+ *
+ * Tests that partial mremap of a THP correctly splits the PMD and
+ * moves only the requested portion. This exercises move_page_tables()
+ * which now handles split failures.
+ */
+TEST_F(thp_pmd_split, partial_mremap)
+{
+	void *new_addr;
+	unsigned long *ptr = (unsigned long *)self->aligned;
+	unsigned long *new_ptr;
+	unsigned long pattern = 0xABCDUL;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern to the page we'll move */
+	ptr[self->pagesize / sizeof(unsigned long)] = pattern;
+
+	/* Also write to first and last page to verify they stay intact */
+	ptr[0] = 0x1234UL;
+	ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)] = 0x4567UL;
+
+	/* Partial mremap - move one base page from the THP */
+	new_addr = mremap((char *)self->aligned + self->pagesize, self->pagesize,
+			  self->pagesize, MREMAP_MAYMOVE);
+	if (new_addr == MAP_FAILED) {
+		if (errno == ENOMEM)
+			SKIP(return, "mremap failed with ENOMEM");
+		ASSERT_NE(new_addr, MAP_FAILED);
+	}
+
+	/* Verify data was moved correctly */
+	new_ptr = (unsigned long *)new_addr;
+	ASSERT_EQ(new_ptr[0], pattern);
+
+	/* Verify surrounding data is intact */
+	ASSERT_EQ(ptr[0], 0x1234UL);
+	ASSERT_EQ(ptr[(self->pmdsize - self->pagesize) / sizeof(unsigned long)], 0x4567UL);
+
+	/* Cleanup the moved page */
+	munmap(new_addr, self->pagesize);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (19 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
@ 2026-02-26 11:23 ` Usama Arif
  2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
  21 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 11:23 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390, Usama Arif

Add test for partial MADV_DONTNEED on THP. This verifies that
MADV_DONTNEED correctly triggers a PMD split, discards only the
requested page (which becomes zero-filled), and preserves data
in the surrounding pages.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 .../testing/selftests/mm/thp_pmd_split_test.c | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_pmd_split_test.c b/tools/testing/selftests/mm/thp_pmd_split_test.c
index 1f29296759a5b..060ca1e341b75 100644
--- a/tools/testing/selftests/mm/thp_pmd_split_test.c
+++ b/tools/testing/selftests/mm/thp_pmd_split_test.c
@@ -253,4 +253,38 @@ TEST_F(thp_pmd_split, partial_mremap)
 		self->split_pmd_failed_before);
 }
 
+/*
+ * MADV_DONTNEED on THP
+ *
+ * Tests that MADV_DONTNEED on a partial THP correctly handles
+ * the PMD split and discards only the requested pages.
+ */
+TEST_F(thp_pmd_split, partial_madv_dontneed)
+{
+	volatile unsigned char *ptr = (volatile unsigned char *)self->aligned;
+	int ret;
+
+	ret = allocate_thp(self->aligned, self->pmdsize);
+	if (ret)
+		SKIP(return, "Failed to allocate THP");
+
+	/* Write pattern */
+	memset(self->aligned, 0xDD, self->pmdsize);
+
+	/* Partial MADV_DONTNEED - discard middle page */
+	ret = madvise((char *)self->aligned + self->pagesize, self->pagesize, MADV_DONTNEED);
+	ASSERT_EQ(ret, 0);
+
+	/* Verify non-discarded pages still have data */
+	ASSERT_EQ(ptr[0], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[2 * self->pagesize], (unsigned char)0xDD);
+	ASSERT_EQ(ptr[self->pmdsize - 1], (unsigned char)0xDD);
+
+	/* Discarded page should be zero */
+	ASSERT_EQ(ptr[self->pagesize], (unsigned char)0x00);
+
+	log_and_check_pmd_split(_metadata, self->split_pmd_before,
+		self->split_pmd_failed_before);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter
  2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
@ 2026-02-26 14:22   ` Usama Arif
  0 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-02-26 14:22 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, Vlastimil Babka,
	lance.yang, linux-kernel, kernel-team, maddy, mpe, linuxppc-dev,
	hca, gor, agordeev, borntraeger, svens, linux-s390



On 26/02/2026 11:23, Usama Arif wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2519d579bc1d8..2dae46fff08ae 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2067,8 +2067,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				pgtable_t pgtable = prealloc_pte;
>  
>  				prealloc_pte = NULL;
> +
>  				if (!arch_needs_pgtable_deposit() && !pgtable &&
>  				    vma_is_anonymous(vma)) {
> +					count_vm_event(THP_SPLIT_PMD_FAILED);
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
> @@ -2471,6 +2473,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  				prealloc_pte = NULL;
>  				if (!arch_needs_pgtable_deposit() && !pgtable &&
>  				    vma_is_anonymous(vma)) {
> +					count_vm_event(THP_SPLIT_PMD_FAILED);
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
This will need to be guarded by CONFIG_TRANSPARENT_HUGEPAGE. Will need below diff in next series..

diff --git a/mm/rmap.c b/mm/rmap.c
index 2dae46fff08ae..9d74600951cf6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2070,7 +2070,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
                                if (!arch_needs_pgtable_deposit() && !pgtable &&
                                    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
                                        count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
                                        page_vma_mapped_walk_done(&pvmw);
                                        ret = false;
                                        break;
@@ -2473,7 +2475,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
                                prealloc_pte = NULL;
                                if (!arch_needs_pgtable_deposit() && !pgtable &&
                                    vma_is_anonymous(vma)) {
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
                                        count_vm_event(THP_SPLIT_PMD_FAILED);
+#endif
                                        page_vma_mapped_walk_done(&pvmw);
                                        ret = false;
                                        break;


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
  2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
                   ` (20 preceding siblings ...)
  2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
@ 2026-02-26 21:01 ` Nico Pache
  21 siblings, 0 replies; 24+ messages in thread
From: Nico Pache @ 2026-02-26 21:01 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm, fvdl,
	hannes, riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang,
	Liam.Howlett, ryan.roberts, Vlastimil Babka, lance.yang,
	linux-kernel, kernel-team, maddy, mpe, linuxppc-dev, hca, gor,
	agordeev, borntraeger, svens, linux-s390

On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
>
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
>
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, partial mremap and so on.
> Most of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, split failing requires an order-0 allocation for
> a page table to fail, which is extremely unlikely.
>
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
>
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.

Hi Usama,

Thanks for tackling this, it seems like an interesting problem. Im
trying to get more into reviewing, so bare with me I may have some
stupid comments or questions. Where I can really help out is with
testing. I will build this for all RH-supported architectures and run
some automated test suites and performance metrics. I'll report back
if I spot anything.

Cheers!
-- Nico

>
> The series is structured as follows:
>
> Patches 1-2:    Error infrastructure — make split functions return int
>                 and propagate errors from vma_adjust_trans_huge()
>                 through __split_vma, vma_shrink, and commit_merge.
>
> Patches 3-12:   Handle split failure at every call site — copy_huge_pmd,
>                 do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
>                 change_pmd_range (mprotect), follow_pmd_mask (GUP),
>                 walk_pmd_range (pagewalk), move_page_tables (mremap),
>                 move_pages (userfaultfd), and device migration.
>                 The code will become affective in Patch 14 when split
>                 functions start returning -ENOMEM.
>
> Patch 13:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
>                 and split_huge_pmd_address() so the compiler warns on
>                 unchecked return values.
>
> Patch 14:       The actual change — allocate PTE page tables lazily at
>                 split time instead of pre-depositing at THP creation.
>                 This is when split functions will actually start returning
>                 -ENOMEM.
>
> Patch 15:       Remove the now-dead deposit/withdraw code on
>                 non-powerpc architectures.
>
> Patch 16:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
>                 split failures.
>
> Patches 17-21:  Selftests covering partial munmap, mprotect, mlock,
>                 mremap, and MADV_DONTNEED on THPs to exercise the
>                 split paths.
>
> The error handling patches are placed before the lazy allocation patch so
> that every call site is already prepared to handle split failures before
> the failure mode is introduced. This makes each patch independently safe
> to apply and bisect through.
>
> The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
> enabled. The test results are below:
>
> TAP version 13
> 1..5
> # Starting 5 tests from 1 test cases.
> #  RUN           thp_pmd_split.partial_munmap ...
> # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
> # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_munmap
> ok 1 thp_pmd_split.partial_munmap
> #  RUN           thp_pmd_split.partial_mprotect ...
> # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
> # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mprotect
> ok 2 thp_pmd_split.partial_mprotect
> #  RUN           thp_pmd_split.partial_mlock ...
> # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
> # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mlock
> ok 3 thp_pmd_split.partial_mlock
> #  RUN           thp_pmd_split.partial_mremap ...
> # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
> # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mremap
> ok 4 thp_pmd_split.partial_mremap
> #  RUN           thp_pmd_split.partial_madv_dontneed ...
> # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
> # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_madv_dontneed
> ok 5 thp_pmd_split.partial_madv_dontneed
> # PASSED: 5 / 5 tests passed.
> # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
> mm-unstable as of 25 Feb (7.0.0-rc1).
>
>
> RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
> - Change counter name to THP_SPLIT_PMD_FAILED (David)
> - remove pgtable_trans_huge_{deposit/withdraw} when not needed and
>   make them arch specific (David)
> - make split functions return error code and have callers handle them
>   (David and Kiryl)
> - Add test cases for splitting
>
> Usama Arif (21):
>   mm: thp: make split_huge_pmd functions return int for error
>     propagation
>   mm: thp: propagate split failure from vma_adjust_trans_huge()
>   mm: thp: handle split failure in copy_huge_pmd()
>   mm: thp: handle split failure in do_huge_pmd_wp_page()
>   mm: thp: handle split failure in zap_pmd_range()
>   mm: thp: handle split failure in wp_huge_pmd()
>   mm: thp: retry on split failure in change_pmd_range()
>   mm: thp: handle split failure in follow_pmd_mask()
>   mm: handle walk_page_range() failure from THP split
>   mm: thp: handle split failure in mremap move_page_tables()
>   mm: thp: handle split failure in userfaultfd move_pages()
>   mm: thp: handle split failure in device migration
>   mm: huge_mm: Make sure all split_huge_pmd calls are checked
>   mm: thp: allocate PTE page tables lazily at split time
>   mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
>   mm: thp: add THP_SPLIT_PMD_FAILED counter
>   selftests/mm: add THP PMD split test infrastructure
>   selftests/mm: add partial_mprotect test for change_pmd_range
>   selftests/mm: add partial_mlock test
>   selftests/mm: add partial_mremap test for move_page_tables
>   selftests/mm: add madv_dontneed_partial test
>
>  arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
>  arch/s390/include/asm/pgtable.h               |   6 -
>  arch/s390/mm/pgtable.c                        |  41 ---
>  arch/sparc/include/asm/pgtable_64.h           |   6 -
>  arch/sparc/mm/tlb.c                           |  36 ---
>  include/linux/huge_mm.h                       |  51 +--
>  include/linux/pgtable.h                       |  16 +-
>  include/linux/vm_event_item.h                 |   1 +
>  mm/debug_vm_pgtable.c                         |   4 +-
>  mm/gup.c                                      |  10 +-
>  mm/huge_memory.c                              | 208 +++++++++----
>  mm/khugepaged.c                               |   7 +-
>  mm/memory.c                                   |  26 +-
>  mm/migrate_device.c                           |  33 +-
>  mm/mprotect.c                                 |  11 +-
>  mm/mremap.c                                   |   8 +-
>  mm/pagewalk.c                                 |   8 +-
>  mm/pgtable-generic.c                          |  32 --
>  mm/rmap.c                                     |  42 ++-
>  mm/userfaultfd.c                              |   8 +-
>  mm/vma.c                                      |  37 ++-
>  mm/vmstat.c                                   |   1 +
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
>  tools/testing/vma/include/stubs.h             |   9 +-
>  25 files changed, 645 insertions(+), 259 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c
>
> --
> 2.47.3
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-02-26 21:01 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-26 11:23 [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Usama Arif
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 14:22   ` Usama Arif
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox