[RFC 0/2] mm: thp: split time allocation of page table for THPs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] mm: thp: split time allocation of page table for THPs
@ 2026-02-11 12:49 Usama Arif
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  0 siblings, 2 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

This is an RFC patch to allocate the PTE page table at split time only
and not do pre-deposit for THPs as suggested by David [1].
The core patch is the first one. The second one is not needed and its
just vmstat counters I used to show that split doesn't fail. Its going to be
0 all the time and won't include it in future revisions.

It would have been ideal if all pre-deposit code was removed but its not
possible due to PowerPC. The rationale and further details are covered
in the commit message of the first patch, including why the patch is safe.

[1] https://lore.kernel.org/all/ee5bd77f-87ad-4640-a974-304b488e4c64@kernel.org/
 
Usama Arif (2):
  mm: thp: allocate PTE page tables lazily at split time
  mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter

 include/linux/huge_mm.h       |   4 +-
 include/linux/vm_event_item.h |   1 +
 mm/huge_memory.c              | 145 ++++++++++++++++++++++++----------
 mm/khugepaged.c               |   7 +-
 mm/migrate_device.c           |  15 ++--
 mm/rmap.c                     |  42 +++++++++-
 mm/vmstat.c                   |   1 +
 7 files changed, 162 insertions(+), 53 deletions(-)

-- 
2.47.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
                     ` (2 more replies)
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
  1 sibling, 3 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

When the kernel creates a PMD-level THP mapping for anonymous pages,
it pre-allocates a PTE page table and deposits it via
pgtable_trans_huge_deposit(). This deposited table is withdrawn during
PMD split or zap. The rationale was that split must not fail—if the
kernel decides to split a THP, it needs a PTE table to populate.

However, every anon THP wastes 4KB (one page table page) that sits
unused in the deposit list for the lifetime of the mapping. On systems
with many THPs, this adds up to significant memory waste. The original
rationale is also not an issue. It is ok for split to fail, and if the
kernel can't find an order 0 allocation for split, there are much bigger
problems. On large servers where you can easily have 100s of GBs of THPs,
the memory usage for these tables is 200M per 100G. This memory could be
used for any other usecase, which include allocating the pagetables
required during split.

This patch removes the pre-deposit for anonymous pages on architectures
where arch_needs_pgtable_deposit() returns false (every arch apart from
powerpc, and only when radix hash tables are not enabled) and allocates
the PTE table lazily—only when a split actually occurs. The split path
is modified to accept a caller-provided page table.

PowerPC exception:

It would have been great if we can completely remove the pagetable
deposit code and this commit would mostly have been a code cleanup patch,
unfortunately PowerPC has hash MMU, it stores hash slot information in
the deposited page table and pre-deposit is necessary. All deposit/
withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
behavior is unchanged with this patch. On a better note,
arch_needs_pgtable_deposit will always evaluate to false at compile time
on non PowerPC architectures and the pre-deposit code will not be
compiled in.

Why Split Failures Are Safe:

If a system is under severe memory pressure that even a 4K allocation
fails for a PTE table, there are far greater problems than a THP split
being delayed. The OOM killer will likely intervene before this becomes an
issue.
When pte_alloc_one() fails due to not being able to allocate a 4K page,
the PMD split is aborted and the THP remains intact. I could not get split
to fail, as its very difficult to make order-0 allocation to fail.
Code analysis of what would happen if it does:

- mprotect(): If split fails in change_pmd_range, it will fallback
to change_pte_range, which will return an error which will cause the
whole function to be retried again.

- munmap() (partial THP range): zap_pte_range() returns early when
pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
For full THP range, zap_huge_pmd() unmaps the entire PMD without
split.

- Memory reclaim (try_to_unmap()): Returns false, folio rotated back
LRU, retried in next reclaim cycle.

- Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
skips this folio, retried later.

- CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.

-  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
try_to_migrate() returns false, split_folio() returns -EAGAIN,
and madvise returns 0 (success) silently skipping the region. This
should be fine. madvise is just an advice and can fail for other
reasons as well.

Suggested-by: David Hildenbrand <david@kernel.org>
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/huge_mm.h |   4 +-
 mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
 mm/khugepaged.c         |   7 +-
 mm/migrate_device.c     |  15 +++--
 mm/rmap.c               |  39 ++++++++++-
 5 files changed, 156 insertions(+), 53 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..b21bb72a298c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze);
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
@@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze) {}
 static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long address, pmd_t *pmd,
-					 bool freeze) {}
+					 bool freeze, pgtable_t pgtable) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44ff8a648afd5..4c9a8d89fc8aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
-	pgtable_t pgtable;
+	pgtable_t pgtable = NULL;
 	vm_fault_t ret = 0;
 
 	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
 	if (unlikely(!folio))
 		return VM_FAULT_FALLBACK;
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable)) {
-		ret = VM_FAULT_OOM;
-		goto release;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto release;
+		}
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		if (userfaultfd_missing(vma)) {
 			spin_unlock(vmf->ptl);
 			folio_put(folio);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			ret = handle_userfault(vmf, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		if (pgtable) {
+			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
+						   pgtable);
+			mm_inc_nr_ptes(vma->vm_mm);
+		}
 		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
-		mm_inc_nr_ptes(vma->vm_mm);
 		spin_unlock(vmf->ptl);
 	}
 
@@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
 	entry = pmd_mkspecial(entry);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
 	set_pmd_at(mm, haddr, pmd, entry);
-	mm_inc_nr_ptes(mm);
 }
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
@@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		pgtable_t pgtable;
+		pgtable_t pgtable = NULL;
 		struct folio *zero_folio;
 		vm_fault_t ret;
 
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (unlikely(!pgtable))
-			return VM_FAULT_OOM;
+		if (arch_needs_pgtable_deposit()) {
+			pgtable = pte_alloc_one(vma->vm_mm);
+			if (unlikely(!pgtable))
+				return VM_FAULT_OOM;
+		}
 		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
 		if (unlikely(!zero_folio)) {
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
@@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			ret = check_stable_address_space(vma->vm_mm);
 			if (ret) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 			} else if (userfaultfd_missing(vma)) {
 				spin_unlock(vmf->ptl);
-				pte_free(vma->vm_mm, pgtable);
+				if (pgtable)
+					pte_free(vma->vm_mm, pgtable);
 				ret = handle_userfault(vmf, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			}
 		} else {
 			spin_unlock(vmf->ptl);
-			pte_free(vma->vm_mm, pgtable);
+			if (pgtable)
+				pte_free(vma->vm_mm, pgtable);
 		}
 		return ret;
 	}
@@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
 	}
 
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_swp_clear_uffd_wp(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(dst_mm);
+		if (unlikely(!pgtable))
+			goto out;
+	}
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
 	/*
@@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
+		if (pgtable)
+			pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
 		__split_huge_pmd(src_vma, src_pmd, addr, false);
@@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (pgtable) {
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	}
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
@@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
@@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		}
 
 		if (folio_test_anon(folio)) {
-			zap_deposited_table(tlb->mm, pmd);
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
@@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
+		    arch_needs_pgtable_deposit()) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 	}
 	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
 
-	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
-	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
+	}
 unlock_ptls:
 	double_pt_unlock(src_ptl, dst_ptl);
 	/* unblock rmap walks */
@@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
 	pmd_t _pmd, old_pmd;
 	unsigned long addr;
 	pte_t *pte;
@@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	 */
 	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr, bool freeze, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *folio;
 	struct page *page;
-	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool soft_dirty, uffd_wp = false, young = false, write = false;
 	bool anon_exclusive = false, dirty = false;
@@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
+		if (pgtable)
+			pte_free(mm, pgtable);
 		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(pmd_is_migration_entry(old_pmd))) {
@@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
 	}
 
 	if (pmd_is_migration_entry(*pmd)) {
@@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Withdraw the table only after we mark the pmd entry invalid.
 	 * This's critical for some architectures (Power).
 	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	} else {
+		VM_BUG_ON(!pgtable);
+		/*
+		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
+		 * being used in mm.
+		 */
+		mm_inc_nr_ptes(mm);
+	}
 	pmd_populate(mm, &_pmd, pgtable);
 
 	pte = pte_offset_map(&_pmd, haddr);
@@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze)
+			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
 {
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
 	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
+		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
+	else if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
+	pgtable_t pgtable = NULL;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
 				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
+
+	/* allocate pagetable before acquiring pmd lock */
+	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable) {
+			mmu_notifier_invalidate_range_end(&range);
+			return;
+		}
+	}
+
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, pgtable);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
@@ -3402,7 +3459,8 @@ static bool __discard_anon_folio_pmd_locked(struct vm_area_struct *vma,
 	}
 
 	folio_remove_rmap_pmd(folio, pmd_page(orig_pmd), vma);
-	zap_deposited_table(mm, pmdp);
+	if (arch_needs_pgtable_deposit())
+		zap_deposited_table(mm, pmdp);
 	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c469..0e976e4c975ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1223,7 +1223,12 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	} else {
+		mm_dec_nr_ptes(mm);
+		pte_free(mm, pgtable);
+	}
 	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
 	spin_unlock(pmd_ptl);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0a8b31939640f..053db74303e36 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -829,9 +829,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 
 	__folio_mark_uptodate(folio);
 
-	pgtable = pte_alloc_one(vma->vm_mm);
-	if (unlikely(!pgtable))
-		goto abort;
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (unlikely(!pgtable))
+			goto abort;
+	} else {
+		pgtable = NULL;
+	}
 
 	if (folio_is_device_private(folio)) {
 		swp_entry_t swp_entry;
@@ -879,10 +883,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 	folio_get(folio);
 
 	if (flush) {
-		pte_free(vma->vm_mm, pgtable);
+		if (pgtable)
+			pte_free(vma->vm_mm, pgtable);
 		flush_cache_page(vma, addr, addr + HPAGE_PMD_SIZE);
 		pmdp_invalidate(vma, addr, pmdp);
-	} else {
+	} else if (pgtable) {
 		pgtable_trans_huge_deposit(vma->vm_mm, pmdp, pgtable);
 		mm_inc_nr_ptes(vma->vm_mm);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index edf5d32f46042..c6ff23fc12944 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,6 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
@@ -1978,6 +1979,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	unsigned long pfn;
 	unsigned long hsz = 0;
 	int ptes = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2012,6 +2014,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the folio is in an mlock()d vma, we must not swap it out.
@@ -2061,12 +2067,21 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			}
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				/*
 				 * We temporarily have to drop the PTL and
 				 * restart so we can process the PTE-mapped THP.
 				 */
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, false);
+						      pvmw.pmd, false, pgtable);
 				flags &= ~TTU_SPLIT_HUGE_PMD;
 				page_vma_mapped_walk_restart(&pvmw);
 				continue;
@@ -2346,6 +2361,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		break;
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -2405,6 +2423,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long pfn;
 	unsigned long hsz = 0;
+	pgtable_t prealloc_pte = NULL;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -2439,6 +2458,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	}
 	mmu_notifier_invalidate_range_start(&range);
 
+	if ((flags & TTU_SPLIT_HUGE_PMD) && vma_is_anonymous(vma) &&
+	    !arch_needs_pgtable_deposit())
+		prealloc_pte = pte_alloc_one(mm);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
@@ -2446,8 +2469,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			__maybe_unused pmd_t pmdval;
 
 			if (flags & TTU_SPLIT_HUGE_PMD) {
+				pgtable_t pgtable = prealloc_pte;
+
+				prealloc_pte = NULL;
+				if (!arch_needs_pgtable_deposit() && !pgtable &&
+				    vma_is_anonymous(vma)) {
+					page_vma_mapped_walk_done(&pvmw);
+					ret = false;
+					break;
+				}
 				split_huge_pmd_locked(vma, pvmw.address,
-						      pvmw.pmd, true);
+						      pvmw.pmd, true, pgtable);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
@@ -2698,6 +2730,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		folio_put(folio);
 	}
 
+	if (prealloc_pte)
+		pte_free(mm, prealloc_pte);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
-- 
2.47.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 12:49 ` Usama Arif
  2026-02-11 13:27   ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 20+ messages in thread
From: Usama Arif @ 2026-02-11 12:49 UTC (permalink / raw)
  To: Andrew Morton, david, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Usama Arif

Add a vmstat counter to track PTE allocation failures during PMD split.
This enables monitoring of split failures due to memory pressure after
the lazy PTE page table allocation change.

The counter is incremented in three places:
- __split_huge_pmd(): Main entry point for splitting a PMD
- try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
- try_to_migrate_one(): When migration needs to split a PMD-mapped THP

Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 1 +
 mm/rmap.c                     | 3 +++
 mm/vmstat.c                   | 1 +
 4 files changed, 6 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 22a139f82d75f..827c9a8c251de 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_UNDERUSED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_SPLIT_PMD_PTE_ALLOC_FAILED,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
 		THP_SCAN_EXCEED_SHARED_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c9a8d89fc8aa..8d7c9f67f8a1d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3332,6 +3332,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
 		if (!pgtable) {
+			count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 			mmu_notifier_invalidate_range_end(&range);
 			return;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index c6ff23fc12944..5c4afedb29d5a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2070,8 +2070,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				pgtable_t pgtable = prealloc_pte;
 
 				prealloc_pte = NULL;
+
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
@@ -2474,6 +2476,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				prealloc_pte = NULL;
 				if (!arch_needs_pgtable_deposit() && !pgtable &&
 				    vma_is_anonymous(vma)) {
+					count_vm_event(THP_SPLIT_PMD_PTE_ALLOC_FAILED);
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 99270713e0c13..473edfa624a41 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1408,6 +1408,7 @@ const char * const vmstat_text[] = {
 	[I(THP_DEFERRED_SPLIT_PAGE)]		= "thp_deferred_split_page",
 	[I(THP_UNDERUSED_SPLIT_PAGE)]		= "thp_underused_split_page",
 	[I(THP_SPLIT_PMD)]			= "thp_split_pmd",
+	[I(THP_SPLIT_PMD_PTE_ALLOC_FAILED)]	= "thp_split_pmd_pte_alloc_failed",
 	[I(THP_SCAN_EXCEED_NONE_PTE)]		= "thp_scan_exceed_none_pte",
 	[I(THP_SCAN_EXCEED_SWAP_PTE)]		= "thp_scan_exceed_swap_pte",
 	[I(THP_SCAN_EXCEED_SHARED_PTE)]		= "thp_scan_exceed_share_pte",
-- 
2.47.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
@ 2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:38     ` Usama Arif
  2026-02-12 12:13     ` Ritesh Harjani
  2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 19:28   ` Matthew Wilcox
  2 siblings, 2 replies; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:25 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

CCing ppc folks

On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
> 
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
> 
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
> 
> PowerPC exception:
> 
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.

Is there a way to remove this? It's always been a confusing hack, now 
it's unpleasant to have around :)

In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
copied generic pgtable_trans_huge_deposit() hurts my belly.


IIUC, hash is mostly used on legacy power systems, radix on newer ones.

So one obvious solution: remove PMD THP support for hash MMUs along with 
all this hacky deposit code.


the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
checks need to be wrapped in a reasonable helper and likely this all 
needs to get cleaned up further.

The implementation if the generic pgtable_trans_huge_deposit and the 
radix handlers etc must be removed. If any code would trigger them it 
would be a bug.

If we have to keep this around, pgtable_trans_huge_deposit() should 
likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
will not be generic support for it.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
@ 2026-02-11 13:27   ` David Hildenbrand (Arm)
  2026-02-11 13:31     ` Usama Arif
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:27 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 13:49, Usama Arif wrote:
> Add a vmstat counter to track PTE allocation failures during PMD split.
> This enables monitoring of split failures due to memory pressure after
> the lazy PTE page table allocation change.
> 
> The counter is incremented in three places:
> - __split_huge_pmd(): Main entry point for splitting a PMD
> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
> 
> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>   include/linux/vm_event_item.h | 1 +
>   mm/huge_memory.c              | 1 +
>   mm/rmap.c                     | 3 +++
>   mm/vmstat.c                   | 1 +
>   4 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 22a139f82d75f..827c9a8c251de 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>   		THP_DEFERRED_SPLIT_PAGE,
>   		THP_UNDERUSED_SPLIT_PAGE,
>   		THP_SPLIT_PMD,
> +		THP_SPLIT_PMD_PTE_ALLOC_FAILED,

Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any 
(future) failures (if any) as well.

It's a shame that we called a remapping a "split" and keep causing 
confusion.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:27   ` David Hildenbrand (Arm)
@ 2026-02-11 13:31     ` Usama Arif
  2026-02-11 13:36       ` David Hildenbrand (Arm)
  2026-02-11 13:38       ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 13:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> Add a vmstat counter to track PTE allocation failures during PMD split.
>> This enables monitoring of split failures due to memory pressure after
>> the lazy PTE page table allocation change.
>>
>> The counter is incremented in three places:
>> - __split_huge_pmd(): Main entry point for splitting a PMD
>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>
>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>   include/linux/vm_event_item.h | 1 +
>>   mm/huge_memory.c              | 1 +
>>   mm/rmap.c                     | 3 +++
>>   mm/vmstat.c                   | 1 +
>>   4 files changed, 6 insertions(+)
>>
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 22a139f82d75f..827c9a8c251de 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>           THP_DEFERRED_SPLIT_PAGE,
>>           THP_UNDERUSED_SPLIT_PAGE,
>>           THP_SPLIT_PMD,
>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
> 
> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
> 

Makes sense. This was just a patch I was using for testing and I wanted to share.
It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
as suggested and we can use for future split failures (hopefully none).

> It's a shame that we called a remapping a "split" and keep causing confusion.
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
@ 2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 13:46     ` Kiryl Shutsemau
  2026-02-11 13:47     ` Usama Arif
  2026-02-11 19:28   ` Matthew Wilcox
  2 siblings, 2 replies; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:35 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 13:49, Usama Arif wrote:
> When the kernel creates a PMD-level THP mapping for anonymous pages,
> it pre-allocates a PTE page table and deposits it via
> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> PMD split or zap. The rationale was that split must not fail—if the
> kernel decides to split a THP, it needs a PTE table to populate.
> 
> However, every anon THP wastes 4KB (one page table page) that sits
> unused in the deposit list for the lifetime of the mapping. On systems
> with many THPs, this adds up to significant memory waste. The original
> rationale is also not an issue. It is ok for split to fail, and if the
> kernel can't find an order 0 allocation for split, there are much bigger
> problems. On large servers where you can easily have 100s of GBs of THPs,
> the memory usage for these tables is 200M per 100G. This memory could be
> used for any other usecase, which include allocating the pagetables
> required during split.
> 
> This patch removes the pre-deposit for anonymous pages on architectures
> where arch_needs_pgtable_deposit() returns false (every arch apart from
> powerpc, and only when radix hash tables are not enabled) and allocates
> the PTE table lazily—only when a split actually occurs. The split path
> is modified to accept a caller-provided page table.
> 
> PowerPC exception:
> 
> It would have been great if we can completely remove the pagetable
> deposit code and this commit would mostly have been a code cleanup patch,
> unfortunately PowerPC has hash MMU, it stores hash slot information in
> the deposited page table and pre-deposit is necessary. All deposit/
> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> behavior is unchanged with this patch. On a better note,
> arch_needs_pgtable_deposit will always evaluate to false at compile time
> on non PowerPC architectures and the pre-deposit code will not be
> compiled in.
> 
> Why Split Failures Are Safe:
> 
> If a system is under severe memory pressure that even a 4K allocation
> fails for a PTE table, there are far greater problems than a THP split
> being delayed. The OOM killer will likely intervene before this becomes an
> issue.
> When pte_alloc_one() fails due to not being able to allocate a 4K page,
> the PMD split is aborted and the THP remains intact. I could not get split
> to fail, as its very difficult to make order-0 allocation to fail.
> Code analysis of what would happen if it does:
> 
> - mprotect(): If split fails in change_pmd_range, it will fallback
> to change_pte_range, which will return an error which will cause the
> whole function to be retried again.
> 
> - munmap() (partial THP range): zap_pte_range() returns early when
> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> For full THP range, zap_huge_pmd() unmaps the entire PMD without
> split.
> 
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.
> 
> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> skips this folio, retried later.
> 
> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
> 
> -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> try_to_migrate() returns false, split_folio() returns -EAGAIN,
> and madvise returns 0 (success) silently skipping the region. This
> should be fine. madvise is just an advice and can fail for other
> reasons as well.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>   include/linux/huge_mm.h |   4 +-
>   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
>   mm/khugepaged.c         |   7 +-
>   mm/migrate_device.c     |  15 +++--
>   mm/rmap.c               |  39 ++++++++++-
>   5 files changed, 156 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..b21bb72a298c9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
>   }
>   
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> -			   pmd_t *pmd, bool freeze);
> +			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
>   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>   			   pmd_t *pmdp, struct folio *folio);
>   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>   		unsigned long address, bool freeze) {}
>   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>   					 unsigned long address, pmd_t *pmd,
> -					 bool freeze) {}
> +					 bool freeze, pgtable_t pgtable) {}
>   
>   static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>   					 unsigned long addr, pmd_t *pmdp,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 44ff8a648afd5..4c9a8d89fc8aa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>   	struct vm_area_struct *vma = vmf->vma;
>   	struct folio *folio;
> -	pgtable_t pgtable;
> +	pgtable_t pgtable = NULL;
>   	vm_fault_t ret = 0;
>   
>   	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
>   	if (unlikely(!folio))
>   		return VM_FAULT_FALLBACK;
>   
> -	pgtable = pte_alloc_one(vma->vm_mm);
> -	if (unlikely(!pgtable)) {
> -		ret = VM_FAULT_OOM;
> -		goto release;
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (unlikely(!pgtable)) {
> +			ret = VM_FAULT_OOM;
> +			goto release;
> +		}
>   	}
>   
>   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   		if (userfaultfd_missing(vma)) {
>   			spin_unlock(vmf->ptl);
>   			folio_put(folio);
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   			ret = handle_userfault(vmf, VM_UFFD_MISSING);
>   			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>   			return ret;
>   		}
> -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> +		if (pgtable) {
> +			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> +						   pgtable);
> +			mm_inc_nr_ptes(vma->vm_mm);
> +		}
>   		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> -		mm_inc_nr_ptes(vma->vm_mm);
>   		spin_unlock(vmf->ptl);
>   	}
>   
> @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>   	pmd_t entry;
>   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>   	entry = pmd_mkspecial(entry);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +	if (pgtable) {
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		mm_inc_nr_ptes(mm);
> +	}
>   	set_pmd_at(mm, haddr, pmd, entry);
> -	mm_inc_nr_ptes(mm);
>   }
>   
>   vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>   			!mm_forbids_zeropage(vma->vm_mm) &&
>   			transparent_hugepage_use_zero_page()) {
> -		pgtable_t pgtable;
> +		pgtable_t pgtable = NULL;
>   		struct folio *zero_folio;
>   		vm_fault_t ret;
>   
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (unlikely(!pgtable))
> -			return VM_FAULT_OOM;
> +		if (arch_needs_pgtable_deposit()) {
> +			pgtable = pte_alloc_one(vma->vm_mm);
> +			if (unlikely(!pgtable))
> +				return VM_FAULT_OOM;
> +		}
>   		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
>   		if (unlikely(!zero_folio)) {
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   			count_vm_event(THP_FAULT_FALLBACK);
>   			return VM_FAULT_FALLBACK;
>   		}
> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   			ret = check_stable_address_space(vma->vm_mm);
>   			if (ret) {
>   				spin_unlock(vmf->ptl);
> -				pte_free(vma->vm_mm, pgtable);
> +				if (pgtable)
> +					pte_free(vma->vm_mm, pgtable);
>   			} else if (userfaultfd_missing(vma)) {
>   				spin_unlock(vmf->ptl);
> -				pte_free(vma->vm_mm, pgtable);
> +				if (pgtable)
> +					pte_free(vma->vm_mm, pgtable);
>   				ret = handle_userfault(vmf, VM_UFFD_MISSING);
>   				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>   			} else {
> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   			}
>   		} else {
>   			spin_unlock(vmf->ptl);
> -			pte_free(vma->vm_mm, pgtable);
> +			if (pgtable)
> +				pte_free(vma->vm_mm, pgtable);
>   		}
>   		return ret;
>   	}
> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
>   	}
>   
>   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> -	mm_inc_nr_ptes(dst_mm);
> -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	if (pgtable) {
> +		mm_inc_nr_ptes(dst_mm);
> +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	}
>   	if (!userfaultfd_wp(dst_vma))
>   		pmd = pmd_swp_clear_uffd_wp(pmd);
>   	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	if (!vma_is_anonymous(dst_vma))
>   		return 0;
>   
> -	pgtable = pte_alloc_one(dst_mm);
> -	if (unlikely(!pgtable))
> -		goto out;
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(dst_mm);
> +		if (unlikely(!pgtable))
> +			goto out;
> +	}
>   
>   	dst_ptl = pmd_lock(dst_mm, dst_pmd);
>   	src_ptl = pmd_lockptr(src_mm, src_pmd);
> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	}
>   
>   	if (unlikely(!pmd_trans_huge(pmd))) {
> -		pte_free(dst_mm, pgtable);
> +		if (pgtable)
> +			pte_free(dst_mm, pgtable);
>   		goto out_unlock;
>   	}
>   	/*
> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
>   		/* Page maybe pinned: split and retry the fault on PTEs. */
>   		folio_put(src_folio);
> -		pte_free(dst_mm, pgtable);
> +		if (pgtable)
> +			pte_free(dst_mm, pgtable);
>   		spin_unlock(src_ptl);
>   		spin_unlock(dst_ptl);
>   		__split_huge_pmd(src_vma, src_pmd, addr, false);
> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	}
>   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>   out_zero_page:
> -	mm_inc_nr_ptes(dst_mm);
> -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	if (pgtable) {
> +		mm_inc_nr_ptes(dst_mm);
> +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> +	}
>   	pmdp_set_wrprotect(src_mm, addr, src_pmd);
>   	if (!userfaultfd_wp(dst_vma))
>   		pmd = pmd_clear_uffd_wp(pmd);
> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else if (is_huge_zero_pmd(orig_pmd)) {
> -		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> +		if (arch_needs_pgtable_deposit())
>   			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else {
> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		}
>   
>   		if (folio_test_anon(folio)) {
> -			zap_deposited_table(tlb->mm, pmd);
> +			if (arch_needs_pgtable_deposit())
> +				zap_deposited_table(tlb->mm, pmd);
>   			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>   		} else {
>   			if (arch_needs_pgtable_deposit())
> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>   			force_flush = true;
>   		VM_BUG_ON(!pmd_none(*new_pmd));
>   
> -		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> +		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> +		    arch_needs_pgtable_deposit()) {
>   			pgtable_t pgtable;
>   			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>   			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>   	}
>   	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>   
> -	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> -	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> +	if (arch_needs_pgtable_deposit()) {
> +		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> +		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> +	}
>   unlock_ptls:
>   	double_pt_unlock(src_ptl, dst_ptl);
>   	/* unblock rmap walks */
> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>   
>   static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> -		unsigned long haddr, pmd_t *pmd)
> +		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
> -	pgtable_t pgtable;
>   	pmd_t _pmd, old_pmd;
>   	unsigned long addr;
>   	pte_t *pte;
> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   	 */
>   	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>   
> -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	} else {
> +		VM_BUG_ON(!pgtable);
> +		/*
> +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> +		 * being used in mm.
> +		 */
> +		mm_inc_nr_ptes(mm);
> +	}
>   	pmd_populate(mm, &_pmd, pgtable);
>   
>   	pte = pte_offset_map(&_pmd, haddr);
> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   }
>   
>   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> -		unsigned long haddr, bool freeze)
> +		unsigned long haddr, bool freeze, pgtable_t pgtable)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct folio *folio;
>   	struct page *page;
> -	pgtable_t pgtable;
>   	pmd_t old_pmd, _pmd;
>   	bool soft_dirty, uffd_wp = false, young = false, write = false;
>   	bool anon_exclusive = false, dirty = false;
> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 */
>   		if (arch_needs_pgtable_deposit())
>   			zap_deposited_table(mm, pmd);
> +		if (pgtable)
> +			pte_free(mm, pgtable);
>   		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>   			return;
>   		if (unlikely(pmd_is_migration_entry(old_pmd))) {
> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 * small page also write protected so it does not seems useful
>   		 * to invalidate secondary mmu at this time.
>   		 */
> -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
>   	}
>   
>   	if (pmd_is_migration_entry(*pmd)) {
> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	 * Withdraw the table only after we mark the pmd entry invalid.
>   	 * This's critical for some architectures (Power).
>   	 */
> -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	} else {
> +		VM_BUG_ON(!pgtable);
> +		/*
> +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> +		 * being used in mm.
> +		 */
> +		mm_inc_nr_ptes(mm);
> +	}
>   	pmd_populate(mm, &_pmd, pgtable);
>   
>   	pte = pte_offset_map(&_pmd, haddr);
> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   }
>   
>   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> -			   pmd_t *pmd, bool freeze)
> +			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
>   {
>   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>   	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> -		__split_huge_pmd_locked(vma, pmd, address, freeze);
> +		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> +	else if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
>   }
>   
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   {
>   	spinlock_t *ptl;
>   	struct mmu_notifier_range range;
> +	pgtable_t pgtable = NULL;
>   
>   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>   				address & HPAGE_PMD_MASK,
>   				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>   	mmu_notifier_invalidate_range_start(&range);
> +
> +	/* allocate pagetable before acquiring pmd lock */
> +	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable) {
> +			mmu_notifier_invalidate_range_end(&range);

What I last looked at this, I thought the clean thing to do is to let 
__split_huge_pmd() and friends return an error.

Let's take a look at walk_pmd_range() as one example:

if (walk->vma)
	split_huge_pmd(walk->vma, pmd, addr);
else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
	continue;

err = walk_pte_range(pmd, addr, next, walk);

Where walk_pte_range() just does a pte_offset_map_lock.

	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);

But if that fails (as the remapping failed), we will silently skip this 
range.

I don't think silently skipping is the right thing to do.

So I would think that all splitting functions have to be taught to 
return an error and handle it accordingly. Then we can actually start 
returning errors.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:31     ` Usama Arif
@ 2026-02-11 13:36       ` David Hildenbrand (Arm)
  2026-02-11 13:42         ` Usama Arif
  2026-02-11 13:38       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:36 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 14:31, Usama Arif wrote:
> 
> 
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>>    include/linux/vm_event_item.h | 1 +
>>>    mm/huge_memory.c              | 1 +
>>>    mm/rmap.c                     | 3 +++
>>>    mm/vmstat.c                   | 1 +
>>>    4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>            THP_DEFERRED_SPLIT_PAGE,
>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>            THP_SPLIT_PMD,
>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
> 
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).

I guess it might be reasonable to have because I am sure it will fail at 
some point and maybe provoke weird issues we didn't think of. In that 
case, having an indication that splitting failed at some point might be 
reasonable.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:25   ` David Hildenbrand (Arm)
@ 2026-02-11 13:38     ` Usama Arif
  2026-02-12 12:13     ` Ritesh Harjani
  1 sibling, 0 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 13:38 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev



On 11/02/2026 13:25, David Hildenbrand (Arm) wrote:
> CCing ppc folks
> 
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
> 
> Is there a way to remove this? It's always been a confusing hack, now it's unpleasant to have around :)


I spent some time researching this (I havent worked with PowerPC before)
as I really wanted to get rid of all the pre-deposit code. I cant really see a
way without removing PMD THP support. I was going to CC the PowerPC maintainers
but I see that you already did!

> 
> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 copied generic pgtable_trans_huge_deposit() hurts my belly.
> 
> 
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
> 

Yes that is what I found as well.

> So one obvious solution: remove PMD THP support for hash MMUs along with all this hacky deposit code.
> 

I would be happy with that!

> 
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar checks need to be wrapped in a reasonable helper and likely this all needs to get cleaned up further.

Ack. The code will definitely look a lot lot cleaner and wont have much of this if we decide to remove
PMD THP support for hash MMU.

> 
> The implementation if the generic pgtable_trans_huge_deposit and the radix handlers etc must be removed. If any code would trigger them it would be a bug.
> 
> If we have to keep this around, pgtable_trans_huge_deposit() should likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there will not be generic support for it.
> 

Ack.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:31     ` Usama Arif
  2026-02-11 13:36       ` David Hildenbrand (Arm)
@ 2026-02-11 13:38       ` David Hildenbrand (Arm)
  2026-02-11 13:43         ` Usama Arif
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 13:38 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team

On 2/11/26 14:31, Usama Arif wrote:
> 
> 
> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>> On 2/11/26 13:49, Usama Arif wrote:
>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>> This enables monitoring of split failures due to memory pressure after
>>> the lazy PTE page table allocation change.
>>>
>>> The counter is incremented in three places:
>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>
>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>
>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>> ---
>>>    include/linux/vm_event_item.h | 1 +
>>>    mm/huge_memory.c              | 1 +
>>>    mm/rmap.c                     | 3 +++
>>>    mm/vmstat.c                   | 1 +
>>>    4 files changed, 6 insertions(+)
>>>
>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>> index 22a139f82d75f..827c9a8c251de 100644
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>            THP_DEFERRED_SPLIT_PAGE,
>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>            THP_SPLIT_PMD,
>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>
>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>
> 
> Makes sense. This was just a patch I was using for testing and I wanted to share.
> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
> as suggested and we can use for future split failures (hopefully none).

Btw, you can use the allocation fault injection framework to find weird 
issues, if you haven't heard of that yet.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:36       ` David Hildenbrand (Arm)
@ 2026-02-11 13:42         ` Usama Arif
  0 siblings, 0 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 13:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:36, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>    include/linux/vm_event_item.h | 1 +
>>>>    mm/huge_memory.c              | 1 +
>>>>    mm/rmap.c                     | 3 +++
>>>>    mm/vmstat.c                   | 1 +
>>>>    4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>>            THP_DEFERRED_SPLIT_PAGE,
>>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>>            THP_SPLIT_PMD,
>>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
> 
> I guess it might be reasonable to have because I am sure it will fail at some point and maybe provoke weird issues we didn't think of. In that case, having an indication that splitting failed at some point might be reasonable.
> 
ack



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter
  2026-02-11 13:38       ` David Hildenbrand (Arm)
@ 2026-02-11 13:43         ` Usama Arif
  0 siblings, 0 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 13:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:38, David Hildenbrand (Arm) wrote:
> On 2/11/26 14:31, Usama Arif wrote:
>>
>>
>> On 11/02/2026 13:27, David Hildenbrand (Arm) wrote:
>>> On 2/11/26 13:49, Usama Arif wrote:
>>>> Add a vmstat counter to track PTE allocation failures during PMD split.
>>>> This enables monitoring of split failures due to memory pressure after
>>>> the lazy PTE page table allocation change.
>>>>
>>>> The counter is incremented in three places:
>>>> - __split_huge_pmd(): Main entry point for splitting a PMD
>>>> - try_to_unmap_one(): When reclaim needs to split a PMD-mapped THP
>>>> - try_to_migrate_one(): When migration needs to split a PMD-mapped THP
>>>>
>>>> Visible via /proc/vmstat as thp_split_pmd_pte_alloc_failed.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>    include/linux/vm_event_item.h | 1 +
>>>>    mm/huge_memory.c              | 1 +
>>>>    mm/rmap.c                     | 3 +++
>>>>    mm/vmstat.c                   | 1 +
>>>>    4 files changed, 6 insertions(+)
>>>>
>>>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>>> index 22a139f82d75f..827c9a8c251de 100644
>>>> --- a/include/linux/vm_event_item.h
>>>> +++ b/include/linux/vm_event_item.h
>>>> @@ -111,6 +111,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>>            THP_DEFERRED_SPLIT_PAGE,
>>>>            THP_UNDERUSED_SPLIT_PAGE,
>>>>            THP_SPLIT_PMD,
>>>> +        THP_SPLIT_PMD_PTE_ALLOC_FAILED,
>>>
>>> Probably sufficient to call this THP_SPLIT_PMD_FAILED and count any (future) failures (if any) as well.
>>>
>>
>> Makes sense. This was just a patch I was using for testing and I wanted to share.
>> It was always 0 as I couldnt get split to fail :) But I can rename it as THP_SPLIT_PMD_FAILED
>> as suggested and we can use for future split failures (hopefully none).
> 
> Btw, you can use the allocation fault injection framework to find weird issues, if you haven't heard of that yet.
> 

This looks very interesting, Thanks! Let me have a look.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:35   ` David Hildenbrand (Arm)
@ 2026-02-11 13:46     ` Kiryl Shutsemau
  2026-02-11 13:47     ` Usama Arif
  1 sibling, 0 replies; 20+ messages in thread
From: Kiryl Shutsemau @ 2026-02-11 13:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm,
	fvdl, hannes, riel, shakeel.butt, baohua, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, lance.yang,
	linux-kernel, kernel-team

On Wed, Feb 11, 2026 at 02:35:07PM +0100, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
> > When the kernel creates a PMD-level THP mapping for anonymous pages,
> > it pre-allocates a PTE page table and deposits it via
> > pgtable_trans_huge_deposit(). This deposited table is withdrawn during
> > PMD split or zap. The rationale was that split must not fail—if the
> > kernel decides to split a THP, it needs a PTE table to populate.
> > 
> > However, every anon THP wastes 4KB (one page table page) that sits
> > unused in the deposit list for the lifetime of the mapping. On systems
> > with many THPs, this adds up to significant memory waste. The original
> > rationale is also not an issue. It is ok for split to fail, and if the
> > kernel can't find an order 0 allocation for split, there are much bigger
> > problems. On large servers where you can easily have 100s of GBs of THPs,
> > the memory usage for these tables is 200M per 100G. This memory could be
> > used for any other usecase, which include allocating the pagetables
> > required during split.
> > 
> > This patch removes the pre-deposit for anonymous pages on architectures
> > where arch_needs_pgtable_deposit() returns false (every arch apart from
> > powerpc, and only when radix hash tables are not enabled) and allocates
> > the PTE table lazily—only when a split actually occurs. The split path
> > is modified to accept a caller-provided page table.
> > 
> > PowerPC exception:
> > 
> > It would have been great if we can completely remove the pagetable
> > deposit code and this commit would mostly have been a code cleanup patch,
> > unfortunately PowerPC has hash MMU, it stores hash slot information in
> > the deposited page table and pre-deposit is necessary. All deposit/
> > withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
> > behavior is unchanged with this patch. On a better note,
> > arch_needs_pgtable_deposit will always evaluate to false at compile time
> > on non PowerPC architectures and the pre-deposit code will not be
> > compiled in.
> > 
> > Why Split Failures Are Safe:
> > 
> > If a system is under severe memory pressure that even a 4K allocation
> > fails for a PTE table, there are far greater problems than a THP split
> > being delayed. The OOM killer will likely intervene before this becomes an
> > issue.
> > When pte_alloc_one() fails due to not being able to allocate a 4K page,
> > the PMD split is aborted and the THP remains intact. I could not get split
> > to fail, as its very difficult to make order-0 allocation to fail.
> > Code analysis of what would happen if it does:
> > 
> > - mprotect(): If split fails in change_pmd_range, it will fallback
> > to change_pte_range, which will return an error which will cause the
> > whole function to be retried again.
> > 
> > - munmap() (partial THP range): zap_pte_range() returns early when
> > pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
> > For full THP range, zap_huge_pmd() unmaps the entire PMD without
> > split.
> > 
> > - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> > LRU, retried in next reclaim cycle.
> > 
> > - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
> > skips this folio, retried later.
> > 
> > - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
> > 
> > -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
> > try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
> > try_to_migrate() returns false, split_folio() returns -EAGAIN,
> > and madvise returns 0 (success) silently skipping the region. This
> > should be fine. madvise is just an advice and can fail for other
> > reasons as well.
> > 
> > Suggested-by: David Hildenbrand <david@kernel.org>
> > Signed-off-by: Usama Arif <usama.arif@linux.dev>
> > ---
> >   include/linux/huge_mm.h |   4 +-
> >   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
> >   mm/khugepaged.c         |   7 +-
> >   mm/migrate_device.c     |  15 +++--
> >   mm/rmap.c               |  39 ++++++++++-
> >   5 files changed, 156 insertions(+), 53 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index a4d9f964dfdea..b21bb72a298c9 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
> >   }
> >   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > -			   pmd_t *pmd, bool freeze);
> > +			   pmd_t *pmd, bool freeze, pgtable_t pgtable);
> >   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
> >   			   pmd_t *pmdp, struct folio *folio);
> >   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
> > @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> >   		unsigned long address, bool freeze) {}
> >   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
> >   					 unsigned long address, pmd_t *pmd,
> > -					 bool freeze) {}
> > +					 bool freeze, pgtable_t pgtable) {}
> >   static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
> >   					 unsigned long addr, pmd_t *pmdp,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 44ff8a648afd5..4c9a8d89fc8aa 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> >   	struct vm_area_struct *vma = vmf->vma;
> >   	struct folio *folio;
> > -	pgtable_t pgtable;
> > +	pgtable_t pgtable = NULL;
> >   	vm_fault_t ret = 0;
> >   	folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
> >   	if (unlikely(!folio))
> >   		return VM_FAULT_FALLBACK;
> > -	pgtable = pte_alloc_one(vma->vm_mm);
> > -	if (unlikely(!pgtable)) {
> > -		ret = VM_FAULT_OOM;
> > -		goto release;
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(vma->vm_mm);
> > +		if (unlikely(!pgtable)) {
> > +			ret = VM_FAULT_OOM;
> > +			goto release;
> > +		}
> >   	}
> >   	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   		if (userfaultfd_missing(vma)) {
> >   			spin_unlock(vmf->ptl);
> >   			folio_put(folio);
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   			ret = handle_userfault(vmf, VM_UFFD_MISSING);
> >   			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> >   			return ret;
> >   		}
> > -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> > +		if (pgtable) {
> > +			pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
> > +						   pgtable);
> > +			mm_inc_nr_ptes(vma->vm_mm);
> > +		}
> >   		map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
> > -		mm_inc_nr_ptes(vma->vm_mm);
> >   		spin_unlock(vmf->ptl);
> >   	}
> > @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> >   	pmd_t entry;
> >   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> >   	entry = pmd_mkspecial(entry);
> > -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +	if (pgtable) {
> > +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	set_pmd_at(mm, haddr, pmd, entry);
> > -	mm_inc_nr_ptes(mm);
> >   }
> >   vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >   			!mm_forbids_zeropage(vma->vm_mm) &&
> >   			transparent_hugepage_use_zero_page()) {
> > -		pgtable_t pgtable;
> > +		pgtable_t pgtable = NULL;
> >   		struct folio *zero_folio;
> >   		vm_fault_t ret;
> > -		pgtable = pte_alloc_one(vma->vm_mm);
> > -		if (unlikely(!pgtable))
> > -			return VM_FAULT_OOM;
> > +		if (arch_needs_pgtable_deposit()) {
> > +			pgtable = pte_alloc_one(vma->vm_mm);
> > +			if (unlikely(!pgtable))
> > +				return VM_FAULT_OOM;
> > +		}
> >   		zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
> >   		if (unlikely(!zero_folio)) {
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   			count_vm_event(THP_FAULT_FALLBACK);
> >   			return VM_FAULT_FALLBACK;
> >   		}
> > @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   			ret = check_stable_address_space(vma->vm_mm);
> >   			if (ret) {
> >   				spin_unlock(vmf->ptl);
> > -				pte_free(vma->vm_mm, pgtable);
> > +				if (pgtable)
> > +					pte_free(vma->vm_mm, pgtable);
> >   			} else if (userfaultfd_missing(vma)) {
> >   				spin_unlock(vmf->ptl);
> > -				pte_free(vma->vm_mm, pgtable);
> > +				if (pgtable)
> > +					pte_free(vma->vm_mm, pgtable);
> >   				ret = handle_userfault(vmf, VM_UFFD_MISSING);
> >   				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
> >   			} else {
> > @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >   			}
> >   		} else {
> >   			spin_unlock(vmf->ptl);
> > -			pte_free(vma->vm_mm, pgtable);
> > +			if (pgtable)
> > +				pte_free(vma->vm_mm, pgtable);
> >   		}
> >   		return ret;
> >   	}
> > @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
> >   	}
> >   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > -	mm_inc_nr_ptes(dst_mm);
> > -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	if (pgtable) {
> > +		mm_inc_nr_ptes(dst_mm);
> > +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	}
> >   	if (!userfaultfd_wp(dst_vma))
> >   		pmd = pmd_swp_clear_uffd_wp(pmd);
> >   	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	if (!vma_is_anonymous(dst_vma))
> >   		return 0;
> > -	pgtable = pte_alloc_one(dst_mm);
> > -	if (unlikely(!pgtable))
> > -		goto out;
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(dst_mm);
> > +		if (unlikely(!pgtable))
> > +			goto out;
> > +	}
> >   	dst_ptl = pmd_lock(dst_mm, dst_pmd);
> >   	src_ptl = pmd_lockptr(src_mm, src_pmd);
> > @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	}
> >   	if (unlikely(!pmd_trans_huge(pmd))) {
> > -		pte_free(dst_mm, pgtable);
> > +		if (pgtable)
> > +			pte_free(dst_mm, pgtable);
> >   		goto out_unlock;
> >   	}
> >   	/*
> > @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
> >   		/* Page maybe pinned: split and retry the fault on PTEs. */
> >   		folio_put(src_folio);
> > -		pte_free(dst_mm, pgtable);
> > +		if (pgtable)
> > +			pte_free(dst_mm, pgtable);
> >   		spin_unlock(src_ptl);
> >   		spin_unlock(dst_ptl);
> >   		__split_huge_pmd(src_vma, src_pmd, addr, false);
> > @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	}
> >   	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> >   out_zero_page:
> > -	mm_inc_nr_ptes(dst_mm);
> > -	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	if (pgtable) {
> > +		mm_inc_nr_ptes(dst_mm);
> > +		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > +	}
> >   	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> >   	if (!userfaultfd_wp(dst_vma))
> >   		pmd = pmd_clear_uffd_wp(pmd);
> > @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   			zap_deposited_table(tlb->mm, pmd);
> >   		spin_unlock(ptl);
> >   	} else if (is_huge_zero_pmd(orig_pmd)) {
> > -		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> > +		if (arch_needs_pgtable_deposit())
> >   			zap_deposited_table(tlb->mm, pmd);
> >   		spin_unlock(ptl);
> >   	} else {
> > @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >   		}
> >   		if (folio_test_anon(folio)) {
> > -			zap_deposited_table(tlb->mm, pmd);
> > +			if (arch_needs_pgtable_deposit())
> > +				zap_deposited_table(tlb->mm, pmd);
> >   			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> >   		} else {
> >   			if (arch_needs_pgtable_deposit())
> > @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >   			force_flush = true;
> >   		VM_BUG_ON(!pmd_none(*new_pmd));
> > -		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
> > +		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
> > +		    arch_needs_pgtable_deposit()) {
> >   			pgtable_t pgtable;
> >   			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
> >   			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
> > @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
> >   	}
> >   	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> > -	src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > -	pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> > +		pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
> > +	}
> >   unlock_ptls:
> >   	double_pt_unlock(src_ptl, dst_ptl);
> >   	/* unblock rmap walks */
> > @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >   static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> > -		unsigned long haddr, pmd_t *pmd)
> > +		unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
> >   {
> >   	struct mm_struct *mm = vma->vm_mm;
> > -	pgtable_t pgtable;
> >   	pmd_t _pmd, old_pmd;
> >   	unsigned long addr;
> >   	pte_t *pte;
> > @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >   	 */
> >   	old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
> > -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	} else {
> > +		VM_BUG_ON(!pgtable);
> > +		/*
> > +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > +		 * being used in mm.
> > +		 */
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	pmd_populate(mm, &_pmd, pgtable);
> >   	pte = pte_offset_map(&_pmd, haddr);
> > @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >   }
> >   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > -		unsigned long haddr, bool freeze)
> > +		unsigned long haddr, bool freeze, pgtable_t pgtable)
> >   {
> >   	struct mm_struct *mm = vma->vm_mm;
> >   	struct folio *folio;
> >   	struct page *page;
> > -	pgtable_t pgtable;
> >   	pmd_t old_pmd, _pmd;
> >   	bool soft_dirty, uffd_wp = false, young = false, write = false;
> >   	bool anon_exclusive = false, dirty = false;
> > @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   		 */
> >   		if (arch_needs_pgtable_deposit())
> >   			zap_deposited_table(mm, pmd);
> > +		if (pgtable)
> > +			pte_free(mm, pgtable);
> >   		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> >   			return;
> >   		if (unlikely(pmd_is_migration_entry(old_pmd))) {
> > @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   		 * small page also write protected so it does not seems useful
> >   		 * to invalidate secondary mmu at this time.
> >   		 */
> > -		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> > +		return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
> >   	}
> >   	if (pmd_is_migration_entry(*pmd)) {
> > @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   	 * Withdraw the table only after we mark the pmd entry invalid.
> >   	 * This's critical for some architectures (Power).
> >   	 */
> > -	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	if (arch_needs_pgtable_deposit()) {
> > +		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	} else {
> > +		VM_BUG_ON(!pgtable);
> > +		/*
> > +		 * Account for the freshly allocated (in __split_huge_pmd) pgtable
> > +		 * being used in mm.
> > +		 */
> > +		mm_inc_nr_ptes(mm);
> > +	}
> >   	pmd_populate(mm, &_pmd, pgtable);
> >   	pte = pte_offset_map(&_pmd, haddr);
> > @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >   }
> >   void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
> > -			   pmd_t *pmd, bool freeze)
> > +			   pmd_t *pmd, bool freeze, pgtable_t pgtable)
> >   {
> >   	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
> >   	if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
> > -		__split_huge_pmd_locked(vma, pmd, address, freeze);
> > +		__split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
> > +	else if (pgtable)
> > +		pte_free(vma->vm_mm, pgtable);
> >   }
> >   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >   {
> >   	spinlock_t *ptl;
> >   	struct mmu_notifier_range range;
> > +	pgtable_t pgtable = NULL;
> >   	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> >   				address & HPAGE_PMD_MASK,
> >   				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
> >   	mmu_notifier_invalidate_range_start(&range);
> > +
> > +	/* allocate pagetable before acquiring pmd lock */
> > +	if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
> > +		pgtable = pte_alloc_one(vma->vm_mm);
> > +		if (!pgtable) {
> > +			mmu_notifier_invalidate_range_end(&range);
> 
> What I last looked at this, I thought the clean thing to do is to let
> __split_huge_pmd() and friends return an error.
> 
> Let's take a look at walk_pmd_range() as one example:
> 
> if (walk->vma)
> 	split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> 	continue;
> 
> err = walk_pte_range(pmd, addr, next, walk);
> 
> Where walk_pte_range() just does a pte_offset_map_lock.
> 
> 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> 
> But if that fails (as the remapping failed), we will silently skip this
> range.
> 
> I don't think silently skipping is the right thing to do.
> 
> So I would think that all splitting functions have to be taught to return an
> error and handle it accordingly. Then we can actually start returning
> errors.

Yeah, I am also confused by silent split PMD failure. It has to be
communicated to the caller cleanly.

It is also an opportunity to audit all callers and check if they can
deal with the failure.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:35   ` David Hildenbrand (Arm)
  2026-02-11 13:46     ` Kiryl Shutsemau
@ 2026-02-11 13:47     ` Usama Arif
  1 sibling, 0 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-11 13:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team



On 11/02/2026 13:35, David Hildenbrand (Arm) wrote:
> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>>
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>>
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>>
>> PowerPC exception:
>>
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>>
>> Why Split Failures Are Safe:
>>
>> If a system is under severe memory pressure that even a 4K allocation
>> fails for a PTE table, there are far greater problems than a THP split
>> being delayed. The OOM killer will likely intervene before this becomes an
>> issue.
>> When pte_alloc_one() fails due to not being able to allocate a 4K page,
>> the PMD split is aborted and the THP remains intact. I could not get split
>> to fail, as its very difficult to make order-0 allocation to fail.
>> Code analysis of what would happen if it does:
>>
>> - mprotect(): If split fails in change_pmd_range, it will fallback
>> to change_pte_range, which will return an error which will cause the
>> whole function to be retried again.
>>
>> - munmap() (partial THP range): zap_pte_range() returns early when
>> pte_offset_map_lock() fails, causing zap_pmd_range() to retry via pmd--.
>> For full THP range, zap_huge_pmd() unmaps the entire PMD without
>> split.
>>
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
>>
>> - Migration / compaction (try_to_migrate()): Returns -EAGAIN, migration
>> skips this folio, retried later.
>>
>> - CoW fault (wp_huge_pmd()): Returns VM_FAULT_FALLBACK, fault retried.
>>
>> -  madvise (MADV_COLD/PAGEOUT): split_folio() internally calls
>> try_to_migrate() with TTU_SPLIT_HUGE_PMD. If PMD split fails,
>> try_to_migrate() returns false, split_folio() returns -EAGAIN,
>> and madvise returns 0 (success) silently skipping the region. This
>> should be fine. madvise is just an advice and can fail for other
>> reasons as well.
>>
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>   include/linux/huge_mm.h |   4 +-
>>   mm/huge_memory.c        | 144 ++++++++++++++++++++++++++++------------
>>   mm/khugepaged.c         |   7 +-
>>   mm/migrate_device.c     |  15 +++--
>>   mm/rmap.c               |  39 ++++++++++-
>>   5 files changed, 156 insertions(+), 53 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index a4d9f964dfdea..b21bb72a298c9 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -562,7 +562,7 @@ static inline bool thp_migration_supported(void)
>>   }
>>     void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> -               pmd_t *pmd, bool freeze);
>> +               pmd_t *pmd, bool freeze, pgtable_t pgtable);
>>   bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
>>                  pmd_t *pmdp, struct folio *folio);
>>   void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd,
>> @@ -660,7 +660,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
>>           unsigned long address, bool freeze) {}
>>   static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
>>                        unsigned long address, pmd_t *pmd,
>> -                     bool freeze) {}
>> +                     bool freeze, pgtable_t pgtable) {}
>>     static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
>>                        unsigned long addr, pmd_t *pmdp,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 44ff8a648afd5..4c9a8d89fc8aa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1322,17 +1322,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>       unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>       struct vm_area_struct *vma = vmf->vma;
>>       struct folio *folio;
>> -    pgtable_t pgtable;
>> +    pgtable_t pgtable = NULL;
>>       vm_fault_t ret = 0;
>>         folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
>>       if (unlikely(!folio))
>>           return VM_FAULT_FALLBACK;
>>   -    pgtable = pte_alloc_one(vma->vm_mm);
>> -    if (unlikely(!pgtable)) {
>> -        ret = VM_FAULT_OOM;
>> -        goto release;
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(vma->vm_mm);
>> +        if (unlikely(!pgtable)) {
>> +            ret = VM_FAULT_OOM;
>> +            goto release;
>> +        }
>>       }
>>         vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> @@ -1347,14 +1349,18 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>           if (userfaultfd_missing(vma)) {
>>               spin_unlock(vmf->ptl);
>>               folio_put(folio);
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>               ret = handle_userfault(vmf, VM_UFFD_MISSING);
>>               VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>>               return ret;
>>           }
>> -        pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
>> +        if (pgtable) {
>> +            pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd,
>> +                           pgtable);
>> +            mm_inc_nr_ptes(vma->vm_mm);
>> +        }
>>           map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr);
>> -        mm_inc_nr_ptes(vma->vm_mm);
>>           spin_unlock(vmf->ptl);
>>       }
>>   @@ -1450,9 +1456,11 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>>       pmd_t entry;
>>       entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>>       entry = pmd_mkspecial(entry);
>> -    pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +    if (pgtable) {
>> +        pgtable_trans_huge_deposit(mm, pmd, pgtable);
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       set_pmd_at(mm, haddr, pmd, entry);
>> -    mm_inc_nr_ptes(mm);
>>   }
>>     vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>> @@ -1471,16 +1479,19 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>>               !mm_forbids_zeropage(vma->vm_mm) &&
>>               transparent_hugepage_use_zero_page()) {
>> -        pgtable_t pgtable;
>> +        pgtable_t pgtable = NULL;
>>           struct folio *zero_folio;
>>           vm_fault_t ret;
>>   -        pgtable = pte_alloc_one(vma->vm_mm);
>> -        if (unlikely(!pgtable))
>> -            return VM_FAULT_OOM;
>> +        if (arch_needs_pgtable_deposit()) {
>> +            pgtable = pte_alloc_one(vma->vm_mm);
>> +            if (unlikely(!pgtable))
>> +                return VM_FAULT_OOM;
>> +        }
>>           zero_folio = mm_get_huge_zero_folio(vma->vm_mm);
>>           if (unlikely(!zero_folio)) {
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>               count_vm_event(THP_FAULT_FALLBACK);
>>               return VM_FAULT_FALLBACK;
>>           }
>> @@ -1490,10 +1501,12 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>               ret = check_stable_address_space(vma->vm_mm);
>>               if (ret) {
>>                   spin_unlock(vmf->ptl);
>> -                pte_free(vma->vm_mm, pgtable);
>> +                if (pgtable)
>> +                    pte_free(vma->vm_mm, pgtable);
>>               } else if (userfaultfd_missing(vma)) {
>>                   spin_unlock(vmf->ptl);
>> -                pte_free(vma->vm_mm, pgtable);
>> +                if (pgtable)
>> +                    pte_free(vma->vm_mm, pgtable);
>>                   ret = handle_userfault(vmf, VM_UFFD_MISSING);
>>                   VM_BUG_ON(ret & VM_FAULT_FALLBACK);
>>               } else {
>> @@ -1504,7 +1517,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>               }
>>           } else {
>>               spin_unlock(vmf->ptl);
>> -            pte_free(vma->vm_mm, pgtable);
>> +            if (pgtable)
>> +                pte_free(vma->vm_mm, pgtable);
>>           }
>>           return ret;
>>       }
>> @@ -1836,8 +1850,10 @@ static void copy_huge_non_present_pmd(
>>       }
>>         add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>> -    mm_inc_nr_ptes(dst_mm);
>> -    pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    if (pgtable) {
>> +        mm_inc_nr_ptes(dst_mm);
>> +        pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    }
>>       if (!userfaultfd_wp(dst_vma))
>>           pmd = pmd_swp_clear_uffd_wp(pmd);
>>       set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> @@ -1877,9 +1893,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       if (!vma_is_anonymous(dst_vma))
>>           return 0;
>>   -    pgtable = pte_alloc_one(dst_mm);
>> -    if (unlikely(!pgtable))
>> -        goto out;
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(dst_mm);
>> +        if (unlikely(!pgtable))
>> +            goto out;
>> +    }
>>         dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>       src_ptl = pmd_lockptr(src_mm, src_pmd);
>> @@ -1897,7 +1915,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       }
>>         if (unlikely(!pmd_trans_huge(pmd))) {
>> -        pte_free(dst_mm, pgtable);
>> +        if (pgtable)
>> +            pte_free(dst_mm, pgtable);
>>           goto out_unlock;
>>       }
>>       /*
>> @@ -1923,7 +1942,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
>>           /* Page maybe pinned: split and retry the fault on PTEs. */
>>           folio_put(src_folio);
>> -        pte_free(dst_mm, pgtable);
>> +        if (pgtable)
>> +            pte_free(dst_mm, pgtable);
>>           spin_unlock(src_ptl);
>>           spin_unlock(dst_ptl);
>>           __split_huge_pmd(src_vma, src_pmd, addr, false);
>> @@ -1931,8 +1951,10 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>       }
>>       add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>>   out_zero_page:
>> -    mm_inc_nr_ptes(dst_mm);
>> -    pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    if (pgtable) {
>> +        mm_inc_nr_ptes(dst_mm);
>> +        pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
>> +    }
>>       pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>       if (!userfaultfd_wp(dst_vma))
>>           pmd = pmd_clear_uffd_wp(pmd);
>> @@ -2364,7 +2386,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>               zap_deposited_table(tlb->mm, pmd);
>>           spin_unlock(ptl);
>>       } else if (is_huge_zero_pmd(orig_pmd)) {
>> -        if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
>> +        if (arch_needs_pgtable_deposit())
>>               zap_deposited_table(tlb->mm, pmd);
>>           spin_unlock(ptl);
>>       } else {
>> @@ -2389,7 +2411,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>           }
>>             if (folio_test_anon(folio)) {
>> -            zap_deposited_table(tlb->mm, pmd);
>> +            if (arch_needs_pgtable_deposit())
>> +                zap_deposited_table(tlb->mm, pmd);
>>               add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>>           } else {
>>               if (arch_needs_pgtable_deposit())
>> @@ -2490,7 +2513,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>>               force_flush = true;
>>           VM_BUG_ON(!pmd_none(*new_pmd));
>>   -        if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
>> +        if (pmd_move_must_withdraw(new_ptl, old_ptl, vma) &&
>> +            arch_needs_pgtable_deposit()) {
>>               pgtable_t pgtable;
>>               pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
>>               pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
>> @@ -2798,8 +2822,10 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
>>       }
>>       set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
>>   -    src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> -    pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
>> +        pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable);
>> +    }
>>   unlock_ptls:
>>       double_pt_unlock(src_ptl, dst_ptl);
>>       /* unblock rmap walks */
>> @@ -2941,10 +2967,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>     static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>> -        unsigned long haddr, pmd_t *pmd)
>> +        unsigned long haddr, pmd_t *pmd, pgtable_t pgtable)
>>   {
>>       struct mm_struct *mm = vma->vm_mm;
>> -    pgtable_t pgtable;
>>       pmd_t _pmd, old_pmd;
>>       unsigned long addr;
>>       pte_t *pte;
>> @@ -2960,7 +2985,16 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>        */
>>       old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
>>   -    pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    } else {
>> +        VM_BUG_ON(!pgtable);
>> +        /*
>> +         * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> +         * being used in mm.
>> +         */
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       pmd_populate(mm, &_pmd, pgtable);
>>         pte = pte_offset_map(&_pmd, haddr);
>> @@ -2982,12 +3016,11 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>   }
>>     static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> -        unsigned long haddr, bool freeze)
>> +        unsigned long haddr, bool freeze, pgtable_t pgtable)
>>   {
>>       struct mm_struct *mm = vma->vm_mm;
>>       struct folio *folio;
>>       struct page *page;
>> -    pgtable_t pgtable;
>>       pmd_t old_pmd, _pmd;
>>       bool soft_dirty, uffd_wp = false, young = false, write = false;
>>       bool anon_exclusive = false, dirty = false;
>> @@ -3011,6 +3044,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>            */
>>           if (arch_needs_pgtable_deposit())
>>               zap_deposited_table(mm, pmd);
>> +        if (pgtable)
>> +            pte_free(mm, pgtable);
>>           if (!vma_is_dax(vma) && vma_is_special_huge(vma))
>>               return;
>>           if (unlikely(pmd_is_migration_entry(old_pmd))) {
>> @@ -3043,7 +3078,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>            * small page also write protected so it does not seems useful
>>            * to invalidate secondary mmu at this time.
>>            */
>> -        return __split_huge_zero_page_pmd(vma, haddr, pmd);
>> +        return __split_huge_zero_page_pmd(vma, haddr, pmd, pgtable);
>>       }
>>         if (pmd_is_migration_entry(*pmd)) {
>> @@ -3167,7 +3202,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>        * Withdraw the table only after we mark the pmd entry invalid.
>>        * This's critical for some architectures (Power).
>>        */
>> -    pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    if (arch_needs_pgtable_deposit()) {
>> +        pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>> +    } else {
>> +        VM_BUG_ON(!pgtable);
>> +        /*
>> +         * Account for the freshly allocated (in __split_huge_pmd) pgtable
>> +         * being used in mm.
>> +         */
>> +        mm_inc_nr_ptes(mm);
>> +    }
>>       pmd_populate(mm, &_pmd, pgtable);
>>         pte = pte_offset_map(&_pmd, haddr);
>> @@ -3263,11 +3307,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>   }
>>     void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> -               pmd_t *pmd, bool freeze)
>> +               pmd_t *pmd, bool freeze, pgtable_t pgtable)
>>   {
>>       VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>>       if (pmd_trans_huge(*pmd) || pmd_is_valid_softleaf(*pmd))
>> -        __split_huge_pmd_locked(vma, pmd, address, freeze);
>> +        __split_huge_pmd_locked(vma, pmd, address, freeze, pgtable);
>> +    else if (pgtable)
>> +        pte_free(vma->vm_mm, pgtable);
>>   }
>>     void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> @@ -3275,13 +3321,24 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   {
>>       spinlock_t *ptl;
>>       struct mmu_notifier_range range;
>> +    pgtable_t pgtable = NULL;
>>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>>                   address & HPAGE_PMD_MASK,
>>                   (address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
>>       mmu_notifier_invalidate_range_start(&range);
>> +
>> +    /* allocate pagetable before acquiring pmd lock */
>> +    if (vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()) {
>> +        pgtable = pte_alloc_one(vma->vm_mm);
>> +        if (!pgtable) {
>> +            mmu_notifier_invalidate_range_end(&range);
> 
> What I last looked at this, I thought the clean thing to do is to let __split_huge_pmd() and friends return an error.
> 
> Let's take a look at walk_pmd_range() as one example:
> 
> if (walk->vma)
>     split_huge_pmd(walk->vma, pmd, addr);
> else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
>     continue;
> 
> err = walk_pte_range(pmd, addr, next, walk);
> 
> Where walk_pte_range() just does a pte_offset_map_lock.
> 
>     pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> 
> But if that fails (as the remapping failed), we will silently skip this range.
> 
> I don't think silently skipping is the right thing to do.
> 
> So I would think that all splitting functions have to be taught to return an error and handle it accordingly. Then we can actually start returning errors.
> 

Ack. This was one of the cases where we would try again if needed.
I did manual code analysis which I included at the end of the commit message
but agreed, its best to return an error and handle accordingly.
I will look into doing this for the next revision.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
  2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:35   ` David Hildenbrand (Arm)
@ 2026-02-11 19:28   ` Matthew Wilcox
  2026-02-11 19:55     ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2026-02-11 19:28 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, lorenzo.stoakes, linux-mm, fvdl, hannes,
	riel, shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
> LRU, retried in next reclaim cycle.

I was advised to ask my stupid question ...

Why do we still try to split the PMD in reclaim?  I understand we're
about to swap the folio out and we'll need to put a swap entry in the page
table so we can find it again.  But can't we now store swap entries at the
PMD level, or are we still forced to store 512 entries at the PTE level?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 19:28   ` Matthew Wilcox
@ 2026-02-11 19:55     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-11 19:55 UTC (permalink / raw)
  To: Matthew Wilcox, Usama Arif
  Cc: Andrew Morton, lorenzo.stoakes, linux-mm, fvdl, hannes, riel,
	shakeel.butt, kas, baohua, dev.jain, baolin.wang, npache,
	Liam.Howlett, ryan.roberts, vbabka, lance.yang, linux-kernel,
	kernel-team

On 2/11/26 20:28, Matthew Wilcox wrote:
> On Wed, Feb 11, 2026 at 04:49:45AM -0800, Usama Arif wrote:
>> - Memory reclaim (try_to_unmap()): Returns false, folio rotated back
>> LRU, retried in next reclaim cycle.
> 
> I was advised to ask my stupid question ...
> 
> Why do we still try to split the PMD in reclaim?  I understand we're
> about to swap the folio out and we'll need to put a swap entry in the page
> table so we can find it again.  But can't we now store swap entries at the
> PMD level, or are we still forced to store 512 entries at the PTE level?

Yes. We don't support PMD swap entries yet.

I don't know all historical details. I suspect there are some rough 
edges around swapin (assume we cannot swapin a 2M THP), and maybe it was 
just easier to not deal with splitting of PMD swap entries (which we 
would similarly have to support).

For sure an interesting project to look into.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-11 13:25   ` David Hildenbrand (Arm)
  2026-02-11 13:38     ` Usama Arif
@ 2026-02-12 12:13     ` Ritesh Harjani
  2026-02-12 15:25       ` Usama Arif
  2026-02-12 15:39       ` David Hildenbrand (Arm)
  1 sibling, 2 replies; 20+ messages in thread
From: Ritesh Harjani @ 2026-02-12 12:13 UTC (permalink / raw)
  To: David Hildenbrand (Arm),
	Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> CCing ppc folks
>

Thanks David!

> On 2/11/26 13:49, Usama Arif wrote:
>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>> it pre-allocates a PTE page table and deposits it via
>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>> PMD split or zap. The rationale was that split must not fail—if the
>> kernel decides to split a THP, it needs a PTE table to populate.
>> 
>> However, every anon THP wastes 4KB (one page table page) that sits
>> unused in the deposit list for the lifetime of the mapping. On systems
>> with many THPs, this adds up to significant memory waste. The original
>> rationale is also not an issue. It is ok for split to fail, and if the
>> kernel can't find an order 0 allocation for split, there are much bigger
>> problems. On large servers where you can easily have 100s of GBs of THPs,
>> the memory usage for these tables is 200M per 100G. This memory could be
>> used for any other usecase, which include allocating the pagetables
>> required during split.
>> 
>> This patch removes the pre-deposit for anonymous pages on architectures
>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>> powerpc, and only when radix hash tables are not enabled) and allocates
>> the PTE table lazily—only when a split actually occurs. The split path
>> is modified to accept a caller-provided page table.
>> 
>> PowerPC exception:
>> 
>> It would have been great if we can completely remove the pagetable
>> deposit code and this commit would mostly have been a code cleanup patch,
>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>> the deposited page table and pre-deposit is necessary. All deposit/
>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>> behavior is unchanged with this patch. On a better note,
>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>> on non PowerPC architectures and the pre-deposit code will not be
>> compiled in.
>
> Is there a way to remove this? It's always been a confusing hack, now 
> it's unpleasant to have around :)
>

Hash MMU on PowerPC works fundamentally different than other MMUs
(unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
into the Linux's multi-level SW page table model. ;) 


> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
> copied generic pgtable_trans_huge_deposit() hurts my belly.
>

On PowerPC, pgtable_t can be a pte fragment. 

typedef pte_t *pgtable_t;

That means a single page can be shared among other PTE page tables. So, we
cannot use page->lru which the generic implementation uses. I guess due
to this, there is a slight change in implementation of
radix__pgtable_trans_huge_deposit(). 

Doing a grep search, I think that's the same for sparc and s390 as well.

>
> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>
> So one obvious solution: remove PMD THP support for hash MMUs along with 
> all this hacky deposit code.
>

Unfortunately, please no. There are real customers using Hash MMU on
Power9 and even on older generations and this would mean breaking Hash
PMD THP support for them. 


>
> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
> checks need to be wrapped in a reasonable helper and likely this all 
> needs to get cleaned up further.
>
> The implementation if the generic pgtable_trans_huge_deposit and the 
> radix handlers etc must be removed. If any code would trigger them it 
> would be a bug.
>

Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() 
will mostly be a dead code anyways. I will spend some time going
through this series and will also give it a test on powerpc HW (with
both Hash and Radix MMU).

I guess, we should also look at removing pgtable_trans_huge_deposit() and
pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
those too will be dead code after this.


> If we have to keep this around, pgtable_trans_huge_deposit() should 
> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
> will not be generic support for it.
>

Sure. That make sense since PowerPC Hash MMU will still need this.

-ritesh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 12:13     ` Ritesh Harjani
@ 2026-02-12 15:25       ` Usama Arif
  2026-02-12 15:39       ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 20+ messages in thread
From: Usama Arif @ 2026-02-12 15:25 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), David Hildenbrand (Arm),
	Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev



On 12/02/2026 12:13, Ritesh Harjani (IBM) wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
>> CCing ppc folks
>>
> 
> Thanks David!
> 
>> On 2/11/26 13:49, Usama Arif wrote:
>>> When the kernel creates a PMD-level THP mapping for anonymous pages,
>>> it pre-allocates a PTE page table and deposits it via
>>> pgtable_trans_huge_deposit(). This deposited table is withdrawn during
>>> PMD split or zap. The rationale was that split must not fail—if the
>>> kernel decides to split a THP, it needs a PTE table to populate.
>>>
>>> However, every anon THP wastes 4KB (one page table page) that sits
>>> unused in the deposit list for the lifetime of the mapping. On systems
>>> with many THPs, this adds up to significant memory waste. The original
>>> rationale is also not an issue. It is ok for split to fail, and if the
>>> kernel can't find an order 0 allocation for split, there are much bigger
>>> problems. On large servers where you can easily have 100s of GBs of THPs,
>>> the memory usage for these tables is 200M per 100G. This memory could be
>>> used for any other usecase, which include allocating the pagetables
>>> required during split.
>>>
>>> This patch removes the pre-deposit for anonymous pages on architectures
>>> where arch_needs_pgtable_deposit() returns false (every arch apart from
>>> powerpc, and only when radix hash tables are not enabled) and allocates
>>> the PTE table lazily—only when a split actually occurs. The split path
>>> is modified to accept a caller-provided page table.
>>>
>>> PowerPC exception:
>>>
>>> It would have been great if we can completely remove the pagetable
>>> deposit code and this commit would mostly have been a code cleanup patch,
>>> unfortunately PowerPC has hash MMU, it stores hash slot information in
>>> the deposited page table and pre-deposit is necessary. All deposit/
>>> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC
>>> behavior is unchanged with this patch. On a better note,
>>> arch_needs_pgtable_deposit will always evaluate to false at compile time
>>> on non PowerPC architectures and the pre-deposit code will not be
>>> compiled in.
>>
>> Is there a way to remove this? It's always been a confusing hack, now 
>> it's unpleasant to have around :)
>>
> 
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;) 
> 
> 
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
> 
> On PowerPC, pgtable_t can be a pte fragment. 
> 
> typedef pte_t *pgtable_t;
> 
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit(). 
> 
> Doing a grep search, I think that's the same for sparc and s390 as well.
> 
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with 
>> all this hacky deposit code.
>>
> 
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them. 
> 
> 

Thanks for confirming! I will keep the pagetable deposit for powerpc
in the next revision.
I will rename pgtable_trans_huge_deposit to arch_pgtable_trans_huge_deposit
and move it to arch/powerpc. It will an empty function for the rest of the
architectures.

>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar 
>> checks need to be wrapped in a reasonable helper and likely this all 
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the 
>> radix handlers etc must be removed. If any code would trigger them it 
>> would be a bug.
>>
> 
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() 
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).
> 
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.
> 
> 
>> If we have to keep this around, pgtable_trans_huge_deposit() should 
>> likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there 
>> will not be generic support for it.
>>
> 
> Sure. That make sense since PowerPC Hash MMU will still need this.
> 
> -ritesh



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 12:13     ` Ritesh Harjani
  2026-02-12 15:25       ` Usama Arif
@ 2026-02-12 15:39       ` David Hildenbrand (Arm)
  2026-02-12 16:46         ` Ritesh Harjani
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-12 15:39 UTC (permalink / raw)
  To: Ritesh Harjani (IBM),
	Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

>>
>> Is there a way to remove this? It's always been a confusing hack, now
>> it's unpleasant to have around :)
>>
> 
> Hash MMU on PowerPC works fundamentally different than other MMUs
> (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit
> into the Linux's multi-level SW page table model. ;)

:)

> 
>> In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1
>> copied generic pgtable_trans_huge_deposit() hurts my belly.
>>
> 
> On PowerPC, pgtable_t can be a pte fragment.
> 
> typedef pte_t *pgtable_t;
> 
> That means a single page can be shared among other PTE page tables. So, we
> cannot use page->lru which the generic implementation uses. I guess due
> to this, there is a slight change in implementation of
> radix__pgtable_trans_huge_deposit().

Ah, did not spot this difference, but makes sense. Still ugly, but make 
sense. Fortunately it would go away with this RFC.

> 
> Doing a grep search, I think that's the same for sparc and s390 as well.

... and I also did not realize that s390x+sparc have separate 
implementations we can now get rid of as well.

> 
>>
>> IIUC, hash is mostly used on legacy power systems, radix on newer ones.
>>
>> So one obvious solution: remove PMD THP support for hash MMUs along with
>> all this hacky deposit code.
>>
> 
> Unfortunately, please no. There are real customers using Hash MMU on
> Power9 and even on older generations and this would mean breaking Hash
> PMD THP support for them.
> 

I was expecting this answer :)

> 
>>
>> the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar
>> checks need to be wrapped in a reasonable helper and likely this all
>> needs to get cleaned up further.
>>
>> The implementation if the generic pgtable_trans_huge_deposit and the
>> radix handlers etc must be removed. If any code would trigger them it
>> would be a bug.
>>
> 
> Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit()
> will mostly be a dead code anyways. I will spend some time going
> through this series and will also give it a test on powerpc HW (with
> both Hash and Radix MMU).

Thanks! The series will grow quite a bit I think, so retesting new 
revisions will be very appreciated!

> 
> I guess, we should also look at removing pgtable_trans_huge_deposit() and
> pgtable_trans_huge_withdraw() implementations from s390 and sparc, since
> those too will be dead code after this.

Exactly.


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time
  2026-02-12 15:39       ` David Hildenbrand (Arm)
@ 2026-02-12 16:46         ` Ritesh Harjani
  0 siblings, 0 replies; 20+ messages in thread
From: Ritesh Harjani @ 2026-02-12 16:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm),
	Usama Arif, Andrew Morton, lorenzo.stoakes, willy, linux-mm
  Cc: fvdl, hannes, riel, shakeel.butt, kas, baohua, dev.jain,
	baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	lance.yang, linux-kernel, kernel-team, Madhavan Srinivasan,
	Michael Ellerman, linuxppc-dev

"David Hildenbrand (Arm)" <david@kernel.org> writes:

>
> Thanks! The series will grow quite a bit I think, so retesting new 
> revisions will be very appreciated!
>

Definitely. Thanks!

-ritesh


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-02-12 16:47 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-11 12:49 [RFC 0/2] mm: thp: split time allocation of page table for THPs Usama Arif
2026-02-11 12:49 ` [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-11 13:25   ` David Hildenbrand (Arm)
2026-02-11 13:38     ` Usama Arif
2026-02-12 12:13     ` Ritesh Harjani
2026-02-12 15:25       ` Usama Arif
2026-02-12 15:39       ` David Hildenbrand (Arm)
2026-02-12 16:46         ` Ritesh Harjani
2026-02-11 13:35   ` David Hildenbrand (Arm)
2026-02-11 13:46     ` Kiryl Shutsemau
2026-02-11 13:47     ` Usama Arif
2026-02-11 19:28   ` Matthew Wilcox
2026-02-11 19:55     ` David Hildenbrand (Arm)
2026-02-11 12:49 ` [RFC 2/2] mm: thp: add THP_SPLIT_PMD_PTE_ALLOC_FAILED counter Usama Arif
2026-02-11 13:27   ` David Hildenbrand (Arm)
2026-02-11 13:31     ` Usama Arif
2026-02-11 13:36       ` David Hildenbrand (Arm)
2026-02-11 13:42         ` Usama Arif
2026-02-11 13:38       ` David Hildenbrand (Arm)
2026-02-11 13:43         ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox