Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hugh Dickins <hughd@google.com>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org,  Lorenzo Stoakes <ljs@kernel.org>,
	willy@infradead.org, linux-mm@kvack.org,  fvdl@google.com,
	hannes@cmpxchg.org, riel@surriel.com,  shakeel.butt@linux.dev,
	kas@kernel.org, baohua@kernel.org,  dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, npache@redhat.com,
	 Liam.Howlett@oracle.com, ryan.roberts@arm.com,
	 Vlastimil Babka <vbabka@kernel.org>,
	lance.yang@linux.dev,  linux-kernel@vger.kernel.org,
	kernel-team@meta.com, maddy@linux.ibm.com,  mpe@ellerman.id.au,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	 gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com,  svens@linux.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time
Date: Sun, 5 Apr 2026 16:34:46 -0700 (PDT)	[thread overview]
Message-ID: <6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com> (raw)
In-Reply-To: <20260327021403.214713-1-usama.arif@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 4104 bytes --]

On Thu, 26 Mar 2026, Usama Arif wrote:

> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
> 
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
> 
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
> All of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, if the system cannot satisfy a single order-0
> allocation for a page table, it is under extreme memory pressure and
> failing the operation is the correct response.
> 
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
> 
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.

I see no mention of the big problem,
which has stopped us all from trying this before.

Reclaim: the split_folio_to_list() in shrink_folio_list().

Imagine a process which has forked a thousand times, containing
anon THPs, which should now be swapped out and reclaimed.

To swap out one of those THPs, it will have to allocate a
thousand page tables, all with PF_MEMALLOC set (to give some
access to reserves, while preventing recursion into reclaim).

Elsewhere, we go to great lengths (e.g. mempools) to give
guaranteed access to the memory needed when freeing memory.
In the case of an anon THP, the guaranteed pool has been the
deposited page table. Now what?

And the worst is that when the 501st attempt to allocate a page
table fails, it has allocated and is using 500 pages from reserve,
without reaching the point of freeing any memory at all.

Maybe watermark boosting (I barely know whereof I speak) can help
a bit nowadays.  Has anything else changed to solve the problem?

What would help a lot would be the implementation of swap entries
at the PMD level.  Whether that would help enough, I'm sceptical:
I do think it's foolish to depend upon the availability of huge
contiguous swap extents, whatever the recent improvements there;
but it would at least be an arguable justification.

Shared page tables?  Generally I run away, but perhaps
manageable in this limited context (a store of not-present
swap entries, to be copied on fault).

Hugh

next prev parent reply	other threads:[~2026-04-05 23:35 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-27  2:08 Usama Arif
2026-03-27  2:08 ` [v3 01/24] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-03-27  2:08 ` [v3 02/24] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-03-27  2:08 ` [v3 03/24] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-03-27  2:08 ` [v3 04/24] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-03-27  2:08 ` [v3 05/24] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-03-30 14:13   ` Kiryl Shutsemau
2026-03-30 15:09     ` David Hildenbrand (Arm)
2026-03-27  2:08 ` [v3 06/24] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-03-27  2:08 ` [v3 07/24] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-03-30 14:27   ` Kiryl Shutsemau
2026-03-27  2:08 ` [v3 08/24] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-03-27  2:08 ` [v3 09/24] mm: handle walk_page_range() failure from THP split Usama Arif
2026-03-27  2:08 ` [v3 10/24] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-03-27  2:08 ` [v3 11/24] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-03-27  2:08 ` [v3 12/24] mm: thp: handle split failure in device migration Usama Arif
2026-03-27  2:08 ` [v3 13/24] mm: proc: handle split_huge_pmd failure in pagemap_scan Usama Arif
2026-03-27  2:08 ` [v3 14/24] powerpc/mm: handle split_huge_pmd failure in subpage_prot Usama Arif
2026-03-27  2:08 ` [v3 15/24] fs/dax: handle split_huge_pmd failure in dax_iomap_pmd_fault Usama Arif
2026-03-27  2:08 ` [v3 16/24] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-03-30 14:41   ` Kiryl Shutsemau
2026-03-27  2:08 ` [v3 17/24] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-03-27  2:09 ` [v3 18/24] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-03-27  2:09 ` [v3 19/24] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-03-27  2:09 ` [v3 20/24] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-03-27  2:09 ` [v3 21/24] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-03-27  2:09 ` [v3 22/24] selftests/mm: add partial_mlock test Usama Arif
2026-03-27  2:09 ` [v3 23/24] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-03-27  2:09 ` [v3 24/24] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-03-27  8:51 ` [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time David Hildenbrand (Arm)
2026-03-27  9:25   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:40     ` Usama Arif
2026-03-27 14:34   ` Usama Arif
2026-04-05 23:34 ` Hugh Dickins [this message]
2026-04-08 15:06   ` Usama Arif
2026-04-08 19:49     ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com \
    --to=hughd@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hca@linux.ibm.com \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=ljs@kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=svens@linux.ibm.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox