From: Nico Pache <npache@redhat.com>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, lorenzo.stoakes@oracle.com,
willy@infradead.org, linux-mm@kvack.org, fvdl@google.com,
hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
kas@kernel.org, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
kernel-team@meta.com, maddy@linux.ibm.com, mpe@ellerman.id.au,
linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
gor@linux.ibm.com, agordeev@linux.ibm.com,
borntraeger@linux.ibm.com, svens@linux.ibm.com,
linux-s390@vger.kernel.org
Subject: Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
Date: Fri, 27 Feb 2026 17:06:12 -0700 [thread overview]
Message-ID: <CAA1CXcCygvA9uUJjB-+2J00srnHbiNGwbvcbqpRer8Vy8QBxWg@mail.gmail.com> (raw)
In-Reply-To: <1d3a4e8e-9ea0-42e7-b8e7-d92fb27f80f4@linux.dev>
On Fri, Feb 27, 2026 at 4:14 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 26/02/2026 21:01, Nico Pache wrote:
> > On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
> >>
> >> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> >> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> >> page table sits unused in a deposit list for the lifetime of the THP
> >> mapping, only to be withdrawn when the PMD is split or zapped. Every
> >> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> >> servers where hundreds of gigabytes of memory are mapped as THPs, this
> >> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> >> could otherwise satisfy other allocations, including the very PTE page
> >> table allocations needed when splits eventually occur.
> >>
> >> This series removes the pre-deposit and allocates the PTE page table
> >> lazily — only when a PMD split actually happens. Since a large number
> >> of THPs are never split (they are zapped wholesale when processes exit or
> >> munmap the full range), the allocation is avoided entirely in the common
> >> case.
> >>
> >> The pre-deposit pattern exists because split_huge_pmd was designed as an
> >> operation that must never fail: if the kernel decides to split, it needs
> >> a PTE page table, so one is deposited in advance. But "must never fail"
> >> is an unnecessarily strong requirement. A PMD split is typically triggered
> >> by a partial operation on a sub-PMD range — partial munmap, partial
> >> mprotect, partial mremap and so on.
> >> Most of these operations already have well-defined error handling for
> >> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> >> fail and propagating the error through these existing paths is the natural
> >> thing to do. Furthermore, split failing requires an order-0 allocation for
> >> a page table to fail, which is extremely unlikely.
> >>
> >> Designing functions like split_huge_pmd as operations that cannot fail
> >> has a subtle but real cost to code quality. It forces a pre-allocation
> >> pattern - every THP creation path must deposit a page table, and every
> >> split or zap path must withdraw one, creating a hidden coupling between
> >> widely separated code paths.
> >>
> >> This also serves as a code cleanup. On every architecture except powerpc
> >> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> >> series removes the generic implementations in pgtable-generic.c and the
> >> s390/sparc overrides, replacing them with no-op stubs guarded by
> >> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> >> on all non-powerpc architectures.
> >
> > Hi Usama,
> >
> > Thanks for tackling this, it seems like an interesting problem. Im
> > trying to get more into reviewing, so bare with me I may have some
> > stupid comments or questions. Where I can really help out is with
> > testing. I will build this for all RH-supported architectures and run
> > some automated test suites and performance metrics. I'll report back
> > if I spot anything.
> >
> > Cheers!
> > -- Nico
> >
>
> Thanks for the build and looking into reviewing this. All comments
> and questions are welcome! I had only tested on x86, and I had a look
> at the link you shared so its great to know that powerPC and s390 are fine.
Good news: as you noted all the builds succeeded, and the sanity tests
dont show any signs of an immediate issue across the architectures.
I'll proceed to debug kernels, and then performance testing. I will
try to start reviewing the actual code changes in depth next week :)
Cheers,
-- Nico
>
prev parent reply other threads:[~2026-02-28 0:06 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-26 11:23 Usama Arif
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-27 12:11 ` Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 14:22 ` Usama Arif
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-02-26 21:01 ` [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split Nico Pache
2026-02-27 11:13 ` Usama Arif
2026-02-28 0:06 ` Nico Pache [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAA1CXcCygvA9uUJjB-+2J00srnHbiNGwbvcbqpRer8Vy8QBxWg@mail.gmail.com \
--to=npache@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=borntraeger@linux.ibm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=fvdl@google.com \
--cc=gor@linux.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=hca@linux.ibm.com \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=svens@linux.ibm.com \
--cc=usama.arif@linux.dev \
--cc=vbabka@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox