linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nico Pache <npache@redhat.com>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	david@kernel.org, lorenzo.stoakes@oracle.com,
	 willy@infradead.org, linux-mm@kvack.org, fvdl@google.com,
	hannes@cmpxchg.org,  riel@surriel.com, shakeel.butt@linux.dev,
	kas@kernel.org, baohua@kernel.org,  dev.jain@arm.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	 ryan.roberts@arm.com, Vlastimil Babka <vbabka@kernel.org>,
	lance.yang@linux.dev,  linux-kernel@vger.kernel.org,
	kernel-team@meta.com, maddy@linux.ibm.com,  mpe@ellerman.id.au,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	 gor@linux.ibm.com, agordeev@linux.ibm.com,
	borntraeger@linux.ibm.com,  svens@linux.ibm.com,
	linux-s390@vger.kernel.org
Subject: Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split
Date: Thu, 26 Feb 2026 14:01:04 -0700	[thread overview]
Message-ID: <CAA1CXcAYt3OfW_uBTYZgr-dBhg99x=5pUs5uvqtpg+PNJ1KxGQ@mail.gmail.com> (raw)
In-Reply-To: <20260226113233.3987674-1-usama.arif@linux.dev>

On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> page table sits unused in a deposit list for the lifetime of the THP
> mapping, only to be withdrawn when the PMD is split or zapped. Every
> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> servers where hundreds of gigabytes of memory are mapped as THPs, this
> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> could otherwise satisfy other allocations, including the very PTE page
> table allocations needed when splits eventually occur.
>
> This series removes the pre-deposit and allocates the PTE page table
> lazily — only when a PMD split actually happens. Since a large number
> of THPs are never split (they are zapped wholesale when processes exit or
> munmap the full range), the allocation is avoided entirely in the common
> case.
>
> The pre-deposit pattern exists because split_huge_pmd was designed as an
> operation that must never fail: if the kernel decides to split, it needs
> a PTE page table, so one is deposited in advance. But "must never fail"
> is an unnecessarily strong requirement. A PMD split is typically triggered
> by a partial operation on a sub-PMD range — partial munmap, partial
> mprotect, partial mremap and so on.
> Most of these operations already have well-defined error handling for
> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> fail and propagating the error through these existing paths is the natural
> thing to do. Furthermore, split failing requires an order-0 allocation for
> a page table to fail, which is extremely unlikely.
>
> Designing functions like split_huge_pmd as operations that cannot fail
> has a subtle but real cost to code quality. It forces a pre-allocation
> pattern - every THP creation path must deposit a page table, and every
> split or zap path must withdraw one, creating a hidden coupling between
> widely separated code paths.
>
> This also serves as a code cleanup. On every architecture except powerpc
> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> series removes the generic implementations in pgtable-generic.c and the
> s390/sparc overrides, replacing them with no-op stubs guarded by
> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> on all non-powerpc architectures.

Hi Usama,

Thanks for tackling this, it seems like an interesting problem. Im
trying to get more into reviewing, so bare with me I may have some
stupid comments or questions. Where I can really help out is with
testing. I will build this for all RH-supported architectures and run
some automated test suites and performance metrics. I'll report back
if I spot anything.

Cheers!
-- Nico

>
> The series is structured as follows:
>
> Patches 1-2:    Error infrastructure — make split functions return int
>                 and propagate errors from vma_adjust_trans_huge()
>                 through __split_vma, vma_shrink, and commit_merge.
>
> Patches 3-12:   Handle split failure at every call site — copy_huge_pmd,
>                 do_huge_pmd_wp_page, zap_pmd_range, wp_huge_pmd,
>                 change_pmd_range (mprotect), follow_pmd_mask (GUP),
>                 walk_pmd_range (pagewalk), move_page_tables (mremap),
>                 move_pages (userfaultfd), and device migration.
>                 The code will become affective in Patch 14 when split
>                 functions start returning -ENOMEM.
>
> Patch 13:       Add __must_check to __split_huge_pmd(), split_huge_pmd()
>                 and split_huge_pmd_address() so the compiler warns on
>                 unchecked return values.
>
> Patch 14:       The actual change — allocate PTE page tables lazily at
>                 split time instead of pre-depositing at THP creation.
>                 This is when split functions will actually start returning
>                 -ENOMEM.
>
> Patch 15:       Remove the now-dead deposit/withdraw code on
>                 non-powerpc architectures.
>
> Patch 16:       Add THP_SPLIT_PMD_FAILED vmstat counter for monitoring
>                 split failures.
>
> Patches 17-21:  Selftests covering partial munmap, mprotect, mlock,
>                 mremap, and MADV_DONTNEED on THPs to exercise the
>                 split paths.
>
> The error handling patches are placed before the lazy allocation patch so
> that every call site is already prepared to handle split failures before
> the failure mode is introduced. This makes each patch independently safe
> to apply and bisect through.
>
> The patches were tested with CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_DEBUG_VM
> enabled. The test results are below:
>
> TAP version 13
> 1..5
> # Starting 5 tests from 1 test cases.
> #  RUN           thp_pmd_split.partial_munmap ...
> # thp_pmd_split_test.c:60:partial_munmap:thp_split_pmd: 0 -> 1
> # thp_pmd_split_test.c:62:partial_munmap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_munmap
> ok 1 thp_pmd_split.partial_munmap
> #  RUN           thp_pmd_split.partial_mprotect ...
> # thp_pmd_split_test.c:60:partial_mprotect:thp_split_pmd: 1 -> 2
> # thp_pmd_split_test.c:62:partial_mprotect:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mprotect
> ok 2 thp_pmd_split.partial_mprotect
> #  RUN           thp_pmd_split.partial_mlock ...
> # thp_pmd_split_test.c:60:partial_mlock:thp_split_pmd: 2 -> 3
> # thp_pmd_split_test.c:62:partial_mlock:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mlock
> ok 3 thp_pmd_split.partial_mlock
> #  RUN           thp_pmd_split.partial_mremap ...
> # thp_pmd_split_test.c:60:partial_mremap:thp_split_pmd: 3 -> 4
> # thp_pmd_split_test.c:62:partial_mremap:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_mremap
> ok 4 thp_pmd_split.partial_mremap
> #  RUN           thp_pmd_split.partial_madv_dontneed ...
> # thp_pmd_split_test.c:60:partial_madv_dontneed:thp_split_pmd: 4 -> 5
> # thp_pmd_split_test.c:62:partial_madv_dontneed:thp_split_pmd_failed: 0 -> 0
> #            OK  thp_pmd_split.partial_madv_dontneed
> ok 5 thp_pmd_split.partial_madv_dontneed
> # PASSED: 5 / 5 tests passed.
> # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> The patches are based off of 957a3fab8811b455420128ea5f41c51fd23eb6c7 from
> mm-unstable as of 25 Feb (7.0.0-rc1).
>
>
> RFC v1 -> v2: https://lore.kernel.org/all/20260211125507.4175026-1-usama.arif@linux.dev/
> - Change counter name to THP_SPLIT_PMD_FAILED (David)
> - remove pgtable_trans_huge_{deposit/withdraw} when not needed and
>   make them arch specific (David)
> - make split functions return error code and have callers handle them
>   (David and Kiryl)
> - Add test cases for splitting
>
> Usama Arif (21):
>   mm: thp: make split_huge_pmd functions return int for error
>     propagation
>   mm: thp: propagate split failure from vma_adjust_trans_huge()
>   mm: thp: handle split failure in copy_huge_pmd()
>   mm: thp: handle split failure in do_huge_pmd_wp_page()
>   mm: thp: handle split failure in zap_pmd_range()
>   mm: thp: handle split failure in wp_huge_pmd()
>   mm: thp: retry on split failure in change_pmd_range()
>   mm: thp: handle split failure in follow_pmd_mask()
>   mm: handle walk_page_range() failure from THP split
>   mm: thp: handle split failure in mremap move_page_tables()
>   mm: thp: handle split failure in userfaultfd move_pages()
>   mm: thp: handle split failure in device migration
>   mm: huge_mm: Make sure all split_huge_pmd calls are checked
>   mm: thp: allocate PTE page tables lazily at split time
>   mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed
>   mm: thp: add THP_SPLIT_PMD_FAILED counter
>   selftests/mm: add THP PMD split test infrastructure
>   selftests/mm: add partial_mprotect test for change_pmd_range
>   selftests/mm: add partial_mlock test
>   selftests/mm: add partial_mremap test for move_page_tables
>   selftests/mm: add madv_dontneed_partial test
>
>  arch/powerpc/include/asm/book3s/64/pgtable.h  |  12 +-
>  arch/s390/include/asm/pgtable.h               |   6 -
>  arch/s390/mm/pgtable.c                        |  41 ---
>  arch/sparc/include/asm/pgtable_64.h           |   6 -
>  arch/sparc/mm/tlb.c                           |  36 ---
>  include/linux/huge_mm.h                       |  51 +--
>  include/linux/pgtable.h                       |  16 +-
>  include/linux/vm_event_item.h                 |   1 +
>  mm/debug_vm_pgtable.c                         |   4 +-
>  mm/gup.c                                      |  10 +-
>  mm/huge_memory.c                              | 208 +++++++++----
>  mm/khugepaged.c                               |   7 +-
>  mm/memory.c                                   |  26 +-
>  mm/migrate_device.c                           |  33 +-
>  mm/mprotect.c                                 |  11 +-
>  mm/mremap.c                                   |   8 +-
>  mm/pagewalk.c                                 |   8 +-
>  mm/pgtable-generic.c                          |  32 --
>  mm/rmap.c                                     |  42 ++-
>  mm/userfaultfd.c                              |   8 +-
>  mm/vma.c                                      |  37 ++-
>  mm/vmstat.c                                   |   1 +
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../testing/selftests/mm/thp_pmd_split_test.c | 290 ++++++++++++++++++
>  tools/testing/vma/include/stubs.h             |   9 +-
>  25 files changed, 645 insertions(+), 259 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/thp_pmd_split_test.c
>
> --
> 2.47.3
>



      parent reply	other threads:[~2026-02-26 21:01 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 11:23 Usama Arif
2026-02-26 11:23 ` [RFC v2 01/21] mm: thp: make split_huge_pmd functions return int for error propagation Usama Arif
2026-02-26 11:23 ` [RFC v2 02/21] mm: thp: propagate split failure from vma_adjust_trans_huge() Usama Arif
2026-02-26 11:23 ` [RFC v2 03/21] mm: thp: handle split failure in copy_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 04/21] mm: thp: handle split failure in do_huge_pmd_wp_page() Usama Arif
2026-02-26 11:23 ` [RFC v2 05/21] mm: thp: handle split failure in zap_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 06/21] mm: thp: handle split failure in wp_huge_pmd() Usama Arif
2026-02-26 11:23 ` [RFC v2 07/21] mm: thp: retry on split failure in change_pmd_range() Usama Arif
2026-02-26 11:23 ` [RFC v2 08/21] mm: thp: handle split failure in follow_pmd_mask() Usama Arif
2026-02-26 11:23 ` [RFC v2 09/21] mm: handle walk_page_range() failure from THP split Usama Arif
2026-02-26 11:23 ` [RFC v2 10/21] mm: thp: handle split failure in mremap move_page_tables() Usama Arif
2026-02-26 11:23 ` [RFC v2 11/21] mm: thp: handle split failure in userfaultfd move_pages() Usama Arif
2026-02-26 11:23 ` [RFC v2 12/21] mm: thp: handle split failure in device migration Usama Arif
2026-02-26 11:23 ` [RFC v2 13/21] mm: huge_mm: Make sure all split_huge_pmd calls are checked Usama Arif
2026-02-26 11:23 ` [RFC v2 14/21] mm: thp: allocate PTE page tables lazily at split time Usama Arif
2026-02-26 11:23 ` [RFC v2 15/21] mm: thp: remove pgtable_trans_huge_{deposit/withdraw} when not needed Usama Arif
2026-02-26 11:23 ` [RFC v2 16/21] mm: thp: add THP_SPLIT_PMD_FAILED counter Usama Arif
2026-02-26 14:22   ` Usama Arif
2026-02-26 11:23 ` [RFC v2 17/21] selftests/mm: add THP PMD split test infrastructure Usama Arif
2026-02-26 11:23 ` [RFC v2 18/21] selftests/mm: add partial_mprotect test for change_pmd_range Usama Arif
2026-02-26 11:23 ` [RFC v2 19/21] selftests/mm: add partial_mlock test Usama Arif
2026-02-26 11:23 ` [RFC v2 20/21] selftests/mm: add partial_mremap test for move_page_tables Usama Arif
2026-02-26 11:23 ` [RFC v2 21/21] selftests/mm: add madv_dontneed_partial test Usama Arif
2026-02-26 21:01 ` Nico Pache [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAA1CXcAYt3OfW_uBTYZgr-dBhg99x=5pUs5uvqtpg+PNJ1KxGQ@mail.gmail.com' \
    --to=npache@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=fvdl@google.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hca@linux.ibm.com \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=maddy@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=svens@linux.ibm.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox