linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range()
@ 2026-02-28  7:09 Yin Tirui
  2026-02-28  7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Yin Tirui @ 2026-02-28  7:09 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto,
	peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt,
	surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky,
	apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams,
	yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley,
	Jonathan.Cameron, riel
  Cc: wangkefeng.wang, chenjun102, yintirui

v3:
1. Architectural Type Safety (Matthew Wilcox):
Following the insightful architectural feedback from Matthew Wilcox in v2,
the approach to clearing huge page attributes has been completely redesigned.
Instead of spreading the `pte_clrhuge()` anti-pattern to ARM64 and RISC-V,
this series enforces strict type safety at the lowest level: `pfn_pte()`
must never natively return a PTE with huge page attributes set.

To achieve this without breaking the x86 core MM, the series is structured as:
  - Fix historical type-casting abuses in x86 (vmemmap, vmalloc, CPA) where
    `pfn_pte()` was wrongly used to generate huge PMDs/PUDs.
  - Update `pfn_pte()` on x86 and ARM64 to inherently filter out huge page
    attributes. (RISC-V leaf PMDs and PTEs share the exact same hardware
    format without a specific "huge" bit, so it is naturally compliant).
  - Completely eradicate `pte_clrhuge()` from the x86 tree and clean up
    the type-casting mess in `arch/x86/mm/init_64.c`.

2. Page Table Deposit fix during clone() (syzbot):
Previously, `copy_huge_pmd()` was unaware of special PMDs created by pfnmap,
failing to deposit a page table for the child process during `clone()`.
This led to crashes during process teardown or PMD splitting. The logic is now
updated to properly allocate and deposit pgtables for `pmd_special()` entries.

v2: https://lore.kernel.org/linux-mm/20251016112704.179280-1-yintirui@huawei.com/#t
- remove "nohugepfnmap" boot option and "pfnmap_max_page_shift" variable.
- zap_deposited_table for non-special pmd.
- move set_pmd_at() inside pmd_lock.
- prevent PMD mapping creation when pgtable allocation fails.
- defer the refactor of pte_clrhuge() to a separate patch series. For now,
  add a TODO to track this.

v1: https://lore.kernel.org/linux-mm/20250923133104.926672-1-yintirui@huawei.com/

Overview
========
This patch series adds huge page support for remap_pfn_range(),
automatically creating huge mappings when prerequisites are satisfied
(size, alignment, architecture support, etc.) and falling back to
normal page mappings otherwise.

This work builds on Peter Xu's previous efforts on huge pfnmap
support [0].

TODO
====
- Add PUD-level huge page support. Currently, only PMD-level huge
pages are supported.

Tests Done
==========
- Cross-build tests.
- Core MM Regression Tests
   - Booted x86 kernel with `debug_pagealloc=on` to heavily stress the
     large page splitting logic in direct mapping. No panics observed.
   - Ran `make -C tools/testing/selftests/vm run_tests`. Both THP and
     Hugetlbfs tests passed successfully, proving the `pfn_pte()` changes
     do not interfere with native huge page generation.
- Functional Tests (with a custom device driver & PTDUMP):
   - Verified that `remap_pfn_range()` successfully creates 2MB mappings
     by observing `/sys/kernel/debug/page_tables/current_user`.
   - Triggered PMD splits via 4K-granular `mprotect()` and partial `munmap()`,
     verifying correct fallback to 512 PTEs without corrupting permissions
     or causing kernel crashes.
   - Triggered `fork()`/`clone()` on the mapped regions, validating the
     syzbot fix and ensuring safe pgtable deposit/withdraw lifecycle.
- Performance tests with custom device driver implementing mmap()
  with remap_pfn_range():
    - lat_mem_rd benchmark modified to use mmap(device_fd) instead of
      malloc() shows around 40% improvement in memory access latency with
      huge page support compared to normal page mappings.

      numactl -C 0 lat_mem_rd -t 4096M (stride=64)
      Memory Size (MB)    Without Huge Mapping With Huge Mapping Improvement
      ----------------    -----------------    --------------    -----------
      64.00               148.858 ns           100.780 ns        32.3%
      128.00              164.745 ns           103.537 ns        37.2%
      256.00              169.907 ns           103.179 ns        39.3%
      512.00              171.285 ns           103.072 ns        39.8%
      1024.00             173.054 ns           103.055 ns        40.4%
      2048.00             172.820 ns           103.091 ns        40.3%
      4096.00             172.877 ns           103.115 ns        40.4%

    - Custom memory copy operations on mmap(device_fd) show around 18% performance 
      improvement with huge page support compared to normal page mappings.

      numactl -C 0 memcpy_test (memory copy performance test)
      Memory Size (MB)    Without Huge Mapping With Huge Mapping Improvement
      ----------------    -----------------    --------------    -----------
      1024.00             95.76 ms             77.91 ms          18.6%
      2048.00             190.87 ms            155.64 ms         18.5%
      4096.00             380.84 ms            311.45 ms         18.2%

[0] https://lore.kernel.org/all/20240826204353.2228736-2-peterx@redhat.com/T/#u

Yin Tirui (4):
  x86/mm: Use proper page table helpers for huge page generation
  mm/pgtable: Make pfn_pte() filter out huge page attributes
  x86/mm: Remove pte_clrhuge() and clean up init_64.c
  mm: add PMD-level huge page support for remap_pfn_range()

 arch/arm64/include/asm/pgtable.h |  4 +++-
 arch/x86/include/asm/pgtable.h   |  9 ++++---
 arch/x86/mm/init_64.c            | 10 ++++----
 arch/x86/mm/pat/set_memory.c     |  6 ++++-
 arch/x86/mm/pgtable.c            |  4 ++--
 mm/huge_memory.c                 | 36 ++++++++++++++++++++++++++--
 mm/memory.c                      | 40 ++++++++++++++++++++++++++++++++
 7 files changed, 93 insertions(+), 16 deletions(-)

-- 
2.22.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-28  7:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28  7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c Yin Tirui
2026-02-28  7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox