* [PATCH RFC 0/2] mm: add huge pfnmap support for remap_pfn_range()
@ 2025-09-23 13:31 Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv Yin Tirui
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Yin Tirui @ 2025-09-23 13:31 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102, yintirui
Overview
========
This patch series adds huge page support for remap_pfn_range(),
automatically creating huge mappings when prerequisites are satisfied
(size, alignment, architecture support, etc.) and falling back to
normal page mappings otherwise.
This work builds on Peter Xu's previous efforts on huge pfnmap
support [0].
TODO
====
- Add PUD-level huge page support. Currently, only PMD-level huge
pages are supported.
- Consider the logic related to vmap_page_range and extract
reusable common code.
Tests Done
==========
- Cross-build tests.
- Performance tests with custom device driver implementing mmap()
with remap_pfn_range():
- lat_mem_rd benchmark modified to use mmap(device_fd) instead of
malloc() shows around 40% improvement in memory access latency with
huge page support compared to normal page mappings.
numactl -C 0 lat_mem_rd -t 4096M (stride=64)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
64.00 148.858 ns 100.780 ns 32.3%
128.00 164.745 ns 103.537 ns 37.2%
256.00 169.907 ns 103.179 ns 39.3%
512.00 171.285 ns 103.072 ns 39.8%
1024.00 173.054 ns 103.055 ns 40.4%
2048.00 172.820 ns 103.091 ns 40.3%
4096.00 172.877 ns 103.115 ns 40.4%
- Custom memory copy operations on mmap(device_fd) show around 18% performance
improvement with huge page support compared to normal page mappings.
numactl -C 0 memcpy_test (memory copy performance test)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
1024.00 95.76 ms 77.91 ms 18.6%
2048.00 190.87 ms 155.64 ms 18.5%
4096.00 380.84 ms 311.45 ms 18.2%
[0] https://lore.kernel.org/all/20240826204353.2228736-2-peterx@redhat.com/T/#u
Yin Tirui (2):
pgtable: add pte_clrhuge() implementation for arm64 and riscv
mm: add PMD-level huge page support for remap_pfn_range()
arch/arm64/include/asm/pgtable.h | 8 ++++
arch/riscv/include/asm/pgtable.h | 5 +++
include/linux/pgtable.h | 6 ++-
mm/huge_memory.c | 22 +++++++---
mm/memory.c | 74 ++++++++++++++++++++++++++++----
5 files changed, 98 insertions(+), 17 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv
2025-09-23 13:31 [PATCH RFC 0/2] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
@ 2025-09-23 13:31 ` Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui
2025-09-23 22:53 ` [syzbot ci] Re: mm: add huge pfnmap " syzbot ci
2 siblings, 0 replies; 10+ messages in thread
From: Yin Tirui @ 2025-09-23 13:31 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102, yintirui
Add pte_clrhuge() helper function for architectures that enable
ARCH_SUPPORTS_HUGE_PFNMAP to clear huge page attributes from PTE
entries.
This function provides the inverse operation of pte_mkhuge() and will
be needed for upcoming huge page splitting, where PTE entries derived
from huge page mappings need to have their huge page attributes cleared.
Signed-off-by: Yin Tirui <yintirui@huawei.com>
---
arch/arm64/include/asm/pgtable.h | 8 ++++++++
arch/riscv/include/asm/pgtable.h | 5 +++++
2 files changed, 13 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index abd2dee416b3..244755bad46f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -366,6 +366,14 @@ static inline pte_t pte_mkinvalid(pte_t pte)
return pte;
}
+static inline pte_t pte_clrhuge(pte_t pte)
+{
+ pteval_t mask = PTE_TYPE_MASK & ~PTE_VALID;
+ pteval_t val = PTE_TYPE_PAGE & ~PTE_VALID;
+
+ return __pte((pte_val(pte) & ~mask) | val);
+}
+
static inline pmd_t pmd_mkcont(pmd_t pmd)
{
return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 91697fbf1f90..125b241e6d2c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -455,6 +455,11 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
}
+static inline pte_t pte_clrhuge(pte_t pte)
+{
+ return pte;
+}
+
#ifdef CONFIG_RISCV_ISA_SVNAPOT
#define pte_leaf_size(pte) (pte_napot(pte) ? \
napot_cont_size(napot_cont_order(pte)) :\
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-23 13:31 [PATCH RFC 0/2] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv Yin Tirui
@ 2025-09-23 13:31 ` Yin Tirui
2025-09-23 22:39 ` Matthew Wilcox
2025-09-24 9:50 ` David Hildenbrand
2025-09-23 22:53 ` [syzbot ci] Re: mm: add huge pfnmap " syzbot ci
2 siblings, 2 replies; 10+ messages in thread
From: Yin Tirui @ 2025-09-23 13:31 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102, yintirui
Add PMD-level huge page support to remap_pfn_range(), automatically
creating huge mappings when prerequisites are satisfied (size, alignment,
architecture support, etc.) and falling back to normal page mappings
otherwise.
Implement special huge PMD splitting by utilizing the pgtable deposit/
withdraw mechanism. When splitting is needed, the deposited pgtable is
withdrawn and populated with individual PTEs created from the original
huge mapping, using pte_clrhuge() to clear huge page attributes.
Update arch_needs_pgtable_deposit() to return true when PMD pfnmap
support is enabled, ensuring proper pgtable management for huge
pfnmap operations.
Introduce pfnmap_max_page_shift parameter to control maximum page
size and "nohugepfnmap" boot option to disable huge pfnmap entirely.
Signed-off-by: Yin Tirui <yintirui@huawei.com>
---
include/linux/pgtable.h | 6 +++-
mm/huge_memory.c | 22 ++++++++----
mm/memory.c | 74 ++++++++++++++++++++++++++++++++++++-----
3 files changed, 85 insertions(+), 17 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 4c035637eeb7..4028318552ca 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1025,7 +1025,11 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif
#ifndef arch_needs_pgtable_deposit
-#define arch_needs_pgtable_deposit() (false)
+#define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
+static inline bool arch_needs_pgtable_deposit(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP);
+}
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..9f20adcbbb55 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2857,14 +2857,22 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (!vma_is_anonymous(vma)) {
old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
- /*
- * We are going to unmap this huge page. So
- * just go ahead and zap it
- */
- if (arch_needs_pgtable_deposit())
- zap_deposited_table(mm, pmd);
- if (!vma_is_dax(vma) && vma_is_special_huge(vma))
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
+ pte_t entry;
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ if (unlikely(!pgtable))
+ return;
+ pmd_populate(mm, &_pmd, pgtable);
+ pte = pte_offset_map(&_pmd, haddr);
+ entry = pte_clrhuge(pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
+ set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
+ pte_unmap(pte);
+
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
return;
+ }
if (unlikely(is_pmd_migration_entry(old_pmd))) {
swp_entry_t entry;
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..c4aaf3bd9cad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2674,6 +2674,19 @@ vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
return __vm_insert_mixed(vma, addr, pfn, true);
}
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+static unsigned int __ro_after_init pfnmap_max_page_shift = BITS_PER_LONG - 1;
+
+static int __init set_nohugepfnmap(char *str)
+{
+ pfnmap_max_page_shift = PAGE_SHIFT;
+ return 0;
+}
+early_param("nohugepfnmap", set_nohugepfnmap);
+#else /* CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP */
+static const unsigned int pfnmap_max_page_shift = PAGE_SHIFT;
+#endif /* CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP */
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
@@ -2705,9 +2718,47 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
return err;
}
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ unsigned long pfn, pgprot_t prot,
+ unsigned int page_shift)
+{
+ pgtable_t pgtable;
+ spinlock_t *ptl;
+
+ if (page_shift < PMD_SHIFT)
+ return 0;
+
+ if ((end - addr) != PMD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PMD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(pfn, 1 << (PMD_SHIFT - PAGE_SHIFT)))
+ return 0;
+
+ if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
+ return 0;
+
+ set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot))));
+
+ pgtable = pte_alloc_one(mm);
+ if (unlikely(!pgtable))
+ return 1;
+ mm_inc_nr_ptes(mm);
+ ptl = pmd_lock(mm, pmd);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ spin_unlock(ptl);
+
+ return 1;
+}
+#endif
+
static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
+ unsigned long pfn, pgprot_t prot, unsigned int max_page_shift)
{
pmd_t *pmd;
unsigned long next;
@@ -2720,6 +2771,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
VM_BUG_ON(pmd_trans_huge(*pmd));
do {
next = pmd_addr_end(addr, end);
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+ if (remap_try_huge_pmd(mm, pmd, addr, next,
+ pfn + (addr >> PAGE_SHIFT), prot, max_page_shift)) {
+ continue;
+ }
+#endif
err = remap_pte_range(mm, pmd, addr, next,
pfn + (addr >> PAGE_SHIFT), prot);
if (err)
@@ -2730,7 +2787,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
+ unsigned long pfn, pgprot_t prot, unsigned int max_page_shift)
{
pud_t *pud;
unsigned long next;
@@ -2743,7 +2800,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
do {
next = pud_addr_end(addr, end);
err = remap_pmd_range(mm, pud, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot);
+ pfn + (addr >> PAGE_SHIFT), prot, max_page_shift);
if (err)
return err;
} while (pud++, addr = next, addr != end);
@@ -2752,7 +2809,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
+ unsigned long pfn, pgprot_t prot, unsigned int max_page_shift)
{
p4d_t *p4d;
unsigned long next;
@@ -2765,7 +2822,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
do {
next = p4d_addr_end(addr, end);
err = remap_pud_range(mm, p4d, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot);
+ pfn + (addr >> PAGE_SHIFT), prot, max_page_shift);
if (err)
return err;
} while (p4d++, addr = next, addr != end);
@@ -2773,7 +2830,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
}
static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+ unsigned long pfn, unsigned long size, pgprot_t prot, unsigned int max_page_shift)
{
pgd_t *pgd;
unsigned long next;
@@ -2817,7 +2874,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
do {
next = pgd_addr_end(addr, end);
err = remap_p4d_range(mm, pgd, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot);
+ pfn + (addr >> PAGE_SHIFT), prot, max_page_shift);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
@@ -2832,8 +2889,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
- int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
-
+ int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, pfnmap_max_page_shift);
if (!error)
return 0;
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-23 13:31 ` [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui
@ 2025-09-23 22:39 ` Matthew Wilcox
2025-09-25 2:17 ` Yin Tirui
2025-09-24 9:50 ` David Hildenbrand
1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2025-09-23 22:39 UTC (permalink / raw)
To: Yin Tirui
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, apopple, samuel.holland,
luxu.kernel, abrestic, yongxuan.wang, linux-mm, linux-kernel,
linux-arm-kernel, linux-riscv, wangkefeng.wang, chenjun102
On Tue, Sep 23, 2025 at 09:31:04PM +0800, Yin Tirui wrote:
> + entry = pte_clrhuge(pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
This doesn't make sense. And I'm not saying you got this wrong; I
suspect in terms of how things work today it's actually necessary.
But the way we handle this stuff is so insane.
pte_clrhuge() should not exist. If we have a PTE, it can't have the
huge bit set, by definition (don't anybody mention hugetlbfs because
that is an entirely separate pile of broken horrors). I understand what
you're trying to do here. You want to construct a PTE that points to
the same address as the first page of the PMD and has the same
permissions. But that *should* be written as:
entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
right? Now, pmd_pgprot() might or might not want to return the huge bit
set. I'm not sure. Perhaps you could have a look through and figure it
out. But pfn_pte() should never return a PTE with the huge bit set.
So if it is set in the pgorot on entry, it should filter it out.
There are going to be consequences to this. Maybe there's code
somewhere that relies on pfn_pte() returning a PTE with the huge bit
set. Perhaps it's hugetlbfs.
But we have to start cleaning this garbage up. I did some work with
e3981db444a0 and the commits leading up to that. See
https://lkml.kernel.org/r/20250402181709.2386022-12-willy@infradead.org
I'd like pte_clrhuge() to be deleted from x86, not added to arm and
riscv.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [syzbot ci] Re: mm: add huge pfnmap support for remap_pfn_range()
2025-09-23 13:31 [PATCH RFC 0/2] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui
@ 2025-09-23 22:53 ` syzbot ci
2 siblings, 0 replies; 10+ messages in thread
From: syzbot ci @ 2025-09-23 22:53 UTC (permalink / raw)
To: abrestic, akpm, alex, anshuman.khandual, aou, apopple, ardb,
baohua, baolin.wang, catalin.marinas, chenjun102, david,
dev.jain, liam.howlett, linux-arm-kernel, linux-kernel, linux-mm,
linux-riscv, lorenzo.stoakes, luxu.kernel, mhocko, npache,
palmer, paul.walmsley, rppt, ryan.roberts, samuel.holland,
surenb, vbabka, wangkefeng.wang, will, willy, yangyicong,
yintirui, yongxuan.wang, ziy
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] mm: add huge pfnmap support for remap_pfn_range()
https://lore.kernel.org/all/20250923133104.926672-1-yintirui@huawei.com
* [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv
* [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
and found the following issues:
* BUG: non-zero pgtables_bytes on freeing mm: NUM
* stack segment fault in pgtable_trans_huge_withdraw
Full report is available here:
https://ci.syzbot.org/series/633cbff7-ef54-4f3a-9133-71cc271396ee
***
BUG: non-zero pgtables_bytes on freeing mm: NUM
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 07e27ad16399afcd693be20211b0dfae63e0615f
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/72b4b6cf-5400-40d6-94b6-1cfc0e85050d/config
C repro: https://ci.syzbot.org/findings/3450ef75-3540-4c00-8b33-5625d4aa40ef/c_repro
syz repro: https://ci.syzbot.org/findings/3450ef75-3540-4c00-8b33-5625d4aa40ef/syz_repro
BUG: non-zero pgtables_bytes on freeing mm: 4096
***
stack segment fault in pgtable_trans_huge_withdraw
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 07e27ad16399afcd693be20211b0dfae63e0615f
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/72b4b6cf-5400-40d6-94b6-1cfc0e85050d/config
C repro: https://ci.syzbot.org/findings/dcfb72b5-c263-48da-830a-7f51aaa927db/c_repro
syz repro: https://ci.syzbot.org/findings/dcfb72b5-c263-48da-830a-7f51aaa927db/syz_repro
Oops: stack segment: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6000 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:pgtable_trans_huge_withdraw+0x115/0x310 mm/pgtable-generic.c:188
Code: c3 10 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 5d 38 13 00 48 8b 03 48 89 04 24 4c 8d 78 08 4c 89 fd 48 c1 ed 03 <42> 80 7c 2d 00 00 74 08 4c 89 ff e8 3b 38 13 00 49 8b 07 48 8d 48
RSP: 0018:ffffc90002d5f300 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffea0000fb3dd0 RCX: ffff888107769cc0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffff888022b90843 R09: 1ffff11004572108
R10: dffffc0000000000 R11: ffffed1004572109 R12: ffff88803ecf7000
R13: dffffc0000000000 R14: ffff88803ecf7000 R15: 0000000000000008
FS: 0000555576e7a500(0000) GS:ffff8880b8612000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000107d74000 CR4: 00000000000006f0
Call Trace:
<TASK>
zap_deposited_table mm/huge_memory.c:2177 [inline]
zap_huge_pmd+0xa25/0xf50 mm/huge_memory.c:2205
zap_pmd_range mm/memory.c:1798 [inline]
zap_pud_range mm/memory.c:1847 [inline]
zap_p4d_range mm/memory.c:1868 [inline]
unmap_page_range+0x9fe/0x4370 mm/memory.c:1889
unmap_single_vma mm/memory.c:1932 [inline]
unmap_vmas+0x399/0x580 mm/memory.c:1976
exit_mmap+0x248/0xb50 mm/mmap.c:1280
__mmput+0x118/0x430 kernel/fork.c:1129
copy_process+0x2910/0x3c00 kernel/fork.c:2454
kernel_clone+0x21e/0x840 kernel/fork.c:2605
__do_sys_clone kernel/fork.c:2748 [inline]
__se_sys_clone kernel/fork.c:2732 [inline]
__x64_sys_clone+0x18b/0x1e0 kernel/fork.c:2732
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f96b638ec29
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc07e618c8 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 00007f96b65d5fa0 RCX: 00007f96b638ec29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000002001000
RBP: 00007f96b6411e41 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
R13: 00007f96b65d5fa0 R14: 00007f96b65d5fa0 R15: 0000000000000006
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:pgtable_trans_huge_withdraw+0x115/0x310 mm/pgtable-generic.c:188
Code: c3 10 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 5d 38 13 00 48 8b 03 48 89 04 24 4c 8d 78 08 4c 89 fd 48 c1 ed 03 <42> 80 7c 2d 00 00 74 08 4c 89 ff e8 3b 38 13 00 49 8b 07 48 8d 48
RSP: 0018:ffffc90002d5f300 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffea0000fb3dd0 RCX: ffff888107769cc0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffff888022b90843 R09: 1ffff11004572108
R10: dffffc0000000000 R11: ffffed1004572109 R12: ffff88803ecf7000
R13: dffffc0000000000 R14: ffff88803ecf7000 R15: 0000000000000008
FS: 0000555576e7a500(0000) GS:ffff8880b8612000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000107d74000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
0: c3 ret
1: 10 48 89 adc %cl,-0x77(%rax)
4: d8 48 c1 fmuls -0x3f(%rax)
7: e8 03 42 80 3c call 0x3c80420f
c: 28 00 sub %al,(%rax)
e: 74 08 je 0x18
10: 48 89 df mov %rbx,%rdi
13: e8 5d 38 13 00 call 0x133875
18: 48 8b 03 mov (%rbx),%rax
1b: 48 89 04 24 mov %rax,(%rsp)
1f: 4c 8d 78 08 lea 0x8(%rax),%r15
23: 4c 89 fd mov %r15,%rbp
26: 48 c1 ed 03 shr $0x3,%rbp
* 2a: 42 80 7c 2d 00 00 cmpb $0x0,0x0(%rbp,%r13,1) <-- trapping instruction
30: 74 08 je 0x3a
32: 4c 89 ff mov %r15,%rdi
35: e8 3b 38 13 00 call 0x133875
3a: 49 8b 07 mov (%r15),%rax
3d: 48 rex.W
3e: 8d .byte 0x8d
3f: 48 rex.W
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-23 13:31 ` [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui
2025-09-23 22:39 ` Matthew Wilcox
@ 2025-09-24 9:50 ` David Hildenbrand
2025-09-25 1:43 ` Yin Tirui
1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-09-24 9:50 UTC (permalink / raw)
To: Yin Tirui, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
baohua, catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102
On 23.09.25 15:31, Yin Tirui wrote:
> Add PMD-level huge page support to remap_pfn_range(), automatically
> creating huge mappings when prerequisites are satisfied (size, alignment,
> architecture support, etc.) and falling back to normal page mappings
> otherwise.
>
> Implement special huge PMD splitting by utilizing the pgtable deposit/
> withdraw mechanism. When splitting is needed, the deposited pgtable is
> withdrawn and populated with individual PTEs created from the original
> huge mapping, using pte_clrhuge() to clear huge page attributes.
>
> Update arch_needs_pgtable_deposit() to return true when PMD pfnmap
> support is enabled, ensuring proper pgtable management for huge
> pfnmap operations.
>
> Introduce pfnmap_max_page_shift parameter to control maximum page
> size and "nohugepfnmap" boot option to disable huge pfnmap entirely.
Why? If an arch supports it we should just do it. Or what's the reason
behind that?
>
> Signed-off-by: Yin Tirui <yintirui@huawei.com>
> ---
> include/linux/pgtable.h | 6 +++-
> mm/huge_memory.c | 22 ++++++++----
> mm/memory.c | 74 ++++++++++++++++++++++++++++++++++++-----
> 3 files changed, 85 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 4c035637eeb7..4028318552ca 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1025,7 +1025,11 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> #endif
>
> #ifndef arch_needs_pgtable_deposit
> -#define arch_needs_pgtable_deposit() (false)
> +#define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
> +static inline bool arch_needs_pgtable_deposit(void)
> +{
> + return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP);
> +}
> #endif
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..9f20adcbbb55 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2857,14 +2857,22 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>
> if (!vma_is_anonymous(vma)) {
> old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
> - /*
> - * We are going to unmap this huge page. So
> - * just go ahead and zap it
> - */
> - if (arch_needs_pgtable_deposit())
> - zap_deposited_table(mm, pmd);
Are you sure we can just entirely remove this block for
!vma_is_anonymous(vma)?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-24 9:50 ` David Hildenbrand
@ 2025-09-25 1:43 ` Yin Tirui
2025-09-25 9:38 ` David Hildenbrand
0 siblings, 1 reply; 10+ messages in thread
From: Yin Tirui @ 2025-09-25 1:43 UTC (permalink / raw)
To: David Hildenbrand, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko, ziy, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, catalin.marinas, will, paul.walmsley, palmer,
aou, alex, anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102
On 9/24/2025 5:50 PM, David Hildenbrand wrote:
>> Introduce pfnmap_max_page_shift parameter to control maximum page
>> size and "nohugepfnmap" boot option to disable huge pfnmap entirely.
>
> Why? If an arch supports it we should just do it. Or what's the reason
> behind that?
>
There's no specific reason for this - it was just intended to provide an
additional option. I'll remove it in the next version.
...
> Are you sure we can just entirely remove this block for !
> vma_is_anonymous(vma)?
>
Thank you for pointing this out!
There is definitely a problem with removing this block entirely for
non-anonymous VMAs. I've also found some other problems. I'll fix all of
them in the next version.
--
Best regards,
Yin Tirui
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-23 22:39 ` Matthew Wilcox
@ 2025-09-25 2:17 ` Yin Tirui
0 siblings, 0 replies; 10+ messages in thread
From: Yin Tirui @ 2025-09-25 2:17 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, apopple, samuel.holland,
luxu.kernel, abrestic, yongxuan.wang, linux-mm, linux-kernel,
linux-arm-kernel, linux-riscv, wangkefeng.wang, chenjun102
On 9/24/2025 6:39 AM, Matthew Wilcox wrote:
> On Tue, Sep 23, 2025 at 09:31:04PM +0800, Yin Tirui wrote:
>> + entry = pte_clrhuge(pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
>
> This doesn't make sense. And I'm not saying you got this wrong; I
> suspect in terms of how things work today it's actually necessary.
> But the way we handle this stuff is so insane.
Thank you for pointing this out and the broader context.
>
> pte_clrhuge() should not exist. If we have a PTE, it can't have the
> huge bit set, by definition (don't anybody mention hugetlbfs because
> that is an entirely separate pile of broken horrors). I understand what
> you're trying to do here. You want to construct a PTE that points to
> the same address as the first page of the PMD and has the same
> permissions. But that *should* be written as:
>
> entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
>
> right? Now, pmd_pgprot() might or might not want to return the huge bit
> set. I'm not sure. Perhaps you could have a look through and figure it
I've tested this on arm64, and pmd_pgprot() does return the huge bit
set, which is exactly why I added pte_clrhuge().
> out. But pfn_pte() should never return a PTE with the huge bit set.
> So if it is set in the pgorot on entry, it should filter it out.
>
> There are going to be consequences to this. Maybe there's code
> somewhere that relies on pfn_pte() returning a PTE with the huge bit
> set. Perhaps it's hugetlbfs.
I'll try to refactor pfn_pte() and related functions to filter out the
huge bit set and test its impact on hugetlbfs.
>
> But we have to start cleaning this garbage up. I did some work with
> e3981db444a0 and the commits leading up to that. See
> https://lkml.kernel.org/r/20250402181709.2386022-12-willy@infradead.org
>
> I'd like pte_clrhuge() to be deleted from x86, not added to arm and
> riscv.
>
I completely agree with the goal of deleting pte_clrhuge() rather than
expanding it. I'll study your referenced work and align my approach with
your efforts.
Would you recommend I address the pfn_pte() and related function
refactoring as part of this patch series, or should I submit it as a
separate patch series?
--
Best regards,
Yin Tirui
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-09-25 1:43 ` Yin Tirui
@ 2025-09-25 9:38 ` David Hildenbrand
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-09-25 9:38 UTC (permalink / raw)
To: Yin Tirui, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
baohua, catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102
On 25.09.25 03:43, Yin Tirui wrote:
>
>
> On 9/24/2025 5:50 PM, David Hildenbrand wrote:
>>> Introduce pfnmap_max_page_shift parameter to control maximum page
>>> size and "nohugepfnmap" boot option to disable huge pfnmap entirely.
>>
>> Why? If an arch supports it we should just do it. Or what's the reason
>> behind that?
>>
> There's no specific reason for this - it was just intended to provide an
> additional option. I'll remove it in the next version.
Good, then let's keep it simple :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range()
2025-10-16 11:27 [PATCH RFC v2 0/2] " Yin Tirui
@ 2025-10-16 11:27 ` Yin Tirui
0 siblings, 0 replies; 10+ messages in thread
From: Yin Tirui @ 2025-10-16 11:27 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ziy, baolin.wang, npache, ryan.roberts, dev.jain, baohua,
catalin.marinas, will, paul.walmsley, palmer, aou, alex,
anshuman.khandual, yangyicong, ardb, willy, apopple,
samuel.holland, luxu.kernel, abrestic, yongxuan.wang, linux-mm,
linux-kernel, linux-arm-kernel, linux-riscv
Cc: wangkefeng.wang, chenjun102, yintirui
Add PMD-level huge page support to remap_pfn_range(), automatically
creating huge mappings when prerequisites are satisfied (size, alignment,
architecture support, etc.) and falling back to normal page mappings
otherwise.
Implement special huge PMD splitting by utilizing the pgtable deposit/
withdraw mechanism. When splitting is needed, the deposited pgtable is
withdrawn and populated with individual PTEs created from the original
huge mapping, using pte_clrhuge() to clear huge page attributes.
Update arch_needs_pgtable_deposit() to return true when PMD pfnmap
support is enabled, ensuring proper pgtable management for huge
pfnmap operations.
Signed-off-by: Yin Tirui <yintirui@huawei.com>
---
include/linux/pgtable.h | 6 +++++-
mm/huge_memory.c | 26 +++++++++++++++++++-------
mm/memory.c | 40 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 8 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 25a7257052ff..9ae015cb67a0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1025,7 +1025,11 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif
#ifndef arch_needs_pgtable_deposit
-#define arch_needs_pgtable_deposit() (false)
+#define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
+static inline bool arch_needs_pgtable_deposit(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP);
+}
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..b5eecd8fc1bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2857,14 +2857,26 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (!vma_is_anonymous(vma)) {
old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
- /*
- * We are going to unmap this huge page. So
- * just go ahead and zap it
- */
- if (arch_needs_pgtable_deposit())
- zap_deposited_table(mm, pmd);
- if (!vma_is_dax(vma) && vma_is_special_huge(vma))
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
+ pte_t entry;
+
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ if (unlikely(!pgtable))
+ return;
+ pmd_populate(mm, &_pmd, pgtable);
+ pte = pte_offset_map(&_pmd, haddr);
+ entry = pte_clrhuge(pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)));
+ set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
+ pte_unmap(pte);
+
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
return;
+ } else if (arch_needs_pgtable_deposit()) {
+ /* Zap for the non-special mappings. */
+ zap_deposited_table(mm, pmd);
+ }
+
if (unlikely(is_pmd_migration_entry(old_pmd))) {
swp_entry_t entry;
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..4e8f2248a86f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2705,6 +2705,40 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
return err;
}
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ unsigned long pfn, pgprot_t prot)
+{
+ pgtable_t pgtable;
+ spinlock_t *ptl;
+
+ if ((end - addr) != PMD_SIZE)
+ return 0;
+
+ if (!IS_ALIGNED(addr, PMD_SIZE))
+ return 0;
+
+ if (!IS_ALIGNED(pfn, HPAGE_PMD_NR))
+ return 0;
+
+ if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
+ return 0;
+
+ pgtable = pte_alloc_one(mm);
+ if (unlikely(!pgtable))
+ return 0;
+
+ mm_inc_nr_ptes(mm);
+ ptl = pmd_lock(mm, pmd);
+ set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot))));
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ spin_unlock(ptl);
+
+ return 1;
+}
+#endif
+
static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
unsigned long addr, unsigned long end,
unsigned long pfn, pgprot_t prot)
@@ -2720,6 +2754,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
VM_BUG_ON(pmd_trans_huge(*pmd));
do {
next = pmd_addr_end(addr, end);
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+ if (remap_try_huge_pmd(mm, pmd, addr, next,
+ pfn + (addr >> PAGE_SHIFT), prot)) {
+ continue;
+ }
+#endif
err = remap_pte_range(mm, pmd, addr, next,
pfn + (addr >> PAGE_SHIFT), prot);
if (err)
--
2.43.0
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-10-16 11:33 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-23 13:31 [PATCH RFC 0/2] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 1/2] pgtable: add pte_clrhuge() implementation for arm64 and riscv Yin Tirui
2025-09-23 13:31 ` [PATCH RFC 2/2] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui
2025-09-23 22:39 ` Matthew Wilcox
2025-09-25 2:17 ` Yin Tirui
2025-09-24 9:50 ` David Hildenbrand
2025-09-25 1:43 ` Yin Tirui
2025-09-25 9:38 ` David Hildenbrand
2025-09-23 22:53 ` [syzbot ci] Re: mm: add huge pfnmap " syzbot ci
2025-10-16 11:27 [PATCH RFC v2 0/2] " Yin Tirui
2025-10-16 11:27 ` [PATCH RFC 2/2] mm: add PMD-level huge page " Yin Tirui
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox