* [PATCH mm-unstable v14 00/16] khugepaged: mTHP support
@ 2026-01-22 19:28 Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 01/16] mm: introduce is_pmd_order helper Nico Pache
` (17 more replies)
0 siblings, 18 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
The following series provides khugepaged with the capability to collapse
anonymous memory regions to mTHPs.
To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
pages that are occupied (!none/zero). After the PMD scan is done, we use
the bitmap to find the optimal mTHP sizes for the PMD range. The
restriction on max_ptes_none is removed during the scan, to make sure we
account for the whole PMD range in the bitmap. When no mTHP size is
enabled, the legacy behavior of khugepaged is maintained.
We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
(ie 511). If any other value is specified, the kernel will emit a warning
and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
but contains swapped out, or shared pages, we don't perform the collapse.
It is now also possible to collapse to mTHPs without requiring the PMD THP
size to be enabled. These limitiations are to prevent collapse "creep"
behavior. This prevents constantly promoting mTHPs to the next available
size, which would occur because a collapse introduces more non-zero pages
that would satisfy the promotion condition on subsequent scans.
Patch 1: add is_pmd_order helper
Patch 2: Refactor/rename hpage_collapse
Patch 3: Refactoring to combine madvise_collapse and khugepaged
Patch 4-8: Generalize khugepaged functions for arbitrary orders and
introduce some helper functions
Patch 9: skip collapsing mTHP to smaller orders
Patch 10-11: Add per-order mTHP statistics and tracepoints
Patch 12: Introduce collapse_allowable_orders
Patch 13-15: Introduce bitmap and mTHP collapse support, fully enabled
Patch 16: Documentation
---------
Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- ran all arches on test suites provided by the kernel-tests project
- internal testing suites: functional testing and performance testing
- selftests mm
- I created a test script that I used to push khugepaged to its limits
while monitoring a number of stats and tracepoints. The code is
available here[1] (Run in legacy mode for these changes and set mthp
sizes to inherit)
The summary from my testings was that there was no significant
regression noticed through this test. In some cases my changes had
better collapse latencies, and was able to scan more pages in the same
amount of time/work, but for the most part the results were consistent.
- redis testing. I tested these changes along with my defer changes
(see followup [2] post for more details). We've decided to get the mTHP
changes merged first before attempting the defer series.
- some basic testing on 64k page size.
- lots of general use.
V14 Changes:
- Added review tags
- refactored is_mthp_order() to is_pmd_order(), utilized it in more places, and
moved it to the first commit of the series
- squashed fixup sent with v13
- rebased and handled conflicts with new madvise_collapse writeback retry logic [3]
- handled conflict with khugepaged cleanup series [4]
V13: https://lore.kernel.org/lkml/20251201174627.23295-1-npache@redhat.com/
V12: https://lore.kernel.org/lkml/20251022183717.70829-1-npache@redhat.com/
V11: https://lore.kernel.org/lkml/20250912032810.197475-1-npache@redhat.com/
V10: https://lore.kernel.org/lkml/20250819134205.622806-1-npache@redhat.com/
V9 : https://lore.kernel.org/lkml/20250714003207.113275-1-npache@redhat.com/
V8 : https://lore.kernel.org/lkml/20250702055742.102808-1-npache@redhat.com/
V7 : https://lore.kernel.org/lkml/20250515032226.128900-1-npache@redhat.com/
V6 : https://lore.kernel.org/lkml/20250515030312.125567-1-npache@redhat.com/
V5 : https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
V4 : https://lore.kernel.org/lkml/20250417000238.74567-1-npache@redhat.com/
V3 : https://lore.kernel.org/lkml/20250414220557.35388-1-npache@redhat.com/
V2 : https://lore.kernel.org/lkml/20250211003028.213461-1-npache@redhat.com/
V1 : https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
A big thanks to everyone that has reviewed, tested, and participated in
the development process. Its been a great experience working with all of
you on this endeavour.
[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
[3] - https://lore.kernel.org/lkml/20260118190939.8986-2-shivankg@amd.com/
[4] - https://lore.kernel.org/lkml/20260118192253.9263-4-shivankg@amd.com/
Baolin Wang (1):
khugepaged: run khugepaged for all orders
Dev Jain (1):
khugepaged: generalize alloc_charge_folio()
Nico Pache (14):
mm: introduce is_pmd_order helper
khugepaged: rename hpage_collapse_* to collapse_*
introduce collapse_single_pmd to unify khugepaged and madvise_collapse
khugepaged: generalize hugepage_vma_revalidate for mTHP support
khugepaged: generalize __collapse_huge_page_* for mTHP support
khugepaged: introduce collapse_max_ptes_none helper function
khugepaged: generalize collapse_huge_page for mTHP collapse
khugepaged: skip collapsing mTHP to smaller orders
khugepaged: add per-order mTHP collapse failure statistics
khugepaged: improve tracepoints for mTHP orders
khugepaged: introduce collapse_allowable_orders helper function
khugepaged: Introduce mTHP collapse support
khugepaged: avoid unnecessary mTHP collapse attempts
Documentation: mm: update the admin guide for mTHP collapse
Documentation/admin-guide/mm/transhuge.rst | 80 ++-
include/linux/huge_mm.h | 10 +
include/trace/events/huge_memory.h | 34 +-
mm/huge_memory.c | 13 +-
mm/khugepaged.c | 695 ++++++++++++++++-----
mm/mempolicy.c | 2 +-
mm/mremap.c | 2 +-
mm/page_alloc.c | 2 +-
8 files changed, 630 insertions(+), 208 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 01/16] mm: introduce is_pmd_order helper
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 02/16] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
` (16 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
In order to add mTHP support to khugepaged, we will often be checking if a
given order is (or is not) a PMD order. Some places in the kernel already
use this check, so lets create a simple helper function to keep the code clean
and readable.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/linux/huge_mm.h | 5 +++++
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 2 +-
5 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfde..bd7f0e1d8094 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -771,6 +771,11 @@ static inline bool pmd_is_huge(pmd_t pmd)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline bool is_pmd_order(unsigned int order)
+{
+ return order == HPAGE_PMD_ORDER;
+}
+
static inline int split_folio_to_list_to_order(struct folio *folio,
struct list_head *list, int new_order)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44ff8a648afd..5eae85818635 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4097,7 +4097,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
i_mmap_unlock_read(mapping);
out:
xas_destroy(&xas);
- if (old_order == HPAGE_PMD_ORDER)
+ if (is_pmd_order(old_order))
count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
count_mthp_stat(old_order, !ret ? MTHP_STAT_SPLIT : MTHP_STAT_SPLIT_FAILED);
return ret;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fba6aea5bea6..b85d00670d14 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2000,7 +2000,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
* we locked the first folio, then a THP might be there already.
* This will be discovered on the first iteration.
*/
- if (folio_order(folio) == HPAGE_PMD_ORDER &&
+ if (is_pmd_order(folio_order(folio)) &&
folio->index == start) {
/* Maybe PMD-mapped */
result = SCAN_PTE_MAPPED_HUGEPAGE;
@@ -2327,7 +2327,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
continue;
}
- if (folio_order(folio) == HPAGE_PMD_ORDER &&
+ if (is_pmd_order(folio_order(folio)) &&
folio->index == start) {
/* Maybe PMD-mapped */
result = SCAN_PTE_MAPPED_HUGEPAGE;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dbd48502ac24..3802e52b01fc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2450,7 +2450,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
/* filter "hugepage" allocation, unless from alloc_pages() */
- order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) {
+ is_pmd_order(order) && ilx != NO_INTERLEAVE_INDEX) {
/*
* For hugepage allocation and non-interleave policy which
* allows the current node (or other explicitly preferred
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4104973e22f..e8a6d0d27b92 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -719,7 +719,7 @@ static inline bool pcp_allowed_order(unsigned int order)
if (order <= PAGE_ALLOC_COSTLY_ORDER)
return true;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (order == HPAGE_PMD_ORDER)
+ if (is_pmd_order(order))
return true;
#endif
return false;
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 02/16] khugepaged: rename hpage_collapse_* to collapse_*
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 01/16] mm: introduce is_pmd_order helper Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
` (15 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 70 ++++++++++++++++++++++++-------------------------
mm/mremap.c | 2 +-
2 files changed, 36 insertions(+), 36 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b85d00670d14..fefcbdca4510 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -396,14 +396,14 @@ void __init khugepaged_destroy(void)
kmem_cache_destroy(mm_slot_cache);
}
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int collapse_test_exit(struct mm_struct *mm)
{
return atomic_read(&mm->mm_users) == 0;
}
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
{
- return hpage_collapse_test_exit(mm) ||
+ return collapse_test_exit(mm) ||
mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
}
@@ -437,7 +437,7 @@ void __khugepaged_enter(struct mm_struct *mm)
int wakeup;
/* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+ VM_BUG_ON_MM(collapse_test_exit(mm), mm);
if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
return;
@@ -491,7 +491,7 @@ void __khugepaged_exit(struct mm_struct *mm)
} else if (slot) {
/*
* This is required to serialize against
- * hpage_collapse_test_exit() (which is guaranteed to run
+ * collapse_test_exit() (which is guaranteed to run
* under mmap sem read mode). Stop here (after we return all
* pagetables will be destroyed) until khugepaged has finished
* working on the pagetables under the mmap_lock.
@@ -580,7 +580,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
- /* See hpage_collapse_scan_pmd(). */
+ /* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
if (cc->is_khugepaged &&
@@ -831,7 +831,7 @@ static struct collapse_control khugepaged_collapse_control = {
.is_khugepaged = true,
};
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool collapse_scan_abort(int nid, struct collapse_control *cc)
{
int i;
@@ -866,7 +866,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
}
#ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
{
int nid, target_node = 0, max_value = 0;
@@ -885,7 +885,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
return target_node;
}
#else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int collapse_find_target_node(struct collapse_control *cc)
{
return 0;
}
@@ -904,7 +904,7 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l
enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
TVA_FORCED_COLLAPSE;
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
return SCAN_ANY_PROCESS;
*vmap = vma = find_vma(mm, address);
@@ -975,7 +975,7 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
/*
* Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
*
* Called and returns without pte mapped or spinlocks held.
* Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1061,7 +1061,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
{
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
GFP_TRANSHUGE);
- int node = hpage_collapse_find_target_node(cc);
+ int node = collapse_find_target_node(cc);
struct folio *folio;
folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1239,9 +1239,10 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
return result;
}
-static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
- struct collapse_control *cc)
+static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long start_addr, bool *mmap_locked,
+ struct collapse_control *cc)
{
pmd_t *pmd;
pte_t *pte, *_pte;
@@ -1349,7 +1350,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
* hit record.
*/
node = folio_nid(folio);
- if (hpage_collapse_scan_abort(node, cc)) {
+ if (collapse_scan_abort(node, cc)) {
result = SCAN_SCAN_ABORT;
goto out_unmap;
}
@@ -1415,7 +1416,7 @@ static void collect_mm_slot(struct mm_slot *slot)
lockdep_assert_held(&khugepaged_mm_lock);
- if (hpage_collapse_test_exit(mm)) {
+ if (collapse_test_exit(mm)) {
/* free mm_slot */
hash_del(&slot->hash);
list_del(&slot->mm_node);
@@ -1770,7 +1771,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
continue;
- if (hpage_collapse_test_exit(mm))
+ if (collapse_test_exit(mm))
continue;
if (!file_backed_vma_is_retractable(vma))
@@ -2286,8 +2287,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
return result;
}
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
- struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+ struct file *file, pgoff_t start,
+ struct collapse_control *cc)
{
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
@@ -2342,7 +2344,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
}
node = folio_nid(folio);
- if (hpage_collapse_scan_abort(node, cc)) {
+ if (collapse_scan_abort(node, cc)) {
result = SCAN_SCAN_ABORT;
folio_put(folio);
break;
@@ -2392,7 +2394,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
return result;
}
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result *result,
+static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
struct collapse_control *cc)
__releases(&khugepaged_mm_lock)
__acquires(&khugepaged_mm_lock)
@@ -2427,7 +2429,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
goto breakouterloop_mmap_lock;
progress++;
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
goto breakouterloop;
vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2435,7 +2437,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
unsigned long hstart, hend;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ if (unlikely(collapse_test_exit_or_disable(mm))) {
progress++;
break;
}
@@ -2458,7 +2460,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
bool mmap_locked = true;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(collapse_test_exit_or_disable(mm)))
goto breakouterloop;
VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2471,12 +2473,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
mmap_read_unlock(mm);
mmap_locked = false;
- *result = hpage_collapse_scan_file(mm,
+ *result = collapse_scan_file(mm,
khugepaged_scan.address, file, pgoff, cc);
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
- if (hpage_collapse_test_exit_or_disable(mm))
+ if (collapse_test_exit_or_disable(mm))
goto breakouterloop;
*result = try_collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
@@ -2485,7 +2487,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
mmap_read_unlock(mm);
}
} else {
- *result = hpage_collapse_scan_pmd(mm, vma,
+ *result = collapse_scan_pmd(mm, vma,
khugepaged_scan.address, &mmap_locked, cc);
}
@@ -2518,7 +2520,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
* Release the current mm_slot if this mm is about to die, or
* if we scanned all vmas of this mm.
*/
- if (hpage_collapse_test_exit(mm) || !vma) {
+ if (collapse_test_exit(mm) || !vma) {
/*
* Make sure that if mm_users is reaching zero while
* khugepaged runs here, khugepaged_exit will find
@@ -2569,8 +2571,8 @@ static void khugepaged_do_scan(struct collapse_control *cc)
pass_through_head++;
if (khugepaged_has_work() &&
pass_through_head < 2)
- progress += khugepaged_scan_mm_slot(pages - progress,
- &result, cc);
+ progress += collapse_scan_mm_slot(pages - progress,
+ &result, cc);
else
progress = pages;
spin_unlock(&khugepaged_mm_lock);
@@ -2814,8 +2816,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_read_unlock(mm);
mmap_locked = false;
*lock_dropped = true;
- result = hpage_collapse_scan_file(mm, addr, file, pgoff,
- cc);
+ result = collapse_scan_file(mm, addr, file, pgoff, cc);
if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
mapping_can_writeback(file->f_mapping)) {
@@ -2829,8 +2830,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
}
fput(file);
} else {
- result = hpage_collapse_scan_pmd(mm, vma, addr,
- &mmap_locked, cc);
+ result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
}
if (!mmap_locked)
*lock_dropped = true;
diff --git a/mm/mremap.c b/mm/mremap.c
index 8391ae17de64..bd24ead6fde4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -244,7 +244,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
goto out;
}
/*
- * Now new_pte is none, so hpage_collapse_scan_file() path can not find
+ * Now new_pte is none, so collapse_scan_file() path can not find
* this by traversing file->f_mapping, so there is no concurrency with
* retract_page_tables(). In addition, we already hold the exclusive
* mmap_lock, so this new_pte page is stable, so there is no need to get
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 01/16] mm: introduce is_pmd_order helper Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 02/16] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-23 5:07 ` Lance Yang
` (2 more replies)
2026-01-22 19:28 ` [PATCH mm-unstable v14 04/16] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
` (14 subsequent siblings)
17 siblings, 3 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.
Create collapse_single_pmd to increase code reuse and create an entry
point to these two users.
Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. This introduces a minor behavioral change
that is most likely an undiscovered bug. The current implementation of
khugepaged tests collapse_test_exit_or_disable before calling
collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
case. By unifying these two callers madvise_collapse now also performs
this check. We also modify the return value to be SCAN_ANY_PROCESS which
properly indicates that this process is no longer valid to operate on.
We also guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
1 file changed, 60 insertions(+), 46 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fefcbdca4510..59e5a5588d85 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
return result;
}
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static enum scan_result collapse_single_pmd(unsigned long addr,
+ struct vm_area_struct *vma, bool *mmap_locked,
+ struct collapse_control *cc)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ enum scan_result result;
+ struct file *file;
+ pgoff_t pgoff;
+
+ if (vma_is_anonymous(vma)) {
+ result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
+ goto end;
+ }
+
+ file = get_file(vma->vm_file);
+ pgoff = linear_page_index(vma, addr);
+
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ result = collapse_scan_file(mm, addr, file, pgoff, cc);
+ fput(file);
+
+ if (result != SCAN_PTE_MAPPED_HUGEPAGE)
+ goto end;
+
+ mmap_read_lock(mm);
+ *mmap_locked = true;
+ if (collapse_test_exit_or_disable(mm)) {
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ return SCAN_ANY_PROCESS;
+ }
+ result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
+ if (result == SCAN_PMD_MAPPED)
+ result = SCAN_SUCCEED;
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+
+end:
+ if (cc->is_khugepaged && result == SCAN_SUCCEED)
+ ++khugepaged_pages_collapsed;
+ return result;
+}
+
static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
struct collapse_control *cc)
__releases(&khugepaged_mm_lock)
@@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
hend);
- if (!vma_is_anonymous(vma)) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma,
- khugepaged_scan.address);
-
- mmap_read_unlock(mm);
- mmap_locked = false;
- *result = collapse_scan_file(mm,
- khugepaged_scan.address, file, pgoff, cc);
- fput(file);
- if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
- mmap_read_lock(mm);
- if (collapse_test_exit_or_disable(mm))
- goto breakouterloop;
- *result = try_collapse_pte_mapped_thp(mm,
- khugepaged_scan.address, false);
- if (*result == SCAN_PMD_MAPPED)
- *result = SCAN_SUCCEED;
- mmap_read_unlock(mm);
- }
- } else {
- *result = collapse_scan_pmd(mm, vma,
- khugepaged_scan.address, &mmap_locked, cc);
- }
-
- if (*result == SCAN_SUCCEED)
- ++khugepaged_pages_collapsed;
+ *result = collapse_single_pmd(khugepaged_scan.address,
+ vma, &mmap_locked, cc);
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
progress += HPAGE_PMD_NR;
@@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
cond_resched();
mmap_read_lock(mm);
mmap_locked = true;
+ *lock_dropped = true;
result = hugepage_vma_revalidate(mm, addr, false, &vma,
cc);
if (result != SCAN_SUCCEED) {
@@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
}
mmap_assert_locked(mm);
- if (!vma_is_anonymous(vma)) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma, addr);
- mmap_read_unlock(mm);
- mmap_locked = false;
+ result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
+ if (!mmap_locked)
*lock_dropped = true;
- result = collapse_scan_file(mm, addr, file, pgoff, cc);
- if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
- mapping_can_writeback(file->f_mapping)) {
+ if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
+ struct file *file = get_file(vma->vm_file);
+ pgoff_t pgoff = linear_page_index(vma, addr);
+
+ if (mapping_can_writeback(file->f_mapping)) {
loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
@@ -2829,26 +2853,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
goto retry;
}
fput(file);
- } else {
- result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
}
- if (!mmap_locked)
- *lock_dropped = true;
-handle_result:
switch (result) {
case SCAN_SUCCEED:
case SCAN_PMD_MAPPED:
++thps;
break;
- case SCAN_PTE_MAPPED_HUGEPAGE:
- BUG_ON(mmap_locked);
- mmap_read_lock(mm);
- result = try_collapse_pte_mapped_thp(mm, addr, true);
- mmap_read_unlock(mm);
- goto handle_result;
/* Whitelisted set of results where continuing OK */
case SCAN_NO_PTE_TABLE:
+ case SCAN_PTE_MAPPED_HUGEPAGE:
case SCAN_PTE_NON_PRESENT:
case SCAN_PTE_UFFD_WP:
case SCAN_LACK_REFERENCED_PAGE:
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 04/16] khugepaged: generalize hugepage_vma_revalidate for mTHP support
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (2 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 05/16] khugepaged: generalize alloc_charge_folio() Nico Pache
` (13 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
For khugepaged to support different mTHP orders, we must generalize this
to check if the PMD is not shared by another VMA and that the order is
enabled.
No functional change in this patch. Also correct a comment about the
functionality of the revalidation.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 59e5a5588d85..59b6b89394a8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -893,12 +893,13 @@ static int collapse_find_target_node(struct collapse_control *cc)
/*
* If mmap_lock temporarily dropped, revalidate vma
- * before taking mmap_lock.
+ * after taking the mmap_lock again.
* Returns enum scan_result value.
*/
static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
- bool expect_anon, struct vm_area_struct **vmap, struct collapse_control *cc)
+ bool expect_anon, struct vm_area_struct **vmap,
+ struct collapse_control *cc, unsigned int order)
{
struct vm_area_struct *vma;
enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
@@ -911,15 +912,16 @@ static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsigned l
if (!vma)
return SCAN_VMA_NULL;
+ /* Always check the PMD order to ensure its not shared by another VMA */
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
- if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order)))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
* remapped to file after khugepaged reaquired the mmap_lock.
*
- * thp_vma_allowable_order may return true for qualified file
+ * thp_vma_allowable_orders may return true for qualified file
* vmas.
*/
if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
@@ -1112,7 +1114,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
goto out_nolock;
mmap_read_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+ HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED) {
mmap_read_unlock(mm);
goto out_nolock;
@@ -1146,7 +1149,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* mmap_lock.
*/
mmap_write_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+ result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
+ HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
@@ -2824,7 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_locked = true;
*lock_dropped = true;
result = hugepage_vma_revalidate(mm, addr, false, &vma,
- cc);
+ cc, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED) {
last_fail = result;
goto out_nolock;
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 05/16] khugepaged: generalize alloc_charge_folio()
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (3 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 04/16] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 06/16] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
` (12 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
From: Dev Jain <dev.jain@arm.com>
Pass order to alloc_charge_folio() and update mTHP statistics.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 4 ++++
mm/khugepaged.c | 17 +++++++++++------
4 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5fbc3d89bb07..c51932e6275d 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -639,6 +639,14 @@ anon_fault_fallback_charge
instead falls back to using huge pages with lower orders or
small pages even though the allocation was successful.
+collapse_alloc
+ is incremented every time a huge page is successfully allocated for a
+ khugepaged collapse.
+
+collapse_alloc_failed
+ is incremented every time a huge page allocation fails during a
+ khugepaged collapse.
+
zswpout
is incremented every time a huge page is swapped out to zswap in one
piece without splitting.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd7f0e1d8094..9941fc6d7bd8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,6 +128,8 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+ MTHP_STAT_COLLAPSE_ALLOC,
+ MTHP_STAT_COLLAPSE_ALLOC_FAILED,
MTHP_STAT_ZSWPOUT,
MTHP_STAT_SWPIN,
MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5eae85818635..00fc92062d70 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -621,6 +621,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -686,6 +688,8 @@ static struct attribute *any_stats_attrs[] = {
#endif
&split_attr.attr,
&split_failed_attr.attr,
+ &collapse_alloc_attr.attr,
+ &collapse_alloc_failed_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 59b6b89394a8..384d30b6bdd3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1059,21 +1059,26 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
}
static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
- struct collapse_control *cc)
+ struct collapse_control *cc, unsigned int order)
{
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
GFP_TRANSHUGE);
int node = collapse_find_target_node(cc);
struct folio *folio;
- folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+ folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
if (!folio) {
*foliop = NULL;
- count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ if (is_pmd_order(order))
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
return SCAN_ALLOC_HUGE_PAGE_FAIL;
}
- count_vm_event(THP_COLLAPSE_ALLOC);
+ if (is_pmd_order(order))
+ count_vm_event(THP_COLLAPSE_ALLOC);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
folio_put(folio);
*foliop = NULL;
@@ -1109,7 +1114,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
*/
mmap_read_unlock(mm);
- result = alloc_charge_folio(&folio, mm, cc);
+ result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_nolock;
@@ -1876,7 +1881,7 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
- result = alloc_charge_folio(&new_folio, mm, cc);
+ result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out;
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 06/16] khugepaged: generalize __collapse_huge_page_* for mTHP support
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (4 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 05/16] khugepaged: generalize alloc_charge_folio() Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
` (11 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.
mTHP collapse will not honor the khugepaged_max_ptes_shared or
khugepaged_max_ptes_swap parameters, and will fail if it encounters a
shared or swapped entry.
No functional changes in this patch.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
1 file changed, 47 insertions(+), 26 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 384d30b6bdd3..0f68902edd9a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -539,7 +539,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
- struct list_head *compound_pagelist)
+ unsigned int order, struct list_head *compound_pagelist)
{
struct page *page = NULL;
struct folio *folio = NULL;
@@ -547,15 +547,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
pte_t *_pte;
int none_or_zero = 0, shared = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
+ const unsigned long nr_pages = 1UL << order;
+ int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+ for (_pte = pte; _pte < pte + nr_pages;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
if (pte_none_or_zero(pteval)) {
++none_or_zero;
if (!userfaultfd_armed(vma) &&
(!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
+ none_or_zero <= max_ptes_none)) {
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
@@ -583,8 +585,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
/* See collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
- if (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared) {
+ /*
+ * TODO: Support shared pages without leading to further
+ * mTHP collapses. Currently bringing in new pages via
+ * shared may cause a future higher order collapse on a
+ * rescan of the same range.
+ */
+ if (!is_pmd_order(order) || (cc->is_khugepaged &&
+ shared > khugepaged_max_ptes_shared)) {
result = SCAN_EXCEED_SHARED_PTE;
count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
goto out;
@@ -677,18 +685,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
}
static void __collapse_huge_page_copy_succeeded(pte_t *pte,
- struct vm_area_struct *vma,
- unsigned long address,
- spinlock_t *ptl,
- struct list_head *compound_pagelist)
+ struct vm_area_struct *vma, unsigned long address,
+ spinlock_t *ptl, unsigned int order,
+ struct list_head *compound_pagelist)
{
- unsigned long end = address + HPAGE_PMD_SIZE;
+ unsigned long end = address + (PAGE_SIZE << order);
struct folio *src, *tmp;
pte_t pteval;
pte_t *_pte;
unsigned int nr_ptes;
+ const unsigned long nr_pages = 1UL << order;
- for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
+ for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
address += nr_ptes * PAGE_SIZE) {
nr_ptes = 1;
pteval = ptep_get(_pte);
@@ -741,13 +749,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
}
static void __collapse_huge_page_copy_failed(pte_t *pte,
- pmd_t *pmd,
- pmd_t orig_pmd,
- struct vm_area_struct *vma,
- struct list_head *compound_pagelist)
+ pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
+ unsigned int order, struct list_head *compound_pagelist)
{
spinlock_t *pmd_ptl;
-
+ const unsigned long nr_pages = 1UL << order;
/*
* Re-establish the PMD to point to the original page table
* entry. Restoring PMD needs to be done prior to releasing
@@ -761,7 +767,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
* Release both raw and compound pages isolated
* in __collapse_huge_page_isolate.
*/
- release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+ release_pte_pages(pte, pte + nr_pages, compound_pagelist);
}
/*
@@ -781,16 +787,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
*/
static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
- unsigned long address, spinlock_t *ptl,
+ unsigned long address, spinlock_t *ptl, unsigned int order,
struct list_head *compound_pagelist)
{
unsigned int i;
enum scan_result result = SCAN_SUCCEED;
-
+ const unsigned long nr_pages = 1UL << order;
/*
* Copying pages' contents is subject to memory poison at any iteration.
*/
- for (i = 0; i < HPAGE_PMD_NR; i++) {
+ for (i = 0; i < nr_pages; i++) {
pte_t pteval = ptep_get(pte + i);
struct page *page = folio_page(folio, i);
unsigned long src_addr = address + i * PAGE_SIZE;
@@ -809,10 +815,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
if (likely(result == SCAN_SUCCEED))
__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
- compound_pagelist);
+ order, compound_pagelist);
else
__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
- compound_pagelist);
+ order, compound_pagelist);
return result;
}
@@ -983,12 +989,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
* Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
*/
static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
- int referenced)
+ struct vm_area_struct *vma, unsigned long start_addr,
+ pmd_t *pmd, int referenced, unsigned int order)
{
int swapped_in = 0;
vm_fault_t ret = 0;
- unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
+ unsigned long addr, end = start_addr + (PAGE_SIZE << order);
enum scan_result result;
pte_t *pte = NULL;
spinlock_t *ptl;
@@ -1020,6 +1026,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
pte_present(vmf.orig_pte))
continue;
+ /*
+ * TODO: Support swapin without leading to further mTHP
+ * collapses. Currently bringing in new pages via swapin may
+ * cause a future higher order collapse on a rescan of the same
+ * range.
+ */
+ if (!is_pmd_order(order)) {
+ pte_unmap(pte);
+ mmap_read_unlock(mm);
+ result = SCAN_EXCEED_SWAP_PTE;
+ goto out;
+ }
+
vmf.pte = pte;
vmf.ptl = ptl;
ret = do_swap_page(&vmf);
@@ -1139,7 +1158,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* that case. Continuing to collapse causes inconsistency.
*/
result = __collapse_huge_page_swapin(mm, vma, address, pmd,
- referenced);
+ referenced, HPAGE_PMD_ORDER);
if (result != SCAN_SUCCEED)
goto out_nolock;
}
@@ -1187,6 +1206,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
result = __collapse_huge_page_isolate(vma, address, pte, cc,
+ HPAGE_PMD_ORDER,
&compound_pagelist);
spin_unlock(pte_ptl);
} else {
@@ -1217,6 +1237,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
vma, address, pte_ptl,
+ HPAGE_PMD_ORDER,
&compound_pagelist);
pte_unmap(pte);
if (unlikely(result != SCAN_SUCCEED))
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (5 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 06/16] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-02-03 12:08 ` Lorenzo Stoakes
2026-01-22 19:28 ` [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
` (10 subsequent siblings)
17 siblings, 1 reply; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.
With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.
To fix this issue introduce a helper function that will limit mTHP
collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
This effectively supports two modes:
- max_ptes_none=0: never introduce new none-pages for mTHP collapse.
- max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
available mTHP order.
This removes the possiblilty of "creep", while not modifying any uAPI
expectations. A warning will be emitted if any non-supported
max_ptes_none value is configured with mTHP enabled.
The limits can be ignored by passing full_scan=true, this is useful for
madvise_collapse (which ignores limits), or in the case of
collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
collapse is available.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0f68902edd9a..9b7e05827749 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -460,6 +460,44 @@ void __khugepaged_enter(struct mm_struct *mm)
wake_up_interruptible(&khugepaged_wait);
}
+/**
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
+ * @order: The folio order being collapsed to
+ * @full_scan: Whether this is a full scan (ignore limits)
+ *
+ * For madvise-triggered collapses (full_scan=true), all limits are bypassed
+ * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
+ *
+ * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
+ * khugepaged_max_ptes_none value.
+ *
+ * For mTHP collapses, we currently only support khugepaged_max_pte_none values
+ * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP
+ * collapse will be attempted
+ *
+ * Return: Maximum number of empty PTEs allowed for the collapse operation
+ */
+static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
+{
+ /* ignore max_ptes_none limits */
+ if (full_scan)
+ return HPAGE_PMD_NR - 1;
+
+ if (is_pmd_order(order))
+ return khugepaged_max_ptes_none;
+
+ /* Zero/non-present collapse disabled. */
+ if (!khugepaged_max_ptes_none)
+ return 0;
+
+ if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+ return (1 << order) - 1;
+
+ pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %d\n",
+ HPAGE_PMD_NR - 1);
+ return -EINVAL;
+}
+
void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
@@ -548,7 +586,10 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
int none_or_zero = 0, shared = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
const unsigned long nr_pages = 1UL << order;
- int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+ int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
+
+ if (max_ptes_none == -EINVAL)
+ return result;
for (_pte = pte; _pte < pte + nr_pages;
_pte++, addr += PAGE_SIZE) {
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (6 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-02-03 13:07 ` Lorenzo Stoakes
2026-01-22 19:28 ` [PATCH mm-unstable v14 09/16] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
` (9 subsequent siblings)
17 siblings, 1 reply; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
Pass an order and offset to collapse_huge_page to support collapsing anon
memory to arbitrary orders within a PMD. order indicates what mTHP size we
are attempting to collapse to, and offset indicates were in the PMD to
start the collapse attempt.
For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the mTHP case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++-----------------
1 file changed, 71 insertions(+), 40 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9b7e05827749..76cb17243793 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1151,44 +1151,54 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
return SCAN_SUCCEED;
}
-static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
- int referenced, int unmapped, struct collapse_control *cc)
+static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
+ int referenced, int unmapped, struct collapse_control *cc,
+ bool *mmap_locked, unsigned int order)
{
LIST_HEAD(compound_pagelist);
pmd_t *pmd, _pmd;
- pte_t *pte;
+ pte_t *pte = NULL;
pgtable_t pgtable;
struct folio *folio;
spinlock_t *pmd_ptl, *pte_ptl;
enum scan_result result = SCAN_FAIL;
struct vm_area_struct *vma;
struct mmu_notifier_range range;
+ bool anon_vma_locked = false;
+ const unsigned long nr_pages = 1UL << order;
+ const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+ VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
/*
* Before allocating the hugepage, release the mmap_lock read lock.
* The allocation can take potentially a long time if it involves
* sync compaction, and we do not need to hold the mmap_lock during
* that. We will recheck the vma after taking it again in write mode.
+ * If collapsing mTHPs we may have already released the read_lock.
*/
- mmap_read_unlock(mm);
+ if (*mmap_locked) {
+ mmap_read_unlock(mm);
+ *mmap_locked = false;
+ }
- result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+ result = alloc_charge_folio(&folio, mm, cc, order);
if (result != SCAN_SUCCEED)
goto out_nolock;
mmap_read_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
- HPAGE_PMD_ORDER);
+ *mmap_locked = true;
+ result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
if (result != SCAN_SUCCEED) {
mmap_read_unlock(mm);
+ *mmap_locked = false;
goto out_nolock;
}
- result = find_pmd_or_thp_or_none(mm, address, &pmd);
+ result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
if (result != SCAN_SUCCEED) {
mmap_read_unlock(mm);
+ *mmap_locked = false;
goto out_nolock;
}
@@ -1198,13 +1208,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* released when it fails. So we jump out_nolock directly in
* that case. Continuing to collapse causes inconsistency.
*/
- result = __collapse_huge_page_swapin(mm, vma, address, pmd,
- referenced, HPAGE_PMD_ORDER);
- if (result != SCAN_SUCCEED)
+ result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
+ referenced, order);
+ if (result != SCAN_SUCCEED) {
+ *mmap_locked = false;
goto out_nolock;
+ }
}
mmap_read_unlock(mm);
+ *mmap_locked = false;
/*
* Prevent all access to pagetables with the exception of
* gup_fast later handled by the ptep_clear_flush and the VM
@@ -1214,20 +1227,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* mmap_lock.
*/
mmap_write_lock(mm);
- result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
- HPAGE_PMD_ORDER);
+ result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
vma_start_write(vma);
- result = check_pmd_still_valid(mm, address, pmd);
+ result = check_pmd_still_valid(mm, pmd_address, pmd);
if (result != SCAN_SUCCEED)
goto out_up_write;
anon_vma_lock_write(vma->anon_vma);
+ anon_vma_locked = true;
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
- address + HPAGE_PMD_SIZE);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
+ start_addr + (PAGE_SIZE << order));
mmu_notifier_invalidate_range_start(&range);
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
@@ -1239,24 +1252,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* Parallel GUP-fast is fine since GUP-fast will back off when
* it detects PMD is changed.
*/
- _pmd = pmdp_collapse_flush(vma, address, pmd);
+ _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(&range);
tlb_remove_table_sync_one();
- pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+ pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
if (pte) {
- result = __collapse_huge_page_isolate(vma, address, pte, cc,
- HPAGE_PMD_ORDER,
- &compound_pagelist);
+ result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
+ order, &compound_pagelist);
spin_unlock(pte_ptl);
} else {
result = SCAN_NO_PTE_TABLE;
}
if (unlikely(result != SCAN_SUCCEED)) {
- if (pte)
- pte_unmap(pte);
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
@@ -1266,21 +1276,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
*/
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
- anon_vma_unlock_write(vma->anon_vma);
goto out_up_write;
}
/*
- * All pages are isolated and locked so anon_vma rmap
- * can't run anymore.
+ * For PMD collapse all pages are isolated and locked so anon_vma
+ * rmap can't run anymore. For mTHP collapse we must hold the lock
*/
- anon_vma_unlock_write(vma->anon_vma);
+ if (is_pmd_order(order)) {
+ anon_vma_unlock_write(vma->anon_vma);
+ anon_vma_locked = false;
+ }
result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
- vma, address, pte_ptl,
- HPAGE_PMD_ORDER,
- &compound_pagelist);
- pte_unmap(pte);
+ vma, start_addr, pte_ptl,
+ order, &compound_pagelist);
if (unlikely(result != SCAN_SUCCEED))
goto out_up_write;
@@ -1290,20 +1300,42 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
* write.
*/
__folio_mark_uptodate(folio);
- pgtable = pmd_pgtable(_pmd);
+ if (is_pmd_order(order)) { /* PMD collapse */
+ pgtable = pmd_pgtable(_pmd);
- spin_lock(pmd_ptl);
- BUG_ON(!pmd_none(*pmd));
- pgtable_trans_huge_deposit(mm, pmd, pgtable);
- map_anon_folio_pmd_nopf(folio, pmd, vma, address);
+ spin_lock(pmd_ptl);
+ WARN_ON_ONCE(!pmd_none(*pmd));
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
+ } else { /* mTHP collapse */
+ pte_t mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
+
+ mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+ spin_lock(pmd_ptl);
+ WARN_ON_ONCE(!pmd_none(*pmd));
+ folio_ref_add(folio, nr_pages - 1);
+ folio_add_new_anon_rmap(folio, vma, start_addr, RMAP_EXCLUSIVE);
+ folio_add_lru_vma(folio, vma);
+ set_ptes(vma->vm_mm, start_addr, pte, mthp_pte, nr_pages);
+ update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
+
+ smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+ pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+ }
spin_unlock(pmd_ptl);
folio = NULL;
result = SCAN_SUCCEED;
out_up_write:
+ if (anon_vma_locked)
+ anon_vma_unlock_write(vma->anon_vma);
+ if (pte)
+ pte_unmap(pte);
mmap_write_unlock(mm);
+ *mmap_locked = false;
out_nolock:
+ WARN_ON_ONCE(*mmap_locked);
if (folio)
folio_put(folio);
trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
@@ -1471,9 +1503,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
result = collapse_huge_page(mm, start_addr, referenced,
- unmapped, cc);
- /* collapse_huge_page will return with the mmap_lock released */
- *mmap_locked = false;
+ unmapped, cc, mmap_locked,
+ HPAGE_PMD_ORDER);
}
out:
trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 09/16] khugepaged: skip collapsing mTHP to smaller orders
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (7 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 10/16] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
` (8 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).
This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].
[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 76cb17243793..8bb70e1401ad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -639,6 +639,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
}
+ /*
+ * TODO: In some cases of partially-mapped folios, we'd actually
+ * want to collapse.
+ */
+ if (!is_pmd_order(order) && folio_order(folio) >= order) {
+ result = SCAN_PTE_MAPPED_HUGEPAGE;
+ goto out;
+ }
if (folio_test_large(folio)) {
struct folio *f;
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 10/16] khugepaged: add per-order mTHP collapse failure statistics
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (8 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 09/16] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 11/16] khugepaged: improve tracepoints for mTHP orders Nico Pache
` (7 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
Add three new mTHP statistics to track collapse failures for different
orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
- collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
PTEs
- collapse_exceed_none_pte: Counts when mTHP collapse fails due to
exceeding the none PTE threshold for the given order
- collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
PTEs
These statistics complement the existing THP_SCAN_EXCEED_* events by
providing per-order granularity for mTHP collapse attempts. The stats are
exposed via sysfs under
`/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
supported hugepage size.
As we currently dont support collapsing mTHPs that contain a swap or
shared entry, those statistics keep track of how often we are
encountering failed mTHP collapses due to these restrictions.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 7 +++++++
mm/khugepaged.c | 16 ++++++++++++---
4 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c51932e6275d..eebb1f6bbc6c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -714,6 +714,30 @@ nr_anon_partially_mapped
an anonymous THP as "partially mapped" and count it here, even though it
is not actually partially mapped anymore.
+collapse_exceed_none_pte
+ The number of collapse attempts that failed due to exceeding the
+ max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
+ values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
+ emit a warning and no mTHP collapse will be attempted. khugepaged will
+ try to collapse to the largest enabled (m)THP size; if it fails, it will
+ try the next lower enabled mTHP size. This counter records the number of
+ times a collapse attempt was skipped for exceeding the max_ptes_none
+ threshold, and khugepaged will move on to the next available mTHP size.
+
+collapse_exceed_swap_pte
+ The number of anonymous mTHP PTE ranges which were unable to collapse due
+ to containing at least one swap PTE. Currently khugepaged does not
+ support collapsing mTHP regions that contain a swap PTE. This counter can
+ be used to monitor the number of khugepaged mTHP collapses that failed
+ due to the presence of a swap PTE.
+
+collapse_exceed_shared_pte
+ The number of anonymous mTHP PTE ranges which were unable to collapse due
+ to containing at least one shared PTE. Currently khugepaged does not
+ support collapsing mTHP PTE ranges that contain a shared PTE. This
+ counter can be used to monitor the number of khugepaged mTHP collapses
+ that failed due to the presence of a shared PTE.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9941fc6d7bd8..e8777bb2347d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -144,6 +144,9 @@ enum mthp_stat_item {
MTHP_STAT_SPLIT_DEFERRED,
MTHP_STAT_NR_ANON,
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
+ MTHP_STAT_COLLAPSE_EXCEED_SWAP,
+ MTHP_STAT_COLLAPSE_EXCEED_NONE,
+ MTHP_STAT_COLLAPSE_EXCEED_SHARED,
__MTHP_STAT_COUNT
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00fc92062d70..7c0b072fc6b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -639,6 +639,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
+DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+
static struct attribute *anon_stats_attrs[] = {
&anon_fault_alloc_attr.attr,
@@ -655,6 +659,9 @@ static struct attribute *anon_stats_attrs[] = {
&split_deferred_attr.attr,
&nr_anon_attr.attr,
&nr_anon_partially_mapped_attr.attr,
+ &collapse_exceed_swap_pte_attr.attr,
+ &collapse_exceed_none_pte_attr.attr,
+ &collapse_exceed_shared_pte_attr.attr,
NULL,
};
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8bb70e1401ad..96f1c28646ba 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -602,7 +602,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ if (is_pmd_order(order))
+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);
goto out;
}
}
@@ -632,10 +634,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
* shared may cause a future higher order collapse on a
* rescan of the same range.
*/
- if (!is_pmd_order(order) || (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared)) {
+ if (!is_pmd_order(order)) {
+ result = SCAN_EXCEED_SHARED_PTE;
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
+ goto out;
+ }
+
+ if (cc->is_khugepaged &&
+ shared > khugepaged_max_ptes_shared) {
result = SCAN_EXCEED_SHARED_PTE;
count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
goto out;
}
}
@@ -1082,6 +1091,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
* range.
*/
if (!is_pmd_order(order)) {
+ count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
pte_unmap(pte);
mmap_read_unlock(mm);
result = SCAN_EXCEED_SWAP_PTE;
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 11/16] khugepaged: improve tracepoints for mTHP orders
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (9 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 10/16] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 12/16] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
` (6 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, David Hildenbrand
Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to
give better insight into what order is being operated at for.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/trace/events/huge_memory.h | 34 +++++++++++++++++++-----------
mm/khugepaged.c | 9 ++++----
2 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4e41bff31888..942c82f2d0a4 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -88,40 +88,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
TRACE_EVENT(mm_collapse_huge_page,
- TP_PROTO(struct mm_struct *mm, int isolated, int status),
+ TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int order),
- TP_ARGS(mm, isolated, status),
+ TP_ARGS(mm, isolated, status, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, isolated)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
__entry->mm = mm;
__entry->isolated = isolated;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("mm=%p, isolated=%d, status=%s",
+ TP_printk("mm=%p, isolated=%d, status=%s order=%u",
__entry->mm,
__entry->isolated,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_isolate,
TP_PROTO(struct folio *folio, int none_or_zero,
- int referenced, int status),
+ int referenced, int status, unsigned int order),
- TP_ARGS(folio, none_or_zero, referenced, status),
+ TP_ARGS(folio, none_or_zero, referenced, status, order),
TP_STRUCT__entry(
__field(unsigned long, pfn)
__field(int, none_or_zero)
__field(int, referenced)
__field(int, status)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -129,26 +133,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__entry->none_or_zero = none_or_zero;
__entry->referenced = referenced;
__entry->status = status;
+ __entry->order = order;
),
- TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
+ TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s order=%u",
__entry->pfn,
__entry->none_or_zero,
__entry->referenced,
- __print_symbolic(__entry->status, SCAN_STATUS))
+ __print_symbolic(__entry->status, SCAN_STATUS),
+ __entry->order)
);
TRACE_EVENT(mm_collapse_huge_page_swapin,
- TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret),
+ TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret,
+ unsigned int order),
- TP_ARGS(mm, swapped_in, referenced, ret),
+ TP_ARGS(mm, swapped_in, referenced, ret, order),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
__field(int, swapped_in)
__field(int, referenced)
__field(int, ret)
+ __field(unsigned int, order)
),
TP_fast_assign(
@@ -156,13 +164,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin,
__entry->swapped_in = swapped_in;
__entry->referenced = referenced;
__entry->ret = ret;
+ __entry->order = order;
),
- TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d",
+ TP_printk("mm=%p, swapped_in=%d, referenced=%d, ret=%d, order=%u",
__entry->mm,
__entry->swapped_in,
__entry->referenced,
- __entry->ret)
+ __entry->ret,
+ __entry->order)
);
TRACE_EVENT(mm_khugepaged_scan_file,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 96f1c28646ba..e33b2594949d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -732,13 +732,13 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
} else {
result = SCAN_SUCCEED;
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, result);
+ referenced, result, order);
return result;
}
out:
release_pte_pages(pte, _pte, compound_pagelist);
trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
- referenced, result);
+ referenced, result, order);
return result;
}
@@ -1132,7 +1132,8 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
result = SCAN_SUCCEED;
out:
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result,
+ order);
return result;
}
@@ -1356,7 +1357,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
WARN_ON_ONCE(*mmap_locked);
if (folio)
folio_put(folio);
- trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
+ trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result, order);
return result;
}
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 12/16] khugepaged: introduce collapse_allowable_orders helper function
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (10 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 11/16] khugepaged: improve tracepoints for mTHP orders Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 13/16] khugepaged: Introduce mTHP collapse support Nico Pache
` (5 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
Add collapse_allowable_orders() to generalize THP order eligibility. The
function determines which THP orders are permitted based on collapse
context (khugepaged vs madv_collapse).
This consolidates collapse configuration logic and provides a clean
interface for future mTHP collapse support where the orders may be
different.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e33b2594949d..11eedd261285 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -498,12 +498,22 @@ static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
return -EINVAL;
}
+/* Check what orders are allowed based on the vma and collapse type */
+static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
+ vm_flags_t vm_flags, bool is_khugepaged)
+{
+ enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
+ unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+ return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+}
+
void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
hugepage_pmd_enabled()) {
- if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+ if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
__khugepaged_enter(vma->vm_mm);
}
}
@@ -2610,7 +2620,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
progress++;
break;
}
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+ if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/true)) {
progress++;
continue;
}
@@ -2920,7 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
+ if (!collapse_allowable_orders(vma, vma->vm_flags, /*is_khugepaged=*/false))
return -EINVAL;
cc = kmalloc(sizeof(*cc), GFP_KERNEL);
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 13/16] khugepaged: Introduce mTHP collapse support
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (11 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 12/16] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 14/16] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
` (4 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
Enable khugepaged to collapse to mTHP orders. This patch implements the
main scanning logic using a bitmap to track occupied pages and a stack
structure that allows us to find optimal collapse sizes.
Previous to this patch, PMD collapse had 3 main phases, a light weight
scanning phase (mmap_read_lock) that determines a potential PMD
collapse, a alloc phase (mmap unlocked), then finally heavier collapse
phase (mmap_write_lock).
To enabled mTHP collapse we make the following changes:
During PMD scan phase, track occupied pages in a bitmap. When mTHP
orders are enabled, we remove the restriction of max_ptes_none during the
scan phase to avoid missing potential mTHP collapse candidates. Once we
have scanned the full PMD range and updated the bitmap to track occupied
pages, we use the bitmap to find the optimal mTHP size.
Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
and determine the best eligible order for the collapse. A stack structure
is used instead of traditional recursion to manage the search. The
algorithm recursively splits the bitmap into smaller chunks to find the
highest order mTHPs that satisfy the collapse criteria. We start by
attempting the PMD order, then moved on the consecutively lower orders
(mTHP collapse). The stack maintains a pair of variables (offset, order),
indicating the number of PTEs from the start of the PMD, and the order of
the potential collapse candidate.
The algorithm for consuming the bitmap works as such:
1) push (0, HPAGE_PMD_ORDER) onto the stack
2) pop the stack
3) check if the number of set bits in that (offset,order) pair
statisfy the max_ptes_none threshold for that order
4) if yes, attempt collapse
5) if no (or collapse fails), push two new stack items representing
the left and right halves of the current bitmap range, at the
next lower order
6) repeat at step (2) until stack is empty.
Below is a diagram representing the algorithm and stack items:
offset mid_offset
| |
| |
v v
____________________________________
| PTE Page Table |
--------------------------------------
<-------><------->
order-1 order-1
We currently only support mTHP collapse for max_ptes_none values of 0
and HPAGE_PMD_NR - 1. resulting in the following behavior:
- max_ptes_none=0: Never introduce new empty pages during collapse
- max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
available mTHP order
Any other max_ptes_none value will emit a warning and skip mTHP collapse
attempts. There should be no behavior change for PMD collapse.
Once we determine what mTHP sizes fits best in that PMD range a collapse
is attempted. A minimum collapse order of 2 is used as this is the lowest
order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
mTHP collapses reject regions containing swapped out or shared pages.
This is because adding new entries can lead to new none pages, and these
may lead to constant promotion into a higher order (m)THP. A similar
issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
introducing at least 2x the number of pages, and on a future scan will
satisfy the promotion condition once again. This issue is prevented via
the collapse_max_ptes_none() function which imposes the max_ptes_none
restrictions above.
Currently madv_collapse is not supported and will only attempt PMD
collapse.
We can also remove the check for is_khugepaged inside the PMD scan as
the collapse_max_ptes_none() function handles this logic now.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 183 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 175 insertions(+), 8 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 11eedd261285..5947faaba85f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,32 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
static struct kmem_cache *mm_slot_cache __ro_after_init;
+#define KHUGEPAGED_MIN_MTHP_ORDER 2
+/*
+ * The maximum number of mTHP ranges that can be stored on the stack.
+ * This is calculated based on the number of PTE entries in a PTE page table
+ * and the minimum mTHP order.
+ *
+ * ilog2(MAX_PTRS_PER_PTE) is log2 of the maximum number of PTE entries.
+ * This gives you the PMD_ORDER, and is needed in place of HPAGE_PMD_ORDER due
+ * to restrictions of some architectures (ie ppc64le).
+ *
+ * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP ranges
+ */
+#define MTHP_STACK_SIZE (1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
+
+/*
+ * Defines a range of PTE entries in a PTE page table which are being
+ * considered for (m)THP collapse.
+ *
+ * @offset: the offset of the first PTE entry in a PMD range.
+ * @order: the order of the PTE entries being considered for collapse.
+ */
+struct mthp_range {
+ u16 offset;
+ u8 order;
+};
+
struct collapse_control {
bool is_khugepaged;
@@ -102,6 +128,11 @@ struct collapse_control {
/* nodemask for allocation fallback */
nodemask_t alloc_nmask;
+
+ /* bitmap used for mTHP collapse */
+ DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
+ DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
+ struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
};
/**
@@ -1371,6 +1402,121 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
return result;
}
+static void mthp_stack_push(struct collapse_control *cc, int *stack_size,
+ u16 offset, u8 order)
+{
+ const int size = *stack_size;
+ struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
+
+ VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
+ stack->order = order;
+ stack->offset = offset;
+ (*stack_size)++;
+}
+
+static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *stack_size)
+{
+ const int size = *stack_size;
+
+ VM_WARN_ON_ONCE(size <= 0);
+ (*stack_size)--;
+ return cc->mthp_bitmap_stack[size - 1];
+}
+
+static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *cc,
+ u16 offset, unsigned long nr_pte_entries)
+{
+ bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+ bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries);
+ return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD_NR);
+}
+
+/*
+ * mthp_collapse() consumes the bitmap that is generated during
+ * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
+ *
+ * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
+ * A stack structure cc->mthp_bitmap_stack is used to check different regions
+ * of the bitmap for collapse eligibility. The stack maintains a pair of
+ * variables (offset, order), indicating the number of PTEs from the start of
+ * the PMD, and the order of the potential collapse candidate respectively. We
+ * start at the PMD order and check if it is eligible for collapse; if not, we
+ * add two entries to the stack at a lower order to represent the left and right
+ * halves of the PTE page table we are examining.
+ *
+ * offset mid_offset
+ * | |
+ * | |
+ * v v
+ * --------------------------------------
+ * | cc->mthp_bitmap |
+ * --------------------------------------
+ * <-------><------->
+ * order-1 order-1
+ *
+ * For each of these, we determine how many PTE entries are occupied in the
+ * range of PTE entries we propose to collapse, then we compare this to a
+ * threshold number of PTE entries which would need to be occupied for a
+ * collapse to be permitted at that order (accounting for max_ptes_none).
+
+ * If a collapse is permitted, we attempt to collapse the PTE range into a
+ * mTHP.
+ */
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+ int referenced, int unmapped, struct collapse_control *cc,
+ bool *mmap_locked, unsigned long enabled_orders)
+{
+ unsigned int max_ptes_none, nr_occupied_ptes;
+ struct mthp_range range;
+ unsigned long collapse_address;
+ int collapsed = 0, stack_size = 0;
+ unsigned long nr_pte_entries;
+ u16 offset;
+ u8 order;
+
+ mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
+
+ while (stack_size > 0) {
+ range = mthp_stack_pop(cc, &stack_size);
+ order = range.order;
+ offset = range.offset;
+ nr_pte_entries = 1UL << order;
+
+ if (!test_bit(order, &enabled_orders))
+ goto next_order;
+
+ max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
+
+ if (max_ptes_none == -EINVAL)
+ return collapsed;
+
+ nr_occupied_ptes = mthp_nr_occupied_pte_entries(cc, offset, nr_pte_entries);
+
+ if (nr_occupied_ptes >= nr_pte_entries - max_ptes_none) {
+ int ret;
+
+ collapse_address = address + offset * PAGE_SIZE;
+ ret = collapse_huge_page(mm, collapse_address, referenced,
+ unmapped, cc, mmap_locked,
+ order);
+ if (ret == SCAN_SUCCEED) {
+ collapsed += nr_pte_entries;
+ continue;
+ }
+ }
+
+next_order:
+ if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+ const u8 next_order = order - 1;
+ const u16 mid_offset = offset + (nr_pte_entries / 2);
+
+ mthp_stack_push(cc, &stack_size, mid_offset, next_order);
+ mthp_stack_push(cc, &stack_size, offset, next_order);
+ }
+ }
+ return collapsed;
+}
+
static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long start_addr, bool *mmap_locked,
@@ -1378,11 +1524,15 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
- int none_or_zero = 0, shared = 0, referenced = 0;
+ int i;
+ int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
enum scan_result result = SCAN_FAIL;
struct page *page = NULL;
+ unsigned int max_ptes_none;
struct folio *folio = NULL;
unsigned long addr;
+ unsigned long enabled_orders;
+ bool full_scan = true;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
@@ -1392,22 +1542,34 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
if (result != SCAN_SUCCEED)
goto out;
+ bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
+
+ enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
+
+ /*
+ * If PMD is the only enabled order, enforce max_ptes_none, otherwise
+ * scan all pages to populate the bitmap for mTHP collapse.
+ */
+ if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))
+ full_scan = false;
+ max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER, full_scan);
+
pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
if (!pte) {
result = SCAN_NO_PTE_TABLE;
goto out;
}
- for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
- _pte++, addr += PAGE_SIZE) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
+ _pte = pte + i;
+ addr = start_addr + i * PAGE_SIZE;
pte_t pteval = ptep_get(_pte);
if (pte_none_or_zero(pteval)) {
++none_or_zero;
if (!userfaultfd_armed(vma) &&
- (!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
+ none_or_zero <= max_ptes_none) {
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
@@ -1475,6 +1637,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
}
}
+ /* Set bit for occupied pages */
+ bitmap_set(cc->mthp_bitmap, i, 1);
/*
* Record which node the original page is from and save this
* information to cc->node_load[].
@@ -1531,9 +1695,12 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
out_unmap:
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
- result = collapse_huge_page(mm, start_addr, referenced,
- unmapped, cc, mmap_locked,
- HPAGE_PMD_ORDER);
+ nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
+ cc, mmap_locked, enabled_orders);
+ if (nr_collapsed > 0)
+ result = SCAN_SUCCEED;
+ else
+ result = SCAN_FAIL;
}
out:
trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 14/16] khugepaged: avoid unnecessary mTHP collapse attempts
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (12 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 13/16] khugepaged: Introduce mTHP collapse support Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 15/16] khugepaged: run khugepaged for all orders Nico Pache
` (3 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
There are cases where, if an attempted collapse fails, all subsequent
orders are guaranteed to also fail. Avoid these collapse attempts by
bailing out early.
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5947faaba85f..344f7cec55b4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1499,9 +1499,42 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
ret = collapse_huge_page(mm, collapse_address, referenced,
unmapped, cc, mmap_locked,
order);
- if (ret == SCAN_SUCCEED) {
+
+ switch (ret) {
+ /* Cases were we continue to next collapse candidate */
+ case SCAN_SUCCEED:
collapsed += nr_pte_entries;
+ fallthrough;
+ case SCAN_PTE_MAPPED_HUGEPAGE:
continue;
+ /* Cases were lower orders might still succeed */
+ case SCAN_LACK_REFERENCED_PAGE:
+ case SCAN_EXCEED_NONE_PTE:
+ case SCAN_EXCEED_SWAP_PTE:
+ case SCAN_EXCEED_SHARED_PTE:
+ case SCAN_PAGE_LOCK:
+ case SCAN_PAGE_COUNT:
+ case SCAN_PAGE_LRU:
+ case SCAN_PAGE_NULL:
+ case SCAN_DEL_PAGE_LRU:
+ case SCAN_PTE_NON_PRESENT:
+ case SCAN_PTE_UFFD_WP:
+ case SCAN_ALLOC_HUGE_PAGE_FAIL:
+ goto next_order;
+ /* Cases were no further collapse is possible */
+ case SCAN_CGROUP_CHARGE_FAIL:
+ case SCAN_COPY_MC:
+ case SCAN_ADDRESS_RANGE:
+ case SCAN_NO_PTE_TABLE:
+ case SCAN_ANY_PROCESS:
+ case SCAN_VMA_NULL:
+ case SCAN_VMA_CHECK:
+ case SCAN_SCAN_ABORT:
+ case SCAN_PAGE_ANON:
+ case SCAN_PMD_MAPPED:
+ case SCAN_FAIL:
+ default:
+ return collapsed;
}
}
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 15/16] khugepaged: run khugepaged for all orders
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (13 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 14/16] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 16/16] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
` (2 subsequent siblings)
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang
From: Baolin Wang <baolin.wang@linux.alibaba.com>
If any order (m)THP is enabled we should allow running khugepaged to
attempt scanning and collapsing mTHPs. In order for khugepaged to operate
when only mTHP sizes are specified in sysfs, we must modify the predicate
function that determines whether it ought to run to do so.
This function is currently called hugepage_pmd_enabled(), this patch
renames it to hugepage_enabled() and updates the logic to check to
determine whether any valid orders may exist which would justify
khugepaged running.
We must also update collapse_allowable_orders() to check all orders if
the vma is anonymous and the collapse is khugepaged.
After this patch khugepaged mTHP collapse is fully enabled.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/khugepaged.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 344f7cec55b4..482c4b27dec0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -438,23 +438,23 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
}
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_enabled(void)
{
/*
* We cover the anon, shmem and the file-backed case here; file-backed
* hugepages, when configured in, are determined by the global control.
- * Anon pmd-sized hugepages are determined by the pmd-size control.
+ * Anon hugepages are determined by its per-size mTHP control.
* Shmem pmd-sized hugepages are also determined by its pmd-size control,
* except when the global shmem_huge is set to SHMEM_HUGE_DENY.
*/
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
hugepage_global_enabled())
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+ if (READ_ONCE(huge_anon_orders_always))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+ if (READ_ONCE(huge_anon_orders_madvise))
return true;
- if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+ if (READ_ONCE(huge_anon_orders_inherit) &&
hugepage_global_enabled())
return true;
if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
@@ -533,8 +533,14 @@ static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
static unsigned long collapse_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags, bool is_khugepaged)
{
+ unsigned long orders;
enum tva_type tva_flags = is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
- unsigned long orders = BIT(HPAGE_PMD_ORDER);
+
+ /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */
+ if (is_khugepaged && vma_is_anonymous(vma))
+ orders = THP_ORDERS_ALL_ANON;
+ else
+ orders = BIT(HPAGE_PMD_ORDER);
return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
@@ -543,7 +549,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
- hugepage_pmd_enabled()) {
+ hugepage_enabled()) {
if (collapse_allowable_orders(vma, vm_flags, /*is_khugepaged=*/true))
__khugepaged_enter(vma->vm_mm);
}
@@ -2896,7 +2902,7 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled();
}
static int khugepaged_wait_event(void)
@@ -2969,7 +2975,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_pmd_enabled())
+ if (hugepage_enabled())
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -3000,7 +3006,7 @@ static void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_pmd_enabled()) {
+ if (!hugepage_enabled()) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -3050,7 +3056,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled()) {
+ if (hugepage_enabled()) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -3076,7 +3082,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled() && khugepaged_thread)
+ if (hugepage_enabled() && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH mm-unstable v14 16/16] Documentation: mm: update the admin guide for mTHP collapse
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (14 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 15/16] khugepaged: run khugepaged for all orders Nico Pache
@ 2026-01-22 19:28 ` Nico Pache
2026-01-26 11:21 ` [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Lorenzo Stoakes
2026-01-26 11:32 ` Lorenzo Stoakes
17 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-22 19:28 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel
Cc: npache, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, lance.yang, vbabka,
rppt, surenb, mhocko, corbet, rostedt, mhiramat,
mathieu.desnoyers, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, gourry, ying.huang, apopple, jannh, pfalcato,
jackmanb, hannes, willy, peterx, wangkefeng.wang, usamaarif642,
sunnanyong, vishal.moola, thomas.hellstrom, yang, kas, aarcange,
raquini, anshuman.khandual, catalin.marinas, tiwai, will,
dave.hansen, jack, cl, jglisse, zokeefe, rientjes, rdunlap,
hughd, richard.weiyang, Bagas Sanjaya
Now that we can collapse to mTHPs lets update the admin guide to
reflect these changes and provide proper guidance on how to utilize it.
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
Documentation/admin-guide/mm/transhuge.rst | 48 +++++++++++++---------
1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index eebb1f6bbc6c..67836c683e8d 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,8 @@ often.
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages of either PMD size
+or mTHP sizes, if the system is configured to do so.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when any THP size is enabled
(either of the per-size anon control or the top-level control are set
to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
+all THP sizes are disabled (when both the per-size anon control and the
top-level control are "never")
process THP controls
@@ -264,11 +265,6 @@ support the following arguments::
Khugepaged controls
-------------------
-.. note::
- khugepaged currently only searches for opportunities to collapse to
- PMD-sized THP and no attempt is made to collapse to other THP
- sizes.
-
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
The khugepaged progress can be seen in the number of pages collapsed (note
that this counter may not be an exact count of the number of pages
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
-being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
-be interpreted roughly as a sign of progress, and counters in /proc/vmstat
-consulted for more accurate accounting)::
+being replaced by a PMD mapping, or (2) physical pages replaced by one
+hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
+or together, depending on the type of memory and the failures that occur.
+As such, this value should be interpreted roughly as a sign of progress,
+and counters in /proc/vmstat consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@@ -308,16 +304,19 @@ for each pass::
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
-``max_ptes_none`` specifies how many extra small pages (that are
-not already mapped) can be allocated when collapsing a group
-of small pages into one large page::
+``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
+when collapsing a group of small pages into one large page::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
-A higher value leads to use additional memory for programs.
-A lower value leads to gain less thp performance. Value of
-max_ptes_none can waste cpu time very little, you can
-ignore it.
+For PMD-sized THP collapse, this directly limits the number of empty pages
+allowed in the 2MB region. For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1)
+are supported. Any other value will emit a warning and no mTHP collapse
+will be attempted.
+
+A higher value allows more empty pages, potentially leading to more memory
+usage but better THP performance. A lower value is more conservative and
+may result in fewer THP collapses.
``max_ptes_swap`` specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page::
@@ -337,6 +336,15 @@ that THP is shared. Exceeding the number would block the collapse::
A higher value may increase memory footprint for some workloads.
+.. note::
+ For mTHP collapse, khugepaged does not support collapsing regions that
+ contain shared or swapped out pages, as this could lead to continuous
+ promotion to higher orders. The collapse will fail if any shared or
+ swapped PTEs are encountered during the scan.
+
+ Currently, madvise_collapse only supports collapsing to PMD-sized THPs
+ and does not attempt mTHP collapses.
+
Boot parameters
===============
--
2.52.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
@ 2026-01-23 5:07 ` Lance Yang
2026-01-23 9:31 ` Baolin Wang
` (2 more replies)
2026-01-28 16:38 ` Nico Pache
2026-02-03 11:35 ` Lorenzo Stoakes
2 siblings, 3 replies; 39+ messages in thread
From: Lance Yang @ 2026-01-23 5:07 UTC (permalink / raw)
To: Nico Pache
Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On 2026/1/23 03:28, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
>
> Create collapse_single_pmd to increase code reuse and create an entry
> point to these two users.
>
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> properly indicates that this process is no longer valid to operate on.
>
> We also guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
I think this patch introduces some functional changes compared to previous
version[1] ...
Maybe we should drop the r-b tags and let folks take another look?
There might be an issue with the vma access in madvise_collapse(). See
below:
[1]
https://lore.kernel.org/linux-mm/20251201174627.23295-3-npache@redhat.com/
> mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
> 1 file changed, 60 insertions(+), 46 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fefcbdca4510..59e5a5588d85 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
> return result;
> }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> + struct vm_area_struct *vma, bool *mmap_locked,
> + struct collapse_control *cc)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + enum scan_result result;
> + struct file *file;
> + pgoff_t pgoff;
> +
> + if (vma_is_anonymous(vma)) {
> + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> + goto end;
> + }
> +
> + file = get_file(vma->vm_file);
> + pgoff = linear_page_index(vma, addr);
> +
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> + fput(file);
> +
> + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> + goto end;
> +
> + mmap_read_lock(mm);
> + *mmap_locked = true;
> + if (collapse_test_exit_or_disable(mm)) {
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + return SCAN_ANY_PROCESS;
> + }
> + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> + if (result == SCAN_PMD_MAPPED)
> + result = SCAN_SUCCEED;
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> +
> +end:
> + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> + ++khugepaged_pages_collapsed;
> + return result;
> +}
> +
> static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
> struct collapse_control *cc)
> __releases(&khugepaged_mm_lock)
> @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
> VM_BUG_ON(khugepaged_scan.address < hstart ||
> khugepaged_scan.address + HPAGE_PMD_SIZE >
> hend);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma,
> - khugepaged_scan.address);
> -
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> - *result = collapse_scan_file(mm,
> - khugepaged_scan.address, file, pgoff, cc);
> - fput(file);
> - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> - mmap_read_lock(mm);
> - if (collapse_test_exit_or_disable(mm))
> - goto breakouterloop;
> - *result = try_collapse_pte_mapped_thp(mm,
> - khugepaged_scan.address, false);
> - if (*result == SCAN_PMD_MAPPED)
> - *result = SCAN_SUCCEED;
> - mmap_read_unlock(mm);
> - }
> - } else {
> - *result = collapse_scan_pmd(mm, vma,
> - khugepaged_scan.address, &mmap_locked, cc);
> - }
> -
> - if (*result == SCAN_SUCCEED)
> - ++khugepaged_pages_collapsed;
>
> + *result = collapse_single_pmd(khugepaged_scan.address,
> + vma, &mmap_locked, cc);
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> progress += HPAGE_PMD_NR;
> @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> cond_resched();
> mmap_read_lock(mm);
> mmap_locked = true;
> + *lock_dropped = true;
> result = hugepage_vma_revalidate(mm, addr, false, &vma,
> cc);
> if (result != SCAN_SUCCEED) {
> @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> }
> mmap_assert_locked(mm);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
>
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> +
> + if (!mmap_locked)
> *lock_dropped = true;
> - result = collapse_scan_file(mm, addr, file, pgoff, cc);
>
> - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> - mapping_can_writeback(file->f_mapping)) {
> + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> + struct file *file = get_file(vma->vm_file);
> + pgoff_t pgoff = linear_page_index(vma, addr);
After collapse_single_pmd() returns, mmap_lock might have been released.
Between
that unlock and here, another thread could unmap/remap the VMA, making
the vma
pointer stale when we access vma->vm_file?
Would it be safer to get the file reference before calling
collapse_single_pmd()?
Or we need to revalidate the VMA after getting the lock back?
Thanks,
Lance
> +
> + if (mapping_can_writeback(file->f_mapping)) {
> loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>
> @@ -2829,26 +2853,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> goto retry;
> }
> fput(file);
> - } else {
> - result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> }
> - if (!mmap_locked)
> - *lock_dropped = true;
>
> -handle_result:
> switch (result) {
> case SCAN_SUCCEED:
> case SCAN_PMD_MAPPED:
> ++thps;
> break;
> - case SCAN_PTE_MAPPED_HUGEPAGE:
> - BUG_ON(mmap_locked);
> - mmap_read_lock(mm);
> - result = try_collapse_pte_mapped_thp(mm, addr, true);
> - mmap_read_unlock(mm);
> - goto handle_result;
> /* Whitelisted set of results where continuing OK */
> case SCAN_NO_PTE_TABLE:
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> case SCAN_PTE_NON_PRESENT:
> case SCAN_PTE_UFFD_WP:
> case SCAN_LACK_REFERENCED_PAGE:
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 5:07 ` Lance Yang
@ 2026-01-23 9:31 ` Baolin Wang
2026-01-26 12:25 ` Lorenzo Stoakes
2026-01-23 23:26 ` Nico Pache
2026-01-26 11:40 ` Lorenzo Stoakes
2 siblings, 1 reply; 39+ messages in thread
From: Baolin Wang @ 2026-01-23 9:31 UTC (permalink / raw)
To: Lance Yang, Nico Pache
Cc: akpm, david, lorenzo.stoakes, ziy, Liam.Howlett, ryan.roberts,
dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On 1/23/26 1:07 PM, Lance Yang wrote:
>
>
> On 2026/1/23 03:28, Nico Pache wrote:
>> The khugepaged daemon and madvise_collapse have two different
>> implementations that do almost the same thing.
>>
>> Create collapse_single_pmd to increase code reuse and create an entry
>> point to these two users.
>>
>> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
>> collapse_single_pmd function. This introduces a minor behavioral change
>> that is most likely an undiscovered bug. The current implementation of
>> khugepaged tests collapse_test_exit_or_disable before calling
>> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
>> case. By unifying these two callers madvise_collapse now also performs
>> this check. We also modify the return value to be SCAN_ANY_PROCESS which
>> properly indicates that this process is no longer valid to operate on.
>>
>> We also guard the khugepaged_pages_collapsed variable to ensure its only
>> incremented for khugepaged.
>>
>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>
> I think this patch introduces some functional changes compared to previous
> version[1] ...
>
> Maybe we should drop the r-b tags and let folks take another look?
>
> There might be an issue with the vma access in madvise_collapse(). See
> below:
>
> [1] https://lore.kernel.org/linux-mm/20251201174627.23295-3-
> npache@redhat.com/
>
>> mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
>> 1 file changed, 60 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index fefcbdca4510..59e5a5588d85 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2394,6 +2394,54 @@ static enum scan_result
>> collapse_scan_file(struct mm_struct *mm, unsigned long a
>> return result;
>> }
>> +/*
>> + * Try to collapse a single PMD starting at a PMD aligned addr, and
>> return
>> + * the results.
>> + */
>> +static enum scan_result collapse_single_pmd(unsigned long addr,
>> + struct vm_area_struct *vma, bool *mmap_locked,
>> + struct collapse_control *cc)
>> +{
>> + struct mm_struct *mm = vma->vm_mm;
>> + enum scan_result result;
>> + struct file *file;
>> + pgoff_t pgoff;
>> +
>> + if (vma_is_anonymous(vma)) {
>> + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
>> + goto end;
>> + }
>> +
>> + file = get_file(vma->vm_file);
>> + pgoff = linear_page_index(vma, addr);
>> +
>> + mmap_read_unlock(mm);
>> + *mmap_locked = false;
>> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
>> + fput(file);
>> +
>> + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
>> + goto end;
>> +
>> + mmap_read_lock(mm);
>> + *mmap_locked = true;
>> + if (collapse_test_exit_or_disable(mm)) {
>> + mmap_read_unlock(mm);
>> + *mmap_locked = false;
>> + return SCAN_ANY_PROCESS;
>> + }
>> + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
>> + if (result == SCAN_PMD_MAPPED)
>> + result = SCAN_SUCCEED;
>> + mmap_read_unlock(mm);
>> + *mmap_locked = false;
>> +
>> +end:
>> + if (cc->is_khugepaged && result == SCAN_SUCCEED)
>> + ++khugepaged_pages_collapsed;
>> + return result;
>> +}
>> +
>> static unsigned int collapse_scan_mm_slot(unsigned int pages, enum
>> scan_result *result,
>> struct collapse_control *cc)
>> __releases(&khugepaged_mm_lock)
>> @@ -2466,34 +2514,9 @@ static unsigned int
>> collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>> VM_BUG_ON(khugepaged_scan.address < hstart ||
>> khugepaged_scan.address + HPAGE_PMD_SIZE >
>> hend);
>> - if (!vma_is_anonymous(vma)) {
>> - struct file *file = get_file(vma->vm_file);
>> - pgoff_t pgoff = linear_page_index(vma,
>> - khugepaged_scan.address);
>> -
>> - mmap_read_unlock(mm);
>> - mmap_locked = false;
>> - *result = collapse_scan_file(mm,
>> - khugepaged_scan.address, file, pgoff, cc);
>> - fput(file);
>> - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>> - mmap_read_lock(mm);
>> - if (collapse_test_exit_or_disable(mm))
>> - goto breakouterloop;
>> - *result = try_collapse_pte_mapped_thp(mm,
>> - khugepaged_scan.address, false);
>> - if (*result == SCAN_PMD_MAPPED)
>> - *result = SCAN_SUCCEED;
>> - mmap_read_unlock(mm);
>> - }
>> - } else {
>> - *result = collapse_scan_pmd(mm, vma,
>> - khugepaged_scan.address, &mmap_locked, cc);
>> - }
>> -
>> - if (*result == SCAN_SUCCEED)
>> - ++khugepaged_pages_collapsed;
>> + *result = collapse_single_pmd(khugepaged_scan.address,
>> + vma, &mmap_locked, cc);
>> /* move to next address */
>> khugepaged_scan.address += HPAGE_PMD_SIZE;
>> progress += HPAGE_PMD_NR;
>> @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma,
>> unsigned long start,
>> cond_resched();
>> mmap_read_lock(mm);
>> mmap_locked = true;
>> + *lock_dropped = true;
>> result = hugepage_vma_revalidate(mm, addr, false, &vma,
>> cc);
>> if (result != SCAN_SUCCEED) {
>> @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct
>> *vma, unsigned long start,
>> hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>> }
>> mmap_assert_locked(mm);
>> - if (!vma_is_anonymous(vma)) {
>> - struct file *file = get_file(vma->vm_file);
>> - pgoff_t pgoff = linear_page_index(vma, addr);
>> - mmap_read_unlock(mm);
>> - mmap_locked = false;
>> + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>> +
>> + if (!mmap_locked)
>> *lock_dropped = true;
>> - result = collapse_scan_file(mm, addr, file, pgoff, cc);
>> - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !
>> triggered_wb &&
>> - mapping_can_writeback(file->f_mapping)) {
>> + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
>> + struct file *file = get_file(vma->vm_file);
>> + pgoff_t pgoff = linear_page_index(vma, addr);
>
>
> After collapse_single_pmd() returns, mmap_lock might have been released.
> Between
> that unlock and here, another thread could unmap/remap the VMA, making
> the vma
> pointer stale when we access vma->vm_file?
>
> Would it be safer to get the file reference before calling
> collapse_single_pmd()?
> Or we need to revalidate the VMA after getting the lock back?
Good catch. I think we can move the filemap_write_and_wait_range()
related logic into collapse_single_pmd(), after we get a file reference.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 5:07 ` Lance Yang
2026-01-23 9:31 ` Baolin Wang
@ 2026-01-23 23:26 ` Nico Pache
2026-01-24 4:41 ` Lance Yang
2026-01-26 12:25 ` Lorenzo Stoakes
2026-01-26 11:40 ` Lorenzo Stoakes
2 siblings, 2 replies; 39+ messages in thread
From: Nico Pache @ 2026-01-23 23:26 UTC (permalink / raw)
To: Lance Yang, Garg, Shivank
Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On Thu, Jan 22, 2026 at 10:08 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2026/1/23 03:28, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create collapse_single_pmd to increase code reuse and create an entry
> > point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. This introduces a minor behavioral change
> > that is most likely an undiscovered bug. The current implementation of
> > khugepaged tests collapse_test_exit_or_disable before calling
> > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > case. By unifying these two callers madvise_collapse now also performs
> > this check. We also modify the return value to be SCAN_ANY_PROCESS which
> > properly indicates that this process is no longer valid to operate on.
> >
> > We also guard the khugepaged_pages_collapsed variable to ensure its only
> > incremented for khugepaged.
> >
> > Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
>
> I think this patch introduces some functional changes compared to previous
> version[1] ...
>
> Maybe we should drop the r-b tags and let folks take another look?
>
> There might be an issue with the vma access in madvise_collapse(). See
> below:
>
> [1]
> https://lore.kernel.org/linux-mm/20251201174627.23295-3-npache@redhat.com/
>
> > mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
> > 1 file changed, 60 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fefcbdca4510..59e5a5588d85 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
> > return result;
> > }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static enum scan_result collapse_single_pmd(unsigned long addr,
> > + struct vm_area_struct *vma, bool *mmap_locked,
> > + struct collapse_control *cc)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + enum scan_result result;
> > + struct file *file;
> > + pgoff_t pgoff;
> > +
> > + if (vma_is_anonymous(vma)) {
> > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > + goto end;
> > + }
> > +
> > + file = get_file(vma->vm_file);
> > + pgoff = linear_page_index(vma, addr);
> > +
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > + fput(file);
> > +
> > + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> > + goto end;
> > +
> > + mmap_read_lock(mm);
> > + *mmap_locked = true;
> > + if (collapse_test_exit_or_disable(mm)) {
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + return SCAN_ANY_PROCESS;
> > + }
> > + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> > + if (result == SCAN_PMD_MAPPED)
> > + result = SCAN_SUCCEED;
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > +
> > +end:
> > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > + ++khugepaged_pages_collapsed;
> > + return result;
> > +}
> > +
> > static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
> > struct collapse_control *cc)
> > __releases(&khugepaged_mm_lock)
> > @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
> > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > hend);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma,
> > - khugepaged_scan.address);
> > -
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > - *result = collapse_scan_file(mm,
> > - khugepaged_scan.address, file, pgoff, cc);
> > - fput(file);
> > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > - mmap_read_lock(mm);
> > - if (collapse_test_exit_or_disable(mm))
> > - goto breakouterloop;
> > - *result = try_collapse_pte_mapped_thp(mm,
> > - khugepaged_scan.address, false);
> > - if (*result == SCAN_PMD_MAPPED)
> > - *result = SCAN_SUCCEED;
> > - mmap_read_unlock(mm);
> > - }
> > - } else {
> > - *result = collapse_scan_pmd(mm, vma,
> > - khugepaged_scan.address, &mmap_locked, cc);
> > - }
> > -
> > - if (*result == SCAN_SUCCEED)
> > - ++khugepaged_pages_collapsed;
> >
> > + *result = collapse_single_pmd(khugepaged_scan.address,
> > + vma, &mmap_locked, cc);
> > /* move to next address */
> > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > progress += HPAGE_PMD_NR;
> > @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > cond_resched();
> > mmap_read_lock(mm);
> > mmap_locked = true;
> > + *lock_dropped = true;
> > result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > cc);
> > if (result != SCAN_SUCCEED) {
> > @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> > }
> > mmap_assert_locked(mm);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > +
> > + if (!mmap_locked)
> > *lock_dropped = true;
> > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> >
> > - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> > - mapping_can_writeback(file->f_mapping)) {
> > + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> > + struct file *file = get_file(vma->vm_file);
> > + pgoff_t pgoff = linear_page_index(vma, addr);
>
>
> After collapse_single_pmd() returns, mmap_lock might have been released.
> Between
> that unlock and here, another thread could unmap/remap the VMA, making
> the vma
> pointer stale when we access vma->vm_file?
+ Shivank, I thought they were on the CC list.
Hey! I thought of this case, but then figured it was no different than
what is currently implemented for the writeback-retry logic, since the
mmap lock is dropped and not revalidated. BUT I failed to consider
that the file reference is held throughout that time.
I thought of moving the functionality into collapse_single_pmd(), but
figured I'd keep it in madvise_collapse() as it's the sole user of
that functionality. Given the potential file ref issue, that may be
the best solution, and I dont think it should be too difficult. I'll
queue that up, and also drop the r-b tags as you suggested.
Ok, here's my solution, does this look like the right approach?:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 59e5a5588d85..dda9fdc35767 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2418,6 +2418,14 @@ static enum scan_result
collapse_single_pmd(unsigned long addr,
mmap_read_unlock(mm);
*mmap_locked = false;
result = collapse_scan_file(mm, addr, file, pgoff, cc);
+
+ if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
+ mapping_can_writeback(file->f_mapping)) {
+ loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+ loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+ filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+ }
fput(file);
if (result != SCAN_PTE_MAPPED_HUGEPAGE)
@@ -2840,19 +2848,8 @@ int madvise_collapse(struct vm_area_struct
*vma, unsigned long start,
*lock_dropped = true;
if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma, addr);
-
- if (mapping_can_writeback(file->f_mapping)) {
- loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
- loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
-
-
filemap_write_and_wait_range(file->f_mapping, lstart, lend);
- triggered_wb = true;
- fput(file);
- goto retry;
- }
- fput(file);
+ triggered_wb = true;
+ goto retry;
}
switch (result) {
-- Nico
>
> Would it be safer to get the file reference before calling
> collapse_single_pmd()?
> Or we need to revalidate the VMA after getting the lock back?
>
>
> Thanks,
> Lance
>
> > +
> > + if (mapping_can_writeback(file->f_mapping)) {
> > loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> > loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> >
> > @@ -2829,26 +2853,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > goto retry;
> > }
> > fput(file);
> > - } else {
> > - result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> > }
> > - if (!mmap_locked)
> > - *lock_dropped = true;
> >
> > -handle_result:
> > switch (result) {
> > case SCAN_SUCCEED:
> > case SCAN_PMD_MAPPED:
> > ++thps;
> > break;
> > - case SCAN_PTE_MAPPED_HUGEPAGE:
> > - BUG_ON(mmap_locked);
> > - mmap_read_lock(mm);
> > - result = try_collapse_pte_mapped_thp(mm, addr, true);
> > - mmap_read_unlock(mm);
> > - goto handle_result;
> > /* Whitelisted set of results where continuing OK */
> > case SCAN_NO_PTE_TABLE:
> > + case SCAN_PTE_MAPPED_HUGEPAGE:
> > case SCAN_PTE_NON_PRESENT:
> > case SCAN_PTE_UFFD_WP:
> > case SCAN_LACK_REFERENCED_PAGE:
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 23:26 ` Nico Pache
@ 2026-01-24 4:41 ` Lance Yang
2026-01-26 12:25 ` Lorenzo Stoakes
1 sibling, 0 replies; 39+ messages in thread
From: Lance Yang @ 2026-01-24 4:41 UTC (permalink / raw)
To: Nico Pache, Garg, Shivank
Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On 2026/1/24 07:26, Nico Pache wrote:
> On Thu, Jan 22, 2026 at 10:08 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2026/1/23 03:28, Nico Pache wrote:
>>> The khugepaged daemon and madvise_collapse have two different
>>> implementations that do almost the same thing.
>>>
>>> Create collapse_single_pmd to increase code reuse and create an entry
>>> point to these two users.
>>>
>>> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
>>> collapse_single_pmd function. This introduces a minor behavioral change
>>> that is most likely an undiscovered bug. The current implementation of
>>> khugepaged tests collapse_test_exit_or_disable before calling
>>> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
>>> case. By unifying these two callers madvise_collapse now also performs
>>> this check. We also modify the return value to be SCAN_ANY_PROCESS which
>>> properly indicates that this process is no longer valid to operate on.
>>>
>>> We also guard the khugepaged_pages_collapsed variable to ensure its only
>>> incremented for khugepaged.
>>>
>>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>
>> I think this patch introduces some functional changes compared to previous
>> version[1] ...
>>
>> Maybe we should drop the r-b tags and let folks take another look?
>>
>> There might be an issue with the vma access in madvise_collapse(). See
>> below:
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20251201174627.23295-3-npache@redhat.com/
>>
>>> mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
>>> 1 file changed, 60 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index fefcbdca4510..59e5a5588d85 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
>>> return result;
>>> }
>>>
>>> +/*
>>> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
>>> + * the results.
>>> + */
>>> +static enum scan_result collapse_single_pmd(unsigned long addr,
>>> + struct vm_area_struct *vma, bool *mmap_locked,
>>> + struct collapse_control *cc)
>>> +{
>>> + struct mm_struct *mm = vma->vm_mm;
>>> + enum scan_result result;
>>> + struct file *file;
>>> + pgoff_t pgoff;
>>> +
>>> + if (vma_is_anonymous(vma)) {
>>> + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
>>> + goto end;
>>> + }
>>> +
>>> + file = get_file(vma->vm_file);
>>> + pgoff = linear_page_index(vma, addr);
>>> +
>>> + mmap_read_unlock(mm);
>>> + *mmap_locked = false;
>>> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
>>> + fput(file);
>>> +
>>> + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
>>> + goto end;
>>> +
>>> + mmap_read_lock(mm);
>>> + *mmap_locked = true;
>>> + if (collapse_test_exit_or_disable(mm)) {
>>> + mmap_read_unlock(mm);
>>> + *mmap_locked = false;
>>> + return SCAN_ANY_PROCESS;
>>> + }
>>> + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
>>> + if (result == SCAN_PMD_MAPPED)
>>> + result = SCAN_SUCCEED;
>>> + mmap_read_unlock(mm);
>>> + *mmap_locked = false;
>>> +
>>> +end:
>>> + if (cc->is_khugepaged && result == SCAN_SUCCEED)
>>> + ++khugepaged_pages_collapsed;
>>> + return result;
>>> +}
>>> +
>>> static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
>>> struct collapse_control *cc)
>>> __releases(&khugepaged_mm_lock)
>>> @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
>>> VM_BUG_ON(khugepaged_scan.address < hstart ||
>>> khugepaged_scan.address + HPAGE_PMD_SIZE >
>>> hend);
>>> - if (!vma_is_anonymous(vma)) {
>>> - struct file *file = get_file(vma->vm_file);
>>> - pgoff_t pgoff = linear_page_index(vma,
>>> - khugepaged_scan.address);
>>> -
>>> - mmap_read_unlock(mm);
>>> - mmap_locked = false;
>>> - *result = collapse_scan_file(mm,
>>> - khugepaged_scan.address, file, pgoff, cc);
>>> - fput(file);
>>> - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>> - mmap_read_lock(mm);
>>> - if (collapse_test_exit_or_disable(mm))
>>> - goto breakouterloop;
>>> - *result = try_collapse_pte_mapped_thp(mm,
>>> - khugepaged_scan.address, false);
>>> - if (*result == SCAN_PMD_MAPPED)
>>> - *result = SCAN_SUCCEED;
>>> - mmap_read_unlock(mm);
>>> - }
>>> - } else {
>>> - *result = collapse_scan_pmd(mm, vma,
>>> - khugepaged_scan.address, &mmap_locked, cc);
>>> - }
>>> -
>>> - if (*result == SCAN_SUCCEED)
>>> - ++khugepaged_pages_collapsed;
>>>
>>> + *result = collapse_single_pmd(khugepaged_scan.address,
>>> + vma, &mmap_locked, cc);
>>> /* move to next address */
>>> khugepaged_scan.address += HPAGE_PMD_SIZE;
>>> progress += HPAGE_PMD_NR;
>>> @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>> cond_resched();
>>> mmap_read_lock(mm);
>>> mmap_locked = true;
>>> + *lock_dropped = true;
>>> result = hugepage_vma_revalidate(mm, addr, false, &vma,
>>> cc);
>>> if (result != SCAN_SUCCEED) {
>>> @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>> hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>>> }
>>> mmap_assert_locked(mm);
>>> - if (!vma_is_anonymous(vma)) {
>>> - struct file *file = get_file(vma->vm_file);
>>> - pgoff_t pgoff = linear_page_index(vma, addr);
>>>
>>> - mmap_read_unlock(mm);
>>> - mmap_locked = false;
>>> + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
>>> +
>>> + if (!mmap_locked)
>>> *lock_dropped = true;
>>> - result = collapse_scan_file(mm, addr, file, pgoff, cc);
>>>
>>> - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>> - mapping_can_writeback(file->f_mapping)) {
>>> + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
>>> + struct file *file = get_file(vma->vm_file);
>>> + pgoff_t pgoff = linear_page_index(vma, addr);
>>
>>
>> After collapse_single_pmd() returns, mmap_lock might have been released.
>> Between
>> that unlock and here, another thread could unmap/remap the VMA, making
>> the vma
>> pointer stale when we access vma->vm_file?
>
> + Shivank, I thought they were on the CC list.
>
> Hey! I thought of this case, but then figured it was no different than
> what is currently implemented for the writeback-retry logic, since the
> mmap lock is dropped and not revalidated. BUT I failed to consider
> that the file reference is held throughout that time.
>
> I thought of moving the functionality into collapse_single_pmd(), but
> figured I'd keep it in madvise_collapse() as it's the sole user of
> that functionality. Given the potential file ref issue, that may be
> the best solution, and I dont think it should be too difficult. I'll
> queue that up, and also drop the r-b tags as you suggested.
>
> Ok, here's my solution, does this look like the right approach?:
Hey! Thanks for the quick fix!
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 59e5a5588d85..dda9fdc35767 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2418,6 +2418,14 @@ static enum scan_result
> collapse_single_pmd(unsigned long addr,
> mmap_read_unlock(mm);
> *mmap_locked = false;
> result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> + mapping_can_writeback(file->f_mapping)) {
> + loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> + loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> + filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> + }
> fput(file);
>
> if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> @@ -2840,19 +2848,8 @@ int madvise_collapse(struct vm_area_struct
> *vma, unsigned long start,
> *lock_dropped = true;
>
> if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
> -
> - if (mapping_can_writeback(file->f_mapping)) {
> - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> - loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> -
> -
> filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> - triggered_wb = true;
> - fput(file);
> - goto retry;
> - }
> - fput(file);
> + triggered_wb = true;
> + goto retry;
> }
>
> switch (result) {
>
>
>
> -- Nico
From a quick glimpse, that looks good to me ;)
Only madvise needs writeback and then retry once, and khugepaged just
skips dirty pages and moves on.
Now, we grab the file reference before dropping mmap_lock, then only
use the file pointer during writeback - no vma access after unlock.
So even if the VMA gets unmapped, we're safe, IIUC.
[...]
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 00/16] khugepaged: mTHP support
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (15 preceding siblings ...)
2026-01-22 19:28 ` [PATCH mm-unstable v14 16/16] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
@ 2026-01-26 11:21 ` Lorenzo Stoakes
2026-01-26 11:32 ` Lorenzo Stoakes
17 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 11:21 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
One small point on this - I don't necessarily blame you for wrapping up some
other stuff in review with the rebase, BUT - it makes difficult for reviewers
when it comes to picking up changes between v13 and v14.
You're going to have issues anyway given the flurry of THP patches we get every
cycle, but part of the review process often is to use git range-diff to check
what _actually_ change between revisions.
And in this case, I had to resolve a whole bunch of merge conflicts just to get
v13 to a point where it _kind of_ represents what was there before on a common
base.
Obviously I'm not asking you to constantly rebase series :P but I'd say in
future it might be useful to separate out the rebase step from the respin, when
asked for a resend to just do a resend, THEN if the time is right for a respin,
to do that separately.
Really more applicable to larger series like this and it's all a bit fuzzy, but
in this case it definitely would have helped!
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 00/16] khugepaged: mTHP support
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
` (16 preceding siblings ...)
2026-01-26 11:21 ` [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Lorenzo Stoakes
@ 2026-01-26 11:32 ` Lorenzo Stoakes
2026-02-04 21:35 ` Nico Pache
17 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 11:32 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Thu, Jan 22, 2026 at 12:28:25PM -0700, Nico Pache wrote:
> V14 Changes:
> - Added review tags
> - refactored is_mthp_order() to is_pmd_order(), utilized it in more places, and
> moved it to the first commit of the series
> - squashed fixup sent with v13
> - rebased and handled conflicts with new madvise_collapse writeback retry logic [3]
> - handled conflict with khugepaged cleanup series [4]
Hmm no mention of change to 3/16, unless it's folded into one of the above?
Very important to make reviewers aware of this stuff.
It's also worth separating out things at a fine-grained level, really
everything. More detail is good. See [0] for example - I practice what I preach
:)
Thanks, Lorenzo
[0]:https://lore.kernel.org/linux-mm/cover.1769198904.git.lorenzo.stoakes@oracle.com/
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 5:07 ` Lance Yang
2026-01-23 9:31 ` Baolin Wang
2026-01-23 23:26 ` Nico Pache
@ 2026-01-26 11:40 ` Lorenzo Stoakes
2026-01-26 15:09 ` Andrew Morton
2 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 11:40 UTC (permalink / raw)
To: Lance Yang
Cc: Nico Pache, akpm, david, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
Andrew - when this goes into mm-new if there isn't a respin between, please
drop all tags except any obviously sent re: the fix-patch.
Thanks!
On Fri, Jan 23, 2026 at 01:07:16PM +0800, Lance Yang wrote:
>
>
> On 2026/1/23 03:28, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing.
> >
> > Create collapse_single_pmd to increase code reuse and create an entry
> > point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. This introduces a minor behavioral change
> > that is most likely an undiscovered bug. The current implementation of
> > khugepaged tests collapse_test_exit_or_disable before calling
> > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > case. By unifying these two callers madvise_collapse now also performs
> > this check. We also modify the return value to be SCAN_ANY_PROCESS which
> > properly indicates that this process is no longer valid to operate on.
> >
> > We also guard the khugepaged_pages_collapsed variable to ensure its only
> > incremented for khugepaged.
> >
> > Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
>
> I think this patch introduces some functional changes compared to previous
> version[1] ...
>
> Maybe we should drop the r-b tags and let folks take another look?
Yes thanks Lance, absolutely this should happen.
Especially on a small-iteration respin (I really wanted to get to v13 but the
rebase issue killed that).
I know it wasn't intentional, not suggesting that of course :) just obviously as
a process thing - it's _very_ important to make clear what you've changed and
what you haven't. For truly minor changes no need to drop the tags, but often my
workflow is:
- Check which patches I haven't reviewed yet.
- Go review those.
So I might well have missed that.
I often try to do a git range-diff, but in this case I probably wouldn't have on
basis of the v13 having merge conflicts.
But obviously given the above I went and fixed them up and applied v13 locally
so I could check everything :)
>
> There might be an issue with the vma access in madvise_collapse(). See
> below:
>
> [1]
> https://lore.kernel.org/linux-mm/20251201174627.23295-3-npache@redhat.com/
>
> > mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
> > 1 file changed, 60 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index fefcbdca4510..59e5a5588d85 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
> > return result;
> > }
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static enum scan_result collapse_single_pmd(unsigned long addr,
> > + struct vm_area_struct *vma, bool *mmap_locked,
> > + struct collapse_control *cc)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + enum scan_result result;
> > + struct file *file;
> > + pgoff_t pgoff;
> > +
> > + if (vma_is_anonymous(vma)) {
> > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > + goto end;
> > + }
> > +
> > + file = get_file(vma->vm_file);
> > + pgoff = linear_page_index(vma, addr);
> > +
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > + fput(file);
> > +
> > + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> > + goto end;
> > +
> > + mmap_read_lock(mm);
> > + *mmap_locked = true;
> > + if (collapse_test_exit_or_disable(mm)) {
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + return SCAN_ANY_PROCESS;
> > + }
> > + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> > + if (result == SCAN_PMD_MAPPED)
> > + result = SCAN_SUCCEED;
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > +
> > +end:
> > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > + ++khugepaged_pages_collapsed;
> > + return result;
> > +}
> > +
> > static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
> > struct collapse_control *cc)
> > __releases(&khugepaged_mm_lock)
> > @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
> > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > hend);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma,
> > - khugepaged_scan.address);
> > -
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > - *result = collapse_scan_file(mm,
> > - khugepaged_scan.address, file, pgoff, cc);
> > - fput(file);
> > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > - mmap_read_lock(mm);
> > - if (collapse_test_exit_or_disable(mm))
> > - goto breakouterloop;
> > - *result = try_collapse_pte_mapped_thp(mm,
> > - khugepaged_scan.address, false);
> > - if (*result == SCAN_PMD_MAPPED)
> > - *result = SCAN_SUCCEED;
> > - mmap_read_unlock(mm);
> > - }
> > - } else {
> > - *result = collapse_scan_pmd(mm, vma,
> > - khugepaged_scan.address, &mmap_locked, cc);
> > - }
> > -
> > - if (*result == SCAN_SUCCEED)
> > - ++khugepaged_pages_collapsed;
> > + *result = collapse_single_pmd(khugepaged_scan.address,
> > + vma, &mmap_locked, cc);
> > /* move to next address */
> > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > progress += HPAGE_PMD_NR;
> > @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > cond_resched();
> > mmap_read_lock(mm);
> > mmap_locked = true;
> > + *lock_dropped = true;
> > result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > cc);
> > if (result != SCAN_SUCCEED) {
> > @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> > }
> > mmap_assert_locked(mm);
> > - if (!vma_is_anonymous(vma)) {
> > - struct file *file = get_file(vma->vm_file);
> > - pgoff_t pgoff = linear_page_index(vma, addr);
> > - mmap_read_unlock(mm);
> > - mmap_locked = false;
> > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > +
> > + if (!mmap_locked)
> > *lock_dropped = true;
> > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> > - mapping_can_writeback(file->f_mapping)) {
> > + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> > + struct file *file = get_file(vma->vm_file);
> > + pgoff_t pgoff = linear_page_index(vma, addr);
>
>
> After collapse_single_pmd() returns, mmap_lock might have been released.
> Between
> that unlock and here, another thread could unmap/remap the VMA, making the
> vma
> pointer stale when we access vma->vm_file?
Yeah, yikes.
The locking logic around this code is horrifying... but that's one for future
series I guess.
>
> Would it be safer to get the file reference before calling
> collapse_single_pmd()?
> Or we need to revalidate the VMA after getting the lock back?
Also obviously the pgoff.
I know Nico suggested a patch in a response, will check.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 9:31 ` Baolin Wang
@ 2026-01-26 12:25 ` Lorenzo Stoakes
0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 12:25 UTC (permalink / raw)
To: Baolin Wang
Cc: Lance Yang, Nico Pache, akpm, david, ziy, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On Fri, Jan 23, 2026 at 05:31:17PM +0800, Baolin Wang wrote:
>
>
> On 1/23/26 1:07 PM, Lance Yang wrote:
> >
> >
> > After collapse_single_pmd() returns, mmap_lock might have been released.
> > Between
> > that unlock and here, another thread could unmap/remap the VMA, making
> > the vma
> > pointer stale when we access vma->vm_file?
> >
> > Would it be safer to get the file reference before calling
> > collapse_single_pmd()?
> > Or we need to revalidate the VMA after getting the lock back?
> Good catch. I think we can move the filemap_write_and_wait_range() related
> logic into collapse_single_pmd(), after we get a file reference.
Good suggestion, is what Nico did in the suggested patch :) Agreed better there.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-23 23:26 ` Nico Pache
2026-01-24 4:41 ` Lance Yang
@ 2026-01-26 12:25 ` Lorenzo Stoakes
1 sibling, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 12:25 UTC (permalink / raw)
To: Nico Pache
Cc: Lance Yang, Garg, Shivank, akpm, david, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, baohua, vbabka, rppt,
surenb, mhocko, linux-trace-kernel, linux-doc, corbet, rostedt,
mhiramat, mathieu.desnoyers, linux-kernel, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang, David Hildenbrand, linux-mm
On Fri, Jan 23, 2026 at 04:26:09PM -0700, Nico Pache wrote:
> On Thu, Jan 22, 2026 at 10:08 PM Lance Yang <lance.yang@linux.dev> wrote:
> >
> >
> >
> > On 2026/1/23 03:28, Nico Pache wrote:
> > > The khugepaged daemon and madvise_collapse have two different
> > > implementations that do almost the same thing.
> > >
> > > Create collapse_single_pmd to increase code reuse and create an entry
> > > point to these two users.
> > >
> > > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > > collapse_single_pmd function. This introduces a minor behavioral change
> > > that is most likely an undiscovered bug. The current implementation of
> > > khugepaged tests collapse_test_exit_or_disable before calling
> > > collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> > > case. By unifying these two callers madvise_collapse now also performs
> > > this check. We also modify the return value to be SCAN_ANY_PROCESS which
> > > properly indicates that this process is no longer valid to operate on.
> > >
> > > We also guard the khugepaged_pages_collapsed variable to ensure its only
> > > incremented for khugepaged.
> > >
> > > Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> > > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> >
> > I think this patch introduces some functional changes compared to previous
> > version[1] ...
> >
> > Maybe we should drop the r-b tags and let folks take another look?
> >
> > There might be an issue with the vma access in madvise_collapse(). See
> > below:
> >
> > [1]
> > https://lore.kernel.org/linux-mm/20251201174627.23295-3-npache@redhat.com/
> >
> > > mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
> > > 1 file changed, 60 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index fefcbdca4510..59e5a5588d85 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
> > > return result;
> > > }
> > >
> > > +/*
> > > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > > + * the results.
> > > + */
> > > +static enum scan_result collapse_single_pmd(unsigned long addr,
> > > + struct vm_area_struct *vma, bool *mmap_locked,
> > > + struct collapse_control *cc)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + enum scan_result result;
> > > + struct file *file;
> > > + pgoff_t pgoff;
> > > +
> > > + if (vma_is_anonymous(vma)) {
> > > + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> > > + goto end;
> > > + }
> > > +
> > > + file = get_file(vma->vm_file);
> > > + pgoff = linear_page_index(vma, addr);
> > > +
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > > + fput(file);
> > > +
> > > + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> > > + goto end;
> > > +
> > > + mmap_read_lock(mm);
> > > + *mmap_locked = true;
> > > + if (collapse_test_exit_or_disable(mm)) {
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + return SCAN_ANY_PROCESS;
> > > + }
> > > + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> > > + if (result == SCAN_PMD_MAPPED)
> > > + result = SCAN_SUCCEED;
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > +
> > > +end:
> > > + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > > + ++khugepaged_pages_collapsed;
> > > + return result;
> > > +}
> > > +
> > > static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
> > > struct collapse_control *cc)
> > > __releases(&khugepaged_mm_lock)
> > > @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
> > > VM_BUG_ON(khugepaged_scan.address < hstart ||
> > > khugepaged_scan.address + HPAGE_PMD_SIZE >
> > > hend);
> > > - if (!vma_is_anonymous(vma)) {
> > > - struct file *file = get_file(vma->vm_file);
> > > - pgoff_t pgoff = linear_page_index(vma,
> > > - khugepaged_scan.address);
> > > -
> > > - mmap_read_unlock(mm);
> > > - mmap_locked = false;
> > > - *result = collapse_scan_file(mm,
> > > - khugepaged_scan.address, file, pgoff, cc);
> > > - fput(file);
> > > - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > > - mmap_read_lock(mm);
> > > - if (collapse_test_exit_or_disable(mm))
> > > - goto breakouterloop;
> > > - *result = try_collapse_pte_mapped_thp(mm,
> > > - khugepaged_scan.address, false);
> > > - if (*result == SCAN_PMD_MAPPED)
> > > - *result = SCAN_SUCCEED;
> > > - mmap_read_unlock(mm);
> > > - }
> > > - } else {
> > > - *result = collapse_scan_pmd(mm, vma,
> > > - khugepaged_scan.address, &mmap_locked, cc);
> > > - }
> > > -
> > > - if (*result == SCAN_SUCCEED)
> > > - ++khugepaged_pages_collapsed;
> > >
> > > + *result = collapse_single_pmd(khugepaged_scan.address,
> > > + vma, &mmap_locked, cc);
> > > /* move to next address */
> > > khugepaged_scan.address += HPAGE_PMD_SIZE;
> > > progress += HPAGE_PMD_NR;
> > > @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > > cond_resched();
> > > mmap_read_lock(mm);
> > > mmap_locked = true;
> > > + *lock_dropped = true;
> > > result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > > cc);
> > > if (result != SCAN_SUCCEED) {
> > > @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > > hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> > > }
> > > mmap_assert_locked(mm);
> > > - if (!vma_is_anonymous(vma)) {
> > > - struct file *file = get_file(vma->vm_file);
> > > - pgoff_t pgoff = linear_page_index(vma, addr);
> > >
> > > - mmap_read_unlock(mm);
> > > - mmap_locked = false;
> > > + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> > > +
> > > + if (!mmap_locked)
> > > *lock_dropped = true;
> > > - result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > >
> > > - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> > > - mapping_can_writeback(file->f_mapping)) {
> > > + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> > > + struct file *file = get_file(vma->vm_file);
> > > + pgoff_t pgoff = linear_page_index(vma, addr);
> >
> >
> > After collapse_single_pmd() returns, mmap_lock might have been released.
> > Between
> > that unlock and here, another thread could unmap/remap the VMA, making
> > the vma
> > pointer stale when we access vma->vm_file?
>
> + Shivank, I thought they were on the CC list.
>
> Hey! I thought of this case, but then figured it was no different than
> what is currently implemented for the writeback-retry logic, since the
> mmap lock is dropped and not revalidated. BUT I failed to consider
> that the file reference is held throughout that time.
You obviously can't manipulate or reference a pointer to a VMA in any way
if is no longer stabilised, that'd be a potential UAF.
>
> I thought of moving the functionality into collapse_single_pmd(), but
> figured I'd keep it in madvise_collapse() as it's the sole user of
> that functionality. Given the potential file ref issue, that may be
> the best solution, and I dont think it should be too difficult. I'll
> queue that up, and also drop the r-b tags as you suggested.
>
> Ok, here's my solution, does this look like the right approach?:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 59e5a5588d85..dda9fdc35767 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2418,6 +2418,14 @@ static enum scan_result
> collapse_single_pmd(unsigned long addr,
> mmap_read_unlock(mm);
> *mmap_locked = false;
> result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> + mapping_can_writeback(file->f_mapping)) {
> + loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> + loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
NIT, but Let's const-ify these.
Also credit to Baolin for having suggested taking the approach of putting
here! :)
> +
> + filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> + }
> fput(file);
>
> if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> @@ -2840,19 +2848,8 @@ int madvise_collapse(struct vm_area_struct
> *vma, unsigned long start,
> *lock_dropped = true;
>
> if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
> -
> - if (mapping_can_writeback(file->f_mapping)) {
> - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> - loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> -
> -
> filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> - triggered_wb = true;
> - fput(file);
> - goto retry;
> - }
> - fput(file);
> + triggered_wb = true;
> + goto retry;
OK this looks correct I agree with Lance.
Could you send this in reply to the parent, i.e. [0], as a fix-patch and
ask Andrew to apply it?
Can then review that there.
[0]:https://lore.kernel.org/all/20260122192841.128719-4-npache@redhat.com/
> }
>
> switch (result) {
>
>
>
> -- Nico
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-26 11:40 ` Lorenzo Stoakes
@ 2026-01-26 15:09 ` Andrew Morton
2026-01-26 15:18 ` Lorenzo Stoakes
0 siblings, 1 reply; 39+ messages in thread
From: Andrew Morton @ 2026-01-26 15:09 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Lance Yang, Nico Pache, david, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On Mon, 26 Jan 2026 11:40:21 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Andrew - when this goes into mm-new if there isn't a respin between, please
> drop all tags except any obviously sent re: the fix-patch.
>
I've been believing this is next -rc1 material. Was that mistaken?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-26 15:09 ` Andrew Morton
@ 2026-01-26 15:18 ` Lorenzo Stoakes
0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-01-26 15:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Lance Yang, Nico Pache, david, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
linux-trace-kernel, linux-doc, corbet, rostedt, mhiramat,
mathieu.desnoyers, linux-kernel, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, gourry, ying.huang, apopple, jannh,
pfalcato, jackmanb, hannes, willy, peterx, wangkefeng.wang,
usamaarif642, sunnanyong, vishal.moola, thomas.hellstrom, yang,
kas, aarcange, raquini, anshuman.khandual, catalin.marinas,
tiwai, will, dave.hansen, jack, cl, jglisse, zokeefe, rientjes,
rdunlap, hughd, richard.weiyang, David Hildenbrand, linux-mm
On Mon, Jan 26, 2026 at 07:09:18AM -0800, Andrew Morton wrote:
> On Mon, 26 Jan 2026 11:40:21 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Andrew - when this goes into mm-new if there isn't a respin between, please
> > drop all tags except any obviously sent re: the fix-patch.
> >
>
> I've been believing this is next -rc1 material. Was that mistaken?
Yeah this isn't ready yet sorry. I did hope we could get this in this cycle but
there's too much to check (esp. given this change for isntance0 and we need more
time to stabilise it, so please keep this out of mm-(un)stable for now.
It's a really huge change to THP so we need to take our time with it.
So we're aiming for 6.21-rc1 / 7.1-rc1 or whatever deems it to be :)
i.e. next+1-rc1.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2026-01-23 5:07 ` Lance Yang
@ 2026-01-28 16:38 ` Nico Pache
2026-02-03 11:43 ` Lorenzo Stoakes
2026-02-03 11:35 ` Lorenzo Stoakes
2 siblings, 1 reply; 39+ messages in thread
From: Nico Pache @ 2026-01-28 16:38 UTC (permalink / raw)
To: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm
Cc: david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb,
mhocko, corbet, rostedt, mhiramat, mathieu.desnoyers,
matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
ying.huang, apopple, jannh, pfalcato, jackmanb, hannes, willy,
peterx, wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang, David Hildenbrand, shivankg
Hi Andrew,
could you please apply the following fixup to avoid potentially using a stale
VMA in the new writeback-retry logic for madvise collapse.
Thank you!
-- Nico
----8<----
commit a9ac3b1bfa926dd707ac3a785583f8d7a0579578
Author: Nico Pache <npache@redhat.com>
Date: Fri Jan 23 16:32:42 2026 -0700
madvise writeback retry logic fix
Signed-off-by: Nico Pache <npache@redhat.com>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 59e5a5588d85..2b054f7d9753 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2418,6 +2418,14 @@ static enum scan_result collapse_single_pmd(unsigned long
addr,
mmap_read_unlock(mm);
*mmap_locked = false;
result = collapse_scan_file(mm, addr, file, pgoff, cc);
+
+ if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
+ mapping_can_writeback(file->f_mapping)) {
+ const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+ const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+ filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+ }
fput(file);
if (result != SCAN_PTE_MAPPED_HUGEPAGE)
@@ -2840,19 +2848,8 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned
long start,
*lock_dropped = true;
if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
- struct file *file = get_file(vma->vm_file);
- pgoff_t pgoff = linear_page_index(vma, addr);
-
- if (mapping_can_writeback(file->f_mapping)) {
- loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
- loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
-
- filemap_write_and_wait_range(file->f_mapping, lstart, lend);
- triggered_wb = true;
- fput(file);
- goto retry;
- }
- fput(file);
+ triggered_wb = true;
+ goto retry;
}
switch (result) {
--
2.52.0
On 1/22/26 12:28 PM, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing.
>
> Create collapse_single_pmd to increase code reuse and create an entry
> point to these two users.
>
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. This introduces a minor behavioral change
> that is most likely an undiscovered bug. The current implementation of
> khugepaged tests collapse_test_exit_or_disable before calling
> collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse
> case. By unifying these two callers madvise_collapse now also performs
> this check. We also modify the return value to be SCAN_ANY_PROCESS which
> properly indicates that this process is no longer valid to operate on.
>
> We also guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 106 +++++++++++++++++++++++++++---------------------
> 1 file changed, 60 insertions(+), 46 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fefcbdca4510..59e5a5588d85 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2394,6 +2394,54 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm, unsigned long a
> return result;
> }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> + struct vm_area_struct *vma, bool *mmap_locked,
> + struct collapse_control *cc)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + enum scan_result result;
> + struct file *file;
> + pgoff_t pgoff;
> +
> + if (vma_is_anonymous(vma)) {
> + result = collapse_scan_pmd(mm, vma, addr, mmap_locked, cc);
> + goto end;
> + }
> +
> + file = get_file(vma->vm_file);
> + pgoff = linear_page_index(vma, addr);
> +
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + result = collapse_scan_file(mm, addr, file, pgoff, cc);
> + fput(file);
> +
> + if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> + goto end;
> +
> + mmap_read_lock(mm);
> + *mmap_locked = true;
> + if (collapse_test_exit_or_disable(mm)) {
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + return SCAN_ANY_PROCESS;
> + }
> + result = try_collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged);
> + if (result == SCAN_PMD_MAPPED)
> + result = SCAN_SUCCEED;
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> +
> +end:
> + if (cc->is_khugepaged && result == SCAN_SUCCEED)
> + ++khugepaged_pages_collapsed;
> + return result;
> +}
> +
> static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *result,
> struct collapse_control *cc)
> __releases(&khugepaged_mm_lock)
> @@ -2466,34 +2514,9 @@ static unsigned int collapse_scan_mm_slot(unsigned int pages, enum scan_result *
> VM_BUG_ON(khugepaged_scan.address < hstart ||
> khugepaged_scan.address + HPAGE_PMD_SIZE >
> hend);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma,
> - khugepaged_scan.address);
> -
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> - *result = collapse_scan_file(mm,
> - khugepaged_scan.address, file, pgoff, cc);
> - fput(file);
> - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> - mmap_read_lock(mm);
> - if (collapse_test_exit_or_disable(mm))
> - goto breakouterloop;
> - *result = try_collapse_pte_mapped_thp(mm,
> - khugepaged_scan.address, false);
> - if (*result == SCAN_PMD_MAPPED)
> - *result = SCAN_SUCCEED;
> - mmap_read_unlock(mm);
> - }
> - } else {
> - *result = collapse_scan_pmd(mm, vma,
> - khugepaged_scan.address, &mmap_locked, cc);
> - }
> -
> - if (*result == SCAN_SUCCEED)
> - ++khugepaged_pages_collapsed;
>
> + *result = collapse_single_pmd(khugepaged_scan.address,
> + vma, &mmap_locked, cc);
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> progress += HPAGE_PMD_NR;
> @@ -2799,6 +2822,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> cond_resched();
> mmap_read_lock(mm);
> mmap_locked = true;
> + *lock_dropped = true;
> result = hugepage_vma_revalidate(mm, addr, false, &vma,
> cc);
> if (result != SCAN_SUCCEED) {
> @@ -2809,17 +2833,17 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> }
> mmap_assert_locked(mm);
> - if (!vma_is_anonymous(vma)) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
>
> - mmap_read_unlock(mm);
> - mmap_locked = false;
> + result = collapse_single_pmd(addr, vma, &mmap_locked, cc);
> +
> + if (!mmap_locked)
> *lock_dropped = true;
> - result = collapse_scan_file(mm, addr, file, pgoff, cc);
>
> - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> - mapping_can_writeback(file->f_mapping)) {
> + if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> + struct file *file = get_file(vma->vm_file);
> + pgoff_t pgoff = linear_page_index(vma, addr);
> +
> + if (mapping_can_writeback(file->f_mapping)) {
> loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>
> @@ -2829,26 +2853,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> goto retry;
> }
> fput(file);
> - } else {
> - result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> }
> - if (!mmap_locked)
> - *lock_dropped = true;
>
> -handle_result:
> switch (result) {
> case SCAN_SUCCEED:
> case SCAN_PMD_MAPPED:
> ++thps;
> break;
> - case SCAN_PTE_MAPPED_HUGEPAGE:
> - BUG_ON(mmap_locked);
> - mmap_read_lock(mm);
> - result = try_collapse_pte_mapped_thp(mm, addr, true);
> - mmap_read_unlock(mm);
> - goto handle_result;
> /* Whitelisted set of results where continuing OK */
> case SCAN_NO_PTE_TABLE:
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> case SCAN_PTE_NON_PRESENT:
> case SCAN_PTE_UFFD_WP:
> case SCAN_LACK_REFERENCED_PAGE:
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2026-01-23 5:07 ` Lance Yang
2026-01-28 16:38 ` Nico Pache
@ 2026-02-03 11:35 ` Lorenzo Stoakes
2 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-03 11:35 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang, David Hildenbrand
Another trivial point on this one, you're missing the prefix in the subject
here. In general I think better to say mm/khugepaged: rather than
khugepaged: also.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse
2026-01-28 16:38 ` Nico Pache
@ 2026-02-03 11:43 ` Lorenzo Stoakes
0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-03 11:43 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang, David Hildenbrand, shivankg
On Wed, Jan 28, 2026 at 09:38:37AM -0700, Nico Pache wrote:
> Hi Andrew,
>
> could you please apply the following fixup to avoid potentially using a stale
> VMA in the new writeback-retry logic for madvise collapse.
>
> Thank you!
> -- Nico
>
> ----8<----
> commit a9ac3b1bfa926dd707ac3a785583f8d7a0579578
> Author: Nico Pache <npache@redhat.com>
> Date: Fri Jan 23 16:32:42 2026 -0700
>
> madvise writeback retry logic fix
>
> Signed-off-by: Nico Pache <npache@redhat.com>
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 59e5a5588d85..2b054f7d9753 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2418,6 +2418,14 @@ static enum scan_result collapse_single_pmd(unsigned long
> addr,
> mmap_read_unlock(mm);
> *mmap_locked = false;
> result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> + mapping_can_writeback(file->f_mapping)) {
> + const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> + const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> + filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> + }
> fput(file);
>
> if (result != SCAN_PTE_MAPPED_HUGEPAGE)
> @@ -2840,19 +2848,8 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned
> long start,
> *lock_dropped = true;
>
> if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb) {
> - struct file *file = get_file(vma->vm_file);
> - pgoff_t pgoff = linear_page_index(vma, addr);
> -
> - if (mapping_can_writeback(file->f_mapping)) {
> - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> - loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> -
> - filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> - triggered_wb = true;
> - fput(file);
> - goto retry;
> - }
> - fput(file);
> + triggered_wb = true;
> + goto retry;
> }
LGTM, with this in place you can add back my tag to the patch.
It'd be good to reference this in the commit message, but you can do that
on a respin.
So:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function
2026-01-22 19:28 ` [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
@ 2026-02-03 12:08 ` Lorenzo Stoakes
2026-02-04 21:39 ` Nico Pache
2026-02-06 17:44 ` Nico Pache
0 siblings, 2 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-03 12:08 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Thu, Jan 22, 2026 at 12:28:32PM -0700, Nico Pache wrote:
> The current mechanism for determining mTHP collapse scales the
> khugepaged_max_ptes_none value based on the target order. This
> introduces an undesirable feedback loop, or "creep", when max_ptes_none
> is set to a value greater than HPAGE_PMD_NR / 2.
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes:
>
> - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> available mTHP order.
>
> This removes the possiblilty of "creep", while not modifying any uAPI
> expectations. A warning will be emitted if any non-supported
> max_ptes_none value is configured with mTHP enabled.
>
> The limits can be ignored by passing full_scan=true, this is useful for
> madvise_collapse (which ignores limits), or in the case of
> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> collapse is available.
Thanks, great commit msg!
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
This LGTM in terms of logic, some nits below, with those addressed feel
free to add:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cheers, Lorenzo
> ---
> mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0f68902edd9a..9b7e05827749 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -460,6 +460,44 @@ void __khugepaged_enter(struct mm_struct *mm)
> wake_up_interruptible(&khugepaged_wait);
> }
>
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> + * @order: The folio order being collapsed to
> + * @full_scan: Whether this is a full scan (ignore limits)
> + *
> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> + *
> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> + * khugepaged_max_ptes_none value.
> + *
> + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
> + * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP
> + * collapse will be attempted
> + *
> + * Return: Maximum number of empty PTEs allowed for the collapse operation
> + */
> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> +{
> + /* ignore max_ptes_none limits */
> + if (full_scan)
> + return HPAGE_PMD_NR - 1;
I wonder if, given we are effectively doing:
const unsigned int nr_pages = collapse_max_ptes_none(order, /*full_scan=*/true);
...
foo(nr_pages);
In places where we ignore limits, whether we would be better off putting
HPAGE_PMD_NR - 1 into a define and just using that in these cases, like:
#define COLLAPSE_MAX_PTES_LIM (HPAGE_PMD_NR - 1)
Then instead doing:
foo(COLLAPSE_MAX_PTES_LIM);
?
Seems somewhat silly to pass in a boolean that makes it return a set value in
cases where you know that should be the case at the outset.
> +
> + if (is_pmd_order(order))
> + return khugepaged_max_ptes_none;
> +
> + /* Zero/non-present collapse disabled. */
> + if (!khugepaged_max_ptes_none)
> + return 0;
> +
> + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
Having a define for HPAGE_PMD_NR - 1 would also be handy here...
> + return (1 << order) - 1;
> +
> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %d\n",
> + HPAGE_PMD_NR - 1);
...and here.
Also a MICRO nit here - the function returns unsigned int and thus we
express PTEs in this unit, so maybe use %u rather than %d?
> + return -EINVAL;
> +}
Logic of this function looks correct though!
> +
> void khugepaged_enter_vma(struct vm_area_struct *vma,
> vm_flags_t vm_flags)
> {
> @@ -548,7 +586,10 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> const unsigned long nr_pages = 1UL << order;
> - int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> + int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
Yeah, the !cc->is_khugepaged is a bit gross here, so as per the above, maybe do:
int max_ptes_none;
if (cc->is_khugepaged)
max_ptes_none = collapse_max_ptes_none(order);
else /* MADV_COLLAPSE is not limited. */
max_ptes_none = COLLAPSE_MAX_PTES_LIM;
> +
> + if (max_ptes_none == -EINVAL)
> + return result;
>
> for (_pte = pte; _pte < pte + nr_pages;
> _pte++, addr += PAGE_SIZE) {
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse
2026-01-22 19:28 ` [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
@ 2026-02-03 13:07 ` Lorenzo Stoakes
2026-02-04 22:00 ` Nico Pache
0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-03 13:07 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Thu, Jan 22, 2026 at 12:28:33PM -0700, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> changes to the VMA from occurring.
>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++-----------------
> 1 file changed, 71 insertions(+), 40 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9b7e05827749..76cb17243793 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1151,44 +1151,54 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> return SCAN_SUCCEED;
> }
>
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> - int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> + int referenced, int unmapped, struct collapse_control *cc,
> + bool *mmap_locked, unsigned int order)
> {
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> enum scan_result result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + bool anon_vma_locked = false;
> + const unsigned long nr_pages = 1UL << order;
> + const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
>
> - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> + VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
>
> /*
> * Before allocating the hugepage, release the mmap_lock read lock.
> * The allocation can take potentially a long time if it involves
> * sync compaction, and we do not need to hold the mmap_lock during
> * that. We will recheck the vma after taking it again in write mode.
> + * If collapsing mTHPs we may have already released the read_lock.
> */
> - mmap_read_unlock(mm);
> + if (*mmap_locked) {
> + mmap_read_unlock(mm);
> + *mmap_locked = false;
> + }
>
> - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> + result = alloc_charge_folio(&folio, mm, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + *mmap_locked = true;
> + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
Why would we use the PMD address here rather than the actual start address?
Also please add /*expect_anon=*/ before the 'true' because it's hard to
understand what that references.
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> + *mmap_locked = false;
> goto out_nolock;
> }
>
> - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> + result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> + *mmap_locked = false;
> goto out_nolock;
> }
>
> @@ -1198,13 +1208,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * released when it fails. So we jump out_nolock directly in
> * that case. Continuing to collapse causes inconsistency.
> */
> - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced, HPAGE_PMD_ORDER);
> - if (result != SCAN_SUCCEED)
> + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> + referenced, order);
> + if (result != SCAN_SUCCEED) {
> + *mmap_locked = false;
> goto out_nolock;
> + }
> }
>
> mmap_read_unlock(mm);
> + *mmap_locked = false;
> /*
> * Prevent all access to pagetables with the exception of
> * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1214,20 +1227,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> vma_start_write(vma);
> - result = check_pmd_still_valid(mm, address, pmd);
> + result = check_pmd_still_valid(mm, pmd_address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> anon_vma_lock_write(vma->anon_vma);
> + anon_vma_locked = true;
>
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> - address + HPAGE_PMD_SIZE);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> + start_addr + (PAGE_SIZE << order));
> mmu_notifier_invalidate_range_start(&range);
>
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1239,24 +1252,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * Parallel GUP-fast is fine since GUP-fast will back off when
> * it detects PMD is changed.
> */
> - _pmd = pmdp_collapse_flush(vma, address, pmd);
> + _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> spin_unlock(pmd_ptl);
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> if (pte) {
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> + order, &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_NO_PTE_TABLE;
> }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - if (pte)
> - pte_unmap(pte);
> spin_lock(pmd_ptl);
> BUG_ON(!pmd_none(*pmd));
> /*
> @@ -1266,21 +1276,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> */
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> spin_unlock(pmd_ptl);
> - anon_vma_unlock_write(vma->anon_vma);
> goto out_up_write;
> }
>
> /*
> - * All pages are isolated and locked so anon_vma rmap
> - * can't run anymore.
> + * For PMD collapse all pages are isolated and locked so anon_vma
> + * rmap can't run anymore. For mTHP collapse we must hold the lock
> */
> - anon_vma_unlock_write(vma->anon_vma);
> + if (is_pmd_order(order)) {
> + anon_vma_unlock_write(vma->anon_vma);
> + anon_vma_locked = false;
> + }
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> - vma, address, pte_ptl,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> - pte_unmap(pte);
> + vma, start_addr, pte_ptl,
> + order, &compound_pagelist);
> if (unlikely(result != SCAN_SUCCEED))
> goto out_up_write;
>
> @@ -1290,20 +1300,42 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * write.
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> + if (is_pmd_order(order)) { /* PMD collapse */
> + pgtable = pmd_pgtable(_pmd);
>
> - spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + spin_lock(pmd_ptl);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> + } else { /* mTHP collapse */
> + pte_t mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
> +
> + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> + spin_lock(pmd_ptl);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + folio_ref_add(folio, nr_pages - 1);
> + folio_add_new_anon_rmap(folio, vma, start_addr, RMAP_EXCLUSIVE);
> + folio_add_lru_vma(folio, vma);
> + set_ptes(vma->vm_mm, start_addr, pte, mthp_pte, nr_pages);
> + update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> +
> + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
I seriously hate this being open-coded, can we separate it out into another
function?
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
Thanks it's much better tracking this specifically.
The whole damn thing needs refactoring (by this I mean - khugepaged and really
THP in general to be clear :) but it's not your fault.
Could I ask though whether you might help out with some cleanups after this
lands :)
I feel like we all need to do our bit to pay down some technical debt!
> + if (pte)
> + pte_unmap(pte);
> mmap_write_unlock(mm);
> + *mmap_locked = false;
> out_nolock:
> + WARN_ON_ONCE(*mmap_locked);
> if (folio)
> folio_put(folio);
> trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> @@ -1471,9 +1503,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc);
> - /* collapse_huge_page will return with the mmap_lock released */
> - *mmap_locked = false;
> + unmapped, cc, mmap_locked,
> + HPAGE_PMD_ORDER);
> }
> out:
> trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.52.0
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 00/16] khugepaged: mTHP support
2026-01-26 11:32 ` Lorenzo Stoakes
@ 2026-02-04 21:35 ` Nico Pache
0 siblings, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-02-04 21:35 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Mon, Jan 26, 2026 at 4:34 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jan 22, 2026 at 12:28:25PM -0700, Nico Pache wrote:
> > V14 Changes:
> > - Added review tags
> > - refactored is_mthp_order() to is_pmd_order(), utilized it in more places, and
> > moved it to the first commit of the series
> > - squashed fixup sent with v13
> > - rebased and handled conflicts with new madvise_collapse writeback retry logic [3]
> > - handled conflict with khugepaged cleanup series [4]
>
> Hmm no mention of change to 3/16, unless it's folded into one of the above?
It's the line with [3], but yeah my bad, i'll try to be more detailed
with these change logs in the future. Was particularly lazy on this
one.
Thanks for the reviews :)
-- Nico
>
> Very important to make reviewers aware of this stuff.
>
> It's also worth separating out things at a fine-grained level, really
> everything. More detail is good. See [0] for example - I practice what I preach
> :)
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/linux-mm/cover.1769198904.git.lorenzo.stoakes@oracle.com/
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function
2026-02-03 12:08 ` Lorenzo Stoakes
@ 2026-02-04 21:39 ` Nico Pache
2026-02-06 17:44 ` Nico Pache
1 sibling, 0 replies; 39+ messages in thread
From: Nico Pache @ 2026-02-04 21:39 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Tue, Feb 3, 2026 at 5:09 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jan 22, 2026 at 12:28:32PM -0700, Nico Pache wrote:
> > The current mechanism for determining mTHP collapse scales the
> > khugepaged_max_ptes_none value based on the target order. This
> > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > is set to a value greater than HPAGE_PMD_NR / 2.
> >
> > With this configuration, a successful collapse to order N will populate
> > enough pages to satisfy the collapse condition on order N+1 on the next
> > scan. This leads to unnecessary work and memory churn.
> >
> > To fix this issue introduce a helper function that will limit mTHP
> > collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > This effectively supports two modes:
> >
> > - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
> > - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > available mTHP order.
> >
> > This removes the possiblilty of "creep", while not modifying any uAPI
> > expectations. A warning will be emitted if any non-supported
> > max_ptes_none value is configured with mTHP enabled.
> >
> > The limits can be ignored by passing full_scan=true, this is useful for
> > madvise_collapse (which ignores limits), or in the case of
> > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > collapse is available.
>
> Thanks, great commit msg!
>
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> This LGTM in terms of logic, some nits below, with those addressed feel
> free to add:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Thanks :)
>
> Cheers, Lorenzo
>
> > ---
> > mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 42 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0f68902edd9a..9b7e05827749 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -460,6 +460,44 @@ void __khugepaged_enter(struct mm_struct *mm)
> > wake_up_interruptible(&khugepaged_wait);
> > }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > + * @order: The folio order being collapsed to
> > + * @full_scan: Whether this is a full scan (ignore limits)
> > + *
> > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > + *
> > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > + * khugepaged_max_ptes_none value.
> > + *
> > + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
> > + * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP
> > + * collapse will be attempted
> > + *
> > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > + */
> > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > +{
> > + /* ignore max_ptes_none limits */
> > + if (full_scan)
> > + return HPAGE_PMD_NR - 1;
>
> I wonder if, given we are effectively doing:
>
> const unsigned int nr_pages = collapse_max_ptes_none(order, /*full_scan=*/true);
>
> ...
>
> foo(nr_pages);
>
> In places where we ignore limits, whether we would be better off putting
> HPAGE_PMD_NR - 1 into a define and just using that in these cases, like:
>
> #define COLLAPSE_MAX_PTES_LIM (HPAGE_PMD_NR - 1)
>
> Then instead doing:
>
> foo(COLLAPSE_MAX_PTES_LIM);
>
> ?
>
> Seems somewhat silly to pass in a boolean that makes it return a set value in
> cases where you know that should be the case at the outset.
>
> > +
> > + if (is_pmd_order(order))
> > + return khugepaged_max_ptes_none;
> > +
> > + /* Zero/non-present collapse disabled. */
> > + if (!khugepaged_max_ptes_none)
> > + return 0;
> > +
> > + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>
> Having a define for HPAGE_PMD_NR - 1 would also be handy here...
>
> > + return (1 << order) - 1;
> > +
> > + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %d\n",
> > + HPAGE_PMD_NR - 1);
>
> ...and here.
>
> Also a MICRO nit here - the function returns unsigned int and thus we
> express PTEs in this unit, so maybe use %u rather than %d?
>
> > + return -EINVAL;
> > +}
>
> Logic of this function looks correct though!
>
> > +
> > void khugepaged_enter_vma(struct vm_area_struct *vma,
> > vm_flags_t vm_flags)
> > {
> > @@ -548,7 +586,10 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > int none_or_zero = 0, shared = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> > const unsigned long nr_pages = 1UL << order;
> > - int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > + int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
>
> Yeah, the !cc->is_khugepaged is a bit gross here, so as per the above, maybe do:
Ok sounds good! I'll make the recommended changes.
Thanks!
-- Nico
>
> int max_ptes_none;
>
> if (cc->is_khugepaged)
> max_ptes_none = collapse_max_ptes_none(order);
> else /* MADV_COLLAPSE is not limited. */
> max_ptes_none = COLLAPSE_MAX_PTES_LIM;
>
> > +
> > + if (max_ptes_none == -EINVAL)
> > + return result;
> >
> > for (_pte = pte; _pte < pte + nr_pages;
> > _pte++, addr += PAGE_SIZE) {
> > --
> > 2.52.0
> >
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse
2026-02-03 13:07 ` Lorenzo Stoakes
@ 2026-02-04 22:00 ` Nico Pache
2026-02-16 15:20 ` Lorenzo Stoakes
0 siblings, 1 reply; 39+ messages in thread
From: Nico Pache @ 2026-02-04 22:00 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Tue, Feb 3, 2026 at 6:13 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jan 22, 2026 at 12:28:33PM -0700, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > the mTHP case this is not true, and we must keep the lock to prevent
> > changes to the VMA from occurring.
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> > mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 71 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9b7e05827749..76cb17243793 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1151,44 +1151,54 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > return SCAN_SUCCEED;
> > }
> >
> > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > - int referenced, int unmapped, struct collapse_control *cc)
> > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > + int referenced, int unmapped, struct collapse_control *cc,
> > + bool *mmap_locked, unsigned int order)
> > {
> > LIST_HEAD(compound_pagelist);
> > pmd_t *pmd, _pmd;
> > - pte_t *pte;
> > + pte_t *pte = NULL;
> > pgtable_t pgtable;
> > struct folio *folio;
> > spinlock_t *pmd_ptl, *pte_ptl;
> > enum scan_result result = SCAN_FAIL;
> > struct vm_area_struct *vma;
> > struct mmu_notifier_range range;
> > + bool anon_vma_locked = false;
> > + const unsigned long nr_pages = 1UL << order;
> > + const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
> >
> > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > + VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
> >
> > /*
> > * Before allocating the hugepage, release the mmap_lock read lock.
> > * The allocation can take potentially a long time if it involves
> > * sync compaction, and we do not need to hold the mmap_lock during
> > * that. We will recheck the vma after taking it again in write mode.
> > + * If collapsing mTHPs we may have already released the read_lock.
> > */
> > - mmap_read_unlock(mm);
> > + if (*mmap_locked) {
> > + mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > + }
> >
> > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > + result = alloc_charge_folio(&folio, mm, cc, order);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> >
> > mmap_read_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - HPAGE_PMD_ORDER);
> > + *mmap_locked = true;
> > + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
>
> Why would we use the PMD address here rather than the actual start address?
The revalidation relies on the pmd_addr not the start_addr. It (only)
uses this to make sure the VMA is still at least PMD sized, and it
uses the order to validate that the target order is allowed. I left a
small comment about this in the revalidate function.
>
> Also please add /*expect_anon=*/ before the 'true' because it's hard to
> understand what that references.
ack
>
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > goto out_nolock;
> > }
> >
> > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > + result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> > if (result != SCAN_SUCCEED) {
> > mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > goto out_nolock;
> > }
> >
> > @@ -1198,13 +1208,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * released when it fails. So we jump out_nolock directly in
> > * that case. Continuing to collapse causes inconsistency.
> > */
> > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > - referenced, HPAGE_PMD_ORDER);
> > - if (result != SCAN_SUCCEED)
> > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > + referenced, order);
> > + if (result != SCAN_SUCCEED) {
> > + *mmap_locked = false;
> > goto out_nolock;
> > + }
> > }
> >
> > mmap_read_unlock(mm);
> > + *mmap_locked = false;
> > /*
> > * Prevent all access to pagetables with the exception of
> > * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1214,20 +1227,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * mmap_lock.
> > */
> > mmap_write_lock(mm);
> > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > - HPAGE_PMD_ORDER);
> > + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> > /* check if the pmd is still valid */
> > vma_start_write(vma);
> > - result = check_pmd_still_valid(mm, address, pmd);
> > + result = check_pmd_still_valid(mm, pmd_address, pmd);
> > if (result != SCAN_SUCCEED)
> > goto out_up_write;
> >
> > anon_vma_lock_write(vma->anon_vma);
> > + anon_vma_locked = true;
> >
> > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > - address + HPAGE_PMD_SIZE);
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > + start_addr + (PAGE_SIZE << order));
> > mmu_notifier_invalidate_range_start(&range);
> >
> > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1239,24 +1252,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * Parallel GUP-fast is fine since GUP-fast will back off when
> > * it detects PMD is changed.
> > */
> > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > + _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> > spin_unlock(pmd_ptl);
> > mmu_notifier_invalidate_range_end(&range);
> > tlb_remove_table_sync_one();
> >
> > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > if (pte) {
> > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > - HPAGE_PMD_ORDER,
> > - &compound_pagelist);
> > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > + order, &compound_pagelist);
> > spin_unlock(pte_ptl);
> > } else {
> > result = SCAN_NO_PTE_TABLE;
> > }
> >
> > if (unlikely(result != SCAN_SUCCEED)) {
> > - if (pte)
> > - pte_unmap(pte);
> > spin_lock(pmd_ptl);
> > BUG_ON(!pmd_none(*pmd));
> > /*
> > @@ -1266,21 +1276,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > */
> > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > spin_unlock(pmd_ptl);
> > - anon_vma_unlock_write(vma->anon_vma);
> > goto out_up_write;
> > }
> >
> > /*
> > - * All pages are isolated and locked so anon_vma rmap
> > - * can't run anymore.
> > + * For PMD collapse all pages are isolated and locked so anon_vma
> > + * rmap can't run anymore. For mTHP collapse we must hold the lock
> > */
> > - anon_vma_unlock_write(vma->anon_vma);
> > + if (is_pmd_order(order)) {
> > + anon_vma_unlock_write(vma->anon_vma);
> > + anon_vma_locked = false;
> > + }
> >
> > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > - vma, address, pte_ptl,
> > - HPAGE_PMD_ORDER,
> > - &compound_pagelist);
> > - pte_unmap(pte);
> > + vma, start_addr, pte_ptl,
> > + order, &compound_pagelist);
> > if (unlikely(result != SCAN_SUCCEED))
> > goto out_up_write;
> >
> > @@ -1290,20 +1300,42 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > * write.
> > */
> > __folio_mark_uptodate(folio);
> > - pgtable = pmd_pgtable(_pmd);
> > + if (is_pmd_order(order)) { /* PMD collapse */
> > + pgtable = pmd_pgtable(_pmd);
> >
> > - spin_lock(pmd_ptl);
> > - BUG_ON(!pmd_none(*pmd));
> > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > + spin_lock(pmd_ptl);
> > + WARN_ON_ONCE(!pmd_none(*pmd));
> > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> > + } else { /* mTHP collapse */
> > + pte_t mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
> > +
> > + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > + spin_lock(pmd_ptl);
> > + WARN_ON_ONCE(!pmd_none(*pmd));
> > + folio_ref_add(folio, nr_pages - 1);
> > + folio_add_new_anon_rmap(folio, vma, start_addr, RMAP_EXCLUSIVE);
> > + folio_add_lru_vma(folio, vma);
> > + set_ptes(vma->vm_mm, start_addr, pte, mthp_pte, nr_pages);
> > + update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> > +
> > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> I seriously hate this being open-coded, can we separate it out into another
> function?
Yeah I think we've discussed this before. I started to generalize
this, and apply it to other parts of the kernel that maintain a
similar pattern, but each potential user of the helper was slightly
different in its approach and I was unable to find a quick solution to
make it apply to all. I think it will require a lot more thought to
cleanly refactor this. I figured I could leave this to the later
cleanup work, or I could just create a static function just for
khugepaged for now?
>
> > + }
> > spin_unlock(pmd_ptl);
> >
> > folio = NULL;
> >
> > result = SCAN_SUCCEED;
> > out_up_write:
> > + if (anon_vma_locked)
> > + anon_vma_unlock_write(vma->anon_vma);
>
> Thanks it's much better tracking this specifically.
>
> The whole damn thing needs refactoring (by this I mean - khugepaged and really
> THP in general to be clear :) but it's not your fault.
Yeah it has not been the prettiest code to try and understand/work on!
>
> Could I ask though whether you might help out with some cleanups after this
> lands :)
>
> I feel like we all need to do our bit to pay down some technical debt!
Yes ofc! I had already planned on doing so. I have some in mind, and I
believe others have already tackled some. After this land, let's
discuss further plans (discussion thread or THP meeting).
Cheers,
-- Nico
>
> > + if (pte)
> > + pte_unmap(pte);
> > mmap_write_unlock(mm);
> > + *mmap_locked = false;
> > out_nolock:
> > + WARN_ON_ONCE(*mmap_locked);
> > if (folio)
> > folio_put(folio);
> > trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > @@ -1471,9 +1503,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > pte_unmap_unlock(pte, ptl);
> > if (result == SCAN_SUCCEED) {
> > result = collapse_huge_page(mm, start_addr, referenced,
> > - unmapped, cc);
> > - /* collapse_huge_page will return with the mmap_lock released */
> > - *mmap_locked = false;
> > + unmapped, cc, mmap_locked,
> > + HPAGE_PMD_ORDER);
> > }
> > out:
> > trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.52.0
> >
>
> Cheers, Lorenzo
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function
2026-02-03 12:08 ` Lorenzo Stoakes
2026-02-04 21:39 ` Nico Pache
@ 2026-02-06 17:44 ` Nico Pache
2026-02-16 15:16 ` Lorenzo Stoakes
1 sibling, 1 reply; 39+ messages in thread
From: Nico Pache @ 2026-02-06 17:44 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Tue, Feb 3, 2026 at 5:09 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jan 22, 2026 at 12:28:32PM -0700, Nico Pache wrote:
> > The current mechanism for determining mTHP collapse scales the
> > khugepaged_max_ptes_none value based on the target order. This
> > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > is set to a value greater than HPAGE_PMD_NR / 2.
> >
> > With this configuration, a successful collapse to order N will populate
> > enough pages to satisfy the collapse condition on order N+1 on the next
> > scan. This leads to unnecessary work and memory churn.
> >
> > To fix this issue introduce a helper function that will limit mTHP
> > collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> > This effectively supports two modes:
> >
> > - max_ptes_none=0: never introduce new none-pages for mTHP collapse.
> > - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> > available mTHP order.
> >
> > This removes the possiblilty of "creep", while not modifying any uAPI
> > expectations. A warning will be emitted if any non-supported
> > max_ptes_none value is configured with mTHP enabled.
> >
> > The limits can be ignored by passing full_scan=true, this is useful for
> > madvise_collapse (which ignores limits), or in the case of
> > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > collapse is available.
>
> Thanks, great commit msg!
>
> >
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> This LGTM in terms of logic, some nits below, with those addressed feel
> free to add:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Cheers, Lorenzo
>
> > ---
> > mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 42 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0f68902edd9a..9b7e05827749 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -460,6 +460,44 @@ void __khugepaged_enter(struct mm_struct *mm)
> > wake_up_interruptible(&khugepaged_wait);
> > }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > + * @order: The folio order being collapsed to
> > + * @full_scan: Whether this is a full scan (ignore limits)
> > + *
> > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > + *
> > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > + * khugepaged_max_ptes_none value.
> > + *
> > + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
> > + * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP
> > + * collapse will be attempted
> > + *
> > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > + */
> > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > +{
> > + /* ignore max_ptes_none limits */
> > + if (full_scan)
> > + return HPAGE_PMD_NR - 1;
>
> I wonder if, given we are effectively doing:
>
> const unsigned int nr_pages = collapse_max_ptes_none(order, /*full_scan=*/true);
>
> ...
>
> foo(nr_pages);
>
> In places where we ignore limits, whether we would be better off putting
> HPAGE_PMD_NR - 1 into a define and just using that in these cases, like:
>
> #define COLLAPSE_MAX_PTES_LIM (HPAGE_PMD_NR - 1)
Would a shorter name be appropriate? COLLAPSE_MAX_PTES_LIM(IT) is
quite long. Can we call it MAX_PTES_LIMIT or KHUGE_MAX_PTES_LIM?
-- Nico
>
> Then instead doing:
>
> foo(COLLAPSE_MAX_PTES_LIM);
>
> ?
>
> Seems somewhat silly to pass in a boolean that makes it return a set value in
> cases where you know that should be the case at the outset.
>
> > +
> > + if (is_pmd_order(order))
> > + return khugepaged_max_ptes_none;
> > +
> > + /* Zero/non-present collapse disabled. */
> > + if (!khugepaged_max_ptes_none)
> > + return 0;
> > +
> > + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>
> Having a define for HPAGE_PMD_NR - 1 would also be handy here...
>
> > + return (1 << order) - 1;
> > +
> > + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %d\n",
> > + HPAGE_PMD_NR - 1);
>
> ...and here.
>
> Also a MICRO nit here - the function returns unsigned int and thus we
> express PTEs in this unit, so maybe use %u rather than %d?
>
> > + return -EINVAL;
> > +}
>
> Logic of this function looks correct though!
>
> > +
> > void khugepaged_enter_vma(struct vm_area_struct *vma,
> > vm_flags_t vm_flags)
> > {
> > @@ -548,7 +586,10 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > int none_or_zero = 0, shared = 0, referenced = 0;
> > enum scan_result result = SCAN_FAIL;
> > const unsigned long nr_pages = 1UL << order;
> > - int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > + int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
>
> Yeah, the !cc->is_khugepaged is a bit gross here, so as per the above, maybe do:
>
> int max_ptes_none;
>
> if (cc->is_khugepaged)
> max_ptes_none = collapse_max_ptes_none(order);
> else /* MADV_COLLAPSE is not limited. */
> max_ptes_none = COLLAPSE_MAX_PTES_LIM;
>
> > +
> > + if (max_ptes_none == -EINVAL)
> > + return result;
> >
> > for (_pte = pte; _pte < pte + nr_pages;
> > _pte++, addr += PAGE_SIZE) {
> > --
> > 2.52.0
> >
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function
2026-02-06 17:44 ` Nico Pache
@ 2026-02-16 15:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-16 15:16 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Fri, Feb 06, 2026 at 10:44:03AM -0700, Nico Pache wrote:
> On Tue, Feb 3, 2026 at 5:09 AM Lorenzo Stoakes
> > > ---
> > > mm/khugepaged.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
> > > 1 file changed, 42 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 0f68902edd9a..9b7e05827749 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -460,6 +460,44 @@ void __khugepaged_enter(struct mm_struct *mm)
> > > wake_up_interruptible(&khugepaged_wait);
> > > }
> > >
> > > +/**
> > > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > > + * @order: The folio order being collapsed to
> > > + * @full_scan: Whether this is a full scan (ignore limits)
> > > + *
> > > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > > + *
> > > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > > + * khugepaged_max_ptes_none value.
> > > + *
> > > + * For mTHP collapses, we currently only support khugepaged_max_pte_none values
> > > + * of 0 or (HPAGE_PMD_NR - 1). Any other value will emit a warning and no mTHP
> > > + * collapse will be attempted
> > > + *
> > > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > > + */
> > > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > +{
> > > + /* ignore max_ptes_none limits */
> > > + if (full_scan)
> > > + return HPAGE_PMD_NR - 1;
> >
> > I wonder if, given we are effectively doing:
> >
> > const unsigned int nr_pages = collapse_max_ptes_none(order, /*full_scan=*/true);
> >
> > ...
> >
> > foo(nr_pages);
> >
> > In places where we ignore limits, whether we would be better off putting
> > HPAGE_PMD_NR - 1 into a define and just using that in these cases, like:
> >
> > #define COLLAPSE_MAX_PTES_LIM (HPAGE_PMD_NR - 1)
>
> Would a shorter name be appropriate? COLLAPSE_MAX_PTES_LIM(IT) is
> quite long. Can we call it MAX_PTES_LIMIT or KHUGE_MAX_PTES_LIM?
Yeah sure re: shorter/better name :) to be fair my suggestion is pretty
terrible, kinda just getting at the notion of there being _some_ define.
But MAX_PTES_LIMIT or KHUGE_MAX_PTES_LIM I think are unclear.
MAX_COLLAPSE_PTES?
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse
2026-02-04 22:00 ` Nico Pache
@ 2026-02-16 15:20 ` Lorenzo Stoakes
0 siblings, 0 replies; 39+ messages in thread
From: Lorenzo Stoakes @ 2026-02-16 15:20 UTC (permalink / raw)
To: Nico Pache
Cc: linux-mm, linux-doc, linux-kernel, linux-trace-kernel, akpm,
david, ziy, baolin.wang, Liam.Howlett, ryan.roberts, dev.jain,
baohua, lance.yang, vbabka, rppt, surenb, mhocko, corbet,
rostedt, mhiramat, mathieu.desnoyers, matthew.brost,
joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
jannh, pfalcato, jackmanb, hannes, willy, peterx,
wangkefeng.wang, usamaarif642, sunnanyong, vishal.moola,
thomas.hellstrom, yang, kas, aarcange, raquini,
anshuman.khandual, catalin.marinas, tiwai, will, dave.hansen,
jack, cl, jglisse, zokeefe, rientjes, rdunlap, hughd,
richard.weiyang
On Wed, Feb 04, 2026 at 03:00:57PM -0700, Nico Pache wrote:
> On Tue, Feb 3, 2026 at 6:13 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Jan 22, 2026 at 12:28:33PM -0700, Nico Pache wrote:
> > > Pass an order and offset to collapse_huge_page to support collapsing anon
> > > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > > are attempting to collapse to, and offset indicates were in the PMD to
> > > start the collapse attempt.
> > >
> > > For non-PMD collapse we must leave the anon VMA write locked until after
> > > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> > > the mTHP case this is not true, and we must keep the lock to prevent
> > > changes to the VMA from occurring.
> > >
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > > mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++-----------------
> > > 1 file changed, 71 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 9b7e05827749..76cb17243793 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1151,44 +1151,54 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > return SCAN_SUCCEED;
> > > }
> > >
> > > -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > - int referenced, int unmapped, struct collapse_control *cc)
> > > +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> > > + int referenced, int unmapped, struct collapse_control *cc,
> > > + bool *mmap_locked, unsigned int order)
> > > {
> > > LIST_HEAD(compound_pagelist);
> > > pmd_t *pmd, _pmd;
> > > - pte_t *pte;
> > > + pte_t *pte = NULL;
> > > pgtable_t pgtable;
> > > struct folio *folio;
> > > spinlock_t *pmd_ptl, *pte_ptl;
> > > enum scan_result result = SCAN_FAIL;
> > > struct vm_area_struct *vma;
> > > struct mmu_notifier_range range;
> > > + bool anon_vma_locked = false;
> > > + const unsigned long nr_pages = 1UL << order;
> > > + const unsigned long pmd_address = start_addr & HPAGE_PMD_MASK;
> > >
> > > - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > > + VM_WARN_ON_ONCE(pmd_address & ~HPAGE_PMD_MASK);
> > >
> > > /*
> > > * Before allocating the hugepage, release the mmap_lock read lock.
> > > * The allocation can take potentially a long time if it involves
> > > * sync compaction, and we do not need to hold the mmap_lock during
> > > * that. We will recheck the vma after taking it again in write mode.
> > > + * If collapsing mTHPs we may have already released the read_lock.
> > > */
> > > - mmap_read_unlock(mm);
> > > + if (*mmap_locked) {
> > > + mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > + }
> > >
> > > - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > > + result = alloc_charge_folio(&folio, mm, cc, order);
> > > if (result != SCAN_SUCCEED)
> > > goto out_nolock;
> > >
> > > mmap_read_lock(mm);
> > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > - HPAGE_PMD_ORDER);
> > > + *mmap_locked = true;
> > > + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >
> > Why would we use the PMD address here rather than the actual start address?
>
> The revalidation relies on the pmd_addr not the start_addr. It (only)
> uses this to make sure the VMA is still at least PMD sized, and it
> uses the order to validate that the target order is allowed. I left a
> small comment about this in the revalidate function.
Yeah having these different addresses is a bit icky, easy to make mistakes here.
Oh how we need to refactor all of these...
>
> >
> > Also please add /*expect_anon=*/ before the 'true' because it's hard to
> > understand what that references.
>
> ack
>
> >
> > > if (result != SCAN_SUCCEED) {
> > > mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > goto out_nolock;
> > > }
> > >
> > > - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > > + result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> > > if (result != SCAN_SUCCEED) {
> > > mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > goto out_nolock;
> > > }
> > >
> > > @@ -1198,13 +1208,16 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * released when it fails. So we jump out_nolock directly in
> > > * that case. Continuing to collapse causes inconsistency.
> > > */
> > > - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > > - referenced, HPAGE_PMD_ORDER);
> > > - if (result != SCAN_SUCCEED)
> > > + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> > > + referenced, order);
> > > + if (result != SCAN_SUCCEED) {
> > > + *mmap_locked = false;
> > > goto out_nolock;
> > > + }
> > > }
> > >
> > > mmap_read_unlock(mm);
> > > + *mmap_locked = false;
> > > /*
> > > * Prevent all access to pagetables with the exception of
> > > * gup_fast later handled by the ptep_clear_flush and the VM
> > > @@ -1214,20 +1227,20 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * mmap_lock.
> > > */
> > > mmap_write_lock(mm);
> > > - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > > - HPAGE_PMD_ORDER);
> > > + result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> > > if (result != SCAN_SUCCEED)
> > > goto out_up_write;
> > > /* check if the pmd is still valid */
> > > vma_start_write(vma);
> > > - result = check_pmd_still_valid(mm, address, pmd);
> > > + result = check_pmd_still_valid(mm, pmd_address, pmd);
> > > if (result != SCAN_SUCCEED)
> > > goto out_up_write;
> > >
> > > anon_vma_lock_write(vma->anon_vma);
> > > + anon_vma_locked = true;
> > >
> > > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > - address + HPAGE_PMD_SIZE);
> > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> > > + start_addr + (PAGE_SIZE << order));
> > > mmu_notifier_invalidate_range_start(&range);
> > >
> > > pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > @@ -1239,24 +1252,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * Parallel GUP-fast is fine since GUP-fast will back off when
> > > * it detects PMD is changed.
> > > */
> > > - _pmd = pmdp_collapse_flush(vma, address, pmd);
> > > + _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
> > > spin_unlock(pmd_ptl);
> > > mmu_notifier_invalidate_range_end(&range);
> > > tlb_remove_table_sync_one();
> > >
> > > - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > > + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> > > if (pte) {
> > > - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > > - HPAGE_PMD_ORDER,
> > > - &compound_pagelist);
> > > + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> > > + order, &compound_pagelist);
> > > spin_unlock(pte_ptl);
> > > } else {
> > > result = SCAN_NO_PTE_TABLE;
> > > }
> > >
> > > if (unlikely(result != SCAN_SUCCEED)) {
> > > - if (pte)
> > > - pte_unmap(pte);
> > > spin_lock(pmd_ptl);
> > > BUG_ON(!pmd_none(*pmd));
> > > /*
> > > @@ -1266,21 +1276,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > */
> > > pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > > spin_unlock(pmd_ptl);
> > > - anon_vma_unlock_write(vma->anon_vma);
> > > goto out_up_write;
> > > }
> > >
> > > /*
> > > - * All pages are isolated and locked so anon_vma rmap
> > > - * can't run anymore.
> > > + * For PMD collapse all pages are isolated and locked so anon_vma
> > > + * rmap can't run anymore. For mTHP collapse we must hold the lock
> > > */
> > > - anon_vma_unlock_write(vma->anon_vma);
> > > + if (is_pmd_order(order)) {
> > > + anon_vma_unlock_write(vma->anon_vma);
> > > + anon_vma_locked = false;
> > > + }
> > >
> > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > > - vma, address, pte_ptl,
> > > - HPAGE_PMD_ORDER,
> > > - &compound_pagelist);
> > > - pte_unmap(pte);
> > > + vma, start_addr, pte_ptl,
> > > + order, &compound_pagelist);
> > > if (unlikely(result != SCAN_SUCCEED))
> > > goto out_up_write;
> > >
> > > @@ -1290,20 +1300,42 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> > > * write.
> > > */
> > > __folio_mark_uptodate(folio);
> > > - pgtable = pmd_pgtable(_pmd);
> > > + if (is_pmd_order(order)) { /* PMD collapse */
> > > + pgtable = pmd_pgtable(_pmd);
> > >
> > > - spin_lock(pmd_ptl);
> > > - BUG_ON(!pmd_none(*pmd));
> > > - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > > + spin_lock(pmd_ptl);
> > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > > + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> > > + } else { /* mTHP collapse */
> > > + pte_t mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
> > > +
> > > + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > > + spin_lock(pmd_ptl);
> > > + WARN_ON_ONCE(!pmd_none(*pmd));
> > > + folio_ref_add(folio, nr_pages - 1);
> > > + folio_add_new_anon_rmap(folio, vma, start_addr, RMAP_EXCLUSIVE);
> > > + folio_add_lru_vma(folio, vma);
> > > + set_ptes(vma->vm_mm, start_addr, pte, mthp_pte, nr_pages);
> > > + update_mmu_cache_range(NULL, vma, start_addr, pte, nr_pages);
> > > +
> > > + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> > > + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >
> > I seriously hate this being open-coded, can we separate it out into another
> > function?
>
> Yeah I think we've discussed this before. I started to generalize
> this, and apply it to other parts of the kernel that maintain a
> similar pattern, but each potential user of the helper was slightly
> different in its approach and I was unable to find a quick solution to
> make it apply to all. I think it will require a lot more thought to
> cleanly refactor this. I figured I could leave this to the later
> cleanup work, or I could just create a static function just for
> khugepaged for now?
Yeah let's at least separate it out for this logic anyway.
Really we should have done the refactoring in advance of these changes, but that
ship has sailed :)
>
> >
> > > + }
> > > spin_unlock(pmd_ptl);
> > >
> > > folio = NULL;
> > >
> > > result = SCAN_SUCCEED;
> > > out_up_write:
> > > + if (anon_vma_locked)
> > > + anon_vma_unlock_write(vma->anon_vma);
> >
> > Thanks it's much better tracking this specifically.
> >
> > The whole damn thing needs refactoring (by this I mean - khugepaged and really
> > THP in general to be clear :) but it's not your fault.
>
> Yeah it has not been the prettiest code to try and understand/work on!
:)
>
> >
> > Could I ask though whether you might help out with some cleanups after this
> > lands :)
> >
> > I feel like we all need to do our bit to pay down some technical debt!
>
>
> Yes ofc! I had already planned on doing so. I have some in mind, and I
> believe others have already tackled some. After this land, let's
> discuss further plans (discussion thread or THP meeting).
Yeah, I'll get that TODO list discussed in meeting shared soon...
>
> Cheers,
> -- Nico
>
> >
> > > + if (pte)
> > > + pte_unmap(pte);
> > > mmap_write_unlock(mm);
> > > + *mmap_locked = false;
> > > out_nolock:
> > > + WARN_ON_ONCE(*mmap_locked);
> > > if (folio)
> > > folio_put(folio);
> > > trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > > @@ -1471,9 +1503,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > > pte_unmap_unlock(pte, ptl);
> > > if (result == SCAN_SUCCEED) {
> > > result = collapse_huge_page(mm, start_addr, referenced,
> > > - unmapped, cc);
> > > - /* collapse_huge_page will return with the mmap_lock released */
> > > - *mmap_locked = false;
> > > + unmapped, cc, mmap_locked,
> > > + HPAGE_PMD_ORDER);
> > > }
> > > out:
> > > trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > > --
> > > 2.52.0
> > >
> >
> > Cheers, Lorenzo
> >
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2026-02-16 15:21 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-22 19:28 [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 01/16] mm: introduce is_pmd_order helper Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 02/16] khugepaged: rename hpage_collapse_* to collapse_* Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 03/16] introduce collapse_single_pmd to unify khugepaged and madvise_collapse Nico Pache
2026-01-23 5:07 ` Lance Yang
2026-01-23 9:31 ` Baolin Wang
2026-01-26 12:25 ` Lorenzo Stoakes
2026-01-23 23:26 ` Nico Pache
2026-01-24 4:41 ` Lance Yang
2026-01-26 12:25 ` Lorenzo Stoakes
2026-01-26 11:40 ` Lorenzo Stoakes
2026-01-26 15:09 ` Andrew Morton
2026-01-26 15:18 ` Lorenzo Stoakes
2026-01-28 16:38 ` Nico Pache
2026-02-03 11:43 ` Lorenzo Stoakes
2026-02-03 11:35 ` Lorenzo Stoakes
2026-01-22 19:28 ` [PATCH mm-unstable v14 04/16] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 05/16] khugepaged: generalize alloc_charge_folio() Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 06/16] khugepaged: generalize __collapse_huge_page_* for mTHP support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 07/16] khugepaged: introduce collapse_max_ptes_none helper function Nico Pache
2026-02-03 12:08 ` Lorenzo Stoakes
2026-02-04 21:39 ` Nico Pache
2026-02-06 17:44 ` Nico Pache
2026-02-16 15:16 ` Lorenzo Stoakes
2026-01-22 19:28 ` [PATCH mm-unstable v14 08/16] khugepaged: generalize collapse_huge_page for mTHP collapse Nico Pache
2026-02-03 13:07 ` Lorenzo Stoakes
2026-02-04 22:00 ` Nico Pache
2026-02-16 15:20 ` Lorenzo Stoakes
2026-01-22 19:28 ` [PATCH mm-unstable v14 09/16] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 10/16] khugepaged: add per-order mTHP collapse failure statistics Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 11/16] khugepaged: improve tracepoints for mTHP orders Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 12/16] khugepaged: introduce collapse_allowable_orders helper function Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 13/16] khugepaged: Introduce mTHP collapse support Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 14/16] khugepaged: avoid unnecessary mTHP collapse attempts Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 15/16] khugepaged: run khugepaged for all orders Nico Pache
2026-01-22 19:28 ` [PATCH mm-unstable v14 16/16] Documentation: mm: update the admin guide for mTHP collapse Nico Pache
2026-01-26 11:21 ` [PATCH mm-unstable v14 00/16] khugepaged: mTHP support Lorenzo Stoakes
2026-01-26 11:32 ` Lorenzo Stoakes
2026-02-04 21:35 ` Nico Pache
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox